File Identity Recovery mini program/script

File Identity Recovery is a mini application that I built to help Identify Files extensions based on the MAGIC python Library.

It’s extremely useful to use after performing software data recovery on failed media/storage devices.

Most Software recovery programs, will either fail to identify a a file extension and recover it as unknown, or simply append the wrong file extension to file which is totally as good as not recovering it in the first place.

This is where File Identity Recovery Program come into play

Features

    1. User-friendly configuration
    2. Ability to exclude files during initial scan by size
    3. Saves a snapshot index of dirs and files for later processing. (in case of thousands/millions of files .. data only needs to be scanned once)
    4. Custom file destination mapping (Advanced file type specification)
    5. Custom minimum size per identity mapping
    6. Auto creates destination dirs and corrects identities initially generated by recovery software
    7. Does not require any installs (No prerequisites)
    8. No dependencies
    9. Works on almost any Linux distro
    10. In-place Updates/Upgrades

Use Cases

File Identity Recovery program is designed mainly to correct mistakenly identified files by software recovery.

You’ll almost always need it right after having performed software recovery of a failed media, where files have lost their original extensions.

Could also be used to identify files individually in case the “file” command is not available.

Installation

Note that if at any point the package you’re trying to use is outdated, you always have the option to download an official binary from the File Identity Recovery project, alternatively use the in-place updates by simply running  file-identity-recovery --update

Self in-place updates are supported since v0.1

These are up to date binaries, built in a reproducible and verifiable way, that you can download and run without having to do additional installation work.

Please see the Official Binaries section below for various downloads. 

Documentation

Getting Started

Download the binary from the Downloads section or get the source code on git here

rename the binary if needed (no restrictions on the naming)

Add the binary to your path ex: /usr/local/bin/file-identity-recovery or simply call it directly as ./file-identity-recovery

Add your configuration to

config.yaml 

Get the help menu by simply doing

file-identity-recovery --help 

The Config.yaml Configuration

Section 1 - userConfig:

The config.yaml is divided into 3 main sections

Section 1:


  userConfig:
    rootScanDir: '/path/to/files/recovered/by/sw-recovery'  

do not add / in the end


    BaseRecoverFilterPath: '/path/to/where/place/your/correctly/identified/files' 

do not add / in the end, this location will be used to create other identity sub dirs like images, mp3 .. etc

Section 2 - appConfig:

    appConfig:
      resultsDatabase: './FilesToTypesDictionary.pickle'   
absolute or relative Path is fine – This could be any file and could be given any name (It’s a python Pickle File .. to store the indexes .. it’s fine to leave this as is)

      errorsDatabase: './FileTypeDictionaryErrors.pickle'
absolute or relative Path is fine – This could be any file and could be given any name (It’s a python Pickle File .. to store the indexes .. it’s fine to leave this as is)
Section 3 - programSettings:

    programSettings:
      minimumFileSizeGlobal: '0'   

Value in KB, files smaller than this setting will not be included in the index … 0 means index all files


      maxFilesToDetect: '0' 

Maximum number of files to add to the index … good to set that to really low value like 100 at first as you test the program, afterwards set it 0, which means include all files


      dryRun: 'True' 

If set to true, only indexing will be built, but no copying or moving of identified files will take place


      showFullDebug: 'True' 

High verbosity


      moveFiles: 'False' 

If set to true, as files being properly identified into sub directories, they’ll be “moved” to BaseRecoverFilterPath instead of copying.
it’s Recommended to set it to true if both Source and Destination dirs are on the same physical disk


      filesExtensionMapping:  

This is where all the power of the program lies.
this setting here allows you to configure how the program identifies files and what it should do when a file is successfully identified.


    - type1:  

An arbitary identifier for the type of identity defined .. you can call it anything .. and must be unique


        - 'PDF document' 

That’s the STRING that should be present in the file identification string.
Let’s say … we want to identify PDF documents.
run the command line:
./file-identity-recovery -f somepdfdocument.pdf
Output sample looks like this:
PDF document, version 1.7

Since the string “PDF document” is a unique output … it could then be a good candidate to be used as a type identifier, since no other file will contain “PDF document” as part of its identity


        - '/pdf' 

That’s the Sub-location (DIR) where files identified as PDF will be placed, the full path would be the value of BaseRecoverFilterPath/pdf


        - '80' 

That’s the Min Size (KB) for pdf files to be identified, PDFs smaller than this value will not be included in the final output


        - 'pdf' 

That’s the EXTENSION that files identified as PDF should have, in almost every software recovery case, files may not have the proper extension, thus the extension specified here will be appended to the identified file.
setting this value to none will retain whatever extension the file already has

Getting/Resetting config.yaml

To generate config.yaml simply run

file-identity-recovery --init  

This will generate config.yaml in the current working directory

Downloads

Latest File Identity Recovery Binary

Git Source Code

Program is equipped with self in-place updates simply run file-identity-recovery --update