Defining a file format with Python

Prehistory

Hello. Most recently, I ran into a problem: for unexplained reasons, the memory card began to throw all files into the LOST.DIR folder without extensions. For a long time there have accumulated more than 500 files of different types: pictures, video, audio, documents. It was impossible to understand the file format on my own, so I began to look for a way to solve this problem programmatically.

Searching of decisions

I did not want to use ready-made solutions in the form of web services or programs, so I had the idea to write a console utility that would go through all the files and install the extensions automatically. Python was chosen for writing the utility. The search for suitable modules and libraries did not bring results for several reasons:

Lack of support from the developer
Excessive functionality
No support for new versions of Python
Excessive code complexity

Python-magic (almost 1000 stars on GitHub), which is a wrapper of the libmagic library, stood out from the multitude of libraries. But using it under Windows is impossible without a DLL for the Unix library. I did not accept this option.

The solution of the problem

Based on the above, I decided not to use third-party libraries and modules and solve the problem without them. After a brief search for information on how to implement this task, the only correct way was to determine the format by file signature.

A file signature is a set of bytes that provides a definition of the file format. The signature has the following form in hexadecimal number system:

50 4D 4F 43 43 4D 4F 43

Fortunately, there are two good sites on the Internet that host many signatures of different formats. The goal has become the most common formats.
As it turned out, some signatures are suitable for different file formats, such as the signature of Microsoft Office files. Based on this, in some cases it will be necessary to return a list of suitable file extensions.

 print(get("D:\\some_ms_office_document")) #  ['doc', 'ppt', 'xls']

Also, signatures often have an offset from the beginning of the file, for example, 3GP multimedia container files.

1. Making a list of data

In the form of a list of data, it was decided to use a JSON file, with a 'data' object, the value of which will be an array of objects of the following form:

 {"format": "jpg", "offset": 0, "signature": ["FF D8 FF E0", "FF D8 FF E1", "FF D8 FF E2", "FF D8 FF E8"]}

Where:
format - file format;
offset - the signature offset from the beginning of the file;
signature - an array of suitable signatures under the specified file format.

2. Writing a utility

We import the necessary modules:

 import os import json

Read the list of data:

 abspath = os.path.abspath(os.path.dirname(__file__)) data = json.loads(open(os.path.join(abspath, "data.json"), "r", encoding="utf-8").read())["data"]

Ok, the data list is loaded. Now we read the file as bytes. We will read only the first 32 bytes, since it is no longer necessary to determine the common formats, and a complete reading of a large file will take a lot of time.

 file = open("path_to_the_file", "rb").read(32)

If we print the variable file , we will see something like this:

 \x90\x00\x03\x00\x00\x00\x04

Now read bytes must be transferred to the hexadecimal system:

 hex_bytes = " ".join(['{:02X}'.format(byte) for byte in file])

Next, we create a list to which the appropriate formats will be added:

 out = []

And now the most interesting thing: we create a construct that will cyclically determine the file format until it passes through all possible formats in the data list:

 for element in data: for signature in element["signature"]: offset = element["offset"]*2+element["offset"] if signature == hex_bytes[offset:len(signature)+offset].upper(): out.append(element["format"])

Regarding this line:

 offset = element["offset"]*2+element["offset"]

Since our bytes are represented as a string, and two characters correspond to a byte, we multiply the offset by 2 and add the number of spaces between the "bytes".
And the only thing left for us is to display a list of suitable formats, which is represented by the variable out .

 print(out) #  ['_1', '_2']   ,

Conclusion

As it turned out, various projects are faced with the need to recognize the file format, so I decided to release my solution in open-source as a module for Python called a fleep ( link to the GitHub page ). You can now install the module using the standard Python pip utility:

 pip install fleep

Also on the GitHub project page there are examples of use and a full list of supported file formats.

Thanks for attention!

Source: https://habr.com/ru/post/345822/

All Articles