Hi, Habr!
Recently I wondered how many bytes are needed to correctly determine the mime type of the file. First of all, googling, the answers received were not satisfied and therefore decided to conduct a little research on this topic.
The following task pushed me to study this issue: determining the MIME type of the file located on the smb server. The best I came up with was to copy a piece of the file to the local machine and then, for this part, try to recognize its MIME type.
')

First, I’ll tell you what I did and why I didn’t like it:
Stack Overflow gives 2 links to Wikipedia:
- File Signature says that in most cases 2-4 bytes are sufficient. Unfortunately, however, this is not the case, for example, for such a popular format as pdf.
- List of signatures provides a list of signatures for files of different formats, but it is far from complete. Then I found File Signatures , it seems like everyone else.
But back to the same pdf. If you believe this source, then to determine that the file is a pdf, four bytes are enough (0x25 0x50 0x44 0x46), but from the first four bytes, libma said that the MIME type of the pdf file is text / plain, and from five application / pdf. I find it difficult to answer exactly what it is connected with, it is necessary to look at the sources.
Now let's move, strictly speaking, to what I did. I wrote a very small program that read all the files from one directory, copied the first N bytes to another directory, and then tried to determine by partial copies of the received files, and what it really was. And so on until the MIME type of the part of the file matches the MIME type of the original. According to the results of the work, the program reported how many bytes it took to determine one or another type. Here is its code:
#include <stdio.h> #include <stdlib.h> #include <magic.h> #include <sys/types.h> #include <dirent.h> #include <errno.h> #include <string.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #define TEST_DIR "test-dir/" #define TMP_DIR "tmp-dir/" magic_t cookie; // Detects how many bytes required for correct MIME-type of this file void detect_size(char *filename) { int bytes = 1; int infd, outfd; char strin[100], strout[100], type[100]; char buf[4096]; strcpy(strin, TEST_DIR); strcat(strin, filename); strcpy(strout, TMP_DIR); strcat(strout, filename); while(1) { // Make a copy of given file infd = open(strin, O_RDONLY); outfd = open(strout, O_RDWR | O_CREAT, 00666); read(infd, &buf, bytes); write(outfd, &buf, bytes); lseek(infd, 0, SEEK_SET); lseek(outfd, 0, SEEK_SET); // Detect mime types of old and new const char *mime_type = magic_descriptor(cookie, infd); strcpy(type, mime_type); mime_type = magic_descriptor(cookie, outfd); // Check if mime type detected correctly if (strcmp(mime_type, type) == 0) { printf("%s detected correctly in %d bytes\n", type, bytes); unlink(strout); return; } unlink(strout); bytes++; } } int main() { DIR *dirfd = opendir(TEST_DIR); struct dirent entry, *result = NULL; cookie = magic_open(MAGIC_MIME_TYPE | MAGIC_ERROR); magic_load(cookie, NULL); while(1) { readdir_r(dirfd, &entry, &result); if (result == NULL) break; // No more entries in this directory if (!strcmp(entry.d_name, ".") || !strcmp(entry.d_name, "..")) continue; // Ignore "." and ".." detect_size(entry.d_name); } magic_close(cookie); closedir(dirfd); exit(EXIT_SUCCESS); }
Then I started experimenting with a bunch of different files in the test-dir folder. Of course, what I have done does not mean a full-scale and serious research, but some of the results are still interesting. Give them a brief summary:
application / x-sharedlib detected in 18 bytes
application / msword detected in 1793 bytes
image / gif detected by in 4 bytes
application / zip detected in 4 bytes
application / x-dosexec detected in 2 bytes
application / vnd.oasis.opendocument.presentation detected in 85 bytes
text / html detected in 14 bytes
image / jpeg detected in 2 bytes
application / x-executable detected in 18 bytes
text / x-makefile detected in 1594 bytes
application / x-executable detected in 18 bytes
application / x-gzip detected in 2 bytes
audio / mpeg detected in 2291 bytes
text / xc detected in 27 bytes
audio / x-flac detected in 4 bytes
application / pdf detected in 5 bytes
I note some things that seemed interesting to me:
- Well, firstly, of course, the already mentioned pdf, which is recognized by 5 bytes, and not by 4, as it should be expected.
- And finally, I would like to note that in spite of all the coolness of the idea of determining the file type by the first N bytes, in my opinion, it failed.
Well, this is probably all that I wanted to tell this time, I do not like to write a lot. I hope that this article will be someone interesting.
Thanks for attention.