File storage

Recently, I had to work actively with sites that store large amounts of information in the file system. These are various photo sites and file hosting sites, as well as sites with video content downloads, some sites were designed and programmed by me from scratch, some were copied, finished or “put in order”.
I should note that file storage in the file system is for many programmers an area that passes by their attention.
First, I’ll give you a small overview of common mistakes:
1. The file is stored in the file system under the Cyrillic name. Actually, the following happens: the user uploads a file with the name, say, “nameless-1.jpg”, a programmer with the same name shoves it into the directory where files are stored. I hope you do not need to explain what problems this may entail?
2. The file is stored under the same name under which it was loaded by the user, but the characters not included in the Latin alphabet are transliterated. Already better, but still this method causes many problems, for example, users love to load files with the same name))) And it's not that they are so angry, for example, my camera after each cleaning of a memory card starts numbering photos from 00001.
And the third most common mistake:
3. Stores in the directory the number of files exceeding the file system. Consider this situation on a specific example, I rewrote file hosting, large, at the time of rewriting the amount of information came close to four terabytes, and this despite the fact that 80 percent of the files were pictures. All files on the disk (there were 4 disks per terabyte) were randomly scattered across two dozen directories, and so on until the disk was full, then the program moved to the next disk. As a result, it took about three seconds to open a directory to a web server. Agree, it is a lot of disastrous. In each directory on the disk were about twenty thousand files.
After analyzing several such situations, I tried to deduce a file storage method that would satisfy the following conditions:
1. The directory should not slow down, that is, more than 1000 files or directories should not be stored in one directory (the number is taken with a margin).
2. File names should not be repeated.
3. It is advisable not to keep two copies of the same file.
After some deliberation, I came to the following scheme, which I want to share with fellow programmers.
I'll start with the last requirement not to keep two copies of the file. To determine the integrity of the file, the md5 hash for php has been used successfully for a long time and is solved by the function md5_file (filename), which calculates the MD5 hash of the file, whose name is given by the argument using the MD5 algorithm of RSA Data Security, Inc. and returns this hash. A hash is a 32-digit hexadecimal number.
If two files are the same for them and the hash will be the same, if different files are different. Now stones will fly into me accompanied by arguments about collisions and unreliability md5. I will answer in order md5 is not reliable? But we do not set ourselves the task of deceiving the “probable enemy”! We just get a unique file id and that's it. And about the collisions ... I do not insist on repeating my method one-on-one, use another function. Just think, two in two hundred and fifty-sixth degree is a lot! If they tell me about the possibility of collisions, I ask a person to give an example of two lines or two files, md5 hash, which are the same ... I have not yet been given such a pair so that the possibility is purely theoretical.
The second point - “file names should not be repeated, directly follows from the third. If we use the string md5 hash as the name of the file on the disk, the file names are not repeated (the actual file names (those that the user uploaded) we can store in the database). In the case of users downloading two identical files, we get the same names from them. And the first - the files will not be duplicated, the second - we do not worry about the names in the directories.
Now a little more complicated about storing files on disk. I create a nested directory structure based on file names. Here, too, full scope for fantasy. I by no means call blindly to copy my method. I usually do two, three levels of nesting directories. The first level is the first two letters of the file name (do not forget, the file name is its md5 hash!); the second level is the third and fourth letters ...
Each level of nesting gives me * 256 directories.
That is, if I can upload no more than 1000 files into one directory, then with one nesting level I can safely place 256,000 files on a disk; with two levels of nesting - 65,536,000; with three - 16 777 216 000 and so on. The length of the string md5 hash allows us to make 16 levels of nesting in directories. In my opinion, this is enough to ensure the work of the most capacious disks. Although, based on practice, usually, there are three levels “by the eye” for projects of any complexity.
PS Updated and detailed version (written following the discussion)

Source: https://habr.com/ru/post/70147/

All Articles

File storage

More articles: