How to create your file system based on blob fields in the database. Why is it convenient? Efficiency issues

Dear readers. This is the third article from the database cycle.

Table of contents:

How to make a different time zone in different databases on the same server.
How to keep logs of data changes by users in the database, storing them in another database (so that the database of the main database is not clogged with garbage and does not grow)
How to create your file system based on blob fields in the database. Why is it convenient? Issues of file storage efficiency (how to get the maximum speed and at the same time the minimum space occupied)

This method is a method of implementing the storage of files attached by the user on the site via the web interface. It will not be a “file system” as it is organized in the operating system.
')
The example described in this article will solve the problem I once had: “ Select a section within the company's account on a web site where employees of the company will be able to store their files, create folders (let's call it“ Disk ”). The disk should be isolated from the accounts of other companies and should be integrated into the account's work processes (organizing the storage of files attached to tasks, projects, counterparty cards, reports, etc.) ”.

Immediately, I note that this method, in terms of speed, is less effective simply storing files on a web server. But it has certain advantages, which in my case outweighed the disadvantages of losing download speeds.

Disadvantages: Increased file download time

Benefits

Distributed storage structure. Those. It is not necessary to store client files on the web server itself. You can store them anywhere, on any servers of your network. Easily move them from server to server if necessary.
Convenience of backing up client files. You can do everything with standard backup tools.
Security. This method is deprived of the main vulnerabilities of web applications when downloading files (for example , you can read them here ). The same method provides physical isolation of customer data from the data of other customers, because each company has its own database.

The main argument for the implementation of this file storage system was the possibility of distributed storage. In principle, you can use solutions like cifs and samba, mount network drives from other machines and store client files there. But at that time I got this here, not quite a standard solution, and I am completely satisfied with it.

In this article we will consider the implementation process from simple to complex:

The general organization of the storage structure.
Save files to blob fields. Direct extraction.
Saving files with intermediate archiving in blob fields. Extraction with intermediate unarchiving.
Saving files with deferred and selective archiving (it does not make sense to archive all files, in which cases the game is not worth the candle, and it also does not always make sense to archive immediately).
Then we consider the organization of the directory structure, the organization of access rights, file operations, some special cases, etc.
So, let's begin.

So, let's begin.

As a database, firebird 3 is used.

1. The general organization of the storage structure.

As in the previous article (on storing logs), you should not store files in the main working database, it will be a lot of garbage, problems with backups, etc. For this, it is better to select a separate database. Let's call it “File DB”. The main database should store the directory structure and links to files, and the binary data from the files will be stored in the file database.

In this case, you will get the convenience of backups, and a fast database. For example, in my system I give users the opportunity to independently schedule their backups, including backups of the file database. Moreover, there is even the possibility of automatically uploading the backup archive to the user on his FTP (for example, if he wants to store backups on his equipment and not pay for renting space in the cloud). Such an implementation is possible thanks to the storage of files in a separate database and the organization of a distributed storage system.

Also, thanks to this organization, the web server performs only the functions of a web server, and not file storage. You can more intelligently organize the distribution of resources in the network. With a huge number of clients (a huge number of databases and file structures), unallocated data storage will simply be unacceptable. The structure should be unlimitedly scalable, as well as simple.

The scheme can be represented as follows:

In addition, each of the elements can be (and better done so) on its own separate server. The data storage structure can be organized as follows. A structure with data about the attached file is created in the main database.

CREATE TABLE FILES_ ( ID BIGINT, DATA TIMESTAMP, --   DATA_DEL TIMESTAMP, --   USER_ INTEGER, --,   USER_DEL INTEGER, --,   ID_FILE INTEGER, --ID     FILE_NAME VARCHAR(256), --  FILE_NAME_TMP VARCHAR(256), --    CONTENT_TYPE VARCHAR(100), --   STATUS SMALLINT, --  (  ) SIZE BIGINT, --  SIZE_ZIP BIGINT, --   SIZE_ZIP_2 BIGINT, --   ZIP_USE SMALLINT, --    ID_FOLDER INTEGER, -- ,    FLAG_ZIP SMALLINT --   );

The structure of the file database can be like this (minimum data about the file and the file itself in the blob field).

 CREATE TABLE FILES_ ( ID BIGINT, DATA TIMESTAMP, USER_ INTEGER, --,   FILE_NAME VARCHAR(256), --  CONTENT_TYPE VARCHAR(100), --   FILE_DATA BLOB SUB_TYPE 0 SEGMENT SIZE 80, --   STATUS SMALLINT, -- (  ) SIZE INTEGER, --  SIZE_ZIP INTEGER, --   ZIP_USE SMALLINT --    );

There is some data redundancy in these two tables relative to each other, but later it will come in handy, for example, during a deferred backup operation, during debugging, during subsequent data analyzes, etc. Do not fence the same request into multiple databases.

Let's start the review with a simple file entry in the blob field.

2.1. Save files to blob fields.

Web forms and the process of uploading a file to a directory on a web server will not be covered here. It is assumed that the reader is familiar with this.

So, the user clicked a button in the web form, the file was downloaded and saved in the web server directory (for example, tmp).

The file data at this moment we have in the global array FILES.
(I will give examples in PHP)

PS: For simplicity, I deliberately omit the conversion of dangerous special characters to mnemonics, forced reduction of numerical data to a number, etc. It is assumed that the reader has his own procedures for this and he is aware of the danger of sql injections. If not, then I highly recommend to get acquainted with this topic and make the appropriate conversion.

 //    . $dbh_file = ibase_connect(…); //     $fd = fopen($_FILES[…]['tmp_name'], 'r'); //     $blob = ibase_blob_import($dbh_file, $fd); //  fclose($fd); //  ,       . if (!is_string($blob)) { } else { $query = 'INSERT INTO FILES_ ( USER_, NAME_FILE, CONTENT_TYPE, FILE_DATA, SIZE, SIZE_ZIP, ZIP_USE) VALUES ( '.$USER_.', '.$_FILES[...]['name'].', '.$file_type.', ?, '.$_FILES[...]['size'].', '.$_FILES[...]['size'].', 0) RETURNING ID'; $prepared = ibase_prepare($dbh_file, $query); $res_query = ibase_execute($prepared, $blob); $prom_query = ibase_fetch_row($res_query);

In $ prom_query [0] - will be the ID value of the recorded file. After successfully writing the file data to the file database, you need to write the file data to our main database.

 // ,   , ID   . if (isset($prom_query[0])) //     $dbh_osn = ibase_connect(…); $query2 = ' INSERT INTO FILES ( USER_, ID_FILE, FILE_NAME, SIZE, SIZE_ZIP, CONTENT_TYPE, FILE_ NAME_TMP, ZIP_USE) VALUES ( '.$USER_.', '.$prom_query[0].', '. $_FILES[…]['name']).', '. $_FILES[…]['size'].', '. $_FILES[…]['size'].', '.$_FILES[…]['type'].', '.basename($_FILES[…]['tmp_name']).', 0)'; $res_query2 = ibase_query($dbh_osn, $query2); //       tmp. unlink($_FILES[…]['tmp_name']);

Everything, we have a file recorded in the file database. It is not on the web server. In the main database there is information about this file, its status, type, size and identifier in the file database.

2.2. Reading a file from the database.

When the user clicks on the link with the file name, we need to reverse the process. Extract the file from the database and present this sequence of data to the user's browser, saying what type of file it is.

In the link, the file identifier must be specified as the GET parameters (or the hash identifier with the salt depends on the system security requirements, we will consider a simple situation so far). For example, a link might look like this: File.

At the stage of simple reading from the database (without archiving data), it is enough for us to access the file database.

 //     $dbh_file = ibase_connect(…) $query="select p.file_data, p.CONTENT_TYPE, p.FILE_NAME, p.size from FILES p where p.id=".$_GET['id']; $res = ibase_query($dbh_file, $query); $data = ibase_fetch_row($res);

When specifying the data type in the header, an individual problem with the Chrome browser has been identified. It is fundamentally necessary for the filename in the header to be with single quotes, while other browsers are fundamentally necessary without quotes. For this example, you can use the following solution.

 preg_match("/(MSIE|Opera|Firefox|Chrome|Version)(?:\/| )([0-9.]+)/", $_SERVER['HTTP_USER_AGENT'], $browser_info); list(,$browser,$version) = $browser_info; if ($browser=='Chrome') header("Content-Disposition: attachment; filename='".str_replace(' ','_',$data[2])."'"); else header("Content-Disposition: attachment; filename=".str_replace(' ','_',$data[2]));

Replacing the space with "_" is carried out to correctly reflect the file name when downloading, because most likely the name will be cut off by a space.

After that, we output the binary data of the file into the script body.

 echo ibase_blob_echo($data[0]);

PS: Here I want to draw attention to a typical error. There should be no characters before <? Php and after?> Otherwise it will not work.

Now when you click on the link file_b.php? Id = 123, the user will download a window for downloading his file (with his name and the desired type).

3.1. Download file with intermediate archiving

We looked at a fairly simple, basic way of storing files. Now we will improve it a little. After all, a binary sequence is stored in the database, why not make it shorter and save space. To do this, before writing the data file in blob, zip it.

Some statistics:
The first time this thought came to me about 1.5 years ago. At that moment I was just developing the invoice reconciliation module for one company and decided to experimentally integrate this function with it. All sorts of files with information on invoices and agreements, pdf, xls, doc, etc. are attached to this module.

For 1.5 years in a medium-type company with a staff of 50-70 people, 4085 files with a total volume of 1,526 mb were attached to one module (at the same time), while on the disk it all takes 1240 mb. Those. archiving zip archive, gave a saving of about 20%. This is pretty good.
In those days, archiving was implemented by me through the zip.lib.php library directly in the script. Later I came to the conclusion that this method is not optimal in terms of compression and speed. At present, the 7zip archiver is used in practice.

Loading a file with archiving does not cause any special problems. It is only necessary before saving the blob file sequence in the database, to archive it, for example:

 exec('7z a < > < >);

And then at the end, in addition to the original, delete its archive from the web server.

But the process of extracting such a file to the user is much more interesting.

3.2. Reading a file from the database with extracting it from the archive

Just as in the direct extraction, do not work. After the user clicks the link, you must first read the file, save it to the web server, unzip it, and then transfer it to the user.

Those. when the user clicks on the file_b.php? id = 123 link, the script should pull another script that considers the data from the blob file (absolutely similar to clause 2.2) after saving this file to the server disk, then start unarchiving it, and the data From the received file, output to yourself, substituting the necessary header, so that the user gets out the window - that he is downloading the file.

For this purpose we use CURL.

After determining the type of browser and substitution header, do the following.

 //         //  $fp = fopen($path, 'w'); // curl-   ,       blob ,      . $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_USERAGENT, $agent); curl_setopt($ch, CURLOPT_FILE, $fp); curl_exec($ch); curl_close($ch); fclose($fp);

Thus, in the file named $ path, our archived file appears. We make its unarchiving.

 exec('7z e '.$path.' –o < > -y');

After that, we read the file and integrate its data into our script using fpassthru

 $stream = fopen(< >,'r'); fpassthru($stream); fclose($stream);

Now everything should work out.

4. Saving files with intermediate archiving in blob fields.

Now consider the issues of speed. We then save the place. But instead of waiting for the user to just download the file to the server, we make him wait for both the backup time and the time to add the file to the blob field. For the user, the file download time often increases even more than 2 times. When retrieving in a similar way, instead of reading the data directly, we begin to save them, unzip them, and then present them to the user. All this is not very good. What can be done in this case?

1) First, you can post the process of downloading a file and the process of archiving it. Those. the user loads the file in its original form. And only then, the script that starts with a periodicity will look at the newly uploaded files and archive them. In this case, for the user, the download of files will be much faster, while we will save space. Moreover, the operation of this script can be organized on another server, and it will not consume the resources of the web server (why would we need a highly loaded server to be engaged in archiving?), This process can take place aside, slowly.

2) Secondly, not all file types make sense to compress. For example, if the difference between a compressed and uncompressed file is less than 10%, then the game is definitely not worth the trouble, space saving is minimal - the CPU voltage and user waiting time are maximum. For myself, I selected the following types of poorly compressible files that the archiver does not even touch, but simply tick (processed — no longer touch). RAR, GZ, ZIP, JAR, TAR, ARJ, UC2, GZ, UUE, LHA, CAB, LZH, ACE, TGZ, 7Z, AVI, MPG, 3GP, WMV, ASF, FLV, MP3, AAC, WMA, AMR, TIF, JPG, JP2, GIF, PNG.

PS: information from ivanbolhovitinov . The correct XLSX, DOCX is always a ZIP archive. The first 2 bytes are “PK”. You can check them like this: file_get_contents ($ filename, NULL, NULL, 0.2)

3) Since when extracting the file to the user, we also do work on unarchiving, even if the file is not a type from item 2 - it’s still not a fact that it makes sense to archive it. Empty archiving is eating up your resources and what's more scary is user waiting time. Therefore, the system, before archiving the file, must check if that makes sense. For this, we had to make a system of preventive archiving, i.e. the script first “tries” - it downloads the file, archives it, but before replacing the original with the archive - it checks how many percent the size of the archive differs from the size of the original. If this value is again less than 10%, then archiving has no meaning at all. Such files are marked as processed and not archived.

Typical representatives of such xlsx files. The format itself is compressed, but in fact there are very different files, some can be compressed by 50%, and some, God forbid, by 5%. Depends on the filling.

4) It also does not make sense to archive files smaller than 250 bytes in size, the archive will be larger than the original.

5) No need to archive large files. It is empirically derived that it is in the region of 15 mb or more. Even if it compresses well - you have to spend some time (apart from having to extract it from the blob) to decompress it - such temporary user expectations can already be strained.

5. Organization of the directory structure, organization of access rights, file operations, some special cases.

So far we have considered only the file storage system. This is the basic thing on which everything else is based. But for a full-fledged “File System”, it is still necessary to somehow structure these files, assign access rights to them, etc.

Perhaps this chapter will not be interesting to everyone, because the implementation of this structure will depend very much on the task at hand. I implemented this structure for the erp-platforma.com cloud service, so in this chapter I will review the tasks that the file system should solve for the company's service organization:

1) Classic - Directory tree files. The mechanism of applying files, editing their metadata, deleting.

2) A small introduction is required for this item. Consider not the obvious thing for everyone in the work of automation systems: when you attach a file to a task, it is not physically stored in the task, but on disk. The task contains only a link to this file.

When disk space runs out, the system administrator may need to clear it. And going through all the tasks for him in twenty years, deleting files from them - well, it's just silly.
Those. there should be some kind of file structure, for example, the “Tasks” service directory on the disk where all files attached to the tasks will be stored, and the sysadmin can simply clean old files.

But what happens if he removes them? The tasks will remain links that do not lead anywhere! Also not gud. Consequently, the tasks should not be a link, but a kind of “window” in the Disk, in which the task will see only its files.

But simply “windows” are few. The user can “remember” that he attached the file, and the “evil” sysadmin will silently delete it from the disk and tell the user that he is crazy. Therefore, the file data can be deleted, but the record that the file was, should remain somewhere in the bins and in our “window” the link should be displayed in some gray, inactive color.

I cited the “task” as an example; similarly, files can be attached to projects, contractors, employees, objects, etc. Those. The system should be universal.

But that's not all. For example, a customer may easily want counterparties to have 2 file application areas, for example, an area where bills are attached and an area where documents are attached. Those. These “windows” should be attached not just to the page and its input data, but to specific elements of the page.

To summarize the point: the file system must support some “windows” into it from the external system. Each “window” should have the functions of file operations within its framework.

3) The user may need to find in the folder the files that were attached to such a counterparty, or to such a project or even for some unknown signs, it does not matter how these files can be called. Those. we need a certain search system not only by file names, but also by their belonging to external structures.

PS: In my opinion, such things are optimally implemented using hashtags. For these purposes, in the files, you can enter a couple of properties: “system hashtags” and “user hashtags”. System names are written in the system, for example, the name of the counterparty. In custom - arbitrarily by the user, with the application file.

4) Access rights . By itself, someone should have the right to view certain folders, some do not. Rights to add, delete. Or it may happen that the user should have access to the task and the files attached to it, and the folder in which these access files should not be. Those. “Window” must have its rights. The system administrator must have a mechanism for assigning or removing access rights to certain users.

These are the requirements for the file system structure.

Realization of all these things is quite complicated and branching code in scripts and in the database structure, therefore, this is not the result of this in the article. And so it is a very big article.
I will describe the basic principles for the implementation of each of the tasks.

The directory structure is fairly straightforward. This is a simple table in the database of a company that has:

1) Folder name
2) Folder Creator
3) Folder node in which this folder is located
4) Folder status
5) Folder ID
6) General rights to the default folder (which all users can do in the folder if they are not granted special rights)

More on each item:

1) The folder name can be duplicated at different levels of the tree, in different nodes. The folder name cannot be duplicated in one node. This mechanism is very simple to implement, just put a unique index on fields 1 and 3 of the table. When duplicating a name in one node, the system will generate an error.

2) Folder creator must be recorded. The folder creator always has full rights to it, unless otherwise specified by the administrator as a user. For example, even if a user has created a folder - the system administrator can still block access to it by making a corresponding entry in the role.

The folder creator can configure default permissions for other users. Those. whether the others see the folder, can they look at the files only for reading, or can delete them, modify user hashtags, etc. Again, these settings may be overlaid by the system administrator.

3) A folder of another folder is registered in the folder in which it is located. This is needed to build a tree-like folder system.

4) Folder status. The folder can be working, or it can be deleted. The question is not really simple. Here, each developer can decide for himself what to do, for example, you can organize the service folder “Trash” in which the folders with the status will be deleted. You can really delete them - but then you need to work out what to do with the files in the folder, as well as with the folder connections in the system structure. The correct solution, in my opinion, is not to physically delete the folder entry, as long as there are links to it in the system and there are some files in it. We clean files, we clean communications, and after that - please, delete. Otherwise, you can get glitches in dependent modules.

5) Each folder must have its own identifier. The names of different folders in different nodes can be duplicated, but identifiers are not. When programming a system, when a user creates a File element on a web page, in this element he always indicates the folder identifier, and it is in this folder that the system will save the files attached to this module through this element.

6) The rights of the default folder. When a user views a folder, the system must first check the records for this folder as a user, if there are at least some rights settings (be it rights to everything, or vice versa prohibit all actions) then these rules should be applied, but if nothing is written, then must enter the rules set by the folder creator.
In my development, the system of rights to folders, I integrated the folders into the general system of rights and for the company administrator, adding folders to the user's role should not cause problems (as an example, the implementation of access rights to system elements can be read here ).

There is an elegant solution for linking a file with a page element .
It is necessary to calculate the hash amount, in which there should be a unique identifier of the page element, and the page data (for example, the project or task number). This hash will always be unique and should be stored in the data record of the file. When loading the page at the moment of displaying the “window”, this binding hash should be calculated and it will be searched for files and output to the page of their list.

Also, the “window” must know the directory in which the new files are to be added.

Records hashtags to files . There are no problems with custom (arbitrary) hashtags, you need to make the user, when adding a file (or when editing it), the ability to make notes to the file.

More complex with system hashtags. For example, in each file to be added to the task it is necessary to put the tag “#Test No.… #”. Then, in the Tasks folder, the user, entering in the search box, for example, “№ ... #“, will receive all the files of the task of interest. I have this functionality implemented at the programming language level, in the properties of the File element of the form (in the article I called it the “window”). In its properties, you can specify a string with identifier elements, and associate these elements with the required data source. The rest of the system will build automatically.

Also, as a nice bonus, you can consider a special case of a file game with blob fields - the Image Storage System .

At some point, in the development of an internal programming language, I was faced with the need to introduce a non-standard data type. There are different data types, integer, varchar, timestamp, etc. But there is no image type in the databases. A need. For example, it is very convenient to take and request to display a table in which there will be images, for example, up and down arrows to move data, delete data, etc. So that everything is already in the database and processed at the base level, and not to fence something in the user interface each time. For example, I can display such things in one query:

This is possible due to the use of blob fields. In addition to the file storage table, you can create, for example, an IMG table in the file base of the client and organize the storage of images there. At the physical level, in the database with this type, only the image identifier will be stored. When displaying data on the page, the interpreter script, having encountered the image data type, will use this ID to access the file database and display the image from the blob field. In this case, the user will not see all this internal kitchen. He will simply indicate in the output of the procedure a parameter with type image and make in the procedure a query that displays a table field with type image in this parameter. Very convenient mechanism.

At this cycle for the database until I finish.

If there are suggestions for improving the mechanisms described in the article, or alternatively organizing the storage of files with similar capabilities, write comments.

Source: https://habr.com/ru/post/319418/

All Articles

How to create your file system based on blob fields in the database. Why is it convenient? Efficiency issues

More articles: