⬆️ ⬇️

Storage of a large number of files

image



Good health, Habrazhiteli! In the process of working on a dating site project, it became necessary to organize the storage of user photos. According to the terms of reference, the number of photos of one user is limited to 10 files. But there may be tens of thousands of users. Especially considering the fact that the project in its current form already exists from the beginning of the “zero”. That is, there are already thousands of users in the database. Almost any file system, as far as I know, reacts very negatively to the large number of child nodes in the folder. From experience, I can say that problems begin after 1000-1500 files / folders in the parent folder.



Disclaimer I googled before writing the article and found several solutions to the issue under discussion (for example, here or here ). But I did not find a single solution that exactly corresponds to mine. In addition, in this article I just share my own experience in solving the problem.



Theory



In addition to the storage task as such, there was also a condition in the TK, according to which the ability to leave captions and photographs was needed. Of course, without a database here is not enough. That is, the first thing we do is create a table in which we prescribe the mapping of meta-data (signatures, titles, etc.) with files on the disk. Each file corresponds to one line in the database. Accordingly, each file has an identifier.

')

A small digression. Let's talk about autoincrement. On a dating site there can be a dozen or two thousand users. The question is how many users in general go through the project for all the time of its existence. For example, the active audience of "dating-ru" is several hundred thousand. However, just imagine how many users have retired during the lifetime of this project; how many users have not been activated yet. And now add our legislation, which obliges you to store information about users for at least six months ... Sooner or later, 4 of a billion UNSIGNED INT kopecks will run out. By this best for the primary-key to take BIGINT .



Now let's try to imagine a BIGINT number. This is 8 bytes. Each byte is from 0 to 255. 255 child nodes is quite normal for any file system. That is, we take the file identifier in hexadecimal representation, break it into chunks of two characters each. We use these chunks as folder names, the latter being the name of a physical file. PROFIT!



0f/65/84/10/67/68/19/ff.file



Elegant and simple. File extension here is not important. Anyway, the file will be given to the script, which will give the browser in particular the MIME type, which we will also store in the database. In addition, storing information about a file in the database allows you to override the path to it for the browser. Let's say the file is actually located relative to the project directory along the path /content/files/0f/65/84/10/67/68/19/ff.file . And in the database you can write him the URL, for example, /content/users/678/files/somefile . SEO-Schnick now, probably, quite smiled. All this allows us not to worry more about where to place the file physically.



Table in db



In addition to the identifier, MIME type, URL and physical location, we will store in the table md5 and sha1 files for screening out the same files if necessary. Of course, we also need to store relationships with entities in this table. Suppose the user ID to which the files belong. And if the project is not very big, then in the same system we can store, say, photos of the goods. By this, we will also store the name of the class of the entity to which the entry belongs.



By the way, about the birds. If you close the folder with .htaccess for external access, the file can only be obtained through a script. And in the script it will be possible to determine access to the file. Looking ahead a bit, I’ll say that in my CMS (where the aforementioned project is being sawn) access is determined by basic user groups, of which I have 8 - guests, users, managers, admins, unactivated, blocked, remote and super admins. A super admin can do absolutely everything, so he is not involved in the definition of access. If the user has a super-admin flag, then he is super-admin. It's simple. That is, we will determine the accesses of the remaining seven groups. Access is simple - either to give the file, or not to give. Total, you can take the field type TINYINT .



And one moment. According to our legislation, we will have to physically store custom pictures. That is, we need to somehow mark the pictures as deleted, instead of physically deleting. It is most convenient to use a bit field for these purposes. I usually use the INT type field in such cases. So with a margin, so to speak. Moreover, I have already established tradition to place the DELETED flag in the 5th bit from the end. But it is not fundamentally the same.



What we have in the end:



 create table `files` ( `id` bigint not null auto_increment, --   `entity_type` char(32) not null default '', --   `entity` bigint null, -- ID  `mime` char(32) not null default '', -- MIME- `md5` char(32) not null default '', -- MD5 `sha1` char(40) not null default '', -- SHA1 `file` char(64) not null default '', --   `url` varchar(250) not null default '', -- URL `meta` text null, -- -   JSON    `size` bigint not null default '0', --  `created` datetime not null, --   `updated` datetime null, --   `access` tinyint not null default '0', --   `flags` int not null default '0', --  primary key (`id`), index (`entity_type`), index (`entity`), index (`mime`), index (`md5`), index (`sha1`), index (`url`) ) engine = InnoDB; 


Dispatcher class



Now we need to create a class with which we will upload files. The class must provide the ability to create files, replace / change files, delete files. In addition, it is worth considering two points. First, the project can be transferred from server to server. So in the class you need to define a property that contains the root directory of files. Secondly, it will be very unpleasant if someone bangs a table in the database. So you need to provide the ability to recover data. With the first all in general is understandable. As for data backup, we will only reserve something that cannot be restored.



ID - restored from the physical location of the file

entity_type - not recoverable

entity - cannot be restored

mime - restored using finfo extension

md5 - restored from the file itself

sha1 - restored from the file itself

file - restored from the physical location of the file

url - not restored

meta - cannot be restored

size - recovers from the file itself

created - you can take information from the file

updated - you can take information from a file

access - not restored

flags - cannot be restored



Immediately you can drop the meta-information. It is not critical for the functioning of the system. And for more rapid recovery, you still need to save the MIME type. Total: entity type, entity ID, MIME, URL, access, and flags. In order to increase the reliability of the system, we will store backup information on each final folder separately in the folder itself.



Class code
 <?php class BigFiles { const FLAG_DELETED = 0x08000000; //    "" /** @var mysqli $_db */ protected $_db = null; protected $_webRoot = ''; protected $_realRoot = ''; function __construct(mysqli $db = null) { $this->_db = $db; } /** * /   URL- * @param string $v  * @return string */ public function webRoot($v = null) { if (!is_null($v)) { $this->_webRoot = $v; } return $this->_webRoot; } /** * /    * @param string $v  * @return string */ public function realRoot($v = null) { if (!is_null($v)) { $this->_realRoot = $v; } return $this->_realRoot; } /** *   * @param array $data   * @param string $url URL   * @param string $eType   * @param int $eID ID  * @param mixed $meta - * @param int $access  * @param int $flags  * @param int $fileID ID   * @return bool * @throws Exception */ public function upload(array $data, $url, $eType = '', $eID = null, $meta = null, $access = 127, $flags = 0, $fileID = 0) { $meta = is_array($meta) ? serialize($meta) : $meta; if (empty($data['tmp_name']) || empty($data['name'])) { $fid = intval($fileID); if (empty($fid)) { return false; } $meta = empty($meta) ? 'null' : "'" . $this->_db->real_escape_string($meta) . "'"; $q = "`meta`={$meta},`updated`=now()"; $this->_db->query("UPDATE `files` SET {$q} WHERE (`id` = {$fid}) AND (`entity_type` = '{$eType}')"); return $fid; } // File data $meta = empty($meta) ? 'null' : "'" . $this->_db->real_escape_string($meta) . "'"; $finfo = finfo_open(FILEINFO_MIME_TYPE); $mime = finfo_file($finfo , $data['tmp_name']); finfo_close($finfo); // FID, file name if (empty($fileID)) { $eID = empty($eID) ? 'null' : intval($eID); $q = <<<sql insert into `files` set `mime` = '{$mime}', `entity` = {$eID}, `entityType` = '{$eType}', `created` = now(), `access` = {$access}, `flags` = {$flags} sql; $this->_db->query($q); $fid = $this->_db->insert_id; list($ffs, $fhn) = self::fid($fid); $url = $this->_webRoot . $url . '/' . $fid; $fdir = $this->_realRoot . $ffs; self::validateDir($fdir); $index = self::getIndex($fdir); $index[$fhn] = array($fhn, $mime, $url, ($eID == 'null' ? 0 : $eID), $access, $flags); self::setIndex($fdir, $index); $fname = $ffs . '/' . $fhn . '.file'; } else { $fid = intval($fileID); $fname = $this->fileName($fid); } // Move file $fdir = $this->_realRoot . $fname; if (!move_uploaded_file($data['tmp_name'], $fdir)) { throw new Exception('Upload error'); } $q = '`md5`=\'' . md5_file($fdir) . '\',`sha1`=\'' . sha1_file($fdir) . '\',' . '`size`=' . filesize($fdir) . ',`meta`=' . $meta . ',' . (empty($fileID) ? "`url`='{$url}',`file`='{$fname}'" : '`updated`=now()'); $this->_db->query("UPDATE `files` SET {$q} WHERE (`id` = {$fid}) AND (`entity_type` = '{$eType}')"); return $fid; } /** *   * @param string $url URL * @param string $basicGroup    * @throws Exception */ public function read($url, $basicGroup = 'anonimous') { if (!ctype_alnum(str_replace(array('/', '.', '-', '_'), '', $url))) { header('HTTP/1.1 400 Bad Request'); exit; } $url = $this->_db->real_escape_string($url); $q = "SELECT * FROM `files` WHERE `url` = '{$url}' ORDER BY `created` ASC"; if ($result = $this->_db->query($q)) { $vars = array(); $ints = array('id', 'entity', 'size', 'access', 'flags'); while ($row = $result->fetch_assoc()) { foreach ($ints as $i) { $row[$i] = intval($row[$i]); } $fid = $row['id']; $vars[$fid] = $row; } if (empty($vars)) { header('HTTP/1.1 404 Not Found'); exit; } $deleted = false; $access = true; $found = ''; $mime = ''; foreach ($vars as $fdata) { $flags = intval($fdata['flags']); $deleted = ($flags & self::FLAG_DELETED) != 0; $access = self::granted($basicGroup, $fdata['access']); if (!$access || $deleted) { continue; } $found = $fdata['file']; $mime = $fdata['mime']; } if (empty($found)) { if ($deleted) { header('HTTP/1.1 410 Gone'); exit; } elseif (!$access) { header('HTTP/1.1 403 Forbidden'); exit; } } else { header('Content-type: ' . $mime . '; charset=utf-8'); readfile($this->_realRoot . $found); exit; } } header('HTTP/1.1 404 Not Found'); exit; } /** *   ()   * @param mixed $fid () * @return bool * @throws Exception */ public function delete($fid) { $fid = is_array($fid) ? implode(',', $fid) : $fid; $q = "delete from `table` where `id` in ({$fid})"; $this->_db->query($q); $result = true; foreach ($fid as $fid_i) { list($ffs, $fhn) = self::fid($fid_i); $fdir = $this->_realRoot . $ffs; $index = self::getIndex($fdir); unset($index[$fhn]); self::setIndex($fdir, $index); $result &= unlink($fdir . '/'. $fhn . '.file'); } return $result; } /** *  ()  "" * @param int $fid () * @param bool $value   * @return bool */ public function setDeleted($fid, $value=true) { $fid = is_array($fid) ? implode(',', $fid) : $fid; $o = $value ? ' | ' . self::FLAG_DELETED : ' & ' . (~self::FLAG_DELETED); $this->_db->query("update `files` set `flags` = `flags` {$o} where `id` in ({$fid})"); return true; } /** *   * @param int $fid  * @return string * @throws Exception */ public function fileName($fid) { list($ffs, $fhn) = self::fid($fid); self::validateDir($this->_realRoot . $ffs); return $ffs . '/' . $fhn . '.file'; } /** *   . *           . * @param int $fid   * @return array */ public static function fid($fid) { $ffs = str_split(str_pad(dechex($fid), 16, '0', STR_PAD_LEFT), 2); $fhn = array_pop($ffs); $ffs = implode('/', $ffs); return array($ffs, $fhn); } /** *    * @param string $f     * @return bool * @throws Exception */ public static function validateDir($f) { if (!is_dir($f)) { if (!mkdir($f, 0700, true)) { throw new Exception('cannot make dir: ' . $f); } } return true; } /** *    * @param string $f       * @return array */ public static function getIndex($f) { $index = array(); if (file_exists($f . '/.index')) { $_ = file($f . '/.index'); foreach ($_ as $_i) { $row = trim($_i); $row = explode('|', $row); array_walk($row, 'trim'); $rid = $row[0]; $index[$rid] = $row; } } return $index; } /** *    * @param string $f       * @param array $index    * @return bool */ public static function setIndex($f, array $index) { $_ = array(); foreach ($index as $row) { $_[] = implode('|', $row); } return file_put_contents($f . '/.index', implode("\r\n", $_)); } /** *   * @param string $group   (. ) * @param int $value   * @return bool */ public static function granted($group, $value=0) { $groups = array('anonimous', 'user', 'manager', 'admin', 'inactive', 'blocked', 'deleted'); if ($group == 'root') { return true; } foreach ($groups as $groupID => $groupName) { if ($groupName == $group) { return (((1 << $groupID) & $value) != 0); } } return false; } } 




Consider some points:



- realRoot - full path to the file system folder ending in a slash.

- webRoot - path from the site root without a leading slash (see below why).

- As a DBMS, I use the MySQLi extension.

- In fact, in the upload method, the first argument passes information from the $ _FILES array .

- If, when calling the update method, to transfer the ID of an existing file, it will be replaced if the input array in the tmp_name is non-empty.

- You can remove and change the file flags for several pieces at once. For this you need to transfer instead of the file identifier either an array with identifiers, or a string with them separated by commas.



Routing



Actually, it all comes down to several lines in htaccess in the site root (it is assumed that mod_rewrite is enabled):



 RewriteCond %{REQUEST_URI} ^/content/(.*)$ RewriteCond %{REQUEST_FILENAME} !-f RewriteRule ^(.+)$ content/index.php?file=$1 [L,QSA] 


“Content” is the folder in the root of the site in my case. Needless to say, you can name the folder differently. And of course, index.php itself, which in my case is stored in the content folder:



 <?php $dbHost = '127.0.0.1'; $dbUser = 'user'; $dbPass = '****'; $dbName = 'database'; try { if (empty($_REQUEST['file'])) { header('HTTP/1.1 400 Bad Request'); exit; } $userG = 'anonimous'; //      ;      $files = new BigFiles(new mysqli($dbHost,$dbUser,$dbPass,$dbName)); $files->realRoot(dirname(__FILE__).'/files/'); $files->read($_REQUEST['file'],$userG); } catch (Exception $e) { header('HTTP/1.1 500 Internal Error'); header('Content-Type: text/plain; charset=utf-8'); echo $e->getMessage(); exit; } 


Well, by itself, we close the file system itself from external access. Put in the root of the content/files folder the .htaccess file with only one line:



 Deny from all 


Total



This solution avoids the loss of file system performance due to an increase in the number of files. At least troubles in the form of thousands of files in one folder can be precisely avoided. And at the same time, we can organize and control access to files by human-readable addresses. Plus compliance with our grim legislation. Immediately make a reservation, this solution is NOT a complete way to protect content. Remember: if something is played in the browser, it can be downloaded for free.

Source: https://habr.com/ru/post/423875/



All Articles