Small storage for small files

During the development of one project, it became necessary to store multiple files (more than 4 million pieces). And their number continued to grow. When files became more than 6 million, it became difficult to work with them. Even if we decompose them into directories and create an extensive tree, even parts of these directories took hours to bypass. Of course, at first nobody thought about how to store all this, and we used the usual screw and ext4. At some point, the read speed from this section decreased to 9MB / sec. This is too slow. The experimental transition to btrfs raised the speed to 13MB, but these figures are also not impressive. SSD for this, no one was going to use and the volume has already passed for 1TB. Those. everything went to using raids. Since the commercial success of the project was in doubt, the cost needed to be kept to a minimum. Therefore, the implementation was supposed to be software.

So, you need a small storage - on a single server or computer, i.e. no more than four drives. You need to store small files - 1-3Mb each.

In our case, Btrfs was faster than ext4, so we decided to use it.
Candidates for storage management are raid0 and raid10. Raid0 is objectively the fastest, although some believe that raid10 is faster. Raid10 can theoretically equal in speed to raid0 when reading. But for the record - it is very doubtful. Yes, and without much difficulty, you can find various tests of hardware raid performance.
The main advantage of raid10 is reliability - not so obvious for not very important data. However, if you need to store a lot, then use either a raid or lvm. It is possible and in the old manner - manually distribute the data on the screws. Reliability is the same everywhere. Losing the hard drive - losing data. In raid0, the strip size (chunk) is configured, so with a small file size, most of the data can be restored.

4 hdd Western Digital Caviar Blue 1TB were taken as the test subjects - perhaps one of the fastest among those available in the public domain without overcharging.
')
raid0 is created as follows:

mdadm -v —create /dev/md0 —level=raid0 —raid-devices=4 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1

Creation happens instantly, because there is nothing to copy and re-arrange.
Save the config so that the array remains on reboot:

 mdadm —examine —scan —config=mdadm.conf >> /etc/mdadm.conf

Create a file system:

 #mkfs.btrfs -Lraid0 /dev/md0 WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL WARNING! - see http://btrfs.wiki.kernel.org before using fs created label raid0 on /dev/md0 nodesize 4096 leafsize 4096 sectorsize 4096 size 3.64TB Btrfs Btrfs v0.19

Interestingly, FS is still experimental. However, this didn’t prevent Fedora 18 developers from making btrfs the default file system.

Let's perform a quick speed test:

 # hdparm -t /dev/md0 /dev/md0: Timing buffered disk reads: 2008 MB in 3.00 seconds = 669.11 MB/sec

On average, linear read speeds range from 650 to 690MB / sec.
This, by comparison, is faster than any sata3 SSD.

3.7TB usable capacity

The two-day test showed a steady speed of random read and random write during file operations of 202MB / sec and 220MB / sec, respectively. If we restrict ourselves to reading only or writing, we get 560MB / sec. These are stable averages.
Very good. By the way, many home SSDs do not reach this level before.
The speed is very decent at the moment. Therefore, actually on it and stopped. In the event that one of the screws begins to crumble, you have to stop the entire raid, copy the data to a new screw sector by sector and return the raid to the active state. It is quite acceptable for not very important storage. If it suddenly becomes important, it will be possible to increase its reliability and overall fault tolerance with the help of drbd.
During the testing process, peak read and write speeds of 700MB / sec were often observed. But since we have small files, such peaks occurred when reading multiple files in parallel. Apparently affects the size of the chunk.

To clear our conscience, we will do the same tests with raid10.
Creating raid10:

 mdadm -v —create /dev/md0 —level=raid10 —raid-devices=4 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1

But then you have to wait some time. A couple of hours.
As a result, we have the following:

 # hdparm -t /dev/md0 /dev/md0: Timing buffered disk reads: 1080 MB in 3.01 seconds = 359.07 MB/sec

To put it mildly, disappointing. Exactly the same results were on raid0 with two disks. The theoretical possibility of reading from 4 screws in this case does not manifest itself.
But linear speed, though indicative, is not so important for us. More important is random access to files.
And it amounted to 190MB / sec for reading and 125MB / sec for writing.
A complete disappointment with writing, although the speed of reading is comparable to raid0. Let me remind you that in this case we also lose half the volume of disks.

There is another version of the software implementation of raid0, but by the means of btrfs itself. There are two parameters in the documentation that may affect performance and reliability. Data type is a way to store data. It affects both the access speed and the reliability of data storage. There are options: raid0, raid1, raid10 and single. And the metadata profile is a way to store metadata. It has a greater effect on the ability to recover data in case of minor failures. There are options: raid0, raid1, raid10, single and dup. Actually, we see already familiar raid0 and raid10. The third parameter, which modestly declares a possible increase in productivity when working with files up to 1GB - mixed. For us, his influence is not obvious.
Parameters are set when creating a file system.

 # mkfs.btrfs -d raid0 -m raid1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL WARNING! - see http://btrfs.wiki.kernel.org before using adding device /dev/sdd1 id 2 adding device /dev/sde1 id 3 adding device /dev/sdf1 id 4 fs created label (null) on /dev/sdc1 nodesize 4096 leafsize 4096 sectorsize 4096 size 3.64TB Btrfs Btrfs v0.19

If you decide to use mixed, then note that the types of data storage and metadata should be the same.
Please note that a single device, as when using a raid, does not work here. You can mount any partition used for the file system. For example, mount / dev / sdc1 / mnt / raid0

As a result, the figures were very similar to raid0 performed by md. Read 198MB / sec and write 216MB / sec. Reading peaks are a little less than 700MB / sec. The average read-only or write speed of 660MB / sec, which was a pleasant surprise.

When creating a partition with -m raid0, sharing speeds increase slightly. Read 203MB / sec and write 219MB / sec. But separate read / write slightly decreased: 654MB / sec.

In general, it can be recommended to use pure btrfs without additional layers, which can be points of failure.

There is still the possibility of storing data in lvm in striping mode. This is a very close analogue of raid0. However, the author has not developed a relationship with lvm for more than a year. Therefore, this performance improvement option was not considered purely for subjective reasons.

And finally - an unsuccessful experiment with raid10 in btrfs:

 # mkfs.btrfs -L raid10 -d raid10 -m raid10 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/ sdf1 WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL WARNING! - see http://btrfs.wiki.kernel.org before using adding device /dev/sdd1 id 2 adding device /dev/sde1 id 3 adding device /dev/sdf1 id 4 fs created label raid10 on /dev/sdc1 nodesize 4096 leafsize 4096 sectorsize 4096 size 3.64TB Btrfs Btrfs v0.19

The creation took a couple of seconds. But this caused more questions than joy.
Reading - 122MB / sec and writing - 127MB / sec. Separate read / write - 657MB / sec. Those. not suitable for our task, although the speed of separate operations is surprising and pleasing.

For peace of mind colleagues tested with ext4 on top of raid0. Results are expected.

 # mkfs.ext4 /dev/md0 -L raid0 mke2fs 1.42.5 (29-Jul-2012) Filesystem label=raid0 OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=256 blocks, Stripe width=1024 blocks 244195328 inodes, 976760832 blocks 48838041 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 29809 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848, 512000000, 550731776, 644972544 Allocating group tables: done Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done

Let's spend a little tuning, which favorably affects the performance:

 tune2fs -O extents,uninit_bg,dir_index /dev/md0 tune2fs -o journal_data_writeback tune2fs -m 0 /dev/md0 mount /dev/md0 raid0 -o acl,user_xattr,noatime,nodiratime,barrier=0,commit=0

The results are pitiable for our task. Reading and writing - 60MB / sec and 86MB / sec. Read only or write only - 535MB / sec.

The conclusions are obvious. Btrfs is much more profitable for our task than ext4, and raid0 is faster than raid10.

It remains only to add that in addition to working data handlers, third-party utilities were used.
For synthetic tests used stress and bonnie ++
Actual read / write data was collected using sysstat.

Source: https://habr.com/ru/post/176759/

All Articles

Small storage for small files

More articles: