📜 ⬆️ ⬇️

Photo storage Topface now open source

We are good at storing photos, so we decided to make life easier for you if you want to build your tumblr, facebook or imgur. The matter is actually simple, but there are subtleties that are better known in advance. In addition, we did everything on node.js , which is not very typical for a repository with more than 100,000,000 photos .

image

In open-source, we decided to give the part of the infrastructure that directly interacts with the disks. The rest of us are managed by nginx, and its improvements are already either in the main nginx branch (progressive jpeg in the image_filter module and supporting variables in it), or waiting in the nginx-devel mailing list (the possibility of "sticking" to a certain edge of the image during crop) .

The storage is suitable not only for photographs, but also for any data that is prone to infinite life and have a finite, sufficiently small size. It is easier and more profitable to store large amounts of data such as video, simply stacking it on the file system. Data must be immutable. If you put a filter on a photo - save it under a new name.
')
There is no single point of failure throughout the infrastructure, so getting meteorites to a couple of servers is not a problem for us. The disks fall apart much more often than we would like, so this feature was laid from the very beginning.

I will try to tell in brief about the component parts.

backpack-coordinator


Link to github . Here you can learn how to deploy the system as a whole.

Coordinators - the level with which the application communicates to store data in a stack. Able to command:



Coordinator instances are equal and interchangeable, they manage data sharding, communicating via zookeeper. They know where to put the next photo. The number of instances in the shard determines the number of copies of the data, the number of shards and their size - the total size of the storage.

Shard - several backpack instances with the same data set.

When a photo is uploaded to the repository, it is stored on one shard instance, after which tasks are created for replication to the other instances of the same shard. We have 3 instances of a backpack on different physical machines for each shard so that in the event of a meteorite falling into one server, the shard continues to function.

It is worth noting that we are trying in every way not to make unnecessary movements, so when adding a new shard to the repository, there is no rebalancing. We just write more actively to the new shard. This is not because rebalancing is difficult (in fact, yes) or because we have not mastered (actually mastered, then changed our mind), but because it is not really necessary.

backpack-replicator


Link to github .

The level of replication, which only does that breaks up the pictures on the backpack instances. Unavailable instances get their data after they wake up. Uses the zk-redis-queue module to ensure that messages are processed at least once.

Replicator instances are also interchangeable and equal, which allows you to run as much as your load requires.

backpack


Link to github .

The level of data storage, the salt storage. Able to webdav commands (GET, PUT):



If desired, nginx with a webdav module or something that webdav can do can be replaced. Nginx was not enough for us, so I had to make the backpack myself. Inspired by the haystack from facebook, where they store their photos. Haystack turned out to be a closed development, but there is a whitepaper to it, from which precious knowledge was drawn.

The essence of the backpack is to keep the index completely in memory (redis) and write many small files to large ones. Thus, to read the file, we only need to get its data (offset and length) from the index and make one pread . No need to raise information on photo metadata (file rights and the like), any data at a distance of one disk seek. Among other things, we use O_DIRECT to align reads on a disk; this gives a few bonus percentages to performance.

The gain from backpack in comparison with nginx can be understood by test. We took 1,700,000 random photos and saved them in backpack and nginx on identical servers. We chose 100,000 random files and decided to read them in 20 streams.

The red line is backpack, the green line is nginx.

Requests per second

image

The backpack took 732 seconds to do everything about everything, while the nginx has 1200 seconds, the difference is 33%.

Disk reads per second

image

Here it becomes clear why: nginx does much more reads from the disk, moves the disk head more, and this operation is not free. Over time, nginx fills disk cache with metadata and accelerates, but even that is not enough. If there were even more data, the probability of hitting the disk cache from nginx would be even less. And we have much more data.

Use i / o

image

So that you do not think that I am deceiving you, on the i / o graphs you can see that all the time each option was used, the disk was 100% used.

All projects support node.js 0.10 and seemingly do not leak, at least the heap remains constant after weeks of work. We start the demons through mon.

We will be glad if you decide to deploy a stack of backpack at home!

Source: https://habr.com/ru/post/184652/


All Articles