📜 ⬆️ ⬇️

How and why we changed the configuration of shards in the architecture of Evernote

In last year's review post on the Evernote architecture, we gave a general description of the servers - “shards”, which we use for both data storage and application logic. Since Evernote is a more personal service than, say, a social network, we can easily spread the data of individual users across different shards to provide a fairly simple linear scalability. Each pair of such shards runs two virtual machines:

image

Each of these virtual machines stores transactional “metadata” in a MySQL database on a RAID-1 array of a pair of 300 GB Cheetah disks with a spindle speed of 15,000 rpm. A separate RAID-10 array of 3-terabyte Constellation disks (7200 rpm) is partitioned to store large Lucene text search index files for each user. Paired virtual machines duplicate each of these sections from the current primary to the current secondary machine using synchronous DRBD.

These shards have enough disk space and support for I / O operations for comfortable data processing of 100,000 registered Evernote users for at least 4 years, and also have additional drive bays in 4U enclosures so that they can be upgraded later if necessary. Taking into account the dual processors L5630 and 48 gigabytes of RAM, the cost of each such unit is up to $ 10,000 with power consumption of about 374 watts each. That is, one registered user accounts for about $ 0.10 of hardware costs and 3.7 milliwatts of energy costs.
')

Opportunities for improvement


The shards generation described above gave us a good price-performance ratio with a very high level of data redundancy that we need. However, we found several areas where this configuration was not ideal for our purposes. For example:
  1. Disks with 15,000 rpm for the MySQL database are usually idle 95% of the time, since InnoDB has done an excellent job with serializing caching and I / O operations. However, we found random bottlenecks when users with large accounts begin the initial synchronization of data on the new device. If their metadata is not yet present in the RAM buffer, massive I / O operations can become very expensive.
  2. Lucene search indexes for our users generate much more I / O than we expected. We see that Lucene accounts for two times more read / write operations than MySQL. This is largely due to our use model: every time we create or edit a note, we need to update the index of its owner and send information about the changes to disk so that they take effect immediately.
  3. DRBD is great for replicating one or two small partitions, but it is very inconvenient when it comes to a significant number of large partitions for each server. Each section must be independently configured, managed and monitored. Various problems can sometimes require complete synchronization of all resources, which can take many hours even if there is a dedicated 1 Gb / s crossover cable.

These restrictions were the main factor limiting the number of users we could assign to each shard. Improving the manageability and performance of I / O operations with metadata would enable us to safely increase the density of user accounts. We solve these problems in our new generation of shards, transferring the storage of metadata to solid-state drives, and the logic of the excess amount of file storage from the operating system to our application.

New configuration


Our new configuration replaces racks with a dozen 4U servers with racks, where together there are fourteen 1U shards for metadata and applications and four 4U servers for file storage.

image

Shard 1U manages a pair of simpler virtual machines, each of which uses one partition on a separate RAID-5 array from Intel 300 GB solid state drives. These two partitions are replicated using DRBD, and the virtual machine image only works on one server at a time. We use up to 80% of the capacity of solid-state drives, which significantly improves the reliability of recording and throughput for I / O operations. We included a spare SSD for each unit instead of using RAID-6, which allowed us to avoid an additional loss of up to 15% performance, since the recovery time will be short and replication with DRBD will give us the opportunity to hedge in case of a hypothetical failure of several disks.

File storage is transferred from local disks on main servers to pools of individual WebDAV servers that manage huge file systems on RAID-6 arrays.
Each time a resource file is added to Evernote, our application synchronously writes a copy of this file to two different file servers in the same rack before the metadata transaction is executed. Remote implementation of the redundancy principle is also guaranteed by the application, which replicates each new file to the remote WebDAV server via asynchronous data transfer in the background.

results


This new configuration has enough capacity for I / O operations and memory to process up to 200,000 users on one shard for at least four years. A rack of 14 shards and 4 file servers costs about $ 135,000 and consumes 3,900 watts, which is about $ 0.05 and 1.4 milliwatts per user.

Thus, the specific number of future servers and power consumption was reduced for new servers by 60%. The specific power consumption of other service equipment (switches, routers, load balancers, text recognition servers in images, etc.) decreased by a total of 50% compared to our previous architecture. All these changes reduce our hosting costs in the long term.

We would not like to make high-profile environmental statements [and it would be possible to insert a photo of Phil Libin hugging a fluffy white belt as a CEPT, but it can be noted that this 50% reduction in energy consumption proportionally reduces carbon emissions into the atmosphere from our equipment.

In addition to the obvious savings, the process of evaluating and testing solutions allows us to more deeply understand the components and technologies that we use. We plan to write a few more posts on the details of testing and optimizing RAID arrays from SSDs, comparative evaluation of Xen and KVM in terms of I / O bandwidth, DRBD management, etc. We hope that this information will be useful to our colleagues creating high-load services.

Source: https://habr.com/ru/post/143340/


All Articles