
In October 2011, the annual conference of developers of highload projects HighLoad ++ was held in Moscow.
I decided to share with the readers the main points of the conference. Since all the information is open and available on
the conference
page , I decided that gathering all the theses together would not be such a bad undertaking. Immediately, I note that the report does not contain detailed information about each report - only key points are touched upon.
So, what was said at HighLoad ++ 2011.
Designing a cloud-based web service “in an adult” / Sergey Ryzhikov (1C-Bitrix)
1C has three data centers located in Russia, the USA and Europe. Loss of communication between data centers can take hours. An asynchronous master replication master is configured between the database servers.
The entire architecture is built from Amazon Web Services.
For static content, Amazon S3 is used. The advantage, among other things, is the low price of such a solution compared to EBS.
It uses Elastic Cloud Balancing, CloudWatch and Auto Scaling.
Machines with DB - on EC2. Each with 17.1Gb of RAM. Software RAIDs are built from EBS disks. RAID-10 is selected as the fastest and most reliable.
Used by InnoDB.
Backups are done using snapshots in EC2 using freeze (freeze) filesystem.
Amazon RBS is not used for the following reasons:
- No full root in the database
- Not transparent
- Risk of long downtime
Architecture of binary data storage on Odnoklassniki / Alexander Khristoforov, Oleg Anastasyev (Odnoklassniki)
The ability to use one-db was initially evaluated. There were the following limitations:
- Poor performance
- Long backups (up to 17 hours)
HDFS, Cassandra, Voldemort, Krati were also evaluated.
The current solution implements a zookeeper cluster and a storage cluster.
On each node of the cluster there are N storages. Each of them contains data segments (256 Mb) and a RAID1 array of logs on a separate disk. IO handles the NIO Socket Server (Mina).
Reservations are made using xfs_io.
The index uses a hash table based on a regular array. Stored directly in direct memory.
At startup, the integrity of the data is checked. Logs are synchronized and cleaned as necessary.
Replication factor - 3.
Routing is based on partitioning. The hash ID is calculated and the remainder of the division into N storages is calculated. The calculated partition value is searched for in the routing table and the data disk is searched.
The concept of regions is used. No data movement required for expansion.
Zookeeper is used for coordination. Its advantage is reliability and performance. The following data is stored in Zookeeper:
- Available servers and their addresses
- Drive locations and statuses
- Routing tables
- Distributed locking
The system is deployed on 24 servers .: 20 - storage, 2 - logs, 2 - backup.
Why you should not use MongoDB / Sergey Tulentsev
The main theses of the report:
- Map / Reduce slow and single threaded
- Each operation in map / reduce superimposes read or write lock
- Memory Mapped Files Problem - Poor System Memory Management
- Not very convenient when all shards are equal. Built-in autosharing is poorly configured
However, the author says that you can use MongoDB by default. Until there is a question about the speed of processing individual requests and the need for the possibility of relational databases.
')
For the first time in RuNet: a tale about 100M letters a day / Andrey Sas (Badoo)
Asynchronous sending is used. The send queue is implemented on the basis of files — native tools, logging, read / write ease.
Instead of sendmail, the SSMTP client is used.
In-memory is not used for fault tolerance (fear of losing letters).
The local cache of DNS queries was implemented, the number of DNS and SMTP workers was increased.
Big recipe book or frequently asked questions on managing complex systems / Alexander Titov, Igor Kurochkin (Skype)
It is proposed to use the Cobbler tool. The tool supports a wide range of linux distributions and has convenient interaction mechanisms.
For configuration management - Chef.
Monitoring - Zabbix API.
Backups of statistics, database and repositories.
Designing large-scale data collection applications / Josh Berkus
It is recommended to use the Mozilla Soccoro project to collect statistics on crashes. To store information, hbase is used as the most scalable solution. In this case, the data itself is stored in hbase (40 TB on 30 nodes). Metadata storage (500 GB) is the responsibility of 2 PostgreSQL servers. Load balancing are 6 servers.
As a tool for configuration management - puppet.
12 Redis use options - in Tarantool / Alexander Calendarev, Konstantin Osipov (Mail.ru)
Tarantool - NoSQL DBMS for storing the “hottest” data. Tarantool stores all data in RAM and records all changes on the disk, thus ensuring data reliability in the event of a failure. Storing data in memory allows you to perform requests with a maximum performance of 200-300k requests per second on the usual modern equipment.
Scaling is proposed to do on the basis of tarantool proxy and shards.
Soon tarantool will also get the following features:
- Transaction support
- Master replication
- Cluster Manager
- Load balancer
Thus, the representatives of mail.ru speak of Tarantool as a replacement for Redis. Yielding in productivity ~ 5%.
Secrets of garbage collection in Java / Alexey Ragozin
Counting references in memory is not the best tool. It does not save from cyclical graphs, it is badly combined with multithreading. Also it is 15-30% of the CPU load.
According to the hypothesis of generations, most objects die young. As long as they are, a small number of other objects refer to them. Thus, by sharing the storage of “young” and “old” objects, we can achieve an increase in productivity.
For JVMs such as HotSpot, there are a lot of keys. Information about the possibility of using keys is contained in Alexey's
blog .
It is recommended to use Concurrent Mark Sweep GC for applications that are sensitive to pauses during garbage collection. Included in particular: -XX: + ExplicitGCInvokesConcurrent
CMS often collect garbage directly during the operation of the application: the objects of the “new” generation are collected in the stop-the-world mode (a fairly fast operation), while the “old” generation is going in parallel and for a long time. Accordingly, the application must satisfy the conditions of the generations hypothesis.
As a result, pauses can reach no more than 150 ms for a 32 GB heap.
As an option - use off-heap memory. But this is a much more difficult task.
Apache Cassandra - another NoSQL repository / Vladimir Klimontovich
Apache Cassandra = Amazon Dynamo + Google Bigtable.
Uses data partitioning technology based on Token Ring topology. Replication is also based on this topology. In this, Cassandra is similar to Amazon Dynamo.
From Bigtable, the key / value data model is taken. Complex queries, indexes are available (useless thing).
LIFO query caching mechanism.
The easiest way to scale is 2 times. Then the ranges of the partition segments are simply divided by 2.
Nodes communicate with each other based on the Thrift protocol.
AJAX Layout / Oleg Illarionov (VK)
The page is divided into several iframe parts, which send independent AJAX requests.
HTML5 is actively used, in particular - local storage.
At the same time, static and content are connected.
Cached pages. They use their own stubs to work with the Browser's History API. When you go back - the tree is removed from the DOM, the environment variables are copied.
For quick search by content, the search is performed on the client.
Building a cloud storage for storing and delivering static content based on the integration of Nginx and Openstack Swift / Stanislav Bogatyrev, Nikolay Dvas (Clodo)
For use in the clouds, the correct solution seems to be object storage, such as Amazon S3 or Rackspace Cloud Files. Clodo cloud storage is built using Swift technology (the underlying Rackspace Cloud Files) and is aimed primarily at storing and distributing content for websites — the main focus of its construction is made on this application — unlike the more common S3 and Cloud Files. Swift storage is slow when distributing content to a large number of clients. Therefore, nginx was chosen as the frontend, modified in two aspects:
Added multizone cache (saves 40% of disk space on expensive disks used for caching);
A control mechanism for both object-by-object and container-by-cache cache was added when using a front-end cluster with independent nginx on each.
PS I hope that the article will not cause a negative reaction because of the nature of the description (extremely concise information, theses from the slides and speakers' speeches). The article does not describe all the reports, on the official website you can find information about all the missing speeches.
PPS Thanks to Oleg Bunin and Ontiko representatives for organizing HighLoad ++, we are waiting for the next conference in 2012!