The other day, the
HighLoad ++ High Level Systems Developers Conference was held in Moscow, and I was fortunate to participate in it. Below I want to briefly go over the reports that I attended during the conference, highlighting in them interesting points in my opinion.
I'll warn you right away that I could have misunderstood some things, misinterpreted some things. If this is important for you - do not read this post, but come to the next conference in person!
Day 1
Storing and delivering VK content
The report began with a two-year-old story, when I asked Pasha Durov how VKontakte stores user data? That day, Pasha responded "on disk." Since then I don’t like him :) This time, Oleg (reporter) promised to answer this question fully, dedicating a whole report to him.
Initially, all files (photos) were stored on the same server as the code - in the upload folder. This approach is simple, but has two flaws: security (code fill, XSS) and scalability.
')
Next, the guys were puzzled by the scalability by organizing the following scheme: the files are uploaded to the main server, and then transferred to the auxiliary ones (designed specifically for user data).
Each user always has the same auxiliary server, defined by user id.
However, this approach has a bottle neck - the main server.
Therefore, the next step was to upload files directly to the content server without using the main server with the application. However, to implement this approach, it is necessary to answer the question: which server to upload a photo to?
- add to the last server added to the system (not suitable, as the server quickly runs out of space);
- to the freest server (not suitable, since all traffic is dumped onto the server);
- random free server (best);
Find and eliminate bottlenecks. Usually the bottleneck is one - the rest of the system's nodes compared to it work an order of magnitude better. Bottle neck'and when storing a large amount of user data can be the following:
- Data backup (RAID is used, if the entire array of 2 disks is burned out - user files are lost forever);
- caching (60% of the content comes from the cache on the client);
- traffic (the solution is CDN, especially for video: VKontakte in Moscow has servers that cache video from a data center in St. Petersburg);
XFS is used as a file system. However, user files are not stored in the file system, but in a single large file. The file is always open for reading, the file index is stored in the RAM. The guys see further development of this technology in complete abstraction from physical media.
In view of the fact that the widely spread WiFi VKontakte is switching to HTTPS. Since there are a huge number of servers with content, you can’t buy a certificate for each of them (too expensive). Therefore, VKontakte has several servers that proxy content via HTTPS.
As for the pictures:
- more than 30 000 000 000 in total, 17 000 000 are added per day;
- images are pre-compressed on the client (Flash);
- used by Graphics Magic;
- when saving photos are cut immediately for all necessary permissions;
As for audio:
- 130K is added per day;
- users have the ability to "add" to themselves the data from another page, which reduces the amount of loaded audio by orders of magnitude;
- when requesting copyright holders, files were initially searched by md5-hash. An algorithm has now been developed that finds similar audio recordings by some audio characteristics;
As for the video:
- 320K added per day;
- using delayed video processing through the queue is used - it is too expensive to process video online;
- ffmpeg is used for video processing;
- video is duplicated when re-downloading by other users - this is not a problem;
- at one time, VKontakte had P2P for video on flash (wat!?), now they already cope without it;
In general, I had the feeling that the contact is relatively simple. Easy but effective.
NoSQL in high load project
A report on how NoSQL is used in Mamba. The peculiarities of the considered system are that more than 30% of executed queries are increment and reading counters.
We started searching for a suitable storage for the data as it should with Memcache. Not satisfied with the lack of persistence and RAM ONLY. Tried Radis - RAM ONLY.
In addition, with a large load, Memcached performance drops by 100 times (for testing load, the Brutis tool was used).
How long, short did the guys come to use TokyoTyrant. However, he suddenly found problems with the integrity of the database when the server was turned off from the socket :) We decided to use a new development from the author of TokyoTyrant - KoytotTycoon. However, 30M records could not be written to the database due to architectural constraints.
Therefore, the guys went in the direction of LevelDB from Google. This database uses LSM-tree technology. Data is stored in SSTable files: sorted immutable key-value pairs.
Data is written in a similar (but already modified) structure in RAM. From time to time there is a merging of subtrees from memory to disk.
In order not to cut off the problems in case of a sudden power outage - Write Write Ahead Log.
The guys tested again and got that in most cases the LevelDB library wins out all the options used earlier. Now they hold 4700 get / sec and 1600 update / sec with 200M records in the database.
MVCC unmasked
MultiVersion Concurency Control is a mechanism that allows readers not to block write, but write reader. In general, it significantly reduces the number of locks in the database. Present in Oracle, InnoDB Mysql, PostgreSQl and some others.
Each entry in the tables of the MVCC systems has two attributes: creaton (xmin) is the transaction number in which the entry was created, expiration (xmax) is the transaction number in which the entry was deleted.
INSERT xmin 40, xmax Null
DELETE xmin 40, xmax 47
UPDATE xmin 64, xmax 78 / xmin 78, xmax NUll
When executing each statement, a MVCC snapshot is created, which determines which data in the database is visible / accessible to the statement.
The algorithm for determining the visibility of the string is as follows:
- get the number of the last completed transaction (current);
- We consider visible those records in which:
- xmin <current numbers;
- they are not deleted (xmax = null), or the transaction in which they are deleted is not registered;
The xmin and xmax fields are present in each table, but hidden by default. However, they can always be specified in the sample explicitly:
SELECT xmin, xmax, * FROM test ;
You can get the current transaction id by using the
SELECT txid_current () query. This query selects data from the pg_clog table. It is important to understand that rolling back a transaction is simply setting the appropriate token for the transaction record in this table. No data is deleted. The transaction is simply marked as rolled out.
Non-MVCC DBMSs require immediate data changes when records are modified or deleted. MVCC DBMS does not have this problem - all irrelevant data is cleaned pending.
Here I want to add from myself that deferred cleaning (the so-called VACUUM) is not as perfect as we would like ... but this is a topic for another discussion.
By the way, here is the
site of the speaker , on which he promised many interesting articles.
MySQL to Google
Perhaps the biggest disappointment among the reports. In short, its essence boils down to one thesis: “Yes, we have MySQL slightly patched by us”.
Maybe I have overstated demands, but I expected something more interesting from the Good Corporation.
Largest projects: Ads, Checkout and, of course, YouTube.
For data centers, not the most expensive hardware is used, but the most advanced. Components (disks) come into disrepair very often, but can be replaced easily and cheaply.
MySQL is used as a cluster. For each shard there is a separate separate process (decider), which is responsible for the selection of a new master in the event of an old fall.
Heartbeat is performed by a simple script that writes some data to the wizard and checks its presence on the replicas.
In order to organize the scaling and fault tolerance, the guys impose many restrictions on working with the database. For example, a write transaction cannot take longer than 30 seconds. It is necessary in order to be able to quickly make the master of another slave.
SPDY support in NginX
SPDY - binary protocol over TCP / TLS connection. It is a transport for HTTP. Developed by Google.
Main features:
- multiplexing (several requests per connection, both from server to client and vice versa);
- header compression (zlib, deflate);
- flow control (window for tcp connection);
- server push ("fill" data in the browser, initiated by the server);
From the pros: built-in HTTPS (due to TLS), good work with a large number of pictures (due to multiplexing);
Of the minuses: works with one domain, as a result - does not support CDN;
According to research (combat) Wordpress'a: SPDY is faster than HTTPS, but slower than normal HTTP.
Sql tricks
The report was a list of SQL / PostgreSQL features that are interesting, but not familiar to everyone:
- view is an effective tool for encapsulating logic and / or tables. I do not agree with the first one, but I use the second one (renaming tables);
- postgres uses the OS cache (therefore, they offer to allocate it a little or all);
- pgbench is a Postgres testing utility;
- Common Table Name (CTE) / Recursive CTE;
- window functions;
- JSON support (c 9.2);
- lateral support (use of data calculated on the current data line in FROM clause, c 9.3);
- extension for working with text and trigrams: pg_trgm;
- extensions for storing key-value pairs: hstore;
- connection of tables from other databases via dblink;
- extensions for working with geo-data: PostGIS and EarthDistance;
- prepared statements - allow you to avoid the overhead of parsing the request, checking existence and rights, building a plan (sometimes it may take
more time than execution); - locks can be set explicitly through the select clause ... for share;
- in PgSQL Prepared Statements are available through execute;
- code blocks do $ code $ ... $ code $ (c 9.1);
- it is better to implement upsert through insert -> fk constraint -> update;
- You can return several values ​​from a function by returning courses (as part of a transaction) or by generating JSON (row_to_json, array_to_json);
The speaker, by the way, is really dangerous: I wrote JSON'a parser in Postgres with the help of Recursive CTE and Regexp in one request!
Crouching server, hidden diode
“Hello, my name is Andrei, I am the Voronezh cattle.” On this report description can be completed :) In general, Aksyonov in all its glory.
Key points:
- when benchmarking - the main thing to remember about the goals: it is necessary to measure the right, and not load average;
- average values ​​for metrics mean nothing: the average temperature in the hospital is cold corpses and feverish comatose cells;
- scalability is non-linear (Amdahl's law - even if 5% does not parallel, then
64 * CPU = 14.5 X)
C (n) = n / (1 + a * (n-1) + b * n * (n-1)), where
a - the degree of contention (the cost of non-parallelized code),
b is the degree of coherency (costs for coherence, communication, synchronization); - can be distinguished sweet spot - the number of resources at which the performance will be optimal;
- after a sweet spot, performance on the contrary begins to fall - with growth it can be worse ;
- do not test on default settings - for most software, to put it mildly, they are strange (fsync after each Insert'a and innodb_buffer_pool in 32 MB in MySQl);
- run the same query 1000 times - test the cache;
World constants - it is important to understand how much action costs.
- CPU L1 - 1,000,000,000 op / sec (1e9)
- CPU L2, misbranch - 100,000,000 op / sec (1e8)
- RAM Access - 10,000,000 op / sec (1e7)
- SSD megaraid - 100,000 op / sec (1e5)
- SSD - 10,000 op / sec (1e4)
- LAN, 1MB RAM - 1,000 op / sec (1e3)
- HDD seek, 1MB LAN - 100 op / sec (1e2)
- WAN roundtrip - 10 op / sec (1e1)
- Memcached access - 10,000 op / sec (1e4)
- RDB simple select - 100 op / sec (1e2)
Go language
Getting started and Hello, World in GO.
Day 2
How the search is arranged
Morning sober from Andrei Aksenov :)
In the work of any search engine, you can select 4 milestones: data acquisition (robot spider), indexing, searching, scaling.
Indexing includes:
- receiving text (html, pdf-> text),
- tokenization
- morphological processing (stemming, lemotization),
- creating an inverted index (keyword -> page numbers in the book and the position of the word in them);
The data in the index must be compressed. Compression can be bit, byte, block. The less data - the less need to be processed when requesting.
The search consists of two orthogonal stages:
fast matching (find) and
high-quality raking (rang).
The matching criteria may vary depending on its use (web, data mining). As for the ranking - it is in principle to the end can not be relevant, because the relevance is personal to each user. Therefore, tricky algorithms like BM25 (TF, IDF, DocLength) are used for ranking, the results of which are personalized whenever possible.
As for scaling, the search requires resources. Therefore, Google has millions of search servers, and Yandex has tens of thousands.
Search Odnoklassniki
The guys managed to use fullscan search (on MsSQl) with 30M users. On a cluster of 16 bases, one search query worked for an average of 15-30 seconds.
Realizing that it was impossible to live like this - they decided to seek a solution.
Since the project is in Java, they began to look towards Lucene. Over 3 years of working with Lucene, the following changes were made to it:
- added replication;
- storing indexes in memory (tried on the RAM drive, mapping files in Heap, but in the end they simply tightened the files in ByteArray - in OldGeneration);
- rewrote index search (default created too many objects, which led to problems with GC);
Now the infrastructure has a separate server indexer, which creates a search index. The index is replicated to the query server, which stores the index both on disk and in memory (for processing requests, an index from memory is used). The indexer receives data through a queue, which allows it to be inaccessible from time to time.
The big flaw was the lack of logic to directly update the index from the database, bypassing the queue. Since some of the messages from the queue sometimes disappeared. I can state the same problem in my company too. Conclusion:
sanity check when working with queues should always be .
Only 5% of requests (most frequently requested) are constantly in the cache. Cache hit 60%.
For personal requests, temporary caches are created (on request), which live for a few minutes.
Each user on the portal is assigned a separate app server (calculated on the basis of userId). In case of problems with the server, the user is transferred to the reserve.
The search for online users was first done by a common index, filtering by the “online” flag outside the index. Then we created a separate index for this purpose.
Storing data in Evernote
Overview report on the device and the numbers Evernote.
Software: Java 6, MySQl 5.1, Debian (Stable), DRBD (distributed storage), Xen.
HardWare: SuperMicro 1U, 2x L5630 COU, 96 GB RAM, 6x 300GB Intel SSD, LSI RAID 5 (+ spare) ~ $ 8000.
DataBase: MySQL 10TB (peak riops 350, peak wiops 50)
Search engine: Lucene (peak riops 800, peak wiops 50)
Servers are located in the USA and China (400 Linux servers).
File access is controlled via Apache / WebDAV. Data is always stored on the same host and is not migrated.
Compared to NFS, WebDav has a slight overhead, but is much simpler and cheaper to deploy and maintain.Load balancing uses an iron balancer A10 with SSL (HTTPS is looking into the external one, HTTP is proxied inside).
Taking into account the features of the service (storing many small files), the problems and their criticality can be described by the following table:
/ | size | normal load | peak load |
bandwith | - | medium | medium |
latence | - | low | low |
cpu | - | low | medium |
file size | high | low | medium |
meta data | low | medium | low |
The author does not recommend using cloud platforms if your application is resource intensive (bandwidth, storage, cpu) or uses them in varying amounts.
If in doubt, calculate how much it will cost to buy and support your servers for the needs of the application, and how much it will cost Amazon. In the case of Evernote, the difference is 1-4 orders of magnitude.
DDOS mechanics
I missed this report almost completely, as I was at dinner :) From the final part I was able to isolate only a couple of theses.
The simplest protection against HTTP attacks is the zone limit in Nginx. You can also pay attention to the module NGINX "testcookie".
Nevertheless, it is important to understand that Nginx is not a remedy for DDOS attacks.
DDOS attacks in Russia 2012
Total attacks in 2012 were> 2600 attacks (last year there were somewhere> 1700).
A botnet of 3,000 machines is enough to overwhelm the average website. The largest botnet consisted of 150K machines. Most bonnets live in Germany, USA, Ukraine, Kazakhstan (descending).
DDOS activity coincides with e-commerce activity (well, a bit of politics).
Political DDOS'y use botnets from Pakistan, India (where not reach law enforcement agencies).
Most attacks generate traffic less than 1GB (normal site bandwidth is not greater).
When attacking, the first thing to do is to analyze the logs (access.log) and make a traffic cast (tcpdump -s0 -c1000000 -w attack.dump). Next, neutralize the attack: remove from the junk traffic, block ip-addresses from which the attack occurs, change the IP.
You should always have at least 2x margin of site performance.
Features of the Russian DDOS - Full Browser Stack (use of real browsers that can pass all anti-ddos tests).
Well, yes - if you expect problems, contact Qrator.
Scaling in 2012
Most likely the report was about load optimization or something else. Alas, I spent all its beginning in the toilet :) I decided to mention it only in view of a rather interesting discussion that unfolded during the questions.
The speaker mentioned the lack of any redundancy and scaling on his project. This led to a rather unexpected question at first: what is the business model of your site? As it turned out, on the speaker’s project, people pay for access to the site’s features using subscriptions (for 6, 12, 24 months). Taking into account the fact that fakapy, in which the site falls down only a couple of times a year, and the fact that all site users have already paid for its use, high resiliency is quite expensive, complicated, and most importantly - not so much needed :) Another thing, if the monetization of the project depended on every request to the project!
Proactive Web Perfomance Optimization
It was assumed that the report will be read by a Twitter employee, so personally I was waiting for the report with great impatience, anticipating something interesting. However, immediately after the beginning of the report, it turned out that he was working on Twitter recently and had been web-optimizing most of the time. And in order to finally finish the crowd, he will tell us about a wonderful tool / utility / profiler / plug-in for web-optimization - YSlow.
System of distributed, scalable and high-loaded data storage for virtual machines
By and large, the report tells how Parallels combined its version of GoogleFS with chunks, metadata servers and other attributes.
A distinctive (for me) trait is that the guys suggest the use of yet expensive, but fast SSD drives. In Google, like, the policy comes down to buying
cheap iron, which is not a pity to throw out and replace.
The main purpose of virtualization is the sharing of resources of a single host, effective backup and deployment.
The best solution for merging disks into an array and their subsequent separation between SAN storage virtual machines. But it is expensive, so everyone uses regular discs.
In the case of journaling files, the data first goes to the log, and only then to the disk. This is necessary so that if we suddenly turn off the disk, we would be able to play the log and add the missing data to the disk.
From the report remember the mention of the PAXOS algorithm. Considered the Greek island of PAXOS, which has a parliament. Members of Parliament are deputies, who are also businessmen. In view of the latter, they very often are away so that no more than half of the parliamentarians are always in parliament. Contact with the missing is possible only with the help of messengers. At the same time, the parliament should always work and pass laws. In addition, all parliamentarians should always be aware of the latest laws. As far as I understand, guys use this algorithm to synchronize chunk-servers.
A power outage can be emulated by sending a SIGSTOP process. In this case, the application at the other end of the TCP connection will be in the same situation as when the power was turned off. It is faster than real power outage.
Recommendation service on a virtual Hadoop cluster
The guys made a map / reduce the task "service recommendations on the site" using Hadoop. As an example, their particular problems and solutions are cited, so there is probably nothing to write here.
MariaDB: The new MySQL
Overview report about Maria DB. MySQL fork is completely rewritten from scratch by the founder of MySQL, after MySQL passed from Sun to Oracle.
As a default engine uses Percona XtraDB. Supports InnoDB.
In essence, the report comes down to Changelog MariaDB version 5.5.
I’m on the journey
Report from Oracle. Mostly about what will be in the new versions. A kind of RoadMap MySQl.
The report sounded the answer to a rather interesting question: why does Oracle support MySQL? The company wants to be represented in all markets, and MySQL covers the Web, Mobile and Embedded. In addition, you do not want to lose clients such as Facebook and Twitter. About the huge comunity can not be mentioned.
Separately, I would like to note that the migration utility from such databases as MsSQL,
Sybase , etc. is included in MySQL Installer for Windows. Unfortunately, she does not know how to rewrite all the accompanying project code in one. In view of what its meaning for me remains lost.
How to choose right index
A report on what indexes are and what they eat. The report was not a few materiel and examples, so I recommend to see the presentation to anyone who is interested in the issue.
The following theses are of the greatest value in my opinion:
- you do not need to do too many indexes, since they reduce the insertion speed and generally incur overhead costs - make indexes only where they really are
needed; - select only the data you need: less data is transmitted over the network, less data is processed by the processor;
- sorting is quite an expensive operation;
- distinct is a very slow operation, as it works on the result of the sample;
- index (a, b) better than index (a) only on equals operation;
- clustering index contains all row data (apparently meant that when building a cluster index, the data in the table is sorted according to the index);
- if the column is not in the index, but it is used in the sample, you will have to fetch from the table;
Presentations
All presentations
are available on SlideShare.
PS
I will leave the post without pictures, since WiFi in the MUMU cafe at Kievsky Station is not much better than the local Olivier :)