📜 ⬆️ ⬇️

HighLoad 2012

The other day, the HighLoad ++ High Level Systems Developers Conference was held in Moscow, and I was fortunate to participate in it. Below I want to briefly go over the reports that I attended during the conference, highlighting in them interesting points in my opinion.

I'll warn you right away that I could have misunderstood some things, misinterpreted some things. If this is important for you - do not read this post, but come to the next conference in person!

Day 1


Storing and delivering VK content


The report began with a two-year-old story, when I asked Pasha Durov how VKontakte stores user data? That day, Pasha responded "on disk." Since then I don’t like him :) This time, Oleg (reporter) promised to answer this question fully, dedicating a whole report to him.

Initially, all files (photos) were stored on the same server as the code - in the upload folder. This approach is simple, but has two flaws: security (code fill, XSS) and scalability.
')
Next, the guys were puzzled by the scalability by organizing the following scheme: the files are uploaded to the main server, and then transferred to the auxiliary ones (designed specifically for user data).
Each user always has the same auxiliary server, defined by user id.
However, this approach has a bottle neck - the main server.

Therefore, the next step was to upload files directly to the content server without using the main server with the application. However, to implement this approach, it is necessary to answer the question: which server to upload a photo to?



Find and eliminate bottlenecks. Usually the bottleneck is one - the rest of the system's nodes compared to it work an order of magnitude better. Bottle neck'and when storing a large amount of user data can be the following:


XFS is used as a file system. However, user files are not stored in the file system, but in a single large file. The file is always open for reading, the file index is stored in the RAM. The guys see further development of this technology in complete abstraction from physical media.

In view of the fact that the widely spread WiFi VKontakte is switching to HTTPS. Since there are a huge number of servers with content, you can’t buy a certificate for each of them (too expensive). Therefore, VKontakte has several servers that proxy content via HTTPS.

As for the pictures:


As for audio:


As for the video:



In general, I had the feeling that the contact is relatively simple. Easy but effective.

NoSQL in high load project


A report on how NoSQL is used in Mamba. The peculiarities of the considered system are that more than 30% of executed queries are increment and reading counters.

We started searching for a suitable storage for the data as it should with Memcache. Not satisfied with the lack of persistence and RAM ONLY. Tried Radis - RAM ONLY.
In addition, with a large load, Memcached performance drops by 100 times (for testing load, the Brutis tool was used).

How long, short did the guys come to use TokyoTyrant. However, he suddenly found problems with the integrity of the database when the server was turned off from the socket :) We decided to use a new development from the author of TokyoTyrant - KoytotTycoon. However, 30M records could not be written to the database due to architectural constraints.

Therefore, the guys went in the direction of LevelDB from Google. This database uses LSM-tree technology. Data is stored in SSTable files: sorted immutable key-value pairs.
Data is written in a similar (but already modified) structure in RAM. From time to time there is a merging of subtrees from memory to disk.
In order not to cut off the problems in case of a sudden power outage - Write Write Ahead Log.

The guys tested again and got that in most cases the LevelDB library wins out all the options used earlier. Now they hold 4700 get / sec and 1600 update / sec with 200M records in the database.

MVCC unmasked


MultiVersion Concurency Control is a mechanism that allows readers not to block write, but write reader. In general, it significantly reduces the number of locks in the database. Present in Oracle, InnoDB Mysql, PostgreSQl and some others.

Each entry in the tables of the MVCC systems has two attributes: creaton (xmin) is the transaction number in which the entry was created, expiration (xmax) is the transaction number in which the entry was deleted.

     INSERT xmin 40, xmax Null
     DELETE xmin 40, xmax 47 

     UPDATE xmin 64, xmax 78 / xmin 78, xmax NUll


When executing each statement, a MVCC snapshot is created, which determines which data in the database is visible / accessible to the statement.
The algorithm for determining the visibility of the string is as follows:



The xmin and xmax fields are present in each table, but hidden by default. However, they can always be specified in the sample explicitly: SELECT xmin, xmax, * FROM test ;

You can get the current transaction id by using the SELECT txid_current () query. This query selects data from the pg_clog table. It is important to understand that rolling back a transaction is simply setting the appropriate token for the transaction record in this table. No data is deleted. The transaction is simply marked as rolled out.

Non-MVCC DBMSs require immediate data changes when records are modified or deleted. MVCC DBMS does not have this problem - all irrelevant data is cleaned pending.
Here I want to add from myself that deferred cleaning (the so-called VACUUM) is not as perfect as we would like ... but this is a topic for another discussion.

By the way, here is the site of the speaker , on which he promised many interesting articles.

MySQL to Google


Perhaps the biggest disappointment among the reports. In short, its essence boils down to one thesis: “Yes, we have MySQL slightly patched by us”.
Maybe I have overstated demands, but I expected something more interesting from the Good Corporation.

Largest projects: Ads, Checkout and, of course, YouTube.

For data centers, not the most expensive hardware is used, but the most advanced. Components (disks) come into disrepair very often, but can be replaced easily and cheaply.

MySQL is used as a cluster. For each shard there is a separate separate process (decider), which is responsible for the selection of a new master in the event of an old fall.
Heartbeat is performed by a simple script that writes some data to the wizard and checks its presence on the replicas.

In order to organize the scaling and fault tolerance, the guys impose many restrictions on working with the database. For example, a write transaction cannot take longer than 30 seconds. It is necessary in order to be able to quickly make the master of another slave.

SPDY support in NginX


SPDY - binary protocol over TCP / TLS connection. It is a transport for HTTP. Developed by Google.

Main features:



From the pros: built-in HTTPS (due to TLS), good work with a large number of pictures (due to multiplexing);
Of the minuses: works with one domain, as a result - does not support CDN;

According to research (combat) Wordpress'a: SPDY is faster than HTTPS, but slower than normal HTTP.

Sql tricks


The report was a list of SQL / PostgreSQL features that are interesting, but not familiar to everyone:


The speaker, by the way, is really dangerous: I wrote JSON'a parser in Postgres with the help of Recursive CTE and Regexp in one request!

Crouching server, hidden diode


“Hello, my name is Andrei, I am the Voronezh cattle.” On this report description can be completed :) In general, Aksyonov in all its glory.

Key points:


World constants - it is important to understand how much action costs.



Go language


Getting started and Hello, World in GO.

Day 2


How the search is arranged


Morning sober from Andrei Aksenov :)
In the work of any search engine, you can select 4 milestones: data acquisition (robot spider), indexing, searching, scaling.

Indexing includes:


The data in the index must be compressed. Compression can be bit, byte, block. The less data - the less need to be processed when requesting.

The search consists of two orthogonal stages: fast matching (find) and high-quality raking (rang).
The matching criteria may vary depending on its use (web, data mining). As for the ranking - it is in principle to the end can not be relevant, because the relevance is personal to each user. Therefore, tricky algorithms like BM25 (TF, IDF, DocLength) are used for ranking, the results of which are personalized whenever possible.

As for scaling, the search requires resources. Therefore, Google has millions of search servers, and Yandex has tens of thousands.

Search Odnoklassniki


The guys managed to use fullscan search (on MsSQl) with 30M users. On a cluster of 16 bases, one search query worked for an average of 15-30 seconds.
Realizing that it was impossible to live like this - they decided to seek a solution.

Since the project is in Java, they began to look towards Lucene. Over 3 years of working with Lucene, the following changes were made to it:


Now the infrastructure has a separate server indexer, which creates a search index. The index is replicated to the query server, which stores the index both on disk and in memory (for processing requests, an index from memory is used). The indexer receives data through a queue, which allows it to be inaccessible from time to time.

The big flaw was the lack of logic to directly update the index from the database, bypassing the queue. Since some of the messages from the queue sometimes disappeared. I can state the same problem in my company too. Conclusion: sanity check when working with queues should always be .

Only 5% of requests (most frequently requested) are constantly in the cache. Cache hit 60%.
For personal requests, temporary caches are created (on request), which live for a few minutes.

Each user on the portal is assigned a separate app server (calculated on the basis of userId). In case of problems with the server, the user is transferred to the reserve.

The search for online users was first done by a common index, filtering by the “online” flag outside the index. Then we created a separate index for this purpose.

Storing data in Evernote


Overview report on the device and the numbers Evernote.

Software: Java 6, MySQl 5.1, Debian (Stable), DRBD (distributed storage), Xen.
HardWare: SuperMicro 1U, 2x L5630 COU, 96 GB RAM, 6x 300GB Intel SSD, LSI RAID 5 (+ spare) ~ $ 8000.

DataBase: MySQL 10TB (peak riops 350, peak wiops 50)
Search engine: Lucene (peak riops 800, peak wiops 50)

Servers are located in the USA and China (400 Linux servers).

File access is controlled via Apache / WebDAV. Data is always stored on the same host and is not migrated. Compared to NFS, WebDav has a slight overhead, but is much simpler and cheaper to deploy and maintain.

Load balancing uses an iron balancer A10 with SSL (HTTPS is looking into the external one, HTTP is proxied inside).

Taking into account the features of the service (storing many small files), the problems and their criticality can be described by the following table:

/sizenormal loadpeak load
bandwith-mediummedium
latence-lowlow
cpu-lowmedium
file sizehighlowmedium
meta datalowmediumlow


The author does not recommend using cloud platforms if your application is resource intensive (bandwidth, storage, cpu) or uses them in varying amounts.
If in doubt, calculate how much it will cost to buy and support your servers for the needs of the application, and how much it will cost Amazon. In the case of Evernote, the difference is 1-4 orders of magnitude.

DDOS mechanics


I missed this report almost completely, as I was at dinner :) From the final part I was able to isolate only a couple of theses.

The simplest protection against HTTP attacks is the zone limit in Nginx. You can also pay attention to the module NGINX "testcookie".

Nevertheless, it is important to understand that Nginx is not a remedy for DDOS attacks.

DDOS attacks in Russia 2012


Total attacks in 2012 were> 2600 attacks (last year there were somewhere> 1700).

A botnet of 3,000 machines is enough to overwhelm the average website. The largest botnet consisted of 150K machines. Most bonnets live in Germany, USA, Ukraine, Kazakhstan (descending).

DDOS activity coincides with e-commerce activity (well, a bit of politics).
Political DDOS'y use botnets from Pakistan, India (where not reach law enforcement agencies).

Most attacks generate traffic less than 1GB (normal site bandwidth is not greater).

When attacking, the first thing to do is to analyze the logs (access.log) and make a traffic cast (tcpdump -s0 -c1000000 -w attack.dump). Next, neutralize the attack: remove from the junk traffic, block ip-addresses from which the attack occurs, change the IP.

You should always have at least 2x margin of site performance.

Features of the Russian DDOS - Full Browser Stack (use of real browsers that can pass all anti-ddos tests).

Well, yes - if you expect problems, contact Qrator.

Scaling in 2012


Most likely the report was about load optimization or something else. Alas, I spent all its beginning in the toilet :) I decided to mention it only in view of a rather interesting discussion that unfolded during the questions.

The speaker mentioned the lack of any redundancy and scaling on his project. This led to a rather unexpected question at first: what is the business model of your site? As it turned out, on the speaker’s project, people pay for access to the site’s features using subscriptions (for 6, 12, 24 months). Taking into account the fact that fakapy, in which the site falls down only a couple of times a year, and the fact that all site users have already paid for its use, high resiliency is quite expensive, complicated, and most importantly - not so much needed :) Another thing, if the monetization of the project depended on every request to the project!

Proactive Web Perfomance Optimization


It was assumed that the report will be read by a Twitter employee, so personally I was waiting for the report with great impatience, anticipating something interesting. However, immediately after the beginning of the report, it turned out that he was working on Twitter recently and had been web-optimizing most of the time. And in order to finally finish the crowd, he will tell us about a wonderful tool / utility / profiler / plug-in for web-optimization - YSlow.

System of distributed, scalable and high-loaded data storage for virtual machines


By and large, the report tells how Parallels combined its version of GoogleFS with chunks, metadata servers and other attributes.
A distinctive (for me) trait is that the guys suggest the use of yet expensive, but fast SSD drives. In Google, like, the policy comes down to buying
cheap iron, which is not a pity to throw out and replace.

The main purpose of virtualization is the sharing of resources of a single host, effective backup and deployment.

The best solution for merging disks into an array and their subsequent separation between SAN storage virtual machines. But it is expensive, so everyone uses regular discs.

In the case of journaling files, the data first goes to the log, and only then to the disk. This is necessary so that if we suddenly turn off the disk, we would be able to play the log and add the missing data to the disk.

From the report remember the mention of the PAXOS algorithm. Considered the Greek island of PAXOS, which has a parliament. Members of Parliament are deputies, who are also businessmen. In view of the latter, they very often are away so that no more than half of the parliamentarians are always in parliament. Contact with the missing is possible only with the help of messengers. At the same time, the parliament should always work and pass laws. In addition, all parliamentarians should always be aware of the latest laws. As far as I understand, guys use this algorithm to synchronize chunk-servers.

A power outage can be emulated by sending a SIGSTOP process. In this case, the application at the other end of the TCP connection will be in the same situation as when the power was turned off. It is faster than real power outage.

Recommendation service on a virtual Hadoop cluster


The guys made a map / reduce the task "service recommendations on the site" using Hadoop. As an example, their particular problems and solutions are cited, so there is probably nothing to write here.

MariaDB: The new MySQL


Overview report about Maria DB. MySQL fork is completely rewritten from scratch by the founder of MySQL, after MySQL passed from Sun to Oracle.
As a default engine uses Percona XtraDB. Supports InnoDB.

In essence, the report comes down to Changelog MariaDB version 5.5.

I’m on the journey


Report from Oracle. Mostly about what will be in the new versions. A kind of RoadMap MySQl.

The report sounded the answer to a rather interesting question: why does Oracle support MySQL? The company wants to be represented in all markets, and MySQL covers the Web, Mobile and Embedded. In addition, you do not want to lose clients such as Facebook and Twitter. About the huge comunity can not be mentioned.

Separately, I would like to note that the migration utility from such databases as MsSQL, Sybase , etc. is included in MySQL Installer for Windows. Unfortunately, she does not know how to rewrite all the accompanying project code in one. In view of what its meaning for me remains lost.

How to choose right index


A report on what indexes are and what they eat. The report was not a few materiel and examples, so I recommend to see the presentation to anyone who is interested in the issue.

The following theses are of the greatest value in my opinion:


Presentations


All presentations are available on SlideShare.

PS


I will leave the post without pictures, since WiFi in the MUMU cafe at Kievsky Station is not much better than the local Olivier :)

Source: https://habr.com/ru/post/155951/


All Articles