“Complicated architecture is very simple to do” - an interview with Oleg Anastasiev from Odnoklassniki

Meet Oleg Anastasyev , a leading Odnoklassniki developer, speaker at Java and Cassandra conferences, an expert in the field of distributed and fault-tolerant systems. With Oleg we talked about the following:

What is wrong with the term "architect"
Why Odnoklassniki 11,000 servers
What do the emergency response drills look like
What is the "Big Rule"?
How do Odnoklassniki use Cassandra
What is the problem for a modern company with placing code in Open Source?
How do Odnoklassniki work with Big Data?

')

As always, under the cut - full text transcript of the conversation.

Architecture and Architects

- Oleg, although in your role you are close to the architect, you always say that you are a “leading programmer”. I know that by the word "architect" you have a special dislike. Please tell me why.

- Now in the industry, the concept of “architect” means a certain person with a huge head, who always knows everything in advance, naturally, long since retired and no longer wringing code. His task is to distribute instructions to those who have not yet reached the architect, the maximum is to draw squares, to connect them with arrows. As a construction architect who paints, but does nothing himself.

The concept of "architect" is already heavily emasculated. And therefore I am against this term, I am more for what an architect should be in our industry. This is a leader, a leading developer who writes the code himself, makes architectural solutions himself, implements them himself, checks whether he alone or with the help of someone else, sees his mistakes, mistakes of others, redoes it all before the architecture becomes good, verified , understandable, not containing anything extra.

- So you still do “architecture”, but you also write code? We caught you, after all an architect!

- Still, yes, but here it is still important that there is not one person with such a head who can foresee everything. Architecture is a team game. Architecture should be made by all those people who make the project. There are one or several leaders who take or do not make any decisions, they even offer more solutions, and then the team should take this architecture in a democratic or consensus sense.

Since we have a very complex technical industry, one person can be very easily mistaken, not see some obvious solutions, drive himself into some kind of mental trap. And here it is very important that there is a second, third, fourth person who will consider the same problem from a slightly different angle and propose solutions that you yourself then wonder about “why I didn’t think of it before, it's obvious, simple and understandably".

“Judging by what I heard, simplicity is an important engineering principle for you, you have said this two or three times now.” Did I hear you correctly?

- Yes. Complex architecture is very easy to do. As they say, any problem can be solved by introducing another level of abstraction. And art and complexity itself is to make a complex system simple. That it contained as little as possible component. Because the smaller the component system, the smaller the number of failures. It turns out a more reliable architecture, a more reliable system, it is necessary to follow it less, it breaks less often, which means that it makes profit more often.

11 thousand servers

- Now there are many thousands of servers in Odnoklassniki, I’m even confused how many. They call the numbers "9 500", "10 000", "11 000". The first question is how many are there, the second is why is there so many of them, and the third is how does this fit in with simplicity?

- Now (the interview was recorded in December 2016 - approx. Avt.), They are probably closer to 11,000. For the most part this is due to the fact that Odnoklassniki launches new services, and they require to have some kind of power. time. Secondly, the load is growing and the old services. For example, we doubled the traffic in the video for the last year, and now we are the first in runet for watching videos. And video is a very resource-intensive project, it requires very high power. Therefore, I think that, most likely, somewhere to 11,000, and, most likely, we will continue to grow.

- Why do so many cars, and how it fits with the simplicity that you declare as a principle?

- There are a lot of servers, because the load is large. In our country there are two social networks, VKontakte and Odnoklassniki. The social network itself is such a task, where there are a lot of connections between people, between systems. That is, it will not be able to shard across the simplest architecture of shared nothing, when you have two servers, and nothing in common between them, they do not know each other and work individually. It is not possible to do this with a social network, because its purpose is to connect people, respectively, in the same way, and servers are connected inside to push information back and forth. Therefore, on the one hand, this task is not very easy to scale, on the other hand, this is the number of servers that, for the social network of our volume, is probably close to the required minimum. Suppose VKontakte servers are much larger than ours.

- Well, you have them, so I understand, is it still not 100% loaded?

- No, not 100.

- How do you choose the balance between Throughput and Latency, between “load the machine on most tomatoes” and “leave it unloaded so that it quickly responds to requests”?

- There is such a thing as "effective use of the data center." At the moment, our data center is completely “iron”, that is, each task runs on its dedicated server or set of servers.

- In the sense that there is no virtualization?

- There is no virtualization or something like that. Accordingly, such a data center, on the one hand, is underloaded, but on the other hand, it is able to respond to requests most quickly, and at the same time it is most predictable, because nothing works on a machine other than one task. That is, nothing can prevent this task from doing something.

- Diagnostics easier?

- Easier diagnostics, search for some anomalies, breakdowns. We took the path of least resistance: it was necessary to solve many other problems, not only these, so we put more and more iron.

- Place in data centers around Moscow - it’s not rubber, right?

- The place is not rubber, we, probably, already in 2017 or in 2018 we will choose all the capacities that are now in data centers ...

- Oh, you will have fun.

- It will be fun, but we are already preparing to improve the utilization of data centers.

- So there are stocks?

- We have reserves, and now we have technical solutions that will allow us to run several tasks on a single machine safely enough. What is called a “cloud” is something like that. Moving in this direction.

fault tolerance

- With external traffic, everything is clear: the video is growing tremendously, and therefore the overall growth is enormous. And the internal traffic? When there are so many cars, it also generates some wild amount of internal traffic - I don’t know, tens, hundreds or thousands of times more than external. How do you solve this problem? How do you manage traffic between data centers - do you need huge amounts of power to drive data back and forth?

“This is where you are entering into an area to which I, probably, for the last eight years had nothing to do.” This is done by a separate team, which is called "networkers", I am quite far from them.

- Well, let's try in this vein to reformulate your question. You have three or four data centers, I do not even know ...

- Three and a half!

- Three and a half. How is the separation of data and loads between these data centers, according to what principles was this done?

- We have three data centers so that we can withstand the failure of any of them. And our goal is that, firstly, we can withstand such a refusal, and, secondly, users would not notice anything. Many people think that the failure of a data center is a very rare thing, it happens literally once a century. So no! Data center failures are regular, from my own experience I can say that about every three months one of them refuses. We cannot predict how much he laid out in time - 15 minutes, an hour, two. And each this hour costs us a lot of money.

- Why is it so often denied that this is some kind of emergency, or? ..

- Different things. There are emergency situations, there is a human factor. For example, quite recently, two rooms flew off in one of our data centers, these are many, many, many machines. Due to the fact that both the power beam fell off. At the same time, all our data centers are of the highest category, there are diesel generators, UPSs, etc. But here in the software of generators an error crept in, and when the data center was disconnected, they stupidly did not start. There were some anomalies in the electrical network, they looked at these anomalies and decided that they did not want to start. Batteries lasted 10 minutes. And then everything.

- And how, actually, the portal reacted to all this?

- The portal responded perfectly, no one noticed. It was literally a month ago, but there was nothing in the news like "Classmates lie", "# okzhivi", everything went well. Sorted out within two hours, turned on the power, started.

- Was there any degradation in functionality, or did it go without it?

- Absolutely imperceptibly passed. We were a little lucky if the other two halls flew out, then there would be a degradation in functionality. But we are moving in the direction that the departure of the data center does not affect the user at all.

- What are the ways to achieve this increase in service reliability? How can I make my system of three machines that are located in different data centers work stably and not degrade?

- The main idea here, of course, is on the reservation. It is clear that the power should be chosen so that the departure of one-third of this power allowed the remaining two servers to pull this load. And again, it's not just the load, but also the availability of data. If you have a data center crashed, there were some important data there, and now they are inaccessible, then at best you are denied functionality because you cannot reach these databases, and at worst nothing works .

We have a replica of data in each data center, in the West it is called the term “replication factor 3”, that is, each of the data centers has its own replica. When we change data, we write to all data centers, to all three replicas, and wait for a response from the two fastest, because the third one may not work at the moment. And we believe that the data was recorded as soon as we received two confirmations from the two fastest.

According to this principle, all data warehouses are built on Odnoklassniki, except Cold Storage - this is a special storage that stores large blobs - for video, for photos, for music.

- And why? It is expensive?

- Yes, replication factor 3 nevertheless implies that we have three copies of data, and our video is growing very fast, there are already petabytes-petabytes, and storing them in three copies is quite expensive. But in order not to lose reliability, reservation is still used, and this system has replication factor 2.1. Roughly speaking, this is RAID 5, assembled not within a local machine, but within a distributed system. That is, if locally installed disks serve as replicas in RAID 5, then in this case these disks are distributed across different data centers.

- And if something of this crashes, then there is a distributed recovery?

- Yes.

What admins play

- This is a generally known problem that after replacing the outgoing nodes or restoring the data center, we need to update the data, give there a lot of traffic, all our traffic capacities are eaten, and so on. Is something done in this direction, is this process controlled?

- There should be enough work not only for the programmer, but also for the system administrator. Therefore, in order to understand when something can fall during a recovery, we arrange exercises and test accidents. This is a very interesting game, which periodically deals with the administration department.

- What does it look like?

- Some person is appointed by the director of this accident, and he comes up with a scenario: what is happening, what are the symptoms. And the rest do not know in advance what will happen. And he begins the teachings: he says that something has turned off, such and such lights have been lit, monitoring says this, this is on the charts. And the participants write what they do and how they will correct this situation.

These are theoretical studies, and there are still practical ones. When we know that cold storage should withstand a third of the data drop, we choose some day (naturally, at first such tests are usually done at the moments of the lowest load, for example, at night). The data center or the machines in it are completely shut down or partitioned across the network, depending on which scenario we are working on, then turn on, and see what happens.

- It's generally pretty scary - to disconnect something in production. How did you overcome this fear? I know a few enthusiasts who say "now we are in production at the experiment," and they say "nafig, nafig, do not touch our production." Didn't it happen to you right away too?

- The fear that you will break something in production goes away after about a tenth time when you broke something in production. The system itself must respond to the fact that it was broken, correctly. That is, with a minimal amount of special effects for the user.

- Tell me about it, please. What is the "right response" and the "wrong response"?

- To react incorrectly - for example, a long time ago, if one of the SQL servers flew out of us, then the portal did not work. Showed the wheel "Sorry, goodbye, come back later." It was at the very beginning of Odnoklassniki, the year 2007-2008. There were 16 servers, they did not fly out so often, and they were quickly brought back and everything was fine. When they became 64, and then 128, and then 256, of course, it began to happen more and more, and this option did not work. He just had a bad influence on the business.

Therefore, they began to think about the system's failure tolerance, that is, about fault tolerance. Then on the Internet there was no information about it at all. Failure tolerance - this is what we ourselves invented and began to implement. The idea was quite simple: if the server does not work, then we cannot read the data from there. If we need this data in order to display something on it, then we will skip figs with it.

Example. Take the count of friends. Normally you have 32 friends. 16 partitions - 2 friends each. One partition fails, we cannot read two friends. In this situation, we will show only 30 friends and the caption: "Perhaps incomplete data is displayed." Because we know that from one of the shards we could not read the data.

By and large, the user may at a particular moment not care about these two friends, especially since, due to the outgoing partitions, they still cannot work now. But at the same time, the user can continue to work with the rest of the information. This is what relates to the rendering of data, to their display on the screen. And vice versa, if we want to change the data that lies on the server that has failed at the moment, then such a transaction cannot be performed, it falls off with the message “sorry, now it could not, maybe later”.

Thus, due to one failed server, a very small part of the functionality suffers, requiring data changes on this server. And everything else works. You cannot view all of your friends on the social network (you have temporarily lost these two), you cannot break them off at the moment, but you can do everything else on the social network - read the tape, post photos, discuss something in the comments, and so on. This is called “failure isolation.” We isolated the failure of this server at the cost of data degradation.

- And if the entire cluster of friends fell off: a virus or something?

- Not necessarily a virus. The whole cluster can fall off very simply: people can put in, bugs can be, someone carelessly seized lock on the entire table, there can be a lot of reasons. The bugs in the database software itself, the network has fallen off - the cluster is completely unavailable, there is none at all. In this case, we have nothing left but to degrade already functionally, that is, we completely remove some of the functions of friends.

- What does the user see with this?

- The user sees a stub: the corresponding section will show that the function is now unavailable, come back later, but the remaining sections of the site will work if they are not affected by this accident. This mode can be activated automatically and manually. The administrator can also disable any function of the site at any time.

Why abandoned SQL-servers

- You talked about Microsoft SQL Server, which you had 16, 32, 64 instances and so on. How is it generally in such numbers is friends with each other? Until now, the question of database scalability, replication, sharding remains open. People say that even the most sophisticated systems like Oracle can scale up to 10 machines, and then, as a rule, trouble begins. What is now in Odnoklassniki with MS SQL-servers?

- Everything is very simple with MS SQL-servers in Odnoklassniki: they are decommissioned. We reject them if possible.

- And in favor of what? What systems replace SQL servers? What about them was, and what is being replaced?

- They had the usual business data that requires ACID transactions. Information about users, friendships, money, photos (not blobs, but meta information), all sorts of classes, comments, messages - everything. SQL Server is a normal paradigm, it works well, it is convenient to quickly write code on it. We did this - we quickly wrote the code, until at some point it became clear that the paradigm of the SQL server did not fit architecturally into the data center failure model.

- Tell me about it a little more.

- There is a SQL server. And it is made in such a way that it is single-master, that is, only one specific server instance changes one data. There are some procedures that the master switches if the first one falls off - leader election, failover ... I have already begun to forget these terms. Failover cluster can be built in several ways. Classic enterprise option: you have shared storage, two servers look at it via optical communication, and if one fails, then the other takes control - that's the whole scheme. But first, the glands themselves are very expensive. Secondly, these two servers are located next to shared storage, that is, in principle, they cannot be located in another data center. And if the data center in which this piece of hardware is failing, then it turns out that you spent a lot of money just like that. This scheme does not provide any fault tolerance in this situation.

There is another scheme, when master is in one data center, slave in another, and it is being executed ...

- Cross data center replication.

- Yes Yes. At a commit, an entry is made to the slave first, and if the slave is committed and everything is fine, then only the master tells the system that it is committed. This scheme is also not all good. When your slave slows down a little bit, the master stops confirming commits. Or if something happened to the grid. Such a transaction works at the speed of the slowest server plus latency on the grid.

Actually, here are the main problems. There is also a multimaster, but this is a separate song, very complex and a lot of conflicts arise. It doesn’t work at all on our loads, so we didn’t even consider this option.

Therefore, it became clear that such classic ACID databases do not work. They do not allow us to do such fault tolerance as we want. To solve this problem, we wrote our ACID system, which distributes data to several data centers. I told her about two years ago on Joker - how it works, its basic principles. And now we are transferring all our data to this system with maximum speed.

- We will add a link to that report, but for those who cannot see it, can you somehow describe in a basic way how you solved the problems described in the new system?

“First, the main thing is that the transaction is confirmed as soon as the two fastest of the three replicas have confirmed that they have accepted the transaction.” Secondly, Cassandra is used as data storage. We have already used it for a long time and proved that it is now the best solution on the market in terms of resiliency, cross-data center replication, and so on.

Further, the coordinators who manage the transactions do not own data in our system. Only storage-nodes own data ... That is, we have separately separated transaction-coordinators and storage.

- Transaction coordinators do what - logic, failures? ..

- Transaction logic. You can open a transaction on it, make it up - well, when you insert several times in a row, it is a transaction, it does not start the operation, but only adds it to memory. And if you, as the client who opened the transaction, are reading something from the database, then your reading request will go through this coordinator. The transaction coordinator will read the current state from Cassandra storage, impose the changed state, which should be within the scope of the transaction that the client has opened, and will give it to the client.

Thus, you see your own changes in the transaction, and neighboring clients, or other clients, or just other threads within this client go directly to Cassandra storage, bypassing the transaction coordinators, and see the status of the data before the transaction begins. When you commit a transaction coordinator like this, in a quorum, in two heads out of three, you will record this data and give confirmation of the successful commit to the client. And those customers who did not participate in the transaction will begin to read this data as soon as the transaction has been committed. So we got read committed. Since the transaction coordinators do not own the data, switching between them was very simple: if one crashes, the other begins to be used.

- In the case of databases, there are different levels of transaction isolation. Why read committed, is read committed everywhere?

- Only read committed is supported in our system. In most systems, read read is actually enough. I saw few systems that required serializable read, and very few people who went to read uncommitted on their own initiative. We at the time of working with Microsoft SQL Server used read uncommitted not because it is so good, but because of performance.

- What versions of Cassandra do you use "in battle"?

- We now use three versions. One is the old old 0.6, it is as simple as a Kalashnikov, there is nothing superfluous there. It is used as the key-value of the repository, more precisely, the column family of the repository, let's say. If you have a key, but in the value of your ordered map of columns with values. This is a lot of where it is very convenient. And work with such repositories is precisely at a low level.

In more modern versions of Cassandra, a concept like CQL: Cassandra Query Language has appeared, which is somehow similar to SQL. In fact, of course, this is not SQL at all, but something similar, which makes it easier for people to enter this new system. Here we use Cassandra 2.0. Why we chose 2.0 for transporting SQL databases on it: roughly speaking, because there is a SELECT, and here SELECT. For a person who programs applied logic, outwardly very little changes. He did selections there, and he does here.

In MS SQL, we did not use joines: we abandoned them, because they are very slow on our volumes. In Cassandra, there is simply no join.

- That is, the data is strongly denormalized?

- Naturally.

- You said about Cassandra versions 0.6 and 2.0, and what is the third version?

- A third 1.2. There was such a very transitional version, we took it, tried it in one piece, the thing works through the stump-deck, but somehow it still does.

Technical duty

- So, in production you have three different versions of Cassandra. I know a lot of people who, when they watch or read our interview, will lift a finger in this place and say: “So, do you guys have technical debt, do you want to start correcting it somehow?” Accordingly, is not Is the fact of using three different versions of the same product technical debt? , , , ?

— , , « », . « » «?». , X 1.0 2.0, . ? - ? — . , , . IT-. … ?

— , : , , - . - , , , . , , , «» - .

, Java- , - Cassandra 2, , - Cassandra 0.6, . - , - , - , - , , .

— . , - . , ( ) , , , , Cassandra 0.6.

? , Cassandra 0.6, : , JIRA, steps to reproduce, , , . , - .

, , , , , , Cassandra 0.6 , , , , , , .

— ? , , Cassandra 0.6 . ?

— , Cassandra 0.6 , . , Cassandra. , , -, , -, , . , , , , — .

— -, 0.6, , , Facebook? ? , 1.0.

— , , , « ». : , , -, , -, -, .

— ?

— 2010-. Microsoft SQL Server NoSQL-, master-slave «». , - , ;-). , — . - : , , , . , . , , .

— ?

- Nothing. — Voldemort LinkedIn, , , , Cassandra. -, , 0.5, . , Cassandra. , , , , , .

, - , - , - . Cassandra. , Cassandra 0.6 GitHub , , - — GitHub, . , « column-family storage» — .

— Cassandra , ?

— , , - , , , . , , DataStax.

— , 2.0 , -quality, ?

— , 3.0-. 2.0 , , , - , - .

— , , - ?

— , . , , .

— - ?

- Yes of course.

— ?

— , 20 , … 2.0.0, . , - , - 5 .

— , , -?

— . -, . , , , .

Open source

— — . , , , Cassandra, OpenJDK… , — « », - (free software open source ), , , , , upstream. , .

? Upstream - ?

— , . Guava -, ? . , , , Guava , , . , . .

— . 11 000 . Cassandra - , , 3 . -, . , - -, , , , , . , -, , , ? - , ?

— , , - , , — , , , . . , , , . , - , , , .

- — , , , , - , , . , , closed source, . , , , — , , - , . closed source, , .

Netflix - , , Atlas , , , , . , , Netflix, - , . Netflix , .

-, , , « ». , , , , .

— , - one-nio , . , GitHub, - . , . wiki, , «-» , , . , , .

, . , one-nio, OpenSource, , , , , .

, , Netflix — , . - - ? Cassandra, , . , , , - ?

— , , , , Java, . «?», . , , . «?» - , . , , one-nio remoting ( remoting, , ), , one-nio, production - .

— - , , , , , , . , . — , ...

— , .

— . , , , - . , - — , , , … high-performance , . .

— — , - one-nio , : Long.reverseBytes() XOR. , …

— , «, ». , , , , , , .

Big Data

— Big Data, Smart Data, Data Science . ?

— Data Science, , . . : , , - , , . : , , , - , , - . . Data Science.

, , — . , .

. , , , 2011-, user-generated content. , , Spotify , Last.fm. , , - : «, , , “ ”, , --». , , , .

— , « » . , , . .

— ?

— : , , . , , . , , , latency , .

— , , Data Science, . - . , -, , - , , , . ?

— , . , , , .

— ?

— , . , — , , , , . . , , .

- What is a smooth start?

- When a feature rolls out in Odnoklassniki, then, firstly, it rolls out in the off state. When an update is made to the physical servers, this feature is turned off. It should not change the behavior of the site. Then the person who programmed this feature starts to turn it on gradually, starting with a small number of users.

- A small amount - how many percent?

- Usually it depends on the self-confidence of the person. Usually it starts with one partition (i.e., 1/256 users), or from one server, or a small number of users by country is taken. If it works well, it turns on more and checks based on the same statistics.

- This is generally accepted practice. But the system is large, if you turned on some feature for one 256th, it turns out that two versions (and maybe more) of the same work on the production at the same time. How is it generally friendly with each other?

- Differently. Depending on what feature.

- Let me then reformulate the question. Do you have to spend additional resources, efforts to solve the problem of “cohabitation” of different versions of features in production?

- Of course. But we have no other choice. Once upon a time, the update procedure looked like a fight with horses in the dark. They laid out a new version on the site, the whole bunch of people went there, expepsy fell in the logs, something did not work, activity sagged, fell here, increased here, everything is incomprehensible, but it is necessary to fix it quickly. It was all in emergency mode. In the end, it became clear that it was impossible to continue this life, and invented a system with the gradual introduction of features.

- This is a whole infrastructure with switching on, knife switches, partitions, what the hell. Does it require any special skills from the person who writes all this, or does he have some clear instructions, what should he do and how?

- Usually, the fact that includes experiments, people are engaged with the admission of work on production. The admission of work on production means that a person has been given some instruction how to work on production and how not to work. All these things are explained to him about experiments, about where to look, how to smoothly include what to do if something went wrong. And then he himself turns on his feature, looks at all this and rakes the results.

- That is, from and to, from writing to implementation.

- Yes. This is very interesting, it motivates people to see how their code works. This quickly leads people to understand how not to do it. Each of the leading programmers for sure at least once, but dropped production. Fully. But for one beaten they give two unbeaten ones, and all the leading programmers very carefully include their experiments, because they perfectly understand what, how and where it can break and what the consequences will be.

Source: https://habr.com/ru/post/324424/

All Articles