High load application architecture. Scaling of distributed systems. Part two

This week we posted the first part of the decoded podcast. Now prepared the second part.

What we are talking about in the second part of the podcast:

Horizontal scaling of the project

- when it is worth using cloud services, and when physical hosting;
- “beautiful solutions” against the “dirty, but productive” code. ORM and all sorts of similar things;
- multilingual and multizone project, problems and solutions.

Asynchronous tasks. The queues.

- asynchronous tasks in distributed systems;
- when they come to the rescue, what technologies exist and are actively developing now;
- what approaches of the organization of asynchronous tasks are used in Badoo;
- what problems had and have to deal with when working with queues;
- useful books and interesting conferences;
- Interesting case studies.
')

Ved .: Yes, this is a good approach, because engineers ... They are often such perfectionists, they want to do it perfectly and at the same time can afford - if they are not controlled - redo them ten times. What can I say - I remember myself, in general, was like that. And over time, an understanding of what needs to be done sometimes faster comes to experienced technicians and engineers. The main thing is to have a result, and not this “spherical target in a vacuum”.

In principle, you are very correct, a reasonable approach. And in general, I looked at the articles on “Habré” that you wrote about how you automate the warm-up, roll-out of new releases using Git. You are using, as far as I know, as a JIRA, TeamCity bugtracker. And all this together you are somehow good there, boil well with a tool like AIDA. I would suggest some time later, in one of the next releases of our podcast, to tell you more about it. Now we probably will not dwell on this. I think you have something to tell, do you have experience in good debugging of all these processes within the team. And let's talk more about this: I understand that the developers you have - also a question, by the way, from Roman Skvazh - not “multiplatform”, that you share them clearly. There are PHP-Schniki, there are sshniki, and there is this most experienced and powerful DBA for databases, right?

AR: Anyway, in any company admins should be. And if there are hundreds of database servers, the DBA should be there simply because it’s not a developer’s task at all - to repair some things and monitor the release of new databases. Here I recently had a small budget and I ordered a patch to MySQL, which for ten years could not be fixed. And I needed to involve, among other things, the system administrator in order for it to be updated on all clusters. DBA should be, it is not even discussed. Well, “multi-platform” ... Generally speaking, people who know PHP need to know some other scripting languages. There are about 100 of them, well, maybe less - we have 80 people. They know PHP and some other scripting languages, but they write mostly in PHP. We have considerably fewer sishnikov, that is, by and large, we can move a person from one team to another.

As for platforms in general ... Probably, among our developers there are no those who can write under the web and at the same time under Android. Because under Android we have to write that ... there we have Java, in my opinion, yes. And under iOS, we have Objective-C. That is, we have such a “mobile zoo”, and in this sense, the people who write the server part for mobile applications, and the people who write the client part for mobile applications are really different people.

Led .: Yes. There is also such a problem that arises in large projects and is often a headache for managers, for team leads, for simple developers — this is when a project has to support many languages and a large amount of textual, say, material, generally any content, which differs depending on the country, and which needs to be somehow adapted, translated, and so on. And plus another problem is added to this - this is the fact that countries are not only different in languages, they are also located very far from each other. Accordingly, the time is different everywhere, and the projects, it turns out, are “multizone”. What are the problems here, including in Badoo? What kind of approaches do you see, architectural solutions, how is this “settled” and how to live with it all?

A.R .: It seems to me that the problem is more of an average character of complexity, that is, compared with performance in a large project and with scaling it is, of course, not such a big problem. As for the "multizone", there, in my opinion, everything is simple, and is solved by the configuration. There is a small file: a country and, roughly speaking, a time shift in UTC - and storing everything in a single format in UTC, and calculations. It's pretty simple.

As for a large number of texts, this is a really interesting question. We have gone through many different stages and end up with the fact that we will now have a translation where the users themselves are translating. We have several dozen languages. I don’t remember exactly ... about 40. And of course it’s very important for us that the translation system is mobile enough so that the features do not slow down.

Oddly enough, all that I’m going to talk about is more on the management plane, and not on the technology plane. Imagine that you want to roll out some feature, and you need to translate it at least into several important languages. Badoo is most common in Europe in three countries: in Spain, Italy and France, and in the Americas it is the States and Brazil. That is, at least here it is - already 5 very important languages.

And we can not roll out the feature. Or rather, we can roll out a feature only for some countries, run it, but we cannot make a full release without these translations. Now imagine, here we have a release cycle - these are two releases per day. We do everything to make two releases a day. To do this, we need our 15 or 16 thousand tests to pass quickly in ten minutes and so on (Clarification: already 18 thousand in 3.5 minutes) . We, if doing this release, we do not want to wait for the translation. Therefore, we have greatly changed the system of translation, when we made a new deployment system, integrating the so-called, I don’t even know how to say it correctly ... such a “translation forward”. That is, roughly speaking, at the testing stage, at the “staging”, which usually takes place in a day or a few days, the translator can already begin to translate the feature, which will then go to production, that is, at the testing stage. And since there are several dozens of translators, the main “challenge” was to create a beautiful and clear interface for translation. Including we experimented with formats such as screen translation. Here is a special translation session: you come in, see the site in English and translate it into Russian; you see some text in English, click, translate into Russian, click and see immediately everything in Russian. Yes, right up to that.

Conductor: Yeah.

AR: As for technical things, the only technical thing we used here is that we do not translate anything. Well, let's say this: we do not choose the language and do not choose the text in real time. There are a lot of solutions when a huge file with different translations is held and one line or another is already inserted by the application in real time. Here we do not do that. We say, “generic” is a very large set of templates immediately in the finished language. This is our solution, which is slightly different from the standard approach. And this is really cool!

Ved .: Standard, there are some types of po-files, that is, as I understand it, for smaller projects, everything can be “poyuzat”, try, right?

.R .: I would not say here that all this will definitely begin to slow down on a large, on a large project, but it is obvious that if you have ready-made templates in some language, you simply don’t waste time, no CPU time to search for a particular phrase. If the page is healthy - yes, it is clear that these phrases will be “dofig”, that is, such searches - even if it is a quick search, but you still need to spend some amount of processor time.

Ved .: Alexey, I wanted to mention one number, which you just mentioned, to clarify. You said that two releases a day, and how did you come to this figure, why did you decide to deploy this way?

.R .: There is such a nuance here: we can deploit, in principle, less often, but it turns out that the rhythm of checks, of various experiments, the rhythm of our grocery office is such that once a day is not enough. We decided twice a day, and why not more - because the more releases, the more often at the time of release the work of some people is paralyzed. At least release engineers. We have a group of release engineers - we collected it, by the way, quite recently.

Here at the time of release (at least when we were just debugging this system) a lot of things were done by hand. Then they automated everything, and now we can do several releases a day, but it seems to me that two are enough. That is, in the morning they made a release - they rolled out what was tested the night before, and everything that they had done before the lunch and potent them - rolled out after lunch. Here is a scheme. I do not even know what the explanation could be here, but, in my opinion, it is quite reasonable.

Led .: Yes. But, in my opinion, it’s even too cool, that is, I don’t really know how you can accumulate so many features in a day so that you can roll out that evening.

Ved .: And they, as I understand it, know in advance. I do not know, how you build it - Scrum or something else. You have, let's say, a certain set of tasks. And you understand that, let's say, until the day after tomorrow they will be completed and in the first half of the day you can roll them out. And so it happens all the time, right?

AR: Planning, of course, is always a painful process. We try to really do it iteratively, we do not use any correct agile methodologies. If you want, I can tell you separately how we tried to introduce agile into our company and why it didn’t go, this is a very interesting separate topic. There are just a lot of teams, and everyone is doing something. And there are teams that make some new features, there are teams that are engaged in some kind of infrastructure tasks, so there are, of course, quite a lot of commits. Well, there is a single point of assembly - release.

Ved .: And the development office is located in one place or are some teams geographically distributed?

AR: We have a large part of server development in Moscow, a part of server development — a small part is in London, and, in turn, there is practically no mobile development in Moscow — all of it is mobile in London. Here is a separation.

Ved .: Another question with communication, of course, rises. And finally, let's come to this topic about scaling, tell me why you don't like ORM.

A.R .: I don’t like ORM for the following reason: I once hurt ORM, and I did ORM myself, and it seemed to me that it was a great “silver bullet”. And I realized that this task is simply not solved in a normal way for the following reasons. The number one reason is that there is a fundamental, so-called impedance mismatch between databases and object development. And if we do something that works poorly with databases, it is worse than if we do something that will not be very well object-oriented or not very beautiful simply because all the “jambs” are most likely we will get on the side of the database - this is more risky for a project. Therefore, the database should be used as efficiently as possible. And you need to "talk" in the language of SQL. To talk with her in the SQL language, you must be able to, among other things, influence this SQL. And most of the ORM is written so that, in general, everything is done for you.

In the case when you start to influence this SQL, you are provided with some special interface. For example, in some Hibernate — yes, Hibernate is used in such cool Java affairs — you will be offered a special language (in my opinion, it is called OQL, Object Query Language) and in fact you will write in the same SQL e, just instead of some joins, perhaps you will use addressing through a point from an object to an object that is contained inside this object, and so on. That is, a join will be added to you, but, in any case, you will begin to write something, which means you will already have some kind of “mess”, and OQL-code consisting of this, and the code that ORM itself will do for you. And if you need to do something extravagant, you will start writing SQL anyway, and therefore, if you have a very large project and you use ORM there, then over time you will have something there that you you do with an automatic machine, something that you wrote in this OQL, and somewhere else is SQL. There is an alternative, fairly simple topic, when you say: “Okay, okay, there is an impedance mismatch, I'm not going anywhere from it. I have to design a good database, I will speak with them in the SQL language, I will blur everything and make the interface, then everything. Somewhere in the middle I will have some kind of mapping, I will spend some time on it. ”

The question is, in fact, exclusively in psychology: it seems to people that spending time writing this layer means, roughly speaking, not to respect yourself. And if a car can do something automatically for me - why should I do it? In fact, it is not. In a large project, this layer is not so much time, but immediately opening this layer, which speaks to the base and then returns the data somewhere, and then they turn into objects separately ... yes, this layer allows you to talk to the base in the language I need, “tyunit” anything, it is easy enough to change, well, the code is a little more than in the standard ORM. Plus, if you look where ORM is mainly used ... ORM are small projects where you can quickly create an interface, I don’t know, editing, and some simple entities. Try using ORM to write at least something that works with analytics. That is, there is some amount of data, and you need to perform analytical queries ...

Ved .: Well, yes, it is difficult, it is difficult.

Ved .: Here and so. And here I’m just looking for Mongo all these new-fashioned ODM. There now Doctrine is supported, there are all sorts of Mandango. Colleagues have experience, let's say, not on the most “small-scale” projects where they use it. But here, for example, we also write only in some layers, which help us to automate some tasks and not “copy-and-paste” unnecessary any similar pieces of queries. But everything at this level is happening now in the project, so we also do not use it. I propose to go to the final part of the podcast and talk about another very interesting and original topic - this is about the fact that when our project grows, we are more and more trying to isolate some components and in addition we understand that many more tasks we can do asynchronously. And here such words as queues come up - these are all sorts of asynchronous tasks, async jobs. How do you work with them, what did you come up with, what do you use?

AR: Here I have one small problem - I also talk a lot about it at the seminar - we use very many technologies that people seem to be antediluvian. I want very carefully in my head, at least in my own, to lay down and tell why. The fact is that for queues we do a lot of things just on databases. A very unexpected decision, in a sense, it even seems - well, how can a large number of real-time events be “pumped” through the database? Through the database we download those events that somehow were generated by other transactions in the database. I'll tell you more.

Just imagine that you are a user, live on some node in the database and fill in some photo or write something and, say, flashed some address in the message. If you upload a photo, it must be moderated. If some address has flashed in the message, we may want to double-check after the fact whether you are a spammer and collect information from different systems, build some ratings, do not mention this link there often and so on. Something, yes, how to analyze? Perhaps you just registered and, I don’t know, some kind of trigger worked out for us - a double-check. In general, some postponed event needs to be done. All this, all these asynchronous things were generated by transactions in the database in which you live. Therefore, imagine that we have some kind of third-party server that processes the queue. There is a problem of two-phase commit and synchronization of operations. So, I understand what the problem is. A two-phase commit is by and large a myth, and one way or another we can get inconsistent data. That is, either in the database everything will be fine, but we will get the absence of the event, for what reasons it will be lost; or everything in the database is bad - that is, the transaction failed, and for what reasons it was not committed, but the event seemed to have arrived.

Conductor: Yeah.

AR: So at some point we decided not to solve such problems at all. If we still have an event generated by a database, then let it be created on every database where this event can be generated. And there is a separate system, which then this event and collects centrally, or decentralized processes. I talk about this scheme in more detail at the seminar.

The idea is that the total number of such events is not very large. How many transactions do you have on one server? Suppose there are several hundred per second - in the same transaction you make a fairly compact record of this event. The conversation does not mean that we are pumping over some billions of events related to a click or something else. , , .

, ( , , , , , ). - , Facebook – Scribe, -, , . - , - . , , Scribe.

- , -, RabbitMQ, 0MQ. - , , , , , , Scribe. , , , , Pinba — , UDP- application- «» , . , - .

.: , , MySQL, , MySQL? — , ?

..: , .

.: , , , , .

..: , - , .

.: , . … … , , , , , . ? , - . , - .

..: , , . , , , , . - , . - - — , . — , . . - , , .

, UDP, - ( UDP ) - , . , , , . - , , . , , , , -, , , . .

.: , . - ?

..: .

.: , , , , , . , , - , , , real-time, , , - .

..: Badoo «». , «» , Badoo .

.: . , , ?

..: , , . , , , . . . . research, , , , , , - «» .

. , ? , , « ». - , « ». application-, web. «» «», , . .

.: , , , . , «», - . , , PHP, , , , , , «» , , «». , Facebook HipHop' HipHopVM . , ? - , ?

..: : , , , , , . , - , . , , PHP, : «, «» Ruby, Ruby», : « , , , , Ruby», , , -. , , , , .

, - , , . , -, , , , . - Python', , , « , , ». . , , , . - . , ( , ), , , «». , - . , , , . .

.: . , , . .

..: --. , .

.: , : , . , : , , , , - ?

..: , , . . : , «» . , . , . , , , .

, , - Erlang' — . — production-. Badoo , , , . , , . , , , : « , , , , -». , , , , . - , - , , , .

.: , , , - , , , - , . .

..: , , . , , , . , , , , , - , . , , - , , , .

.: , , . , , , , , «» . - . « » -, -. , , , ? - .

..: , , , . — , — . ― highload, . , , «» , UNIX-, , — , , . « » — , , .

- , … , , . . - , «», , - , . -, -, . , - IT-, . - , , — , community - . - HighLoad PHP-, PHPClub, SQADays , -, , community , — .

, , HighLoad, . - highload-, . , , … CodeFest, - , DevConf - highload, - «», , highload, , .

, , - , — , , , , — , , , (, ) . , , : Velocity, MySQL Conf – MySQL , MySQL . , MySQL- Percona . PostgreSQL, , «» , , .

.: .

..: . , ?

.: , , -?

..: , , , . : — . , ?

.: , .

..: , , . , ?

.: , .

.: . , , , , , , ― . , , , , . .

..: : , , , , . , , : « , , , , . , “”?» . , , : . , , . , , . , , — « ». , … , . . , , . , — .

, , , , , , , , . , , . , PHP/MySQL-, PHP , - UNIX MySQL, -. highload, , , . — php.feedme.ru , ― .

.: . , , , , , , , . , - . , , , , git, … , , . .

..: , . , — , , - .

.: , . , , .

Ved .: And I urge the listeners: if you have questions, if you did not have time to ask something ... Today we made an announcement at the request of the workers. In general, the topic is really correct. If you didn’t have time, didn’t think of a question - ask. Alexey will answer in the comments, explain how he can. At this point, we’ll round up, with you today were Anton Kopylov, Anton Sergeev and our guest - Alexei Rybak from the company Badoo. Listen to our releases on itcompot.ru, also on the podfm.ru podcast terminal, subscribe to iTunes. Good luck to you to develop and hear. Until!

AR: Goodbye, thank you.

Play the full podcast

Download the podcast release

Source: https://habr.com/ru/post/185596/

All Articles

High load application architecture. Scaling of distributed systems. Part two

More articles: