📜 ⬆️ ⬇️

Everything has its time

image

Banks.ru - a project with a 10-year history. At different times, banki.ru experienced different loads. The portal was rebuilt for new requirements both logically and technologically, something we changed in an emergency mode, something - in an evolutionary way. Now the average attendance is about 2 million page views, i.e. the project is no longer small, but also not quite large.


This article is a transcript of the report by Roman Ivliev (CIO Banki.ru) at the HighLoad ++ Junior training conference, which took place a couple of months ago in Moscow as part of the Russian Internet Technologies festival.
In this article we want to talk about optimization, its timeliness, and about suboptimization, that the best practices of developing loaded systems are not always beneficial for business.


Let's look at examples and look for answers to questions:


  1. Is your highload so highload?
  2. Should Habreffekt reason for the introduction of high technology?
  3. "Crutch" or "high-tech solution" - what to choose? Advantages and disadvantages.
  4. How to choose a moment to start a new era? Are there any criteria when it makes sense to start optimizing your application and implement cool things “like an adult”.
  5. How can you use the “Bunin list” to achieve very good indicators, and do you really need all the points?
  6. How to work with technical debt so that it does not overgrow with moss?

In conclusion, Roman Ivliev will talk about several examples from banki.ru’s life in terms of replacing technological solutions in the field of high loads, and what came of it.



Everything has its time


Roman Ivliev (Banks.ru)


I stand before you here, because for 15 years I have been engaged in all this IT disgrace, for quite a long time I have gone from an engineer to a boss. Worked in various companies known in his circles:




And this is about Banks.ru:



')

I hope you heard about us. Anyone who uses banking services probably knows what Banki.ru is.


I will explain where these figures come from and, in general, why they are here. No advertising, this is what we actually have. Traffic is useless, but nevertheless, we serve about half a million people a day, i.e. this is not the number of requests, this is the number of unique people, the number of requests is, naturally, more — it is there to two million, maybe even three, and when there are “miracles” with currency or a license is revoked from some cool bank, we have There are numbers that are going wild. In 2014, when the Central Bank released the dollar, the load on the service of exchange rates increased 85 times on one day, i.e. Yesterday we had 10 requests, and then there were 850 requests. In a second, naturally.


What did I want to tell?




About how projects develop, in principle, i.e. about the fact that maybe what you consider HighLoad is not HighLoad at all. About habraeffekt - for certain everyone heard about this disgrace and what it is. It’s about how you can fight with all this by sticking “crutches” or high-tech solutions, and when it comes to understanding that you need to quit and start doing great, as adult guys from cool companies do. And, by and large, what to do with everything that remains in the process of this outrage.




It is very important: everything that I will tell you now, before you try it in yourself, first try it with your neighbor, because what I tell you is my experience, and it does not necessarily correlate with your experience. Moreover, some things that I will tell, in general, probably, it is better not to do it ever. But nonetheless…




Still HighLoad or not HighLoad? Those. Is your project so loaded, and is it loaded at all?




From these questions one can understand whether your project is loaded or not? For example, by the number of requests, the number of servers, the number of records in the database, can one understand whether a project is loaded or not loaded?


In fact, it is quite difficult to understand these specific issues, because the study should be comprehensive. If you have, for example, 100 thousand servers that serve 100 thousand requests — one per server, then probably this is not HighLoad, and if you have a trading system that serves 100 thousand requests per server second, probably worth thinking about.




But on such issues can HighLoad be understood or not HighLoad? Those. your servers handle the load, is there a need to do something about it, or, for example, you can just remove an eternal cycle, and everything in your life will be fine? Or, in general, you need to think, for example, how to fasten the Tarantool to the place to which you have not yet been fastened? You can understand on such issues?


This is, in fact, closer to the truth, because, in principle, HighLoad is when something ceases to cope with the tasks that you have solved. If your little VPS suddenly began to die, then you, in principle, have already arrived at HighLoad. HighLoad for yourself. Because there is no such understanding that, for example, 10 requests per second is not HighLoad, 100 requests per second is HighLoad. Therefore, I will continue to talk about the situation when you come with your current solution to a state where this state begins to degrade at a minimum, and at the most it simply stops working.




One such example, when you are not very well - is habraeffekt. Surely you heard about him. For those who have not heard - this is when you suddenly have a load from somewhere, and the degree of surprise can really be very amazing. At one time, when I was working at Kaspersky, someone hung up a screenshot drawn in paint brash, saying that the Kaspersky Main Page was set up, and now there is a naked woman. And - lo and behold! - traffic grew in minutes just to some cosmic heights, because everyone knows that this humanity is moving. And including the corresponding type of picture.




That is, reactive growth is a situation where you suddenly become popular, when something happens, and to your resource, application, no matter what, suddenly comes some additional load.


This may be a process that you can predict, that is, for example, when your advertisers take and send some very cool newsletter that you had a miracle and in your online store a 90% discount on Iphone 6 conditionally. Then it all gets into some kind of social network like Vkontaktik, there the people start to repost wildly, and the avalanche-like process begins - the freebie also moves people, so many people come to you. They are coming. The first 10 people flood you with your decision, the other 100 people feed you, curse, never come again. However, fact is fact.


And there is a random process when, for example, your colleague or someone else wrote some successful article on Habré, and it was so well sucked into the index. In general, people came from somewhere. Or the machinations of competitors - this is actually a banal DDoS, i.e. You can get into the situation that this is not a habraeffect, but DDoS.


However, is this a reason to decide that you finally become HighLoad? In fact - not a fig. Because habraeffekt at you in 20 minutes will end, relieved. And you can believe in this “lightly” further, because most likely this will happen. I know this, because in 2014, we were sausage for three days - for three days people were interested in exchange rates, because the dollar was 30-40 rubles, then it became 90, and at some point it was over 100. So we hit the ground for three days, then let go. Was this a sign to us that something had happened? Yes it was.


In fact, all these moments, when your load begins to grow reactive - this is a cool way to do load testing for free, which you probably won't be able to do with such special parameters in normal life. For example, our test infrastructure cannot even dream of generating the number of requests that were generated by people who were then trying to find a profitable contribution or something else to quickly hide their money.


All this is a reason for later, when you let go, and you will surely let you go (if not let go, then this is the topic of another report), it is an occasion to review your infrastructure and understand what your place is weak. Those. you cannot die immediately everywhere. Something still dies in the first place and the locomotive will pull everything else. Those. either it will be the base, or it will be the front, or it will be the application. This must be investigated (but this, again, is another topic).


Based on the foregoing, habraeffekt - this is just a reason to think. First, to understand where it came from. In the event that, for example, if your marketers miraculously forgot to warn you about an advertising campaign, it would be better to ask them not to do so again. Well, because they were wasting money - you still didn’t answer anything to anyone. Nevertheless, there is a weak link with which, in principle, you can continue to work further.




What broke first? The first thing that comes to mind is to figure it out, but what did it all fall off? Why did growth happen? Is there any chance this will happen again?


It is very important. If you go to a certain next level, then it needs to be understood (again, how to understand it - more on that later). However, if you understand that this is a system, then this is a reason to think. Before that. This is more for your information.


Naturally, in a situation where you are rolling like a dead fish in the sun, it would be nice to try to do something to make people feel better, and they began to somehow hear, see, etc. For this there are "crutches".




Crutches are not a way to get around, they are a way to make your decision work.


There are two options for “crutches”: the first “crutch” is the usual “quick”, the second is the high-tech “crutch”. Most often this is the same thing, and many, when they solve their problems, are very shy to say that they have inserted a “crutch”. For some reason, it is considered that it is bad. In fact, this is not a fig. I honestly tell you - it is even good.




You can quickly solve the problem. Yes, it will not always be easy, yes, it will not always be clever - this is obvious. Moreover, most likely, the next day you will look at this decision, you will begin to see clearly and promise yourself never to do it again, but nevertheless, in fact you solve the problem - you find a weak spot, this weak spot is removed, and people - your customers– they start getting. Let not all, but they will be more.


Most often, these “crutches” are inserted very awkwardly, but if you have a process, when you pour out the code on production, then in the case when you become ill, you can do the opposite. For example, you can fix something on real servers, see that it all works, and then stuff it. So sometimes we even did nothing, coped.


And the most important thing is to remember that you now have to. We owe ourselves, we owe the system, we owe the people - we owe everything, in general, by and large, in a circle you owe everything.




If a technological solution ... What do I mean when I talk about a technological solution? This is when what you do, you do, in fact, more consciously, or something. It is a kind of design, understanding, awareness, and estimations for the future - how it will all work or not work. This is the same “crutch”, only inserted “by science”, i.e. on the processes - you laid down the task, the task got into the planning, you planned it, did it, estimated it, etc.


Most often, this will be done well, well, neatly, but in a situation where you are covered, you will not be able to do anything about it, you simply have no time for it. It happens such a situation, as we have, for example. In due time, when it also got worse, they screwed Varnish in a quick way to production and, in principle, left it there later, only tuned it up because at the moment they just screwed it up, it didn’t work very well. In particular, he started caching the cities from which people come, banners, which people see, some personal messages, and eventually he cached the 404th error to someone and gave the 404th error to everyone, in minutes 5, probably. But quickly, again, they realized that something was not right here. Then we all brought it to mind, preconfigured, etc.




By the way, here it is, one of the algorithms. The third point is very important - it is often common, especially in small companies, when there is such a chief driven development, when a boss resorts, and says: “They told me here that I must do this.” The first thing that comes to mind is: “But this is a chef, a stick-tree, you can't offend a person”. It is then that the “crutches” are born, not even technologically, but simply so that it disappears from the eyes, and then we finish it somehow in the evening.


Here is the algorithm. In fact, the algorithm is cool, but it does not work, because in the end you, it turns out, should also have the boss - on the issue of debts and crutches. You understand that he owes you money, and now you owe him, i.e. not good.




When does it make sense to start doing something in order to avoid such situations? Those. is there any algorithm, is there a point of no return when you understand that everything? In general, when the time came to still take up the mind, stop believing that you have pants, and remember that the system of crutches is full, and what will we do about it?




The only way to look is monitoring. Those. how to understand? It will actually be difficult to understand, I will now show why. But in fact, if you do not have monitoring, if you have only cries from the next room, then you are unlikely to understand by such criteria. But again, who screams? If that guy screams from the third point, then most likely it's time. As long as you do not understand what is happening with your system, there is actually no chance of understanding that it is getting worse.


And analyze these results. Because in some cases, it will suffice to use the same notorious google-analytics, which deftly draws you these beautiful trends. I’ll show you how it doesn’t work, but in fact, at least some numbers let you begin to understand that time is ticking. Those. Naturally, any technology has its own margin of safety, which in different cases is found out differently, most often it is found out by a fall. Either you start doing something in advance.


Here, about the "in advance."




On the issue of forecasting. Purely theoretically, this could have been possible if the world had been so arranged, and there would have been no problems at all. Those. if you have 10% CPU (CPU here for example), which generates 10 users, 20% - 20, 30% - 30, etc. up to 1000 conditionally, if you have 10 nuclear. In fact, this is almost never the case. Ie, probably, happens, but I did not meet. More often it happens when you have 100 users - this is 10% of the CPU, and 200 users are already, alas, pipets, have arrived. It so happens that 200 users are nothing, but at the same time you sit in another place. Therefore, there are, relatively speaking, 4-5 points in your system that need to be monitored at all times.


Monitoring is the CPU, memory, channel, disk and the number of database requests, because the database is also not rubber, and if you managed to make the number of connections on the application more than is possible in the database, it will not affect you exactly , until the people who came, they will not die.


There is a configuration error. The most common reason that something goes wrong is a configuration error. Those. take, for example, postgress from the box, put it, it works. On the test machine it works, on 5 users it works, on 10 it works, on 20 it does not work. The same parsley happens with almost everything, i.e. If you have programmed your decision, and it works great for you, it is not at all necessary that it will continue to work. Therefore, the only way to do something here is to conduct periodic studies, to conduct periodic cuts of these threshold values ​​here, roughly speaking, to catch the moment when your applied load begins to change some curve. You have, for example, the load on the CPU, it is almost linear. You start serving, it starts to deviate a little, at some point it will deviate like this, or it will not deviate, but it will deviate elsewhere ... Therefore, if you take some Zabbix, you can display five monitors on one screen, where you can see each with each. It is better to display them in a column. Why in the column? Because you will clearly see what happens with all the parameters with increasing load.




This is all done very simply - with a mouse click, in Zabbix - there are no problems there. If you do not already have this, my advice is to screw it up. We have such TV sets hanging on the way to the restroom - this is the place that every engineer passes by during the day and not once. He comes past him to work and goes past him from work. It so happened that the path is so laid. And, passing by, he looks. There are these graphics hanging in a column - three monitor, and everything can be clearly seen, even the plasma is not necessary.


As a result, if you have even the slightest fear that something will go wrong with the next increase in load, you will pay attention to those people who can help you with this. Either these are testers, if you have such people, or take the most primitive JMeter, write two lines of code on it, and it will show you what will happen next. Those. the situation when you went into a deviation, if it goes, goes and does not fall - this is not some kind of a la “you have unsuccessfully committed yourself today,” this is also the cause of many habraeffects. - , , , . , : « ?» – «, , , », - . , , , . .




- ? Sure you may. – . , , , . , … Hetzner – ? , -? / , , . , , , , . , , - , - , , , , 400-500 , , , 1000. 1000 – . – , , , , .


, . . - , , , , :




. Those. - , - , , , , . , .


– ? , 1000 , 10 . . , . google-, . – , , . , . , , . …




– . . Those. : « ?» – «, , 500-600». . ? , ( ). Those. 4 . . Those. , « 4.5 . », . 10 ., , , , .., , . – , . , , . Those. . Those. - , -, - - . , , , ..


. Those. . What it is? , - . , , , , , , . , .




, , 50 , , 2 , 1 . , ± , .. . , , . , , , , , , .




? , , , , , , , , , , - . , , , .


, – , . , , - , - , , , , , Nginx, , ...


, , , , .. , . . Those. – «», , .


, - «» . Badoo Yandex ( HighLoad), : «, «». So what? : «, «». , , . «» – , , .




, , . - - , - - … – « ».


, , . -, , , , , , . , -, , , - . , – , – . , «1» – , «1.1» – , «1.1» 34 , - . , .


, . , , , ? , , , . 100 . , , . , - – , Tarantool. NoSQL, Postgress ( ), - , .




. - . – ? . , , , . . , , , , , , , , , . 30 . , , , , , , , , – . , - , , .


, , . , , – . , , . , , , . , , , , , , . , – , , , , . , «»/ , Landing Page.


Landing Page? html, , Nginx' . Nginx 80 . , – . 80 . . , - , . – – , , , «» «», -, . «»? «». Decision? Why not? , , , , . , 80 . , , .


, , , , .




, – , , ? , :




, 100500 , . , J:




YouTube « », , , 20. 20, , , , J


Those. , , , -. , «-».




, , , , . , , - Badoo. , , , , . , . , , , .. . , , , , . , …


, , ? , , . , , , , : «, , . ». , . « ». , 700 . , , , ID … , – , , . , , , . , «, » – , , ..


( ?) , , . Those. – « Nginx ?», « ?», « Postgress?» , , « - slow log MySQL?». . , , DBA. « ». Google , . Google , , , J.


, , ? . . . , : « . 1 2016 , ». , - , , . . – , .




– , .


– , , , Badoo' , , , . , , , , . - ..


«». – . Those. - . - , . – . . Those. , select * from (- ) – , . , 200 . , 200 . . , 2 , - , , , , , , , , , , , .


Badoo'? . , , , -. Those. , . Those. , – , , – 100 , , , .


? . - . Those. . – IT . , . – , , , , , . , , , , . , , «, , , , , ».




, . . , , , javascript . , , . . Those. , – .


, . – , . Those. , , . « » , - , - -, - . . , , . - . , . , - , , . . – : « ».


, , NPM . ? . «-! ». , … J – , , , . – . , javascritp … , , , , , . , - - , , url… – url , , , - . url, «--, », , «», , , , . , . . Those. , - , - , , , , , , .


. , GitHub, . . . , , .


– , , . , . , slow , , , slow. , php , , javascript . . ' , -, , . -, , . javascritp .


, . -, - , , , , , , , .




«», . . . , , , - , , , , , , . , «». – . . , - , . , , … , . - , - . - , ? There is. , , .


? , .




, «» – . , , , , , - , «» Nginx, , , « ». , : « , . , ». , , Nginx, .


It is never realized. Because, in fact, then - this is never. If you have not done so now. The priorities are the same nonsense. Someone thinks that technical tasks are more important than technical debt. Bullshit this all. If you’re a technical debt accumulator, you’ll be able to accumulate In general, you will have some kind of "Pandora's box", or something. Those. a box in which lies something.




And there is such a thing as a default. This is the impossibility of paying debts. Technical default is the ability to pay debts, but also the inability to pay debts. Those. in fact, when you have the opportunity to pay, but you do not pay a fig. The same thing happens with those. by duty. You can do it, but you are not allowed to do this business. You can score on the business, but then the business will score on you and you will have to change jobs. Those. You can get to this state very quickly, but you will be able to rake it out for a very long time.


For example, we have on some teams in those. There are about 170 tasks in duty - this is somewhere a team month. Here, experimentally, it was proved that this is the ceiling. And when we reach this ceiling, I go to the business and ask them between projects to give us a week or a month and a half so that we spend on this “Augean Stable” to clean out and throw out what we programmed there. It does not always work, but it works for us, because usually it is simply explained - if we don’t do this, then it will all stop working. Then you listen to what kind of “well done” you are, and “whom I hired, to whom I pay money, what ...” and many more different words. But nevertheless, this is the only way.


Debt is always accumulated, I have never seen a single company ... I read about one, which claims that they can work without defects, i.e. they have the appearance of a defect - it is a “red rag”, which says: “everyone has thrown, they went to repair the defect”. I honestly do not know how fast they are moving forward. I did not investigate this question, however, it seems to me that this does not work well.




In conclusion. Important strategy! Why a strategy? The people who work with you should all think the same. Those. in a situation when some kind of fact is happening, everyone should act approximately equally. Therefore, humanity has come up with communication - so that you can talk to your colleagues. I understand that programmers do not like to talk with their colleagues, but nonetheless. It would be nice to get together and at least find the last one, i.e. say: "If any trouble happens, then you will fix it all." Most likely, this will not happen. Those. Of course, everything will help me fix it, but I have seen it happen live. Usually it looks like this: half the team, says: “Well, e-mine!”, And the type thinks what to do. The second half of the team says: “Well, that's all, in short, damn ... There were 100 requests per second, and now 500! Well, in short. " Then there is someone alone, calling the boys girls, girls as boys, to invigorate the inner ego, etc. and says: "Well, let's think a little bit."


Think about it. It is important! In general, programmers tend to think. You need to think about this too - about what happens, what to do. We have a certain plan so short, by 5 points, what to do if trouble happened.




There everything looks like this: you go there, look at what is there, go here, look at what is there, etc.


We must follow the system. If you are doing a project (not any Viagra website — what is it fashionable to sell there now?), Which will work for a month and die, then it would be nice to have some kind of vision of what is happening. Those. The system you are working on is considered to be your machine. Will you drive a car whose brakes are potentially not working well? Probably not. Because if you once drove into a pillar with a sweep and realized that your brakes do not work well, if you repeat this, then it’s cool, of course, it would be great, but in fact it would be better to tweak something there.


Risks must always be assessed. What is risk assessment? With us it looks like this: how much will I lose money if I lain for a minute. Well, $ 100, relatively speaking. And how much will I fly for this later? Or how bad will the company be?


There is also a cool thing that should be discussed separately - these are reputational risks, i.e. Your every fakap is not a fakap monetary, it is a reputation reputation. Those. if you are a young novice social network facebook2, then it would be better for you, of course, not to lie, otherwise you will not have any chances to reach facebook1. Incidentally, this was shown by the guys from the Chinese chatik, an analogue of WhatsApp, who proudly entered the market and said to them: “Hey, who are you, get out of here.” And they and their billion users in China remained. Because they didn’t foresee something, they didn’t think about something, didn’t figure it out, and they simply didn’t have a strategy for what to do with all this.


Try on cats. Now a very cheap virtual hosting - you take, you make your system somewhere in the corner on your knee you collect, shove whatever technologies you want, any technologies, any zoos. Let him fall, let him die. You can even, for laughter, send a piece of your traffic there. Nginx can do it - remember about balancing? Send 10 people there, and see what happens. So do the guys from Badoo. This is not an advertisement, they just talked about it. They analyze their services on, in my opinion, Icelanders. Well, like, they took the most seismic resistant dudes, who by and large do not care what happens in this world, they sit on their island and everything is fine with them, and this Eyyafyakudl, everything is cool with them. There they are loaded with new solutions, then they look. They are: "Oh, let's go drink some more beer." This is a real story, they told about it.




You should always keep track of current software versions. Everything always changes, but there is a nuance - if you suddenly forgot to update something, then most likely you have chances that the next version that you get into the repository at the end will be incompatible with your decision. This is a very cool joint about which many people forget, such as Git, everything. But everybody forgets that Git is not your property, it is the property of the guy who made this decision. And to bang a branch for him is, in principle, his right, granted to him, as they say, by the Internet constitution. His code is what he wants, he does it with him. Therefore, if you want everything to be clear, honorable, noble, follow the software, watch for updates, watch what, in general, happens on the Internet. Because if you are a website, for example, in currency exchange, then it would be nice to have a look at what is happening with this currency in general. Because sooner or later, trouble can happen.




Important words. He said a very clever uncle. He said that everything that I had uttered here, by and large, - everything that I had uttered here. Those. really, what to do is to make a decision for you.




In more detail about the development of the highload-system, we will talk on the webinar of Roman Ivliev “ HighLoad-development from scratch and in steps .” It will be a one-day intensive master class in which we will talk about system development, refactoring and reengineering much more thoroughly:
System development

  • When is it time to start doing something?
  • The evolution of the classic loaded web portal (for example, banki.ru)
  • Optimization methods for various elements of the system
  • Managed degradation and preparation for it
  • Preparing the system for growth: flexibility and reliability
  • Hype - its pros and cons

Refactoring or reengineering

  • Business Continuity in System Development
  • When else can you safely use the capabilities of the current system?
  • When do you need to start building again?
  • PDCA cycle and “lean” system development.

Source: https://habr.com/ru/post/308616/


All Articles