⬆️ ⬇️

Summary of the report “What we know about microservices” (HL2018, Avito, Vadim Madison)

Hi% username%!



Most recently, the Highload ++ conference ended (thanks again to the entire team of organizers and olegbunin personally. It was very cool!).



On the eve of the conference, Alexey fisher proposed to create an initiative group of “stalkers” at the conference. We, during the reports, wrote small notes that we exchanged. Some notes turned out to be quite detailed and detailed.

')

The community in social networks positively evaluated this format, so I (with permission) decided to publish the outline of the first report. If this format is interesting, then I can prepare a few more articles.



image



Drove off



In Avito there are many services and a lot of links between them. This causes problems:





A large number of infrastructure elements:





There are a number of layers, the report describes only one (PaaS).



The platform has 3 main parts:





Standard microservice development pipeline



CLI push -> CI -> Bake -> Deploy -> Test -> Canary -> Production



CLI-push



Long taught to do the right developers. Still left a weak point.



Automated through cli utility, which helps to create a basis for microservice:



  1. Creates a template service (templates are supported for a number of PLs).
  2. Automatically deploys local development infrastructure
  3. Connects the database (does not require configuration, the developer does not think about access to any database).
  4. Live build
  5. Generation blanks autotests.


The config is described in the toml file.



Sample file:



image



Validation



Basic validation checks:





Documentation



Everyone should have the documentation, but almost nobody has it.



The documentation should be:





Documentation needs to be reviewed.



Pipeline preparation





Bake





The owner search is determined by the push (the number of push and the amount of code in them).



If there are potentially dangerous migrations (alter), then a trigger is registered in the Atlas and the service is placed in quarantine.



The quarantine will be settled to owners through pushy (in manual mode?)



Conventions Check



Checking:





Tests



Testing is performed in a closed loop (for example, hoverfly.io) - a typical load is recorded. Then it is emulated in a closed loop.



Compliance with resource consumption is checked (separately we look at extreme cases - too few / many resources), cut-off by rps.



Load testing also shows the performance delta between versions.



Canary tests



We start the launch on a very small number of users (<0.1%).



Minimum load 5 minutes. Main 2 hours. Then the volume of users increases if everything is ok.



We look:





Squeeze testing



Testing through extrusion.



We load with real users 1 instans to a point of failure. We look at his ceiling. Next we add another instance and load it. We look at the next ceiling. We look at the regression. Enrich or replace data from load testing at Atlas.



Scaling



Only on cpu is bad, you need to add food metrics.



The final scheme:





When scaling, do not forget to look for dependencies on services. We remember about the scaling cascade (+1 level). We look at the historical data of the initializing service.



Additionally





Dashboard



We look at everything from above in an aggregated form and draw conclusions.





Example:



image

Source: https://habr.com/ru/post/429460/



All Articles