📜 ⬆️ ⬇️

Summary of the report “What we know about microservices” (HL2018, Avito, Vadim Madison)

Hi% username%!

Most recently, the Highload ++ conference ended (thanks again to the entire team of organizers and olegbunin personally. It was very cool!).

On the eve of the conference, Alexey fisher proposed to create an initiative group of “stalkers” at the conference. We, during the reports, wrote small notes that we exchanged. Some notes turned out to be quite detailed and detailed.
')
The community in social networks positively evaluated this format, so I (with permission) decided to publish the outline of the first report. If this format is interesting, then I can prepare a few more articles.

image

Drove off


In Avito there are many services and a lot of links between them. This causes problems:


A large number of infrastructure elements:


There are a number of layers, the report describes only one (PaaS).

The platform has 3 main parts:


Standard microservice development pipeline


CLI push -> CI -> Bake -> Deploy -> Test -> Canary -> Production

CLI-push


Long taught to do the right developers. Still left a weak point.

Automated through cli utility, which helps to create a basis for microservice:

  1. Creates a template service (templates are supported for a number of PLs).
  2. Automatically deploys local development infrastructure
  3. Connects the database (does not require configuration, the developer does not think about access to any database).
  4. Live build
  5. Generation blanks autotests.

The config is described in the toml file.

Sample file:

image

Validation


Basic validation checks:


Documentation


Everyone should have the documentation, but almost nobody has it.

The documentation should be:


Documentation needs to be reviewed.

Pipeline preparation



Bake



The owner search is determined by the push (the number of push and the amount of code in them).

If there are potentially dangerous migrations (alter), then a trigger is registered in the Atlas and the service is placed in quarantine.

The quarantine will be settled to owners through pushy (in manual mode?)

Conventions Check


Checking:


Tests


Testing is performed in a closed loop (for example, hoverfly.io) - a typical load is recorded. Then it is emulated in a closed loop.

Compliance with resource consumption is checked (separately we look at extreme cases - too few / many resources), cut-off by rps.

Load testing also shows the performance delta between versions.

Canary tests


We start the launch on a very small number of users (<0.1%).

Minimum load 5 minutes. Main 2 hours. Then the volume of users increases if everything is ok.

We look:


Squeeze testing


Testing through extrusion.

We load with real users 1 instans to a point of failure. We look at his ceiling. Next we add another instance and load it. We look at the next ceiling. We look at the regression. Enrich or replace data from load testing at Atlas.

Scaling


Only on cpu is bad, you need to add food metrics.

The final scheme:


When scaling, do not forget to look for dependencies on services. We remember about the scaling cascade (+1 level). We look at the historical data of the initializing service.

Additionally



Dashboard


We look at everything from above in an aggregated form and draw conclusions.


Example:

image

Source: https://habr.com/ru/post/429460/


All Articles