Hello everyone, my name is Yuri Builov, I lead the development in CarPrice. Briefly tell you how and why we came to microservices in PHP and Golang. What we use, how we instrument and monitor our applications in production. Next, I’ll tell you about distributed tracing, which provides us with the transparency of the services.

Why microservices
Recently, microservices are quite a trend topic and many of them want them even where it is not needed. This is a rather slippery path, stepping on which you need to understand what awaits you ahead. We did not come to microservices for the sake of trends, but of necessity, with the awareness of all the difficulties that we will have to face.
Initially, CarPrice was built as a monolithic Bitrix application with outsourcing developers and focusing on development speed. In a certain period, it played an important role in the successful entry of the project to the market.
')
Over time, it became impossible to maintain a stable work of the monolith - each release turned into a test for testers, developers and admins. Different processes interfered with the normal operation of each other. For example, a girl from the workflow could start generating documents for completed auctions, and at that moment dealers could not bargain normally due to the brakes on the backend.
We began to change. Large parts of the business logic were placed in separate services: logistics, service for checking the legal purity of the car, image processing service, dealers recording service for inspection and issue, billing, rate receiving service, authentication service, recommender system, api for our mobile and reactive applications.
What do we write on
At the moment, we have dozens of services on different technologies that communicate over the network.
These are mostly small Laravel (php) applications that solve a specific business problem. Such services provide HTTP API and can have administrative Web UI (Vue.js)
We try to arrange the common components in the libraries that the composer delivers. In addition, services inherit a common php-fpm docker image. This removes a headache when updating. For example, we have almost everywhere php7.1.
Speed ​​critical services we write on Golang.
For example, the jwt-authentication service writes out and checks tokens, and also can unplug an unscrupulous dealer, whose manual for sins disconnects from the auction platform.
The service of receiving bets processes dealers' bets, saves them to the database and postpones the events to rabbitmq and the RT notification service.
For services on Golang we use go-kit and gin / chi.
The go-kit has attracted with its abstractions, the ability to use various transports and wrappers for metrics, but it tires a little with its love for functionalism and verbosity, therefore we use it in capital buildings with rich business logic.
On gin and chi it is convenient to collect simple http-services. This is ideal quickly and with minimal effort to start a small service in production. If we have complex entities, we try to transfer the service to the go-kit.
Evolution of monitoring
At the time of the monolith, we lacked newrelic. Having jumped onto the microservice rung, the number of servers has increased and we abandoned it for financial reasons and rushed into disrepute: Zabbix for iron, ELK, Grafana and Prometheus for APM.

First of all, we folded nginx logs from all services in ELK, built graphs in Grafana, and followed the queries that spoiled the 99th percentile went to Kibana.
And here began the quest - to understand what was happening with the request.
In a monolithic application, everything was simple - if it was php, then it used to be xhprof, armed with which it was possible to understand what was happening there. With microservices, where the request goes through several services and even on different technologies, this trick will not work. Somewhere network, somewhere synchronous requests or foul cache.
Suppose we found a slow request for our API. By code, it was determined that the request turned to three services, collected and returned the result. Now it is necessary by indirect evidence (timestamp, request parameters) to find the downstream requests in order to understand which of the services was the cause of the slow request. Even if we found that service, we need to go to the metrics or logs of the service and look for the cause there, but it often happens that the lower services work quickly, and the resulting query slows down. In general, it is so-so fun.
And we realized that it was time - we need distributed tracing.
Jaeger, welcome!
Motivation:
- Search for anomalies - why the 99th percentile is corrupted, for example, network timeouts, service crashes or locks in the database.
- Diagnostics of mass problems (50th or 75th percentile) after deployment, changing the configuration of the service or the number of instances
- Distributed profiling - find slow services, components or functions.
Visualization (Gantt) of the request stages - you can understand what is happening inside
Remembering
Google's Dapper, we first came to
Opentracing - the universal standard for distributed tracing. He is supported by several tracers. The most famous are
Zipkin (Java) and
Appdash (Golang).
However, a new and promising
Jaeger from Uber Technologies has recently appeared among the old-time tracers who support the standard. We will talk about him.
Bekend - Go
UI - React
Storage - Cassandra / Elasticsearch
Originally developed under the standard OpenTracing.
Unlike the same Zipkin, the Jaeger model natively supports key-value logging and traces are represented as a directed acyclic graph (DAG) and not just a span tree.
In addition, most recently at the Open Source summit in LA Jaeger
was put on the same shelf with such honorable projects as Kubernetes and Prometheus.

Architecture
Each service collects the timings and additional information in the spans and throws them into the adjacent jaeger-agent by udp. That, in turn, sends them to the jaeger-collector. After that, the traces are available in jaeger-ui. On of.sayte architecture is depicted as:

Jaeger in production
Most of our services are deployed in Docker containers. Assembles them Drone, and deploit Ansible. Unfortunately (no), we have not yet switched to orchestration systems like k8s, nomad or openshift, and the containers are running Docker Compose.
Our typical services in conjunction with jaeger looks like this:

Installing Jaeger in production is a collection of several services and storage.
→ collector - accepts spans from services and writes them to storage
→ query - Web UI and API for reading spans from storage
→ storage - stores all spans. You can use either cassandra or elasticsearch
For devs and local development, it is convenient to use the Jaeger “all-in-one” build with in-memory storage under the traces of
jaegertracing/all-in-one:latest
How it works
The service collects information on timing and meta information of the request in spans. The span is transferred between methods through the context, and to the downstream services through the context injection into the header.

For demonstration, the uber team prepared a good example illustrating the tracing in the driver search service:
HotrodAs in code
First we need to create the tracer itself.
import ( "github.com/uber/jaeger-client-go" "github.com/uber/jaeger-client-go/config" ... ) jcfg := config.Configuration{ Disabled: false,
Add middleware (opentracing.TraceServer) - creates root span for api method. All nested spans will be tied to it.
endpoint := CreateEndpoint(svc)
In addition, we extract (Extract) the context of the trace from the header of the incoming request (opentracing.FromHTTPRequest). Thus, our service will be associated with the superior service, provided that it has passed the context of the trace in the request (Inject).
r.Handle(path, kithttp.NewServer( endpoint, decodeRequestFn, encodeResponseFn,
Next we tool our methods:
func (s Service) DoSmth() error { span := s.Tracing.StartSpan("DoSmth", ctx) defer span.Finish()
Well, the very start Spana looks like this
func (t AppTracing) StartSpan(name string, ctx context.Context) opentracing.Span { span := opentracing.SpanFromContext(ctx); if span != nil { span = t.Tracer.StartSpan(name, opentracing.ChildOf(span.Context())) } else { span = t.Tracer.StartSpan(name) } return span }
It's all. Now we can observe the work of our service in real time.
(
Big picture )

For example, in the picture above, we found a braking query and saw that half the time was taken off the network between services, and the second half was updated to the database. With this you can already work.
Thanks for attention. I hope this article will be useful, and Jaeger will help someone to bring transparency to the work of services.
useful links
→
About project site→
Repository→
Opentracing website→
Example