Events, tires and data integration in the challenging world of microservices
Valentin Gogichashvili explains microservices.Before you decrypt the report with Highload ++ .
Good afternoon, I am Valentin Gogichashvili. I made all slides in Latin, I hope there will be no problems. I'm from Zalando.
')
What is Zalando? You probably know Lamoda, Zalando was Lamoda's dad his time. To understand what Zalando is, you need to introduce Lamoda and increase it several times.
Zalando is a clothes shop, we started selling shoes, very good by the way. Began to expand more and more. From the outside, the site looks very simple. For 6 years that I have been working in Zalando and for 8 years of existence - this company was one of the fastest growing in Europe at one time. Six years ago, when I came to Zalando, it grew about 100%.
When I started 6 years ago, it was a small startup, I came quite late, there were already 40 people there. We started in Berlin, in these 6 years we have expanded Zalando Technology to many cities, including Helsinki and Dublin. In Dublin, sit data-science'y, in Helsinki sit mobile developer'y.
Zalando Technology is growing. At the moment we hire around 50 people a month, this is a terrible thing. Why? Because we want to build the world's coolest fashion platform. Very ambitious, let's see what happens.
I want to return to history a little bit and show you the old world in which you, most likely, at some point in your career were definitely.
Zalando began as a small service that had 3 levels: web applicaton, backend and database. We used Magento. By the time I was called to Zalando, we were the biggest Magento users in the world. We had huge headaches with mysql.
We started the project REBOOT. I came to this project 6 years ago.
What have we done?
We rewrote everything in Java because we knew Java. We put PostgreSQL everywhere, because I knew PostgreSQL. Well, Python is a technical matter. Virtually any normal person will support me, that Python for tooling is the only right decision (people from the world of Perl, don't kill me). Python is a good joke for writing tooling.
We began to develop such a scheme:
We had a macro services system. Java Backend, PostgreSQL storage with PostgreSQL sharding. Two years ago at the same conference I talked about how we do PostgreSQL-shardings, how we manage schemes, how we roll out versions without downtime, it was very interesting.
As I said, Java we all knew. SOAP was used to combine macro services with each other. PostgreSQL gave us the ability to have very clean data. We had a schema, clean data, transactions, and stored procedures, which we taught all java-developers or those who still remained from the PHP world, whom we taught Java and stored procedures.
One hint: if you are in the mode of less than 15 million users per month, then you can use the Java SProc Wrapper system to automatically shard PostgreSQL from Java. A very interesting thing, which is PostgreSQL in the RSP system, essentially.
Everything was fine, we wrote and rewrote everything. We first bought our warehouse management system, and then rewrote everything. Because we had to move much faster than the people from whom we bought the system could do it.
Everything worked fine until the problem started with frames. Our beautiful world began to crumble before our eyes. The standardization system, its level that we introduced at the Java level, SOAP began to crumble. People began to complain and leave or simply not to come.
We told them: you have to write in Java, if you leave, what are we going to do? If you write something in Haskell or Clojure and leave, what are we going to do? And they answered us fuck you.
We decided to get down to business seriously. We decided to rebuild not only the architecture, but the entire organization. We began the process of restructuring an organization that the German industry did not see, in which we said that we were completely destroying everything that we had. It was an organization in which there were around 900 people, we are destroying the hierarchical structure in the form in which it was. We announce Radical Agility.
What does it mean?
We announce that we have teams that are autonomous, that move forward meaningfully. Of course, we want people who are engaged in business, they do this business with skill.
Autonomous teams.
They can choose their own technology stack. If the team decided that they would write in Haskell or Clojure, then so be it. But you have to pay for it. Teams must support the services they wrote themselves to wake up at night themselves. Including the choice of the stack stack. We taught you PostgreSQL if you want to choose MongoDB, and there is no stop, MongoDB is blocked from us. We have a radar technology in which we conduct monthly surveys and technologies that we consider dangerous, we put on the red sector. This means that the team can choose these technologies, but they blame completely on themselves if something goes wrong.
We said that teams will be isolated with their AWS accounts. Before that, we were in our own data centers, choosing AWS, we made a deal with the devil. We said, we know that it will cost more, but we will move faster. We will not have situations like before, in our own data centers: it took 6 weeks to order one hard disk. It was unbearable and impossible. We could not move forward.
Many people believe that autonomy is anarchy. Autonomy is not anarchy. With autonomy comes a lot of responsibility, especially for Zalando, which is publicly traded company. We are on the exchange and as in any publicly traded company auditors come to us and they check how our systems work. We had to create some kind of structure that would allow our developers to work with AWS, but still be able to answer questions from auditors of the level: “Why do you have this IP address in public access without identifications?”
The result is such a system:
We wanted to make it as simple as possible, it is really simple. But everyone swears when they see her.
If you go to AWS, remind you, with this speed and with openness, and if you choose an idea with microservices or public services, then you may have to pay for it. Including if you want to make a system that is safe, that answers questions that our auditors can ask.
Of course, we said that in order to maintain a heterogeneous stack of technologies, we are raising the standardization level with Java and PostgreSQL to a higher level. We are raising the standardization level to the REST APIs level.
What does it mean? I noted this in a previous report that we need an API description system. System description of how microservices communicate with each other. We need order. At some level, we need to standardize. We announced that we will have an API system first. And that each service before they begin to write, the team must come to the guild API and persuade them to accept the API as part of the approved API. We wrote REST API guidelines, very interesting. They were even referred to in some resources. APIs are the first libraries that allow you to use Swagger (OpenAPI) as cursors for the server. For example, connection is the control for flask in Python, and play-swagger is the control for Scala's play system. For Clojure there is the same rotor, it is very convenient. You first write the Swagger file, describe what you want to achieve from your microservice, and then simply indicate which functions on your system should perform certain operations in the API.
But the problem with microservices. I want to repeat this phrase several times. Microservices is the answer to organizational problems, it is not a technical answer. I will not advise microservices to anyone who is small. I will not advise microservices to those who have no problems with a motley technological base, who do not need to write one service on Scala, another service in Python or Haskell. The number of problems with microservices is quite high. This barrier. In order to overcome it, you need to experience quite a lot of pain before this, as we did.
One of the biggest problems with Mircoservices: microservices, by definition, close access to the data persistence system. The bases are hidden inside microservice.
Thus, the classic extract transform load process does not work.
Let's take one step back and remember how we work in the classic world. What do we have? We have a classic world, we have developers, junior developers, senior developers, DBA and Business Intelligence.
How it works?
In the simplest case, we have business logic, a base, an ETL process that pulls our data straight from the base and stuffs it into Date Warehouse (DWH).
On a larger scale, we have many services, many bases and one process, which is written, most likely, with pens. The data from these databases are pulled out and put into a special database for business analysts. She is very well structured, business analysts understand what they are doing.
Of course, this is all - not without problems. This is all very difficult to automate. In the world of microservices, everything is different.
When we announced microservices, when we announced Radical Agility, when we announced these great innovations for developers, business analysts were very unhappy.
How to collect data from a huge number of microservices?
This is not about dozens, but about hundreds or even thousands. Then Valentine comes on a horse and says: we will write everything to the stream, to the queue. Then the architects say: why queue? Someone will use Kafka, someone will use Rabbit, how will we integrate it all? Our security officers said: never in life, we will not allow. Our business analysts said: if there is no scheme there, we will hang ourselves and we will not be able to understand what is flowing, it will be just a gutter, and not a data transport system.
We sat down and began to confer and decide what to do. Our main goals are: ease of use of our system, we want us to not have a single point of failure, not to have such a monster, which, if it falls, then everything will fall. There must be a secure system, and this system must meet the needs of business intelligence, the system must satisfy our data-science. It should in a good case allow other services to use this data that flows through the bus.
Very simple.Event Bus.
From the Event Bus we will be able to pull out Business Intelligence or in some Data heavy services. DDDM is a favorite concept lately. This is a data driven decisions making system. Any manager will be delighted with such a word. Machine learning and DDDM.
What do we come up with?
Nakadi. You probably understand that my name is pretty Georgian. Nakadi in Georgian means flow. For example, a mountain stream.
We started to make such a flow. The basic principles that we have invested there, I repeat a little.
We said that we will have a standard HTTP API. If possible - restful. We will make a centralized or, if possible, not very centralized event type registry. We will introduce different event types. For example, at the moment we have two classes supported. These are data capture and business events. That is, if entities change with us, then we can record event capture with all the necessary meta-information. If we just have information that something happened in the business process, then this is usually a much simpler case, and we can write a simpler event. But still, business analysts demand that we have a structure that can be automatically parsed.
Having extensive experience with PostgreSQL and with schemes, we know that without the support of versioning schemes, nothing will work. That is, if we slide to the level where programmers will have to describe the order created, then the order created 1,2,3, we will essentially make the system look like Microsoft Windows, and it will be very difficult, especially to understand how to develop the essence, how the essence is versioned. It is very important that this interface allows you to stream data so that you can respond as quickly as possible to the arrival of messages and notify everyone who wants to receive a message.
We did not want to reinvent the wheel. Our goal is to make the system as minimal as possible that will use existing systems. Therefore, at the moment we have taken Kafk, as an underline system and PostgreSQL for storing metadata and schema.
Nakadi Cluster is what we have. There is in the form of open source project. At the moment, he is validating a scheme that has been registered before. He is able to record additional information in the metafield for the event. For example, the time of arrival, or if the client has not created unique id for the event, then you can push the unique id there.
We also felt that we need to take control of the offset. Those who know how Kafka works. Somebody knows? Good, but not most. Kafka is a classic pub / sub system in which the producer writes data sequentially, and the client does not store, as in the classic message systems.
The client does not create separate message copies, the only thing the client needs is the offset. That is a shift in this endless stream. You can imagine that Kafka is an endless stream of data in which every line is numbered. If your client wants to collect data, he says: read from the position of X. Kafka will give him this data from position X. Thus, the data is ordered in such a way that it is guaranteed that the server does not need to store a lot of information, as is usually done in classic message- systems that allow komitit part read event'ov. In this situation, we have a problem in that it is impossible to commit a piece of a read block. Now offtext went, I didn't want to talk about Kafk, sorry.
High level interface makes reading from kafk very easy for clients. Clients should not exchange information, who reads from which section, what offset values ​​they store. Just comes the client and gets what you need from the system. We decided on the path of minimal resistance. Zookeeper already has for Kafk'i, no matter how terrible Zookeeper is, we already have it, we already need to manage it and we use it to store offset'ov and additional information. PostgreSQL - for metadata and storage schemes.
Now I want to tell in which direction we are moving.
We are moving very fast. Therefore, when I return to Berlin, some parts will already be made.
At the moment we have Nakadi Cluster, we have Nakadi UI, which we started writing on Elm to interest other people. Elm is cool, love him.
The next step we want to be able to manage multiple clusters. We have already seen shoals when a new producer arrives and starts writing 10 thousand event's per second, without warning about anything.
Our cluster does not have time to scale. We want us to have different clusters for different data types. We did the standardization of the interface so that there was no link to Kafk'y.
We can switch from Kafk'i to Redis. And from Redis to Kinesis. Essentially, the idea is that, depending on the need for the service and the properties of the events that they write, if ordering is not interesting to someone, then you can use a system that does not support ordering and is more efficient than Kafka. At the moment we have the ability to abstract this using our interface.
Nakadi Scheme Manager needs to be pulled out of the cluster, because it must be zasher. The next step is such an idea that our circuits are detected. That is, the microservice is raised, publishes its swagger-file, publishes a list of event'ov in the same format as the swagger. Automatically, the crawker takes all this away and eliminates the need for developers to additionally inject the schema into the message bus before the deployment plan.
And of course, the topology manager, so that you can somehow rut the producer and the consumer on different clusters. They said that Kafka works like an elephant. No, not as an elephant, but as a locomotive. In our situation, this train breaks all the time. I do not know who produced this locomotive, but in order to manage the Kafk in AWS, it turned out that it is not so simple.
We wrote the Bubuku system, a very good title, very Russian.
What does Bubuku do?
I had a big slide that showed what Bubuku was doing, but it turned out very large. All can be viewed at the link .
In a trailer, Bubuku has goals to do what others do not do with Kafk. The basic ideas are that this is automatically a reportition, automatic scaling and the ability to survive lightning strikes, crazy monkeys that kill instances.
By the way, our system is being tested by Chaos Monkey, and it works very well. I recommend it to everyone, if you write microservices, always think about how this system is experiencing Chaos Monkey. This is a Netflix-system, which randomly kills nodes or disables the network, spoils your system.
Whatever system you build, if you don’t test it, it won’t work if something breaks.
Concluding my superficial story, I want to say: what I told you about, now we are developing in open source. Why open source? We even wrote why Zalando does open source .
When people write in open source, they write not for the company, but in part for themselves. Therefore, we see that the quality of products is better, we see that the isolation of products from infrastructure is better. Nobody writes inside zalando.de and do not rule the keys, do not commit to Git.
We have principles on how to open source. Do you have questions in the company? Should we open source or not? There is the principle of open source first. Before starting a project, we think whether it is worth open source. In order to understand and answer this question, you need to answer the following questions:
Who needs it?
Do we need this?
Do we want to deal with this as an open source project?
Can we keep what we will keep in this publice tree?
There are things when open source is not necessary:
If your project contains domain knowledge, then what makes a company your company is not open source, of course.
This is the last slide, here are the projects that were mentioned today:
If you go to zalando.imtqy.com , there are a lot of projects on PostgreSQL, a lot of libraries for both the backend and the frontend, I highly recommend.