Scale development: from startup to hundreds of engineers

Many other large IT companies started with a startup, and Badoo is no exception. In recent years, the company has gone from several dozen engineers to several hundred. Nikolai Krapivny was on the front line on most of this path and made decisions: what is better to do and what not to do, how to cope with problems. His report on TeamLead Conf was devoted to this experience and the picture of the world, which was formed as a result.

Of course, each company has its own path , but the problems of human communications are all about the same. Someone else's experience will help to think in advance about the problems that will have to face the growth of the company. Even if these values do not fit in directly, it will tell you which direction to think.

')
The story consists of three parts. The first is about communications , about how they change with the growth of the company. The second part is about how with an increase in the number of engineers in a team to try to keep development speed . And the third part - from why Badoo lives in two offices , and how to cope with the problem of communication.

Let's get started!

About the speaker: Nikolai Krapivny (@ cyberklin ) has been working at Badoo for the last eight years, five of them have been managing teams and building development processes.

Before diving into the first part, I want to say that this is a story about our path and does not claim to be absolute truth. Each company has its own way, but I am sure that our experience, the values we have formed for ourselves, some knowledge will help you in your growth and help you build the right process. In spite of the fact that you have different specifics, everything is a little different, I hope this will be useful for you.

Communications

To begin with, let's theoretically discuss a little bit about what happens to communications when a company grows.

Communication is about how departments interact with each other, how people interact with each other, how communication takes place so that something is done in the company.

Let us consider a hackneyed, but nonetheless vital, example: the command of an abstract startup. Several people gathered, someone is closer to business, and someone is more technical. But overall, this is a small team that does something that maybe someday will become the second Facebook. And in this team, all work is built on communications. The team is small, and there is no point in introducing any processes. Everything works just like that : someone talked to someone, agreed to do something quickly, do something.

Despite the fact that in the process, built only on communications, on conversations: “And let's do it”, - “And let it be quick”, - “Let's do it like this”, there are certain problems, this team certainly has its advantages.

Work happens quickly . The time from the idea to the idea becomes available to the user is very short. The idea came, we talked with someone, how to do it faster, it is already done, ready.
It is flexible . In this small team there is no such thing that someone only does something concrete, and cannot, when necessary, connect to a task that is important. In principle, everyone does everything, and if something is important for us, then everyone makes efforts to do it.
In general, due to the fact that as such processes have not yet been built, such work is quite effective . We do not spend extra time on overhead, on some processes, on some rebuilt formal schemes.

These are exactly the values that every business wants to see: the most flexible equation with resources, minimal time-to-market and low operating costs.

The company grows - communications "break".

When a company grows, the advantages of a small team, when everything works quickly, on interaction, on conversations, become a problem. The load on the communication of the amount of transmitted information begins to grow, and we come to the fact that the communication "break" . We start to lose more on communications than we win. It is necessary to talk with too many people, somewhere there is a misunderstanding when transferring information from person to person, somewhere we just lost something, forgot something. And all that was then built, which gave speed, we gradually begin to lose.

If you extrapolate and look at the company's development model over a long time interval, then it looks like a cycle. The number of people increases, the load on the process increases, communications begin to break. What worked previously stops working. Therefore, we are forced to repair something in these places. Often this happens at the boundaries of departments. To fix, you have to formalize the process of communication. And this cycle is repeated many times : the number of people increases, something starts to work inefficiently, we introduce new processes, somehow we formalize them, we get a new stock for growth until it breaks in another place and so on and on. It’s like scaling the system, as with performance: if you increase the load on the system - the weakest element, the slowest part will not stand it. We are repairing, somehow improving, a window appears in which you can increase the load on the system. So with the scaling of the company.

It was a small introductory theoretical part.

Now let's take a practical look at what cycles we went through, what problems we encountered, and how we solved them.

Technical task

As a first example, consider the task of formalizing the relationship between a business and an engineering team. The terms of reference, or, as we call it PRD, is a request for what needs to be changed in terms of design functionality. This is a fairly obvious formalization that all companies undergo. I think that most of you work in companies where there is some kind of formal process for transferring a development task. From the grocery team, from the business or from an external customer - it does not matter.

We have gone through several parts of the complication of this process. At first we just wrote. When the team became more than the one that allows you to do things just by talking to each other, we began to write all this in tasks. Tasks were formulated as “what needs to be done.” Further, the complexity of the product grew, the number of people in the company grew, and we came to the conclusion that it is useful to maintain the current version of the current operating system in one place. We transferred it all to the wiki, and the discussion of changes to the comments to the wiki, so that everything was in one place. The next step was to formalize what should be in the PRD + PRD review process. Now we have a template that records what information must necessarily be in the PRD, what should be described and what data should be collected before starting work. For example, now the PRD template contains the following blocks:

The goal, why we do this functionality.
On which platforms, products, countries we are planning to launch.
Description of the functional in the use cases format: the main cases + a pre-written list of “difficult cases” that everyone has forgotten about.
Tokens (separately processed by a copywriter).
Communications: will there be email / push notifications for this functionality and, if so, which ones.
Plan release, depending on the marketing / other projects in the company.
Analytics: how we will evaluate the results, what business metrics we need to add to assess the success of the change.

Thus, in the current form, the interaction between the product and the technical team is formalized quite strongly and helps us not to lose any important points in the process of transferring the task to work.

Server client

We grew further, mobile development appeared and became one of the key areas. There was the next point at which communication "broke off". This is the point at the interface between the client and the server . It is about how the client should interact with the server at the protocol level, at the relationship level. This was solved by conversations between client guys and server ones. But the number of teams grew, the number of people in these teams grew. And the fact that information about the interaction between the client and the server was stored only in the heads of the developers began to cause problems.

Documentation

The problems we encountered were fairly simple and obvious. The client-server relationship is not only a protocol, but also an interaction scheme according to this protocol. What commands to send and when, how the client should request something, how the application is launched - everything must follow the protocol.

For example, client-side developers solve the problem and believe that the API has a suitable team that can be called and everything will be fine. This client is released and creates a problem on the server, because the team was too heavy for him and requires too many resources. In addition, iOS and Android understand the API a little differently, and implement it differently, because of this we can not quickly make changes to the API. Thus, we came to the conclusion that the protocol needs to be documented.

The release does not return back

The peculiarity of mobile platforms is that it is impossible to return the release. If the application is laid out in the store and the user has installed it, then, most likely, the client will have to live with this version for a very long time. Error at the design stage of the protocol, at the stage of determining the interaction between the client and the server, dear. In Badoo, another year or two we will have to support any application that is released until the number of users drops to a certain limit.

To solve this problem, we decided to allocate a separate MAPI command, which will document the protocol, and will be a knowledge sharing point between the client and the server . This team includes employees from client and server development. This mixed team is engaged in the transformation of product requirements into a protocol change and its documentation. Since the error at the stage of implementation of the protocol is rather expensive for us, the processes in this team are a bit more complicated and more difficult than in all the others. They use double code review, trying to eliminate the possibility of an error.

This team quickly became the center of knowledge sharing. Often there are situations where the developers of the client and server can not agree on how they should interact. For example, iOS can only do this, but for Android it is not suitable. The new team solves these controversial problems and, if necessary, gathers the right people to make the right decision.

If you look at the outline of our process, the Mobile API team is an intermediate link between when the requirements are ready and when the development begins. I.e:

from the product team comes the task of developing TK (PRD);
the protocol design team compiles the documentation;
development of client and server parts begins according to the documentation.

With such a process, server and client development can proceed independently, and we often use it.

Problem statistics

The company continued to grow and develop, there were more people and projects. Slowly, we came to the conclusion that a separate team stood out, which deals with data, statistics, helps the product team to analyze how users react to changes. As I said, problems appear at the junction of teams . We have a new team, and after a while this joint also began to work inefficiently.

The fact is that analysts need good data to identify patterns and answer tricky product questions. Good data means that all statistics should be subject to some single language. When we talk about statistics and our product, we need to speak in one particular language.

Prior to this, in each technical assignment the product manager described the principles of collecting statistics with the words: this button needs to measure click rate, this screen has conversion. But then the developer himself decided which events to track, how to measure (from the client or server), which graphics to draw, and for example, which cuts to add to these events. The developer can make graphs, cut into device types, add gender, collect statistics by country. These disparate data come to the analytical department, but based on them it is impossible to accurately assess the quality of the solution in the product. As a result, there is a reverse shaft of tasks: we make changes, these changes are implemented, the product manager requests analysis, the statistics team requests additional data, the task goes for revision, statistics are being finalized, we are waiting again ... This extends the product cycle and this was a problem for us.

The process of collecting and analyzing statistics needs to be formalized.

We decided that the statistics requirements will be recorded in the TK, and the analysts will be the owners of the requirements knowledge. The analyst, at the stage of transferring work on the TOR to the development, says which statistics are needed, which events to monitor, for which cuts to break the data. If the analyst asks to expand the existing statistics or add a new one, then we add new functionality or modify the existing one. For this, we formalized working with data in code. We made a single API that simply does not allow sending insufficient data or invalid data.

In parallel, in terms of tools, we have Microstrategy's fast tool for data visualization and our own A / B testing tool. The owners of all knowledge of how to properly use these tools are analysts.

Another stage is added to the process diagram. PRD passes the stage of coordination in the department of analytics, and only after that is transferred to MAPI and development. So it works right now.

Load distribution

The next problem is related to the growth of load and interaction within one department. I lead the backend development team for our products, and using her example I will illustrate what problems arise with the growth in the number of employees within one team.

In a team of up to 15 people, everything is quite simple. We believed that everyone does everything and distributed tasks mainly according to the principle who is free now - he does. Why up to 15?

It is believed that one or timlid or technid should lead a team of up to 7–9 people. This is an empirically established number of an adequate number of subordinates.

We had a team leader and his deputy, so together we controlled 14–15 people. With further growth, it became necessary to some additional division. The flow of development tasks needs to be balanced. We have determined the main requirement for this process: we form a specialization . Each piece of code will be experts, 1-2, and best 3, who know this code best, and who support this code. But at the same time, there is an orthogonal requirement: to maintain flexibility . For example, if five people support the messenger, and there are too many urgent tasks, then they should not stand idle. If the team has free resources, they should be included in the performance of other people's tasks. These requirements are contradictory, but we still want to try to achieve this.

We have divided a large team into development groups of 4-9 people. At the head of each group is the leader and he is the immediate leader of the team. We introduce such a thing as a component. A component is a piece of code that is finished in terms of product functionality. Each component is assigned to a specific group. Each component within the group has 1-2-3 people who are experts on this piece, and are engaged in its development and support.

In terms of load sharing, each task has a component.
The tasks of technical duty and support are distributed in the “native” group - the one to which this component is assigned.
We try to distribute new functionality in the “native” group. But only if we have this opportunity.
In order to maintain flexibility, we do not exclude a situation where one group helps the other and does something that is not related to its components.
In this case, either a technical task review or a code review is conducted - this is done by the “native” group.

In this version, we are working now. The team has 30 people, 5 groups and 22 components that we share between them. Until we see a limit for further growth in this format and up to a certain scale, we will stick to it.

An interesting side effect: what happens in a team when the number of projects, the number of people, the number of changes grows quite strongly. We are faced with the fact that everything has become so numerous that it is difficult to understand the specific reasons for a change.

I will give an example of the growth of registration of new users in Brazil. The reason could be: a spam bot that registers new accounts and spoils our life; problems with sessions; just promo campaign; launching a new wave of marketing in Brazil. The graph shows a change, and we want to understand with minimal effort what caused it.

We made for ourselves a tool called WTF. This is one tool that collects in itself from various subsystems and parts of production that can somehow influence the metrics. This tool is integrated into the charting tool, and you can see the changes at intervals. As a bonus, we try to integrate not only technical metrics (accidents, configuration changes), but also business metrics (promo and advertising companies).

The interface is simple: the red line is the change associated with some configuration change. This tool helps to track changes in the conditions of the grown project.

Let's sum up the first part of my report:

With the growth of the communications team will be missed. They will overload and become ineffective.
Most often this happens between departments, in our case between server and client development.
Where it breaks, we formalize the process.
New tools will be needed as the number of projects grows.

It worked for us:

Formal interaction between the grocery and engineering departments is implemented through TK.
Interactions with BI are based on analyst requirements;
The MAPI team deals with the protocol for the client and server parts.
All interaction within the department takes place as a component - it is a way to formalize the distribution of tasks.

The development process involved 200 people. With further growth, we may face new challenges. Then in a couple of years there will be a new report about how we all remade :)

Speed

We want to keep the speed of making changes to the system with the growth of the team. At the same time, faced with problems in communications, we introduced a number of formal processes and obtained a multi-stage scheme.

Time-to-marker with such a process is increasing and increasing. Now we look like this.

Our system is like a big ship. He swims very fast, coolly armed, everything is cool, until you need to make some very small change. To maneuver, react to the market, we need a change to pump through our entire scheme.

Then we thought: maybe everything is wrong. Maybe we are growing wrong in general, and we need to redo everything. A variant with cross-functional teams comes to mind. We scale the system vertically. We say: more work - more people. And lost in speed of delivery. It may be worth switching to a scheme when our team is a large number of startups. Each startup will do part of the work itself, and inside it will have effective communications. Then you will not need to make formal processes.

The idea of converting functional teams into cross-functional ones in order to speed things up has arisen many times since our evolutionary process. We refused from it because of several minuses.

Less resource flexibility . Redeploying people between cross-functional teams is more difficult. The response to a change in load or process is slower.
The issue of process control in the system . There are 10 teams with back-tenders, front-fenders and analysts in each. The question arises: will not every backenderder write in his own language and drag the development stack to his side. It also threatens the creation of new bicycles to solve the same tasks. This places an additional burden on the administration of the entire system.
This system works only on some scale . You need to provide a bus factor greater than one, so you can't make a command with only one backend. All specialists should be at least two, and it seems subjective that more people are needed to do the same number of tasks.

If we present our system as a queuing system for applications (where applications are product hypotheses and changes), you can find the answer to the question about speed. The graph shows the saturation theory of queuing theory, which has requests per second on the OX axis. For our process, this means the number of tasks performed. On the OY axis, the processing time of each request.

From the point of view of queuing theory, the system can be optimized either by the number of problems solved or by the time of task processing.

The functional team is optimized for the number of tasks performed. Cross-functional - on time delivery. In a cross-functional team, everything happens faster, the time-to-market is smaller, but fewer solved problems. In order to make a task faster, it is necessary that a certain amount of resources be either completely free and wait for the task to arrive, or perform some task that is not so important and can be postponed. Within the framework of functional teams, we essentially optimize the use of development resources. Due to this internal optimization, we get a large number of completed tasks.

Let's return to the problem. We still lack the flexibility and speed for fast food projects. We want the delivery time to be minimal, and do not want to waste time on the processes. We want to take advantages from both approaches. To achieve this, we have divided our workflow. For business and some specific tasks (for example, marketing) speed is important. For them, we will use an approach with cross-functional teams. And for the area in which the speed of delivery is not so important, we will apply the general scheme.

In fact, these are project teams. The grocery department says what is needed now, what is important for the company, where we want to add and improve. Most often these are experimental projects in which it is not exactly known whether it will fly or not. They do not need to invest a lot of resources on the documentation or the construction of ideal solutions. — 2 , . , . , , . - - , - , .

. , , .