"Breaking Bugs" in Sberbank: how to fix the seven-day rate of bugs per day
Bugfixing is a tedious, but mandatory part of any development, and not everyone wants to do it. How to turn bug fixing into something exciting? Arrange a competition! In this post, we will tell you in detail about our 24-hour "bugfix-marathon" - from preliminary preparation to the raking of the latest commits after awarding the winners.
We infect with the idea
The scale of development of our application Sberbank Online over the past year has increased markedly. At the same time, small bugs began to accumulate, which were not reflected in any way on the key metrics. But we understood that it was a time bomb and something had to be done with it.
We were inspired by how similar problems our colleagues from Avito solve, and we decided to organize a massive attack on bugs in the Bagaton format - taking into account our development structure, culture and specificity of the flow. ')
It was necessary to arrange everything so that the guys themselves wanted to participate in the Bagaton and prove their coolness without directives from above. To do this, the competition must have a cool atmosphere. We decided to come up with a special style, something recognizable and about bugs. Bugs are bugs. Who in everyday life kills beetles? Disinsectors - guys in yellow suits of chemical protection. Where are they lit up in recent years? In one popular series about a chemistry teacher. The basis is, we finish activities. They organized a video game tournament, a quiz with prizes, cool individual nominations ... and of course, a lot of delicious food. But most importantly, whatever one may say, the competition for the elimination of bugs. This was reminded by a dashboard with a web interface, showing the progress of teams, their current positions, the number of points, etc. We discussed everything with timblids - they approved our plans.
Android vs iOS - so unfair
First, we wanted to push Sberbank Online's Android developers with their iOS counterparts, to play on the platform rivalry. But in the process of organizing, they realized that this is not the best solution, because technically the platforms operate under unequal conditions. It so happened that on iOS, we quickly build builds and run autotests.
Then we changed the format and made mixed teams: five Android and iOS developers each. Previously, captains were chosen from among proactive developers to help form teams. It turned out nine teams. And despite the fact that we figured out the issue of iron in terms of fair play, it was necessary to make sure that other restrictions would not stand in the way of our army of bug fixers.
The next quest was the choice of date Bagaton. Dates of releases for each of the platforms are different - they were chosen so that everyone was comfortable. We tried to make the date as close as possible to the date when the release candidate was withdrawn.
In addition, Bagaton heavily loads the infrastructure of the platforms. When there is a competition, who quickly fixes the bugs, the number of pull-requests takes off. Even a month and a half before bagaton, there was a risk that our equipment would not cope with the predicted peaks. But at that moment we were expecting a new iron, and it arrived just in time. We managed to connect, configure and enhance the capacity of the infrastructure of both platforms several times.
Pipeline - how not to drop everything in the pipe
Here we did everything as follows: just before the start of bagaton from our develop, we took a branch in which the teams were to work. A bunch of pull-requests with fixed bugs were poured into it during bagaton. Autotests were run on each of them, developers reviewed pull-requests, and testers checked new assemblies for bug fixes. And so all 24 hours of competition.
It was also necessary to distribute the load testers. We made an hourly chart of the predicted number of pull-requests in the 24-hour interval of bagaton - depending on the number of participants, server load, third-party activities, etc. Compared with the average productivity of testers and the number of effective working hours of each accompanying bagaton. Distributed "duty" so that by Saturday morning the queues were as small as possible. In general, they were confused.
At the same time, we took into account that, after bagaton, it was necessary to immediately begin regression testing in order to assess the quality of the branch as soon as possible and decide on its injection into the dev branch. This is an additional burden on testers.
Features Review
For us it was very important not just to fix the bugs, but to do it efficiently. Three procedures ensure that the code sent by the developers in pull-requests is verified. In order to code zapruvili, they must pass successfully:
three experienced developers reviewed and approved the code;
the code is normally crashed and did not fail autotests;
After the build and the infusion, the bug in the assembly on the described conditions is not renewed.
We were afraid that in competitive mode, no one would be reviewing each other. And inside the team can not leave a review. Therefore, we decided not to invent anything and act according to the standard flow, as in the working mode: an arbitrary cross-review - who is free, he takes over the process.
It was also necessary to track, so that the review was not going to the queue. In order to be safe, we attracted signors to the review (even those who did not participate in the bagaton itself) and actively reminded the participants about the quality orientation. One signor iOS developer, in parallel with the fix of bugs for his team, had 80 pull requests back in time — he read and understood. This is really a lot!
We select and evaluate bugs
We took low priority bugs; we chose obvious trash by labels and dates. In total, 490 bugs turned out - mostly small and medium ones, which were not reached by due to more important tasks. These are all sane trivials and minors:
bugs that repeatedly moved from version to version
user-generated bugs
the freshest crashes
regression bugs
bugs that affect ux
All the bugs were divided into three waves on the priority of closure:
The first wave - about 230 bugs
The second wave - about 150 bugs
The third wave (spare) - about 110 bugs
Defects were evaluated not by complexity, but by criticality for business. The most critical ones are “artificially” and temporarily transferred to the priority of “blocker” and “criticized”. The higher the priority of a bug, the more points it earned. The complexity was not taken into account - it happened that the bug blocker was closed in 20 minutes, and the trivial - in 4 hours. For one bug it was possible to earn from 1 to 7 points.
We led each team's account on closed bugs according to their value in the rules of bagaton. If the teams had time, they took the next defect to work. Motivation through cost allowed us to close more critical bugs in the first place.
How to close bugs
We have divided the first wave of bugs into 11 groups (with a margin), equal in the number of points and in the ratio of Android and iOS. The first wave is "expensive" bugs, priority ones, with increased cost. For easy search in Jira, we assigned them the appropriate labels. It turned out about 20 bugs in each group.
At the beginning of Bagaton, we gathered the captains of the teams and played out the labels. Further, the captains in their filter designated the desired label and distributed the corresponding bugs within the team. So we managed to eliminate the chaotic bugfixing, where the guys would just take what was clearer for them.
For the first four hours, teams were awarded points only for bugs with labels of the group that fell out to them, in order to set a certain rhythm. When the time is up, the open bugs are still switched to the second wave, adding to others that it made sense to close within the framework of Bagaton.
By 19:00, all unclosed bugs passed into the third wave - in addition to the bugs that had already been planned there. As a result, for the evening we still had “quick” bugs that would close in the usual flow: crashes and current ones unloaded literally a day before the bagaton, as well as bugs with the lowest priority. All three waves went to work. As a result, 286 out of 493 allocated bugs were closed for bagaton.
Bagaton unites
The headquarters of Bagaton was located in our conference hall, quizzes and a video game tournament were held there. The teams were not limited, ran where it is convenient for them. As a result, the whole bank found out about bagaton. One product ouner from the fourth floor said: “I’m going to meet you on the 14th floor, looking for the right room. Suddenly I understand that familiar faces have just seen, I’m coming back - my developed people are sitting in full swing, and zero attention is paid to me. Ha - I think - they don’t hide from their product-ouner and over 10 floors, okay, sit down, bugfix is ​​a good thing. ”
There was a team in which only one Android came to the bagaton and at the same time six strong iOS developers. We exceptionally knocked out this team another package with iOS-bugs.
In addition, seven developers from the regions arrived at Bagaton. Some of them met their teams for the first time, with whom they had previously only met via videoconferencing. It was very cool to watch how these guys actively joined the process.
How the results were evaluated
For almost a hundred developers, we only had 15 testers. And at night, and at all four. All of them lacked, so the testing was continued the next day. It was testers who awarded points to teams, so we pulled them out of the teams to eliminate bias. In a typical workflow, the tester can call the developer and find out: “Listen, man, there’s such a problem ...”. It was strictly on bagaton: testers should wrap everything that does not pass.
So we could see that some developers are not working in the accepted flow. Hackathon became a kind of catalyst for all deviations. Those who work well on the flow, have been tested in the first wave and get points. All those who didn’t comply very well got into the line that they had already raked after bagaton. It had 60 bugs.
Incidents
In general, everything went in a regular mode, the incidents were typical and were eliminated in working order. When something broke, some signorov immediately switched from the bugfix to the elimination of the incident.
There was one funny case. When we were preparing a dashboard, we described the possible risks: access to Jira, rolling updates, etc. Notify all administrators that at the time of Bagaton, you need to suspend all maintenance work, updates to Jira and servers. Create backup accounts to access Jira. And suddenly, at around 6:00 pm, we understand that the dashboard stopped collecting data. Assumptions were different. Maybe they did not take into account some kind of security protocol? The reason was unexpected. Our organization is very large; it is not always possible to get complete information about all the planned processes. Our dashboard was deployed on a virtual machine on one of the secondary servers. It turned out that it was on this day, Friday evening, that this server, according to an unknown plan, was physically disconnected from the outlet, immersed in a car and sent for permanent residence to our new data center. As a result, by Saturday morning, we had to collect all the data and calculate points in manual mode.
Merge branches and other results
In the normal operating mode, the entire branch is manually run through 800+ test cases. The full team of testers does regular regression testing in two weeks. We could not afford to keep develop for so long without changes. To shorten the testing time, we chose the main test cases of the application's performance - about 107. By the end of Monday, 80% of iOS drove away, 50% of Android did not reveal a single critical bug. We decided that the branches can be merged.
Of the 286 bugs closed on the bagaton, 182 bugs were fixed. The rest are redjacks, bugs that are not relevant for various reasons (somewhere, the design or functionality has already changed). These bugs are not critical, but now they will not need to be distracted and you can safely focus on important tasks.
Also, according to the results of bagaton, many have a question: how many bugs have we contributed? Only eight bugs on iOS and seven bugs on Android.
It is important for us that developers feel responsible for the product code along with other team members. This is important in any development, but in the development of distributed it becomes a prerequisite for successful work. And in our opinion, we managed to raise the level of that very belonging and team spirit. The result was a story with a bunch of profit: in a short time we fixed a bunch of bugs, unloaded backlogs, pumped up team skills and got a lot of fun.
Material prepared by the Sberbank Digital Business Platform team