Is it necessary to prohibit deploit in production at a certain time? Or, the
#NoDeployFriday movement became a relic of the times when there were no comprehensive integration tests and continuous deployment?
In your team, you might be faced with the same dilemma. Who is right and who is wrong? Is refusal to deploy on Fridays a reasonable risk reduction strategy, or is it a harmful culture that prevents us from building better and more sustainable systems?
Ding Ding
I am sure that the engineers who had the good fortune to be "in touch", were deprived of their days off due to all the broken Friday changes. I was also in that situation. A phone call when you go out with your family or in the middle of the night informing you that the application has crashed. After you get into the computer and check the fast-growing logs, it becomes obvious that everything has ruined the rare unhandled exception. Disgusting.
During the analysis, it turns out that for the script that led to the crash, the tests were not written, apparently because it was not considered likely. After a series of lengthy calls with other engineers in search of a better way to roll back the changes and fix everything, the system starts working again. Fuh.
')
On Monday, a meeting of the "five why."
"
Let's just stop deploying on Fridays. Then everything will work steadily over the weekend, and next week we will be on the alert after all sorts of releases ."
Everyone nods. If something does not go into operation until noon on Thursday, then it waits until Monday morning. Does this approach harm or help?
As you know, comments on Twitter are often very subjective. Although the ban on Friday releases seems reasonable, someone will quickly point out that this is just a crutch because of the fragility of the platform, which is caused by poor testing processes and deployment.
Some even suggest that you simply prefer a more relaxed deployment than the weekend itself:
Other users believe that a possible solution to the problem may be the introduction of function flags.
This user believes that risky deployment problems should not arise thanks to the processes and tools available to us today.
Who makes the decisions?
All this exchange of views suggests that we, as a community of engineers, can strongly disagree and not necessarily agree with each other. Who would have thought. Probably, this situation also demonstrates that the general picture with #NoDeployFriday contains such nuances that are not well reflected on Twitter. Is it true that we all have to apply continuous deployment, otherwise “we do wrong”?
In making this decision there is a psychological aspect. Dislike of Friday's releases comes from the fear of making mistakes during the week (due to fatigue or haste) that can hurt while most of the employees rest for two days. As a result, a Friday commit, containing a potential problem, can ruin a weekend heap of people: on-duty engineers, other engineers who will remotely help solve the problem, and possibly infrastructure specialists, who will have to repair damaged data. If the failure turns out to be serious, then other company employees may also be involved in the situation, who will need to contact customers and minimize damage.
Taking the position of an idealist, we can assume that in an ideal world with an ideal code, an ideal test coverage and an ideal QA, no changes can lead to a problem. But we are people, and people are prone to make mistakes. There will always be some strange border cases that are not closed during development. That's life. So the #NoDeployFriday movement makes sense, at least theoretically. However, this is only a blind tool. I believe that it is necessary to evaluate the changes made depending on the situation, and a priori it is necessary to proceed from the fact that we deploy every day, even on Fridays, but at the same time should be able to isolate those changes that should wait until Monday.
There are some issues that we can discuss. I divided them into categories:
- Understanding the "damage radius" of the change.
- Reasonableness of the deployment process.
- Ability to automatically detect errors.
- How much time is spent on solving problems.
Now let's discuss.
Understanding the "radius of destruction"
When online again begin to break spears about Friday releases, they always forget about the important thing - about the very nature of the changes. There are no identical changes in the code base. Some commits rule the interface a bit and nothing more; others refactor hundreds of classes without affecting the functionality of the program; others change database schemas and make major changes in the process of data consumption in real time; the fourth can restart one instance, while the fifth can initiate a cascade restart of various services.
Looking at the code, engineers should be well aware of the “damage radius” of the changes. What part of the code and application will be affected? What could fall if the new code fails? Is this just a click on the button that throws an error, or will all new entries be lost? Is the change made to one isolated service, or will many services and dependencies change simultaneously?
I can not imagine who will refuse to make changes with a small "radius of destruction" and simple deployment on any day of the week. But at the same time, major changes — especially those related to the storage infrastructure — should be made more carefully, perhaps at a time when there are fewer users online. It would be even better if such large-scale changes were commissioned in parallel in order to test and evaluate their work under real load, and no one would even know about it.
Here you need to make decisions depending on the situation. Does every engineer realize the “damage radius” of changes in the production-environment, and not only in the development environment? If not, why not? Is it possible to improve the documentation, training and display of the effect of code changes in production?
"Radius of defeat" is small? Start on Friday.
"The radius of destruction" large? Wait until Monday.
Reasonableness of the deployment process
One way to reduce risk is to continuously improve the deployment process. If you still need to start a fresh version of the application, so that the specialist knows which script to run, which file to copy and where to copy, then it’s time to do the automation. In recent years, the tools in this area have stepped far forward. We often use
Jenkins Pipeline and
Concourse , they allow direct code to set the assembly, testing and deployment pipelines.
The process of fully automated deployment is an interesting thing. It allows you to step back and try to abstract what should happen from the moment the pull request is initialized to the application being sent into service. Describing all the steps in the code, for example, in the tools mentioned above, will help you summarize the step definitions and reuse them in all applications. In addition, you will be interested to note some of the strange or lazy decisions that you once made and accepted.
Every engineer who read the previous two paragraphs and responded in the style of "Of course! We have been doing this for years! ”I can guarantee that 9 others presented their application infrastructure and curled up, realizing the amount of work that needs to be done to transfer the system to a modern deployment pipeline. This implies the use of the advantages of modern tools that not only perform continuous integration, but also allow you to continuously deliver bugs to the prod, and engineers simply press a button to put it into operation (or do it automatically if you are brave enough).
Improving the deployment pipeline requires involvement and appropriate personnel - this is definitely not a side project. A good solution would be to select a team to improve the internal tools. If they still do not know about the existing problems - and they certainly know - then you can gather information on the most painful situations associated with the release process, then prioritize and work with others on corrections. Slowly but surely, the situation will improve: the code will be sent into service faster and with fewer problems. More and more people will be able to learn the best approaches and make improvements on their own. As the situation improves, approaches will be distributed in teams, and this new project will be completed correctly, without the usual copying of old bad habits.
From the moment the merge pull request to the commit is made, it should be automated so that you don’t even have to think about it. This not only helps to isolate real problems in QA, because the only variable is the modified code, but it makes writing code a much more enjoyable thing. Commissioning is decentralized, which increases personal autonomy and responsibility. And this, in turn, leads to more thoughtful decisions regarding when and how to roll out a new code.
Reliable Deploying Conveyor? Roll out on Friday.
Manually copy scripts? Wait until Monday.
Ability to identify errors
Commissioning does not stop after the code starts to work. If something goes wrong, we need to know about it, and it is desirable that we be informed about it, and not have to find information on our own. To do this, you need to automatically scan application logs for errors, explicitly track key metrics (for example, the number of processed messages per second, or the percentage of errors), as well as a warning system that informs engineers of critical issues and shows a negative trend with certain metrics.
Operation is always different from development, and engineers need to monitor the operation of certain parts of the system. Need to answer questions about each subsequent change: did it speed up or slow down the system? Timeouts have become more or less? Are we limited by processor or I / O?
Data on metrics and errors should be transmitted to the warning system. Teams should be able to determine which signals indicate a negative situation and send automatic messages about it. For our teams and the most serious incidents, we use PagerDuty.
Measuring the metrics of a production-system means that engineers will be able to see if something has changed after each deployment, for better or for worse. And in the worst cases, the system will automatically inform someone about the problem.
Good monitoring, notifications and specialists on duty? Deployte on Friday.
Do you view ssh logs manually? Wait until Monday.
How much time is spent on problem solving
Finally, the main criterion is how long it will take to fix the problems. This is partly dependent on the “damage radius” of the changes. Even if you have a licked pipeline of deployment, some changes are difficult to fix quickly. Rolling back changes to the data extraction system and the search index scheme may require time-consuming reindexing, in addition to fixing a line of code. The average duration of deployment, verification, correction and redeployment of changes in CSS can be minutes, while serious changes in the repository may require days of work.
For all works within the deployment pipeline, which at the macro level can increase the reliability of making changes, no changes are the same, so you need to evaluate them separately. If something goes wrong, can we fix everything quickly?
Fully fixed with a single commit commit? Deployte on Friday.
There are big difficulties if something goes wrong? Wait until Monday.
Think for yourself, decide for yourself
What is my position on #NoDeployFriday? I think it all depends on the release. Changes with a small “radius of destruction” that can be easily rolled back can be deployed anytime on any given day. With large changes, the impact of which should be closely monitored in the production-system, I highly recommend waiting until Monday.
In fact, it’s up to you to decide whether to deploy on Fridays. If you are working with a squeaky and fragile system, then it is better to avoid Fridays until you do everything necessary to improve the deployment process. Just be sure to do it, do not dismiss. Failure to release Friday is a normal way to cover up temporary shortcomings in the infrastructure. This is a reasonable damage reduction for the benefit of the business. But it is bad if this rule covers permanent defects.
If you are not sure what effect the changes will have, then postpone until Monday. But think about what can be done next time, to better understand this effect, and improve the related infrastructure for this. As always in life, every decision has its own nuances. Solutions are not divided into “black” and “white”, “right” and “wrong”: as long as we do everything we can for business, applications and each other, improving our systems, it means we are doing everything well.
Successful deployment.