Changes in complex software systems seem to be taking forever, aren't they? Even engineers often feel that the changes are going more than they should, although we are aware of the complexity of the system!
For customers, the situation is even more incomprehensible. The problem is exacerbated by the random complexity that is added over time due to poor system support. There is a feeling that we are trying to draw water from a ship with a thousand holes.
Therefore, sooner or later, the customer will send a letter: “Why the hell does it take so long?” Let's not forget that we, as software engineers, have a window into the world, which they often lack. They trust us very much, but sometimes the seemingly insignificant change takes a lot of time. This is why questions arise.
Do not be offended by this question; Take it as an opportunity to show empathy and give people a clearer picture of the complexity of the system. At the same time, you can suggest ways to improve the situation. When someone is upset, this is the best time to offer a solution!
')
Below is a letter that, in one form or another, we have repeatedly sent over the years. We hope it will help you to answer such questions.
Letter
Dear Customer,
I saw your comment on the “Notify before the end of the assignment deadline” card and I will be happy to discuss it at our next meeting. Here, for reference, I will summarize my thoughts, it is not necessary to answer.
Paraphrasing your note:
Changing the deadline for completing tasks by one day for mail notification should occupy one line. How can it take 4-8 hours? What am I missing?On the one hand, I agree with you. It is enough just to change part of the request with
tasks due <= today
to
tasks due <= tomorrow
.
On the other hand, reducing it to such a simplified idea, we inadvertently ignore the inherent complexity and make a number of engineering decisions. Some of them we should discuss.
Part 1. Why is this small change more than it seems?
This is a simple, small change, one line of code. Spending on it all day, even half a day, seems excessive.
Of course, you can not just roll out the change in production, without running at least locally or on a test server. You should make sure that the code is executed correctly, and in case of changing the request, you need to compare the output and make sure that it looks more or less correct.
Here, the comparison of the output may be minimal, only a small sample check: make sure that the results make sense, etc. This notification is for internal employees. If the mathematics by date is incorrect (a slight error), we will quickly hear about it from the teams. If this were, say, an email to your customers, it would require more in-depth study. But for this easy testing and review, 20-40 minutes is enough, depending on whether something strange or unexpected appears. Digging into the data can be time consuming. Issuing changes without a review is simply unprofessional negligence.
Thus, we add time for normal logistics, such as commit code, merging changes, deployment, and so on: from the beginning of work to production in production, it takes at least an hour for a competent, professional engineer.
Of course, this assumes that you know exactly which line of code to change. The task workflow mostly lives in the old system, but some parts of the logic live in the new system. Moving logic from the old system is good, but it means that the task functionality is currently divided into two systems.
Since we have worked together for so long, our team knows which process is sending an email with an overdue task and can point to a line of code in the new system that initiates the process. So we do not need to spend time figuring it out.
But if we look at the task code in the old system, there are at least four different ways to determine when the task has arrived. In addition, looking at the patterns and behavior of e-mail, there are at least two more places where non-standard logic seems to be implemented for this task.
And then the notification logic is harder than you thought. It distinguishes between general and individual tasks, open and private, recurring, the function of additional notification to the manager in case of an overdue task, etc. But we can quickly find out that only 2 of 6+ definitions of the overdue task are actually used for notifications. And only one thing needs to be changed to achieve the goal.
Such a review can easily take another half hour or so, maybe less, if you were recently in this part of the code base. In addition, the hidden complexity means that we can exceed our estimate for manual testing. But let's just add 30 minutes for extra effort.
Thus, we have reached 1.5 hours to feel confident that the change will be carried out as it should be.
Of course, we have not yet verified whether any other processes use a mutable query. We do not want to accidentally disrupt other functions by changing the concept of “deadline” to the day that precedes the last day to complete the task. We must consider the code base from this point of view. In this case, there seems to be no underlying dependencies — probably because the bulk of the user interface is still in the old system. Therefore, there is no need to worry about changing or testing other processes. At best, this is another 15-30 minutes.
Oh, and since the main part of the user interface of the task is still in the old system, we really need to do a quick overview of the functionality of the task in this system and make sure that the feedback is correct. For example, if the user interface highlights the tasks whose deadline has arrived, we can change this logic to match the notification. Or at least go back and ask the customer how he wants to do it. Recently, I have not looked at the functionality of the task in the old system and I do not remember whether it has any idea about the term / overdue. This review adds another 15-30 minutes. Perhaps more if the old system also has several definitions of a “task”, etc.
Thus, we went into the range of 2–2.5 hours to complete the task with the confidence that everything would go well, without unintended side effects or confusion in the user's work.
Part 2. How can I reduce this time?
Unfortunately, the only result of these efforts is only the fulfillment of the task. This is suboptimal, which is quite disappointing. The knowledge gained by the developer during the work is personal and ephemeral. If another developer (or ourselves after 6 months) needs to make changes to this part of the code again, the process will have to be repeated.
There are two main tactics to correct the situation:
- Actively clean code base to reduce duplication and complexity.
- Write automated tests.
Note: we have already discussed the documentation, but in this case it is not the best solution. Documentation is useful for high-level ideas, for example, to explain business logic or frequently repeated processes, such as a list of new partners. But when it comes to code, documentation quickly becomes too voluminous and becomes outdated as the code changes.
You have noticed that none of these tactics are included in our 2–2.5 hours.
For example, maintaining a clean code base means that instead of simply completing the task, we ask questions:
- Why are there so many different ways to identify tasks whose deadline has expired?
- Do they all need and work on them?
- Is it possible to reduce these methods to one or two concepts / methods?
- If the concept is divided between the old and the new systems, can it be consolidated?
And so on.
Answers to these questions can be quite fast: for example, if we encounter a clearly dead code. Or it may take several hours: for example, if tasks are used in many complex processes. Once we have these answers, it will take even more time to refactor to reduce duplication / confusion and get a single description of the “due date” concept — or rename the concepts in the code to clearly understand how they differ and why.
But in the end, this part of the code base will become much easier, it will be easier to read and modify.
Another tactic we usually use is automated testing. In a sense, automated tests are like documentation that cannot become obsolete and which is easier to detect. Instead of manually running the code and viewing the output, we write test code that runs the request and programmatically checks the output. Any developer can run this test code to understand how the system should work and make sure that it still works in this way.
If you have a system with a decent test coverage, these changes will take much less time. You can change the logic and then run the full test suite and make sure that
- the change works correctly;
- the change did not break anything (this is even more valuable information than in the first paragraph).
When we build systems from scratch at Simple Thread, we always include time to write automated tests in time estimates. This may slow down initial development, but greatly improves the efficiency of work and maintenance. Only when the system grows, you really understand the importance of tests, but by this point it can be very difficult to return tests to the system. The presence of tests also greatly simplifies the work of new employees, and changing the behavior of the system is much faster and safer.
Part 3. Where did we come from? Where are we going?
To date, we rarely indicate in the assessment for you the time to clean up the code or write tests. This is partly because writing tests from scratch is a minor overhead, and adding back tests to the code base in hindsight is a lot of work, like restoring the foundation to the house where people live.
This is also partly explained by the fact that starting to work with you, we immediately go into resuscitation mode. We have almost daily problems with synchronizing third-party data, weekly problems with generating reports, constant requests for supporting small data changes, inadequate monitoring and logging of the system, etc. The codebase is sinking under the weight of technical debts, and we frantically try to keep the systems afloat while simultaneously sealing the holes with tape.
Over time, systems become more stable and reliable, we automate / provide a UI for self-service frequent support requests. We still have a lot of technical debts, but we are out of emergency mode. But I do not think that we will ever completely withdraw from this resuscitation mentality to a more proactive, mature “plan and execute” mentality.
We try to clear the code on the go, and we always test thoroughly. But being careful and diligent is not a proactive refactoring and not creating the infrastructure necessary for good automated tests.
If we do not start paying some technical debts, we can never significantly improve the situation. It will take months for highly qualified, competent developers to navigate and make non-trivial changes.
In other words, 4–8 hours for this task is a margin of about 2–4 times, but it will significantly reduce efforts for such changes in the future. If this part of the codebase were cleaner and had good coverage with automatic tests, then a competent experienced developer would execute it in an hour or less. And the key point is that the work of a new developer will take slightly longer.
For such a change of time, we need your consent. This is a conscious attempt at a fundamental level to improve the performance of your system, and not just how users perceive it. I understand that it is difficult to agree on such investments precisely because there is no visible benefit, but we are happy to sit down with you and prepare some clear figures that will show how these investments will pay off in the long run from an engineering point of view.
Thank,
El