📜 ⬆️ ⬇️

How to cope with problems in an inherited project after 3 other teams

This article does not pretend to be a universal recipe; we will try to describe in it the problems we encountered and their solutions in the project that we got after 3 other teams.

First, we briefly describe the essence of the project. There are doctors in clinics who dictate information about a patient and his visit with special devices. Then this information is translated into text form (a special department is responsible for this, whose employees listen and type the text), the text is checked, the template is filled out. Then there is a movement on Workflow, which includes different stages with different business logic, then integration with several external systems takes place. And, finally, a letter is printed to the patient and sent. And the work is archived after some time (but at the same time it can be restored as needed).


In this part of the Workflow can be performed on the iPad, there is a special part for system settings and workflow, there is also a special part for editing and approving documents in the browser. But the bulk of the work is done on the user's local machine, which is located in the hospital.
')
There is also a prerequisite that the data can not be lost. Therefore, on the user's local machine everything is stored in the MySQL database, the data from which is synchronized with the main server.

The specifics of the project - the field of health in a Western European country. Also, this software kit is deployed in more than 10 clinics, each of which has some unique specifics. Also, in order to persuade the clinic for a new version, you need quite a lot of time: first you need to deploy the clinics to the test server, then they test for 3 weeks, then they deploy to a live server.
The databases also contain confidential information about patients, so each clinic has its own server (and as practice has shown, persuading them to buy a new screw is already a problem).

What did we get?
As you understand, they refused from 3 previous teams not because of their high quality of work. We received about 400,000 lines of C # source code (+ some more external libraries). Part of the source code was lost. Lack of documentation. Lack of tests. Lack of test data. The absence of a person on the part of the customer who would know everything about the system (the person who would have been since the project was created). The code was written in such a way that it was impossible to cover its unit with tests. DB of approximately 120 tables, Performance problems, Using ADO.NET, Dapper, LINQ, Entity Framework 4 for interacting with the database. MS SQL database.

Part of the application architecture:



A little explanation about the depicted parts of the system:

Some numbers:
In a week (a year and a half ago), a total of 5,000 works were processed. Under the processing of work refers to the passage through all the necessary stages, the implementation of interaction with all external services, etc.

What do we have now (main achievements in 1.5 years)?
The performance of the application in some operations increased 10-12 times, the number of simultaneous users increased 5 times, the number of work processed per week reached 30.000, the customer became calm
And we will describe some of the problems (and what we have done to overcome them) that we have encountered.

1 problem: how to prove to the customer that everything is bad?

As you understand, the customer sees that the program works, that it can process tasks. And to prove that “under the hood” of this system, everything works either with crutches, or at the limit - is problematic. And the problem with this: how to show the customer that the system is improving?

One option that was ultimately chosen is the use of the NDepend system. This is a static code analysis system. And very good. What particularly liked it - the presence of about 150 standard rules.

Very useful of them were the rules for finding dead code - not used classes and methods. In 90% of cases they were true, but there were mistakes. They said that default constructors are not used and implicit conversion operators are not needed. In fact, they were needed.

Also, a positive impression was left by the speed of work - the analysis of the resulting 40 libraries took place within 4-6 seconds. Also easy to integrate with Continuous Integration (although this will surprise only a few people). Another useful graph was dependencies, which can build NDepend. When it was shown to the customer, it was a very good justification of the time it took to implement one of the features.

Another interesting feature of NDepend is to create your own rules for code analysis. Although it was not useful to us personally, there are enough standard rules. At the same time, due to the fact that the project was inherited, we changed the standard rules (so that they passed), made them critical (if they are violated, that Continious Integration then reports this). And then gradually raised the threshold of "quality" of the code.

Another very useful feature is a comparison with previous results of the analysis and finding the difference.
What conclusions have been made on this issue: NDepend has very good capabilities, builds many beautiful graphs, we expand. After the customer began to see the graphics, he became calmer: he saw progress.

2 problem: data confidentiality.

The clinics work with real patients, and the data about them is confidential. Moreover, they are located on the servers of clinics. This data included audio files with dictations of doctors, official documents, etc.

For better testing, we needed a copy of this database. But due to confidential data, hospitals were not eager to give this data.

Therefore, an anonymization program was written that replaces all the letters in documents with units, replaces the entire audio content with an array of zeros, replaces all names with Name 1, Name 2, etc.
The implementation of this program took about 3 days. After that, a copy of the live database was created, which was anonymized.

What was the mistake? It did not take into account that it will be long (as it turned out, the net anonymization time was about 5 days). Anonymization created add. the load on the same physical disk that the live database was on, which caused problems. Fortunately, we monitored just in case, and when we received the first signs of a large queue for reading from the disk, we stopped the anonymization program.

After that, we added functions to it so that it could continue anonymizing the database after the interruption and started to run only at night.

What conclusions have been made on this issue: confidential data is a very big problem. As further use showed, the benefits of such a decision were not particularly large, because different hospitals had different business processes and the data structure was sometimes very different. Although undoubtedly, the use of anonymized DB gave short-term advantages and partially reassured the customer, because he understood that this was a copy of real data, and this was good proof that the new modules would work at a predictable rate on real data.

3 problem: code that is not written for testing.

Inherited code that needs to be expanded without breaking? It would seem that it is easier to use unit tests and enjoy life. However, the quality of the inherited code was very low. And as practice has shown, this code was not written for testing (singletones, using the old Entity Framework, which is not suitable for testing). But at the same time, it was necessary to create tests for him. And we chose the path to create integration tests. And the new features tried to write so that they could be covered with unit-tests.

For writing integration tests, we created one add. utility and database recovery function. So the programmer had to bring the program before the action being tested, then run the data saving utility. And then call at the beginning of the test database recovery indicating the path to the recovered data. The utility of saving data simply went through all the tables that are in the database, and saved the data from them in XML. After this, the restore function dynamically created a new database, restored the data in it, and modified the connection strings so that the tests saw this new database. To restore the data, we used the formation of the insert command and the use of ADO.NET to execute the command, which gave us independence from the Entity Framework, etc. As a positive consequence - now this utility is easy to transfer to any other language.

As practice has shown, if properly prepare the database, the recovery takes 1 second. And given the fact that the CI had night tests, this was an acceptable value.

And then this utility got accustomed to testers when they started doing testing with Ranorex - they added command line arguments and started using databases to restore them there.

What conclusions we made on this issue:
  1. For legacy code, reinventing small bikes is almost inevitable.
  2. If you only thought about integration tests, immediately think about night tests.
  3. Restoring a DB can actually be very convenient.
  4. There is no need to store data to restore the database in the test project, it is better to throw it out in the zip archive on the drop box and store a link to it in the archive
  5. A positive side effect of the restoration was the verification of the correctness of the change in the structure of the database, which we occasionally have to do.

4 problem: performance counters on the target machine.

Also relevant is the question that arose during the year - how to show the state of the system that was inherited, and that it gradually begins to not withstand the loads? One of the most effective tools is to collect performance counters at weekly intervals. There are a lot of counters, they give good graphics, they can be exported to Excel, analyzed, etc. The only pity is that we did not do it from the very beginning, when there were 5,000 jobs a week.

Also, the advantage of this approach is that it is very easy to persuade the customer to do this - this is a built-in Windows feature, so there was almost no need to persuade the launch of statistics collection.

What conclusions we made: it was necessary to remove the performance counters at the very beginning of the project. Although now comparing them with an interval of one month is quite interesting.

5 problem: do not trust converters from RFH to HTML.

In the inherited code, the documents were stored in the form of RTF. The customer asked to modify the ability to edit documents in the web client. The customer had a license for TxTextControl v 19, so the editor was implemented using this component. However, later it turned out that formatting (which requires conversion from RTF to HTML and back) leads to problems. Sometimes the 10th font size becomes 10.5 after conversion and many, many other small jigs. In this service those. Support recommended upgrade to version 20. But in a more detailed study it turned out that in the 20th version there are other jambs that also do not suit us.

Conclusions we made on this issue:


6 problem: recovery of the lost part for iOS.

In addition to the web part, there is also an expanded part for using the application on the iPad. As it turned out, the source codes from it were lost. After some searches, we tried to restore the source code using decompilation (this can be done using the Developer Bundle from RedGate, which includes Ants Profiler). In general, the quality of decompilation is very good, after analysis, we were able to find within 2 hours how authentication for IOS occurred in the WCF server and restore it in the latest version of the program.

Conclusions: The RedGate Developer Bundle really helped us out in that situation.

7 problem - profiler.

We also encountered one performance problem: after 40 minutes of work at the client, printing delays began. We checked everything that is possible from the point of view of the server, but we didn’t find anything bad there.
Fortunately, at that time we decided to try Ants Profiler. It has a very good advantage - really easy to use, but it requires administrative rights to install on a local machine. Launching a program under it is literally an action in 3 clicks. Another happy full-featured trial period of 14 days - if you need to run one-time on the final machine, then this is the most suitable choice.
After launching the client for 40 minutes under the profiler, finding the problem turned out to be a matter of 3 minutes (in one place there was an accumulation of the list of abbreviations, and each time the character was typed, the reduction was checked, which caused delays).

Another advantage is very good reports that are very visual (an example of the report is presented below) + can log SQL queries, disk accesses, network queries.

Minus - at the highest level of detail profiling does not work well with COM objects (at least). But after some experiments, we found that detailing at the procedure level does not change anything, and for more than a year we have been using this good tool.

8 problem - database performance at snapshots level.

As already mentioned, due to the fact that we work in health care, we have a rather long update process. Approximately half a year ago, when the number of jobs increased to 20,000 jobs per week, the customer had problems due to the fact that the system did not have time to process the task queue, and from the user's point of view they did not follow the working workflow. The problem manifested itself due to the fact that it was blocked during the reading from the table, and therefore the table update commands were not executed as quickly as we would like.

Therefore, we were given the task of optimizing the application without changing the application on the server.
It is logical to assume that if you need to optimize without changing the code, then you need to optimize not the application, but the database. In this case, the indexes in the database have already been created.

After a week of analysis, we decided to try to transfer the database to snapshot mode. Unexpectedly, it gave unusually good results - the queue from 400 jobs was reduced to 12 jobs (but this was already within the normal range).

Conclusions that we have made on this problem: in addition to creating indexes, we should not forget about such a mechanism, which is in MS SQL

9 problem - reports.

Also, from the point of view of performance, there was another problem: there was an admin panel, which allowed building various types of reports. But at the same time, these reports were not optimally written, and their calculation after a while began to take 40 minutes. For example, these reports included typing performance evaluators. At the same time, these reports often led to table locks (and this was before solving the database performance at the snapshot level). According to business logic, these reports showed information for the past day and the employees at the beginning of the working day started its generation through the admin panel, after which the file appeared on the network drive and everyone watched it. In this case, because of the locking of the tables, all updating and deletion operations from the database periodically stopped.

This problem was solved very simply - we just created a console utility and set it not to be executed using the Windows Task Scheduler, but removed from the admin area. Thus, we kept the business logic (the report is ready by the beginning of the working day) and avoided problems with the locale.

Conclusions on this issue: it is good that we analyzed the frequency with which reports are generated and by which time the report should be ready.

We have listed some of the problems we encountered while working with an inherited project. Were our solutions perfect? Hardly.

But they allowed to keep the project, develop it and make it sustainable, and the customer stayed with us.

Source: https://habr.com/ru/post/272417/


All Articles