⬆️ ⬇️

Do not touch the logs with your hands. Part 2: How we implemented Unified Logfile Analyzer

In the last article, we talked about the system we created under the name ULA (Unified Logfile Analyzer) —analyzer, the main functions of which are collecting and aggregating incoming error messages using the shingles algorithm, making decisions on them and automatic notification for problems with the test environment. Today we will share the practice of detecting / resolving bugs rolling out this system and our plans.







At the moment we have connected to the system a little less than half of the planned projects. Why not everything? Maybe the employees received insufficient information, maybe they simply did not believe in the product - as it turned out, it was not enough just to tell managers about the product and arrange a few presentations. In the project environment, where the developers from among the creators of the ULA participated, the success was greater, and the implementation at the beginning was carried out without much effort.



Implemented ULA through meetings with management and testers. They told about the system on presentations, demonstrated the main functions. After several such sessions, ESF autotest developers gradually began to receive connection requests. Perhaps, it would be possible to start better if we announced the tool in advance so that users would wait for the release.

')

Standard questions that were asked to us:



- Does your system compete with HP ALM?

Perhaps in the future, in terms of collecting metrics for automated testing.



- Does your system know how to aggregate the logs of the ESF systems themselves?

Not at the moment, but in the future we will implement the log analysis of the systems themselves. While this data is collected and attached to the tests in the form of additional information.



- Why not the ELK stack (Elastic Search, Log Stash, Kibana)?

We need more complex processing logic, decision-making functionality, integration with HP SM, HP ALM, the ability to work with the source data purposefully on the right request, i.e. do not need a constant stream of data from the system logs.



- And who will use the system?

Here everything is ambiguous. The implication was that a team of engineers, which conducts mostly manual testing and analyzes autotests, should analyze the errors. But this is not always the case: in completely new projects, autotest developers or other engineers are often involved in parsing. Therefore, now it is important for us to clearly understand the situation for each project, to identify those who need to be trained to work with the system.



Now about the problems that we faced after connecting several ESF projects.



Quality AutoTest logs



The main problem that required a change in the basic algorithm is the presence of the same type of trace trace in the logs that write auto-tests. For a series of tests, TestNG is used, and in case of an error, the guys write a full trace to the log, which the framework generates. As a result, up to 80% of the length of the error message becomes similar. It was necessary to do something with it. And quite urgently. To cut off part of the log and not to process it at all would be completely wrong. Therefore, it was decided to introduce weight shingles, i.e. introduce weights canonized, cleared of "garbage" phrases. There is no such approach in the classical algorithm.



In the future, when enough statistical data is gathered, we derive the necessary polynomial for determining weights. So far, when viewing several hundred messages, it was decided to use the slightly corrected arc-tangent function. The main significance is taken by the first 20 to 30 words of the message, then a slight decline begins (the beginning of the stack trace). The tail trace has the least significance. It may be necessary in the future to introduce the dependence of the parameters of the algorithm on the subsystem under test and the framework used.



Performance



Although during the development of the system load testing was carried out at each sprint, it did not help us to avoid a number of performance problems when connecting real projects. We are faced with the fact that:





It happens that the queue gets up to 200 messages per second, and they begin to accumulate. As a result, everything, of course, is processed without critical situations, but a 100% busy processor affects the operation of WEB services. Here is what we have done so far to solve performance problems:





However, the issue of performance is not fully resolved, the team continues to work.



Synchronization of flows in a DBMS



Oracle AQ queue parsing occurs through a procedure that is associated with a subscriber. The DBMS manages multithreading, but with a heavy load, we are faced with a problem.



The fact is that it is necessary to keep logs of incoming messages in the system (one message for us is a record of the test step). Counters are grouped by unique launch IDs. This is necessary in order to compare the number of incoming messages with the expected ones and to understand whether the launch is complete, build a test tree and display an aggregated error table. Without elements of synchronization of threads such a counter can not be entered. First, we invented the “bicycle” and made the MUTEX table, which was blocked for a fraction of a second during the calculation of the counter value. Under heavy load, they began to catch the dead block. Then they used the DBMS_LOCK package and created a lock on the piece of code that worked with the counter. For a long time, they could not understand why sometimes the counter showed an incorrect value, but in the end they decided to have a synchronization problem. For those interested, we recommend reading this article about the pitfalls of locks.



Versatility



We position the system as universal: it is enough to write our own autotest report parser for it. But in fact, for the same Allure it turned out to be quite difficult to do. The fact is that the same can be recorded in the report in different ways, we do not have general rules. As a result, for two weeks constantly had to carry out improvements and, most likely, this is not the end. We even got into the code of Allure itself, but more on that later.



System limitations and design errors





Allure



The first Allure problem we encountered is the difference in adapters for different frameworks. This is not the specifics of our autotests, but a common practice. The testClass and testMethod labels with which the test was defined belonged to the testNG feature adapter, and other adapters did not provide them by default. Adding 2 labels turned out to be easy, since the model (AllureModelUtils) had these methods:



public static Label createTestClassLabel(String testClass) {        return createLabel(LabelName.TEST_CLASS, testClass);    }    public static Label createTestMethodLabel(String testMethod) {        return createLabel(LabelName.TEST_METHOD, testMethod);    } 


It was decided not to rewrite the logic of the parser, but to create your own listener in which these two labels would be added.



The second problem we encountered is testNG. The adapter creates separate tests for the before methods if an error occurred during their execution. The tests themselves go to the status of canceled. Thus, we received duplicate tests in our system.



The fix for this Allure feature was flagged in RoadMap Allure 2.0, but most projects still used version 1.5 or even lower. Our parser was primarily written for these versions. We could not wait, so again we went by correcting the listener.



Multi-browning



When designing, we chose React JS and focused on working in Google Chrome. They showed it to the management, started testing it on other browsers and it turned out that nothing really works. In the future, it will be necessary to devote more time to the problem of multi-browser compatibility. At the moment, the WEB part of the system works in Google Chrome, Mozilla Firefox, MS IE latest versions.



Shoemaker without shoes



We are so carried away by other logs that we forgot about our own. Of course, they were, but the detailing turned out to be insufficient. When the real operation began and problems started to fall down, we had to spend several days walking through all the functionality and making normal logging in the system itself. Logs are written for errors in the analysis of queues, in the called procedures and in the system services themselves. Logs every user action.



Rush



In order to accelerate the output of the system to productive operation, the usual bash was used to search for the necessary pieces of logs in the file system on the ESA test environment. They wrote a script that goes through the directories, unpacks the necessary files, searches for entries for the input session and writes intermediate results into a temporary file (rather large). The last action was a mistake. This solution was single-ended and unacceptable for us. At the moment we rewrote almost all the functionality in Java, and the intermediate results are stored entirely in memory.



Future plans



In the near future we plan:





Despite all the bugs, we are optimistic and believe that the development will help engineers to significantly reduce the time to analyze the results and improve the quality of analysis. At the moment, we have accumulated volume backlog, the implementation of which will give us a new interesting experience and make the product better. We will be happy to answer your questions on the topic, learn about your practice and cases.

Source: https://habr.com/ru/post/342112/



All Articles