Sort warnings of static analyzers by priority when searching for and fixing program errors

In 2015, 6,488 new software vulnerabilities were registered in the National Vulnerability Database ( NVD ) in the United States , for a total of 74.885 vulnerabilities found in the period 1988-2016 . The static analysis tools check the source code of the programs for defects, including potential security vulnerabilities, and issue diagnostic messages (warnings) that indicate the location of the alleged defect, its nature, and, as a rule, additional contextual information. The accuracy of such warnings is then evaluated by the user. The effort required to check all warnings and correct all confirmed errors manually is often much larger than the project’s budget and deadlines. For this reason, users need tools that would allow them to sort warnings by importance, thereby determining the order in which they are checked. This article is devoted to our study of this issue using classification models designed to help analysts and programmers in classifying warnings by priority and determining the optimal procedure for correcting the corresponding errors.

Warnings of static analyzers: problems and main task

Static analysis tools test applications without performing them, unlike dynamic analysis tools . Static analysis usually checks the source files of a program, although it is possible to check binary files. The result of such tools is a set of warnings that at a minimum contain information about the location of the defect in the source code (for example, the file path and line number), as well as a text description of the problem. Many analyzers also provide additional contextual information, for example, information about the execution branches of the investigated code area and the values of variables that may trigger this diagnostic. In case of errors affecting several lines of code at once, in some static analyzers the warning indicates the starting and ending lines.

Some of the warnings may be false positives (false alarms), which are erroneously issued to the correct code and due to the compromise between accuracy and speed of verification that is inevitable for static analysis. In particular, the analyzer may be forced to produce inaccurate results in order to meet within a reasonable time and an acceptable amount of computing resources. Some static analyzers, for example, tools for the automatic proof of theorems like the Z3 , are focused on the accuracy of the analysis at the expense of performance. In addition to false positive results, static analyzers can also demonstrate false negatives (i.e., when warnings are not issued for an erroneous code). In some cases, this behavior is explained by the fact that the tool simply cannot detect certain types of errors.

The permissible ratios between speed and accuracy of the analysis were agreed upon during discussions between the developers of static analyzers (both closed and open), which resulted in the determination of the optimal number of warnings issued, which would reveal real errors without causing too many false positives. . On the topic of how difficult these discussions were, there is a noteworthy article by Al Bessie from Coverity : Using the Static Analysis of the Real World .
')
A previous study found a decrease in coincidence in the types of defects detected by the most popular tools. On the one hand, this is explained by the fact that developing a static analyzer from scratch is a laborious task, and manufacturers simply cannot include all types of analysis at once. On the other hand, diagnostics for some types of errors require a large amount of memory, time for analysis or disk space. Such diagnostics are not included in the tool if they are in low demand among users.

By checking the code base with several analyzers at once, users can identify a wider range of problems, which allows for a deeper analysis. However, this approach leads to the accumulation of too many warnings to be checked, including false positives.

According to empirical data obtained by Kremenek et al. , For analyzers that are able to effectively detect software errors, false positives account for 30% or more of the total number of warnings issued by them. Our task, therefore, is to achieve the maximum possible automation of the process of establishing the truth / falsehood of warnings. We also determine the confidence level of those alerts that require user verification in order to facilitate the sorting of alerts by priority.

Our approach

Studies by other authors on the problem of combining warnings from different tools and their evaluation - [ Meng 2008 ], [ Kong 2007 ], [ Kremenek 2004a ] and [ Plakosh 2014 ] - rely on statistical methods that do not take into account the complex potential correlations between various factors (for example, they take into account a smaller number of classification features and / or use simpler mathematical models) or do not provide detailed descriptions for warnings from several tools. At the same time, there are many works that explore the issue of classifying warnings of a single tool and use methods of sorting them by priority based on the same characteristics that are taken into account in our classification models. Examples of such articles are: [ Heckman 2011 ], [ Ruthruff 2008 ], [ Kremenek 2004b ] and [ Heckman 2007 ].

The authors of previous studies were able to accurately determine the degree of reliability of warnings of individual analyzers and, based on this indicator and a number of other factors, classify them by priority. Thus, the authors of the study Predicting Accurate and Actionable Static Analysis Warnings: An Experimental Approach developed models that correctly identified false alarms among the warnings of the FindBugs static analyzer in more than 85% of cases. Warnings to be assessed were sorted by priority based on dynamically updated data on whether confirmed errors of one type or another were corrected in the past. In previous studies, in solving the problem of differentiation between false and real errors, methods such as collecting contextual information about the code, the choice of warning types, data fusion, machine learning, and mathematical and statistical models were used. All these methods and mechanisms are used in our study. In addition, we use the technology of dynamic defect detection, graph theory and model verification mechanism. Based on the work of other researchers, we have combined many classification features that correspond to the parameters of warnings and the code itself.

Our approach also takes into account the research experience of the Carnegie Mellon Software Engineering Institute ( SEI ), Improving the Coding Violations [ Plakosh 2014 ], which resulted in the development of three binary logistic regression models for warning classification ( The warnings were taken as a basis, for which there are correspondences in the base of the Standards of Safe Programming for SEI CERT ) - these models take into account in which analyzers this or that diagnostics worked. In other words, for the three rules from the SEI CERT Secure Programming Standards, the authors of the study developed classifiers, which were then “trained” on a set of already verified warnings from several tools. Further, these classifiers were tested on a different set of warnings, each classifier had to evaluate them as true or false, after which the accuracy of the predictions was compared with actual manual evaluation data. In this work, for the first time, advanced methods of statistical analysis of a set of warnings from different static analyzers were proposed, allowing to predict the truth / falsity of a warning as accurately as possible.

In our study, we complement the work of SEI by introducing a mechanism for evaluating additional classification methods and using a much larger number of classification features. In addition, the data set analyzed by us is more voluminous and heterogeneous. We have developed classifiers for the rules, as well as an initial classifier that uses the SEI CERT rule identifier format (it includes a 3-letter prefix denoting the rule type, number, hyphen, and programming language name, for example: "INT33-C") one more sign of classification. We continue to improve our classification models by trying different parameters of classifiers, adding new data to the warning sets that the models are “trained” and tested, as well as experimenting with the input data sets when developing classifiers. Once the classifier is ready, we apply it to the corresponding warnings in the test set in order to evaluate the prediction accuracy of our models. The novelty of our approach lies in the fact that we use several analyzers, take into account more signs and apply a number of advanced classification methods from which the most productive ones are selected.

The team of researchers from SEI, which helps me in working on this project, includes David Svoboda, Will Sneevli, Robert Stodard , David Zubrow , Jennifer Burns, Guillermo Marsé-Santurio, Eli Canal, Christine Beck and Richard Jin. Our team works with Claire Le Gu , Senior Lecturer at Carnegie Mellon School of Computer Science ( CMU's School of Computer Science ), who acts as a consultant. Her experience is extremely useful for our project, since she is engaged in research in the field of software engineering and programming languages, namely, she specializes in the problems of developing, debugging and ensuring the quality of software systems. Our study also responds to the tasks set by the Ministry of Defense in connection with the need for technology for operational, followed and automated analysis and verification of applications . In addition, our study corresponds to one of two objectives of the SEI strategic plan : ensuring the security of program-dependent systems throughout their life cycle.

Our approach will allow analytics and programmers to sort warnings by importance by automating the following processes:

Determining the degree of reliability of a warning (that is, the probability that a warning is true or false).
Distribution of warnings in three categories: estimated true warnings (e-TP), estimated false positives (e-FP) and intermediate warnings (I). In this case, the warnings of the first group (e-TP), after detection, are immediately sent to the debuggers, bypassing the manual check.
Sort intermediate warnings based on their confidence level. It may also take into account additional factors related to this warning, for example, the risks and costs associated with its processing.

When developing a classifier for a rule from the SEI CERT Secure Programming Standards, we use historical data on all verified warnings related to this rule, as well as new data collected in this project. Similarly, archived and new data are taken into account when creating a classifier for the entire set of warnings.

Based on the experience of the above-mentioned studies (both inside and outside the SEI), we have identified the following classification criteria for inclusion in our models (incomplete list):

warning depth (the depth of the alleged defect in the file)
the number of significant lines of code (SLOC) in the function / method, file, program
total number of lines of code (LOC) in a function / method, file, program
cyclomatic complexity
connectivity (program modules)
coupling ("" of elements of the module)
tongue
matching warnings (same lines of code, file and rule) issued by all analyzers
code change rate
the number of warnings (per file, function, etc.)
number of tokens in function / method
number of functions / methods in file
number of parameters in function / method
average number of tokens
average SLOC
overlapping file paths
class name (where applicable)
method / function name
package name (where applicable)
many other instrument-specific indicators that may vary for different warnings

The processing of archived data on verified warnings and the corresponding signs from the above list is carried out using four classification methods:

One of the data sets used in our study includes the results of evaluating warnings for 20 source code bases with a total volume of 19,237 KLOC ( thousands of lines of code ) and contains 3,147 confirmed real warnings and 11,772 confirmed false alarms.

To verify these databases, the Source Code Analysis Laboratory (SCALe) CERT-software platform was used, combining tools for analyzing a variety of commercial, open and experimental tools. Due to the fact that SCALe uses the capabilities of several static analyzers, it detects more defects than any single instrument. However, this approach implies a large amount of output data and, accordingly, requires a lot of labor costs for their processing. In total, over 16 million lines of code were analyzed using SCALe, including the source databases of the Ministry of Defense, energy supply systems, medical equipment, etc.

Using the SCALe GUI, the user loads a set of static analyzer reports and corresponding source files into the tool. Alerts are saved in a single format to a separate SCALe database. Each warning is accompanied by additional information, for example, information about the corresponding rule from the set of CERT Secure Programming Standards . SCALe also preserves the associations between these rules and the warnings from each of the connected analyzers, so that several warnings from different tools can correspond to the same CERT rule. These associations between warnings and rules are of particular importance for our task of classifying warnings at the level of individual rules.

The app can also be used to examine alerts. The browser-based interface allows you to filter warnings and sort them by priority, view the corresponding section of the source code and mark warnings as true or false. The database is dynamically updated when changes are made.

We improved SCALe for our project, so now it can collect additional information for each warning and perform a number of auxiliary tasks, for example, to ensure anonymity of the data. Additional information is extracted from several sources. Source code metrics (such as cyclomatic complexity and number of significant lines of code) are calculated using a modified version of the Lizard tool. Fields for additional parameters are retrieved from analyzer reports.

In addition, we have developed a script that combines and analyzes warnings, preparing data for processing by applications of statistical analysis. This script converts a multi-table database of a modified version of SCALe into a linear .csv file, in which the records are separated by commas - this format is convenient for tools that perform the classification. Our script also combines warnings that match the parameters [rule, line number, file]. Finally, the script performs additional analysis and adds to the records such information as the number of warnings per file, the number of warnings per function and the nesting depth of the file in the project, as well as splits the paths to the files so that partially coincident paths can be used as signs of classification.

Our method of developing classifiers can easily be extended to work with other standards and platforms that can store verification data. For example, the CERT programming rules can be replaced by other similar standards, and the SCALe tool by other platforms. SCALe warnings associations with rules can be replaced with associations with other rules and numbered error lists, for example, Common Weakness Enumeration , OWASP Application Security Verification Standard Project, and Software Development Standard in C MISRA . Similarly, alert assessment data from other platforms working with multiple analyzers can be converted to other formats supported by our classifiers and their development tools.

Testing our approach with partners from the Ministry of Defense

In addition to the 20 code bases verified by CERT, we use data from three divisions of the Ministry of Defense (we will talk about this type of cooperation in a separate article), two of which stated the need to check the security of their code of more than 100 MSLOC ( million significant lines of code ). Extrapolating the results collected on the archived data of previous CERT checks (with a ratio of 3.31 warnings per 1,000 lines of code), we expect that for the code bases of both units, approximately 662,000 warnings can be identified. Our task is to automatically establish the truth / falsity of the marked defects with an accuracy of 95%. If successful, our method (and analyzers based on it) will help to significantly reduce labor costs for evaluating analysis results and sorting found defects by priority.

When using our automatic classification system, specialists in the mentioned divisions could process warnings according to the following scheme:

e-TP (perceived true errors) are sent directly to the debugger team.
I (interim warnings) are sent for evaluation by a team of analysts.
e-FP (suspected false positives) are ignored.

Further extrapolation of CERT archive data gives a true / false warning ratio of 1: 3.74. Thus, given our ambitious task of processing 90% of warnings and correctly distributing them to the e-TP and e-FP groups, the following results are expected for 200 MSLOC: 126,000 e-TP , which will be sent immediately to the debugger team, 470,000 e-FP which will be ignored, and 66,000 I , which will be evaluated manually. These numbers suggest a 90% reduction in the time for manual evaluation of warnings, and all warnings will be taken into account, so that the code’s security level will not decrease (in other words, 90% of the automatically identified e-TP or e-FP warnings correspond to a reduction in their manual estimate by 90%). The degree of reliability of intermediate warnings will help determine the order of their verification. In practice, those of them that have the lowest priority can be completely ignored.

The figure below shows how our method will help improve the process of checking applications. Code bases are checked by several static analyzers, and each of them gives its own set of warnings. The arrows under the caption "Today" ("today"), leading from the “Alerts” block to the chart with a red frame, indicate a familiar alert handling strategy, in which each of the alerts and the corresponding code must be manually checked to establish its truth or falsity. This process usually takes too much time, given the limited budget of projects. The yellow circle on this diagram shows the number of warnings that can be estimated for 12,939 hours of work (assuming that each alert normally takes 5 minutes, according to a study [ Hayward, 2008 ]), and the red ellipse - the remaining 506.733 untested prevention On the other hand, our strategy shows the upper row of arrows:. 90% of alerts will be automatically and correctly divided into groups e-TP or e-FP , with the result that the user will only have to check 66,000 interim warnings.This will take only 5,500 hours, which is less than half the time for the first scenario. Moreover, our method ensures that all warnings requiring manual verification will definitely pass it.

Fig. 1: The task of our study is to significantly reduce the time for the manual assessment of unverified warnings and their number. The image of a woman and a laptop (“Woman And Laptop”) is taken from the following source: www.publicdomainpictures.net/view-image.php?image=47526&picture=woman-and-laptop .

Further work

A recent study on Automatically Learning Semantic Features for Defect Prediction has shown that analyzing the semantic features of programs can significantly improve the detection accuracy of software defects. In this regard, we plan to subsequently add such an analysis to our classification models. Moreover, we expect to use semantic features from reports stored in repositories when developing classifiers. We are also going to use the automatic parameter optimization mechanism for classification methods, which, according to a recent study of Automated Parameter Optimization of Classification Techniques for Defect Prediction Models , will significantly improve our classifiers.

In the future, we may also add an advanced analysis of the costs, risks and benefits of verifying warnings and taking these indicators into account, along with the degree of confidence in determining the sensitivity of our classification models. This approach will increase the likelihood of their implementation in software companies.

Our system supports any rules of CERT programming, and in the future we plan to develop classifiers for other standards. At the moment, we are limited to a set of analyzers used and the willingness of potential consumers to acquire and maintain several tools.

It is also possible that in the future the results of our research will be combined with automatic code correction solutions that are currently being worked on at SEI. In this case, our models will be applied after automatic refactoring of the code, which will help eliminate only a small number of warnings about potential errors. Our method can be used to sort by priority (in expert analysis) of potential semi-automatic fixes that are not guaranteed correct and require manual evaluation (for example, the specialist must determine whether some automatic fix will be correct given the desired code behavior ). All other warnings for which there are no automatic fixes can be classified in the usual way using our method, as described above (i.e., categorized as e-TP , e-FP and I ).

In the next article of this series, we will discuss the cooperation of our team with the three divisions of the Ministry of Defense mentioned above.

We welcome your feedback on this work - leave them in the comments section below the text.

Additional resources

Read and comment on our articles and help us improve the SEI CERT Programming Standards , which are developed on the basis of publicly available wikis.

We also invite you to visit the CERT source code analysis website.

The following are sources of citations in this article:

[Heckman 2011] Heckman, Sarah, and Laurie Williams. Automated code review . Information and Software Technology 53.4 (2011): 363-387.

[Heckman 2007] Heckman, Sarah. Adaptively ranking alerts generated from automated static analysis , Crossroads 14.1, 2007.

[Kong 2007] Kong, Deguang, et al. ISA: a source code for static data fusion . Scalable information systems. ICST, 2007.

[Kremenek 2004] Kremenek, Ted, et al. Correlation exploitation in error ranking. ACM SIGSOFT Software Engineering Notes. Volume 29. N6. ACM, 2004.

[Meng 2008] N. Meng, Q. Wang, Q. Wu, H. Mei, Analogue of the Multiple Scattering Analysis Tools , International Conference on Quality Software, Oxford, UK, August 12-13, 2008

[Plakosh, 2014] Plakosh, Daniel, Robert Seacord, Robert W. Stoddard, David Svoboda, and David Zubrow. Improving the Automated Detection and Analysis of Secure Coding Violations . (2014).

[Ruthruff 2008] Ruthruff, Joseph R., et al. Predicting accurate and static analysis warnings: an experimental approach . Proceedings of the 30th international conference on Software engineering. ACM, 2008.

Translation Acknowledgments

This is a non-SEI-sanctioned translation of the blog post "by Lori Flynn 2016", by Carnegie Mellon University, has been trained by the Software Engineering Institute.
This is not the case for the Carnegie University. Accuracy and interpretation of this translation are by Andrey Karpov.
ANY MATERIAL OF CARNEGIE MELLON UNIVERSITY AND / OR ITS SOFTWARE ENGINEERING INSTITUTE CONTAINED HEREIN IS FURNISHED ON AN "AS-IS" BASIS. CARNEGIE MELLON UNIVERSITY MAKES NO WARRANTIES OF ANY KIND, ETHER EXPRESSED OR IMPLIED, BUT NOT LIMITED UNITED KINGDOES OF THE KIND WITH RESPECT TO FREEDOM FROM PATENT, TRADEMARK, OR COPYRIGHT INFRINGEMENT.

Source: https://habr.com/ru/post/305532/

All Articles