Analysis of software performance using mathematical experiment planning

“Premature optimization is the root of all evil”
Anthony Hoar

Greetings to all users of Habr!
This article originated as a useful by-product of my scientific research. I would be happy if the ideas outlined below seem interesting and useful to you, and even better if they get their application and further development in real-life projects.

Software performance (software) is an important aspect in the development of any software product. The urgency of the issue is due to the ever-increasing complexity and significance of software. Particular attention is paid to performance:

in engineering and scientific developments, where complex long-term computations are often made, and processor time on cluster systems is expensive and limited;
in web applications in which the time of page generation is critical for the user and directly depends on the amount of server capacity;
embedded software, etc.

A thorough analysis of the performance of a software product can significantly reduce the cost of the equipment itself and the cost of maintaining performance, increase the loyalty of software users.
The concept of performance in terms of software means either productivity or reactivity :

productivity - the amount of information processed by the system per unit of time;
reactivity - the time between the presentation of the input data to the system and the appearance of the relevant output information.

In this paper, reactivity will be considered as a measure of software performance. Such a choice is not fundamental and the only possible, and was made only to simplify measurements in experiments. Consideration of productivity requires a more complicated test procedure, since measurements of the volumes of information processed are not always possible. Estimation of data processing time by the system is much easier to implement, incl. and automate the testing process.
When analyzing software, the most obvious and logical task facing the researcher will be an increase in productivity. Formally, we have an optimization problem, and in the context of this study, the problem of minimizing the time of processing incoming information by a software system. Thus, the optimization criterion is a certain function.

where (1)

- information processing time;
- all parameters (factors or influencing factors) that can directly or indirectly affect system performance;
$x_i \ in [a_i, b_i]$ - the domain of the i- th factor, which is the constraint of the problem.

The choice of the entire set of factors affecting program performance is not a trivial problem. As a rule, modern software is an extremely complex system with a huge number of links and dependencies. And it's not just the complexity of the software itself. Any program has a runtime programming language (runtime), operates in the operating system, interacts with other services and software — all of these elements are also separate complex systems with no fewer internal connections and dependencies. If we also add mutual dependencies of factors (which can be not only paired and / or linear), then we get an incredible number of various combinations of factors.
For example, consider a program that works with a database (DB) and produces statistical processing of the information stored in it. Examples of factors affecting performance include:

the amount of computer memory;
speed of access to the hard disk;
maximum frequency of work and average processor load;
DBMS settings, etc.

The possibility of dependencies between these factors is obvious. For example, increasing the amount of RAM can reduce the need for frequent access to the hard drive, and, consequently, reduce the influence of the corresponding factor.
')
To find the optimal values of influencing factors, i.e. To solve the optimization problem (1), we can offer the following options:

full enumeration of all possible combinations of values of influencing factors;
random selection of a certain number of combinations and the subsequent selection of the best option;
analytical study of the system;
application of specialized software;
use of mathematical models.

All of these options have their advantages and disadvantages. Using a full search it can be argued that the desired parameters will be found, however, with a large number of factors and variants of their values, the number of all possible combinations may be too large, and the experiments will take too much time. The number of all possible combinations of N can be found by combinatorial rules:

N = n_1 n_2 ... n_k

where (2)

- the number of variants of the values of the i -th factor.

Then, with 5 factors and 5 possible values of each factor, we get $5 ^ {^ {5}} = 3125$ combinations.
With a random choice of the optimal combination, there is a high probability that the solution obtained will be very far from the global optimum.
Analytical study of the system is often difficult or impossible when analyzing existing products without source code. In addition, such an approach requires a full understanding by the researcher of all the algorithms used in the software, connections and dependencies of the components.
Special software tools such as profilers [5] provide only some statistical information about the execution of the program code: the number of method calls, the average execution time of methods, etc. Optimization in this case comes down to identifying the so-called. Bottlenecks and optimization of the algorithms used. Such an approach is quite popular, but it does not allow one to obtain the desired solution to problem (1) .
Mathematical models of software performance analysis were repeatedly considered by foreign and domestic authors. So in [1] , [2] , original approaches to solving this problem were proposed at different stages of software development.
Summing up, it should be noted that the main disadvantages of the proposed methods of solution are:

excessively long measurement time;
reliance on the ability of the researcher to draw conclusions based on the analysis of the program;
complexity of application and use.

To overcome the above difficulties and solve the problem, it is proposed to use an apparatus developed in the theory of mathematical experiment planning (MBE).
Let us dwell in more detail on the question of the applicability of the MBE to software performance analysis. One of the main ideas of experiment planning is the use of a black box for the object under study cybernetic abstraction [3] (see Fig. 1 ).

Fig. 1. Abstraction black box.

Such an abstraction implies a refusal to consider the internal mechanisms of the phenomenon or object under study due to its great complexity. Analysis of the phenomenon is reduced to the analysis of input parameters affecting the object (factors) and output characteristics (responses) [3] .
For the applicability of the MPE is necessary to comply with a number of conditions [3]:

reproducibility of experiments;
manageability factors;
measurability of output characteristics and the ability to express it in one number;
unambiguity and compatibility of factors, etc.

After analyzing the analogies, we can conclude about the possibility of considering the software system as a black box. Then, with the fulfillment of the above requirements, the entire existing MPE apparatus can be used to study the performance.
As an example, and to refine the method of using MBE for analyzing the performance of a software system, consider a web application that is part of the Professional Clubs project [4] .

Stage 1. Analysis of a priori information.

When analyzing a priori information about a software product, it became known:

Software is a web application written in the PHP programming language;
The software runs on the Apache web server, the PHP interpreter is connected as a module;
The software uses MySQL for data storage;
It is possible to enable data caching by the application itself.

As a mathematical model of the experiment [3], we will use the simplest one — a linear model without dependencies between factors. Most likely, such an assumption and such a combination of factors do not correspond to the actual situation, but may well be used to test the method itself. Then this task is reduced to finding the values of the coefficients. k_i

regression equations

y = k_0 + k_1 x_1 + k_2 x_2 + k_3 x_3 + k_4 x_4 + k_5 x_5

. (3)

The factors with the highest values will have the greatest effect on the output response.

Stage 2. The choice of influencing factors.

Table 1 presents a set of influencing factors selected as a result of analyzing a priori information about software.

Table 1.

Factor	Description
	MySQL key_buffer_size A DBMS tuning parameter that determines the amount of memory allocated for storing indexes [6] .
	MySQL table_cache DBMS tuning parameter that determines the number of constantly open database tables [6] .
	MySQL query_cache_limit DBMS tuning parameter that determines the maximum amount of memory for caching query results [6] .
	Caching data by the application. It can be either on or off.
	Web server and PHP interpreter launch method. The application under study can be launched in two ways: Apache web server + PHP module; web server nginx + php-fpm.

Stage 3. The choice of the upper and lower levels for the factors.

Now you need to select the upper and lower levels for each factor [3] . Hereinafter, we will use the notation adopted in the IPE:

+1 corresponds to the upper level of the factor;
-1 corresponds to the lower level of the factor.

Table 2.

Factor	Top Level (+1)	Lower level (-1)
	265 Mb.	16 Mb.
	300	64
	64 Mb.	1 Mb.
	Cache is enabled.	The cache is turned off.
	Nginx + php-fpm.	Apache + mod_php.

Stage 4. Drawing up a planning matrix and conducting experiments.

Table 3 presents the results of a series of experiments.

Table 3.

No						y	No						y
one	-one	-one	-one	-one	-one	6,909456902	17	+1	-one	-one	-one	-one	6.956250343
2	-one	-one	-one	-one	+1	6,265920885	18	+1	-one	-one	-one	+1	6.27117213
3	-one	-one	-one	+1	-one	1,046864681	nineteen	+1	-one	-one	+1	-one	1,049605346
four	-one	-one	-one	+1	+1	0,959287777	20	+1	-one	-one	+1	+1	0.960128005
five	-one	-one	+1	-one	-one	6.922491238	21	+1	-one	+1	-one	-one	6.94905457
6	-one	-one	+1	-one	+1	6,292138541	22	+1	-one	+1	-one	+1	6,288483698
7	-one	-one	+1	+1	-one	1,047327693	23	+1	-one	+1	+1	-one	1,048429732
eight	-one	-one	+1	+1	+1	0,959178464	24	+1	-one	+1	+1	+1	0.959984639
9	-one	+1	-one	-one	-one	6,947828159	25	+1	+1	-one	-one	-one	6.944574752
ten	-one	+1	-one	-one	+1	6.269961421	26	+1	+1	-one	-one	+1	6.281574535
eleven	-one	+1	-one	+1	-one	1,047032595	27	+1	+1	-one	+1	-one	1,047937875
12	-one	+1	-one	+1	+1	0.960076244	28	+1	+1	-one	+1	+1	0,960813348
13	-one	+1	+1	-one	-one	6.954160943	29	+1	+1	+1	-one	-one	6,952602925
14	-one	+1	+1	-one	+1	6.278223336	thirty	+1	+1	+1	-one	+1	6.284795263
15	-one	+1	+1	+1	-one	1,048019483	31	+1	+1	+1	+1	-one	1,047952991
sixteen	-one	+1	+1	+1	+1	0,960559206	32	+1	+1	+1	+1	+1	0,960591927

Stage 5. Analysis of the results.

The coefficients of the regression equation (3) can be found as

$k_j = \ frac {\ sum_ {i = 1} ^ {N} x_ {ji} y_i} {N}, j = 1 ... k$ . (four)

The results are presented in table 4 and fig. 2

Table 4.


3,97361211	0.0519711	0.0245041	0,0034686	-2,9182692	-0,2483070

Fig. 2. The coefficients of the regression equation.

Stage 6. Conclusion.

From fig. 2 that the main contribution to the time of web page generation is made by the factor x_4

that characterizes data caching in the application. Negative regression coefficient b_4

means that this parameter reduces the black box response function. In our example, this means a reduction in page generation time.
Strictly speaking, such results are obvious, since the inclusion of data caching by the application entails a minimal number of queries to the DBMS. Thus, the effect of the DBMS settings becomes insignificant. The obtained results confirm the possibility of using the MBE in relation to the analysis of software performance.

In this article, I tried to consider the possibility of applying the existing method of mathematical planning of experiment for a long time as applied to the analysis of software performance. Such an approach makes it possible to overcome a number of difficulties arising in the analysis of software systems, to identify the factors that most strongly influence the analyzed characteristic, to reveal the dependencies between the factors.
Of course, the presented methodology does not claim to replace the existing methods of performance analysis, for example, profiling. However, there are a number of tasks in which the use of such a technique can significantly facilitate the tasks of a researcher.

Thanks for attention!

Bibliography

Dubakov S.A. Information technology analysis of performance in the software development process: dis. Cand. tech. Sciences / Dubakov SA - Tomsk, 2005. - 135 p.
Moiseychuk LD Development of models and methods for analyzing software performance based on strictly hierarchical stochastic Petri nets: dis. Cand. tech. Sciences / Moiseychuk LD - St. Petersburg, 2002. - 152 p.
Adler Yu.P., Markova EV, Granovsky Y.V. Planning an experiment when searching for optimal conditions. - M .: Science, 1976. - 279 p.
LLC "Professional Clubs". - URL: http://prof-club.ru/ . Appeal date: 25.09.2011.
Kaspersky K. Technique program optimization. Efficient use of memory. - SPb .: BHV-Petersburg, 2003. - 464 p.
MySQL Server System Variables. - URL: http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html . Appeal date: 25.09.2011.

Source: https://habr.com/ru/post/129346/

All Articles