Introduction In the modern world there are a large number of tasks, within which one has to process large arrays of the same type of data. Vivid examples are systems for analyzing stock quotes, weather conditions, network traffic statistics. Many of these systems use various relational databases, the tables of which contain such amounts of data that the correct creation and optimization of queries to these tables becomes simply necessary for the normal functioning of the system. This article describes the solution methods (and comparative temporal characteristics of the methods used) of several tasks for obtaining data from
MySQL DBMS tables containing statistics about network traffic passing through the routers of one of the major Russian network providers. The intensity of the data stream coming from the main router is such that, on average, 400 million to a billion records containing information about
TCP / IP transactions enter the database tables of the network traffic monitoring system used (the router in question exports data using the
netflow protocol).
MySQL is used as a database management system.
The use of databases. Inquiries in the theory of relational databases are based on the concept of operations on sets, while in the theory of sets, the concept of time is absent. In reality, everything is different: DBMS implements abstract concepts based on real equipment. As a result, the execution of requests requires a lot (sometimes extremely long) time, and it is not always possible to wait for a long time, therefore the problem of finding ways to speed up the applied requests is quite acute - good, these ways exist. One of the main and most frequently used methods is the indexing of tables, which allows to speed up their viewing. Another way is to create queries that influence the planning mechanism, allowing the requests of different clients to interact more effectively. It is also possible to modify the hardware settings to improve performance, bypassing the physical limitations inherent in a particular type of equipment.
Work with indexes . A table with no indexes is a random set of rows, and to find the desired record, let's say by id, you need to check all its rows to match the desired value, for this you need to scan the entire table from beginning to end. This can take a lot of time and will be extremely inefficient, especially if the table contains only a few records that satisfy the search condition.
As an example, consider a table containing organizations under the jurisdiction of the Internet provider under consideration. In this context, there are two fields in the table - organization id (
org_id ) and organization name (
org_name ).
')
Suppose we added an index to the
org_id field (see Fig. 1). The index contains an entry for each row in the table, and the index entries are sorted by
org_id . Now, instead of scanning all the records in the table, we can use the index. Suppose that it is required to find a line containing an entry about an organization (
Institute of Automation and Electrometry of the Siberian Branch of the Russian Academy of Sciences ) with a unique identifier (
org_id ) equal to 2. Scanning by index returns a single line. The values ​​in the index are sorted in ascending order, so reaching the next value (with
org_id equal to
3 ) you can complete the scan: after
3 we will not find the desired values. If the desired values ​​are somewhere in the middle of the table, using special algorithms (for example, using the binary search method), you can go to the desired row without a long linear scan of the table.

Fig. 1. Interaction with a table using an index
In general, most queries can be optimized if we correctly create the necessary indexes in the table and construct the query so that they are efficiently used. However, there are such queries whose execution speeds do not help indexes at all - for example, when more than about 1/10 of all records in a table are selected by index, the optimizer will prefer
FTS (Full Table Scan) instead of using an index, since sequential reads from disk it happens faster than reading from different parts of the disk (moving the head over the disk -
seek is an “expensive” operation). The optimizer can be “forced” to use the index using the
FORCE INDEX option, but this usually does not give a gain in speed. Also, the tables can be so large that creating an index will be impractical in terms of the volume occupied or the duration of the operation. In addition, for the convenience of using indices one has to “pay off” in a certain way, indices have their drawbacks. Most of them are insignificant, but you should know about them.
Disadvantages of using indexes. First, indexes speed up data retrieval, but they slow down the operation of adding, deleting, and modifying data in the indexed columns, because every time you change data in such a table, you have to rebuild the index. The relationship here is simple: the more indexes the table has, the greater the slowing down of operations on records.
Secondly, the index file occupies a certain place on the disk (there are cases when the indexes take up more space than the data itself). When creating a large number of indexes, the size of such an index file can quickly reach its maximum size. This may cause tables to reach the size limit faster than it would have been without using indexes.
- For tables such as MyISAM, excessive indexing of tables can result in an index file reaching its maximum size faster than a data file (the problem can be solved with PARTITIONING ).
- BDB type tables store data and indexes in a single file. This, of course, is the reason for the faster reaching the maximum size of a table file.
- InnoDB type tables are located in a single tablespace. When adding indexes, the disk space allocated for the table space will be exhausted faster.
User variables. MySQL supports
user variables (hereinafter referred to as
PP ) starting with version 3.23.6. Variables can be assigned values ​​and accessed later — this can also be used when there is a need to save the results of calculations for use in further queries.
For example, in order to find products with a minimum price, you can perform the following actions:
SELECT @min_price:=MIN(price) FROM shop; SELECT * FROM shop WHERE price=@min_price;
User variables are written as
@var_name , and can take integer (
int ), fractional (
real ) and string (
string ) type values. Assignment of a variable to a value is done through the SET operator (
SET @var1='RUS'
), through the
SELECT statement (
SELECT 'RUS' INTO @var1;
) or during the execution of the query through the operator “: =” (“=” is treated as equality) (
@ var1: = 72 ).

An uninitialized variable is
NULL .
It is necessary to initialize the variables before executing the query, because otherwise the type of variables will be determined during the execution of the query itself, and it may be detected incorrectly, as a result, an incorrect result will be obtained. Example , discussion .
The benefits of using PP . The ability to save intermediate results in variables and operate them in the course of further query execution allows you to significantly speed up the execution of some queries (due to a smaller number of table scans), and also allows you to perform such queries that are implemented in the standard relational model is very difficult or not at all.
Disadvantages of using PP.- Difficult portability to other DBMSs (the PP mechanism serves as a specific add-on over the relational model of databases within MySQL). However, this drawback is not very critical, since at the moment all large DBMS have their own deviations from the ANSI SQL standard.
- The behavior of user variables will depend on the order of execution of a complex query, which may be changed by the MySQL optimizer (unless the STRAIGHT_JOIN operator is used).
Example number 1. One of the important tasks in monitoring network traffic is fixing the peaks of downloads provided by communication channels from Internet providers (see Fig. 2).

Fig. 2. Graph showing the channel loading dynamics
The monitoring system stores information about the incoming and outgoing traffic in its entirety, and in particular, for each of the channels in separate tables. Table type is
MyISAM .
The simplified form of the table of this type:
t (timestamp) | src (number of bytes given) | dst (number of bytes received) |
To search for the maximum incoming (as well as outgoing, but in this case, work with the
src field) traffic, you need to scan the entire table, while comparing the
dst field values ​​corresponding to the t
prev (timestamp in the previous step) and t
curr (temporary label in the current step). This is the main difficulty: in the framework of the relational model, it is impossible to remember the previous value in the process of scanning the table and use it directly. It can be calculated using the subquery
SELECT dst FROM t t2 WHERE t2.t<t1.t ORDER BY t2.t DESC LIMIT 1;
where
t1.t is the current value of the timestamp), but this design greatly complicates the query, in general, the complexity of the query from
O (n) increases to O (n
2 ) (due to the fact that for each timestamp it is necessary to calculate the previous one by scan the entire table), which greatly affects the speed of the query. If there is a unique index in the table on the
t field, the calculation will be carried out much faster, but this is when it is known that there are no missing timestamps in the table (and this ideal situation is almost never encountered), and the previous timestamp is calculated simply by subtracting the desired interval from the current (at a unique index sample is very fast). In the general case, the timestamp value in the previous step has to be calculated by a subquery, based not on strict equality, but on a nonstrict one, using sorting, and such a query, of course, works much longer. In the relational variant, in the subquery for faster data retrieval, an index is required on the
t field, and an external query is used to work with the
dst field; therefore, in this case, the composite index
t_dst to the fields (
t ,
dst ) is created in the table.
Query option within the framework of the relational database model:
SELECT MAX(t1.dst-(SELECT dst FROM t t2 WHERE t2.t<t1.t ORDER BY t2.t DESC LIMIT 1)) FROM t t1;
As can be seen from the query, you have to come in the table n + 1 times. custom variables allow you to implement such a query in one pass through the table. In order for a query that performs one pass through the table to work correctly, it is necessary that the data be selected in ascending order of the timestamp.
This can be achieved in the following ways.
- Sort the data in the table (if required) by an
ALTER TABLE `t` ORDER BY `t`
(for MyISAM tables, a working solution) before the query is executed. If the data was inserted sequentially, but several records were deleted from the table, you need to defragment the table with the OPTIMIZE TABLE `t`
. Since the data is sorted in advance, the query uses the IGNORE INDEX(t_dst)
instruction so that the data is selected not by index, but sequentially. - The presence of a composite index for selectable (used in WHERE / ORDER BY / GROUP BY ) fields allows you to select by it (by index) data — that is, in ascending order of the selected field. In this case, the presence of a composite index t_dst on the fields ( t , dst ) provides a sample of data in ascending order of the label t . In order for the optimizer to use the index in the sample, you need to use the
FORCE INDEX(t_dst)
. - Select data not from table t, but from a dynamic sample created from it, sorted by field t in ascending order (
SELECT * FROM `t` ORDER BY `t`
).
The ability to select data in one pass is based on the fact that it is possible to memorize the value of the dst column on the previous row of the table into a user variable and compare it with the value of dst on the current row of the table.
A query that works with a table where the data is arranged in the right order.
SET @data=NULL, @next_data=NULL, @max_value=0; SELECT @next_data:=`dst` `nextdata`,IF (@next_data- @data>@max_value, @max_value:=@next_data-@data, @max_value), @data:=dst FROM `t` IGNORE INDEX(t_dst); SELECT @max_value;
A query that selects data in the desired order by index:
SET @data=NULL, @next_data=NULL, @max_value=0; SELECT @next_data:=`dst` `nextdata`,IF (@next_data- @data>@max_value, @max_value:=@next_data-@data, @max_value), @data:=dst FROM `t` FORCE INDEX(t_dst); SELECT @max_value;
A query that sorts the data during the selection process:
SET @data=NULL, @next_data=NULL, @max_value=0; SELECT @next_data:=`dst` `nextdata`, IF (@next_data - @data > @max_value, @max_value:=@next_data-@data, @max_value), @data:=dst FROM (SELECT * FROM `t` ORDER BY t) t1; SELECT @max_value;
Table 1 shows the duration of the query (with a different number of rows in the table). The number of rows in the table is determined by the duration of the interval. In this case, one minute intervals are considered.
Period | Number of lines | Relational query option | Request from PP (sequential selection) | Request with PP (sorting in progress) | Request from PP (sorting is carried out by sampling by index) |
1 day | 1,440 | 2.67 seconds | 0.00 sec | 0.00 sec | 0.00 sec |
Week 1 | 10,080 | 2 min 11 sec | 0.00 sec | 0.01 sec | 0.02 sec |
1 month | 43,200 | 40 min 1 sec | 0.02 sec | 0.04 seconds | 0.07 seconds |
1 year | 525,600 | More than three days | 0.54 seconds | 0.68 sec | 0.94 seconds |
Table 1. Measurements of the duration of the execution of requests from example No. 1.
The results of measuring the duration of the execution of queries confirm that the relational variant of the query is absolutely inapplicable in this case, and all variants of queries with user variables are executed faster than in a second, which is acceptable in most cases.
Example number 2. To analyze the channel load, you regularly have to evaluate the load for certain time intervals (several seconds, several minutes). In this case, the selected interval with a duration of 5 seconds. The structure of the table from which data is selected is the same as in example No. 1, the time stamp
t must be of type
TIMESTAMP .
By a relational query, such a sample is implemented through grouping and then sorting the results:
SELECT `t`-INTERVAL(UNIX_TIMESTAMP(`t`)%5) SECOND AS s5,SUM(`dst`) AS `d` FROM `t` GROUP BY s5 ORDER BY `t
In addition to a full pass through the table, this query also implements grouping and sorting the result.
The same query can be implemented through user variables with a smaller load due to the absence of explicit grouping, and, therefore, sorting (provided that there are no missing timestamps in the table).
A query that works with a table, where the data is arranged in the correct order:
SET @c:=0; SELECT `t`, `dst` FROM ( SELECT `t`, IF(UNIX_TIMESTAMP(`t`) % 5, @c := @c+`dst`, @c := `dst`) AS `dst`, IF ((UNIX_TIMESTAMP(`t`) + 1) % 5, 1,0) AS `mark` FROM `t` IGNORE INDEX(`t_dst`)) t1 WHERE mark = 0;
A query that selects data in the desired order by index:
SET @c:=0; SELECT `t`, `dst` FROM ( SELECT `t`, IF(UNIX_TIMESTAMP(`t`) % 5, @c := @c+`dst`, @c := `dst`) AS `dst`, IF ((UNIX_TIMESTAMP(`t`) + 1) % 5, 1,0) AS `mark` FROM `t` FORCE INDEX(`t_dst`)) t1 WHERE mark = 0;
A query that sorts the data during the selection process:
SET @c:=0; SELECT `t`, `dst` FROM ( SELECT `t`, IF(UNIX_TIMESTAMP(`t`) % 5, @c := @c+`dst`, @c := `dst`) AS `dst`, IF ((UNIX_TIMESTAMP(`t`) + 1) % 5, 1,0) AS `mark` FROM ( SELECT * FROM `t` ORDER BY `t`) t1) t2 WHERE mark = 0;
As part of the subquery execution, the summing into the
@c variable
occurs first , and if the timestamp corresponds to a five-second interval, the counter is reset and the
mark is assigned the value 1. Next, the external query selects those rows whose
mark is 1, that is, the required ones.
Table 2 shows the duration of the query (with a different number of rows in the table).
Period | Number of lines | Relational query option | Request from PP (sequential selection) | Request with PP (sorting in progress) | Request from PP (sorting is carried out by sampling by index) |
1 day | 86,400 | 0.11 sec | 0.05 sec | 0.1 sec | 0.08 seconds |
Week 1 | 604,800 | 0.86 sec | 0.33 sec | 0.73 sec | 0.6 sec |
1 month | 2,592,000 | 33.72 seconds | 1.65 sec | 3.62 sec | 2.83 seconds |
1 season (3 months) | 7,776,000 | 2 min 45 sec | 4.87 seconds | 10.46 seconds | 8.36 seconds |
Table 2. Measurements of the duration of the execution of requests from example No. 2
As in the first example, the given variants of queries using custom MySQL variables work several times faster.
Example number 3 . From the point of view of the work of the network, the subordinate organizations are described by certain technical characteristics, in particular, by ranges of IP addresses allocated for the organization. Sometimes there is a need to isolate the largest continuous blocks of IP-addresses allocated for the organization.
The table of organizations and their network ranges (nets) is represented as follows:
org_id (organization id) | org_name (Name of the organization) | net (network address, 4 bytes) | net (network mask, 4 bytes) |
If you need to choose the largest continuous network range for each organization, there is no particular difficulty. But if you need to select the two largest network ranges — the relational query query option becomes quite complicated — you need to count the number of larger network ranges for each range, then weed out those that are smaller than the first two (see
here for more details).
Option query in the relational model:
SELECT t1.org_id, t1.org_name, (SELECT count(*) FROM `nets` t2 WHERE t2.org_id = t1.org_id AND ((t2.mask<<32)+t2.net) < ((t1.mask<<32)+t1.net) ) c, inet_ntoa(t1.net), inet_ntoa(t1.mask) FROM nets t1 HAVING c<2 ORDER BY org_id, mask;
With the help of user variables, it is possible within one pass to count the number of derived network ranges and reset the counter when the organization changes. If the counter reaches the desired value (in this case, 2), the records are not displayed. The table itself before use must be sorted by the size of the network ranges within the organization.
Query option with MySQL user variables:
set @i=0, @p=0; SELECT t.org_id, t.org_name, inet_ntoa(t.net), inet_ntoa(t.mask) FROM ( SELECT * FROM `nets` ORDER BY `org_id`, `mask` ) AS t WHERE ( (if (@p=org_id, @i:=@i+1,(@i:=0) or (@p:=org_id))) and @i<2);
The high efficiency of the query with the PP results from the fact that in the relational variant of the query there is a subquery that processes each record in the table.
Example number 4. In previous experiments, the option of modifying the structure of a table has never been considered (in real life it is far from always possible or convenient to change it). However, when working with temporal tables, it is simply a decision to make another column, which will contain the previous
prev value of the parameter being measured. For the highest speed of searching for the maximum difference value, you can add a field that will contain the
diff difference between the current value of the measurement
d and the previous one.
Consider the option of changing the table structure - adding the
prev column. The original data table is:
t (timestamp) | d (measurement data) |
The table does not contain indexes and a primary key — if there is an index on the
t field or a composite index on the (
t, d ) fields
, the relational query option is
much slower .
Option modification in the framework of the relational model of the database:
CREATE TABLE `t_modifyed` SELECT t, (SELECT `d` FROM `t` `in` where `in`.t < `out`.`t` order by t desc limit 1 ) prev, d FROM `t` `out` ORDER BY `t`;
Modification option using custom variables:
set @v = NULL; CREATE TABLE `t_moifyed` SELECT t, @v prev, @v:=dd FROM `t` ORDER BY `t`;
Table 3 shows the duration of the query (with a different number of rows in the table).
Period | Number of lines | Relational query option | Request from PP (sequential selection) | Request with PP (sorting in progress) |
1 day | 1,440 | 0.35 sec | 0.00 sec | 0.00 sec |
Week 1 | 10,800 | 12.33 seconds | 0.00 sec | 0.01 sec |
1 month | 43,200 | 3 min 45 sec | 0.10 sec | 0.15 sec |
1 year | 525,600 | more than 5 hours | 0.54 seconds | 0.96 seconds |
Table 3. Measurements of the duration of the execution of requests from example No. 4.
Other uses. Many relational DBMSs allow you to use the number of the current output line during the execution of a query.
- Oracle (variable: rownum)
- Microsoft SQL (ROW_NUMBER () function)
- PostgreSQL (ROW_NUMBER () function)
In MySQL, rownum can be correctly emulated only with user variables:
set @n:=0; select @n:=@n+1 as rownum, t.* from t;
Strictly speaking, there is no need to use rownum in MySQL: there is a LIMIT instruction to limit the number of returned rows by the query, but sometimes it may be necessary.
Conclusion. Using custom variables in a particular type of query allows you to achieve greater speed when executing queries than in their relational counterparts. This may occur if:
- the query is such that it cannot efficiently use indexes (the number of selected values ​​by index is too large and the optimizer does FTS );
- during the execution of the query, the values ​​calculated in the previous steps and entered into user variables are used.
Similar opportunity in other relational DBMS. A mechanism similar to the mechanism of user variables is found in almost all modern relational DBMSs. This section provides examples for PostgreSQL and Microsoft SQL. In the case of using the functions listed below, it is necessary to take into account that when creating them, no action was taken so that the rows were selected from the table in the required order (ascending
t ).
PostgreSQL :
PostgreSQL DBMS does not support the ability to use custom variables in the query execution process. But starting from version 9.0 in PostgreSQL it is possible to create anonymous blocks of code. Unfortunately, they do not support formal parameters and do not return values. However, it is possible to create a function in the
pg_temp schema , which will automatically remove the function from the schema upon completion of the session. An example of a function that calculates the maximum differential:
DROP FUNCTION IF EXISTS pg_temp.max_drop(); CREATE FUNCTION pg_temp.max_drop() returns int as $$ DECLARE prev INTEGER; curr INTEGER; max_ INTEGER; begin max_ := 0; prev := 0; FOR curr IN SELECT d FROM pogoda LOOP IF (curr - prev > max_) THEN max_ := curr - prev; END IF; prev := curr; END LOOP; RETURN max_; end $$ lANGUAGE plpgsql;
Microsoft SQL : The main difference between a query and a query using the MySQL PP is that the variable must be set in advance.
DECLARE @diff int; DECLARE @prev int; set @diff = 0; set @prev = 0; select @diff = case when @diff> (d-@prev) then @diff else (d-@prev) end, @prev = d from npogoda; select @diff;
Oracle : at the time of this writing in Oracle SQL, there is no possibility to use user variables — such an option is available only in PL / SQL and extensions such as SQLPLUS, Oracle XE, and. etc.
Thanks The author is grateful for the invaluable help in the process of writing the following people: