Key features of CUBRID 8.4.0

Greetings to all!

This blog will be very entertaining! Today I will talk about the very interesting features of the latest version of CUBRID 8.4.0, about what you usually would not find in the manual. I will give very important recommendations for optimizing queries and indexes, I will give test results, as well as examples of use in real Web services.

Earlier, I was talking superficially about the changes in the new version, about the doubled accelerated database engine, about the extended support for MySQL syntax, etc. And today I will talk about them and other things in more detail, focusing on how we were able to double the performance of CUBRID.
')
The main areas that affected the performance of CUBRID are:

Reducing the size of the database volume
Improved parallel computing in the Windows version
Index optimization
LIMIT optimization processing
Optimization of condition processing in GROUP BY

Reducing the size of the database volume

In CUBRID 8.4.0, the database volume size has decreased by as much as 218%. The reason for this is a completely redesigned structure for storing indexes, which in parallel affects the performance of the entire system.

In the following figure, you can see a comparison of the size of the database volumes in the previous version 8.3.1 and the new 8.4.0. In this case, both databases stored 64,000,000 records with a primary key. Data is in gigabytes.

Improved parallel computing in the Windows version

CUBRID 8.4.0 improves parallel computing in the version for the Windows platform using advanced Mutexes. The following graph shows the comparative performance results of the previous and new versions.

Improved parallel computing in the Windows version

Index optimization

It is here that I will tell you everything in great detail.

CUBRID 8.4.0 differs from the previous version in half the accelerated database engine. We implemented several very important index optimizations like:

Covering index
Optimization of condition processing in LIMIT
- Restriction on keys (Key Limit)
- Multiband scanning (Multi Range)
Optimization of condition processing in GROUP BY
Index scan descending
Index scan support in LIKE operators

Now let's see how the index structure is organized in CUBRID 8.4.0. In CUBRID, the index is implemented as a B + tree [link to a Wikipedia article], in which the index key values are stored in the leaves of the tree.

For a practical example, I suggest looking at the following table structure (STRING = VARCHAR (1,073,741,823)):

CREATE TABLE tbl (a INT, b STRING, c BIGINT);

Enter the data:

INSERT INTO tbl VALUES (1, 'AAA, 123), (2, 'AAA', 12), …;

And create a multi-column index. By the way, notice that I create the index after I entered the data. This is the recommended way if you want to enter data at the initial stage or when recovering it. Thus, you can avoid the time and resources spent on indexing with each entry. You can read more about the recommendations when entering big data here .

CREATE INDEX idx ON tbl (a, b);

The figure below shows the structure of this index, in the leaves of which there are pointers ( OID ) to the data itself, which is in the heap file (heap file) on the disk.

Index structure in CUBRID

Thus, the index key values ( a and b ) are sorted by magnification (default).
Each sheet has a pointer (indicated by an arrow) to the corresponding data (entry in the table) located in the heap on the disk.
The data in the heap are arranged in a random order, as indicated in the figure.

Index scan

Now let's see how the search usually happens using indexes. Given the above created table, we will run the following query.

SELECT * FROM tbl
WHERE a > 1 AND a < 5
AND b < 'K'
AND c > 10000
ORDER BY b;

First, CUBRID will find all the leaves in which a> 1 and a <5 .
Then, among this result, he will select the leaves in which b <'K' .
Since column c is not indexed, it is necessary to contact the heap that is on the disk to get its values.
Each sheet in the index tree contains an OID (Object Identifier) value, which indicates exactly where the data of a particular table entry is stored on the disk.
Based on these OIDs, the server will contact the heap to get the values in column c .
Then CUBRID will find all those records in which c> 10,000 .
As a result, all these records will be sorted by column b , as required in the request.
Then the results are sent to the client.

Covering index

Now let's see how a covering index can significantly increase the performance of CUBRID. In short, a covering index allows you to get query results without having to refer to the heap on disk, which reduces the number of I / O operations, which in turn is the most expensive operation in terms of elapsed time.

However, the magic of the Covering Index can only be applied when all the columns whose values are requested in the request are in the same composite index . In other words, their values must be in one sheet of the index tree. For example, look at the following query.

SELECT a, b FROM tbl
WHERE a > 1 AND a < 5
AND b < 'K'
ORDER BY b;

As you can see, all the columns used in this query are in the same multi-column index that we created at the very beginning.
The WHERE clause specifies only those pegs that are in the same multi-column index.
Also in the condition of the ORDER BY operator, only the column that is in the same multi-column index is indicated.

And so, if we run this query:

As part of the normal index scanning process, CUBRID will first find all the leaves in the index tree, in which a> 1 and a <5 .
Then, among this result, he will select the leaves in which b <'K' .
Since the values of columns a and b have already been obtained in the process of scanning the index, it is no longer necessary to go and look at the heap on the disk to get these values. Thus, after the second step, the server immediately proceeds to sorting the results by column b .
Then returns values.

Let's now see how a covering index can improve server performance. For the same example above, we assume that a very large amount of data is stored in the database.

Q1. Below is a query that uses the columns specified in a single composite index.

SELECT a, b FROM tbl WHERE a BETWEEN ? AND ?

Q2. And now the query, where column a is indexed, but column c is not.

SELECT a, c FROM tbl WHERE a BETWEEN ? AND ?

The following graph shows how quickly requests can be processed if they use a covering index.

Cover index performance in CUBRID

LIMIT optimization processing

Key Limit

CUBRID 8.4.0 has a very “smart” LIMIT operator analyzer. This analyzer was very optimized, which allows you to process only the number of records required in the condition of the LIMIT operator, at which the server returns results immediately. For example, look at the following query.

SELECT * FROM tbl
WHERE a = 2
AND b < 'K'
ORDER BY b
LIMIT 3;

CUBRID first finds the first leaf in the index tree, in which a = 2 .
Since the index includes the values of column b , which are already sorted, there is no need to sort the results separately.
The server passes only the first 3 index keys and stops there, since there is no need to return more than 3 results.
Then the server is already heaped to get the values of all the other columns. Thus, only 3 entries will be affected on the disc.

Multi-Range Scan (Multi Range)

Optimizing multi-band scanning is another major improvement in the new CUBRID 8.4.0. When users fill in data that lies in a certain range, for example, between a> 0 AND a <5 , the task is fairly easy for most DBMSs. However, things get much more complicated when scattered ranges are included, for example, a> 0 AND a <5 AND a = 7 AND a> 10 AND a <15 . This is where CUBRID is different. A new optimization function in-place sorting (sorting on the fly) allows you to solve two problems at once:

Key Limit
As well as sorting records on the fly

For example, consider the following query.

SELECT * FROM tbl
WHERE a IN (2, 4, 5)
AND b < 'K'
ORDER BY b
LIMIT 3;

Since all keys in the index tree are sorted, the server will start scanning starting from the first sheet, where a = 2 (see the figure below).
Since it is necessary to get only 3 rows of the table, sorted by column b , the server will sort the results that satisfy the condition a IN (2, 4, 5) AND b <'K' , on the fly.
1. At the very beginning, the server will find a record (2, AAA), which gives the 1st result.
2. Then finds the record (2, ABC), which gives the 2nd result.
3. Then finds the record (2, CCC), which gives the 3rd result.
4. Since the server has already found 3 records, it jumps to the next range, in order to search for records, where the values of column b will be less than the values already found.
1. First, the server will find a record (4, DAA), which is greater than the last value of column b of the records already found. Therefore, this range immediately disappears, and the server jumps to the next range.
2. Finds a record (5, AAA) that is smaller than ABC and CCC. Therefore removes the last entry and inserts this entry in the appropriate place.
3. The next entry (5, BBB) is already more than the last entry of preliminary results. Therefore, this scan of this range is completed. The entire search is also completed, since there are no other ranges needed for scanning.
Since all the results are already sorted, it remains only to look at the heap and get the values of the other columns.

Thanks to this multi-band scanning feature with on-the-fly sorting, CUBRID can perform a very fast search among a large amount of data.

Multi-band scanning in CUBRID

Test results

In Korea, there is a very popular Web service Me2Day, an analogue of Twitter. The following test results were obtained based on real data from this service.

Like Twitter, Me2Day has a table of posts where all the tweets are stored. Statistics of users and their relationships show that:

50% of users follow 1-50 users.
40% of users follow 51-2000 users.
10% of users follow 2001+ users.

The following index has been created for this table.

INDEX (author_id, registered DESC)

The most important request, which is most often requested in both Twitter and Me2Day services, is “to show the last 20 random posts of all users I follow ”. Below is this very query.

SELECT * FROM posts
WHERE author_id IN (?, ?, ..., ?) AND registered < :from ORDER BY reg_date DESC
LIMIT 20;

The test was launched for 10 minutes, during which this request was continuously processed. Below is a graph of test results that compares the UNION operator in MySQL , which is on average 4 times faster than the IN operator in MySQL, with the IN operator in CUBRID . For one thing, comparing with the previous version, you can see how the performance of CUBRID 8.4.0 has increased after the implementation of multi-band scanning.

Test results of the operator IN in CUBRID

Test results of the operator IN in CUBRID

After such positive results, we replaced the Me2Day MySQL server, which is responsible for the daily operation of the service, with the CUBRID server. Next time I will tell about this test in more detail. In the meantime, you can also read about it in English on the main site .

Optimization of condition processing in GROUP BY

The new version of CUBRID 8.4.0 significantly accelerated the processing of queries containing ORDER BY and GROUP BY operators. When the columns included in a multi-column index are used in ORDER BY and GROUP BY terms, there is no need to sort the values, since they are already sorted in the index tree. This optimization can significantly increase the processing performance of the entire request. We can look at the work of the next request.

SELECT COUNT(*) FROM tbl
WHERE a > 1 AND a < 5
AND b < 'K' AND c > 10000
GROUP BY a;

As part of the normal index scanning process, CUBRID will first find all the leaves in the index tree, in which a> 1 and a <5 .
Using the OID values, the server will contact the heap to get the values in column c .
Then CUBRID will find all those records in which c> 10,000 .
Since all the necessary values are already sorted, the GROUP BY operation will be performed immediately without prior sorting.
Then the server returns the results.

Increase Developer Productivity

In addition to improving the performance of the entire system, the new version of CUBRID 8.4.0 supports more than 90% of the SQL syntax of the MySQL DBMS. We also implemented extended support for implicit type conversions so that developers focus on improving the functionality of their applications, while CUBRID will do all the internal conversions. Below are a few examples of new syntax.

Implicit type conversion
CREATE TABLE x (a INT); INSERT INTO x VALUES ('1');
SHOW Requests
SHOW TABLES; SHOW COLUMNS; SHOW INDEX; …
ALTER TABLE ... CHANGE / MODIFY COLUMN ...
CREATE TABLE t1 (a INTEGER); ALTER TABLE t1 CHANGE ab DOUBLE; ALTER TABLE t2 MODIFY col1 BIGINT DEFAULT 1;
UPDATE ... ORDER BY
UPDATE t SET i = i + 1 WHERE 1 = 1 ORDER BY i LIMIT 10;
DROP TABLE IF EXISTS ...
DROP TABLE IF EXISTS history;

In total, there are 23 new DATE / TIME syntaxes in the new version, 5 associated with strings, and 5 new aggregation functions. The entire list of new syntax can be found in the blog on off. site.

Improving High Availability Availability

Improving next key lock

Also in CUBRID 8.4.0, the locking mechanism was significantly improved to minimize the occurrence of stasis. For example, in a High Availability environment, stagnation will not occur between transactions that enter data into the same table at the same time.

Conclusion

As you probably already understood, the new version of CUBRID 8.4.0 clearly surpasses all previous versions in performance, reliability, and design convenience. CUBRID is designed to be used in Web applications and services, therefore all major developments, improvements and optimization are carried out in the area of functions often used in Web applications (for example, as IN operators, LIMIT restrictions, grouping and sorting, as well as High Availability), superiority in the performance of which is proved by the results of comparative tests.

If you have certain questions, write in the comments. I would be very happy to clarify everything!

Source: https://habr.com/ru/post/117684/

All Articles