📜 ⬆️ ⬇️

MariaDB at Google Summer of Code: GSoC16 Results

A somewhat belated report about MariaDB, our projects at the last GSoC, our students, kings and cabbage .

Last - 2015 - GSoC we have turned out very unfortunate. There were a total of eight students, but many failed in the middle of summer (at the midterm evaluation), and three were from one university in Cameroon (and obviously from one course), with excellent applications, but they did not do anything together, from the word "quite" , well, maybe one line of comment was corrected in a month and a half. And after the failure of the midterm they tried to protest our decision to Google, and even sent us a letter with vague threats. They say that it’s not good to fail so many students, to spoil the image of yourself, next year Google won't let you.

But Google did not obey them and gave them. And this year, probably in contrast, turned out to be extremely successful.

This year we were allocated 10 slots, which were claimed by 54 students. Considering the sad experience of the past year, we ensured that all the accepted students were from different universities - if they were to tinker, then independently of each other. And it was as always - some students wrote to our mailing list in advance, appeared on IRC, discussed projects. Two students even made working prototypes. Such active, of course, were all taken. The rest were left behind, the competition was big. From the accepted ten students it turned out that:


Daniel Medvedev: NO PAD collations


As written in the documentation (actually a glimpse and between the lines), when comparing lines of different lengths, MariaDB Server (and Percona Server and MySQL) complement the shorter line with spaces until the lengths become equal (that is, of course, nothing is actually added, but the result is the same, as if supplemented). It looks like this:
')
MariaDB [test]> select "abcd" > "abc"; +----------------+ | "abcd" > "abc" | +----------------+ | 1 | +----------------+ 1 row in set (0.00 sec) MariaDB [test]> select "abc\t" > "abc"; +-----------------+ | "abc\t" > "abc" | +-----------------+ | 0 | +-----------------+ 1 row in set (0.00 sec) 

In the second case, the long string is lexicographically less than the short, because '\t' less than a space. It's all good and standard. But not really. In the SQL standard collation has such a property, it can be PADSPACE or NOPAD. In the first case, a short line is achieved with spaces when comparing, in the second, respectively, this does not occur. So it turns out that so far all comparisons have been made as if they were PADSPACE. And now Daniel Medvedev has made us NO PAD collations. The effect is not always noticeable, but it is:

 MariaDB [test]> set collation_connection=utf8_general_nopad_ci; Query OK, 0 rows affected (0.00 sec) MariaDB [test]> select "abc\t" > "abc"; +-----------------+ | "abc\t" > "abc" | +-----------------+ | 1 | +-----------------+ 1 row in set (0.00 sec) 

This project was completely finished and now it is already in MariaDB 10.2.2.

Galina Shalygina: Condition pushdown into non-mergeable views


Galina is not new to us. She is well acquainted with one of the most complex subsystems of MariaDB - the query optimizer and is the author of the implementation of the WITH operator, recursive and non-recursive CTE .

For this GSoC Galina chose to optimize views for which you cannot merge with the main query. Best of all, of course, perform a query with views using the merge method (view merge), in which the view is substituted as a macro in the main query. For example, such a query:

 CREATE VIEW v1 AS SELECT a1+b1 AS c1, d1 FROM t1 WHERE e1>5; SELECT x2 FROM v1, t2 WHERE c1=y2 AND d1=10; 

You can convert to:

 SELECT x2 FROM t1, t2 WHERE a1+b1=y2 AND d1=10 AND e1>5 ; 
And here the optimizer sees all the tables used, all the conditions, and can choose the best plan. But not all views can be optimized in this way, for example, a merge cannot be done if the view uses the GROUP BY . In such cases, you have to execute the view as a separate query, saving the result into a temporary table, and then use it in the main query. As a result, the optimizer does not see the “whole picture” and is forced to consider both queries separately. That is, in the previous example, he would not be able to use the condition d1=10 when sampling from table t1 .

This is where the new optimization comes into effect. If the view cannot be substituted into the main query and the optimizer is forced to read it separately using a temporary table, it will now try to use the conditions from the main query when filling this temporary table, pushing them in some way inside the view. That is, in the example above, the optimizer will save the result of such a query in a temporary table:

 SELECT a1+b1 AS c1, d1 FROM t1 WHERE e1>5 AND d1=10; 

What can speed up the execution of a request for orders.

This work was also completely completed, and it is already available in 10.2.3.

Shubham Barai, Sachin Setiya: Arbitrary length UNIQUE constraints


The task was to implement the uniqueness constraints of arbitrary length. Usually, when someone writes a UNIQUE(a,b,c) when creating a table, MariaDB (both Percona and MySQL) create an index for the fields a, b, c. Hence the restriction is obtained - the uniqueness of the combination of fields can be guaranteed only when the total length of these fields does not exceed the maximum allowable length of the index key. But this is illogical! The "index" is a way to optimize data access, it is not at all in the SQL standard. And UNIQUE is a logical restriction of data, directly from the SQL standard, and it is completely incomprehensible why it should suffer because of some limitations of a specific implementation of query acceleration. Our solution was to create another field, hidden, users do not need to show it, and write a hash of what UNIQUE is for. And on this field you can already create a regular index. When writing in it, you need to look for any conflicts, and if there is, pull conflicting records and compare the values ​​directly, the hashes may in fact coincide coincidentally. Moreover, MyISAM already knows how to do it, even does it — this is how SELECT DISTINCT works. But for user tables it was not available. What I wanted to fix.

This task has a difficult situation. We have offered it at GSoC for several years now, nobody wanted it. And this year, it received two applications, and both are quite sensible. Then both students wrote a working (!) Prototype - for the first time in my eight years as a mentor at GSoC. And then, again, both (!) Stated that, they say, this is too simple, but let them also do it in InnoDB. I had to take both. But I slightly corrected one of them.

Shubham dived into the wilds of InnoDB, and began to deal with its indexes, logs, transactions, recovery, and other magic. And Sachin pretended to forget that inside MyISAM this functionality already exists, threw out his prototype, and began to implement everything at the server level, in an engine-independent way. As a result, both completed the task successfully, the low-level implementation is a little bit faster, and the high-level one works with different engines, allows the optimizer to use these invisible fields for query optimization, and brings with it other interesting buns. We are still not fully determined what we take - everything is so tasty ... So in 10.2 it most likely will not fall. It will be at 10.3, probably.

Sachin Setiya: Hidden columns


And this is the aforementioned bun. To make an invisible field (and it is also virtual, you can not write this hash to the disk), we thought up to add what is called Invisible columns in Oracle, and Implicitly Hidden in DB2. If it is simple, then when creating a field, you can specify that it is “invisible”. After this, SELECT * and INSERT INTO table VALUE (...) will be overlooked. Do not show or, respectively, do not write the specified values ​​in it. But if you mention this field by name - in any team - then it behaves like a normal visible field. This was conceived for the expansion of the scheme so as not to break the applications running (possibly also with closed source) applications. Hidden fields are added, old requests do not see them, and new ones call them by name, and everything works.

Only we have somewhat expanded this idea by adding different levels of invisibility. The first level is as above as in Oracle and DB2. Well, there is still zero - everything is visible, it is not interesting. The second level - the field is visible only in SELECT and only if explicitly mentioned by name. That is, it is not visible in INSERT / UPDATE - its value cannot be changed. And it is not visible in CREATE / ALTER - it cannot be created or deleted. It is created automatically. Example - ROWID and other pseudo-fields. And the third level, the field is not visible anywhere else. This is exactly what was needed to create invisible fields with hash. And you can also use them for functional indices of the type INDEX(a+b) - also, create an invisible virtual field and index it. And you can ... In general, when this bun appeared, the ideas did not keep waiting for themselves.

As part of the previous task, this is also likely to fall only in 10.3.

Varun Gupta: SQL aggregate functions


Also an idea that was thought over for a long time. But I really managed to formulate only this GSOC. The standard says how to create user-defined functions in SQL, and MariaDB, of course, is able to:

 CREATE FUNCTION COUNT_X(x INT) RETURNS INT RETURN (SELECT COUNT(*) FROM data_table WHERE value=x); 

But these are ordinary features. But you cannot create your aggregate functions in SQL. At least in the standard it is not. In Oracle there is, and in PostgreSQL there, and, for example, in HSQLDB too. I wanted to be in MariaDB. We thought a lot about syntax - there is no standard, everyone has their own way. Compared like the others. In the end, they decided not to show off, and did it like everyone else - that is, in their own way. Of course, our version is the most natural and easiest. Probably those who invented this syntax in Oracle / PostgreSQL / HSQLDB also consider that their option is the best.

It will work like this:

 CREATE AGGREGATE FUNCTION agg_and(x INT) RETURNS INT BEGIN DECLARE z INT DEFAULT 65535; DECLARE CONTINUE HANDLER FOR NOT FOUND RETURN z; LOOP FETCH GROUP NEXT ROW; SET z= (z&x); END LOOP; END 

For comparison, in order to understand where the legs grow from, here is absolutely according to the standard a written non-aggregate function that calculates the AND of all the values ​​of a certain column in the table:

 CREATE FUNCTION col_and() RETURNS INT BEGIN DECLARE x INT; DECLARE z INT DEFAULT 65535; DECLARE CONTINUE HANDLER FOR NOT FOUND RETURN z; DECLARE cur CURSOR FOR SELECT col FROM data_table; OPEN cur; LOOP FETCH cur INTO x; SET z= (z&x); END LOOP; END 

That is, the logic is the same, just a special cursor is fed into the aggregate function that passes through all the values ​​of the group.

Unfortunately, this is also unlikely to fall into 10.2. Although the project is also almost complete.

This year, for some reason, there was not a single project completed by, say, three-quarters. So that and quit sorry and inconvenient to carry. All that was wanted was done. And this is great!

Looking forward to next year ...

Source: https://habr.com/ru/post/312838/


All Articles