MySQL Optimizer Problems

What I managed to understand about MySQL for several years of its development:

it is impossible to develop without breaking backward compatibility
No one needs MySQL with broken backward compatibility.

I will describe two serious MySQL DNA problems that any MySQL 5.1 and 5.5 users indirectly encounter (I didn’t check about 5.6, but I don’t think this has changed).

How does MySQL work?
')
There is MySQL itself - it is

parser
query optimizer
replication
system catalog

There are other things, but the main ones are these.

There is a Storage Engine: a plugin that implements the following functionality:

transactions
disk storage
indexes
statistics

The list is not exhaustive, and not very well divided into responsibilities, but satisfactory for our consideration.

The API for the Storage Engine is such a holy cow that cannot be touched.
I want to give you a wonderful, wonderful example of how this Storage Engine API is “perfectly done” and what problems arise with it.

Specifically, I studied these problems while fighting with this bug: bugs.mysql.com/bug.php?id=63320
If anyone feels the strength and desire to deal with both the bug and the source code, I can wish good luck.
Those who are lazy and trust my story can simply read on.

So, what information can the Storage Engine report to the optimizer?

number of records in the table (let it be row_count)
integer rows_per_key (the ratio of the number_ of records_total to the number of_unique_ records).

Please note - the number of unique records can not be directly recognized.
MySQL counts it as number of records in the table / rows_per_key

We write this in the form of a code (how it is calculated as a result):

int unique_row_count_mysql = row_count / rows_per_key; int rows_per_key = row_count / unique_row_count_engine; int unique_row_count_mysql = row_count / ((int) (row_count / unique_row_count_engine));

I explain:

 a2 = (int)b / ((int) (b / a));

Just because of the lack of this function, we lose exactly rounding the number twice in a row.
I will explain an example. Let there is a table with millions of entries.
Consider a different number of unique keys.

Total records	Unique records	rows_per_key	Unique records (according to the optimizer)
1,000,000	100,000	ten	100,000
1,000,000	200,000	five	200,000
1,000,000	300,000	3	333333
1,000,000	400,000	2	500,000
1,000,000	500,000	2	500,000
1,000,000	600,000	one	1,000,000
1,000,000	700,000	one	1,000,000
1,000,000	800,000	one	1,000,000
1,000,000	900000	one	1,000,000
1,000,000	1,000,000	one	1,000,000

Like?
There are NO reasons to lose accuracy EXCEPT the Storage Engine API - no.
API can not be touched.
This problem is applicable to any Storage Engine.

This is half the battle. We go further.
To choose an effective plan, it is not enough for the optimizer to know the number of unique records and the number of records in total.
Often * he needs to know the number of NULLs, the number of non-NULLs, in the case of composite keys - various combinations.
Often - this is in the following cases:

inner join ("just" join)
outer join (left / right / full)
subquery (in / not in / all / any / exists / not exists)

With null statistics, you can build various effective optimizations.
About this later, we first describe the problem.

In the Storage Engine API, there is no ability to distinguish nulls from other entries.
Almost not.
InnoDB and MyISAM implement two global variables: innodb_stats_method and myisam_stats_method.
These global variables have the following meanings:

nulls ignore
nulls equal
nulls not equal.

When I first saw it in the source code, I could not believe my own eyes.
These variables affect the collection of statistics for ALL tables and queries that have nulls in the join keys and subqueries.

What is bad?
Here is a simple request:

  select * from a join b on a.id = b.id

If nulls are worth ignore - everything is fine
If nulls are worth it, everything is fine.
If nulls are not equal, everything is very bad, the optimizer will choose a pessimistic plan - assuming that there will be a lot of entries at the output.

Another request

  select * from a left join b on a.id = b.id

If nulls are not equal, everything is fine
If nulls equal or nulls ignore costs, the optimizer will choose an optimistic plan very badly - assuming that there will be few entries at the output.

With subqueries, the situation is both the case and that.
As a matter of fact, the bug to which I referred is about the features of calculating the number of records based on the innodb_stats_method setting.
The architecture initially did not provide for the separation of nulls from everything else, and now developers are tormented by inventing more cunning crutches and heuristics to miss smaller ones.

From my point of view, these are dead poultices.
The MySQL optimizer built on top of AST (abstract syntax tree) is dead and cannot be developed and improved. This is a huge hole that devours a lot of time and does not provide an opportunity to improve query performance.

By themselves, the Storage Engine is fast, but the optimizer does not know how to use it. No API, no opportunity, no normal statistics.

I do not believe that this situation will fundamentally change. In fact, you need to throw out a significant piece of MySQL and rewrite it from scratch.
But it will be a completely different product.

Source: https://habr.com/ru/post/173021/

All Articles

MySQL Optimizer Problems

More articles: