Examining JOIN Performance in MySQL

I think it’s no secret that JOIN is considered to be a rather expensive operation, and many novice programmers (who use MySQL) like to be intimidated, that JOIN is bad, and it’s best to do without them, if possible.

Let's explore this issue in more detail and see if a JOIN is really bad, and when you should think about it at all.

What I will not write

To begin, I would like to immediately say that I will not do:

- MySQL tuning: all settings are taken by default ( including innodb_buffer_pool_size = 8 MB, etc. )
- integration with programming languages: all requests will be made via the MySQL Sequel Pro client, and time will be measured based on its readings
- obvious things, like joining when sampling 3 lines: I don’t want to consider saving on matches, or not - we’ll consider saving ten times rather than tens of percent
')

Initial conditions

We will have two simple disgraceful tables (the tables were made just for example and filled with random data ):

Calls - 10 million lines:

id	INT PRIMARY KEY AUTO_INCREMENT
user_id	Int
cost	Int
call_dt	DATETIME
tariff_id	Int

Users - 100 thousand lines:

id	INT PRIMARY KEY AUTO_INCREMENT
birthdate	DATE
name	VARCHAR (10)
sex	ENUM ('M', 'F')

The names, I think, speak for themselves, the indexes are only on the primary key (id). Purely in principle, we could create some indexes that would help us fulfill those queries that we will explore, but our goal is different, namely, to investigate how fast the JOIN works.

Table types

For the purposes of the study, the Calls table was taken in two types - MyISAM and InnoDB, and the Users table, to which we are joining, three types - MyISAM, InnoDB and MEMORY

First blood

All tests were performed on my laptop, from MySQL version 5.5.9 on Mac OS X with default settings from MAMP. All tables fit into memory completely, requests were run several times to make sure that everything gets into the cache.

To begin with, let's just look at the speed of browsing rows in MyISAM and InnoDB by running such a query (remember that there are no indexes for price or user_id - we measure the speed of FULL SCAN in MySQL)

SELECT MAX(cost) FROM calls WHERE user_id = 5000; --  №1

And such a request:

 SELECT MAX(cost) FROM calls; --  №2

Results (error less than 5%):

No	InnoDB, ms	MyISAM ms
one	5 360	862
2	5 390	1,150

I don’t want to study in more detail why the aggregation samples in MyISAM are so affected by the number of WHERE lines that fell under the WHERE, but the fact remains that with full sequential table scans, MyISAM is 4.5 times faster than InnoDB. Hence the opinion that InnoDB is a “brake”, and that MySQL itself (with MyISAM) is very smart.

Mini-output: With full sequential browsing of MyISAM 5 times faster than InnoDB

JOIN

Let's now connect to the case the Users table - no wonder we created it.
We will investigate requests for such a plan:

 SELECT MAX(calls.cost) FROM calls JOIN users ON calls.user_id = users.id WHERE calls.cost > %d AND users.sex = 'M'

The parameter for cost is chosen in such a way that a certain percentage of entries in the Calls table are selected.

If we make the query given above, to the muscle, with the cost parameter, which will correspond to N% of rows, MySQL will make the join only in N% of cases, and it will not make JOIN for the rest (100-N)% of rows. Here is a MySQL smart.

Let's get to the test results with JOIN:

FULL SCAN + JOIN 0.1% of rows

Users \ Calls	InnoDB, ms (only JOIN, ms)	MyISAM, ms (only JOIN, ms)
InnoDB	5,450 (~ 0)	857 (~ 0)
MyISAM	5,450 (~ 0)	937 (~ 100)
MEMORY	5,350 (~ 0)	845 (~ 100)

So far, times differ very little from FULL SCAN. It is understandable - after all, JOIN is made for a meager number of rows.

FULL SCAN + JOIN 1% of rows

Users \ Calls	InnoDB, ms (only JOIN, ms)	MyISAM, ms (only JOIN, ms)
InnoDB	5,660 (300)	999 (140)
MyISAM	6,530 (1,200)	1,810 (950)
MEMORY	5,460 (100)	911 (65)

Funny, yes? Only 1% of the rows are jointed, and the results for MyISAM + MyISAM are 2 times more than for MyISAM + InnoDB. Funny enough, the JOIN to InnoDB in this case is faster than the JOIN to MyISAM. And we have not started testing this yet :)!

FULL SCAN + JOIN 10% of rows

Users \ Calls	InnoDB, ms	MyISAM ms
InnoDB	7,230 (1,900)	2 190 (990)
MyISAM	16,100 (8,800)	10,200 (9,000)
MEMORY	6,080 (700)	1,440 (580)

It’s a shame for the country (MyISAM), but what can I do ... It turns out that MyISAM is not that smart ... Or not? Let's look at the results of the final testing.

FULL SCAN + JOIN 100% rows

Users \ Calls	InnoDB, ms	MyISAM ms
InnoDB	18,000 (14,650)	12,500 (11,655)
MyISAM	100,000 (96,650)	91,600 (90,750)
MEMORY	10,500 (7,150)	5,280 (4,435)

Pay attention to the monstrous (!) Sampling times with JOIN with MyISAM. But InnoDB was pleasantly surprised - due to its architecture, the JOIN is not a very expensive operation for InnoDB. To be honest, I was very surprised when I got such a result that the second fastest JOIN option is when InnoDB is joining MyISAM.

Well, with MEMORY, I think everything is clear - MEMORY gives an overhead of 525% (4,435 ms) on a join over PK, InnoDB gives an overhead of 1,380% (11,655 ms), but I’m ashamed to say about MyISAM.

Replacing JOIN with IN (...)

A keen eye could notice that for our scenario (when we do JOIN to users, to weed out all women from the table of calls) there is a way without a JOIN, but with a simple enumeration of all user_id in IN ():

 SET group_concat_max_len = 10000000; SELECT GROUP_CONCAT(id) FROM users WHERE sex = 'M'; --    50 ,  -   SELECT MAX(cost) FROM calls WHERE user_id IN(%s); -- %s

Such a query to a table of the MyISAM type will work for 3,730 ms, and to InnoDB - for 8,290 ms. A keen eye may notice that this method is faster than the JOIN to MEMORY, although not by much. This method is suitable if you have a very fast connection to MySQL (for example, a UNIX socket). In other cases, IMHO, with such numbers of records, it is obvious that driving a huge amount of traffic between the MySQL server and the application server is not a good idea.

You can make your own conclusions : if you need a lot of joining, and for some reason you need high performance, use a combination of MyISAM + InnoDB, or just InnoDB + InnoDB, if there are more than one join. Well, if you are extreme, you can use MyISAM + MEMORY for maximum performance in all scenarios. There is another option with MEMORY + MEMORY, but for a large number of records I would not do that;).

UPD: I would like to thank the hombrauser homm for very useful comments, such as this . In general, I highly recommend reading the comments, they explain a lot of things that for some reason were not obvious to the readers:
- query cache is disabled
- JOIN is done on the primary key
- indexes on the table Calls are not created, because we do not set the task to optimize a particular query

Source: https://habr.com/ru/post/122210/

All Articles