I think it’s no secret that JOIN is considered to be a rather expensive operation, and many novice programmers (who use MySQL) like to be intimidated, that JOIN is bad, and it’s best to do without them, if possible.
Let's explore this issue in more detail and see if a JOIN is really bad, and when you should think about it at all.
What I will not write
To begin, I would like to immediately say that I will not do:
- MySQL tuning: all settings are taken by default (
including innodb_buffer_pool_size = 8 MB, etc. )
- integration with programming languages:
all requests will be made via the MySQL Sequel Pro client, and time will be measured based on its readings- obvious things, like joining when sampling 3 lines:
I don’t want to consider saving on matches, or not - we’ll consider saving ten times rather than tens of percent')
Initial conditions
We will have two simple disgraceful
tables (the
tables were made just for example and filled with random data ):
Calls - 10 million lines:
id | INT PRIMARY KEY AUTO_INCREMENT |
user_id | Int |
cost | Int |
call_dt | DATETIME |
tariff_id | Int |
Users - 100 thousand lines:
id | INT PRIMARY KEY AUTO_INCREMENT |
birthdate | DATE |
name | VARCHAR (10) |
sex | ENUM ('M', 'F') |
The names, I think, speak for themselves, the indexes are only on the primary key (id). Purely in principle, we could create some indexes that would help us fulfill those queries that we will explore, but our goal is different, namely, to investigate how fast the JOIN works.
Table types
For the purposes of the study, the
Calls table was taken in two types - MyISAM and InnoDB, and the
Users table, to which we are joining, three types - MyISAM, InnoDB and MEMORY
First blood
All tests were performed on my laptop, from MySQL version 5.5.9 on Mac OS X with default settings from MAMP. All tables fit into memory completely, requests were run several times to make sure that everything gets into the cache.
To begin with, let's just look at the speed of browsing rows in MyISAM and InnoDB by running such a query (remember that there are no indexes for price or user_id - we measure the speed of FULL SCAN in MySQL)
SELECT MAX(cost) FROM calls WHERE user_id = 5000;
And such a request:
SELECT MAX(cost) FROM calls;
Results (error less than 5%):
No | InnoDB, ms | MyISAM ms |
---|
one | 5 360 | 862 |
2 | 5 390 | 1,150 |
I don’t want to study in more detail why the aggregation samples in MyISAM are so affected by the number of WHERE lines that fell under the WHERE, but the fact remains that with full sequential table scans, MyISAM is 4.5 times faster than InnoDB. Hence the opinion that InnoDB is a “brake”, and that MySQL itself (with MyISAM) is very smart.
Mini-output: With full sequential browsing of MyISAM 5 times faster than InnoDB
JOIN
Let's now connect to the case the Users table - no wonder we created it.
We will investigate requests for such a plan:
SELECT MAX(calls.cost) FROM calls JOIN users ON calls.user_id = users.id WHERE calls.cost > %d AND users.sex = 'M'
The parameter for cost is chosen in such a way that a certain percentage of entries in the Calls table are selected.
If we make the query given above, to the muscle, with the cost parameter, which will correspond to N% of rows, MySQL will make the join only in N% of cases, and it will not make JOIN for the rest (100-N)% of rows. Here is a MySQL smart.
Let's get to the test results with JOIN:
FULL SCAN + JOIN 0.1% of rows
Users \ Calls | InnoDB, ms (only JOIN, ms) | MyISAM, ms (only JOIN, ms) |
---|
InnoDB | 5,450 (~ 0) | 857 (~ 0) |
MyISAM | 5,450 (~ 0) | 937 (~ 100) |
MEMORY | 5,350 (~ 0) | 845 (~ 100) |
So far, times differ very little from FULL SCAN. It is understandable - after all, JOIN is made for a meager number of rows.
FULL SCAN + JOIN 1% of rows
Users \ Calls | InnoDB, ms (only JOIN, ms) | MyISAM, ms (only JOIN, ms) |
---|
InnoDB | 5,660 (300) | 999 (140) |
MyISAM | 6,530 (1,200) | 1,810 (950) |
MEMORY | 5,460 (100) | 911 (65) |
Funny, yes? Only 1% of the rows are jointed, and the results for MyISAM + MyISAM are 2 times more than for MyISAM + InnoDB. Funny enough, the JOIN to InnoDB in this case is faster than the JOIN to MyISAM. And we have not started testing this yet :)!
FULL SCAN + JOIN 10% of rows
Users \ Calls | InnoDB, ms | MyISAM ms |
---|
InnoDB | 7,230 (1,900) | 2 190 (990) |
MyISAM | 16,100 (8,800) | 10,200 (9,000) |
MEMORY | 6,080 (700) | 1,440 (580) |
It’s a shame for the country (MyISAM), but what can I do ... It turns out that MyISAM is not that smart ... Or not? Let's look at the results of the final testing.
FULL SCAN + JOIN 100% rows
Users \ Calls | InnoDB, ms | MyISAM ms |
---|
InnoDB | 18,000 (14,650) | 12,500 (11,655) |
MyISAM | 100,000 (96,650) | 91,600 (90,750) |
MEMORY | 10,500 (7,150) | 5,280 (4,435) |
Pay attention to the monstrous (!) Sampling times with JOIN with MyISAM. But InnoDB was pleasantly surprised - due to its architecture, the JOIN is not a very expensive operation for InnoDB. To be honest, I was very surprised when I got such a result that the second fastest JOIN option is when InnoDB is joining MyISAM.
Well, with MEMORY, I think everything is clear - MEMORY gives an overhead of 525% (4,435 ms) on a join over PK, InnoDB gives an overhead of 1,380% (11,655 ms), but I’m ashamed to say about MyISAM.
Replacing JOIN with IN (...)
A keen eye could notice that for our scenario (when we do JOIN to users, to weed out all women from the table of calls) there is a way without a JOIN, but with a simple enumeration of all user_id in IN ():
SET group_concat_max_len = 10000000; SELECT GROUP_CONCAT(id) FROM users WHERE sex = 'M';
Such a query to a table of the MyISAM type will work for 3,730 ms, and to InnoDB - for 8,290 ms. A keen eye may notice that this method is faster than the JOIN to MEMORY, although not by much. This method is suitable if you have a very fast connection to MySQL (for example, a UNIX socket). In other cases, IMHO, with such numbers of records, it is obvious that driving a huge amount of traffic between the MySQL server and the application server is not a good idea.
You can make your own
conclusions : if you need a lot of joining, and for some reason you need high performance, use a combination of MyISAM + InnoDB, or just InnoDB + InnoDB, if there are more than one join. Well, if you are extreme, you can use MyISAM + MEMORY for maximum performance in all scenarios. There is another option with MEMORY + MEMORY, but for a large number of records I would not do that;).
UPD: I would like to thank the hombrauser
homm for very useful comments, such
as this . In general, I highly recommend reading the comments, they explain a lot of things that for some reason were not obvious to the readers:
- query cache is disabled
- JOIN is done on the primary key
- indexes on the table Calls are not created, because we do not set the task to optimize a particular query