Visualize and understand Hash Match Join

This post is the third part of the series on connection operators (be sure to read part 1 - nested loops joins , and part 2 - merge joins ). The translation of the article was prepared specifically for students of the “MS SQL Server Developer” course.

Hash Match Joins are reliable workhorses of physical connection operators.
While Nested Loops Join will fail, if there is too much data to put in memory, and Merge Join will require the input data to be sorted, Hash Match will connect any data that you submit to the input (provided that the equality predicate is executed for the connection and so far there is enough free space in your tempdb).

Watch a video on the topic on YouTube

The hash match algorithm consists of two stages, which work as follows:

During the first “Build” phase, SQL Server creates a hash table in memory from one of the input tables (usually the smallest of the two). The hashes are calculated based on the keys of the input data and then stored together with the string in the hash table in the appropriate block. In most cases, there is only one row of data in each block, except when:

There are lines with duplicate keys.
The hash function creates a collision, and completely different keys get the same hash (this is rare, but possible).

After creating the hash table, the “Probe” stage begins (validation). In the second stage, SQL Server calculates the hash of the key for each row in the second input table and checks if it exists in the hash table created in the first stage. If there is a match for this hash, then it is checked whether the keys of the string (s) in the hash table and the rows from the second table match (this check must be performed due to possible collisions).
A common variant of the hash match algorithm occurs when the build phase fails to create a hash table that can be fully stored in memory:

This happens when there is more data than can be allocated in memory, or when SQL Server provides insufficient memory for a hash match connection.

When SQL Server does not have enough memory to store the hash table during the build phase, it continues to work, storing some blocks in memory and placing other blocks in tempdb.
During the validation phase, SQL Server connects the data rows from the second table to blocks in the build phase, which are in memory. If the block to which this string potentially matches is currently not in memory, SQL Server writes this string to tempdb for later comparison.

When matches for one block are completed, SQL Server clears this data from memory and loads the following blocks into memory. It then compares the rows of the second table (currently in tempdb) with the new blocks in memory.

As is the case with each physical connection operator in this series, details about the hash match operator can be found in Hugo Cornelis help (Hugo Kornelis) about hash match .

What does Hash Match Join show?

Knowing the intrinsic features of how the match join hash works allows us to determine what the optimizer thinks about our data and higher-level join operators, helping us focus on tuning performance.

Here are a few scenarios that should be considered the next time you see that the hash match join is used in your execution plan:

While a hash match join can aggregate huge data sets, building a hash table from the first input table is a blocking operation that prevents subsequent statements from being executed. In this regard, I always check if there is an easy way to convert hash match to nested loops or merge join. Sometimes this is not possible (too many lines for nested loops or unsorted data for merge join), but you should always check whether a simple index change or improved estimates from statistics update causes SQL Server to choose a non-blocking hash match join operator
Hash match joins are great for large connections, as they can be sent to tempdb, this allows them to perform connections with large data sets that can cause the connection to fail in memory using nested loops or merge join operators.
- If you see the hash match join operator, this means that SQL Server considers the input data to be too large. If we know that our input data should not be so big, then it’s worth checking whether there are problems with statistics or estimates that cause SQL Server to choose the wrong hash match join .
When running in memory, the hash match join is quite effective. Problems arise when the build phase goes into tempdb.
- If I notice a small yellow triangle indicating that the connection goes into tempdb, I see why it happened: if there is more data than memory available, there is little that can be done, but if the allocated memory seems unreasonably small, it may mean that we probably have another problem with statistics, which leads to too low estimates for the SQL Server optimizer.

Thank you for reading the article. You might also like my twitter .

We covered this topic in the previous open lesson . We are waiting for your comments!

Source: https://habr.com/ru/post/459314/

All Articles

Visualize and understand Hash Match Join

What does Hash Match Join show?

More articles: