SQL Server Joins Part 2 This lesson will be about Hash Join and Merge Join. Merge Join Unlike the nested loops join which supports any join predicate, the merge join requires at least one equijoin predicate. Moreover, the inputs to the merge join must be sorted on the join keys. For example, if we have a join predicate “T1.a = T2.b,” table T1 must be sorted on T1.a and table T2 must be sorted on T2.b. The merge join works by simultaneously reading and comparing the two sorted inputs one row at a time. At each step, we compare the next row from each input. If the rows are equal, we output a joined row and continue. If the rows are not equal, we discard the lesser of the two inputs and continue. Since the inputs are sorted, we know that we are discarding a row that is less than any of the remaining rows in either input and, thus, can never join. We can express the algorithm in pseudo-code as: get first row R1 from input 1 get first row R2 from input 2 while not at the end of either input begin if R1 joins with R2 begin return (R1, R2) get next row R2 from input 2 end else if R1 < R2 get next row R1 from input 1 else get next row R2 from input 2 end Unlike the nested loops join where the total cost may be proportional to the product of the number of rows in the input tables, with a merge join each table is read at most once and the total cost is proportional to the sum of the number of rows in the inputs. Thus, merge join is often a better choice for larger inputs. The above pseudo-code implements a one-to-many merge join. After we join two rows, we discard R2 and move to the next row of input 2. This presumes that we will never find another row from input 1 that will ever join with the discarded row. In other words, there may not be duplicates in input 1. On the other hand, it is acceptable that we might have duplicates in input 2 since we did not discard the current row from input 1. Merge join can also support many-to-many merge joins. In this case, we must keep a copy of each row from input 2 whenever we join two rows. This way, if we later find that we have a duplicate row from input 1, we can play back the saved rows. On the other hand, if we find that the next row from input 1 is not a duplicate, we can discard the saved rows. We save these rows in a worktable in tempdb. The amount of disk space we need depends on the number of duplicates in input 2. A one-to-many merge join is always more efficient than a many-to-many merge join since it does not need a worktable. To use a one-to-many merge join, the optimizer must be able to determine that one of the inputs consists strictly of unique rows. Typically, this means that there is either a unique index on the input or there is an explicit operator in the plan (perhaps a sort distinct or a group by) to ensure that the input rows are unique. Sort merge join vs. index merge join There are two ways that we can get sorted inputs for a merge join: we may explicitly sort the inputs using a sort operator or we may read the rows from an index. In general, a plan using an index to achieve sort order is cheaper than a plan using an explicit sort. Join predicates Merge join supports multiple equijoin predicates so long as the inputs are sorted on all of the join keys. The specific sort order does not matter so long as both inputs are sorted in the same order. For example, if we have a join predicate “T1.a = T2.a and T1.b = T2.b,” we can use a merge join so long as T1 and T2 are both sorted either on “(a, b)” or on “(b, a).” Merge join also supports residual predicates. For example, consider the join predicate “T1.a = T2.a and T1.b > T2.b.” Although we cannot use the inequality predicate as part of a merge join, we can still use the equijoin portion of this predicate to perform a merge join (presuming the inputs are sorted on column “a”). For each pair of rows that join on column “a,” we can then apply the inequality predicate. If the inequality evaluates to true, we return the joined row; if not, we discard it. Outer and semi-joins Merge join supports all outer and semi-joins variations. To implement an outer join, we simply need to track whether each row has joined. Instead of discarding a row that has not joined, we can NULL extend it and output it as appropriate. We can implement a semi-join or an anti-semi-join in a similar way. Merge join supports a special case for full outer join. In some cases, we generate a merge join for a full outer join even if we have no equijoin predicate. The following examples use this simple schema: create table T1 (a int, b int, x char(200)) create table T2 (a int, b int, x char(200)) set nocount on declare @i int set @i = 0 while @i < 1000 begin insert T1 values (@i * 2, @i * 5, @i) insert T2 values (@i * 3, @i * 7, @i) set @i = @i + 1 end Let’s start with a simple example: select * from T1 join T2 on T1.a = T2.a option (merge join) Since we’ve asked for a merge join with the option hint and since there are no indexes on these two tables, the optimizer must add explicit sorts to the plan: Rows Executes 334 1 |--Merge Join(Inner Join, MANY-TO-MANY MERGE:([T1].[a])=([T2].[a]), RESIDUAL:([T2].[a]=[T1].[a])) 1000 1 |--Sort(ORDER BY:([T1].[a] ASC)) 1000 1 | |--Table Scan(OBJECT:([T1])) 668 1 |--Sort(ORDER BY:([T2].[a] ASC)) 1000 1 | |--Table Scan(OBJECT:([T2])) Although the rows in both input tables are indeed unique, the optimizer does not know that and cannot enforce it so we generate a many-to-many join. Note that each table scan executed only once regardless of the number of rows processed. Also, note that the sort on T2 only returned 668 rows out of 1000 rows. What happened? After processing 668 rows from T2, the merge join encountered a row from T2 that is greater than any row in T1. At this point, the merge join read all the remaining rows from T1. Once it reached the end of T1, the merge join exited even though it did not read all of the rows in T2. Now observe what happens if we create an index on T1: create unique clustered index T1ab on T1(a, b) And run the same query again: Rows Executes 334 1 |--Merge Join(Inner Join, MANY-TO-MANY MERGE:([T1].[a])=([T2].[a]), RESIDUAL:([T2].[a]=[T1].[a])) 1000 1 |--Clustered Index Scan(OBJECT:([T1].[T1ab]), ORDERED FORWARD) 668 1 |--Sort(ORDER BY:([T2].[a] ASC)) 1000 1 |--Table Scan(OBJECT:([T2])) This is the same plan except that we now need a sort only on T2 since we can use the index on T1 to achieve the sort order. Note that even though we now have a unique index on T1, we still have a many-to-many join. Why? The index guarantees uniqueness on “(a, b)” not on column “a” alone. We could have duplicate values of column “a” and, thus, so long as we join only on a, we need the many-to-many join. Now let’s try adding an index on T2: create unique clustered index T2a on T2(a) With both indexes, we no longer need the hint to get a merge join: select * from T1 join T2 on T1.a = T2.a Rows Executes 334 1 |--Merge Join(Inner Join, MERGE:([T2].[a])=([T1].[a]), RESIDUAL:([T2].[a]=[T1].[a])) 668 1 |--Clustered Index Scan(OBJECT:([T2].[T2a]), ORDERED FORWARD) 1000 1 |--Clustered Index Scan(OBJECT:([T1].[T1ab]), ORDERED FORWARD) Observe that, with both indexes, we no longer need any sorts. Moreover, the unique index on T2 does guarantee uniqueness on column “a” so we can now do a one-to-many join. Notice that the MANY-TO-MANY keyword is gone in this example. (There is no ONE-TO-MANY keyword in text showplan.) Notice also that the optimizer switched the order of the inputs to put the unique input T2 first so we can use a one-to-many join. Hash Join it comes to physical join operators, hash join does the heavy lifting. While nested loops join works well with relatively small data sets and merge join helps with moderately sized data sets, hash join excels at performing the largest joins. Hash joins parallelize and scale better than any other join and are great at maximizing throughput in data warehouses. Hash join shares many characteristics with merge join. Like merge join, it requires at least one equijoin predicate, supports residual predicates, and supports all outer and semi-joins. Unlike merge join, it does not require ordered input sets and, while it does support full outer join, it does require an equijoin predicate. The hash join executes in two phases: build and probe. During the build phase, it reads all rows from the first input (often called the left or build input), hashes the rows on the equijoin keys, and creates an in-memory hash table. During the probe phase, it reads all rows from the second input (often called the right or probe input), hashes these rows on the same equijoin keys, and looks or probes for matching rows in the hash table. Since hash functions can lead to collisions (two different key values that hash to the same value), we typically must check each potential match to ensure that it really joins. for each row R1 in the build table begin calculate hash value on R1 join key(s) insert R1 into the appropriate hash bucket end for each row R2 in the probe table begin calculate hash value on R2 join key(s) for each row R1 in the corresponding hash bucket if R1 joins with R2 return (R1, R2) end Note that unlike the nested loops and merge joins which immediately begin flowing output rows, the hash join is blocking on its build input. That is, it must completely read and process its entire build input before it can return any rows. Moreover, unlike the other join methods, the hash join requires a memory grant to store the hash table. Thus, there is a limit to the number of concurrent hash joins that SQL Server can run at any given time. While these characteristics are generally not a problem for data warehouses, they are undesirable for most OLTP applications. Hash Join – Memory & Spilling Before a hash join begins execution, SQL Server tries to estimate how much memory it will need to build its hash table. We use the cardinality estimate for the size of the build input along with the expected average row size to estimate the memory requirement. To minimize the memory required by the hash join, we try to choose the smaller of the two tables as the build table. We then try to reserve this much memory to ensure that the hash join can successfully execute. What happens if we grant the hash join less memory than it requests or if the estimate is too low? In these cases, the hash join may run out of memory during the build phase. If the hash join runs out of memory, it begins spilling a small percentage of the total hash table to disk (to a workfile in tempdb). The hash join keeps track of which “partitions” of the hash table are still in memory and which ones have been spilled to disk. As we read each new row from the build table, we check to see whether it hashes to an in-memory or an on-disk partition. If it hashes to an in-memory partition, we proceed normally. If it hashes to an on-disk partition, we write the row to disk. This process of running out of memory and spilling partitions to disk may repeat multiple times until the build phase is complete. Hash Join - Left deep vs. right deep vs. bushy hash join trees These terms refer to the shape of the query plan as illustrated by this figure: Hash Join - Left deep vs. right deep vs. bushy hash join trees The shape of the join tree is particularly interesting for hash joins as it affects the memory consumption. In a left deep tree, the output of one hash join is the build input to the next hash join. Because hash joins consume their entire build input before moving to the probe phase, in a left deep tree only adjacent pairs of hash joins are active at the same time. For example, in the above picture, we begin by building the hash table for HJ1. When HJ1 begins probing, we use the output of HJ1 to build the hash table for HJ2. When HJ1 is done probing, we can release the memory used by its hash table. Only then do we begin probing HJ2 and building the hash table for HJ3. Thus, HJ1 and HJ3 are never active at the same time and can share the same memory grant. The total amount of memory we need is max(HJ1 + HJ2, HJ2 + HJ3). In a right deep tree, the output of one hash join is the probe input to the next hash join. All of the hash joins must build their complete hash tables before we can begin probing. All of the hash joins are active at once and cannot share memory. When we do begin probing, rows flow up the entire tree of hash joins without blocking. Thus, the total amount of memory we need is HJ1 + HJ2 + HJ3. Hash Join - Examples create table T1 (a int, b int, x char(200)) create table T2 (a int, b int, x char(200)) create table T3 (a int, b int, x char(200)) set nocount on declare @i int set @i = 0 while @i < 1000 begin insert T1 values (@i * 2, @i * 5, @i) set @i = @i + 1 end set @i = 0 while @i < 10000 begin insert T2 values (@i * 3, @i * 7, @i) set @i = @i + 1 end set @i = 0 while @i < 100000 begin insert T3 values (@i * 5, @i * 11, @i) set @i = @i + 1 end select *from T1 join T2 on T1.a = T2.a Rows Executes 334 1 |--Hash Match(Inner Join, HASH:([T1].[a])=([T2].[a]), RESIDUAL:([T2].[a]=[T1].[a])) 1000 1 |--Table Scan(OBJECT:([T1])) 10000 1 |--Table Scan(OBJECT:([T2])) Notice that the T2 has ten times as many rows as T1 and indeed the optimizer chooses to use T1 as the build table and T2 as the probe table. select *from (T1 join T2 on T1.a = T2.a) join T3 on T1.b = T3.a Rows Executes 334 1 |--Hash Match(Inner Join, HASH:([T1].[ b])=([T3].[a]), RESIDUAL:([T1].[ b]=[T3].[a])) 334 1 |--Hash Match(Inner Join, HASH:([T1].[a])=([T2].[a]), RESIDUAL:([T1].[a]=[T2].[a])) 1000 1 | |--Table Scan(OBJECT:([T1])) 10000 1 | |--Table Scan(OBJECT:([T2])) 100000 1 |--Table Scan(OBJECT:([T3])) Note that the optimizer has selected a left deep plan. First, we join the two small tables, T1 and T2. The results of this join yield only 334 rows which we use to build a hash table before joining with the large table T3. Now observe what happens if we add a predicate to restrict the size of the smaller two tables. (A single where clause suffices; the optimizer can derive “T2.a < 100” from “T1.a < 100” and “T1.a = T2.a”.) select *from (T1 join T2 on T1.a = T2.a) join T3 on T1.b = T3.awhere T1.a < 100 Rows Executes 17 1 |--Hash Match(Inner Join, HASH:([T2].[a])=([T1].[a]), RESIDUAL:([T1].[a]=[T2].[a])) 34 1 |--Table Scan(OBJECT:([T2]), WHERE:([T2].[a]<(100))) 50 1 |--Hash Match(Inner Join, HASH:([T1].[ b])=([T3].[a]), RESIDUAL:([T1].[ b]=[T3].[a])) 50 1 |--Table Scan(OBJECT:([T1]), WHERE:([T1].[a]<(100))) 100000 1 |--Table Scan(OBJECT:([T3])) This time the optimizer selected a right deep plan. T1 and T2 are now so small (34 and 50) rows that it is better to build a hash table on these two tables and probe using the large table T3 than it is to build a hash table on an intermediate hash join result. Joins Summary The following table summarizes the characteristics of the three physical join operators: Next time we'll be talking about how SQL Server does the aggregation internally on the physical level.