Candy Crush and SQL Server’s Missing Index Requests

It’s My Blog It’s My Rules


Since I know this’ll come out Friday, I get to have fun.

I have a feeling most of you know what Candy Crush is. It’s a mindless, infinite thumb scroll replacement.

When you’re playing, Candy Crush will suggest moves to you. It thinks they’re good ideas. Though I don’t know what the algorithm is, it seems to recommend some goofy stuff.

Take this example. It wants me to move the orange piece. It’s not a bad move, but it’s not the best possible move.

Savvy Candy Crush players will see the obvious choice. Make five in a row with the purple pieces, get the cookie thing, cookie thing drops down next to the explodey thing, and when you combine it with the wrapped candy thing, all the purple pieces turn into explodey pieces.

Level over, basically.

But Candy Crush isn’t thinking that far ahead. Neither are missing index requests.

SQL Server Does The Same Thing


Let’s take this query, for example. It’s very Post-centric.

SELECT   p.OwnerUserId, p.Score, p.Title
FROM     dbo.Comments AS c
JOIN     dbo.Posts AS p
    ON p.OwnerUserId = c.UserId
WHERE    p.PostTypeId = 1
AND      p.ClosedDate >= '2018-06-01'
ORDER BY p.Score DESC;

Right now, the only indexes I have are clustered, and they’re not on any columns that help this query. Sure, they help other things, and having clustered indexes is usually a good idea.

This is the home run use-case for nonclustered indexes. You know. Organized copies of other parts of your data that users might query.

This is such an obvious  move that SQL Server is all like “hey, got a minute?”

This is where things get all jumpy-blinky. Just like in Candy Crush.

Hints-a-Hints


This is the index SQL Server wants:

CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[Posts] ([PostTypeId],[ClosedDate])
INCLUDE ([OwnerUserId],[Score],[Title])

Would this be better than no index? Yes.

Is it the best possible index? No. Not by a long shot.

Let’s start with our predicates. SQL Server picked PostTypeId as the leading key column.

SELECT SUM(CASE p.PostTypeId WHEN 1 THEN 1 ELSE 0 END) AS [count_type_one],
       SUM(CASE WHEN p.ClosedDate >= '2018-06-01' THEN 1 ELSE 0 END) AS count_closed_date
FROM dbo.Posts AS p

Is it selective?

That ain’t good

Regardless of selectivity, the missing index request mechanism will always put equality predicates first.

What I’m getting at is that the missing index request isn’t as well thought out as a lot of people hope. It’s just one possible index for a query weighted to helping us find data in the where clause.

With a human set of eyes on it, you may discover one or more better possible indexes. You may even discover one on for the Comments table, too.

Other Issues


There’s also the issue of the included columns it chose. We’re ordering by Score. We’re joining on OwnerUserId.

Those may be helpful as key columns, depending on how much data we end up joining, and how much data we end up sorting.

A SQL Server query plan
Guesses. Just guesses.

Complicated Game


If you don’t have anyone doing regular index tuning, missing index hints are worth following because they’re better than nothing.

They’re like… acceptable. Most of the time.

The big things you have to watch out for are the incredibly wide requests, duplicative requests, and ones that want big string columns.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

What SQL Server Indexes Don’t Do With Data

You Fine And All


You can read a lot about how indexes might improve queries.

But there are certain query mechanics that indexes don’t solve for, and you can not only get stuck with an index that isn’t really helping, but all the same query problems you were looking to solve.

Understanding how indexes and queries interact together is a fundamental part of query tuning. 

In this post, we’re going to look at some query patterns that indexes don’t do much to fix.

Part Of The Problem


Indexes don’t care about your queries. They care about storing data in order. 

There are very definite ways that you can encourage queries to use them efficiently, and probably even more definite ways for you to discourage them from using them efficiently.

Even with syntax that seems completely transparent, you can get into trouble.

Take the simple example of two date columns: there’s nothing in your index that tracks how two dates in a row relate to each other.

Nothing tracks how many years, months, weeks, days, hours, minutes, seconds, milliseconds, or whatever magical new unit Extended Events has to make itself less usable.

Heck, it doesn’t even track if one is greater or less than another.

Soggy Flautas


Ah, heck, let’s stick with Stack Overflow. Let’s even create an index.

CREATE INDEX whatever 
ON dbo.Posts(CreationDate, ClosedDate);

Now let’s look at these super important queries.

    SELECT   COUNT(*) AS records
    FROM     dbo.Posts AS p
    WHERE    p.CreationDate < p.ClosedDate;


    SELECT   COUNT(*) AS records
    FROM     dbo.Posts AS p
    WHERE    p.CreationDate > p.ClosedDate;

How much work will the optimizer think it has to do, here? How many rows will it estimate? How will it treat different queries, well, differently? If you said “it won’t”, you’re a smart cookie. Aside from the “Actual rows”, each plan has the same attributes across.

A SQL Server query plan
And I’m Homosapien like you
A SQL Server query plan
And we’re Homosapien too

Treachery


Neither of these query plans is terrible on its own.

The problem is really in larger plans, where bad decisions like these have their way with other parts of a query plan.

Nasty little lurkers they are, because you expect things to get better when creating indexes and writing SARGable predicates.

Yet for both queries, SQL Server does the same thing, based on the same guesses on the perceived number of rows at play. It’s one of those quirky things — if it’s a data point we care about, then it’s one we should express somehow.

A computed column might work here:

    ALTER TABLE dbo.Posts 
        ADD created_less_closed AS 
            CONVERT(BIT, CASE WHEN CreationDate < ClosedDate 
                              THEN 1
                              WHEN CreationDate > ClosedDate
                              THEN 0
                         END)


    CREATE INDEX apathy 
        ON dbo.Posts (created_less_closed);

Which means we’d have to change our queries a bit, since expression matching has a hard time reasoning this one out:

    SELECT   COUNT(*) AS records
    FROM     dbo.Posts AS p
    WHERE    1 = CONVERT(BIT, CASE WHEN CreationDate < ClosedDate 
                              THEN 1
                              WHEN CreationDate > ClosedDate
                              THEN 0
                         END)
    AND 1 = (SELECT 1);


    SELECT   COUNT(*) AS records
    FROM     dbo.Posts AS p
    WHERE    0 = CONVERT(BIT, CASE WHEN CreationDate < ClosedDate 
                              THEN 1
                              WHEN CreationDate > ClosedDate
                              THEN 0
                         END)
    AND 1 = (SELECT 1);
A SQL Server query plan
Better
A SQL Server query plan
Butter

What Would Be Better?


Practically speaking, it isn’t the job of an index (or statistic) to track things like this. We need to have data that represents things that are important to users.

Though neither of these plans is terrible in isolation, bad guesses like these flow all through query plans and can lead to other bad decisions and oddities.

It’s one of those terrible lurkers just waiting to turn an otherwise good query into that thing you regard with absolute loathing every time it shows up in [your favorite monitoring tool].

Knowing this, we can start to design our data to better reflect data points we care about. A likely hero here is a computed column to return some value based on which is greater, or a DATEDIFF of the two columns.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

SQL Server Clustered Column Store Indexes and String Aggregation

String aggregation is a not uncommon problem in SQL Server. By string aggregation I mean grouping related rows together and concatenating a string column for that group in a certain order into a single column. How will do CCIs do with this type of query?

The Data Set


A simple table with three columns is sufficient for testing. ITEM_ID is the id for the rows that should be aggregated together. LINE_ID stores the concatenation order within the group. COMMENT is the string column to be concatenated. The table will have 104857600 rows total with 16 rows per ITEM_ID.

CREATE TABLE dbo.CCI_FOR_STRING_AGG (
	ITEM_ID BIGINT NOT NULL,
	LINE_ID BIGINT NOT NULL,
	COMMENT VARCHAR(10) NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);

DECLARE @loop INT = 0
SET NOCOUNT ON;
WHILE @loop < 100
BEGIN
	INSERT INTO dbo.CCI_FOR_STRING_AGG WITH (TABLOCK)
	SELECT
	t.RN / 16
	, 1 + t.RN % 16
	, CHAR(65 + t.RN % 16)
	FROM
	(
		SELECT TOP (1048576)
		(1048576 * @loop) - 1 +
		ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
		FROM master..spt_values t1
		CROSS JOIN  master..spt_values t2
	) t
	OPTION (MAXDOP 1);

	SET @loop = @loop + 1;
END;

On my machine this codes takes around a minute and a half and the final table size is around 225 MB. I’m inserting batches of 1048576 rows with MAXDOP 1 to get nice, clean rowgroups.

The Test Query


Let’s concatenate the strings at an ITEM_ID level for all ITEM_IDs with an id between 3276800 and 3342335. All of the data is stored in a single rowgroup and the CCI was built in such a way that all other rowgroups can be skipped with that filter. This should represent the best case for CCI performance. A common way to concatenate a column is with the FOR XML PATH method:

SELECT
  o.ITEM_ID
, STUFF(
	(
		SELECT ','+ i.COMMENT
		FROM dbo.CCI_FOR_STRING_AGG i
		WHERE o.ITEM_ID = i.ITEM_ID
		ORDER BY i.LINE_ID
		FOR XML PATH('')
	)
,1 ,1, '') ALL_COMMENTS
FROM dbo.CCI_FOR_STRING_AGG o
WHERE o.ITEM_ID BETWEEN 3276800 AND 3342335
GROUP BY o.ITEM_ID;

Note that I’m not bothering with the additional arguments for the XML part, but you should whenever using this method in production.

The query plan looks reasonable at first glance:

a4 FOR XML estimated

However, the query takes nearly six minutes to complete. That’s not the lightning fast CCI query performance that we were promised!

Spool Problems


We can see from the actual execution plan that SQL Server scanned all 104.8 million rows from the CCI on a single thread:

a4 actual threads

Those rows were then sent into an index spool which was read from all four threads in the nested loop. The CCI scan and the build of the index spool took the most time in the plan so it makes sense why CPU time is less than elapsed time, even with a MAXDOP 4 query:

CPU time = 305126 ms, elapsed time = 359176 ms.

Thinking back to how parallel nested loop joins work, it seems unavoidable that the index spool was built with just one thread. The rows from the outer result set are sent to the inner query one row at a time on different threads. All of the threads run independently at MAXDOP 1. The eager spool for the index is a blocking operation and the blocking occurs even when the subquery isn’t needed. Consider the following query in which the ITEM_ID = 3400000 filter means that rows from the FOR XML PATH part of the APPLY will never be needed:

SELECT o.ITEM_ID
, ac.ALL_COMMENTS
FROM
(
	SELECT TOP (9223372036854775807) b.ITEM_ID
	FROM dbo.CCI_FOR_STRING_AGG b
	WHERE b.ITEM_ID BETWEEN 3276800 AND 3342335
	GROUP BY b.ITEM_ID
) o
OUTER APPLY (
	SELECT 'TEST'

	UNION ALL

	SELECT STUFF(
		(
			SELECT ','+ i.COMMENT
			FROM dbo.CCI_FOR_STRING_AGG i
			WHERE o.ITEM_ID = i.ITEM_ID
			ORDER BY i.LINE_ID
			FOR XML PATH('')
		)
	,1 ,1, '')
	WHERE o.ITEM_ID =  3400000
)  ac (ALL_COMMENTS)
OPTION (MAXDOP 4, QUERYTRACEON 8649);

The index spool is still built for this query on one thread even though the startup expression predicate condition is never met. It seems unlikely that we’ll be able to do anything about the fact that the index spool is built with one thread and that it blocks execution for the query. However, we know that we only need 1048576 rows from the CCI to return the query’s results. Right now the query takes 104.8 million rows and throws them into the spool. Can we reduce the number of rows put into the spool? The most obvious approach is to simply copy the filter on ITEM_ID into the subquery:

SELECT
  o.ITEM_ID
, STUFF(
	(
		SELECT ','+ i.COMMENT
		FROM dbo.CCI_FOR_STRING_AGG i
		WHERE o.ITEM_ID = i.ITEM_ID
		and i.ITEM_ID BETWEEN 3276800 AND 3342335
		ORDER BY i.LINE_ID
		FOR XML PATH('')
	)
,1 ,1, '') ALL_COMMENTS
FROM dbo.CCI_FOR_STRING_AGG o
WHERE o.ITEM_ID BETWEEN 3276800 AND 3342335
GROUP BY o.ITEM_ID;

This doesn’t have the desired effect:

a4 bad filter with spool

The filter is moved after the index build. We’ll still get all of the rows from the table put into the spool, but no rows will come out unless ITEM_ID is between 3276800 and 3342335. This is not helpful. We can get more strict with the query optimizer by adding a superfluous TOP to the subquery. That should force SQL Server to filter on ITEM_ID before sending rows to the spool because otherwise the TOP restriction may not be respected. One implementation:

SELECT
o.ITEM_ID
, STUFF(
(
SELECT ','+ i.COMMENT
FROM
(
SELECT TOP (9223372036854775807)
a.ITEM_ID, a.LINE_ID, a.COMMENT
FROM dbo.CCI_FOR_STRING_AGG a
WHERE a.ITEM_ID BETWEEN 3276800 AND 3342335
) i
WHERE o.ITEM_ID = i.ITEM_ID
ORDER BY i.LINE_ID
FOR XML PATH('')
)
,1 ,1, '') ALL_COMMENTS
FROM dbo.CCI_FOR_STRING_AGG o
WHERE o.ITEM_ID BETWEEN 3276800 AND 3342335
GROUP BY o.ITEM_ID;

SQL Server continues to outsmart us:

a4 no spool

As you can see in the plan above, the spool has completely disappeared. I was not able to find a way, even with undocumented black magic, to reduce the number of rows going into the spool. Perhaps it is a fundamental restriction regarding index spools. In fact, we can use the undocumented trace flag 8615 to see that spools are not even considered at any part in the query plan for the new query. On the left is the previous query with the spool with an example highlighted. On the right is the new query. The text shown here is just for illustration purposes, but we can see the spool on the left:

a4 TF diff

The important point is that for this query we appear to be stuck.

Rowgroup Elimination Problems


We can try our luck without the spool by relying on rowgroup elimation alone. The spool can’t be eliminated with the NO_PERFORMANCE_SPOOL hint, but another option (other than the TOP trick above) is to use the undocumented QUERYRULEOFF syntax to disable the optimizer rule for building spools:

SELECT
  o.ITEM_ID
, STUFF(
	(
		SELECT ','+ i.COMMENT
		FROM dbo.CCI_FOR_STRING_AGG i
		WHERE o.ITEM_ID = i.ITEM_ID
		ORDER BY i.LINE_ID
		FOR XML PATH('')
	)
,1 ,1, '') ALL_COMMENTS
FROM dbo.CCI_FOR_STRING_AGG o
WHERE o.ITEM_ID BETWEEN 3276800 AND 3342335
GROUP BY o.ITEM_ID
OPTION (QUERYRULEOFF BuildSpool);

The spool is gone but we don’t get rowgroup elimination:

a4 no RG elimination

We can get rowgroup elimination even without hardcoded filters with a bitmap from a hash join. Why don’t we get it with a nested loop join? Surely SQL Server ought to be able to apply rowgroup elimination for each outer row as it’s processed by the inner part of the nested loop join. We can explore this question further with tests that don’t do string aggregation. The query below gets rowgroup elimination but the constant filters are passed down to the CCI predicate:

SELECT *
FROM (VALUES (1), (2), (3), (4)) AS v(x)
INNER JOIN dbo.CCI_FOR_STRING_AGG a ON v.x = a.ITEM_ID
OPTION (FORCE ORDER, LOOP JOIN);

If we throw the four values into a temp table:

SELECT x INTO #t
FROM (VALUES (1), (2), (3), (4)) AS v(x)

The join predicate is applied in the nested loop operator. We don’t get any rowgroup elimination. Perhaps TOP can help us:

SELECT *
FROM #t v
CROSS APPLY (
	SELECT TOP (9223372036854775807) *
	FROM dbo.CCI_FOR_STRING_AGG a
	WHERE v.x = a.ITEM_ID
) a
OPTION (LOOP JOIN);

Sadly not:

a4 top failure

The join predicate is applied in the filter operation. We still don’t get any rowgroup elimination. Frustratingly, we get the behavior that we’re after if we replace the CCI with a heap:

a4 heap success

I don’t really see an advantage in pushing down the predicate with a heap. Perhaps Microsoft did not program the optimization that we’re looking for into the query optimizer. After all, this is likely to be a relatively uncommon case. This query is simple enough in that we filter directly against the CCI. In theory, we can give our query a fighting chance by adding a redundant filter on ITEM_ID to the subquery. Here’s the query that I’ll run:

SELECT
  o.ITEM_ID
, STUFF(
	(
		SELECT ','+ i.COMMENT
		FROM dbo.CCI_FOR_STRING_AGG i
		WHERE o.ITEM_ID = i.ITEM_ID
		and i.ITEM_ID BETWEEN 3276800 AND 3342335
		ORDER BY i.LINE_ID
		FOR XML PATH('')
	)
,1 ,1, '') ALL_COMMENTS
FROM dbo.CCI_FOR_STRING_AGG o
WHERE o.ITEM_ID BETWEEN 3276800 AND 3342335
GROUP BY o.ITEM_ID
OPTION (QUERYRULEOFF BuildSpool);

Unfortunately performance is even worse than before. The query finished in around 51 minutes on my machine. It was a proper parallel query and we skipped quite a few rowgroups:

Table ‘CCI_FOR_STRING_AGG’. Segment reads 65537, segment skipped 6488163.
CPU time = 10028282 ms, elapsed time = 3083875 ms.

Skipping trillions of rows is impressive but we still read over 68 billion rows from the CCI. That’s not going to be fast. We can’t improve performance further with this method since there aren’t any nonclustered indexes on the CCI, so there’s nothing better to seek against in the inner part of the loop.

Temp Tables


We can use the humble temp table to avoid some of the problems with the index spool. We’re able to insert into it in parallel, build the index in parallel, and we can limit the temp table to only the 1048576 rows that are needed for the query. The following code finishes in less than two seconds on my machine:

CREATE TABLE #CCI_FOR_STRING_AGG_SPOOL (
	ITEM_ID BIGINT NOT NULL,
	LINE_ID BIGINT NOT NULL,
	COMMENT VARCHAR(10) NULL
);

INSERT INTO #CCI_FOR_STRING_AGG_SPOOL WITH (TABLOCK)
SELECT a.ITEM_ID, a.LINE_ID, a.COMMENT
FROM dbo.CCI_FOR_STRING_AGG a
WHERE a.ITEM_ID BETWEEN 3276800 AND 3342335;

CREATE CLUSTERED INDEX CI_CCI_FOR_STRING_AGG_SPOOL
ON #CCI_FOR_STRING_AGG_SPOOL (ITEM_ID, LINE_ID);

SELECT
  o.ITEM_ID
, STUFF(
	(
		SELECT ','+ i.COMMENT
		FROM #CCI_FOR_STRING_AGG_SPOOL i
		WHERE o.ITEM_ID = i.ITEM_ID
		ORDER BY i.LINE_ID
		FOR XML PATH('')
	)
,1 ,1, '') ALL_COMMENTS
FROM dbo.CCI_FOR_STRING_AGG o
WHERE o.ITEM_ID BETWEEN 3276800 AND 3342335
GROUP BY o.ITEM_ID;

The query plan for the last query:

a4 temp table

With this approach we’re not really getting much from defining the underlying table as a CCI. If the table had more columns that weren’t needed then we could skip those columns with column elimination.

Other Solutions


An obvious solution is simply to add a covering nonclustered index to the table. Of course, that comes with the admission that columnstore currently doesn’t have anything to offer with this type of query. Other solutions which are likely to be more columnstore friendly include using a CLR function to do the aggregation and the SQL Server 2017 STRING_AGG function.

Note that recursion is not likely to be a good solution here. Every recursive query that I’ve seen involves nested loops. The dreaded index spool returns:

a4 recursion issue

Final Thoughts


Note that the poor performance of these queries isn’t a problem specific to columnstore. Rowstore tables can have the similar performance issues with string aggregation if there isn’t a sufficient covering index. However, if CCIs are used as a replacement for all nonclustered indexes, then the performance of queries that require the table to be on the inner side of a nested loop join may significantly degrade. Some of the optimizations that make querying CCIs fast appear to be difficult to realize for these types of queries. The same string aggregation query when run against a table with a rowstore clustered index on ITEM_ID and LINE_ID finished in 1 second. That is an almost 360X improvement over the CCI.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

The SQL Server Query Without an Efficient Join

Sometimes simple changes to a query can lead to dramatic changes in performance. I often find it helpful to dig into the details as much as I can whenever this happens.

Test Data


All that’s needed is a table with a clustered key defined on both columns. The first column will always have a value of 1 and the second column will increment for each row. We need two identical tables and may change the number of rows inserted into both of them for some tests. Below is SQL to follow along at home:

DROP TABLE IF EXISTS dbo.outer_tbl;
DROP TABLE IF EXISTS dbo.inner_tbl;

CREATE TABLE dbo.outer_tbl (
ID1 BIGINT NOT NULL,
ID2 BIGINT NOT NULL,
PRIMARY KEY (ID1, ID2)
);

CREATE TABLE dbo.inner_tbl (
ID1 BIGINT NOT NULL,
ID2 BIGINT NOT NULL,
PRIMARY KEY (ID1, ID2)
);

INSERT INTO dbo.outer_tbl WITH (TABLOCK)
SELECT TOP (20000)
  1
, ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;

INSERT INTO dbo.inner_tbl WITH (TABLOCK)
SELECT TOP (20000)
  1
, ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;

UPDATE STATISTICS dbo.outer_tbl
WITH FULLSCAN;
UPDATE STATISTICS dbo.inner_tbl
WITH FULLSCAN;

The Query


The following query will return one row from outer_tbl if there is at least one row in inner_tbl with a matching value for ID1 and ID2. For our initial data this query will return all 20000 rows. The query optimizer unsurprisingly picks a merge join and the query finishes with just 15 ms of CPU time:

SELECT *
FROM dbo.outer_tbl o
WHERE EXISTS (
	SELECT 1
	FROM dbo.inner_tbl i
	WHERE o.ID1 = i.ID1
	AND o.ID2 = i.ID2
);

Query plan:

a2 original query

Now let’s make a small change:

SELECT *
FROM dbo.outer_tbl o
WHERE EXISTS (
	SELECT 1
	FROM dbo.inner_tbl i
	WHERE o.ID1 = i.ID1
	AND o.ID2 IN (-1, i.ID2)
);

Query plan:

a2 adjusted query

Now the query will return one row from outer_tbl if ID1 = -1 and there is at least one row in inner_tbl with a matching value for ID1 or if there is at least one row in inner_tbl with a matching value for ID1 and ID2. The query optimizer gives me a hash join and the query finishes with 11360 ms of CPU time. If I force a loop join the query finishes with 10750 ms of CPU time. If I force a merge join the query finishes with 112062 ms of CPU time. What happened?

The Problem


The problem becomes easier to see if we change both of the tables to have just one column. We know that ID1 always equals 1 for both tables so we don’t need it in the query for this data set. Here’s the new set of tables:

DROP TABLE IF EXISTS dbo.outer_tbl_single;
DROP TABLE IF EXISTS dbo.inner_tbl_single;

CREATE TABLE dbo.outer_tbl_single (
ID2 BIGINT NOT NULL,
PRIMARY KEY (ID2)
);

CREATE TABLE dbo.inner_tbl_single (
ID2 BIGINT NOT NULL,
PRIMARY KEY (ID2)
);

INSERT INTO dbo.outer_tbl_single WITH (TABLOCK)
SELECT TOP (20000)
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;

INSERT INTO dbo.inner_tbl_single WITH (TABLOCK)
SELECT TOP (20000)
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;

UPDATE STATISTICS dbo.outer_tbl_single
WITH FULLSCAN;
UPDATE STATISTICS dbo.inner_tbl_single
WITH FULLSCAN;

If I try to force a hash join:

SELECT *
FROM dbo.outer_tbl_single o
WHERE EXISTS (
	SELECT 1
	FROM dbo.inner_tbl_single i
	WHERE o.ID2 IN (-1, i.ID2)
)
OPTION (MAXDOP 1, HASH JOIN);

I get an error:

Msg 8622, Level 16, State 1, Line 27
Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints and without using SET FORCEPLAN.

That makes sense. Hash join requires an equality condition and there isn’t one. Attempting to force a merge join throws the same error. A loop join is valid but there’s a scan on the inner side of the loop:

SELECT *
FROM dbo.outer_tbl_single o
WHERE EXISTS (
	SELECT 1
	FROM dbo.inner_tbl_single i
	WHERE o.ID2 IN (-1, i.ID2)
)
OPTION (MAXDOP 1, NO_PERFORMANCE_SPOOL);

Query plan:

a2 single column loop join

Note that we don’t need to read the 400 million (20000 * 20000) rows from the inner table. This is a semi join so the loop stops as soon as it finds a matching row. On average we should read (1 + 20000) / 2 rows from the inner table before finding a match so the query should need to read 20000 * (1 + 20000) / 2 = 200010000 rows from the inner table. This is indeed what happens:

a2 single column loop join actual

Hash Join Performance


Let’s go back to the initial slow query with two columns. Why is the query eligible for a hash join? Looking at the details of the hash join we can see that the hash probe is only on the ID1 column:

a2 hash probe

With this data set we end up with the worst possible case for a hash join. All of the rows from the build will hash to the same value and no rows from the probe will be eliminated. It’s still a semi join, so while we don’t know what the order of the hash table is we can still expect to need to (1 + 20000) / 2 entries of the hash table on average before finding the first match. This means that we’ll need to check 200010000 entries in the hash table just for this query. Since all of the work is happening in the hash join we should also expect quadratic performance for this query compared to the number of the rows in the table. If we change the number of rows in the table we see the quadratic relationship:

a2 hash rows in table vs time

For the table scans in the query we’re likely to do allocation order scans so the data may not be fed into the hash join in index order. However I just created the tables and data is returned in order in other queries, so I’m going to assume that data is fed into the join in order in this case. You may voice your objections to this assumption in the comments. Of course, the final output of the join may not be in order. We can monitor the progress of the query in a lightweight way using sys.dm_exec_query_profiles with trace flag 7412 enabled. I tested with 40k rows in each table to get a longer running join. The number of rows processed every quarter second was not constant:

a2 hash graph 40k

In fact, the query seemed to speed up as time went on. This was unexpected. Perhaps the hash table is built in such a way that the earliest rows for a hash value are at the end of the linked list and the last values processed are at the start? That would explain why the beginning of the query had the lowest number of rows processed per second and the number of rows processed per second increased over time. This is complete speculation on my part and is very likely to be wrong.

The costing for the hash join with the new CE is disappointing. All of the information is available in the statistics to suggest that we’ll get the worst case for the hash join but the total query cost is only 0.75 optimizer units. The legacy CE does better with a total query cost of 100.75 optimizer units but the hash join is still selected.

Merge Join Performance


For merge join the data is only sorted by ID1 for the join:

a2 merge join

The other part of the join involving the ID2 column is processed as a residual. This is also a many-to-many join. Craig Freedman has a good explanation of how many-to-many joins can lead to a lot of extra work in the join operator:

Merge join can also support many-to-many merge joins. In this case, we must keep a copy of each row from input 2 whenever we join two rows. This way, if we later find that we have a duplicate row from input 1, we can play back the saved rows. On the other hand, if we find that the next row from input 1 is not a duplicate, we can discard the saved rows. We save these rows in a worktable in tempdb. The amount of disk space we need depends on the number of duplicates in input 2.

We have the maximum number of duplicates per possible. It seems reasonable to expect very bad performance along with lots of tempdb activity. Since we’re reading the data in order we might expect to see the maximum rate of rows processed per second to be at the beginning and for the query to slow down as time goes on. As ID2 increases we need to read through more and more previous rows in tempdb. My intuition does not match reality at all in this case. There’s some silly bug with sys.dm_exec_query_profiles that needs to be worked around (merge join always reports 0 rows returned), but the rows processed per second was linear:

a2 merge join graph 1

The logical reads for the join is also linear and stabilizes to 49 reads per row returned with 10k rows in the base table.

a2 merge join graph 2

The number of logical reads per row returned stabilizes to 95 when there are 20k rows in the base table. This doesn’t appear to be a coincidence. Perhaps the data stored in tempdb is stored in a format that does not preserve the original ordering. We can look at the merge_actual_read_row_count column from the DMV for the merge join to see that the number of rows read per row processed from the inner table approaches the number of rows in the table. The ramp up happens very quickly. The measurements aren’t very precise, but for the 20k row test I saw that the merge join had processed 380000 rows after just reading 20 rows from the inner table. That would explain why the logical reads and runtime of the merge join grows quadratically with the number of rows in the table but the number of rows processed per second appears to be constant for a given table size.

For the merge join the query optimizer appears to be aware of the problem. Many-to-many joins seem to have quite a bit of a penalty from a costing perspective. With the new CE we get a total query cost of 421.072 optimizer units. The legacy CE gives a very similar cost.

Loop Join Performance


For the loop join we need to pay attention the clustered index seek in the inner part of the join:

a2 loop join operator

The -1 predicate causes trouble here as well. A filter condition will only be used as a seek predicate if it can be safely applied to every row. With the -1 check we don’t have anything to seek against for the inner table, so we only seek against ID1. As before, we should expect to read 200010000 rows from the seek. Our expectations are met:

a2 rows read

This is the first join which exposes the problem in a direct way in the XML for the actual plan which is nice. However, the performance will be quadratic with the number of rows in the table as with hash join and merge join. This is the first query that we have a shot at fixing. All that we need to do is to write the join in such a way that a seek can always be performed against the ID2 column in the inner table. The fact that the column cannot be NULL makes this possible. Consider the following query:

SELECT *
FROM dbo.outer_tbl o
WHERE EXISTS (
	SELECT 1
	FROM dbo.inner_tbl i
	WHERE o.ID1 = i.ID1
	AND i.ID2 BETWEEN (
	CASE
	WHEN o.ID2 = -1 THEN -9223372036854775808
	ELSE o.ID2
	END
	)
	AND CASE
	WHEN o.ID2 = -1 THEN 9223372036854775807
	ELSE o.ID2
	END
	)
) OPTION (MAXDOP 1, LOOP JOIN);

Note that I’m only using BETWEEN to work around a wordpress bug.

When o.ID2 = -1 the code simplifies to the following:

 WHERE o.ID1 = i.ID1
AND i.ID2 BETWEEN -9223372036854775808
AND 9223372036854775807

When o.ID2 <> -1 the code simplifies to the following:

WHERE o.ID1 = i.ID1
AND i.ID2 BETWEEN o.ID2 AND o.ID2

In other words, can we use the bounds of the BIGINT to create a seek on ID2 that can always be done:

a2 good seek

The improved query returns the results with just 47 ms of CPU time. Only 20000 rows are read from the index seek which is the best possible result for our data set. The estimated cost of the query is dramatically reduced to 3.51182 for the new CE, but that isn’t enough for it to be naturally chosen over the much less efficient hash join. That is unfortunate. As you might have guessed the LOOP JOIN hint is not necessary when using the legacy CE.

Manual Rewrites


We can avoid the problem all together by changing the join condition to allow joins to work in a much more natural way. Splitting it up with a UNION finishes very quickly with our sample data set:

SELECT *
FROM dbo.outer_tbl o
WHERE EXISTS (
	SELECT 1
	FROM dbo.inner_tbl i
	WHERE o.ID1 = i.ID1
	AND o.ID2 = i.ID2
)

UNION

SELECT *
FROM dbo.outer_tbl o
WHERE o.ID2 = -1 AND EXISTS (
	SELECT 1
	FROM dbo.inner_tbl i
	WHERE o.ID1 = i.ID1
);

Query plan:

a2 union query

Splitting it into two semijoins does fine as well, although the hash match (aggregate) at the end is a bit more expensive for this data set than the hash match (union) employed by the other query:

SELECT *
FROM dbo.outer_tbl o
WHERE EXISTS (
	SELECT 1
	FROM dbo.inner_tbl i
	WHERE o.ID1 = i.ID1
	AND o.ID2 = -1
) OR EXISTS (
	SELECT 1
	FROM dbo.inner_tbl i
	WHERE o.ID1 = i.ID1
	AND o.ID2 = i.ID2
);

Query plan:

a2 OR query

Fully optimizing the above query is an interesting exercise but Paul White already covered the problem quite well here.

Final Thoughts


At first glance the following query is a little suspicious but the worst case performance scenario wasn’t immediately obvious (at least to me):

SELECT *
FROM dbo.outer_tbl o
WHERE EXISTS (
	SELECT 1
	FROM dbo.inner_tbl i
	WHERE o.ID1 = i.ID1
	AND o.ID2 IN (-1, i.ID2)
);

In fact, it will even perform well for some data sets. However, as the number of duplicate rows for a single ID grows we can end up with a triangular join. I found it interesting that all three join types scaled in the same way. Thanks for reading!