But You Should
This post isn’t meant to dissuade you from using EXISTS or NOT EXISTS when writing queries. In fact, most of the time I think they make a lot of sense.
But weird things can happen along the way, especially if you don’t have supporting indexes, or if supporting indexes aren’t chosen by the optimizer for various reasons.
In this post, I’m going to show you a query plan pattern that can occur in semi-join plans, and what you can do about it.
Row Goalies
This can happen without a TOP in the query, I’m only using it because it gets me the plan shape I care about quickly, and returns results “quickly” enough.
SELECT TOP (1) c.* FROM dbo.Comments AS c WHERE NOT EXISTS ( SELECT 1/0 FROM dbo.Votes AS v WHERE c.PostId = v.PostId ) ORDER BY c.CreationDate DESC;
The part of the plan that we care about is on the inner side of the semi join, where the TOP is applied in a funny way.
This is a very good example of where query costing can be quite off. The part of the plan that is responsible for most of the runtime is estimated to cost 0% of the total cost.
Closer
Here’s what’s going on in those two operators!
The optimizer thinks it can find a matching row in three executions, but it takes 70.
It also takes reading 2,630,801,081 rows to return 49 rows. There are 52,928,700 rows in the table.
Yes, if you multiply 52928700 * 49 = 2630801081. Technically it’s 49.7, but the plan shows rounded numbers. Close enough for the optimizer, so they say.
Index Not Exists
Without an index, a hash join plan would have been a much better option, which we can get by adding an OPTION(HASH JOIN)
hint to the query.
While ~12 seconds is great, we can do better. I’ve heard about these things called indexes.
How do we indexes?
An Index, You Say
It looks straight forward to create an index that should get us the plan we want, if we create it on the PostId column on the Votes table.
CREATE INDEX v ON dbo.Votes(PostId);
Assuming that the plan shape otherwise stays the same, we’ll scan the Comments table, Sort by PostId, and then do a Seek into the Votes table on the inner side of the join.
Right?
While it is quite a bit faster, it’s not at all what we want. And hey, look, the costs and times here are all screwy again.
Stop looking at costs.
Adding a FORCESEEK hint gets us the plan shape we care about, of course:
SELECT TOP (1) c.* FROM dbo.Comments AS c WHERE NOT EXISTS ( SELECT 1/0 FROM dbo.Votes AS v WITH(FORCESEEK) WHERE c.PostId = v.PostId ) ORDER BY c.CreationDate DESC;
Along with a much faster runtime:
Gosh Wrong
If you’re keeping score at home, the optimizer chose:
- A TOP > Scan that took 30 seconds instead of a Hash Join that took 11 seconds
- A Hash Join that took 5.2 seconds instead of a Top > Seek that took 2.2 seconds
But I bet I know what you’re thinking. We need an index on the Comments table. And I bet you’re thinking it should look something like this:
CREATE INDEX c ON dbo.Comments (PostId, CreationDate) INCLUDE (Score, UserId, Text);
But, prepare you biggest internet sigh, that just won’t work out.
Much like above, if we add a FORCESEEK hint, we get a better/faster plan that looks a lot like the plan above.
Flip That Script
The only index that gets us a respectable execution plan without a bunch of hints is this one:
CREATE INDEX c ON dbo.Comments (CreationDate, PostId) INCLUDE (Score, UserId, Text);
Which gets us this execution plan:
Thanks for reading!
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.
I nearly crushed on my chair laughing when you said “I’ve heard about these things called indexes”
?