Look, I really like EXISTS and NOT EXISTS. I do. They solve a lot of problems.
This post isn’t a criticism of them at all, nor do I want you to stop using them. I would encourage you to use them more, probably.
But there’s some stuff you need to be aware of when you use them, whether it’s in control-flow logic, or in queries.
If you keep your head about you, you’ll do just fine.
The issue you can hit here is one of row goals. And a T-SQL implementation shortcoming.
If I run this query, it’ll chug along for about 10 seconds.
IF EXISTS ( SELECT 1/0 FROM dbo.Posts AS p JOIN dbo.Votes AS v ON p.Id = v.PostId WHERE v.VoteTypeId = 1 AND v.CreationDate >= '2018-12-01' AND p.PostTypeId = 1 ) BEGIN SELECT x = 1; END;
The part of the plan that we care about is a seek into the Votes table.
SQL SERVER’S COST BASED OPTIMIZER™ thinks that 2.52 (rounded to 3) rows will have to get read to find data we care about, but it ends up having to do way more work than that.
It’s worth a short topic detour here to point out that when you’re tuning a slow query, paying attention to operator costs can be a real bad time. The reason this query is slow is because the costing was wrong and it shows. Costed correctly, you would not get this plan. You would not spend the majority of the query execution time executes in the lowest-costed-non-zero operator.
Normally, you could explore query hints to figure out why this plan was chosen, but you can’t do that in the context of an IF branch. That sucks, because a Hash Join hinted query finished in about 400ms. We could solve a problem with that hint, or if we disabled row goals for the query.
In order to tune this, we need to toggle with the logic a little bit. Rather than put a query in the IF EXISTS, we’re going to set a variable based on the query, and use the IF logic on that, instead.
DECLARE @do_it bit; SELECT @do_it = ( SELECT CONVERT ( bit, ISNULL ( MAX(1), 0 ) ) FROM dbo.Posts AS p JOIN dbo.Votes AS v ON p.Id = v.PostId WHERE v.VoteTypeId = 1 AND v.CreationDate >= '2018-12-01' AND p.PostTypeId = 1 ) OPTION(HASH JOIN); IF @do_it = 1 BEGIN SELECT x = 1; END;
This produces the fast plan that we’re after. You can’t use a CASE expression here and get a hash join though, for reasons explained in this post by Pablo Blanco.
But here it is. Beautiful hash join.
EXISTS With OR Predicates
A common query pattern is to is EXISTS… OR EXISTS to sort out different things, but you can end up with a weird optimizer query rewrite (SplitSemiApplyUnionAll) that looks a lot like the LEFT JOIN… IS NULL pattern for finding rows that don’t exist. Which is generally a bad pattern, as discussed in the linked post.
SELECT c = COUNT_BIG(*) FROM dbo.Users AS u WHERE u.Reputation > 1000000 AND EXISTS ( SELECT 1/0 FROM dbo.Badges AS b WHERE b.UserId = u.Id ) OR EXISTS ( SELECT 1/0 FROM dbo.Comments AS c WHERE c.UserId = u.Id );
This is what I’m talking about, in the plan for this query.
Rather than do two semi joins here for the EXISTS, we get two right outer joins. That means (like in the linked post above), all rows between tables are joined, and filters are applied much later on in the plan. You can see one of the right outer joins, along with the filters (on expressions!) in the nice picture up there.
The fix here, of course (of course!) is to write the query in a way that the optimizer can’t apply that foolishness to.
SELECT c = SUM(x.c) FROM ( SELECT c = COUNT_BIG(*) FROM dbo.Users AS u WHERE u.Reputation > 1000000 AND EXISTS ( SELECT 1/0 FROM dbo.Badges AS b WHERE b.UserId = u.Id ) UNION ALL SELECT c = COUNT_BIG(*) FROM dbo.Users AS u WHERE EXISTS ( SELECT 1/0 FROM dbo.Comments AS c WHERE c.UserId = u.Id ) ) AS x;
This query completes in around 1.5 seconds, compared to 4.9 seconds above.
It’s rough when you run into these problems, because solutions aren’t always obvious (obvious!), nor is the problem.
Most of the posts I write about query tuning arise from issues I solve for clients. While most performance problems come from predictable places, sometimes you have to deal with edge cases like this, where the optimizer mis-costs things.
Thanks for reading!
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.
- SQL Server 2017 CU 30: The Real Story With SelOnSeqPrj Fixes
- Getting Parameter Values From A SQL Server Query Plan For Performance Tuning
- Getting The Top Value Per Group With Multiple Conditions In SQL Server: Row Number vs. Cross Apply With MAX
- Getting The Top Value Per Group In SQL Server: Row Number vs. Cross Apply Performance