Sweetness, Sweetness
I’ll be brief here, and let you know exactly when I’ll use IN and NOT IN rather than anything else:
- When I have a list of literal values
That’s it. That’s all. If I have to go looking in another table for anything, I use either EXISTS or NOT EXISTS. The syntax just feels better to me, and I don’t have to worry about getting stupid errors about subqueries returning more than one value.
For IN clauses, it’s far less of an ordeal, usually. But for NOT IN, there are some additional concerns around NULLable columns.
Of course, actual NULL values really screw things up, but even when SQL Server needs to protect itself against potential NULL values, you can end up in performance hospice.
Historical
First, a little bit of history. With NOT IN. Take the below script, and flip the insert into the @bad
table variable to use 2 instead of NULL, after you’ve seen what happens with NULL.
DECLARE @good table ( id int NOT NULL ); DECLARE @bad table ( id int NULL ); INSERT @good ( id ) VALUES (1); INSERT @bad ( id ) VALUES (NULL); /*Change this between NULL and 2*/ SELECT records = COUNT_BIG(*) /*Should be 1, or something*/ FROM @good AS g WHERE g.id NOT IN ( SELECT b.id FROM @bad AS b );
You’ll see pretty quickly that NOT IN gives you wonky results when it hits a NULL.
This is known.
Imaginary NULLs
Let’s take the below setup. Though each table allows NULLs in their single column, no NULL values will be inserted into them.
CREATE TABLE #OldUsers ( UserId int NULL ); CREATE TABLE #NewUsers ( UserId int NULL ); /* But neither one will have any NULL values at all! */ INSERT #OldUsers WITH (TABLOCK) ( UserId ) SELECT p.OwnerUserId FROM dbo.Posts AS p WHERE p.OwnerUserId IS NOT NULL; INSERT #NewUsers WITH (TABLOCK) ( UserId ) SELECT c.UserId FROM dbo.Comments AS c WHERE c.UserId IS NOT NULL;
The real lesson here is that if you know that no NULL values are allowed into your tables, you should specify the columns as NOT NULL.
I know, it’s scary. Really scary. Errors. What if. How dare.
But these are the kind of adult decisions you’ll have to make as an application developer.
Be brave.
Protect Your NULL
The big problem with NOT IN, is that SQL Server goes into defensive driving mode when you use it under NULLable conditions.
I don’t have a license because it would be irresponsible, and I’ve lived my entire life in big cities where having a car would be more trouble than it’s worth. But I assume that when I see people complain about drivers not knowing what to do the second there’s a rain drizzle or a snow flurry somewhere in the area is a similarly annoying scenario out there on the roads. All of a sudden, seemingly competent drivers turn into complete basket cases and drive like they’ve got a trunk full of dead bodies clowns.
Here’s an example of a bad way to deal with the situation, vs. a good way to deal with the situation:
/*Bad Way*/ SELECT records = COUNT_BIG(*) FROM #NewUsers AS nu WHERE nu.UserId NOT IN ( SELECT ou.UserId FROM #OldUsers AS ou ); /*Good Way*/ SELECT records = COUNT_BIG(*) FROM #NewUsers AS nu WHERE NOT EXISTS ( SELECT 1/0 FROM #OldUsers AS ou WHERE nu.UserId = ou.UserId );
Note the very professional formatting and correct syntax. Ahem. Bask.
Count to 10 while you’re basking.
Results
The resulting execution plan for each example should be illuminating. Here they are:
The NOT IN version takes 15 minutes, and the NOT EXISTS version takes 415ms.
Reality Bites
Since there are no NULLs, the first query returns the correct results. But the amount of work SQL Server has to do to make sure there are no NULLs is both absurd and preposterous.
If you’re like me, and you want to throw the query optimizer in the garbage every time you see a TOP over a Scan, you might say something like “an index would make this really fast”.
You wouldn’t be wrong, but most people either:
- Take bad advice and never index #temp tables
- Create nonclustered indexes on #temp tables that don’t get used
Well, you get what you deserve pay for.
Thanks for reading!
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.