Performance Pains With NOT IN And NULLable Columns In SQL Server

Performance Pains With NOT IN And NULLable Columns In SQL Server



Thanks for watching!

Video Summary

In this video, I delve into the performance pitfalls of using NOT IN and nullable columns in SQL Server queries. Erik Darling from Darling Data shares his expertise on why you should be cautious when employing NOT IN, especially when dealing with null values, as it can lead to unexpected results and complex execution plans. I explain how using NOT EXISTS instead can simplify your queries and avoid the performance overhead associated with NOT IN, making your SQL code more efficient and reliable.

Full Transcript

Erik Darling here with Darling Data, the most punch drunk SQL Server consultancy on the planet. Anyway, some days we’re just tired. SQL Server is exhausting, isn’t it? It’s like, man who thought he couldn’t possibly get any more tired got way more tired. Anyway, in today’s video, we’re going to talk about a performance peril with NOTIN and Nullable columns. Now, if you’ve been working with SQL Server for, or databases in general, for like longer than 15 seconds, you’ve probably run into someone on the internet saying if you use NOTIN and you hit a null, your query will bail out and return no results, or return results that you wouldn’t expect. And you should use something else to do, to write your query, or do something else to do. something to handle the nulls, right? Is null, coalesce, something, you know, some, some other crappy idea. In this video, I’m going to talk to you about, like, not only like how logically that works, but also from a performance perspective, how even if you don’t hit that situation, SQL Server has to come up with some really weird execution plans to protect itself when you do run into that. So with that, that, that intro complete, I suppose this is where I beg you for beg you for your, your, your, your alms for the poor. If you appreciate my channel content, and you would like to donate $4 a month to keeping Erik Darling alive and somewhat well fed, you can do that by clicking the link down in the video description. It says become a member. You can become a member. It’ll cost you $4 a month. If you don’t have $4 a month, well, if you’re just some freeloading, weird, I don’t know, college student or something, you can like, you can comment, you can subscribe.

You know, whatever floats your boat, strikes your fancy, blows your hair back, keeps your powder dry, whatever, however you, however it makes you feel, you can do those things. If you need help with SQL Server, if you need consulting, well, golly and gosh, you found a young, handsome consultant with reasonable rates right here on YouTube. You can get in touch with me. I can do all this stuff. I can do more stuff. This is just the stuff that I really like doing. So if you have something else, well, I don’t know, suppose we can talk about that. My rates will remain reasonable. If you would like some high quality, low cost SQL Server training that lasts for as long as you live, you can go to that link and then you can enter that code and then you can get all of mine for about $150. It’s a pretty good deal.

If you want to catch me live and in person, I will be at Past Data Summit. November 4th and 5th, I will be delivering pre-cons. Of course, the Past Data Summit itself goes on much longer than that. I even have a regular session on isolation levels. It’s really going to twist your melon, bake your noodle, you know, all that stuff. So you should come to that and you should see me live because it’s even better in person. I promise.

So with that mundane nonsense out of the way, let’s talk about not in and the problems that it can cause. All right. So I’m going to run this, not that it does anything, and I’m going to run a couple of funny looking queries. It’s very, the reason why they’re funny looking and the reason they might strike you as funny looking is not because they’re standing next to me, but because there is no from clause in these.

All right. It’s just select one, X equals one, where select zero is not in one or two, and then there’s select one, where select zero is not in one, select null. Right. I guess there’s a union all in there, too, for good measure. But when we run these, this is sort of what I was talking about from a logical standpoint, where if you have a null, SQL Server is like, I don’t know, can’t possibly figure that out.

You get a blank result rather than what you would expect, which is this. Right. You would expect to see a one come back because zero is not in one or null. Zero is also not in one or two, but we got a result back from that one, whereas this one we didn’t.

Right. So that’s not that’s not a very good time. Now, the point of this video is not that. Right. That is obviously a logical sort of I mean, I’m not going to call it an inconsistency, but it is it is sort of strange.

Right. Because in you have no problem there. So if I were to give you a general rule of thumb, if you’re going to use in or not in, then you should do that only when you have literal values that you write into your in or not in.

Right. Whether there are numbers, strings, dates, something like that. If you are typing the actual values you care about in in or not in, then you’re probably all right. For not in, you do have to pay attention to if the column in the table is nullable or not.

But, you know, that that’s up to you to figure out. But if you need to supply a sub query or anything like that, then you really should be using exists or not exists, because you will you do not have to deal with the same sort of weirdness that you end up with when you use in and not in or rather specifically not in.

But what I what I did is what I’m going to do here is I’m going to prep a couple of tables and I’m going to stop there. Well, I’m going to run that and then I’m going to come over to the query plan. So this is the not in version of the query plan right here.

Right. So this is the thing that I ran with not in after populating these. The thing that you’ll notice is that for both of these tables, these columns have been declared as nullable. Right. This this column is nullable.

Where’s that other one? There it is. This column is nullable. But we have specifically put a whole bunch of not null values into both of these tables. Right. We’ve only put values into the comment table or user IDs from the comment table where user ID is not null.

And we have only put the owner user IDs from the post table where owner user ID is not null. So we don’t have any nulls in there, but the columns are both nullable. And since we define the table that way, that can happen under all sorts of other circumstances, too.

If your column is nullable and you do select into or you create a table and you don’t specify null or not null, you might be surprised at what what the actual table definition is. But this was the query plan that that ran for the not in version.

Right. You can see the not in up here. Select this stuff. That is exactly what I showed you in the other tab. This thing runs for a little over 13 minutes.

For that little over 13 minutes, you’re going to notice some extra stuff. Right. So the query itself, of course, we are just saying from this. Where this column. I didn’t frame that up well at all, did I from this where this column is not in select this column.

Right. That’s all we got there. Right. But we look at the query plan. We see comment post comment comment.

We have three references to comment and one reference one one reference to post. Now, what ends up happening is SQL Server has to do a whole lot of gymnastics to figure out which columns might be nullable or might not be. So SQL Server uses this row count spool over here to start figuring out if any if any rows in the comment table are null.

And it has to count all of these to figure out if it’s going to just bail and give you nonsense. Then we also have SQL Server doing all this crazy stuff where for 13 minutes we have a top above a scan where SQL Server just keeps going in here and doing an anti semi join to the post table to figure out some of the not in. Right. But when we do this, this is when we’re also trying to figure out if there’s any nulls in there that might mess things up.

And then finally, way over here, we have the result of all this work right anti semi joined to the base table itself. So SQL Server had a real tough time and did a whole lot of extra work trying to figure out if there were any nulls in there. Like what’s going to happen? Should I bail early?

Like like like can we even fit? Can we figure this query out? That doesn’t happen with with not exists. So the not exists version of the query and this is, you know, something that when I first started writing SQL, I messed up because I, you know, I was like, holy crap. In and not in. Oh, man. Stuff is weird with them. Like not fully knowing the whole story.

So I just started taking in and not in and writing like exists and not exists. But I would always forget this part of the exists, like the correlating part. So it would just be like we’re not in select something from this table.

And then that that that doesn’t work. Right. Because with exists and not exists, you don’t project out anything from the select list. That’s why I can stick one divided by zero in the select list and nothing happens.

I don’t get an error because SQL Server doesn’t evaluate this. Right. You can put whatever you want in there except an aggregate. It doesn’t matter.

A lot of people think that if you put top and it exists or not exists, it’s somehow faster. It’s not. There is an implied top in there already. Ah, crazy. Anyway, it’s called a row goal.

You should learn about it. When I run this query and say select count from whatever where not exists, this thing, oops, I didn’t turn on query plans. This does not take 13 minutes to run.

This runs pretty quickly. And we just have one simple hash join, a right anti-semi join, which from the comments table to the post table. And we don’t have all that extra work.

We don’t have that row count spool where SQL Server is trying to figure out if there are any nulls in there. We don’t have that weird top above a scan where SQL Server is trying to figure out if it can keep getting rows. If like, should I bother?

Should I bother? Should I bother? What’s this? What’s this? What’s this? Pass it along over here. We don’t have all that. SQL Server just takes care of all this with the not exists in one simple join. And this finishes a lot quicker, right?

It doesn’t take 13 minutes. It takes 1.7 seconds. So when you’re writing queries, do be very careful with not in. Do make sure that if you are going to use not in, like I said, sort of a general rule of thumb, use literal values. Make sure the column that you care about is not nullable.

And, you know, if you are unsure, if you just want to be safe in general, exists and not exists are a far easier way to deal with subqueries and other stuff like that. Because you don’t have all these sort of execution plan oddities that you can end up with when you start using not in for these things. So my general preference is exists and not exists.

But as usual, or, you know, this is your first run around the block. Don’t forget that correlation in the exists or not exists. Otherwise, you could be very surprised by the results.

They are. They might be shocking. You might find that you have found everything or you have found nothing. Right?

Go figure there. Anyway, thank you for watching. I hope you enjoyed yourselves. I hope you learned something. I hope that you will avoid nullable columns. I hope you will avoid nulls. And especially mixing all of that stuff with not in because the results you get will not be happy ones.

All right. Anyway, thank you for watching. Goodbye.

Have a nice day. I hope it’s the best day of your life. Which is secretly, I guess, a curse because every day after that would just pale in comparison. Huh.

Well, maybe you live to be a thousand in interesting times. Life’s funny.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.