How To Write SQL Server Queries Correctly: IN and NOT IN

Thanks for watching!

Video Summary

In this video, I delve into the nuances of using `IN` and `NOT IN` clauses in T-SQL queries, particularly focusing on how these directives behave when dealing with nullable columns. I share practical examples from my series on writing SQL Server queries correctly, emphasizing why `EXISTS` and `NOT EXISTS` are often superior choices due to their consistent handling of nulls and better performance. I also address the complexities that arise with `NOT IN`, such as the defensive query plans generated by SQL Server when null values are involved, which can lead to significant performance issues. The video concludes with a comparison between using `IN` and `NOT IN`, highlighting the importance of understanding these behaviors for writing efficient and reliable SQL queries.

Full Transcript

Erik Darling here, with Darling Data. Look at that nice blue. Really brings out the suffering in my eyes. Anyway, today’s video, guess what? We’re going to continue my series on how to write SQL Server queries correctly. And in today’s video, we are going to cover in and not in, because there are some funny things about these directives in T-SQL. But of course, before we get on with that, let’s talk about you and me, and you being nice to me for once. Mom. If you would like to sign up for a membership to my channel, there’s a link right in the video description that says become a member. You can do so for as little as $4 a month. If $4 a month would of, want. www, Eh?upg, not 100문in.

2015. *** buscar leads. directives to use in SQL in general. We’re talking about T-SQL specifically, but these rules will generally apply across all SQLs. So just really, you know, stick to exist and not exist as much as you can. In and not in, you know, if it’s literal values within, fine. But even if it’s literal values with not in, you have to be very careful about nulls. And you have to write your query defensively to explicitly remove nulls in order to get that working. Now, I did a performance focus video about not in, and I’m going to rehash a little bit of that. Again, oh, why aren’t you spelled out correctly? Oh, I hate when I use old code where I didn’t use the full word integer.

It’s very embarrassing for me. But I’ve already created two tables that allow, that are integers that allow nulls. And I’ve written two versions of, I’ve populated them explicitly from two different tables in the Stack Overflow database where no null values ended up in the tables, but the columns are nullable. Now, the big problem here is that SQL Server adds sort of like without even like announcing itself unbeknownst to you, adds a whole bunch of defensive stuff to the query plan in the event that a null occurs in the results. I’m going to show you what that looks like. So what I’m going to show you is this is the estimated plan for this. It does all sorts of things in here.

The estimated plan doesn’t really, like this is mostly just to show you that there is a lot of added complexity. We hit old users once, twice, three times, and we touch new users once. But the actual execution plan, you can see just how painful this was. This ran for over 20 minutes, 21 minutes, 22 seconds. And a lot of the problem in this query was, it’s a query plan pattern that is really, really bad. But SQL Server just inserts it when you use not in with nullable columns, right? See, like, you might see this query plan pattern in other places in SQL Server, but this is where SQL Server just throws it in there to protect itself. So you have this top above a scan. And like, if you see a top above a scan, and the scan is on a big table, you’re in real trouble. That’s never any fun, because this nested loops join is going to make this do a lot of work, right? So you have a lot of rows that come out of here, you have a lot of rows that go into the loop join, and you have a lot of scans of the old users table. That’s a bad time. Now, what I said about using not exist, like not having to worry about having perfect indexes, or writing really complex, or rather, adding complexity to the query to like, like hard code protection against nulls. Like, sure, I could add an index on old users on the the user ID column, or whatever I called it. And there would be a top above a seek, and that would be faster. But you would still run into the sort of logic, I’m going to call it a logical inconsistency, even though it’s consistent behavior. It doesn’t feel right to me for not in to screw up with nulls the way it does, or to handle nulls the way it does. So you could add an index, sure, and that would be faster. But now you have to worry about, you know, indexing your temp tables every single time you do this pattern. You could add a bunch of explicit not null checks on the old on the new users and old users table, you could write that query very defensively. Or you could not worry about any of that stuff, and just use the not exist version. This, that’s what I’ve done here, we say, you know, select the records from new users where not exists, correlate on that. And rather than taking 20 minutes, this runs in a few seconds, actually about two and a little under two and a half seconds.

So apart from the logical reasons for avoiding not in when with nullable columns, because, you know, let’s face it, a lot of people out there are afraid of making a column not null, because who knows, right, for the same reason that developers will make every string column varchar 255, or even a max data type, just because they’re afraid of truncation errors, or who knows what, even though the column is like state code, and you’re like, wow, like Massachusetts, New York, California, NACA, NYCTA, like none of these things are ever going to be 255. But who knows what will happen? Maybe, I don’t know, I don’t even have a reasonable thing to put in there. Maybe someone will like, say they live in every state or something, and you’ll have, you know, 50 times two, and that’s 100, but not even 255. Wow, that’d be real hard to do that. But developers screw things up constantly. That’s why I am a consultant with reasonable with reasonable rates, fixes these things. Okay, so anyway, you are far better off from both a performance and a consistency point of view, using exists and not exists over in and not in. Like I’ve said a few times now, it doesn’t really matter with in, because in does works the way that the same way that exists does, regardless of nulls.

So if you have a list of literal values or a subquery, you are safe using in. If you have a list of literal values or a column to column comparison with not in, and those columns are nullable, but don’t contain any nulls, you can end up with a really wacky query plan unless you explicitly filter out nulls with your query. And you have really good indexes in place to support that query.

With not in, if you like, if you just have a column and a list of literals, you still have to worry about that column, because any nulls in that column will make the list of literals from the not in clause misbehave, right? So just be like, be wary out there, exists and not exists are the better choice, because they both act consistently with probably what you would want to get back for your query results. Since there are no nulls, the first query returns correct results, but the amount of work SQL Server has to do to make sure that it doesn’t encounter any nulls or that it can behave safely if any nulls is pretty absurd, right? 20 something minutes to do that work sucks.

You can, of course, index the temp tables, but a lot of people have read one single blog post in their entire career about that and think that it’s always a bad idea. So it’s a separate conversation. But, you know, I’m going to be honest with you, I’ve had a lot of really good luck indexing temp tables in my life. So, what can I tell you? What can I tell you? The SQL Server has room for a breadth of experiences in when it comes to performance tuning. But anyway, I was going to say something else, but now I’m just giggling internally and I’ve lost my train of thought. So that was just about it for in and not in. Again, exists and not exists are usually far better options. Next up, I’m going to talk about CTE a bit. I actually have two videos coming up for CTE and they both sort of have different approaches to them. So there’s one that I’m going to talk about next and there’s one that I’m going to talk about, one that I’m going to go at the very end of the series. So there’s multiple contents on common table expressions because there is quite a bit to say about them. So anyway, thank you for watching. I hope you enjoyed yourselves. I hope you learned something and I hope to see you in the next video about common table expressions, which is going to be fun.

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.