Learn T-SQL With Erik: Subqueries With NOT IN

Learn T-SQL With Erik: Subqueries With NOT IN


Video Summary

In this video, I delve into the intricacies of subqueries in T-SQL, focusing on a specific scenario involving the `NOT IN` operator and nullable columns. Erik Darling from Darling Data provides a detailed walkthrough using the Stack Overflow data dump as an example, illustrating how null values can lead to unexpected results when used with `NOT IN`. The video explores different query approaches to achieve accurate results, including modifications to handle nulls effectively and comparing `NOT IN` with `NOT EXISTS`. By examining execution plans closely, viewers gain insights into why certain queries perform differently and how to write more robust T-SQL code. This is part of a preview for upcoming beginner-level material, which can be purchased now at half the future price. If you’re attending the PaaS Data Community Summit in November, this content will align well with the pre-cons taught by Kendra Little and me, offering free access to all materials.

Full Transcript

Erik Darling here with Darling Data. In today’s video, we’re going to learn a little bit more about T-SQL here. In this video, we’re going to talk about subqueries. Now, of course, this is a preview of the beginner material that I’m putting out. This is for presale down in the old video description somewhere about in this area, this vicinity somewhere over in here. You can purchase everything now for half the price that it will be when the material is fully published. This is, of course, companion material, not like full material, but companion material to the pre-cons that Kendra Little and I will be teaching at PaaS Data Community Summit this November. And if you’re coming to those, you will get all of this stuff for free because they will mesh fairly well. Anyway, let’s talk a little bit about like the particular form of subquery with not in. Now, it’s at least fairly well known amongst SQL people that when you use not in and one side is or both sides are nullable, things get kind of weird and you need to think a little bit harder about how you write your queries. So I promise this isn’t like just that isn’t the lesson I’m going to teach you. We’re going to go further than that. But that’s sort of the starting point. All right. So like if we say if we run this query, and so what we’re looking at here is a query where both sides of the not in expression are nullable. So in the votes table, wake up, zoom it. Come on, buddy, you can do it. And the votes table, the user ID column is not only nullable, but contains lots of nulls. Because this table is heavily scrubbed in the public data dump because like it would be unfair to publish people’s like voting records out in the wild, right? You vote and it’s supposed to be a private thing, right? How you vote is supposed to be between you and the piece of paper you jam a hole in.

Don’t want any hanging chads in the Stack Overflow community. And then like in the comments table, some of the user IDs are null. This column is again nullable and contains nulls, right? So there are nulls on both sides of this not in expression. When, of course, it’s only if there are nulls on the inner side of the not in expression that we get the sort of incorrect result-y stuff from it, right? So if we run this, this is going to run for about four seconds and it’s going to return a count of zero. So it’s going to look like there are no matches in these.

But of course, we look at the execution plan. We’ll see that, I mean, we’re not going to see that there were matches. What we’re going to see is YSQL came to the conclusion that there were no matches, right? So we have what, so up at the very, very top here, this would be the actual join between votes and comments, right? This set, like this here, this is the comments table. This is the votes table. And this is where they would get joined together to produce a result from that subquery.

But you can see about right here is where SQL decides that there are no matches whatsoever, right? That’s where we, that’s where we first hit a zero. So let’s examine the query plan a little bit. What we have down in this section where SQL Server decides that we have zero rows is, like, we have the votes table, which gets left anti-semi joined to a, to the comments table here. And there’s a row count lazy spool here that counts a whole bunch of rows.

And then there’s another left anti-semi join, which we just talked about, which is where we hit zero. And then down here, there’s another copy of the comments table and another row count spool. So a row count spool just does exactly kind of what the name implies. It counts rows.

This iteration of the, or rather this reference to the comments table is just counting all of the rows in the comments table. It counts all, like, 24.5 million rows or something. Then this one down here is counting something a little bit different.

This one is looking for nulls, right? Because this column is nullable, but may not contain nulls, right? Just because a column allows nulls does not mean anyone has put a null in there.

SQL Server needs to do this to figure out if there is a null. And then if there is a null, right? Like, this is where the semi joins get kind of funny. So, like, here we’re just counting rows, right?

And we’re joining that count of rows to the votes table here, right? This left anti-semi join, which means, like, not exists, not in. And this one, but the predicate on this one is just where the votes, the user ID column in the votes table is null, right?

So, v.userID is null. So, it’s like, okay, are there nulls here? I don’t know.

And then this one doesn’t actually have a predicate on it. What are we left anti-semi joining to? Well, we’re just figuring out if there are nulls in this table. And if there are nulls, then this returns no rows, right?

This is where we go from 3, 5, 8, 2, 7, 1 to 0. So, what we can do to get a somewhat correct result, somewhat, is do this, right? We can say, select count from votes where the user ID column is not in this again.

But this time, we’re going to get rid of nulls from the comments table, okay? Now, I want you to pay attention to what this number turns out to be. Do-do-do-do-do.

This does not really get all that much faster. It may be, like, about a second faster. The last query took about 4 seconds. This one takes about 3 and change. But the result that we get back is 293716. I just want you to keep that number in mind.

Because it is a number, but is it the right number? We’re going to figure that out. And we’re going to ask a probing question about data to help you write your queries correctly, right? So, if we look at the execution plan for this one, we are down to one row count spool right here.

We don’t have the two row count spools anymore. And this one is just looking, this one is just counting where c.userID is not null. And then we have our left anti-semi join here to where v.userID is null.

And then we have our actual join up between votes and comments here on userID. So, we get 293716 back from this. Is that number right or is that number wrong?

Well, one way to sort of visualize what that thing is actually counting is to write the query like this. Oh, and I have all these ridiculous things popping up on my screen now. So, if we add in this predicate, we will get rid of both row, we will have no row count spools in the query execution plan anymore.

Because SQL Server will no longer have to track this. And we’re going to get the same number back. But this is where I want you to ask yourself if this is the right number or not.

Because SQL Server can only count where userID is not a null mark, which is what it gave us in the previous query. And it’s exactly what we get in this query too. Right?

We run this. It still takes about three seconds. We still get 293716. If we look at the execution plan, there are no more row count spools. There is just one join here.

And the number of rows that leave that join is 293716. And that’s what we count here with the stream aggregate. Is this right or wrong? Well, one way to find out if you think this is right or wrong is to write the query a little bit differently.

Now, this does help us move into the next topic that we’re going to talk about, which is going to be exists and not exists. But what happens if we run this query? Because, see, like whenever people talk about not in and how it can return incorrect results if there are nulls on the inner side of the not in, is they will tell you to either screen out nulls from the query itself.

Right? That’s one way to protect yourself from getting incorrect results or incorrect results. Or they’ll tell you to use not exists.

But not exists will fundamentally give you a different result here. Okay? So let’s look at, if we run this query, this is going to run a lot faster. Right?

This finishes in about a second. But look at the number we get back. Right? Four, nine, six, three, nine, six, five, five. That is an eight-digit number of rows that we get back from this count, which is much higher than the 293,716, I believe, that we got back from the other query. So this is counting all of the rows in the votes table where the values don’t exist in the comments table for user ID.

Right? And that includes the nulls. Remember, like, not exist and exist and not exist handle nulls in ways that, or rather, not exist handles nulls in a much better way than not in handles them.

So we still do that join, but we’re counting all of the rows and votes that are still null. Right? Because those don’t match.

So depending on what you want to actually count, you might need to write this query differently in either case. If you want to count of all the rows, including nulls, in the votes table that don’t, where the user ID there doesn’t exist in the user ID column in the comments table, then just write not exists. But what happens if we change that not exist query to also get rid of nulls from the outer side?

Right? So we’ll no longer count null rows from votes. What do we come back to?

293.716. So you can kind of get a sense here that the data, like, what query results are correct depends on what you’re actually looking for. If you’re purely looking for matches between votes and comments, then you’d, like, you would want to screen out nulls from the votes table.

If you want to count all the rows, including nulls from the votes table that don’t match the user ID in the comments table, then you could just write not exists and leave it at that. So it really does depend on what you’re looking for here, how you want to write this query to get back the correct results. Anyway, that’s about what I wanted to talk about here.

Thank you for watching. Thank you for watching. Again, this is just a small portion of the material. You can see we’re down in, like, the 300 rows here. There’s a whole bunch of other stuff that I didn’t cover because it’s a preview. It’s supposed to urge you to buy the video course so you see all the content.

Cool. So, thank you for watching. I hope you enjoyed yourselves. I hope you learned something. I hope that you will now have something very interesting to think about when you are writing your queries.

And, what else? I don’t know. I think that’s about it. Anyway, it is Saturday here, so I’m going to go do something else now with my entire day.

So, thank you and goodbye.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.