How To Write SQL Server Queries Correctly: EXISTS and NOT EXISTS
Thanks for watching!
Video Summary
In this video, I dive into the world of SQL Server queries, focusing on the often-overlooked `EXISTS` and `NOT EXISTS` clauses. These powerful tools can significantly improve your query performance by reducing duplicate results and making your code more efficient. I share practical examples and explain why using these directives correctly is crucial for writing effective SQL queries. Whether you’re a seasoned SQL professional or just starting out, understanding how to leverage `EXISTS` and `NOT EXISTS` can greatly enhance your ability to write optimized and maintainable queries.
I also take some time to address common misconceptions about these clauses, particularly the idea that what you select in an `EXISTS` clause matters for performance. I demonstrate with examples that selecting anything—like one divided by zero or even the entire King James Bible—has no impact on the query’s execution plan. The key is understanding how SQL Server processes these clauses to find matches or non-matches, and why using them properly can lead to more efficient queries.
Full Transcript
Erik Darling here with Darling Data, and we are getting into video weird green screen. I don’t know why that’s happening behind me like that. I think I need to adjust the light over there a little bit. But in this video, we’re going to continue the series on how to write SQL Server queries correctly. And in this video, we are going to talk about the wonderful, the fabulous, the underused, the malnourished, the oft overlooked, exists and not. The number of times I have tuned queries just by changing some paradigm joins in, not in, things like that to use exists or not exists is amazing, quite frankly. So I think you’re going to enjoy this one. But before we do that, we need to talk about how much child support you owe me. Just kidding. If you like this channel and you want to become a member, there’s a link right in the video description. It says something like, become a member, you can choose to pay me $4 a month in child support if you like the video babies that I’m cranking out. If you cannot afford the $4 a month, of course, you can like, you can comment, you can subscribe, you can, you know, show your affection in other ways. I guess it’s sort of like a rich dad, poor dad or something, but you know, whatever. If you need help with SQL Server, in any, for really anything, you have the best SQL Server consulting in the world available to you, and I can do any of these things at a very reasonable rate. I can also do more stuff, depending on sort of what you’re what you’re into and what you need. But these are these are the things that I usually end up doing with people. So I’m pretty good at them at this point. If you need some very high quality, very low cost SQL Server performance tuning training, you can get all of mine, again, via link in the video description that talks about that says, like, buy training, or something, or something, or something, and you can get 75% off. That brings it down to about 150 US dollars. And that is for life. That is not an expiring thing. The only thing expiring is you. Video training will be there until, I don’t know, hard to say. Forever, maybe. I mean, as long as it’s useful. Yes. No upcoming events, end of the year, 2025.
We’ll figure it out. With that out of the way, let’s talk about exists and not exists. Now, I think what’s great about SQL is the structured query language is designed in a way where certain directives is very, very obvious what you are getting when you use them. exists and not exists. You can say, I want to find things where they exist, or say, I want to find things that don’t exist. I don’t know. I think these are wonderful things. What if you were able to find aliens like that, or like a multiverse or something?
All sorts of interesting things could happen. The sort of awful thing about SQL is that it has a lot of rules. And they are selectively applied, just sort of like the English language itself. I have a young daughter who is learning how to read and explaining to her various rules for spelling and pronunciation and grammatical correctness is… It’s a fun challenge. It’s a fun, fun challenge. I, of course, have, you know, my gripes and grievances with SQL.
If you want, like, the pettiest example, I think that instead of select, we should just write get. Not only is it half as long, but it’s far more obvious what you’re doing. When I go to the store, I do not select milk, eggs, steak, butter, salt, pepper, scotch, anything like that. I usually just get them. I get the things that I want. But, you know, that’s, you know, that’s my breakfast of champions. I don’t know what yours is.
But two of the most overlooked things in SQL are exist and not exist. Perhaps they would get more traction if they were called there or not there. But I think if you had to deal with a where clause and a there clause, things would get real weird real quick. But it might be kind of fun to say, like, select star from table where there is not or where there is or, you know, something like that.
Where something is there or not there. I don’t know. It would be fun for me anyway. But whenever I bring these up, people get kind of weird about them because they’ve usually read some, like, really incorrect blog post at some point in their life that says subqueries are bad. And, of course, exists and not exists work off something that looks pretty well like a subquery.
And they’re like, no, can’t do that. Bad, bad, bad, bad. But those people are fools and they read things by fools. And then now we’ve just multiplied the number of fools in the world.
This is the problem with people writing foolish things is that foolish people read them. So you have foolish people writing foolish things that foolish people read and become doubly foolish. The foolish is just exponential.
What is it? Phrase it something. Exponentially caustic foolishness in the world. Now, I think the fun thing about… That shouldn’t be there.
You get out, you idiot. Foolish thing. The nice thing about exists and not exists, and this is something that comes up quite frequently when we’re talking about using these directives in SQL, is people think that something, whatever you put in the select list of the exists makes some difference to performance.
It does not. You can put select star. You can put one divided by zero, which is my favorite thing to do to prove that it doesn’t mean anything. You could put the entire contents of the King James Bible in there.
And guess what? Wouldn’t make a difference. SQL Server throws it away, forgets it ever existed. Likewise, if you add distinct or top or anything like that to an exist clause, it doesn’t matter.
Offset fetch would be another row filtering thing. It doesn’t matter. Group by doesn’t matter, right?
Because SQL Server does not work, does not care about that. It only goes in to find a match or not a match. And there is a one-row goal on that anyway.
So it doesn’t care about that. One thing that we talked about in the joins video is if you have a one-to-many or a many-to-many relationship, SQL Server will show you the results when you use join with exists and not exists. We only care that a row is there or a row is not there.
Right? Don’t need to find… If we find, like, let’s say we have user ID 1 here and we have 10 rows that match user ID 1 here, we say where exists this.
SQL Server does not say, oh, I found one. Oh, I’m going to go find nine more ones. It just says I found a one. We’re good. If we say where not exists and the SQL Server is like, oh, wait, but I found a one. It’s going to say that one is there.
It’s not going to go find that one nine more times to make sure that it still doesn’t exist. Right? So we find one or we don’t find one and we bail out. So both exists.
We already talked about that. Now, let’s say you are a brand new query writer, you know, doing your thing in the world and you have been tasked. Your boss says, hey, pretty please, give me a list of people, of IDs and display names from the users table who have made a post, who have a reputation of one.
Right? What stinks is that if you were to write this query and, you know, the most straightforward way possible, you would get a whole bunch of duplicates back. Why?
Because there are a whole bunch of people who might have made multiple posts, who all have an ID of one. There are a lot of people in here who have that problem. Right?
I mean, just think about how many rows matched community up here. Lots of them. So we’re going to see lots of duplicates in here. Right? Here’s May Taha with like five rows of duplicates. You look at that and you say, ah, boy, I don’t feel like sifting through all that.
Ah, group by, I’d have to write two column names in there. I don’t feel like writing all that. Typing?
For idiots. Typing. Fingers get tired. I’m old. I’m just going to throw distinct up at the top. Okay. Well, you can do that.
And, I mean, the query itself doesn’t actually run that much faster, but we do get the results back faster because we have, we send fewer rows to SSMS to process and put into grid form. So, like, the query plans and the performance don’t matter much here. But you could do this and you could, you know, very easily get back the results that you want to see.
Which is fine. But, uh, that only works kind of up to a certain point performance-wise. After a certain point performance-wise, uh, throwing distinct, especially on a very long column list, uh, can, can, can become pretty painful.
I would, I would strongly advise against using distinct for long column lists. I would strongly advise in favor of, uh, you know, uh, either getting that distinctness some other way. Like, you could add, um, like, you, you might know that there are two or three columns in the results that make up a distinct, uh, tuple.
And you could use, like, a row number or something to just include those three columns to figure out dupes and then filter to where row number equals one. Uh, you could also write the query slightly differently so that, uh, duplicate results are discarded the very moment you start joining rows together. Right?
So, uh, let’s say that we wanted to write this query and we wanted to just get the distinct results from users that have a matching row in posts. Well, that is exactly what exists, exists to do. All right.
So if we run this query and we, uh, get this stuff like the same way that we did before, uh, we’re going to get an execution plan that has, well, this one’s a little misleading. Usually when you write a query that does this kind of thing, you will see, uh, a semi join or, uh, for exists or an anti-semi join for not exists. This one just works a little bit differently.
Uh, and this one just kind of groups by, uh, the post table. Uh, it aggregates all those rows. So there’s only one of each. And then we join a distinct result set of owner user IDs from posts.
All right. Uh, ooh, there we go. The emerge join to the users table. Um, if we throw, let’s just so I can show you the semi join version of this. Let’s say option force order and let’s get an estimated plan for this.
Oh no. It does the same thing. It just reverses it. Nevermind. Okay. Forget that happened. No query plan for you.
Um, so once exists locates a match, um, then it’s, it’s like, cool, we got this row. I’m going to return that row, right? It’s basically inter joining the tables together.
Uh, I see a lot of people attempt to write exists queries. The same way that they attempt to write in or not in queries. And again, this comes down to the column list that comes out of exists does not matter.
Right? Because notice, we notice what we’re missing here. There’s the, we had it up here. This, this was correct.
Right? We, we selected nothing. We selected one divided by zero from the post table, but we had this correlating where clause. Right? So where the owner user ID here matches here, I see a lot of people try to do this and it just doesn’t go well because there’s no correlation in here. So if there’s for every row and posts, SQL Server is like, yeah, yeah, there’s something’s there.
Okay. We, we, we get a thing and then SQL Server just basically gives us a, a, a, a left semi join with no join predicate, which is not what you want. This query is incorrect.
If you write your exists or not exist queries like this, you will be sadly, you’ll be very sad about the results because, um, they won’t, they won’t be right. They will be completely wrong. Right?
So make sure that when you write your exists and not exist queries, they are properly correlated inside of the exists and they do not just look like this because this is bad and wrong. We do not want this happening. Now, one thing that grinds my years, you know, it gets, gets me fired up, angry at the world.
Uh, I want to, I want to drink that 24 ounce glass of scotch, go out, go out there, the baseball bat is whenever I see a SQL tutorial, uh, they give this advice about finding rows in one table that do not exist in another table. And they, they, they seem to all, uh, uh, seem, all seem to favor using a left join to do that. So the basic query pattern, and I’m not saying that you should never do this.
There are times when this will be the better choice, when this will perform better. It is, uh, up to a lot of very, very localized, um, uh, things around like indexing and stuff like that. Uh, but in like, you know, SQLs and the optimizer making good join choices based on good cardinality estimates and things like that.
So things like indexes, the cardinality, uh, estimator you’re using legacy or default, legacy or new or default or new or whatever, old or new, let’s just say. Uh, lots of things can mess up how SQL Server chooses to do these sorts of, how to do these sorts of joins, how to, which physical join operator they choose. Um, so there’s all sorts of things that can make one choice or the other good or bad.
But the basic thing that you do is you left join, you select from the users table, and then you left join to whatever other table. And then you find, typically you use the primary key, um, ID is the clustered primary key in the post table. This is the most common thing that you’ll do.
You could use, technically you could use any non-nullable column that you want, but, you know, clustered primary key is a pretty good choice for that. The problem that you run into is that when you, when you use this pattern, what SQL Server does is it takes both tables, right? We have users here and we have posts here.
SQL Server does not choose for this query to do any early aggregation. Uh, that makes sense for the users table because ID is the clustered primary key. That probably makes less sense for the owner user ID column.
Remember when we did the exist query and I showed you, it did that, uh, aggregate from the post table. So there only one row would come out of that. Um, you know, SQL Server is just like, okay, well, no, uh, no early aggregation for you here.
Uh, we fully, you fully joined both tables together. And then after those two tables get fully joined together, then you filter out rows, right? This is where we start.
This is where we start reducing the result set in our where clause. And if you hover over the filter, you’re going to see that our predicate for that filter is where the ID column from the post table. Again, the primary clustered key is null.
And this is usually a pretty bad, this is usually a pretty bad query pattern to see because you want SQL Server to filter out rows as early as possible. Not join every single possible row and then filter out, uh, rows that don’t match. So, uh, a better way of writing that is of course to use not exists because not exists will tell you which rows aren’t there.
And if we run this, remember this query runs for about 1.6 seconds. Uh, we run this one. This one finishes up in about half the time, about 800 and some odd milliseconds.
But notice we have the different, we don’t have, we have all, we have a fairly close pattern here. Not exactly perfectly, you know, aligned, but, uh, SQL Server does opt for the early aggregation on, on, from the post table, right? So it makes a, it does the aggregation on the owner user ID column.
And now we have a different type of join, right? We don’t have just an outer join with a filter afterwards. We just have the aggregate that does the count afterwards. But we actually have up here is a left anti-semi join, right?
So this means that the rows get eliminated at the join rather than fully joining the tables and filtering them out later. So a lot of times when I’m tuning queries and I see that pattern with the left join where some column is null, like one of my first instincts is to replace that with not exist to see if that gets us any sort of performance improvement.
Most of the time it does, not every single time, but most of the time it will. Um, so, uh, your developer life will be a whole lot less confusing and tiresome, uh, if you make sure that you fully take advantage of all of the things that SQL Server has available to it. Um, the real tough thing, the thing, something that bothers me quite a bit about, um, you know, certain frameworks, like ORMs and any framework being one of them, uh, is that it’s not always obvious, or rather it’s not, it’s not always done correctly where, uh, a join or exist is used where it should.
They are capable of doing it, but it doesn’t always happen. And, and there’s different ways to write those types of queries so that if you’re doing, uh, joins to either find just the existence of something like, say like just to do, like you’re doing the join for the purpose of filtering, not for the purpose of displaying data. If you just need to find rows that match from one table to another, or rows that don’t match from one table to another, exists and not exists are usually the more efficient choices there.
Um, that changes, of course, if you needed to bring data back from the table as well, right? That’s when you would want to use a join because you can’t project data out of exists or not exists. That’s why the select list up here doesn’t matter.
So just kind of keep that in mind when you’re writing queries, uh, that, you know, like you can’t project anything out of exists or not exists. So if you, if we needed columns from the post table, we wouldn’t want to use that. We would, that’s where we want to use a join instead.
But if we’re just looking for rows, there rows, not there exists and not exists will usually get you there faster. Now, uh, the next thing we’re going to talk about in this series is, uh, subqueries of the correlated variety. Um, suppose non-correlated subqueries are a bit more exotic and a bit less useful, but, uh, that’s what we’re going to do.
So, I hope that, I hope that that excites and titillates you and, uh, you’ll stick around to watch that. So, I’m going to do that one next. So, thank you for watching.
I hope you enjoyed yourselves. I hope you learned something. And, uh, what else is in there? I hope that you can’t hear the sirens currently going by. That would be nice.
Um, I’m actually going to, going to listen to this video, the very end of this video, uh, to see if, if the sirens show up on the recording. Because, you know, life in the big city. All right.
Anyway, uh, let’s get going here. I’m just, I’m just, just going on at this point. Goodbye.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.