Performance Tuning Semi and Anti-Semi Joins In SQL Server

Thanks for watching!

Video Summary

In this video, I delve into tuning semi-joins in SQL Server, specifically focusing on `EXISTS` and `NOT EXISTS` queries. Semi-joins are powerful tools that can simplify complex queries but often introduce performance challenges due to row goals and unexpected query plans. I discuss the importance of proper indexing and demonstrate how adding a hash join hint or creating targeted indexes can significantly improve query performance, sometimes reducing execution time from over a minute down to just eight seconds. The video also touches on the nuances of sorting and the use of batch mode for large datasets, providing practical solutions for optimizing these types of queries in your own work.

Full Transcript

Erik Darling here with Darling Data. Getting a little bit more night recording in. It’s making my green screen lighting a little bit weird, but we’re not going to let that stop us. Apparently the daylight that normally comes in through my office window over there that you can’t see makes a pretty big difference in not casting strange shadows behind me. Who would have thunk it? More light is better when it comes to green screens. But anyway, in this video, this is a video, right? Camera’s on, microphone’s on. Remember to hit record this time. We’re good. We’re going to talk about tuning semi-joins. Semi-joins, of course, generally crop up when you use exists or not exists. And you can have either a semi-join or an anti-semi-join. A semi-join is, of course, for exist queries, and an anti-semi-join is for not exists queries. And both of them are wonderful, spectacular things when used correctly and appropriately. And I’m a big proponent of using exists and not exists because they are wonderful natural extensions of the SQL language that you should be taking advantage of. All the things that you can do to attempt to replace them.

You know, like writing a join with a distinct, a derived join with a distinct in it, or writing a left join where, you know, correlated and then where, you know, the left join to table filtered with like some primary key or not nullable column is null to find rows that don’t exist. Ah, boy. It’s just a lot more work and effort than it’s usually worth. So before we get into that, we of course need to talk about you and me, our future together. If you would like to say thank you for all of the video content and get in on the ground floor of many great and wonderful Darling Day to things that are set to come up in the new year, you can sign up for a membership to this channel for the low, low cost of $4 a month. Not bad.

Not bad. Even with inflation. If you are unable to scrounge $4 out of the couch cushions this month or any month, you can of course interact with my channel in many meaningful, ingratiating ways. You can throw me some likes on the videos. You can throw me some comments on the videos. I do love interacting with each and every one of you almost on a daily basis. And of course, if you subscribe to the channel, you will get these wonderful, wonderful, near instantaneous notifications every time I publish a video. If you find yourself needing help with SQL Server and you need any of this type of stuff done, I am very good at these things. I’m quite good at these things at this point in my career.

So if you are looking for SQL Server help for any of these things, my rates are reasonable. If you need help with something else dealing with SQL Server, I can guarantee you my rates will remain reasonable. If you would like to get some high quality, low cost training, you can do that. You can get mine for that. You can’t get anyone else’s for $150 for life for 24 hours of performance tuning content. Good luck with that.

I don’t even think Pluralsight can beat those numbers, but I don’t think anyone cares about them anymore. Right? A little dicey. But if you use the links in the show description, either to sign up for a membership or to buy my training, those are other good things you can do both for me and you. I will be speaking live with a whole lot of gumption in Seattle, Washington, November 4th and 5th at Past Data Summit, co-hosting two fantastic, glorious days of performance tuning pre-cons with Kendra Little.

And of course, if there is an event nearby you where you would like to single white female me, you can do that by telling me which one might need a pre-con speaker. Because doing the pre-cons, doing the payday and covering some of the travel stuff. I do not make an extraordinary amount of money from pre-cons.

Just covering the travel stuff is a good way to get me to show up somewhere and, you know, I don’t know, whatever you want to do. Smell my hair. Follow me around.

Steal my silverware after I use it. Whatever it is, I don’t know. I don’t know what you’re up to. I don’t know what goes on in that deviant brain of yours. You can do that.

But until then, let’s talk about these here semi-joins. Now, one very, very interesting thing about semi-joins is that they often introduce a row goal into your query. So you might notice if you look at the query that I’ve written here, we have an outer top one for the comments table.

But when we look at the query plan, we’re going to see a rather mysterious top operator in our query plan. Perhaps one that we did not plan on seeing in our plan. An unplanned plan.

So if we look in here, we’re going to see we have our outer top out here, right? That’s our presentation top. We have an order by to make that correct. We have an order by on creation date descending, which is a non-unique column.

And then we continue the order by, extend the order by with the ID column from the comments table, which is a unique column. So you have that tiebreaker in there to make sure that we get consistent results back. If we did not have that in there, especially with parallel execution plans, we could see all sorts of strange things happen.

Now, something that we’re going to wrestle with in all of these plans is sorts. But before we wrestle with sorts, we’re going to wrestle with one of my least favorite plan shapes of all time. And that is when you have a top above a scan.

Right there. We spend 55 seconds in the top above the scan. Now, remember when I said that exists and not exists introduce row goals.

That row goal is there because with exists and not exists, we care not about duplicates. We either find something or we don’t find something. And quite often, SQL Server will use sort of an injected top into the query plan to do that.

We just keep finding the top one over and over and over again. The problem is that every time we run this top one for a row that comes out of here, we have to scan and scan and scan in here. So we end up doing so even though, let’s get this right.

Even though there are only this many rows in the votes table, this is how many rows we end up reading over all of those top scans. Right. That is a fairly brutally large number.

Right. And that is absolutely no fun whatsoever. If we go and we look at the properties of this and we look at this, we will see some really big numbers on all these threads adding up to this really big number on this thread. So SQL Server did not have a good time in here.

Now, part of the reason for this is that we don’t have a good index to support any part of this. We’re going to get to that in a second. But before we do, what you should know is that oftentimes when you end up with a plan of that wretched nature, you can fix it pretty quickly and easily. Without doing any further optimizations, you can fix it pretty quickly and easily.

Now, let’s remember, this thing ran for about a minute without any intervention. If we run this with just a hash join hint, SQL Server will no longer have that nested loops with the top above the scan. We’ll just do a big scan of comments and a big scan of votes, and we will end up in much better shape here.

So big scan over here, big scan over here, hash join in here. The whole thing takes just about eight seconds. So going from one minute to eight seconds is a pretty good improvement right off the bat.

Let’s experiment with indexes a little bit, though. So the first thing that I want to index, because it’s the simplest index to create, it’s the only column that we care about on the vote side of things, is the post ID column.

It’s not a unique column. Obviously, because people can, you know, cast multiple votes on a post. So we can’t make a unique index here, but we can at least create an index to make this part of the query, you know, give that inner side of the query a little something to work with.

The trouble that you run into a lot, though, is once you add a good index, SQL Server starts thinking a little too much about things, and they aren’t good thoughts. So let’s run, well, actually, let’s run this query.

I don’t know why I got the estimated plan there. I was thinking about something else for a minute. So let’s run, get this going. And this runs for about five seconds. We shaved another, like, three seconds off the initial query, but it’s kind of stupid.

The reason it’s kind of stupid is because we fully scanned that nonclustered index over here, and then we fully scanned this clustered index over here. And even though it’s a little bit more efficient, you have to kind of wonder why SQL Server wouldn’t seek into the index that we just created, because that, right, if we create an index, we now have a seekable thing for that exist clause.

So let’s see what that query plan might look like. Let’s get the estimated plan here. All right, so now we have this one here with a top above a seek.

Let’s see how this goes. All right. I am mostly happy with this. We put this down to 3.5 seconds, right?

But now we have this sort of interesting thing over here. Most of our problem in this query is SQL Server needing to order by creation date descending and comment the ID column in the comments table ascending.

Okay. Well, well, you know, it’s kind of weird, right? Isn’t it? Put this stuff in order from the comments table.

Okay. Well, I mean, I guess we can do that over here. We could do that over here after we, like, found stuff. But, you know, SQL Server doing this over here. Sometimes that’s a good plan. All right. So let’s create an index on the comments table.

And what this index is on is it’s on post ID, right? Because post ID in the comments table is what we’re correlating on to the post ID in the votes table. And then we’re going to put creation date second.

And the hope here is that, you know, if you’ve watched any of my other videos about how indexes hold data, that for every post ID we find, which is within a quality predicate, the creation date column will be in order. And since this table has a clustered primary key on the ID column, that ID column, and we have a, this nonclustered index is not unique.

The ID column from the comments table will be a hidden third key column here. So we would technically have post ID and creation date and ID in this index in the order we want. All right.

So let’s, let’s create that. Let’s see what happens here. Let’s go with this. Let’s roll with all this fun stuff. I’m going to create this index and see what happens.

It might be a, might be a real fun time, right? Okay. Here we go. You got it.

We got our new index in. All right. And let’s run this. Let’s see here. This appears to be, this appears to have gotten worse with an index, doesn’t it?

Sure did. 14 seconds. What happened? Well, you’re going to see what happens.

It’s SQL Server chose a serial merge join plan. Look at this garbage monster monstrosity that SQL Server has chosen here. Because we now have both of our join keys in order, SQL Server was like, oh, I don’t need to sort anything.

I’ll just, just do a merge join. But even SQL Server sometimes knows that a parallel merge join is the worst thing in the world. And so it gives us this, this serial plan.

Even worse, what do we get over here? We still have a, the votes table is still on the outer side of the query with an, with an index scan on it. And the comments table is in here with an index scan on it.

We didn’t even seek to any of the stuff that we cared about. We just use the ordered nature of things to, to, to, to, to, to give us that serial merge join plan. Now, this is where things get kind of annoying is that if we, if we just, if we had a force seek hint to the comments table and we just try to get an estimated plan, SQL Server says, no bueno.

We, we are out of buenos. You can have no buenos. You, you, you, no buenos for you.

The query processor could not deal with this, which, you know, it’s kind of weird because without the force seek hint, you know, we had the comments table on the inner side of the join. And you would think that SQL Server could just take, you know, stuff from here and seek into it with the, the post column over here. But apparently not.

Now, if we add a force seek hint on the inner side and we run this, this is what our plan turns into. But we still have this sort here on the comments table. All right.

So this is okay because we’re down to about three seconds now. We’re still, we have about 1.5 seconds reading from the comments table and now about 1.5 seconds on the sort. And the past, the sort had spilled a bit and things were a little bit weird.

This is about as good as we’re going to get. All right. Honestly, as far as query tuning, this thing goes. And this, this took kind of considerable work to get from, you know, a minute to eight seconds, which is so far the biggest jump. But to get from eight seconds to like three seconds, you know, we had to add two indexes and now we need a force seek hint.

And things are just, things are just a little rocky. If I’m going to be honest with you, if I’m really tuning a big query that does all this stuff, what I want is batch mode. All right.

But just use, like get batch mode involved somehow. That’s really what I want to go after. But in this one, we’re going to focus just a little bit on some of the indexing pitfalls that can happen in here. So because we need to sort data a little bit differently, and I have, I accidentally deleted that from my index definition.

Sorry about that. All right. We’re going to change our index a little bit to be on creation ID descending, ID ascending, and then post ID as the last column in the index.

All right. And we’re going to drop our existing index with the wonderful underused drop existing equals on index option. And now we’re going to see how things go with this plan.

Let’s get the estimated plan here and see what happens. Now we have a serial nested loops join plan with no sorting. And if we run that, we finish things up just about instantly.

So if you’re tuning queries that use exists and not exist, right, they’re going to have a semi join of some type. Exist will be a plain semi join. Not exist will be an anti semi join because you are finding stuff that isn’t there.

They sometimes take, they sometimes need some extra help to get them to be reliably nice and fast. Sometimes that extra help is just throwing a hash join hint on there. Sometimes that extra help is just getting batch mode involved.

If you have two very, very large tables, batch mode is going to do you a hell of a lot more good than all of the indexing in the world. These tables aren’t quite big enough to qualify for that. Batch mode does do really well on them.

But, you know, I’d rather show you a little bit of the indexing stuff. The other thing that you can do with the indexing is pay really attention to really close attention to the query plans. If you end up with these types of plans and you have all sorts of sorting going on in them, you might need to adjust your indexes in a way that is a little bit counterintuitive.

And what I mean by that is you might need to sort of index towards the sort with these columns rather than index towards the join with this column. Remember that when we added this column in, things got a bit better.

Not like spectacularly faster. We saved like two and a half or so seconds on things. But it wasn’t really like spectacular. And then when we added a force seek hint in, we ended up with that awful serial merge join plan.

So sometimes depending on what the problem, like really, if this video were to have like a great title, it would be how to index for what the problem and the plan is. The issue is that this video would take me, that video would be months long.

So this is just one example. So this is just kind of indexing to help you tune exists and not exist queries. And some of the things that you might see in query plans along the way.

In my case, for the query that I’m running, my best set of indexes was, of course, one on the votes table on the post ID column, right? Because that gives us a good, clean, ordered path into the data that we’re trying to figure out if it exists or not.

And then it was the final index that we created down here where we geared our index towards sorting the data the way that we needed it and then having the column that we care about for correlation in here. Because remember, when we tried to put a for seek hint on the comments table, we got that optimizer error anyway.

So there was no way SQL Server was seeking into the comments table and doing anything helpful with an index that led on post ID. But there was just no use in that. So with the index that we have down here, once again, we get a fairly simple, easy query plan, just a plain serial nested loops plan.

And we end up finishing this very quickly because now we don’t need to sort data. And now we have a really efficient way to locate the data that we care about on the inner side here. So no sort, top above a seek, simple top for the presentation, and boom, we have a fully tuned not exist query.

So with that out of the way, once again, from me and bats, thank you for watching. I hope you enjoyed yourselves. I hope you learned something.

I hope that if we ever meet, you bring me Pez because my Pez dispenser is empty. I’m just kidding. I don’t actually eat candy. I have no sweet tooth whatsoever. If you brought me a salt lick, I would be far more appreciative.

That’s my problem in life. The salt. Savory and the salty.

Anyway, I’m going to get going because I obviously have some other tabs to close out. So we’re going to get those recorded, and I will see you over in the next video. Thank you for watching.

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.