A Little About Hash Join Spills And Bailouts In SQL Server

Thanks for watching!

Video Summary

In this video, I delve into the complexities of hash join spills in SQL Server, providing a detailed walkthrough of how these operations can lead to significant performance issues. We start by examining a query that selects from a table containing large text data, which quickly demonstrates the severe impact of hash joins on system resources when dealing with large datasets. Through extended events and a series of tests, I illustrate how even minimal data volumes can trigger extensive recursion and bailouts, leading to prolonged execution times that are both frustrating and time-consuming to observe. The video also touches on batch mode and row mode operations, highlighting the consistent unpleasantness they bring to query performance during hash join spills.

Full Transcript

Erik Darling here with Darling Data. Still, uh, feels like a perpetual situation. I’m not sure, I’m not sure that this can ever be resolved. At least not peaceably. Anyway, in today’s video, we’re going to talk about hash join spills. Because, oh, I don’t know, those seem pretty important. Because, uh, much like hash aggregate spills, if these really start piling up, they can ruin your day. Not in a good way. Not like, uh, if your car breaks down in front of a bar, and you’re like, I’ll just go inside and call AAA from the payphone. And you end up having a great day inside the bar. Because you’re like, well, I don’t have any quarters. And the bartender points to a sign that’s like, no change. With a payphone. And so you have to buy a beer that costs, 75 cents to get a quarter to call the, to call the tow truck company. Then you just decide to hang around for a while.

That’s my kind of day. Anyway, uh, yeah, hash join spills. Sorry, I was, I was, I got a little carried away there. I actually got, I actually got lost in that moment in my head. I was like, because I can, I can, like, picture the bar and the bartender. Did something for me. Did, did something special for me. So, uh, hash join memory grants. There is a lot to say about them. And thankfully, uh, I don’t have to say all this. Uh, there’s a link. Well, my, my hand goes away where I want to point. But, uh, there’s a link up in the, that I’m going to put in the, in the show notes, as it were. Uh, written by, uh, Craig Friedman, who actually played the part of the bartender in that, in that scenario that we just talked about.

Uh, where he will, he will tell you in great, well, I don’t know. Is that great detail? It’s pretty good detail. How memory grants for hash joins, uh, are calculated by SQL Server. And, um, this is, and what happens when they spill. So, there’s all this great stuff that you will learn from Craig. Uh, I’m not going to repeat all this stuff because that would be weird plagiarism. I’m just going to tell you that it exists here. If you want to pause and read it and like, like, like, like type in the URL from there, but just, just know that my, my source is cited. All right. I’m not, I’m not claiming that this green text is mine.

Definitely not. Definitely would never write all that stuff. So, uh, like I promised in the video about hash aggregates, uh, we are going to use an extended event, which is over here. And that is going to show us, uh, when we hit hash warnings, uh, in, uh, with, with the hash operator that spills out and starts, and starts going through different levels of recursion and then hits a bailout point.

And the bailout point is, uh, when the hash join switches over to some, uh, naive kind of nested loops join. Uh, so that’s, it’s not, it’s not a, not a fun time. I promise you.

So what I’ve done is I pre-run four queries and, uh, I’m going to, I’m going to show you what the queries are because they, they relate to what is up here. Where like the size of the data that is going through all the hashing stuff has a big part to do with how much of a memory grant the, the hash join queries need. So I’ve got two queries here.

Uh, they both do just about the same thing. They join from votes to comments, but they join on some really low selectivity columns. Right.

The post ID column and the votes table and the post ID column in the comments table are like, there’s only eight possible like numbers in there. So there’s a lot of matches in there. These are not unique columns where there’s like very few buckets of matches.

Right. And some of the buckets of stuff are going to be way bigger than other buckets of stuff because there’s way more of certain post types than others. That’s something that we’ve looked at a million times in these videos.

So, uh, I ran these two up here without any memory grant hints on them. Right. So these ones get the full memory grant that they want to run. And then there are two down below that are, that are capped.

It’s essentially the same two queries in the same order, just with caps on them. Now, if you remember from the hash aggregate video, the, the, I mean, aside from the row count and aside from the intent of the table, uh, the, the main, like the focal point difference, the, the crucial difference between the votes table and the comments table is that the votes table has like five or six integer columns in a date time column.

And the comments table has like four or five integer columns, a date time column, and then an envarchar 700 text column, string column. So the, and the string texty columns inflate memory grant needs way higher because of the way SQL Server estimates the column fullness. And because it’s a string and strings are a mistake and you shouldn’t put strings in databases.

It just screws everything up. So I’ve got these four queries already run and we’re going to examine the query plans just a little bit. So this first one where we select just from votes takes 7.7 seconds.

The one where we select from comments, right? We see the C dot star here and the V dot star here indicating the, the alias of the table we selected from. Okay.

Got that. And, uh, the, the one that selects from comments does take a couple seconds longer. Um, you know, not real, really any real reason other than like the chunk, the chunkiness of the chunkiness of the data. Right.

So, uh, we can see that happening sort of all throughout the plan, uh, where, you know, the stuff takes longer. Right. And, um, if we look at the memory grants for these, the one that just selects the columns from the votes table gets about a 4.5 gig memory grant, almost 4.6 there. And the one that selects from the comments table gets nearly a 10 gig memory grant.

Right. 9,855 megabytes. It’s about 9.85 gigabyte. Uh, yeah.

Almost 10 gigs. Yeah. 9.8 gigs. Yeah. Close enough. There we go. Math. I can do that. Sometimes. Sometimes I remember things. So, obviously, uh, like SQL Server’s memory grants here, nothing spills from either of these. The hash joins are fine here, which means that, and then also another good thing to point out is that there are no warnings on the selects.

So, sometimes if SQL Server, um, is like detects after query execution that a memory grant was either too big, way too big or way too small, uh, it’ll throw up a warning on the select operator. And it’ll tell you that, like, the memory grant was too big or too small, and if you have some sort of, you know, um, memory grant feedback mechanism in place, it’ll start adjusting that. If not, it’ll just twiddle its thumbs and stare at you and be like, hmm, guess you should have paid for Enterprise Edition, hmm?

Hmm. Oh, you’re not using the newest compatibility level? Hmm. Weird. Weird for you.

Yeah. Oh, that’s too bad. Hmm. Hmm. Yeah, I’m just, I’m, I’ll be over here if you need me. All these, all these features and capabilities, duh, you’re not in the right compat level, or you didn’t pay $7,000 a course, so I’m just gonna hang on over here and wait for you. Someday you’ll get there.

Real, real helpful, real nice, real cool. Psyched on that. So, uh, obviously, stifling the query that selects from the comments table is gonna hurt way more from a memory grant perspective than stifling the comment, cycling the query that hits from the votes table, because that string column in the comments table is gonna really whomp things up.

So, if we scroll down and look at what happens to the two sort of nerf-balled queries, uh, SQL Server begs for an index on this one, right? It has not begged for an index previously.

And, uh, if you look at the hash join operations, uh, this one spills for, uh, this query, a whole thing spills for, like, nearly a minute. But the one where we select from the comments table spills for nearly four, over four minutes. Nearly four minutes and fifteen seconds.

Nearly. One second off. Uh, and if we look at the, uh, spill levels on these, uh, this one spilled to level three, and, of course, all eight threads spilled. And keep this number in mind, 663-800, right?

So, uh, that’s how many pages got spilled. Now, if you were to look at, uh, like, the hash warning thing for these queries, the level, uh, the spill level would match the recursion level that SQL Server notes for, uh, the, for, in the hash warning thing. Um, and then, if we look at this one, this is also, oops, oops, that, this thing keeps reframing, and that, that messed me up a little bit.

This one spilled to level four, and with eight, of course, still with eight, all eight threads spilling, but that’s way more pages, right? The last one was, like, 663,000. That’s 3, 1, 1, 9, 4, 8, 8.

That’s a seven-digit number. I only have these fingers left. So, that’s 3.1 million pages. So, that’s pretty tough there, right? We spilled a lot more because that text data takes up way more space on the pages.

You need way more pages to hold on to it. So, this is obviously not a very good situation, but, uh, none of these, neither of these queries, even when we nerf them down to, um, 0.1 max grant percent hit, do we hit the hash bailout. Now, the first thing I want to show you is that hash bailout is not just for hash joins.

So, you may remember this query from, that runs for about 30 seconds from the video about hash aggregates. We’re going to run this again, and we’re going to watch the extended event that I have over here. And, uh, about 5, 10 seconds in, this will start showing stuff.

And, uh, we’ll see it go through the different levels of recursion and then the bailout. There we go. There’s recursion one.

Eh, no, this thing runs for 30 seconds. And then you have to wait for extended events to, like, you know, get its act together and put the stuff in there. Ah, there’s two.

And now it finished. Hey, look at that. All right. So, what happens in here? Let’s, let’s open this up and let’s take a slightly closer look. There we go.

So, here are our threads. And here, well, we only, let’s do the top one, so there’s only one. And, uh, I mean, kind of awkward, isn’t it, right? Uh, we have a bailout and then another recursion. And then, well, some recursions here and a bailout and then, uh, then some more bailouts.

So, uh, I, I, I don’t know. Maybe it showed up out of order or maybe, maybe things are just weird. Uh, but anyway, uh, you can totally get bailout and recur, recur, recursion and then bailout with just a hash aggregate.

You don’t need a hash join for it. But now, let’s behold the real majesty here. Oh, not that one.

This one. This, this one, this one’s some real good majesty. So, we’re going to take our, our really crappy query that selects from the comments table. Right? And we’re going to run this one.

And I’ve, I cleared out the data in here. So, there’s nothing in there anymore. And if we run this, uh, this thing will, uh, almost immediately, uh, start recursing and recursioning and bailing and outing. Uh, it does not take much for, for this one to kick in.

Uh, now, uh, I had, I’m going to come back to that one in a second. Now, I had run this same query with the, uh, with batch mode going for it. Um, this one fared okay.

Uh, you know, the, the weight stats. Uh, there’s one kind of new one in here. Um, so, when we saw the, the hash problems, uh, with, um, just the hash aggregates, I believe it was HT build and HT delete that were way up top. Uh, went with the hash join.

We’re starting to see a lot more HT memo and we’re starting to see this HT repartition weight. Uh, sleep task is still a big deal for, um, for both the row mode and the batch mode hash join spills. So, the sleep task is still a big part of that and it’s still not in the query plan XML for either one of those.

So, just something that you should be aware of there. The sleep task thing is still a fact. The sleep task weight type is still a factor there.

But now, let’s come back and look at this. And we can see that, uh, this query has been bailing out for quite a while now. So, uh, we hit some recursion and then we just started bailing and bailing and bailing and bailing and bailing.

This query will run for a very long time. This query will run for longer than I care to stand here. This query will run for longer than you would care to watch.

Um, it, it, it would, it would be awful. So, uh, we’re not going to watch this thing finish because it takes too long to finish. Um, there’s, there’s.

Probably a good joke in there that I’m not going to make. But, uh, if we look at all this data, we can see all of the bailing out happening across all of these threads over and over and over again. And, uh, we have just hit, we have hit a point where we, we no longer care to live.

So, uh, I don’t know. That, the, those, that’s what happens during hash joins or hash join spills. Um, you know, the, the weight stats are the same as, uh, as they are for hash aggregate spills.

And, um, yeah, they’re unpleasant. Uh, batch mode, still not good. Row mode, no fun at all. Uh, and again, probably way more worth, uh, paying attention to, uh, hash, different hash spills than different sort spills.

Unless the sort spills are in, uh, batch mode. Um, next video, uh, we’re gonna, we’re gonna look at exchange spills. Which are when parallel, uh, exchange operators, uh, run out of memory buffer space.

And begin spilling all over the place. Like, often, awful drunken bar patrons who swore they just needed to use the payphone 57 beers ago. They haven’t left.

I don’t know. Anyway. Uh, maybe, maybe someday that’ll be me. If I ever, if I ever build a time machine and go back to, like, 1981. That’ll, that’ll be my plan.

Gosh, the car broke down. I’ve, I’ve got all these bills from the, from the year 2030. No?

Alright. Whatever. Uh, okay. Um, yeah. Uh, thank you for watching. I hope you learned something. Uh, if you invented time machine, please take me with you. Um, what was I gonna say?

Uh, hope you enjoyed yourselves. I would enjoy myself if you made a time machine and took me with you. Um, that was, that was, that’s really the crux of this whole thing. Um, uh, if you like this video, for some reason, if you like learning about how bad SQL Server can be at things, uh, feel free to give me a cordial thumb up.

Or a cordial comment. I like those. Feel good about those. They really make my day. They brighten my whole mood.

Uh, and if you enjoy this sort of SQL Server content, uh, you should hit the subscribe button. Because we’re, we’re, we’re getting awfully close to 4,000 here. Which would, uh, I think break, break, uh, break my, uh, tie.

Or break my current sort of standing with, uh, Amiga repair channels. So, we’re, we’re gonna get up there. We’re gonna, we’re gonna break through to a new level of SQL Server fandom.

You and me. All together, my, my data darlings up there in the world. So, uh, yes, you should, you should like, you should, you should subscribe. You should, you should cordial comment.

And, uh, you should see me in the next video where we talk about when, when parallelism gets real, real messed up. All right. Thank you for watching.

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.