A Little About Hash Aggregate Spills and SLEEP_TASK Waits In SQL Server
Thanks for watching!
Video Summary
In this video, I dive into the world of hash spills within SQL Server, specifically focusing on hash match aggregates and their behavior in row mode. With a bit of a reflective tone, I share my experiences as Erik Darling from Darling Data, discussing how these spills can significantly impact query performance during what should be a relaxing Friday afternoon. The video covers various scenarios, including the differences between batch and row modes, and delves into the disappointing lack of detailed weight statistics provided by SQL Server for these operations, highlighting the frustration with Microsoft’s decision-making in this area. Through multiple queries and detailed analysis, I explore how adding more columns or text data can exacerbate hash spills, leading to substantial increases in execution time and page volumes spilled to disk. The goal is not only to understand these issues better but also to advocate for clearer insights into query performance bottlenecks.
Full Transcript
Erik Darling here with Darling Data. Feeling real bubbly and effervescent on this joyous, I think probably the final Friday of June 2024. Where did this year go? What happened? This year disappeared on us. Feels like slowly, slowly disappeared. Anyway, my eyes feel weird. My left eyelid won’t stop twitching. I think I have some form of irritation in there. But today, as you can see from the giant floating zarzad head of green text, in today’s video we’re going to talk about hash spills.
Because what the hell else are we going to do on a beautiful summer Friday except talk about hash spills. Yesterday we talked about sort spills. At least I think it was yesterday. It was probably yesterday. If anyone can remind me. What we did yesterday. That’d be wonderful.
Alright, so before the Don Julio shows up and everything changes, we’re going to talk about hash spills. Now, this video is only about hash match aggregates. This is not about hash joins. Hash joins are going to be in the next video. Not today. Another time.
Because hash joins… What’s really interesting with hash joins is looking at extended events for the hash bailouts and recursion stuff. Because that’s where you can see the sort of spectacularly bad performance that can come out of hash joins when they spill. So, in general. And now, I’m just… Specifically for row mode.
For batch mode, hash and sort spills make me very nervous. Because if you remember the sort spill video from yesterday, the batch mode sort took forever. It was like five minutes. Whereas, like the equivalent query in row mode only spilled for like a few seconds or something.
So, both batch… Both hash and sort spills in batch mode make me incredibly nervous when I see them. In row mode, hash spills tend to make me more nervous than sort spills. Because, as you’ll see in this video, like, hash spills of… Like, when we cap memory at the same amount.
And sorts usually require a lot more data. Because remember, you’re sorting all the columns you’re selecting by the columns you’re ordering by. But it’s usually like a much closer to size of data operation.
Whereas, with hashes, when those spill, for some reason, they just… Whatever algorithm is responsible for the hash spills tends to really beat performance up. Even for like similarly sized spill amounts.
So, like in pages, right? So, let’s look at this first query here. And we have query plans turned on.
Thank God, if I forgot that. If I forgot that and I had to restart recording this video, I don’t know what I would do. Alright, so, this query takes around 4 and a half seconds. And, you know, I’d really like Microsoft to recall the summer intern who did the operator times here.
Because these make no gosh darn sense. And I’m not quite entirely sure how to interpret this. Because in the row mode plan, the times are supposed to be cumulative.
So, the repartition streams should not… Should have technically run for as long… Should contain the time from the clustered index scan.
So, it should be at least 1.626 plus whatever time gets spent in there. I don’t know where to put that 1 in the accounting. I don’t know where that 1 goes.
I don’t know if we add it to the 1.6. I don’t know if it’s part of the 1.6. I don’t know if the 1.6 is part of the 1. I don’t… I just don’t know what to do with it. Alright?
So, let’s just say that the hash match aggregate ran for 2 and a half seconds or something. Right? Or 2 seconds. I really… These numbers are too depressing for me to think about too much. How we ended up here, I don’t know.
I like the batch mode version better where each operator is its own time. Because then I don’t have to worry about whatever mess this is. Alright.
So, relatively simple. Relatively straightforward. Now, we’re going to look at these next two queries in other windows. Because these next two queries took more time than I want to fill up dead air for.
While I’m waiting for them to finish. Alright? So, what we’re going to do in each of these windows is run the query and then look at our session weight stats for what happened when these queries ran.
And there’s going to be some real disappointing stuff happening here. Alright? And it’s not just related to the timing here where once again this number is lower than this number for some reason.
And I don’t know how to munch those numbers together into some sense. So, let’s just say that the hash match aggregate took, I don’t know, about 25 seconds. With the spilling.
Right? So, in the query where it didn’t spill, it took like two, two and a half seconds. In the query where it did spill, it took a lot longer. Right?
About ten times as long. Alright? Sort spills don’t usually hit you for ten times as long. I guess hash spills are different in that way. This hash spill looks about like so. Spill level two!
We had to make two passes of the spills. All eight threads did that and about 290,000 pages spilled out to disk. So, I don’t know.
That seems pretty slow for 290,000 pages to be honest with you on that. I don’t really know what to say there. Whatever. Not having a good time internally that hash spill. But what’s really disappointing is something that we’ve seen before in here where our old friend Sam, someone at Microsoft, that glorious idiot, decided to hide information from us.
Because the top weights that you see for this query over here are CX import, CX packet, and SOS scheduler yield. But they don’t really tell you where the time went in this query, do they? They don’t really account for the 30 seconds that we spent in here.
What comes a lot closer to accounting for the 30 seconds we spent in here are these things up top. Like Sleep Task and CX Consumer. That our good friend Sam said, It’s for your own good.
It’s in your own self-interest. All you wanted was a Pepsi, but you don’t get these weight stats. You can have the Pepsi. You’re not getting these weight stats in your query plan.
So the Sleep Task and the CX Consumer weights. Don’t show up in your query plan. Because Sam is an idiot. We don’t like Sam, do we?
Sam is not our friend. All right. So looking at this same query, essentially, but this one running in batch mode. So again, I’m using the query optimizer compatibility level to get batch mode on rowstore.
And I’m making this thing spill a whole bunch. All right. Spills are the name of the game.
They are the word of the day is spill. They’re also the number of the day. They’re also the special of the day. They are everything.
Everything for us. Now in this one, let’s look at the weight stats here first. So now we have some kind of new weights, don’t we? We have some HT weights. All right.
These are related to batch mode. These are very batchy mode-y related weights. But we still have a bunch of Sleep Task. And we still have a bunch of CX Consumer. Now let’s go look at our query plan and see what happened in here.
All right. This thing spilled for about 10 seconds. And this one isn’t too bad. Where they get really bad is with the bigger spills, which we’re going to take a look at in a minute.
But if we look at the weights over here. And we look at the weight stats. There we go.
That’s the button. We get htbuild and we get htdelete. Apparently these ones were considered important enough to show. But we still get no Sleep Task. And we still get no CX Consumer. Even though those are at the very top of our waiting query game.
Right. So to recap, we get this and we get this. We do not get this or this. We are all quite sad by that.
Especially because, you know, if you look at some of these wait times, these max wait times, that’s almost 11 seconds on CX Consumer. Right. And if you look at these total wait times in here, that’s a lot of time to not account for in an executing query. Right.
Kind of. It’s not cool, Sam. It’s not cool at all. We don’t like being lied to, Sam the man. All right. So let’s move on a little bit.
And let’s see what happens when our hash spills involve more columns. Remember yesterday when we had sort spills, the more data we added to those sort spills, the worse they got. It was from like a time perspective, from like a weight perspective.
Like they just dragged on and on and on and on and on. These, well, at least for the non-spilling query, even here we add some time to it. Right.
The first one was four and a half seconds. We’re essentially doing the same thing. We’re just selecting more columns and this took about a second and a half longer. Right. Like everything in here, despite, like, holy cow.
It worked on this one. I don’t know what happened. I don’t know what magic happened, but look, the time is actually cumulative on this one. That wasn’t true for any of the other ones.
What happened in here? I don’t know. I don’t know. Sometimes it works. Sometimes it doesn’t. These racy conditions, I guess. What’s happening inside your head?
It’s like trying to figure out what a toddler is thinking. It’s amazing. But anyway, this one did take a little bit longer. Right. This one did.
This one did take. It’s a little bit longer than the one where we were just selecting one integer column. In this query, we’re selecting one, two, three, four integer columns and one date time column. All right.
So now let’s go look at this one in another window. I’ve already pre-run this because, again, I don’t want anyone sitting around bored. And now look what happens here. That went from taking 30 seconds to taking one minute and 33 seconds.
This one took a full minute longer selecting more columns. All right. More stuff in here spilled because we are selecting more stuff.
This one got to spill level three. Right. This is a full level higher. A full spill level higher than the one before that. And the number of pages is also about tripled.
Right. This one went from like about 200 and something thousand to 832,000. Right. So a lot more stuff spilled, though, because we had a lot more in the hash to spill. So when it was just one column we were grouping by, we didn’t have a lot.
I mean, we still ended up messing things up pretty good. Good job, us. But we didn’t. But we didn’t. But this one, because we have more columns that we need to group by and all the other stuff, we end up doing way, way more work.
And in the results, we have way more sleep task and way more CX consumer than we did in the other query. Now, since this one is in row mode, we don’t have the HT weights, which is OK.
Like, we don’t need to know that. But again, these weights aren’t going to show up in the query plan weight stats because Sam needs to get talked to by someone.
Sam needs a talking to. Sam, Sam, Sam. So now let’s look at what happens when we start messing with text columns.
All right, so in this query, we are going to, if I recall correctly, I don’t know, again, some of these queries were written yesterday. Just kidding.
They weren’t written yesterday. I just can’t keep everything in my head all the time. Now that we’re selecting a text column and we’re grouping by this text column, this text column in the comments table.
I don’t know if you remember the sort spill video. We looked a lot at the like average length and the like, you know, like how SQL Server estimates memory for these things. And, you know, and especially at how even like the text column stuff in row mode tended to make spills worse because you’re dealing with larger data when it spills off to disk.
All right. So this query takes about 8.8 seconds. And this doesn’t spill. And somehow, miraculously, the repartition streams is working here.
Maybe it just takes like more data to make a repartition streams to work. Maybe like something has to really slow down in order for the code and read the repartition streams one to work the way it should.
I don’t know. It’s really weird. But at least it’s cumulative. I don’t know if it’s right, but at least it tracks, right? At least it’s logically cumulative going from here to here to here. At least we have that going for us.
But this, we know that by the time we get past the hash, it takes, we are a few seconds ahead of where we were when we were not messing with any text columns. Right?
So now we’re going to look at this thing running in batch mode and spilling. And, oh wait, this is the one I was supposed to close. Ah, nuts.
Here we go. So this is what happens when batch mode hash match spills a text column. Look at that.
2 minutes and 24 seconds. Ain’t that something? That’s crazy, right? That’s nuts. Like, like it’s, it’s right up there with how bad the batch mode sort was. 2 minutes and 24 seconds.
Can you imagine waiting 2 minutes and 24 seconds for this? Now, the spill level on this is back to, is back to spill level 3. But there’s a lot more pages in this, right?
Because we had that text column involved. And there, like there’s definitely some differences between the votes table and the comments table. The votes table is like 53 million rows about.
And the comments table is like 25 million rows. So the comments table, even though it’s smaller, because it has, we’re spilling that text column out. The data pages that we’re spilling out are way bigger.
I mean, not like way big, like there’s a way bigger number of them because the text column makes the pages, like adds more space, right? So the, when we’re dealing with like a whole bunch of narrow data types, even though we did spill a lot and it took a long time, it’s still not quite as disastrous as when we spill out like, like anything that involved text data.
Like the, the high end three, level three, eight spilled threat, eight spilled threat hash join from the votes table was like 800 something thousand pages. This is like 2.6 million, almost 2.7 million pages. So that text column adds a lot more page volume to the spill and really messes things up.
And of course we have in here, our friends. We have sleep task, ht build and ht delete, and cx consumer. And you know, again, for this query, the ht weights will be available in the query plan x xml, but sleep task and cx consumer, because our enemy Sam at Microsoft doesn’t want us to see these weights.
They are not going to be in the query plan, and a lot of the time that you would, a lot of the time that you would, would account for like what, what went wrong with this query. A lot of things that you would, you know, maybe see peripherally, like, you know, like when you go and examine a query plan, that would help you determine stuff are just not in there, right? So it’s always good to know where this stuff comes from.
Now, you know, I do a lot of experimentation with running queries and seeing what their weights are, and that’s sort of how I figure this stuff out. And that’s why in SP pressure detector, you know, the list of weights that I have in there, and with like a sort of description on them, will, you know, decode some of this stuff for you. So like if you, if you were looking at this query plan on your own, when you might like, like, look, the spill is visible, there’s an exclamation point on it, the operator time is visible, you can see how long that thing spilled for.
When you look at the weights, you don’t get the full story of what weights show up when these things happen. And that’s what you kind of have to know because when you’re looking at a server from the top down, if you’re like, you know, you just get on a server, and like you use whatever script you want to look at weight stats, hopefully it doesn’t screen any of these out because someone at Microsoft is a jerk. But maybe it would show you like, you know, these weights in total.
And if you saw the HT weights, and if you saw the CX consumer weights, and if you saw the sleep task weights, you saw the IO completion weights like from yesterday’s video, it would give you a better indicator of like maybe where queries are struggling as a whole. Right? And like that maybe like gives you a place to focus.
Right? Maybe it helps you figure out like, you know, like where the stress and strain on the server is. Now, especially if you see a lot of these spilly type weights, right, like IO completion, sleep task, the HT stuff, if it’s batch mode, you know, and you also see a lot of the page IO latch underscore whatever weights, that’s a pretty good sign that there’s just a constant battle going on between the buffer pool and query memory grants. And that’s that server probably doesn’t have enough memory in it in general.
You know, it might, you know, like, there might be all sorts of other ways you could go to try and get those numbers under control. But like, it just might be a sign that the server is completely underpowered. And that’s where you need to start.
Like, that’s where the quickest performance win is just like, just get some more memory in this thing if you can. Right? So, let’s go look at one last query in here. And we’re going to close this out.
And this one is particularly interesting to me because this one will get a hash match flow distinct. And that hash match flow distinct will spill. And we’re playing kind of a weird trick on SQL Server here with top.
And the bigger you set this number to, the worse this spill is. I had to find something in the middle. And then we’re going to say optimize for top equals one. And even with a recompile hint, can’t figure that out.
Or rather, it is still under the spell of the optimize for hint. And if you look at what happened in here, we, of course, have a number of things that we’re going to do. Once again, a whole bunch of sleep tasks up at the top.
10.991 milliseconds. And I think the reason why, like, you know, this one is helpful to look at is because this one is single threaded and a lot of the other ones run in parallel. So, it’s a little bit more clear, like, where weights go in here.
And so, if you look at the results where we have, you know, fully 10 seconds of sleep tasking, we can probably figure out just how much time was spent actually spilling on that single thread in there. Right?
So, but, you know, once again, if we look at what happened here and we look at the weight stats for the query, the only thing we will see is 4 milliseconds of SOS scheduler yield. All right?
There that is. There’s that 4 milliseconds of SOS scheduler yield and a query that ran for 25 seconds. All right?
So, we took 6.7 seconds here and we took, well, I mean, 25 is 19. And so, we spent about 10 of the 19 seconds in this operator spilling to disk. Isn’t that exciting?
Isn’t that exciting to know about? And, of course, the spill level for this is spill level 5. One spilled thread. Anyway, that’s about enough about hash spills.
Now, again, this was purely about hash aggregate spills. Tomorrow’s video, or actually, no, tomorrow’s Saturday. So, probably not tomorrow’s video and probably not Sunday’s video.
Maybe Monday’s video will be about hash join spills. So, I hope that you’ll join me for that. Anyway, thank you for watching.
I hope you enjoyed yourselves. I hope you learned something. If you ever meet someone at Microsoft, I hope you have a good talk with them about the weight stats that they’re including in these query plans. That’d be nice.
You know, we deserve better. Us people paying, well, I mean, I don’t pay per core, but you probably pay per core. So, you deserve better.
Apparently, I deserve whatever I get. That’s okay. Yeah, if you enjoyed this video, I do like thumbs ups and I like encouraging comments. Up to and including you, go girl.
If you enjoy this sort of SQL Server content, you can join. Let’s see. Let’s make sure we have this refreshed up until the absolute most current. You can join nearly 3,790 other of my data darlings by subscribing to this channel and getting a notification every time I publish one of these.
And I would just like to apologize to anyone not in an East Coast time zone who gets this notification late at night, like someone in Europe maybe. Or even further away than Europe. Past Europe.
I don’t even know what time it is in New Zealand right now. Australia? Who can tell? So, I’m not sure if anyone else subscribes to me from further away than that. Probably not.
Anyway. I’m gonna go start Friday-ing, because it is Friday and it is time to Friday. Thank you for watching and I will see you in the next video about hash join spills. It will be just as exciting and riveting.
I promise you. I would never lie to you. I’m not from Microsoft. Or the government. Or the government. And that’s smart because I would take care of them. Welcome to this band right now. Now, what we’re aware of is that these are folks that we can play for with our muutest territory where we’re not Arabia Thank you.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 25% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.