A Little About Sort Spills And IO_COMPLETION waits In SQL Server

A Little About Sort Spills And IO_COMPLETION waits In SQL Server



Thanks for watching!

Video Summary

In this video, I delve into the fascinating world of sort spills in SQL Server, explaining how they can impact query performance and offering insights on when to address them. Starting off with a bit of personal frustration over Microsoft support’s suggestion to shut down an Azure instance to save costs, I highlight why better support is available through services like mine at Darling Data. The video then focuses on demonstrating sort spills using two queries—both with hints to ensure one spills while the other doesn’t—and explores why the spilled query can sometimes run faster than its non-spilled counterpart due to factors such as local storage speed and the complexity of sorting multiple columns, especially those containing text data. Through detailed analysis and practical examples, I aim to provide viewers with a deeper understanding of sort spills and their implications for database performance optimization.

Full Transcript

Erik Darling here with, as you may have guessed, Darling Data. In this video we’re going to talk about sort spills. Now, if you’re wondering why I sound a little bit low right now, oh boy. I just got off a very depressing client call where Microsoft support suggested out loud and with a serious face that one way that they could say, save money on their Azure bill would be to turn off their managed instance that runs their e-commerce site overnight so it didn’t accrue any spend. So yeah, we’ll just turn the website off at night. It’s like the early 90s when you would leave the office and turn off the lights and it would also turn off all the servers or something. So yeah, anyway, that’s the thing. That stunk. No one was happy. So once again, if you would like better support than Microsoft is willing to offer you. My rates are reasonable and I am available for higher.

Erik Darling of Darling Data is in fact available for hire for these kinds of things. So you don’t have to be abused by Microsoft financially and mentally. So you might notice that something up here that I’m doing is setting compat level explicitly to 140, 140 because that is the 2017 compat level. The reason I’m doing that is because if we use the 150 or 160 compat level, we will get sort spills that neither of us have the patience to sit and wait for during one of these videos. So here’s what I got from these two queries. It’s really the same query twice. But this is what happens when a batch mode sort spills. That is nearly five minutes. And remember, this is a fully batch mode plan.

So all of the operators in this plan are responsible for only for the time that is spent in them. Down here in the row mode plan, things are a little bit different. Things do get weird because you might see that this ran for almost 2.2 seconds. Repartition stream says, no, I only ran for 1.3 seconds. And the sort says, I ran for 4 seconds. So that’s a little misleading. But anyway, the reason for this is all explained in great detail in this wonderful post by Mr. Paul White, or Ms. A.K.A. Pablo Blanco. And it was based on this demo. And actually, my name actually appears in the blog post, which is a magnificent thing.

And never thought that I would see the day. And if you feel like Microsoft should probably work on this scalability issue, there’s also a feedback item that is actually under review. To Microsoft’s credit, I will click on this so you can see. So this feedback item opened by y’all is truly nine months ago. Oh, man, this one’s ready to pop.

This is under review. And I actually got a thank you from the company. That’s as good as a gold watch, isn’t it? All right. Getting a thank you. Comment thank you. Anyway, let’s get back to regular row mode sort spills. So I’m going to run these two queries. And they’ve got all sorts of hints and stuff on them to recompile and clear out the procedure cache.

And this one up here is I’m using the min grant percent hint to ensure that this thing gets the minimum amount of required memory to not spill. Because I want this one to not spill. And I’ve got this query down here using the max grant percent hint, ensuring that it most definitely will spill.

It was built to spill if you’re into that kind of music. We hear it darling data or not. We like hard goth. Hard goth only. Anyway, just kidding. We like a wide variety of music.

Depends on what the mood is. When it’s a bad mood because Microsoft support is awful, we listen to the hard goth. So let’s look at the query plans for these.

Because that’s what we do, isn’t it? We’re the data darlings who stare at query plans. And let’s be moderately surprised when we look at these two query plans. And we see that the plan that didn’t spill.

Oh, that was terrible framing by me. We’re going to do some sit-ups after this one. And the plan that did spill, we see way over here, these sword operators. The sword operator way…

Oh, my hand. I look like I’m doing something awful to that sword operator. The sword operator way up top did not spill. And that took five seconds. The sword operator right here…

Ooh, that’s nice. That’s right there. Man, that’s good framing on my part. That took 3.4 seconds. But why? Why?

Why my data darlings did this… Why did the query that spilled take less time than the query that didn’t spill? Now, this is something that I… Like, I never used to, like, really catch well query tuning things before Microsoft introduced operator times into query plans.

Because you would see a spill and all you would have to go on is, Well, crap. Spills are pretty slow, right? You should try to get rid of spills. If I fix the spill, maybe it’ll be faster.

It didn’t always turn out that way. Now, the reason why my spill is faster, I mean, first and foremost, is because I am on fast local storage, right? So this is Crystal Disk Mark hitting my fast local SSDs on this computer.

Again, these are SSDs plugged directly into all the same parts and components encased in a beautiful Lenovo laptop right next to where all the other hardware and stuff is.

Because, you know, you don’t have that probably, though. Because you work for knuckleheads. And you work for knuckleheads who dragged you kicking and screaming into the cloud where storage is awful.

Generally awful. And if it’s not the storage that’s awful, then it’s the path that the data has to take getting to the storage. It has to get way over here, right? Your data is nowhere near your SQL Server.

It’s miles away, probably. Miles of network cable away. So I get the benefit of fast local storage that I don’t have to go across miles of wires to get to. You probably don’t have that because you work for knuckleheads.

I work for one knucklehead. But the one knucklehead I work for bought one nice laptop to do demos on. So that’s why this sort is fast for me. It probably wouldn’t be fast for you.

I realize some cloud instances do have, like, a local storage with, like, you know, hyper drives on them. And you could get stuff fast there, too, probably. But most people don’t have that.

So they get really screwed up by this stuff. So you will probably want to fix sort spills. I probably don’t need to fix sort spills. But I’m going to show you in a minute how you know if you need to fix sort spills.

Aside from, like, just, or, like, if you have a lot of sort spilling and, you know, doing things. So one thing that’s sort of interesting about sort spills, at least in parallel execution plans, and we’re going to hope that this query works correctly the first time because this demo is a little weird.

Sometimes it’s, like, great the second I run it. Other times I have to tinker with the memory grant percents. And it’s not fun when I have to tinker with the memory grant percents.

And what do you know? I’m probably going to have to tinker with the memory grant percents. Let’s change this one to, like, 13. Because, you know, what’s funny is it worked three seconds ago when I ran this before recording the video.

Don’t take it out on me. I’m still better than Microsoft support. There we go.

That’s what I wanted to see. So if you look at this top query up here, right? This query, when it spilled, it only… So this query, just to make sure everyone understands, this query is running at doc 8. That’s this many fingers.

And this query spilled… Well, spilled level 1 and spilled 7 threads out to disk, right? That’s this many fingers.

8 threads is this many fingers. And so one of these threads is showing that it did something, right? So if we come over here and we look at the properties and we look at this, we will see one thread with 1,435 rows on it. It looks like it did some stuff.

But this is just a weird query plan timing issue, right? This is not actually an actuality kind of what happened. It’s sort of what happened. Both the thread stuff and the operator time stuff, as we saw in the previous demo with the row mode thing where the repartition streams was not in the realm of reality of what the other operators around it were doing.

The operator timing and the thread stuff can also not be anywhere near reality. It’s sort of like me after 8 p.m. Me and reality are not shaking hands anymore.

But so this query down here, which spilled to level 1 and spilled, oh, why did you disappear? You were right there. All you had to do was not leave like my dad.

So this is level 1 and spilled all 8 threads. So this sort, even though almost 53 million rows from both of these go into this sort, and both of these sorts sort 53 million rows, this one looks like it didn’t do anything. All right, it’s just, it’s all zeros in there, all right?

Like my report cards. So, again, the reason why, like, the sort spills are generally faster is because… I have nice local storage, which you don’t have, probably.

I hate whispering. Sorry. Sorry about that. Sorry about that. So sort spills can get worse as you have more columns to spill. So just, you’ll allow me to go back in time one moment.

If we look at this sort, we have one column in the output list, that is post ID, and one column in the order by, which is post ID descending. Okay? So if we run these two queries now, and again, I have my little hints here just to make sure everything happens the way I want it to.

That should be… Is that… Those are the right two queries?

I didn’t highlight the one above it, did I? That was rather foolish of me. Rather foolish. Oh, Eric. Where does your foolishness cease? Ever.

So we’re going to run these two, and this is important because you should understand this about sorting data in SQL Server. Right? And this one is still a little bit faster.

Not as crazy faster as the other one. Right? Six point… Oh, man. Zoom it is all over the place today. 6.2 seconds versus 5.7 seconds. But now the sort operators are going to look a little bit different than they did in the previous demo.

And they’re going to look different because we have, if the tooltip ever graces us with its presence, we have way more columns in the output list now. Right? We’re still only ordering by post ID.

But what SQL Server has to do is all the columns in the output list, those also have to be put in order. Right? Like, that’s what a sort does.

It sorts all the data that you’re outputting by the column that you’re ordering by. So, you know, again, I’ve probably gone over the Excel analogy a few times where when you’re using Excel and you click that button in the top left-hand corner and everything gets highlighted. And then you click sort and you choose a column and everything in the spreadsheet flips to match the sort order of that one column.

Or Excel kind of yells at you and is just like, are you sure you just want to sort this one column and not everything around it? Because you’d look kind of stupid if you did. So, that’s what SQL Server kind of has to do in memory too.

It has to flip all the order by columns to the order of the, so it has to flip all the output columns to the order of the order by column. So, that’s why queries that select more columns and need to order those columns need more memory. Right?

So, that’s one thing to keep in mind there. And one thing that will exacerbate those issues is when you have text data, or not just like the data type text or ntext, I mean like any string data really. Anything that is not, like all the columns that we’ve been dealing with before, these are all integers or dates.

I guess they’re all integers and there’s one date time. So, these aren’t like, you know, big honking columns with like variable lengths and stuff where SQL Server has to guess how much data is in them. Right?

So, if we look at the comments table and we run this query, right, what I want to show you is that a lot of columns in here don’t, like, so the, just to make sure, make sure, sure, we understand what we’re talking about here. The, the text column in the comments table is an envarchar 700. Right?

Envarchar. Double, double byte encoded text. And so, that’s why I have data length divided by two. Also, data length tends to be a little bit faster than length when we do these things. So, that’s why, that’s the divide by two there.

So, if you look at all the, the stuff in here, like, a lot of the comments that, you know, just, listen to this top section, don’t have very long length, byte lengths, compared to the maximum byte length of the column. But, what’s really interesting, ready, like, so, I’m going to show you this and I’m going to talk a little bit about memory grant stuff, is when we run this now, we have this average column length of 302 bytes. And this is actually, this actually plays pretty well into how SQL Server does memory grants for string columns.

Right? Because what it does is it guesses that every row that is produced, that needs to be sorted, for a string column, that, that, that row data will be half full. Right?

So, for a var, and varchar 700, having 302 bytes in there is actually pretty, pretty close to half. Right? So, the average comment length in here actually works pretty well for that algorithm. It might not work well for every, like, data set.

You might have, in varchar 700 columns, where, like, legit, like, everything really is only, like, you know, half or, like, 50 bytes full or something. And you would just get really excessive memory grants for that stuff. So, if we run this query right here.

And this query does not select the text column. And we look at what it does. This, this one still sorts, spills, this one still spills a little bit.

But it’s pretty quick, right? Two seconds. Like, no one’s, no one’s going to really gripe about two seconds. But now, if we run this query, which is, which is captive, a max grand percent of one with the text column involved. Here’s, here’s what we’re going to have to, here’s what we’re going to do.

Is we’re going to come over here. And we’re going to run sp pressure detector with a sample of 12 seconds, which might, which might help you understand exactly how long that query that I just highlighted is going to run for. So, we’re going to kick that off.

And then we’re going to run this. And what we’re going to see at the end is that SQL Server spent way more time spilling when there was text data involved. Because we have way more pages to spill out.

We have bigger data size to spill out because of that text column. Right? So, this whole thing takes about 10 seconds, which is just, which, if you’re wondering why pressure detector was at 12 seconds, it’s so I could run it and have like a second or two of grace period to, to come over here and execute this one. Right?

So, if we look at what sp pressure detector tells us about the 12 seconds that this ran for, what you’re going to see way up at the top is this IO completion weight. And if you, if you notice the, the helpful description column that I’ve put into SP pressure detector, just for you, just for you, because I love you and I care about you way more than Microsoft support does. So, this is the weight type.

This is the weight type that you will see from queries when they are spilling sorts a lot. There are different weight types, which we’re going to look at in future videos that happen when hash spills, hash operators spill from both a hash aggregate and hash join perspective. But the IO completion weight is pretty decent, like you can associate that pretty decently with row mode sort hash spills.

So, if this were, if this were a batch mode query, you would see BP underscore sort as the batch mode sort thing that was happening when things were spilling and getting awful. So, if you, if you’re looking at a server as a whole, or if you’re looking at weight stats for a query and you’re wondering what IO completion means, well, if, you know, if you have a lot of slow queries that are, you know, doing a lot of sorting and they’re doing that sorting in row mode, there’s a pretty good chance that they are spilling lots and lots of stuff out to disk. And since you work for knuckleheads who, you know, put you on the cloud, you’re probably going to have to fix those because that actually can meaningfully slow a query down right there.

So, IO completion weights, if you see those associated with running queries, a lot, some spilling going on. Whether that spill is the root cause of why the query is slow, you’re going to have to determine that or hire me to do that because I’m happy to, happy to tell you. Either way.

But that’s, that’s what you would have to do there. Anyway, my, my wife has been texting me for 20 minutes. So I should probably respond or something. But before I do, before I go, thank you for watching.

I hope you enjoyed yourselves. I hope you learned something about sort spills. If you like this video for whatever reason you like it, it doesn’t have to be the content. It could be, it doesn’t have to be what you see in SSMS.

It could just be my bright, sunshiny presence here on the screen. I like the thumbs ups on the videos. And I like, you know, the you go girl comments in the videos.

You can even say you go girl. I won’t, I won’t be offended. So there’s that. If you like this sort of SQL Server content in general, please subscribe to my channel and you can join drum roll. Let’s hit this refresh button.

Make sure we’re totally up to date. You can join nearly 3,779 other data darlings who get notified every time I publish one of these videos that mean so, so very much to me. So once again, from the bottom of my heart, thank you for watching.

Thank you.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 25% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.