Diagnosing and Fixing tempdb Contention from Spills in SQL Server

Thanks for watching!

Video Summary

In this video, I delve into the fascinating world of TempDB contention caused by spills in SQL Server. Joining forces with my buddy Erik Darling from Darling Data, we explore how these spills can significantly impact performance and cause TempDB to become a bottleneck. We demonstrate this through an engaging example using SQL Server Management Studio and SQL Query Stress, revealing the ins and outs of how spills affect query execution times and CPU usage. By the end of the video, you’ll understand the importance of indexing and sort order alignment to mitigate these issues, ensuring smoother database performance in highly concurrent environments.

Full Transcript

Your best friend in the world, Erik Darling here with Darling Data. And in today’s video, we are going to talk about how spills can cause TempDB contention. Isn’t that just miraculous? We can just keep finding new ways to abuse TempDB. If it wasn’t like table variables and temp tables and, you know, I don’t know, all the other stuff that hits TempDB, if that wasn’t enough, now we have to worry about spills causing TempDB contention. And, boy, does that suck. But first, let’s talk about you and me. If you have already done this stuff, if you have already liked, if you have already commented, if you have already subscribed, and I do thank you from the deep, deep, dark bottom of my heart for doing that, and you want to go steady with me, you want to take things to the next level, you can click the link in the video description just below to sign up for a low-cost $4 a month membership. To the channel, which will just say thank you for doing a good job. That would be cool. If you need help with your SQL Server, and you think, boy, Erik Darling sure would be useful doing any of these things, hit me up. My rates are reasonable.

If you need me to do anything else with SQL Server, the same thing applies there. We can talk. My rates will remain reasonable just for you. Just make sure that you said, make sure when you hit me up, you say, hey, I’m coming from your YouTube channel where you said your rates are reasonable. Otherwise, I’ll have no idea who you are, and I might say something unreasonable. If you would like to get some high-quality, low-cost SQL Server training that is good for the rest of your life, so the longer you live, the more it’s worth, you can get about 24 hours of it for about $150 at my site there.

You can use the discount code SPRINGCLEANING for that. If you want to catch me live and in person, I will be in Seattle, Washington with Kendra Little doing two days of SQL Server performance tuning witchcraft, wizardry, warlockery. And it would be cool to see you there because it’s my birthday and I’m making goodie bags, so you have that to look forward to.

With that out of the way, let us begin our SQL Server partying. Now, let’s go over to SQL Server Management Studio and let’s make sure we have no indexes here. The first thing I want to point out is that if we run this store procedure on its own, it’ll run for about 1.3 seconds, right?

We spend about 200 milliseconds scanning the clustered index and then another second or so over here in the sort, right? Okay, so this runs pretty quick right now by itself with nothing else going on. Let’s come over to SQL Query Stress and let’s run this.

And we’re going to give this a few seconds to warm up. What we’re going to see while this starts warming up and doing stuff is that the queries start running for longer and longer over here, right? No longer are we at just the 1.3 second mark.

Some of these have been going for almost 4.5 seconds. What you’re going to see over in this column is a lot of stuff being runable, right? So runable is going to be SOS scheduler yield related.

And if we keep looking over here and running this, all of a sudden we’re going to start seeing queries taking a lot longer over here. Now, classic tempdb contention signs would be seeing stuff like in the wait info column like pagelatch underscore UP or pagelatch underscore EX. And sometimes you might see that in here if you have enough of the spill contention going on.

That could still totally happen. But we’re not seeing that in there. What we’re seeing are queries running longer and longer because they’re getting bogged down in tempdb all trying to do the CPU work to deal with the spills. Now, keep in mind, this is only 50 threads, right?

It’s 100 iterations, but it’s only 50 active threads at a time, right? All the results from spwhoisactive will say about 50 rows down here in the armpit zone for anything that’s actively running. So, like, not a lot in here is, like, you know, like, I’m not exhausting worker threads as part of this.

If I ran a lot more threads, things would get a lot worse. But you can see in this column that now stuff is taking, like, 9, 10, 11 seconds in here. And, indeed, if I run my store procedure with the rest of the workload going, it’s going to take a while, right?

That was, like, 1.3 seconds before. Now this thing took, well, just about 10 seconds. Or, sorry, just about 9 seconds.

8.675 seconds now here. 726 seconds here. If we look at the weights for this query, right? If we come over here and look, the weight stats for IO completion won’t be that bad, right? This is the typical sort-spill thing in here.

And I think that there’s been some improvement in the row mode algorithm for that. Really, what we get bogged down in is the SOS scheduler yield weights over here. There was about 6 seconds of those for the 8-second query.

So, about 6 of the 8 seconds that we spent waiting for this thing to finish were just, like, waiting on, like, CPU, cooperative CPU scheduling. And, again, this isn’t a ton of stuff going at once, right? I’m not, like, you know, throttling the server with 1,000 active requests.

This is just 50 queries running. Granted, they’re all doing the same thing and they’re all spilling. But, you know, that’s just kind of funny to see, right? So, yeah, it’s like running this over and over again, things get kind of wonky.

And we, so, this ran for almost three minutes and we barely completed, a little over 900 total copies of this query finished in that time, right? So, like, on average, about five and a half seconds of CPU. The, like, some of the iteration average in there, 17 seconds.

So, things were getting really slow on the server itself. Now, with nothing running on here, if I come back and run this again, tempDB goes back to normal. We’re at about 1.3 seconds for this.

Cool. Let’s look at how we might fix that. Now, in-memory tempDB stuff might help, right? Depending on what kind of contention you’re seeing.

If we were seeing the page latch contention, then in-memory tempDB stuff might be good. You know, but I think, you know, for a lot of the queries where I see, you know, something like this where, you know, there’s a little spill from a little sort and we could fix it pretty easy with an index, we should do that. Now, one thing that might surprise you a little bit is that creating this index on reputation does not fix the spill.

Or, rather, does not get rid of the sort. It does make the sort smaller, right? Or, it does make the sort faster because we’re reading from a much smaller index.

But, it doesn’t actually get rid of the sort. To do that, we need to actually match the sort order of the reputation column here in the index to the sort order of the query. Now, a lot of people might write stuff like, does index sort direction ever matter?

No, it’s stupid. Leave it alone. You’re an idiot. Leave it. That’s not true. There are lots of times when it does matter. This can be one of them.

Another time when it matters quite a bit is when you’re writing windowing functions. And, the windowing function specifies, I mean, not in the partitioning clause because the partitioning clause doesn’t have an ascending, descending. But, in the order by clause of a windowing function, you might specify some column descending.

And, then, your index sort direction matters quite a bit. Okay. So, with this index created, at least I’m pretty sure it’s created, if we run this, we’ll see that this query finishes just about instantly. And, now we have a top here but without the sort involved.

So, now, if we come over here and we run SQL query stress for, again, 100 iterations on 50 threads, we won’t even have time to get back to SQL Server management studio because this will have completed in about 15 milliseconds. So, that’s a pretty good deal there. I like it quite a bit.

And, I think you should, too. So, anyway. With that out of the way, thank you for watching. I hope you enjoyed yourselves.

I hope you learned something. I hope that you are now able to go out into the world and resolve TempTB contention issues from spills because it’s a good thing to do. You’ll probably be a hero to someone if you start fixing these sort of problems.

Now, granted, these aren’t the sort of, like, big query problems that you might see from, like, reports where you go from, you know, 30 minutes to 3 seconds or something. But, this is the kind of, like, you know, highly concurrent workload stuff that you have to start thinking about and analyzing and addressing when you’re dealing with big, highly concurrent OLTP workloads because this kind of stuff can really sneak up and bite you. So, you know, as usual, spwhoisactive is a great tool to start tracking this stuff down.

And, you know, the more you can do to, you know, monitor and get an idea of what’s going wrong on servers when things are in a highly concurrent state, the better job you can do of starting to tune things so that you don’t have to worry about the server falling over. If I were to, if I really wanted to make things awful, I could have thrown way more worker threads at this and things would have been just terrible in here, right? 50 active users, right, obviously doesn’t take much to bog down a SQL Server.

So, you know, you can see some really profound effects even from small amounts of really highly concurrent spilling activity. So, anyway, that’s about it for this one. I got a couple more videos about rewriting functions and function inlining and stuff.

So, we’re going to get to those. And then, I don’t know, I think we’re going to, I don’t know what we’re going to do, to be honest. We’re just going to close these tabs out and I’m going to get on an airplane tomorrow.

Then I’ll figure out what to do when I get home. All right. Thank you for watching.

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

8 thoughts on “Diagnosing and Fixing tempdb Contention from Spills in SQL Server”

TomF says:

October 18, 2024 at 5:28 pm

On recent non-VMs… use of NVMe for tempdb shiuld minimize this potential, right?
1. Erik Darling says:
  
  October 18, 2024 at 5:34 pm
  
  Well, this is a VM, but it’s a very local VM working on direct attached NVME. So you are free to draw conclusions from that.
Henrik Staun Poulsen says:

October 22, 2024 at 8:05 am

Would you mind telling us why you use “Sort_In_tempdb = on” ?
“A friend of mine” always told tought that meant double the writes.
1. Erik Darling says:
  
  October 22, 2024 at 8:35 am
  
  I use it to avoid unnecessary user database file activity. It’s difficult to figure out the intent of your second sentence, so I’ll leave it be.
  1. Henrik Staun Poulsen says:
    
    October 23, 2024 at 12:57 pm
    
    hi Erik,
    I always thought that sorting in tempdb meant writing things twice.
    But when running
    create index ix_reputation on dbo.users (reputation) with (sort_in_tempdb=on, data_compression=page, drop_existing=on)
    create index ix_reputation on dbo.users (reputation) with (sort_in_tempdb=off, data_compression=page, drop_existing=on)
    my server has everything in RAM, so no difference.
    After setting MaxMem to 1GB, and running
    drop index ix_title on dbo.Posts
    create index ix_title on dbo.Posts (title) with (sort_in_tempdb=off, data_compression=page)
    drop index ix_title on dbo.Posts
    create index ix_title on dbo.Posts (title) with (sort_in_tempdb=on, data_compression=page)
    I was able to lower the writes on mydb 10x, and overall writes by 10%
    
    It looks as if I need to copy your “sort_in_tempdb=on”.
    Thank you.
    1. Erik Darling says:
      
      October 23, 2024 at 2:00 pm
      
      Well, congratulations.
Tom Wickerath says:

October 24, 2024 at 4:20 am

Using Microsoft SQL Server 2019 (RTM-CU20), I was unable to produce a tempdb spill, even before creating the initial index, until I changed the database compat level to 140 (SQL Server 2017). After doing that, I was able to see the spill to tempdb in the execution plan with no index and with the first version of the index created. Upon creating the second version of the index, with the descending sort on Reputation, I no longer got the tempdb spill with compat. = 140.
1. Tom Wickerath says:
  
  October 24, 2024 at 4:22 am
  
  Forgot to say this, but cool demo!

Comments are closed.