sp_PressureDetector: What SQL Server Looks Like Under CPU Pressure
Video Summary
In this video, I delve into the intricacies of CPU and memory pressure in SQL Server, focusing on how to identify these issues through practical examples. We start by examining what CPU pressure looks like in a server environment, specifically looking at thread pool weights and SOS scheduler yield weights. I walk you through calculating worker threads for different versions of SQL Server, explaining why the calculations changed with newer versions. Additionally, I demonstrate how parallel queries can impact your server’s thread pool and provide insights on using SP Pressure Detector to monitor CPU usage effectively. By the end of this video, you’ll have a clearer understanding of how to recognize and address CPU pressure in your SQL Server environment.
Full Transcript
Erik Darling here from Darling Data, with Darling Data. It came from the data. He’s just emailing me. Never mind. Nothing important. And we’re going to continue our lecture series on SP underscore pressure detector by first talking about what CPU pressure can look like in SQL Server. That’s a pretty good one to start with. We’ll also look at what memory pressure might look like in SQL Server. And we’ll also, well, well, not in this video. That’s the next video. If I put everything in one video, it would be an hour long and you would die watching it. Who knows? Oh, and then in another video, we’ll talk about how SP pressure detector can help you identify which queries are having which issues. Don’t look so amazed. Anyway, one thing, like, I think the way that it most commonly manifests itself is that it’s not going to be a problem.
when servers are having CPU pressure. It’s sort of either thread pool, a lot of thread pool weights, or a lot of a lot of SOS scheduler yield weights, right? Just not enough CPUs to service all the CPU stuff that’s going on. It’s not going to be CX packet, CX, or any of the CX. It’s going to be SOS scheduler yield and thread pool for the most part. Now, there are various ways to figure out how many worker threads. you have. And this is generally the calculation that you’ll want to use. And this is the calculation to use for 64-bit server. If you’re using a 32-bit server, you… I can’t conceive of you at this point in time. It’s windy out there today. If you can hear wind chimes in the background, it’s because it’s a very windy day here. So this is the calculation that you would use for, say, modern-ish versions of SQL Server that have fewer than 64 cores in them. Now, Microsoft actually did change the calculation recently for servers with more, well, semi-recently, with servers with more than 64 cores.
Before I talk about that, I did get a user question. They said, why don’t you wear one of your t-shirts when you record these videos? Why are you wearing an Adidas shirt when you record these videos? Because I don’t have any of these shirts left. I did not save one for myself. I gave all of these away at various pre-cons, SQL bits and past data community summit. Sorry, community summit. And should I ever get invited back to one of those lovely venues again, you can be sure that there will be more available.
I think I have a pretty good grip on which sizes I need to bring now, so there won’t be a scramble to give away leftovers of lots of small and extra small t-shirts at the end. But let’s look at this impressive piece of documentation. What we’re going to do is skip this part, because if you are using a version older than this, I just feel bad for you.
So we’re going to focus on this section down here. So starting with SQL Server 2016 SP2 and SQL Server 2017 and up. This also applies to 2019 and 2022.
Of course, it doesn’t stop at 2017. That’d be absurd. The calculations that you want to use would look like this. So if you have fewer than or equal to 4, you get 512 worker threads.
If you have greater than 4 and less than or equal to 64, you get 512 plus the number of logical CPUs minus 4 times 16. And if you have greater than 64 cores, the only thing that changes is it goes times 32. Now, it’s a little confusing, because it looks like it’s 512 squared, but it’s really just a footnote down here.
I don’t know. It’s an odd design choice, that whole thing. But whatever. Anyway, that’s where I got that calculation from.
And that’s how I know that based on the number of CPUs I have, minus 4 times 16 is the number of worker threads for my server currently. So you should be glad to know all that. Now, the reason why this is pertinent to the conversation is going to be partially because of the thread pool thing that we talked about.
Now, on just about every server, you are going to see some amount of thread pool weights. My wonderful friend, Josh Darnell, a.k.a. Josh the Coder, a.k.a. the guy who keeps my website alive, did some digging. And he had an interesting blog post that I’ll link to in the show notes about how SQL Server will grow and trim the worker thread set, the pool of workers that you have, as the workload needs more threads and as the workload needs fewer threads.
And so you’re always going to see some thread pool weights as SQL Server grows and trims the pool of workers. Where you have to worry about it is where you have, you know, many, well, I guess you want to compare the total duration of thread pool weights to server uptime. And if it’s a rather significant portion, if it’s like 5%, 10% or higher, you might have a pretty consistent thread pool issue, especially if there are long average weights per weight on thread pool.
Right? You want to, queries are waiting many seconds to get threads to run. That’s not a good sign.
Now, where things become rather interesting, of course, is with parallel queries. Because parallel queries can reserve a lot of worker threads. And they can work towards exhausting your available threads a lot faster than a bunch of single-threaded queries.
So execution plans are turned on, so I don’t have to press any extra buttons. If we run this query and we get a count from the comments table, we will have a parallel query plan. All this stuff here is parallel.
And if we look at the properties of the select, and again, you know, the properties section is a really important thing to look at. I’m going to blow this up over here so I don’t have to do too much tinkering with moving around and stuff. We will see that this particular query, which ran at DOP of 8, well, let me show you guys to open up this, reserved and used 8 threads.
A parallel query can reserve far more threads than it actually uses. So that’s a good thing to look at if you’re troubleshooting a parallel plan issue. Now, let’s get rid of that for a moment.
Now, that’s a parallel query that has just one single liner, one single branch. If we run this query, which will join a couple other tables together, and we look at the query plan, eventually it will show up. I promise.
This one has some additional branches and has some additional parallel exchanges. We have some parallel exchange here. We have a parallel exchange here.
We have a parallel exchange here. And typically, we want to think about those exchanges as being stopping points to a particular branch of the query. Now, sometimes SQL Server can share threads prior to a parallelism exchange.
So if we look at the thread stat here, it looks a little bit different. Now we have two branches, and we have reserved and used 16 threads there. Now, what’s important to note is that there will always be a plus 1.
So if you have DOPS set to 8, it’ll be 8 plus 1. And you can see that if you look at a parallel operator. It doesn’t have to be a parallel exchange operator.
But if you look under here, you’ll see that there are nine threads involved. Thread 0 is this coordinator thread, and then the other eight threads are ones that are doing actual work. All right, so it’s 8 plus 1.
So it was technically 9, but don’t shoot. And if we run this third query, which is a little bit different, the join between… Well, the join that I do to vote is a little bit different.
And none of these joins make a… Well, I mean, this join makes sense. This join makes no sense. I just did this to get a different join operator between votes and users. But if we change this query a little bit to join comments to votes, and we join on post ID column, which makes far more sense, we will get what looks like a quite similarly parallel plan.
And it’ll finish eventually, I promise. I swear to you. I wouldn’t lie about that.
God, I write good demos, don’t I? Who writes demos better than me? Look at that thing. 19 seconds on the nose, just like I predicted.
But if we look at this query, and again, we go to the… Oh, hey, thanks, tooltip. Show up whenever you want. If we go to the properties here, and we look at the thread stat, now we have a third branch, and we have 24 reserved and used threads.
All right, so typically the bigger and the more branches you have in a parallel query. If you notice the difference between this plan and the plan before, I don’t want to run it again and make you wait for another 20 seconds, but the last plan had a nested loops join in it between users and votes, and that’s why we didn’t have an additional parallel exchange.
This one, we have parallel exchange here, we have a parallel exchange here, we have a parallel exchange here, and we have parallel exchanges here and here.
So these are all sort of potential separate branches to things. But since we had DOP8, and we had three branches in this plan, we had 24 reserved threads, even though we only have eight threads really working.
But again, it’s going to be plus this coordinator thread. All right. So with all that out of the way, we’re going to use SQL query stress again, and we’re going to run this query in it, which you may recognize from the remote DAC video.
And I’ve already got this window up here set to show me, or set to show remote DAC stuff. And we’re going to use SP pressure detector to check on CPU issues.
So right now, I don’t really have much going on here. I have, you know, this, some, some SOS scheduling weights, but I just cleared out weight stats.
We don’t expect to see much of anything. We have some CPU details in here that are kind of nice. You know, we’ll tell you if any CPUs are offline. This is usually what I look at, like look at first, but there’s some other stuff in here that’s, you know, kind of useful to know, get kind of sort of the lay of the land for the CPUs.
And then if there are any, if there are any points in the, since the server has been up that you have hit greater than 50% CPU, we’ll also log those over here, right?
You can see that. But we’re going to pay, we’re going to, we are going to be paying attention to this section over here for kind of the remainder of the demos.
So let’s do this and let’s get this and let’s hit go here first and let’s hit go here. And since we’re already, we’ve already have the DAC, what we’re going to see is which queries are running.
There’s a section here, which will tell you if there are any requests currently hitting thread pool weights. Now this is important because sessions waiting on thread pool do not have a session ID yet.
They don’t have that context set for them. So you can sit there waiting for a very long time on, you know, waiting to get a worker thread.
Weight stats are going to start to pile up as the workload kind of chugs along. But if we look at this section right now, we’re going to see that we have, you know, again, 576 total threads, but we have 618 in use, which means we have negative 42 threads on this server.
Now at the moment, we have seven requests waiting for threads. All right, you can see that there. And if we look down in this thread pool section, there’s one, two, three, four, five, six, seven in here that have been waiting on thread pool for a bit.
Now let’s run this again since the workload’s been going for a little bit. Now we have four that are waiting on thread pool. And if we look at the weights, thread pool has slowly crept up.
And what we don’t, again, you know, just sort of my contrived workload, we haven’t had a ton of queries waiting on thread pool, but we have had on average queries waiting like five and a half seconds to get thread pool weights.
That’s if my cursor will show up in the right place. All right, server, I’m really beating the tar out of you there. We have, these queries are waiting about five and a half seconds to get a thread to execute, and that’s where things can look no good at all.
So, you know, this is holding pretty steady. We have some additional requests. The query that I’m running runs kind of slow, so this isn’t going to like, you know, really give you like an awesome view of like that creeping up and like getting a lot of requests that are waiting on it.
But I think you can get the idea pretty well from that. So let’s cancel this stuff, and let’s just take a look at weights again. Now, remember the two things that I told you are going to really creep up when we are waiting on, or when we have CPU pressure on the server.
One of them is going to be ThreadPool, which is cleverly documented for you using SP Pressure Detector as worker thread exhaustion. And again, that held pretty steady around the five and a half second mark, which is not good.
We don’t want this to be a weight that queries are experiencing this kind of weight on. But where things are, I think, really interesting is the other weight that I told you can get out of control when your server is under CPU pressure, and that’s SOS Scheduler Yield.
So if we look at that, right, they have that weight here, SOS Scheduler Yield, which again is query scheduling. And if we come over here a little bit, we have had 151,215 waiting tasks.
And on average, they waited 157 milliseconds for CPU attention, right? That’s queries saying, I’m ready to go.
Can the CPU pick me up, please? Waiting for my CPU driver. So that’s not a good situation. And now this is a very exacerbated situation, but I think what we should keep in mind with SOS Scheduler Yield Weights is queries when they get a CPU, get on a CPU to do some executing.
You have this thing called the CPU quantum, I guess we can be close enough for that. And I apologize for keeping scratching my eyes.
Again, it’s very windy out today, and I was outside earlier, and I have like grit in my eyeballs that is making noise. I blink.
It’s fun. But so you have this quantum, which is a set amount of time that a query has to use a CPU before it jumps off and allows another query to have CPU attention if there’s another query waiting, or your query will just immediately jump right back on if there’s nothing else in line.
In this case, everything was screwed up, and that quantum, that set amount of time, once you factor everything in, ends up around four milliseconds.
I’m going to say exactly four milliseconds, because imagine measuring four milliseconds exactly every single time. Bonkers. It’s insane. Now, granted, there are situations where you could have a non-yielding scheduler where something is just dominating a scheduler and not letting it go.
I see that happen sometimes with weird XP command shell stuff, but, or like system stuff too, like buffer latch timeouts, you might see non-yielding schedulers arise from that.
But queries were waiting 157.4, bunch of other numbers, milliseconds, to get back on a CPU to get four milliseconds of CPU attention.
And that’s a really bad situation. So, we have learned the prior video, why you want the DAC, the dedicated admin connection, that’s so you can run diagnostic queries when your SQL Server is under a fair amount of strain, which is what we’re doing here.
We’re using the dedicated admin connection. And we learned that when we have a set amount of worker threads for our SQL Server, and when we start exhausting those threads, things can really start to pile up and cause trouble.
Now, you don’t have to hit thread pool to have CPU pressure, but I showed you two signs of it. The thread pool weights with long average weights, and the thread, and the SOS schedule yield weights really piling up to the top.
And especially if it has long average weights there, because that means queries are waiting potentially longer to get on a CPU than they’re spending on a CPU.
And that’s a really, really bad sign. That just means either, usually you, either, I mean, it can mean a lot of things, right? Your workload is just god-awful. You have too few CPUs to service the workload.
You’re, I don’t know, you have really old CPUs that have like one or two gigahertz, and they’re spinning too slowly. Maybe you have balanced power mode on your server.
A lot of things can happen in there. But anyway, I’m going to call this one here, and I’m going to start putting myself together to talk about what servers can look like when they are under a lot of memory pressure.
So that’ll be, that’ll be fun for you too, right? Anyway, thank you for watching. I hope you enjoyed yourselves. You’re listening to the grit in my eye crunch.
If you, if you enjoy this sort of content, you know, gritty, gritty eye crunches or not, you should like and subscribe to the channel.
So every time I publish something, YouTube will yell at you about it. You get 10 million notifications. It should be a lot of fun for you. And I think that’s about it.
Yeah. 22 minutes of me babbling here. That’s, that’s quite enough. All right. Thanks. Goodbye.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.
missing your written blog posts
Yeah, I just haven’t felt like writing lately, so I’m producing content in a way that I currently enjoy.
Your blog, your rules