A Little About PAGEIOLATCH Waits In SQL Server

Video Summary

In this video, I delve into the concept of page IO latch weights in SQL Server and how they can indicate performance issues related to insufficient memory for the amount of data stored. Using my scaled-down server setup, which mimics common client environments with limited resources, I demonstrate how high page IO latch weights can significantly impact query performance by forcing frequent disk access. I also discuss practical steps to mitigate these issues, such as optimizing indexes and improving query writing to better leverage available memory.

Full Transcript

Erik Darling here with Darling Data. In today’s video, we’re going to talk a little bit about page IO latch weights. And before we do, what I need to do is introduce you in case, well, let’s be honest, you may not have seen every single video on my channel. You may not have seen very specific videos where I talked about this tiny little server that I use to really beat up on and show how bad things can get with SQL Server. So let’s zoom in on this a little bit. I should probably be in the right window context for that, right? I hope hitting control and one didn’t break anything important outside of this thing. So this tiny little server has about 14 gigs of memory assigned to it and only two virtual processors. So this thing is very easy to overload and overwhelm. The thing is that this is sort of like me having to create something that scales or like create something that’s a good scale model of like some of the hardware configurations that I see when working with clients. So like my Stack Overflow database is only like a hundred and something gigs, but like, you know, like my laptop that I usually use, the VM that I usually use is 16 cores and like 96 gigs of memory. So like, that’s pretty solid, right? Hardware to like, you know, data comparison. You can still make bad things happen with it, but it’s a lot easier to like show the kind of performance problems I see with a really scaled down VM, but still the same size of data. So like you could take, you know, like a production server that maybe you’ll work with and it might have like, I don’t know, like eight cores or 16 cores or something and like 128 gigs of memory, but you have like a terabyte or two of data on there, right? So like, this is just sort of like a scale model of like the size of the database that I normally use to make sure that the server can kind of like display some of the same issues that I see happening with like live production servers and my day-to-day client work. So that’s kind of the reason behind this thing. So let’s get out of there. And let’s talk a little bit about page IO latch weights before we go too much further. Now, like, I don’t know, maybe like a hundred, 200 years ago, sometime around the time of the American revolution, there were a lot of blog posts that said like, if you have like high page IO latch weights, but either EX or SH that like you, I don’t know, have a disk IO subsystem problem.

I’ve never met a disk IO subsystem. So I don’t really know when one would have a problem or not. It’s kind of a, kind of an outdated term now anyway, I think. But that’s not really what it means. What it means when these weights really accumulate and when you have like a lot of queries that are constantly waiting on these, or you look at weight stats for the server as a whole, and you just see like, like page IO latch weights are a significant, like number of hours of weights compared to like, like how many hours your server has been up. That means that either you don’t have enough memory for the amount of data that you’re storing period, right? Like nothing you do is ever going to work. You might have great indexes. You might have great queries. You just can’t possibly cache everything you need in memory. So your queries are constantly going out to disk to get it, to bring it into memory to, so that your queries can start, you know, using them.

It could also mean that maybe one, like there’s a, just like a deficiency somewhere, right? Like you, maybe your queries are written like crap with a lot of non-sargable predicates, like is null, coalesce, substring, left, right, whatever, in the where clause. Or maybe you just have like no good indexes for your query to use and everything is scanning like a heap or a clustered index or something. I don’t really have a good hard and fast rule about how much memory you should have compared to how much data you have.

You know, if you can get close to one-to-one and, you know, you see your page IO latch weights, like almost disappear, great. You might be in weird situations where, you know, like you might have on your server, like a five terabyte database full of nothing that anyone ever touches. You just dump data into it from other places and it’s like archive, cold storage, like maybe like once a year someone will do something with it.

And your main production database is reasonably small. So like don’t just look purely at size of data when you want to figure out how much memory you have. Really what you want to do is look at your page IO latch weights.

And kind of like a weird brute force way of thinking about it is that like every time you double memory, you’ll cut your page IO latch weights in half. That’s not always going to be perfectly true. That’s not always going to be like exactly what happens.

But that’s like a just like a way to start thinking about it so that you can start figuring out exactly how much memory you might need to get rid of them as completely as possible. The reason you want to get rid of them as completely as possible is because when queries have to go to disk, right, disk is always going to be the absolute slowest component in a server. I say I use like I use in a server loosely because most storage these days is not actually in the server.

Most storage these days is connected via network. That’s also true of my VM, right? Because if you look at my VM, it’s on this tiny little Samsung disk, right?

And this Samsung disk is connected via a little USB cable right here. And this tiny little USB cable is often what has a very, very hard time moving data around, right? So my VM and everything is on here, but all my memory is on my laptop.

So anything that I do has to get scooted across that wire, which can really slow things down. One great way of figuring out if not your storage, but if your storage networking is having a hard time keeping up with the amount of data load that you’re putting on it, is to look in the error log for 15-second I.O. warnings.

To make this easy on you, I have a store procedure called SP underscore log hunter. I apologize. My social media intern who does all the sound effects in my store procedure is still out sick.

I don’t know. Maybe drunk just disappeared something. I don’t know. Maybe they quit. I just haven’t found the letter yet. But if you go there, you can run.

If you go to code.erikdarling.com, you can grab SP log hunter, and you can use that to parse through your error log and look for all the awful things that hide out in there. The 15-second I.O. warnings are one of the messages that it looks for. So what I’m going to do is I’m going to configure my tiny little server tuning server to have 8 gigs of memory.

And I’m going to create this store procedure, which actually it’s already created. I don’t know who I think I’m fooling here. And I’m going to use SQL query stress to throw 10 threads, at least according to my notes here.

I need 10 threads and 10 iterations. All right. Good background information there, right?

10 threads, 10 iterations. 10 times 10 is 100. I’m going to run this thing 100 times. I’m actually going to run it more than 100 times. If you look at the logic in here, it’s kind of gnarly. This runs in a while loop, and this while loop essentially will make this thing run 10 times.

Every time it runs, it’ll pick a random number between 1 and 5. And depending on what number goes, we’re going to do a select count from a different of the main tables in the Stack Overflow database. So what this is designed to do is, like, with the amount of memory that we have in here, is just a really stressed disk.

All right. So if we look at, where is that hiding? Oh.

That’s in here somewhere. Maybe it’s not. Maybe I lied to you. Maybe I need to type in a demo. I’m sure I’ll offend someone. Usually do, right? Someone gets mad about something. Let’s look at SP pressure detector.

Okay. Let’s examine some stuff here. One of them, or rather, what I want to look at, is down here in the memory section. So I have 126 gigs of data, 14 physical gigs of memory, and 8 gigs assigned for max server memory.

All right. So there’s no way that me reading all those tables randomly from all this stuff is going to play well with this stuff. All right.

So obviously, we’re at a little bit of a disadvantage here in our data to memory ratio. So let’s, I already have this window open with SP who is active in it. Put a, oh, that’s two semicolons.

I typed again. Freak out. Right now, nothing’s going on. But if I hit go here, and we start looking at what’s running, all of these queries are going to get really just beaten up by page IO latch weights. All right.

We just keep running this and looking at stuff. We have some really, really bad weights going on here. All right. Almost 500 milliseconds. There’s another one.

Oh, 200 milliseconds. There’s a couple more that are almost 200 milliseconds. So this is what query workloads will look like on servers where, like, you just plain don’t have enough memory. It might not be this amazingly consistent.

It might not be, like, just everything waiting for all the time. But if you use SP who is active enough and you get familiar enough with your server, you’ll probably see this situation happen quite a bit. Right.

If you, you know, just, like, make sure, like, keep an eye on, like, your really busy times of the day, this is going to happen, like, constantly. Right. You’re just always going to see, like, in whatever list of queries are running, you’ll see, like, a bunch of them waiting on page IO latch weights. So let’s go cancel this because we don’t really need to, we don’t need to keep doing that.

So that’s what, in general, queries will look like when page IO latch weights are, A, like, dominant in the workload, and B, when you’re, like, you’re overwhelming storage hardware at some point. Either, I mean, I guess it could be the disks. It’s usually not the disks.

The disks are usually fine. It’s usually the networking between your server and wherever you’re reading stuff from. So one thing to sort of think about with all that, again, like, as you double memory, like, you might cut those weights in half.

Now, if I were running this demo on my laptop where the storage is direct attached and really fast and, you know, just, like, all internal and, like, the memory and the storage and the CPU are, like, within, I don’t know, it’s a 16-inch laptop. So they’re going to be pretty close to each other. There’s not a lot of, like, distance to travel for things to get from one part to the other.

We would still see a lot of page IO latch weights, right, because I don’t have enough memory to keep what I need available for queries to use. But you probably just wouldn’t see, like, the milliseconds get up as high, right? And I know that, you know, like, almost no one is using bare metal servers with direct attached storage.

And, you know, if you’re on a VM where your storage is remote to your server or you’re in the cloud where you’re, like, I know there are certain cloud instances that do offer, like, some forms of direct attached storage. But let’s face it, if you’re watching free videos on YouTube, like, to learn about SQL Server, your company is most likely not springing for, like, the really good servers with, like, the really fast direct attached storage. You’re probably, your storage is probably, like, a couple tin cans and some old yarn.

So there is that. Now, one way that, well, two ways that you can sort of make better use of the memory that you have is to give queries better indexes to access data, like, quickly and easily. And, you know, like, all these queries, if we look at the estimated plan for this thing, all of these queries are just going to be doing clustered index scans, right?

So we’re reading, like, effectively the table, like, everything, right? It’s, ugh, get in there. The clustered indexes are, you know, big.

They’re all the columns in the table. I was actually at a session sort of recently by a fellow named Tim Chapman who works at Microsoft. He made a really good point that calling clustered indexes clustered indexes can be somewhat misleading because it’s really clustered tables, right?

Because you have a heap table or you have a clustered table. Putting a clustered index on a heap makes it a clustered table. It’s not like you have this heap object and then a clustered index somewhere over here.

You just cluster the table, right? So your table is the clustered index. So I thought that was a really good point. I’m trying to make, like, a mental note to, you know, phrase things that way because I think it’s actually a really good point and something that a lot of people do get confused about when they’re working with SQL Server.

So a lot of other database engines, they don’t necessarily have the concept of clustered index, right? Like Postgres is like heaps and then being treated in nonclustered indexes and the 7,000 other kinds of indexes you can create there. Oracle, I mean, who knows?

I don’t have enough money to test it. So I’ll just leave that alone. But if we wanted to make better use of the memory that we currently have in the server, again, we could create narrow, nonclustered indexes that would be useful for our queries to use. But we also have to make sure that the queries are written in a way to take advantage of those indexes.

So while this index creates, I forget how long this takes. You know, it’s a tiny little baby server. It might take a second.

I’m going to show you with this index. And this is a single key column index on the PostID column in the comments table. What I’m just going to show is two examples of queries where we have a good where clause like this, right, plainly written without any interference, and then a bad where clause that looks like this, where, you know, we’re not going to be able to take good advantage of the index that we just created on the table.

So I’m going to run the query two different ways. And in those two different ways, what I’m going to show you afterwards is how much of the index ends up in memory, like after the query runs. All right.

So we’re going to come back up here. And we’re going to run this first with the good where clause. And let’s execute you.

All right. So we have our nice narrow little index, right? If you look up here, you can see this frame that a little bit.

The primary key, the clustered primary key, or the clustered table that is the comments table is about 8 gigs. And our narrow little nonclustered index is about 241 megs. Now, I did use page compression with this index.

I tend to, like, use page compression a lot with indexes when I create them because it’s just another really good way to take advantage of or to make better use of your server’s memory. It’s a smaller object on disk. It’s a smaller object in memory.

And there’s just some, you know, some CPU overhead as you decompress it on the way out of the buffer pool. So our query, when it ran, it used our nice narrow little nonclustered index here. And we’re able to seek very easily to the row or rows.

I mean, there’s one row. It counted one that we cared about. And because of that, we were able to read just a very small portion into memory. Now, granted, this is just one execution of one equality predicate.

In real life, you’ll have, like, a whole bunch of queries that are running and looking for different equality predicates. So you might, over the course of time, you’ll end up reading portions of that index. And you’ll probably see more parts of that index in the buffer pool.

But it’s still a much better use of the buffer pool than just straight up scanning a clustered index. So let’s rerun this with the badware clause. Zero on that and one on that.

I should probably write this as a loop or something. But, you know, I’m kind of lazy. And I like typing in demos because it makes all the right people upset. So now let’s run this with the badware clause using that coalesce function there.

And what happens is because we still get the same result back of one row. But when we look at the query plan, it’s a scan of that nonclustered index. It’s no longer a seek into that clustered index.

And because we had to scan that whole thing. Now, not every scan will always read everything in there. If you use, like, top or offset fetch, you could bail out early. But in this case, we don’t have either of those.

We’re just getting a count. But because we had to scan that, we had to do a scan and get the whole thing into memory, it looks a lot different this time. Right? Now all 241 megs of the index end up in memory.

So when you’re thinking, when you, like, let’s say that page IO latch weights, both the underscore sh and the underscore ex, those are the most common ones that I see. I don’t usually see the other page IO latch weights cropping up and being awful issues.

Let’s say that those are, like, really just, like, predominant weights on your server. Or every time you sample your workload, you see a lot of those page IO latch weights cropping up and slowing queries down. Right?

Just adds, like, variable speed queries. Like, maybe not parameter sniffing. Maybe it’s just, you know, crappy luck with what data pages are in memory. So, like, let’s say that, like, you know, you’ve made an assessment of the server that, like, page IO latch weights are a problem. You can kind of, again, you can kind of brute force reducing them by, like, thinking every time you double memory, you’ll cut those weights in half.

Right? Like, for me, ideally, if, like, I don’t really like to see page IO latch weights be more than, like, 20% of, like, your server’s uptime. Granted, like, bursty workloads can make it look like they’re barely a problem.

But, you know, like, really, like, you know, consistent, like, ongoing workloads. I want to say 24-7. But I hate saying 24-7 because it reminds me of being on call, which is, like, depressing.

But, like, let’s say, you know, you’ve made that assessment. You can generally think that every time you double memory, those weights will get cut in half. You can, of course, like, that’s usually, like, a good, like, I want to make a big change really quickly.

Like, I want to, like, I’m going to do something to the server that’s going to make every query faster immediately. That’s one thing you can do. If you can’t, like, I don’t know, for whatever reason, either you can’t do that or you’ve already done that, then it’s time to start looking at, you know, queries and indexes.

You know, like, your first step probably shouldn’t be adding indexes. Your first step should be, like, assessing indexes. Look for unused indexes.

Look for duplicative indexes. SP Blitz indexes, of course, wonderful for that. And you’re going to want to start, like, trimming down indexes first. After the indexes get trimmed down a whole bunch, then adding in new indexes that support queries better is a good second step.

Also, making sure that your queries are written in a way that they can actually take advantage of all the indexes that you’ve created on there. So that, you know, like, you don’t create these indexes and then realize that your queries all have, like, is null, coalesce, left, right, substring, you know, other sort of, like, non-sargable predicates in your queries that make it so that your queries can’t seek the data. They have to scan data to filter everything out.

So, you generally don’t want to tinker with disk speed to fix this. You want to make sure that you’re not reliant on disk anymore, right? Like, getting faster disks might speed up the page IO latch weights, but your goal should be to just, like, get rid of them by not being reliant on disk.

Faster disks can be useful for some things, right? So, like, if you, like, you know, if you have, like, really high write log weights and, you know, the average milliseconds per weight is pretty high, faster disks can be useful for that. But having more memory is still valuable because you’ll be, like, especially considering, like, most servers are using remote storage, more memory is still really valuable for that because you will be taking traffic off the network pipe between your server and the storage.

So, you will clear that space so the write log stuff can happen faster. IO completion weights, backup related weights, those can also be helped by faster disks and whatnot. But really, your goal should be to increase memory to the point where, like, A, your queries are more consistently fast because you don’t have to sometimes go to disk and sometimes not.

And you want to clear that network path so that when queries do need to go to disk, either to read data or to write data, they have a pretty clear pipe to do it. So, that’s my thoughts on that. And that is a little bit about page IO latch weights.

Whew! All right. So, as always, I hope you learned something. Hope you enjoyed yourselves.

Thank you for watching. If you like this video, there’s a thumbs up button that I would appreciate you punching the crap out of. Like, really wind up. Go at it.

Like, punch it like it kicked your dog or something. If you like this kind of SQL Server performance tuning content, there is also a subscribe button, which I would like you to attack with equal aggression, equal veracity. That’d be cool.

And let’s see. What else? I don’t know. Yeah, like, subscribe. Got that covered. I think that’s about it. All right. It’s dinner time. I’m going to get out of here.

I’m going to go change into my formal Adidas t-shirt and do the dinner thing. So, thank you for watching, and I will see you in the next video. It might be tomorrow if you’re all good, well-behaved out there and you really like and subscribe with adequate force.

Thank you.

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.