A Little About Out Of Date Statistics In SQL Server
Thanks for watching!
Video Summary
In this video, I dive into an often-overlooked issue in SQL Server: outdated statistics and their impact on cardinality estimation. Specifically, I explore how the default and legacy cardinality estimators can lead to significant performance issues when statistics are not updated frequently enough or rely solely on auto-update stats. By using a modified version of the Stack Overflow database, I demonstrate how adding new data without updating statistics can result in severely inaccurate execution plans, especially with inequality predicates. The video highlights the differences between the legacy and default cardinality estimators, showing how the latter provides slightly better estimates but still struggles when faced with ascending key columns that aren’t marked as such by SQL Server. Through practical examples, I explain the math behind density vector estimation and provide insights into why frequent statistics updates are crucial for maintaining optimal query performance.
Full Transcript
Erik Darling here with Darling Data, voted by BeerGut Magazine to be the producer of the freshest, hottest SQL Server content on the internet. So if you’re into that sort of thing, you’re in the right place. Welcome aboard! In this video, we’re going to talk about a somewhat overlooked problem that you can run into with outdated statistics. Especially, well, I mean, I guess it could really apply to either cardinality estimation model, either what Microsoft quite smugly refers to as the default cardinality estimator, which is the one that you use if you’re in a compat level higher than 120, which is around SQL Server 20, or 314, or the legacy cardinality estimator. Again, very smug nomenclature there, which is the cardinality estimator that you use if you’re in compat level 110 or lower, or you have the database scoped option, database scope configuration option to use the legacy cardinality estimator turned on, enabled. This thing is looking a little funny on me. It’s a little too happy down there.
One thing that a lot of people overlook. So when we think about outdated statistics, what a lot of people focus on is that you have this 200-step histogram, and perhaps the values in the histogram don’t reflect the actual counts of things that are in the table for existing data, which is totally a reasonable thing to be worried about. So this can happen either because, you know, you, well, let’s see, let’s think of some reasons here. Either you haven’t updated stats in a while, and the, you know, the numbers that are in the histogram just don’t, like, like a bunch of data got added, and like, you know, what, the old histogram said that you had like 500 rows for this thing, but now you’ve got like 500,000 rows for this thing. Granted, that’s a little far-fetched. If you have a big enough table that you can see that sort of difference, but auto stats will kick in eventually, and I know how wild you crazy kids are about rebuilding indexes, which will also update stats.
But let’s just say for a minute that, like, you know, you had some just wildly outdated stats that hadn’t been updated in a while because you have a huge table, you haven’t met the auto stats threshold yet, auto update stats threshold yet, and you just had like wildly inaccurate numbers. That could also happen if you don’t use a high enough sampling when you do a stats update. If you just use a default sampling, you can miss a whole lot of really interesting data points.
Apparently, um, apparently I smoke too many cigarettes, or apparently, since I don’t do that, we’re a family-friendly, we’re an addiction-free channel here. We are only addicted to making SQL Server go faster. You know, we don’t smoke cigarettes.
Though we do miss smoking cigarettes. But really, one of the big things that I run into is that, you know, you’ve added, like, new data to a statistics histogram that just isn’t reflected in it at all. A lot of people will call that something like the ascending key problem, which is a valid, you know, valid name for it.
You know, there’s all sorts of writing about it. And, um, it is a, it is a, something that can actually cause pretty severe performance issues. Because depending on, uh, which cardinality estimator you’re using, you could get some wildly different guesses depending on, um, what, well, like, how many rows are in the table.
And, you know, like, how many values match a certain predicate and all this other stuff. And you can end up with some really bad execution plans when you, um, you start joining, uh, those tables off to other tables. And the, you know, the predicate estimates, like rows coming out of the table, don’t really reflect reality.
And so you don’t get the good, the join that you get is, again, not in line with reality. And, you know, you get like a nested loops join when you really shouldn’t have. So, uh, we’re not going to delve that far into it in this video.
I’m just going to kind of show you what, what, what it’ll look like when that sort of thing happens. So in my copy of the Stack Overflow database, uh, what I’ve done is I’ve created a table, uh, called votes underscore beater. And we’re going to beat this table up a little bit.
Uh, I didn’t want to do it to my actual copy of the votes table because I, I, I don’t, I don’t like messing with, I don’t like messing with them. Um, uh, life is hard enough without having to, well, actually I don’t have to ever have to restore a database. If I do something like that, uh, I, I keep, uh, a version of the Stack Overflow database that I do nothing in called Stack Overflow clean.
So if I ever really mess something up in a table, I can just copy it from this other database. I’m a cheater like that. Cause I, I don’t, I don’t, I don’t like restoring databases.
It’s not, not my thing. Um, so I, I’ve created an, uh, an alternate table to use. And I’ve inserted all the data from the actual votes table into it. And I can see I did all my homework there and I created an index on it, uh, on the post ID column.
Because that’s the column that we’re going to focus on here. Now, um, at current, I actually know cause I already did all this stuff. Um, what this query returns is just some statistical information about what’s in there.
Now, um, right now I have 229,561 modifications against this table. In order for, uh, auto-update stats to kick in using the new auto-update stats thresholds. Uh, the old, old auto-update stats threshold was 20% of the table plus 500 rows, which is, you know, let’s just face it, 20% of the table.
And he, it kills me when he has to stick to plus 500 rows. Cause yeah, it’s still like, it’ll be close enough to 20%. If it’s a table big enough for you to care about, 20% is 20%. Adding 500 isn’t meaningful.
Uh, and the current range high key in the statistics histogram is this. But there’s a whole lot of data within these modifications that have, that has a range high key for post ID higher than this, right?
So that’s the important thing here. So, um, what I did in order to facilitate this, this nightmare is I created a temp table. Uh, it did some work to validate the post IDs that were going to be deleted, uh, deleted the top.
And now, so the, the stats modification counter, uh, that I just showed you here, uh, that was about 500 rows higher, um, plus 500, right? 500 rows, who cares?
Well, now we’re going to talk about how much 500 rows makes a difference. So I subtract, so they, when the table was a full size, right? Before I deleted rows out of it, there was, uh, the auto-update stats threshold was, uh, 230,062.
So I just subtracted 500 rows out of there. And, uh, so this number minus 501. And I deleted those rows out and I, on the delete command, I outputted those columns into the votes temp table.
Uh, then one thing that I did that’s really important before, uh, you know, before anything else was I re-updated statistics, right? The, the, the stats update here is to prevent, um, me going over that modification counter. Because remember, deletes count as modifications.
So I wanted to update stats to zero that out. And then I inserted all the rows from the temp table back into votes beater. And I did some work to find, uh, the highest post ID with more than one hit to it.
Uh, or rather the post ID over that old, the hot, the stats high key, the range high key. And for that, for the statistics object, uh, that has, that had the most in there. Right?
So I want to show you how this, the, the guess is crappy. So with the legacy cardinality estimator, the problem that you run into is whether you have an inequality predicate or an equality predicate. SQL Server only ever guesses that one row is going to come out of there.
And let me just scooch these, these dummies together a little bit. So for the inequality predicate, we get 229,877 rows back, but SQL Server only thinks that one row is going to come out of there.
For the index seek, where we, you know, find 300 rows, SQL Server only guesses that one row is going to come out of there. This is a thing with the legacy CE that Microsoft has done, uh, some, did some work to improve upon and has tweaked and mangled and done all sorts of things.
It had made it, you know, it’s still better, but it’s the constant tweaking. It’s, it’s like, like who can keep up? Right?
It’s like every CU, there’s just some thing that’s slightly different by like a percent. You’re like, nothing works anymore. Everything’s broken. Uh, with the, with the new cardinality, I see what Microsoft, um, so pompously calls the default CE, things look different.
Is, uh, and what I’m going to show you here, I mean, I have this use hint, this enable hist amendment for ascending keys hint on this. And the reason I have this here is because SQL Server has not marked the post ID column as, uh, being ascending.
Um, I tried it with a few different columns in the table, including the creation date column. Uh, and SQL Server did not think that any of them were ascending. So we’re just going to stick with post ID and we’re going to throw this hint on there to show you the difference.
So if I throw this hint on this query and we search for where post IDs are greater than, uh, what the, what’s in the current histogram, uh, we get a guess or rather we get an estimate that just about matches the, uh, modification counter in the table, right?
So two, two, nine, five, six, one. If we had two, two, nine, five, six, two, auto stats would have kicked in. So when we tell SQL Server that this is an ascending key and SQL Server runs this query, the cardinality estimation process, uh, looks at how many modifications there are and says, okay, well, maybe there’s that many rows.
This is for an inequality predicate. Uh, if we don’t add that hint on SQL Server, um, guesses about 30% of the table will be in there.
So this number changes from being the modification counter, uh, to about 30%, sorry, to about 30% of the modifications, right? So six, eight, eight, six, eight, eight, six, eight. I’ll show you the math on that when we get through these demos.
So you can be happy that you watch this. And when I say, I hope you learned something, you can say, oh yeah, I did learn something. I learned some math.
And then you’re going to say, oh, I learned math from a high school dropout. And then you’re going to say, maybe I shouldn’t learn math from a high school dropout. We’ll, we’ll see. I don’t know.
Uh, so anyway, uh, if we look at the equality predicate with, um, with the enable ascending key thing in there, uh, we will get a guess of 55 or an estimate of 55 rows, right? Out of the 294.
So remember with the default CE, uh, that, those are both guesses of one. So with the new cardinality estimator, you get slightly different guesses depending on, um, you know, if the, if SQL Server has already has marked the index or statistics options being an ascending key, you wouldn’t have to use these hints.
But in my case, I’m, I’m using them to sort of fludge things a little bit. So, um, what we get, if we look at the stats properties again, which will help explain those numbers a little bit.
Uh, so for the ascending key inequality predicate, uh, we got just basically the modification counter as a cardinality estimate. For the default CE without that on there, in order to get 6, 8, 8, 6, 8, which is this one, we basically, um, look at the modification counter, uh, times 0.3.
So about 30% of the modifications. For the default CE cardinality, uh, estimate, we, you know, we get here, we actually get a density vector estimate.
Now, if you’re unfamiliar with the density vector estimate, that’s something that comes from looking at, um, the statistics object, which we’re going to go do, right? We’re going to go look at this, and we’re going to go to the properties.
And these two numbers here, if you’ve, if you’ve ever, like, you know, watched my videos or read my posts or read anyone’s posts about local variables, you’ll know that when you have an equality predicate with a local variable, absent or absent or recompile hint, cardinality estimation changes from, like, a parameter sniffing thing or a literal value estimation thing to using some magic numbers.
Those magic numbers come from the statistics object. If we go into the details here, let’s just make this full screen to make it easy. If we look at the density of the column, of the post ID column, which is this, and the number of rows in the table, which is this, and we multiply this by this, we get that, that, we get that density vector guess.
The, all density is just sort of like the assumed uniqueness of this column. So the SQL Server does some math in there to figure out how unique it thinks the post ID column is, which is, which is again, a total guess based on, you know, like something times the distinct number of rows.
So SQL Server comes up with this number and multiplies it, that density number by the rows number. And that’s where we get that guess of 55 from. So that explains pretty much, you know, old versus new when it comes to outdated statistics, especially when those statistics don’t have, like when the high key of those statistics don’t have, it does not match what the current high value is in a table.
You know, it’s, it’s a lot, maybe it’s a little bit easier to think of in terms of like a, like a date column or a date time column where, you know, every row that gets added, it’s almost like an identity column where every row that gets added is new, right?
It’s a higher value than what was there before. It’s a, you know, it’s sort of like an append only, you know, type, type insert scenario. You know, it’s not like, it’s like a, you know, it’s like an order date column, not a ship date column or, you know, ship date comes in null and gets updated.
Order dates are always going to be new. Every order is going to be newer than the order before it. So that’s like an ascending key. Same as, you know, an identity column in that regard. So really one of the biggest dangers that I can see, uh, or rather one of the biggest, you know, performance landmines that I see a lot of people run into is when they don’t update stats frequently enough, or if they just depend on auto stats to do their thing, then they end up with these histograms that don’t represent the newest data, right?
It’s not really about the data that’s already in there. Most of the time, most of the time, the data that’s already in there is probably close enough to reality.
You know, there are obvious exceptions to that. Of course, you know, nothing is perfect, but it’s usually the data that lives beyond whatever the current range high key is that messes things all up. Now, of course, if we ran a stats update on the votes beater table, we would have, we would have representation for those higher values and they only wouldn’t have this anymore.
So it’s something you have to be really careful of, you know, especially if you’re using the legacy cardinality estimator, which I often think is the much better cardinality estimator. Getting that one row guess can be an absolute disaster.
So if you get that one row guess and SQL Server comes up with a, you know, crappy nested loops plan where, you know, you have lots and lots of rows and, but SQL Server’s like, I think there’s only one, you know, you could run into some real issues.
The newer cardinality estimator does do better with this. You know, it at least gives you a guess of higher than one. It might not always be perfectly accurate, but often, you know, the fact that it’s a bit closer to reality does, does buy you, you know, does buy you out of some, you know, potential performance landmines or pitfalls, potential performance pitfalls.
Pleasant, pleasant, pleasant, pleasant alliteration. Anyway, so you learned some math from a high school dropout. Hope you enjoyed yourselves.
I hope you learned something other than math from a high school dropout. If you like this video, you can, you know, if you want to create a bunch of YouTube accounts to give extra thumbs up, you can do that.
I’m not responsible for that voting ring, though. I would never encourage that behavior. I don’t even thumbs up my own videos, which, I don’t know, probably, probably tells you something about how, what an ethical, honest person I am.
If you like this sort of SQL Server content, please subscribe. Subscribers are awesome. We, we, I finally cracked 4,000, which means in the grand scheme of things, probably not a lot.
But if one of you lovely 4,000 or so people sees these videos and thinks, hey, we could use the kind of help that Erik Darling offers for SQL Server, you can hire me, too. I am, I am a consultant.
I consult and I fix these problems and I find these problems and I analyze these problems and I guess, actually, no, I said, I analyze these problems, find these problems, analyze these problems, and fix these problems.
So it’s the FAF method. The patented Darling data, find it, analyze it, fix it. The FAF of SQL Server performance tuning. Don’t steal that from me.
I’ll come find you, my Delta Miles. And I think, I think that’s probably about good for today. Um, uh, I should probably eat something I feel like. My, my, perhaps my blood sugar is cratering a little bit.
Um, I think I’ll, I think I’ll go eat a beer. Or something. Maybe, maybe some chicken. Chicken’s good, right?
Chicken’s healthy for you. But I do apologize to any, if any, if any of the data darlings out there are vegans, I do apologize. You can, you can, you can pretend I said something else. Tofurky.
Um, some, some sort, some sort of bean. But, anyway. Uh, I’m gonna go do that. And, uh, apparently I have some drivers to install, because Intel keeps yelling at me. So, I’m gonna do all sorts of fun things with beer and chicken and, and, and driver updates.
And I’m sure you’re gonna be very jealous. Um, I know that if I were, if I were not me, I would be very jealous. So, uh, as always, thank you for watching.
And I will see you next time in another video. Um, maybe, maybe, hopefully there will be no math. Because that’s, that’s, that’s a promise I’d like to keep. All right.
Thank you.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.