Lost In Costs: Part 2
Video Summary
In this video, I dive into the fascinating world of query optimization in SQL Server, focusing on how rules and heuristics drive the process. We explore the often confusing structure of estimated execution plans versus actual ones, highlighting why certain costs remain estimates even after running a query. By walking through examples like reputation lookups and index scans, I illustrate how SQL Server’s cost-based optimizer can sometimes make decisions that don’t align with modern storage capabilities or our expectations. The video also touches on the nuances of random I/O costing, covering topics such as key lookups, bitmap indexes, and the impact of select star queries, all while emphasizing the importance of understanding these intricacies for effective query tuning.
Full Transcript
Erik Darling here with Darling Data. And because you are such data darlings, I’m going to skip over the usual intro. We are going to start with part two of the Lost in Cost series. If you just stumbled upon this, maybe you should probably watch part one first if you haven’t seen that already. And after this, well, I mean, depending on when you stumble upon this, remember that? Remember that website stumble upon? What a nightmare. But depending on when you find this video, there might even be a part three available, which would be the final part. So lucky you wouldn’t get to see me three times. Doesn’t get much better than that. Anyway, where were we? Right? Query optimization. Query optimization is driven by rules and heuristics. Early on, queries are mapped as these funny little tree structures, something like Bob Ross type tree structures. And those structures are pattern matched to various rules that are pattern matched to various rules that the optimizer has to figure out different things to do. It might do with your query plan in order to make it very cheap, very cost effective. Those rules provide different logically equivalent plans. Things like the placement of aggregates, the placement of predicates and the placement of joins are all it can all be affected by these rules. But if you ever want to see them, there’s this great DMV called sys.dm exec query transformation stats.
And if you look at it, you can see the names and birthdays and social security numbers of all these different rules that the optimizer has at its disposal to come up with query plans for you. Right? And if you scroll down through them, there are some pretty funny ones down at the bottom here. This is one of my favorites is top on empty. That’s a good one. I’m not sure why the O is capitalized for that T up on purge, whatever. Anyway, Paul Whitehead, has a great video about all these things you should watch that sometime. Anyway, when you get an actual execution plan, there are lots of operators that will have required estimates for cost based optimization. And they will have information added to them that will reflect actuals for a number of things, a variety of metrics. Costs are not one of them. Right? So you will see things for rows and potentially batches and executions and stuff like rebinds and rewinds. But let’s use this query, very simple query as an example. If we get the estimated plan, I’m only hitting control and L for this one. Right? You can tell that it’s control and L because we don’t have the operator times in here.
Right? But if we look at this and we zoom in, we will see some estimates, right? Estimated IO, subtree and CPU cost. They will be 65, 72. This is another one that gets me. It’s like, why wouldn’t you put like the main cost at the top? Now you have to like, I have to add 65 to 7.2 to get… It’s nuts, right? Who designs these things? But the important thing here is to sort of remember these numbers, right? 65, 72, 7.2. If we run that query and get the actual plan, right? I assure you this is the actual plan because we got a result back, right? It counted zero rows. And we look at this. Now we have operator times and stuff. So this actually did a thing. We will have… This tooltip got magically bigger. And it got magically bigger because now we have these actual things in it, right? We have these things that are actualities in the plan, including this one, which doesn’t actually say actual, but we must infer that they meant to put actual in there.
And maybe someone was just a little drunk that afternoon. We don’t know. But there are also some estimates down here. Some of the estimates down here correlate to actuals up here, which is also a bit confusing, but you know, we’ve all had bad days. But the important thing that I want to show you is that these costs remain estimates and these costs all remain the same, right? We have 65, 72 and 7.2, right? So that’s what we get. Estimated costs, no actual costs. Some costs and estimates and stuff might look wrong, right? I think a great example of when this happens is when maybe some things happen after initial cardinality estimates.
Like sometimes SQL Server might add something to a query plan, perhaps like a bitmap, right? Which gets created up here, but then gets used down here and can affect cardinality, right? So SQL Server expected to read 2.4 million rows, but actually only got, only actually read 40,000 rows. And that’s because this bitmap got used up here, passed through the hash join, passed through the repartition streams, and then finally applied as a predicate when SQL Server was reading data down here.
So SQL Server actually could read fewer like pages and stuff with the bitmap in there. The optimizer is heavily biased against doing random I.O. Why?
Well, if you remember from the first video, if you did not get too drunk between videos, then you will remember I showed you a picture of a computer. And the hardware within that computer was not nearly as powerful as the hardware that you might have today. Maybe not if you’re on Azure, but in general.
And the storage in particular on those things was pretty bad. And when you did random I.O. on old spinning hardware, things that looked a little bit more like record players in the storage you might know and love from today, there was a big physical penalty to going around and doing random I.O.
This is why back when people were on storage hardware where things would have to move about, then things like index fragmentation and page splits and stuff were a much bigger deal. Right?
On modern storage hardware, far less of a big deal. Unless you’re in the cloud. Who knows? Right? So I’ve got this index on the reputation column. I’ve already got that there. And what I want to show you is sort of, I mean, some people might call it like a tipping point demo.
This isn’t really a tipping point demo. This is just a more, I’m going to show you like costing stuff. So if we look at this query for reputation equals nine, SQL Server uses a key lookup.
Right? It seeks into our nonclustered index. It takes one row at a time from there. It goes into a nested loops joint.
And because it’s a loop, we’re doing random I.O. at this point. And then it goes into the key lookup down here to fetch all the columns that I’m selecting from the users table that are not contained. And that’s one single key column index on the reputation column.
So, you know, that’s what we got there. Nice key lookup plan. Probably not a key lookup plan that I would want to fix.
It takes about 27 milliseconds. Right? So I don’t think adding a covering index and trying to reduce the impact of this query by 99.8% is well worth my time. But let’s continue looking at the costs.
Okay? That’s what we’re here to do. So if we run this query and we say, hey, SQL Server, I want you to go find reputation 51 now. SQL Server changes its mind about a few things.
SQL Server decides, well, I’m just going to scan the clustered index for this one. Doing a key lookup would be too much work for me. Right? And this takes about 200 milliseconds. Hmm.
We went from 27 milliseconds to 200 milliseconds. SQL Server, I think you might be incorrect about some of this random I.O. stuff. If we were to force SQL Server to use that narrow index, right? We would apply an index hint here.
SQL Server would say, well, guess what? I can use that index and I can finish this query in 33 milliseconds. So costing around random I.O. is often wrong.
And there’s often, at least, you know, again, newer SQL Server, you know, local factors apply. Often random I.O. is not the terrible enemy that it used to be. Now, I know what you’re saying.
Eric, Eric, darling. Please. Why are you using select star? Why are you selecting star? Surely, selecting star must be a terrible, must be a terrible thing for you to be doing in front of all these people. And you know what?
You’re right. The thing is, excuse me, the thing is that SQL Server costs that random I.O. the same, whether you are saying select star or just selecting one additional integer column. If we look at these two queries and we force SQL Server to use our index on reputation, when we look at what gets returned, we can see up here, clearly, this is select star.
We have all of the columns from the user’s table. And clearly down here, we have just two narrow integer columns. These two query plans, though they look identical in many ways, will also have identical costs, right?
So it’s not about 37.67 query bucks and 37.6, well, 6767 for both of them, right? 37.6767. So random I.O. cost is not at all affected by, like, perhaps the amount of work that each of those random I.O.s might have to do getting more data from the clustered index.
SQL Server costs them the same if it’s one column or all the columns, right? It’s kind of a nutty thing. Now, of course, it’s not a very good idea if one were to go around, right?
Maybe all your queries are select star for a reason. I don’t know. Maybe you have a great reason for doing that. It would not be a good idea for you as a responsible, seasoned, qualified database professional to have every nonclustered index include every column that you have in your table, right?
What’s a cluster? Because we think about maybe what a clustered index is, right? A clustered index is every column in the table logically ordered by the clustered index key column or columns.
If you think about what a nonclustered index is, well, it is every column in the includes, if you have includes, right? You don’t, includes are optional, ordered by whatever the nonclustered index key columns you assign to that index are. If you’re the type of crazy person who would create a nonclustered index with, you know, n number of key columns, and then you included every other column in the table, it would be essentially a second clustered index, like just in name only, right?
Or rather, the only, it would just not in name only, right? Because it would be every column in the table logically ordered by the nonclustered index key columns. It’s not a good practice to do, right?
It’s not a good thing for you to be doing. I’ve explained that in another video, so if you care, go find those. Anyway, let’s look at another example of SQL Server hating random I.O. Okay, it’d be a good time.
Oh yeah, we’re creating indexes. Now, there are times when higher cost plans can be faster, but of course won’t be chosen because of the perceived expense of doing random I.O. by the optimizer.
That’s obviously not a good situation for us to be in because we would expect queries to be as fast as possible. But they’re not always because SQL Server often makes wrong costing decisions because of the random I.O. bias. So if we run this query and get the actual execution plan, returns one row, right?
Fine. And it takes 904 milliseconds in total. Okay, well, is that good?
Is that bad? Was that ugly? I don’t know. How do you feel about it? 900 milliseconds. Probably not the end of the world. Probably not the worst query you’ve ever seen. But if we tell SQL Server, right? Actually, there’s an important part here.
SQL Server scans the nonclustered index that we created on the POST table, but scans the clustered index on the user’s table, which is not what we would expect because we just created an index on the user’s table that helps our query quite a bit.
If we say SQL Server, well, could you pretty please use my index, right? So we’re going to tell SQL Server, I would prefer if you use this index. All right, remember, 900 milliseconds.
Well, that sure felt a bit snappier to me. And it was, right? We got that in 170 milliseconds about. Now, of course, SQL Server chose a parallel plan here, which, of course, helps the final speed of it a bit.
But the reason why it chose a parallel plan was because of how high it costs doing the random IO associated with the key lookup, right?
The initial query was not… Oh, I meant to scroll up there. Actually, let’s run these together. Let’s make sure we do the right thing. If we look at these two queries, the first one has a cost of 30 query bucks.
All right, okay. Cheaperoo. The second one has a cost of 76 query bucks. So this one obviously wasn’t chosen because it has a much higher cost. But the much higher cost led to a parallel plan, led to a faster query.
I’m not saying parallelism is precisely why this one is faster, but it’s just interesting to note the plan changes because of the higher cost here. So there we have sort of an interesting thing, right?
We have SQL Server choosing a slower plan that does not do random IO because the plan that would have done random IO would have cost more and would have been parallel.
So it’s like, ah, it’s crazy, right? And like, it didn’t consider… It didn’t think that the parallelism would reduce the effort of this plan enough to make it worthwhile. Kind of a funny thing. Anyway, sometimes when we want to figure out stuff about parallelism, it helps to start with understanding how SQL Server chooses a parallel plan.
Sometimes it helps to start way back at the beginning. And for me, that beginning is, of course, looking at the cost threshold for parallelism setting, which I have set to 50 here locally.
And what I want to show you, actually, we’re going to look at just a saved plan. Again, this demo, I’ve been burned by it too many times live and in person. So we’re going to…
You and I, we’re going to look at this thing. Saved. All right. So up here, we have a query that’s selecting the top 1892 rows from the user’s table joined to the post table.
There is no hint applied to this one, right? If we look at the query plan itself is sort of immaterial. But what I want to show you is that this thing has a cost of 49.9968 query bucks.
All right. So that’s about as close as you can get to 50 without hitting 50. All right.
And maybe having SQL Server think about a parallel plan. If we look at this query and say we want the top 1893, right? So that’s one more than the 92 plan up there. And we look at this.
It has a cost of exactly… Well, 50.0038 query bucks. But this query has a maxed up one hint on it. So we said SQL Server, you’re not allowed to make this query parallel.
Okay. So keep that in mind there. The one at the very bottom is the same select top 1893, but you’ll notice that there’s no maxed up one hint over there. This query has a cost of 44.5187 query bucks.
That’s quite interesting, isn’t it? Something that trips a lot of people up around costing is especially parallel plan stuff is they expect a parallel plan to have a cost that is higher than the cost threshold for parallelism, right?
After all, SQL Server chose parallelism. The cost of the query must be so high that it’s above the cost threshold for parallelism setting, but that’s not true. See, every query plan starts as a serial execution plan, right?
Parallelism is explored later, potentially, if the cost of a serial query is higher than the cost threshold for parallelism setting. And that’s what happened here, right?
For the top 1893 query, we set maxed up to one, right? We said no parallelism, and the cost of the serial query was 50 query bucks. When we ran that query without the maxed opend, we said, hey, SQL Server, do whatever you want.
Go crazy, pal. And what did it do? Well, it said 50 point something query bucks. Hell yeah. Parallel plan exploration. And SQL Server looked at the parallel plan and said, hey, you’re pretty good.
I like you. How much do you cost? And the plan was like, I cost 44 query bucks. SQL Server said, I’ll take it. So all the parallel plan has to do is be cheaper than the serial plan, right?
The parallel plan doesn’t have to have a higher cost than cost threshold for parallelism. It only has to have a cost lower than the serial plan that it started off with.
Very interesting stuff there, I know. We’re going to pause here, right? That’s a good logical stopping place, I think, for this. We’re also nearing the 20 minute mark.
And, you know, I see the watch times on these things. You squirrel people. Anyway, thank you for watching. I will see you in part three, where we will finish up all of this grand and groovy material. All right.
Thank you for watching.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.