How Poor Cardinality Estimates Can Lead To Worse Blocking And Deadlocking In SQL Server

Thanks for watching!

Video Summary

In this video, I delve into how poor cardinality estimates can exacerbate locking situations in SQL Server, particularly when dealing with updates and modifications. I demonstrate this through a series of queries and stored procedures, showing how bad cardinality estimates lead to excessive locks on the entire table instead of using an index effectively. By creating realistic examples and explaining the nuances of cardinality estimation, I highlight the importance of ensuring good indexes are in place and the potential pitfalls of using local variables. The video concludes with practical advice on fixing these issues, including the use of option recompile hints and parameterized dynamic SQL.

Full Transcript

Erik Darling here with Darling Data, trying to talk in a way that it would be really easy to train the AI off of. Hopefully someday I will be AI training worthy. Training AI worthy? I don’t know. Whatever they call it when AI steals your stuff, I guess. And in today’s video, we’re going to talk about how bad cardinality estimates can make locking situations worse in SQL Server. Alright. So, hope you’re ready. I realized that I don’t have a good affectionate name for all my viewers out there, my watchers. Maybe watchers is the right word. All 3,562 of you as of the recording of this video. Who knows? Maybe that will go up while I’m recording. You never can tell. I don’t want to call you my data darlings. That might get me in trouble with the misses. And I don’t want to call you data heads because it sounds like I’m about to say something a little bit more rude. So, we’re going to, I don’t know, we’re going to have to think about that. If you have a good idea, leave a comment.

Because I would love to know how you would like me to refer to you. Because I can’t possibly learn all your names. So, we’re going to have to come to some sort of group descriptor. Anyway, I’ve got this lovely index on the post table. And, you know, it’s good enough to prove a point. It’s not anything overly fancy. As soon as you get too fancy in demos, they start failing and people start staring at you like you’re an idiot. And what I want to show you is, well, how bad cardinality can make locking worse in SQL Server.

So, this is an easy one for SQL Server, right? This query right in here, very easy. We’re going to run this and we’re going to roll it back immediately because we don’t need to keep any of these changes. And I’m going to run this update in the transaction. Within the transaction, I’m going to select data out of my little helper function there called What’s Up Locks. It’s available at my GitHub repo somewhere. And what this is going to do is show us all the locks that were taken by this update.

All right. So, we go and we run this. Lo and behold, it runs pretty quickly. And if we zoom in over here, let me frame that up real nice for everyone at home. We see this request mode column right here. But notice only one of these rows has an X by it in itself. All right. That means this is the thing that actually took the locks. And the thing that actually took the locks was on 167 keys.

All right. So, that’s pretty easy. It’s pretty low, pretty lightweight. All right. It’s a serviceable number of locks. This thing finished pretty quickly. We didn’t have to worry too much about anything at all. All right. Pretty okay in here.

All right. And let’s go back to the query plan real quick. SQL Server started with a seek over here. And started with a very good cardinality estimate of 167.

Now, the thing that is important to note here is that when you start with a seek, you are most likely going to start with key locks. You start with a scan, you are most likely going to start with page locks. If you, I don’t know, I don’t think you can really start with much of those.

Unless you’re using a heap or something. But rid locks, maybe. Get some rid locks in your life.

And so, sort of generalized sort of advice there. The storage engine sees seeks low number of keys. Says, hey, key locks. And then from row or page, you might move up to an object level lock, which we’ll see in a minute.

But you will not go from row to page to object. You just go from row or page to object. Assuming that SQL Server finds valid reason to engage in an attempt at lock escalation and is successful.

If there are any competing locks on the table, it may not be successful. Now, the thing that almost no one, well, let’s see, what’s a good way to put this? The thing that almost everyone takes for granted is that their end users are not walking data dictionaries.

They do not have numerical meanings for different things printed out at their desk. We’re going to just look stuff up to make your job easier. PostTypeID equals three isn’t going to mean much to anyone at home.

No one’s going to memorize all the different post types in the post types table. It’s just not a thing that they’re going to do. So what they do know, usually, is the type of post that they want to find.

It could be question, it could be answer. For the sake of this demonstration, we’re going to be looking for wiki posts because those hit a relatively small number of rows. Posts and questions are, like, most of the table.

All the other things are the rest of the table, but, like, there’s 17 million rows in the post table. Like, 6 million are questions. Like, 11 million are answers.

And there’s, like, a few hundred thousand of the other stuff. But look what happens with this query. This is a real gosh darn shame what happens here. This is not a very quick finishing query at all, is it?

Not at all. It’s just four seconds. Right? Well, actually, how long did that take? Well, it was 87 milliseconds.

What’s your problem? What’s your gosh darn problem? Right? 1.8 seconds in here doing all this stuff. And, of course, 1.8 seconds over here.

Notice that we did not use our nice narrow little index that we created on the post table, on the post type ID column. We ignored it. SQL Server says, no, I’m not using that index.

Because I don’t want to do key lookups. If you want me to use this index, you have to put the column that you’re updating in the index. Good luck with that later.

We’ll see how that goes. But the reason why SQL Server doesn’t use it is because SQL Server makes a real crappy guess at cardinality. Right?

If you kind of look a little bit more closely about what happens in this query plan, this parallel distribute streams uses a partitioning type of broadcast. What broadcast means is that this one row gets sent out to eight threads because we’re running it max.8.

Right? We get one row from here. This thing turns that one row into eight copies of one row. And then when we come down here in the clustered index, we have some…

Let me get both of these things open so you can see a little bit better. There we go. We have eight threads in here that act cooperatively to scan all 17 million rows.

Right? These numbers in here will add up to 17 million. And then up in this section, we have the actual number of rows that got produced by each of those threads after that post type ID filter was applied. So, you know, there’s a decent spread here.

Nothing’s too, too off. I guess the 11 is a little bit low. But 30, 11, 19. This will add up to the 175 rows that we get here. So, all well and good.

And then when we finally do our join over here, that gets whittled down to 167 rows. Right? So, a lot of extra work.

And what’s kind of funny is that even if we tell SQL Server to use our index, right? If we say SQL Server, we created an index. It’s perfectly usable.

You’re being a goofball. Use our index. We still get, well, actually, you know, I should repeat myself. But if you were paying close attention to the output of this from the first demo, this does indeed lock the entire table. Right?

So, we get still over here a rather poor cardinality estimate down here. 167 of 2142770. So, that’s a seven-digit number.

So, I think 2.1 million rows are going to get hit over here. So, we don’t use it. And, of course, when SQL Server is like, holy cow, that’s a lot of locks. It escalates those up to the object.

And I’m just going to, just to make sure that you don’t think I’m being goofy here. I’m going to, that query did run a lot faster. It still took a lot of locks, though. So, if we rerun this, this is the one that takes about two seconds or so, I guess. De-da-de-dee.

This one also locks the entire object. We have this X locked. Let X lock at the object level. We lock the whole gosh darn thing. So, in practice, a lot of people will experiment with joins, with modifications. And, that’s not maybe so great.

Xist tends to work a bit better, unless you need to, like, join a table to another table to update the columns in one table to the columns in another table. Then, Xist does you no good.

But, like, in this case, we could use Xist. Maybe it would turn out a little bit better. But, that’s usually not the way most people are going to write that query the first time. Now, going back to our users not being data dictionaries, right? What we’re going to do is we’re going to create a store, I’m going to say, an incredibly realistic store procedure.

It’s like uncanny valley levels of realism for this store procedure, where we’re going to ask our users to supply a post type. We’re going to look that post type up for them.

And, then we are going to do an update based on the post type that we find, the post type ID that we find for them, right? So, let’s create this store procedure.

Let’s make sure we have this created, because there’s another copy of that that has a little fix for it. So, if we run this, and it’s going to do roughly the same thing as all the other ones, what do we get?

We get this big honking object level lock with the local variable in effect. The reason why is because SQL Server makes terrible guesses when we use local variables. Whomp and whomp.

167 out of 2.1 million. And, again, we are not using our nice narrow nonclustered index. We are using our big honking clustered index. All right.

SQL Server has said no. No to the nonclustered index. Yes to the clustered index. And SQL Server is once-ing again. Once-ing. Ah!

It’s a good time. Once-ing. Where’d that come from? Once-ing again.

Asking for an index that not only leads with our post type ID column, but also includes the column we are attempting to update, which is maybe not the greatest, the grandest of ideas.

So, of course, you know, local variables cause all sorts of problems. You may run into cardinality estimation issues for all sorts of other reasons, but local variables are just a very easy and convenient way to show you how crappy cardinality can get when you use them.

Of course, there’s a very easy fix for this, right? And what we’re going to do is just create or alter our store procedure, and we’re just going to stick an option recompile at the end, right?

And I just want to show you the difference here. It’s not really anything incredibly groundbreaking. Yeah. Local variables, option recompile.

That’s like the first step in the decision tree. Figure out how bad this thing is. Just run and do that. All right. So if we run this, what we’ll see is the same behavior as the, when we used a inlined literal value where we have the 167 exclusive locks, sorry, 167 exclusive key locks here, and then, you know, some other intent exclusive locks and other places that don’t really do anything because they don’t actually take the locks.

The only one that actually takes the locks is the one that has X by it. So when you’re writing modification queries, be very, very careful. Make sure that they have good, make sure that you have good indexes in place so that your queries can find the data they’re looking for to modify.

If you find yourself needing to use local variables, some ways to fix problems with them are, of course, option recompile hints using parameterized dynamic SQL or an enter a store procedure call to a store procedure that will actually do the update because those will treat whatever local variable you create outside of them as a parameter when you pass it into them.

If you’re using table variables for whatever reason, you know, they’re in memory only, right? Just try using a temp table instead. Usually get better cardinality estimates.

Table variables don’t get any sort of local histogram to the data that shows the data distributions in there, and that can cause some pretty big problems when you start joining them off to other tables. If you have, I don’t know, poorly written queries, overly complex join and where clauses, maybe out-of-date stats, you can hire me to do most of that stuff.

I’ll even update statistics for you if you feel like you need me to. One thing that is sometimes good to mess with, it wouldn’t, I tried it every which way in this demo, but sometimes changing the cardinality estimation model where your queries can be useful.

You have the new one and the legacy one. I have a strong preference for the legacy cardinality estimator for most of the things that I do. A lot of the demos that I write are using the new cardinality estimator where things just, you know, fly off the rails in a lot of ways.

Then there are other things that you might be doing in your queries that the optimizer does not reason terribly well with. If you have scalar-valued functions in a where clause or a join predicate, or if you have multi-statement table-valued functions with the return-a-table variable, you’re cross-applying or joining to those, you can hire me to rewrite those because I do that for fun.

Money. I do that for money so I can have fun. Keep. And of course, if you are in the midst of a modification query, if you are shredding XML or JSON and attempting to use some sort of join or where predicate or some sort of isolating predicate, you can also hire me to fix that because I do that sort of stuff also for money fun.

So anyway, we’ve learned today that poor cardinality estimates can lead to more intrusive locking.

Don’t let the intrusive locking win. Fix your queries so that when you modify data, you take as few locks as possible. You don’t try to escalate those locks all the time.

And then you cause all sorts of blocking and deadlocking issues. Of course, if you have all sorts of blocking and deadlocking issues, you can also hire me with money to fix that so I can have fun doing this.

All right? Good deal. All right? Anyway, thank you for watching. I hope you enjoyed yourselves. I hope you learned something. I hope you’ll hire me.

I don’t know why I’m pushing that so hard. I don’t know. It’s like I have vacation coming up. The more people who hire me, the harder it is to take vacation.

If you like this video, thumbs ups are nice. Just make sure that you put them, you thumbs up somewhere appropriate.

Nice comments are nice. Do you like those? Kissy face emojis. Always a winner. And if you like this sort of SQL Server content, you can subscribe to my channel.

So that, hold on, let’s drum roll this. So that you can join nearly 3,563 other lucky people who get notified when these videos are published.

So, yeah, that’s all that. Anyway, I’m going to go work because fun is over. I’ve had my designated playtime.

I’ve got my yard time today. So now it’s time to go back to work. Thank you for watching.

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.