A Little About Multi-Column Indexes In SQL Server

A Little About Multi-Column Indexes In SQL Server



Thanks for watching!

Video Summary

In this video, I dive into the world of multi-column indexes in SQL Server, explaining their capabilities and limitations. Erik Darling from Darling Data shares his experience dealing with unexpected technical issues that delayed recording a planned video on parallel nested loops. Despite these challenges, he manages to deliver valuable insights by comparing query plans generated using different cardinality estimators and demonstrating how adding computed columns can significantly improve query performance. I hope you find this content as enlightening as I did while preparing it, especially if you’re dealing with complex queries involving multi-column comparisons.

Full Transcript

Erik Darling here with Darling Data. And today’s super ultra, probably the most important video you’ll watch over the next 10 minutes. We’re going to talk about multi-column, we’re going to talk a little bit about multi-column indexes in SQL Server and what they do and what they don’t do. So, last week, I promised a longer video on parallel nested loops and let me just tell you how many things have been just ganging up on me to prevent that from happening. One, Friday, when I went to actually record it, all of the lights in my office and actually apartment started getting weird and flickery and like things were like turning off like the like my computer monitors would start turning off off and on and then like randomly lights wouldn’t work. And I don’t know how many of you out there are like certified electricians. I know that there are a lot of forklift operators in the crowd, but I’m not sure about the electrician segment of Darling Data fans. But if you ever open up a fuse box, like behind the panel, there are, well, if you’re in America, there are usually two big wires coming in and they have screws that kind of hold the stuff in place to make a connection and provide electricity through the fuse box and the apartment at large. And apparently just over time, because of the natural vibrations of the city of New York, the screws had come loose. And so they had to be tightened so that electricity could once again flow freely and smoothly throughout everything. So that got done like late Friday night. And by that point, I was like, well, there’s just no way, no way in hell I’m trying to record a video about parallel nested loops. Now I’ve already had a few drinks and it’s a, it’s a dense enough set of, of, of explanations without the mind being a bit foggy. And so that didn’t happen. And then of course the weekend came along and you know what they say about weekends?

No one watches SQL Server videos on weekends. No one cares. So, um, you know, uh, Monday, uh, kicked my butt. Uh, a lot, a lot of, a lot of, a lot of query tuning work done on Monday. And, uh, here we are on Tuesday. And I kind of realized that the, the material that I have is a bit too dense for just one, one quick video. So I’m going to have to, I’m going to have to figure out a better way to present all that stuff. But it’s, it’s in the works. In the meantime, I hope you’ll accept this piece offering about multi-column indexes. So, uh, let’s do it. All right. So I’ve got, uh, four queries here. Um, two of them are using the, what Microsoft so, uh, I don’t know, hubristically refers to as the default cardinality estimator, which is stupid.

Um, it’s just new. Um, and, uh, two queries that are using the legacy cardinality estimator. And, uh, it’s not legacy. It’s just the, it’s the older one. Um, and what I want to show you here is, uh, well, it’s the first thing I want to show you here. This is first of many things that I’m going to show you here. I’m going to try to make this one on the shorter side because it’s hot. And part of the reason why it’s hot is there’s too much hair coating my head in some places anyway.

Uh, and it’s, it’s, it’s a sweaty one, sweaty one here in the city. And I have not been, um, meritally cleared to put the air conditioners in yet. So here we are getting a little sweaty on camera for you. Uh, but what I want to show you first is that, uh, neither one of these cardinality estimators does a tip, does a terribly good job of actually estimating cardinality. So, uh, here’s what all four of these queries return. Uh, the first, the, uh, first one, 269,420, I did not intentionally write a query, uh, that returned those numbers.

I’m just now realizing what those numbers are now that I read them out loud. Uh, and then we have two that return zero. So, uh, let’s recover smoothly from that. And let’s look at the query plans because what I want to show you in the query plans is how different cardinality estimation models handle these sort of predicates. Right? So we’re just looking for first where creation date is less than close date, which actually, you know, has rows that hit it because often, uh, you have to create a question before you can close it.

And then we have these, uh, that make no sense because you can’t close a question or you can’t create a question before you close, after you close it, something like that. Uh, so yeah, no, this just makes no sense. But SQL Server estimates the same number of rows, uh, for both of them. You get a same sort of stock guess of about 30% of the rows in the table. The post table has about 17 million rows in it. 5 million is apparently about 30% of 7 million.

So we have that going for us because five times three would be 15. And if we added 3%, then we would get like 99. That would bring us a little bit closer to like, you know, 5 point something million, a little higher. So trust me, math. We’ve got, we’re good there. And the legacy cardinality estimator, the calculation is not nearly so straightforward, but the end result is that SQL Server thinks that 97,000 rows will, uh, qualify for this predicate, uh, which is also wrong. It’s a little bit less wrong, right? Cause you know, 269,000, it’s a lot closer to 97,000 than it is to 5.1 million. But the legacy cardinality estimator does the same thing.

And, uh, that’s not what exactly what I wanted. Uh, and it also gives you the same stock guess. Now, like I said, this math is a whole lot less straightforward to figure out. There is a lot of funny symbols and letters that are actually numbers and things like that. Basically, uh, unless you are an advanced mathematician, uh, it would do no good to try to explain this formula to you. You would never remember it anyway. Uh, it would do no good. So what a lot of people, notice I don’t have any good indexes for these queries. We’re just working off the clustered index.

So, uh, let’s take a first stab at an index on, for this query, right? Let’s just say we want to create a query, uh, sorry, we want to create an index on creation date and close date. All right. And so we’ll get that index moving and let’s run these queries again. Now the last set of four queries took about seven seconds to run. Uh, they all generated parallel execution plans. We did some fancy fun work. We got results back. Everything was groovy.

And now with that index in place, well, I mean, do we really do any better? Not, not, not really. Everything finishes in just about the same amount of time. We get the same estimates for all of these 97,318 here, 5.1 million, blah, blah, blah over here. And we get all the same results back. Now, the thing to keep in mind here is that multi-column indexes like this don’t track any correlations between the columns in them, right?

You really only get a histogram on creation date. Uh, SQL Server may have in the background created a system statistic on close date. That’s totally possible, but this isn’t like a bad statistics thing. This is just SQL Server doesn’t like SQL Server indexes. Don’t keep track of this, right? It has no idea how, like how creation date, how many creation dates are less than or greater than close dates.

Like this is not something that statistics track. And it’s very, very hard to communicate some of these things to SQL Server. So like, unless we had a check constraint that told SQL Server that creation date always has to be greater than, uh, close date and that close date can never be greater or creation date can never be greater than close date. Creation date always has to be less than close date. Like, you know, we might get like giving a little bit more information might be kind of helpful, but SQL Server is going to have no idea how many creation dates, uh, like will be less than close date.

Cause remember not every question gets closed, right? So close is going to have a lot of null values in it. So only the questions that get closed have values here. And SQL Server doesn’t keep track of how many of those like, like might just show up, right? We just don’t have that kind of good information. So the only way to really give SQL Server any better info, like not the only way, but one, one way that you can give SQL Server a cleaner path to these sort of predicates is to use a computed column.

Now, uh, what I’m going to do is alter the post table and I’m going to add this column, which converts a bit, uh, converts these ones and zeros to bits. So when creation date is less than close date, it’s one when creation date is greater than close date at zero. And that’s going to be that, right? So we add our computed column and notice how quickly that added because I did not add the persisted keyword.

And I want you to pay attention to something else very closely here is that I don’t need to persist this column in order to create an index on it. Likewise, I also don’t need to persist that column for it to get statistics generated on it. Persisted is just a special thing that you do when you, you, I mean, it’s a tough choice to make because even persisting computed columns doesn’t guarantee you a whole lot in the way of, um, you know, things, things going well when, when you query them. Uh, that’s often, often quite a crapshoot.

So, uh, that index is in place and I’m just going to show, uh, a couple of versions here with the, uh, the, well, I guess the default cardinality estimator. I’m going to go along with Microsoft silly, silly naming scheme. And we’re going to just run these two queries. And these two queries finish a whole lot faster because we are able to very seek, very easily and quickly seek. And if this thing would just let me do the grabby thing and move the query up, we are able to very quickly, uh, jump to where, uh, the data that we care about, right?

Notice that we didn’t have to directly reference the computed column in the where clause. We just had to exactly write the form of the computed column in the where clause. Uh, so, but this is where we’re looking for one equals this. And this is where we’re looking for zero equals this. And notice that cardinality estimates do improve quite a bit here because SQL Server has an absolute, I mean, SQL Server always thinks that one row is going to exist. Even if no rows exist, you’ll never see a zero.

I mean, I’m not going to say never. You’re almost guaranteed to never see zero of zero for one of these things. Uh, at least for like an actual, like data acquisition operator. There are probably some like in memory operators where you could see zero of zero. But anyway, I, I, I, I digress. And in a minute, I’m going to undress cause holy God, it’s hot. Uh, but like what you see here is that we get, because we actually stabilized that expression, we gave SQL Server like an actual, like materialization of what we’re looking for and what we care about.

Uh, we are able to, uh, like generate good statistics for like exactly what we’re looking for. So we get the exact hundred, hundred percent spot on guess here. And even though we get zero of one here, that’s pretty gosh darn close to it. We get very nice index seeks exactly to the data that we care about. We don’t have to scan the whole index, figure out if one is greater or one is less.

And it just saves a lot of time generally. So if you’re dealing with query plans, uh, that even do something kind of simple, like, like you just want to figure out if one column is greater than the other. Um, this, you know, you might run into all sorts of issues with cardinality estimation. You might have a tough time indexing for those queries. And, you know, you might suffer all sorts of plan quality issues because you’re not getting good cardinality estimates from these things.

Remember with the default cardinality estimator, you get a stock guess of 30%. That could be way, way off with the legacy cardinality estimator. You get a very fancy math estimate, which in this case was closer to reality, but was still off by like a hundred percent or a little bit more because it was 97 versus two, two, six, nine, four, 20. Yes.

Not going to jail for that. So, uh, if you, if you find yourself having to do these sorts of calculations, uh, and you find yourself getting bad cardinality estimates and bad query plans, things slowing down, uh, one thing, one option that you might want to look at is creating a computed column that expresses this stuff for you. Where this gets a little tougher is that, uh, you know, I do run into a lot of people who have to make this comparison across tables.

So like, let’s just say for the context of the stack overflow database, let’s just say we were comparing a creation date column in the post table to a date column in like users or comments or votes or something or badges or any other column with a date in it. You know, it would, it would, it would be, you can’t really index that. You can’t index across tables, uh, directly.

You could create an index view and then create, you know, if you needed to create whatever other stuff you had to on top of that. But, uh, that would be one way of indexing across tables. Other than that, you would be looking at like dumping stuff into a temp table where you combine the results and then index that.

Or just, you know, some other, uh, some other data materialization layer, uh, where that you, that is indexable in that way. Which now that I think about it, like realistically, it’s going to be indexed views and temp tables, but all situational, isn’t it? Anyway, um, I got other stuff to do.

Um, and there is some sort of aircraft going by. Uh, I hope doesn’t pick up on the mic. But anyway, um, thank you for watching. I hope you enjoyed yourselves.

I hope you learned something. I hope you, um, are not hot and sweaty wherever you are. I hope that you are air conditioned and comfortable. Uh, and you’re, you’re having a good day.

Uh, if you like this video, uh, I, I do enjoy a nice thumbs up and I do enjoy a nice, uh, nice comment. Um, uh, you know, again, just, just be, go easy on me because I’m having, I’m having a sweaty one over here. Uh, if you like this sort of SQL Server content, please subscribe to the channel.

Uh, you can join. Now, let me, let me refresh this so I get the exact number. You can join nearly 3,545, nearly, other subscribers to get notified when, when I drop these precious gems. I drop these jewels on you.

And, um, I, I do, I do also promise that eventually we will get, we will, we will get to the parallel nested loops video. And, um, I’m looking forward to seeing the watch metrics on that. I’m looking forward to see exactly how long viewers stick around for on that one because, uh, like I said, it is some dense material.

And, uh, it is kind of, kind of mind numbing when we, when you get down to it. But, uh, anyway, uh, it’s time for me to go de-sweat myself. So, um, you can, I’ll leave you with that visual.

Uh, once again, thank you for watching. Goodbye. Uh, why won’t this thing stop recording?

Oh, it did stop recording. No, it didn’t stop recording.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.