SQL Server Index Design: Getting Key Column Order Right

SQL Server Index Design: Getting Key Column Order Right


Video Summary

In this video, I delve into how data is stored within rowstore B-tree indexes in SQL Server, focusing on the mental visualization technique that can help you understand index structures more intuitively. While Microsoft has made some features only available in Enterprise Edition, which can cause performance issues when moving to Standard Edition, I use the `generate series` function from SQL Server 2022 to demonstrate how this function can be a valuable tool for visualizing and optimizing indexes. By cross-joining the function multiple times, I illustrate how data is logically ordered within an index, emphasizing why it’s crucial to consider key column order when designing indexes for optimal query performance.

Full Transcript

Erik Darling here with Darling Data. Feeling hyped up, amped up, ready to go, ready to plow through this day. In this video, I want to talk about how you and your mental brain should visualize, it’s not like the literal storage format, but it’s a good visualization practice. for how data is stored in rowstore indexes. This is not a video about columnstore indexes. columnstore indexes are a completely different thing. No information that I give you here to be used for the mental visualization of columnstore index storage. Just rowstore B-tree indexes, alright? Nothing else. Isn’t those crazy in-memory hash and range indexes or whatever? This is just rowstore B-tree indexes, alright? Nothing else. Just regular rowstore B-tree on-disk indexes, alright? So, I want to start this video off not surprisingly lodging a complaint against Microsoft. And the reason why this is a complaint is because they’re usually smarter than this. Usually, like, take developer edition of SQL Server, for example. It’s Enterprise Edition.

All the Enterprise Edition features, there’s no way to turn Enterprise Edition features off in Standard Edition. So, they get all you suckers hooked on Enterprise Edition stuff, and then when you’re like, haha, well, I’m going to get all this cool stuff for two grand decor, and you go to restore your developer database over to Standard Edition, and Standard Edition’s like, whoa, whoa, whoa, whoa, whoa. Whoa. Can’t do that here. Or even worse. You unwittingly, unknowingly, were benefiting from Enterprise Edition Optimizer features that you stopped getting in Standard Edition, and all of a sudden, performance sucks. The only way for you to fix it quick, you get Enterprise Edition. Usually, Microsoft is real smart about dangling that candy in front of you, waving that lollipop in your face. But they’ve gotten, apparently, they’ve just gotten stupid over the years. Because watch this. So, generate series is a new function for SQL Server 2022. And like a lot of new functions, like string ag and like other stuff like that, string split is another one, they hit it.

They’ve tucked it away behind a compatibility level. But like for a lot of stuff, you can tell SQL Server what compatibility level you want your query to run in. So, I’m going to tell SQL Server, I want you to run in compat level 160, and I want you to use this generate series function. But it tells me it’s an invalid object name. It’s invalid, they say in France. I don’t think it means the same thing. I think it’s different. And this is stupid. If you want to get people hooked on something, you’ve got to make it easy for them to get.

That’s why drug dealers are very successful. And Microsoft could be as successful as drug dealers. They would just stop being stupid about these things.

So, let’s get rid of a couple things here. Let’s move on. Our demonstration for the day. Let’s put our database into the 160 compat level. Let’s get rid of this foolish thing. Which does us no good.

No good whatsoever. And let’s frame up this query nicely. Now, to make things short and tidy, what I’m doing is I’m cross-joining the generate series function to itself four times.

And I’m only generating the numbers one through five. This is just to keep the result set small and manageable to help give you some drugs. By drugs, I mean an easy way to visualize index contents in your rowstore, Btree, on disk index.

So, stick with me here. I’m going to run this query and the results are going to look like this. And if I can just grab that thing and buy ourselves a little bit more real estate up here, if we were to create an index across these four columns, this is about how the data would be ordered in that index.

Maybe not perfectly physically ordered, something that’s logical ordering, but just to give you a nice mental grasp on things, the primary sorting of the index index is the first key column.

And if we scroll down through first key column a little bit, eventually, first key column, around the 125th row or so, is going to flip to the number two.

Right? And if we scroll down another 125 rows, we’re going to flip to the number three. Right? You can see that flip right there. And if we just, you know, real quick go down, we’ll see four and we’ll see five, and five will be the end.

So this is why people will tell you things like, this is the column that’s in like a lot of where clauses. This is like the primary thing that users are going to be looking to find when they search for data.

It’s a good idea to have that be the first key column and to have other, perhaps less common, less commonly searched columns be later in the index, because this index will define the primary sorting, sorry, this column will define the primary sorting of the index.

And guess what? Data is real easy to find when it’s sorted. That’s S-O-R-T-E-D, not S-O-R-D-I-D. Sorted.

Sortied. I think it’s French for exited. Maybe. I don’t know. So like, if you wanted to find values in your first key column, where they equal two, well, it’s really easy to seek to this chunk of twos.

Right? And once you’re in this chunk of twos, it is very easy to navigate to other key columns, through other key columns in the index.

So once you have, once you have that range of twos, it’s real easy to say, find all the ones in the second key column, because they’re all in order. Right?

And if you wanted to extend your where clause and say, I also want to find all the ones in my third key column, those would be very easy to find, because you would have all the ones in order here, and then all the ones in order here.

And if you wanted to extend that to a fourth predicate and say, I only want to find the ones in my fourth key column, that would be very easy to do. You could find two, one, one, one very easily. Right?

There’s all sorts of things about index stores that make these sort of searches nice and easy. But it’s also why if your queries aren’t using the first key column, all the other key columns are a bit of a jumble, and it takes a lot more work to go find values in them.

Like, let’s say we wanted to find all of the fives from the second key column. All right? We would have to go through all the ones, twos, threes, fours, and fives in the first key column.

Right? And then we would have to find where, right? Because we can’t, we’re not searching on this. So we can’t seek to anything in here and then go find anything here.

We need to find everything in here. It’s a five. Right? And we have fives throughout all of the different rows. in the first key column.

So searching on this, just this, is not going to be as efficient as searching on this and this. Okay? Now, let’s play that again with the third key column.

You can see that the fives get even more spread out. Right? Because now we have to go through all the ones, twos, threes, fours, and fives here.

We have to go through all the ones, twos, threes, fours, and fives here. And then we finally get to where the fives are in this third key column. But you can see how spread out they are after the other key columns are sorted for the index.

And if we take that a step further and we go and we look for where the fourth column equals five, the values are even more spread out because we have all the ones, twos, threes, fours, and fives here.

All the ones, twos, threes, and fours, three, fours, and fives here. I should go back to kindergarten. All the ones, twos, threes, fours, and fives here. And then we finally get down to where there are just fives here.

So this is why I record a lot of videos about indexing, about index key column order, and about why it’s really, really important for you to think about this stuff when you’re tuning queries.

Not necessarily at the initial index design phase because when you’re initially designing indexes, you might have very little idea how your application will actually be used. You might have a few ideas about how it will get used, right?

You might have a few modules that do specific things, but as soon as you let users free in your database, it’s very, very difficult to predict what kind of queries they’re going to be running, how they’re going to use your application to do things, right?

Things that might seem totally logical, straightforward, and obvious to you as the application developer may not be totally obvious, straightforward, or anything to end users once they start poking around in there.

You might think, well, everyone’s going to want to search by this first key column. It’s going to be great. Look how fast this stuff is. Then users get in and they’re like, no, I really want to find stuff in the third key column.

And all of a sudden, their query performance is not as good as your query performance, especially as that database gets bigger and bigger and bigger and things grow. Queries have to get more data, work through more data in order to find people are interested.

So, this is a good way to visualize how indexes store data. Again, it’s not the exact physical implementation.

It’s just a good visualization technique for you. good idea to have the columns in your filter definition be at least included columns for the SQL server has them all available in your index to use.

So, yeah, when you’re designing indexes, you’re thinking about index key column order, there are a lot of things to think about, but it becomes easier to think about all of those things when you have a good mental picture of how your query can navigate through your index to find the data that it’s looking for.

This is why pretty common advice is to, you know, go for equality predicates first because to some degree, equality predicates will generally be more selective than inequality predicates. This is less true depending on the data type of the equality predicate, right?

Like a bit column, potentially not as selective as, say, an identity column, right? An integer, big int or something. Range predicates like, you know, start dates, end dates, things like that.

You know, different things to think about. Like, this search might be selective if we’re only looking for a week, but this search we’re looking for 10 years, not very selective. So there’s all sorts of things that come into play when you’re designing indexes, but keeping index key column order in mind so that you can narrow down what queries are searching for and figure out what a good leading key column index is, this makes life a lot easier.

You can think of each column in the key of an index as sort of like a gatekeeper to the next column, right? If you are not searching on the first column, data in the second column is not going to be in good order.

If you are not searching on the first and second columns, index in that third key column is not going to be in a good searchable order and so on. So hopefully this was helpful to you.

Hopefully this will aid you in your quest to design the perfect index for your query or queries. I don’t know.

I have a call starting soon. I should probably go prepare for that inevitability. But if you enjoyed this video, thumbs up is always appreciated.

If you like this kind of content, a little subscribe is always appreciated. It’s free. It costs you nothing to get a notification that I recorded a video.

And as always, I hope you enjoyed yourselves. I hope you learned something. And I hope to see you in the next video, whenever that may be.

So, thank you for watching.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.