A Little About How Indexes Store Data In SQL Server

Video Summary

In this video, I delve into how indexes store and sort data in SQL Server, focusing on B-tree rowstore indexes since they are the most commonly used type. I explain that creating indexes helps speed up searches by organizing data in a way that makes locating specific records more efficient. The video walks through visualizing index storage with query results to illustrate how data is sorted within an index and why this matters for query performance. I also touch on inequality predicates, which can cause sorting issues when crossing boundaries between index values, impacting the need for sorts in query plans.

Additionally, I discuss the implications of sorts on memory grants, noting that they can lead to large memory requests or spills to disk, potentially affecting parameter sniffing and overall query execution efficiency. The video aims to provide insights into why proper indexing is crucial for optimizing SQL Server performance and sets up the next segment by previewing upcoming discussions on memory grants and their impact on query plans.

Full Transcript

Erik Darling here with Darling Data, the most sought after SQL Server consultancy on the planet and known universe. And I’m re-recording this video because it was pointed out to me by the ever vigilant Randolph West that my audio sync was off quite a bit in the last one. I’m not sure why. I recorded the isolation level video right before this one. Everything was fine. Walked away for a minute, came back, let the other video upload and start processing on YouTube. And this one was a disaster area. So we’re going to do this over again because, you know, I guess if I’m going to be pitching these videos as a reason why you should either, you know, buy my training or, you know, purchase consulting from me, I should not look like a badly dubbed Kung Fu movie when I do it. So, um, that’s fun. Anyway, uh, this video is going to talk about how indexes store and sort data because of course the reason we create indexes is that they put data in an order that we care about. And of course, uh, when we create indexes and we put data in order, it is to make searches of that data faster.

And of course the fastest data to search is data that is sorted in a way that makes locating the data that we care about as efficient as possible. So, uh, let’s make sure that query plans are turned on because I have a helpful note to do that right here because sometimes I am a bonehead and I forget to do that. There’s nothing worse than running a query that takes like 30 seconds and then finding out that, uh, you didn’t turn on query plans.

I think maybe the only thing worse than that is recording a video for 15 minutes and then realizing that you looked like a badly dubbed Kung Fu movie. So we’re gonna, we’re gonna try that again. Um, and of course we care about, uh, sorted data because when we have sort operators in our query plans, bad things can happen. Uh, we can have rather large memory grants and we’re gonna talk about memory grants in the next video.

So I’m not gonna spend too much time explainifying them here. I’m just going to say that sometimes sorts ask for big memory grants because those are, uh, size of data operations. Uh, if we don’t get enough memory, we might spill to disk.

If we ask for way too much memory, we might, uh, you know, knock a bunch of data out of the buffer pool. We might clear out other memory consuming caches and memory managers in order to loan memory out to query so that they can execute. Um, memory grants and sort of things like that can, uh, make queries more sensitive to parameter sniffing.

Um, even with the memory grant feedback, uh, intelligent query processing, uh, feature in newer versions of SQL Server, we can still run into situations where, uh, it does not quite work out once you start sort of mixing, um, different performance issues with those features. It’s just, you know, parameter sniffing is sort of a classic one. I know we have the parameter sensitive plan optimization, but, um, you know, most folks not on SQL Server 2022, aren’t gonna feel a lot of, uh, uh, pain relief from that one.

And, um, even like when you have a normal parameter sniffing situation, uh, the memory grant feedback thing, um, kind of kicks back and forth a lot. Even if it settles on like a middle ground value, that middle ground value can still not be all that great. Um, you still end up with a fair amount of spilling and a fair amount of overestimating memory grants.

And, uh, you sometimes have to override that when you can with the, uh, the, the max grant percent query hint. Uh, so let’s look at the index that I have created here. Uh, it is on the users table and the columns that make up the key of the index are reputation, upvotes, downvotes, and creation date, which I just stuck in descending order for a little bit of flavor.

It’s not gonna make a whole lot of difference to our query. Uh, down in the included columns, we only have display name. Um, and, uh, the thing about included columns is that they are just window dressing for your query.

They are only stored in the data pages. Uh, they are not sorted the way that the key of the index is. And since they’re not in the key of the index, they don’t offer the storage engine, any sort of, um, uh, like optimized way of locating exactly where things are.

So if we had a query that was like where reputation equals one and display name is like capital A percent, like, like we couldn’t like seek to any values for display name. Cause it’s down to the includes the order of included columns doesn’t matter for the index definition. Key column order matters a whole lot.

Included column order does not matter one single lick. So the first thing that I want to do is kind of talk you through how, uh, you can visualize the way that indexes store data. Uh, that makes the, uh, makes finding data in them more efficient.

So I’m going to run this query and notice that I have an order by reputation upvotes downvotes and creation date descending. I have a bunch of columns in the where clause. That’s sort of less important, but what I want to do is run this.

And show you the query plan and show you that there is no sort in the query plan because we have an index that exactly matches the ordering. The present presentation level ordering of our data. Uh, so we don’t have to sort that, but what I want to show you in the results to sort of help you visualize the way that indexes store data.

When you have a multi key column, nonclustered index is by showing you the results over here. So for reputation, we only have the value one 24. Uh, of course, the reputation column is going to be ordered from low from an ascending order from a 124.

From lowest value to highest value. But just for the chunk of this index where reputation equals one 24, we can sort of get a sense for how other, other data in the index is sorted. Now reputation, I’m going to call it the primary sort of the index.

I don’t want you to get that confused with the primary key. I just want you to know that because it’s the leading key column of the index, every, it is the primary sorting of the index is by this column. All the other columns are sorted within a duplicate value chunk by it, like within that.

Um, we’re going to talk about what happens when you cross boundaries a little bit later when we talk about inequality predicates. But with it within reputation one 24, up votes are all sorted in ascending order from one down to 10. Down votes are all sorted within any duplicate values in up votes.

So we’re, we’re up votes is one down votes is one one two. And that pattern kind of carries on except for poor up vote number four that only has a one and a two. But for six, we have one two three for seven, we have two four.

So it’s ascending order within every range of duplicate values. Um, for unique, you know, if this was a unique index, you know, uh, where we had like, you know, if this is like an identity column or something, there wouldn’t be a lot of room for that.

So, um, it would look a little bit, it would look different, of course. But if, you know, if we had a unique leading key column, we wouldn’t have this nice visualization here. And then within, uh, any, uh, duplicate values for down votes, uh, creation date will be sorted in descending order for those.

All right. So like, that’s how you can visualize this index data being stored, right? Whether, wherever there’s duplicate values, uh, values for the column in the key of the index after it will be stored in ascending order within that duplicate range.

And that’ll go for any other key columns that come along. So I’m going to run these two queries back to back. And we’re going to look at the query plans in a second.

But what I want to just, you know, bring up is to point out kind of the obvious what’s on your screen. Uh, this, this query, uh, is just the top 1000 ordered by upvotes. And since upvotes isn’t the leading key column, we don’t have that.

We don’t have that column nicely ordered from a, an ascending order from smallest to largest. We have reputation ordered that way, but not upvotes. For the second query, we’re only looking for where reputation equals one.

So we’re only going to return the top 1000 rows for reputation equals one. Now that the query plans look different. And, uh, well, I guess at least half an important way.

The top query, uh, goes parallel and has a top end sort in it to put data in the order that we care about. So if we look at the details of that top end sort, uh, we’re going to see an output list and we are going to see an order by upvotes in ascending order. Right.

Cause we had to, we don’t have upvotes perfectly in, we don’t have upvotes is like the primary sorting of this index, in this index or any other index. So we need to sort that we, the SQL Server needs to get all those rows and ask for memory is scratch space, memory grant, and write down all of the columns that we’re selecting in order of the column that we’re asking to be ordered by. The second query plan where we’re only looking for reputation equals one, we don’t have any sort at all.

There’s no sort operator in this one because we have that equality predicate on reputation. We have upvotes in perfect ascending order for just that chunk of reputation equals one. Remember when we looked at that first query where we had reputation equals one 24 and upvotes was in like, like one through 10 after that.

It’s the same deal here. So we don’t have to physically sort data because we have data in, in index order for reputation equals one, right? So we use that leading key column.

We have an equality predicate. We only search that chunk of data and upvotes is in perfect order within that chunk of data. We can expand that across the entire key of the index and we can order by creation date descending. And we will still not need to sort in our query plan.

Our query plan is sort free, right? We have a seek, we have a top. And because the equality predicates preserve index order across the key of the index, we don’t need to sort data for that. Included columns are useless for that.

If we look at this query with an order by on display name. Remember display name was in the include section, include region of the index. If we look at the, we get one row back, one single solitary row, but our execution plan decides to sort that one row.

So even though we seek to all the values we want, that gets us down to a single row. SQL Server is like, ah, well, let’s make sure that row is nicely sorted for them. They asked so nicely for that order by, I guess, I guess it has to follow the rules.

Otherwise we’d have mungodb. So something that I brought up a little bit earlier, inequality predicates do not work in the same way. So if we look for where reputation is less than or equal to one or greater than or equal to 1 million.

Again, we’re only going to get one row back for this, but SQL Server is still going to choose to sort data. Well, SQL Server is still going to have to sort data for these two queries. So if this thing will cooperate and we can get these things nice and close to each other.

So when I zoom in, they’re not 10 million miles apart. It looks like a moon mission. So this top query, we have to sort data. SQL Server chooses a parallel execution plan to do it.

And for the second query, even though we get a single row back, we still need to sort that single row because we have a top and order by. So crossing boundaries presents an issue for these queries because every time you cross a boundary, so like let’s say we went from reputation 124 to 125, the upvotes column would reset the ordering.

So let’s look at how that we can visualize that. So this query is almost the same as the first query, except we’re doing in 124, 125. If we look at the results, the important thing that I want to get to is where this boundary changes.

Because now we’ve crossed from 124, we’ve hit upvotes equals 10. Now we get to 125 and the sorting for upvotes resets. So the same thing is going to happen here where we’re not going to have downvotes like perfectly in order, even after the upvotes, right?

So like that this boundary resets. If we did where reputation in 124, 125 and upvotes equals zero or one or something, then we would have a different sort of set of data there.

But you can see all the sorting resets every time you cross a boundary like that. Now that matters less as long as we are sorting by the leading key column. If we search on other columns, like on other key columns in the index, upvotes is the second key column in the index.

So we can cross boundaries in upvotes or downvotes. And as long as we’re ordering by the leading key column in the index, then sorting this data is free, right?

Looking at these two, we have two nearly identical query plans. They both scan the index because again, we’re not searching on reputation. We’re only ordering by reputation.

But so the order by is free, but the search does result in a scan because we’re not also searching on reputation. All right. So that is how a little bit about how indexes was Btree rowstore indexes in SQL Server store data.

columnstore indexes are of course a little bit different. By that, I mean a lot different. And they are not a topic in this video because they are so different.

And most of y’all are just using the Btree rowstore indexes anyway. So that’s why I chose to talk about it. Anyway, again, we care about sorts.

Sorts ask for memory grants. We’re going to talk about those next. You know, sorts can spill the disk. Sorts can also ask for way more memory. Steal a bunch of pages from the buffer pool. Clear out your plan cache.

All sorts of nasty stuff. And they just kind of generally make code a little bit more sensitive to parameter sniffing issues on account of the memory grant thing. And in parameter sniffing scenarios, the memory grant feedback loop gets thrown off a little bit. So that is all I have to say about that at the moment.

Next video will be more about memory grants. We can talk about those. And yeah, I don’t know. Thanks for watching.

Please, pretty please, like and subscribe so that you can keep seeing these. And if you like and subscribe in the next five minutes, I’m timing you, I will send you an Adidas t-shirt. Leave me your address in the YouTube comments.

All right. Good enough. Thanks for watching.

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.