Indexing SQL Server Queries For Performance: Index Sorting
Thanks for watching!
Video Summary
In this video, I dive into the fascinating world of how indexes store and sort data in SQL Server, particularly focusing on multi-key indexes. We explore why understanding this is crucial for optimizing queries and improving performance. Whether you’re a seasoned SQL professional or just starting out, grasping these concepts can significantly enhance your ability to write efficient T-SQL code. I also share some practical tips on how to leverage indexes effectively, making sure that data retrieval processes are as swift and smooth as possible. Along the way, we take a fun detour through my database of DMX tracks, using it to illustrate key points about index ordering and query optimization in an engaging and relatable manner.
Moreover, I delve into the nuances of equality versus inequality predicates in queries, explaining how they affect the need for sorting or reordering data. By walking you through detailed execution plans and practical examples, I aim to demystify these concepts and equip you with the knowledge needed to make informed decisions when designing your database schema and writing SQL queries. Whether you’re looking to fine-tune existing queries or plan new ones from scratch, this video provides valuable insights that can help streamline your workflow and improve overall performance.
Full Transcript
Erik Darling here with Darling Data. In today’s video, we’re going to get on with our indexing series and talk a little bit about how indexes store and sort data in SQL Server and how you can use this knowledge to your advantage to make queries faster and more productive and, I don’t know, I’m sure it will meet some other KPIs along the way. But before we do that, let’s talk about how Erik Darling stores and sorts money. If you would like to support my efforts bringing you this ultra-high quality SQL Server training content, you can sign up for a membership to my channel down in the video description for around $4 a month. Of course, I don’t see the full $4 a month. The IRS has something to say about that. So you can do that. If you, you know, did I use the joke about the bully stealing your lunch money already? Well, I mean, I guess not everyone watches every video, so screw it. The schoolyard bully took your lunch money and you can do that. If you don’t have, you don’t have four bucks, you can like, you can comment, you can subscribe and help me grow this magnificent channel in all sorts of other ways. If you, you know, watch these videos and you’re like, holy cow, that Erik Darling sure does know a lot about SQL Server.
You can hire the best SQL Server consultant in the world, me, to do all this stuff for you. And I would be happy to do that in exchange for money. And as always, my rates are reasonable. Speaking of reasonable, segue right into this beautiful bit. If you would like all of my SQL Server performance tuning training at 75% off, that brings the price down to about $150 for the whole enchilada. You can, you can go to that URL and punch in that coupon code and by the grace of God, check out with a major credit card and we’re all happy people after that.
There is finally an upcoming event that has been announced. SQL Saturday, New York City 2025 will be taking place on May the 10th. What a wonderful day that is.
May in New York City is just a beautiful time of year. You know, it’s not, you have left the freezing grasp of winter and you have not yet entered the sweaty armpit of summer. It’s just a perfect time of year.
So I would encourage anyone who is feeling like having a nice slice of Sbarro pizza to come on down to New York City for the SQL Saturday event. I might even be there, who knows. And with that out of the way, let’s, let’s go have ourselves a little party with SQL Server and I should probably click on the right thing, shouldn’t I?
So, uh, order data is great for all sorts of things in databases. Uh, searching, right? If you want to search through data, it’s a lot easier if it’s an, if it’s in an order that allows you to efficiently search through it.
Uh, if you want to say group by or order by or, uh, you know, have a windowing function do some stuff, which we’re going to talk about in the next video, if you couldn’t see that in the tab name right next to this one. Uh, there are all sorts of things that having indexes in place that pre-sort your data in a useful way can help with.
Uh, but we need to talk a little bit about how indexes store data on, like for multi, multi-key indexes store data. For a single key column, for a single key column index, it’s probably pretty obvious the index is sorted by that column. But I still get questions from people like, uh, I have this index on column one.
Is that the same thing as this index on column three, two, one? And I’m like, no, column one is over here. And this one, column one is the only key column in this one.
How, how would that work? It’s kind of a thing that I still get. So, uh, those, those are, those are the fun questions that Erik Darling answers, uh, off the cuff.
So, uh, in my, uh, database called CRAP, I have created a table called albums. And this album is actually just, it has two indexes on it. Uh, we have a clustered primary key on a column helpfully called surrogate ID to, just so you understand.
It has no meaning, uh, outside of being the clustered primary key and getting us all the things that we desire from clustered indexes. Um, uh, you know, like, uh, you know, being, um, you know, uh, narrow. That’s a good one, right?
Biggins, narrow. Uh, being unique, right? It’s a, it’s an identity. So every value in there, uh, unless we do something silly is going to, uh, be a unique entry. Um, it is ever increasing, meaning it’s a, it’s an, it’s an ever growing value unless we do something goofy and we need to reseed it and then it’ll be a negative growing value or something.
Uh, and it is static, meaning that we are not going to be modifying the values in this thing. We are not, we are not allowed to directly update an identity column. We can, you know, do stuff with identity insert, but we can’t, you know, update the column.
So we have this thing, which has all sorts of great clustered index attributes associated with it. And then we have this index down here, which is a nonclustered index. And this, uh, index is on artist, release year, album title, and track ID.
Um, I guess in a, you know, I guess we could have swapped release year and album title. Wouldn’t really have made that much of a difference, uh, in the table or in the index. But, uh, for the purpose of this demo, let’s just pretend this is exactly the way we want things.
Now, what I did was I put, um, a bunch of rows into this table. Uh, and I, like this, this, this, this, this, this value’s clause is a little misleading because what we end up in, what we end up with in the table actually looks like this, where we have, uh, the surrogate ID, the artist, the release year, the album title, and then the track ID for the number of tracks that are in the album that we care about, right?
So we actually have multiple entries for each album. Now, um, in rowstore indexes, key columns are stored in sorted order in order to make it easy to navigate the, uh, the B tree that they’re stored in efficiently to locate rows.
Uh, they are not stored individually like in columnstore indexes. In columnstore indexes, you really get like a column, like an index on each column that is in the index, which is the series of little indexes, uh, on those things, uh, row groups and whatnot.
Um, but, uh, you know, like way back in the, the, the bad days of, of life, um, there, a lot of people would, uh, use the phone book analogy to talk about indexes. Uh, I’m going to update that for the modern, um, I don’t even know what generation is the most current one. Uh, whatever.
Everyone makes playlists. Everyone loves a playlist. I have YouTube playlists. I have Spotify playlists. I have playlists all over the place. I assume that other people who are far more organized than I also have playlists in their lives. So let’s, uh, let’s go back to what we have on this table.
Oops. Index wise. I didn’t frame that very well. We have a clustered index, uh, a clustered primary key on surrogate ID, and we have a nonclustered index on artist, release year, album title, and track ID. Now, if I want to sort, if I want to say select star from albums, order by surrogate ID, of course, SQL Server is able to take advantage of our clustered primary key.
Uh, it’s only 200 rows. It’s not a big deal here. Performance wise.
Oops. I hit control, but snap, snap, snap, it didn’t listen. Uh, we scan the clustered index, but we don’t have a sort operator in our query plan. We have asked for this in the order of surrogate ID, but since the clustered index put surrogate ID in the order that we wanted it in, SQL Server does not have to physically implement a sort of this data. The same thing goes for if we, um, say select star from albums, order by artist.
If we look at the execution plan, we have an index scan, a nonclustered index scan this time. You can see the non-col here. It would be nice if SSMS had the ability to show us full object names the way Plan Explorer, uh, which hasn’t seen a, I don’t know, any development in who knows how long.
Uh, it would be nice if SSMS had this ability, but, uh, for now we just, we’re just left with these ellipsis cliffhangers to figure out what happened. But notice there is no physical sort operator in this plan. Uh, we can, let’s just contrast these two plans really quickly.
Where if we were to ask for SQL Server to, uh, sort our data by release here, and SSMS is being real picky about that right there. But now we do have to physically sort that data when we say order by release here, because release here is not the leading key of any index in the table. Uh, the, the clustered primary key is surrogate ID, and the nonclustered index that we have is, uh, artist.
It leads, leads with artist. So asking for this data to be ordered by release here means that we have to re-sort the data from what the index initially had it sorted, the order that the index initially had it sorted in. The same thing would go if we used any other trailing key, like album title or track ID.
If we look at the execution plans for these, come on, SSMS, help me out a little bit here. Just make this video go, like, kind of smoothly. I’ll give you $50.
We have, uh, so asking for order by album title or by track ID, uh, in these situations, both leads to a physical sorting of the data, or let’s call it a reordering of the data from the order that the index had it stored in. Now we can order by, if we, if we say order by all the columns in the table, artist really, or rather all the columns in the nonclustered index, we’re leaving surrogate ID out of this. Now, even though surrogate ID is the clustered primary key, and it is technically included in the nonclustered index, uh, they’re like, we’re just saying that’s not an important column for our results.
Right. So like, just to make it, uh, maybe perhaps make it a little bit more obvious. Um, if we were to, if we were to say, do all this stuff and also say order by surrogate ID, uh, we would still not need to sort the data because SQL Server, uh, when you create a, uh, a non-clustered, non-unique index on a table that has a clustered index, the clustered index key column or columns, right?
If you have, you can have a multi-key clustered index end up in the, in the nonclustered index is like invisible final key columns. So we still don’t need to sort data when we add surrogate key to the order by here. All right.
So, uh, if the, if this index were unique, then surrogate ID would end up as an included column. Um, and then we would probably have to sort the data, uh, by surrogate ID because it would be an include and includes are useless for ordering, but we’ll talk more about that later. So, um, this, this, this order by is still free, right?
Like all of these columns, there was no sort in this query plan. If we were to do something like this and we were to take the leading key column out of the order by elements and just say order by these three, we are back to having to sort this data because it needs to be reordered from the way the index initially stored it. Right.
Because the index stores data ordered by artist. And then for any duplicates within artist, we are ordered by release here for any duplicates within release here, we are ordered by album title. And for any, uh, duplicates with an album title, we are ordered by track ID.
So an easy way to sort of see how this works is to include an equality predicate in our query. Because right now with this query, we are, we are effectively skipping over the artist column in the, in the, in the, in the, that leads the index, that leads the nonclustered index here. But as soon as we ask for an equality predicate on the artist column, we are able to order by these three columns for free because we don’t have to reorder anything.
We seek into, uh, our nonclustered index. We find every, uh, entry where the artist is DMX and then for, uh, then we have, because that is an equality predicate for a single value, uh, from the, the leading key of the index. Uh, the next, the next columns over like, uh, release year and album title and track ID, uh, the order is maintained in the index.
We don’t have to worry about reordering data in the index because we just did a single equality seek to this. So in the results, we have DMX and because this is all DMX here, these are all, this is all a duplicate value, right? This is just, you know, uh, 127 rows of DMX.
Uh, then release year is, is, is in order within this range of duplicate values. So we have 1998 at the beginning and we have two, oops, oops. Uh, come on, scroll bar, stay where I put you.
We have 2012 at the end. So this release year column is all in order from an ascending order for DMX. And then for 1998, we actually have two albums, right?
DMX, what a, what a beast, uh, released two albums in 1998. Flesh of my flesh, blood of my blood, and it’s dark and hell is hot. And so, uh, we have DMX in order.
We have the release year 1998 in order. And then we have album title. This is, of course, F comes before I. So this is in order.
And then the track IDs, one through 16 and then one through 19 are in order for each of these album titles. What that means is that if we were to say, um, search, uh, across, uh, artist release year and album title and then order by track ID with, so we have three equality predicates. We would find, we would seek to DMX.
We would seek to release year. We would seek to flesh of my flesh, blood of my blood. And then we would have, or track ID in the exact order that we care about it stored in the index. We don’t have to see, we don’t have to sort data here, right?
So we get that data, we get that data in order for free. Inequality predicates do not preserve data because we have to cross boundaries within the index when orders would, when, when, when an order by, when order resets, right? So, uh, if we look at this query plan, we are back to having to sort this data, right?
And the reason we have to sort that data is because we are asking for, uh, artist release year album title, uh, where the artist equals DMX and the release year is greater than 1998 ordered by album title and track ID. Now, when we had an equality predicate here, it, it wasn’t a big deal. We didn’t have to resort that, but with an inequality predicate, as soon as we cross over from 19, um, sorry, I asked for this in order by album title.
So as soon as we cross a boundary and release year, say going from, uh, 1999 to 1998, we’re ordered by album title and track ID, right? This, this, this one and this one. So, uh, this data was not in order, uh, the way we wanted it after we crossed a boundary with, for a release year.
So, uh, so like if we, another thing that’s kind of funny is that sorting, uh, with duplicates is, uh, often called non-deterministic because like, look at these results. If we were to say, um, select these columns and then order by, uh, track ID, we would get all the track IDs one in here, uh, like for this chunk. But, uh, the order of the columns is a little bit out of whack.
It’s like 98, 98, 99, 2001, 2003, 2012, 2006. So these track IDs were not sorted perfectly by anything else. If we wanted, uh, a deterministic sort order, uh, to, with like tiebreakers and stuff like that, we would need to include a unique column in there.
So if we also ask for a sort by surrogate ID, then we would get, we could have asked for, I guess, release year two, but that might’ve been a little funky with, uh, oops, a little funky with, uh, the two albums that got released in 98. But if we say order by surrogate ID as well, then we get 98, 99, 2001, 2003, 2006, 2012. So there are the way that you, uh, the order that you store, uh, or create your indexes stores the, uh, data that you put in the key of the index in a specific key, in a specific order.
Uh, if you ask for a different column to be ordered by, or if you ask for a different set of columns to be ordered by, and you don’t have a quality predicates that maintain that order, you will have to resort data from the order that it’s stored in from the index. The same thing would go for, um, like an in search. So a lot of times the way that SQL Server handles in searches and the query plan makes it look like it is just two separate seek predicates, but it’s not two seeks with like ordered, uh, results after that.
Uh, like, like if this were just, you know, like when we ran this query and it just said the artist equals DMX ordered by these three columns, we didn’t have to worry about, um, sorting this data. But when we say in Blondie comma DMX, we do have to sort that data because every time we cross a boundary, right? When we go from Blondie to DMX, the release here ordering resets.
So for Blondie, uh, I only, I didn’t put all of Blondie’s albums in here cause let’s face it, not all of Blondie’s albums are worth putting on a playlist. May have had some good songs later in the career, but at a certain point, the, the, the whole, the full, the full album experience kind of lost some gusto. So, but, um, so for Blondie, the release here is perfectly ordered from 1976 down to 1982.
But then as soon as we cross over from DMX, that ordering restarts. So this is 1998 through is going to be like 2012. So we have that data in the order that we care about in the index per artist.
But as soon as we cross a boundary to a new artist, that ordering resets in the year column goes from ascending to descending. Uh, so from lowest to highest for that artist. Now let’s look at a slightly more complicated example.
And this is what we’re going to finish up with. Uh, if we have, we say, oops, I need to, uh, go into the stack overflow database for this one, don’t I? So a slightly more complicated example, we’re going to select the top 40, uh, users from the users table where reputation is greater than, I think that’s a hundred thousand.
deprivation is great because they’re Thiessen times a million. It stands for a littleack casos and it’s three times per object in them old and that’s a your circle. But the fields of definition is could have a million times time to define schemas.
It mean, it makes the video that’s healthier because it has a million times. And theункты sampling tribunk wages are very difficult. It means they’re going to meet teams and um, but if we do a bunch of things on the basis of EstoyDeal, I’m going to feature point because we’d take advantage through time for most of the reasons. We use that to visit the标志 and including data. The mushrooms data, I used to be able a lot larger information for each member, but we have a digital tool called user- STRANGO date.
the query. Now if we create this index on the users table and we rerun our query, we don’t need to sort that data anymore because we have the index that we created has a key is keyed on reputation. So now reputation is stored in the order that our query cares about it being in.
So if we look at this plan, there is no sort, there is just a top. So we go in here, we find, we seek to the top 40 rows that we care about. And since id is the clustered primary key, we don’t need to do any additional work to sort this. Now if we look at the estimated plan for this query, where we’re adding another top in here, we’re cross applying to the top one, this is the most, this is going to get the most recent badge for any given user. If we look at this query plan, we’re going to have an eager index spool in here. That’s something that I’ve talked about in this video series as well as many other times on the channel. And one of the things that a lot of, what a lot of people do when they need to fix an eager index spool is they look at the details of the eager index spool down here. And they say something like, well, I need to seek to the user ID in the badges table, and I need to output the name and the date column from the badges table. So they might create an index that looks like this on this, on the badges table in order to facilitate, in order to get rid of that eager index spool, which is, you know, if you were strictly speaking, this, this is the index that SQL Server created to facilitate faster seeking on the inner side of that nested loops joint. If we look at this query now, we do get rid of the eager index spool, but we still need to sort this data, right? We seek in here, and then, but we, but since the date column is in the includes of this index, we do not have an ordered, an ordered avenue to where this data lives in order to make that sorting free. So let’s say if we wanted to run a slightly different query, like say, select the top one from users, and then the select their top 1000 most recent badges, right, we could do this, we get John Skeet, and I forgot to, wait, I did, what on earth? Oh, yeah, good answer. Okay, so if we look at this, we still need to sort that data, and now our sort even spills a little bit. So don’t we feel foolish? Now, if we were to change this index definition a little bit, and we were to say, select user, I’m sorry, we were to put date is the second key column in the index, then we would still be able to, we would be able to get rid of the eager index spool, and we would still be able to navigate the B tree after the nested loops join, and not have to sort that data anymore. We still have a top down here, but it’s not a top end sort. So you will serve for every time, for every trip, remember that nested loops is a per row activity. So for every row that comes, that we seek to here, like one row comes out, goes into the nested loops join, comes down here, and then SQL Server seeks to that one value, right? Every time you come in here, we seek to this, and then for every time that we seek into this table, for every user ID, the date column is now in order. So because our index is on user ID, and then date, every time we have an equality predicate on user ID, like remember, like just like with the artists and albums, we had like user ID 22656, the dates are all in order for user ID 22656. So for every equality predicate, that second key column is in the order that we care about it being in. So remember that our index now is on these two, right? User ID and then date. So if we were to select the top 100, and we were to say order by user ID and date, again, this ordering is free, we have a top, but we do not have a sort in the query plan. And if we were to say, hey, let’s select the top 100, but let’s order by these three columns now, user ID, date and name. Because name is an included column, that data is not stored in any particular order. Includes have no ordering to them, they just exist down at the leaf level. So we end up with a sort back in the query plan. So understanding concepts like this is really, really important for making queries go faster. A lot of queries have either presentation or implied order bys. You know, presentation layer order by is something like this. But, you know, implied order bys might be, you know, to support certain query plan operators like stream aggregates and merge joins.
You might need to sort data to match the, what you’re asking for in a windowing function, the partition by order by clause. You might have a top with an order by in your query, there might be all sorts of reasons why having data in order would make queries go faster. It’s important to understand how equality predicates and inequality predicates can change the way that SQL Server thinks about presenting the data in the order that you’re asking it for, and all sorts of other things.
You know, fixing sorts is not just a matter of, you know, generally making queries go faster, but, you know, sorts can ask for very large memory grants. Sorts can spill if they don’t get an appropriate memory grants. Sorts can ask for way, way, way more memory than they end up using.
So there’s all sorts of reasons to get the indexing and querying right so that you don’t end up in a situation with a slow query because of that. But we’ve talked about this long enough now. Thank you for watching. I hope you enjoyed yourselves. I hope you learned something, and I hope you stick around for the next video on windowing functions where we will we will experience even more further joy than we experience in this one. So goodbye.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.