Shook Spear
Video Summary
In this video, I delve into how indexes can help us avoid sorting data in SQL Server, which is crucial because sorts can be memory-intensive and slow down query performance. We explore the intricacies of sort operators, their impact on memory grants, and how they internally block other operations. I also discuss how equality predicates in indexes can preserve order without needing a sort operator, while inequality predicates often require one. Additionally, I provide practical examples to illustrate these concepts and highlight the importance of writing queries that leverage index design effectively to minimize sorting.
Full Transcript
And here is the 5,011th attempted recording of this video. I don’t know why it’s so much easier to do this stuff live. I do it live, blaze right through, everything’s good. I record stuff, I’m like, ugh, gotta fix that. My video editing skills suck, so that usually just means take two. In this video, we’re going to talk about sorts, but rather how indexes can help you avoid sorting data. And we’re going to do that because there are some things about sorts that we care about. Things that we care about with sorts is one, they will ask for memory, and the size of the memory grant that they will ask for is going to depend on how long your data is, how many rows you’re selecting, and how wide your data is, how many columns you’re selecting and their data types. The bigger your data is, the bigger a memory grant that sort is going to ask for. Now remember that SQL Server doesn’t work with pages on disk, it only works with pages in memory. Those pages that get stored in memory get stored in the buffer pool, and the more, and memory being a finite resource, the more queries you have asking for large memory grants, that memory has to come from somewhere. Usually that memory, if you don’t have a ton of extra memory on your system, will come from space that it was going to use for the buffer pool.
For a memory saving example, you have one set of parameters that, you know, returns a small number of rows, you don’t have to sort a lot of data, then another set of parameters where it returns a lot of rows, not even returns a lot of rows, just processes a lot of rows to get you your final result set. And that might need far more memory to run optimally. And that’s where you end up with the spilling and the slowness and everything sucks.
And then they’re also, they’re internally blocking. I don’t mean that they are blocking other queries, just inside of an execution plan, all the rows have to get to a sort, and SQL Server has to write them all down in order, and then they can pass things on to other rows downstream, or other operators downstream. And it’s, you know, you just have a bunch of operators sitting around waiting for that sort to finish and do its thing potentially. Unless it’s at like the, all the way to the left in your execution plan, and then I guess it doesn’t matter. So let’s look at how index design can help us avoid sorting data. So I’m going to, well, I can’t remember if I created this index or not. I already created it. Good for me. Sweet. This is so long ago. I forgot.
But what I want to show you first is, for this index, how the data is going to be stored in this index. And I’m going to do it with this query. Now, this, I’m using this query because it gives me a nice, easy set of data to illustrate this point with. So this, this index is on first, the first key column is reputation. The second key column is upvotes. The third key column is downvotes. The fourth key column is creation date descending. We have display name included, which is not going to help us sort data at all, because included columns are in no particular helpful order.
But we’ll get to that later. So if we look, if we think logically about how data is stored in this index, we have a whole bunch of duplicate values for reputation 124. Within that, within this duplicate range of values, we have values for upvotes stored in order next. So this is all 124. And then we have groups of ones, groups of twos, groups of, oops, groups of threes, and groups of four, five, six, seven, eight, down, on down to 10.
All right. So we have 124. Then data in upvotes is in order after that. And then within duplicate ranges or within a range of values in upvotes, we will have downvotes in order next. So for where upvotes equals one, we have downvotes one, one, two. And then within any duplicate range of values here, we will have creation date in descending order here, because we have it in descending order in the index. So for 124, we have upvote equals one, and then downvote equals one, and then creation date equals this.
So this is how the index is in order stored logically, not physically, logically. So if I run these two queries, the first one is just going to say select the top 1,000, those four columns in the index, ordered by upvotes. And the second one is that same query, but with an equality predicate on reputation.
If I run these two queries back to back, the first query will have a sort. We have to sort by upvotes to get data in the order we want here. But if we have an equality predicate on reputation, there’s no sort operator in this plan.
We have an index seek, we have a top operator that’s not a top sort, and then we just have a select. So in this query, with no equality predicate on reputation, ordering by upvotes, now we have to sort data. And this spills a little bit, not enough for me to get riled up about, but it does spill.
Which means that logically, we can select the top 1,000, those four columns in the index, with an equality predicate on reputation, upvotes, and downvotes. And we can order by creation date descending. And we still will not need a sort operator in the query plan to get the data in the order we want it.
Using equality predicates going across like that means that we can traverse the data just like I showed it to you with that first query. So we find equality values here, equality values here, equality values here. And then by the time we get over to the creation date column, that data is in the order that we want it.
That order is preserved. So nice, right? With rowstore indexes, you have that sort of column to column dependency. It’s not true with columnstore indexes, but that’s not what we’re going to get to here.
But like I mentioned, includes are useless. So even if we have an equality predicate on creation date and we say order by display name, well, we’re going to get this one row back, but we’re also going to have a sort operator in the query plan.
Now, this is a top end sort. This is not a fully sorted thing-a-bob. Anyway.
Equalities are also not so kind. So in the original query where I had reputation equals one, and in this query we’re going to say less than or equal to one or greater than or equal to, I believe that’s a million.
I’m too lazy to count all those zeros. But if we run these two queries and look at the execution plan, we will have a sort here, top end sort here. And we’ll have a sort here, another top end sort. Again, this one spills a little bit, not much. And this one doesn’t spill at all because there’s a very, very small number of rows involved in that one.
But so I guess if there’s a message here, it’s an inequality operators do not handle order preservation as well. There are some funny rules to that, though. Let’s take these queries, for example, where I can, we’re going to filter on, we’re going to, well, all of them are going to order on reputation.
All of them are going to order by that first column in the index. These are going to filter on columns that occur later in the index. And these are going to be inequality predicates that occur on the reputation column.
But if we run these queries and we look at the execution plans, none of these need to sort. None of these will have a sort operator in them, even if we scroll all the way down. All we see here is tops, not top end sorts.
So whenever you’re designing indexes, a lot of times a rule will be something like, Use equality predicates first, man. And then, you know, inequality predicates second, man.
And then, you know, include your select list, which is 45 columns because everyone was too lazy to normalize the table. Man. But I think that the indexing advice should be perhaps a little bit different.
So if we have queries where we need to optimize for sorting data, Sometimes I think that it is a wise idea to change the index creation advice to be on equality columns first, and then sorting elements after equality columns, and then inequality predicates, and then your 45 include columns.
I’m kidding. Don’t. Please don’t use 45 include columns. I will cry. I cry every time. I’m going to just stop doing that. Just get…
Stop. Stop. There are better ways. There are better ways in the world. Anyway. So that’s one way to think about indexes. It’s a little bit different from the way that a lot of people think about indexing for things today.
Now, there’s also times when different ways of writing queries can help you avoid sorting data. So what I have here is an index on the votes table. It’s on vote type ID, post ID, bounty amount descending, and creation date.
Now, this query has a couple predicates in it. One is on vote type ID being in 1.3, and one is on creation date being greater than or equal to 2010.
Now, the way that this query plan is going to look, which I’ll show you in a second, we’re going to see seeks for this, and we’re going to see a thing for that, like something. So we’ll get to it.
But this row number function, right, where we’re partitioning by post ID, and we’re ordering by bounty amount, that’s following the rules I just talked about, where, you know, we should have this thing, we should have this equality predicate on vote type ID here, and we’re going to partition by post type ID here, and we’re going to order by bounty amount descending here, and then we’re going to filter on creation date being greater than or equal to here.
But when I run this, something kind of not cool happens. I highlighted too much, and I hit an error message. But if I do what I was supposed to do, if I pay attention a little bit, and I run this query, it’s not going to be fast.
It’s not going to return any rows either, because I’m playing a mean trick on the optimizer, because I’m generating this row number, and I’m filtering where that row number equals zero, but row number doesn’t ever equal zero. But the optimizer is just like, yeah, we’ll go for it.
It’s kind of goofy. But if we look over at the query plan, well, what do we have here? Big old sort. Big old sort. Look at that.
We sorted lots of rows there. And that’s after an index seek. And that’s, well, we have this seek predicate on vote type ID, and we have this predicate on creation date, but still we had to sort data.
And that’s because in, even though it’s seeking, and even though it seeks twice there, and we have two seek predicates, it’s like an or, kind of.
And that is not so hot at preserving data. If we want to avoid the sort there, we need to change the way that we write the query a little bit. So if we rewrite the query to be one expression where we look for vote type ID equals one, and another one where we look for vote type ID equals three, and we run that exact same query, this will turn out a little bit better for us, in that we won’t have to sort any data.
It’ll run a bit faster, too. Well, it’s just about twice as fast. Right? And there’s no sort operator in here. So equality predicates can help you navigate indexes a little bit better.
Indexes can put data in a helpful order to help you avoid needing to sort things. But sometimes we need to write queries in slightly different ways, slightly less lazy ways, that aren’t always totally intuitive.
So there’s some unexpected stuff that does not produce data in the order that we want it, and that the optimizer is like, we’re going to have to sort a lot of that data instead. Yeah.
So that’s that. So now remember, we care about sorts. Similarly, we care about hashes as well, because hashes are similar operators, and that they will ask for memory grants, but typically not size of data memory grants, the way that sorts will.
But we care about memory-consuming operators because they ask for memory grants. Memory grants will often need to take memory from the buffer pool, unless you have lots and lots and lots of memory.
When we ask for lots of memory grants at once, we may run out of memory grant space. We may hit resource semaphore weights. If queries run and they do not get adequate memory, we may spill to disk, and large enough spills can certainly impact performance.
They can make code sensitive to parameter sniffing, where different parameters would produce, well, we need to process different numbers of rows, where we might need far more memory to do that in some circumstances.
And they’re internally blocking, meaning that all the rows have to show up at the sort operator before we can start writing things down. And if we need to sort lots of data, we could be there for a minute when we write all that stuff down in order.
So anyway, this is the five billionth run through of this video, and I am done with it. I am going to produce this one.
I am hopeful that Camtasia will mix it in a way that uses both stereo speakers. And I will see you over in the next video. Thank you for watching.
Thank you.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 25% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.
Related Posts
- The SQL Server Performance Tasting Menu: Performance Issues With Parallel Query Plans
- The SQL Server Performance Tasting Menu: How DISTINCT Queries Can Hurt Performance
- The SQL Server Performance Tasting Menu: How You Compare Date Columns Can Hurt Query Performance
- The SQL Server Performance Tasting Menu: How Multi-Statement Table Valued Functions Hurt Performance