Indexing SQL Server Queries For Performance: Equality and Inequality Predicates

Thanks for watching!

Video Summary

In this video, I dive into the intricacies of indexing queries for better performance in SQL Server, focusing specifically on equality versus inequality predicates. As we explore common scenarios and problems encountered when designing indexes, you’ll learn how to optimize your queries so they can take full advantage of those indexes. The discussion covers a range of topics, from understanding why certain index designs might not perform as expected due to non-selective columns required in every query, to strategies for rewriting both the index and the query itself to improve performance. By the end of this video, you’ll gain insights that can help you avoid common pitfalls and make more informed decisions when designing indexes for your databases.

Full Transcript

Erik Darling here with Darling Data. We’re going to embark on an adventure together, you and I. We’re going to delve deep into the caverns of how to index queries for better performance. I obviously can’t cover every single scenario, but I’m going to cover many of the most common ones and some of the most common problems that I see when I’m doing my consulting work. And hopefully you will get a at least enough knowledge to be dangerous in your own work. In this first video, we’re going to talk about equality versus inequality predicates, because the way that you design your queries and the way that you design your indexes really has to kind of go hand in hand so that you make sure that your queries actually use your indexes well. And if you have queries that are not performing well, perhaps the way they’re written is the reason why you’re not taking full advantage of those wonderful indexes that you have. Perhaps, you can stop worrying about fragmentation and start worrying about real problems. Be a grand thing. But before we start talking about equality and inequality, I promise we’re not going to go off on any soapbox tangents here about the world at large. But before we do that, if you would like to become a member of this channel and support the channel and all the work that I do to bring you this wonderful content, there is a link at the very top of the video that says become a member in which you can click on that very link.

And you can give me a minimum of $4 a month towards continuing to create this content. It’s a nice thing to do. If you’re in the market for nice things to do that are a little bit less expensive, you may like, you may comment, you may subscribe. You can do all of the above, none of the above. But of course, then you’d be mean. If you just need to be a mean, cruel person. Well, maybe this isn’t the channel for you. We’re all nice people here. If you are in the market for SQL Server consulting help, well, by golly and gosh, I do quite a bit of it. All these things over here to my right. I’m in the market to do them. And if you need something aside from maybe one of these things, well, you can always talk to me and tell me what it is.

And as always, my rates are reasonable. Training. I got some. You need some. It’s about $150 for the rest of your God-given life. Again, you can use that link and use Spring Cleaning the discount code or visit the video description for a link that puts all of these things together for you.

So that you can just go right on ahead and not have to deal with copying and pasting anything. I think by the time these videos start airing, it will be 2025. It will be the beginning of the year. So I guess news about upcoming events will be forthcoming.

Won’t that be lovely for you and me? Because that means I’m getting out of the house. But with that out of the way, let’s do a little disco dance and talk about these inequality and equality predicates. Now, a little preamble on this stuff.

Pretty much since I started reading about indexes, when I started caring about SQL Server performance, the general wisdom has always seemed to have been to you should design your indexes. The key of your indexes to support the most restrictive search predicates first.

That could either be, you know, like a unique column, a unique column of values, or just one that is highly selective in some way. And that’s not terrible starting advice.

If you’re a beginner out there in the world and you see that advice, that is not advice that you need to be terribly suspicious of. You can follow that advice up to a point and probably have pretty well good enough performance for a lot of your queries.

You know, I realize that there’s going to be a lot of questions about, you know, whether you should index for your where clause first or your join clause first or whatever. And the crazy thing is that there are going to be different scenarios where indexing for, like prioritizing different parts of your query for indexing can make a big difference.

But if I were to give you some just basic starter level advice to go along with indexing the most selective columns first, I would say that you should index for your where clause first and then prioritize other parts of the query. That’s not always going to work and that’s not always going to be true because sometimes you’re not going to have a where clause, you’re only going to have a join clause.

And sometimes other parts of the query might come into play as to how you want to design your indexes. Things like, you know, windowing functions, group by, order by, things, you know, columns you’re selecting, very specific restrictive filters on things.

All sorts of things come into play when you’re designing indexes, but we’re going to talk through a bunch of those during the course of the series, so I’m not going to talk too much about those here. One problem that I see pretty consistently is that a lot of databases end up designed with some very non-selective columns that are required for just about every query.

Think about something like soft deletes where most rows are not deleted. Most rows are in play. You might have a status column that has a variety of statuses in it that is still not very selective, but, you know, has outcomes in it that you need to, that most of your queries are going to be filtering specifically for.

Now, leaving filtered indexes specifically out for the moment, a lot of the columns that I see, rather, a lot of the indexes that I see, have those types of columns in them.

I don’t want to say included in them because I don’t want you to think that we’re just talking about included columns because those columns end up in various places in an index depending on how the index is laid out overall.

If it is a filtered index, then they usually end up in the include columns because they’re in the where clause of the index. And having them in the includes gives you a little bit more, gives you a little bit more gusto than leaving them out entirely.

Of course, putting them in the key is fine too, but usually includes is good enough. But, like, for all of these required type columns, whether it’s, you know, you know, like if you have a multi-tenant type database and you have, like, your organization ID or you have, like, the soft delete thing or the status thing that you need to look up, there are all sorts of columns along those lines which are not very selective, which you do need in the index to make your queries fast because you need your data organized by these things.

Some of the index design patterns that stem from that are rather unhealthy. Now, in this post, what I want to look at is both an index change and a query change that can help you out when you have, like, let’s say you have an index that maybe wasn’t designed great, but your query can be rewritten to take better advantage of that index or how to rewrite the index so that whatever way your query is written, you can do pretty well with it.

So I’ve already created two indexes and a constraint on the post table. Sorry. I’ve created two indexes, one on the badges table on name and then user ID, and this index is plenty good enough, and then one on the post table on post type ID and then owner user ID.

Post type ID is taking the place of sort of like a status column because, you know, there are about like eight different, sorry, there’s like eight different post types in the table, so they’re going to, and like some of them are much more selective than others, and the problem is that, like, for some types of posts, an index that leads with post type ID would be very selective, and for other types of posts, an index that leads with post type ID is going to be not very selective at all.

And then I also have a constraint on the post table telling SQL Server that every post type ID that exists in that column is greater than zero and less than nine. All right.

These are hard limits that we have set on the post table. So, in this case, the post type ID column, like I said, it’s going to play the role of our non-selective column that every query is required to have filtered in there.

Granted, that’s not going to be the case with a lot of Stack Overflow queries, but, you know, just go along with me here. Let’s, like, just pretend that all the queries need to find a particular type of thing, but what else you might be searching for depends on what else the module that your users are executing is going to run.

Now, I have this query, and this query is not going to be very fast. I’m just going to walk you through what it does real quick. Let’s start down here a little bit.

So, we are selecting from the badges table, and the reason why we have the index that we have on the badges table is to grab, well, it partially helps with this, but really this part of the query is mostly assisted by the ID column on the users table being the clustered primary key. So, seeking to various user IDs is totally fine and easy.

The reason why we have the index that we have on the badges table is because we start filtering the badges table with an initial, I mean, I know it says in popular question. I guess before there were some more things in there, and I narrowed it down to this to make the demo, like, work in a reasonable amount of time, but this is essentially an equality predicate on popular question.

And then for the post table, this is what we’re doing. And if you remember the index that I created, it was on post type ID and then owner user ID, and then it had score in the includes, which is obviously not going to be a completely ideal index for this situation, but the thing to pay attention to really is what’s going on in the where clause here.

We can just leave this out of the equation for the time being. Now, if we look at the plan that we get for this query, you’ll notice it doesn’t return any results, and I purposely have this not return any results because I want you to see the raw execution time of the query without returning any rows to SQL Server Management Studio.

That’s why I filtered to where that row number column equals zero down here. I want nothing to come back from this. Now, looking, like, more closely at the query plan, like I said before, getting to the user’s table is trivial.

It takes about eight milliseconds of time, but you’ll see that we spend about 32 and a half seconds executing the query. The majority of the time, if you’ll allow me to drag this over, is in this section of the query plan. And in this section of the query plan, we start with a scan on the post table.

And the reason why we start with a scan on the post table is because we don’t have SQL with the inequality predicate, the way we’ve written this, we can’t seek into that index that leads with post type ID. SQL Server ends up building an eager index spool based on this.

Or rather, we could seek into it. SQL Server just doesn’t. SQL Server just says, like, I don’t know. And this is even with that constraint on the table, showing where SQL Server knows that any value less than three is going to be one or two. SQL Server could have converted that to equals one or equals two, but it doesn’t.

So we end up building an index at runtime. And this takes about 21 seconds because there’s about four seconds spent scanning the index. So the majority of the execution time really is in here.

Okay. Now, like, it would be nice if SQL Server did do this, but, like, did do things a little bit more naturally. And, like I said, it could choose to do this, but it doesn’t.

If we look at the estimated plan for this with the 4C Kent on there, this looks a lot friendlier, right? This looks like a lot better of a situation, but it’s not. If I start this running, we are going to be waiting here for a very, very long time.

I have, in the past, attempted to let this thing run to completion, but it runs for about an hour before I give up on it. So this thing will just run and run and run and run and run. This actually doesn’t do any better with the 4C Kent.

Sometimes it is useful to apply various hints to your queries to see why SQL Server doesn’t choose plans. You know, I find things very interesting to do are 4C Kent and option loop join at the end of your query because that will really show you indexing issues with whatever query you’re running. Now, this thing has been running for about 45 seconds, and we haven’t gotten anything back from it.

The main problem with this query is if you look at the index seek, we are doing this. And actually, let me expand this out as well just so it’s a little bit more clear in here. Range columns.

Range columns. So for every time we go into the POST table, we scan for everything that’s less than 3. And the trouble with that in our case is that everything less than 3 is the majority of the table. So every time we go and seek into that index, we end up with a big, big problem.

Having to read through about 17 million rows and then doing a residual predicate on the owner user ID. If you look at the, look at this, we actually have a, do an index seek. Oh, wait, that’s the wrong one.

That’s the user’s table. I’m sorry. Highlighted the wrong thing. I was like, wait a minute. That doesn’t make sense. So we have a seek predicate on a POST type ID less than 3 and then a residual predicate on owner user ID. This is a bad time because every single time we go into this and seek, we seek through 17 million rows and apply a residual predicate on owner user ID.

That is not a very good scenario. If we were searching on a very limited set of POST type IDs, that would be maybe okay. But the way that this thing is going, it’s very much not okay.

So the 4C does show us that this is a bad plan idea. So when we, like I was saying, if you look at, let me start this running. And then we’re going to talk about the distribution of records for POST type ID in the POST table.

I actually have to put that in a new window probably because I’m building an index there. So like I was saying, we have to go through about 11 million records to find anything with a POST type ID of less than 3. 3 is down here with 167 records.

But the POST type IDs that we care about that are less than 3, that’s 1 and 2. SQL Server doesn’t. SQL Server would have to scan through 17 million of those.

Now, what I was saying before, which I think I mumbled up a little bit, was that SQL Server could convert this logically because there’s a constraint on the table that says, like, you know, all of the values in here are greater than 0 and less than 9. SQL Server could logically translate less than 3 with everything that it knows about that data to equals 1 or equals 2.

But it doesn’t do that on its own. Right? It just doesn’t.

It can’t unfurl that logic on us. So I’m creating an alternate index here on owner user ID and then post type ID. And what I want to show you first is that if we get the estimated plan for this, this thing should start using the new index that we created, which is not badges underscore x.

And if we do that, this will actually come back pretty quickly. At least it did at one point. Who knows how long it’ll take now.

But there we go. So that was about 8 seconds. And we can see that this is much improved. All right? This is no longer building an index. This is no longer, what do you call it, building an eager index boolean is no longer taking 30-something seconds to execute this whole part of the loop.

And, you know, having the reverse index where we use the other one, that didn’t cause us, like, this actually does better. Now, if we wanted to write this query in a way that would have it turn out well for our indexes with our original index, what we could do is actually let me tweak this a little bit to say index equals not badges on this.

And if we look at the query plan for this, it is pretty okay. There is still no eager index boole being built in the estimated plan. And because we have separated out our search logic a bit, and we’ve said select this stuff where post type ID equals 1, and select this stuff where post type ID equals 2, we’ve done the optimizer’s job of unrolling less than 3 with the, you know, because, again, with the constraint being greater than 0, less than 9, less than 3 can only mean 1 or 2, or 1 and 2, depending on how you look at it.

But if we run the query and we tell SQL Server to use our original index, not badges, we still do fine with this because SQL Server does two pretty quick seeks, and this ends up taking roundabout the same amount of time as the query with the unchanged less than 3 predicate and the new index that we created that leads with owner user ID.

So with that index, we’re able to seek to each owner user ID and then filter out the post type IDs. Owner user ID is a far less dense column. It’s very, very unique amongst the 17 plus million rows in the post table.

So the highest count for a single user in this version of the database is around 28,000 rows, and seeking to span a range of 28,000 rows for a single user ID and applying a predicate after that is pretty trivial as far as seeks go.

And then another thing that we could do is we could convert the where clause to say where post type ID is in 1, 2, and we could run this, and this should also take roundabout eight seconds. The query plan is slightly different for this one.

6, 7, 8, there we go. Okay, it was a little under 8 seconds, 7.8. Maybe the cache was a little bit better or something on that one. But rather than having two distinct seeks like we had with the query 1, union all, query 2 thing, we just have one seek into this.

And if we look at the seek predicate, we will have actually two seeks in here. One of them will be for post type ID 1, and the other will be for post type ID 2. It’s a little hard to get this with ZoomIt.

Let me see if I can… Yeah. So there’s the first seek for post type ID 1. And then if we expand this one out, we will see the second seek for post type ID 2.

So you can, depending on how your indexes are arranged, you can sometimes change the way that where clauses are written to take better advantage of indexes as they exist now. If you are allowed to change indexes, sometimes that like, but not change queries, like if you’re working with a third party vendor app, and they’re like, no, you can’t change the queries, but you can do stuff with indexes if you’re smart.

One thing that you could do is change the order of some key columns, depending on how the queries are written, in order to have those indexes get used more intelligently by those queries. So there is that.

Anyway, that’s a little foray into inequality and equality predicates in queries, and how those tend to align with the way that you’re in the index key column order, and the way that you design your indexes for those queries to be able to get seeked into.

Sometimes it does work. Sometimes you can write queries so that having a less selective column first works out okay. Other times you have to either rewrite the query or reshape the index so that the intent of your query is better serviced by the key columns in your index.

So with that out of the way, I hope you enjoyed yourselves. I hope you learned something, and I will see you in the next video, which will be about, oh, what’s number three over here? Unpredictable search queries.

So don’t read this yet, because that’s what I’m talking about next. Anyway, thank you for watching.

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.