Learn T-SQL With Erik: Nonclustered Index Interactions

Chapters

*00:00:00* – Introduction
*00:01:33* – Nonclustered Index Basics
*00:04:05* – Index Choice Based on Metadata
*00:07:27* – Clustered Index Scan Cost
*00:09:38* – Unique vs Non-Unique Indexes
*00:12:05* – Index Union and Merge Join
*00:15:17* – Index Hints and Join Types
*00:18:48* – Conclusion

Full Transcript

Erik Darling here with Darling Data. And in today’s video, we are going to continue with our little baby slices of material from my Learn T-SQL with Erik course. And today we are going to talk about nonclustered indexes and their relationship to either the heap or clustered table that they were created on. And of course, this, you know, again, this is not specifically like, here’s how to write a query, Learn T-SQL, SQL Server stores data, and learning about how those data structures relate both to each other and to the base tables that they are created on. In this case, is very, very valuable for people who want to seriously practitioners T-SQL, because without this knowledge, your queries are going to suck. They’re going to be terribly slow, and people are going to be angry at you. You can be as advanced as you want in the writing of T-SQL, but unless it, unless those queries ever finish, no one’s ever truly going to apply and appreciate them. So, that’s why we must talk about these things. So, down in the video description, you will see all sorts of very helpful links for you to give me money. You can hire me for consulting. I still need to tell people what to do with their SQL servers, which is fine with me. I don’t mind doing that. In fact, I’m rather accustomed to it. You can purchase my training materials. There is a link, actually, with a discount code attached for this very course.

Right? Bloop, bloop, bloop, bloop, bloop, bloop, bloop. You can become a supporting member of the channel to keep the wind in my sails, the gas in my engine, whatever. Does the gas go into the engine? I don’t know. I’ve never had a car. But to keep creating this high-quality SQL Server content for you or one of your friends, if you share this video with one of them after you like and subscribe, if you haven’t done either of those already. I don’t want to be repetitive here, but unless I say these things, people just sort of passively watch and then I never see them again.

If you would like to monitor your SQL servers for free, I have a very, very free $0, zero-cost monitoring tool available up on GitHub. It’s this link, but it’s down in the video description. You got to go click on something, right? But it’s totally free, totally open source, no email, no phone home, no weird telemetry data. It’s just all the stuff that I would look at if I were a SQL Server monitoring tool, which now I am. So now you know what I’d look at. Congratulations.

Weight stats, blocking, deadlocks, queries, CPU, memory, disk I.O., all that fun stuff. You know, normal SQL Server monitoring stuff, just way better. And if you are the type of robot-loving person that seems to make up the majority of the IT world today, you can turn on, optionally, if you can opt in to using MCP servers to allow your robots to talk to purely your monitoring data.

And they can make a much better sense out of what I have collected than if you were to just go allow them to run random DMV queries on their own. Because, I mean, look, I’m not trying to get on the case of the LLMs and, like, all that, but they still mess a lot of stuff up, and they still make a lot of stuff up, at least, you know, in my dealings with them.

No, I don’t know that I would just let them go do that. But the MCP tools that I built in here are all very well defined. The schema is all sort of predefined.

The results, like, the columns that they represent is all predefined. And so it’s much easier for them to manage that. There’s nothing to hallucinate there. So they could still make things up, but they can’t hallucinate things existing or not existing, which is supposedly a step in the right direction.

Anyway, if you would still, if you still value human interactions, I will be leaving the house. And when I leave the house, these are the places I’ll be going, right?

I’ll be in, well, let’s do this in order, kind of. I’m going to be at Pass on Tour. I’m going to lose my entire arm up here. I’m going to be at Pass on Tour in Chicago, May 7th and 8th, right?

And then from there I fly to Poland from May 11th to the 13th. And then from there I fly home. And after I’m home for about a month, I fly to date a Saturday, Croatia, June 12th and 13th.

And then I’m going to be home for a little, as far as I know, I’m going to be home for a little bit. And then I go to Seattle, Washington, November 9th through 11th. So get your dance and rain boots on for that.

But for now, it is April. And April is baseball and floating hot dogs. Because this gentleman database is clearly either, that thing is about to drop to the ground.

Or, I don’t know, maybe it’s like floating to his mouth. I don’t know what happened there. But everyone’s got beer and they’re happy because they won.

At least they think they did. This guy looks angry. You know, like angry eyebrows. This guy’s kind of like sad looking. This guy’s angry looking too. Champion’s guy looks happy.

But he’s, again, wearing red. So I don’t know how he got into this picture. I feel like, oh, this isn’t brown. I don’t know. Someone’s about to get shot, maybe. Anyway, let’s talk about nonclustered indexes. Now, we cannot talk about nonclustered indexes without some sort of actually type person piping up and saying.

But what about, what about, what about, I don’t know, chicken little. What if the sky does fall on your head? I don’t know what to tell you.

What, won’t they slow down inserts, updates, and deletes? I’m like, yeah, merges too. They do have an impact on these things. But we’ve got this query over here that’s running for like a minute.

And if we add an index, it’ll run really fast. So maybe reducing that query’s impact on the server sort of amortizes the cost of maintaining this index for the inserts, updates, and deletes. Maybe that’ll happen.

Maybe there are trade-offs that we have to make in life. Maybe there are things that we as professionals in the database world have to test. But make sure that they work as expected.

I know, it’s crazy. So yes, every index is a separate object on disk and memory. It is a copy of the data from either the heap or clustered table.

Every index will add some overhead to modifications because you’re going to have to write those changes to the transaction log. You’re even going to have to lock them to complete writes. All right?

My goodness. Even indexes that are unused by reads. And we hope to get rid of those. We do seek to remove indexes that are not helping our queries go faster. So don’t keep those around. Need to be maintained because the SQL Server doesn’t know when a query might show up and need to use them.

All right? So we need to read those indexes into the buffer pool. Sometimes a little teeny bit of the index if we do a seek. Sometimes a lot of an index if we do a scan.

Right? So the main points that poor indexing habits can introduce, the main problems that they can introduce to your systems, consist of typically these three points.

Buffer pool contention because you have more objects vying for space in your buffer pool. You have more transaction log activity because there are more objects you need to write information about changes to the transaction log. And you may even see increased locking and maybe even increased lock escalation attempts and sometimes even successes.

Assuming that there are no competing locks that would prevent a lock escalation from becoming a success after it gets attempted. So there are things that we do have to be aware of when we are indexing things. So all of these things being true simultaneously, we must come to an agreement, you and I.

We must come to an arrangement that suits everybody. We must compromise and we must say that we need indexes but we should enjoy them responsibly. The same way that we enjoy other leisure activities responsibly so that we don’t end up throwing up on ourselves.

All right? But we also need to query responsibly. Don’t we?

The wider your indexes, or rather the wider your queries are, the wider your indexes may have to be in order for SQL Server to actually use them. All right? Because if we are writing select star-ish queries, or if we have queries that are, you know, attempting to sort of kitchen sink, evaluate 14 different columns in a table, well, if we don’t have indexes that SQL Server considers covering for those very wide queries, if our queries have a very wide berth, we’ve got wide hips on that query, well, guess what?

SQL Server might say, ah, no, I’m just going to scan that clustered index instead. And let’s talk about why that happens. That’s a great question.

Why does that happen? I’ve created this index on the reputation column in the users table. It maxed up one so that I get a more consistent statistics histogram. Sometimes that works.

Sometimes it doesn’t. But anyway, if we, the important thing that I want to show you in this query, or rather across these three queries, is the cost of the key lookup in the plan.

Because remember, our index up here is only on the reputation column. And so when we go to get other columns from the table, and I’m telling SQL Server here for a bit of demo stability that it must use this index, or else it will be in big trouble, then SQL Server has to get this column from somewhere, and so a key lookup gets involved.

If we run this query, and we look at the cost of the key lookup, SQL Server says, I think that’s going to cost 47.3194 query bucks to fetch that one column, that one integer column, from the users table, a total of 15,656 times, right?

Because we have to do that many loops to go get that set of columns, or that column from the clustered index down here. Okay, remember that number, four, gosh darn it, you, 47.3194, alright? What if we say, I need to go get an envarchar max column, because this is a count ID column again, this is an integer, right?

And this about me column, well this is not an integer at all. If that other helpful thing will show up, we can see this, about me is indeed an envarchar maximum, alright? Max envarchar there. How much does this cost?

Well, shockingly, 47.3194 query bucks. Oh dear. So, SQL Server assigns the same cost to going to get a 4 byte integer as it does to going to get a book with a key lookup.

Isn’t that interesting? That completely leaves out that we might have to do, like, lob reads, or read off-road pages, or off-road data, or any of that other stuff. It just has the same cost, right?

Wild. And even still, if we sell SQL Server, hey, go get everything from that table. I guarantee you, my friends, this is not a metadata cost that we are concerned about here. Anyone who talks about metadata in these situations should be hit with a hammer.

We have the same plan, right? But we have, also, we have the same cost assigned here. 47.3194 query bucks.

So, getting everything from the table has the same cost, with a lookup, has the same cost as getting just one integer column from the table, right? So, you don’t even have to do a select star query.

Remember, all the folks on LinkedIn who give you the groundbreaking advice to avoid select star. My goodness, you geniuses, you. They are not quite, not quite all together, are they?

Because you don’t have to do select star. All you have to do is get one extra column, even just one integer column that is not represented in your nonclustered index, for SQL Server to need to evaluate the cost of lookup versus clustered index scan plans, and to maybe not use your index just based on that one column difference.

So, while I do agree that select star makes things a little bit harder, it also doesn’t take select star to get index choice to be a sort of painful point during query optimization.

You can ignore that number. That was a previous run. I guess, I don’t know. I installed a cumulative update, math changed on me, or something. Hard to tell.

But, that’s all. All I know is that the last time I ran through this, that’s how much it cost it did. Funny. All right. So, let’s see.

When we let SQL Server choose which index to use, right? And we’re going to look for reputation equals 16 now. SQL Server says 38,000. Nope.

Scanning the clustered index. That takes 234 milliseconds to do that. Okay. That’s fine. Keep in mind that the clustered index scan is a fixed cost, right? So, if we needed to go get, say, like, that was, how many?

38,000? I think 21 is like 58,000 or something. Yeah, 51,000. This will have the same fixed cost, right?

Because we still have to scan the clustered index and do the same amount of work either way. So, that’s 60, oh, let’s see. 68.3519 query bucks.

And that is, execution plan. 68.3519 query bucks. So, because scanning the clustered index represents the same amount of work, no matter how many rows we estimate might come out of that clustered index scan, it is a fixed cost based on pages and whatnot, right?

So, the bigger your clustered index is the more expensive it gets, at least from a CPO and IO perspective. Lookup costs are dynamic, and they are based on the number of times the lookup has to be performed. I find this to be a very sensible arrangement.

If we were to go look at the execution plan for this, we would see that the cost of this lookup is no longer 47.3819 query bucks, but for one single row, it is, oh, you know what? Oh, that’s the right place. 0.0032831 query bucks.

A very small number. So, lookup costs are dynamic based on the number that SQL Server has to get done. There is some math done around sort of like the first one is expensive, but then following ones are less expensive because SQL Server sort of figures, well, like this one might be from an empty, it’s going to be from an empty buffer pool, it assumes. And then like the next one will, like maybe the day is going to be on the same page, so I can read that from memory.

So, like the first one’s expensive, but then the next ones are, the cost is reduced for each one. So, good things to know when one is looking at query plans and whatnot. There is a fun quirk with unique versus non-unique nonclustered indexes.

In the users table, we have this account ID column, which is unique. It is entirely unique. It is a somewhat monotonically increasing number.

However, it is not exactly the same as the identity column that is the clustered primary key on the table, which is the ID column, but we can create a unique and a non-unique nonclustered index on the account ID column, right? I think I got that one. The other one’s showing me a red squiggle.

Yeah, okay, I got that one. So, when you create a non-unique nonclustered index, the clustered primary key acts as a hidden key column in the index definition. So, if we run this query and we tell SQL Server to use our non-unique index, we will get a double seek into the nonclustered index.

We can see both where ID and account ID are evaluated with the literal values that we passed in. So, we get a double seek there. In a unique nonclustered index, that clustered primary key column is more like an include in the index, where we are no longer able to double seek.

Notice that we do have a seek predicate here to account ID, but now we have a residual predicate up here on ID. So, that’s the truth about that. But when I talked about how nonclustered indexes have a relationship both to the table that they get created on and other nonclustered indexes.

Well, let’s take a look at some of that. There’s a type of plan that SQL Server can use, a type of execution plan SQL Server can use called index intersection. And hopefully, this all works out.

Where SQL Server has taken… Let’s do our magic zoom trick here. Oh, that’s too big. There we go.

SQL Server has done a seek into two non-clustered… So, what’s annoying is that it’s very… I don’t know why those ellipses kick in so soon, but if we were to look here, we would see an index seek on this nonclustered index, and then an index seek into this nonclustered index. And then SQL Server is able to merge join those two indexes together and produce the result that we care about.

And if we look at the merge join, it is a residual, well, this is a merge join, on the ID column in one index being matched to the ID column in the other index. So, when I talked about how nonclustered indexes inherit the clustered index key column in some form, depending on uniqueness, SQL Server does a merge join here because that ID column is the second key column. And so, that data is in order after we do equality predicates to the actual key column that we defined the index on.

Interesting stuff, right? If we look at index intersection, we can even get index intersection with a key lookup, right? And we’re going to hint SQL Server to use those indexes, right?

Because that’s the easiest way to get this sort of plan. But now, we have what we saw before, which is the merge join and the two nonclustered index thingies here. But now, we also have a lookup back to the clustered index to get that additional column, right?

Because now, we’re trying to get display name, but display name is not in either of our indexes. So, SQL Server can do index intersection plus a lookup. But the index hint order is significant, right?

So, up here, we have the non-unique account ID and reputation. And down here, we have reputation first and then the non-unique one, right? So, if you’re going to be the type of person who hints indexes, you may want to consider the order that you’re hinting them in because SQL Server will hit them in the order that you hint them, right? Because we hit reputation first in this one and we hit the non-unique one second in this one, right?

But up here in this query, right? If you look at this plan again, right? Execution plan, this hits non-unique first and then this one hits reputation second.

So, very interesting things when you start hinting in indexes, especially multiple indexes. We also have the concept of an index union plan. This is where SQL Server brings the results of two indexes together, like if you were doing a union or a union all type query, right?

So, this is merge join concatenation. This is not just a merge join. This is concatenating results from both of these.

And we’ll see once again that, well, we don’t really have much of a, we don’t have a join condition because this is not a merge join. It is a concatenation operator. But it still says merge join, which is a little confusing, but that’s okay.

I’m sure Microsoft is on the case, right? But we hit both of our nonclustered indexes and then we did that. And then we can also see that index union with a key lookup is also a perfectly valid plan choice, right? So, that’s all that happening in there.

So, index intersection happens most commonly with AND predicates. You can see, well, you can see there are a variety of different join types that might be chosen, sometimes even a hash join, depending on your index setup. Both indexes must be fairly selective.

The result set must be fairly small. The optimizer doesn’t choose these quite as often because usually it will do either a covering index or, you know, scan a clustered index or just do a regular old key lookup. Those don’t always require hints, but, you know, sometimes if you want it to happen, it’s not going to happen naturally.

You have to tell it to happen. Index union is more common with OR predicates. You see a concatenation operator that combines streams of data from two nonclustered indexes.

You may even see a stream aggregate to remove duplicates, which is a nice thing. Each branch has to be seekable. I guess it’s somewhat more commonly chosen than index intersection, but, you know, whatever.

They don’t get used a lot. Like, they don’t get, those plans don’t get chosen very often. The cost model assumes some overhead to, you know, doing that type of stuff that is unfriendly to them.

It prefers to hit a single index path when possible. Sometimes statistically, or rather, the statistics that you have, the histograms that you have, may not, like, present a very big benefit to the optimizer for using them. And then there’s always some risk of eliminating duplicates, adding a bunch of overhead, which, you know, for a, that’s why they have to be selective.

But they can be very beneficial if you have selective AND or OR conditions, and each predicate has a good candidate index to seek into. So, if you are sort of, you know, hoping for that type of execution plan, then you will have to sort of do some work to get one quite often, either with a foreseek hint or hinting the indexes to use in the appropriate order to use them. Anyway, that went on a lot longer than I thought it was gonna.

I’m gonna go finish my coffee now. Thank you for watching. I hope you enjoyed yourselves. I hope you learned something. Today is Friday. Nah, I shouldn’t have saved a long video for Friday.

No one watches the long videos on Friday. But I will see you on Monday, where we will do yet another Office Hours. And hope that someone out there cares.

Alright. Thank you for watching.

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

Learn T-SQL With Erik: Nonclustered Index Interactions