Why Some Indexes Create Slower Than Others In SQL Server
Thanks for watching!
Video Summary
In this video, I dive into why certain indexes in SQL Server can create much more slowly than others. After waking up feeling unusually sleepy despite my usual morning caffeine fix, we decided to use that extra energy to explore the nuances of index builds on different columns. I demonstrate how non-selective columns lead to uneven thread distribution and significantly longer build times compared to selective columns. By examining the properties of these indexes, we uncover fascinating insights into SQL Server’s indexing behavior, particularly when using standard edition versus developer edition. The video also touches on a humorous anecdote about Microsoft support, highlighting the importance of accurate information in troubleshooting complex issues. Overall, this session provides valuable lessons for database administrators and developers looking to optimize their indexing strategies.
Full Transcript
Erik Darling here with Darling Data and a little sleepy. I don’t know why. I woke up this morning, shot out of a cannon, ready to go. And for some reason after drinking my customary two double shots of espresso, I got to feeling sleepy. I’m not really sure what the story with that is. It’s kind of a strange thing when your body reacts. the opposite way to something that usually has a pretty good effect. Anyway, today’s video, we’re going to talk about why some indexes create far more slowly than others. Now, if you’re on standard edition, this likely does not apply to you because you cannot create indexes with parallel threads. You are limited to offline index builds with a single thread, and you’re not going to have these problems because you just don’t have multiple threads to see this sort of stuff. So, if you’re on standard edition, I don’t know, you can create your indexes really slowly with a single thread or you can just watch this video. I don’t know, like maybe install developer edition to see how grand being able to create indexes quickly or more quickly is usually.
Sometimes for the most part. Now, what we’re going to do is we’re going to start, well, I’ve already created them because, I mean, it does, if we look under the armpit down here, let me zoom in under the armpit. I want to make sure that you get the full underarm experience from me. Where’s that time thing? Where are you? Where are you hiding from me? There we go. One minute and 18 seconds to create all these. And I didn’t want to sit there and make you wait for these things to pop up. Before we look at the plans for these, though, what I want to point out is that there’s a link up at the top there at my website, different index build strategies for SQL Server. And that’s not a post that, I mean, I did write the post, but really it’s a collection of links from 2006, back when Microsoft actually wrote useful things about SQL Server, not just like bland marketing material about how new feature is going to drive modernization data, blah, blah, blah, blah, blah, blah, blah. This was actually useful technical information. Good stuff. So there’s like six or seven posts up there at that link to old Microsoft posts about different index build strategies for SQL Server. And some of them you’ll see in here if you read all that stuff.
This link will be in the show notes as usual. So I’ll stick that in the old YouTube description. And without further ado, let’s look at some stuff. So what I just want to get the first four things up here. There’s four in total. But we have four indexes that got created, two of them on the votes table and two of them on the post table. And what I want to show you really quickly before we move on is the columns that these indexes got created on. So the vote type ID column is very not selective. There are like majority of vote types or upvotes or downvotes.
There are some, there’s a decent amount of question marked or rather answer marked as green check mark the answer in there. But then there’s a bunch of other things for like spam, offensive, whatever. And so it’s just not a terribly selective bunch of data. The second, the second index we created was on a column called post ID and post ID is much more selective. Granted, there are some posts with way more votes than others. Right. So it’s like skewed data, but it’s pretty selective generally.
For the post table, we did almost the same thing. We created one index on a very not selective column post type ID because most, most post types are going to be questions or answers. And then there’s like a smattering of other stuff in the table as well. And the next one is on owner user ID. Now, owner user ID is pretty similar to post ID up here. And that, you know, there’s going to be some skew towards users who ask more questions or post more answers. But in general, this is a fairly selective column.
At the far outlier of this is John Skeet, who in the 2013 version of Stack Overflow has around 27,000 or so answers or questions and answers combined. I think mostly answers, to be honest with you. I don’t think John Skeet has ever asked a question from thinking about things logically. At least a question that was not rhetorical. He’s, you know, one of those. Maybe he’s asked, you know, maybe he’s asked questions of other people in interviews, but, you know, I don’t think he’s ever asked a question he didn’t know the answer to.
Fascinating, fascinating way to live life. So let’s look at kind of what happens in here. Now, we’re going to go get the properties of all these things because that’s where all the helpful stuff lives. And we’re just going to expand the rows red thing a little bit. We can expand this too, but it’s not going to really make much of a difference. If you’ve watched other videos of mine, you know that this is the number of rows that a thread handled, and this is the number of rows that a thread produced.
So if we had, like, we don’t have a filter on this index. We had a filter on this index. These threads might have produced much lower numbers. But since we don’t have a filter or anything on here, these threads up here will produce the same number of rows that were read down here. All right. So good to know. Good to know. Good things to know.
And I don’t know what that accent was. It was very nonspecific. I wasn’t making fun of anyone. It was just a voice that came out of my body. Maybe I’m possessed. Maybe I’m just exhausted. Who knows? But if you look at the sort for this non-selective query, this is where things get a little jangly.
All right. If you look at all this stuff, some of these threads handled way more work than others. All right. This one handled a whole bunch. This one handled, I guess, a decent one. This one handled the most, though. If we, like, drew a line down under this 8, let’s see if I can draw a straight line with this thing.
Pretty good. Not bad. I haven’t had my morning drink yet, so it’s a little squiggly. A little shaky. But if we look at this, like, this thread number one handled far and away the most.
Like, no other number is quite as long as thread number one. And then thread number three did, like, nothing. And some of these handled, like, way fewer rows than others.
And that almost matches the distribution of data in the column. And I’m going to show you that in a second. And then if we look at the, let’s stick with the sort, because the sort seems to be where the interesting stuff happens.
And here, if we look at the sort for the index that got created on post ID, the numbers are much, much more even in here. All right. This is a much easier distribution of data. And if we pay attention to the times.
Go away, tooltip. No one needs your nonsense here. It took 40 seconds to create this index that leads with vote type ID. And it took about 18 seconds to create this index that leads with post type ID.
This is the same number of rows going into there. Right. There’s no filter on either of these. And the only thing that’s really different is the distribution of data.
Right. Like, even if you think about it, vote type ID is an integer. But it’s only ever, like, I think there are only, like, eight vote types. Right. So, like, you really only have the number one through eight.
If you have post ID, it’s also an integer. But it’s, like, you know, far bigger integers. So it’s not like there’s a, integers are all four bytes anyway. So it’s not like there’s a big difference in, like, the type of data we’re creating the index on.
It’s just the distribution that makes creating some indexes a lot slower. Right. And we’ll see almost the same pattern if we look at what happened in the index creation for the votes, for the post table. Sorry. If we look at what happened over here.
Holy cow. We only use three threads. And look at this distribution. You could think of that and that as questions and answers. And this is everything else.
Right. So we have about six million questions in the post table. They all ended up on one thread. We have about 11 million answers in the post table. They ended up in one thread. We have about 50,000 other things in the post table. And they all ended up on a third thread.
Now, you might be looking at this and saying, why in God’s name did SQL Server only use three threads to do this? Why wouldn’t we break these things up further? Why wouldn’t we use more threads?
Why wouldn’t we do that? And so you might even think about doing something insane like adding a max stop 8 hint to the index build. The index create script, sorry.
And you would be sorely disappointed to learn that max stop is not min dop. Now, it’s a short digression here. I was on a customer call recently where they had a support ticket open with Microsoft.
And when I say with Microsoft, I say that very loosely. Because Microsoft support is not all just Microsoft employees. Microsoft farms out support to like two or three other companies.
And this was a gentleman who worked for one of those two or three other companies. And the problem generally was that, well, the customer really wanted to get a parallel execution plan for this one query. They didn’t want to change any settings.
They didn’t want to add any hints to the query. They kept, you know, seeing all this stuff. Well, we want to get a parallel plan. And the gentleman from the third party support group working for Microsoft kept telling them, well, just try it with a max stop 8 hint. And I kept having to tell this gentleman that max stop is the maximum dop, but it is not the minimum dop.
If you can add a max stop 8 hint to anything, it’s not going to make that query go parallel. It’s going to tell SQL Server that that query can’t go more parallel than 8. And he refused to believe me.
Now, I’m just going to throw this out there. If you work for a company that’s paying Microsoft for support and you’re unhappy with it, you should talk to me instead. Because, at least for the client that I was working with, they paid $75,000 a year to Microsoft for support.
And this is the type of person who they would get on a call with. Someone with about 18 months of experience with SQL Server doesn’t know their butt from their elbow. I’m going to keep that one family friendly just in case you want to show that to your boss.
And they just don’t know anything. They’re, again, like 18 months of experience max with SQL Server. So, for about the price of one, for a little bit less than the cost of one core of Enterprise Edition, you could have a whole lot of help from me who actually knows something about SQL Server.
Wouldn’t that be grand? So, if we look, so this query, sorry, this index create only uses three threads, which is a little depressing. Max stop 8 doesn’t help because it doesn’t, again, doesn’t set the minimum dop.
There’s no min dop hint, much as I wish there was a min dop hint. We don’t get that. There’s a trace flag and there’s a use hint.
They’re both still technically, like, legally unsupported by Microsoft. But they do work, but they still don’t set a minimum dop. You can use the trace flag or the use hint with a maximum dop, but there’s no, like, you must use eight threads for this.
So, that’s a little bit silly, ain’t it? Anyway, if we look at the second index create for the post table, again, much more evenly distributed. Right?
Everything in there, pretty evenly distributed. Look at all those 21s all the way down. Very, very nice. And there’s, again, a pretty significant timing difference between creating the index on non-selective data versus creating the index on selective data.
Right? Now, there’s a big difference between the votes table and the post table. The votes table is about 53 million rows. The post table is about 17 million rows. So, there are significant timing differences between the two tables.
But within the two tables, creating the indexes with a non-selective leading column can really increase the amount of time you spend building that index. First, creating the index on a selective column first because you just get better distribution in there.
Now, to kind of round things out with this, for this video, I want to show you a couple things down here. Now, when we looked at the thread distribution for the non-selective columns in both the votes table and the post table, this is what the runtime counters per thread looked like.
Right? And I’ve omitted thread zero here because thread zero did not receive any rows. Thread zero generally does not receive any rows. And if we run this query, and we tuck this down a little bit, you’re going to see some numbers that look kind of familiar down here that you also see up here.
Now, I don’t have it completely mentally mapped out in my head, as I probably should. Let’s tuck that up a little bit. Oh, not that high.
Come on. Get down. There we go. Now we can see everything. But if you look… Oh, that didn’t do it. Come on, baby. If you look down here, we can see 733 here. And we can see 733 here.
Let’s see. There’s a 3511 733 here. There’s a 3511 733 here. Let’s see.
Do we have a 41? No, we don’t. Do we have a 203? We do. There’s a 2039371 there. Yeah, there’s a 2039371 there.
And then, you know, there’s a… Let’s see. Do we have an 818? Do we have an 818? Do we have 16? We have an 818477 up here. Look, this is very exciting stuff.
We have an 818477 up here. We have an 818477 right here. So there are a bunch of these threads that only got to work on like a single vote type ID. Some of them spread out a little bit more.
And some of them, you know, just kind of did their own thing. I think… Oh, there’s another good one. Look at this. 3-5-7-3-4-5-0.
3-5-7-3-4-5-0. So you get… Some of these threads did work on just one specific vote type ID. Other ones, you know, again, SQL Server kind of spread that out a little bit.
So that was nice of SQL Server, I suppose. Now, where things get interesting, too, is… And I kind of…
I kind of spoiled this one earlier when I talked about it, but we’ll do it anyway. If we were on this query on the post table, and we look at the post type IDs versus the actual rows, here’s post type ID 2 with 11 million.
Right there. Those are all your answers. There’s post type ID 1 with about 6 million questions. And there’s that there.
And then if we look at this number, 50597, I bet that would just about add up to what you have in here. Right?
So we’ve got a couple 25,000s plus a little. We’ve got a 167, a 166, a 4, and a 2. We’re going to bet if you added those numbers up, they would add up to 50597. So the three threads in here did sort of an unfortunate amount of work.
Like, I think what’s unfortunate about it is, like, if you think about what happened up here, like, SQL Server broke up, like, some of the bigger vote type IDs across multiple threads. It did not do that down here.
Right? We only got three threads. Right? Maybe, like, a fourth thread could have evened this out. And then we could have had, like, a 50,000, and then, like, a 5.5 million, and another 5.5 million, and then a 6 million.
That would have been a little bit nicer. But SQL Server did not choose that. So anyway, if you’re ever creating indexes, and you wonder why some indexes kind of create slower than others, this might be why.
You might be creating some indexes with leading non-selective columns, which sometimes you’ve got to do. Sometimes that’s the wise thing to do. Sometimes that’s what your where clause is on.
You’ve got to respect that where clause. And other times you might be creating indexes on fairly selective leading columns, and you might think, wow, that index created a lot faster.
And this would probably explain why. If you’re ever incredibly curious, and you get the actual execution plan for your index create statements, you might see stuff just like this, very uneven row distributions across threads.
Maybe, like in the case of the post table, you might not see very many threads involved at all. And that could also be part of it. And remember, especially if you are working for a Microsoft third-party support vendor out there, MacStop is not MinDop.
And again, if you are overpaying Microsoft for terrible support, I’m your mans. I can certainly do better than add a MacStop hint to make a query go parallel, because that’s a sure sign of a lack of knowledge.
I hope you enjoyed yourselves. Lord knows I did. I think I kind of pepped up a little bit as I was talking.
Maybe just sitting at my desk was what was making me a little sluggish feeling. Anyway, I hope you learned something. If you like this video, I do like thumbs-ups in appropriate places, and I do like nice comments.
And if you like this sort of SQL Server content, if you like learning more, if you want to know more about SQL Server than Microsoft Support does, and you want to keep watching these videos, a great way to get notified is to subscribe to my channel.
If you do that, you will join… Hang on, I’ve got to get the official number as of this recording. You will join nearly 3,574 other people and celebrating every time I post a video.
That would be fantastic, wouldn’t it? Wouldn’t that be just lovely for you? You wouldn’t have to keep refreshing the page. You would just get a little notification that said, Erik Darling did a thing.
And then you would be able to watch the thing. And you would be a smarter, happier, more well-informed person for doing so. Anyway, I’m going to…
I’ve got stuff to do. Actually, I’m finally getting a haircut in about a half hour. So I should probably prepare myself for that eventuality. And in the next video I record, I’m going to not look like some sort of, like, I don’t know, weird nerd.
It’s a curly Q thing over here. I don’t care that I’m… I don’t care that my hair is thinning. I care that my hair waves when I don’t want it to. I’m in my mid-40s.
My hair is probably going to get thin unless I intervene in some way. And I’d rather just shave my head. And the reason I’d rather shave my head because when you’re a guy with a shaved head, when it’s not due to illness, there is a tremendous amount of responsibility on you to maintain a reasonable weight because you don’t want to have a big face with a shaved head.
At least I don’t. It makes me look very… I look like Dr. Evil if I get chubby with a shaved head. So you don’t want to see that. So anyway, I’m going to go do my self-improvement stuff and I will see you in the next video.
Thank you, truly, from the bottom of my heart, for watching.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.