Why DISKANN Indexes Are So Slow To Create In SQL Server
Video Summary
In this video, I delve into the intricate process of creating vector indexes in SQL Server, specifically focusing on the background tasks and query hints involved. By examining the detailed query plan generated during index creation, I uncover a series of unique trace flags, parallel plan preferences, and other hints that are not typically available to users. This exploration reveals how these elements contribute to the complexity—and current slowness—of vector index creation, while also hinting at potential optimizations for future versions of SQL Server. Through this analysis, viewers gain insight into the inner workings of SQL Server’s query optimizer and the advanced techniques used during index builds that could impact performance and usability in various scenarios.
Full Transcript
Erik Darling here with Darling Data. And in today’s video, we are going to endeavor to answer an important question. Hopefully a temporarily important question. But why are disk ANN or disk ANN indexes so slow to build? These are, of course, the new vector search indexes that Microsoft has given us in SQL Server 2025. And of course, the usual spiel does apply here. This is, this is all pertinent to RC1. So release candidate one of SQL Server 2025. I believe this will be the last one until we get an RTM since in about a month, there will both Microsoft Ignite and past data summit will be going on. And usually that’s about when an RTM build drops. So things could change by then, but at least for the next three or so weeks, this is what happens in the background. Before we get into that, down in the video description, all sorts of helpful links. You can hire me for consulting by my training, support this channel. You can also ask me office hours questions. That is totally free. I answer five of those every Monday. And if you enjoy this content, you should probably like subscribe and tell a friend. May not want to pass this one on to anyone who works at Microsoft for now. Speaking of past data community summit, like I said, yeah, that’s got, that’s happening in beautiful Seattle, lovely this time of year. November 17th to 21st, I will be delivering two days of T-SQL pre-cons with Ms. Kendra Little. They are going to be the best T-SQL pre-cons that have ever taken place in all of human existence. So you have that to look forward to. Anyway, let’s go.
Look at this. Now I am, I am officially in two SSMS spawns territory with how busy I am with things, but that’s a, that’s a good thing, right? Busy is good. So I want to go over a couple of things and then show you the real, the real spicy stuff. So I am using, so start things way back. My, my dear friend, Joe Sack, my bosom buddy, Joe Sack has been doing a lot of work. And the, the AI, the AI, the AI space with things. He’s a, he’s a, he’s a bit more keen on it than I am. But, but he’s been messing around with, with the vector stuff in SQL Server. And so all credit to this database existing goes to Joe Sack. I’ll, I’ll have a link both to his, uh, substack where he, where he’s talking about all this interesting stuff. And, um, the GitHub repo where he keeps this database. If you want to play along, play a lot, play around or play along or whatever you want to do, whatever form of play you’re into. I’m not sure.
Uh, but Joe had a post, uh, actually yesterday, uh, was it? Well, I mean, yes, yesterday is, as I’m recording this, but you know, temporal elements are all, all screwy. When we start publishing things online, uh, about a query that, about like a top something query with vector search. It wasn’t slow. It was just, you had to ask for a much bigger top than you thought, um, in order to get back results.
Uh, but anyway, uh, I saw the query and I was like, wow, that’s messed up. Like there’s gotta be a way to fix that. And, um, it turns out there, there, there wasn’t cause of the way the vector search thing works, but, uh, it all sort of led to this. So, um, what I’m going to do is I’m going to, uh, so Joe has a table in his database called properties.
Uh, and I forgot an S there, didn’t I? There we go. All good. Brand new script, you know, not even saved with the file name yet. So we’re all, you know, a little, a little dicey there, but, uh, I’m just going to create a 20,000 row table based on Joe’s table that has stuff in it.
Right. So let’s drop and do that. Right. So 20,000 rows. And, um, now what I’m going to do is create the vector index. And one thing I want to point out here that’s sort of interesting is I have, uh, actual execution plans enabled.
So when I run these two queries, I of course get it in a, uh, a query plan for the, uh, creating the table and selecting data into the empty table, whatever, whatever select into really does. And then, uh, adding the, um, primary key to it. Right. So the first two things that I do there, cause you need a primary key if you want to create a vector index.
So now I’m going to create the vector index and this, keep in mind, this is 20,000 rows. And while this is not like the biggest beefiest VM that’s ever existed, it’s about 10 gigs of memory, the four cores, something like that. This is 20,000 rows. Right.
And then we’re waiting and we’re at 10. I mean, if you, if you look, uh, sorry, this elbow, you can see like 14, 15, 16, 20, 20,000 rows is pretty slow. So, and I did get to wondering why 20,000 rows was so slow and I’m going to show you in a moment what’s going on while a vector index builds.
But, uh, what I wanted to show you before that are maybe some other things that, um, you, you, you maybe won’t be so thrilled about and maybe something you will be so thrilled about when with vector indexes. So this all took about 30 seconds for 20,000 rows, which is not, not a good time, uh, to be honest with you. But of course, RC1, all this stuff gets better changes, you know, just currently things not so hot.
Um, so just like columnstore indexes, uh, when they first came out, adding a vector index to your table, uh, does not just make the, the column that’s indexed read only. It makes the entire table read only. So if like, if I try to update a column in the table, it’s called bedrooms, right? And I remember I created the index on this description vector column and, uh, I guess SSMS still hasn’t quite worked out all the like new stuff going on in there.
But, um, at least SSMS 21 has it, maybe SSMS 22 has, I should give that a shot someday. It makes the entire table read only, right? So data modification statement failed because tiny properties has a vector index on it. So you like the entire table is read only like you can, it’s not just like the vector index column, which is, I don’t know, maybe not great if you have a transactional system.
Uh, but two nice things that I do want to say about vector indexes is that they do not support rebuild or reorganize. So you cannot rebuild or reorganize a vector index. So I suppose Ola will have to get to work making sure to exclude vector indexes, at least as they currently live.
Um, but anyway, what I really wanted to show you today, what the, the real meat and potatoes stuff is what’s going on in the background while a vector index builds. Okay. So we’re going to start creating this and we’re going to run a little query over here that shows us what’s happening while a vector index creates.
Um, SP who is active strangely did not pick up on this, but, um, you know, whatever. So, uh, we get a query plan and some query text back. The query plan actually shows us a query plan for what’s going on.
Like when I created the thing over there, there was no query plan when I hit, when I said, Hey, like make this index. And all of a sudden the query plan was missing. So this is like the query plan.
And if we zoom around, we’ll see, uh, you know, uh, there’s tiny properties and here’s the vector index seek. We’re doing something over there and well, there’s tiny properties again. And then if we scroll over this way a little bit, we will have a clustered index seek into this graph edge table, 120, something or other.
And apparently we’re, uh, we’re missing some statistics on this table. I don’t know, Microsoft, you ought to create some statistics on that one. I think, but, uh, anyway, the query that’s running while all this happens is not a typical create index query.
No, no, no. It is this whole wacky wild adventure. So while that index creates, right.
I’m just going to format this stuff a little bit. So it fits on screen a little bit better. Um, these are some, I guess, parameters that get passed into whatever we’re doing. Um, we declare dop, right, which is an integer and equals four.
Um, I think that might be, um, I think that might be related to my system, particularly since I have four CPUs in there. And then we start running this query. So we have a, if random start ID equals zero, we do this, uh, and we get some stuff in here.
And this, this, to me, this is interesting. So the, the query that’s running in the background uses the enable parallel plan preference hint. This, this hint, uh, has up to now been completely unsupported, almost disavowed by Microsoft.
This does not show up in the valid use hints DMV. So maybe this is a sign that this hint is now safe for all of us to use in production. Right.
That’s exciting stuff, right? That’s big news there. Uh, and then we, uh, well, let’s see, we select a top one thing here that Microsoft might need to get itself a copy of SQL prompt because some of this formatting is an atrocity. Uh, I can, I can, I’m willing to share my license with you, Microsoft, if you’re, if you’re really feeling, I know there’s been a lot of layoffs lately.
So if you’re really feeling a budgetary pinch, we can, we can talk later. You just let me know. Uh, so we get the property ID and we’ve got a table sample here with repeatable one.
Hmm. All right. And then, uh, this is where I guess things start getting interesting. This is that graph edge table. Uh, I guess at least one of them that we saw in the query plan.
And we, uh, insert some stuff into there, right? Uh, we select some properties and, ooh, cast that as binary 192. Very exciting stuff.
And then we do this row per page and oh boy, we are, we are using the fizz lock in here. My goodness. Huh? There we are. We are fizz locking things. That’s, that’s a fun one.
All right. Uh, and then, uh, apparently down here. So like the, the query plan we saw was most likely the stuff that’s coming up down here because there is a bit of a loop or something that goes on, I think, or at least, uh, you know, in RC zero, there was a loop.
I forget what they do in this one. Maybe. Oh yeah. There’s a batch start and batch end. Anyway, uh, there is some interesting stuff that goes on in here. Um, uh, what was it? Uh, did it, did, uh, maybe it’s down here.
Uh, there’s a, uh, there’s a function that we can’t use is disc and prune, right? So this has got a whole within group thing. So this is some kind of like probably internal windowing function that, uh, that, that Microsoft has access to from like a system thing that we don’t have access to.
I haven’t tried any of that yet. Uh, and then down here we have a query with a bunch of hints on it. We go to merge union, very exciting stuff. We have our parallel plan preference hint again.
I don’t know. This is feeling very production ready to me now. And then we’ve got some trace flags. Now, some of these trace flags I am familiar with. So there’s stuff for like, uh, nested loops batch sort, disabling nested loops batch sort. There’s stuff for disabling spools.
There’s stuff for disabling nested loops pre fetch. Yeah. But this one’s 7454. I don’t, I don’t know what that one does. Be honest with you. I just don’t. Um, I don’t know every, every single trace flag, but then we exec sp vector index build update stats.
We update stats, uh, after doing that. And then, uh, we have a batch start where we select the max property ID and some batch size stuff in here, right? All good things.
Uh, and then, uh, I don’t know the sort of some of this stuff, uh, more batch size stuff. And then where was, there was something else. There was something else in here that was very interesting to me.
Um, uh, where was it? It’s not this thing. I mean, this is sort of the index creation stuff that goes on in the background, right? So like when you create a vector index, you tell SQL Server, whether you want it to be, uh, cosine or one of the other mathy things that they have, but then there are some options that you, at least currently, I don’t see a way to specify them, but, uh, like stuff like this, that shows up in the sys dot, uh, whatever vector indexes DMV there is, but it doesn’t show up as like a creation element right now.
But then actually this, this one was, uh, was interesting as well. So use hint, uh, disallow batch mode. Again, we have enable parallel plan preference, but then there were, there were some other funny ones and I don’t, I, I doubt these are going to be, um, available to us mere mortals, uh, out in SQL Server world, but parallel DML, so we can disable batch mode, enable parallel plan, enable parallel plan preference, parallel, enable parallel DML for vector index build and parallel DML using independent transactions.
So this is the, these ones I have not seen before. I don’t, I doubt these are valid use sense for us, but, uh, I might go and try those. And then we have, uh, the same query trace on stuff that we saw before. So I’m not going to explain that one again, but then we are optimizing for, uh, certain values for the batch start and end.
And, uh, then there was some other interesting stuff in here, uh, like this pack int function. Do we get to use that? I don’t know.
I, I, I, I, I’m, I’m just seeing all this. So, uh, I, I, I want to figure that out too. And, uh, good Lord. Come on, put cross apply in a new line. All right.
So then, uh, well, there’s unpack, there’s unpack int. So we get to pack int, pack an int and then unpack an int. That’s a fun, that’s a fun pairing of things. And down here there was a nut. Well, there was disk and prune, but then there was, oh, where was it?
Uh, I don’t know. This, this stuff, this stuff all gets rather difficult to read. But then, uh, you know, when, when people tell you to be careful with query hints, my Lord, Microsoft. My Lord.
Uh, we have some of the same things that we saw before with the, um, uh, disabling batch mode, enabling parallel plan preference. Uh, I apparently deleted something there that I shouldn’t have. There we go.
Line that back up. Uh, and then the parallel DML stuff that we saw before. And, and here’s a, here’s a fun one. Uh, Microsoft turns off eager index spools in this query. They’re like no eager index pools for you.
Select. So with sell to index on the fly, uh, that is the, that is the optimizer rule that would build an eager index pool in your, in your query plan. So Microsoft apparently anti eager index pool as well. So high five Microsoft.
We share, we share, we share a mutual loathing for the eager index pool. And then at the end, of course, we, uh, we, uh, we, uh, we sp vector index build update stats again.
So right now I would imagine that like, since this, this is all like very loopy stuff going on or like, um, I imagine that this, this kind of thing is why, um, building vector or disc and indexes is so slow currently.
Um, I would also imagine that this is why they are also currently, this also currently makes your whole table read only. Cause if you were to like run a command that changed stuff in the table and you had to run all this stuff over again to like figure out, like, like, like rebuild your vector index, um, or like update your vector index, like during like regular transactional processing, that would be kind of a nightmare.
So I understand why we are not, we do not currently have, um, writeable, uh, disc and indexes. Cause that wouldn’t be fun, but, uh, I really do hope that this stuff gets sorted out. Um, if not by RTM in three weeks, no pressure guys, uh, then hopefully in the near future, because, um, you know, uh, it would, it would be a terribly uncompetitive thing for a SQL Server to not have, uh, writeable, uh, vector indexes.
Um, and like, again, trade-offs and stuff, right? Like Oracle has writeable vector indexes. Postgres has them.
Elastic has them. I forget Mungo, but like lots of other database engines have writeable indexes. I don’t know what trade-offs they made in order to accomplish that. Perhaps Microsoft, there were, there were certain trade-offs Microsoft was not willing to make, uh, in doing that, or perhaps there are other limitations involved that, um, we, we, we, who do not have the source code are unaware of.
So anyway, if you, if you, if you’ve heard people talk about why, or talk about, uh, disc and indexes being slow to create, uh, hopefully this answers your question as to why they are slow to create.
Um, I think it’s, it might be worth, uh, looking at this. So this is the full properties table or a full or properties table. I, I, uh, there were some problems earlier.
Uh, I don’t want to talk about that, but, um, there are 400,000 rows in the pro, like I just reinserted some extra ones. The one that you download from Joel only have 100,000 in it.
So you have to beef the table up a little bit. But, um, when you create, when I tried to create a vector index on the full properties table, the, um, the query looked a little bit different. Um, I think it, I think it was doing more to batch it, but, um, it might not be terribly interesting to go find more of that in here at this point.
But I don’t know that, that looks kind of the same to me. But anyway, uh, at least now we know we have some answers as to, uh, why building the disc and index is not terribly fast as of RC1.
Okay. As of RC1 disclaimers and all that. Uh, thank you for watching. I hope you enjoyed yourselves. I hope you learned something and I will see you in the next video where we will, um, I don’t know yet, but actually I don’t, I don’t have anything else planned.
So it’ll, it’ll be a surprise to you and me. All right. Thank you for watching.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.