Get AI-Ready With Erik: Why Approximate Search Matters

Video Summary

In this video, I delve into the world of approximate search performance, focusing on vector indexes in SQL Server and their implications for query optimization. With the rise of vector search functions and the preview features for disk and indexes, it’s clear that Microsoft is moving towards a future where vector data plays a significant role. However, the current implementation leaves much to be desired; the performance of vector distance queries degrades significantly as data sizes increase. I explore this issue by examining tables with millions of rows and gigabytes of data, showcasing how query times balloon out of control. The video also touches on potential improvements and why it’s crucial for Microsoft to enhance their vector index capabilities to support efficient vector search operations in the future.

Full Transcript

Erik Darling here with Darling Data. Living the dream as it goes. In today’s video, we are going to get even more AI ready than we have ever been before by talking about sort of like approximate search performance, right? Because, you know, there’s like a reason why people care and have vested interests in things like, you know, vector indexes, like in SQL Server, or there is a preview feature for disk and indexes, right? And there is also a vector search function, which can use those indexes. The vector distance function does not do that, right? The vector distance function’s job is to take one set of floaty points and figure out how far they are away from another set of floaty points. And what we’re going to find is that performance for that is not so hot as data gets in. So, you know, it’s bigger and this is why Microsoft better get off its keister and figure out how to make creating disk and indexes not be a terrible slog and how to, you know, not make, not have them make your tables read only because without that, I just don’t know where we’re going to go, right? So, you know, run this. And if we look at the approximate size of our stuff in here, the post embeddings table, which is just question titles.

Uh, is about a million rows and approximately four gigs. The answer embeddings table is 2.6 million rows and approximately 10 gigs. So the, like, it’s, it’s just about like doubled in size. Well, I mean, it’s, you know, like, like 1 million to 2 and a half million. If it was double, it’d be like eight gigs, right? It’s like double and a half. So it’s like 10 gigs. We added a little bit more in there. This, like the size of it, the same damn thing happens with query performance, right? It is bonkers, right? So if we, uh, run some stuff here. To evaluate, uh, the speed of a vector distance query against the post embeddings table, that’s the one with 1 million rows in it, right? If we run this, let me look at what we get back. Here’s the query plan for it. Uh, and this took, this was a million rows and 219 milliseconds. Right? So we look at the query plan. Uh, this is what that looks like. You know, we get this whole thing happening here. Uh, the clustered index scan, uh, like we’re not filtering on vector distance.

So there’s like no like filter operator for that stuff. Uh, if we, if we were that, that would be there, but like just to like scan everything and then sort the rows down to what the top 10 that we care about, uh, you can see that we are sorting by expression 1001 ascending, which is the vector distance result. So that took about 200 milliseconds or so. And if we look at this for 2.6 million rows, you know, a million rows is not terribly, I’ll give you that. We look at this for 2.6 million rows. Run this thing. Uh, a bit more sluggish. Uh, this was five and a half seconds for some reason. Uh, it’s usually not that slow. It’s usually, uh, quite a bit slower.

Uh, usually quite a, usually a little bit faster, but, uh, this one, we have to do this whole song and dance, right? So, um, SQL Server, not having a great time with the, uh, the vector distance stuff. So, so really the message here is sort of like the, the, you know, the bigger your data is, which is true for like any, any other principle in database, like the bigger your database, the bigger your data is the, um, the, like the more often you will have to think about various query, query performance things. The thing is with, with, with vector distance, like there’s not a lot you can do to performance tune that unless you have other indexes to like incorporate other search, like obligations in here or something.

Right. There’s like just selecting the top 10 from this table. We do a lot of work for that. It’s like significant, but this is, this was the performance aspect that I, that interested me, that set me down this whole terrible path. learning all this other stuff. So we’re, we’re, we’re going to talk more about that as we go. But, uh, as a wise man once said, it is Friday and, uh, we do not have a lot of time to, to waste on a Friday thinking about SQL Server and vectors and all that other stuff. So this is just going to be a short video, uh, for us to, to, to, to think a little bit about how we might want to manage our, our vector data in the future. Right.

And of course, why it is imperative that Microsoft sort itself out with the vector index stuff so that we can use vector search, the function, the vector search function to make things faster. We are, we are going to talk about that shortly. So that out of the way, thank you for watching. I hope you enjoyed yourselves. I hope you learned something and I will see you next week with a fresh brand new office hours episode. and I don’t know whatever other stuff comes along. So anyway, thank you for watching.

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.