Get AI-Ready With Erik: Vector Indexes Just Not There Yet

Get AI-Ready With Erik: Vector Indexes Just Not There Yet


Summary

In this video, I discuss vector indexes in SQL Server 2025, focusing on their current state as a preview feature. I walk through the process of inserting data into a table to prepare for creating a vector index and explain why these indexes are not yet fully production-ready. I highlight key limitations such as read-only tables, lack of multi-column support, and the absence of inline index definitions, emphasizing that while they offer some functionality, there is still much work to be done before vector indexes can be considered mature features in SQL Server.

Topics

`SQL Server 2025`, `Vector Indexes`, `Vector Data Type`, `Vector Distance`, `Vector Norm`, `Vector Normalize`, `AI Generate Embeddings`, `AI Generate Chunks`, `Preview Features`, `Read-Only Table`, `Clustered Primary Key`, `Compressible Columnstore Index`, `Vector Index Quantization`, `System Databases`, `Inline Index Definition`, `Multi Column Vector Index`, `Multi Key Vector Index`, `Enterprise Database Product`, `SQL Server 2025 Update`, `Erik Darling`, `Darling Data`, `Get AI Ready with Erik Course`

Chapters

Full Transcript

Erik Darling here with Darling Data, and we are just waiting for the last, oh, I don’t know, thousand or so records to go into this table to make it an even 10,000 rows so that I can show you some stuff about vector indexes. You can see maybe from, let me, well, you know, let’s go in here and let’s zoom in a little bit. We have local zoomit running in here. So, I was putting, like, 8,000 rows into the Taylor, and it’s gonna, it’s gonna take a little while, right, because I had 2,000 rows in there, but I wanted to get it up to 10,000 rows. So, you know, I’m just waiting on that to finish up. We’re almost done. Yeah, it’s 7,500 out of 8,000 embeddings, and, oh, I don’t know, let’s see.

Did this finally stop moving? Yeah, it did. All right, so our local olama has finished creating embeddings. Some input tokens were not marked as outputs. Okay, that’s great. Well, anyway, nice to know. Let’s talk a little bit about why vector indexes are just kind of not there yet.

This is all part of my Get AI Ready with Erik course. You can buy it with the link up there that is also handily down in the video description below. And the nice tidally added coupon code will get you 100 full US American dollars off. So right now in SQL Server 2025, the vector things that you have generally available are, of course, the vector data type, which currently only allows for a float 32 in GA.

You have three vector centric functions, vector distance, vector norm, and vector normalize. I go over all those in more detail in the full course material. I’m not going to, I can’t, I can’t do everything here. Give it all away for free. What’s the point?

You can create external models. So you can do things. Well, not like I just showed you. Well, actually you could do that. I’m going to show you just not in this video, but you can use like functions within SQL Server, like AI generate embeddings and AI generate chunks to sort of do the job that the Python script I was just showing you would do. So SQL Server 2025 has a database scope configuration with the ability to allow you to use preview features, which you have to explicitly opt into.

So you don’t accidentally use a preview feature in production and then be like, oh, I didn’t know. Whoops. If you want to turn on preview features, if you just don’t care, like I don’t care about this local VM. So screw it, just do whatever I want with it. You can do that by saying, by altering the database scope configuration to set preview features equals on.

Preview features has not quite made it to blue text in SSMS 22 point, whatever this is yet. I don’t know. Is this thing telling me there’s an update? What is this saying? Oh, there is. 22 point, 2.1 is now available. Well, we’ll get to that later.

Maybe, maybe preview features will turn blue in that one. We don’t know. But creating a vector index requires a single column integer primary key of the clustered variety. And of course, a vector column vector data type column on which to create it.

Our post embeddings table meets all of those qualifications. And as long as everything goes right, we should have exactly 10,000 rows in there. Now we clearly don’t have a vector index currently on the table because if we did all of those inserts that we were just doing would have failed.

All right. So we can run this and it’ll just say it wasn’t there. So we did not do anything. But like there are two, I mean, the two main problems with vector indexes. One, they make the table that you create them on read only in full, not just the vector column, the entire table read only.

No, no writes whatsoever. Insert, update, delete, merge out. All right. No. I mean, obviously no partition switching because you can’t partition this stuff, but all the things in the world. So what I want to show you is what happens when you create a vector index in SQL Server by running this.

All right. And like I’m not going to go through all the code because it’s a pretty hellacious mess. But, you know, this thing took about 10 seconds for about 10,000 rows. You know, fairly long time for an index create.

I believe it is sped up with cumulative update one a bit for 2025. But this is the this is the query that currently builds a vector index. And there’s a lot of stuff going on in here. Right.

Like this is not a normal index create. And it is kind of funny that like when you when you like turn on execution plans, normally when you create an index, you get an execution plan back. Doing it for creating a vector index, you don’t get an execution plan back for it.

But you do get this handy warning when you create a vector index that the join order has been enforced because a local join hint is used. So, you know, real, real professional there. And that’s that’s always a good sign.

But this is the code that currently does it right where we’re doing all this wacky stuff. And this is, you know, not how a normal index gets created. Microsoft right now has to do all of this stuff in order to build a vector index, which is why it’s so slow and why this is a preview feature.

And this is not, you know, production ready for anyone, because if you like to tell someone that, like, you know, like when you create an index in your database, it runs a series of like, you know, I don’t know. I don’t know if internally this is from a store procedure. This is just ad hoc SQL that runs.

But if you were to tell someone this is what you’re doing, like a serious database person, I think there’s a pretty good chance you would get like laughed at pretty hard. You know, cool workaround. But this isn’t this is not what you expect from an enterprise database product.

Not not at all. Yeah. So anyway, this is why vector index is preview only.

And once you create a vector index, there are some other some other things in there, too. Right. Once you create a vector index, you can see a bit about it by using.

I mean, it’s kind of also kind of we have to use sys.internal tables and join that off to some other stuff to find like information about it. But like like this is looking at the post embeddings table. This is the full table that we have, which is like one point something million rows.

I think it’s a you know, it’s a pretty big index. And it’s not like you can compress this thing. I mean, you can create a columnstore index, which is compressible. Like you would still have like the base table or like if it was a non clustered columnstore.

So like you could there is some compression that like kind of works on this, but not not like page or row compression. Right. So this is the index that you get back.

There’s also this other funny row called vector index quantization table, which apparently would be like a compressed vector index thing. I don’t have the full details on that, but rather anything more than what I see there. But it’s not in use yet.

There’s also sort of a funny thing where like you can’t turn on preview features and system databases. So like when I when I when I was first reading about this and I was like vector indexes are read only. Well, that’s kind of whack.

But we could we could like maybe put them into a temp table, put like some stuff into a temp table and then create the vector index on like a smaller set of things and then search like in that. But since you can’t turn on preview features and system databases, you can’t create a temp table with a vector index on it. Which I don’t know, I mean, you know, you’re messing with my work around here.

You’re messing with like the like the perf tuner stuff in here. I don’t I don’t like I don’t particularly care for that. There’s also some other funny like interoperability things where like like most index things you can like create in line with the table definition.

You can not do that with vector indexes. Vector indexes don’t allow you the inline index definition thing. You get one vector index per column, meaning that if you for some reason wanted to like measure like if some reason like you wanted like, you know, like like Euclidean or dot product or you wanted to mix cosine with the other two, you couldn’t index the same vector column two ways, which is weird. And you also can’t have multi column vector indexes, you know, like the vector index like no includes, which is, you know, not awesome for people who might want to like, you know, hit their vector column and, you know, select other columns, things like that.

But, you know, there’s also like like like like no multi key vector indexes. So there’s kind of a lot of stuff about the vector implementation right now that feels really half baked. It feels like this was like rushed out the door, like there was a lot of pressure to get, you know, some movement on this.

And, you know, it just shows us really unpolished, you know, I’m not going to say unprofessional because screw it. Look at me. But, you know, it’s just like really unpolished as as a as a feature at the moment. And again, like we’re not we’re not getting like at least I’m not getting any like feedback about like roadmaps or timeline for when any of this stuff will get worked out.

So, you know, I can’t tell you any of that. Maybe maybe some MVP out there can can tell you because, you know, they’ll they’ll they’ll they’ll kiss the right end of someone and get some details. But that ain’t me, babe. All right. Anyway, vector indexes, maybe someday.

Right now, you know, no, no, no, no buenos to be had. All right. Thank you for watching. Thank you for watching.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

Get AI-Ready With Erik: Vector Search Function, Nice Someday

Get AI-Ready With Erik: Vector Search Function, Nice Someday


Summary

In this video, I delve into the fascinating world of vector search functions in SQL Server, specifically focusing on how they can transform your search queries by making them more efficient and accurate. I break down the intricacies of using the vector search function, which returns a single column called distance, and explain its unusual syntax and functionality. By walking through a practical example with a full post embeddings table, I demonstrate how to use this function effectively while highlighting its quirks, such as the odd behavior when aliasing tables and the necessity for a top number of rows to be specified. Additionally, I compare vector search against vector distance, discussing their performance differences and providing insights into optimizing your queries for better results.

Topics

`SQL Server`, `vector search function`, `table-valued function`, `distance column`, `cosine metric`, `vector index`, `query plan`, `full post embeddings table`, `cross-apply`, `embedding tables`, `foreign key`, `column store indexes`, `vector distance function`, `filtered searches`, `SQL Server preview feature`, `exact search`, `course recommendation`

Chapters

  • *00:00:00* – Introduction
  • *00:01:30* – Vector Search Function Basics
  • *00:04:07* – Query Plan and Performance Comparison
  • *00:05:31* – Filters After Vector Search
  • *00:06:21* – Top Number Considerations

Full Transcript

Erik Darling here with Darling Data, and continuing on with more vector-y nonsense, I’m going to talk about why the vector search function will someday be able to make your search queries suck a little bit less. So the vector search function is a table-valued function that returns one column called distance. The inside of the function is very, very strange, at least syntactically to me, because you can’t just use like position, like at least I haven’t been able to get it to work with like positional stuff. But you have a table that equals, you know, whatever table you want to hit. You can alias that table here and then use columns from the table in here outside of the vector search function, you know, cross-apply-ish. Then we have the column in this table, right, that has our, that is a vector data type that has our vector index on it. We have the original vector that we want to compare and find things that are, that are similar to it similar to it. And then we tell it which metric to use, which, you know, because we’re doing all similarity search, we want cosine. And then we have to give it a top number of rows to pull out. I’m going to talk a little bit more about this at the end of the video, but this is, this is an odd, this is an odd duck to me. So this is like sort of the basic breakdown of it. Now, what we’re going to do is just search a little bit and just kind of look at a query plan for one, using the vector search function against our, our full post embeddings table, just to find 10 rows, right?

So like, obviously this, you know, stuff is all in there. It’s kind of weird. Is it like, you can’t alias this, right? And if you say like PE dot embedding, uh, it’s like, nah, I don’t know you, right? Like what? It’s kind of, it’s a little bit odd that that’s a, that’s a thing there, but you know, what can you do? Oh, green screen. Never knew you cared. So, uh, we run this query. We will see our vector index in use. And you can see that you, you, you, you, you can actually seek into a vector index, which is, is going to be great. Um, what’s kind of weird is that you will, you always get a one row guess out of the vector index, which is maybe not great. Like I’ve, I’ve messed around with this and it’s always one row and you get this new vector index seek, uh, icon, which looks a little bit like someone sitting down doing something lewd, but I don’t know. That’s maybe that’s just my Rorschach test, but, uh, the only column that you get back is the added VS dot star. And this is aliased as VS. The only column you get back is this distance column from there. If you want other columns, you have to either, they have to either be in the table.

You store your post and bet your, where your embeddings are, or like you have to join off to another table. So for example, if we wanted to get information from the actual post table, uh, for like rows that the po like matched with the post embeddings table, we could join like the, again, the syntax is very odd. You can join like the table that you reference in here outside over here. I mean, it’s kind of like cross supply ish, but you know, it’s just like an odd thing to see. So now we can get like base table data out and also, you know, search for similar embeddings at the same time. So it does take a little bit more there again, like, like, like to start your, you’re going to want, not because of the vector index thing, but to start, you’re probably going to want to keep your embeddings in a different table, uh, with like a foreign key that like, you know, like maintains the primary key from whatever table your text data lives in.

Cause you don’t, you’re not, you’re not really don’t want to blow it out your main table with all the embedding stuff. It’s pretty big right now. And there’s not really like a great compression story for it, especially if it’s like all in like, you know, some historical legacy base table, um, like new tables, you could create columnstore indexes on, but like old ones, it’s not good. So let’s just do a brief comparison here between, um, vector distance and vector search. All right. So we’ll run these two queries and we’ll get some timing from them because timing is important, right? So the vector search, uh, just for a small table takes about a second right here. And the vector search, uh, is about 16 milliseconds. And of course, you know, that’s, that’s exactly what sort of, that’s sort of exactly what you would expect because being able to seek into an index, even if it’s a vector index, it takes a long time to create only ever gives you a one row estimate and, um, you know, makes your table read only.

At least it still gives you a faster query. Whereas with, with vector distance, vector distance, you don’t get that. What’s interesting. I think about this query plan is like, I expected to see like a filter operator way later in the query plan for the, for the vector distance. But this one, it shows up at the nested loops join, right? Like this is where we do our filtering.

Cause I, I, I had stuck the vector index in a where clause up here. Uh, there we go. All right. So it was just like where vector distance is less than 0.2. And, and like, just, just to have a number in there. And I was like, ah, where’s the filter operator, but no, there it is sitting at the nested loops join, like a big weirdo.

So, I mean, I guess that’s, that’s, that’s better than, you know, a lot of other things. It might, uh, result in the late filter in SQL Server, but you know, still, still, still a little funny looking. Um, so one thing to understand about the vector search function though, is that any filters, um, that you might have in like the outer part of the query will apply after vector search does its thing.

Not before, not during, after. Inside vector search, you have to specify the top end number of similar vectors that you want it to return first. And then your where clause might filter those results out.

So if your filter eliminates like a lot of the results that the top end similarity thing finds, you might get way, way fewer rows than you expected back, which might, might be fine for some, some searches, right? That might be totally okay, right? You might be like, give me the top 10, then I got to weed them out a little bit, or give me the top 100, then I got to weed them out a little bit.

Um, but sometimes if you are expecting to like, you know, get back a top 10 or a top five or anything like that, you might need to put a much larger top number in here in order to get those rows out. Right. In order to get that number out.

And it’s, it’s really hard to sort of predict that ahead of time. So you might have to do a lot more work to like, you know, like say I want the top 10. Like I say, I want like the top 5,000 or 10,000 here and then apply whatever search stuff and then still give you this back. Like, I don’t know.

It’s, it’s, it’s a real weird situation in there. But, uh, if you’re using filtered searches, use a larger top end to account for filtering. Um, you know, again, you want to create separate embedding tables by, uh, you know, like by category so that, you know, you can, you know, sort of manage them separately and differently. Or if, you know, you’re, you’re not going to go the vector index route.

If you’re not like in, you know, just like get down with the preview feature sickness, uh, you know, right now you can just use a vector distance function. You can get exact search, but which is, you know, and still, you can still filter on that, but you know, it’s just, it’s just slower because there’s no, like it can’t take advantage of an index. So, anyway, that’s about a good here.

Thank you for watching. I hope you enjoyed yourselves. I hope you learned something. I hope you will buy my wonderful course, Get AI Ready with Erik, and I will see you next time. All right.

Goodbye.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

SQL Server Performance Office Hours Episode 50

SQL Server Performance Office Hours Episode 50



Chapters 

To ask your questions, head over here.

## Full Transcript

Erik Darling here with Darling Data. Another exciting Monday is upon us, which means another exciting, thrilling, breathtaking, blockbuster, record-breaking, I already said that, maybe, I forget, episode of Office Hours is upon us at long last. Here is your chance to shine, SQL Server community. I answer five questions every week doing this. So you could have your question asked and potentially even answered. You know, no voting required. You just stick it in there, I’m gonna read it. So you could throw your question in there and I’ll do it. Down in the video description, you can figure out how to do that.

I had to update the link sort of recently. There was a weird snafu with a redirect plug-in on my WordPress site. Apparently, the dash in there, I was redirecting it to blog posts about office hours instead of going to the page where you can ask questions. Maybe that would explain why there’s a slight drop-off in questions for a minute, but it’s fixed now, so you can do that.

Down the link, down in the video description, there’s all sorts of other links. I’ve redone my training site quite a bit. There’s all sorts of new consulting offerings there. There’s all sorts of new training offerings there. So you should check that out and click on things and give me money.

As always, you should like, subscribe and tell a friend too, because, you know, I do like to see the numbers go up when I hit refresh. Obsessively, my YouTube app, looking for comments and numbers going up. Life as hell.

So, look, I had a couple of screw ups on this the last time. I forgot to change the dates for Croatia. The dates for Croatia are now correct. I will not be in Croatia and somewhere else on the same dates anymore.

You know, copy and paste. Look, I make a lot of copy and paste mistakes in my life. It’s just my way of living. So I live with it. I deal with it. I move on. Croatia is now completely aligned and correct.

I will be at all these places and I will be delivering my advanced T-SQL pre-con. Of course, all attendees get free access to the Learn T-SQL with Erik course, which is the backing material to everything I’ll be teaching. You know, we only I only have like six and a half hours for these pre-con days to tell people stuff, but I have like way more material than that.

And if you just a quick note, if you have purchased the Learn T-SQL course, there’s a bunch of new advanced material in there that I’ve been I’ve been getting out as tech review completes. So if you have it and you were looking forward to the advanced material, maybe you even completed the beginner material. It’s all it’s it’s there’s a bunch of new stuff there if you haven’t picked up the course yet.

Now’s a good time because like we’re like getting close to the end there. Right. It’s almost all completely done. Anyway, still January still 2026, at least as far as I can tell.

So we are. Well, I mean, at least I am locally still very, very cold. It is currently. Let me take a look here. How many degrees is it? Not a lot of them. I’m going to tell you that much. It is currently 31 degrees.

It feels a lot feels it felt a lot colder outside. I’ll tell you. Well, Saturday is a high of 16. Eat my life. All right. Let’s answer some questions while we’re nice and cozy and toasty indoors here.

Do to do. First question. How can I find out why the clustered index update node in an execution plan is slow? I have this problem on many servers, so I just need the general principles.

Well, if you are hopefully judging this by getting an actual execution plan and not looking at just simply at costs or percentages, you could you could try looking at at the weight stats for the query that might give you some like if you like right click on like the fine, like the root operator in the plan, you go to properties, helpful little properties tab opens up on the right side of the screen, which should if you get an actual execution plan have weight stats in it.

That’s that’s that’s one way to find out. Another way, if you know, if you because, you know, I’ll be honest and Microsoft lies about a lot of stuff in there, right? It’s like, yeah, or they don’t include everything they should.

You could also like run sys dot query against sys dot DM exec session weight stats, which would show you the weight stats for your session. So you could run the update and look at that.

If nothing obvious sticks out there, then it could be something behind the scenes that, that is going on. Other things that you might want to look at is if you hover over the clustered index update, you might see that there are lots of nonclustered indexes listed in there, right?

Because it’s going to say like object and it’s going to tell you all the objects. SQL Sentry plan Explorer back before SolarWinds bought it and ruined everything that the SQL Sentry company created. Used to have a really nice thing where if you had a clustered index update that updated multiple nonclustered indexes under it, it would actually like tell you like clustered index plus number of nonclustered indexes, which is fantastic, right?

Because you can very easily see it’s like, oh, clustered index plus 17 nonclustered indexes. Well, no wonder. It’s a lot of running around that might be going on in there. Um, you know, if, if it were something like maybe a, um, if it were something like maybe, what do you call it?

Um, like, like, it’s probably not going to be, I mean, I guess it could be a trigger. Uh, I would look for triggers. If it were an index view, that would show up in the query plan.

If it were foreign keys, that would show up in the query plan. So I guess you could look at triggers on the table. Uh, like, like aside from that, uh, other things that you might look at are, um, you know, like really like kind of depending on how many rows you’re updating.

Um, you may find that, um, like there’s just a lot of background activity going on. Um, you might consider, you know, unfortunately you might have to look at like, uh, like perfmon counters or something for like page splits or, um, like any, like, like writing to the transaction logs, stuff like that.

Um, you know, just lots of general sort of other stuff going on. Um, so like you might not get the answer from the query plan, but you could certainly get answers by looking at the, the query plan and some sort of like correlating surrounding statistics and also check for triggers on the table as well.

All right. Let’s see here. Oh boy. As a developer calling me a developer.

Flattery will get you everywhere. Uh, how do you deal with looking into one bug or validating a new feature and finding other features that are unrelated, finding other issues that are unrelated along the way?

Uh, boy, that’s a, that’s a tough one. I am not good at that. Uh, like anytime I’m looking at like my, like one of my store procedures, just basically like the, like, like, like the extent of my, like, I’m a developer work. Uh, man, I, I get lost in so much stuff.

Um, it could be anything from just like, Oh, that formatting annoys me. Now I got to fix that. Um, or like, like, what is this? Like, why am I doing this this way? I got to fix that.

Um, like, like, like, like, uh, I get, I get so lost. So like, it really depends. Um, if, if I am just meandering a little bit and I like spot a few, like, I don’t know, let’s just call them like superficial things. I’ll just, I’ll usually just crack them out along the way.

If I spot something that I like, like knowing how I am, almost everything turns into a terrible rabbit hole. So, uh, usually I will, I will just add a to do and move on. Um, like, I will come back to that to do almost immediately after doing my other stuff.

But, uh, I know that if I, if I start, if I stop everywhere along the way that, um, like, like, like causes me to like, like pause and like have interest or like, you know, like mentally jars me a little bit, I will never get to the thing that I want to get done. So really it’s just, it’s about like, just sort of managing your, like whatever mental thing you have that, uh, that, that like, you know, forces you to analyze things as like, you can’t just ignore things and get to where you’re going. Uh, it’s really just about managing that.

Uh, and you know, make like noting stuff that you want to look at later. Um, you know, like, like, you don’t have to write a book about it. Just be like, to do like, like short blurb or something like that.

Uh, like there’s, there’s probably better ways of doing that. But like, if I, if I, if I started like context switching and like, if I had, like, if I was looking at this stuff and I had to like make a note somewhere else, like open an issue, forget it. Like, like it would just wouldn’t be unmanageable.

So like, usually for me, it’s just a short, like, like, like all caps to do colon, like, figure this out. Like, what is this? Like, fix this later. Like, don’t be an idiot. Why are you doing things this way?

So that’s, that’s about, that’s about all the advice I have. There’s probably much better advice out there. Uh, I, I don’t have anything better than that. Let’s see.

We updated statistics and the plan didn’t improve. These things happen. Isn’t that supposed to fix bad estimates? Well, it, you know, uh, to a degree, you, you might fix a, a root cardinality estimate, assuming that your query is written in a way that allows SQL Server to make a good estimate in the first place.

Uh, you know, there are all sorts of things that may get in the way of SQL Server making a good estimate. Uh, you know, non-sargable predicates, local variables, table variables, uh, you know, a lot of different things. Um, sometimes query complexity just sort of gets in the way of SQL Server making a reasonable estimate on things.

Um, you might try, uh, different cardinality estimation models. You have the default as Microsoft calls it, a lot of hubris in that, but there is a default cardinality estimator and the legacy cardinality estimator. You might try that.

There are also all sorts of use hints to, um, change different things about the way cardinality estimation works. But a lot of that stuff is, I mean, it can’t, I’m not going to say it’s overkill. Like if, if you are really invested in like figuring out why a cardinality estimate is bad, there are all sorts of use hints that you could like, you could do that.

Like, like assume min selectivity and like other, other things like that. The other thing to consider though, is that, you know, cardinality estimation can really only be good at like the table access. As soon as you move past that and you get into like, you know, like any like grouping situation, like, like any aggregate, uh, you start joining stuff together.

Uh, you start like mixing, like all these cardinality estimates. Things can get really weird, especially because, you know, the optimizer, you know, like, like the final query plan that you see might, might be like a Frankenstein of a plan. Right. Cause like SQL Server might like, like cost and reshuffle and do all sorts of other things along the way.

And like, you might have this weird staple together plan that like, has like almost looks like, like just deranged. Uh, I remember, you know, um, when I was first started looking at query plans, I would see these things were like, like the, like the num, like the estimated numbers would just be like crazy all over the place. We’re like, just like in like, like an inconsistent ways.

And it’s just because the optimizer, like one branch of a plan might get a cardinality estimate. And then SQL Server might like, like keep like this part of the plan, but then like change something in this part of the plan. And then the cardinality estimate down here might change and look weird up here.

Like there’s just all sorts of stuff that, that can go on. Um, so like, you know, like fixed bad estimates. Yeah. But there’s only like, so deep, like, like, like it’s only like one point in the plan.

Can you really see the fruit of that? Like after that, you know, like, like if estimates were way off here and you improve estimates way off here, you might improve estimates downstream. But, um, like, you know, just, just because like, you might’ve had good estimates in general way over here, but then it was like stuff that happened at later stages in the plan that, uh, that, that, you know, through cardinality estimates off.

So, um, you know, like, like, yeah, there’s, there’s, there is an element of, you know, we updated statistics and, you know, uh, where there’s an element of disappointment to updating statistics and nothing improving or like, you know, like the plan not getting meaningfully better in some way. But like, there’s all sorts of things that you might be doing that might be screwing SQL Server up. And there’s all sorts of things that like, just due to like query or plan complexity might like, you know, just, just, you’re just getting bad estimates because at some point SQL Server kind of gave up, right?

It was just like good enough plan found or like optimization timeout or something like that. So, uh, you know, just, it only goes so far at some point you have to dissect things a little bit. Uh, you know, like if you’re still getting like real bad estimates from stuff and the query is like, like complex past a certain point, that might be, it might be a good time to like start breaking the query up into some component parts.

Like if you’re selecting from two tables and joining them together and each of those tables, like has like occupies some space in a where clause, like just try like, like isolating each individual query. Like you might try putting one thing into a temp table, like going from there, uh, maybe even both things into a temp table. So SQL Server has a materialized result to work off of, uh, you know, like updating statistics and only do so much in a plan that is, you know, past a certain point of complexity.

All right. We created all the indexes suggested by the missing index DMVs. You, uh, but performance got worse.

Yeah. Okay. Why would SQL Server recommend bad indexes? Uh, well, maybe SQL Server didn’t recommend like terrible indexes.

Um, you know, maybe SQL Server was just doing its best. Right. Like, like imagine if every time you looked at a query, you thought to yourself, Oh, like maybe this index would be okay, but you had to do it really fast.

Right. So like, like SQL servers, missing index requests generally care about two things. Uh, one, um, making data easy to locate.

That’s the where clause in two, um, not having to do key lookups. So that is a select list, all sorts of other stuff that might be useful. And as well as in an index, like especially key columns, like, you know, order by group by, um, you know, if you have a windowing function, then the partition by order by in there.

Uh, there’s all sorts of things that would make sense. Um, even like join keys, like the SQL Server doesn’t, you know, put join keys in the, the key of an index, right? It’s just the where clause stuff.

So there are all sorts of things that, um, you know, like missing indexes, miss missing index requests, miss that you as a human being would do better at. The problem is it like, like, this isn’t like, this isn’t like an indictment of the missing index requests. Like I’d rather have, I could rather have one show up just so I can be like, Hey, maybe something is wrong here.

Um, but when you see missing index requests, you should never just like look at the DMVs and implement them all. Like, like that’s just a, that’s a real bad idea. What you, what you should be doing is just looking for long running queries.

All right. Like look for your slow queries. And then like, you know, if there is a missing index request or more, more than one in the, in the query plan, um, see if those missing index requests align with. You know, like data access methods, like index seek or scan or whatever, uh, that are also slow.

Cause like that, like, but you know, also like, you don’t just create what the missing index request says. Like use a little, use a little bit of your gray matter on that one. All right.

Like, like, like look, like look a little bit more, uh, closely at the query. So it’s not that, it’s not that the index recommendations are necessarily bad. I mean, some of the, they’re, they’re certainly not like, you know, um, what a senior person might, uh, recommend for an index. But I think they are a good enough sign that you should look into things that like, you know, like, but that’s like on a query by query basis.

That is not at the DMV level, right? You can correlate your, like the missing index requests or requests that you see in a query plan to what’s in the DMVs. But I certainly wouldn’t just go off the DMVs because SQL Server might recommend a missing index on it.

Like, like happens all the time in the stack overflow database when I’m doing demos, you know, like there will be a scan of the user’s table that takes like 80 milliseconds. But SQL Server is like missing index, like 98% impact. Like it’s added and we’re like, we’re going home boys.

That’s not the case. Right? So careful with that. I have a five table join. Do you now?

Well, call me later. Uh, that shows statement optimum level full and reason for early termination, early for termination timeout in the plan XML. But the query only has simple equality joins and returns 50 rows.

Well, uh, I feel like I’ve said this before, but I’ll say it again quickly. Cause this, this is 20 minutes of my life. I’ll never get back.

Uh, just kidding. I love these 20 minutes. Um, uh, timing out does not necessarily mean that SQL Server, uh, spent some amount of time doing something and then gave up. What, uh, what the optimization timeout means is it very early on in query optimization based on some, like, you know, early, like, like, it’s just like a heuristic view of your query, you know, based on complexity, number of joins, group buys, all the other stuff that goes on in a query.

SQL Server will give that query sort of an initial cost starting point, right? It’s not going to be the final cost and, you know, costs are all estimates anyway. So, you know, it’s all, it’s all kind of a, all kind of a wild, wild goose chase in the dark with the stab or something.

But, um, it, what it’ll do is it will a lot, a certain number of steps that it is willing to take in order to, um, in order to, uh, like come up with a query plan for you. And the, the, the, the, the reason for early termination being a timeout means that SQL Server has gone through all of the steps that it is willing to take in order to come up with a query plan. And you get whatever the cheapest one that it currently has is after those steps have, um, have, have been taken.

So, um, you know, like regardless of, you know, your simple equality joins or only returning 50 rows, SQL Server doesn’t know that it only returns 50 rows. When you’re, when it’s coming up with a query plan, right? Like it might estimate something different to begin with, but like, it doesn’t know that the final thing will always be exactly 50 rows.

So, uh, you know, like, that’s not really a point there, but, uh, like it, you know, seeing a, a five table join hit a reason for early termination timeout probably just means that your, your starting query, you know, was because it has, you know, just simple equality joins. And maybe the table is relatively small. Um, it could have a relatively low, um, starting cost heuristically.

And it might not because of that, you know, because that low starting cost, uh, SQL Server just might not be willing to invest all that many steps in coming up with alternative plans for it. So that’s probably good. Anyway, thank you for watching.

I hope you enjoyed yourselves. I hope you learned something and I will see you. Well, I’ll see you tomorrow, but I’ll also see you next Monday for another office hours episode. Thank you.

Oh, so very much for watching.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

Get AI-Ready With Erik: Why Approximate Search Matters

Get AI-Ready With Erik: Why Approximate Search Matters


Video Summary

In this video, I delve into the world of approximate search performance, focusing on vector indexes in SQL Server and their implications for query optimization. With the rise of vector search functions and the preview features for disk and indexes, it’s clear that Microsoft is moving towards a future where vector data plays a significant role. However, the current implementation leaves much to be desired; the performance of vector distance queries degrades significantly as data sizes increase. I explore this issue by examining tables with millions of rows and gigabytes of data, showcasing how query times balloon out of control. The video also touches on potential improvements and why it’s crucial for Microsoft to enhance their vector index capabilities to support efficient vector search operations in the future.

Full Transcript

Erik Darling here with Darling Data. Living the dream as it goes. In today’s video, we are going to get even more AI ready than we have ever been before by talking about sort of like approximate search performance, right? Because, you know, there’s like a reason why people care and have vested interests in things like, you know, vector indexes, like in SQL Server, or there is a preview feature for disk and indexes, right? And there is also a vector search function, which can use those indexes. The vector distance function does not do that, right? The vector distance function’s job is to take one set of floaty points and figure out how far they are away from another set of floaty points. And what we’re going to find is that performance for that is not so hot as data gets in. So, you know, it’s bigger and this is why Microsoft better get off its keister and figure out how to make creating disk and indexes not be a terrible slog and how to, you know, not make, not have them make your tables read only because without that, I just don’t know where we’re going to go, right? So, you know, run this. And if we look at the approximate size of our stuff in here, the post embeddings table, which is just question titles.

Uh, is about a million rows and approximately four gigs. The answer embeddings table is 2.6 million rows and approximately 10 gigs. So the, like, it’s, it’s just about like doubled in size. Well, I mean, it’s, you know, like, like 1 million to 2 and a half million. If it was double, it’d be like eight gigs, right? It’s like double and a half. So it’s like 10 gigs. We added a little bit more in there. This, like the size of it, the same damn thing happens with query performance, right? It is bonkers, right? So if we, uh, run some stuff here. To evaluate, uh, the speed of a vector distance query against the post embeddings table, that’s the one with 1 million rows in it, right? If we run this, let me look at what we get back. Here’s the query plan for it. Uh, and this took, this was a million rows and 219 milliseconds. Right? So we look at the query plan. Uh, this is what that looks like. You know, we get this whole thing happening here. Uh, the clustered index scan, uh, like we’re not filtering on vector distance.

So there’s like no like filter operator for that stuff. Uh, if we, if we were that, that would be there, but like just to like scan everything and then sort the rows down to what the top 10 that we care about, uh, you can see that we are sorting by expression 1001 ascending, which is the vector distance result. So that took about 200 milliseconds or so. And if we look at this for 2.6 million rows, you know, a million rows is not terribly, I’ll give you that. We look at this for 2.6 million rows. Run this thing. Uh, a bit more sluggish. Uh, this was five and a half seconds for some reason. Uh, it’s usually not that slow. It’s usually, uh, quite a bit slower.

Uh, usually quite a, usually a little bit faster, but, uh, this one, we have to do this whole song and dance, right? So, um, SQL Server, not having a great time with the, uh, the vector distance stuff. So, so really the message here is sort of like the, the, you know, the bigger your data is, which is true for like any, any other principle in database, like the bigger your database, the bigger your data is the, um, the, like the more often you will have to think about various query, query performance things. The thing is with, with, with vector distance, like there’s not a lot you can do to performance tune that unless you have other indexes to like incorporate other search, like obligations in here or something.

Right. There’s like just selecting the top 10 from this table. We do a lot of work for that. It’s like significant, but this is, this was the performance aspect that I, that interested me, that set me down this whole terrible path. learning all this other stuff. So we’re, we’re, we’re going to talk more about that as we go. But, uh, as a wise man once said, it is Friday and, uh, we do not have a lot of time to, to waste on a Friday thinking about SQL Server and vectors and all that other stuff. So this is just going to be a short video, uh, for us to, to, to, to think a little bit about how we might want to manage our, our vector data in the future. Right.

And of course, why it is imperative that Microsoft sort itself out with the vector index stuff so that we can use vector search, the function, the vector search function to make things faster. We are, we are going to talk about that shortly. So that out of the way, thank you for watching. I hope you enjoyed yourselves. I hope you learned something and I will see you next week with a fresh brand new office hours episode. and I don’t know whatever other stuff comes along. So anyway, thank you for watching.

.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

Get AI-Ready With Erik: Combining Search Scoring

Get AI-Ready With Erik: Combining Search Scoring


Video Summary

In this video, I delve into the world of search algorithms, specifically focusing on reciprocal rank fusion (RRF) and weighted scoring techniques. These methods are crucial when dealing with diverse data sources that require different ranking systems to be combined for a unified, superior search result. I explain how RRF works by summing each document’s reciprocal rank and adjusting it with a constant K, ensuring higher weight is given to items appearing high across multiple methods. Additionally, I discuss weighted scoring as an alternative approach, where scores are normalized between 0 and 1 to provide more precise control over which factors matter most in the search results. The video covers practical examples like SQL Server documentation searches, enterprise bug tracking systems, and performance tuning recommendations engines, illustrating how these techniques can be applied to improve overall search quality without complex retraining processes.

Full Transcript

Erik Darling here, Darling Data. And there’s no demos in this video, so if you just want to skip right ahead and buy the course for $100 off with that coupon up top, you could just go ahead and do that. You can hear me prattle on about this stuff from the comfort of your home whenever you want. So, uh, low, low price of $100 off. So, uh, when, when you are working with vectors, um, one thing that you’re going to have to deal with as a database person is that you might have multiple, multiple things in your data, uh, aside from just the, the vectors that we’ve been talking about that would, uh, indicate that a, a certain, uh, match to your data. That a search, uh, a search, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, It’s a wonderful, it’s a wonderful quote, right?

Makes it, we can sound very authoritative reading this quote. Reciprocal rank fusion, RRF, is a powerful model agnostic algorithm used in hybrid search to combine results from multiple ranking systems, like keyword and vector search, into one unified superior ranking.

Superior, like Perrier sparkling water. Stuff goes straight to your head, I hear. And what it does is it sums each document’s reciprocal rank, which is one divided by the rank from each list, and you adjust it by a constant.

The constant is called K. Why is constant spelled with a K? Perhaps it’s some Germanic influence that I’m unaware of, but that’s what’s in the math, so we’re going to stick with it. Because, you know, people who know math tend to be pretty smart.

So if they want to spell constant with a K, they can go right on and do that. But the whole idea is to give higher weight to height. Sorry, this is a direct quote, so I can’t go off script here.

Giving higher weight to items appearing high across different methods, improving overall search quality without complex retraining. It’s crucial for blending diverse search types into coherent results, commonly used in modern LLM-based systems for better relevance. So just in the Stack Overflow database alone, we have multiple things that might inform us as to what is a good match, what is good content, what is high quality.

Where do we go from here? We have vector distance, right, which is a number from 0 to 2, but lower is better. Okay, so like the lower your vector distance is, the more similar two sets of vectors are.

Then we have, in the post table, we have a column called score. And typically, you know, not, maybe not like, maybe it’s not a perfect system, but, you know, it’s a pretty okay one. We have the score column, and that can be negative, right, or it can be a very high positive number.

But we can’t, like, so we couldn’t automatically just, like, add or subtract score from vector distance because, like, you know, vector distance is going to be a tiny little decimal. And if you have, like, a score of, like, 18,000 on a thing and you’re like, well, oh, I’ll just make it negative. And it’s, like, you’re at negative 18,000 plus, like, minus, well, I mean, like, it’s two negative numbers plus, so it’s more of a negative.

So it’s, like, you could have just a really big, like, negative number, but that would look weird. And then you could also, like, do stuff like keyword boost, like we looked at in the last video, where you can, like, you know, you can use, like, sort of, like, some, like, backup search terms to be, like, oh, well, if it has this in it, then it’s, like, like, Thai food. And then, like, my favorite Thai dish is called, it’s, like, it’s called Rama.

And it’s, like, this spicy peanut sauce thing. So, like, whenever I’m looking for, like, whenever I’m in a new city, if I’m, like, you know, got to hankering for Thai food, one thing I’m, like, well, if you’re a Thai restaurant and you have this Rama dish, usually chicken because I’m a little bit of a coward. But if you, like, if you have this chicken Rama, I’m, like, man, I’m going to you first.

I don’t care how much, like, I don’t care if you have a D for your, like, you know, health and sanitation report. I’m in there. We need some dirty peanut sauce. So, like, that, you know, that’s just, like, keyword boost, right?

Good stuff. So, the first option you have is reciprocal rank fusion, where you ignore scores and use rank positions only because you have to do some very fancy math to figure this out. So, I go over all that math and show you queries that can do this in the full course.

I’m not doing that here because, well, I mean, you got to save something for marriage, right? But you would want to use reciprocal rank fusion when you are combining different ranked lists or rather ranked lists from different sources where the scores are not comparable because they are different units and scales. It’s also very simple and robust and, you know, in that simplicity and robustness, you lose a little bit of fine-grained control.

Where this might make sense is if you are building a SQL Server documentation search and you wanted to combine vector similarity search, which returns distances, and then you would have, like, full text search, which would return ranks, like, 128, 95, and 42. Like, just, like, those two numbers are completely incompatible.

Then you also might want to think about, like, recency, right? Because, like, for a lot of things, I mean, look, I’m like a crate digger when it comes to, like, SQL Server information. Like, usually when I’m looking for something, I’m looking for something real old, right?

Like, the new stuff, I don’t know. Like, maybe if I need to learn about, like, something that just came out, it’s one thing. But, like, usually I’m looking for, like, some old Craig Friedman post. And I’m like, no, no, no.

Get that bottle under the bar that’s, like, covered in dust. Yeah, that’s the one. Right? So, like, that’s usually what I’m about. But, like, all three of these things sort of, like, have their own incompatibilities. It’s, like, really hard to sort of, like, coagulate.

Like, oh, well, you know, this is your vector similarity. But, you know, like, you wanted some keywords in there that were pretty hard for vectors to navigate. So, we did some full text search and got you some numbers back on that.

But do you want the new stuff or do you want, like, that old dusty bottle under the bar? Right? Like, what are you looking for here? So, like, you might have, like, a recency or, like, some, like, sort of temporal time-based way of ranking stuff.

But these things are just not, you just can’t, like, put those things together. Right? Like, that’s what reciprocal rank fusion does.

You also have the concept of weighted scoring, where you would normalize all your scores to a number between 0 and 1. Right? Which is sort of like vectors, like, the vectors of 0 to 2, or the vector cosine of 0 to 2.

The idea here is to put everything on the same scale. Right? And then you would decide how much each factor matters.

Like, you know, I think, like, in the, well, I mean, I know in the video that I do, I show you an example where it’s, like, starting with, like, a 90% semantic and 10% popularity setup. And then, like, how, like, how that changes at, like, 70-30 and 60-40 and 50-50 and stuff. And, like, how that really changes the search results.

But, like, the 90% semantic is usually a pretty good one. But, so, like, what it would mean is that, like, relevance would matter the most. But you still want to be, like, well, this is a really popular post.

So, like, maybe, maybe we ought to think about, you know, like, you know, incorporating that feedback from our voting system in here a little bit. But the thing is that those two metrics just live in completely different universes. You know, think, like, again, going back to the stack overflow data, vector cosine distance is a range from 0.0, which indicates identicality, and 2.0, which means opposite direction.

Right? So, like, 0.0 was, like, you are just, like, giving each other a hug and you’re walking this way. 2.0 was, like, you, like, just, like, broke up and going your separate ways.

You know, like, the three records you let her borrow under the thing and you’re mad because she stole your Cure t-shirt. And you’re, like, man. Stinks.

Anyway. Again, if you try to combine those raw scores, then the post score would dominate everything because, like, if, like, you have some post with 18,000 votes and it’s always going to win, even if it’s completely irrelevant. Right?

If there’s, like, a high cosine distance, then, like, but, like, score being a number that high, you’ll be, like, well, like, even if I turn that into a negative number, it’s still, like, kind of whack. So, with the 90-10 split, relevance still wins, but you take in another feedback. There’s another feedback mechanism at play.

So, just, I mean, just think of it sort of like a pie chart, right? So, it’s, like, the result is 90% about relevance and 10% about popularity. It also makes it easier to reason about things, like, if I want popularity to matter more, I can change the weighting from 0.9 and 0.1, which is the 90-10 split, to 0.7 and 0.3, to, like, a 70-30 split. The 90-10 split essentially means I care about relevance way more than popularity, but I still want popularity to matter, right?

It’s, like, it’s a sensible thing to do. You would want to use weighted scoring when you need more precise control over which factors are important to you. It gives you a lot more knobs to tune, right?

And you can do that for multiple things. It’s not just, like, you know, similarity and popularity. Like, if you wanted to throw recency into the mix, you could do that, too, and you could add that to your weighting measures, right? You could have a three-way weighting split, not, you know, just a monogamous two-way rating split, right?

You got to get, like, get a whole room involved, right? Like an orgy of rating split, but ranking splits, whatever. So the other thing that weighted scoring is good for that, I mean, reciprocal rank fusion is also kind of good for that, is if you want to incorporate, like, a whole bunch of different stuff, not just ranks, and, like, the stuff that you have to combine is, like, just somewhat, like, it’s just, like, too different as far as, like, what it returns.

So, like, let’s say that you wanted to have an enterprise search where recency matters three times more than relevance, right? Like, you, like, maybe you’re, like, it’s only 60% relevant, but if it’s recent, like, we should show it because maybe that’s what someone’s looking for. And, like, maybe, like, like, we have, like, a bug tracking system, and you want to find recent bugs with similar keywords because, like, those are going to be more relevant to something you’re working on today than, like, maybe something from, like, three years ago that, like, has, you know, that is more similar.

But, you know, it’s just, again, it’s, like, something that you’ve dealt with years ago. It’s maybe not relevant to what you’re doing today. Another example would be, like, let’s say you were a SQL, like, you were putting together a SQL Server performance tuning recommendations engine, right?

And because you watch my videos, like, I’m not doing this. If you were doing this out there in the world, you would know from watching my videos that query execution time matters about 50 times more than logical reads because it is not SQL Server.

Just to call back to one of the earlier videos this week, it is not, right, because vector search is not good at not. It is not SQL Server 2008 or SQL Server 2008 R2 anymore. We live in 2026, and if we’re still looking at logical reads, we ought to have our heads checked.

And we also know that things like statistics freshness matter 20 times more than index, right? I mean, that should be, like, it could clearly be 100 times more, you know, like, how many zeros can we put in there, right? Like, having up-to-date statistics is much more important than caring about index fragmentation because, again, it is not SQL Server 2008.

It is not SQL Server 2008 R2, and we should act like we know what time it is, right? Anyway, that’s probably good here. Thank you for watching.

I hope you enjoyed yourselves. I hope you learned something. And I will see you in tomorrow’s big Friday video, which is actually, thankfully, a small Friday video, so I can get you on your way to having a great weekend and can all live in peace. All right.

Thank you for watching.

Video Summary

In this video, I delve into the world of search algorithms, specifically focusing on reciprocal rank fusion (RRF) and weighted scoring techniques. These methods are crucial when dealing with diverse data sources that require different ranking systems to be combined for a unified, superior search result. I explain how RRF works by summing each document’s reciprocal rank and adjusting it with a constant K, ensuring higher weight is given to items appearing high across multiple methods. Additionally, I discuss weighted scoring as an alternative approach, where scores are normalized between 0 and 1 to provide more precise control over which factors matter most in the search results. The video covers practical examples like SQL Server documentation searches, enterprise bug tracking systems, and performance tuning recommendations engines, illustrating how these techniques can be applied to improve overall search quality without complex retraining processes.

Full Transcript

Erik Darling here, Darling Data. And there’s no demos in this video, so if you just want to skip right ahead and buy the course for $100 off with that coupon up top, you could just go ahead and do that. You can hear me prattle on about this stuff from the comfort of your home whenever you want. So, uh, low, low price of $100 off. So, uh, when, when you are working with vectors, um, one thing that you’re going to have to deal with as a database person is that you might have multiple, multiple things in your data, uh, aside from just the, the vectors that we’ve been talking about that would, uh, indicate that a, a certain, uh, match to your data. That a search, uh, a search, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, It’s a wonderful, it’s a wonderful quote, right?

Makes it, we can sound very authoritative reading this quote. Reciprocal rank fusion, RRF, is a powerful model agnostic algorithm used in hybrid search to combine results from multiple ranking systems, like keyword and vector search, into one unified superior ranking.

Superior, like Perrier sparkling water. Stuff goes straight to your head, I hear. And what it does is it sums each document’s reciprocal rank, which is one divided by the rank from each list, and you adjust it by a constant.

The constant is called K. Why is constant spelled with a K? Perhaps it’s some Germanic influence that I’m unaware of, but that’s what’s in the math, so we’re going to stick with it. Because, you know, people who know math tend to be pretty smart.

So if they want to spell constant with a K, they can go right on and do that. But the whole idea is to give higher weight to height. Sorry, this is a direct quote, so I can’t go off script here.

Giving higher weight to items appearing high across different methods, improving overall search quality without complex retraining. It’s crucial for blending diverse search types into coherent results, commonly used in modern LLM-based systems for better relevance. So just in the Stack Overflow database alone, we have multiple things that might inform us as to what is a good match, what is good content, what is high quality.

Where do we go from here? We have vector distance, right, which is a number from 0 to 2, but lower is better. Okay, so like the lower your vector distance is, the more similar two sets of vectors are.

Then we have, in the post table, we have a column called score. And typically, you know, not, maybe not like, maybe it’s not a perfect system, but, you know, it’s a pretty okay one. We have the score column, and that can be negative, right, or it can be a very high positive number.

But we can’t, like, so we couldn’t automatically just, like, add or subtract score from vector distance because, like, you know, vector distance is going to be a tiny little decimal. And if you have, like, a score of, like, 18,000 on a thing and you’re like, well, oh, I’ll just make it negative. And it’s, like, you’re at negative 18,000 plus, like, minus, well, I mean, like, it’s two negative numbers plus, so it’s more of a negative.

So it’s, like, you could have just a really big, like, negative number, but that would look weird. And then you could also, like, do stuff like keyword boost, like we looked at in the last video, where you can, like, you know, you can use, like, sort of, like, some, like, backup search terms to be, like, oh, well, if it has this in it, then it’s, like, like, Thai food. And then, like, my favorite Thai dish is called, it’s, like, it’s called Rama.

And it’s, like, this spicy peanut sauce thing. So, like, whenever I’m looking for, like, whenever I’m in a new city, if I’m, like, you know, got to hankering for Thai food, one thing I’m, like, well, if you’re a Thai restaurant and you have this Rama dish, usually chicken because I’m a little bit of a coward. But if you, like, if you have this chicken Rama, I’m, like, man, I’m going to you first.

I don’t care how much, like, I don’t care if you have a D for your, like, you know, health and sanitation report. I’m in there. We need some dirty peanut sauce. So, like, that, you know, that’s just, like, keyword boost, right?

Good stuff. So, the first option you have is reciprocal rank fusion, where you ignore scores and use rank positions only because you have to do some very fancy math to figure this out. So, I go over all that math and show you queries that can do this in the full course.

I’m not doing that here because, well, I mean, you got to save something for marriage, right? But you would want to use reciprocal rank fusion when you are combining different ranked lists or rather ranked lists from different sources where the scores are not comparable because they are different units and scales. It’s also very simple and robust and, you know, in that simplicity and robustness, you lose a little bit of fine-grained control.

Where this might make sense is if you are building a SQL Server documentation search and you wanted to combine vector similarity search, which returns distances, and then you would have, like, full text search, which would return ranks, like, 128, 95, and 42. Like, just, like, those two numbers are completely incompatible.

Then you also might want to think about, like, recency, right? Because, like, for a lot of things, I mean, look, I’m like a crate digger when it comes to, like, SQL Server information. Like, usually when I’m looking for something, I’m looking for something real old, right?

Like, the new stuff, I don’t know. Like, maybe if I need to learn about, like, something that just came out, it’s one thing. But, like, usually I’m looking for, like, some old Craig Friedman post. And I’m like, no, no, no.

Get that bottle under the bar that’s, like, covered in dust. Yeah, that’s the one. Right? So, like, that’s usually what I’m about. But, like, all three of these things sort of, like, have their own incompatibilities. It’s, like, really hard to sort of, like, coagulate.

Like, oh, well, you know, this is your vector similarity. But, you know, like, you wanted some keywords in there that were pretty hard for vectors to navigate. So, we did some full text search and got you some numbers back on that.

But do you want the new stuff or do you want, like, that old dusty bottle under the bar? Right? Like, what are you looking for here? So, like, you might have, like, a recency or, like, some, like, sort of temporal time-based way of ranking stuff.

But these things are just not, you just can’t, like, put those things together. Right? Like, that’s what reciprocal rank fusion does.

You also have the concept of weighted scoring, where you would normalize all your scores to a number between 0 and 1. Right? Which is sort of like vectors, like, the vectors of 0 to 2, or the vector cosine of 0 to 2.

The idea here is to put everything on the same scale. Right? And then you would decide how much each factor matters.

Like, you know, I think, like, in the, well, I mean, I know in the video that I do, I show you an example where it’s, like, starting with, like, a 90% semantic and 10% popularity setup. And then, like, how, like, how that changes at, like, 70-30 and 60-40 and 50-50 and stuff. And, like, how that really changes the search results.

But, like, the 90% semantic is usually a pretty good one. But, so, like, what it would mean is that, like, relevance would matter the most. But you still want to be, like, well, this is a really popular post.

So, like, maybe, maybe we ought to think about, you know, like, you know, incorporating that feedback from our voting system in here a little bit. But the thing is that those two metrics just live in completely different universes. You know, think, like, again, going back to the stack overflow data, vector cosine distance is a range from 0.0, which indicates identicality, and 2.0, which means opposite direction.

Right? So, like, 0.0 was, like, you are just, like, giving each other a hug and you’re walking this way. 2.0 was, like, you, like, just, like, broke up and going your separate ways.

You know, like, the three records you let her borrow under the thing and you’re mad because she stole your Cure t-shirt. And you’re, like, man. Stinks.

Anyway. Again, if you try to combine those raw scores, then the post score would dominate everything because, like, if, like, you have some post with 18,000 votes and it’s always going to win, even if it’s completely irrelevant. Right?

If there’s, like, a high cosine distance, then, like, but, like, score being a number that high, you’ll be, like, well, like, even if I turn that into a negative number, it’s still, like, kind of whack. So, with the 90-10 split, relevance still wins, but you take in another feedback. There’s another feedback mechanism at play.

So, just, I mean, just think of it sort of like a pie chart, right? So, it’s, like, the result is 90% about relevance and 10% about popularity. It also makes it easier to reason about things, like, if I want popularity to matter more, I can change the weighting from 0.9 and 0.1, which is the 90-10 split, to 0.7 and 0.3, to, like, a 70-30 split. The 90-10 split essentially means I care about relevance way more than popularity, but I still want popularity to matter, right?

It’s, like, it’s a sensible thing to do. You would want to use weighted scoring when you need more precise control over which factors are important to you. It gives you a lot more knobs to tune, right?

And you can do that for multiple things. It’s not just, like, you know, similarity and popularity. Like, if you wanted to throw recency into the mix, you could do that, too, and you could add that to your weighting measures, right? You could have a three-way weighting split, not, you know, just a monogamous two-way rating split, right?

You got to get, like, get a whole room involved, right? Like an orgy of rating split, but ranking splits, whatever. So the other thing that weighted scoring is good for that, I mean, reciprocal rank fusion is also kind of good for that, is if you want to incorporate, like, a whole bunch of different stuff, not just ranks, and, like, the stuff that you have to combine is, like, just somewhat, like, it’s just, like, too different as far as, like, what it returns.

So, like, let’s say that you wanted to have an enterprise search where recency matters three times more than relevance, right? Like, you, like, maybe you’re, like, it’s only 60% relevant, but if it’s recent, like, we should show it because maybe that’s what someone’s looking for. And, like, maybe, like, like, we have, like, a bug tracking system, and you want to find recent bugs with similar keywords because, like, those are going to be more relevant to something you’re working on today than, like, maybe something from, like, three years ago that, like, has, you know, that is more similar.

But, you know, it’s just, again, it’s, like, something that you’ve dealt with years ago. It’s maybe not relevant to what you’re doing today. Another example would be, like, let’s say you were a SQL, like, you were putting together a SQL Server performance tuning recommendations engine, right?

And because you watch my videos, like, I’m not doing this. If you were doing this out there in the world, you would know from watching my videos that query execution time matters about 50 times more than logical reads because it is not SQL Server.

Just to call back to one of the earlier videos this week, it is not, right, because vector search is not good at not. It is not SQL Server 2008 or SQL Server 2008 R2 anymore. We live in 2026, and if we’re still looking at logical reads, we ought to have our heads checked.

And we also know that things like statistics freshness matter 20 times more than index, right? I mean, that should be, like, it could clearly be 100 times more, you know, like, how many zeros can we put in there, right? Like, having up-to-date statistics is much more important than caring about index fragmentation because, again, it is not SQL Server 2008.

It is not SQL Server 2008 R2, and we should act like we know what time it is, right? Anyway, that’s probably good here. Thank you for watching.

I hope you enjoyed yourselves. I hope you learned something. And I will see you in tomorrow’s big Friday video, which is actually, thankfully, a small Friday video, so I can get you on your way to having a great weekend and can all live in peace. All right.

Thank you for watching.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

Get AI-Ready With Erik: Helping Vectors Rank Better With Keyword Boosts

Get AI-Ready With Erik: Helping Vectors Rank Better With Keyword Boosts


Video Summary

In this video, I delve into strategies to enhance vector search relevance in semantic search engines like Vector, focusing on SQL Server scenarios. I explore how keyword searches and full-text indexes can complement each other, discussing their strengths and limitations. By demonstrating practical techniques such as adding specific keywords to the query conditions and using boosts based on domain knowledge, I show how to refine search results to prioritize conceptual relevance over popularity or frequency. This approach is particularly useful when you need exact terms but also want the search engine to understand the context better, ensuring that highly relevant content isn’t missed due to semantic gaps.

Full Transcript

Erik Darling here with Darling Data. You know, trying to get my AI ready on. I’m gonna make sure that I’m prepared for the world falling apart around us and all that other good stuff. So today’s video is going to be about some things you can do that might help vectors rank things a bit better, right? Because, you know, like we talked about in the last video, vector might miss some stuff, right? It’s not, didn’t, we were looking for a SQL Server 2008 R2 when we got back a lot of SQL Server 2008. And then we were like, I want to find stuff about connection strings, but not any framework. And Vector was like, hey, any framework, you said those words, here you go. Uh, I got, I hear that right? Yeah. Ah, well, there you go. It’s some entity framework for you. Um, so semantic search is there to find conceptually related content, but it has a, it has a bad habit of missing exact terms. Uh, keyword search, like, you know, we should, we, I show you, I had to unfortunately show you some full text index stuff, which I feel dirty about still, but I’ll get over it when you buy something. Uh, keyword search finds exact terms, but may not rank by conceptual relevance, uh, as well. Right. Cause like, you know, we can, we can put in explicit words and we can find explicit words.

Which is great for stuff like, you know, error messages or like, you know, product versions and things like that. But you know, less so for like, ah, I think I want a sandwich. What kind of sandwich do I want? I don’t know. Who has sandwiches, right? Best sandwich. Right. So like, you know, stuff that, you know, you might, might have a harder time. Anyway. Um, so this is, what I’m going to show you is useful when a specific term must appear in results, but you want results ordered by conceptualization. Conceptual relevance, maybe rather than like popularity or something. Right. Cause you know, in the, in the stack overflow database, we have a score column, right? And like, you know, we could certainly like, you know, like, like factor score into things, which we will talk about.

But, uh, typically we, we, you know, if we’re, we’re spending all this time and effort and energy into getting our vectors, right. We, we, we would probably want to, uh, you know, involve them at least to some degree. Right. What’s the point of all this damn data. Don’t use it. Uh, so what we’re going to do is try to find stuff about SQL injection prevention, but we’re going to, we’re going to like give a secret. We’re going to give our, our search a little bit of help, right? Because we want to make sure that the title of the post is at least, you know, has SQL injection in it. Right.

So this is another form of keyword search kind of, um, it’s just an alternate to full text indexes. You can, you can throw some stuff like this in, I guess, in the where clause to help filter stuff out. There’s also another, um, uh, there’s also another, uh, technique that I’m going to show you down below. And this will at least get us to the point where, you know, all of the post titles have SQL injection in them. And then it’s sort of up to us to figure out like, okay, well, I mean like, you know, distance wise, right? Like, like all of these things do pretty okay.

But the scores are all over the place. We’ve got a zero, a 10, a 2700, right? But like, like, we’re not seeing, we’re not seeing stuff with like, uh, and working at questions too. Right. So like, like two things that we’re not doing here that like if in real life we would probably want to be doing is one, making sure that these questions have answers. Right. Cause like the question doesn’t have any answers. How do you like, it might just be a similar question if you’re doing like, you know, like content deduplication or whatever.

But if you want to like find related content that might have helpful things in it, aside from that, we would probably want to make sure that they had some answers. We also want, might want to factor a score into it somehow because score at least, you know, would help dictate like, Hey, this is a good question. Like, or at least this is a highly upvoted question. Good or not is, uh, come on. It’s stack overflow. Right.

Like something, something’s got really hot. A lot of upvotes, like more upvotes than I will ever get in my life. But, uh, you look at it and you’re like, you link to the document. You rat. Anyway, another, another option we have, right. Another way that we could potentially, uh, make that, make this all better somehow would be to add a boost based on the presence of certain words in, uh, in another title. Right. So, uh, if we, we, we, we, we, the way that we would do that is we could, I mean, we’re like, this is just to show you what the boost actually turns out being.

So if we’re like, if we’re off looking for like how to optimize database performance, you know, conceptually, we’re going to find stuff that’s pretty close to that. But like, we, we might, you know, kind of like, you know, know our data or like, you know, have some idea about like, like allow people to enter in some like, uh, like, uh, some like, uh, what do you call them? Uh, like backup search words or something. We’re like, Oh, like, yeah, I’m really interested in indexes. Right. Yeah. That’s a good one.

So we could add in some, some like booster stuff like that. And then we could adjust, uh, what the vector distance is based on that boost. And what we’ll end up with is a query that looks a bit like this, man. I’m actually going to highlight the whole thing. Cause the last thing I want in one of these videos is like an error. And then you’re like, uh, then I have to like go vector search the error message and figure to ask AI how to fix it.

And AI is like, Oh, well, it looks like maybe you forgot to highlight where you declared a variable. And then you want to be like, man, AI, you’re so good. You’re so smart. How’d you? God. Anyway, uh, what we have here is a few columns. Oh, we’ve got more than a few, but we have a few columns that we would care about, uh, the contents of for this exercise.

So we have the normal vector distance thing in here, right? And you know, these numbers are all just wonderful, right? They are all floating point numbers, I know, decimals. Nailed it. Uh, but then now then we have our boost column and all of these are, I mean, we don’t, we don’t have any, we didn’t hit on index or like maybe index got screened out past the top 15. Like maybe index didn’t help like the index one didn’t help. Cause like what we did up here, uh, was say if, you know, um, uh, minus case when title, like indexes, remember we have to, we want, we want to get these numbers lower.

Right. So we’re subtracting. This is not boosting by adding onto the vector distance. Cause that would make it further apart or make, make the things seem further apart in the numbers. Uh, we’re doing this by, um, subtracting the, like add relevance to certain keywords. So if the title is like index or the title is like performance or the title is like slow, then we would like have these multipliers, uh, get subtracted so that we could see more stuff.

So, um, like everything that we found at least because we, we matched conceptually, but the vector search very close to performance, everything was either a three or zero, right? So like, like a lot of these are three, three, three, and then a lot of them are zero, zero, uh, zero, zero and stuff. But you can see how the, the semantic distance got adjusted by the, uh, by the boost, right?

So like the one that started at 1077 got adjusted down to triple sevens, right? Uh, the one that started at 845 got boosted by nothing and stayed at 845. Uh, and then the one at 1358 got boosted by 0.03 and got boosted and it got adjusted to 1058.

So this is just another way of like helping like the vector distance stuff, like get some more, um, like contextual, like, you know, words in, in play and like adjust the score a little bit based on like certain things that are very closely rated. Like you might know about like domain knowledge stuff. Uh, but you know, the, the vector similarity search is just like, I, I, I got these floating points over here and I got these floating points over here and I got to figure out how close these floating points are.

It’s not thinking about like, you know, additional, like, like, Oh, well, if I had this other floating point, if I thought about this other floating point, I, I could, I could really, I could really get some better searches in here or some better search results in here. So this is just another way of doing that anyway. It’s a lot of stuff to think about with this, isn’t there?

Well, maybe someday, right? I don’t, I don’t know if anyone’s on SQL Server 2025. Yeah, they did.

They did just release a cumulative update one with a shockingly low number of, uh, fixes in it. Unfortunately, there were also no, uh, no, it was not graduation day for any of our preview features. So we’re still, still stuck there.

Anyway, that’s probably good for this one. Thank you for watching. I hope you enjoyed yourselves. I hope you learned something and I hope that you will use this, this wonderful, fabulous coupon code, uh, up, up, up here to buy the entire course. And, uh, I don’t know, maybe pay rent or a electric bill or phone bill or something.

I got all sorts of stuff I gotta do. All right. All right.

Anyway, thank you for watching. I’ll see you tomorrow.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

Get AI-Ready With Erik: Full Text Keyword Search In SQL Server

Get AI-Ready With Erik: Full Text Keyword Search In SQL Server


Video Summary

In this video, I dive into the world of vector search in SQL Server 2025 and its limitations. Starting off with a discussion on why vector search might not be ideal for precise queries like searching for specific versions of SQL Server (e.g., 2008 R2), I highlight how these searches can return irrelevant results due to the nature of vector matching. Moving forward, I explore other challenges such as understanding negations and how certain terms are interpreted by vector search algorithms. To address some of these issues, I propose using full text search—a tool that better understands logical operators like “not”—as a complementary method for enhancing vector search results. While full text search isn’t without its drawbacks, it offers a practical solution to refine searches in SQL Server 2025, making the overall process more robust and accurate.

Full Transcript

Erik Darling here with Darling Data, and we are going to get back into our SQL Server 2025 AI Preparedness Tour. This is, of course, in support of my new album, Get AI Ready with Erik. If you are interested in purchasing this album, not perching on this album, that would be uncomfortable for everyone. You can do that at the link up there. The blue thing, it starts with HTTPS. I realize there’s some other blue things down there, but there’s definitely only just one blue thing way at the top. That link is also down in the video description where you can very conveniently click on it and change your life. So, we’ve talked about a bunch of vector-y stuff, so we’re going to continue on with that. And the thing that we have to, you know, first go through, and I mean, I’m going to say something unpopular at the end of this. I know you might be shocked by that, but I’m going to say something unpopular at the end of this one. But up front, vector search is not good for some things. It lacks precision, one might say. So, if, you know, let’s say you care desperately about SQL Server 2008 R2 for some reason, I don’t know, it’s just, it makes for a good example.

But, and you vector search for it. R2 isn’t like a big, significant, vector-y thing, right? Like, the vector sounds like, oh, R2, D2, right? So, it treats 2008 and 2008 R2, because like SQL Server 2008, that is like, kabow! But R2 is like, not so much. So, let’s run this query against our database. And we didn’t, like, normally, the first thing I do before I run a query is make sure execution plans are turned on, but that doesn’t matter here. So, that’s good for us, right? We don’t need to worry so much about that. We can just run our query and not care about the execution plan, at least not yet anyway.

But if we look at the results here, right? And using my lovely SQL Server SSMS 22 result zooming, look what we get back here. Like, sure, the top result is SQL Server 2008 R2, but with a distance of zero, I’m pretty sure that was the post title that we searched for, right? Because we didn’t filter out the exact search. Well, I don’t know. Actually, no. Yeah, no, actually, that won’t matter here. Good.

Because sometimes with these, you know, you have to, like, if it came from an original post, you should screen out the original post because you don’t want to match your, like, the original post to, like, the, like, candidate posts. But that doesn’t matter for this one. So, we got a bunch of, but if you look at this column, there are a lot of no’s in here, right? There’s, like, five in a row and then a sixth one. What is that? Six in a row? One, two, three, four. It looks like five, six, and then seven, eight, nine, ten.

Yeah. So, yeah, like, six out of the ten results that we got back were not 2008 R2. They were just SQL Server 2008. And that does not bode terribly. That may feel like it doesn’t bode terribly well for vector search. But the thing you need to understand is that, like, vector search is not meant to be, like, the end-all, be-all replacement search for everything else.

There are certain things that you might have to do to bolster searches like this. And we’ll talk about an unpopular option for that at the end here. Another thing that vector searches do not have a very good understanding of is the word not.

Which, I don’t know, if you’ve ever talked to an LLM, like, I don’t know, those things seem to do okay with stuff here. Like, I am not interested in buying a Lamborghini. I am interested in buying a different kind of sports car.

Like, I don’t know, I’m pretty sure they would figure out. But, you know, they’re probably using a whole different set of, I don’t know, maybe smarter, better things inside of them. They can understand not, right?

But if we were to search something like this, right, and we were to say connection string not related to, like, not entity framework. I want a different connection string. I want, like, a Python connection string or something, right?

You should probably say that. But this, again, just makes it for a pretty good example. Where, if we run this query and we look at what we get back, well, we get problems with entity. I mean, this one says EF4.

So, I mean, it doesn’t exactly match to identity. But it’s still that. This one doesn’t have entity in it. But, you know, I don’t know that that’s going to help. These ones are all entities.

And, you know, I mean, this one is still ADO.net. Maybe that’s close enough to entity framework. But at least in my mind, they’re basically the same thing, which is untrue. But, you know, in my mind, I’m just like, you do that stuff?

Cool. I don’t, I don’t, I don’t. You need database help. You should hire me for that. But you can see that a lot of stuff in here does come back as entity framework related. And this is probably another good example of computers maybe not understanding strings too well.

Right? Because this one is EF4. And we were just searching for entity. But I’ll be damned if I’m going to list out all the EFs in this query to search for. So, you know, it’s close enough.

I can still see stuff got weird in there. So the unpopular thing and the thing that you might need if you are going to do vector search in SQL Server is one of my least favorite things in SQL Server.

Both from a, you know, from a performance perspective, certainly. But at least, you know, like, like I used to do a lot of work on a database product called Relativity. It’s a e-discovery product.

I used to be a DBA at a shop that had them. And for a long time, like they’re like early, much earlier. I’m sure it’s much better now. Like way earlier on, they were using full text search for like document search stuff. Because it’s all like lawyers searching through cases, like trying to sue people and find documents and stuff like that.

So there’s a lot of full text search in that. And my, like, my head goes to terrible places when I think about full text search. Because there was like terabytes and terabytes of data in those things.

But, you know, full text search, for what it’s worth, does understand not. Because you can say not contains entity. And you can say contains connection string when you want to search for stuff.

And boy, howdy, does entity framework not show up in this one, right? So like at least, at least it understands not contains, right? We can say not contains and we can skip the entity framework stuff, right?

Or at least entity, right? Or whatever we did. Entity. Yeah, entity. So we are able to get rid of entity using full text search. But there is, of course, a lot of, you know, sort of backroom pain involved with full text search.

Despite its, I guess, advanced understanding of not. So, you know, you could choose to go this path, you know, which kind of sucks, right? It’s maybe not the greatest thing in the world.

However, it could be used to bolster vector search in some places. And, you know, I guess your alternative, if you’re like, you know, full text search, full text that, you could always go back to the bad old days of, you know, double wildcard text searching and say, and title not like percent entity percent. With, of course, the correct unicode thing in front of the string if it is unicode.

Because one should always mind their data types. All right. Well, that hurt to say.

Anyway, that’s about good here. If you’re looking at the top there, you see that I have four more SSMS tabs open. So we have four more tabs of things to talk about. Well, three more tabs of things to talk about at least this week.

And then we will start afresh next week with office hours, getting AI ready and all sorts of other things. All right. Anyway, that’s probably about good here.

Thank you so much for watching. I hope you enjoyed yourselves. I hope you learned something. And I will see you in tomorrow’s video where we will be even more ready for AI in SQL Server 2025. Yo desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk

Video Summary

In this video, I dive into the world of vector search in SQL Server 2025 and its limitations. Starting off with a discussion on why vector search might not be ideal for precise queries like searching for specific versions of SQL Server (e.g., 2008 R2), I highlight how these searches can return irrelevant results due to the nature of vector matching. Moving forward, I explore other challenges such as understanding negations and how certain terms are interpreted by vector search algorithms. To address some of these issues, I propose using full text search—a tool that better understands logical operators like “not”—as a complementary method for enhancing vector search results. While full text search isn’t without its drawbacks, it offers a practical solution to refine searches in SQL Server 2025, making the overall process more robust and accurate.

Full Transcript

Erik Darling here with Darling Data, and we are going to get back into our SQL Server 2025 AI Preparedness Tour. This is, of course, in support of my new album, Get AI Ready with Erik. If you are interested in purchasing this album, not perching on this album, that would be uncomfortable for everyone. You can do that at the link up there. The blue thing, it starts with HTTPS. I realize there’s some other blue things down there, but there’s definitely only just one blue thing way at the top. That link is also down in the video description where you can very conveniently click on it and change your life. So, we’ve talked about a bunch of vector-y stuff, so we’re going to continue on with that. And the thing that we have to, you know, first go through, and I mean, I’m going to say something unpopular at the end of this. I know you might be shocked by that, but I’m going to say something unpopular at the end of this one. But up front, vector search is not good for some things. It lacks precision, one might say. So, if, you know, let’s say you care desperately about SQL Server 2008 R2 for some reason, I don’t know, it’s just, it makes for a good example.

But, and you vector search for it. R2 isn’t like a big, significant, vector-y thing, right? Like, the vector sounds like, oh, R2, D2, right? So, it treats 2008 and 2008 R2, because like SQL Server 2008, that is like, kabow! But R2 is like, not so much. So, let’s run this query against our database. And we didn’t, like, normally, the first thing I do before I run a query is make sure execution plans are turned on, but that doesn’t matter here. So, that’s good for us, right? We don’t need to worry so much about that. We can just run our query and not care about the execution plan, at least not yet anyway.

But if we look at the results here, right? And using my lovely SQL Server SSMS 22 result zooming, look what we get back here. Like, sure, the top result is SQL Server 2008 R2, but with a distance of zero, I’m pretty sure that was the post title that we searched for, right? Because we didn’t filter out the exact search. Well, I don’t know. Actually, no. Yeah, no, actually, that won’t matter here. Good.

Because sometimes with these, you know, you have to, like, if it came from an original post, you should screen out the original post because you don’t want to match your, like, the original post to, like, the, like, candidate posts. But that doesn’t matter for this one. So, we got a bunch of, but if you look at this column, there are a lot of no’s in here, right? There’s, like, five in a row and then a sixth one. What is that? Six in a row? One, two, three, four. It looks like five, six, and then seven, eight, nine, ten.

Yeah. So, yeah, like, six out of the ten results that we got back were not 2008 R2. They were just SQL Server 2008. And that does not bode terribly. That may feel like it doesn’t bode terribly well for vector search. But the thing you need to understand is that, like, vector search is not meant to be, like, the end-all, be-all replacement search for everything else.

There are certain things that you might have to do to bolster searches like this. And we’ll talk about an unpopular option for that at the end here. Another thing that vector searches do not have a very good understanding of is the word not.

Which, I don’t know, if you’ve ever talked to an LLM, like, I don’t know, those things seem to do okay with stuff here. Like, I am not interested in buying a Lamborghini. I am interested in buying a different kind of sports car.

Like, I don’t know, I’m pretty sure they would figure out. But, you know, they’re probably using a whole different set of, I don’t know, maybe smarter, better things inside of them. They can understand not, right?

But if we were to search something like this, right, and we were to say connection string not related to, like, not entity framework. I want a different connection string. I want, like, a Python connection string or something, right?

You should probably say that. But this, again, just makes it for a pretty good example. Where, if we run this query and we look at what we get back, well, we get problems with entity. I mean, this one says EF4.

So, I mean, it doesn’t exactly match to identity. But it’s still that. This one doesn’t have entity in it. But, you know, I don’t know that that’s going to help. These ones are all entities.

And, you know, I mean, this one is still ADO.net. Maybe that’s close enough to entity framework. But at least in my mind, they’re basically the same thing, which is untrue. But, you know, in my mind, I’m just like, you do that stuff?

Cool. I don’t, I don’t, I don’t. You need database help. You should hire me for that. But you can see that a lot of stuff in here does come back as entity framework related. And this is probably another good example of computers maybe not understanding strings too well.

Right? Because this one is EF4. And we were just searching for entity. But I’ll be damned if I’m going to list out all the EFs in this query to search for. So, you know, it’s close enough.

I can still see stuff got weird in there. So the unpopular thing and the thing that you might need if you are going to do vector search in SQL Server is one of my least favorite things in SQL Server.

Both from a, you know, from a performance perspective, certainly. But at least, you know, like, like I used to do a lot of work on a database product called Relativity. It’s a e-discovery product.

I used to be a DBA at a shop that had them. And for a long time, like they’re like early, much earlier. I’m sure it’s much better now. Like way earlier on, they were using full text search for like document search stuff. Because it’s all like lawyers searching through cases, like trying to sue people and find documents and stuff like that.

So there’s a lot of full text search in that. And my, like, my head goes to terrible places when I think about full text search. Because there was like terabytes and terabytes of data in those things.

But, you know, full text search, for what it’s worth, does understand not. Because you can say not contains entity. And you can say contains connection string when you want to search for stuff.

And boy, howdy, does entity framework not show up in this one, right? So like at least, at least it understands not contains, right? We can say not contains and we can skip the entity framework stuff, right?

Or at least entity, right? Or whatever we did. Entity. Yeah, entity. So we are able to get rid of entity using full text search. But there is, of course, a lot of, you know, sort of backroom pain involved with full text search.

Despite its, I guess, advanced understanding of not. So, you know, you could choose to go this path, you know, which kind of sucks, right? It’s maybe not the greatest thing in the world.

However, it could be used to bolster vector search in some places. And, you know, I guess your alternative, if you’re like, you know, full text search, full text that, you could always go back to the bad old days of, you know, double wildcard text searching and say, and title not like percent entity percent. With, of course, the correct unicode thing in front of the string if it is unicode.

Because one should always mind their data types. All right. Well, that hurt to say.

Anyway, that’s about good here. If you’re looking at the top there, you see that I have four more SSMS tabs open. So we have four more tabs of things to talk about. Well, three more tabs of things to talk about at least this week.

And then we will start afresh next week with office hours, getting AI ready and all sorts of other things. All right. Anyway, that’s probably about good here.

Thank you so much for watching. I hope you enjoyed yourselves. I hope you learned something. And I will see you in tomorrow’s video where we will be even more ready for AI in SQL Server 2025. Yo desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk desk

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

SQL Server Performance Office Hours Episode 49

SQL Server Performance Office Hours Episode 49


Query Store shows this plan has the lowest average duration, so it must be the best plan. Should we force it?
One of my queries uses an eager index spool operator that’s costing 45% of the query. I understand it’s building a temporary index, but I can’t figure out what’s triggering it – there’s no Halloween protection needed and no self-joins. The query runs fine if I break it into temp tables. Is there a way to see what the optimizer is trying to protect against, or how to prevent the spool without rewriting?
Let’s say your friend was hired as the sql developer/DBA for a very mature application. This application uses a model where each client has their own database, but each database has the same schema. Most of these databases are hosted in Azure SQLDB. However, many top-paying clients host their own server/on-prem sql and refuse to upgrade past 2014/2016. Your friend notices a lot of transformations done in procedures are way quicker using some of the features/functions in newer versions. Would your friend be an idiot for creating two separate variables, one with the sql statement for the old versions, and one for the new, then executing the appropriate one based on the SERVERPROPERTY(‘Edition’) value? How would you handle this? Also, will you be my friend? Thanks, from [hopefully] your new friend.
Postgres and Oracle have skip scans. Why Doesn’t SQL Server?
I am interested in your AI course but my company probably won’t be on SQL Server 2025 for a while. How relevant will this content be in a year or two? The AI world seems to change rapidly.

To ask your questions, head over here.

Video Summary

In this video, I dive into some interesting questions submitted by viewers during my Office Hours session. I address topics ranging from query store optimization to handling different database environments and SQL Server versions. Whether you’re curious about forcing a plan that shows as having the lowest average duration or dealing with eager index pool operators in your queries, there’s something for everyone. Additionally, I explore strategies for managing code deployments across multiple server editions, suggesting approaches like using environment variables or parent-child store procedures to handle different SQL Server versions without complicating your existing codebase too much. The video also touches on the absence of skip scans in SQL Server compared to other database systems and why this feature might not be as critical due to how SQL Server optimizes query plans. As always, I encourage viewers to engage by submitting their own questions or becoming a supporting member to help keep the content relevant and useful for all SQL enthusiasts out there!

Full Transcript

Erik Darling here with Darling Data, and we are in the midst of a Monday. But I suppose that’s good news for you, because that means you get to experience yet another Office Hours. I think we are coming real close to closing in on episode 50, which I will do absolutely nothing to commemorate.

But anyway, let’s talk through our stuff. I will answer five user-submitted questions to my Office Hours link. Now, if you look down in the video description, you will find all sorts of helpful links.

And not to spend too long here, but I recently spent some time redoing my consulting offerings, so there are some new ones in there for folks who might need different sorts of engagements than the typical blocks of hours that I was selling in the past.

Those are still available, but I do have some new stuff if you want to check that out. I’ve also added a lot of training and redone some of the training offerings in that you can now buy them as subscription products at the monthly or yearly rate. So if you are on the fence about spending some chunk of money all at once, you can now spend a smaller chunk of money, either on a monthly or yearly basis.

You can also use the links below to become a supporting member of the channel. If you just like all the stuff that I talk about, but you want to spend like, I don’t know, four bucks a month just to buy me half a New York espresso here, you can do that. There’s also a very helpful link down the bottom if you want to ask me office hours questions for one of these things.

So if you have a SQL Server question and you’re like, boy, it would be cool if Erik Darling answered that on YouTube, you can submit one and have that happen. And as always, please do the old like, subscribe, tell a friend, all that stuff. This training page is going to get fuller and fuller, at least until things start happening.

I will be at DataTune in Nashville. So all of the, I’ll just get out of the way. I have pre-cons at all of these.

I will be teaching my advanced T-SQL material at all of these. That’s DataTune, Nashville, March 6th and 7th. Data Saturday, Chicago, March 13th and 14th.

SQL Day, Poland, May 11th and 13th. And Data Saturday, Croatia, March 6th and 7th. I should probably fix the alignment on those a little bit, but you know what? I kind of like them staggered some.

I think it gives it a dynamic vibe. It’s like, whew. It’s just misaligned. That’s my fault. Anyway, we’re trudging through January here in fabulous New York City. So we’ll get there someday.

Drunken marshmallows. Data marshmallows. Anyway, let’s answer some of these questions. Here we go.

All right. Query store shows this plan has the lowest average duration. So it must be the best plan. Should we force it? Well, you know, that’s an interesting question because I would, I would be tempted to try it out and see how it goes. So should you force it?

Um, you know, depends on a few things. I’ll talk through that in a moment. But, you know, you don’t need my permission to try stuff. Right.

You can always, you know, like right click and force it. And if things go well, you can just be like, hey, I tuned a query. And if things don’t go well, you can always enforce it. Right.

But, uh, so, you know, like in the, in the, in the, in, it could totally work out in your favor. Right. It could, it could absolutely just like, like cool. It’s done. But you know, like you might need to be a little bit careful because you might need to investigate some other query metrics, uh, around that. Uh, you know, like, um, you could, you could look in, you know, I would, I would prefer if you used my store procedure, SP quickie store for this type of analysis.

But if you look up the, you know, cause you know, searching by query ID and query stores can’t be done. Right. You just can’t do it.

Uh, so if you use SP quickie store and you search for that query ID and you see all the different plans, then what you might find is some differing metrics between them. Uh, there might be something around CPU and duration. Well, I mean, obviously duration is different cause you said this one has the lowest, but you may find some other stuff to be interesting.

Like some other, some of the other plans may be parallel, uh, and some of the other plans, uh, may produce, uh, way more rows than the other. There is a row count column and query store, which can be very helpful for this. So, uh, it could be that your query has the lowest duration for a very small number of rows, like, I don’t know, 10 or a hundred or a thousand and other, uh, other times in this query runs, uh, the number of rows that it deals with is much higher.

Uh, in which case that low duration query, which might just be like some, you know, tiny little serial plan that, you know, this, you know, got a couple of nested loops joins and whatever, and just finds it’s, you know, thousand rows and moves on. Well, there’s a reason it’s the lowest duration, right? Cause it was the least like physical effort in the database.

You might find that other, other times in that query ran, there were much more rows involved. So what you could do is a couple things. Um, I have another store procedure at my GitHub repo that is code.erikdarling.com. And it is called SP query repro builder.

And what you can do is it’s almost has like the same plumbing as SP quickie store, but it has a different purpose. And the purpose is to give you a runnable query based on query store stuff. So like, you know, what, what I like, like the nice thing about query store is that it will, um, you know, uh, give you an execution plan and the query that produced that plan.

And there’s all sorts of stuff in that query plan that can help you reproduce the query. Like if your query is using, uh, you know, uh, formal parameters from your store procedure, and you know, you’re not using like recompile hints or temp tables or local variables, then it’ll set up that query to run with all of the appropriate settings. That your query ran with when it hit the database.

Uh, and you know, you will, what you will find is an executable version of the query with all of the different parameter values that were used for it. So you could experiment a bit with that and see if caching a plan for that low duration thing, and then running other, uh, you know, parameter values for that query maintains the same low duration. So that, that is the approach that I would take, right?

Cause that, that, that might make life easy. Like I said, though, there are some limitations to things. Like if you’re, if your query involves a temp table or a table variable, I can’t tell you what rows ended up in there. I can’t even tell you what the definition of it is.

Um, if you’re using local variables, those disappear with the, um, the values for those disappear in cached plans. And if you’re using recompile hints, then I will do my best to pull literal values out of the query plan that were used, but you will have to figure out which, um, which parameters they belong to. Cause I do not know that once you have recompiled, I, I lose that.

I lose that sense and sensibility. So that’s, that’s, that’s the approach I would take there. All right. Uh, one of my queries uses an eager index pool operator.

That’s costing 45% of the query. I understand it’s building a temporary index, but I can’t figure out what’s triggering it. There’s no Halloween protection needed and no self joins.

The query runs fine. If I break it into temp tables, is there a way to see what the optimizer is trying to protect against or how to prevent the spool without rewriting? Oh, well, it’s an eager index pool, which means one of two things.

And neither of them are Halloween protection. Uh, it means that now you usually tablespools are what get used for Halloween protection. And that’s of course to prevent, uh, you know, like sort of infinite loop modifications to a, to a, to a, to a query, uh, when a modification query runs.

But, uh, what, what you would want to do is try, try one of two things with your query, whichever, whichever table, uh, the eager index pool is being built off of the first thing you try. The first thing you do is add a force seek hint, uh, force seek hint to your query and see if there is an actual seekable index on that table. If you get an error, uh, then there isn’t one.

If, uh, you get an error, what you need to do is look at the eager index pool, because there will be some properties of the eager index pool, which will help you create a permanent index on the table. Um, so that SQL Server does not need to build an index every single time that your query runs. So just to make sure we covered that, uh, the, the eager index pool is not for Halloween protection.

It is, uh, strictly a performance optimization. Usually, well, actually it is not usually it’s on the inner side of a nested loops join. Uh, other things you could try doing if you felt like it, um, say option like hash join, probably, um, that might be one way to get around it, but I can’t guarantee that will be faster.

Um, usually the two things that you do is see a SQL Server, like missing an opportunity to use an index that I have with the force seek hint. If it’s not, then creating a permanent index will get usually get rid of the eager index pool. But sometimes you might find that you still need the force seek hint because the optimizer kind of a dumb, dumb some points.

All right. Let’s see what we got here. This is a long one.

Good Lord. Uh, let’s say your friend was hired as the SQL developer DBA for a very mature application. Uh, all right. Hot, mature applications want you to develop for them or something.

Uh, this application uses a model where each client has their own database, but each database has the same schema. Wild. Never heard of that before.

Uh, most of these databases are hosted and hosted in Azure SQL DB. God help you. I wish I had some holy water to throw on you. However, many top paying clients host their own server on prem, uh, own server on prem SQL and refuse to upgrade past 2014, 2016. Your friend notices a lot of transformations done in procedures are way quicker using some of the features function, the newer versions.

Would your friend be an idiot for creating two separate variables, one with the SQL statement for the old versions and one for new, then executing the appropriate one based on server property addition value? How would you handle this? Also, will you be my friend?

Thanks from hopefully your new friend. Well, I don’t know. You, you, you, you might talk too much to be my friend. I, I can be honest with you. It’s a, it’s a lot of words. The beginning and end of that are very far apart. I’m exhausted now.

Um, so I, I, I don’t know if I would handle this like in the code, like at execution time. Uh, I would, I would probably be more, um, in favor of like, like this being a code deployment strategy. Like, you know, you’re, you’re rolling out code and you’re like, oh, you’re on, you’re an Azure SQL DB where you have access to new stuff, but it’s still Azure SQL DB.

So it sucks. Uh, then, then you would, then you would roll out the, like the specific code for that. Um, I would, I would probably not want to make this a like runtime decision in every like store procedure or query that’s executing.

Um, like, like hopefully you’re using store procedures where, you know, you can like, make, you know, like change your deployments. Uh, process to, uh, roll out like a different code base to, uh, a different, uh, different, like, you know, version edition of SQL Server. But, um, I, I don’t think I would want to do this in the store procedure.

Cause you’re just thinking like very, very long complex store procedures, depending on how long and complex your store procedures are currently. I would be a little scared of that. Um, you know, even with dynamic SQL, I would be, I would be a little frightened of that.

Um, you know, if branches and query plans have, uh, very, very strange side effects, unless you, uh, bury things. So I suppose one way of handling that would be to change if like, if you, if you just wanted to straight deploy all the code, no matter what, one way of handling that would be to change the model to, um, have, have like a, like a, like a parent child store procedure thing. And then be like, Hey, if like, you know, like run this parent store procedure, like take all the parameters that we would take from, uh, like whatever the call is.

And then like execute a child store procedure based on like environment variables. That would be one way of handling it where, you know, you don’t have to tinker with the actual deployment process too much. And you still get sort of like the runtime effect of things, but without like making your store procedures insanely long and complex and dealing with like the, you know, the sort of like a lot of the repercussions of like if branching and store procedures, that would be, that would be another way of handling it.

But, uh, like I said, I wouldn’t, I wouldn’t want to, I wouldn’t want to be like, I wouldn’t want like giant if branches in the store procedure telling you, you know, dictating which, which path to follow for like a bunch of queries. Like that, that’s not a good time. So one of the, one of those other two things, either deployment specific code, uh, based on environment variables or a parent child caller situation with your store procedure, where everyone gets all the code, but only like one store procedure, one like version of the store procedure will run based on those environment variables.

All right. Uh, Postgres and Oracle have skip scans. Come on.

Postgres just got skip scans. Don’t act like, don’t act all highfalutin with me, pal. Um, Postgres and Oracle have skip scans. Why doesn’t SQL Server? I don’t know.

They haven’t put it in there. Maybe they will someday. Uh, skip scans are cool though. And, uh, I, I, I do, I do sometimes wish SQL Server had them because like what, like what a skip scan means is like you would have to like worry a whole lot less about like a lot of different, like nonclustered index permutations for things. So like, let’s say you have, uh, like, I don’t know, let’s just use like, you can either like do like the posts or the votes table in, in the stack overflow database where in the post table, you have a post type ID.

It’s a like highly non unique. Um, you know, cause it’s like just post types. It’s like eight of them and like, like most of them are one and two, which are questions and answers.

And like in the votes table, you have vote type ID. And like most of them are one, two or three, which is like marked as answer up, vote down, vote. And so like, like, like if you had an index that leads with either one of those, but then you have like another, like another column in the table, I think probably let’s just make it easy.

Let’s say it’s on the post table and you have an index on like post type ID and score. And your query is like, you know, like, like just like where score is, um, like score equals 1000, right? Just to throw an example out there.

If you don’t have an index that leads with score, you’re looking at having to scan, um, scan an index in order to, uh, like figure out like, like what, what score is 1000. And like the typical B tree index, that means like you’re scanning the entire thing, right? So like your index leads with vote type ID.

So your index is ordered by vote type ID. And then within like, you know, all the post type ID ones, like score is ordered in ascending order from like, you know, let’s just say zero to one. Like a five kabillion, right?

Uh, what a skip scan allows you to do is like, essentially it gives you like a, like, like a, almost like a double seek thing where like, you’ll break the index into chunks. So like the first thing you would do, it would go like, let’s just say like you would replace it with like post type ID equals one. And then you find all the thousands for post type ID equals one.

And then instead of like, like just scanning and scanning, like then the next thing you would do, you would skip and then you would jump to like post type ID equals two. Then you would find all the scores, score equals a thousand for post type ID two. And then like three and four and five until you finish and get to eight.

So, so like, you just kind of like chunk and index up and you just almost like a, like, like, like seek. And then many like seek to like what your actual predicate is. And then like another one.

And then you, it’s almost like if you did like a union all right. So like one way to, one way to think about that is like, like, like instead of saying like score equals a thousand, it would be like we’re post type ID equals one and score equals a thousand or post type ID equals two and score equals a thousand. It’s almost like you just union all of those queries together.

So skip scans would be awesome to have, right? Like they’re like pretty cool. Cause right now you have to say you can write some like weird recursive CTE queries to like emulate them, but it would be great if like we just had them, right? It’d be a very powerful thing for, especially people who like a lot of reasons why like so many over index tables and SQL servers.

Cause we don’t have nice things like that. All right. Uh, all right.

You can be my friend. Look at you with your nice pithy answers that allow me to talk about things. I can sell. I like you. I am interested in your AI course, but my company probably won’t be on SQL Server 2025 for a while. How relevant will this content be in a year or two?

The AI world seems to change rapidly. You are right. The, the greater AI world changes very quick. Every week there is a new most powerful reasoning thinking thing yet that like just blows the doors off everything that has happened before and is revolutionizing the AI world. Microsoft SQL Server ain’t that right?

Microsoft SQL Server moves at a glacial pace because I mean, you know, it’s never been terribly fast moving, but you know, let’s, let’s not pretend that, uh, you know, fabric isn’t training every valuable resource and dumping it into a useless cesspool. But anyway, um, the stuff that I cover in my AI course, the only thing that would like, like change in the material is certain preview features may become generally available. So like vector indexes, like vector indexes, like, like, like, like a couple other things that stuff might go from G from, from preview to generally available.

And I will update the course content is that stuff changes. But the course content itself is focused largely on, you know, just like the best way to approach, um, you know, vector stuff in SQL, certain SQL Server, how it works and how it works is not going to change that much. And I do cover all the preview features in there.

So like, it’s not like you’re just gonna like, you’re just gonna be like preview features. I don’t know. Like, like I do talk through all of the preview features in there, like, you know, like how they work, what they do, like vector indexes, vector search, um, you know, float 16 for what it’s worth. So like, I do talk through all that stuff and I will again, update the material when it, you know, changes, if it changes, because it’s not guaranteed that anything will go from preview to generally available.

But, um, you know, like, like, like really it’s, it’s, it’s a lot of just like foundational material, but like, like, like how it works, like what, what you should care about the best way to do things. Like there’s a lot of stuff in there. Um, it doesn’t make a difference if the wider AI world is changing.

It’s how you use it in the context of SQL Server will not change. So, you could buy it now and watch it and be prepared for a year or two. You could buy it now and in a year or two, watch it.

It doesn’t matter. Right? Uh, just buy it. Buy it. Arrgh. Come on. Rent is due. Buy the course.

All right. That’s probably good for me here. Thank you for watching. I hope you enjoyed yourselves. I hope you learned something. I hope you buy the course because rent is due. I will see you in tomorrow’s video where actually, you know what, I’m going to go, I’m going to be talking more about the new AI course. So I don’t know.

It’s one thing leads to another, doesn’t it? One thing always just drags right into another Monday to Tuesday to Wednesday to Thursday to Friday. Then you die.

All right. Thank you for watching.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

Get AI-Ready With Erik: A Little About Cross Table Searches

Get AI-Ready With Erik: A Little About Cross Table Searches


Video Summary

In this video, I delve into the exciting world of AI preparedness and explore a practical application: finding related content in real-time. Whether you’re building advanced chatbots using SQL Server’s vector capabilities or managing a Q&A site like StackOverflow, being able to suggest potential duplicates before submission can save time and effort for both users and moderators. I demonstrate how vector distance functions can help identify near-duplicates by embedding query vectors and comparing them against existing content, providing a likelihood score that filters out irrelevant results. While the performance isn’t always optimal due to row-by-row comparisons, this method offers significant improvements over keyword searches, especially in understanding the intent behind user queries. By combining similarity search with keyword or full-text search strategies, we can achieve more accurate and comprehensive results, ensuring that users get the most relevant information as quickly as possible.

Full Transcript

Erik Darling, Darling Data, continuing. Are you ready for some AI? AI preparedness, football preparedness, all very important things for us to prepare for. Today we’re going to talk a little bit about sort of like the idea of finding related content. So, you know, for you, whether you are building the world’s greatest chatbot based on SQL Server’s vector capabilities, or you just want to have someone like, you know, I don’t know, let’s say you run a Q&A site like StackOverflow, or I don’t know, let’s just say someone asks a question or something and you want to show them some potential duplicates before they submit thing, submit something. If you wanted to do this as they’re typing, you know, you would probably need to use like keyword or full text search to accomplish that because like, like generating and embedding as someone like, you know, types a letter and like, I spelled that wrong, delete, delete, delete, delete. It would be sloppy and messy. So we don’t want to necessarily do that. But, you know, let’s say, you know, someone’s like submits a question, but like, like you have like a, like a, like a middle screen, right? Like you, you preserve all their work, but then you also, you’re also like, hey, these other things look pretty close to what you’re doing.

What you were looking for, like, I think, I think maybe, maybe might be on the right track here. Maybe you should check some of these resources out before you, you know, send in your new thing and set a bunch of people running around trying to find answers for you and like, wait a minute, that reminds me of this thing. You know, you could maybe, you know, do a slightly better job. Right? Maybe save them some time, save yourself some time, save the world some time, right? So we can’t save the world some time. What are, what do we have computers for? So what we’ll do here is sort of expand on just saying, hey, vector distance, you know, whatever. Now we can, you know, get, maybe get a little feedback about what we’re doing. And we’ll have a new column in our select list that’ll show the likeliness of the query being a near duplicate. So we’re going to use this sort of, I mean, it’s, I’m going to say it’s, it’s arbitrary in that, like, like it makes sense for what I’m doing here and the data I’m working with here.

In your real life workload, query, embedding life, you might choose a different number to decide how likely a duplicate, likely a duplicate something might be. And then we are also going to start filtering on this. Now, I’m going to be honest with you.

This, this is, if you are coming to this from a DBA perspective, from like a performance-ish tuning perspective, which is again, why I started looking at all this stuff. Cause I was like, Hey, I bet this is going to need some performance tuning work. This does not get pushed anywhere that might be useful.

This is not like a SARGA-able predicate. SQL Server has to, again, compare this row by row, right? You have to take each row, feed it into the vector distance function, compare the table embedding to the local variable, which again, like from a, from a like performance cardinality perspective, the local variable doesn’t make a difference here, right?

Like you can throw a recompile on this. It’s not going to help anything. SQL Server is not going to be like, Oh, now I know that’s not a, you know, density vector guess. I’m going to give you this other, I’m going to give you like a good cardinality estimate based on the histogram.

You don’t get that. So like, don’t, don’t get tied up on that. But of course the vector distance, distance function has to compare again, sort of like in a row by row, uh, row by row form. Like, you know, like put this embedding in compared to this embedding.

Is that less than 0.30? Uh, we’ll find out when it runs, right? So like it, like the performance is not going to be awesome for that. Again, this is where sort of like the vector search or like vector index stuff comes in, vector indexing comes in.

But those still being in preview and Microsoft releasing a woefully incomplete implementation of all this, uh, at RTM and calling it AI ready. Eh, wrong. Eh, boy.

What do we do with ourselves? But anyway, let’s run this query. And we get some results back. And this was not terribly, this was not terribly slow because, you know, we don’t have a lot of terribly, we don’t have a lot, a lot of data in here. And I have a pretty decent VM backing this thing up.

But, uh, what’s, what’s interesting, right, is, um, so we have like this, this filter here, right? And this is where our predicate ends up for vector distance, right? Say less than 2.99 E minus 0, 1, right?

So, um, like we scan the whole post table, we bring all million rows out of that. We compare, we do the filtering of the vector distance function in there, right? You can see vector distance, cosine, et cetera, in this filter operator.

Again, because it’s not struggle. This is why things like pre-filtering, like anything, anything you can, we could push. I mean, there’s nothing we can really push to the post embeddings table. But, you know, like for the post table itself, there’s like some pre-filtering we could do there to at least, you know, get some of the rows out.

But, you know, sometimes this ends, like if this were like a regular nested loops join, sometimes the vector distance function ends up at the nested loops join. We get an adaptive join here. So, you know, all that does a thing.

Anyway, I just really wanted to show you that it wasn’t a sargable predicate. We couldn’t like scan the post table with the vector distance function there filtering out rows when we touch the table. But, you know, the results that we get back are at least filtered distance wise to things that would make sense for us.

And the titles that we get back are pretty close to, you know, how do I speed up my SQL queries? But, you know, this is like kind of a funny thing because it’s like how do I speed up my SQL queries? My SQL queries or my SQL queries, right?

Because my SQL is a database, but my SQL, it could be anything, right? SQL could be in SQL Server, Postgres, Oracle, DB2, Snowflake, you name it, right? SQL is in like everything.

Is SQL the original AI? It’s like, no, it’s SQL. But, yeah. So, like a bunch of these results are like, you know, like my SQL, my SQL. But, like, not necessarily like, but we weren’t necessarily searching for anything specific to my SQL, the database. We were just thinking about my, like, SQL that I own.

So, this is like one of those things that similarity in vector search can miss and why you might need to think about combining some alternate search strategies to sort of give you the best of, like, all worlds when you’re searching. Because, like, not only do you need to worry, like, think about similarity, you need to think about the things that similarity search might miss in some cases.

Like, certain contextual things that might make, like, might be really important where you need to sort of bring keyword search or some other, like, keyword search to find, like, more exact stuff. Or you might need to start adding in, like, you know, some sort of waiting to make sure that the quality of the search result is okay, too. So, it’s like, you know, like, all of this stuff is, you know, like, about speeding up SQL queries, right?

There’s all sorts of things in here that might not even show up. Like, I think this is a great example where, you know, the search phrase was, how do I speed up my SQL queries? If you were to take that and just apply, like, pure, like, keyword or full-text search semantics to this query, you probably wouldn’t see SQL queries taking too long.

But, like, the similarity search figures out, oh, like, how do I speed up my SQL queries is pretty close to SQL queries taking too long, right? Like, so that sort of, like, intent of the words, that’s what doesn’t come out in, like, a lot of other search forms that vector search or similarity search really does bring out. Right?

So, it can also be useful. Let’s say that, like, you know, let’s say you’re on a Q&A site where people ask questions, like Stack Overflow. And you might have, like, swarms of completely unpaid slave labor moderators in there who, you know, have to deal with, like, cues of things like duplicate question flags and all sorts of other stuff. But this could be really useful for them to help them find duplicate questions.

Because, again, like, you know, like, reviewing the content of a thing beyond, like, just, like, keyword search stuff. Like, if you were looking, like, how do I speed up my SQL queries? You might find an exact duplicate of, like, the question body in, like, you know, SQL queries taking too long that wasn’t obvious exactly from the title.

So, like, if you’re, like, if you have any sort of, like, moderation thing happening, like we do in, like, or they do in Stack Overflow. I don’t work for Stack Overflow. But, like, you might have something like this where, like, moderators have to go look at stuff and you might want to help moderators find, like, oh, yeah, that is a duplicate of that.

Like, oh, okay. Oh, well, you know, someone flagged it as a duplicate of this thing, but it’s really a duplicate of this thing. So what we’ll do is we’ll take, I have a source post and a query vector.

And we are going to find our source post here. And I’m going to show you what exactly what we’re searching for. And then, like, just a way to screen for potential duplicates using a very similar search pattern to what we did before.

All right. So let’s go pretend that we are unpaid moderators and we are looking to try to find how to undo the most recent commits in Git and what we get back. Now, are these great?

I mean, are these, like, awesome across the board? No. But, you know, again, the similarity stuff can miss some certain things.

So, like, this does have, like, undo and get and commit. And, but this one, like, it has the word recent in it. Right.

So it’s, like, how to undo the most recent commits in Git. But this one thing that similarity search misses is negation. So, like, not. Like, like, like, I want this to, like, not be, like, like, saying, like, I wanted to undo the most recent. But the another one saying that isn’t the most recent.

Similarity search will miss stuff like that. Right. So, like, how to undo the most recent commits in Git. The most, like, the most similar thing we found is Git undo a commit that isn’t the most recent. But similarity search kind of barfs on that.

This one is a lot closer. I mean, if you actually, like, read, like, undo last commit. Like, how to undo the most recent commits. Like, that second one is actually, you know, like, like, in the literal sense, closer. Even though from, like, textual representation with float numbers.

This one, this one seems closer. How to get back to the most recent version in Git. Maybe not particularly there.

Right. Undo a particular commit in Git that’s been pushed to remote repos. I don’t know about that one. How can I undo the list, the last Git add commit. Yeah, that one, that one, that one’s probably closer there, too. So, like, similarity search is not perfect.

But there’s, like, a lot of stuff that similarity search might miss or might, you know, not catch in a perfect way that, you know, a human eyeball doing unpaid moderation on your, you know, your private equity website might, just might catch and understand a little bit better. Anyway, thank you for watching. Hope you enjoyed yourselves.

I hope you learned something. Again, the full version of all this content is available for purchase with the AI Ready coupon. The link for this is fully expressed down in the video description. But if you head over to training.erikdarling.com, you can buy the Get AI Ready with Erik course for $100 off.

It’s a great deal. Because then you’ll be AI Ready far before anyone else is. Except me.

I was AI Ready before you. That’s how I got to do all this. All right. Thank you for watching.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

Get AI-Ready With Erik: A Little About Finding Related Content

Get AI-Ready With Erik: A Little About Finding Related Content


Video Summary

In this video, I delve into user search handling in AI systems, specifically focusing on how database administrators can manage and utilize user queries effectively. By leveraging embeddings generated by application layers, we explore a workflow where user searches are converted into vectors and stored in a searchable table. This allows for efficient similarity searches using vector distance functions to provide relevant results based on the user’s intent rather than relying on traditional keyword matching. I also discuss practical considerations such as ensuring unique search phrases through hashing, handling embedding data from application layers, and optimizing query performance with minimal impact on cardinality estimation.

Full Transcript

Erik Darling here with Darling Data, and we are going to continue our AI preparedness and talk a little bit about dealing with user searches. This is, of course, all brief snippets of the content that I have available in my Get AI Ready with Erik course. You can get it from training.erikdarling.com. Right now, it’s a hundred bucks off with the coupon code AIREADY. The link that you see is a good way to get it. And the link that you see above you floating in SQL Server Management Studio is down in the video description for your tender loving clicking. So we’ve got that going for us. Now, so far, what we’ve done is used existing embeddings as query vectors. Right? We said, hey, is any of this other stuff similar to this thing? Which, you know, can be useful. Right? That can be a useful thing to measure. It can be useful to pre-generate embeddings and say, is this thing similar to this other thing. But, you might have users in your database. Right? Shocking. Crazy, I know. Where we, the general EBA public may need to deal with these users. And these users may have their own specific questions to ask. So what we’re going to look at here is a workflow where the application does the embedding for us, right? So a user will type in a search query into some form in the front end. And then we’re going to pretend that like a service in the application layer generated an embedding based on that user search query and stuck it in the database. And now it’s our job to take that user search query and, you know, see if there’s anything similar to it.

Right? You know, developers love their app layers. So we’re going to app players. Whatever. But this video will focus mostly on like, you know, how you as a DBA might take those search phrases, stick them in a table and then do a search using them with the vector distance function. So you likely want a table that looks something like this, right? And you may, I don’t know, like, like, if I had to like really design this thing and not just have like a simple demo table, I’d probably want like a user ID or something else where I could like uniquely identify the user and their search queries. So when I go and get whatever someone sent in, I could, you know, find the right thing and all that. But for us, this is good enough right now where we’ll have the, you know, the general phrase that they sent in their search text. We’re going to hash that phrase so that we can make sure that we only have unique stuff in there. We don’t want, if someone has already asked the same question, then we would want to say, hey, we already have something. We already have that. Like, we won’t even have cache search results somewhere in our app player. And we might be able to use that instead.

We will have the embedding that was generated again in the app player. I’ve prefilled this table with the embeddings. We’re pretending that the app player is doing some magic and inserting into this table. And all that other good stuff. So if we look at what is currently in our search phrases table, these are all searches that users have submitted, right? If we zoom in on this, how do I make SQL queries faster? JavaScript array message. What is a foreign key? Python file handling. There’s a lot of stuff in here that we could probably use to, I don’t know.

These are all questions that people cared about. Loop without cursor SQL. How dare you? Cursors are great. So we’re going to start with this. Hey, you did it. I’m going to start with something here. Just to make life a little bit simpler, we will declare a query vector local variable and we’ll assign an embedding to that local variable. Normally, the local variable thing would be a bit icky. All right. If you’re a skilled SQL, skilled T-SQL practitioner, perhaps you’ve watched my other videos.

Perhaps we have, you have bought my prior trainings. Perhaps you have attended some of my prior trainings in person. You will know that local variables can often have deleterious effects on cardinality estimation. But in this case, it doesn’t matter because it is not an aware clause. It is just being fed into this vector distance function.

And it has no effect on cardinality estimation or anything else. So we don’t have to worry too much about that there. You could throw a recompile hint on this if you wanted. It would change absolutely nothing for you. But what we’re going to do is we’re going to ask, we’re going to look up the phrase, how do I make my SQL queries faster? We’re going to assign the embedding that was generated in the app layer for that search query. And then we’re going to find the top 10 queries, top 10 posts that might be similar to that. All right. So again, we are using the vector distance function. We are using the cosine equationatory things.

And here we are comparing the embedding in the post embeddings table to the query vector. And then we are ordering by the distance alias in ascending order so that we get back the most relevant results first. And we could think about other ways of doing this, right? Like we might like make maybe score would be a valid thing to also bring up, like maybe it’ll be like a little bit more high quality content or something. But for now, we’re just going to stick to just finding purely what is the most similar.

And if we look over here, we get some rather low scoring things back, right? So these might be the most similar, like, like I was saying, these might be the most similar things, but they might not be the most high quality things. We don’t even have a guarantee that like this thing has an answer, right? Because we’re just comparing titles.

We’re just saying, is the title of this thing similar to the title of any other things? What other titles is this title similar to? And these titles are quite similar, aren’t they? Right? Like how optimize SQL query, make it faster. How can I speed up the SQL query twice? Two different IDs with the same damn post title.

Speed up the SQL query to my SQL. Well, stop using my SQL. Don’t be stupid. SQL query is taking too long, right? So like all this stuff, pretty similar. And this is like the kind of, I mean, what I think is the kind of neat thing about vector search is that you can, you can search based on like what a user meant, what a user’s intent was.

And for DBAs, I mean, like think of all the, think of all the times you’ve yelled at developers for throwing in like, like wild card string searches, like, like where title, like percent something percent or title and title, like percent something percent. Like think of all the times you’ve screamed at them for this and like how you can now just go to be like, well, now we don’t have to do that anymore.

Now we can just take this magical embedding and compare it to this other magical embedding. And we can figure out how similar those embeddings are without having to do a bunch of wild cards, wild card string searches. It’s kind of a nice thing for DBAs because you no longer have this like, like, well, how are the developers going to figure out if this thing is like this or this or this or this or this or this or this or this or this without like, like just ruining your life, right?

And giving you, making steam come out your ears. It’s not a good feeling. Trust me. I’ve had steam go in. I’ve had steam come out. It’s not.

It’s cleansing, but it’s painful. Let’s, let’s try another one using something similar, right? We’re going to take this query vector for what is a foreign key.

What is a foreign key? Why would I stick it in a lock if I know it’s not going to work? So it’s from a foreign country and foreign countries all have far different locking mechanisms than American locks do.

But if we run this query, right? Well, now we’re going to do the same thing where we’re going to, you know, get our search phrase. We are going to plug that into the vector distance function.

We are going to compare the, the, the embedding for that, for our, what is a foreign key search phrase to other things in the post table. Right? And we’re going to get back some pretty similar stuff here.

Right? Right. What exactly is a foreign key? Foreign keys, foreign keys. What do they do for me? How can foreign keys benefit me?

Right. And then, but like all this stuff is generally about foreign keys. And this is without having to do like, again, like the like percent foreign key percent, you know, inevitably having to worry about case sensitivity or something like that. But we, we, we get a bunch of stuff back about foreign keys.

Now, again, we have not established like a, like a high quality benchmark for this. Right? Because the scores for this are all over the place.

11, 1, 7, 15, minus 1, 0, 2. Right? There’s like, there’s some stuff in there where, you know, you might want some of the higher scoring stuff to maybe show up a little bit higher in the search results. We’ll talk about that later and how to sort of like weight things differently.

Because, you know, distance, well, it is a good, you know, indicator of similarity between two, two, two, two strings. Right? The, the, the, the vectors and the embeddings, they’re a good indicator of like similarity there, but they’re, they’re not like, they’re not a terrible, good, terribly good measure of like, is the content after that quality?

Like, like it could be completely wrong. It could be an AI generated response, which is full of malarkey. Right?

If you’ve ever asked the AI a question and gotten a bunch of stuff back where you’re like, none, none, none of that is right. None of, none of that is a key insight. Thank you.

You, it doesn’t do anything to tell us like, like how high quality, like the information after that is. Right? It can tell us how similar these things are. And these things are very similar.

They all say foreign keys in them. Right? Granted, there’s a little bit of a spread there between like MySQL and like, you know, general databases and stuff. But what you get back is at least, you know, foreign key related without having to search like percent, foreign percent key percent over and over again. Like doing traditional keyword searches that often looked like this.

Right? Like SQL performance and foreign in key and stuff like that. And we all know how we feel about this. And while you can get, you know, stuff that says, you know, insert, we can get stuff back that is SQL and performance.

You know, it, it, it, it does not give us like, we’d have to give it way more things, way more tokens, if you will, in order to narrow it to think something that we care about, like SQL Server. And then like maybe a SQL Server version and then like, I don’t know, is this, we just like, like how many likes are you going to throw in there before you’re like, ah, I finally found what I was looking for. It’s a lot of likes, a lot of likes, unlike my posts.

It’s a lot of likes. All right. Thank you for watching. Hope you enjoyed yourselves.

I hope you learned something. And again, if you are interested in the full course content, it is all available for sale and purchase at that link, which is down in the video description. The coupon code embedded will get you 100 smackaroos, dollaroos, buckaroos.

I’m going to stop there. Off the, off the price. So, thank you and goodbye.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.