Get AI-Ready With Erik: Helping Vectors Rank Better With Keyword Boosts

Video Summary

In this video, I delve into strategies to enhance vector search relevance in semantic search engines like Vector, focusing on SQL Server scenarios. I explore how keyword searches and full-text indexes can complement each other, discussing their strengths and limitations. By demonstrating practical techniques such as adding specific keywords to the query conditions and using boosts based on domain knowledge, I show how to refine search results to prioritize conceptual relevance over popularity or frequency. This approach is particularly useful when you need exact terms but also want the search engine to understand the context better, ensuring that highly relevant content isn’t missed due to semantic gaps.

Full Transcript

Erik Darling here with Darling Data. You know, trying to get my AI ready on. I’m gonna make sure that I’m prepared for the world falling apart around us and all that other good stuff. So today’s video is going to be about some things you can do that might help vectors rank things a bit better, right? Because, you know, like we talked about in the last video, vector might miss some stuff, right? It’s not, didn’t, we were looking for a SQL Server 2008 R2 when we got back a lot of SQL Server 2008. And then we were like, I want to find stuff about connection strings, but not any framework. And Vector was like, hey, any framework, you said those words, here you go. Uh, I got, I hear that right? Yeah. Ah, well, there you go. It’s some entity framework for you. Um, so semantic search is there to find conceptually related content, but it has a, it has a bad habit of missing exact terms. Uh, keyword search, like, you know, we should, we, I show you, I had to unfortunately show you some full text index stuff, which I feel dirty about still, but I’ll get over it when you buy something. Uh, keyword search finds exact terms, but may not rank by conceptual relevance, uh, as well. Right. Cause like, you know, we can, we can put in explicit words and we can find explicit words.

Which is great for stuff like, you know, error messages or like, you know, product versions and things like that. But you know, less so for like, ah, I think I want a sandwich. What kind of sandwich do I want? I don’t know. Who has sandwiches, right? Best sandwich. Right. So like, you know, stuff that, you know, you might, might have a harder time. Anyway. Um, so this is, what I’m going to show you is useful when a specific term must appear in results, but you want results ordered by conceptualization. Conceptual relevance, maybe rather than like popularity or something. Right. Cause you know, in the, in the stack overflow database, we have a score column, right? And like, you know, we could certainly like, you know, like, like factor score into things, which we will talk about.

But, uh, typically we, we, you know, if we’re, we’re spending all this time and effort and energy into getting our vectors, right. We, we, we would probably want to, uh, you know, involve them at least to some degree. Right. What’s the point of all this damn data. Don’t use it. Uh, so what we’re going to do is try to find stuff about SQL injection prevention, but we’re going to, we’re going to like give a secret. We’re going to give our, our search a little bit of help, right? Because we want to make sure that the title of the post is at least, you know, has SQL injection in it. Right.

So this is another form of keyword search kind of, um, it’s just an alternate to full text indexes. You can, you can throw some stuff like this in, I guess, in the where clause to help filter stuff out. There’s also another, um, uh, there’s also another, uh, technique that I’m going to show you down below. And this will at least get us to the point where, you know, all of the post titles have SQL injection in them. And then it’s sort of up to us to figure out like, okay, well, I mean like, you know, distance wise, right? Like, like all of these things do pretty okay.

But the scores are all over the place. We’ve got a zero, a 10, a 2700, right? But like, like, we’re not seeing, we’re not seeing stuff with like, uh, and working at questions too. Right. So like, like two things that we’re not doing here that like if in real life we would probably want to be doing is one, making sure that these questions have answers. Right. Cause like the question doesn’t have any answers. How do you like, it might just be a similar question if you’re doing like, you know, like content deduplication or whatever.

But if you want to like find related content that might have helpful things in it, aside from that, we would probably want to make sure that they had some answers. We also want, might want to factor a score into it somehow because score at least, you know, would help dictate like, Hey, this is a good question. Like, or at least this is a highly upvoted question. Good or not is, uh, come on. It’s stack overflow. Right.

Like something, something’s got really hot. A lot of upvotes, like more upvotes than I will ever get in my life. But, uh, you look at it and you’re like, you link to the document. You rat. Anyway, another, another option we have, right. Another way that we could potentially, uh, make that, make this all better somehow would be to add a boost based on the presence of certain words in, uh, in another title. Right. So, uh, if we, we, we, we, we, the way that we would do that is we could, I mean, we’re like, this is just to show you what the boost actually turns out being.

So if we’re like, if we’re off looking for like how to optimize database performance, you know, conceptually, we’re going to find stuff that’s pretty close to that. But like, we, we might, you know, kind of like, you know, know our data or like, you know, have some idea about like, like allow people to enter in some like, uh, like, uh, some like, uh, what do you call them? Uh, like backup search words or something. We’re like, Oh, like, yeah, I’m really interested in indexes. Right. Yeah. That’s a good one.

So we could add in some, some like booster stuff like that. And then we could adjust, uh, what the vector distance is based on that boost. And what we’ll end up with is a query that looks a bit like this, man. I’m actually going to highlight the whole thing. Cause the last thing I want in one of these videos is like an error. And then you’re like, uh, then I have to like go vector search the error message and figure to ask AI how to fix it.

And AI is like, Oh, well, it looks like maybe you forgot to highlight where you declared a variable. And then you want to be like, man, AI, you’re so good. You’re so smart. How’d you? God. Anyway, uh, what we have here is a few columns. Oh, we’ve got more than a few, but we have a few columns that we would care about, uh, the contents of for this exercise.

So we have the normal vector distance thing in here, right? And you know, these numbers are all just wonderful, right? They are all floating point numbers, I know, decimals. Nailed it. Uh, but then now then we have our boost column and all of these are, I mean, we don’t, we don’t have any, we didn’t hit on index or like maybe index got screened out past the top 15. Like maybe index didn’t help like the index one didn’t help. Cause like what we did up here, uh, was say if, you know, um, uh, minus case when title, like indexes, remember we have to, we want, we want to get these numbers lower.

Right. So we’re subtracting. This is not boosting by adding onto the vector distance. Cause that would make it further apart or make, make the things seem further apart in the numbers. Uh, we’re doing this by, um, subtracting the, like add relevance to certain keywords. So if the title is like index or the title is like performance or the title is like slow, then we would like have these multipliers, uh, get subtracted so that we could see more stuff.

So, um, like everything that we found at least because we, we matched conceptually, but the vector search very close to performance, everything was either a three or zero, right? So like, like a lot of these are three, three, three, and then a lot of them are zero, zero, uh, zero, zero and stuff. But you can see how the, the semantic distance got adjusted by the, uh, by the boost, right?

So like the one that started at 1077 got adjusted down to triple sevens, right? Uh, the one that started at 845 got boosted by nothing and stayed at 845. Uh, and then the one at 1358 got boosted by 0.03 and got boosted and it got adjusted to 1058.

So this is just another way of like helping like the vector distance stuff, like get some more, um, like contextual, like, you know, words in, in play and like adjust the score a little bit based on like certain things that are very closely rated. Like you might know about like domain knowledge stuff. Uh, but you know, the, the vector similarity search is just like, I, I, I got these floating points over here and I got these floating points over here and I got to figure out how close these floating points are.

It’s not thinking about like, you know, additional, like, like, Oh, well, if I had this other floating point, if I thought about this other floating point, I, I could, I could really, I could really get some better searches in here or some better search results in here. So this is just another way of doing that anyway. It’s a lot of stuff to think about with this, isn’t there?

Well, maybe someday, right? I don’t, I don’t know if anyone’s on SQL Server 2025. Yeah, they did.

They did just release a cumulative update one with a shockingly low number of, uh, fixes in it. Unfortunately, there were also no, uh, no, it was not graduation day for any of our preview features. So we’re still, still stuck there.

Anyway, that’s probably good for this one. Thank you for watching. I hope you enjoyed yourselves. I hope you learned something and I hope that you will use this, this wonderful, fabulous coupon code, uh, up, up, up here to buy the entire course. And, uh, I don’t know, maybe pay rent or a electric bill or phone bill or something.

I got all sorts of stuff I gotta do. All right. All right.

Anyway, thank you for watching. I’ll see you tomorrow.

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.