Get AI-Ready With Erik: A Little About Cross Table Searches

Video Summary

In this video, I delve into the exciting world of AI preparedness and explore a practical application: finding related content in real-time. Whether you’re building advanced chatbots using SQL Server’s vector capabilities or managing a Q&A site like StackOverflow, being able to suggest potential duplicates before submission can save time and effort for both users and moderators. I demonstrate how vector distance functions can help identify near-duplicates by embedding query vectors and comparing them against existing content, providing a likelihood score that filters out irrelevant results. While the performance isn’t always optimal due to row-by-row comparisons, this method offers significant improvements over keyword searches, especially in understanding the intent behind user queries. By combining similarity search with keyword or full-text search strategies, we can achieve more accurate and comprehensive results, ensuring that users get the most relevant information as quickly as possible.

Full Transcript

Erik Darling, Darling Data, continuing. Are you ready for some AI? AI preparedness, football preparedness, all very important things for us to prepare for. Today we’re going to talk a little bit about sort of like the idea of finding related content. So, you know, for you, whether you are building the world’s greatest chatbot based on SQL Server’s vector capabilities, or you just want to have someone like, you know, I don’t know, let’s say you run a Q&A site like StackOverflow, or I don’t know, let’s just say someone asks a question or something and you want to show them some potential duplicates before they submit thing, submit something. If you wanted to do this as they’re typing, you know, you would probably need to use like keyword or full text search to accomplish that because like, like generating and embedding as someone like, you know, types a letter and like, I spelled that wrong, delete, delete, delete, delete. It would be sloppy and messy. So we don’t want to necessarily do that. But, you know, let’s say, you know, someone’s like submits a question, but like, like you have like a, like a, like a middle screen, right? Like you, you preserve all their work, but then you also, you’re also like, hey, these other things look pretty close to what you’re doing.

What you were looking for, like, I think, I think maybe, maybe might be on the right track here. Maybe you should check some of these resources out before you, you know, send in your new thing and set a bunch of people running around trying to find answers for you and like, wait a minute, that reminds me of this thing. You know, you could maybe, you know, do a slightly better job. Right? Maybe save them some time, save yourself some time, save the world some time, right? So we can’t save the world some time. What are, what do we have computers for? So what we’ll do here is sort of expand on just saying, hey, vector distance, you know, whatever. Now we can, you know, get, maybe get a little feedback about what we’re doing. And we’ll have a new column in our select list that’ll show the likeliness of the query being a near duplicate. So we’re going to use this sort of, I mean, it’s, I’m going to say it’s, it’s arbitrary in that, like, like it makes sense for what I’m doing here and the data I’m working with here.

In your real life workload, query, embedding life, you might choose a different number to decide how likely a duplicate, likely a duplicate something might be. And then we are also going to start filtering on this. Now, I’m going to be honest with you.

This, this is, if you are coming to this from a DBA perspective, from like a performance-ish tuning perspective, which is again, why I started looking at all this stuff. Cause I was like, Hey, I bet this is going to need some performance tuning work. This does not get pushed anywhere that might be useful.

This is not like a SARGA-able predicate. SQL Server has to, again, compare this row by row, right? You have to take each row, feed it into the vector distance function, compare the table embedding to the local variable, which again, like from a, from a like performance cardinality perspective, the local variable doesn’t make a difference here, right?

Like you can throw a recompile on this. It’s not going to help anything. SQL Server is not going to be like, Oh, now I know that’s not a, you know, density vector guess. I’m going to give you this other, I’m going to give you like a good cardinality estimate based on the histogram.

You don’t get that. So like, don’t, don’t get tied up on that. But of course the vector distance, distance function has to compare again, sort of like in a row by row, uh, row by row form. Like, you know, like put this embedding in compared to this embedding.

Is that less than 0.30? Uh, we’ll find out when it runs, right? So like it, like the performance is not going to be awesome for that. Again, this is where sort of like the vector search or like vector index stuff comes in, vector indexing comes in.

But those still being in preview and Microsoft releasing a woefully incomplete implementation of all this, uh, at RTM and calling it AI ready. Eh, wrong. Eh, boy.

What do we do with ourselves? But anyway, let’s run this query. And we get some results back. And this was not terribly, this was not terribly slow because, you know, we don’t have a lot of terribly, we don’t have a lot, a lot of data in here. And I have a pretty decent VM backing this thing up.

But, uh, what’s, what’s interesting, right, is, um, so we have like this, this filter here, right? And this is where our predicate ends up for vector distance, right? Say less than 2.99 E minus 0, 1, right?

So, um, like we scan the whole post table, we bring all million rows out of that. We compare, we do the filtering of the vector distance function in there, right? You can see vector distance, cosine, et cetera, in this filter operator.

Again, because it’s not struggle. This is why things like pre-filtering, like anything, anything you can, we could push. I mean, there’s nothing we can really push to the post embeddings table. But, you know, like for the post table itself, there’s like some pre-filtering we could do there to at least, you know, get some of the rows out.

But, you know, sometimes this ends, like if this were like a regular nested loops join, sometimes the vector distance function ends up at the nested loops join. We get an adaptive join here. So, you know, all that does a thing.

Anyway, I just really wanted to show you that it wasn’t a sargable predicate. We couldn’t like scan the post table with the vector distance function there filtering out rows when we touch the table. But, you know, the results that we get back are at least filtered distance wise to things that would make sense for us.

And the titles that we get back are pretty close to, you know, how do I speed up my SQL queries? But, you know, this is like kind of a funny thing because it’s like how do I speed up my SQL queries? My SQL queries or my SQL queries, right?

Because my SQL is a database, but my SQL, it could be anything, right? SQL could be in SQL Server, Postgres, Oracle, DB2, Snowflake, you name it, right? SQL is in like everything.

Is SQL the original AI? It’s like, no, it’s SQL. But, yeah. So, like a bunch of these results are like, you know, like my SQL, my SQL. But, like, not necessarily like, but we weren’t necessarily searching for anything specific to my SQL, the database. We were just thinking about my, like, SQL that I own.

So, this is like one of those things that similarity in vector search can miss and why you might need to think about combining some alternate search strategies to sort of give you the best of, like, all worlds when you’re searching. Because, like, not only do you need to worry, like, think about similarity, you need to think about the things that similarity search might miss in some cases.

Like, certain contextual things that might make, like, might be really important where you need to sort of bring keyword search or some other, like, keyword search to find, like, more exact stuff. Or you might need to start adding in, like, you know, some sort of waiting to make sure that the quality of the search result is okay, too. So, it’s like, you know, like, all of this stuff is, you know, like, about speeding up SQL queries, right?

There’s all sorts of things in here that might not even show up. Like, I think this is a great example where, you know, the search phrase was, how do I speed up my SQL queries? If you were to take that and just apply, like, pure, like, keyword or full-text search semantics to this query, you probably wouldn’t see SQL queries taking too long.

But, like, the similarity search figures out, oh, like, how do I speed up my SQL queries is pretty close to SQL queries taking too long, right? Like, so that sort of, like, intent of the words, that’s what doesn’t come out in, like, a lot of other search forms that vector search or similarity search really does bring out. Right?

So, it can also be useful. Let’s say that, like, you know, let’s say you’re on a Q&A site where people ask questions, like Stack Overflow. And you might have, like, swarms of completely unpaid slave labor moderators in there who, you know, have to deal with, like, cues of things like duplicate question flags and all sorts of other stuff. But this could be really useful for them to help them find duplicate questions.

Because, again, like, you know, like, reviewing the content of a thing beyond, like, just, like, keyword search stuff. Like, if you were looking, like, how do I speed up my SQL queries? You might find an exact duplicate of, like, the question body in, like, you know, SQL queries taking too long that wasn’t obvious exactly from the title.

So, like, if you’re, like, if you have any sort of, like, moderation thing happening, like we do in, like, or they do in Stack Overflow. I don’t work for Stack Overflow. But, like, you might have something like this where, like, moderators have to go look at stuff and you might want to help moderators find, like, oh, yeah, that is a duplicate of that.

Like, oh, okay. Oh, well, you know, someone flagged it as a duplicate of this thing, but it’s really a duplicate of this thing. So what we’ll do is we’ll take, I have a source post and a query vector.

And we are going to find our source post here. And I’m going to show you what exactly what we’re searching for. And then, like, just a way to screen for potential duplicates using a very similar search pattern to what we did before.

All right. So let’s go pretend that we are unpaid moderators and we are looking to try to find how to undo the most recent commits in Git and what we get back. Now, are these great?

I mean, are these, like, awesome across the board? No. But, you know, again, the similarity stuff can miss some certain things.

So, like, this does have, like, undo and get and commit. And, but this one, like, it has the word recent in it. Right.

So it’s, like, how to undo the most recent commits in Git. But this one thing that similarity search misses is negation. So, like, not. Like, like, like, I want this to, like, not be, like, like, saying, like, I wanted to undo the most recent. But the another one saying that isn’t the most recent.

Similarity search will miss stuff like that. Right. So, like, how to undo the most recent commits in Git. The most, like, the most similar thing we found is Git undo a commit that isn’t the most recent. But similarity search kind of barfs on that.

This one is a lot closer. I mean, if you actually, like, read, like, undo last commit. Like, how to undo the most recent commits. Like, that second one is actually, you know, like, like, in the literal sense, closer. Even though from, like, textual representation with float numbers.

This one, this one seems closer. How to get back to the most recent version in Git. Maybe not particularly there.

Right. Undo a particular commit in Git that’s been pushed to remote repos. I don’t know about that one. How can I undo the list, the last Git add commit. Yeah, that one, that one, that one’s probably closer there, too. So, like, similarity search is not perfect.

But there’s, like, a lot of stuff that similarity search might miss or might, you know, not catch in a perfect way that, you know, a human eyeball doing unpaid moderation on your, you know, your private equity website might, just might catch and understand a little bit better. Anyway, thank you for watching. Hope you enjoyed yourselves.

I hope you learned something. Again, the full version of all this content is available for purchase with the AI Ready coupon. The link for this is fully expressed down in the video description. But if you head over to training.erikdarling.com, you can buy the Get AI Ready with Erik course for $100 off.

It’s a great deal. Because then you’ll be AI Ready far before anyone else is. Except me.

I was AI Ready before you. That’s how I got to do all this. All right. Thank you for watching.

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.