Get AI-Ready With Erik: A Little About VECTOR_DISTANCE
Video Summary
In this video, I discuss vector distance measures in the context of similarity search, specifically focusing on cosine and Euclidean distances. I start by giving a shout-out to ChatGPT for creating an apt back-to-work image that perfectly captured the dreary mood of a cold morning. The video then delves into how cosine distance ignores the length of vectors, making it ideal for text similarity searches where direction is more important than magnitude. In contrast, Euclidean distance measures the straight-line distance between points, which can be useful in certain scenarios but often isn’t as relevant when dealing with vector embeddings and textual data. To illustrate these concepts, I walk through a practical example using SQL queries to find similar posts based on performance-related questions, demonstrating how cosine similarity provides more meaningful results for text-based comparisons.
Full Transcript
Erik Darling here with Darling Data. And before we jump into the, we’re not, I’m not going to go through the whole deck, but I just want to, did want to give out, give a shout out to my boy ChatGPT for coming up with a banger back to work image for 2026. Right? I was just like, give me, give me something grim, right? And nailed it. My man over here is missing an arm again. It froze off. Everyone’s coffee is a disaster. Like, like, like that. I don’t, I don’t know, I don’t know what that is. That’s like a coffee with a muffin in it. But yeah. Uh, good job. Good job. All right. Nailed it. All right. Everyone looking miserable and cold, hunched over. That’s right. It’s like a Dostoyevsky or something. Anyway, with that out of the way, let’s talk a little bit about vector distance. Uh, so I, I, like, I, like I said before, like if you’re doing similarity search, you like, like 100 emoji, just want to use cosine. All right. Nothing else makes sense to use for it. Uh, okay. Maybe not like a hundred percent. There might be some weird, like real weird stuff, but if you’re just getting into this, you’re probably not going to be getting into weird stuff. Someone’s just probably going to be like, Hey, tell me, like, give me some recommendations. Like, give me some recommendations.
I mean, like, like, like build a chat bot for me. And like, this, this stuff is much more useful for that. Um, but the thing with cosine is like, it ignores what’s, let’s call it the length of, of, of a vector. The length of a vector you can think of as just being like, if you added up all the numbers in the vector, how long is it? Right. Um, cosine doesn’t care about that. It just cares if all of the vectors are sort of like near each other. Right. So, um, it doesn’t care about like, like, like, like this example here. It doesn’t care if one, one set of vectors is one, two, three, and the other is two, four, six.
It means that they’re kind of like pointing in the right general neighborhood. Right. Um, but, uh, with the Euclidean stuff, uh, it does care about length. Right. Um, Euclide measures stuff in sort of like, uh, a straight line distance between two points. It’s like, again, like GPS coordinates. So like, if, if we were to say like, if, if I were to walk from point A to point B, how far is that? Um, so if you have the same vectors, like one, two, three, and two, four, six, they’re like, like this over here, like granted they’re pointing in the same direction, but Euclide is just like, well, one might be over here. One might be over here. But like, even though they’re still going the same way, it’s not like, they’re not close to each other in that like similarity, like cosine way.
Where like, even though like, you know, like they might be far apart in whatever cosine space there is in the world, they’re still pointing in the same general direction for things. So I realized that’s a little dumb and stupid, but you know, sounding, but it’s just kind of a good generalization of how things work. So like if we took these two, right, where we have one, zero, zero, and then a hundred, zero, zero, uh, cosine is going to be like, you know what?
You are pointing in the same general direction, right? Uh, but Euclid is like, you are 99 Euclids apart, right? You’re like, this is like, well, there’s a, like, there’s 99, there’s a difference of 99 between one and a hundred. So you are 99 Euclids apart. Um, the other one is just like, eh, well, you know, well, like, you know, we’re going from one to zero and the other one’s going from a hundred to zero.
So that’s what that is. Um, I don’t know why one of these is capitalized. The other one isn’t, uh, that’s a very strange artifact of something, but we’ll fix that right now. Uh, and then if we look at these, right, um, you know, like now we have negative one, negative 50, negative a hundred, and we have a hundred, 500, 1000. Now cosine is just like, you’re a horrible match, right? Cause up here, like you, uh, up here cosine was just like, you’re a great match, right? Zero.
We go from zero to two, right? So the closer something is to absolute zero, the better match they are. Like then we run this one, right? It was like negative, negative, negative, positive, positive, positive. And we look at these. Now the cosine is like, you are very, very far apart. You are not going in the same direction.
We have negative numbers. We have positive numbers. They’re all different. Um, so this one is almost as imperfect a match as you can get at 1.99, blah, blah, blah. Euclid is just like, well, now you’re 1,233 Euclids apart, right? So like, like Euclid is just, again, it’s measuring like GPS coordinates. Cosine is measuring the similarity in the way that the vector, um, the way that the, uh, the floating point numbers in the vectors are sort of pointing, right?
Like the, just think of my fingers as arrows and like the negative ones are pointing this way and the positive ones are pointing this way. And so that’s why these are very, very far apart as far as cosine similarity goes now. But if we do this, right, and this one, like, you know, it almost feels like, where are they? Where are they going?
Like, like, cause we have one and 50 and a hundred and then a hundred and 500 and a thousand. Like, like, like there’s still like, those arrows are still pointing in the same general direction, right? Right? Cause it’s one 50, a hundred, a hundred, 500, a thousand.
Like they’re still pointing in the same general direction. It’ll be an okay match. It’s not going to be like a perfect match like before where like, but it’ll be an okay match. But Euclid will of course say you are many Euclids apart, right?
So like the, the cosine similarity is pretty good, right? It’s not like, like, wow, that’s amazing. But, uh, Euclid is like, you’re a thousand and 11 Euclids apart just because all it cares about is the, the, the length, right?
So it’s a hundred and 500 and a thousand and the other one’s one and 50 and a hundred. Like you’re like completely two different points on this map where the cosine is just like, you know, you, you, you could be, you could have some things in common, right? It’s like matchmaker, matchmaker.
Right? So these two are a little bit more similar, but, uh, text, text similarity, you generally want to use cosine, right? I said that a million times now because you don’t care how many Euclids apart two texts are.
You care how similar those texts are to each other and the vectors that just sort of, and the, the vectors, right? The embeddings, uh, the various floating point numbers that describe, uh, sort of the text of the question, right? The intent, the meaning of the question, those are what you care about similarity on.
So if we, uh, we’re just going to do a quick example of that here where let’s say, um, we wanted to, oh, well, I’ll just show you. So it’s a little bit, a lot easier to see in the, in the results. And if I just sit here and yak about what we’re, what we’re after, if we run this query, right?
And what we’re essentially doing is asking for, uh, you know, the, the top one, I mean, granted, we don’t even need this hunt greater than a hundred filter. Order by score descending. That was a little silly on my part.
I think that was just testing. But if we look at this, this query, right, we’re going to find the top one post where that’s about performance. Right. And, uh, we’ll look at the title of what we’re searching for. And then we’ll look at the top 10 results where the vector distance, right?
And one thing that’s important here is you want to, like when you’re ordering by, when you’re trying to figure out what is most similar, uh, you want to order by distance in ascending order. Because that’ll give you the lowest numbers for us, the most similar stuff first. So if we look at what we got back, right?
The question, the question that we were looking for, right? The top scoring question about performance is improve insert per second performance of SQLite. And then as far as like similar things that we got back, uh, like, like here’s our distances, right?
So, you know, start off, you know, pretty similar and then, you know, things get a little bit less similar as we go down the list. But this is sort of how the similarity thing works, right? We have the top one over here is SQLite insert performance, which, you know, pretty, pretty similar to that, right?
Pretty similar. Then SQLite insert very slow question mark, right? Still pretty close to this, right?
Right. Getting there. And then faster bulk inserts in SQLite 3, uh, you know, uh, the, the, the vectors and the embeddings are not sort of good at figuring out, like, if there’s a difference between SQLite and SQLite 3. It’s like the, the, the same way they’re not good at figuring, figuring out if there’s a difference between like SQL 2008 and 2008 R2, right?
The 3 in SQLite 3 and the R2 and SQL Server 2008 don’t really like stand out, uh, to, to the vectors. And so you kind of get some weird crossover there. And then we have an Android SQLite database, slow insertion.
Romance is alive. And then SQLite .NET performance, how to speed up things. Well, you know, still SQLite, still performance, but not necessarily about inserts.
And then slow SQLite insert using JDBC drivers and Java. Well, I mean, it’s still about slow SQLite inserts, but like is the original, like does JDBC Java matter? Do we, do we need, is that, is that relevant?
Then speed up SQL inserts. Well, we have drifted away from SQLite, but we are still dealing with SQL inserts. And then we, we start, we, we, we sort of start losing the script down here a little bit. Don’t we?
SQLite optimization for millions of entries. SQLite optimization. SQLite slowing down after millions of rows. How to speed up. So there’s still stuff about SQLite and performance. These are still relatively similar question titles, right?
But, but we’re, we’re, we’re sort of starting to get away from like just purely improving SQLite insert performance. Okay. So anyway, uh, that’s what we did there.
And the way we did that was we grabbed, uh, a post ID and embedding from our post embeddings table. We just filled that up in the last video. And then, uh, we looked at what it was.
We could, you know, figure out, Hey, like is what we’re getting back close to what we were asked for. And then what we did down here was use the vector distance function, right? Again, with the, always with the cosine when we want the similarity.
Uh, and then we compared the embedding in the, in the post embeddings table, the, all the embeddings in the post embeddings table to the query vector that we got from up here. Right? So that’s this thing, the embedding that we assigned up there.
And then we ordered those, um, results by the distance column in ascending order. And we removed the original post ID from there. So if we don’t do that, then we’ll get back like the same thing.
Like we’ll get back the post that we were just looking for a similar post to. And we don’t need that. It’s no good for us. So, uh, yeah.
Anyway, uh, that’s probably good for this one. Uh, thank you for watching. Hope you enjoyed yourselves. I hope you learned something and I hope you enjoyed this banger of an image from my, my, my boys at chat GPT. All right.
Thank you for watching.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.




