Get AI-Ready With Erik: A Little About Vectors and Distances

Get AI-Ready With Erik: A Little About Vectors and Distances


Video Summary

In this video, I dive into the exciting new vector data type and vector distance function introduced in Microsoft SQL Server 2025 and Azure SQL Database. Starting off with a brief introduction to the vector data type, I explain its syntax and default behavior, including how it handles dimensions and the importance of specifying the correct number of dimensions when working with vectors. I also highlight that while this is just a taste of what’s covered in my comprehensive course, purchasing it remains essential for those looking to fully leverage these new features. The video then delves into practical examples using SQL Server Management Studio 2025 and demonstrates how vector distance functions like cosine, Euclidean, and dot work, emphasizing their importance in similarity searches and text analysis tasks.

Full Transcript

Erik Darling here. Darling Data. And we’re going to spend some time, since I just released a course about all the new AI hubbub in Microsoft SQL Server 2025, Azure SQL Database and Managed Instance, I figured we should do some videos to kind of give you a little idea of what the content looks like in order to hopefully spurn you into purchasing it. There’s a coupon code up yonder that will get you $100 off the course. I’ll put this link in the video description for as many as I remember. But the coupon code is just AI ready. So if you go over to my training site at training.erikdarling.com, you can put in that coupon code without having this whole full link in front of you in order to get that course content. Anyway, let’s see. Let’s start off with just a little tiny itty bitty introduction to the vector data type and a little bit about the vector distance function. There’s of course a lot more material in the full course, but if I did the whole thing for free here, there’d be no point in purchasing it, wouldn’t there? Unless you just wanted to say thanks, which the very few of you want to do. Anyway, the new vector data type, by default, well, I mean, it looks like this. It’s like Victor with an E. You say, I want this vector data type. You tell SQL Server how many dimensions will be in that data type. And of course, dimensions are these things over here. Each dimension is a number. They’re all separated by commas, right? So 1.0, 2.0, 3.0, and they have to be within these little square brackets. By default, they will all be float 32. We’ll talk more about what that means. There is of course a float 16 that is in preview. Float 32 is the general availability data type that is fully supported currently.

And so that’s generally what we get. You generally won’t have three dimensions in a vector. It’s just for a little bit of simplicity here. But this is a little bit of, oh, this is kind of what it looks like. Let me use my fabulous new SQL Server Management Studio 2025 content zoom feature. And this is what you get back when you look at a vector in SSMS. It’s actually kind of a funny XML clickable column. And notice that these numbers look a little bit different from the numbers that we put in. These get converted to big crazy floats. But this is what the vector data type looks like.

Sort of. Well, I mean, it’s not what it looks like in storage. It’s what it looks like presented to you as a person. Sort of like on Star Trek when aliens are like so big and weird and scary that they have to wear devices that present themselves as like human looking to the ship members and the crew and stuff. It’s sort of like that. If you saw what vectors look like, if you saw their internal representation, your mind would collapse on itself.

All right. It’s like hearing the voice of God or something. One thing to be aware of, and this is something that’s going to scare some people when they maybe start working with these, is that vector data types are rather inflexible in many ways. You have to know exactly how many dimensions. And again, a dimension is each number inside here. You are going to have for your vector data type and you must use precisely that many.

For example, if we do this and we declare a vector with one dimension, we cannot assign three dimensions to it and just have it silently truncate. We’ll get an error. Nor can we sort of under pack a dimension and say like, you know, a lot of people will be like, I’m going to just use Envarcar max for this state field and stuff like, you know, M A N Y R I C A T X and stuff in it.

You can’t do that with vectors, right? So if we say I want a vector one oh two four and we try to just put three in it, SQL service like no can do. One oh two, four and three do not match. Sorry, pal. The demand, the number of dimensions that you use for your vector data, vector data types are going to be determined by whatever embedding model you choose to use to generate your embeddings.

That’s a big, scary, crazy sentence. But don’t worry. We’ll talk more about that is, you know, sort of another videos this month. But also I talk way more about that in the course material.

So if you are just so eager to know more about that, you can you can get way ahead and buy the course now. But different models. Right. So like, you know, like every time, you know, like Anthropic or ChatGPT or, you know, whatever Google is doing, say this is our newest, most powerful model yet. That like, you know, like, you know, they’re talking about like things like that.

Right. It’s like the model that you’re using that would generate the numbers that tell that, like assign meaning to the text data that you have. Right. Those models generate stuff and like in those numbers, those models generate a certain number of dimensions. There are some newer ones that are dynamic, but like that’s not going to help us here.

But like most like most like you choose, like the model that you choose. Right. Like is going to be dependent on a lot of stuff. You might not even be the one choosing it.

You just have to know how many dimensions that model creates when it generates embeddings. But once you pick a model, right, or once a model is chosen for you, all your vectors need to match that dimension. So this is called length and you cannot compare different ones.

So the like probably one of the more common ways that you’re going to be using vector data types in SQL Server, at least today, because the vector search function, which will again, we’ll get to later, is still in preview. It is not a generally available feature. So the most common way that most people in production are going to be doing things is by using the vector distance function.

That’s this thing right here where this is how this is how SQL Server will tell you how similar things are usually using the cosine calculation here. But if we have, let’s say up here, we have a vector three and a vector four, we cannot compare those. Right. Like SQL Server just says vector dimensions three and four do not match.

Right. So we can’t compare a vector with three dimensions to a vector with four dimensions or any other differing number of inventions or dimensions. Doesn’t matter. But the vector distance function has three metrics available.

We have cosine, Euclidean and dot. Cosine is what you’re going to be using most of the time because that’s a similarity search. Euclidean is almost like GPS coordinates.

It’s it’s a totally different thing. And dot is dot is just weird. I don’t I don’t even like talking about dot because it’s just that bizarre. But if we have these two vectors up here and we’re just going to again keeping things simple.

With just three dimensions and then we have one dot O two dot O three dot O and one dot O two dot O three dot five. What we what we get when we run these things are slight are just different ways to measure similarity. Right. So.

It’s different, but for similarity search. Right. For like searching text for like, you know, things that are considered close to each other. Cosine is going to be what you use mostly.

So let’s run this and compare. Let’s see how similar these two vectors are and the cosine distance for this one. Like generally lower number is better.

Right. So the lower a number is the quote, like the better off it is for the cosine distance. This is measuring like the similarity of like like all of the dimensions that we have up there. The Euclidean distance is 0.5.

And the only reason this is 0.5 is because this is 0.5 higher. So when I said this is like GPS coordinates or something, it really is. And this isn’t going to be really good for like text similarity search stuff.

And this is not what you want to use here. So just because that that dimension is like 0.5 somethings longer than the other dimension, it is 0. The difference is 0.5.

The dot product distance. This is negative 15.5. But the calculation for this is just like like 1 times 1 plus 2 times 2 plus 3 dot 3 times 3.5. Right. So we get 15 and a half back.

So that’s that’s really all that is. Anyway, when the like I said, the closer the the numbers that you get back are or rather the lower the numbers are for cosine is going to be the closer to 0. The the the the more similar the vectors are or the more similar the dimensions are.

So if we have these two strings here that match entirely. Right. So that’s now it’s 1.02 dot 3 dot 5. And we look what we get back here.

Now cosine is exactly 0. Euclidean is exactly 0. And the the dot product distance is negative 17.25. But again, this is just like, you know, like one.

One times one plus two times two plus 3.5 times 3.5. So like this number is just getting lower for whatever reason. I mean, we made the numbers bigger.

So the negative number got bigger, which is a weird thing to think about. But like I said, the lower numbers are the more similar they are. I use cosine throughout like the entire course because I use a Stack Overflow database.

And the sort of goal of using that and which we’ll talk about more in a later video is that like it’s probably more most closely matches what a lot of you will end up doing with it. Where, you know, we’re trying to find we’re using question titles to find similar questions. We’re using, you know, question titles and comparing them to answer bodies to find like maybe like, you know, good answers for a similar question title, stuff like that.

So like the cosine is really what you want to be using. But look what happens here when we change the lower string to use all negative numbers. Right. So negative one, negative two, negative three dot five.

Now, all of a sudden we get back very different numbers. Now, the cosine distance here is almost two, which is like the far end of this. No, it’s like for a cosine distance.

The numbers that you’ll get back are between zero and two. So like there’s not a lot of forgiveness in there. So like the closer you are to zero for cosine distance, the better.

Euclidean distance is now 7.8. And really, it’s just because we’re measuring like this thing is going positive and the other thing is going negative. So they’re just like, like again, GPS coordinates.

These are two different points on a map like these just got further apart on like whatever grid map weird tesseract these vectors exist in. And the dot product distance is now a positive 15.5 telling us that these are very, very different dots. Like I said, we don’t talk about dot product anyway.

Thank you for watching. I hope you enjoyed yourselves and I’ll see you in tomorrow’s video where we will talk about some, some more fun vectory stuff in SQL Server. Anyway, thank you for watching.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.