Get AI-Ready With Erik: A Little About Embeddings

Video Summary

In this video, I dive into the world of embeddings and their importance in bridging the gap between what computers can do and what humans need them to do. I explain why traditional string comparisons fall short when it comes to understanding the meaning behind text, especially for tasks like optimizing SQL queries or improving database performance. By converting text into numerical vectors, embeddings allow us to compare texts based on their semantic similarity rather than just surface-level differences, making it possible for computers to understand and process natural language more effectively.

Full Transcript

Erik Darling here. Back in my AI groove. Today we’re going to talk a little bit about embeddings. Yesterday we talked a little bit about dimensions and how embeddings make up dimensions and stuff like that. So today we’re going to talk about why we have embeddings. What is an embedding and why do we need it? So that’s going to be our goal today is to just go over that a little bit. Now, the thing is that computers are great. at comparing things like numbers, right? Which is what embeddings are made up of all those little floaty things that we looked at. But computers are very, very bad at understanding meaning, right? And unless we get into all sorts of terrible, like, like wild card percent, like searching in our, our string columns, which we would like to not do, then like there is really no good way to compare text to see how similar that text is, right? We all we can do is to compare text to see how similar that text is, right? We all we can do is to compare text to see how similar that text is, right? So what we can do is say is this word in this text, or are these words, like is this pattern in this text, even regex can’t, can’t tell you what something means. Unless we don’t want that problem. So if we were to take the phrases, how do I optimize a SQL query? And what’s the best way to speed up my database queries, if we were to take those strings, the computer would not be able to really compare those strings in a meaningful way for like what their what their intent is, what they mean to us, right? All the computer can do is compare the little bits and bobs in each string and say, hey, do they match? Are they are they are they equal? So to a computer doing string comparison? You know, those are completely different strings, right? Those strings are not equal to each other. But to a human, they’re kind of asking the same question. Embeddings are to try and bridge that gap. Now, you might hate AI, right? You like I mean, I’m, you know, pretty, gosh, diggity darn sick of hearing about AI everything and having everyone shove an AI thing in their, their AI in their in their application or in their product and saying, Oh, look, I changed this page from loading to thinking and now I’ve got AI. But like, if like, as a DBA, you should be into this or like, even as a developer, you should be into this because like, you can avoid doing all that, like really painful, like wild card string search stuff in a lot of places.

So like, if we were to take these two strings, right? Again, how do I optimize a SQL database query? And what’s the best way to speed up my database queries? The computer can’t figure that out, right? Like it just says that these things are different, right? These are not the same string dummy, right? But they mean the same thing, right? They have the same general meaning. Even if you were to do like, you know, some awful like search, you know, with a bunch of wild cards in there, which SQL is awful at, or even if you get into like full text search, these strings are not at all equivalent, right? There’s no match, right?

It’s terrible. So what embeddings do is turn text into numbers. So how do I optimize a SQL query? Well, I mean, you know, just to jump ahead a little, if your embedding model generates 1024 dimensions, then you’re embedding, then that string will become 1024 numbers, right? It’ll be that sort of square, at least in SQL Server and how it shows you these things as a little square bracketed list of various floats and bibs and bobs. And what’s the best way to speed up database queries would also become 1024 numbers. And those 1024 numbers, because the way that the embedding models are all trained, and like the billions and billions of texts that they’ve seen, these two things generate numbers that are pretty close to each other, they’re not going to be exact.

But again, we’re not trying to find exact here. It’s not like if you were searching for an error code, like I’m hitting error 0x, you know, 80085. And you want an exact match for that error, because if you get like error 80086, it’s going to be different, right? It’s not the same error. So like, it’s not an exact match we’re looking for. We’re looking for things that are similar to each other.

So if we were to compare, like SQL Server, like, like the embedding model would take these two strings, turn them into numbers, and then those numbers would become how we figure out how similar they are. And again, the lower number you get using the cosine thing with vector distance, the better off you are. So this is a very, very similar one, because this number is getting pretty close to zero, right?

Like, so this would be a pretty good indication that these strings are similar. If like, you know, if this, if this five are way over here, or like, like, you know, we were like, you know, had like a one point something over here, that would not be very similar, because cosine goes from zero to two. Right? So we would not have a very similar match there.

Like, if we took these two strings, like, let’s say we had these two questions. We had, who is the governor of Campania during the Herculonius period? And how do I get pizza off my eyeglasses?

Completely different questions, right? Not even closer, like, like, we were in different worlds. So these would generate completely different embeddings. And these completely different embeddings would generate a cosine that is, you know, that is nearly two, that is nearly, that is almost as high as we can get, as distant as we can get with two, with two sort of text embedding vectors, right?

It’s very, very far apart in the world of cosines. But the whole thing is that each dimension is there to capture some aspect of meaning. What does, what does, what does this string of words mean?

What are we indicating here? What is our intent? Right? And at least nobody I know knows exactly what each dimension represents. That number, like, got me on that one, right?

Search me, baby. But I’m sure there’s, I’m sure someone out there is very, who is very, very smart and very, very good at math can tell you what each, they could probably read dimensions the way, you know, some people can read, you know. Like, like, assembly code.

Like, okay, good for you. You know, some people can read, like, binary. Like, oh, like, that’s, that’s a 17. And you’re like, okay.

Okay, sure it is. But I’m sure someone, someone out there who’s very smart could, you know, explain to you what each dimension represents. But all of these different AI models, right? All of these models learn patterns from billions and billions of text examples, right?

It could be, you know, copyrighted internet material. It could be books they stole. It could be your website, right?

It could be your blog posts. It could be, I don’t know, like, anything, right? Like, all these text examples, these models get trained on them. You know, it’s sort of like, like, OCR stuff where, you know, the, like, different programs will get trained on different sets of images so that they can recognize something from those images. Like, you know, birds or bowling balls or, you know, Adidas t-shirts.

So it’s sort of like the same thing, but like, but just with text. So it’s sort of like, like GPS coordinates for meaning, right? So like New York and Boston, they’re pretty close, right?

They’re nearby each other. New York and Tokyo, they’re, they’re pretty far from each other. I think that’s like almost opposite, exact opposite sides of the planet. So, you know, for like, if you were to take the, the, the, like SQL optimization and query tuning, you know, those are pretty close by, those are close by concepts, right?

They’re very, they’re very near. They are near neighbors to each other. But SQL optimization and like best pizza are pretty far.

Even though the best SQL optimization and the best pizza are in New York, right? That’s, that’s, that’s me. I’m not the best pizza, but.

Ah, whatever. Anyway. Ah, that was off. That was a little off the cuff. Oh, I apologize. I apologize. Ah, but those things are far apart, right? Like, you know, like if you, if you, if you’re trying to find like, you know, like query tuners in New York and you get a list of pizza places, well, you’re probably not going to be too happy with that search.

Anyway, it’s a little bit about embeddings. Um, hope you enjoyed yourselves. I hope you learned something.

And, uh, I will see you over in tomorrow’s video where, uh, we will talk about more stuff from the course. And again, this course is all currently, uh, on sale. You get a hundred bucks off with that coupon code.

This link will be down in the video description. Uh, but if you don’t feel like waiting from clicking through on things and you feel like there might be some malicious, uh, UTM codes or something in here, you can just go to training.erikdarling.com. So, grab that Get I Ready course and just use the coupon code AIREADY to get that hundred bucks off.

Anyway, thank you for watching. I’ll see you over in tomorrow’s video where we will talk about some other vector-y things. All right.

Thank you for watching.

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.