Get AI-Ready With Erik: Bad Vector Observability
Summary
In this video, I delve into the world of bad vectors and embeddings, illustrating what they look like through real-life examples and practical scenarios. Drawing from my experience attempting to transcribe, summarize, and chapterize a large YouTube catalog using local Language Models (LLMs), I highlight how imperfect these models can be, leading to issues such as overly long chapter timestamps that don’t match the actual video length. This example underscores the need for robust validation mechanisms to ensure embeddings are of high quality before they’re stored or used in any pipeline. I also explore techniques like using dot product arguments and vector distance functions to identify problematic vectors, emphasizing the importance of catching these issues early on to maintain the integrity of the data processing workflow.
Chapters
- *00:00:00* – Introduction
- *00:01:00* – Bad Embeddings Example
- *00:02:14* – LLM Pipeline Issues
- *00:03:20* – Vector Distance Function
- *00:05:33* – Filtering Out Bad Embeddings
- *00:07:46* – Trigger for Validation
- *00:08:45* – Conclusion
Full Transcript
Erik Darling here with Darling Data. And what I want to show you in this video is a little bit about what bad vectors look like or what bad embeddings look like. So it would be stuff that like, like just, you know, like this can happen for a lot of reasons, right? It can happen during like, like, like while you’re doing the embedding, something weird can happen. There could all sorts of like, I don’t know, you can even have like weird truncated text, truncated text that does it. But let’s just like, just, I’m going to give you like sort of a, a, an example from my real life. Like, I don’t know if you, I forget when I talked about it, but like one of the things that I said I was trying to do was take all my YouTube catalog, uh, and like have it transcribed, summarized, and chaptered by using local LLMs. So like I have this pipeline set up to like download the YouTube video, use one local, one local LLM to make the transcription. And then another LLM to look at the transcription to generate the summary and chapters. Now what’s, what’s, what was really interesting, like a funny thing that happened that I didn’t catch until like, like there was a fair amount of scrutiny going on where I was like, wait a minute, like, like one LLM generates the transcript and then another LLM looks at the transcript and it summarizes it. And the summaries were generally okay. They were a little repetitive, like in this video I delve into or this video I dive into, like whatever, it doesn’t matter.
But what was really interesting was the chapters. The reason the chapters were interesting is because the LLM that looked at the transcript at first, like until I put it in there, it had no idea how long the video was. So I would have like a 10, 12, 15 minute video and the local LLM would start putting chapters at like an hour, hour and a half, two hours out. And I was like, that’s not good. Someone’s going to look at that and think I’m crazy. This is a 15 minute video. And it’s like three hours in, good night. Okay. So like, like, like, like I’ve been going, like, I’ve had to like redo a lot of stuff because of that. Uh, I spent yesterday with my Mac book on drink coasters with, um, like, like, like ice sleeves for pain under it. Cause it was getting hot.
And it was just the whole thing. Anyway, it was like seven hours of reprocessing 700 videos or something. But, uh, like, like, like, like, you know, again, something that we can all agree on. LLMs are currently imperfect. Um, pipelines are also, also somewhat imperfect. Um, you know, computers quite imperfect. So there are a lot of potential reasons why you might need to like deal with and find things that look like this. So like one, one way you can do that is by using the dot, um, dot product argument for the vector distance function. And you can generally use about these numbers to find vectors that would not be vectors or embeddings that would not be good. Like, thankfully I don’t have any of those. Right. So like everything in here is not like messed up, all zeros, very weak, you know, like kind of the same low numbers all across.
Uh, what I want to show you is what happens when we mix like kind of good ones and bad ones together. And so I’m just using some literal values here to like, you know, like show like at least some like, okay stuff and then some bad stuff where it’s like all zeros and whatever, and then a no one finally. And we can use some fancy queries to sort of categorize those and find ones that are not good. Right. So stuff like this, uh, where we have five rows that are very like near zero. Zero magnitude, almost like, like potential zeros, one row that’s okay. And one row that’s no. And we can use another sort of fancy query to find, um, to find ones that like just to get the detail on that, that was an aggregation.
So like when dot distance is zero, that’s probably not a good sign. Um, when it’s like a number like this, probably not a good sign, negative four, not a good sign. Generally you want to see like where it’s okay, like negative one and close to, close to negative one. Um, and then, uh, and this is like, again, this is the difference between itself. I’m not talking about dot product between like two different vectors. I’m talking about like, like when you say dot product between like, like when you compare a vector to itself, right? Or compare an embedding to itself, not when you compare it to something else, comparing it to something else is completely different. Comparing it to itself is what we’re looking at here.
And like, like the reason why you would care about this is because you might, you know, you might have like the text of the document, uh, somewhere in your database. Uh, and you might also have the embedding for it. And if you like, if you’re looking at the text document and it’s like, you know, it doesn’t matter if it’s like one line or if it’s like a couple paragraphs or a long document, if it’s all zeros, it’s never going to match to anything, right? Like it’s just not going to come up as being similar to anything. So that’s what you have to be really careful of is like, cause it just makes the LLMs look worse than they actually are.
So I’m going to create a different, slightly different table here called bad embeddings. Again, kind of using the same setup with like, you know, like some okay ones and some not so okay ones. And if we just run a query against this and we look at what comes back, like we get like, of course, like the good match and the great match come up on top, but like then like, you know, like weak matches and noise aren’t too far behind. Right. And so like, again, like, like other videos have talked about being careful about filtering with this and like saying, hey, like vector distance is less than like 0.2 or like 0. whatever.
So you can like sort of get rid of stuff like this. But like, if you were expecting good matches from some of these, you might be pretty surprised when you don’t get them. Right. And that’s going to be based on just like the vectors being, the embeddings being messed up. So like, if we, if we just run a couple of queries like this, we’re going to say vector distance not between minus 1.05 and 0.95.
And then between 1.05 and 0.95, then we’ll see those like two different result sets kind of come in. Right. So like, again, good match and great match ended up in here. The bad ones ended up in here. So this is one way of sort of catching and filtering out like bad embeddings by comparing them to themselves. Right.
Because again, we’re just, this isn’t two different embeddings. This is the same embedding. And we’re just saying, hey, how much do you agree with yourself? Right. How strongly do you reinforce yourself?
So you might be now thinking like, what are ways that we could validate vectors as they come into the database so we can catch this stuff? And, you know, sort of unfortunately, this is, this is a real bummer. Like at least, again, as things currently exist, I’m on CU1 of 2025. Right.
I know it says RTM down like here, but I’m on CU1. RTM doesn’t update to say CU1. Thanks, Microsoft for that making me look dumb. But like, you might think that if we created a computed column to say, look at like each vector and then we could like add a check constraint to say, like, if you’re messed up, I don’t want you on my table.
I mean, it doesn’t work. Right. We can’t persist this because vector distance is non-deterministic. Right. So that’s messed up.
And of course, if we take out the persisted. Right. And we say, OK, well, you know, no persisting. That’s OK. Then we get a different error and we can’t create the check constraint on it specifically because it is not persisted. So this messes us up further.
So kind of you’re stuck a little bit with like a trigger. Right. And I’m just going to give you a simple after insert update trigger where real life would be like, and instead of update trigger and, you know, you would insert it into like insert good rows and you would like into the real table and you would insert bad rows into a logging table and be like, I need to reprocess you.
But one way you can do that is with a trigger and just say, hey, if you’re not between these magic numbers that I care about, you’re going to get out of here. So this one passes and this one fails. So that’s just one way of sort of protecting yourself from bad vectors getting in for whatever reason.
Again, LLM failures and coding failures, embedding failures, all the all the stuff that can happen. And if you’ve ever dealt with any sort of ETL pipeline before in your life, you’re not probably or any sort of import process, anything like that, you’re going to be no stranger to things like this and having to use some mechanism of like capturing bad stuff and like logging it and like saying, I need to reprocess this and letting the good stuff in.
Anyway, that’s about it there. Thank you for watching. I hope you enjoyed yourselves. I hope you learned something and I will see you over in tomorrow’s video. Thank you.
Thank you. Thank you. I hope that at some point you learn to love me. Am I cool yet? Ah, screw it.
All right. Thank you.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

