Get AI-Ready With Erik: Generating Chunks

Summary

In this video, I delve into the intricacies of managing long content when using embedding models like those from MXBAI and OpenAI ADA002. With token limits being a critical factor, I explain how these limits can be reached quickly in scenarios such as Stack Overflow posts, where question bodies and answer bodies can easily exceed the 512-token limit for some models or the more generous 8191-token limit of others. To address this challenge, I introduce Microsoft’s chunking method, detailing its components like source text, chunk type (fixed), size, overlap, and how they interact to preserve context while respecting token constraints.

Chapters

*00:00:00* – Introduction to Chunks and Token Limits
*00:03:45* – Stack Overflow Database Considerations
*00:07:28* – Example Text Chunking
*00:10:09* – Overlap in Chunks
*00:12:04* – Conclusion

Full Transcript

Erik Darling here with Darling Data, and we’re going to finish out this week talking about a subject that is near and dear to my heart, a subject that I have a lot of experience with, and is dealing with long content. Because, you know, I have written some pretty long blog posts in my life. I’ve written some extensive training material in my life, and I’ve got to manage a lot of long content day to day. So, we’re going to do that. This is, of course, all videos, all snippets, tidbits, tiny chunks, right, near shadows of the full course material from my class course, whatever you want to call it, my correspondence course. You can become a locksmith, repair guns, crack safes, get AI ready with Erik, get AI ready with Erik, where you can learn in full, gruesome, gory detail. About all of these things. And, you know, so we’ve got that going for us. And my green screen is acting a little funny in the background, but not so funny that I’m willing to stop.

So, embedding models, as you either know now or will know in the near future, have things called token limits. If you’ve ever used an LLM to any degree, you know, you might have, you know, if you’re using like a web interface, you might see something like, this conversation length is done. That’s enough. We’re over here. We have reached our context. And this is token driven. And if you’ve ever used, you know, a more professional grade LLM product, like say, Claude Code or something like that, you may have noticed it slowly counting up cursors as it does, tokens as it does things.

Where it’s like, you know, it’s like, you know, it’s like, thinking about it. I read this file. I’m thinking about it. And like, you just see like the number of tokens rack up and up. And then you hit a certain amount of tokens and they charge you more money. So, embedding models have token limits. The one that we use for the course, the MXBAI, MXBAI EmbedLarge has a 512 token limit. OpenAI ADA002 has an 8191 token limit. And there are some other ones that have much smaller token limits like the MiniLM, blah, blah, blah.

Stack Overflow content, for a lot of things, will fit just fine into the 512 token limit. Titles are usually fine because they’re less than 300 characters. But bodies are often long. They are often 1,000 to 10,000 or more characters. Some long content, right? Some pretty verbose answerers on Stack Overflow. And that’s even just for like question bodies. And of course, answer bodies, which is probably more in line with what I was just talking about, can be very long.

You’ll have code examples. You’ll have all sorts of stuff in there. You know, quotes from documentation that are extensive. Things like that. That may exhaust the token limit for your model. You don’t want an exhausted model, I can tell you that much.

And what’s, I think what’s especially, you know, maybe surprising to some of you out there is that you, if you exceed the limit of tokens, then the model just stops, right? It just silently truncates the rest of it. It just leaves a whole bunch of stuff out. Which, you know, can be not great if you’ve got a very long answer that’s full of very good information and, like, very important details.

And the first, you know, like, let’s say, I don’t know, 1,000 characters or so is just sort of like, you know, preamble. You probably ain’t training stuff right, right? You’re not going to have very good context for things. So, Microsoft has given us a way to deal with long content in which we can generate chunks.

I don’t know who names these things. Just, like, like, obvious barf references. Why? Why? Why are you going to call it chunks?

What are you doing over there? Ah, generating chunks. Feeling okay? No. So, generating chunks requires, well, requires three, but there is an optional fourth input to your chunk generator. There is the source with which you wish to generate chunks with.

There is chunk type, which currently, I believe now, you can only do fixed. And then there is chunk size, which is the number of characters with which you wish to chunk. Then you have this other thing called overlap.

And overlap can be interesting because if you’re dealing with something that’s, like, paragraphs, like, and you generate a chunk of, like, let’s say, I don’t know, like, 300 characters, but the paragraph itself is, like, 600 characters, the next chunk you generate is going to be just, like, the second half of the paragraph.

And so you might want to have your chunks overlap a little bit so that, like, you retain some context from, like, the first half of the paragraph into the second half of the paragraph, right?

So, like, you kind of want to, like, blend things a little bit, right? It’s kind of like your blender tool where you can say, like, well, you know, this is a really important, like, and you’re not going to be doing this line by line when you generate chunks or else you might actually generate chunks.

But, like, you’re not going to do this line by line, but just, you know, like, like, thinking about a situation where, like, well, like, you’re reading through a paragraph and you’re like, that’s an important sentence.

And then, like, the next sentence kind of, like, carries on with that. You’re like, like, that’s the kind of, like, contextual stuff that you want to carry across so that when you’re generating embeddings, they sort of, like, retain more context across the chunks that you generate.

But AI generate chunks will return a table-valued result with the text fragment that is being chunked. And that would feed into the AI generate embeddings function.

The chunk order, right? So you can see, you know, I guess, like, breakfast, lunch, or dinner. And then you have the chunk offset, which is the position in the source.

So maybe that was, like, an afternoon snack. And then you have the chunk length, which is the number of characters in the chunk. So just a very basic example would be something like this, where I have this text column that is just a string that I’m selecting in here.

And we have our, we’re going to just cross-apply generate chunks here. And we have our source pointing to this thing. And we have our, well, we can only say fixed here.

But then we’re going to just give me a chunk size of 50 characters, right? It’s not, that’s not a best practice. That’s not, like, what you should do. You have to figure that stuff out.

This is just the example I’m using based on the piece of text that I have there that made for a reasonable demo. So that’s what we’ve got, right?

Again, 50 might be great, right? Like, cost threshold for power might be amazing. I don’t know. Anyway, here’s what we get back. And here’s what we see in here.

Now, there are some weird, like, there are some things you have to consider with this. And, like, one of those things is very specific to the Stack Overflow database in that the body column in the post table, like, it has, like, HTML formatting in it, right?

It’s sort of, like, Markdown-y formatting kind of. But it’s all, like, there’s a lot of, like, brackets and stuff. Now, like, it’s not, like, you know, an embedding model is going to consider those things important, but they are characters that do contribute to your token limit.

So, like, if you’re, like, depending on the cleanliness of your data, like, and, you know, how many tokens your embedding model allows, you might have to seriously think about, like, cleaning out these, like, nonsense things, like, you know, like, BR and, like, you know, like, H1 and H3, stuff like that.

But this is what we get back. We get back four lines. We see the order of the chunks. And, like, we see each of these chunks on one line, and they’re all 50 characters, right?

But notice, like, also, like, there are, like, big spaces in it. Like, the spaces contribute to this too, right? So, like, spaces, like, any character, right, whether it’s useful or not, contributes to the chunk size.

If you have empty lines in your stuff, they, like, you’ll get empty lines back from generating chunks, right? So, like, it’s like, I am a line with all these empty things in it that will do no one any good whatsoever.

So, like, sometimes you might have to filter out, like, you know, you might want to, like, you know, pre-filter garbage from here and just say, like, well, you know, if you don’t have any, like, useful characters in you, I’m hacking you out because you’re no good, right?

So just this simple where chunk is, like, you know, like, A to Z and 0 to 9, right? So, like, anything that actually has that in it will keep. Anything else, go away, right? Like, you’re not words anymore.

Like, I don’t know, it’s, well, I mean, I say that from a very, like, you know, you know, like, anglo-centric point of view where these are my letters. If you can do that with your letters, right?

Like, lesson learned, make sure your letters are in there, your alphabet, whatever language you’re doing this in, make sure you’re represented. That’s mine, so that’s what I’m doing. But overlap is there to prevent losing context at chunk boundaries.

So, what each, like, each chunk will include some text from the previous chunk. If you set overlap to 10, that means you will have a chunk size of 10%. So, like, we say, we do all this and say overlap equals 10, right?

And we run this, then we get back this, right? We get back five rows, right? And, you know, like, embed, and embed, and, you know, like, we don’t, there’s not a lot of overlap there.

If we crank the overlap up to, say, like, 25, right? So, let’s make that a little bit bigger, then the results change a bit, right? Like, and, actually, go away, red gate. And then if we do 50, we will get far different results.

Now we get back seven rows of stuff. And, actually, it’s probably more, probably a little bit more illustrative to just run these all together, right? And just kind of see how things change across these.

And so, like, you know, you can kind of see that, like, you know, like the, these don’t exactly line up in the exact same way, right? Like, the alignment here is not the same across, you know, a 10% overlap and a 25% overlap. And then a 50% overlap, which actually, like, causes more rows to get produced by that because we end up having more chunks of 50 characters because we preserve more from the, like, the previous row.

So, if you have long content, and long content, of course, will vary by the size of your text that you’re dealing with and the token limit of the embedding model that you are using, you may need to consider generating chunks. Thank you for watching.

I hope you enjoyed yourselves. I hope you learned something. And I’ll see you next week. Another one. All right.

Goodbye. Goodbye.

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.