Get AI-Ready With Erik: A Little About The StackOverflow Demo Database

Get AI-Ready With Erik: A Little About The StackOverflow Demo Database


Video Summary

In this video, I dive into the demo database used for my new course, Get AI Ready with Erik, which focuses on vector data and search strategies in SQL Server 2025, Azure SQL Database, and Azure Managed Instance. I explain why I chose to use Stack Overflow 2010 instead of the larger Stack Overflow 2013 database, highlighting its smaller initial size (about 1 gig download and 10 gigs in database size) that grows significantly after generating embeddings and other vector data types, reaching around 40 gigs. Throughout the video, I discuss various aspects of the database, such as the benefits and challenges of using different columns for embedding generation, like question titles and bodies, and how to handle text length limits by chunking longer texts to ensure meaningful context is maintained across embeddings.

Full Transcript

All right, Erik Darling here with Darling Data. And we’re going to, like I said, spend a little time in January talking about my new course, Get AI Ready with Erik, which is all about dealing with the vector stuff in SQL Server 2025, Azure SQL Database and Azure Managed Instance. Today, we’re going to talk a little bit about the demo database that I’m using here. Most of my training uses is a version of the Stack Overflow database called Stack Overflow 2013. But for this course, I started with a smaller one because this course isn’t necessarily about like performance tuning in all its glory, where a bigger database presents more performance problems. This is just about, you know, getting started with the vector stuff, learning how search works, you know, generating embeddings, chunking, all the stuff like that, various different search strategies. But for this course, I’m using Stack Overflow 2010. because it’s a lot smaller to start with. It’s about like a 1 gig download and 10 gigs in database size. But once we generate all the embeddings and stuff, it ends up being around 40 gigs, right? So this thing gets really big compared to, you know, where it started. If I did this with the Stack Overflow 2013 database, which is about 50 gigs, it may well ended up between 100 and 150 gigs once all the embeddings got generated. So let’s talk about a few things that make for good embeddings and Stack Overflow. In the posts table, we have things like the question title, which are great, right? Because it’s short, it’s compact, it’s easy to fit the entire thing into one embedding. Then we have the question body, which is also good, but present some challenges because we have the question and answer bodies, which are good, but they present some challenges because they tend to be longer text. And we may need to think about chunking longer ones up because it if we don’t, they might get like they would get silently truncated after a certain number of tokens. We’ll talk about all this more as we go through things. But just so you kind of, you know, get an understanding of like, like where we’re going with the course. You know, I’m going to jump ahead a little bit in the things that I say, but just sort of make a, if there’s anything you don’t understand, just make a mental note, write it down. I promise you, something will come up later where we’ll talk about it. Of course, this is all fully covered in the course material that I have.

And then we have the tags column, which is not really good for generating embeddings on because we’re probably not going to be looking for like similar tags for a lot of stuff. But the tags do make for a good sort of pre-filtering elements to sort of guarantee that something we’re getting is actually what we’re like looking for. So, but again, stuff we’ll talk about as we go through. In the users table, there are some things that, you know, like a mixed bag here. The about me column in the users table would be good for generating embeddings because it would kind of help you, it might help you find users who would be good at answering certain questions based on what they’ve typed in their little biography field. For location, this one you could talk me into because in Stack Overflow land, location is not like a dropdown. It is a free text entry form and people can write in all different things.

Like if we were to, like, if we wanted to find a bunch of people in New York, you would have to, we could like spell out New York. We could have NY. We could have like, you know, New York, New York. We could have like Brooklyn, Manhattan, Queens, Staten Island, like the Bronx, right? Like all these things, all these different things that could indicate New York. But don’t necessarily like, aren’t just like the words New York that’s written in. So maybe location would be kind of cool. Website, unless you were looking for like, I don’t even know, like, like unless you were looking for like someone spamming like some domain in a website with like, I don’t know, like an online casino or something. But you could, you know, so probably not because you could already find that pretty easily. In the comments table, there is a text column with a comment in it, but we don’t read the comments here.

So probably not that, you know, it would be, it would be like, like trying to find similar user reviews or something, right? I mean, I don’t know, maybe like you could find like comment spam with that. But comment spam tends to be pretty copy and paste, like, hey, join my Discord server. Win a chance to win 3 million bitcoins or something. But like we would have, like the way we want to look at this is that, you know, like, like what, like what we’re going to use generally in the post table to sort of like link different things together.

So like starting with the question title and then finding like other similar questions by title or starting with question title and then finding like what might be another good answer, like, like, like to the, to the question, but maybe like that’s an answer to a different question. Right. So like, like all sorts of different things. If we look at sort of the breakdown of questions and answers in the post table, we get back, let’s zoom in on this a bit. We have about 2.6 million answers. We have about 1 million questions and we have a thousand others.

What the others are doesn’t matter. The things no one cares about in Stack Overflow. The titles that we get in Stack Overflow look like this and they’re generally pretty short, right? People tend to write pretty short, pretty descriptive titles for their questions to sort of like, you know, get the attention of people who could answer questions.

Right. Like people, like people doing their similarity search for like things I know about. Right. Like how do one do the most recent commits and get like, oh, I know how to do that. I can answer that question. Right. So like, you know, the human brain pattern match.

So this is kind of like what stuff generally looks like. And when I said that titles are generally short, it’s because people don’t want to write their life story and their whole question in the title. But questions can also, question titles can also be somewhat vague, but still helpful. Right.

Because like, like if you look at like the DBA Stack Exchange site, there’s going to be a ton of questions. It’s like how to optimize this SQL query. Right. Is this like over and over again? You see that like, like copy and paste it in.

And then like the actual details and the useful stuff is going to be in the question body. So there’s, you know, like the question titles are great because they tend to be short. And it’s very easy to generate fully self-contained embeddings based on short text like that.

In the body column, like this stuff tends to get a bit longer. Right. So if we look at this is taking forever. Why is this so slow? We’re just getting that by score descending. Jeez Louise.

Bodies. There’s some interesting stuff to deal with in the bodies that can, you know, influence some of your choices. We’ll talk about later. Like one of the really annoying things is in the body column. Like, like Stack Overflow stores all the HTML and that HTML like tends to count towards like the number of tokens and stuff.

It’s, it’s aggravating, but it’s the best that we’ve got right here. I didn’t want to write a function to clean HTML out of Stack Overflow questions. I’m sorry. I guess I could have had AI do it.

But this is what bodies tend to look like. Now, if we look at the sort of distribution of title lengths in the post table, we’ll see that, you know, a lot of them are on the very short side. So between like, you know, like a hundred and like, I actually don’t know what the minimum is for number of characters that a question title has to have.

So let’s just say between one and a hundred characters, right? Like the majority of the questions that we have fall into that. Most of them are between 30 and 59 and a few of them are on the longer side up here.

But there’s, but there’s still short enough to be pretty like self-contained as far as like generating one single, it’s like self-contained embedding without having to think about breaking your embedding up or breaking your text up over multiple embeddings. Right.

So if we look at body lengths, right, and we look at this stuff here, run the same kind of query, you know, you know, like a lot of them are like less than 500 bytes, but a lot of them, you know, they tend to get pretty big. And this is where you have to start sort of worrying about text length limits for your embedding model.

The reason why this matters is because embedding models convert text to what’s called tokens. If you’ve ever used like an LLM sort of like a, maybe like a command line thing, like cloud code or cursor or something like that, you’ll notice it like, like when you put in things like there’s this little counter says tokens, it like racks up pretty quickly, like you’re trying to get a high score.

But, you know, all those embedding models convert text to tokens. You get about four characters per token. That’s why I was getting kind of like annoyed of like the HTML storage in there. Different embedding models will have different token limits, right?

And those token limits will define how long, like how much text will go into generating and embedding. So if your embedding model has a token of say, has a limit of say 512 tokens, the text beyond that 512 tokens will get like it just gets silently thrown away, right?

And 512 tokens is maybe around 400 words or 1600 characters. A lot of question bodies exceed that limit. But this is where the concept of chunking comes in, right?

Where you take long bodies, right? You know, chunk your long bodies. You take something that’s like, let’s say 1,024, I don’t know, let’s say something that would take 1,024 tokens.

You break it up into two things and you generate your embeddings over both. And those things can have a percentage of overlap to maintain context across. So like, let’s say that, you know, you had a short paragraph first and then a longer paragraph.

You would like want to have some overlap so that like maybe like the second chunk wasn’t like the first half of the paragraph and then the second half of the paragraph. That would just like you would lose some context and meaning if you did that.

So, but the whole point of chunking with the overlap is to maintain some context across longer texts. Anyway, that’s about enough for this. Thank you for watching.

I hope you enjoyed yourselves and I’ll see you over in tomorrow’s video where we will have some additional vector data type things to talk about. Anyway, see you tomorrow.

Goodbye. Goodbye.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.