Get AI-Ready With Erik: Capacity Planning

Summary

In this video, I delve into the fascinating world of capacity planning as it pertains to vector embeddings in SQL Server databases. I start by comparing the size of a typical Stack Overflow post table from 2010 to its corresponding embeddings table, revealing just how much space these embeddings can consume—eight and a half gigs compared to six and a quarter gigs for the original data. This comparison sets the stage for an exploration of vector sizes, which depend on the number of dimensions in your chosen model. I also discuss the implications of different models and their embedding sizes, touching on both theoretical maximums and practical considerations. Additionally, I explore how vector indexes work differently from traditional B-tree indexes, emphasizing the importance of columnstore indexes for optimizing space usage. Finally, I provide some practical tips for managing database space when working with large embeddings, including the use of clustered columnstore indexes to achieve significant savings.

Chapters

*00:00:00* – Introduction
*00:00:32* – Embedding Size Comparison
*00:01:04* – Post Table vs Post Embeddings Table
*00:01:53* – Projection and Estimation
*00:02:17* – Embeddings Column Details
*00:02:34* – Vector Storage Format
*00:03:02* – Vector Size Dependence
*00:03:31* – Model-Specific Sizes
*00:04:29* – Vector Index Overview
*00:05:04* – Graph Structure of Vector Index
*00:05:10* – Edge Table Size
*00:05:29* – Controlling Embedding Sizes
*00:06:00* – Columnstore Indexing
*00:06:37* – Interoperability Issues
*00:06:42* – Currently Available Features
*00:07:03* – Capacity Planning Tips
*00:07:14* – Clustered Columnstore Index
*00:07:41* – Course Promotion

Full Transcript

Erik Darling here with Darling Data. Boy oh boy, exciting times are upon us. We’re going to talk about a DBA-ish subject here, but who knows, maybe someday a developer will care about this sort of thing as well, and that is capacity planning. Because vectors themselves, the embeddings, can be rather large. And what I’m going to show you is You know what, actually, I’m going to change this on the fly a little bit, because I want to show you actually just how big embeddings can get. And I don’t know why it didn’t occur to me to do this at the beginning. Posties. I’m going to compare the size of the post table to the size of the post embeddings table. Now, this is the Stack Overflow 2010, so this is like not even close to the full-size database. This is like the first like two, three years of Stack’s existence. The post table with the title column, the tags column, the bodies column, body column, not bodies, like last editor, like a whole bunch of string columns and stuff in there. The whole thing, right, is about 6.4 gigs. The post embeddings table with far fewer columns in it, just having the embeddings in there is eight and a half gigs, right? Like, it’s no joke, size-wise.

And, like, one way that you can sort of figure that out is, like, we’re going to take, it’s like 8218. We’re going to take that number, and we’re going to just sort of, like, try to do a rough projection here. Now, remember, post embeddings is about eight and a half gigs, like, full-size up here. If we just sort of look at different row counts and how big they might be, that still doesn’t get us close to, like, we’re still, like, about a gig off for all of those different row counts, right? And, like, part of that is going to be, like, well, like, the post embeddings table has the embeddings column in it, which is the majority of the space, but there are a few other columns in the table.

Nothing huge. It’s, like, ID and, like, post ID and, like, a date time to column or something. So, like, if you look at the table itself, like, those embeddings take up the majority of it, and, you know, every embedding looks like this. It’s, you know, 1024 floaty numbers jammed into a JSON string.

So, you know, not kind of storage. If you’re a storage vendor out there, God bless. God bless. I’m in the wrong line of work, probably.

But vector size is going to depend on the number of dimensions that your embeddings are, and the number of dimensions in your embeddings is going to depend on the model you choose. There are all sorts of different models out there that generate different numbers of embeddings.

The max number of embeddings that you can use for a float32 data, float32 vector in SQL Server is 1,998. That doubles for float16, but float16 is still a preview feature. So, yep, ba-ba-ba-ba-ba-ba-ba-ba-ba.

Here we are back at square one. So, if we look at this and we look at what different models might generate as far as size goes, you know, like 768, you know, theoretically, you know, you know, like, I don’t even know how many gigs that would be.

But this is, you know, like, they get bigger as you involve more embeddings up until you hit, you know, 8,000 actual bytes and when you hit the 1,998 vectors for a float32. But I don’t know, at least no model that I’ve ever seen uses specifically 1,998.

If you’re using, like, OpenAI, you’re going to have, like, the 1,532 or whatever that you see in the majority of the Microsoft demos because, gee, I wonder why they’re all using OpenAI, self-talking, whatever. Anyway, what’s cool about vector indexes, though, sort of, it’s like, it’s not like a B-tree index, right?

Because it’s not like you’re taking all the data and just putting it in order, right? So, it’s like you’re making a graph out of it and you’re sort of connecting lines within that graph. That’s why the vector index create process looks the way it does.

If you go back a few videos to where I showed you the stuff that Microsoft is running when they build a vector index, like, you’re not just, like, taking the data and putting it in, like, B-tree order. It’s not like you’re seeking into, like, a regular B-tree and just being like, oh, this is my entry point. I found this thing. I seek exactly to this primary key vector row.

It’s not like that at all. It builds a graph and the graph structure itself is a lot smaller than the vector embeddings are. Like, you can see the size of the edge table, right, is 286 megs, which is much, much smaller than the size of the embedding data that we have in the post-embeddings table.

If you want to control embedding sizes, like, outside of just which model you choose, right? Like, if you choose a model that’s, like, 1,024 or 1,500, whatever, you know, like, you’re going to have some pretty big vector data.

Like, columnstore works. Like, you can create a, I mean, non-clustered columnstore is, like, it’s okay, but, like, you still have the base table, which is, you know, it can be like a rowstore clustered index or something, and that’s not going to be very well compressed.

So, like, clustered columnstore, you can create that on your table that has the embeddings in it, and columnstore will compress it pretty well. Row and page compression don’t do anything for vectors, right? That is not the type of data that you can actually compress, because it’s, like, the same reason you can’t compress, like, JSON or XML, or, like, you have compressed XML indexes in SQL Server, whatever, but, like, if you just have, like, an Envarkar max or something, like, that doesn’t get compressed.

The only thing is you can’t intermix columnstore and vector indexes. If you have a table with a columnstore, like, a clustered columnstore index on it, or I think even non-clustered columnstore, too, you can’t create a vector index on it.

And if you have a vector index on a table, you can’t create a columnstore index on it. They just don’t, they don’t interoperate. That’s not super important today, since vector indexes are still preview only. The only generally available stuff is float32 and vector distance, and a few of the other, like, AI generate chunks and, you know, things like that.

So, you may as well, at this point, just get the space savings and maybe some batch mode in your query plans by creating the clustered columnstore index on it.

Anyway, just a little bit of stuff about the capacity planning there, a couple hints on how you can maybe save some database space by using clustered columnstore.

But anyway, that’s it there. Thank you for watching. I hope you enjoyed yourselves. I hope you learned something. And I hope that you will buy this course, Get AI Ready with Erik, available today with a coupon code AI Ready, which takes 100 entire dollars off the cost of the course.

It makes it well worth your time and money if your company is planning on doing anything with vectors in SQL Server. All right.

Well, that’s probably good there. Thank you for watching. Goodbye.

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.