Over And Over
Some of the best questions I get some clients, conference attendees, and random email, are about how to design indexes.
A lot of developers out there have a rather foggy picture of exactly how indexes work. They’re all seen phone books, and drawings of B-Tree indexes, but some common things still escape them.
In this post, I’m going to talk about a few things like I’m speaking to someone who has never created a table before.
Even if you’ve created lots of tables and indexes, you may find some interesting points in here, if you pay close attention.
The topics I’m going to go over are:
- Heaps
- Clustered indexes
- Nonclustered indexes
- Index design concepts
Hopefully it helps to clear some things up!
What’s A Heap?
A heap is a table that doesn’t have a clustered index on it. It may have a nonclustered primary key, nonclustered indexes, constraints of all variety, foreign keys, etc.
Heaps are not indexes. Heaps are unordered data structures.
There are good reasons for some tables to be heaps, but most tables shouldn’t be. A good example is a staging table for ETL loads.
Dumping a bunch of data into a heap typically works faster because data doesn’t need to be sorted to conform to index key ordering. More indexes also means more transaction log writes.
I don’t want to get into minimal logging, because that can be a really tricky thing to get in many circumstances. If you want to read more about that, you can follow this link: Minimal Logging with INSERT…SELECT into Heap Tables
Heaps behave different in some ways that are surprising, though when you update or delete (not truncate) rows from them:
- Updates move rows around and leave pointers that need to be followed in their place
- Deletes may remove all rows from data pages but not deallocate them from the table (unless you use a TABLOCK hint, or the delete is escalated to use a table lock naturally)
This isn’t something that happens in tables with clustered indexes on them, and they can have surprising impacts on query performance.
If you update variable length data in your heaps a whole bunch, you can end up with a lot of row movement, which will make later scans of the heap do a lot more reads to follow all those pointers to the location of the new rows.
Same thing with deletes. When you scan through a heap (because there’s no indexed data you can seek to), you may end up reading lots of empty pages to satisfy a complete scan of the data.
I don’t have a good explanation for why heaps are designed that way for updates. Perhaps it’s an optimization for large updates to not automatically split pages, which adds a bit of logging overhead. For deletes, though, I believe it’s so pages are at the ready for new data being loaded in without having to allocate new pages. Since heaps have no order, you can stick data anywhere.
One additional note here is that even with a TABLOCK hint, deletes may not immediately deallocate pages if you’re using Read Committed Snapshot Isolation, or Snapshot Isolation.
In general, there’s not a lot of sense to leaving tables as heaps in SQL Server, at least for OLTP workloads.
What’s A Clustered Index?
A clustered index logically orders the data in your table by the clustered index key column(s). Physical ordering is a separate issue.
I often hear people ask if clustered indexes are a copy of the table, and the answer is… kind of, but only in the way that nonclustered indexes are also copies of the data. Adding nonclustered indexes to a heap would likewise be a copy of data in the heap.
However, a clustered index is not a copy of a heap.
In other words, a single table cannot exist as a clustered and heaped version of itself simultaneously. You are either clustered or heaped.
A good clustered index (often on an identity column) does not necessarily have to provide optimized search access to the data stored in it. Often, the choice is to optimize the insert workload.
To pick a good clustered index, you should usually follow the NUDES acronym:
- Narrow: Numbers, dates, not string-ish
- Unique: Identities or sequences are good for this
- Distinct: If not unique, then as distinct as possible
- Ever: Increasing (append only is the goal, here
- Static: Definitely not a column you ever update
The goal is to append unique, static values to the end of the list, that take up very few bytes so that intermediate (not leaf) pages can be more densely packed.
If you have a non-surrogate (natural) key that accomplishes this, great. If not, surrogate keys (again, identity columns) are a good choice to check these boxes. Clustered indexes do not have to be unique, nor do they have to be primary keys, but they are most often tightly coupled in SQL Server relational designs.
In most OLTP workloads, lacking a natural key to cluster on is fine, because you can create nonclustered indexes to optimize searches, joins, and other relational activities that benefit from ordering (group by, order by, partition by, etc.). If your OLTP queries end up scanning clustered indexes a lot, it’s usually a good sign that you haven’t designed nonclustered indexes well.
This assumes they have a where clause on them, of course.
What’s A Nonclustered Index?
Nonclustered indexes are separate copies of the table data, whether it’s clustered or heaped. On disk, in memory, in the transaction log, with different statistics, column ordering, and data pages assigned to them.
We’ll talk about some design strategies next, but the ideal usage for nonclustered indexes is to optimize data access, so your query’s where clause can locate data quickly. There are many other considerations and uses (for instance, supporting foreign keys, or backing up unique constraints), but if you’re very new to indexing, focus on helping your queries get to the data they’re searching for quickly.
A common misconception is that just because a column is in the key of an index, it will help the query be more efficient, regardless of what position in key column order it is, but that’s absolutely not true.
I like to think of key column order as sort of like following a recipe. Let’s say you’re following one where the start is a big piece of ~whatever~, and the end is cooked and cubed pieces of ~whatever~
The initial steps in the recipe are:
- Defrost ~whatever~
- Cube ~whatever~
- Cook ~whatever~
If you don’t defrost ~whatever~ first, cubing it is going to take a long time (forget skipping to cooking it). If you don’t cube ~whatever~, cooking it is going to take a long time.
This is a bit like the way index key column order works (though not a perfect metaphor). If you don’t access the leading key, accessing any following keys is less efficient.
Each column is a gatekeeper to the next.
Nonclustered Index Design Patterns
Clustered indexes are the foundation and frame of a table. Nonclustered indexes are where we go for specific purposes, to find specific things.
If we didn’t have things organized in smaller chunks, we’d end up running around the whole place looking for toilet paper after we found a toilet in the dining room.
It’s a situation no one wants to encounter. Your queries are no different. The less running around looking for things they have to do, the happy they are.
In most cases, the fastest way to reduce the time they spend looking for things is to create indexes that match the goals of your where clause.
There is a persistent and rather idiotic strain of advice that the most selective column should always come first. That’s likely true if the key of the index is unique, and you’re searching equality predicates for a single row.
It’s just that most queries I see aren’t doing that — queries that do that (something like a primary key lookup) already perform well — the types of queries I deal with are usually dealing with a ton of rows.
There’s a second persistent piece of goofiness that inequality predicates should always be indexed for after equality predicates. That’s certainly not true when a small range (say the last hour of data) matches very few rows, but an equality predicate (say a bit column) would match tons of rows.
One important thing to remember about nonclustered indexes is that they inherit things from both clustered and heaped tables: identifiers.
- If your table is a heap, there’s a row identifier (RID) that ends up in your nonclustered indexes
- If your table is clustered, the clustered index key(s) end up in your nonclustered indexes
- For non-unique nonclustered indexes, they end up as a final key column
- For unique nonclustered indexes, they end up as an included column
Tomorrow, we’ll look at some index design patterns in practice.
Thanks for reading!
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.