SQL Server – Page 8 – Darling Data

Indexing SQL Server Queries For Performance: Fixing An Eager Index Spool

Posted on January 24, 2024January 24, 2024 by Erik Darling

Bugger

Probably the most fascinating thing about Eager Index Spools to me is how often the optimizer will insert them into execution plans, often to a query’s detriment.

In a sane world, a non-loop join plan would be chosen, a missing index request would be registered that matches whatever would have been spooled into an index, and we’d all have an easier time.

While I understand that all of the optimizer stuff around spools in general was written before storage hardware wasn’t crap, and 32bit software couldn’t see more than a few gigs of memory, I do find it odd that so little revision and correction has been applied.

Of course, there are use cases for everything. I was involved in a data warehouse tuning project where rewriting a query to corner the optimizer into using a nested loops join was necessary to build an Eager Index Spool. Maintaining a nonclustered index on the staging table made data loads horrible, but letting SQL Server build one at query runtime was a massive improvement over other options. All that had to be done was to rewrite a simple inner join to remove any direct equality predicates.

While the below queries don’t even come mildly close to reproducing the performance improvement I’m talking about above, it should give you some idea of how it was done.

/*How it started*/
SELECT
    p.Id,
    UpMod =
        SUM(CASE WHEN v.VoteTypeId = 2 THEN 1 ELSE 0 END),
    DownMod =
        SUM(CASE WHEN v.VoteTypeId = 3 THEN 1 ELSE 0 END),
    PostScore = 
        SUM(p.Score)
FROM dbo.Posts AS p
JOIN dbo.Votes AS v
  ON v.PostId = p.Id
WHERE p.Score > 1000
GROUP BY
    p.Id;

/*How it's going*/
SELECT
    p.Id,
    UpMod =
        SUM(CASE WHEN v.VoteTypeId = 2 THEN 1 ELSE 0 END),
    DownMod =
        SUM(CASE WHEN v.VoteTypeId = 3 THEN 1 ELSE 0 END),
    PostScore = 
        SUM(p.Score)
FROM dbo.Posts AS p
JOIN dbo.Votes AS v
  ON  v.PostId >= p.Id
  AND v.PostId <= p.Id
WHERE p.Score > 1000
GROUP BY
    p.Id;

With no equality predicate in the join clause of the second query, only a nested loops join is available. But again, this is the type of thing that you should really have to push the optimizer to do.

Of course, for the above queries, the second plan is a disaster, like most Eager Index Spool plans tend to be. The non-spool query with the hash join finishes in about 600ms, and the Eager Index Spool plan takes a full 1 minute and 37 seconds, with all of the time spent building the spool.

So, like I’ve been saying, one should really have to go out of their way to have this type of plan chosen.

Matter Worse

Compounding the issue is that the optimizer will sometimes choose Eager Index Spool plans when they are entirely unnecessary, and indexes exist to fully support query requirements.

The below join doesn’t actually work, because it’s not how the tables are related, but it’s a good example of that I mean.

SELECT
    u.Id,
    u.DisplayName,
    p.*
INTO #p1
FROM dbo.Users AS u
OUTER APPLY
(
    SELECT
        Score = SUM(p.Score),
        AnswerCount = SUM(p.AnswerCount)
    FROM dbo.Posts AS p
    WHERE p.Id = u.Id
) AS p;

The Id column in both the Users table and Posts table is the clustered primary key. There’s no sensible reason for an index to be created at runtime, here.

Of course, the Posts table relates to the Users table via a column called OwnerUserId, but whatever.

The point is the resulting query plan.

If we tell the optimizer that’s it’s being a dunce, we get a better, much faster, Eager Index Spool-free query plan.

SELECT
    u.Id,
    u.DisplayName,
    p.*
INTO #p2
FROM dbo.Users AS u
OUTER APPLY
(
    SELECT
        Score = SUM(p.Score),
        AnswerCount = SUM(p.AnswerCount)
    FROM dbo.Posts AS p WITH (FORCESEEK) /*I am different*/
    WHERE p.Id = u.Id
) AS p;

Sometimes this is the only way to solve spool problems.

Option One: Adding An Index

In most cases, Eager Index Spools are just really irritating missing index requests.

Here’s an example of one. The query itself touches the Posts table three times. Once to find questions, a second time to find answers related to those questions, and a third time to make sure it’s the highest scoring question for the answer.

SELECT TOP (100)
    QuestionOwner = 
        (
            SELECT 
                u.DisplayName 
            FROM dbo.Users AS u 
            WHERE pq.OwnerUserId = u.Id
        ),
    QuestionScore = 
        pq.Score,
    QuestionTitle = 
        pq.Title,
    AnswerOwner = 
        (
            SELECT 
                u.DisplayName 
            FROM dbo.Users AS u 
            WHERE pa.OwnerUserId = u.Id
        ),
    AnswerScore = 
        pa.Score
FROM dbo.Posts AS pq
JOIN dbo.Posts AS pa
  ON pq.Id = pa.ParentId
WHERE pq.PostTypeId = 1
AND   pa.PostTypeId = 2
AND   pa.Score >
(
    SELECT
        MAX(ps.Score)
    FROM dbo.Posts AS ps
    WHERE ps.ParentId = pa.ParentId
    AND   ps.Id <> pa.Id
)
ORDER BY
    pa.Score DESC,
    pq.Score DESC;

Are there many different ways to write this query? Yes. Would they result in different query plans? Perhaps, perhaps not.

Right now, this query has this index available to it, along with the clustered primary key on Id.

CREATE INDEX
    p
ON dbo.Posts
    (PostTypeId, ParentId, OwnerUserId)
INCLUDE
    (Score)
WITH
    (MAXDOP = 8, SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

Because we don’t have an index that leads with ParentId, or that allows us to easily seek to ParentId in the MAX subquery (more on that later, though), the optimizer decides to build one for us.

We can see what index the spool is building by looking at the tool tip. In general, you can interpret the seek predicate as what should be the key column(s), and what should be included by what’s in the output list.

There is sometimes some overlap here, but that’s okay. Just ignore any output columns that are already in the seek predicate. And of course, we can generally ignore any clustered index key column(s), since the nonclustered index will inherit those anyway.

Adding this index will get rid of the Eager Index Spool:

CREATE INDEX
    p2
ON dbo.Posts
    (ParentId, Score)
WITH
    (MAXDOP = 8, SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

In this case, I’ve chosen to add the Score column to they key of the index to allow for an ordered aggregation (SUM function) to take place without a Sort operator.

Option Two: Over Communicating

Let’s take a step back. We currently have this index, that leads with PostTypeId.

CREATE INDEX
    p
ON dbo.Posts
    (PostTypeId, ParentId, OwnerUserId)
INCLUDE
    (Score)
WITH
    (MAXDOP = 8, SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

The section of the query that generates the Eager Index Spool is this one:

(
    SELECT
        MAX(ps.Score)
    FROM dbo.Posts AS ps
    WHERE ps.ParentId = pa.ParentId
    AND   ps.Id <> pa.Id
)

What we know, that the optimizer doesn’t know, is that only rows with a PostTypeId of 2 are answers. We don’t need to compare answers to any other kind of post, because we don’t care about them here.

If we change the subquery to limit comparing answers to other answers, it would also allow us to take care of the existing index by locating the right type of Post, and give seekable access to the ParentId column.

(
    SELECT
        MAX(ps.Score)
    FROM dbo.Posts AS ps
    WHERE ps.ParentId = pa.ParentId
    AND   ps.PostTypeId = 2
    AND   ps.Id <> pa.Id
)

That changes the full query to this:

SELECT TOP (100)
    QuestionOwner = 
        (
            SELECT 
                u.DisplayName 
            FROM dbo.Users AS u 
            WHERE pq.OwnerUserId = u.Id
        ),
    QuestionScore = 
        pq.Score,
    QuestionTitle = 
        pq.Title,
    AnswerOwner = 
        (
            SELECT 
                u.DisplayName 
            FROM dbo.Users AS u 
            WHERE pa.OwnerUserId = u.Id
        ),
    AnswerScore = 
        pa.Score
FROM dbo.Posts AS pq
JOIN dbo.Posts AS pa
  ON pq.Id = pa.ParentId
WHERE pq.PostTypeId = 1
AND   pa.PostTypeId = 2
AND   pa.Score >
(
    SELECT
        MAX(ps.Score)
    FROM dbo.Posts AS ps
    WHERE ps.ParentId = pa.ParentId
    AND   ps.PostTypeId = 2 /* I am new and different and you should pay attention to me */
    AND   ps.Id <> pa.Id
)
ORDER BY
    pa.Score DESC,
    pq.Score DESC;

Which changes the section of the query plan that we’re concerned with to this:

Sometimes the optimizer just needs a little but more information from you.

Thanks for reading!

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

A Little More About Isolation Levels In SQL Server

Posted on January 23, 2024January 16, 2024 by Erik Darling

A Little More About Isolation Levels In SQL Server

In this video, I talk about some of the misguided expectations that Read Committed has associated with it, and query patterns to watch out for under Read Committed Snapshot Isolation.

If you’d like to see my full day performance tuning session, here are upcoming dates:

Going Further

Join me at DataTune in Nashville, March 8-9 2024

Posted on January 22, 2024January 16, 2024 by Erik Darling

Spring Training

This March, I’ll be presenting my full day training session The Foundations Of SQL Server Performance Tuning.

All attendees will get free access for life to my SQL Server performance tuning training. That’s about 25 hours of great content.

Get your tickets here for this event, taking place Friday, March 8th-9th 2024 at Belmont University – Massey Center 1900 Belmont Blvd, Nashville, TN 37212

Here’s what I’ll be presenting:

The Foundations Of SQL Server Performance Tuning

Session Abstract:

Whether you want to be the next great query tuning wizard, or you just need to learn how to start solving tough business problems at work, you need a solid understanding of not only what makes things fast, but also what makes them slow.

I work with consulting clients worldwide fixing complex SQL Server performance problems. I want to teach you how to do the same thing using the same troubleshooting tools and techniques I do.

I’m going to crack open my bag of tricks and show you exactly how I find which queries to tune, indexes to add, and changes to make. In this day long session, you’re going to learn about hardware, query rewrites that work, effective index design patterns, and more.

Before you get to the cutting edge, you need to have a good foundation. I’m going to teach you how to find and fix performance problems with confidence.

Event Details:

Get your tickets here for this event!

Going Further

Indexing SQL Server Queries For Performance: Fixing A Non-SARGable Predicate

Posted on January 19, 2024January 16, 2024 by Erik Darling

Nausea

Okay, deep breath. Deep, deep breath.

Would you believe I still have to fix queries with a bunch of functions in joins and where clauses? Like every day? All day?

Where things get tough is when there’s some third party vendor app where code can’t be changed, but customers have become so fed up with performance that they’re willing to make other changes to help things along.

This isn’t a big fireworks demo. I could have spent a lot more time finding a worse scenario, and I’ve hinted it to exacerbate the issue a bit.

Sometimes my hardware is too good for bad demos, probably because it’s not in the cloud.

That whole “cloud” thing has some real problems.

Setup

Anyway, let’s say we have this index (because we do, I just created it).

CREATE INDEX
    p
ON dbo.Posts
    (CommunityOwnedDate, Score)
WITH
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);
GO

Let’s further presume that we have this stored procedure.

This stored procedure could have been a query.

CREATE OR ALTER PROCEDURE
    dbo.FixNonSargable
(
    @CommunityOwnedDate datetime,
    @Score integer
)
AS
BEGIN
    SET NOCOUNT, XACT_ABORT ON;

    SELECT
        p.*
    FROM dbo.Posts AS p
    WHERE COALESCE(p.CommunityOwnedDate, '17530101') >= @CommunityOwnedDate
    AND   p.Score >= @Score
    ORDER BY
        p.Id DESC
    OPTION
    (
        USE HINT('QUERY_OPTIMIZER_COMPATIBILITY_LEVEL_140'), /*No batch mode*/
        MAXDOP 1 /*No parallelism*/
    );
END;
GO

When executed with such great gusto, it’s a bit slow.

EXEC dbo.FixNonSargable
    @CommunityOwnedDate = '20130101',
    @Score = 10;

At ~1.4 seconds, we’re very upset with performance. Throw the whole thing in the trash. Install DuckDB.

Index Reversal

If we change the key column order of our index, so the column with a seekable predicate can go first, we can get faster query execution.

CREATE INDEX
    p
ON dbo.Posts
    (Score, CommunityOwnedDate)
WITH
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE, DROP_EXISTING = ON);
GO

Now you don’t have to wait 6-never months for your vendor to fix their garbage code.

In this case, changing the order of key columns was a more beneficial arrangement for this particular query.

Results may vary. Not query results! I mean like, if you go do this in your database. On “your” hardware. In the “cloud”.

Thanks for reading!

Going Further

Creating Uncacheable Stored Procedures In SQL Server

Posted on January 18, 2024January 16, 2024 by Erik Darling

Creating Uncacheable Stored Procedures In SQL Server

Here’s the demo script from today’s video:

/*Create a table if you need to*/
CREATE TABLE
    dbo.DinnerPlans
(
    id bigint IDENTITY,
    name nvarchar(40) NOT NULL,
    seat_number tinyint NULL,
    is_free bit NOT NULL,
);
GO

/*First example, with an object that doesn't exist*/
CREATE OR ALTER PROCEDURE
    dbo.i_live
(
    @decider bit = NULL
)
AS
BEGIN
    SET NOCOUNT, XACT_ABORT ON;
   
    IF @decider = 'true'
    BEGIN
        SELECT
            dp.*
        FROM dbo.DinnerPlans AS dp;
    END;
   
    IF @decider = 'false'
    BEGIN
        SELECT
            whatever.*
        FROM dbo.AnObjectThatDoesntEvenPretendToExist AS whatever;

        --SELECT result = 'false!';
    END;
   
    IF @decider IS NULL
    BEGIN
        SELECT
            result =
                'please make a decision.'
    END;
END;
GO

/*Say goodbye!*/
DBCC FREEPROCCACHE;

/*This runs without an error*/
EXEC dbo.i_live
    @decider = 'false';

/*But there's no query plan!*/
SELECT
    object_name =
       OBJECT_NAME(deps.object_id, deps.database_id),  
    deps.type_desc,
    deps.last_execution_time,
    deps.execution_count,
    dest.text,
    query_plan =
        TRY_CAST(detqp.query_plan AS xml)
FROM sys.dm_exec_procedure_stats AS deps
OUTER APPLY sys.dm_exec_sql_text(deps.plan_handle) AS dest
OUTER APPLY sys.dm_exec_text_query_plan(deps.plan_handle, 0, -1) AS detqp;
GO

Thanks for watching!

Going Further

sp_QuickieStore: Find Your Worst Performing Queries During Working Hours

Posted on January 17, 2024January 16, 2024 by Erik Darling

sp_QuickieStore: Find Your Worst Performing Queries During Working Hours

Thanks for watching!

Going Further

Catch the latest episode of the @sqldatapartners podcast, guest starring yours truly

Posted on January 16, 2024January 10, 2024 by Erik Darling

Ziggy Played Guitar

I was recently honored to be a guest on the SQL Data Parters podcast. We talked about SQL Server performance tuning scripts, consulting, and responsible levels of meat.

Remember to rate and like and subscribe and all that jazz.

Thanks for listening!

Going Further

Join me at DataTune in Nashville, March 8-9 2024

Posted on January 15, 2024January 4, 2024 by Erik Darling

Spring Training

This March, I’ll be presenting my full day training session The Foundations Of SQL Server Performance Tuning.

All attendees will get free access for life to my SQL Server performance tuning training. That’s about 25 hours of great content.

Get your tickets here for this event, taking place Friday, March 8th-9th 2024 at Belmont University – Massey Center 1900 Belmont Blvd, Nashville, TN 37212

Here’s what I’ll be presenting:

The Foundations Of SQL Server Performance Tuning

Session Abstract:

I work with consulting clients worldwide fixing complex SQL Server performance problems. I want to teach you how to do the same thing using the same troubleshooting tools and techniques I do.

Before you get to the cutting edge, you need to have a good foundation. I’m going to teach you how to find and fix performance problems with confidence.

Event Details:

Get your tickets here for this event!

Going Further

Indexing SQL Server Queries For Performance: Fixing Predicate Key Lookups

Posted on January 12, 2024January 11, 2024 by Erik Darling

Make Things Fast

Our very lofty goal as query tuners is to make the time from request to result shorter. Sometimes that requires a bit of sacrifice on the server side:

Queries going parallel, using more CPU to reduce wall clock time
Queries using more memory to avoid spilling to disk
Queries using space in tempdb via a #temp table to store results
Queries having better indexes available to them for searching

Just about every query tuning technique requires sacrificing some resource on the server (and sometimes more than one resource) to achieve our very lofty goal.

The only exclusion is fixing some miserably-written query to bring it back to sensibility. That requires only a sacrifice on your side.

Indexes are not free rides, of course, though much of the cost attributed to them is overblown. I’m not here to give you license to add 100 indexes to a table, but I am giving you permission to judge indexes by quality (read vs write usage) over quantity.

Anyway, let’s look at some of my favorite index design patterns when I’m tuning queries for clients.

Many of the issues are going to contextually framed by issues with parameter sensitivity in order to exacerbate them. You may see similar bad judgement without parameter sensitivity, of course. Either the optimizer screws up some cardinality estimation, or you do something that screws up cardinality estimation for the optimizer. Common causes for that are local variables, table variables, non-SARGable predicates, overly complicated queries, etc.

I don’t write queries that do stupid things, except to show you when things are stupid.

In this post, I’m going to cover the fixing predicate key lookups! Because why not? We’re already here. Stuck together.

Forever.

Key Lookups In General

Key lookups represent a choice made by the optimizer between:

Clustered index scan
Nonclustered index seek/scan + Clustered index lookup

Lookups can be quite useful to avoid creating quite wide nonclustered index, but the optimizer is heavily biased against them. It does not like random I/O.

Typically lookups get chosen when a relatively small number of rows are expected to be retrieved from the nonclustered index.

Bad estimates in general, and self-inflicted bad estimates from the list above, can contribute to lookups being chosen inappropriately.

Also like I said above, parameter sensitivity is a very easy way to demonstrate the problem.

Before we go on though, let’s talk about the two things lookups can be assigned to do:

Retrieve columns from the clustered index that aren’t part of the nonclustered index definition
Evaluate predicates in the clustered index that aren’t part of the nonclustered index definition

Often, fixing lookups only to avoid retrieving columns is a last resort for me, because it can mean greatly expanding the number of columns included in a nonclustered index.

But fixing lookups that have predicates in them is quite appealing to me, because a good index should support locating and filtering as much data is possible for important queries.

Okay, now we can go on.

Predicate Key Lookups: Good

Let’s start with this index:

CREATE INDEX
    p
ON dbo.Posts
    (Score, CreationDate, PostTypeId)
WITH
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);
GO

And let’s start with this procedure:

CREATE OR ALTER PROCEDURE
    dbo.PredicateKeyLookup
(
    @Score integer,
    @CreationDate datetime,
    @PostTypeId integer,
    @OwnerUserId integer
)
AS
BEGIN
    SET NOCOUNT, XACT_ABORT ON;
   
    SELECT TOP (100)
        p.*
    FROM dbo.Posts AS p
    WHERE p.Score >= @Score
    AND   p.CreationDate >= @CreationDate
    AND   p.PostTypeId = @PostTypeId
    AND   p.OwnerUserId = @OwnerUserId
    ORDER BY
        p.Score DESC,
        p.CreationDate DESC,
        p.Id DESC;
END;
GO

I know, selecting star is a bad practice, but I’m going to share something shocking with you (and I’ll blog about this later, too): Lookups are assigned the same optimizer cost unit, whether you select one column, or 1000 columns.

The thing to note right here, is that the stored procedure filters on all the columns in our index, plus it has an additional filter on the OwnerUserId column. That (along with the select list columns) is the work our lookup will have to do.

Starting with a query that selects a small number of rows:

EXEC dbo.PredicateKeyLookup
    @Score = 0,
    @CreationDate = '20131101',
    @PostTypeId = 1,
    @OwnerUserId = 39677;

The query plan looks like this, and is very fast. Yay. We smart.

Zooming in on the lookup a bit, here’s what it’s doing:

The clustered primary key on the Posts table is Id, which the nonclustered index inherits as a final key column in the nonclustered index (seek predicate), so this is how we match rows across the two
The clustered index has all these columns in it that are not part of the nonclustered index definition, so we can retrieve any rows matched via the seek predicate
The clustered index has OwnerUserId in it as well, which is also not part of the nonclustered index definition, so we can additionally filter any rows matched via the seek predicate

Wordy, I know, for bullet points, but the goal of these posts is to describe things to developers who may have never encountered such things.

Predicate Key Lookups: Bad

If we change our execution parameters to ones that are not selective at all, we’ll use the same execution plan, but speed will get a bit worse.

EXEC dbo.PredicateKeyLookup
    @Score = 0,
    @CreationDate = '20080101',
    @PostTypeId = 2,
    @OwnerUserId = 22656;

To run down what changed:

CreationDate will fetch several more years of posts (2008 vs 2013)
We’re looking for answers (PostTypeId 2) from questions (PostTypeId 1)
The OwnerUserId is now a much more active user on the site

Now, rather than finishing in ~300ms, the query takes ~3.5 seconds.

The majority of the time is spent looping and looking up, which is not what we want at all. Again, indexes should help us find and filter as much data as possible in one place.

You can think of each index as a separate data source. If you need two data sources on one table to filter things, you’re leaving yourself open to performance issues.

Predicate Key Lookups: Fixing

In this case, we can fix things well-enough by changing the index to this:

CREATE INDEX
    p
ON dbo.Posts
    (Score, CreationDate, PostTypeId, OwnerUserId)
WITH
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE, DROP_EXISTING = ON);
GO

Now, either execution, in either execution order, is fast (enough).

Now, for sure, with enough domain knowledge, and enough understanding of the queries, we could do things differently.

If we know most/all queries will filter on OwnerUserId/PostTypeId
We’re allowed to create a completely new index to help this query if it’s important enough

CREATE INDEX
    p
ON dbo.Posts
    (OwnerUserId, PostTypeId, Score, CreationDate)
WITH
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE, DROP_EXISTING = ON);
GO

In this index, OwnerUserId and PostTypeId are the leading columns. The way b-tree indexes work, equality predicates on leading column(s) maintain sorting of any following columns.

Since we’re ordering by Score and CreationDate, this is important (for this query, anyway). Now, both plans look like this.

Predicate Key Lookups: Thoughts

Often, fixing a predicate key lookup by adding the filtering column to the nonclustered index your query is already using is good enough.

Other times, to squeeze every last bit of performance out of things, we may need to make additional adjustments. Changing existing indexes in a way that rearranges key column order can be perilous.

While the query you’re working on may work way better, others may not be so happy with your tinkering. Again, this is where you need additional domain knowledge.

If you find yourself wanting to really change things like that, you may be safer creating a new index instead.

Thanks for reading!

Going Further

Indexing SQL Server Queries For Performance: Common Questions

Posted on January 11, 2024January 5, 2024 by Erik Darling

Over And Over

Some of the best questions I get some clients, conference attendees, and random email, are about how to design indexes.

A lot of developers out there have a rather foggy picture of exactly how indexes work. They’re all seen phone books, and drawings of B-Tree indexes, but some common things still escape them.

In this post, I’m going to talk about a few things like I’m speaking to someone who has never created a table before.

Even if you’ve created lots of tables and indexes, you may find some interesting points in here, if you pay close attention.

The topics I’m going to go over are:

Heaps
Clustered indexes
Nonclustered indexes
Index design concepts

Hopefully it helps to clear some things up!

What’s A Heap?

A heap is a table that doesn’t have a clustered index on it. It may have a nonclustered primary key, nonclustered indexes, constraints of all variety, foreign keys, etc.

Heaps are not indexes. Heaps are unordered data structures.

There are good reasons for some tables to be heaps, but most tables shouldn’t be. A good example is a staging table for ETL loads.

Dumping a bunch of data into a heap typically works faster because data doesn’t need to be sorted to conform to index key ordering. More indexes also means more transaction log writes.

I don’t want to get into minimal logging, because that can be a really tricky thing to get in many circumstances. If you want to read more about that, you can follow this link: Minimal Logging with INSERT…SELECT into Heap Tables

Heaps behave different in some ways that are surprising, though when you update or delete (not truncate) rows from them:

Updates move rows around and leave pointers that need to be followed in their place
Deletes may remove all rows from data pages but not deallocate them from the table (unless you use a TABLOCK hint, or the delete is escalated to use a table lock naturally)

This isn’t something that happens in tables with clustered indexes on them, and they can have surprising impacts on query performance.

If you update variable length data in your heaps a whole bunch, you can end up with a lot of row movement, which will make later scans of the heap do a lot more reads to follow all those pointers to the location of the new rows.

Same thing with deletes. When you scan through a heap (because there’s no indexed data you can seek to), you may end up reading lots of empty pages to satisfy a complete scan of the data.

I don’t have a good explanation for why heaps are designed that way for updates. Perhaps it’s an optimization for large updates to not automatically split pages, which adds a bit of logging overhead. For deletes, though, I believe it’s so pages are at the ready for new data being loaded in without having to allocate new pages. Since heaps have no order, you can stick data anywhere.

One additional note here is that even with a TABLOCK hint, deletes may not immediately deallocate pages if you’re using Read Committed Snapshot Isolation, or Snapshot Isolation.

In general, there’s not a lot of sense to leaving tables as heaps in SQL Server, at least for OLTP workloads.

What’s A Clustered Index?

A clustered index logically orders the data in your table by the clustered index key column(s). Physical ordering is a separate issue.

I often hear people ask if clustered indexes are a copy of the table, and the answer is… kind of, but only in the way that nonclustered indexes are also copies of the data. Adding nonclustered indexes to a heap would likewise be a copy of data in the heap.

However, a clustered index is not a copy of a heap.

In other words, a single table cannot exist as a clustered and heaped version of itself simultaneously. You are either clustered or heaped.

A good clustered index (often on an identity column) does not necessarily have to provide optimized search access to the data stored in it. Often, the choice is to optimize the insert workload.

To pick a good clustered index, you should usually follow the NUDES acronym:

Narrow: Numbers, dates, not string-ish
Unique: Identities or sequences are good for this
Distinct: If not unique, then as distinct as possible
Ever: Increasing (append only is the goal, here
Static: Definitely not a column you ever update

The goal is to append unique, static values to the end of the list, that take up very few bytes so that intermediate (not leaf) pages can be more densely packed.

If you have a non-surrogate (natural) key that accomplishes this, great. If not, surrogate keys (again, identity columns) are a good choice to check these boxes. Clustered indexes do not have to be unique, nor do they have to be primary keys, but they are most often tightly coupled in SQL Server relational designs.

In most OLTP workloads, lacking a natural key to cluster on is fine, because you can create nonclustered indexes to optimize searches, joins, and other relational activities that benefit from ordering (group by, order by, partition by, etc.). If your OLTP queries end up scanning clustered indexes a lot, it’s usually a good sign that you haven’t designed nonclustered indexes well.

This assumes they have a where clause on them, of course.

What’s A Nonclustered Index?

Nonclustered indexes are separate copies of the table data, whether it’s clustered or heaped. On disk, in memory, in the transaction log, with different statistics, column ordering, and data pages assigned to them.

We’ll talk about some design strategies next, but the ideal usage for nonclustered indexes is to optimize data access, so your query’s where clause can locate data quickly. There are many other considerations and uses (for instance, supporting foreign keys, or backing up unique constraints), but if you’re very new to indexing, focus on helping your queries get to the data they’re searching for quickly.

A common misconception is that just because a column is in the key of an index, it will help the query be more efficient, regardless of what position in key column order it is, but that’s absolutely not true.

I like to think of key column order as sort of like following a recipe. Let’s say you’re following one where the start is a big piece of ~whatever~, and the end is cooked and cubed pieces of ~whatever~

The initial steps in the recipe are:

Defrost ~whatever~
Cube ~whatever~
Cook ~whatever~

If you don’t defrost ~whatever~ first, cubing it is going to take a long time (forget skipping to cooking it). If you don’t cube ~whatever~, cooking it is going to take a long time.

This is a bit like the way index key column order works (though not a perfect metaphor). If you don’t access the leading key, accessing any following keys is less efficient.

Each column is a gatekeeper to the next.

Nonclustered Index Design Patterns

Clustered indexes are the foundation and frame of a table. Nonclustered indexes are where we go for specific purposes, to find specific things.

If we didn’t have things organized in smaller chunks, we’d end up running around the whole place looking for toilet paper after we found a toilet in the dining room.

It’s a situation no one wants to encounter. Your queries are no different. The less running around looking for things they have to do, the happy they are.

In most cases, the fastest way to reduce the time they spend looking for things is to create indexes that match the goals of your where clause.

There is a persistent and rather idiotic strain of advice that the most selective column should always come first. That’s likely true if the key of the index is unique, and you’re searching equality predicates for a single row.

It’s just that most queries I see aren’t doing that — queries that do that (something like a primary key lookup) already perform well — the types of queries I deal with are usually dealing with a ton of rows.

There’s a second persistent piece of goofiness that inequality predicates should always be indexed for after equality predicates. That’s certainly not true when a small range (say the last hour of data) matches very few rows, but an equality predicate (say a bit column) would match tons of rows.

One important thing to remember about nonclustered indexes is that they inherit things from both clustered and heaped tables: identifiers.

If your table is a heap, there’s a row identifier (RID) that ends up in your nonclustered indexes
If your table is clustered, the clustered index key(s) end up in your nonclustered indexes
- For non-unique nonclustered indexes, they end up as a final key column
- For unique nonclustered indexes, they end up as an included column

Tomorrow, we’ll look at some index design patterns in practice.

Thanks for reading!