Performance Tuning Batch Mode HTDELETE and HTBUILD Waits In SQL Server
Thanks for watching!
Here’s the demo query, in case you’re interested in following along.
SELECT
u.Id,
u.DisplayName,
u.Reputation,
TopQuestionScore =
MAX(CASE WHEN p.PostTypeId = 1 THEN p.Score ELSE 0 END),
TopAnswerScore =
MAX(CASE WHEN p.PostTypeId = 2 THEN p.Score ELSE 0 END),
TopCommentScore =
MAX(c.Score)
FROM dbo.Users AS u
JOIN dbo.Posts AS p
ON u.Id = p.OwnerUserId
JOIN dbo.Comments AS c
ON u.Id = c.UserId
WHERE EXISTS
(
SELECT
1/0
FROM dbo.Votes AS v
WHERE v.PostId = p.Id
AND v.VoteTypeId IN (1, 2, 3)
)
AND u.Reputation > 10000
AND p.Score > 0
AND c.Score > 0
GROUP BY
u.Id,
u.DisplayName,
u.Reputation
ORDER BY
TopAnswerScore DESC
OPTION
(
RECOMPILE,
USE HINT('FORCE_LEGACY_CARDINALITY_ESTIMATION'),
USE HINT('QUERY_OPTIMIZER_COMPATIBILITY_LEVEL_140')
);
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.
Whether you want to be the next great query tuning wizard, or you just need to learn how to start solving tough business problems at work, you need a solid understanding of not only what makes things fast, but also what makes them slow.
I work with consulting clients worldwide fixing complex SQL Server performance problems. I want to teach you how to do the same thing using the same troubleshooting tools and techniques I do.
I’m going to crack open my bag of tricks and show you exactly how I find which queries to tune, indexes to add, and changes to make. In this day long session, you’re going to learn about hardware, query rewrites that work, effective index design patterns, and more.
Before you get to the cutting edge, you need to have a good foundation. I’m going to teach you how to find and fix performance problems with confidence.
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.
When you’ve been working with the same technology for long enough, especially if you’ve found your niche, you’re bound to find yourself having to solve the same problems over and over again with it.
Even if you move to a new job, they probably hired you because of your track record with, and skillset in, whatever you were doing before.
One thing I do for clients is review resumes and interview applicants for developer/DBA roles. There are some laughable ones out there, where someone claims to be an expert in every technology they’ve ever touched, and there are some ridiculous ones out there that are obviously copy/paste jobs. I even came across one resume where the job description and qualifications were posted under experience.
As a performance tuning consultant, even I end up fixing a lot of the same issues day to day. There are, of course, weird and cool problems I get to solve, but most folks struggle with the grasping the fundamentals so bad that I have to ~do the needful~ and take care of really basic stuff.
There’s something comforting in that, because I know someone out there will always need my help, and I love seeing a client happy with my work, even if it wasn’t the most challenging work I’ve ever done.
Anyway, to celebrate the times where we’ve gotten to come out of our groundhog holes (I’m sure they have a real name, but I’m not a groundhogologist), leave a comment below with a cool problem you’ve gotten to solve recently.
In the case of a tie, there will be a drinking contest to determine the winner.
Thanks for reading!
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.
bit Obscene Episode 1: What Developers Need To Know About Transactions
In the inaugural episode, Joe Obbish and Erik Darling talk about how developers use and abuse transactions, and some surprises you might run into for them.
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.
Before I go on about how adding an index can fix your blocking and deadlocking problems, let me warn you about how indexes can cause you problems.
Without fail, whenever I suggest adding indexes to clients, someone gets real worried about write speeds.
Like, the server can be burning down, spurting blood everywhere, and cursing our mothers in ancient Aramaic, but no, let’s stop and worry about write speeds going up by a few milliseconds.
You have to add quite a few indexes (20), and be modifying quite a few rows (millions) before the number of indexes really starts to hurt “write speeds”.
I haven’t seen a server whose biggest problem was write speeds (outside of Managed Instance and SQLDB) since spinning disks were measured in RPMs. The real problem I see many servers face from having “too many” indexes is increased locking.
The more indexes you add, the more you have to lock when modifications happen, even when you’re changing the same number of rows as you were before. You also increase your changes for lock escalation attempts.
Having a handsome young professional with reasonable rates (me) review your indexing is a good idea.
But you can end up with locking and deadlocking problems when you err in the opposite direction of “too many” indexes, especially if your modification queries don’t have good supporting indexes to help them find the data they wanna change.
It Started With A Scan
I never spent a summer at camp Scans-Are-Bad, but scans can tell us something important about modification queries.
In a very general sense, if the operators in your modification queries are acquiring data from the tables they need to modify via a scan, they’ll start by locking pages. If they start by seeking to rows in the table they need to modify, they’ll start by locking rows.
UPDATE
p
SET
p.Score += 1000
FROM dbo.Posts AS p
JOIN dbo.Users AS u
ON u.Id = p.OwnerUserId
WHERE u.Reputation >= 800000
AND p.Score < 1000;
And this is the query plan for it:
We would care very much about the Posts table being scanned to acquire data, because the storage engine doesn’t have a very granular way to identify rows it’s going to modify.
We would care so very much, that we might add an index like this:
CREATE INDEX
p
ON dbo.Posts
(OwnerUserId, Score)
WITH
(MAXDOP = 8, SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);
To get an execution plan that looks like this:
Now, this isn’t awesome, because we’re updating the Score column, and we need the Score column in our index to find the rows we care about, but if the query goes from taking 5 seconds to taking a couple hundred milliseconds, we’re going in the right direction.
There is reasonably sound advice to avoid indexing columns that change a lot, but if you need to find rows to modify in columns that change a lot, you may need to follow the also quite reasonably sound advice to make your queries faster so there’s less chance for them to become entangled.
The longer modification queries run for, the longer they have to interfere with other queries.
It Started With A Tran(saction)
I often see transactions (ab)used by developers. Here’s a lazy example:
BEGIN TRAN;
UPDATE
b
SET
b.UserId = 138
FROM dbo.Badges AS b
WHERE b.Date >= '2010-12-25'
AND b.Date < '2010-12-26';
/*Go do some other stuff for a while, I hear ACID is good.*/
COMMIT; /*No error handling or anything, just screw it all.*/
The query plan for this update will look simple, harmless, never did nothin’ to nobody. But while we’re doing all that ~other stuff~ it’s holding onto locks.
This isn’t always the end of the world.
Right now, the only index is the clustered primary key. Some queries will be able to finish immediately, as long as they’re located in they’re in the clustered primary key prior to the rows that are being locked.
For example, this query can finish immediately because the date value for its row is a touch before the pages we’re updating.
SELECT
b.*
FROM dbo.Badges AS b
WHERE b.Id = 1305713;
But this query, and any queries that search for an Id value on a locked page, will be blocked.
SELECT
b.*
FROM dbo.Badges AS b
WHERE b.Id = 1306701
That Id value is just outside the range of dates we’re modifying, but because the storage engine is locking pages and not rows, it has to protect those pages with locks.
With this index, both queries would be able to finish immediately, because the storage engine would know precisely which rows to go after, and a more granular locking arrangement (rows instead of pages) would be available.
CREATE INDEX
woah_mama
ON dbo.Badges
(Date)
WITH
(MAXDOP = 8, SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);
In this example, we’re not updating the Date column, so we don’t have to worry about the sky falling on write speeds, or write speeds crying wolf, or a thousand other cautionary tales about overreaction.
NOLOCK, And Other Drugs
Let’s say we’re idiots unaware of how bad uncommitted reads are, and we decide to use NOLOCK hints to avoid our select queries being blocked.
SELECT
b.*
FROM dbo.Badges AS b WITH(NOLOCK)
WHERE b.Id = 1306700;
While that transaction is open, and we’re still doing all those other highly ACIDic things to our database, this query will return the following results:
We can see the in-flight, uncommitted, change from the update.
Maybe we return this data to an end user, who goes and makes a really important, life-altering decision based on it.
And then maybe something in that transaction fails for some reason, and everything needs to roll back.
That really important, life-altering decision is now totally invalidated, and the end user’s life doom-spirals into sure oblivion.
And it’s all your fault.
Thanks for reading!
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.
Faster, easier, cloudier, smarter, shardier. Blah blah blah. Sounds great! But how do you get new users to try your product and make any reasonable assessment of it?
Not everyone comes to the table with a bunch of data they can jam right in and get partying with.
Worse, they might be on a platform where exporting their data to use on your platform (because, let’s face it, no two products have the same backup format) is a real and complete pain.
The number of hurdles that users face just getting to the point where they can make an assessment of your product are pretty huge.
Let’s talk about how you can make that easier.
Working Data Sets
Stack Overflow provides public access to their data sets, both as XML dumps, and a web interface.
I’m not saying that their relational data makes sense for every system out there, but it’s a big data set that’s good for analysis.
If you own a database product, you could:
Grab the dumps and turn them into a format that makes sense for your platform
Have a hosted playground for new users to run queries against interactively
This helps potential new customers get comfortable with the inevitable proprietary syntax, gauge query efficiency.
Microsoft, for all its silliness, gives SQL Server users a couple different sample databases to work off of. They even update them for new versions to show off all the new memes features they’ve added.
They even have a free developer edition of the product that you can install and run with pretty quickly. You don’t need this if your product is all cloud-based, but you get the idea.
Hands-down, the most annoying part of testing any database platform, is getting reasonable data to test against in there.
Portability
If you are an installer-based life form, and your database as a lot of settings that might matter for performance and reliability, or uses a specific OS, you should consider having a few different VM images available for download.
This lets you easily distribute a golden copy of an ideal environment for your product, with the OS, database, and data all packed together.
Oracle does this, and for the short time I had to experiment with some stuff on their platform, it was incredibly handy.
If you don’t want to go this route, because you don’t quite have Oracle money, being a fledgling database product, have a dedicated install and config page:
Recommended hardware
OS version
Database install steps
Any additional dependencies
Recommended database configurations
Where to get ample sample data to play with
While we’re talking about sample data, why not have a few different sizes of data? Not everyone wants to set up a 64 core, 2TB of RAM virtual machine just to mess with a petabyte set of time series data.
Have some small, medium, large, and extra large sets available for testing.
Sure, prospective clients might opt for small and medium, but the folks you want to evangelize your product are going to love you for having bigger data sets to show more complex problems and solutions.
If part of the sell for your product is how great data ingestion is, have data ready to ingest based on whatever format you excel at, even if it’s Excel files.
More likely it’s csv, parquet, json, or uh… something.
Visibility
A lot of folks are used to having more than a command line to interact with their database.
Postgres has pgAdmin, Oracle has SQL Developer, Microsoft has SQL Server Management Studio and Azure Data Studio, and there are many third party tools that can connect to a variety of platforms, too.
Writing large, complex queries in a CLI is a drag. It might be great for administration, and simple stuff, but correcting syntax errors in them is like threading needles blindfolded.
You may not want to build a whole bunch of tooling up front for developers to work in, but a lightweight browser-based tool with a “run” button can go a long way.
Take db<>fiddle as an example. You can’t do any database management with it, but you can pretty much fully interact with the database by sending queries in, the way a developer would write and test queries.
Nerdery
I love playing with other databases, but I do not love all the foreplay it takes just to get moving with one.
The more things are different about your platform — and those differences may be spectacular — the harder it is to lure away folks who are inexperienced with the stack you’ve chosen.
You might even have an amazing sales tech team who will come in and do a lot of the heavy lifting for prospective clients, but some companies out there want to do a test run before investing a lot of time and getting hourly check-ins from sales reps about how things are going and when the contract will get signed.
That also ignores one of the most powerful segments of any database community: the developers who will build content around your product, and go out in the world to blog, present, and develop training content.
Thanks for reading!
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.
Whether you want to be the next great query tuning wizard, or you just need to learn how to start solving tough business problems at work, you need a solid understanding of not only what makes things fast, but also what makes them slow.
I work with consulting clients worldwide fixing complex SQL Server performance problems. I want to teach you how to do the same thing using the same troubleshooting tools and techniques I do.
I’m going to crack open my bag of tricks and show you exactly how I find which queries to tune, indexes to add, and changes to make. In this day long session, you’re going to learn about hardware, query rewrites that work, effective index design patterns, and more.
Before you get to the cutting edge, you need to have a good foundation. I’m going to teach you how to find and fix performance problems with confidence.
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.
Why Partially Fixing Key Lookups Doesn’t Work In SQL Server
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.
SQL Server’s missing index requests (and, by extension, automatic index management) are about 70/30 when it comes to being useful, and useful is the low number.
The number of times I’ve seen missing indexes implemented to little or no effect, or worse, disastrous effect… is about 70% of all the missing index requests I’ve seen implemented.
If they’re all you have to go on, be prepared to drop or disable them after reviewing server and index usage metrics.
Here’s what you’re way better off doing:
Find your slowest queries
See if there’s a missing index request
Run them, and get the actual execution plan
Look at operator times in the execution plan
Ask yourself if the index would benefit the slowest parts
Or, you can hire me to do all that. I don’t mind. Even the Maytag Man has an alarm clock.
Poor Performer
Let’s start with a query, and just the base tables with no nonclustered indexes added. Each table still has a clustered primary key on its Id column.
Initially, I thought showing the query plan in Row Mode over Batch Mode would make issues more clear, but row mode operator times are a real disaster.
They’re supposed to be cumulative going from right to left, but here we go from 9 to 4 to 10 to 27 to 22 to 41 to 32. Forget that. Batch Mode it is.
Anyway, here’s the query.
SELECT
u.Id,
u.DisplayName,
TopQuestionScore =
MAX(CASE WHEN p.PostTypeId = 1 THEN p.Score ELSE 0 END),
TopAnswerScore =
MAX(CASE WHEN p.PostTypeId = 2 THEN p.Score ELSE 0 END),
TopCommentScore =
MAX(c.Score),
TotalPosts =
COUNT_BIG(DISTINCT p.Id),
TotalComments =
COUNT_BIG(DISTINCT c.Id)
FROM dbo.Users AS u
JOIN dbo.Posts AS p
ON u.Id = p.OwnerUserId
JOIN dbo.Comments AS c
ON u.Id = c.UserId
WHERE EXISTS
(
SELECT
1/0
FROM dbo.Votes AS v
WHERE v.PostId = p.Id
AND v.VoteTypeId IN (1, 2, 3)
)
AND u.Reputation > 10000
AND p.Score > 10
AND c.Score > 0
GROUP BY
u.Id,
u.DisplayName
ORDER BY
TotalPosts DESC;
The goal is to get… Well, pretty much what the column names describe. A good column name goes a long way.
If you had free and unfettered access to these tables, what row store indexes would your druthers lead you to?
I’m limiting your imagination to row store here, because that’s what the missing index requests are limited to.
Underwhelm
The optimizer has decided two indexes, on the same table, would really help us out. There are two very clunky ways to see them both.
You can always see the first one in green text at the top of your query plan, when a missing index request exists.
If you prefer something human readable, this is what they would translate to, with a little bit of hot sauce courtesy of yours truly.
CREATE INDEX
c
ON dbo.Comments
(Score)
INCLUDE
(UserId)
WITH
(MAXDOP = 8, SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);
CREATE INDEX
c2
ON dbo.Comments
(UserId, Score)
WITH
(MAXDOP = 8, SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);
Big Reveal
With all that in mind, let’s look at the query plan before adding the indexes. We’re going to skip ahead a little bit in the bullet points above, to the last two:
Look at operator times in the execution plan
Ask yourself if the index would benefit the slowest parts
Here’s the plan, which takes ~10 seconds in total. The arrow is pointing at where the optimizer thinks a new index will help the most.
Since the operators in this plan are mostly in Batch Mode, every operator is showing CPU time just for itself.
That entire portion of the plan is responsible for about half of the total execution time, but there’s no index recommendation there.
And look, I get it, missing index requests happen prior to query execution, while index matching is happening. The optimizer has no idea what might actually take a long time.
But if we’re looking at the only pre-execution metrics the optimizer has, you’d think the estimated costs alone would push it to ask for an index on the Posts table.
Perhaps missing index requests should be selected after query execution. After all, that’s when the engine knows how long everything actually took.
Generous Soul
Okay, so those two indexes on the Comments table up there? I added both of them.
The query plan changes, but it doesn’t get any faster.
Once again, a missing index request is registered, but only one this time.
On the Votes table.
Not the Posts table.
CREATE INDEX
v
ON dbo.Votes
(VoteTypeId)
INCLUDE
(PostId)
WITH
(MAXDOP = 8, SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);
Okay SQL Server, you got me. I’ll add it.
Impatience
With that index in place, what sort of totally awesome, fast query plan do we get?
Every time we add an index, this query gets one second slower. Part of the problem, of course, is that the optimizer really likes the idea of joining Posts to Votes first.
All of the query plans we’ve looked at have ad a similar pattern, where Vote is on the outer side of a Nested Loops join, and Posts is on the inner side, correlated on the pre-existing clustered primary key on Posts.
But Posts has a much more important join to the Users table. If we were to make that more efficient, we could perhaps change the optimizer’s mind about join ordering.
And there’s no missing index request to tell us that. We have to use our damned eyes.
Maybe something like this.
CREATE INDEX
p
ON dbo.Posts
(Score, OwnerUserId)
INCLUDE
(PostTypeId)
WITH
(MAXDOP = 8, SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);
Let’s give that a shot.
“Self-Tuning Database”
When vendors tell you about their self-tuning database systems, they’re lying to you.
Maybe Oracle isn’t. I don’t know.
But I’m so confident in this new index that I’m going to get rid of all the indexes that SQL Server has suggested so far.
They were bad. They made our query slower, and I don’t want them interfering with my awesome index.
Now the query is twice as fast, at 5 seconds (down from the original 10 seconds). The two operators that take up the majority of the query execution time now are the Hashes; Inner Join and Aggregate.
They don’t spill, but they are likely ill-prepared for the number of rows that they have to deal with. One may infer that from the estimated vs. actual rows that each one sees.
SQL Server 2014 now uses one shared hash table instead of per-thread copy. This provides the benefit of significantly lowering the amount of memory required to persist the hash table but, as you can imagine, the multiple threads depending on that single copy of the hash table must synchronize with each other before, for example, deallocating the hash table. To do so, those threads wait on the HTDELETE (Hash Table DELETE) wait type.
I tried many different indexing schemes and combinations trying to get the terrible underestimate from the Comments table to not cause this, but nothing quite seemed to do it.
In cases where you run into this, you may need to use a temp table to partially pre-aggregate results, and then join to the troublesome table(s) using that data instead.
Thanks for reading!
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.
Probably the most fascinating thing about Eager Index Spools to me is how often the optimizer will insert them into execution plans, often to a query’s detriment.
In a sane world, a non-loop join plan would be chosen, a missing index request would be registered that matches whatever would have been spooled into an index, and we’d all have an easier time.
While I understand that all of the optimizer stuff around spools in general was written before storage hardware wasn’t crap, and 32bit software couldn’t see more than a few gigs of memory, I do find it odd that so little revision and correction has been applied.
Of course, there are use cases for everything. I was involved in a data warehouse tuning project where rewriting a query to corner the optimizer into using a nested loops join was necessary to build an Eager Index Spool. Maintaining a nonclustered index on the staging table made data loads horrible, but letting SQL Server build one at query runtime was a massive improvement over other options. All that had to be done was to rewrite a simple inner join to remove any direct equality predicates.
While the below queries don’t even come mildly close to reproducing the performance improvement I’m talking about above, it should give you some idea of how it was done.
/*How it started*/
SELECT
p.Id,
UpMod =
SUM(CASE WHEN v.VoteTypeId = 2 THEN 1 ELSE 0 END),
DownMod =
SUM(CASE WHEN v.VoteTypeId = 3 THEN 1 ELSE 0 END),
PostScore =
SUM(p.Score)
FROM dbo.Posts AS p
JOIN dbo.Votes AS v
ON v.PostId = p.Id
WHERE p.Score > 1000
GROUP BY
p.Id;
/*How it's going*/
SELECT
p.Id,
UpMod =
SUM(CASE WHEN v.VoteTypeId = 2 THEN 1 ELSE 0 END),
DownMod =
SUM(CASE WHEN v.VoteTypeId = 3 THEN 1 ELSE 0 END),
PostScore =
SUM(p.Score)
FROM dbo.Posts AS p
JOIN dbo.Votes AS v
ON v.PostId >= p.Id
AND v.PostId <= p.Id
WHERE p.Score > 1000
GROUP BY
p.Id;
With no equality predicate in the join clause of the second query, only a nested loops join is available. But again, this is the type of thing that you should really have to push the optimizer to do.
Of course, for the above queries, the second plan is a disaster, like most Eager Index Spool plans tend to be. The non-spool query with the hash join finishes in about 600ms, and the Eager Index Spool plan takes a full 1 minute and 37 seconds, with all of the time spent building the spool.
So, like I’ve been saying, one should really have to go out of their way to have this type of plan chosen.
Matter Worse
Compounding the issue is that the optimizer will sometimes choose Eager Index Spool plans when they are entirely unnecessary, and indexes exist to fully support query requirements.
The below join doesn’t actually work, because it’s not how the tables are related, but it’s a good example of that I mean.
SELECT
u.Id,
u.DisplayName,
p.*
INTO #p1
FROM dbo.Users AS u
OUTER APPLY
(
SELECT
Score = SUM(p.Score),
AnswerCount = SUM(p.AnswerCount)
FROM dbo.Posts AS p
WHERE p.Id = u.Id
) AS p;
The Id column in both the Users table and Posts table is the clustered primary key. There’s no sensible reason for an index to be created at runtime, here.
Of course, the Posts table relates to the Users table via a column called OwnerUserId, but whatever.
The point is the resulting query plan.
If we tell the optimizer that’s it’s being a dunce, we get a better, much faster, Eager Index Spool-free query plan.
SELECT
u.Id,
u.DisplayName,
p.*
INTO #p2
FROM dbo.Users AS u
OUTER APPLY
(
SELECT
Score = SUM(p.Score),
AnswerCount = SUM(p.AnswerCount)
FROM dbo.Posts AS p WITH (FORCESEEK) /*I am different*/
WHERE p.Id = u.Id
) AS p;
Sometimes this is the only way to solve spool problems.
Option One: Adding An Index
In most cases, Eager Index Spools are just really irritating missing index requests.
Here’s an example of one. The query itself touches the Posts table three times. Once to find questions, a second time to find answers related to those questions, and a third time to make sure it’s the highest scoring question for the answer.
SELECT TOP (100)
QuestionOwner =
(
SELECT
u.DisplayName
FROM dbo.Users AS u
WHERE pq.OwnerUserId = u.Id
),
QuestionScore =
pq.Score,
QuestionTitle =
pq.Title,
AnswerOwner =
(
SELECT
u.DisplayName
FROM dbo.Users AS u
WHERE pa.OwnerUserId = u.Id
),
AnswerScore =
pa.Score
FROM dbo.Posts AS pq
JOIN dbo.Posts AS pa
ON pq.Id = pa.ParentId
WHERE pq.PostTypeId = 1
AND pa.PostTypeId = 2
AND pa.Score >
(
SELECT
MAX(ps.Score)
FROM dbo.Posts AS ps
WHERE ps.ParentId = pa.ParentId
AND ps.Id <> pa.Id
)
ORDER BY
pa.Score DESC,
pq.Score DESC;
Are there many different ways to write this query? Yes. Would they result in different query plans? Perhaps, perhaps not.
Right now, this query has this index available to it, along with the clustered primary key on Id.
CREATE INDEX
p
ON dbo.Posts
(PostTypeId, ParentId, OwnerUserId)
INCLUDE
(Score)
WITH
(MAXDOP = 8, SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);
Because we don’t have an index that leads with ParentId, or that allows us to easily seek to ParentId in the MAX subquery (more on that later, though), the optimizer decides to build one for us.
We can see what index the spool is building by looking at the tool tip. In general, you can interpret the seek predicate as what should be the key column(s), and what should be included by what’s in the output list.
There is sometimes some overlap here, but that’s okay. Just ignore any output columns that are already in the seek predicate. And of course, we can generally ignore any clustered index key column(s), since the nonclustered index will inherit those anyway.
Adding this index will get rid of the Eager Index Spool:
CREATE INDEX
p2
ON dbo.Posts
(ParentId, Score)
WITH
(MAXDOP = 8, SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);
In this case, I’ve chosen to add the Score column to they key of the index to allow for an ordered aggregation (SUM function) to take place without a Sort operator.
Option Two: Over Communicating
Let’s take a step back. We currently have this index, that leads with PostTypeId.
CREATE INDEX
p
ON dbo.Posts
(PostTypeId, ParentId, OwnerUserId)
INCLUDE
(Score)
WITH
(MAXDOP = 8, SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);
The section of the query that generates the Eager Index Spool is this one:
(
SELECT
MAX(ps.Score)
FROM dbo.Posts AS ps
WHERE ps.ParentId = pa.ParentId
AND ps.Id <> pa.Id
)
What we know, that the optimizer doesn’t know, is that only rows with a PostTypeId of 2 are answers. We don’t need to compare answers to any other kind of post, because we don’t care about them here.
If we change the subquery to limit comparing answers to other answers, it would also allow us to take care of the existing index by locating the right type of Post, and give seekable access to the ParentId column.
(
SELECT
MAX(ps.Score)
FROM dbo.Posts AS ps
WHERE ps.ParentId = pa.ParentId
AND ps.PostTypeId = 2
AND ps.Id <> pa.Id
)
That changes the full query to this:
SELECT TOP (100)
QuestionOwner =
(
SELECT
u.DisplayName
FROM dbo.Users AS u
WHERE pq.OwnerUserId = u.Id
),
QuestionScore =
pq.Score,
QuestionTitle =
pq.Title,
AnswerOwner =
(
SELECT
u.DisplayName
FROM dbo.Users AS u
WHERE pa.OwnerUserId = u.Id
),
AnswerScore =
pa.Score
FROM dbo.Posts AS pq
JOIN dbo.Posts AS pa
ON pq.Id = pa.ParentId
WHERE pq.PostTypeId = 1
AND pa.PostTypeId = 2
AND pa.Score >
(
SELECT
MAX(ps.Score)
FROM dbo.Posts AS ps
WHERE ps.ParentId = pa.ParentId
AND ps.PostTypeId = 2 /* I am new and different and you should pay attention to me */
AND ps.Id <> pa.Id
)
ORDER BY
pa.Score DESC,
pq.Score DESC;
Which changes the section of the query plan that we’re concerned with to this:
Sometimes the optimizer just needs a little but more information from you.
Thanks for reading!
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.