All The Performance Problems With Select * Queries In SQL Server

This was originally posted by me as an answer here. I’m re-posting it locally for posterity.

The two reasons that I find the most compelling not to use SELECT * in SQL Server are

  1. Memory Grants
  2. Index usage

Memory Grants


When queries need to Sort, Hash, or go Parallel, they ask for memory for those operations. The size of the memory grant is based on the size of the data, both row and column wise.

String data especially has an impact on this, since the optimizer guesses half of the defined length as the ‘fullness’ of the column. So for a VARCHAR 100, it’s 50 bytes * the number of rows.

Using Stack Overflow as an example, if I run these queries against the Users table:

SELECT TOP 1000 
       u.DisplayName 
FROM dbo.Users AS u 
ORDER BY u.Reputation;


SELECT   TOP 1000
         u.DisplayName,
         u.Location
FROM     dbo.Users AS u
ORDER BY u.Reputation;

 

DisplayName is NVARCHAR 40, and Location is NVARCHAR 100.

Without an index on Reputation, SQL Server needs to sort the data on its own.

SQL Server Query Plan
NUTS

But the memory it nearly doubles.

DisplayName:

NUTS

DisplayName, Location:

NUTS

This gets much worse with SELECT *, asking for 8.2 GB of memory:

NUTS

It does this to cope with the larger amount of data it needs to pass through the Sort operator, including the AboutMe column, which has a MAX length.

NUTS

Index Usage


If I have this index on the Users table:

CREATE NONCLUSTERED INDEX ix_Users ON dbo.Users ( CreationDate ASC, Reputation ASC, Id ASC );

 

And I have this query, with a WHERE clause that matches the index, but doesn’t cover/include all the columns the query is selecting…

SELECT   u.*,
         p.Id AS PostId
FROM     dbo.Users AS u
JOIN     dbo.Posts AS p
    ON p.OwnerUserId = u.Id
WHERE    u.CreationDate > '20171001'
AND      u.Reputation > 100
AND      p.PostTypeId = 1
ORDER BY u.Id;

The optimizer may choose not to use the narrow index with a key lookup, in favor of just scanning the clustered index.

SQL Server Query Plan
NUTS

You would either have to create a very wide index, or experiment with rewrites to get the narrow index chosen, even though using the narrow index results in a much faster query.

SQL Server Query Plan
NUTS

CX:

SQL Server Execution Times: CPU time = 6374 ms, elapsed time = 4165 ms.

 

NC:

SQL Server Execution Times: CPU time = 1623 ms, elapsed time = 875 ms.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Why Functions Are Bad in SQL Server Where Clauses

Raised Right


It seems like every time I check out a server, the query plans are a nightmare. Users are freaking out, and management’s coffee is more Irish than Colombian.

Many times, the issue is that people are using presentation layer functions for relational processes. The where clause, joins, group by, and order by parts of a query.

Think about built-in string and date functions, wrapped around columns, and the problems they can cause.

These are things you should actively be targeting in existing code, and fighting to keep out of new code.

Nooptional


When you’re trying to get rid of them, remember your better options

  • Cleaning data on input, or via triggers: Better than wrapping everything in RTRIM/LTRIM
  • Using computed columns: Better than relying on runtime calculations like DATEADD/DATEDIFF
  • Breaking queries up: Use UNION ALL to query for either outcome (think ISNULL)
  • Using indexed views: If you need to calculate things in columns across tables
  • Creating reporting tables: Sometimes it’s easier to denormalize a bit to make writing and indexing easier
  • Using #temp tables: If you have data that you need to persist a calculation in and the query to generate it is complicated

Note the things I’m not suggesting here:

  • CTEs: Don’t materialize anything
  • @table variables: Cause more problems than they solve
  • Views: Don’t materialize unless indexed
  • Functions: Just no, thanks

More Work


Yes, finding and fixing this stuff is more work for you. But it’s a whole lot less work for the optimizer, and your server, when you’re done.

If that’s the kind of thing you need help with, drop me a line.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

SQL Server Index Spools When You Have An Index

PAUL WHITE IS WRONG


Ha ha. Just kidding. He’s never wrong about anything.

But he did write about Eager Index Spools recently, and the post ended with the following statement:

Eager index spools are often a sign that a useful permanent index is missing from the database schema.

I’d like to show you a case where you may see an Eager Index Spool even when you have the index being spooled.

Funboy & Funboy & Funboy


Let’s say we’ve got a query that, for better or worse, was written like so:

SELECT SUM(records)
FROM dbo.Posts AS p 
CROSS APPLY 
(
    SELECT COUNT(p2.Id)
    FROM   dbo.Posts AS p2
    WHERE  p2.LastEditDate >= '20121231'
	AND    p.Id = p2.Id

    UNION ALL

    SELECT COUNT(p2.Id)
    FROM   dbo.Posts AS p2
    WHERE  p2.LastEditDate IS NULL
	AND    p.Id = p2.Id
) x (records);

Right now, we’ve got this index:

CREATE INDEX to_null_or_not_to_null ON dbo.Posts(LastEditDate);

Which means we’ve effectively got an index on (LastEditDate, Id), because of how clustered index key columns are inherited by nonclustered indexes.

The APPLY section of the query plan looks like so:

SQL Server Query Plan
Spooled to death.

Each spool runs for nearly 53 seconds. The entire plan runs for 1:52.

There have been times when I’ve seen index spools created to effectively re-order existing indexes.

Perhaps that’s the case here? Let’s add this index.

CREATE /*UNIQUE*/ INDEX that_is_a_question ON dbo.Posts(Id, LastEditDate);

I’ve got UNIQUE in there in case you’re playing along at home. It makes no difference to the outcome.

I’d expect you to ask about that. I have high expectations of you, dear reader. I love you.

PLEASE DON’T LEAVE ME.

Get Out


The new execution plan looks uh.

SQL Server Query Plan
I’m In It

That’s frustrating, isn’t it? Why would you do that?

When I asked Paul why the optimizer was wrong (I understand that many of you confuse Paul with the optimizer. To wit, they’ve never been seen together.), he said something along the lines of:

The issue is that you have a unique clustered index that prevents the index matching logic from finding the better nonclustered index.

Well okay yeah lemme just go drop that clustered index or something.

Workarounds


There are several workarounds, like using FORCESEEK inside the APPLY logic.

Of course, the better method is just to write the query so there’s no need for the optimizer to join a table to itself a couple times.

SELECT SUM(x.records)
FROM (    
	SELECT COUNT(p.records)
    FROM 
    (
        SELECT 1 AS records
        FROM   dbo.Posts AS p2
        WHERE  p2.LastEditDate >= '20121231'
    ) AS p
    
    UNION ALL
    
    SELECT COUNT(p.records)
    FROM 
    (
        SELECT 1 AS records
        FROM   dbo.Posts AS p2
        WHERE  p2.LastEditDate IS NULL
    ) AS p
) AS x (records);

Which’ll finish in about 1.5 seconds.

But hey, nifty demo.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

The Right Way To Compare Date Columns In Two Different SQL Server Tables

A Certain Ratio


I have a lot to say about this demo on SQL Server 2019, which I’ll get to in a future post.

For now, I’m going to concentrate on ways to make this situation suck a little less for you wherever you are.

Let’s talk.

Pinky Blue


Let’s get a couple indexes going:

CREATE INDEX something ON dbo.Posts(PostTypeId, Id, CreationDate);
CREATE INDEX stuffy ON dbo.Comments(Score, PostId, CreationDate);

And look at a maybe kinda sorta stupid query.

SELECT COUNT(*)
FROM   dbo.Posts AS p
JOIN   dbo.Comments AS c
    ON  p.Id = c.PostId
WHERE DATEDIFF(YEAR, p.CreationDate, c.CreationDate) > 1
AND p.PostTypeId = 1
AND c.Score > 0;

We wanna find questions where a comment was left a year after they were posted, and the comment was upvoted.

BIG DATA!

What We Know


From yesterday’s post, we know that even if we put our date columns first in the index, we wouldn’t have anything to seek to.

Unlike yesterday’s post, these columns are in two different tables. We can’t make a good computed column to calculate that.

The indexes that I created help us focus on the SARGable parts of the where clause and the join conditions.

That query takes about 2 seconds.

SQL Server Query Plan
Hm.

Switch Hitter


You might be tempted to try something like this, but it won’t turn out much better unless you change your indexes.

SELECT DATEDIFF(YEAR, p.CreationDate, c.CreationDate) AS Diffo
INTO #t
FROM   dbo.Posts AS p
JOIN   dbo.Comments AS c
    ON  p.Id = c.PostId
WHERE p.CreationDate < DATEADD(YEAR, -1, c.CreationDate)
AND   c.CreationDate > DATEADD(YEAR,  1, p.CreationDate)
AND p.PostTypeId = 1
AND c.Score > 0

SELECT COUNT(*)
FROM #t AS t
WHERE t.Diffo > 1

DROP TABLE #t;

Moving CreationDate to the second column helps the first query quite a bit.

CREATE INDEX something_alt ON dbo.Posts(PostTypeId, CreationDate, Id);
CREATE INDEX stuffy_alt ON dbo.Comments(Score, CreationDate, PostId);
SQL Server Query Plan
Chock full of Jacques

Opternatives


You could try an indexed view here, too.

CREATE VIEW dbo.Sneaky
WITH SCHEMABINDING
AS
SELECT COUNT_BIG(*) AS records
FROM   dbo.Posts AS p
JOIN   dbo.Comments AS c
    ON  p.Id = c.PostId
WHERE DATEDIFF(YEAR, p.CreationDate, c.CreationDate) > 1
AND p.PostTypeId = 1
AND c.Score > 0
GO 

CREATE UNIQUE CLUSTERED INDEX whatever ON Sneaky(records);
SQL Server Query Plan
Mexican Ham Radio

But, you know. That might be overkill.

Depending.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

The Right Way To Compare Date Columns In SQL Server Queries

Tell It To The Judge


A common dilemma is when you have two date columns, and you need to judge the gap between them for something.

For instance, say you have a table of orders and you want to figure out how long on average it takes an ordered item to ship, or a shipped item to be delivered.

You’re not left with many good ways to write the query to take advantage of indexes.

Let’s have a look-see.

Iron Mask


We’re gonna skip the “aw shucks, this stinks without an index” part.

CREATE INDEX gloom ON dbo.Posts(CreationDate, LastActivityDate);

SELECT COUNT(*) AS records
FROM dbo.Posts AS p
WHERE DATEDIFF(YEAR, p.CreationDate, p.LastActivityDate) > 9;

Because it still stinks with an index. Check it out.

SQL Server Query Plan
Why would you?

Though we have a predicate, and an index on both columns, we don’t have anything to seek to.

Why? Because our predicate isn’t on anything that the index is keeping track of.

Indexes don’t care how many years, months, days, hours, minutes, seconds, milliseconds, or microseconds difference there are between these two columns.

That’d be a really cool kind of index to have for sure, but insert a shrug that fills your screen the way dark matter fills the universe here.

All we got is workarounds.

Another For Instance


We can use a computed column:

ALTER TABLE dbo.Posts 
    ADD despair AS DATEDIFF(YEAR, CreationDate, LastActivityDate);

CREATE INDEX sadness ON dbo.Posts(despair);

The result is something we can seek to.

SQL Server Query Plan
How could you?

Which is probably the type of plan that you’d prefer.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

How To Do Date Math In Your Where Clause To Avoid SQL Server Performance Problems

Pattern Baldness


I see this pattern quite a bit in stored procedures, where the front end accepts an integer which gets passed to a query.

That integer is used to specify some time period — days, months, whatever — and the procedure then has to search some date column for the relevant values.

Here’s a simplified example using plain ol’ queries.

Still Alive


Here are my queries. The recompile hints are there to edify people who are hip to local variable problems.

DECLARE @months_back INT = 1;

SELECT COUNT(*) AS records
FROM dbo.Posts AS p
WHERE DATEADD(MONTH, @months_back * -1, p.CreationDate) <= '20100101' --Usually GETDATE() is here
OPTION(RECOMPILE);
GO 

DECLARE @months_back INT = 1;

SELECT COUNT(*) AS records
FROM dbo.Posts AS p
WHERE p.CreationDate <= DATEADD(MONTH, @months_back, '20100101') --Usually GETDATE() is here
OPTION(RECOMPILE);
GO

The problem with the first query is that the function is applied to the column, rather than to the variable.

If we look at the plans for these, the optimizer only thinks one of them is special enough for an index request.

SQL Server Query Plan
please respond

Sure, there’s also a ~200ms difference between the two, which is pretty repeatable.

But that’s not the point — where things change quite a bit more is when we have a useful index. Those two queries above are just using the clustered index, which is on a column unrelated to our where clause.

CREATE INDEX etadnoitaerc ON dbo.Posts(CreationDate);
SQL Server Query Plan
Oops.

Side by side:

  • The bad query uses >10x more CPU
  • Still runs for >3x as long
  • Scans the entire index
  • Reads 10x more rows
  • Has to go parallel to remain competitive

At MAXDOP 1, it runs for just about 2.2 seconds on a single thread. Bad news.

Qual


This is one example of misplacing logic, and why it can be bad for performance.

All these demos were run on SQL Server 2019, which unfortunately didn’t save us any trouble.

In the next post, I’m going to look at another way I see people make their own doom.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

When SELECT * Doesn’t Matter In SQL Server Queries

Pavlovian


There’s a strange response to some things in the SQL Server community that borders on religious fervor. I once worked with someone who chastised me for having SELECT * in some places in the Blitz scripts. It was odd and awkward.

Odd because this person was the most Senior DBA in the company, and awkward because they didn’t believe me that it didn’t matter in some cases.

People care about SELECT * for many valid reasons, but context is everything.

One For The Money


The first place it doesn’t matter is EXISTS. Take this index and this query:

CREATE INDEX specatular_blob ON dbo.Posts(PostTypeId, OwnerUserId);

SELECT COUNT(*) AS records
FROM   dbo.Users AS u
WHERE  EXISTS ( SELECT * 
                FROM dbo.Posts AS p 
				WHERE p.OwnerUserId = u.Id 
				AND p.PostTypeId = 2 );

The relevant part of the query plan looks like this:

SQL Server Query Plan
What’s My Name

We do a seek into the index we created on the two columns in our WHERE clause. We didn’t have to go back to the clustered index for everything else in the table.

That’s easy enough to prove if we only run the subquery — we have to change it a little bit, but the plan tells us what we need.

SELECT *
FROM   dbo.Posts AS p
WHERE  p.OwnerUserId = 22656
AND    p.PostTypeId = 2;

This time we do need the clustered index:

SQL Server Query Plan
Who We Be

You can even change it to something that would normally throw an error:

SELECT COUNT(*) AS records
FROM   dbo.Users AS u
WHERE  EXISTS ( SELECT 1/0 
                FROM dbo.Posts AS p 
				WHERE p.OwnerUserId = u.Id 
				AND p.PostTypeId = 2 );

Two For Completeness


Another example is in derived tables, joins, and apply.

Take these two queries. The first one only selects columns in our nonclustered index (same as above).

The second one actually does a SELECT *.

/*selective*/
SELECT     u.Id,
           u.DisplayName,
		   ca.OwnerUserId, --I am only selecting columns in our index
		   ca.PostTypeId,
		   ca.Id
FROM       dbo.Users AS u
CROSS APPLY( SELECT TOP (1) * --I am select *
             FROM dbo.Posts AS p 
			 WHERE p.OwnerUserId = u.Id 
			 AND p.PostTypeId = 2 
			 ORDER BY p.OwnerUserId DESC, p.Id DESC) AS ca
WHERE U.Reputation >= 100000;

/*less so*/
SELECT     u.Id,
           u.DisplayName,
		   ca.* --I am select *
FROM       dbo.Users AS u
CROSS APPLY( SELECT TOP (1) * --I am select *
             FROM dbo.Posts AS p 
			 WHERE p.OwnerUserId = u.Id 
			 AND p.PostTypeId = 2 
			 ORDER BY p.OwnerUserId DESC, p.Id DESC) AS ca
WHERE U.Reputation >= 100000;

The first query only touches our narrow nonclustered index:

SQL Server Query Plan
Blackout

The second query does a key lookup, because we really do select everything.

SQL Server Query Plan
Party Up

Trash Pile


I know, you’ve been well-conditioned to freak out about certain things. I’m here to help.

Not every SELECT * needs to be served a stake through the heart and beheading.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Why Not Exists Makes More Sense Than Left Joins For Performance

Checklist


This is always fun to talk to people about, because of all the misconceptions around the concept.

You need to find rows in one table that don’t have a match in another table. Maybe it’s a reconciliation process, maybe it’s part of ETL or something.

Doesn’t matter. Pay attention!

Choices


The way most people will write this query on the first try is like this:

    SELECT   COUNT_BIG(u.Id) AS records
    FROM     dbo.Users AS u
    LEFT JOIN     dbo.Posts AS p
        ON u.Id = p.OwnerUserId
    WHERE p.Id IS NULL;

The query plan has one of my (generally) least favorite things in it: A filter.

SQL Server Query Plan
2.8 seconds!

What’s the filter doing?

SQL Server Query Plan
EZ-NULLZ

Looking for NULL values after the join. Yuck.

Better Choices


Expressed more SQL-y, we could use NOT EXISTS.

    SELECT   COUNT_BIG(u.Id) AS records
    FROM     dbo.Users AS u
    WHERE NOT EXISTS ( SELECT 1/0 
                       FROM dbo.Posts AS p 
    				   WHERE p.OwnerUserId = u.Id );

I mean, look, the Id column in the Posts table is the PK/CX. That means it can’t be NULL, unless it’s a non-matched row in a left join.

If that column is NULL, then every other column will be NULL too. You don’t ever need to select any data from the Posts table.

The query plan looks like this now:

SQL Server Query Plan
1.6 seconds

This performs better (under most circumstances), and gets some additional optimizations: A Bitmap, and a pre-aggregation of the OwnerUserId column in the Posts table.

Other Than Speed?


The Not Exists query will ask for around ~200MB less memory to run.

SQL Server Query Plan
Every Penny

Why is this? Why is there such a difference between logically equivalent queries?

The Left Join version forces both tables to be fully joined together, which produces matches and non-matches.

After the join, we eliminate non-matches in the Filter. This is why I’m generally suspicious of Filter operators. They often mean we’ve some expression or complication in our logic that prevents the optimizer from eliminating rows earlier. This is to be expected when we do something like generate and filter on a row number — the row number doesn’t exist until the query runs, and has to be filtered later than existing data.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

When Does NOT IN And NULLable Columns Hurt SQL Server Query Performance?

What This Isn’t


I’m not writing about how if you use NOT IN, and suffer from NULL values in a column you’re comparing, you’ll get an empty result.

That’s easy enough to show:

DECLARE @not_null TABLE ( id INT NOT NULL );
INSERT @not_null ( id )
VALUES ( 1 );

DECLARE @null TABLE ( id INT NULL );
INSERT @null ( id )
VALUES ( 1 );
INSERT @null ( id )
VALUES ( NULL );

SELECT *
FROM   @not_null AS nn
WHERE  nn.id NOT IN ( SELECT n.id FROM @null AS n );

What’s sort of more interesting to me is what happens in execution plans when you’re comparing NULL-able columns that don’t contain any NULLs.

Exampled!


Here’s the first query I want to show you. In the Posts table, the OwnerUserId column is NULLable, but doesn’t have any NULLs in it. The Id column in the Users table is not NULLable — it’s the PK/CX on the table.

SELECT COUNT_BIG(*) AS records
FROM   dbo.Users AS u
WHERE  u.Id NOT IN ( SELECT p.OwnerUserId 
                     FROM dbo.Posts AS p 
					 WHERE p.Score < 0 );

Note that for all these queries, I’ve created these indexes:

CREATE INDEX beavis ON dbo.Users(AccountId);
CREATE INDEX butthead ON dbo.Posts(OwnerUserId, Score);
CREATE INDEX stewart ON dbo.Posts(Score, OwnerUserId);

The important part of the query looks like this:

SQL Server Query Plan
Good thing I have great indexes.

Zooming in to a sorta weird part of the query:

SQL Server Query Plan
I’m in it

The optimizer spins up a Row Count Spool to make sure it’s right about the lack of NULLs. You can see the seek predicate doing so, and finding 0 rows, but taking ~200ms to do it. This time was much worse without those great indexes.

Explicitly Not NULL


If we change our query slightly, we can get a less exotic query plan:

SELECT COUNT_BIG(*) AS records
FROM   dbo.Users AS u
WHERE  u.Id NOT IN (   SELECT p.OwnerUserId
                       FROM   dbo.Posts AS p
                       WHERE  p.Score < 0
                       AND    p.OwnerUserId IS NOT NULL );
SQL Server Query Plan
Not Null’em

Which is a little bit faster overall, at ~350ms vs 550ms.

This is the equivalent of writing your query the way you should have written it in the first place, using NOT EXISTS.

SELECT COUNT_BIG(*) AS records
FROM   dbo.Users AS u
WHERE  NOT EXISTS (   SELECT 1 / 0
                      FROM   dbo.Posts AS p
                      WHERE  p.OwnerUserId = u.Id
                      AND    p.Score < 0 )

Recap


The optimizer is a weird place. It’s the ultimate defensive driver. Dealing with NULLs can be quite odd and difficult.

I generally save IN and NOT IN for when I have short lists of literal values. Likely, EXISTS/NOT EXISTS when values are contained in other tables, or dumping values into a #temp table when there’s a lot of values can be more efficient, if only to make better indexing available.

While IN doesn’t have the same side effects as NOT IN with NULLs, I think using EXISTS instead is just a good habit.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Why Spills In SQL Server Parallel Plans Can Hurt Performance

Go Along, Get Along


This is a somewhat complicated topic. There’s not a great TL;DR here, but I don’t want anyone to walk away from this post thinking that parallelism or indexes are “bad”.

What I do want to show is how uneven parallelism can exacerbate existing plan quality issues, and how some indexing can be unhelpful.

Query The First


This is the query we’ll be working with.

SELECT      u.Id, u.DisplayName, u.Reputation, ca.*
FROM        dbo.Users AS u WITH
CROSS APPLY (   SELECT c.*, 
                       ROW_NUMBER() 
					       OVER ( PARTITION BY c.PostId 
						          ORDER BY c.Score DESC ) AS PostScoreRank
                FROM   dbo.Comments AS c
                WHERE  u.Id = c.UserId
                AND    c.Score > 0 ) AS ca
WHERE       u.Reputation >= 100000
AND         ca.PostScoreRank < 1
ORDER BY    u.Reputation DESC
OPTION(RECOMPILE);

I’m using cross apply because the optimizer is likely to pick a Nested Loops join plan. These plans are unlikely to see a Redistribute Streams on the inner side of the join.

Within the apply, I’m making SQL Server do a more significant amount of work than outside of it. This will make more sense later on.

Outside of the apply, I’m doing a little bit of work against a few columns in the Users table, columns that would probably make good candidates for indexing.

The index that I currently have on the Comments table looks like this:

    CREATE INDEX kerplop 
    ON dbo.Comments(UserId, PostId, Score DESC) 
    WHERE Score > 0

Anyway, the query plan for this run looks like this:

SQL Server Query Plan
Stevenage overspill

The part I want to focus on are the spills.

SQL Server Query Plan
Goes on…

What you should keep in mind is that while all 4 threads spill, they all spill pretty evenly.

Thread distribution is pretty good across parallel workers. Not perfect, but hey.

All together now

If you want perfect, go be disappointed in what you get for $47k per .75 cores of Oracle Enterprise Edition.

Query The Second


Knowing what we know about stuff, we may wanna add this index:

    CREATE UNIQUE INDEX hey_scully
    ON dbo.Users (Id, Reputation DESC) 
    INCLUDE(DisplayName);

But when we do, performance gets much worse.

SQL Server Query Plan
If only.

Zooming back in on the Sorts…

SQL Server Query Plan
Happening in mine.

Each spill was about ~2x as bad, because thread distribution got much worse.

SQL Server Query Plan
Fall down

Poor thread 4 got stuck with ~534k rows. The problem here is that each thread in a parallel plan gets an even cut of the memory grant. That doesn’t rebalance if parallelism is skewed. Threads may rebalance if a Redistribute Streams operator appears, but we don’t have one of those here. We will sometimes get one on the outer side of nested loops joins, if the optimizer decides it’s needed.

But since we don’t, things get all screwy.

SQL Server Query Plan
Underage

Thread 2, which had only 63k rows assigned to it didn’t use the full amount of memory it got, though it still apparently spilled. Same with thread 3, but to a lesser extent (get it?).

But why did this happen when we added an index?

Paper Boy


Reading the plan from right to left, top to bottom, we start with a scan of the Users table. This is when something called the parallel page supplier kicks in and starts handing out rows as threads ask for them. Its job is to make sure that parallel workers get rows when they ask for them, and that different threads don’t get the same rows. To do that, it uses key ranges from the statistics histogram.

It makes for a rather dull screenshot, but both histograms are identical for the clustered and nonclustered indexes in this demo. It’s not a statistical issue.

Nor are indexes fragmented, so, like, don’t get me started.

According to my Dear Friend, the parallel page supplier aims for 64k chunks. The smaller index just happens to end up with a more unfortunate key range distribution across its fewer pages.

Feuer

What About A Different Index?


Let’s switch our indexes up and add this one:

    CREATE UNIQUE INDEX spooky_mulder
    ON dbo.Users (Reputation DESC, Id) 
    INCLUDE(DisplayName);

The plan no longer goes parallel, and it runs for about 4 seconds.

SQL Server Query Plan
First Resort, Last Resort

We’re doing the same amount of work on the inner side of the nested loops join. The only part of the plan that changed is on the outer side.

SQL Server Query Plan
Monkey Bread

This is more of an aside than anything, but in parallel nested loops plans, the optimizer only considers if parallelism will reduce the cost of the outer side of the join.

The plan changing to use a cheaper seek with no need to sort data means the outer side is rather cheap to execute, but the inner side is just as expensive.

SQL Server Query Plan
Not to brag but

The DOP 1 plan is only slightly cheaper, here. You may expect a plan that “costs” this much to go parallel, but alas, it was not meant to be.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.