Parallelism Archives | Page 4 of 6

My Favorite SQL Server Cursor Options: Local Static

Posted on February 24, 2020January 20, 2026 by Erik Darling

Cursory

Video Summary

In this video, I dive into the world of cursors and explore why my favorite cursor option is the local static cursor. Despite some contention in the SQL community about which cursor options are best, I share one compelling reason for preferring local static cursors over others like the fast forward cursor. I explain how using a fast forward cursor can force your entire query to run serially, leading to suboptimal performance. By contrast, a local static cursor allows for parallel execution, significantly improving query performance in certain scenarios. The video also covers the importance of testing different cursor options and understanding their impact on execution plans and performance metrics.

Full Transcript

This is still just plain black coffee and I’m having a tough time with that this morning. I’m recording video number two today on this cold, chilly, rainy day. Brrr, day where it sure would be nice to have a heaping helping of some Lafroig or Lagavulin or Ardbeg in a fellas coffee cup. Sure would be nice. Ah, boy. So I want to talk about my favorite cursor options. I know that there is some contention.

I almost knocked my coffee over because I’m apparently too sober to function. So there is some contention about what the best cursor options are and why. And you should always use this and never use this. And I’m going to throw my hat in the ring.

And my favorite cursor options are local static. And there are very smart people out there like Erlend Somerskog who agree with me. And I’m going to show you one of my reasons why I like the local static cursor better than other cursor types. But first, we have to go look at that other cursor type.

So this is a type of cursor that a lot of people will tell you to use a local fast forward cursor. And if you look at the documentation for cursor options, you’ll see that fast forward has some optimizations. Optimizations.

What those optimizations are, we don’t know. I don’t know. I don’t think they’re documented. But one optimization about fast forward cursors that I heartily dislike is that when you run them, much like many other things in SQL Server, like inline scalar, non-valued functions, non-inlined scalar valued functions, modifying table variables, reading from multi-statement table valued function table variables, accessing some system views, blah, blah, blah, blah, blah, will force your entire query to run serially.

Let’s go take a look. So with the fast forward cursor in place, we can see that this query plan up here is entirely serial. And it runs for just about 11 and a half seconds. If I go look at the properties of this open cursor, right?

Something again, I think I say this in a lot of videos when I’m talking about execution plans. One should always be looking at the properties window when they are looking at query plans. There is so much good, rich, detailed information in there about what is going on with your query.

Without it, you would be lost. Sad babe in the woods. We can see immediately that there is a non-parallel plan reason. Now, why none of these things have spaces in them is beyond, is a little bit beyond me.

I’m not quite sure why. You know, I understand that sometimes space bars stop working. You could probably just ask Microsoft’s IT department for a new one.

They probably have some spares sitting around. But anyway, we have a non-parallel plan reason that says no parallel fast forward cursor, which is a bit of a departure from most non-parallel plan reasons.

Most of the time when you see a non-parallel plan reason, it will either say maxed offset to one or could not generate valid parallel plan. And could not generate valid parallel plan is an umbrella for a lot of things that we cannot have query parallelism for.

Now, am I saying that parallel queries should always be used or that they are the best or that you should always strive to have a parallel query? No. No, of course not. And if we’re being honest about the query that’s running, there are probably some indexes that would help this thing run a lot faster, where you wouldn’t notice a difference in the type of cursor that you are using.

But not everyone is in a position where they can, you know, change indexes or, you know, tune queries or tune logic or any of that other stuff. And you might have a big honking crazy query that drives the initial result set of the cursor you’re going to use. And you might not have too many options for tuning that.

This is one thing where changing the cursor option can help you a little bit. So this query takes 11 and a half seconds when it runs serially with the local and fast forward options assigned to the cursor. But over here we have a query or we have a cursor.

We have a… With my favorite cursor options, local static. And when we look at the query plan for local static, we can see that we do not have the same parallelism inhibitor going on that we did in the fast forward plan.

We actually have parallel, parallel, parallel, parallel up until we gather streams here. And this runs for about 2 point… Well, let’s be fair.

About 2.8 seconds. So we got about 12 seconds versus about 3 seconds. Right? Okay. So about a 9 second difference in the query that drives the initial population of the cursor temp table doodad thingy. So, kinda nice that you can…

You should be careful with your cursors and all that. And, you know… I think…

I think cursors are a bit inevitable in some cases. And, you know… Well, you know, some people might watch this video and hiss because I’m talking about ways to tune a cursor rather than just getting rid of the cursor. Well, I mean, like, you know, what if I was going to take the results of this cursor?

What if I was going to take ID, display name, and score? And what if I was going to feed those into a stored procedure and I was going to have to run that stored procedure over a whole bunch of rows? Like, how else am I really going to do that?

If I write a while loop, is it going to make a difference in how I call a stored procedure? If I dump those rows into a temp table, is it really going to be all that different than if I use a cursor to populate a static temporary object? No.

It’s not going to be all that different. So, you know, well, a lot of people will get mad about the cursor in general. And I understand why because there are some very bad uses for cursors. And, you know, I see them and I got to help people with those kind of performance problems pretty often.

But, you know, if I had the choice between spending a long time tuning a cursor query and just changing the cursor option and, you know, saving nine seconds right off the bat, you know, if I’m in a hustle, if I’m trying to like, you know, really get some stuff done, I might just go with the different cursor options to move things along. Maybe get to something else that’s, you know, a performance bump in the night for something I’m trying to tune.

Anyway, the moral of the story is don’t use cursors if you don’t have to. But if you have to or if you’re stuck with them for some other reason, choose a cursor. Look at your cursor options carefully.

Don’t just jump right to the fast forward cursor. It can, you know, do some do some performance testing, you know, run the run the query inside the cursor and see how it goes. And then then run the query inside the cursor and see if see if you get the same execution plan.

See if you get the same time, the same performance metrics, because, you know, if you run the query, then, you know, you don’t use outside of the cursor and you get a nice parallel plan and everything’s fast. And then you put it in the cursor and it takes 20 minutes, you’re gonna be like, oh, this damn cursor. But it’s you. It’s your fault. You chose the wrong cursor options.

Don’t blame the cursor. Blame the you. That’s it. Anyway, I’m gonna go now. I’m going to go invest in Scotland and have a nice day. So, thanks for watching. I hope you I hope you learned something.

I hope that you enjoyed yourself. Maybe. And I will see you in another video. Maybe even another one today.

I don’t know. I don’t know where I don’t know where today may take me in. And if I’m feeling proper Scottish, I might I might feel very record-y today. I don’t know. We’ll see. It’s a surprise for everyone.

Anyway. Thank you. And I will see you in another video where maybe I’ll still be able to stand up. Goodbye.

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

Signs Your SQL Server Execution Plan Is Lying To You: Repartition Streams

Posted on February 7, 2020January 20, 2026 by Erik Darling

Steakenly

Video Summary

In this video, I delve into how parallelism can lead to execution plan issues that are challenging to diagnose, specifically focusing on repartitioned streams operators. I walk through a query that runs for about 42 seconds and analyze the execution plan, highlighting the significant time spent in these operators despite no data spilling. By examining the distribution of rows across threads, I illustrate how skew can cause certain parallelism exchanges to become bottlenecks, even when not all operators show this issue. Additionally, I explain why cached plans often hide crucial details about such issues and discuss the limitations of tools like SP Blitz Cache in detecting these problems post-execution.

Full Transcript

Yip, yip, yip, yip, yip, yip, yip, yip, yip. As my dear friend Hamer says, yip, yip, yip, yip, yip, yip, yip, yip, yip, yip. I’d like to finish out talking about where parallelism can cause plan issues and where it is difficult to figure out why an execution plan might be solved. I’m looking at it by talking about repartitioned streams. Now, we looked in the last video where several exchange operators spilled and things were rather dire. In this one, we’re going to look at where things don’t spill, but perhaps we have some reasons why things might have gone poorly. Now, again, this query runs for about 42 seconds, so that’s a pretty good chunk of change there that I don’t want you to sit through, so I ran the query ahead of time.

And when we look at the plan, let’s see, go over here, we have index seek into the votes table, and then we have this repartitioned streams operator. And the first thing that I want you to notice is that we spend just about, let’s see, again, these are row mode plans, so the operator times are cumulative, reading from right to left. So 19.6 minus 1.6 is, let’s see, 18.6, 18. So we spent just about, well, I mean, like, what, 17 and a half seconds in this repartitioned streams operator?

That’s not a good sign. This kind of happens again, too. If we look up a little bit further in the plan, we’ll have this hash match operator, which is about 24 and a half seconds, and then this repartitioned streams operator, which is about 33 and a half seconds. So another good chunk of time, almost 10 seconds spent in there. Now, in this case, I do believe it’s because there is quite a bit of skew in the parallelism, which I’ll show you in a minute.

But I’ve definitely seen cases where this happens when there isn’t as profound skew. I’ve definitely seen cases where it happens when there is no skew. In this case, this demo just happens to work out really well where there’s skew and there’s slow parallelism exchanges. So if you look at what happened here, all 3.37332131, ooh, that’s 37 million rows ended up on a single thread.

That is not a good time, apparently. Not much got repartitioned here. If we look at this operator, if we look at the index seek, things started off kind of okay. Like, kind of okay, right? A lot of 1.2s. Thread 3 was an outlier. And then when we went to rebalance the streams, we ended up in a bad spot.

And this sort of happens again. I should probably keep this in focus here. This repartitioned streams actually does its job. Look how nice and even that is. And then this hash match is nice and even, too. But then this repartitioned streams is back to Sucksville.

And that’s no good at all. And that’s all going into this thing here, which only ends up with one row. Just a bit absurd. And I’m not exactly sure why this gather streams takes another seven seconds. I’m not that handy with a debugger. But this is where a lot of the clogging happens in the execution plan.

Now, one could make all sorts of reasonable efforts to tune this query plan or to tune this query. One may see things one may want to do. But we’re not going to do that here because we are not tuning queries here. We are just talking about how the cache plans can hide things from you.

Now, this is one thing where SP Blitz Cache can’t help. There is no way to detect parallel skew after the fact. And there is no way to get per operator run times from a cache plan to give you a warning about what went wrong when we ran this.

But what I can show you is that… Excuse you. Excuse you again. What I can show you is that we lose all of that interesting information in the cached plan.

Right? We don’t see how long this ran for. It’s like, oh, it cost 9%. What? No big deal. This cost 24%. We should do something about that. And then, like, you know, we’ll see stuff here.

But none of those… None of the parallelism operators where there was really a big holdup in the execution plan are showing… Are showing why they were…

Or showing that they were the slow points. We do see that there were spills in this. But the spills weren’t the problem. Right? The spills were pretty small. And we just didn’t…

You know, we just didn’t… Spill so much that, like, you know, fixing the spill would be the big fix for this. Right? And this doesn’t even have the big problem that the query plan with the exchange bills… Or rather, the metadata about the query plan with the exchange bills had.

In this case, we have total CPU at about 110 seconds. The duration at about 40 seconds. So we don’t even have, like, that, like, you know, forensic helper of, like, you know, CPU being even or lower than duration in a parallel plan to look at. So that can be rather misleading as well.

So, again, you know, stuff to keep in mind when you’re looking at cached plans is, you know, that a lot of stuff’s going to be missing. Looking at the metadata can help sometimes. Other times it’s a mystery.

You know, there are… Like, a lot of times you will definitely need to see an actual execution plan in order to make any sort of, like, reasonable guess at what part of the query or query plan to focus on. Anyway, that’s it for this one.

The next video is going to start a whole new topic. We’re going to start fresh. Who knows what clothing I’ll take off for the next video. Anyway.

Thanks for watching and see you over there. Goodbye.

Going Further

Signs Your SQL Server Execution Plan Is Lying To You: Exchange Spills

Posted on February 6, 2020May 16, 2022 by Erik Darling

Since They Wanna Know

Going Further

An Unfortunate Performance Side Effect Of The OUTPUT Clause In SQL Server

Posted on January 20, 2020May 14, 2022 by Erik Darling

Put Out

I see people using OUTPUT to audit modifications from time to time, often because “triggers are bad” or “triggers are slow”.

Well, sometimes, sure. But using OUTPUT can be a downer, too.

Let’s look at how.

A Process Emerges

Say we’ve got a table that we’re using to track user high scores for their questions.

CREATE TABLE dbo.HighQuestionScores
(
    Id INT PRIMARY KEY CLUSTERED,
    DisplayName NVARCHAR(40) NOT NULL,
    Score BIGINT NOT NULL
);

To test the process, let’s put a single user in the table:

INSERT dbo.HighQuestionScores WITH (TABLOCK)
        (Id, DisplayName, Score)
SELECT u.Id, u.DisplayName, p.Score
FROM dbo.Users AS u
JOIN 
(
    SELECT p.OwnerUserId, 
	       MAX(p.Score) AS Score
    FROM dbo.Posts AS p
    WHERE p.PostTypeId = 1
    GROUP BY p.OwnerUserId
) AS p ON p.OwnerUserId = u.Id
WHERE u.Id = 22656;

To exacerbate the problem, I’m not going to create any helpful indexes here. This is a good virtual reality simulator, because I’ve seen your indexes.

Yes you. Down in front.

The relevant part of the query plan is the scan of the Posts table:

It’s parallel, and takes 1.8 seconds.

Aw, dit

Now let’s add in an OUTPUT clause.

I’m going to skip over inserting the output into any structure, because I want you to understand that the target doesn’t matter.

INSERT dbo.HighQuestionScores WITH (TABLOCK)
        (Id, DisplayName, Score)
OUTPUT Inserted.Id,
       Inserted.DisplayName,
       Inserted.Score
SELECT u.Id, u.DisplayName, p.Score
FROM dbo.Users AS u
JOIN 
(
    SELECT p.OwnerUserId, MAX(p.Score) AS Score
    FROM dbo.Posts AS p
    WHERE p.PostTypeId = 1
    GROUP BY p.OwnerUserId
) AS p ON p.OwnerUserId = u.Id
WHERE u.Id = 22656;

The relevant part of the plan now looks like this:

We’ve lost parallelism, and inspecting the properties of the Insert operator tells us why:

SQL Server Query Plan Tool Tip — Less Successful

We’ve got a Non Parallel Plan Reason. Why aren’t there any spaces? I don’t know.

Why can’t that go parallel? I also don’t know.

What About Triggers?

If we create a minimal trigger on the table, we can see if it has the same overhead.

CREATE OR ALTER TRIGGER dbo.hqs_insert ON dbo.HighQuestionScores
AFTER INSERT
AS
BEGIN
SET NOCOUNT ON;

    SELECT Inserted.Id,
           Inserted.DisplayName,
           Inserted.Score
    FROM Inserted;

END

Let’s go back to the original insert, without the output! We care about two things:

Is the parallel portion of the insert plan still there?
Is there any limitation on parallelism with the Inserted (and by extension, Deleted) virtual tables?

The answers are mostly positive, too. The insert plan can still use parallelism.

I’m not gonna post the same picture here, you can scroll back fondly.

Though the select from the Inserted table within the trigger doesn’t go parallel, it doesn’t appear to limit parallelism for the entire plan. It does appear that reads from the Inserted table can’t use parallelism (sort of like the table variable in a MSTVF).

Let’s modify the trigger slightly:

CREATE OR ALTER TRIGGER dbo.hqs_insert ON dbo.HighQuestionScores
AFTER INSERT
AS
BEGIN
SET NOCOUNT ON;

    DECLARE @Id INT
	DECLARE @DisplayName NVARCHAR(40)
	DECLARE @Score BIGINT

    SELECT @Id = Inserted.Id,
           @DisplayName = Inserted.DisplayName,
           @Score = Inserted.Score
    FROM Inserted
	JOIN dbo.Comments AS c 
	    ON c.UserId = Inserted.Id;

END

And for variety, let’s insert a lot more data into our table:

TRUNCATE TABLE dbo.HighQuestionScores;

INSERT dbo.HighQuestionScores WITH (TABLOCK)
        (Id, DisplayName, Score)
SELECT u.Id, u.DisplayName, p.Score
FROM dbo.Users AS u
JOIN 
(
    SELECT p.OwnerUserId, MAX(p.Score) AS Score
    FROM dbo.Posts AS p
    WHERE p.PostTypeId = 1
    GROUP BY p.OwnerUserId
) AS p ON p.OwnerUserId = u.Id
WHERE u.Id < 500000;

Here’s the query plan:

The read from Inserted is serial, but the remainder of the plan fully embraces parallelism like a long lost donut.

Togetherness

Given a well-tuned workload, you may not notice any particular overhead from using OUTPUT to audit certain actions.

Of course, if you’re using them alongside large inserts, and those large inserts happen to run for longer than you’d like, it might be time to see how long they take sans the OUTPUT clause. It’s entirely possible that using a trigger instead would cause fewer performance issues.

Thanks for reading!

Going Further

Quickly Detecting CPU and Memory Pressure On A SQL Server With sp_PressureDetector

Posted on January 16, 2020May 14, 2022 by Erik Darling

Options Ahoy

There are lots of great options out there for figuring out what’s happening on a server. If you’re the kind of person still running sp_who, or sp_who2, or sp_who8045, it’s time to stop.

You’ve got much better options, like sp_WhoIsActive, or the First Responder Kit. They’re both great, but sometimes I needed this very specific information very quickly without all the bells and whistles.

I’ve had to do some pretty weird troubleshooting in my life, where a server was totally jammed up, and these help me figure out what’s going on.

Sir-Procs-A-Lot

I call it the pressure detector. It’s four simple DMV queries that will give you different levels of detail about memory and CPU usage currently taking place on a server. Like a lot of these scripts, when a server is Baltimoring, you’re better off running them using the DAC. Otherwise, you’re kinda stuck in the same place all your other queries are.

So what does this help you with? Situations where you’re running out of worker threads, or when you’re running out of memory.

When you hit those, you’ll get smacked with a couple nasty waits:

THREADPOOL: Running out of worker threads
RESOURCE_SEMAPHORE: Running out of query memory

The thing is, sometimes it’s hard to see what’s happening, what’s causing problems, and what’s on the receiving end.

Check out this video to walk you through the procedure results.

Video Killed

Thanks for reading!

Going Further

How Table Variables Prevent SQL Server From Using A Parallel Query Plan

Posted on January 7, 2020May 16, 2022 by Erik Darling

Well, huh

Going Further

In SQL Server, Does Query Parallelism Change Query Memory Grants?

Posted on December 9, 2019May 16, 2022 by Erik Darling

This was originally posted as an answer by me here, I’m re-posting it locally for posterity

Sup?

For SQL Server queries that require additional memory, grants are derived for serial plans. If a parallel plan is explored and chosen, memory will be divided evenly among threads.

Memory grant estimates are based on:

Number of rows (cardinality)
Size of rows (data size)
Number of concurrent memory consuming operators

If a parallel plan is chosen, there is some memory overhead to process parallel exchanges (distribute, redistribute, and gather streams), however their memory needs are still not calculated the same way.

Memory Consuming Operators

The most common operators that ask for memory are

Sorts
Hashes (joins, aggregates)
Optimized Nested Loops

Less common operators that require memory are inserts to column store indexes. These also differ in that memory grants are currently multiplied by DOP for them.

Memory needs for Sorts are typically much higher than for hashes. Sorts will ask for at least estimated size of data for a memory grant, since they need to sort all result columns by the ordering element(s). Hashes need memory to build a hash table, which does not include all selected columns.

Examples

If I run this query, intentionally hinted to DOP 1, it will ask for 166 MB of memory.

SELECT *
FROM 
     (  
        SELECT TOP (1000) 
               u.Id 
        FROM dbo.Users AS u
        ORDER BY u.Reputation
     ) AS u
OPTION(MAXDOP 1);

If I run this query (again, DOP 1), the plan will change, and the memory grant will go up slightly.

SELECT *
FROM (  
        SELECT TOP (1000) 
               u.Id
        FROM dbo.Users AS u
        ORDER BY u.Reputation
     ) AS u
JOIN (
        SELECT TOP (1000) 
               u.Id
        FROM dbo.Users AS u
        ORDER BY u.Reputation
     ) AS u2
ON u.Id = u2.Id
OPTION(MAXDOP 1);

There are two Sorts, and now a Hash Join. The memory grant bumps up a little bit to accommodate the hash build, but it does not double because the Sort operators cannot run concurrently.

If I change the query to force a nested loops join, the grant will double to deal with the concurrent Sorts.

SELECT *
FROM (  
        SELECT TOP (1000) 
               u.Id
        FROM dbo.Users AS u
        ORDER BY u.Reputation
     ) AS u
INNER LOOP JOIN ( --Force the loop join
        SELECT TOP (1000) 
               u.Id
        FROM dbo.Users AS u
        ORDER BY u.Reputation
     ) AS u2
ON u.Id = u2.Id
OPTION(MAXDOP 1);

The memory grant doubles because Nested Loop is not a blocking operator, and Hash Join is.

Size Of Data Matters

This query selects string data of different combinations. Depending on which columns I select, the size of the memory grant will go up.

The way size of data is calculated for variable string data is rows * 50% of the column’s declared length. This is true for VARCHAR and NVARCHAR, though NVARCHAR columns are doubled since they store double-byte characters. This does change in some cases with the new CE, but details aren’t documented.

Size of data also matters for hash operations, but not to the same degree that it does for Sorts.

SELECT *
FROM 
     (  
        SELECT TOP (1000) 
                 u.Id          -- 166MB (INT)
               , u.DisplayName -- 300MB (NVARCHAR 40)
               , u.WebsiteUrl  -- 900MB (NVARCHAR 200)
               , u.Location    -- 1.2GB (NVARCHAR 100)
               , u.AboutMe     -- 9GB   (NVARCHAR MAX)
        FROM dbo.Users AS u
        ORDER BY u.Reputation
     ) AS u
OPTION(MAXDOP 1);

But What About Parallelism?

If I run this query at different DOPs, the memory grant is not multiplied by DOP.

SELECT *
FROM (  
        SELECT TOP (1000) 
               u.Id
        FROM dbo.Users AS u
        ORDER BY u.Reputation
     ) AS u
INNER HASH JOIN (
        SELECT TOP (1000) 
               u.Id
        FROM dbo.Users AS u
        ORDER BY u.Reputation
     ) AS u2
ON u.Id = u2.Id
ORDER BY u.Id, u2.Id -- Add an ORDER BY
OPTION(MAXDOP ?);

There are slight increases to deal with more parallel buffers per exchange operator, and perhaps there are internal reasons that the Sort and Hash builds require extra memory to deal with higher DOP, but it’s clearly not a multiplying factor.

Thanks for reading!

Going Further

How TABLOCK Hints Can Hurt Column Store Insert Performance

Posted on November 20, 2019May 16, 2022 by Erik Darling

Utility

There are a lot of good reasons to use tablock on your inserts, as long as you can handle the reduced concurrency of a table level lock.

You may get minimal logging here, here. And you may even get parallel inserts. You may even avoid weird compression side effects.

All very good reasons to vie for the affections of tablock hints!

You should do that.

You should try.

Because Sometimes

You may run into oddball scenarios where the results without tablock are better.

For example, I was experimenting with clustered column store temp tables for another post, and I came across this oddity.

Assume two identical inserts, but one has a tablock hint and one doesn’t.

DROP TABLE IF EXISTS #t_cci;
GO 

CREATE TABLE #t_cci (UserId INT, VoteTypeId INT, CreationDate DATETIME, INDEX c CLUSTERED COLUMNSTORE);

INSERT #t_cci WITH (TABLOCK) -- A tab of locks
    ( UserId, VoteTypeId, CreationDate )
SELECT v.UserId, v.VoteTypeId, MAX(v.CreationDate)
FROM dbo.Votes AS v
WHERE v.UserId IS NOT NULL
GROUP BY v.UserId, v.VoteTypeId;

DROP TABLE IF EXISTS #t_cci;
GO 

CREATE TABLE #t_cci (UserId INT, VoteTypeId INT, CreationDate DATETIME, INDEX c CLUSTERED COLUMNSTORE);

INSERT #t_cci -- A lack of tabs
    ( UserId, VoteTypeId, CreationDate )
SELECT v.UserId, v.VoteTypeId, MAX(v.CreationDate)
FROM dbo.Votes AS v
WHERE v.UserId IS NOT NULL
GROUP BY v.UserId, v.VoteTypeId;

The plans for those queries, in that order, look like this:

The fully parallel query takes 4.6 seconds, and the serial insert query takes 1.7 seconds.

Even more strange, the insert with tablock leaves four open rowgroups, and the non-tablock query has one compressed rowgroup.

SQL Server Performance Metrics — It was not a dream.

Adding to that, using sp_spaceused on the temp tables, we can see some important differences.

The uncompressed temp table is much larger, has more unused space, and has… index size?!

Yep, because open row groups, or delta stores, are stored as b-trees.

the columnstore index might store some data temporarily into a clustered index called a deltastore

I believe this happens with the parallel insert, because each thread inserts ~75k rows, which is fewer than the minimum 102,400 needed to qualify for skipping the delta store and being compressed. A simple way to test this is to run the tablock query at MAXDOP 2.

That ends up being a little faster than the serial insert, and also gets full compression.

Each thread is ~150k rows and gets compressed. Hooray.

Looking at a PerfView diff of the good and bad samples, this path is only present in the slow run:

When inserting to the delta store, we need to acquire exclusive latches like just about any b-tree insert, and reserve log space to do so.

Morality Hilarity

To summarize the problem:

The TABLOCK hint gets us a fully parallel insert plan
There aren’t enough rows on each parallel thread to get row group compression
We need to insert into a clustered index behind the scenes that makes up the delta store
Which results in threads waiting on exclusive latches and reserving log space

Will you ever hit this? Maybe someday.

This will also happen with real tables (not just temp tables), so I suspect it’s a potentially common scenario for column store users.

You get all worked up getting these parallel inserts and then your compression sucks.

Thanks for reading!

Going Further

Why Spills In SQL Server Parallel Plans Can Hurt Performance

Posted on November 1, 2019May 16, 2022 by Erik Darling

Go Along, Get Along

This is a somewhat complicated topic. There’s not a great TL;DR here, but I don’t want anyone to walk away from this post thinking that parallelism or indexes are “bad”.

What I do want to show is how uneven parallelism can exacerbate existing plan quality issues, and how some indexing can be unhelpful.

Query The First

This is the query we’ll be working with.

SELECT      u.Id, u.DisplayName, u.Reputation, ca.*
FROM        dbo.Users AS u WITH
CROSS APPLY (   SELECT c.*, 
                       ROW_NUMBER() 
					       OVER ( PARTITION BY c.PostId 
						          ORDER BY c.Score DESC ) AS PostScoreRank
                FROM   dbo.Comments AS c
                WHERE  u.Id = c.UserId
                AND    c.Score > 0 ) AS ca
WHERE       u.Reputation >= 100000
AND         ca.PostScoreRank < 1
ORDER BY    u.Reputation DESC
OPTION(RECOMPILE);

I’m using cross apply because the optimizer is likely to pick a Nested Loops join plan. These plans are unlikely to see a Redistribute Streams on the inner side of the join.

Within the apply, I’m making SQL Server do a more significant amount of work than outside of it. This will make more sense later on.

Outside of the apply, I’m doing a little bit of work against a few columns in the Users table, columns that would probably make good candidates for indexing.

The index that I currently have on the Comments table looks like this:

    CREATE INDEX kerplop 
    ON dbo.Comments(UserId, PostId, Score DESC) 
    WHERE Score > 0

Anyway, the query plan for this run looks like this:

The part I want to focus on are the spills.

What you should keep in mind is that while all 4 threads spill, they all spill pretty evenly.

Thread distribution is pretty good across parallel workers. Not perfect, but hey.

If you want perfect, go be disappointed in what you get for $47k per .75 cores of Oracle Enterprise Edition.

Query The Second

Knowing what we know about stuff, we may wanna add this index:

    CREATE UNIQUE INDEX hey_scully
    ON dbo.Users (Id, Reputation DESC) 
    INCLUDE(DisplayName);

But when we do, performance gets much worse.

Zooming back in on the Sorts…

Each spill was about ~2x as bad, because thread distribution got much worse.

Poor thread 4 got stuck with ~534k rows. The problem here is that each thread in a parallel plan gets an even cut of the memory grant. That doesn’t rebalance if parallelism is skewed. Threads may rebalance if a Redistribute Streams operator appears, but we don’t have one of those here. We will sometimes get one on the outer side of nested loops joins, if the optimizer decides it’s needed.

But since we don’t, things get all screwy.

Thread 2, which had only 63k rows assigned to it didn’t use the full amount of memory it got, though it still apparently spilled. Same with thread 3, but to a lesser extent (get it?).

But why did this happen when we added an index?

Paper Boy

Reading the plan from right to left, top to bottom, we start with a scan of the Users table. This is when something called the parallel page supplier kicks in and starts handing out rows as threads ask for them. Its job is to make sure that parallel workers get rows when they ask for them, and that different threads don’t get the same rows. To do that, it uses key ranges from the statistics histogram.

It makes for a rather dull screenshot, but both histograms are identical for the clustered and nonclustered indexes in this demo. It’s not a statistical issue.

Nor are indexes fragmented, so, like, don’t get me started.

According to my Dear Friend, the parallel page supplier aims for 64k chunks. The smaller index just happens to end up with a more unfortunate key range distribution across its fewer pages.

What About A Different Index?

Let’s switch our indexes up and add this one:

    CREATE UNIQUE INDEX spooky_mulder
    ON dbo.Users (Reputation DESC, Id) 
    INCLUDE(DisplayName);

The plan no longer goes parallel, and it runs for about 4 seconds.

We’re doing the same amount of work on the inner side of the nested loops join. The only part of the plan that changed is on the outer side.

This is more of an aside than anything, but in parallel nested loops plans, the optimizer only considers if parallelism will reduce the cost of the outer side of the join.

The plan changing to use a cheaper seek with no need to sort data means the outer side is rather cheap to execute, but the inner side is just as expensive.

The DOP 1 plan is only slightly cheaper, here. You may expect a plan that “costs” this much to go parallel, but alas, it was not meant to be.

Thanks for reading!

Going Further

Why Does My Serial Query Plan Have Parallel Wait Stats Like CXPACKET And CXCONSUMER?

Posted on September 23, 2019May 16, 2022 by Erik Darling

Harkening

In dramatic fashion, I’m revisiting something from this post about stats updates.

It’s a quick post, because uh… Well. Pick a reason.

Get In Gear

Follow along as I repeat all the steps in the linked post to:

Load > 2 billion rows into a table
Create a stats object on every column
Load enough new data to trigger a stats refresh
Query the table to trigger the stats refresh

Except this time, I’m adding a mAxDoP 1 hint to it:

SELECT COUNT(*)
FROM dbo.Vetos
WHERE UserId = 138
AND   PostId = 138
AND   BountyAmount = 138
AND   VoteTypeId = 138
AND   CreationDate = 138
OPTION(MAXDOP 1);

Here’s Where Things Get Interesting

Our MaXdOp 1 query registers nearly the same amount of time on stats updates and parallelism.

But our plan is indeed serial. Because we told it to be.

By setting maxDOP to 1.

Not Alone

So, if you’re out there in the world wondering why this crazy kinda thing goes down, here’s one explanation.

Are there others? Probably.

But you’ll have to find out by setting MAXdop to 1 on your own.

Thanks for reading!