Execution Plans Archives | Page 18 of 31

Query Tuning SQL Server 2019 Part 3: Who Died And Made You The Optimizer?

Posted on June 17, 2020May 14, 2022 by Erik Darling

Be Yourself

We’ve got a problem, Sam Houston. We’ve got a problem with a query that has some strange issues.

It’s not parameter sniffing, but it sure could feel like it.

When the procedure compiles and runs with VoteTypeId 5, it runs for 12 minutes
Other VoteTypeIds run well with the same plan that VoteTypeId 5 gets
When VoteTypeId 5 runs with a “small” plan, it does okay at 10 seconds

Allow me to ruin a graph to illustrate. The Y axis is still seconds, but… it goes up a little higher now.

SQL Server Query Execution Times — weigh-in

The Frustration (A Minor Digression)

Here’s where life can be tough when it comes to troubleshooting actual parameter sniffing.

If you’re relying solely on the plan cache, you’re screwed. You’ll see the plan, and the compile value, but you won’t have the runtime value anywhere that “caused” the problem. In other words, the set of parameters that were adversely affected by the query plan that didn’t fit.

There are some things that can help, like if you’re watching it happen live, or if you have a monitoring tool that might capture runtime parameters.

OR IF YOU USE SP UNDERSCORE HUMANEVENTS.

Like I said, this isn’t parameter sniffing, but it feels like it.

It could extra-feel like it because you might see a misbehaving query, and a compile-time parameter that runs quickly on its own when you test it, e.g. VoteTypeId 6.

It would be really hard to tell that even if a plan were to compile specifically for a different parameter, it would still run for 12 minutes.

Heck, that’d even catch me off-guard.

But that’s what we have here: VoteTypeId 5 gets a bad plan special for VoteTypeId 5.

Examiner

Let’s dig in on what’s happening to cause us such remarkable grief. There has to be a reason.

I don’t need more grief without reason; I’ve already got a public school education.

If we were to summarize the problem here: that Hash Match Left Anti Semi Join runs for 12 minutes on its own.

No other operator, or group of operators, is responsible for a significant amount of time comparatively.

Magnifier

Some things to note:

The bad estimates aren’t from predicates, they’re from Batch Mode Bitmaps
Those bad estimates end up producing a much larger number of rows from the Adaptive Join
The Hash Match ends up needing to probe 932 million rows

Taking 12 minutes to probe 932 million rows is probably to be expected, now that I think about it.

If the optimizer had a good estimate from the Bitmaps here, it may have done the opposite of what a certain Pacific Island Dwelling Bird said:

Getting every nuance of this sort of relational transformation correct can be tricky. It is very handy that the optimizer team put the effort in so we do not have to explore these tricky rewrites manually (e.g. by changing the query text). If nothing else, it would be extremely tedious to write all the different query forms out by hand just to see which one performed better in practice. Never mind choosing a different version depending on current statistics and the number of changes to the table.

In this case, the Aggregate happens after the join. If the estimate were correct, or even in the right spacetime dimension, this would be fine.

We can gauge the general efficiency of it by looking at when this plan is used for other parameters that produce numbers of rows that are closer to this estimate.

If the optimizer had made a good guess for this parameter, it may have changed the plan to put an aggregate before the join.

Unfortunately we have very little control over estimates for Bitmaps, and the guesses for Batch Mode Bitmaps are a Shrug of Atlassian proportions.

Finisher

We’ve learned some things:

This isn’t parameter sniffing
Batch Mode Bitmaps wear pants on their head
Batch Mode Bitmaps set their head-pants on fire
Most of the time Batch Mode performance covers this up
The plan for VoteTypeId 5 is not a good plan for VoteTypeId 5
The plan for VoteTypeId 5 is great for a lot of other VoteTypeIds

In tomorrow’s post, we’ll look at how we can fix the problem.

Thanks for reading!

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Query Tuning SQL Server 2019 Part 2: Big Databases, Big Ideas

Posted on June 16, 2020May 14, 2022 by Erik Darling

Are We Still Friends?

When I first wrote this demo, I called it dbo.ParameterSniffingMonstrosity.

Because , you know, it’s really terrible.

CREATE OR ALTER PROCEDURE dbo.VoteSniffing( @VoteTypeId INT )
AS
SET XACT_ABORT, NOCOUNT ON;
    BEGIN
        SELECT   ISNULL(v.UserId, 0) AS UserId,
                 SUM(CASE WHEN v.CreationDate >= '20190101'
                          AND  v.CreationDate < '20200101'
                          THEN 1
                          ELSE 0
                     END) AS Votes2019,
                 SUM(CASE WHEN v.BountyAmount IS NULL
                          THEN 0
                          ELSE 1
                     END) AS TotalBounty,
                 COUNT(DISTINCT v.PostId) AS PostCount,
                 @VoteTypeId AS VoteTypeId
        FROM     dbo.Votes AS v
        WHERE    v.VoteTypeId = @VoteTypeId
        AND      NOT EXISTS
                (   
                    SELECT 1/0
                    FROM dbo.Posts AS p
                    JOIN dbo.Badges AS b 
                        ON b.UserId = p.OwnerUserId 
                    WHERE  p.OwnerUserId = v.UserId
                    AND    p.PostTypeId = 1 
                )
        GROUP BY v.UserId;
    END;
GO

The only parameter is for VoteTypeId, which has some pretty significant skew towards some types, especially in the full size Stack Overflow database.

SQL Server Query Results — Ask me about my commas

It’s like, when people tell you to index the most selective column first, well.

Sometimes it’s pretty selective.
Sometimes it’s not very selective

But this is exactly the type of data that causes parameter sniffing issues.

With almost any data set like this, you can draw a line or three, and values within each block can share a common plan pretty safely.

But crossing those lines, you run into issues where either little plans do far too much looping and seeking and sorting for “big” values, and big plans do far too much hashing and scanning and aggregating for “little” values.

This isn’t always the exact case, but generally speaking you’ll observe something along these lines.

It’s definitely not the case for what we’re going to be looking at this week.

This week is far more interesting.

That’s why it’s a monstrosity.

Fertilizer

The indexes that I create to support this procedure look like so — I’ve started using compression since at this point in time, 2016 SP1 is commonplace enough that even people on Standard Edition can use them — and they work quite well for the majority of values and query plans.

CREATE INDEX igno
ON dbo.Posts 
    (OwnerUserId, PostTypeId)
    WHERE PostTypeId = 1 
WITH(MAXDOP = 8, SORT_IN_TEMPDB = ON, DATA_COMPRESSION = ROW);
GO

CREATE INDEX rant
ON dbo.Votes 
    (VoteTypeId, UserId, PostId)
INCLUDE 
    (BountyAmount, CreationDate) 
WITH(MAXDOP = 8, SORT_IN_TEMPDB = ON, DATA_COMPRESSION = ROW);
GO 

CREATE INDEX clown ON dbo.Badges( UserId ) 
WITH(MAXDOP = 8, SORT_IN_TEMPDB = ON, DATA_COMPRESSION = ROW);
GO

If there are other indexes you’d like to test, you can do that locally.

What I want to point out is that for many values of VoteTypeId, the optimizer comes up with very good, very fast plans.

Good job, optimizer.

In fact, for any of these runs, you’ll get a good enough plan for any of the other values. They share well.

EXEC dbo.VoteSniffing @VoteTypeId = 4;
EXEC dbo.VoteSniffing @VoteTypeId = 6;
EXEC dbo.VoteSniffing @VoteTypeId = 7;
EXEC dbo.VoteSniffing @VoteTypeId = 9;
EXEC dbo.VoteSniffing @VoteTypeId = 11;
EXEC dbo.VoteSniffing @VoteTypeId = 12;
EXEC dbo.VoteSniffing @VoteTypeId = 13;
EXEC dbo.VoteSniffing @VoteTypeId = 14;
EXEC dbo.VoteSniffing @VoteTypeId = 15;
EXEC dbo.VoteSniffing @VoteTypeId = 16;

VoteTypeIds 1, 2, 3, 5, 8, and 10 have some quirks, but even they mostly do okay using one of these plans.

There are two plans you may see occur for these.

Plan 1

Plan 2

Particulars & Peculiars

Plan 1 is first generated when the procedure is compiled with VoteTypeId 4, and Plan 2 is first generated when the procedure is compiled with VoteTypeId 6.

There’s a third plan that only gets generated when VoteTypeId 2 is compiled first, but we’ll have to save that for another post, because it’s totally different.

Here’s how each of those plans works across other possible parameters.

Plan 1 is grey, Plan 2 is blue. It’s pretty easy to see where each one is successful, and then not so much. Anything < 100ms got a 0.

The Y axis is runtime in seconds. A couple are quite bad. Most are decent to okay.

Plans for Type 2 & 8 obviously stick out, but for different plans.

This is one of those things I need to warn people about when they get wrapped up in:

Forcing a plan (e.g. via Query Store or a plan guide)
Optimizing for unknown
Optimizing for a specific value
Recompiling every time (that backfires in a couple cases here that I’m not covering right now)

One thing I need to point out is that Plan 2 doesn’t have an entry here for VoteTypeId 5. Why?

Because when it inherits the plan for VoteTypeId 6, it runs for 17 minutes.

This is probably where you’re wondering “okay, so what plan does 5 get when it runs on its own? Is this the mysterious Plan 4 From Outer Space?”

Unfortunately, the plan that gets generated for VoteTypeId 5 is… the same one that gets generated for VoteTypeId 6, but 6 has a much smaller memory grant.

If you’re not used to reading operator times in execution plans, check out my video here.

Since this plan is all Batch Mode operators, each operator will track its time individually.

The Non-Switch

VoteTypeId 5 runtime, VoteTypeId 6 compile time

If I were to put a 17 minute runtime in the graph (>1000 seconds), it would defeat the purpose of graphing things.

Note the Hash Match has, by itself, 16 minutes and 44 seconds of runtime.

VoteTypeId 5 runtime, and compile time

This isn’t awesome, either.

The Hash Join, without spilling, has 12 minutes and 16 seconds of runtime.

Big Differentsiz

You have the same plan shape and operators. Even the Adaptive Join follows the same path to hash instead of loop.

Sure, the spills account for ~4 minutes of extra time. They are fairly big spills.

But the plan for VoteTypeId 5, even when compiled specifically for VoteTypeId 5… sucks, and sucks royally.

There are some dismally bad estimates, but where do they come from?

We just created these indexes, and data isn’t magically changing on my laptop.

TUNE IN TOMORROW!

Thanks for reading

Going Further

Query Tuning SQL Server 2019 Part 1: Changing Databases

Posted on June 15, 2020May 14, 2022 by Erik Darling

Teeth To Grit

I’ve always had trouble standing still on SQL Server versions, but most companies don’t. Hardly anyone I talk to is on SQL Server 2017, though these days SQL Server 2016 seems more common than SQL Server 2012, so at least there’s that. Mostly I’m happy to not see SQL Server 2014. God I hate SQL Server 2014.

Despite the lack of adoption, I’ve been moving all my training material to SQL Server 2019. Heck, in a few years, my old posts might come in handy for you.

But during that process, I kept running into the same problem: The demos generally still worked for the OLTP-ish queries, but for the report-ish queries Batch Mode On Rowstore (BMOR, from here) was kicking butt (most of the time anyway, we’re gonna look at some hi-jinks this week).

The problem, so far as I could tell, was that the Stack Overflow 2013 database just wasn’t enough database for SQL Server 2019 (at least with my hardware). My laptop is quad core (8 with HT) @2.9GHz, with 64GB of RAM, and max server memory set to 50GB. The SO2013 database is… just about 50GB.

While it’s fun to be able to create performance problems even with the whole database in memory, it doesn’t match what lot of people are dealing with in real life.

Especially you poor saps on Standard Edition.

My options seemed to be:

Drop max server memory down
Use a VM with lower memory
Use the full size Stack Overflow database

Flipping and Flopping

Each of these has problems, though.

Dropping max server memory down is okay for the buffer pool, but SQL Server (it seems especially with column store/batch mode) is keen to use memory above that for other things like memory grants.

A lot of the interesting struggle I see on client servers between the buffer pool and query memory grants didn’t happen when I did that.

Using a VM with lower memory, while convenient, just didn’t seem as fun. Plus, part of the problem is that, while I make fun of other sample databases for being unrealistically tiny, at least they have relatively modern dates in some of them.

I was starting to feel really goofy having time stop on January 31st, 2013.

I suppose I could have updated all the CreationDate columns to modernize things, but who knows what that would have thrown off.

Plus, here’s a dirty little secret: all the date columns that start with “Last” that track stuff like when someone last logged in, or when a post was last active/edited, they don’t stop at 2013-12-31. They extend up to when the database was originally chopped down to size, in 2017 or so. I always found that a bit jarring, and I’d have to go and add time to them, too, to preserve the gaps.

It all starts to feel a bit like revisionist history.

The End Is Thigh

In the end, I settled on using the most recent version available here, but with a couple of the tables I don’t regularly use in demos cut out: PostHistory, and PostLinks. Once you drop those out, a 360GB database drops down to a much more manageable 150Gb or so.

If you’d like to get a copy, here’s the magnet link.

SQL Server Database Properties — Four users, huh?

The nice thing is that the general cadence of the data is the same in many ways and places, so it doesn’t take a lot to adjust demos to work here. Certain Post and Vote Types, User Ids, Reputations, etc. remain skewed, and outliers are easy to find. Plus, at 3:1 data to memory, it’s a lot harder to keep everything safely in the buffer pool.

This does present different challenges, like index create time to set up for things, database distribution, etc.

But if I can give you better demos, that seems worth it.

Plus, I hear everything is in the cloud now anyway.

Alluding To

In the process of taking old demos and seeing how they work with the new database, I discovered some interesting stuff that I want to highlight a little bit. So far as I can tell, they’re not terribly common (yet), but that’s what makes them interesting.

If you’re the kind of person who’s looking forward to SQL Server 2019’s performance features solving some problems for you auto-magick-ally, these may be things you need to watch out for, and depending on your workload they may end up being quite a bit more common than I perceive.

I’m going to be specifically focusing on how BMOR (and to some extent Adaptive Joins) can end up not solving performance issues, and how you may end up having to do some good ol’ fashion query tuning on your own.

In the next post, we’ll look at how one of my favorite demos continues to keep on giving.

Thanks for reading!

Going Further

SQL Server’s Optimizer Could Be Smarter About Constraints

Posted on June 10, 2020May 16, 2022 by Erik Darling

Couldn’t We All

Let me ask you a question: If I told you that all the numbers in an integer column were either:

> 0
>= 1

You’d probably agree that the lower number you could possible see is 1.

And that’s exactly the case with the Reputation column in Stack Overflow.

Non-Alignment Pact

Assume that I am being truthful about creating this index:

CREATE INDEX constraints_are_silly
ON dbo.Users
(
Reputation,
UpVotes
) INCLUDE (DisplayName);

Also assume that this is the most important query ever written by human hands for the benefit of humanity:

SELECT TOP (1000)
u.Reputation,
u.UpVotes,
u.DisplayName
FROM dbo.Users AS u
WHERE u.Reputation <= 1
ORDER BY u.UpVotes;

However, I’m dissatisfied with the query plan. This requires no assumption. It’s a bit much for what I’m asking, I think.

Since a current implementation rule for the database is that no one can have a Reputation of 0 or less, I add this constraint hoping that SQL Server will see this and stop sorting data, because it knows that 1 is the lowest integer it will find, and the order of UpVotes won’t reset for Reputation = 0.

ALTER TABLE dbo.Users ADD CONSTRAINT checko CHECK (Reputation > 0);

But I still end up with the same execution plan. In neither case is the plan a) trivial, or b) simple parameterized. We can’t blame the optimizer trying to be helpful.

Now assume that I get really mad and change my constraint. This requires minimal assumption.

ALTER TABLE dbo.Users ADD CONSTRAINT checko CHECK (Reputation >= 1);

And now I get a query plan that does not have a sort in it. My approval does not require assumption.

Why does one constraint remove the need to sort, and one not?

Over My Head

The answer is in the query plan. Sometimes I have to be reminded to look at these.

The Seek Predicate on the left is from when we defined the constraint as > 0. It has a <= 1 predicate.
The Seek Predicate on the right is an equality on = 1

For a little more detail, I asked a question. Apparently the optimizer… doesn’t consider data types here.

Thanks for reading!

Going Further

How Trivial Plans Can Make Identifying Queries Confusing In SQL Server

Posted on May 26, 2020May 16, 2022 by Erik Darling

End Of The Line

There are a lot of words you can use to describe Extended Events. I’ll skip over my list to keep the blog family friendly.

In this post, we’re going to look at how two event can show the same query having a different plan handle.

I stumbled on this while working on sp_HumanEvents, when one of my precious sessions suddenly STOPPED RETURNING RESULTS.

Very helpful.

Let’s look at why.

A Plan Here, A Plan There

When the optimizer gets its hands on a query, it has some early choices, where it might choose to keep a plan at the trivial optimization level, and it may choose simple parameterization in order to make things more re-usable.

Apparently this causes some differences when it comes to hashing plans.

Here’s an example query:

SELECT COUNT(*) FROM dbo.Votes WHERE BountyAmount = 500 AND UserId = 22565;

Looking at the execution plan for it, we know a couple things.

It didn’t stay trivial because it went parallel, and the optimizer thought that simple parameterization was a good idea. Even though we passed in literals, they’re replaced by parameters in the execution plan text.

What’s the big deal?

Well, nothing, unless you want to correlate information about a single query from multiple Extended Events.

Check It Out

Let’s run sp_HumanEvents and get debug information — because the regular results won’t return anything for this query right now (I have fixed this issue, don’t worry).

EXEC sp_HumanEvents @event_type = N'query', @debug = 1, @seconds_sample = 5;

In another window, we’ll run our query.

SELECT COUNT(*) FROM dbo.Votes WHERE BountyAmount = 500 AND UserId = 22565;

What do we get back? The debug parameter will spit out data in temp tables, and print out dynamic SQL commands for troubleshooting.

Eagle-eyed readers will notice that the plan handles in each event for the same query are different, but query hash and plan hash are identical.

Because of this, I’ve had to change the way I get plan handles, so I only take the one that comes from post execution showplan, because…

There’s Another Way

Which plan_handle ends up in your plan cache? Not the one from sql_statement_completed.

We can see that by running sp_HumanEvents to capture wait stats, which will also try to correlate waits to the queries that caused them.

EXEC sp_HumanEvents @event_type = N'waits', @debug = 1, @seconds_sample = 5;

I’m grabbing the debug here too so we can see what ends up in the temp tables along the way.

We can grab information about which queries caused waits, and the plan handle here is the same one that we saw in the post execution showplan event.

Yay?

In the Actual Results Plus from sp_HumanEvents, we can see the simple parameterized version of the query ended up in the plan cache.

There are two lines here because we had waits on two different wait stats, but the point is more that we can see simple parameterization at work here, but not in the events that collected information about executed queries.

Frustration Point

The problem would have been more obvious if the parameterized text showed up in the post execution showplan event, and not the version with literal values. It would have been easier to figure out why the plan handles were different, anyway.

So what’s the solution? Now, sp_HumanEvents will only get the plan handle from the post execution showplan event, in case you want to go and do anything with it later.

Thanks for reading!

Going Further

SQL Server Index Choice and Batch Mode On Rowstore

Posted on May 22, 2020July 8, 2024 by Erik Darling

A/B Testing

Let’s look at a pretty simple query against the Votes table:

SELECT v.VoteTypeId, 
       MIN(v.CreationDate) AS min_creation_date,
       MAX(v.CreationDate) AS max_creation_date,
       COUNT_BIG(*) AS records
FROM dbo.Votes AS v
GROUP BY v.VoteTypeId
ORDER BY v.VoteTypeId;

There are only two columns involved:

VoteTypeId, which we’re grouping on
CreationDate, which we’re finding min and max values for

There are two different ways to approach this:

CREATE INDEX even_steven 
    ON dbo.Votes(VoteTypeId, CreationDate);

CREATE INDEX steven_even 
    ON dbo.Votes(CreationDate, VoteTypeId);

But first, let’s look at the query with just the clustered index, which is on the Id column, not mentioned at all in the query.

We’ll call this our baseline.

This takes 2 seconds.

Even Steven

With an index on VoteTypeId, CreationDate, what happens?

It’ll take 5 seconds no matter how many times I run it.

This might sound like a very good index though, because even though we don’t have a where clause looking for VoteTypeIds, we’re grouping by them.

Having CreationDate next in the index key should make it really easy to find a min and max value for each VoteTypeId, because CreationDate will be in order.

And you know what? That sort of works out. We get a Stream Aggregate in the plan without a Sort operator.

But it still sucks: Why?

Steven Even

With this index, we go right back to… What we had before.

We went through all the trouble of adding indexes, to have one be slower, and one not get us any faster than just using the clustered index.

What gives?

Teenage Angst

I’ve been avoiding something a little bit, dear reader. You see, I’m using SQL Server 2019.

The first plan and the third plan — the ones that finished in 2 seconds — they both used batch mode on rowstore. That’s an Enterprise Edition optimizer feature available in compat level 150.

If you were to run this on SQL Server 2017 or earlier, you would find no measurable difference between any one of these queries.

And look, batch mode on row store does represent a good improvement in many cases for large aggregation queries — the type of queries that would benefit from columnstore in general — and maybe in places where you’re unable to use columnstore today because you’re also using triggers, cursors, Replication, or another feature that it disagrees with.

If you suddenly saw a 60% improvement in some of your “big” queries, you’d probably be pretty happy. I’m not saying it comes for free, or that it’s a magickal world where everything is perfect for every query now.

You can only get that if you have PLAN GUIDES FOR EVERY QUERY. H ah aa ah ah ha a.

But let’s consider something else

It only kicked in when our indexes were lacking.

When we had a “good index”, SQL Server chose a plan with no batch-y mode-y at all.

If you’ve carefully crafted some indexes over the years, even a sure shot for the type of query you want to get some batch mode love may not see it.

Remember: It only gets used if your query is tall enough to ride.

Thanks for reading!

Going Further

Using Exchange Demand Partitioning to Improve Parallel Query Scalability In SQL Server

Posted on May 21, 2020May 16, 2022 by Joe Obbish

One of our SQL Server workloads runs many concurrent parallel queries on a possibly already busy server. Other work occurring on the server can have dramatic effects on parallel query runtimes. One way to improve scalability of parallel query runtimes is to achieve query plans with operators that allow a dynamic amount of work to be completed by each parallel worker thread. For example, exchange operators with a partitioning type of hash or round robin force a typically even amount of work to be completed by each worker thread. If a worker thread happens to be on a busy scheduler then query runtime may increase due to the longer runtime of the worker thread that is competing for CPU resources. Serial zones in parallel query plans can of course have the same problem. Batch mode operators, exchange operators with a partitioning type of broadcast or demand, and the parallel page supplier are all examples of operators which can do a dynamic amount of work per thread. Those are the operators that I prefer to see in query plans for this workload.

Very little has been written about exchange operators with a partitioning type of demand, so I forgive you for not hearing of it before today. There is a brief explanation available here, an example of using demand partitioning to improve some query plans involving partitioned tables, and a Stack Exchange answer for someone comparing round robin and demand partitioning. You have the honor of reading perhaps the fourth blog post about the subject.

Start

The demos are somewhat dependent on hardware so you may not see identical results if you are following along. I’m testing on a machine with 8 CPU and with max server memory was set to 6000 MB. Start with a table with a multi-column primary key and insert about 34 million rows into it:

DROP TABLE IF EXISTS #;
CREATE TABLE # (
ID BIGINT NOT NULL,
ID2 BIGINT NOT NULL,
STRING_TO_AGG VARCHAR(MAX),
PRIMARY KEY (ID, ID2)
);

INSERT INTO # WITH (TABLOCK)
SELECT RN, v.v, REPLICATE('REPLICATE', 77)
FROM (
SELECT TOP (4800000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values
CROSS JOIN master..spt_values t2
) q
CROSS JOIN (VALUES (1), (2), (3), (4), (5), (6), (7)) v(v);

The query tuning exercise is to insert the ID column and a checksum of the concatenation of all the STRING_TO_AGG values for each ID value ordered by ID2. This may seem like an odd thing to do but it is based upon a production example with an adjustment to not write as much data as the real query. Not all of us have SANs in our basements, or even have a basement. Use the following for the target table:

DROP TABLE IF EXISTS ##;

CREATE TABLE ## (
ID BIGINT NOT NULL,
ALL_STRINGS_CHECKSUM INT
);

Naturally we use SQL Server 2019 CU4 so the STRING_AGG function is available to us. Here is one obvious way to write the query:

INSERT INTO ## WITH (TABLOCK)
SELECT ID, CHECKSUM(STRING_AGG(STRING_TO_AGG , ',') WITHIN GROUP (ORDER BY ID2))
FROM #
GROUP BY ID
OPTION (MAXDOP 8);

The above query takes about 63 seconds on my machine with a cost of 2320.57 optimizer units. The query optimizer decided that a serial plan was the best choice:

SQL Server Query Plan

This is a rather lame result for a query tuning exercise so I will assume that I know better and force a parallel query plan with undocumented trace flag 8649. SQL Server warns us that the estimated cost is 4816.68 optimizer units but surely doesn’t expect a detail like that to stop me. The adjusted query executes in 75 seconds:

SQL Server Query Plan

My arrogance is checked. The parallel query is slower than the serial version that the query optimizer wanted us to have. The problem is the hash exchange operator. It is an order preserving exchange with a high MAXDOP and wide rows. This is the worst possible situation for exchange performance. How else can we write the query? Batch mode is not available for STRING_AGG so that’s out. Does anyone remember anything about tuning row mode queries?

The Dark Side

Query hints along with carefully constructed T-SQL are the pathway to many abilities, some considered to be unnatural. We can give the classic Parallel Apply query pattern made famous by Adam Machanic a shot to solve this problem. Perhaps you are thinking that there is no driving table for the outer side of a nested loop join, but we can create one by sampling the clustered index of the base table. I’ll skip that part here and just use what I know about the data to divide it into 96 equal ranges:

DROP TABLE IF EXISTS ###;
CREATE TABLE ### (s BIGINT NOT NULL, e BIGINT NOT NULL);

INSERT INTO ###
VALUES
(1, 50000),
(50001, 100000),
(100001, 150000),
(150001, 200000),
(200001, 250000),
(250001, 300000),
(300001, 350000),
(350001, 400000),
(400001, 450000),
(450001, 500000),
(500001, 550000),
(550001, 600000),
(600001, 650000),
(650001, 700000),
(700001, 750000),
(750001, 800000),
(800001, 850000),
(850001, 900000),
(900001, 950000),
(950001, 1000000),
(1000001, 1050000),
(1050001, 1100000),
(1100001, 1150000),
(1150001, 1200000),
(1200001, 1250000),
(1250001, 1300000),
(1300001, 1350000),
(1350001, 1400000),
(1400001, 1450000),
(1450001, 1500000),
(1500001, 1550000),
(1550001, 1600000),
(1600001, 1650000),
(1650001, 1700000),
(1700001, 1750000),
(1750001, 1800000),
(1800001, 1850000),
(1850001, 1900000),
(1900001, 1950000),
(1950001, 2000000),
(2000001, 2050000),
(2050001, 2100000),
(2100001, 2150000),
(2150001, 2200000),
(2200001, 2250000),
(2250001, 2300000),
(2300001, 2350000),
(2350001, 2400000),
(2400001, 2450000),
(2450001, 2500000),
(2500001, 2550000),
(2550001, 2600000),
(2600001, 2650000),
(2650001, 2700000),
(2700001, 2750000),
(2750001, 2800000),
(2800001, 2850000),
(2850001, 2900000),
(2900001, 2950000),
(2950001, 3000000),
(3000001, 3050000),
(3050001, 3100000),
(3100001, 3150000),
(3150001, 3200000),
(3200001, 3250000),
(3250001, 3300000),
(3300001, 3350000),
(3350001, 3400000),
(3400001, 3450000),
(3450001, 3500000),
(3500001, 3550000),
(3550001, 3600000),
(3600001, 3650000),
(3650001, 3700000),
(3700001, 3750000),
(3750001, 3800000),
(3800001, 3850000),
(3850001, 3900000),
(3900001, 3950000),
(3950001, 4000000),
(4000001, 4050000),
(4050001, 4100000),
(4100001, 4150000),
(4150001, 4200000),
(4200001, 4250000),
(4250001, 4300000),
(4300001, 4350000),
(4350001, 4400000),
(4400001, 4450000),
(4450001, 4500000),
(4500001, 4550000),
(4550001, 4600000),
(4600001, 4650000),
(4650001, 4700000),
(4700001, 4750000),
(4750001, 4800000);

I can now construct a query that gets a parallel apply type of plan:

INSERT INTO ## WITH (TABLOCK)
SELECT ca.*
FROM ### driver
CROSS APPLY (
	SELECT TOP (987654321987654321) ID, CHECKSUM(STRING_AGG(STRING_TO_AGG , ',') WITHIN GROUP (ORDER BY ID2)) ALL_STRINGS_CHECKSUM
	FROM # WITH (FORCESEEK)
	WHERE ID BETWEEN driver.s AND driver.e
	GROUP BY ID
) ca
OPTION (MAXDOP 8, NO_PERFORMANCE_SPOOL, FORCE ORDER);

This is an unnatural query plan. The query optimizer assigned it a cost of 36248.7 units. I had to add the TOP to get a valid query plan with an index seek. Removing the TOP operator results in error 8622. Naturally such things won’t stop us and running the query results in an execution time between 15 – 19 seconds on my machine which is the best result yet.

SQL Server Query Plan

This query plan has an exchange partitioning type of round robin. Recall such exchange types can lead to trouble if there’s other work executing on one of the schedulers used by a parallel worker thread. So far I’ve been testing these MAXDOP 8 queries with nothing else running on my 8 core machine. I can make a scheduler busy by running a MAXDOP 1 query that has no real reason to yield before exhausting its 4 ms quantum. Here is one way to accomplish that:

SELECT TOP (1) t1.high + t2.high + t3.high + t4.high
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
CROSS JOIN master..spt_values t4
ORDER BY t1.high + t2.high + t3.high + t4.high
OPTION (MAXDOP 1, NO_PERFORMANCE_SPOOL);

Wait stats for this query if you don’t believe me:

SQL Server Wait Stats

Running this query at the same time as the parallel query can apply a large performance penalty to the parallel query. The parallel query can take up to 48 seconds to execute if even a single worker thread has to share time on a scheduler with another. That is, the query ran 3 times slower when I added a single MAXDOP 1 query to the workload. Looking at thread details for the parallel query:

SQL Server Threads

As you can see, one of the worker threads took a significantly longer amount of time to complete its work compared to the other threads. There is no logged wait statistic for this kind of performance problem in which the other parallel worker threads complete their work much earlier than when the query finishes. If there’s no worker thread then there is no wait associated with the query. The only way to catch this is to look at actual row distribution or the CPU time to elapsed time ratio.

You may be wondering why the query is worse than twice as slow as before. After all, if all workers do an equal amount of work but one now gets access to half as much CPU as before it seems reasonable to expect the runtime to double instead of triple. The workers of the parallel query have many reasons they might yield before exhausting their full 4 ms quantum – an I/O wait for example. The MAXDOP 1 SELECT query is designed to not yield early. What is very likely happening is that the MAXDOP 1 query gets a larger share of the scheduler’s resources than 50%. SQL Server 2016 made adjustments to try to limit this type of situation but by its very nature I don’t see how it could ever lead to a perfect sharing of a scheduler’s resources.

Demanding Demand Partitioning

We can get an exchange operator with demand based partitioning by replacing the driving temp table with a derived table. Full query text below:

INSERT INTO ## WITH (TABLOCK)
SELECT ca.*
FROM (
VALUES
(1, 50000),
(50001, 100000),
(100001, 150000),
(150001, 200000),
(200001, 250000),
(250001, 300000),
(300001, 350000),
(350001, 400000),
(400001, 450000),
(450001, 500000),
(500001, 550000),
(550001, 600000),
(600001, 650000),
(650001, 700000),
(700001, 750000),
(750001, 800000),
(800001, 850000),
(850001, 900000),
(900001, 950000),
(950001, 1000000),
(1000001, 1050000),
(1050001, 1100000),
(1100001, 1150000),
(1150001, 1200000),
(1200001, 1250000),
(1250001, 1300000),
(1300001, 1350000),
(1350001, 1400000),
(1400001, 1450000),
(1450001, 1500000),
(1500001, 1550000),
(1550001, 1600000),
(1600001, 1650000),
(1650001, 1700000),
(1700001, 1750000),
(1750001, 1800000),
(1800001, 1850000),
(1850001, 1900000),
(1900001, 1950000),
(1950001, 2000000),
(2000001, 2050000),
(2050001, 2100000),
(2100001, 2150000),
(2150001, 2200000),
(2200001, 2250000),
(2250001, 2300000),
(2300001, 2350000),
(2350001, 2400000),
(2400001, 2450000),
(2450001, 2500000),
(2500001, 2550000),
(2550001, 2600000),
(2600001, 2650000),
(2650001, 2700000),
(2700001, 2750000),
(2750001, 2800000),
(2800001, 2850000),
(2850001, 2900000),
(2900001, 2950000),
(2950001, 3000000),
(3000001, 3050000),
(3050001, 3100000),
(3100001, 3150000),
(3150001, 3200000),
(3200001, 3250000),
(3250001, 3300000),
(3300001, 3350000),
(3350001, 3400000),
(3400001, 3450000),
(3450001, 3500000),
(3500001, 3550000),
(3550001, 3600000),
(3600001, 3650000),
(3650001, 3700000),
(3700001, 3750000),
(3750001, 3800000),
(3800001, 3850000),
(3850001, 3900000),
(3900001, 3950000),
(3950001, 4000000),
(4000001, 4050000),
(4050001, 4100000),
(4100001, 4150000),
(4150001, 4200000),
(4200001, 4250000),
(4250001, 4300000),
(4300001, 4350000),
(4350001, 4400000),
(4400001, 4450000),
(4450001, 4500000),
(4500001, 4550000),
(4550001, 4600000),
(4600001, 4650000),
(4650001, 4700000),
(4700001, 4750000),
(4750001, 4800000)
) driver( s, e)
CROSS APPLY (
	SELECT TOP (987654321987654321) ID, CHECKSUM(STRING_AGG(STRING_TO_AGG , ',') WITHIN GROUP (ORDER BY ID2)) ALL_STRINGS_CHECKSUM
	FROM # WITH (FORCESEEK)
	WHERE ID BETWEEN CAST(driver.s AS BIGINT) AND CAST(driver.e AS BIGINT)
	GROUP BY ID
) ca
OPTION (MAXDOP 8, NO_PERFORMANCE_SPOOL, FORCE ORDER);

Query performance is effectively random. The query was observed to execute as quickly as 15 seconds and as slowly as 45 seconds. In some situations there was an incredible amount of skew in row distributions between threads:

SQL Server Threads

This is an unexpected situation if there are no other queries running on the server. Query performance is most disappointing.

Is there a trace flag?

Yes! The problem here is that the nested loop join uses the prefetch optimization. Paul White writes:

One of the available SQL Server optimizations is nested loops prefetching. The general idea is to issue asynchronous I/O for index pages that will be needed by the inner side — and not just for the current correlated join parameter value, but for future values too.

That sounds like it might be wildly incompatible with a demand exchange operator. Querying sys.dm_exec_query_profiles during query execution proves that the demand exchange isn’t working as expected: worker threads no longer fully process the results associated with their current row before requesting the next one. That is what can lead to wild skew between the worker threads and as a result query performance is effectively random.

Documented trace flag 8744 disables this optimization. Adding it to the query using QUERYTRACEON results in much more stable performance. The query typically finishes in about 15 seconds. Here is an example thread distribution:

SQL Server Threads

If you fight the ISV fight like I do, you may not be able to enable trace flags for individual queries. If you’re desperate you could try artificially lowering the cardinality estimate from the derived table. An OPTIMIZE FOR query hint with a direct filter is my preferred way to accomplish this. I like to set the cardinality estimate equal to MAXDOP but I have no real basis for doing this. Here is the full query text:

DECLARE @filter BIGINT = 987654321987654321;
INSERT INTO ## WITH (TABLOCK)
SELECT ca.*
FROM (
VALUES
(1, 50000),
(50001, 100000),
(100001, 150000),
(150001, 200000),
(200001, 250000),
(250001, 300000),
(300001, 350000),
(350001, 400000),
(400001, 450000),
(450001, 500000),
(500001, 550000),
(550001, 600000),
(600001, 650000),
(650001, 700000),
(700001, 750000),
(750001, 800000),
(800001, 850000),
(850001, 900000),
(900001, 950000),
(950001, 1000000),
(1000001, 1050000),
(1050001, 1100000),
(1100001, 1150000),
(1150001, 1200000),
(1200001, 1250000),
(1250001, 1300000),
(1300001, 1350000),
(1350001, 1400000),
(1400001, 1450000),
(1450001, 1500000),
(1500001, 1550000),
(1550001, 1600000),
(1600001, 1650000),
(1650001, 1700000),
(1700001, 1750000),
(1750001, 1800000),
(1800001, 1850000),
(1850001, 1900000),
(1900001, 1950000),
(1950001, 2000000),
(2000001, 2050000),
(2050001, 2100000),
(2100001, 2150000),
(2150001, 2200000),
(2200001, 2250000),
(2250001, 2300000),
(2300001, 2350000),
(2350001, 2400000),
(2400001, 2450000),
(2450001, 2500000),
(2500001, 2550000),
(2550001, 2600000),
(2600001, 2650000),
(2650001, 2700000),
(2700001, 2750000),
(2750001, 2800000),
(2800001, 2850000),
(2850001, 2900000),
(2900001, 2950000),
(2950001, 3000000),
(3000001, 3050000),
(3050001, 3100000),
(3100001, 3150000),
(3150001, 3200000),
(3200001, 3250000),
(3250001, 3300000),
(3300001, 3350000),
(3350001, 3400000),
(3400001, 3450000),
(3450001, 3500000),
(3500001, 3550000),
(3550001, 3600000),
(3600001, 3650000),
(3650001, 3700000),
(3700001, 3750000),
(3750001, 3800000),
(3800001, 3850000),
(3850001, 3900000),
(3900001, 3950000),
(3950001, 4000000),
(4000001, 4050000),
(4050001, 4100000),
(4100001, 4150000),
(4150001, 4200000),
(4200001, 4250000),
(4250001, 4300000),
(4300001, 4350000),
(4350001, 4400000),
(4400001, 4450000),
(4450001, 4500000),
(4500001, 4550000),
(4550001, 4600000),
(4600001, 4650000),
(4650001, 4700000),
(4700001, 4750000),
(4750001, 4800000)
) driver( s, e)
CROSS APPLY (
	SELECT TOP (987654321987654321) ID, CHECKSUM(STRING_AGG(STRING_TO_AGG , ',') WITHIN GROUP (ORDER BY ID2)) ALL_STRINGS_CHECKSUM
	FROM # WITH (FORCESEEK)
	WHERE ID BETWEEN CAST(driver.s AS BIGINT) AND CAST(driver.e AS BIGINT)
	GROUP BY ID
) ca
WHERE driver.s <= @filter
OPTION (OPTIMIZE FOR (@filter = 350001), MAXDOP 8, NO_PERFORMANCE_SPOOL, FORCE ORDER);

Query performance is the same as with TF 8744:

SQL Server Query Plan

Does this query do better than round robin partitioning when there is a busy MAXDOP 1 query running at the same time? I ran it a few times and it completed in about 15-16 seconds every time. One of the worker threads does less work and the others cover for it:

SQL Server Threads

In this example the nested loop join only gets the prefetch optimization if the cardinality estimate is more than 25 rows. I do not know if that number is a fixed part of the algorithm for prefetch eligibility but it certainly feels unwise to rely on this behavior never changing in a future version of SQL Server. Note that prefetching is a trade-off. For some workloads you may be better off with round robin partitioning and prefetching compared to demand without prefetching. It’s hard to imagine a workload that would benefit from demand with prefetching but perhaps I’m not being creative enough in my thinking.

Final Thoughts

In summary, the example parallel apply query that uses demand partitioning performs 2-3 times better the query that uses round robin partitioning when another serial query is running on the server. The nested loop prefetch optimization does not work well witth exchange operator demand partitioning and should be avoided via a trace flag or other tricks if demand partitioning is in use.

There are a few things that Microsoft could do to improve the situation. A USE HINT that disables nested loop prefetching would be most welcome. I think that there’s also a fairly strong argument to make that a query pattern of a nested loop join with prefetching with a first child of a demand exchange operator is a bad pattern that the query optimizer should avoid if possible. Finally, it would be nice if there was a type of wait statistic triggered when some of a parallel query’s workers finish their work earlier than others. The problematic queries here have very little CXPACKET waits and no EXECSYNC waits. Imagine that I put a link to UserVoice requests here and imagine yourself voting for them.

Thanks for reading!

Going Further

Detecting Parallel Query Plan Performance Issues In Query Store

Posted on May 19, 2020May 16, 2022 by Erik Darling

Let’s Have Fun

I got asked recently if there’s a way to find parallel queries that are potentially skewed.

One thing that you’ll see quite often is that duration and cpu time aren’t much different. For parallel queries, it should be the opposite — you should be using more CPU in order to reduce duration. Granted, this query could also find queries that were blocked, but it’s a start.

WITH XMLNAMESPACES ( 'http://schemas.microsoft.com/sqlserver/2004/07/showplan' AS x ), 
skew AS 
(
    SELECT qsp.query_id, 
           qsp.plan_id,
           qsrs.avg_cpu_time / 1000. AS avg_cpu_time_ms, 
           qsrs.avg_duration / 1000. AS avg_duration_ms,
           (qsrs.avg_duration - qsrs.avg_cpu_time) / 1000. AS the_difference_ms,
           qsrs.avg_dop, 
           CONVERT(XML, qsp.query_plan) AS query_plan
    FROM sys.query_store_runtime_stats AS qsrs
    JOIN sys.query_store_plan AS qsp
        ON qsrs.plan_id = qsp.plan_id
    WHERE qsp.is_parallel_plan = 1
    AND qsrs.avg_duration >= qsrs.avg_cpu_time
)
SELECT sk.*
FROM skew AS sk
    OUTER APPLY sk.query_plan.nodes('//x:StmtSimple') AS s(c)
WHERE s.c.exist('@StatementType[.= "SELECT"]') = 1
ORDER BY sk.avg_duration_ms DESC;

If you’re wondering why I’m looking at the XML outside a CTE, it’s because Query Store doesn’t store plans in XML, so you get an error. This was the first workaround I found.

Msg 9506, Level 16, State 1, Line 125
The XMLDT method 'nodes' can only be invoked on columns of type xml.

And if you’re wondering how Query Store stores query plans, JOIN THE CLUB:

Just when you thought the image datatype was deprecated, huh?

Thanks for reading!

Going Further

How Are Table Variables Different In SQL Server 2019?

Posted on April 3, 2020May 16, 2022 by Erik Darling

WallaWallaBingBang

Going Further

Interpreting Key Lookups In SQL Server Query Plans Can Be Tricky

Posted on March 18, 2020May 16, 2022 by Erik Darling

This Looks Bonkers!

If you’ve ever read Kim Tripp’s wonderful post on tipping points, you’re probably staring at this Key Lookup and wondering why SQL Server would ever even consider it here. That’s like a 12 digit number. Twelve. That’s like a foreign phone number. That’s like what I spent on cheese plates last year.

I kid, I kid. Though I would not be opposed to that lifestyle.

The thing is, what that number represents is a little different from what we might first expect.

That number comes from multiplying these two numbers:

But Really

That’s how many rows we read. Which isn’t great, obviously. Look how long that thing runs for.

And it gets worse when we examine how rows were distributed on threads.

We produced far fewer rows than that in all, because the Lookup is evaluating a predicate, which only produces ~27k rows.

So for each of the ~11 million rows that comes out of the index seek of our nonclustered index, we:

“Join” it to the clustered index based on the clustered index key column
Evaluate if the OwnerUserId for that row is 22656

The 27,062 number is how many rows are produced after the filter is applied. That’s a bit more obvious when using Plan Explorer.

This lookup doesn’t produce any rows or columns, that’s why there are 0.0 rows per iteration.

It’s purely to filter data out, and it does that. Slowly.

Look, I’m not defending the choice, I’m just using it to teach you something.

How Can You Fix It?

In this case, it would probably be worth adding the OwnerUserId column to the nonclustered index we already have on Posts that’s being used in this query, assuming that it wouldn’t be disruptive to other queries. If that’s not possible, then a new index that satisfies the entire where clause would be a good solution.

If neither of those is palatable, then you might try some exotic rewrites to isolate those rows, correlate on a different column, or “persuade” the optimize to stop pursuing nested loops joins.

Thanks for reading!