Execution Plans – Page 9

How Useful Is Column Store Indexing In SQL Server Standard Edition?

Posted on March 17, 2021May 16, 2022 by Erik Darling

Speed Limit

When I’m blogging about performance tuning, most of it is from the perspective of Enterprise Edition. That’s where you need to be if you’re serious about getting SQL Server to go as fast as possible. Between the unrealistic memory limits and other feature restrictions, Standard Edition just doesn’t hold up.

Sure, you can probably get by with it for a while, but once performance becomes a primary concern it’s time to fork over an additional 5k a core for the big boat.

They don’t call it Standard Edition because it’s The Standard, like the hotel. Standard is a funny word like that. It can denote either high or low standing through clever placement of “the”. Let’s try an experiment:

Erik’s blogging is standard for technical writing
Erik’s blogging is the standard for technical writing

Now you see where you stand with standard edition. Not with “the”, that’s for sure. “The” has left the building.

Nerd Juice

A lot of the restrictions for column store in Standard Edition are documented, but:

DOP limit of two for queries
No parallelism for creating or rebuilding indexes
No aggregate pushdown
No string predicate pushdown
No SIMD support

Here’s a comparison for creating a nonclustered column store index in Standard and Enterprise/Developer Editions:

SQL Server Query Plan — your fly is down

The top plan is from Standard Edition, and runs for a minute in a full serial plan. There is a non-parallel plan reason in the operator properties: MaxDOPSetToOne.

I do not have DOP set to one anywhere, that’s just the restriction kicking in. You can try it out for yourself if you have Standard Edition sitting around somewhere. I’m doing all my testing on SQL Server 2019 CU9. This is not ancient technology at the time of writing.

The bottom plan is from Enterprise/Developer Edition, where the the plan is able to run partially in parallel, and takes 28 seconds (about half the time as the serial plan).

Query Matters

One of my favorite query tuning tricks is getting batch mode to happen on queries that process a lot of rows. It doesn’t always help, but it’s almost always worth trying.

The problem is that on Standard Edition, if you’re processing a lot of rows, being limited to a DOP of 2 can be a real hobbler. In many practical cases, a batch mode query at DOP 2 will end up around the same as a row mode query at DOP 8. It’s pretty unfortunate.

In some cases, it can end up being much worse.

SELECT 
    MIN(p.Id) AS TinyId,
    COUNT_BIG(*) AS records
FROM dbo.Posts AS p WITH(INDEX = ncp)
JOIN dbo.Votes AS v
    ON p.Id = v.PostId
WHERE p. OwnerUserId = 22656;

SELECT 
    MIN(p.Id) AS TinyId,
    COUNT_BIG(*) AS records
FROM dbo.Posts AS p WITH(INDEX = 1)
JOIN dbo.Votes AS v
    ON p.Id = v.PostId
WHERE p. OwnerUserId = 22656;

Here’s the query plan for the first one, which uses the nonclustered column store index on Posts. There is no hint or setting that’s keeping DOP at 2, this really is just a feature restriction.

Higher Ground

The second query, which is limited by the MAXDOP setting to 8, turns out much faster. The batch mode query takes 3.8 seconds, and the row mode query takes 1.4 seconds.

In Enterprise Edition, there are other considerations for getting batch mode going, like memory grant feedback or adaptive joins, but those aren’t available in Standard Edition.

In a word, that sucks.

Dumb Limit

The restrictions on creating and rebuilding column store indexes to DOP 1 (both clustered and nonclustered), and queries to DOP 2 all seems even more odd when we consider that there is no restriction on inserting data into a table with a column store index on it.

As an example:

SELECT 
    p.*
INTO dbo.PostsTestLoad
FROM dbo.Posts AS p
WHERE 1 = 0;

CREATE CLUSTERED COLUMNSTORE INDEX pc ON dbo.PostsTestLoad;

SET IDENTITY_INSERT dbo.PostsTestLoad ON;

INSERT dbo.PostsTestLoad WITH(TABLOCK)
(
    Id, AcceptedAnswerId, AnswerCount, Body, ClosedDate, 
    CommentCount, CommunityOwnedDate, CreationDate, 
    FavoriteCount, LastActivityDate, LastEditDate, 
    LastEditorDisplayName, LastEditorUserId, OwnerUserId, 
    ParentId, PostTypeId, Score, Tags, Title, ViewCount 
)
SELECT TOP (1024 * 1024)
    p.Id, p.AcceptedAnswerId, p.AnswerCount, p.Body, p.ClosedDate, p.
    CommentCount, p.CommunityOwnedDate, p.CreationDate, p.
    FavoriteCount, p.LastActivityDate, p.LastEditDate, p.
    LastEditorDisplayName, p.LastEditorUserId, p.OwnerUserId, p.
    ParentId, p.PostTypeId, p.Score, p.Tags, p.Title, p.ViewCount 
FROM dbo.Posts AS p;

SET IDENTITY_INSERT dbo.PostsTestLoad OFF;

Unsupportive Parents

These limits are asinine, plain and simple, and I hope at some point they’re reconsidered. While I don’t expect everything from Standard Edition, because it is Basic Cable Edition, I do think that some of the restrictions go way too far.

Perhaps an edition somewhere between Standard and Enterprise would make sense. When you line the two up, the available features and pricing are incredibly stark choices.

There are often mixed needs as well, where some people need Standard Edition with fewer HA restrictions, and some people need it with fewer performance restrictions.

Thanks for reading!

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

A Useful Rewrite Using APPLY To Improve Query Performance In SQL Server

Posted on March 12, 2021May 16, 2022 by Erik Darling

More To The Matter

In the year 950 B.C., Craig Freedman write a post about subqueries in CASE expressions. It’s amazing how relevant so much of this stuff stays.

In today’s post, we’re going to look at a slightly different example than the one given, and how you can avoid performance problems with them by using APPLY.

Like most query tuning tricks, this isn’t something you always need to employ, and it’s not a best practice. It’s just something you can use when a scalar subquery doesn’t perform as you’d like it to.

How Much Wood

Our starting query looks like this. The point of it is to determine the percentage of answered questions per month.

SELECT 
    x.YearPeriod,
    MonthPeriod = 
        RIGHT('00' + RTRIM(x.MonthPeriod), 2),
    PercentAnswered = 
        CONVERT(DECIMAL(18, 2), 
           (SUM(x.AnsweredQuestion * 1.) /
           (COUNT_BIG(*) * 1.)) * 100.)
FROM
(
    SELECT 
        YearPeriod = YEAR(p.CreationDate),
        MonthPeriod = MONTH(p.CreationDate),
        CASE 
            WHEN EXISTS
                 ( 
                     SELECT 
                         1/0
                     FROM dbo.Votes AS v
                     WHERE v.PostId = p.AcceptedAnswerId
                     AND   v.VoteTypeId = 1 
                 ) 
            THEN 1
            ELSE 0
        END AS AnsweredQuestion
    FROM dbo.Posts AS p
    WHERE p.PostTypeId = 1
) AS x
GROUP BY 
    x.YearPeriod, 
    x.MonthPeriod
ORDER BY 
    x.YearPeriod ASC, 
    x.MonthPeriod ASC;

Smack in the middle of it, we have a case expression that goes looking for rows in the Votes table where a question has an answer that’s been voted as the answer.

Amazing.

To start with, we’re going to give it this index.

CREATE INDEX p 
    ON dbo.Posts(PostTypeId, AcceptedAnswerId) 
    INCLUDE(CreationDate) 
WITH(MAXDOP = 8, SORT_IN_TEMPDB = ON);

Planpains

In all, this query will run for about 18 seconds. The majority of it is spent in a bad neighborhood.

Why does this suck? Boy oh boy. Where do we start?

Sorting the Votes table to support a Merge Join?
Choosing Parallel Merge Joins ever?
Choosing a Many To Many Merge Join ever?
All of the above?

Bout It

If we change the way the query is structured to use OUTER APPLY instead, we can get much better performance in this case.

SELECT 
    x.YearPeriod,
    MonthPeriod = 
        RIGHT('00' + RTRIM(x.MonthPeriod), 2),
    PercentAnswered = 
        CONVERT(DECIMAL(18, 2), 
           (SUM(x.AnsweredQuestion * 1.) /
           (COUNT_BIG(*) * 1.)) * 100.)
FROM
(
    SELECT 
        YearPeriod = YEAR(p.CreationDate),
        MonthPeriod = MONTH(p.CreationDate),
        oa.AnsweredQuestion
    FROM dbo.Posts AS p
    OUTER APPLY 
    (
        SELECT 
            AnsweredQuestion = 
                CASE 
                    WHEN v.Id IS NOT NULL 
                    THEN 1 
                    ELSE 0 
                END
        FROM dbo.Votes AS v
        WHERE v.PostId = p.AcceptedAnswerId
        AND   v.VoteTypeId = 1
    ) oa
    WHERE p.PostTypeId = 1
) AS x
GROUP BY 
    x.YearPeriod, 
    x.MonthPeriod
ORDER BY 
    x.YearPeriod ASC, 
    x.MonthPeriod ASC;

This changes the type of join chosen, and runs for about 3 seconds total.

We avoid all of the problems that the parallel many-to-many Merge Join brought us.

Thanks, Hash Join.

It’s also worth noting that the OUTER APPLY plan asks for an index that would help us a bit, though like most missing index requests it’s a bit half-baked.

USE [StackOverflow2013]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[Votes] ([VoteTypeId])
INCLUDE ([PostId])
GO

Index Plus

Any human could look at this query and realize that having the PostId in the key of the index would be helpful, since we’d have it in secondary order to the VoteTypeId column

CREATE INDEX v 
    ON dbo.Votes(VoteTypeId, PostId) 
WITH(MAXDOP = 8, SORT_IN_TEMPDB = ON);

If we add that index, we can make the subquery fairly competitive, at about 4.5 seconds total.

But the issue here is now rather than poorly choosing a Sort > Merge Join, we go into a Nested Loops join for ~6 million rows. That’s probably not a good idea.

This index doesn’t leave as profound a mark on the APPLY version of the query. It does improve overall runtime by about half a second, but I don’t think I’d create an index just to get a half second better.

But hey, who knows? Maybe it’d help some other queries, too.

Indexes are cool like that.

Back On The Map

If you’ve got subqueries in your select list that lead to poor plan choices, you do have options. Making sure you have the right indexes in place can go a long way.

You may be able to get competitive performance gains by rewriting them as OUTER APPLY. You really do need to use OUTER here though, because it won’t restrict rows and matches the logic of the subquery. CROSS APPLY would act like an inner join and remove any rows that don’t have a match. That would break the results.

Thanks for reading!

Going Further

Multiple Distinct Aggregates: Still Harm Performance Without Batch Mode In SQL Server

Posted on March 10, 2021May 16, 2022 by Erik Darling

Growler

Well over 500 years ago, Paul White wrote an article about distinct aggregates. Considering how often I see it while working with clients, and that Microsoft created column store indexes and batch mode rather than allow for hash join hints on CLR UDFs, the topic feels largely ignored.

But speaking of all that stuff, let’s look at how Batch Mode fixes multiple distinct aggregates.

Jumbo Size

A first consideration is around parallelism, since you don’t pay attention or click links, here’s a quote you won’t read from Paul’s article above:

Another limitation is that this spool does not support parallel scan for reading, so the optimizer is very unlikely to restart parallelism after the spool (or any of its replay streams).

In queries that operate on large data sets, the parallelism implications of the spool plan can be the most important cause of poor performance.

What does that mean for us? Let’s go look. For this demo, I’m using SQL Server 2019 with the compatibility level set to 140.

SELECT
   COUNT_BIG(DISTINCT v.PostId) AS PostId,
   COUNT_BIG(DISTINCT v.UserId) AS UserId,
   COUNT_BIG(DISTINCT v.BountyAmount) AS BountyAmount,
   COUNT_BIG(DISTINCT v.VoteTypeId) AS VoteTypeId,
   COUNT_BIG(DISTINCT v.CreationDate) AS CreationDate
FROM dbo.Votes AS v;

In the plan for this query, we scan the clustered index of the Votes table five times, or once per distinct aggregate.

In case you’re wondering, this results in one intent shared object lock on the Votes table.

<Object name="Votes" schema_name="dbo">
  <Locks>
    <Lock resource_type="OBJECT" request_mode="IS" request_status="GRANT" request_count="9" />
    <Lock resource_type="PAGE" page_type="*" index_name="PK_Votes__Id" request_mode="S" request_status="GRANT" request_count="14" />
  </Locks>
</Object>

This query runs for 38.5 seconds, as the crow flies.

A Join Appears

Let’s join Votes to Posts for no apparent reason.

SELECT
   COUNT_BIG(DISTINCT v.PostId) AS PostId,
   COUNT_BIG(DISTINCT v.UserId) AS UserId,
   COUNT_BIG(DISTINCT v.BountyAmount) AS BountyAmount,
   COUNT_BIG(DISTINCT v.VoteTypeId) AS VoteTypeId,
   COUNT_BIG(DISTINCT v.CreationDate) AS CreationDate
FROM dbo.Votes AS v
JOIN dbo.Posts AS p
    ON p.Id = v.PostId;

The query plan now has two very distinct (ho ho ho) parts.

This is part 1. Part 1 is a spoiler. Ignoring that Repartition Streams is bizarre and Spools are indefensible blights, as we meander across the execution plan we find ourselves at a stream aggregate whose child operators have executed for 8 minutes, and then a nested loops join whose child operators have run for 20 minutes and 39 seconds. Let’s go look at that part of the plan.

Each branch here represents reading from the same spool. We can tell this because the Spool operators do not have any child operators. They are starting points for the flow of data. One thing to note here is that there are four spools instead of five, and that’s because one of the five aggregates was processed in the first part of the query plan we looked at.

The highlighted branch is the one that accounts for the majority of the execution time, at 19 minutes, 8 seconds. This branch is responsible for aggregating the PostId column. Apparently a lack of distinct values is hard to process.

But why is this so much slower? The answer is parallelism, or a lack thereof. So, serialism. Remember the 500 year old quote from above?

Another limitation is that this spool does not support parallel scan for reading, so the optimizer is very unlikely to restart parallelism after the spool (or any of its replay streams).

In queries that operate on large data sets, the parallelism implications of the spool plan can be the most important cause of poor performance.

Processing that many rows on a single thread is painful across all of the operators.

Flounder Edition

With SQL Server 2019, we get Batch Mode On Row store when compatibility level gets bumped up to 150.

The result is just swell.

The second query with the join still runs for nearly a minute, but 42 seconds of the process is scanning that big ol’ Posts table.

Grumpy face.

Thanks for reading!

Going Further

Performance Tuning Memory Queries For SQL Server

Posted on March 9, 2021May 16, 2022 by Erik Darling

I am a video

Thanks for watching!

Going Further

Database Context Is Everything In SQL Server

Posted on March 3, 2021May 16, 2022 by Erik Darling

Full Frontal

Going Further

SQL Server Queries Go Faster When They Don’t Touch Disk

Posted on March 2, 2021May 16, 2022 by Erik Darling

A Rears

GitHub scripts

Thanks for watching!

Going Further

How I Set Up To Tune SQL Server Stored Procedures

Posted on March 1, 2021May 16, 2022 by Erik Darling

I am a heading

Going Further

What’s Really Different About In-Memory Table Variables In SQL Server?

Posted on February 16, 2021May 16, 2022 by Erik Darling

Kendra, Kendra, Kendra

My dear friend Kendra asked… Okay, look, I might have dreamed this. But I maybe dreamed that she asked what people’s Cost Threshold For Blogging™ is. Meaning, how many times do you have to get asked a question before you write about it.

I have now heard people talking and asking about in-memory table variables half a dozen times, so I guess here we are.

Talking about table variables.

In memory.

Yes, Have Some

First, yes, they do help relieve tempdb contention if you have code that executes under both high concurrency and frequency. And by high, I mean REALLY HIGH.

Like, Snoop Dogg high.

Because you can’t get rid of in memory stuff, I’m creating a separate database to test in.

Here’s how I’m doing it!

CREATE DATABASE trash;

ALTER DATABASE trash 
ADD FILEGROUP trashy 
    CONTAINS MEMORY_OPTIMIZED_DATA ;
     
ALTER DATABASE trash 
ADD FILE 
(
    NAME=trashcan, 
    FILENAME='D:\SQL2019\maggots'
) 
TO FILEGROUP trashy;

USE trash;

CREATE TYPE PostThing 
AS TABLE
(
    OwnerUserId int,
    Score int,
    INDEX o HASH(OwnerUserId)
    WITH(BUCKET_COUNT = 100)
) WITH
(
    MEMORY_OPTIMIZED = ON
);
GO

Here’s how I’m testing things:

CREATE OR ALTER PROCEDURE dbo.TableVariableTest(@Id INT)
AS
BEGIN

    SET NOCOUNT, XACT_ABORT ON;
    
    DECLARE @t AS PostThing;
    DECLARE @i INT;

    INSERT @t 
        ( OwnerUserId, Score )
    SELECT 
        p.OwnerUserId,
        p.Score
    FROM Crap.dbo.Posts AS p
    WHERE p.OwnerUserId = @Id;

    SELECT 
        @i = SUM(t.Score)
    FROM @t AS t
    WHERE t.OwnerUserId = 22656
    GROUP BY t.OwnerUserId;

    SELECT 
        @i = SUM(t.Score)
    FROM @t AS t
    GROUP BY t.OwnerUserId;

END;
GO

Hot Suet

So like, the first thing I did was use SQL Query Stress to run this on a bunch of threads, and I didn’t see any tempdb contention.

So that’s cool. But now you have a bunch of stuff taking up space in memory. Precious memory. Do you have enough memory for all this?

Marinate on that.

Well, okay. Surely they must improve on all of the issues with table variables in some other way:

Modifications can’t go parallel
Bad estimates
No column level stats

But, nope. No they don’t. It’s the same crap.

Minus the tempdb contetion.

Plus taking up space in memory.

But 2019

SQL Server 2019 does offer the same table level cardinality estimate for in-memory table variables as regular table variables.

If we flip database compatibility levels to 150, deferred compilation kicks in. Great. Are you on SQL Server 2019? Are you using compatibility level 150?

Don’t get too excited.

Let’s give this a test run in compat level 140:

DECLARE @i INT = 22656;
EXEC dbo.TableVariableTest @Id = @i;

Switching over to compat level 150:

Candy Girl

So what do memory optimized table variables solve?

Not the problem that table variables in general cause.

They do help you avoid tempdb contention, but you trade that off for them taking up space in memory.

Precious memory.

Do you have enough memory?

Thanks for reading!

Going Further

When Index Sort Direction Matters For Query Performance In SQL Server

Posted on February 12, 2021May 16, 2022 by Erik Darling

Ever Helpful

I got a mailbag question recently about some advice that floats freely around the internet regarding indexing for windowing functions.

But even after following all the best advice that Google could find, their query was still behaving poorly.

Why, why why?

Ten Toes Going Up

Let’s say we have a query that looks something like this:

SELECT
    u.DisplayName,
    u.Reputation,
    p.Score, 
    p.PostTypeId
FROM dbo.Users AS u
JOIN
(
    SELECT
        p.Id,
    	p.OwnerUserId,
    	p.Score,
    	p.PostTypeId,
    	ROW_NUMBER() OVER
    	(
    	    PARTITION BY
    		    p.OwnerUserId,
    			p.PostTypeId
    		ORDER BY
    		    p.Score DESC
    	) AS n
    FROM dbo.Posts AS p
) AS p
    ON  p.OwnerUserId = u.Id
    AND p.n = 1
WHERE u.Reputation >= 500000
ORDER BY u.Reputation DESC,
         p.Score DESC;

Without an index, this’ll drag on forever. Or about a minute.

But with a magical index that we heard about, we can fix everything!

Ten Toes Going Down

And so we create this mythical, magical index.

CREATE INDEX bubble_hard_in_the_double_r
ON dbo.Posts
(
    OwnerUserId ASC, 
    PostTypeId ASC, 
    Score ASC
);

But there’s still something odd in our query plan. Our Sort operator is… Well, it’s still there.

Oddly, we need to sort all three columns involved in our Windowing Function, even though the first two of them are in proper index order.

OwnerUserId and PostTypeId are both in ascending order. The only one that we didn’t stick to the script on is Score, which is asked for in descending order.

Dram Team

This is a somewhat foolish situation, all around. One column being out of order causing a three column sort is… eh.

We really need this index, instead:

CREATE INDEX bubble_hard_in_the_double_r
ON dbo.Posts
(
    OwnerUserId ASC, 
    PostTypeId ASC, 
    Score DESC
);

Granted, I don’t know that I like this plan at all without parallelism and batch mode, but we’ve been there before.

Thanks for reading!

Going Further

Mind Your OUTPUT Targets In SQL Server, Some Of Them Hurt Query Performance

Posted on February 10, 2021May 16, 2022 by Erik Darling

Browser History

I’ve blogged about OUTPUT a couple times, and those posts are Still Accurate™

But it’s worth noting that, for the second post OUTPUT forced the query to run serially with no target; just returning data back to SSMS.

Depending on the query behind the putting of the out, parallelism could be quite important.

That’s why in the first post, the put out into a real table didn’t cause performance to suffer.

Of course, if you OUTPUT into a table variable, you still have to deal with table variables being crappy about modifications.

Samesies

If you compare the performance of queries that output into a @table variable vs one that outputs into a #temp table, you’ll see a difference:

Even though the parallel zone is limited here, there’s a big difference in overall query time. Scanning the Votes table singe-threaded vs. in parallel.

When you’re designing processes to be as efficient as possible, paying attention to details like this can make a big difference.

Thanks for reading!