Annoyances When Indexing For Windowing Functions In SQL Server

One Day


I will be able to not care about this sort of thing. But for now, here we are, having to write multiple blogs in a day to cover a potpourri of grievances.

Let’s get right to it!

First, without a where clause, the optimizer doesn’t think that an index could improve one single, solitary metric about this query. We humans know better, though.

WITH Votes AS 
(
    SELECT
        v.Id,
        ROW_NUMBER() 
            OVER(PARTITION BY 
                     v.PostId 
                 ORDER BY 
                     v.CreationDate) AS n
    FROM dbo.Votes AS v
)
SELECT *
FROM Votes AS v
WHERE v.n = 0;

The tough part of this plan will be putting data in order to suit the Partition By, and then the Order By, in the windowing function.

Without any other clauses against columns in the Votes table, there are no additional considerations.

Two Day


What often happens is that someone wants to add an index to help the windowing function along, so they follow some basic guidelines they found on the internet.

What they end up with is an index on the Partition By, Order By, and then Covering any additional columns. In this case there’s no additional Covering Considerations, so we can just do this:

CREATE INDEX v2 ON dbo.Votes(PostId, CreationDate);

If you’ve been following my blog, you’ll know that indexes put data in order, and that with this index you can avoid needing to physically sort data.

SQL Server Query Plan
limousine

Three Day


The trouble here is that, even though we have Cost Threshold For Parallelism (CTFP) set to 50, and the plan costs around 195 Query Bucks, it doesn’t go parallel.

Creating the index shaves about 10 seconds off the ordeal, but now we’re stuck with this serial calamity, and… forcing it parallel doesn’t help.

Our old nemesis, repartition streams, is back.

SQL Server Query Plan
wackness

Even at DOP 8, we only end up about 2 seconds faster. That’s not a great use of parallelism, and the whole problem sits in the repartition streams.

This is, just like we talked about yesterday, a row mode problem. And just like we talked about the day before that, windowing functions generally do benefit from batch mode.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

An Overlooked Benefit Of Batch Mode For Parallel Query Plans In SQL Server

Make It Count


When queries go parallel, you want them to be fast. Sometimes they are, and it’s great.

Other times they’re slow, and you end up staring helplessly at a repartition streams operator.

SQL Server Query Plan
brick wall

Sometimes you can reduce the problem with higher DOP hints, or better indexing, but overall it’s a crappy situation.

Snap To


Let’s admire a couple familiar looking queries, because that’s been working really well for us so far.

WITH Comments AS 
(
    SELECT
        ROW_NUMBER() 
            OVER(PARTITION BY 
                     c.UserId
                 ORDER BY 
                     c.CreationDate) AS n
    FROM dbo.Comments AS c
)
SELECT *
FROM Comments AS c
WHERE c.n = 0
OPTION(USE HINT('QUERY_OPTIMIZER_COMPATIBILITY_LEVEL_140'));

WITH Comments AS 
(
    SELECT
        ROW_NUMBER() 
            OVER(PARTITION BY 
                     c.UserId
                 ORDER BY 
                     c.CreationDate) AS n
    FROM dbo.Comments AS c
)
SELECT *
FROM Comments AS c
WHERE c.n = 0
OPTION(USE HINT('QUERY_OPTIMIZER_COMPATIBILITY_LEVEL_150'));

One is going to run in compatibility level 140, the other in 150, as foretold by ancient alien prophecy.

The two query plans will have a bit in common, but…

SQL Server Query Plan
just batch

The second query, which runs in batch mode, runs about 15 seconds faster. One big reason why is that we skip that most unfortunate repartition streams operator.

It’s a cold sore. An actual factual cold sore.

The only ways I’ve found to fix it completely are:

  • Induce batch mode
  • Use the parallel apply technique

But the parallel apply technique doesn’t help much here, because of local factors.

In this case, me generating the largest possible result set and then filtering it down to nothing at the end.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

An Overlooked Benefit Of Batch Mode With Windowing Functions In SQL Server

Lavender


If you ask people who tune queries why batch mode is often much more efficient with windowing functions, they’ll tell you about the window aggregate operator.

That’s all well and good, but there’s another, often sneaky limitation of fully row mode execution plans with windowing functions.

Let’s go take a look!

Global Aggregates


One thing that causes an early serial zone in execution plans is if you use a windowing function that only has the order by

For example, let’s look at the plans for these two queries:

WITH Comments AS 
(
    SELECT
        ROW_NUMBER() 
            OVER(ORDER BY 
                     c.CreationDate) AS n
    FROM dbo.Comments AS c
)
SELECT *
FROM Comments AS c
WHERE c.n = 0;

WITH Comments AS 
(
    SELECT
        ROW_NUMBER() 
            OVER(PARTITION BY 
                     c.UserId
                 ORDER BY 
                     c.CreationDate) AS n
    FROM dbo.Comments AS c
)
SELECT *
FROM Comments AS c
WHERE c.n = 0;

The resulting estimated plans look like this, using the 140 compatibility level:

SQL Server Query Plan
oops

In the top plan, where the windowing function only has an order by, the serial zone happens immediately before the Segment operator. In the second plan, the parallel zone carries on until right before the select operator.

If you’re wondering why we’re only looking at estimated plans here, it’s because repartition streams ruins everything.

In The Year 2000


In compatibility level 150, things change a bit (yes, a window aggregate appears):

SQL Server Query Plan
merry christmas

And the window aggregate appears within the parallel zone. The parallel zone does end before the filter operator, which may or may not be a disaster depending on how restrictive your filter is, and how many rows end up at it.

Also note the distinct lack of a repartition streams operator ruining everything. We’ll talk about that tomorrow.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Batch Mode On Row Store vs Batch Mode Tricks For Performance Tuning SQL Server Queries

Quiet Time


I think Batch Mode is quite spiffy for the right kind of query, but up until SQL Server 2019, we had to play some tricks to get it:

  • Do a funny join to an empty table with a column store index
  • Create a filtered column store index with no data in it

If you’re on SQL server 2019 Enterprise Edition, and you’ve got your database in compatibility level 150, you may heuristically receive Batch Mode without those tricks.

One important difference between Batch Mode Tricks™ and Batch Mode On Rowstore (BMOR) is that the latter allows you to read from row mode tables using Batch Mode, while the former doesn’t.

Tricks have limits, apparently.

Squish Squish


To cut down on typing, I’ll often create a helper object like this:

CREATE TABLE dbo.t
(
    id int NULL,
    INDEX c CLUSTERED COLUMNSTORE
);

If you read this post, you’ll understand more why.

Now, let’s compare these two queries:

SELECT 
    p.OwnerUserId,
    COUNT_BIG(*) AS records
FROM dbo.Posts AS p
LEFT JOIN dbo.t
    ON 1 = 0
WHERE p.Score < 50
GROUP BY p.OwnerUserId
HAVING COUNT_BIG(*) > 2147483647
OPTION(USE HINT('QUERY_OPTIMIZER_COMPATIBILITY_LEVEL_140'), MAXDOP 8);

SELECT 
    p.OwnerUserId,
    COUNT_BIG(*) AS records
FROM dbo.Posts AS p
WHERE p.Score < 50
GROUP BY p.OwnerUserId
HAVING COUNT_BIG(*) > 2147483647
OPTION(USE HINT('QUERY_OPTIMIZER_COMPATIBILITY_LEVEL_150'), MAXDOP 8);

One executes in compatibility level 140, the other in 150.

Splish Splash


There are a couple interesting things, here.

SQL Server Query Plan
the porter

Even though both queries have operators that execute in Batch Mode (Filter, Hash Match), only the second query can read from the row store clustered index in Batch Mode. In this case, that shaves a couple hundred milliseconds off the seek.

There is likely some additional invisible benefit to not having to convert the row mode seek to a batch mode hash join at the next operator, since one executes for 501ms, and the other executes for 278ms. There’s nothing in the query plan to signal that happening, so you’ll just have to use your imagination.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Making The Most Of Temp Tables Part 4: Batch Mode

Le Big Zoom Zoom


When you have queries that need to process a lot of data, and probably do some aggregations over that lot-of-data, batch mode is usually the thing you want.

Originally introduced to accompany column store indexes, it works by allowing CPUs to apply instructions to up to 900 rows at a time.

It’s a great thing to have in your corner when you’re tuning queries that do a lot of work, especially if you find yourself dealing with pesky parallel exchanges.

Oh, Yeah


One way to get that to happen is to use a temp table with a column store index on it.

SELECT 
    v.UserId, 
    SUM(v.BountyAmount) AS SumBounty
FROM dbo.Comments AS c
JOIN dbo.Votes AS v
    ON  v.PostId = c.PostId
    AND v.UserId = c.UserId
GROUP BY v.UserId
ORDER BY SumBounty DESC;

CREATE TABLE #t(id INT, INDEX c CLUSTERED COLUMNSTORE);

SELECT 
    v.UserId, 
    SUM(v.BountyAmount) AS SumBounty
FROM dbo.Comments AS c
JOIN dbo.Votes AS v
    ON  v.PostId = c.PostId
    AND v.UserId = c.UserId
LEFT JOIN #t AS t 
    ON 1 = 0
GROUP BY v.UserId
ORDER BY SumBounty DESC;

Keep in mind, this trick won’t work if you’re on SQL Server 2019 and using in memory tempdb. But aside from that, you’re free to rock and roll with it.

If you end up using this enough, you may just wanna create a real table to use, anyway.

Remarkable!


If we look at the end (or beginning, depending on how you read your query plans) just to see the final times, there’s a pretty solid difference.

SQL Server Query Plan
you can’t make me

The first query takes around 10 seconds, and the second query takes around 4 seconds. That’s a pretty handsome improvement without touching anything else.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Making The Most Of Temp Tables Part 3: More Opportune Indexes To Make Queries Go Faster

I Know You


You have too many indexes on too many tables already, and the thought of adding more fills you with a dread that has a first, middle, last, and even a confirmation name.

This is another place where temp tables can save your bacon, because as soon as the query is done they basically disappear.

Forever. Goodbye.

Off to buy a pack of smokes.

That Yesterday


In yesterday’s post, we looked at how a temp table can help you materialize an expression that would otherwise be awkward to join on.

If we take that same query, we can see how using the temp table simplifies indexing.

SELECT
    p.OwnerUserId,
    SUM(p.Score) AS TotalScore,
    COUNT_BIG(*) AS records,
    CASE WHEN p.PostTypeId = 1 
         THEN p.OwnerUserId
         WHEN p.PostTypeId = 2
         THEN p.LastEditorUserId
    END AS JoinKey
INTO #Posts
FROM dbo.Posts AS p
WHERE p.PostTypeId IN (1, 2)
AND   p.Score > 100
GROUP BY CASE
             WHEN p.PostTypeId = 1 
             THEN p.OwnerUserId
             WHEN p.PostTypeId = 2 
             THEN p.LastEditorUserId
         END,
         p.OwnerUserId;

CREATE CLUSTERED INDEX c ON #Posts(JoinKey);

SELECT *
FROM #Posts AS p
WHERE EXISTS
(
    SELECT 1/0
    FROM dbo.Users AS u
    WHERE p.JoinKey = u.Id
);

Rather than have to worry about how to handle a bunch of columns across the where and join and select, we can just stick a clustered index on the one column we care about doing anything relational with to get the final result.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Making The Most Of SQL Server Temporary Tables Part 2: Materializing Expressions

Bad Data


A lot of the time when I see queries that are written with all sorts of gymnastics in the join or where clause and I ask some questions about it, people usually start complaining about the design of the table.

That’s fine, but when I ask about changing the design, everyone gets quiet. Normalizing tables, especially for Applications Of A Certain Age™ can be a tremendously painful project. This is why it’s worth it to get things right the first time. Simple!

Rather than make someone re-design their schema in front of me, often times a temp table is a good workaround.

Egg Splat


Let’s say we have a query that looks like this. Before you laugh, and you have every right to laugh, keep in mind that I see queries like this all the time.

They don’t have to be this weird to qualify. You can try this if you have functions like ISNULL, SUBSTRING, REPLACE, or whatever in joins and where clauses, too.

SELECT
    p.OwnerUserId,
    SUM(p.Score) AS TotalScore,
    COUNT_BIG(*) AS records
FROM dbo.Users AS u
JOIN dbo.Posts AS p
    ON u.Id = CASE 
                   WHEN p.PostTypeId = 1 
                   THEN p.OwnerUserId
                   WHEN p.PostTypeId = 2
                   THEN p.LastEditorUserId
              END
WHERE p.PostTypeId IN (1, 2)
AND   p.Score > 100
GROUP BY p.OwnerUserId;

There’s not a great way to index for this, and sure, we could rewrite it as a UNION ALL, but then we’d have two queries to index for.

Sometimes getting people to add indexes is hard, too.

People are weird. All day weird.

Steak Splat


You can replace it with a query like this, which also allows you to index a single column in a temp table to do your correlation.

SELECT
    p.OwnerUserId,
    SUM(p.Score) AS TotalScore,
    COUNT_BIG(*) AS records,
    CASE WHEN p.PostTypeId = 1 
         THEN p.OwnerUserId
         WHEN p.PostTypeId = 2
         THEN p.LastEditorUserId
    END AS JoinKey
INTO #Posts
FROM dbo.Posts AS p
WHERE p.PostTypeId IN (1, 2)
AND   p.Score > 100
GROUP BY CASE
             WHEN p.PostTypeId = 1 
             THEN p.OwnerUserId
             WHEN p.PostTypeId = 2 
             THEN p.LastEditorUserId
         END,
         p.OwnerUserId;

SELECT *
FROM #Posts AS p
WHERE EXISTS
(
    SELECT 1/0
    FROM dbo.Users AS u
    WHERE p.JoinKey = u.Id
);

Remember that temp tables are like a second chance to get schema right. Don’t waste those precious chances.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Making The Most Of Temp Tables In SQL Server Part 1: Fully Parallel Inserts

Sing Along


If you have a workload that uses #temp tables to stage intermediate results, and you probably do because you’re smart, it might be worth taking advantage of being able to insert into the #temp table in parallel.

Remember that you can’t insert into @table variables in parallel, unless you’re extra sneaky. Don’t start.

If your code is already using the SELECT ... INTO #some_table pattern, you’re probably already getting parallel inserts. But if you’re following the INSERT ... SELECT ... pattern, you’re probably not, and, well, that could be holding you back.

Pile On


Of course, there are some limitations. If your temp table has indexes, primary keys, or an identity column, you won’t get the parallel insert no matter how hard you try.

The demo code is available here if you’d like to test it out.

SQL Server Query Plan
amanda lear

The first thing to note is that inserting into an indexed temp table, parallel or not, does slow things down. If your goal is the fastest possible insert, you may want to create the index later.

No Talent


When it comes to parallel inserts, you do need the TABLOCK, or TABLOCKX hint to get it, e.g. INSERT #tp WITH(TABLOCK) which is sort of annoying.

But you know. It’s the little things we do that often end up making the biggest differences. Another little thing we may need to tinker with is DOP.

SQL Server Query Plan
little pigs

Here are the query plans for 3 fully parallel inserts into an empty, index-less temp #table. Note the execution times dropping as DOP increases. At DOP 4, the insert really isn’t any faster than the serial insert.

If you start experimenting with this trick, and don’t see noticeable improvements at your current DOP, you may need to  bump it up to see throughput increases.

Also remember that if you’re doing this with clustered column store indexes, it can definitely make things worse.

Page Supplier


Though the speed ups above at higher DOPs are largely efficiency boosters while reading from the Posts table, the speed does stay consistent through the insert.

If we crank one of the queries that gets a serial insert up to DOP 12, we lose some speed when we hit the table.

SQL Server Query Plan
oops

Next time you’re tuning a query and want to drop some data into a temp table, you should experiment with this technique.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

An Edge Case When Working With Date Parameters In SQL Server

Wheeze Man


When people tell you that working with correct data types is important, it’s for a variety of very good reasons.

Not only can you avoid performance issues, but you can avoid strange query plan distractions, too.

Let’s look at an example for when you use date parameters against datetime columns.

Wrong And Mean


Index from outta nowhere pow!

CREATE INDEX pe ON dbo.Posts(LastEditDate);

The important thing about the LastEditDate column in the Posts table is that it’s nullable.

Not all posts will get edited. Especially mine. They’re always correct the first time.

Basically read only, if we’re being honest about things.

Or maybe it’s about 50/50.

If your columns aren’t nullable, you’ll run into far fewer problems and ambiguities.

I’d like a new data type called ABYSS. Or maybe VOID.

The Problem: Wrong Data Type And NULL Checks


DECLARE @d date = '20170601';
DECLARE @sql nvarchar(MAX) = N'
SELECT
    COUNT_BIG(*) AS records
FROM dbo.Posts AS p
WHERE p.LastEditDate > @d
AND   p.LastEditDate IS NOT NULL;'

EXEC sp_executesql @sql, 
                   N'@d date', 
                   @d;
GO

If we pass in a parameter that has a date datatype, rather than date time, an odd thing will happen if we add in a redundant IS NOT NULL check.

SQL Server Query Plan
yortsed

The seek predicate will only seek to the first non-NULL value, rather than immediately to the start of the range of dates we care about, which means we end up reading a lot more rows than necessary.

Note the query runtime of 743 milliseconds, and that we end up reading quite a few more rows than we return.

And here I was told Seeks are always efficient ?

Solution One: Stop Checking For NULLs


If we either stop checking for NULLs, we’ll get around the issue.

DECLARE @d date = '20170601';
DECLARE @sql nvarchar(MAX) = N'
SELECT
    COUNT_BIG(*) AS records
FROM dbo.Posts AS p
WHERE p.LastEditDate > @d;'

EXEC sp_executesql @sql, 
                   N'@d date', 
                   @d;
GO

The plan for this query looks a bit different, but performance is no worse for the wear.

SQL Server Query Plan
still using the wrong datatype

Note the 25 millisecond execution time. A clear improvement over the 743 milliseconds above. Though the query plan does look a bit odd.

The compute scalar gins up a date range, which is checked in the seek:

SQL Server Query Plan
HELLO COMPUTER

I wonder what Expr1002 is up to.

Solution Two: Just Use The Right Datatype To Begin With


In reality, this is what we should have done from the start, but the whole point of this here blog post is to show you what can happen when you Do The Wrong Thing™

When we use the right datatype, we get a simple plan that executes quickly, regardless of the redundant NULL check.

DECLARE @d date = '20170601';
DECLARE @sql nvarchar(MAX) = N'
SELECT
   COUNT_BIG(*) AS records
FROM dbo.Posts AS p
WHERE p.LastEditDate > @d
AND   p.LastEditDate IS NOT NULL;'

EXEC sp_executesql @sql, 
                   N'@d datetime', 
                   @d;
SQL Server Query Plan
no fuss, no muss

Here, the NULL check is a residual predicate rather than the Seek predicate, which results in a seek that really seeks instead of just meandering past some NULLs.

SQL Server Query Plan
gerd jerb

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Why You Shouldn’t Ignore Filter Operators In SQL Server Query Plans Part 2

If You Remember Part 1


We looked at a couple examples of when SQL Server might need to filter out rows later in the plan than we’d like, and why that can cause performance issues.

Now it’s time to look at a few more examples, because a lot of people find them surprising.

As much as I love surprising people, sometimes I’d much rather… not have to explain this stuff later.

Since all my showering and errands are out of the way, we should be able to get through this list uninterrupted.

Unless I get thirsty.

Max Datatypes


If we need to search a column that has a MAX datatype, or if we define a parameter as being a MAX datatype and search a more sanely typed column with it, both will result in a later filter operation than we may care for.

SELECT 
    COUNT_BIG(*) AS records
FROM dbo.Users AS u
WHERE u.AboutMe = N'Hi';

DECLARE @Location nvarchar(MAX) = N'here';
SELECT 
    COUNT_BIG(*) AS records 
FROM dbo.Users AS u 
WHERE u.Location = @Location 
OPTION(RECOMPILE);

Even with a recompile hint!

SQL Server Query Plan
opportunity knocked

Here we can see the value of properly defining string widths! If we don’t, we may end up reading entire indexes, and doing the work to weed out rows later.

Probably something that should be avoided.

Functions


There are some built-in functions, like DATALENGTH, which can’t be pushed when used in a where clause.

Of course, if you’re going to do this regularly, you should be using a computed column to get around the issue, but whatever!

SELECT 
    COUNT_BIG(*) AS records
FROM dbo.Users AS u
WHERE DATALENGTH(u.Location) > 0;
SQL Server Query Plan
measuring up

And of course, everyone’s favorite love-to-hate, the scalar UDF.

Funny thing about these, is that sometimes tiny bumps in the number of rows you’re after can make for big jumps in time.

SELECT TOP (165)
    u.Id,
    u.DisplayName
FROM dbo.Users AS u
WHERE dbo.ScalarFunction(u.Id) > 475
ORDER BY u.Id;

SELECT TOP (175)
    u.Id,
    u.DisplayName
FROM dbo.Users AS u
WHERE dbo.ScalarFunction(u.Id) > 475
ORDER BY u.Id;
SQL Server Query Plan
10 more rows, 5 more seconds

Complexity


Sometimes people (and ORMs) will build up long parameter lists, and use them to build up a long list IN clause list, and even sometimes a long OR clause list.

To replicate that behavior, I’m using code I’m keeping on GitHub in order to keep this blog post a little shorter.

To illustrate where things can get weird, aside from the Filter, I’m going to run this with a few different numbers of parameters.

EXEC dbo.Longingly @loops = 15;
EXEC dbo.Longingly @loops = 18;
EXEC dbo.Longingly @loops = 19;

This will generate queries with different length IN clauses:

SQL Server Missing Index Request
bigger than others

Which will result in slightly different query plans:

SQL Server Query Plan
THREE!

We can see some tipping points here.

  • At 15 parameters, we get a scan with a stream aggregate
  • At 18 parameters, we get a scan with a filter
  • At 19 parameters, we get a parallel scan with a filter

Parallelism to the rescue, again, I suppose.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.