Joe Obbish Archives | Darling Data

How to Add Clippy Support to SSMS

Posted on April 1, 2026March 30, 2026 by Joe Obbish

Step 1: Install the Clippy digital assistant SSMS add-on

Go to github and grab the code. Clippy wants you to be on a recent version of SQL Server so only SQL Server 2022+ is supported.

Step 2: Create your first Clippy alert and schedule it

For illustrative purposes, below is a very basic check that throws an alert if CPU >= 80%

DECLARE
    @CPUFromBuffer INT,
    @LogMessage VARCHAR(100);

select TOP (1) @CPUFromBuffer = record.value('(./Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)[1]', 'int')
from (
    select timestamp, convert(xml, record) as record
    from sys.dm_os_ring_buffers
    where ring_buffer_type = N'RING_BUFFER_SCHEDULER_MONITOR'
    and record like '%<SystemHealth>%'
) as RingBufferInfo
ORDER BY timestamp DESC;

IF @CPUFromBuffer >= 80
BEGIN
    SET @LogMessage = 'Your SQL Server is using ' + CAST(@CPUFromBuffer AS VARCHAR(10)) + '% of available CPU.'
    EXEC dbo.AddPaperClipToErrorLog @LogMessage;
END;

Step 3: Trigger the alert

On my local machine with 8 physical cores I can simply run the following query and wait:

 WITH vCTE AS (
      SELECT v.v
      FROM
      (VALUES
      (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),(50),(51),(52),(53),(54),(55),(56),(57),(58),(59),(60),(61),(62),(63),(64),(65),(66),(67),(68),(69),(70),(71),(72),(73),(74),(75),(76),(77),(78),(79),(80),(81),(82),(83),(84),(85),(86),(87),(88),(89),(90),(91),(92),(93),(94),(95)
      ) v(v)
)
SELECT MIN(v0.v + ca.v)
FROM vCTE v0
CROSS APPLY (
      SELECT MIN(v1.v + v2.v + v3.v + v4.v + v5.v) v
      FROM vCTE v1
      CROSS JOIN vCTE v2
      CROSS JOIN vCTE v3
      CROSS JOIN vCTE v4
      CROSS JOIN vCTE v5
      WHERE v1.v > (v0.v - 96)
) ca
OPTION (NO_PERFORMANCE_SPOOL, MAXDOP 8, USE HINT (N'ENABLE_PARALLEL_PLAN_PREFERENCE'));

Step 4: Check for alerts in the error log

a69 paperclip

Hyper-threading on Azure VMs is a SQL Licensing Trap

Posted on March 26, 2026March 26, 2026 by Joe Obbish

Azure VMs with hyper-threading enabled are sized according to logical cores instead of physical cores. These logical cores can perform with 50% of the power of physical cores for high levels of activity but will always perform at 100% of the SQL Server licensing cost rate. As a result, moving from a busy on-premises SQL Server VM sized to an Azure VM with hyper-threading enabled can result in a surprise SQL Server licensing bill.

The Evidence

In my opinion, the documentation is fairly clear that logical cores are what’s for sale here, but there seems to be a fair amount of confusion on this topic. Hopefully one of these five exhibits is able to convince the reader.

Exhibit A: Hyper-threading can be disabled and it cuts the vCPU count in half

It is possible to disable hyper-threading within the VM as demonstrated here. The number of available vCPUs within the VM is reduced by half after hyper-threading is disabled. Conveniently, the blog post also provides a command line method to see physical core and virtual core count. The physical core count stays the same between configurations.

Exhibit B: Marketing language used for VMs that don’t allow SMT

Quoting from the Famsv7 series document, with emphasis mine:

The Famsv7-series utilizes AMD’s 5th Generation EPYC™ 9005 processor that can achieve a boosted maximum frequency of up to 4.5 GHz. The Famsv7-series VM comes without Simultaneous Multithreading (SMT), meaning a vCPU is now mapped to a full physical core, allowing software processes to run on dedicated and uncontested resources. These new full core VMs suit workloads demanding the highest CPU performance. Famsv7-series offers up to 80 full core vCPUs and 640 GiB of RAM.

These documents will present the VMs in the best possible light, and there sure is a lot of writing stressing physical cores. There’s no similar language for the document describing the Ebdsv5 and Ebsv5 series.

Exhibit C: Microsoft’s AI helper

a68 ai bad

a68 ai good

Check
out
my
superior
use
of
bullet
points
AI
could
never

Exhibit D: Not enough sockets

The Ebdsv5 and Ebsv5 series is backed by the Intel® Xeon® Platinum 8370C (Ice Lake) processor. This processor has a physical core count of 32, a logical core count of 64, and can support up to a two socket configuration. Microsoft offers VMs above 64 vCPU, so presumably the the vCPU count cannot represent physical cores. The only alternative is some kind of super NUMA that is stitching together sockets from different physical hosts into a single VM, and it’s best not to think about the implications of that.

Exhibit E: Simple scaling testing on the Standard_E8bds_v5

The scalability tests described here were also run against a Standard_E8bds_v5 Azure VM. No special configuration was done for the VM. Tests were run after installing SQL Server 2022 Developer Edition and SSMS. The scaling of Standard_E8bds_v5 Azure acts as expected of a VM with 4 physical cores and 8 virtual cores:

The Difference

Consider a small company using VMware which is able to fit its entire production workload onto a two socket host with little to no oversubscription of CPU resources. Suppose that company has a single 8 vCPU VM running SQL Server and owns SQL licenses for 8 cores. ESXI will preferentially schedule those 8 vCPU onto 8 different physical cores, so the VM will likely scale as if it has 8 physical cores:

ESXi hosts manage processor time intelligently to guarantee that load is spread smoothly across processor cores in the system. Logical processors on the same core have consecutive CPU numbers, so that CPUs 0 and 1 are on the first core together, CPUs 2 and 3 are on the second core, and so on. Virtual machines are preferentially scheduled on two different cores rather than on two logical processors on the same core.

Of course, this is not guaranteed behavior. However, SQL licenses can be expensive, so it can often make business sense to make sure that the SQL Server VM is able to get as much physical CPU power as allowed by the vCPU count. This is entirely different conceptually compared to getting a VM in the cloud. The rented cloud VM will run on a physical host with other VMs from different customers. Cloud providers carefully limit resources available to each VM to make things fair for everyone instead of giving as much resources as possible to the VMs with expensive software licensing costs per core.

The Trap

Mapping an on-premises VM to the closest Azure VM in terms of CPU power requires careful analysis accounting for differences in clock speed, processor generation, resource usage within the VM, oversubscription of the VM host, and other factors. To keep things simple, imagine going dumpster diving on Microsoft property and building a single socket Intel® Xeon® Platinum 8370C (Ice Lake) server. All of the configuration scenarios in the table below are appropriately sized for their SQL Server workloads and there is no oversubscription of CPU on the VM host, when applicable. Matching the effective physical CPU count in Azure VMs requires doubling the SQL licenses for many different configurations of the OS running SQL Server:

a68 trap

This could be quite the expensive surprise for SQL Server Enterprise Edition.

The Escape

Paying monthly for new licenses instead of bringing your own licenses with software assurance does not avoid this problem. Looking today at US East, the F16ams v7 model costs
$5962.64 per month for a SQL Enterprise VM and $1810.40 per month for a Windows OS VM, for an effective SQL licensing cost of $259.52 per physical core. The E16bds v5 model costs $5,892.56 per month for a SQL Enterprise VM and $1,512.56 per month for a Windows OS VM, for an effective SQL licensing cost of $547.50 per physical core. Naturally, the license costs per vCore are quite similar between models.

Constrained vCPU sizes for database workloads can be used to reduce licensing costs, but this feature is designed to offer better memory, storage, and I/O bandwidth at lower CPU counts. It is unlikely that reducing the vCPU size in this way will somehow lead to physical core scaling instead of vCore scaling behavior. However, note that this configuration was not directly tested.

Sadly, there is no recent licensing update that allows for CPU affinity to be used to reduce licensing costs. All of the vCPUs exposed within the VM OS must be licensed for SQL Server.

There are a few VM series offered within the Azure VM family which do not offer SMT. This directly resolves the license trap, but there aren’t many choices available at the time of publication for this blog post. The Famsv7 series mentioned earlier is one modern example, and perhaps this is a great CPU for SQL Server workloads, but its VMs do not offer any local temporary storage.

It is otherwise possible to disable hyper-threading within the VM using a preview feature. This will reduce licensing costs for Windows OS VMs for which a SQL Server license is supplied by the user, but will not reducing licensing costs when paying directly for new licenses by renting a “SQL Server on Azure Windows” VM. Disabling hyper-threading may not be straightforward to do and hesitance to rely on preview features is understandable, but the potential license cost savings here can be huge depending on the size of the VM.

Final Thoughts

Some cloud vendors offer virtual machines with “hyper-threaded cores” which seems to allow them to offer bigger vCPU counts for a lower price. Unfortunately, a virtual machine with lots of weak vCPUs is not cost-effective for software with a high license cost per vCPU, such as SQL Server Enterprise Edition. Be sure to understand what each cloud vendor’s vCPU count truly represents in terms of supporting your workload. Thanks for reading!

Emulating GETDATE() on Azure SQL Database

Posted on February 23, 2026March 1, 2026 by Joe Obbish

GETDATE() always uses the UTC time zone on Azure SQL Database which can be a compatibility issue for applications that assume that GETDATE() is using a different time zone. Developers may wish to replace GETDATE() with code that continues to use their expected time zone to avoid UTC time reforms. For example, I’ve seen application code with over 50,000 references to the GETDATE function. Replacing all of those calls with GETUTCDATE() would be a herculean effort.

FAST 1

I don’t know how good Erik’s SEO is, but here’s a simple replacement if you don’t care about performance and just need something that gives you the previous behavior of GETDATE():

CREATE OR ALTER FUNCTION dbo.[GetDateNew]()
RETURNS DATETIME
WITH SCHEMABINDING
AS
BEGIN
    RETURN SYSDATETIMEOFFSET() AT TIME ZONE N'Central Standard Time';
END;
GO

Keep reading if you’d like to do better than that.

AT TIME ZONE At Risk

AT TIME ZONE was a great addition to SQL Server 2016 that allowed for the retirement of many well-meaning but poorly implemented attempts to do time zone conversions in T-SQL. However, it was not without its own problems:

The CPU cost per execution was surprisingly high, but this was addressed in SQL Server 2022 and in Azure SQL Database.
AT TIME ZONE makes scalar UDFs ineligible for inlining.
AT TIME ZONE usage results in an “unknown” cardinality estimate.

You can see the cardinality estimate issue in action by populating a simple table with about 6 million rows.

DROP TABLE IF EXISTS dbo.Level100;

CREATE TABLE dbo.Level100 (
    TenantID INT NOT NULL,
    InsertTime DATETIME NOT NULL
);

INSERT INTO dbo.Level100 (TenantID, InsertTime)
SELECT q.RN % 4, DATEADD(MINUTE, -1 * RN, GETDATE())
FROM
(
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
    FROM master..spt_values t1
    CROSS JOIN master..spt_values t2
) q;

CREATE INDEX TenantID_InsertTime ON dbo.Level100 (TenantID) INCLUDE (InsertTime);

CREATE INDEX InsertTime_TenantID ON dbo.Level100 (InsertTime) INCLUDE (TenantID);

The first query uses GETDATE():

SELECT COUNT(*)
FROM dbo.Level100 l
WHERE l.TenantID = 2
AND l.InsertTime > DATEADD(MINUTE, -10, GETDATE())
OPTION (MAXDOP 1);

The filter against TenantID is expected to return 25% of the rows in the table and the filter against InsertTime is expected to return less than 0.001% of the rows. As expected, the query optimizer chooses to filter against the InsertTime_TenantID index and the query finishes instantly:

a66 good

The second query uses AT TIME ZONE:

SELECT COUNT(*)
FROM dbo.Level100 l
WHERE l.TenantID = 2
AND l.InsertTime > DATEADD(MINUTE, -10, CAST(SYSDATETIMEOFFSET() AT TIME ZONE N'Central Standard Time' AS DATETIME))
OPTION (MAXDOP 1);

Once again, the filter against TenantID is expected to return 25% of the rows in the table. However, the presence of AT TIME ZONE results in an unknown inequality estimate against InsertTime which is 30%. The query optimizer makes a different choice and goes with the TenantID_InsertTime index because 25% < 30%:

a66 bad

Meme Preparation

It seems that we need to return to the bad old days before we had AT TIME ZONE available in SQL Server. One advantage that we have is that the function only needs to do the time zone conversion for the current moment in time. Central time is six hours behind UTC at the time of publication, so we could construct a function like this:

CREATE OR ALTER FUNCTION dbo.[GetDateCT](@PassInGetDateUTC DATETIME)
RETURNS DATETIME
WITH SCHEMABINDING
AS
BEGIN
    RETURN DATEADD(HOUR, -6, @PassInGetDateUTC);
END;
GO

The input parameter is there to make the function eligible for inlining because GETDATE() and other similar functions prevent inlining. Returning to the same query as before:

SELECT COUNT(*)
FROM dbo.Level100 l
WHERE l.TenantID = 2
AND l.InsertTime > DATEADD(MINUTE, -10, dbo.[GetDateCT](GETUTCDATE()))
OPTION (MAXDOP 1);

We can see that the new function results in a reasonable cardinality estimate and the query optimizer makes the correct index choice:

a66 also good

Of course, this function definition will need to be updated a few times per year to deal with daylight savings time changes. It isn’t too difficult to design a process to automatically perform these updates, but honestly I expect that the hassle of doing something will exceed the tolerance of most companies. However, this option is available to you if you need to preserve the usually superior cardinality estimation behavior enjoyed with GETDATE().

How to Convert Time Zones in SQL Server?

a66 maymay

Parsing and Arsing

Any solution you pick, other than a naked reference to SYSDATETIMEOFFSET() along with AT TIME ZONE, may result in parsing issues in your code. For example, some parts of certain queries do not allow for subqueries, so a subquery replacement against a table-valued function will not work as a direct replacement. Even the simplest scalar UDF reference can result in code that fails to parse, such as the following:

CREATE TABLE #t (
    StupidColumn DATETIME dbo.[GetDateNew]()
);

Fortunately, most of these issues are likely rare in practice and should be straightforward to address.

Keep in mind that you also need to fix GETDATE() references in areas such as default column values, constraints, computed columns, default column tables for table types, and so on. I would personally use a naked reference to SYSDATETIMEOFFSET() along with AT TIME ZONE for these areas as I see no benefit in using a scalar UDF.

Summary of Issues for Various Solutions

I was going to put some introduction text here, but whatever, you can figure out the point of this section on your own.

Inline table-valued function approach

Most T-SQL compatibility issues
Very verbose
Cardinality issues with AT TIME ZONE

Naked SYSDATETIMEOFFSET() AT TIME ZONE N’Central Standard Time’ replacement

Very verbose
Cardinality issues with AT TIME ZONE

Simple scalar UDF with AT TIME ZONE:

UDF is not eligible for inlining
UDF will return different values as the query executes which probably isn’t what you want

Non-static scalar UDF:

Function definition must occasionally be refreshed
Somewhat verbose
Some types of queries, such as those with CTEs, will not allow any UDF to be inlined

Final Thoughts

Messy no-win situations like this highlight the importance of picking the right cloud platform for your application. Azure SQL Managed Instance and Azure VMs avoid this problem entirely. Thanks for reading!

Erik cries when SSMS lies

Posted on November 14, 2025November 14, 2025 by Joe Obbish

My SQL Server friendo Mr. Erik C. Darling was recently telling me about some work he did getting batch mode on paging queries. This sounded a bit odd to me because paging queries make me think of small seeks of rows against finely curated nonclustered rowstore indexes, so I asked him for a link to his blog post about it. He grumpily refused and told me to find the link myself, which I eventually did.

High End Machine Performance

Erik’s first attempt used OFFSET/FETCH and resulted in a row mode query:

SQL Server Query Plan

The clustered index scan makes this a bit of a sad paging query. In Erik’s defense, there’s a whole host of real world reasons as to why you wouldn’t be able to make the perfect nonclustered index for your paging query:

End users may choose to sort on many different columns and you can’t index them all
A key member of your Index Approval Committee is on vacation
You already have more indexes on the table than your number of fingers
You’re working with third party software which does not allow you to create custom indexes

Getting back to the query, it doesn’t look that offensive to me. The row mode sort is a parallel top N sort and the overall number of rows to return is low, so each thread can independently sort its rows and return 1000 locally sorted rows to the parent operator. This is about as good as it gets with parallel row mode sorting. This is a row mode only query so the operator times that you see are the sum of that operator’s work and its children. In terms of real work done by the query, the scan clocks in at 1.168 seconds and the sort clocks in at 0.84 seconds. The final accounting at the end by the parent Parallelism (Gather Streams) is misleading at best and an outright LIE at worst. There wasn’t 4 seconds of work done by this query. There was only 2 seconds. The red lines illustrate the problem perfectly and I won’t be elaborating further:

a65 red

Erik’s second attempt uses ROW_NUMBER() and he achieves a plan with some batch mode operators using BMOR (batch mode on row store):

SQL Server Query Plan

The parallel batch mode sort works just fine here in that the single thread output property isn’t an issue. The parent operator is a batch mode window aggregate, but even if it wasn’t, the grandparent is a gather streams operator so the rows would end up on one thread anyway. Actual time statistics accounting works differently for batch mode operators: each batch mode operator only tracks its own work. In terms of real work done by the query, the scan clocks in at 1.1022 seconds and the sort clocks in at 0.892 seconds. This is quite similar to the first attempt. It could be argued that the batch mode sort is more efficient than the row mode top N sort, but I’d call it a wash considering the unpredictable rowstore to batch mode conversion overhead (which does seem to be small for this table).

Low End Machine Performance

I tested on my local machine with 64 GB of RAM which is less than Erik’s laptop. My clustered index scans took significantly longer than his, but as usual, there’s a lot to learn from low end machine performance. Let’s go back to the first reason as to why the table might not be indexed well for this particular query:

End users may choose to sort on many different columns and you can’t index them all

Microsoft presents a standard solution for this scenario: the humble nonclustered columnstore index. This will be great for my low end machine because I’ll be able to fit the new NCCI in memory. For those following along at home on their own low end machines, I created the index on every column except the Body column in a very carefree fashion:

CREATE NONCLUSTERED COLUMNSTORE INDEX ncci ON posts (
  Id
, AcceptedAnswerId
, AnswerCount
, ClosedDate
, CommentCount
, CommunityOwnedDate
, CreationDate
, FavoriteCount
, LastActivityDate
, LastEditDate
, LastEditorDisplayName
, LastEditorUserId
, OwnerUserId
, ParentId
, PostTypeId
, Score
, Tags
, Title
, ViewCount
) WITH (MAXDOP = 1)

I ran the OFFSET/FETCH query on my low end machine with MAXDOP 4 and it only took 430 CPU ms and 130 ms of elapsed time:

a65 gotta go fast

That’s a huge improvement compared to the 15-20 second runtime I was experiencing earlier. Interestingly, the second query (the ROW_NUMBER() approach) sticks with the parallel batch mode sort and performs significantly worse in comparison:

a65 too slow

The key difference here is the batch mode Top N sort in the first query. Remember that the query compile process for BMOR is different than what you get when compiling in the presence of a glorious columnstore index. You can get the improved batch mode top N sort by also doing a fake join to an empty CCI table. Serious batch mode connoisseurs should be mindful of the compile differences as they seek to gain the greatest performance benefit possible from batch mode.

Final Thoughts

Friends don’t let friends be lied to by SSMS. Thanks for reading!

I finally got my first unfriendly stack overflow comment

Posted on October 24, 2025 by Joe Obbish

I remember feeling a little nervous when when I first started contributing to stack exchange. It was supposedly unwelcoming and full of unfriendly people. I even planned on going to the version of the site full of database administrators, which obviously would be much worse than the average exchange site. My worries didn’t last long though. Sure, there was the occasional answer with a gruff tone, but I realized pretty quickly that I was getting help from industry experts for free. I interacted with a lot of cool, knowledgeable people (including the guy who owns this site) and I was able to help a lot of people with their problems. I even eventually got to shake hands with Sean Gallardy, the SQL dumping king.

In my opinion, one of the nicest things that you can do on the internet is spend your free time to help someone else with a problem. I also believe that it’s often whiny and counterproductive to complain about the free help that other people are offering. Those beliefs guided my interactions on the site and I ended up ranked #26 in reputation. The DBA stack exchange was a wonderful place of helping. The help wasn’t always perfect, but what is? Especially something given for free?

The Problem

Speaking of imperfections, I recently had a problem at work with CLR functions that were accessing files on Azure Blob Storage. We were planning for a migration to Azure SQL Database and these functions were flagged as an issue because that platform does not support CLR functions. The ideal solution would be to remove all interactions with the binary file data within SQL Server, but the complexity of the project and the migration timelines made this a tall order. So I went looking for a workaround.

I didn’t find anything except for a poorly answered stack exchange where someone had what appeared to be a similar problem to me.

A Solution

After picking at the problem and dealing with AI hallucinations, we eventually found something that worked as a proof of concept in a development environment. I’ll go ahead and reproduce my freshly minted stack overflow answer here:

We need to do something similar as part of a quick migration to Azure SQL Database to replace a CLR function that accessed data in Azure Blob Storage. Basic overview of process:

Use extended events file names to make read and write requests. You can create one file per request and encode the request information in the file name.

Use Azure Event Grid to trigger C# function execution whenever a new file is created. Your function code interacts with Azure Blob Storage and returns the necessary data to your Azure SQL Database.

As a first step, create an extended event session with an event condition that will never trigger and without any targets. I used user_event with an impossible filter value for event_id.

Detailed steps for making a read request for a file:

Take your preferred concurrency protection lock to serialize file creation (sp_getapplock is a common choice).

Use ALTER EVENT SESSION ... ON DATABASE ADD TARGET package0.event_file(SET filename=N'... to create a file. The file name should contain the information necessary to identify the requested file. In our application we use a GUID to identify files so we include the GUID in the file name.

Use ALTER EVENT SESSION ... ON DATABASE DROP TARGET package0.event_file; to remove the file target within SQL Server (the file will persist in Azure Blob Storage).

Release your concurrency protection lock.

Wait in a loop until the requested data shows up in a cache table.

While the T-SQL code waits, the C# function triggers off of the file creation. It parses the file name, make the API request to get the file from Azure Blob Storage, and writes the file data to the cache table within Azure SQL Database. In our development environment we experienced typical latency of around 500 ms for this step.

The T-SQL wait loop ends and your code now has access to the file data within Azure SQL Database.

Detailed steps for making a write request for a file:

Write the file data along with an identifying ID to a cache table.

Take your preferred concurrency protection lock to serialize file creation (sp_getapplock is a common choice).

Use ALTER EVENT SESSION ... ON DATABASE ADD TARGET package0.event_file(SET filename=N'... to create a file. The file name can be anything because it just informs your C# code that there’s something to process in the cache table. There’s no need to parse any information from the file name in the C# code.

Use ALTER EVENT SESSION ... ON DATABASE DROP TARGET package0.event_file; to remove the file target within SQL Server (the file will persist in Azure Blob Storage).

Release your concurrency protection lock.

Wait in a loop until the requested data is deleted from the cache table.

While the T-SQL code waits, the C# function triggers off of the file creation. It gets the file data from Azure SQL Database, makes the API request to write the file to Azure Blob Storage, and deletes the row from the cache table to signal success.

The T-SQL wait loop ends and your code proceeds now that the file has been written to Azure Blob Storage.

As an aside, it’s theoretically possible to use sp_trace_generateevent to pass up to 8000 bytes of binary data to an extended event target file. However, extended events are buffered to files and you will likely experience high latency. It is even possible for events to be lost and not written to the file at all. Avoiding these issues is why we went with the approach that creates a single 4 KB file per file request.

Obviously this is a horrible way to abuse Extended Events but this was the best quick fix solution that we found.

Have you ever seen a solution so ugly that it wraps all the way around and starts to look beautiful? Sure, I wouldn’t recommend implementing something like this if you had any other option. But sometimes there’s simply no better option. Anyway, I posted that answer to give a faint sliver of hope to the next poor soul who was trying to solve the same problem. If it inspires someone else to find a solution, that’s awesome. I contribute to stack exchange solely to help others.

Apparently, the Real Problem Was Trying to Help

Some people contribute to the site for other reasons, such as this guy:

a63 so mean

I tried to reply with a comment but stack overflow didn’t let me do it. Seemed to be some kind of bug with the site. I’ll implement another ugly workaround and respond here instead.

The question on stack overflow was posted in July 2022 and has 492 views and +3 net upvotes right now as I write this blog post. I can’t believe that I have to explain this, but I ended up at this “old” stack overflow question because that’s where google took me after trying to find a solution for the same problem. Who the hell cares how old the question is? Would the question somehow be better if I called in a favor with Aaron Bertrand and asked him to edit the database to change the year to 2025 instead of 2022? There’s even a meta post about answering old questions. The overwhelming consensus is that it’s perfectly okay to answer old questions.

Besides, are you sure that you want to discount the value of old questions and answers? Here’s one of your top ranked answers:

a63 point and click lol

I got goosebumps while reading it. Truly extraordinary.

With respect to the question I answered, it is perfectly answerable and meets all of the necessary qualifications for a stack overflow question:

a63 question

The OP doesn’t know how to do something within Azure SQL Database. What is the point of demanding a minimally reproducible example when he doesn’t know how to do it? That’s like asking someone to prove a negative. What do you expect him to provide? A code comment lamenting that he doesn’t know how to do it?

Finally, I followed the principles documented at how to answer while constructing my answer. There is nothing wrong with writing a “non-trivial” or “detailed” answer. There’s even a summary at the top. If you don’t like my answer, great, go down vote it. Please do anything other than arrogantly tut-tutting at me in a comment in a way that’s contrary to stack overflow’s own published principles. Not that it matters because I got no interest in playing unpaid janitor, but I don’t even have permission to close-vote the answer. I just want to help people. Maybe use some of your “community energy” to actually help people? A “Principal Software Engineer at Microsoft” surely can do better than one answer in the last year, right?

a63 zzzzzz

Big Data

Perhaps my experience was caused by a difference in culture between the DBA stack exchange and stack overflow. After all, the stats seem to speak for themselves:

DBA stack exchange: ~0 unfriendly comments on 459 answers = 0% unfriendly rate
stack overflow: 1 unfriendly comment on 1 answer = 100% unfriendly rate

Database administrators might not be so bad after all!

Final Thoughts

It’s wild how people act on the internet even while presenting their full name, photo, and their workplace. Some people should maybe stick to answering questions about how to reset passwords and leave the technical questions to others. This experience honestly lowered my opinion of the average Microsoft employee. I’d write you guys off completely if Sean wasn’t so good at analyzing dumps. Thanks for reading!

Cursor Declarations That Use LOB Local Variables Can Bloat Your Plan Cache

Posted on May 8, 2024May 13, 2024 by Joe Obbish

I found a few stored procedures with single plan cache entries over 500 MB of memory each according to sys.dm_exec_cached_plans while optimizing the plan cache in production. The stored procedures were relatively simple and both execution plans were less than 1 MB when saved as sqlplan files. I was confused, offended, and confused.

First published on MSDN on Jan 09, 2007

Did you know that the memory used for cursors can be cached along with the compiled plan? I’ll include a direct quote of the ancient wisdom:

Inactive cursors are cached in the compiled plan. Just like MXCs there are two lists maintained here: the lookup list and the enum list. The lookup list stores the inactive cursors. When a cursor is de-allocated, the memory used to store the cursor is cached. On subsequent re-execution of the batch with this cursor, the cached cursor memory is reused and re-initialized as an active cursor. The enum list has the list of active and inactive cursors and is used to enumerate all cursors for a compiled plan.

Cursors are re-generatable given a compiled plan. Relative to compiled plans they are also relatively inexpensive to re-generate.

I checked sys.dm_exec_cached_plan_dependent_objects and found that nearly all of the cached memory was used for the cursors instead of the execution plans. This makes sense because there was no reason why a simple procedure should use 500 MB for a cached plan, but why were the cursors using so much memory?

Back to 2022

I was able to create a simple reproduction of the issue on SQL Server 2022 CU13 which is the most recent version currently available at the time of writing. Create a stored procedure that uses the OPENJSON() function along with an NVARCHAR(MAX) local variable as part of a cursor declaration. Here is one example:

CREATE OR ALTER PROCEDURE TestMemoryLeak_1
AS
BEGIN
    SET NOCOUNT ON;

    DECLARE @json NVARCHAR(MAX);

    SET @json = N'[
        {
            "Order": {
                "Number": "SO43659",
                "Date": "2024-05-31T00:00:00"
            },
            "AccountNumber": "AW29825",
            "Item": {
                "Price": 2024.9940,
                "Quantity": 1
            }
        },
        {
            "Order": {
                "Number": "SO43661",
                "Date": "2024-06-01T00:00:00"
            },
            "AccountNumber": "AW73565",
            "Item": {
                "Price": 2024.9940,
                "Quantity": 3
            }
        }
    ]';

    DECLARE json_cursor CURSOR FAST_FORWARD FOR 
    SELECT Number, OrderDate, Customer, Quantity
    FROM OPENJSON(@json) WITH (
        Number VARCHAR(200) '$.Order.Number',
        OrderDate DATETIME '$.Order.Date',
        Customer VARCHAR(200) '$.AccountNumber',
        Quantity INT '$.Item.Quantity'
    );

    DECLARE 
        @Number VARCHAR(200),
        @OrderDate DATETIME,
        @Customer VARCHAR(200),
        @Quantity INT;

    OPEN json_cursor;
    FETCH NEXT FROM json_cursor INTO @Number, @OrderDate, @Customer, @Quantity;
    WHILE @@FETCH_STATUS = 0
    BEGIN
        -- nobody wants to work anymore

        FETCH NEXT FROM json_cursor INTO @Number, @OrderDate, @Customer, @Quantity;
    END;
    CLOSE json_cursor;
    DEALLOCATE json_cursor;

    RETURN;
END;
GO

The amount of memory cached for the cursor execution can be seen with the following code:

DECLARE @plan_handle varbinary(64);
SELECT @plan_handle = plan_handle
FROM sys.dm_exec_procedure_stats
where object_id = object_id('[TestMemoryLeak_1]');

SELECT * FROM 
sys.dm_os_memory_objects
WHERE type = 'MEMOBJ_CURSOREXEC'
AND memory_object_address IN  (
    SELECT memory_object_address
    FROM sys.dm_exec_cached_plan_dependent_objects(@plan_handle)
);

I executed the test stored procedure 1 time, measured the memory usage, then executed it 9999 more times, and finally measured the memory usage again. I expected the memory usage to not change because the cursor always processes the same data. SQL Server betrayed my expectations:

a62 leak

A cursor that processes 2 rows of data has managed to use 163 MB of cached memory. Wonderful. One workaround is to remove OPENJSON() from the cursor definition by inserting the data into a table variable and making the cursor read from the table variable:

DECLARE @json_table TABLE (
    Number VARCHAR(200) NULL,
    OrderDate DATETIME NULL,
    Customer VARCHAR(200) NULL,
    Quantity INT NULL
);

INSERT INTO @json_table (Number, OrderDate, Customer, Quantity)
SELECT Number, OrderDate, Customer, Quantity
FROM OPENJSON(@json) WITH (
    Number VARCHAR(200) '$.Order.Number',
    OrderDate DATETIME '$.Order.Date',
    Customer VARCHAR(200) '$.AccountNumber',
    Quantity INT '$.Item.Quantity'
);

DECLARE json_cursor CURSOR FAST_FORWARD FOR 
SELECT Number, OrderDate, Customer, Quantity
FROM @json_table;

SQL Server no longer betrays my expectations. The amount of cached memory for the cursor does not change between 1 and 10000 executions:

a62 no leak

Here is a comparison using the sys.dm_exec_cached_plans DMV:

a62 compare

Determining which row is associated with each stored procedure is an exercise left up to the reader.

A second workaround is to define the cursor using the LOCAL argument. A third workaround is to define the local variable as NVARCHAR(4000). Thank you to the comments for both of these workarounds. Note that I originally thought the issue was related to OPENJSON() but it appears that was an error on my part. The presence of an LOB variable looks to be the determining factor.

Final Thoughts

Using local LOB variables as part of a cursor declaration can lead to ever-increasing amounts of cached memory associated with its cached plan. It’s possible that SQL Server will prune this plan once it experiences plan cache pressure, but it may also choose to prune other plans instead. I recommend coding defensively by always using the LOCAL argument for your cursor declarations. Thanks for reading!

Another Method to Reduce Temp Table Plan Cache Pollution

Posted on December 17, 2022 by Joe Obbish

SQL Server allows stored procedures to reference temporary tables that were created in an outer scope. However, doing so can cause the plan cache pollution problem described here.

The Polluter

The problem can be observed with a simple repro script. The child stored procedure performs a series of updates on a temp table. The parent procedure creates the temp table, calls the child stored procedure, and returns data from the temp table.

CREATE OR ALTER PROCEDURE dbo.update_temp_table
AS
BEGIN
    SET NOCOUNT ON;
    -- the #outer_scope temp table is created in an outer scope

    UPDATE #outer_scope SET CNT = (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t2) + (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t2)
    WHERE ID = 1;

    UPDATE #outer_scope SET CNT = (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t3) + (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t2)
    WHERE ID = 1;

    UPDATE #outer_scope SET CNT = (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t4) + (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t2)
    WHERE ID = 1;

    UPDATE #outer_scope SET CNT = (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t5) + (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t2)
    WHERE ID = 1;
END;

GO

CREATE OR ALTER PROCEDURE dbo.parent_proc
AS
BEGIN
    SET NOCOUNT ON;

    CREATE TABLE #outer_scope (ID INT NOT NULL, CNT INT NULL);
    INSERT INTO #outer_scope (ID, CNT)
    VALUES (1, NULL);

    EXEC dbo.update_temp_table;

    SELECT * FROM #outer_scope;
END;

GO

I executed the parent_proc procedure on three different sessions and was rewarded with three different cached plans:

a61 cache pollution

The Cleaner

One way to resolve this issue is to change the child procedure to create a new local temp table, to copy the data from the original table into the new one using its own child procedure, to perform the updates against the new table, and to copy back the data into the original table, again in a child procedure. That was a lot of words so perhaps code will be easier to understand:

CREATE OR ALTER PROCEDURE dbo.update_temp_table_import_temp_data
WITH RECOMPILE
AS
BEGIN	
    IF OBJECT_ID(N'tempdb..#outer_scope', 'U') IS NOT NULL
    BEGIN
        INSERT INTO #local_scope
        SELECT * FROM #outer_scope;
    END;
END;

GO


CREATE OR ALTER PROCEDURE dbo.update_temp_table_export_temp_data
WITH RECOMPILE
AS
BEGIN
    IF OBJECT_ID(N'tempdb..#outer_scope', 'U') IS NOT NULL
    BEGIN
        TRUNCATE TABLE #outer_scope;	

        INSERT INTO #outer_scope
        SELECT * FROM #local_scope;
    END;
END;

GO


CREATE OR ALTER PROCEDURE dbo.update_temp_table_NEW
AS
BEGIN
    SET NOCOUNT ON;
    
    CREATE TABLE #local_scope (ID INT NOT NULL, CNT INT NULL);
    
    EXEC dbo.update_temp_table_import_temp_data;
    
    UPDATE #local_scope SET CNT = (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t2) + (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t2)
    WHERE ID = 1;

    UPDATE #local_scope SET CNT = (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t3) + (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t2)
    WHERE ID = 1;

    UPDATE #local_scope SET CNT = (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t4) + (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t2)
    WHERE ID = 1;

    UPDATE #local_scope SET CNT = (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t5) + (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t2)
    WHERE ID = 1;
    
    EXEC update_temp_table_export_temp_data;
END;	
    
GO


CREATE OR ALTER PROCEDURE dbo.parent_proc_NEW
AS
BEGIN
    SET NOCOUNT ON;

    CREATE TABLE #outer_scope (ID INT NOT NULL, CNT INT NULL);
    INSERT INTO #outer_scope (ID, CNT)
    VALUES (1, NULL);

    EXEC dbo.update_temp_table_NEW;

    SELECT * FROM #outer_scope;
END;

GO

Running the new procedure in three different sessions no longer results in cache pollution:

a61 cache clean

This is because the references to outer-scoped temp tables have been moved to small, simple procedures that are built with a procedure level RECOMPILE hint. This approach has the following advantages:

There is no cache pollution for any of the code.
The update statements will be compiled significantly less often, especially on SQL Server 2019 and above.

It also has the following disadvantages:

Data from the outer table needs to be written to the local table and data from the local table is written back to the outer table.
The definition and columns of the temp table need to be duplicated in additional places in code.

I don’t think that this is the best general solution by any means, but it does fulfill a particular niche use case and I’ve been able to use this method in production code. For other ideas, Erland Sommarskog has a detailed write up of different ways of passing data between procedures.

Final Thoughts

When you need your plan cache to be cleaned there’s really only one person to call:

a61 cache cleaner

Thanks for reading!

Improving AT TIME ZONE Performance

Posted on November 11, 2022November 11, 2022 by Joe Obbish

Queries that use the AT TIME ZONE operator may perform worse than desired. For example, Jonathan Kehayias blogged about performance struggles with AT TIME ZONE at sqlskills.com. One key difference between Jonathan’s opinion and my own is that I enjoy writing these kinds of posts.

Test Data

The test data consists of one million rows per day over the month of January 2022 for a total of 31 million rows.

DROP TABLE IF EXISTS dbo.CCIForBlog;

CREATE TABLE dbo.CCIForBlog (
    SaleTimeUTC DATETIME2 NOT NULL,
    WidgetCount BIGINT NOT NULL,
    INDEX CCI CLUSTERED COLUMNSTORE
);

GO

SET NOCOUNT ON;

DECLARE
    @StartDate DATETIME2 = '20220101',
    @DaysToLoad INT = 31,
    @DaysLoaded INT = 0;

WHILE @DaysLoaded < @DaysToLoad
BEGIN
    INSERT INTO dbo.CCIForBlog (SaleTimeUTC, WidgetCount)
    SELECT DATEADD(SECOND, q.RN / 11.5, @StartDate), q.RN / 100000
    FROM
    (
        SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
        FROM master..spt_values t1
        CROSS JOIN master..spt_values t2
    ) q (RN)
    OPTION (MAXDOP 1);

    SET @StartDate = DATEADD(DAY, 1, @StartDate);
    SET @DaysLoaded = @DaysLoaded + 1;
END;

CREATE STATISTICS S1 ON CCIForBlog (SaleTimeUTC);

Data is saved to a table with a clustered columnstore index. Of course, this is a small amount of data for a columnstore table. However, it is more than sufficient to demonstrate AT TIME ZONE as a performance bottleneck.

Filtering

Suppose an end user wants a count of widgets sold between January 3rd and January 6th. A first attempt at this query could look like the following:

SELECT SUM(WidgetCount)
FROM dbo.CCIForBlog c
WHERE c.SaleTimeUTC >= '20220103'
AND c.SaleTimeUTC < '20220106'
OPTION (MAXDOP 1);

This query plays to all of columnstore’s strengths and it only takes around 10 milliseconds to execute. The data was loaded in date order so most of the rowgroups are eliminated. However, end users don’t think in terms of UTC time. The end user actually wanted Central Standard Time. After a bit of research, the query is changed to as follows:

SELECT SUM(WidgetCount)
FROM dbo.CCIForBlog c
WHERE SWITCHOFFSET(c.SaleTimeUTC, 0) AT TIME ZONE 'Central Standard Time' >= '20220103'
AND SWITCHOFFSET(c.SaleTimeUTC, 0) AT TIME ZONE 'Central Standard Time' < '20220106'
OPTION (MAXDOP 1);

The AT TIME ZONE operator is useful when the number of offset minutes is unknown. UTC always has an offset of 0 so it is better to use SWITCHOFFSET(datetimeoffset_expression, 0)as opposed to AT TIME ZONE ‘UTC’. Even so, the query still takes over 3 minutes to execute on my machine. Nearly all of the execution time is spent on performing AT TIME ZONE calculations:

a60 slow

Note that using c.SaleTimeUTC AT TIME ZONE 'UTC' AT TIME ZONE 'Central Standard Time' would double the execution time.

One idea is to rewrite the filtering so that the time zone conversions are done on the constant values instead of the column:

WHERE c.SaleTimeUTC >= SWITCHOFFSET('20220103' AT TIME ZONE 'Central Standard Time', 0)
AND c.SaleTimeUTC < SWITCHOFFSET('20220106' AT TIME ZONE 'Central Standard Time', 0)

I strongly recommend against this approach. It can lead to wrong results for some time zones and boundary points. Instead, you can use the fact that datetimeoffset only supports an offset of up to +-14 hours. That means that (local – 14 hours) < UTC < (local + 14 hours) is true for any time zone and any point in time. A logically redundant filter can be added to the query:

SELECT SUM(WidgetCount)
FROM dbo.CCIForBlog c
WHERE SWITCHOFFSET(c.SaleTimeUTC, 0) AT TIME ZONE 'Central Standard Time' >= '20220103'
AND SWITCHOFFSET(c.SaleTimeUTC, 0) AT TIME ZONE 'Central Standard Time' < '20220106'
AND c.SaleTimeUTC >= DATEADD(HOUR, -14, '20220103')
AND c.SaleTimeUTC < DATEADD(HOUR, 14, '20220106')
OPTION (MAXDOP 1);

The newly improved query finishes in around 26 seconds. It is able to benefit from rowgroup elimination and performs significantly fewer time zone conversions compared to the original query. In this case, we were able to use knowledge about time zones and a bit of date math to improve performance from over 3 minutes to about 26 seconds.

Grouping

Suppose that an end user wants widget counts summarized by date. A first attempt at this query could look like the following:

SELECT ca.ConvertedDate, SUM(WidgetCount)
FROM dbo.CCIForBlog c
CROSS APPLY (
    SELECT CAST(c.SaleTimeUTC AS DATE)
) ca (ConvertedDate)
GROUP BY ca.ConvertedDate
OPTION (MAXDOP 1);

This query takes about 1 second on my machine. However, once again, the end user wants the data to be in CST instead of UTC. The following approach takes around 3 minutes:

SELECT ca.ConvertedDate, SUM(WidgetCount)
FROM dbo.CCIForBlog c
CROSS APPLY (
    SELECT CAST(SWITCHOFFSET(c.SaleTimeUTC, 0) AT TIME ZONE 'Central Standard Time' AS DATE)
) ca (ConvertedDate)
GROUP BY ca.ConvertedDate
OPTION (MAXDOP 1);

This should not be a surprise because the bottleneck in query performance is performing 31 million AT TIME ZONE calculations. That doesn’t change if the query performs filtering or grouping.

Historically, governments only perform daylight saving time or offset switches on the minute. For example, an offset won’t change at 2:00:01 AM, but it might change at 2:00:00 AM. The source data has one million rows per day, so grouping the date truncated to the minute, applying the time zone conversion to the truncated distinct values, and finally grouping by date should lead to significant performance improvement. One way to accomplish this:

SELECT ca.ConvertedDate, SUM(SumWidgetCount)
FROM 
(
    SELECT DATEADD(MINUTE, DATEDIFF(MINUTE, '20000101', c.SaleTimeUTC), '20000101'), SUM(WidgetCount)
    FROM dbo.CCIForBlog c
    GROUP BY DATEADD(MINUTE, DATEDIFF(MINUTE, '20000101', c.SaleTimeUTC), '20000101')
) q (SaleTimeUTCTrunc, SumWidgetCount)
CROSS APPLY (
    SELECT CAST(SWITCHOFFSET(q.SaleTimeUTCTrunc, 0) AT TIME ZONE 'Central Standard Time' AS DATE)
) ca (ConvertedDate)
GROUP BY ca.ConvertedDate
OPTION (MAXDOP 1);

The new query takes around 4 seconds on my machine. It needs to perform 44650 time zone conversions instead of 31 million. Once again, we were able to use knowledge about time zones and a bit of date math to improve performance.

Functioning

I’ve developed and open sourced replacement functions for AT TIME ZONE to provide an easier way of improving performance for queries that use AT TIME ZONE. The TZGetOffsetsDT2 function returns a pair of offsets and the TZFormatDT2 function transforms those offsets into the desired data type. The filtering query can be written as the following:

SELECT SUM(WidgetCount)
FROM dbo.CCIForBlog c
OUTER APPLY dbo.TZGetOffsetsDT2 (c.SaleTimeUTC, N'UTC', N'Central Standard Time') o
CROSS APPLY dbo.TZFormatDT2 (c.SaleTimeUTC, N'UTC', N'Central Standard Time', o.OffsetMinutes, o.TargetOffsetMinutes) f
WHERE f.ConvertedDateTime2 >= '20220103'
AND f.ConvertedDateTime2 < '20220106'
OPTION (MAXDOP 1);

The new query takes around 10 seconds to execute. The new query plan is able to use batch mode processing at every step:

a60 filter fast

If desired, performance can be further improved by adding the same redundant filters as before:

SELECT SUM(WidgetCount)
FROM dbo.CCIForBlog c
OUTER APPLY dbo.TZGetOffsetsDT2 (c.SaleTimeUTC, N'UTC', N'Central Standard Time') o
CROSS APPLY dbo.TZFormatDT2 (c.SaleTimeUTC, N'UTC', N'Central Standard Time', o.OffsetMinutes, o.TargetOffsetMinutes) f
WHERE f.ConvertedDateTime2 >= '20220103'
AND f.ConvertedDateTime2 < '20220106'
AND c.SaleTimeUTC >= DATEADD(HOUR, -14, '20220103')
AND c.SaleTimeUTC < DATEADD(HOUR, 14, '20220106')
OPTION (MAXDOP 1);

The most optimized version takes around 1 second to execute. Quite an improvement compared to 3 minutes!

The grouping query can also be rewritten using the new functions:

SELECT f.ConvertedDate, SUM(WidgetCount)
FROM dbo.CCIForBlog c
OUTER APPLY dbo.TZGetOffsetsDT2 (c.SaleTimeUTC, N'UTC', N'Central Standard Time') o
CROSS APPLY dbo.TZFormatDT2 (c.SaleTimeUTC, N'UTC', N'Central Standard Time', o.OffsetMinutes, o.TargetOffsetMinutes) f
GROUP BY f.ConvertedDate
OPTION (MAXDOP 1);

This version takes about 7 seconds to execute. This is slower than the date truncation method which took 4 seconds but still much faster than the original AT TIME ZONE method.

Unfortunately, SQL Server limitations require a pair of functions to get the best performance. The open source project does provide a simpler function that can be called by itself but it is only eligible for nested loop joins.

Upgrading

SQL Server 2022 RC1 has signs of improvement for AT TIME ZONE. The basic queries that use AT TIME ZONE in this blog post take about 75 seconds to execute on my machine, so it could be estimated that SQL Server 2022 will reduce the CPU footprint of AT TIME ZONE by 60%. Release candidates are not fully optimized so it’s possible that final performance numbers will be different once the product goes GA. I suspect that these performance improvements are already present in Azure SQL Database but I can’t find any documentation for the change.

Final Thoughts

Please try my open source project if you’re experiencing performance problems with AT TIME ZONE. Thanks for reading!

Why Parallel Queries are Sometimes Slow In SQL Server

Posted on August 24, 2022August 24, 2022 by Joe Obbish

You may have noticed large variations in elapsed time for parallel queries while performing query tuning or observing a production workload. This post reviews some of the possible explanations for those variations. I’m testing on SQL Server 2022 RC0 but this behavior can be observed on all currently supported versions of SQL Server.

The Setup

The test data is very simple. Insert ten million sequential integers into a single column table with a clustered index:

CREATE TABLE dbo.ParallelTest (
Id BIGINT NOT NULL,
CONSTRAINT PK_ParallelTest PRIMARY KEY (Id)
);

INSERT INTO dbo.ParallelTest (Id)
SELECT [value]
FROM GENERATE_SERIES(1, 10000000);

Now consider the following query:

SELECT SUM(a.ID + b.ID) s
FROM dbo.ParallelTest a
INNER JOIN dbo.ParallelTest b ON a.ID = b.ID
OPTION (USE HINT ('DISALLOW_BATCH_MODE'));

The query optimizer naturally picks a merge join for this query. The self-joined tables obviously have the same row count and the data is already sorted. That makes this query an ideal candidate for a serial merge join from the optimizer’s point of view . The USE HINT in the query isn’t particularly important and the example works without it, but its presence will be explained later. Here’s what the query plan looks like for me:

a59 serial merge

The query takes 2721 ms of CPU time and elapsed time to execute on my machine. It’s reasonable for CPU and elapsed time to match here because there’s nothing for the query to wait on. It’s the only user query running against the SQL instance. Next I’ll force a parallel merge join query plan by changing the query hints to the following:

SELECT SUM(a.ID + b.ID) s
FROM dbo.ParallelTest a
INNER JOIN dbo.ParallelTest b ON a.ID = b.ID
OPTION (MAXDOP 8, MERGE JOIN, USE HINT('ENABLE_PARALLEL_PLAN_PREFERENCE', 'DISALLOW_BATCH_MODE'));

I chose MAXDOP 8 because my test machine has 8 physical CPU cores. This is an unnatural query plan with a higher parallel cost than serial cost. The query optimizer only picked it because it was forced to. Note the presence of the exchange operators with a relatively high estimated cost:

a59 parallel merge

The parallel query takes 6465 ms of CPU time and 1723 ms of elapsed time to execute on my machine. It finished about 33% faster than the serial query but CPU time more than doubled. The exchange operators added a large overhead for this query. Next I’ll run the same query again but I’ll make one of my CPUs busier than before. Consider the following infinite loop coded as a stored procedure:

CREATE OR ALTER PROCEDURE #p AS
SET NOCOUNT ON;
DECLARE @i INT;
WHILE 1 = 1
SET @i = 0;
GO

EXEC #p;

This code does not accomplish anything besides efficiently using CPU resources, which is exactly what I need for my test. I’m running the merge query at MAXDOP 8 on a machine with 8 CPU cores so I’m pretty much guaranteed to see scheduler contention for some of the workers of the parallel query. They will compete for CPU resources with the infinite loop.

Take a moment to predict the performance penalty that the parallel query will experience due to the busy scheduler. Will it be 2X slower? 3X slower? Worse? For a few hints, recall that hash distributed exchange operators will generally try to force an equal amount of work to each thread. Also recall that a parallel merge join requires order-preserving exchange operators. I’ve included Erik running at MAXDOP 8 to hopefully prevent you from accidentally seeing the answer before you’ve finalized your prediction.

a59 erik

With the busy scheduler, the query now takes 6818 ms of CPU time and 31745 ms of elapsed time to execute on my machine. The same code is over 18X slower simply due to the highly efficient stored procedure that’s running on the same SQL instance. Here are the query wait stats in case they are of interest:

<Wait WaitType=”CXPACKET” WaitTimeMs=”454673″ WaitCount=”5262″ />
<Wait WaitType=”CXSYNC_PORT” WaitTimeMs=”32539″ WaitCount=”41″ />
<Wait WaitType=”CXSYNC_CONSUMER” WaitTimeMs=”43″ WaitCount=”14″ />

scheduler_ring_buffer_recorded

This is an inappropriately brief introduction to a complex topic, but workers cooperate with each other by sharing time on schedulers. Under normal conditions, a worker will execute code on the scheduler for up to 4 milliseconds before yielding. The worker yields in case there’s another worker waiting to execute on the scheduler. When there’s not the original worker immediately starts executing code again. An example of when a worker yields early is if it needs to wait for an I/O operation.

scheduler_ring_buffer_recorded is the name of a debug extended event that you should probably never use. I often don’t follow my own advice, so I used it to get an idea of how often workers were yielding to the scheduler for the merge join queries. I only looked at action types of “SCHEDULER_NONPREEMPTIVE_RESUME” and I admit that this analysis is quite approximate. For the MAXDOP 1 query, there was a total of 705 events over 2872 milliseconds. In other words, the average time spent on the scheduler before yielding is close to the 4 ms soft maximum. This should be expected for the serial query if all of the data is in the buffer pool. There is no memory grant for the query (the reason for the USE HINT) and there’s not really any reason to yield early.

The parallel query had significantly more yields. Looking at one worker thread as an example, it had 14862 events over 1714 milliseconds. In other words, it had around 35 times the number of scheduler yields compared to the MAXDOP 1 query. The higher number of yields is a performance liability when competing with workers from other queries on the same scheduler. The infinite loop stored procedure dominates the workers from the parallel and gets around 97% of the cpu time on the scheduler. That’s a big part of why the parallel query is 18X slower than before.

Note that you can also get this information by looking at the yield_count column of the sys.dm_of_schedulers DMV, but those numbers are server wide and aren’t isolated to one query. With that said, I did get reasonably similar results when comparing that column to the extended event results.

To summarize, parallel query performance overhead isn’t as simple as just the direct overhead of the exchange operators. A query changing from serial to parallel can result in completely different interactions between the workers and the scheduler. The serial query executed its code 4 milliseconds at a time. The parallel query had to yield significantly more often. This type of change isn’t something that you can directly observe in a query plan.

Why Parallel Queries are Sometimes Slow

There are many reasons why a parallel query may not be as efficient as you hoped. I find these to be the most common:

1) There is an overhead associated with parallel exchange operators which is not present in serial queries. Moving data between worker threads requires additional CPU time which adds to the overall execution time of the query. I believe this is what is described by the following quote from the documentation:

Simple queries, frequently found in transaction processing applications, find the additional coordination required to execute a query in parallel outweigh the potential performance boost.

2) There’s some kind of additional coordination required between threads which delays query completion. For example, an order-preserving repartition streams operator is generally worse for performance than one that doesn’t need to preserve order. A different type of example can be seen in my blog post about slow parallel scans caused by cloud storage.

3) Work may be unevenly distributed between worker threads. For example, an eager index spool can only be built with one thread. Other threads may be idle while that temporary index is built. For a different example, a filtered parallel scan of a table may not output the same number of rows to all threads. All else being equal, a parallel query that cannot use all of the CPUs allowed by MAXDOP is not as efficient as one that can.

4) Work may be evenly distributed between worker threads but some of the CPUs are busier than others. Some parallel queries are as slow as their busiest CPU, even when queries are evenly sharing time on the scheduler. Of course, dramatic performance penalties can be seen when the parallel query gets an uneven amount of time on the scheduler (like in the demo in this blog post).

Mitigations

Individual parallel query performance can degrade due to contention on a single scheduler that’s used by one or more of its workers. You are more likely to run into problems with higher DOP or a higher number of concurrent queries. Lowering MAXDOP decreases the odds that you’ll run into the worst type of contention because the query uses fewer schedulers. Decreasing the number of queries decreases contention per scheduler which also helps.

To explain it in a different way, the simplest way to reduce problems with parallel queries is to have fewer worker threads per scheduler. This is of course a balancing act, but it’s one that you must perform for some workloads. Set Cost Threshold for Parallelism and server/database/resource governor MAXDOP appropriately. When possible, don’t run too many concurrent queries for your hardware.

For very important queries for which performance is critical, consider changing to a type of parallelism that isn’t as susceptible to scheduler contention issues. The two most common methods are a parallel apply pattern, perhaps using demand partitioning, or batch mode execution.

Batch mode is yet another complicated topic that I’m not going to explain here, but it uses a fundamentally different method of parallelism compared to row-mode execution which is generally much more robust against problems caused by busy schedulers. Below is a comparison of elapsed times with and without the infinite loop stored procedure running:

The elapsed time of the parallel batch mode query only changes slightly. This is also a nice property when considering overall server scalability.

Final Thoughts

I was going to originally blog about DOP feedback, but it’s taking a while for my feedback to kick in. Thanks for reading!

Ordered Columnstore Indexes on SQL Server 2022 CTP 2.0

Posted on July 11, 2022 by Joe Obbish

Brent recently blogged about ordered columnstore indexes in SQL Server 2022 and had some trouble with them, so I decided to take a look into the mechanics of the feature. I’m testing on SQL Server 2022 CTP 2.0.

What does the ordered columnstore feature do?

A sort operator may be added to query plans that insert into the table. The sort operator is a bit unusual in that the data may not be fully sorted.
A sort operator is added when initially creating an ordered columnstore index. The level of rowgroup elimination fragmentation will depend on memory, DOP, and other factors.
A sort operator is added when rebuilding an ordered columnstore index. The level of rowgroup elimination fragmentation will depend on memory, DOP, and other factors.

Technical details for insert

The sort for inserting into an ordered columnstore is a DML request sort. It appears to use the same internal mechanism as the sort that’s added for inserting into partitioned columnstore tables. The difference is that the data is sorted by the specified columns instead of a calculated partition id. In my testing, the sort appears to be a best effort sort that does not spill to tempdb. This means that if SQL Server thinks there won’t be enough memory then the data will not be fully sorted. Parallel inserts have an additional complication. Consider the following query plan image:

a58 parallel insert

That is a row mode sort. It is a row mode sort because a batch mode parallel sort would put all resulting rows on a single thread which would make the parallel insert pointless. However, there’s no repartition streams operator as a child of the sort. Data is sorted on each thread in a best effort fashion. Even if there is enough memory to fully sort the data, you will end up with DOP threads of sorted data instead. The data will not be sorted globally. The split into threads will increase rowgroup elimination fragmentation.

As mentioned earlier, the sort operator does not always appear. It is not present when the cardinality estimate is very low (around 250 rows). I suspect that the same logic is used for adding the sort as adding the memory grant for compression. For very low cardinality estimates, the data will be inserted into delta rowgroups, even if there’s more than 102399 rows. By that same reasoning, I expect that there is no sort operator if the INSERT query hits a memory grant timeout.

For more information on this sort, see CQScanPartitionSortNew in Paul White’s blog post about different sort types in SQL Server.

Technical details for CREATE/REBUILD index

I spent less time looking into the sort that’s added as part of CREATE or REBUILD index. In my testing, the sort again does not spill to tempdb. The sort is also performed on a per thread basis for parallel index operations. The minimum fragmentation level will be achieved for a MAXDOP 1 operation with sufficient memory. Reducing memory or running the create index in parallel will increase fragmentation. This is unfortunate because ordered columnstore indexes do not support online index creation or rebuilds.

An unpopular opinion

I think that the community worries too much with columnstore with respect to achieving perfect segment ordering and keeping rowgroups at exactly 1048576 rows. If you perform basic maintenance and partition tables appropriately then that should be good enough for most data warehouse workloads. Most query performance issues are going to be caused by getting no elimination at all, scanning through too many soft-deleted rows, or the usual query performance problems. Scanning 11 rowgroups instead of 5 probably isn’t why your queries on columnstore indexes are slow today.

Final thoughts

As is, this feature can be described as a poor man’s partitioning. The sweet spot for this functionality feels very limited to me in its current state, but we’re still on CTP 2.0. Maybe there will be changes before RTM. Thanks for reading!