How to Add Clippy Support to SSMS

Step 1: Install the Clippy digital assistant SSMS add-on

Go to github and grab the code. Clippy wants you to be on a recent version of SQL Server so only SQL Server 2022+ is supported.

Step 2: Create your first Clippy alert and schedule it

For illustrative purposes, below is a very basic check that throws an alert if CPU >= 80%

DECLARE
    @CPUFromBuffer INT,
    @LogMessage VARCHAR(100);

select TOP (1) @CPUFromBuffer = record.value('(./Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)[1]', 'int')
from (
    select timestamp, convert(xml, record) as record
    from sys.dm_os_ring_buffers
    where ring_buffer_type = N'RING_BUFFER_SCHEDULER_MONITOR'
    and record like '%<SystemHealth>%'
) as RingBufferInfo
ORDER BY timestamp DESC;

IF @CPUFromBuffer >= 80
BEGIN
    SET @LogMessage = 'Your SQL Server is using ' + CAST(@CPUFromBuffer AS VARCHAR(10)) + '% of available CPU.'
    EXEC dbo.AddPaperClipToErrorLog @LogMessage;
END;

Step 3: Trigger the alert

On my local machine with 8 physical cores I can simply run the following query and wait:

 WITH vCTE AS (
      SELECT v.v
      FROM
      (VALUES
      (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),(50),(51),(52),(53),(54),(55),(56),(57),(58),(59),(60),(61),(62),(63),(64),(65),(66),(67),(68),(69),(70),(71),(72),(73),(74),(75),(76),(77),(78),(79),(80),(81),(82),(83),(84),(85),(86),(87),(88),(89),(90),(91),(92),(93),(94),(95)
      ) v(v)
)
SELECT MIN(v0.v + ca.v)
FROM vCTE v0
CROSS APPLY (
      SELECT MIN(v1.v + v2.v + v3.v + v4.v + v5.v) v
      FROM vCTE v1
      CROSS JOIN vCTE v2
      CROSS JOIN vCTE v3
      CROSS JOIN vCTE v4
      CROSS JOIN vCTE v5
      WHERE v1.v > (v0.v - 96)
) ca
OPTION (NO_PERFORMANCE_SPOOL, MAXDOP 8, USE HINT (N'ENABLE_PARALLEL_PLAN_PREFERENCE'));

Step 4: Check for alerts in the error log

a69 paperclip

Hyper-threading on Azure VMs is a SQL Licensing Trap

Azure VMs with hyper-threading enabled are sized according to logical cores instead of physical cores. These logical cores can perform with 50% of the power of physical cores for high levels of activity but will always perform at 100% of the SQL Server licensing cost rate. As a result, moving from a busy on-premises SQL Server VM sized to an Azure VM with hyper-threading enabled can result in a surprise SQL Server licensing bill.

The Evidence

In my opinion, the documentation is fairly clear that logical cores are what’s for sale here, but there seems to be a fair amount of confusion on this topic. Hopefully one of these five exhibits is able to convince the reader.

Exhibit A: Hyper-threading can be disabled and it cuts the vCPU count in half

It is possible to disable hyper-threading within the VM as demonstrated here. The number of available vCPUs within the VM is reduced by half after hyper-threading is disabled. Conveniently, the blog post also provides a command line method to see physical core and virtual core count. The physical core count stays the same between configurations.

Exhibit B: Marketing language used for VMs that don’t allow SMT

Quoting from the Famsv7 series document, with emphasis mine:

The Famsv7-series utilizes AMD’s 5th Generation EPYC™ 9005 processor that can achieve a boosted maximum frequency of up to 4.5 GHz. The Famsv7-series VM comes without Simultaneous Multithreading (SMT), meaning a vCPU is now mapped to a full physical core, allowing software processes to run on dedicated and uncontested resources. These new full core VMs suit workloads demanding the highest CPU performance. Famsv7-series offers up to 80 full core vCPUs and 640 GiB of RAM.

These documents will present the VMs in the best possible light, and there sure is a lot of writing stressing physical cores. There’s no similar language for the document describing the Ebdsv5 and Ebsv5 series.

Exhibit C: Microsoft’s AI helper

a68 ai bad

a68 ai good

  • Check
  • out
  • my
  • superior
  • use
  • of
  • bullet
  • points
  • AI
  • could
  • never

Exhibit D: Not enough sockets

The Ebdsv5 and Ebsv5 series is backed by the Intel® Xeon® Platinum 8370C (Ice Lake) processor. This processor has a physical core count of 32, a logical core count of 64, and can support up to a two socket configuration. Microsoft offers VMs above 64 vCPU, so presumably the the vCPU count cannot represent physical cores. The only alternative is some kind of super NUMA that is stitching together sockets from different physical hosts into a single VM, and it’s best not to think about the implications of that.

Exhibit E: Simple scaling testing on the Standard_E8bds_v5

The scalability tests described here were also run against a Standard_E8bds_v5 Azure VM. No special configuration was done for the VM. Tests were run after installing SQL Server 2022 Developer Edition and SSMS. The scaling of Standard_E8bds_v5 Azure acts as expected of a VM with 4 physical cores and 8 virtual cores:

a68 new table

The Difference

Consider a small company using VMware which is able to fit its entire production workload onto a two socket host with little to no oversubscription of CPU resources. Suppose that company has a single 8 vCPU VM running SQL Server and owns SQL licenses for 8 cores. ESXI will preferentially schedule those 8 vCPU onto 8 different physical cores, so the VM will likely scale as if it has 8 physical cores:

ESXi hosts manage processor time intelligently to guarantee that load is spread smoothly across processor cores in the system. Logical processors on the same core have consecutive CPU numbers, so that CPUs 0 and 1 are on the first core together, CPUs 2 and 3 are on the second core, and so on. Virtual machines are preferentially scheduled on two different cores rather than on two logical processors on the same core.

Of course, this is not guaranteed behavior. However, SQL licenses can be expensive, so it can often make business sense to make sure that the SQL Server VM is able to get as much physical CPU power as allowed by the vCPU count. This is entirely different conceptually compared to getting a VM in the cloud. The rented cloud VM will run on a physical host with other VMs from different customers. Cloud providers carefully limit resources available to each VM to make things fair for everyone instead of giving as much resources as possible to the VMs with expensive software licensing costs per core.

The Trap

Mapping an on-premises VM to the closest Azure VM in terms of CPU power requires careful analysis accounting for differences in clock speed, processor generation, resource usage within the VM, oversubscription of the VM host, and other factors. To keep things simple, imagine going dumpster diving on Microsoft property and building a single socket Intel® Xeon® Platinum 8370C (Ice Lake) server. All of the configuration scenarios in the table below are appropriately sized for their SQL Server workloads and there is no oversubscription of CPU on the VM host, when applicable. Matching the effective physical CPU count in Azure VMs requires doubling the SQL licenses for many different configurations of the OS running SQL Server:

a68 trap

This could be quite the expensive surprise for SQL Server Enterprise Edition.

The Escape

Paying monthly for new licenses instead of bringing your own licenses with software assurance does not avoid this problem. Looking today at US East, the F16ams v7 model costs
$5962.64 per month for a SQL Enterprise VM and $1810.40 per month for a Windows OS VM, for an effective SQL licensing cost of $259.52 per physical core. The E16bds v5 model costs $5,892.56 per month for a SQL Enterprise VM and $1,512.56 per month for a Windows OS VM, for an effective SQL licensing cost of $547.50 per physical core. Naturally, the license costs per vCore are quite similar between models.

Constrained vCPU sizes for database workloads can be used to reduce licensing costs, but this feature is designed to offer better memory, storage, and I/O bandwidth at lower CPU counts. It is unlikely that reducing the vCPU size in this way will somehow lead to physical core scaling instead of vCore scaling behavior. However, note that this configuration was not directly tested.

Sadly, there is no recent licensing update that allows for CPU affinity to be used to reduce licensing costs. All of the vCPUs exposed within the VM OS must be licensed for SQL Server.

There are a few VM series offered within the Azure VM family which do not offer SMT. This directly resolves the license trap, but there aren’t many choices available at the time of publication for this blog post. The Famsv7 series mentioned earlier is one modern example, and perhaps this is a great CPU for SQL Server workloads, but its VMs do not offer any local temporary storage.

It is otherwise possible to disable hyper-threading within the VM using a preview feature. This will reduce licensing costs for Windows OS VMs for which a SQL Server license is supplied by the user, but will not reducing licensing costs when paying directly for new licenses by renting a “SQL Server on Azure Windows” VM. Disabling hyper-threading may not be straightforward to do and hesitance to rely on preview features is understandable, but the potential license cost savings here can be huge depending on the size of the VM.

Final Thoughts

Some cloud vendors offer virtual machines with “hyper-threaded cores” which seems to allow them to offer bigger vCPU counts for a lower price. Unfortunately, a virtual machine with lots of weak vCPUs is not cost-effective for software with a high license cost per vCPU, such as SQL Server Enterprise Edition. Be sure to understand what each cloud vendor’s vCPU count truly represents in terms of supporting your workload. Thanks for reading!

Double Your Core Count With Azure SQL Managed Instance

I’m going to open with a perhaps controversial statement: “when you buy 4 vCores on the Azure SQL Managed Instance platform, what you’re actually buying is 2 physical cores presented as 4 hyperthreaded cores to SQL Server”. That means that if you have 8 physical cores on your SQL Server machine today then your starting Managed Instance vCore equivalent count could be closer to 16 vCores instead of 8. Perhaps this is already well known to everyone else, but I couldn’t find any (accurate) writing on this topic so I gave it a shot.

The Test Machines

A CPU core is only as good as its performance, so instead of theoretical stuff, I’m simply going to test identical, synthetic workloads on 3 different machines:

  • An Intel i7-9700k processor gaming PC with Windows Home (big regret)
  • An 8 core VMware VM on an Intel Gold 6248R physical host without any special configuration
  • An 8 vCore next-gen general purpose Azure SQL Managed Instance on a Dev/Test subscription in US East

If the Managed Instance has the equivalent of 8 physical cores then I expect to see similar scaling on all three machines. If it effectively only has four physical cores then I expect to see much worse scaling with the Managed Instance compared to the other two machines.

One Big Query

For the first test, I wanted to run a single query with varying MAXDOP on a quiet machine. I wrote a query to burn CPU rather efficiently while minimizing shared state: there’s no I/O, the memory grant is tiny, there’s very little intra-worker communication, and the query uses demand based parallelism to distribute rows between the parallel worker threads. Here is the query that I ran at MAXDOP 1, MAXDOP 4, and at MAXDOP 8:

 WITH vCTE AS (
      SELECT v.v
      FROM
      (VALUES
      (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),(50),(51),(52),(53),(54),(55),(56),(57),(58),(59),(60),(61),(62),(63),(64),(65),(66),(67),(68),(69),(70),(71),(72),(73),(74),(75),(76),(77),(78),(79),(80),(81),(82),(83),(84),(85),(86),(87),(88),(89),(90),(91),(92),(93),(94),(95)
      ) v(v)
)
SELECT MIN(v0.v + ca.v)
FROM vCTE v0
CROSS APPLY (
      SELECT MIN(v1.v + v2.v + v3.v + v4.v) v
      FROM vCTE v1
      CROSS JOIN vCTE v2
      CROSS JOIN vCTE v3
      CROSS JOIN (SELECT TOP (10) v FROM vCTE) v4
      WHERE v1.v > (v0.v - 96)
) ca
OPTION (NO_PERFORMANCE_SPOOL, MAXDOP 8, USE HINT (N'ENABLE_PARALLEL_PLAN_PREFERENCE'));

as well as the testing results:

a67 big elapsed

The bare metal machine has single-threaded performance better than twice as fast as the Managed Instance. Of course, this isn’t unexpected. The i7-9700k has a higher clock speed and I’m not paying any virtualization/cloud/managed instance taxes. What I think is more notable is how the query performs as MAXDOP increases. The VMware VM and the bare metal machine both have pretty reasonable scaling. It would be nice to see a speed up of exactly 8.0 at MAXDOP 8 but this is fairly difficult to achieve in practice due to OS overhead, SQL Server overhead, and other factors. However, the Managed Instance acts exactly as I would expect from an 4 physical core/8 virtual core machine. The query can run at DOP 8 but the overall speedup barely exceeds the theorized number of physical cores: 4.

Here is the CPU time used by each tested query:

a67 big cpu

Again, the Managed Instance acts as expected from an 4 physical core/8 virtual core machine. DOP can certainly be set above the number of physical cores, but this is going to result in inflated CPU time metrics because both hyperthreads cannot execute at the same time. Note that we do not see the same inflation of CPU time at MAXDOP 8 for the other two test machines.

Many Small Queries

The second and final test attempt is to run as many small queries as possible within a five minute window. The number of concurrent sessions varies between 1, 4, and 8. As usual, I am forcing each session to go on its own scheduler in order to get the most consistent test results possible. Long time readers of my blog posts with very good memories already know the usual SQLCMD routine. Here is the query that was run, which is very similar to the previous one for highly technical (laziness) reasons:

WITH vCTE AS (	
    SELECT v.v
    FROM
    (VALUES
    (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),(50),(51),(52),(53),(54),(55),(56),(57),(58),(59),(60),(61),(62),(63),(64),(65),(66),(67),(68),(69),(70),(71),(72),(73),(74),(75),(76),(77),(78),(79),(80),(81),(82),(83),(84),(85),(86),(87),(88),(89),(90),(91),(92),(93),(94),(95)
    ) v(v)
)	
SELECT @Dummy = MIN(v0.v + ca.v)	
FROM vCTE v0	
CROSS APPLY (	
    SELECT MIN(v1.v + v2.v + v3.v) v
    FROM vCTE v1
    CROSS JOIN vCTE v2
    CROSS JOIN (SELECT TOP (8) v FROM vCTE) v3
    WHERE v1.v > (v0.v - 96)
) ca	
OPTION (NO_PERFORMANCE_SPOOL, MAXDOP 1);	

as well as the testing results:

a67 small count

Again, I think that the scaling is what matters here. The Managed instance cannot exceed four times the throughput of a single session even with 8 concurrent sessions on all different schedulers. As before, the Managed Instance has the same performance profile as a 4 physical core machine. Both of the other machines have quite reasonable scaling as well as almost triple the throughput during the 8 session test.

Some Theoretical Stuff

Microsoft provides some documentation about the CPU models that power the physical machines that host the Managed Instances. The premium series Managed Instance hardware uses the Intel 8370C (Ice Lake) processor which conveniently is not on Intel’s website. Wikipedia says that this processor has 32 physical cores, 64 logical cores, and supports a maximum of two sockets. That means that the biggest possible Intel 8370C physical host has 64 physical cores and 128 logical cores. Microsoft offers a 128 vCore Managed Instance for premium series CPU counts. It is simply impossible to have 128 physical cores with the Intel 8370C processor.

The same argument can be made for the general purpose series. The Intel E5-2673 v4 (Broadwell) has 20 physical cores, 40 logical cores, and supports a maximum of two sockets. An 80 vCore Managed Instance cannot have 80 physical cores on Intel E5-2673 v4 hardware. It is simply not possible.

I admit that I don’t have any insider knowledge here. Maybe I’m totally wrong about what Microsoft’s virtualization layers are doing with respect to Managed Instance. However, the 8 vCore MI seems to behave exactly as I would expect from a 4 physical core/8 virtual core VM, so that’s what my mental model will be until I see new evidence.

Final Thoughts

Of course, your real production workloads will be more complicated than simply burning CPU. You may have over-provisioned your SQL Server machines or maybe you’ve already exposed hyperthreads to your OS running SQL Server. Those of you in those categories may not see a vCore surprise if you migrate to the Managed Instance platform. However, if you typically think in terms of physical cores and your starting point is a well-sized SQL Server machine, your best bet may be to assume that each MI vCore gives half as much CPU power as you’re used to. Thanks for reading!

Emulating GETDATE() on Azure SQL Database

GETDATE() always uses the UTC time zone on Azure SQL Database which can be a compatibility issue for applications that assume that GETDATE() is using a different time zone. Developers may wish to replace GETDATE() with code that continues to use their expected time zone to avoid UTC time reforms. For example, I’ve seen application code with over 50,000 references to the GETDATE function. Replacing all of those calls with GETUTCDATE() would be a herculean effort.

FAST 1

I don’t know how good Erik’s SEO is, but here’s a simple replacement if you don’t care about performance and just need something that gives you the previous behavior of GETDATE():

CREATE OR ALTER FUNCTION dbo.[GetDateNew]()
RETURNS DATETIME
WITH SCHEMABINDING
AS
BEGIN
    RETURN SYSDATETIMEOFFSET() AT TIME ZONE N'Central Standard Time';
END;
GO

Keep reading if you’d like to do better than that.

AT TIME ZONE At Risk

AT TIME ZONE was a great addition to SQL Server 2016 that allowed for the retirement of many well-meaning but poorly implemented attempts to do time zone conversions in T-SQL. However, it was not without its own problems:

  • The CPU cost per execution was surprisingly high, but this was addressed in SQL Server 2022 and in Azure SQL Database.
  • AT TIME ZONE makes scalar UDFs ineligible for inlining.
  • AT TIME ZONE usage results in an “unknown” cardinality estimate.

You can see the cardinality estimate issue in action by populating a simple table with about 6 million rows.

DROP TABLE IF EXISTS dbo.Level100;

CREATE TABLE dbo.Level100 (
    TenantID INT NOT NULL,
    InsertTime DATETIME NOT NULL
);

INSERT INTO dbo.Level100 (TenantID, InsertTime)
SELECT q.RN % 4, DATEADD(MINUTE, -1 * RN, GETDATE())
FROM
(
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
    FROM master..spt_values t1
    CROSS JOIN master..spt_values t2
) q;

CREATE INDEX TenantID_InsertTime ON dbo.Level100 (TenantID) INCLUDE (InsertTime);

CREATE INDEX InsertTime_TenantID ON dbo.Level100 (InsertTime) INCLUDE (TenantID);

The first query uses GETDATE():

SELECT COUNT(*)
FROM dbo.Level100 l
WHERE l.TenantID = 2
AND l.InsertTime > DATEADD(MINUTE, -10, GETDATE())
OPTION (MAXDOP 1);

The filter against TenantID is expected to return 25% of the rows in the table and the filter against InsertTime is expected to return less than 0.001% of the rows. As expected, the query optimizer chooses to filter against the InsertTime_TenantID index and the query finishes instantly:

a66 good

The second query uses AT TIME ZONE:

SELECT COUNT(*)
FROM dbo.Level100 l
WHERE l.TenantID = 2
AND l.InsertTime > DATEADD(MINUTE, -10, CAST(SYSDATETIMEOFFSET() AT TIME ZONE N'Central Standard Time' AS DATETIME))
OPTION (MAXDOP 1);

Once again, the filter against TenantID is expected to return 25% of the rows in the table. However, the presence of AT TIME ZONE results in an unknown inequality estimate against InsertTime which is 30%. The query optimizer makes a different choice and goes with the TenantID_InsertTime index because 25% < 30%:

a66 bad

Meme Preparation

It seems that we need to return to the bad old days before we had AT TIME ZONE available in SQL Server. One advantage that we have is that the function only needs to do the time zone conversion for the current moment in time. Central time is six hours behind UTC at the time of publication, so we could construct a function like this:

CREATE OR ALTER FUNCTION dbo.[GetDateCT](@PassInGetDateUTC DATETIME)
RETURNS DATETIME
WITH SCHEMABINDING
AS
BEGIN
    RETURN DATEADD(HOUR, -6, @PassInGetDateUTC);
END;
GO

The input parameter is there to make the function eligible for inlining because GETDATE() and other similar functions prevent inlining. Returning to the same query as before:

SELECT COUNT(*)
FROM dbo.Level100 l
WHERE l.TenantID = 2
AND l.InsertTime > DATEADD(MINUTE, -10, dbo.[GetDateCT](GETUTCDATE()))
OPTION (MAXDOP 1);

We can see that the new function results in a reasonable cardinality estimate and the query optimizer makes the correct index choice:

a66 also good

Of course, this function definition will need to be updated a few times per year to deal with daylight savings time changes. It isn’t too difficult to design a process to automatically perform these updates, but honestly I expect that the hassle of doing something will exceed the tolerance of most companies. However, this option is available to you if you need to preserve the usually superior cardinality estimation behavior enjoyed with GETDATE().

How to Convert Time Zones in SQL Server?

a66 maymay

Parsing and Arsing

Any solution you pick, other than a naked reference to SYSDATETIMEOFFSET() along with AT TIME ZONE, may result in parsing issues in your code. For example, some parts of certain queries do not allow for subqueries, so a subquery replacement against a table-valued function will not work as a direct replacement. Even the simplest scalar UDF reference can result in code that fails to parse, such as the following:

CREATE TABLE #t (
    StupidColumn DATETIME dbo.[GetDateNew]()
);

Fortunately, most of these issues are likely rare in practice and should be straightforward to address.

Keep in mind that you also need to fix GETDATE() references in areas such as default column values, constraints, computed columns, default column tables for table types, and so on. I would personally use a naked reference to SYSDATETIMEOFFSET() along with AT TIME ZONE for these areas as I see no benefit in using a scalar UDF.

Summary of Issues for Various Solutions

I was going to put some introduction text here, but whatever, you can figure out the point of this section on your own.

Inline table-valued function approach

  • Most T-SQL compatibility issues
  • Very verbose
  • Cardinality issues with AT TIME ZONE

Naked SYSDATETIMEOFFSET() AT TIME ZONE N’Central Standard Time’ replacement

  • Very verbose
  • Cardinality issues with AT TIME ZONE

Simple scalar UDF with AT TIME ZONE:

  • UDF is not eligible for inlining
  • UDF will return different values as the query executes which probably isn’t what you want

Non-static scalar UDF:

  • Function definition must occasionally be refreshed
  • Somewhat verbose
  • Some types of queries, such as those with CTEs, will not allow any UDF to be inlined

Final Thoughts

Messy no-win situations like this highlight the importance of picking the right cloud platform for your application. Azure SQL Managed Instance and Azure VMs avoid this problem entirely. Thanks for reading!

Erik cries when SSMS lies

My SQL Server friendo Mr. Erik C. Darling was recently telling me about some work he did getting batch mode on paging queries. This sounded a bit odd to me because paging queries make me think of small seeks of rows against finely curated nonclustered rowstore indexes, so I asked him for a link to his blog post about it. He grumpily refused and told me to find the link myself, which I eventually did.

High End Machine Performance

Erik’s first attempt used OFFSET/FETCH and resulted in a row mode query:

SQL Server Query Plan

The clustered index scan makes this a bit of a sad paging query. In Erik’s defense, there’s a whole host of real world reasons as to why you wouldn’t be able to make the perfect nonclustered index for your paging query:

  • End users may choose to sort on many different columns and you can’t index them all
  • A key member of your Index Approval Committee is on vacation
  • You already have more indexes on the table than your number of fingers
  • You’re working with third party software which does not allow you to create custom indexes

Getting back to the query, it doesn’t look that offensive to me. The row mode sort is a parallel top N sort and the overall number of rows to return is low, so each thread can independently sort its rows and return 1000 locally sorted rows to the parent operator. This is about as good as it gets with parallel row mode sorting. This is a row mode only query so the operator times that you see are the sum of that operator’s work and its children. In terms of real work done by the query, the scan clocks in at 1.168 seconds and the sort clocks in at 0.84 seconds. The final accounting at the end by the parent Parallelism (Gather Streams) is misleading at best and an outright LIE at worst. There wasn’t 4 seconds of work done by this query. There was only 2 seconds. The red lines illustrate the problem perfectly and I won’t be elaborating further:

a65 red

Erik’s second attempt uses ROW_NUMBER() and he achieves a plan with some batch mode operators using BMOR (batch mode on row store):

SQL Server Query Plan

The parallel batch mode sort works just fine here in that the single thread output property isn’t an issue. The parent operator is a batch mode window aggregate, but even if it wasn’t, the grandparent is a gather streams operator so the rows would end up on one thread anyway. Actual time statistics accounting works differently for batch mode operators: each batch mode operator only tracks its own work. In terms of real work done by the query, the scan clocks in at 1.1022 seconds and the sort clocks in at 0.892 seconds. This is quite similar to the first attempt. It could be argued that the batch mode sort is more efficient than the row mode top N sort, but I’d call it a wash considering the unpredictable rowstore to batch mode conversion overhead (which does seem to be small for this table).

Low End Machine Performance

I tested on my local machine with 64 GB of RAM which is less than Erik’s laptop. My clustered index scans took significantly longer than his, but as usual, there’s a lot to learn from low end machine performance. Let’s go back to the first reason as to why the table might not be indexed well for this particular query:

End users may choose to sort on many different columns and you can’t index them all

Microsoft presents a standard solution for this scenario: the humble nonclustered columnstore index. This will be great for my low end machine because I’ll be able to fit the new NCCI in memory. For those following along at home on their own low end machines, I created the index on every column except the Body column in a very carefree fashion:

CREATE NONCLUSTERED COLUMNSTORE INDEX ncci ON posts (
  Id
, AcceptedAnswerId
, AnswerCount
, ClosedDate
, CommentCount
, CommunityOwnedDate
, CreationDate
, FavoriteCount
, LastActivityDate
, LastEditDate
, LastEditorDisplayName
, LastEditorUserId
, OwnerUserId
, ParentId
, PostTypeId
, Score
, Tags
, Title
, ViewCount
) WITH (MAXDOP = 1)

I ran the OFFSET/FETCH query on my low end machine with MAXDOP 4 and it only took 430 CPU ms and 130 ms of elapsed time:

a65 gotta go fast

That’s a huge improvement compared to the 15-20 second runtime I was experiencing earlier. Interestingly, the second query (the ROW_NUMBER() approach) sticks with the parallel batch mode sort and performs significantly worse in comparison:

a65 too slow

The key difference here is the batch mode Top N sort in the first query. Remember that the query compile process for BMOR is different than what you get when compiling in the presence of a glorious columnstore index. You can get the improved batch mode top N sort by also doing a fake join to an empty CCI table. Serious batch mode connoisseurs should be mindful of the compile differences as they seek to gain the greatest performance benefit possible from batch mode.

Final Thoughts

Friends don’t let friends be lied to by SSMS. Thanks for reading!

I finally got my first unfriendly stack overflow comment

I remember feeling a little nervous when when I first started contributing to stack exchange. It was supposedly unwelcoming and full of unfriendly people. I even planned on going to the version of the site full of database administrators, which obviously would be much worse than the average exchange site. My worries didn’t last long though. Sure, there was the occasional answer with a gruff tone, but I realized pretty quickly that I was getting help from industry experts for free. I interacted with a lot of cool, knowledgeable people (including the guy who owns this site) and I was able to help a lot of people with their problems. I even eventually got to shake hands with Sean Gallardy, the SQL dumping king.

In my opinion, one of the nicest things that you can do on the internet is spend your free time to help someone else with a problem. I also believe that it’s often whiny and counterproductive to complain about the free help that other people are offering. Those beliefs guided my interactions on the site and I ended up ranked #26 in reputation. The DBA stack exchange was a wonderful place of helping. The help wasn’t always perfect, but what is? Especially something given for free?

The Problem

Speaking of imperfections, I recently had a problem at work with CLR functions that were accessing files on Azure Blob Storage. We were planning for a migration to Azure SQL Database and these functions were flagged as an issue because that platform does not support CLR functions. The ideal solution would be to remove all interactions with the binary file data within SQL Server, but the complexity of the project and the migration timelines made this a tall order. So I went looking for a workaround.

I didn’t find anything except for a poorly answered stack exchange where someone had what appeared to be a similar problem to me.

A Solution

After picking at the problem and dealing with AI hallucinations, we eventually found something that worked as a proof of concept in a development environment. I’ll go ahead and reproduce my freshly minted stack overflow answer here:

We need to do something similar as part of a quick migration to Azure SQL Database to replace a CLR function that accessed data in Azure Blob Storage. Basic overview of process:

  1. Use extended events file names to make read and write requests. You can create one file per request and encode the request information in the file name.
  2. Use Azure Event Grid to trigger C# function execution whenever a new file is created. Your function code interacts with Azure Blob Storage and returns the necessary data to your Azure SQL Database.

As a first step, create an extended event session with an event condition that will never trigger and without any targets. I used user_event with an impossible filter value for event_id.

Detailed steps for making a read request for a file:

  1. Take your preferred concurrency protection lock to serialize file creation (sp_getapplock is a common choice).
  2. Use ALTER EVENT SESSION ... ON DATABASE ADD TARGET package0.event_file(SET filename=N'... to create a file. The file name should contain the information necessary to identify the requested file. In our application we use a GUID to identify files so we include the GUID in the file name.
  3. Use ALTER EVENT SESSION ... ON DATABASE DROP TARGET package0.event_file; to remove the file target within SQL Server (the file will persist in Azure Blob Storage).
  4. Release your concurrency protection lock.
  5. Wait in a loop until the requested data shows up in a cache table.
  6. While the T-SQL code waits, the C# function triggers off of the file creation. It parses the file name, make the API request to get the file from Azure Blob Storage, and writes the file data to the cache table within Azure SQL Database. In our development environment we experienced typical latency of around 500 ms for this step.
  7. The T-SQL wait loop ends and your code now has access to the file data within Azure SQL Database.

Detailed steps for making a write request for a file:

  1. Write the file data along with an identifying ID to a cache table.
  2. Take your preferred concurrency protection lock to serialize file creation (sp_getapplock is a common choice).
  3. Use ALTER EVENT SESSION ... ON DATABASE ADD TARGET package0.event_file(SET filename=N'... to create a file. The file name can be anything because it just informs your C# code that there’s something to process in the cache table. There’s no need to parse any information from the file name in the C# code.
  4. Use ALTER EVENT SESSION ... ON DATABASE DROP TARGET package0.event_file; to remove the file target within SQL Server (the file will persist in Azure Blob Storage).
  5. Release your concurrency protection lock.
  6. Wait in a loop until the requested data is deleted from the cache table.
  7. While the T-SQL code waits, the C# function triggers off of the file creation. It gets the file data from Azure SQL Database, makes the API request to write the file to Azure Blob Storage, and deletes the row from the cache table to signal success.
  8. The T-SQL wait loop ends and your code proceeds now that the file has been written to Azure Blob Storage.

As an aside, it’s theoretically possible to use sp_trace_generateevent to pass up to 8000 bytes of binary data to an extended event target file. However, extended events are buffered to files and you will likely experience high latency. It is even possible for events to be lost and not written to the file at all. Avoiding these issues is why we went with the approach that creates a single 4 KB file per file request.

Obviously this is a horrible way to abuse Extended Events but this was the best quick fix solution that we found.

Have you ever seen a solution so ugly that it wraps all the way around and starts to look beautiful? Sure, I wouldn’t recommend implementing something like this if you had any other option. But sometimes there’s simply no better option. Anyway, I posted that answer to give a faint sliver of hope to the next poor soul who was trying to solve the same problem. If it inspires someone else to find a solution, that’s awesome. I contribute to stack exchange solely to help others.

Apparently, the Real Problem Was Trying to Help

Some people contribute to the site for other reasons, such as this guy:

a63 so mean

I tried to reply with a comment but stack overflow didn’t let me do it. Seemed to be some kind of bug with the site. I’ll implement another ugly workaround and respond here instead.

The question on stack overflow was posted in July 2022 and has 492 views and +3 net upvotes right now as I write this blog post. I can’t believe that I have to explain this, but I ended up at this “old” stack overflow question because that’s where google took me after trying to find a solution for the same problem. Who the hell cares how old the question is? Would the question somehow be better if I called in a favor with Aaron Bertrand and asked him to edit the database to change the year to 2025 instead of 2022? There’s even a meta post about answering old questions. The overwhelming consensus is that it’s perfectly okay to answer old questions.

Besides, are you sure that you want to discount the value of old questions and answers? Here’s one of your top ranked answers:

a63 point and click lol

I got goosebumps while reading it. Truly extraordinary.

With respect to the question I answered, it is perfectly answerable and meets all of the necessary qualifications for a stack overflow question:

a63 question

The OP doesn’t know how to do something within Azure SQL Database. What is the point of demanding a minimally reproducible example when he doesn’t know how to do it? That’s like asking someone to prove a negative. What do you expect him to provide? A code comment lamenting that he doesn’t know how to do it?

Finally, I followed the principles documented at how to answer while constructing my answer. There is nothing wrong with writing a “non-trivial” or “detailed” answer. There’s even a summary at the top. If you don’t like my answer, great, go down vote it. Please do anything other than arrogantly tut-tutting at me in a comment in a way that’s contrary to stack overflow’s own published principles. Not that it matters because I got no interest in playing unpaid janitor, but I don’t even have permission to close-vote the answer. I just want to help people. Maybe use some of your “community energy” to actually help people? A “Principal Software Engineer at Microsoft” surely can do better than one answer in the last year, right?

a63 zzzzzz

Big Data

Perhaps my experience was caused by a difference in culture between the DBA stack exchange and stack overflow. After all, the stats seem to speak for themselves:

  • DBA stack exchange: ~0 unfriendly comments on 459 answers = 0% unfriendly rate
  • stack overflow: 1 unfriendly comment on 1 answer = 100% unfriendly rate

Database administrators might not be so bad after all!

Final Thoughts

It’s wild how people act on the internet even while presenting their full name, photo, and their workplace. Some people should maybe stick to answering questions about how to reset passwords and leave the technical questions to others. This experience honestly lowered my opinion of the average Microsoft employee. I’d write you guys off completely if Sean wasn’t so good at analyzing dumps. Thanks for reading!

Cursor Declarations That Use LOB Local Variables Can Bloat Your Plan Cache

I found a few stored procedures with single plan cache entries over 500 MB of memory each according to sys.dm_exec_cached_plans while optimizing the plan cache in production. The stored procedures were relatively simple and both execution plans were less than 1 MB when saved as sqlplan files. I was confused, offended, and confused.

First published on MSDN on Jan 09, 2007

Did you know that the memory used for cursors can be cached along with the compiled plan? I’ll include a direct quote of the ancient wisdom:

Inactive cursors are cached in the compiled plan. Just like MXCs there are two lists maintained here: the lookup list and the enum list. The lookup list stores the inactive cursors. When a cursor is de-allocated, the memory used to store the cursor is cached. On subsequent re-execution of the batch with this cursor, the cached cursor memory is reused and re-initialized as an active cursor. The enum list has the list of active and inactive cursors and is used to enumerate all cursors for a compiled plan.

Cursors are re-generatable given a compiled plan. Relative to compiled plans they are also relatively inexpensive to re-generate.

I checked sys.dm_exec_cached_plan_dependent_objects and found that nearly all of the cached memory was used for the cursors instead of the execution plans. This makes sense because there was no reason why a simple procedure should use 500 MB for a cached plan, but why were the cursors using so much memory?

Back to 2022

I was able to create a simple reproduction of the issue on SQL Server 2022 CU13 which is the most recent version currently available at the time of writing. Create a stored procedure that uses the OPENJSON() function along with an NVARCHAR(MAX) local variable as part of a cursor declaration. Here is one example:

 

CREATE OR ALTER PROCEDURE TestMemoryLeak_1
AS
BEGIN
    SET NOCOUNT ON;

    DECLARE @json NVARCHAR(MAX);

    SET @json = N'[
        {
            "Order": {
                "Number": "SO43659",
                "Date": "2024-05-31T00:00:00"
            },
            "AccountNumber": "AW29825",
            "Item": {
                "Price": 2024.9940,
                "Quantity": 1
            }
        },
        {
            "Order": {
                "Number": "SO43661",
                "Date": "2024-06-01T00:00:00"
            },
            "AccountNumber": "AW73565",
            "Item": {
                "Price": 2024.9940,
                "Quantity": 3
            }
        }
    ]';

    DECLARE json_cursor CURSOR FAST_FORWARD FOR 
    SELECT Number, OrderDate, Customer, Quantity
    FROM OPENJSON(@json) WITH (
        Number VARCHAR(200) '$.Order.Number',
        OrderDate DATETIME '$.Order.Date',
        Customer VARCHAR(200) '$.AccountNumber',
        Quantity INT '$.Item.Quantity'
    );

    DECLARE 
        @Number VARCHAR(200),
        @OrderDate DATETIME,
        @Customer VARCHAR(200),
        @Quantity INT;

    OPEN json_cursor;
    FETCH NEXT FROM json_cursor INTO @Number, @OrderDate, @Customer, @Quantity;
    WHILE @@FETCH_STATUS = 0
    BEGIN
        -- nobody wants to work anymore

        FETCH NEXT FROM json_cursor INTO @Number, @OrderDate, @Customer, @Quantity;
    END;
    CLOSE json_cursor;
    DEALLOCATE json_cursor;

    RETURN;
END;
GO

The amount of memory cached for the cursor execution can be seen with the following code:

DECLARE @plan_handle varbinary(64);
SELECT @plan_handle = plan_handle
FROM sys.dm_exec_procedure_stats
where object_id = object_id('[TestMemoryLeak_1]');

SELECT * FROM 
sys.dm_os_memory_objects
WHERE type = 'MEMOBJ_CURSOREXEC'
AND memory_object_address IN  (
    SELECT memory_object_address
    FROM sys.dm_exec_cached_plan_dependent_objects(@plan_handle)
);

I executed the test stored procedure 1 time, measured the memory usage, then executed it 9999 more times, and finally measured the memory usage again. I expected the memory usage to not change because the cursor always processes the same data. SQL Server betrayed my expectations:

a62 leak

A cursor that processes 2 rows of data has managed to use 163 MB of cached memory. Wonderful. One workaround is to remove OPENJSON() from the cursor definition by inserting the data into a table variable and making the cursor read from the table variable:

DECLARE @json_table TABLE (
    Number VARCHAR(200) NULL,
    OrderDate DATETIME NULL,
    Customer VARCHAR(200) NULL,
    Quantity INT NULL
);

INSERT INTO @json_table (Number, OrderDate, Customer, Quantity)
SELECT Number, OrderDate, Customer, Quantity
FROM OPENJSON(@json) WITH (
    Number VARCHAR(200) '$.Order.Number',
    OrderDate DATETIME '$.Order.Date',
    Customer VARCHAR(200) '$.AccountNumber',
    Quantity INT '$.Item.Quantity'
);

DECLARE json_cursor CURSOR FAST_FORWARD FOR 
SELECT Number, OrderDate, Customer, Quantity
FROM @json_table;

SQL Server no longer betrays my expectations. The amount of cached memory for the cursor does not change between 1 and 10000 executions:

a62 no leak

Here is a comparison using the sys.dm_exec_cached_plans DMV:

a62 compare

Determining which row is associated with each stored procedure is an exercise left up to the reader.

A second workaround is to define the cursor using the LOCAL argument. A third workaround is to define the local variable as NVARCHAR(4000). Thank you to the comments for both of these workarounds. Note that I originally thought the issue was related to OPENJSON() but it appears that was an error on my part. The presence of an LOB variable looks to be the determining factor.

Final Thoughts

Using local LOB variables as part of a cursor declaration can lead to ever-increasing amounts of cached memory associated with its cached plan. It’s possible that SQL Server will prune this plan once it experiences plan cache pressure, but it may also choose to prune other plans instead. I recommend coding defensively by always using the LOCAL argument for your cursor declarations. Thanks for reading!

Another Method to Reduce Temp Table Plan Cache Pollution

SQL Server allows stored procedures to reference temporary tables that were created in an outer scope. However, doing so can cause the plan cache pollution problem described here.

The Polluter

The problem can be observed with a simple repro script. The child stored procedure performs a series of updates on a temp table. The parent procedure creates the temp table, calls the child stored procedure, and returns data from the temp table.

CREATE OR ALTER PROCEDURE dbo.update_temp_table
AS
BEGIN
    SET NOCOUNT ON;
    -- the #outer_scope temp table is created in an outer scope

    UPDATE #outer_scope SET CNT = (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t2) + (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t2)
    WHERE ID = 1;

    UPDATE #outer_scope SET CNT = (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t3) + (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t2)
    WHERE ID = 1;

    UPDATE #outer_scope SET CNT = (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t4) + (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t2)
    WHERE ID = 1;

    UPDATE #outer_scope SET CNT = (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t5) + (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t2)
    WHERE ID = 1;
END;

GO

CREATE OR ALTER PROCEDURE dbo.parent_proc
AS
BEGIN
    SET NOCOUNT ON;

    CREATE TABLE #outer_scope (ID INT NOT NULL, CNT INT NULL);
    INSERT INTO #outer_scope (ID, CNT)
    VALUES (1, NULL);

    EXEC dbo.update_temp_table;

    SELECT * FROM #outer_scope;
END;

GO

I executed the parent_proc procedure on three different sessions and was rewarded with three different cached plans:

a61 cache pollution

The Cleaner

One way to resolve this issue is to change the child procedure to create a new local temp table, to copy the data from the original table into the new one using its own child procedure, to perform the updates against the new table, and to copy back the data into the original table, again in a child procedure. That was a lot of words so perhaps code will be easier to understand:

CREATE OR ALTER PROCEDURE dbo.update_temp_table_import_temp_data
WITH RECOMPILE
AS
BEGIN	
    IF OBJECT_ID(N'tempdb..#outer_scope', 'U') IS NOT NULL
    BEGIN
        INSERT INTO #local_scope
        SELECT * FROM #outer_scope;
    END;
END;

GO


CREATE OR ALTER PROCEDURE dbo.update_temp_table_export_temp_data
WITH RECOMPILE
AS
BEGIN
    IF OBJECT_ID(N'tempdb..#outer_scope', 'U') IS NOT NULL
    BEGIN
        TRUNCATE TABLE #outer_scope;	

        INSERT INTO #outer_scope
        SELECT * FROM #local_scope;
    END;
END;

GO


CREATE OR ALTER PROCEDURE dbo.update_temp_table_NEW
AS
BEGIN
    SET NOCOUNT ON;
    
    CREATE TABLE #local_scope (ID INT NOT NULL, CNT INT NULL);
    
    EXEC dbo.update_temp_table_import_temp_data;
    
    UPDATE #local_scope SET CNT = (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t2) + (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t2)
    WHERE ID = 1;

    UPDATE #local_scope SET CNT = (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t3) + (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t2)
    WHERE ID = 1;

    UPDATE #local_scope SET CNT = (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t4) + (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t2)
    WHERE ID = 1;

    UPDATE #local_scope SET CNT = (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t5) + (SELECT COUNT(*) FROM master..spt_values t1 CROSS JOIN master..spt_values t2)
    WHERE ID = 1;
    
    EXEC update_temp_table_export_temp_data;
END;	
    
GO


CREATE OR ALTER PROCEDURE dbo.parent_proc_NEW
AS
BEGIN
    SET NOCOUNT ON;

    CREATE TABLE #outer_scope (ID INT NOT NULL, CNT INT NULL);
    INSERT INTO #outer_scope (ID, CNT)
    VALUES (1, NULL);

    EXEC dbo.update_temp_table_NEW;

    SELECT * FROM #outer_scope;
END;

GO

Running the new procedure in three different sessions no longer results in cache pollution:

a61 cache clean

This is because the references to outer-scoped temp tables have been moved to small, simple procedures that are built with a procedure level RECOMPILE hint. This approach has the following advantages:

  • There is no cache pollution for any of the code.
  • The update statements will be compiled significantly less often, especially on SQL Server 2019 and above.

It also has the following disadvantages:

  • Data from the outer table needs to be written to the local table and data from the local table is written back to the outer table.
  • The definition and columns of the temp table need to be duplicated in additional places in code.

I don’t think that this is the best general solution by any means, but it does fulfill a particular niche use case and I’ve been able to use this method in production code. For other ideas, Erland Sommarskog has a detailed write up of different ways of passing data between procedures.

Final Thoughts

When you need your plan cache to be cleaned there’s really only one person to call:

a61 cache cleaner

Thanks for reading!

Improving AT TIME ZONE Performance

Queries that use the AT TIME ZONE operator may perform worse than desired. For example, Jonathan Kehayias blogged about performance struggles with AT TIME ZONE at sqlskills.com. One key difference between Jonathan’s opinion and my own is that I enjoy writing these kinds of posts.

Test Data

The test data consists of one million rows per day over the month of January 2022 for a total of 31 million rows.

DROP TABLE IF EXISTS dbo.CCIForBlog;

CREATE TABLE dbo.CCIForBlog (
    SaleTimeUTC DATETIME2 NOT NULL,
    WidgetCount BIGINT NOT NULL,
    INDEX CCI CLUSTERED COLUMNSTORE
);

GO

SET NOCOUNT ON;

DECLARE
    @StartDate DATETIME2 = '20220101',
    @DaysToLoad INT = 31,
    @DaysLoaded INT = 0;

WHILE @DaysLoaded < @DaysToLoad
BEGIN
    INSERT INTO dbo.CCIForBlog (SaleTimeUTC, WidgetCount)
    SELECT DATEADD(SECOND, q.RN / 11.5, @StartDate), q.RN / 100000
    FROM
    (
        SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
        FROM master..spt_values t1
        CROSS JOIN master..spt_values t2
    ) q (RN)
    OPTION (MAXDOP 1);

    SET @StartDate = DATEADD(DAY, 1, @StartDate);
    SET @DaysLoaded = @DaysLoaded + 1;
END;

CREATE STATISTICS S1 ON CCIForBlog (SaleTimeUTC);

Data is saved to a table with a clustered columnstore index. Of course, this is a small amount of data for a columnstore table. However, it is more than sufficient to demonstrate AT TIME ZONE as a performance bottleneck.

Filtering

Suppose an end user wants a count of widgets sold between January 3rd and January 6th. A first attempt at this query could look like the following:

SELECT SUM(WidgetCount)
FROM dbo.CCIForBlog c
WHERE c.SaleTimeUTC >= '20220103'
AND c.SaleTimeUTC < '20220106'
OPTION (MAXDOP 1);

This query plays to all of columnstore’s strengths and it only takes around 10 milliseconds to execute. The data was loaded in date order so most of the rowgroups are eliminated. However, end users don’t think in terms of UTC time. The end user actually wanted Central Standard Time. After a bit of research, the query is changed to as follows:

SELECT SUM(WidgetCount)
FROM dbo.CCIForBlog c
WHERE SWITCHOFFSET(c.SaleTimeUTC, 0) AT TIME ZONE 'Central Standard Time' >= '20220103'
AND SWITCHOFFSET(c.SaleTimeUTC, 0) AT TIME ZONE 'Central Standard Time' < '20220106'
OPTION (MAXDOP 1);

The AT TIME ZONE operator is useful when the number of offset minutes is unknown. UTC always has an offset of 0 so it is better to use SWITCHOFFSET(datetimeoffset_expression, 0)as opposed to AT TIME ZONE ‘UTC’. Even so, the query still takes over 3 minutes to execute on my machine. Nearly all of the execution time is spent on performing AT TIME ZONE calculations:

a60 slow

Note that using c.SaleTimeUTC AT TIME ZONE 'UTC' AT TIME ZONE 'Central Standard Time' would double the execution time.

One idea is to rewrite the filtering so that the time zone conversions are done on the constant values instead of the column:

WHERE c.SaleTimeUTC >= SWITCHOFFSET('20220103' AT TIME ZONE 'Central Standard Time', 0)
AND c.SaleTimeUTC < SWITCHOFFSET('20220106' AT TIME ZONE 'Central Standard Time', 0)

I strongly recommend against this approach. It can lead to wrong results for some time zones and boundary points. Instead, you can use the fact that datetimeoffset only supports an offset of up to +-14 hours. That means that (local – 14 hours) < UTC < (local + 14 hours) is true for any time zone and any point in time. A logically redundant filter can be added to the query:

SELECT SUM(WidgetCount)
FROM dbo.CCIForBlog c
WHERE SWITCHOFFSET(c.SaleTimeUTC, 0) AT TIME ZONE 'Central Standard Time' >= '20220103'
AND SWITCHOFFSET(c.SaleTimeUTC, 0) AT TIME ZONE 'Central Standard Time' < '20220106'
AND c.SaleTimeUTC >= DATEADD(HOUR, -14, '20220103')
AND c.SaleTimeUTC < DATEADD(HOUR, 14, '20220106')
OPTION (MAXDOP 1);

The newly improved query finishes in around 26 seconds. It is able to benefit from rowgroup elimination and performs significantly fewer time zone conversions compared to the original query. In this case, we were able to use knowledge about time zones and a bit of date math to improve performance from over 3 minutes to about 26 seconds.

Grouping

Suppose that an end user wants widget counts summarized by date. A first attempt at this query could look like the following:

SELECT ca.ConvertedDate, SUM(WidgetCount)
FROM dbo.CCIForBlog c
CROSS APPLY (
    SELECT CAST(c.SaleTimeUTC AS DATE)
) ca (ConvertedDate)
GROUP BY ca.ConvertedDate
OPTION (MAXDOP 1);

This query takes about 1 second on my machine. However, once again, the end user wants the data to be in CST instead of UTC. The following approach takes around 3 minutes:

SELECT ca.ConvertedDate, SUM(WidgetCount)
FROM dbo.CCIForBlog c
CROSS APPLY (
    SELECT CAST(SWITCHOFFSET(c.SaleTimeUTC, 0) AT TIME ZONE 'Central Standard Time' AS DATE)
) ca (ConvertedDate)
GROUP BY ca.ConvertedDate
OPTION (MAXDOP 1);

This should not be a surprise because the bottleneck in query performance is performing 31 million AT TIME ZONE calculations. That doesn’t change if the query performs filtering or grouping.

Historically, governments only perform daylight saving time or offset switches on the minute. For example, an offset won’t change at 2:00:01 AM, but it might change at 2:00:00 AM. The source data has one million rows per day, so grouping the date truncated to the minute, applying the time zone conversion to the truncated distinct values, and finally grouping by date should lead to significant performance improvement. One way to accomplish this:

SELECT ca.ConvertedDate, SUM(SumWidgetCount)
FROM 
(
    SELECT DATEADD(MINUTE, DATEDIFF(MINUTE, '20000101', c.SaleTimeUTC), '20000101'), SUM(WidgetCount)
    FROM dbo.CCIForBlog c
    GROUP BY DATEADD(MINUTE, DATEDIFF(MINUTE, '20000101', c.SaleTimeUTC), '20000101')
) q (SaleTimeUTCTrunc, SumWidgetCount)
CROSS APPLY (
    SELECT CAST(SWITCHOFFSET(q.SaleTimeUTCTrunc, 0) AT TIME ZONE 'Central Standard Time' AS DATE)
) ca (ConvertedDate)
GROUP BY ca.ConvertedDate
OPTION (MAXDOP 1);

The new query takes around 4 seconds on my machine. It needs to perform 44650 time zone conversions instead of 31 million. Once again, we were able to use knowledge about time zones and a bit of date math to improve performance.

Functioning

I’ve developed and open sourced replacement functions for AT TIME ZONE to provide an easier way of improving performance for queries that use AT TIME ZONE. The TZGetOffsetsDT2 function returns a pair of offsets and the TZFormatDT2 function transforms those offsets into the desired data type. The filtering query can be written as the following:

SELECT SUM(WidgetCount)
FROM dbo.CCIForBlog c
OUTER APPLY dbo.TZGetOffsetsDT2 (c.SaleTimeUTC, N'UTC', N'Central Standard Time') o
CROSS APPLY dbo.TZFormatDT2 (c.SaleTimeUTC, N'UTC', N'Central Standard Time', o.OffsetMinutes, o.TargetOffsetMinutes) f
WHERE f.ConvertedDateTime2 >= '20220103'
AND f.ConvertedDateTime2 < '20220106'
OPTION (MAXDOP 1);

The new query takes around 10 seconds to execute. The new query plan is able to use batch mode processing at every step:

a60 filter fast

 

If desired, performance can be further improved by adding the same redundant filters as before:

SELECT SUM(WidgetCount)
FROM dbo.CCIForBlog c
OUTER APPLY dbo.TZGetOffsetsDT2 (c.SaleTimeUTC, N'UTC', N'Central Standard Time') o
CROSS APPLY dbo.TZFormatDT2 (c.SaleTimeUTC, N'UTC', N'Central Standard Time', o.OffsetMinutes, o.TargetOffsetMinutes) f
WHERE f.ConvertedDateTime2 >= '20220103'
AND f.ConvertedDateTime2 < '20220106'
AND c.SaleTimeUTC >= DATEADD(HOUR, -14, '20220103')
AND c.SaleTimeUTC < DATEADD(HOUR, 14, '20220106')
OPTION (MAXDOP 1);

The most optimized version takes around 1 second to execute. Quite an improvement compared to 3 minutes!

The grouping query can also be rewritten using the new functions:

SELECT f.ConvertedDate, SUM(WidgetCount)
FROM dbo.CCIForBlog c
OUTER APPLY dbo.TZGetOffsetsDT2 (c.SaleTimeUTC, N'UTC', N'Central Standard Time') o
CROSS APPLY dbo.TZFormatDT2 (c.SaleTimeUTC, N'UTC', N'Central Standard Time', o.OffsetMinutes, o.TargetOffsetMinutes) f
GROUP BY f.ConvertedDate
OPTION (MAXDOP 1);

This version takes about 7 seconds to execute. This is slower than the date truncation method which took 4 seconds but still much faster than the original AT TIME ZONE method.

Unfortunately, SQL Server limitations require a pair of functions to get the best performance. The open source project does provide a simpler function that can be called by itself but it is only eligible for nested loop joins.

Upgrading

SQL Server 2022 RC1 has signs of improvement for AT TIME ZONE. The basic queries that use AT TIME ZONE in this blog post take about 75 seconds to execute on my machine, so it could be estimated that SQL Server 2022 will reduce the CPU footprint of AT TIME ZONE by 60%. Release candidates are not fully optimized so it’s possible that final performance numbers will be different once the product goes GA. I suspect that these performance improvements are already present in Azure SQL Database but I can’t find any documentation for the change.

Final Thoughts

Please try my open source project if you’re experiencing performance problems with AT TIME ZONE. Thanks for reading!

Why Parallel Queries are Sometimes Slow In SQL Server

You may have noticed large variations in elapsed time for parallel queries while performing query tuning or observing a production workload. This post reviews some of the possible explanations for those variations. I’m testing on SQL Server 2022 RC0 but this behavior can be observed on all currently supported versions of SQL Server.

The Setup

The test data is very simple. Insert ten million sequential integers into a single column table with a clustered index:

CREATE TABLE dbo.ParallelTest (
Id BIGINT NOT NULL,
CONSTRAINT PK_ParallelTest PRIMARY KEY (Id)
);

INSERT INTO dbo.ParallelTest (Id)
SELECT [value]
FROM GENERATE_SERIES(1, 10000000);

Now consider the following query:

SELECT SUM(a.ID + b.ID) s
FROM dbo.ParallelTest a
INNER JOIN dbo.ParallelTest b ON a.ID = b.ID
OPTION (USE HINT ('DISALLOW_BATCH_MODE'));

The query optimizer naturally picks a merge join for this query. The self-joined tables obviously have the same row count and the data is already sorted. That makes this query an ideal candidate for a serial merge join from the optimizer’s point of view . The USE HINT in the query isn’t particularly important and the example works without it, but its presence will be explained later. Here’s what the query plan looks like for me:

a59 serial merge

The query takes 2721 ms of CPU time and elapsed time to execute on my machine. It’s reasonable for CPU and elapsed time to match here because there’s nothing for the query to wait on. It’s the only user query running against the SQL instance. Next I’ll force a parallel merge join query plan by changing the query hints to the following:

SELECT SUM(a.ID + b.ID) s
FROM dbo.ParallelTest a
INNER JOIN dbo.ParallelTest b ON a.ID = b.ID
OPTION (MAXDOP 8, MERGE JOIN, USE HINT('ENABLE_PARALLEL_PLAN_PREFERENCE', 'DISALLOW_BATCH_MODE'));

I chose MAXDOP 8 because my test machine has 8 physical CPU cores. This is an unnatural query plan with a higher parallel cost than serial cost. The query optimizer only picked it because it was forced to. Note the presence of the exchange operators with a relatively high estimated cost:

a59 parallel merge

The parallel query takes 6465 ms of CPU time and 1723 ms of elapsed time to execute on my machine. It finished about 33% faster than the serial query but CPU time more than doubled. The exchange operators added a large overhead for this query. Next I’ll run the same query again but I’ll make one of my CPUs busier than before. Consider the following infinite loop coded as a stored procedure:

CREATE OR ALTER PROCEDURE #p AS
SET NOCOUNT ON;
DECLARE @i INT;
WHILE 1 = 1
SET @i = 0;
GO

EXEC #p;

This code does not accomplish anything besides efficiently using CPU resources, which is exactly what I need for my test. I’m running the merge query at MAXDOP 8 on a machine with 8 CPU cores so I’m pretty much guaranteed to see scheduler contention for some of the workers of the parallel query. They will compete for CPU resources with the infinite loop.

Take a moment to predict the performance penalty that the parallel query will experience due to the busy scheduler. Will it be 2X slower? 3X slower? Worse? For a few hints, recall that hash distributed exchange operators will generally try to force an equal amount of work to each thread. Also recall that a parallel merge join requires order-preserving exchange operators. I’ve included Erik running at MAXDOP 8 to hopefully prevent you from accidentally seeing the answer before you’ve finalized your prediction.

a59 erika59 erik

a59 erika59 erik

a59 erika59 erik

a59 erika59 erik

With the busy scheduler, the query now takes 6818 ms of CPU time and 31745 ms of elapsed time to execute on my machine. The same code is over 18X slower simply due to the highly efficient stored procedure that’s running on the same SQL instance. Here are the query wait stats in case they are of interest:

<Wait WaitType=”CXPACKET” WaitTimeMs=”454673″ WaitCount=”5262″ />
<Wait WaitType=”CXSYNC_PORT” WaitTimeMs=”32539″ WaitCount=”41″ />
<Wait WaitType=”CXSYNC_CONSUMER” WaitTimeMs=”43″ WaitCount=”14″ />

scheduler_ring_buffer_recorded

This is an inappropriately brief introduction to a complex topic, but workers cooperate with each other by sharing time on schedulers. Under normal conditions, a worker will execute code on the scheduler for up to 4 milliseconds before yielding. The worker yields in case there’s another worker waiting to execute on the scheduler. When there’s not the original worker immediately starts executing code again. An example of when a worker yields early is if it needs to wait for an I/O operation.

scheduler_ring_buffer_recorded is the name of a debug extended event that you should probably never use. I often don’t follow my own advice, so I used it to get an idea of how often workers were yielding to the scheduler for the merge join queries. I only looked at action types of “SCHEDULER_NONPREEMPTIVE_RESUME” and I admit that this analysis is quite approximate. For the MAXDOP 1 query, there was a total of 705 events over 2872 milliseconds. In other words, the average time spent on the scheduler before yielding is close to the 4 ms soft maximum. This should be expected for the serial query if all of the data is in the buffer pool. There is no memory grant for the query (the reason for the USE HINT) and there’s not really any reason to yield early.

The parallel query had significantly more yields. Looking at one worker thread as an example, it had 14862 events over 1714 milliseconds. In other words, it had around 35 times the number of scheduler yields compared to the MAXDOP 1 query. The higher number of yields is a performance liability when competing with workers from other queries on the same scheduler. The infinite loop stored procedure dominates the workers from the parallel and gets around 97% of the cpu time on the scheduler. That’s a big part of why the parallel query is 18X slower than before.

Note that you can also get this information by looking at the yield_count column of the sys.dm_of_schedulers DMV, but those numbers are server wide and aren’t isolated to one query. With that said, I did get reasonably similar results when comparing that column to the extended event results.

To summarize, parallel query performance overhead isn’t as simple as just the direct overhead of the exchange operators. A query changing from serial to parallel can result in completely different interactions between the workers and the scheduler. The serial query executed its code 4 milliseconds at a time. The parallel query had to yield significantly more often. This type of change isn’t something that you can directly observe in a query plan.

Why Parallel Queries are Sometimes Slow

There are many reasons why a parallel query may not be as efficient as you hoped. I find these to be the most common:

1) There is an overhead associated with parallel exchange operators which is not present in serial queries. Moving data between worker threads requires additional CPU time which adds to the overall execution time of the query. I believe this is what is described by the following quote from the documentation:

Simple queries, frequently found in transaction processing applications, find the additional coordination required to execute a query in parallel outweigh the potential performance boost.

2) There’s some kind of additional coordination required between threads which delays query completion. For example, an order-preserving repartition streams operator is generally worse for performance than one that doesn’t need to preserve order. A different type of example can be seen in my blog post about slow parallel scans caused by cloud storage.

3) Work may be unevenly distributed between worker threads. For example, an eager index spool can only be built with one thread. Other threads may be idle while that temporary index is built. For a different example, a filtered parallel scan of a table may not output the same number of rows to all threads. All else being equal, a parallel query that cannot use all of the CPUs allowed by MAXDOP is not as efficient as one that can.

4) Work may be evenly distributed between worker threads but some of the CPUs are busier than others. Some parallel queries are as slow as their busiest CPU, even when queries are evenly sharing time on the scheduler. Of course, dramatic performance penalties can be seen when the parallel query gets an uneven amount of time on the scheduler (like in the demo in this blog post).

Mitigations

Individual parallel query performance can degrade due to contention on a single scheduler that’s used by one or more of its workers. You are more likely to run into problems with higher DOP or a higher number of concurrent queries. Lowering MAXDOP decreases the odds that you’ll run into the worst type of contention because the query uses fewer schedulers. Decreasing the number of queries decreases contention per scheduler which also helps.

To explain it in a different way, the simplest way to reduce problems with parallel queries is to have fewer worker threads per scheduler. This is of course a balancing act, but it’s one that you must perform for some workloads. Set Cost Threshold for Parallelism and server/database/resource governor MAXDOP appropriately. When possible, don’t run too many concurrent queries for your hardware.

For very important queries for which performance is critical, consider changing to a type of parallelism that isn’t as susceptible to scheduler contention issues. The two most common methods are a parallel apply pattern, perhaps using demand partitioning, or batch mode execution.

Batch mode is yet another complicated topic that I’m not going to explain here, but it uses a fundamentally different method of parallelism compared to row-mode execution which is generally much more robust against problems caused by busy schedulers. Below is a comparison of elapsed times with and without the infinite loop stored procedure running:

The elapsed time of the parallel batch mode query only changes slightly. This is also a nice property when considering overall server scalability.

Final Thoughts

I was going to originally blog about DOP feedback, but it’s taking a while for my feedback to kick in. Thanks for reading!