Cheat Sheet Archives | Darling Data

The How To Write SQL Server Queries Correctly Cheat Sheet: INTERSECT And EXCEPT

Posted on March 27, 2024April 1, 2024 by Erik Darling

Who Am I Kidding?

I have never once seen anyone use these. The most glaring issue with them is that unlike a lot of other directives in SQL, these ones just don’t do a good job of telling you what they do, and their behavior is sort of weird.

Unlike EXISTS and NOT EXISTS, which state their case very plainly, as do UNION and UNION ALL, figuring these out is not the most straightforward thing. Especially since INTERSECT has operator precedence rules that many other directives do not.

INTERSECT gives you a set of unique rows from both queries
EXCEPT gives you a set of unique rows from the “first” query

So, cool, if you know you want a unique set of rows from somewhere, these are good places to start.

What’s better, is that they handle NULL values without a lot of overly-protective syntax with ISNULL, COALESCE, or expansive and confusing OR logic.

The tricky part is spotting when you should use these things, and how to write a query that makes the most of them.

And in what order.

Easy Examples

Often the best way to get a feel for how things work is to run simple queries and test the results vs. your expectations, whatever they may be.

I like these queries, because the UserId column in the Comments table is not only NULLable, but contains actual NULLs. Wild, right?

SELECT
    c.*
FROM dbo.Comments AS c
WHERE c.UserId IS NULL
AND   c.Score > 2

INTERSECT

SELECT
    c.*
FROM dbo.Comments AS c
WHERE c.UserId IS NULL
AND   c.Score > 3
ORDER BY
    c.Score;

Running this will return results where a Comment’s Score is greater than 3 only, because that’s the starting point for where both query results begin to match results across all the columns.

Note that the UserId column being NULL doesn’t pose any problems here, and doesn’t require any special handling. Like I said. And will keep saying. Please remember what I say, I beg of you.

Moving on to EXCEPT:

SELECT
    c.*
FROM dbo.Comments AS c
WHERE c.UserId IS NULL
AND   c.Score > 2

EXCEPT

SELECT
    c.*
FROM dbo.Comments AS c
WHERE c.UserId IS NULL
AND   c.Score > 3
ORDER BY
    c.Score;

This will only return results from the “first” query (often referred to as the left, or outer query) with a Score of 3, because that’s the only data that exists in it that isn’t also in the “second” (or right, or inner) query.

Both queries will find many of the same rows after Score hits 2 or 3, but those get filtered out to show only the difference(s) between the two.

In case it wasn’t obvious, it’s a bit like using NOT EXISTS, in that rows are only checked, and not projected from the second/right/inner query, looking for Scores greater than 3.

Again, NULLs in the UserId column are handled just fine. No ISNULL/COALESCE/OR gymnastics required.

I’m really trying to drive this home, here.

In The Year 2000

SQL Server 2022 introduced some “modern” ANSI syntax. By modern, I mean that IS DISTINCT FROM was introduced to the standard in 1999, and IS NOT DISTINCT FROM was introduced in 2003.

While no database platform adheres strictly or urgently to ANSI standards, waiting 20 years for an implementation in SQL Server is kind of really-extra-super-duper son-of-a-gun boy-howdy dag-nabbit-buster alright-bucko hold-your-horses listen-here-pal levels of irritating.

Think of all the useless, deprecated, and retired things we’ve gotten in the past 20 years instead of basic functionality. It’s infinitely miffing.

Anyway, I like these additions quite a lot. In many ways, these are extensions of INTERSECT and EXCEPT, because the workarounds involved for them involved those very directives. Sort of like Microsoft finally adding GREATEST and LEAST, after decades of developers wondering just what the hell to do instead, I hope they didn’t show up too late to keep SQL Server from being bullied by developers who are used to other platforms.

We can finally start to replace mutton-headed, ill-performing syntax like this:

SELECT
    c = COUNT_BIG(*)
FROM dbo.Users AS u
JOIN dbo.Posts AS p
  ON u.Id = p.LastEditorUserId
  OR (p.LastEditorUserId IS NULL);

With stuff that doesn’t suck, like this:

SELECT
    c = COUNT_BIG(*)
FROM dbo.Users AS u
JOIN dbo.Posts AS p
  ON u.Id IS NOT DISTINCT FROM p.LastEditorUserId;

The query plan timings tell enough of a story here:

sql server query plan — i’ve been waiting for so long

But not everyone is able to use the latest and greatest (or least and greatest, ha ha ha) syntax. And the newest syntax isn’t always better for performance, without additional tweaks.

And that’s okay with me. I do performance tuning for a living, and my job is to know all the available options and test them.

Like here. Like now.

The Only One I Know

Let’s compare these two queries. It’ll be fun, and if you don’t think it’s fun, that’s why you’ll pay me. Hopefully.

SELECT
    c = COUNT_BIG(*)
FROM dbo.Users AS u
JOIN dbo.Posts AS p
  ON u.Id IS NOT DISTINCT FROM p.LastEditorUserId;

SELECT
    c = COUNT_BIG(*)
FROM dbo.Users AS u
JOIN dbo.Posts AS p
  ON u.Id = p.LastEditorUserId
WHERE EXISTS
(
    SELECT p.LastEditorUserId FROM dbo.Posts AS p
    INTERSECT
    SELECT u.Id FROM dbo.Users AS u
);

Here’s the supporting index that I have for these queries:

CREATE INDEX 
    LastEditorUserId 
ON dbo.Posts
    (LastEditorUserId)
WITH
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

It’s good enough. That’s what counts, I guess. Showing up.

20th Century Boy

At first glance, many queries may appear to be quite astoundingly better. SQL Server has many tricks up its sleeves in newer versions, assuming that you’re ready to embrace higher compatibility levels, and pay through the nose for Enterprise Edition.

This is a great example. Looking at the final query timing, you might think that the new IS [NOT] DISTINCT FROM syntax is a real dumb dumb head.

But unless you’re invested in examining these types of things, you’ll miss subtle query plan difference, which is why you’ll pay me, hopefully,

The second query receives the blessing of Batch Mode On Row Store, while the first does not. If we use the a helper object to get them both functioning on even terms, performance is quite close:

SELECT
    c = COUNT_BIG(*)
FROM dbo.Users AS u
JOIN dbo.Posts AS p
  ON u.Id IS NOT DISTINCT FROM p.LastEditorUserId 
LEFT JOIN dbo.columnstore_helper AS ch 
  ON 1 = 0;

In this case, the slightly tweaked query just slightly edges out the older version of writing the query.

I Can’t Imagine The World Without Me

There are many ways to write a query, and examine the performance characteristics. As SQL Server adds more options, syntax, capabilities, and considerations, testing and judging them all (especially with various indexing strategies) becomes quite an endeavor.

I don’t blame developers for being unaware or, or unable to test a variety of different rewrites and scenarios. The level of understanding that it takes to tune many queries extends quite beyond common knowledge or sense.

The aim of these posts is to give developers a wider array of techniques, and a better understanding of what works and why, while exposing them to newer options available as upgrade cycles march bravely into the future.

Keeping up with SQL Server isn’t exactly a full time job. Things are changed and added from release to release, which are years apart.

But quite often I find companies full of people struggling to understand basic concepts, troubleshooting, and remediations that are nearly as old as patches for Y2K bugs.

My rates are reasonable, etc.

Thanks for reading!

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

The How To Write SQL Server Queries Correctly Cheat Sheet: UNION vs. UNION ALL

Posted on March 26, 2024March 26, 2024 by Erik Darling

Are You Sure You’re Sure?

UNION and UNION ALL seem to get used with the same lack of discretion and testing as several other things in T-SQL: CTEs vs temp tables, temp tables vs. table variables, etc.

There are many times I’ve seen developers use UNION when result sets have no chance of being non-unique anyway, and many times I’ve seen them use UNION ALL when there would be a great benefit to discarding unnecessary duplicates.

Even worse is when the whole query is written incorrectly in the first place, and both DISTINCT and UNION are dumped all over queries to account for unwanted results across the board.

For example, someone may test a query in isolation, decide that DISTINCT needs to be applied to that result set, and then use UNION when appending another set of results to the final query. Throw in the typical slew of NOLOCK hints and one is left to wonder if anyone even understands what correct output might look like at all.

The answer to most questions about the correct way to write a query of course hinge on the quality of the underlying data, and any observed flaws reported by end users or QA testers.

This all becomes quite difficult to wade through, because developers may understand the correct logic, but not the correct way to implement it.

Just An Onion

To start, let’s flesh out what each operator means in the most basic way.

Using a nifty SQL Server 2022 function, and the power of batch separator loops, we’re going to load the numbers 1-2 into two tables, twice.

CREATE TABLE
    #t1
(
    i integer
);

INSERT
    #t1 (i)
SELECT
    gs.*
FROM GENERATE_SERIES(1, 5) AS gs;
GO 2

CREATE TABLE
    #t2
(
    i integer
);

INSERT
    #t2 (i)
SELECT
    gs.*
FROM GENERATE_SERIES(1, 6) AS gs;
GO 2

Doing this will provide a unique set of the numbers 1-6 from both temporary tables.

SELECT
    t.i
FROM #t1 AS t

UNION

SELECT
    t.i
FROM #t2 AS t;

Which is not logically equivalent to doing this:

SELECT DISTINCT
    t.i
FROM #t1 AS t

UNION ALL

SELECT DISTINCT
    t.i
FROM #t2 AS t;

The first query will not only deduplicate rows within each query, but also in the final result.

The second query will only deduplicate results from each query, but not from the final result.

To avoid playing word games with you, the first query will return the numbers 1-6 only once, and the second query will return 1-5 once, and 1-6 once.

Some additional sense can be made of the situation by looking at the query plans, and where the distinctness is applied.

To put things plainly: if you’re already using UNION to bring to results together, there’s not a lot of sense in adding DISTINCT to each query.

Precedence, etc.

To better understand how UNION and UNION ALL are applied, I’d encourage you to use this simple example:

/*Changing these to UNION makes no difference*/
SELECT 1
UNION ALL
SELECT 2 
UNION ALL
SELECT 3    
/*Changing these to UNION makes a difference*/
UNION ALL
SELECT 1
UNION ALL
SELECT 2 
UNION ALL
SELECT 3;

Specifically, look at the actual execution plans for these as you quote out ALL from the lines below the second comment.

You may even be surprised by what comes back when you get to the final UNION-ed select!

Orthodoxy

There has been quite a bit of performance debate about UNION and UNION ALL. Obviously, using UNION incurs some overhead to deduplicate results.

When you need it for result correctness, I’d encourage you to think about a few things:

The number of columns you’re selecting
The data types of the columns you’re selecting
What data actually identifies a unique row

I’ve come across many queries that were selecting quite a long list of columns, with lots of string data involved, that did a whole lot better using windowing functions over one, or a limited number of columns, with more manageable data types, to produce the desired results.

Here is a somewhat undramatic example:

DROP TABLE IF EXISTS
    #u1;

SELECT
    c.CreationDate,
    c.PostId,
    c.Score,
    c.Text,
    c.UserId
INTO #u1
FROM dbo.Comments AS c
WHERE c.Score IN (2, 9, 10)
AND   c.UserId IS NOT NULL

UNION

SELECT
    c.CreationDate,
    c.PostId,
    c.Score,
    c.Text,
    c.UserId
FROM dbo.Comments AS c
WHERE c.Score IN (3, 9, 10)
AND   c.UserId IS NOT NULL;

DROP TABLE IF EXISTS
    #u2;

SELECT
    y.CreationDate,
    y.PostId,
    y.Score,
    y.Text,
    y.UserId
INTO #u2
FROM
(
        
    SELECT
        x.*,
        n = 
            ROW_NUMBER() OVER
            (
                PARTITION BY
                    x.UserId,
                    x.Score,                    
                    x.CreationDate,
                    x.PostId
                ORDER BY
                    x.UserId,
                    x.Score,                    
                    x.CreationDate,
                    x.PostId
            )
    FROM
    (
        SELECT
            c.CreationDate,
            c.PostId,
            c.Score,
            c.Text,
            c.UserId
        FROM dbo.Comments AS c
        WHERE c.Score IN (2, 9, 10)
        AND   c.UserId IS NOT NULL
        
        UNION ALL
        
        SELECT
            c.CreationDate,
            c.PostId,
            c.Score,
            c.Text,
            c.UserId
        FROM dbo.Comments AS c
        WHERE c.Score IN (3, 9, 10)
        AND   c.UserId IS NOT NULL
    ) AS x
) AS y
WHERE y.n = 1;

In the first query, we’re doing a straight union of all the columns in the Comments table, which includes the Text column (nvarchar 700).

In the second query, the UNION has been replaced by UNION ALL, and I’m using ROW_NUMBER on the non-text columns, and filtering to only the first result.

Here are the query plans:

If you’re looking at the second query plan and wondering why you’re not seeing the usual traces of windowing functions (window aggregates, or segment and sequence project, a filter operator to get n = 1), I’d highly suggest reading Undocumented Query Plans: The ANY Aggregate.

Like I said, this is a somewhat undramatic example. It only shaves about 500ms off the execution time, though that is technically about 30% faster in this scenario. It’s a good technique to keep in mind.

The index in place for these queries has this definition:

CREATE INDEX
    c   
ON dbo.Comments
    (UserId, Score, CreationDate, PostId)
INCLUDE 
    (Text)
WITH
    (SORT_IN_TEMPDB  = ON, DATA_COMPRESSION = PAGE);

Is UNION Ever Better Than UNION ALL?

There have been a number of times when producing distinct results has improved things rather dramatically, but there are a couple general characteristics they all shared:

Producing unique rows, either via UNION or DISTINCT is not prohibitively time consuming
The source being unique-ified feeds into an operation that is time consuming

Here’s an example:

CREATE INDEX
    not_badges
ON dbo.Badges
    (Name, UserId)
WITH
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

CREATE INDEX
    not_posts
ON dbo.Posts
    (OwnerUserId)
INCLUDE
    (Score, PostTypeId)
WITH
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);
GO
  
DROP TABLE IF EXISTS
    #waypops;

CREATE TABLE
    #waypops
(
  UserId integer NOT NULL
);

INSERT
    #waypops WITH(TABLOCKX)
(
    UserId
)
SELECT
    b.UserId
FROM dbo.Badges AS b
WHERE b.Name IN
(
    N'Popular Question', N'Notable Question', 
    N'Nice Question', N'Good Question', 
    N'Famous Question', N'Favorite Question',
    N'Great Question', N'Stellar Question',
    N'Nice Answer', N'Good Answer', N'Great Answer'
);

SELECT
    wp.UserId,
    SummerHereSummerThere =
        SUM(ca.Score)
FROM #waypops AS wp
CROSS APPLY
(
    SELECT
        u.Score,
        ScoreOrder =
            ROW_NUMBER() OVER
            (
                ORDER BY
                    u.Score DESC
            )
    FROM
    (
        SELECT
            p.Score,
            p.OwnerUserId
        FROM dbo.Posts AS p
        WHERE p.OwnerUserId = wp.UserId
        AND   p.PostTypeId = 1

        UNION /*ALL*/

        SELECT
            p.Score,
            p.OwnerUserId
        FROM dbo.Posts AS p
        WHERE p.OwnerUserId = wp.UserId
        AND   p.PostTypeId = 2
    ) AS u
) AS ca
WHERE ca.ScoreOrder = 0
GROUP BY
    wp.UserId
ORDER BY
    wp.UserId;

Executing this query as UNION-ed results gives us a query that finishes in about 3 seconds.

Note that the Distinct Sort operator chosen to implement the desired results of the UNION reduces the rows from 13,300,793 to 2,406,354. This is especially important when Lazy Table Spools are involved.

Here’s the query plan when it’s executed with UNION ALL:

Execution time goes from 3 seconds to 33 seconds. You may notice that the numbers on the inner side of the nested loops join are much larger across the plan, and that the Lazy Table Spool goes from about 900ms (2.587 seconds minus 1.612 seconds) to taking about 18 seconds (20 seconds minus 1.7 seconds). The Nested Loops Join also suffers rather dramatically, taking nearly 9 seconds, instead of the original 300ms, largely owing to the fact that it has to deal with 946,677,558 additional rows.

You’d suffer, too. Mightily.

Championship Belt

Choosing between UNION and UNION ALL is of course primarily driven by logical query correctness, but you should fully consider which columns actually identify a unique row for your query.

There are sometimes better ways of identifying uniqueness than comparing every single column being selected out in the final result set.

When you run into slow queries that are using UNION and UNION ALL, it’s usually worth investigating the overall usage, and if using one over the other gives you better performance along with correct results.

Where UNION can be particularly troublesome:

You’re selecting a lot of columns (especially strings)
You’re attempting to deduplicating many rows
You’re not working with a primary key
You’re not working with useful supporting indexes

Where UNION ALL can be particularly troublesome:

You’re selecting a lot of rows, and many duplicates exist in it
You’re sending those results into other operations, like joins (particularly nested loops)
You’re doing something computationally expensive on the results of the UNION ALL

Keep in mind that using UNION/UNION ALL is a generally better practice than writing some monolithic query with endless OR conditions in it.

Thanks for reading!

Going Further

The How To Write SQL Server Queries Correctly Cheat Sheet: Views vs. Inline User Defined Functions

Posted on March 19, 2024March 19, 2024 by Erik Darling

Making Plans For Nigel

Views get a somewhat bad rap from performance tuners, but… It’s not because views are inherently bad. It’s just that we’ve seen things. Horrible things.

Attack ships on fire off the shoulder of Orion… I watched sea-beams glitter in the dark near the Tannhäuser Gate. All those moments will be lost in time, like tears in rain…

The problem is really the stuff that people stick into views. They’re sort of like a junk drawer for data. Someone builds a view that returns a correct set of results, which becomes a source of truth. Then someone else comes along and uses that view in another view, because they know it returns the correct results, and so on and so on. Worse, views tend to do a bunch of data massaging, left joining and coalescing and substringing and replacing and case expressioning and converting things to other things. The bottom line is that views are as bad as you make them.

The end result is a trash monster with a query plan that can only be viewed in full from deep space.

When critical processes start to rely on these views, things inevitably slow to a crawl.

I’ve said all that about views to say that the exact same problem can happen with inline table valued functions. I worked with a client last year who (smartly) started getting away from scalar and multi-statement functions, but the end results were many, many layers of nested inline functions.

Performance wasn’t great. It wasn’t worse, but it was nothing to gloat and beam over.

The Case For Views

Really, the main reason to use a view over an inline table valued function is the potential for turning it into an indexed view. If Microsoft would put an ounce of effort into making indexed views more useful and usable, it would loom a bit larger.

There are some niche reasons too, like some query generation applications use metadata discovery to build dynamic queries that can’t “see” into inline table valued functions the way they can with views, but I try not to get bogged down in tool-specific requirements like that without good reason.

Both views and inline table valued functions offer schemabinding as a creation option. This, among other incantations, are necessary if you’re going to follow the indexed view path.

But, here we find ourselves at the end of the case for views. Perhaps I’m not digging deep enough, but I can’t find much realistic upside.

While doing some research for this, I read through the CREATE VIEW documentation to see if I was missing anything. I was a bit surprised by this, but don’t see it as a great reason to use them:

CHECK OPTION

Forces all data modification statements executed against the view to follow the criteria set within select_statement. When a row is modified through a view, the WITH CHECK OPTION makes sure the data remains visible through the view after the modification is committed.

If you’re into that sort of thing, perhaps this will make views more appealing to you. I’m not sure I can think of why I’d want this to happen, but 🤷‍♂️

The Case For Inline Functions

Now that we’re squared away on views, and we’ve made sure we’re starting with the understanding that either of these module types can be lousy for performance if you put a lousy query in them, and fail to create useful indexes for those queries to access data efficiently.

What would sway my heart of stone towards the humble inline table valued function?

Parameters.

Views can’t be created in a way to pass parameter directly to them. This can be a huge performance win under the right conditions, especially because if you use cross or outer apply to integrate an inline table valued function into your query. You can pass table columns directly in to the function as parameter values. Inline table valued functions take the ick away.

You know how with stored procedures, if you want to use one to process multiple rows from a table, the most workable approach is to use a loop or cursor to assign row values to parameters, and then execute the procedure with them?

Just an example, if you had a stored procedure to take (to make it simple, full) backups, it would be handy to be able to do something like this:

EXEC dbo.TakeAFullBackup
    @DatabaseName AS 
    SELECT
        d.name
    FROM sys.databases AS d
    WHERE d.database_id > 4;

But no, we have to write procedural code to get a list of database names, loop through them, and execute the procedure for each one (or some other close-enough approximation).

Kinda lame, SQL Server. Kinda lame.

Rat Race

When I first came across this oddity, I probably thought (and wrote) things like: “though this is a rare occurrence in views…”

Time has tried that line of thinking and found it wanting. I’ve seen this happen many, many times over now. It’s funny, the more things you learn that can go wrong in a query plan, the more things you become quite paranoid about. The mental checklist is astounding.

Let’s start, as we often do, with an index:

CREATE INDEX
    p
ON dbo.Posts
    (OwnerUserId, Score DESC)
INCLUDE 
    (CreationDate, LastActivityDate)
WITH
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

Now, before we move on, it’s worth noting that this issue is fixed under certain conditions:

You’re on SQL Server 2017 CU30, and have query optimizer hot fixes enabled
I’m not sure which CU this fix was released for in SQL Server 2019, it’s not in any that I can find easily
You’re on SQL Server 2022 and using compatibility level 160

From my testing, it doesn’t matter which compatibility level you’re in on SQL Server 2017 or 2019, as long as optimizer hot fixes are enabled.

/*Using a database scoped configuration*/
ALTER DATABASE SCOPED CONFIGURATION 
SET QUERY_OPTIMIZER_HOTFIXES = ON;

/*Using a trace flag instead*/
DBCC TRACEON(4199, -1);

/*SQL Server 2022+ only*/
ALTER DATABASE StackOverflow2013
SET COMPATIBILITY_LEVEL = 160;

For our purposes, we’ll be using SQL Server 2022 in compatibility level 150, with query optimizer hot fixes disabled.

No Problemo (Query)

Here’s a view and a query, where things work just fine:

CREATE OR ALTER VIEW
    dbo.DasView
WITH SCHEMABINDING
AS
SELECT
    p.Score,
    p.OwnerUserId,
    p.CreationDate,
    p.LastActivityDate,
    DENSE_RANK() OVER 
    ( 
        PARTITION BY 
            p.OwnerUserId
        ORDER BY 
            p.Score DESC
    ) AS PostRank
FROM dbo.Posts AS p;
GO

SELECT 
    p.*
FROM dbo.DasView AS p
WHERE p.OwnerUserId = 22656;
GO

The reason this works fine is because the where clause contains a literal value, and not a variable or parameter placeholder.

Everything is how we would expect this query plan to look, given the indexes available.

Si Problemo (View)

Where things become wantonly unhinged is when we supply a placeholder for that literal value.

CREATE OR ALTER PROCEDURE 
    dbo.DasProcedure 
(
    @OwnerUserId integer
)
AS
SET NOCOUNT, XACT_ABORT ON;
BEGIN
    SELECT 
        p.*
    FROM dbo.DasView AS p
    WHERE p.OwnerUserId = @OwnerUserId
    /*OPTION(QUERYTRACEON 4199)*/
    /*OPTION(USE HINT('QUERY_OPTIMIZER_COMPATIBILITY_LEVEL_160'))*/;
END;
GO

EXEC dbo.DasProcedure 
    @OwnerUserId = 22656;
GO

Note that I have a query trace on and use hint here, but quoted out. You could also use these to fix the issue for a single query, but my goal is to show you what happens when things aren’t fixed.

Here’s what that looks like:

Rather than a seek into the index we created, the entire thing is scanned, and we have a filter that evaluates our placeholder from 17 million rows and whittles the results down to 27,901 rows.

No Problemo (Function)

Using an inline table valued function allows us to bypass the issue, without any hints or database settings changes.

CREATE OR ALTER FUNCTION  
    dbo.DasFunction 
(
    @OwnerUserId integer
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
SELECT
    p.Score,
    p.OwnerUserId,
    p.CreationDate,
    p.LastActivityDate,
    DENSE_RANK() OVER 
    ( 
        PARTITION BY 
            p.OwnerUserId
        ORDER BY 
            p.Score DESC
    ) AS PostRank
FROM dbo.Posts AS p
WHERE p.OwnerUserId = @OwnerUserId;
GO

This changes our procedure as well:

CREATE OR ALTER PROCEDURE 
    dbo.DasProcedure 
(
    @OwnerUserId integer
)
AS
SET NOCOUNT, XACT_ABORT ON;
BEGIN
    SELECT 
        p.*
    FROM dbo.DasFunction(@OwnerUserId) AS p;
END;
GO


EXEC dbo.DasProcedure
    @OwnerUserId = 22656;

And our query plan goes back to normal.

Even if you don’t have this specific problem, it’s often worth exploring converting views to inline table valued functions, because quite often there is a common filtering or joining criteria, and having parameters to express that is beneficial in a couple ways:

It better shows the intent of module and what it can be used for
It prevents developers from forgetting filtering criteria and exploding results

Thanks for reading!

Going Further

The How To Write SQL Server Queries Correctly Cheat Sheet: Common Table Expressions

Posted on March 15, 2024March 1, 2024 by Erik Darling

Followers

Common table expressions remind me of clothes in the 70s. A bunch of people with no taste convinced a bunch of people with no clue that they should dress like them, and so we got… Well, at least we got the 80s afterwards.

The big draw with common table expressions is that they filled in some blanks that derived tables left unanswered.

The first problem with common table expressions is that most people use them like nose and ear hair trimmers: they just sort of stick them in and wave them around until they’re happy, with very little observable feedback as to what has been accomplished.
The second big problem with common table expressions is that the very blanks they were designed to fill in are also the very big drawbacks they cause, performance-wise. Sort of like a grand mal petard hoisting.

To bring things full circle, asking someone why they used a common table expression is a lot like asking someone why they wore crocheted bell bottoms with a velour neckerchief in the 70s. Someone said it was a good idea, and… Well, at least we got the 80s afterwards.

Much like joins and Venn diagrams, anyone who thinks they have some advanced hoodoo to teach you about common table expressions is a charlatan or a simpleton. They are one of the least advanced constructs in T-SQL, and are no better or worse than any other abstraction layer, with the minor exception that common table expressions can be used to build recursive queries.

Other platforms, enviably, have done a bit to protect developers from themselves, by offering ways to materialize common table expressions. Here’s how Postgres does it, which is pretty much the opposite of how SQL Server does it.

By default, and when considered safe, common table expressions are materialized to prevent re-execution of the query inside them.

You can force the issue by doing this (both examples are from the linked docs):

WITH w AS MATERIALIZED (
    SELECT * FROM big_table
)
SELECT * FROM w AS w1 JOIN w AS w2 ON w1.key = w2.ref
WHERE w2.key = 123;

Or go your own way and choose to not materialize it:

WITH w AS NOT MATERIALIZED (
    SELECT * FROM big_table
)
SELECT * FROM w AS w1 JOIN w AS w2 ON w1.key = w2.ref
WHERE w2.key = 123;

You don’t get those options in SQL Server as of this writing, which really sucks because developers using other platforms may have certain expectations that are, unfortunately, not met.

Likewise, other sane and rational platforms use MVCC (optimistic locking) by default, which SQL Server does not. Another expectation that will unfortunately not be met for cross-platform developers.

Common Table Cult

The amount of developer-defense that common table expressions get is on par with the amount of developer-defense that table variables get.

It’s quite astounding to witness. How these things became such sacred cows is beyond me.

First, there are times when using a common table expression has no impact on anything:

WITH
    nocare AS
(
    SELECT
        u.*
    FROM dbo.Users AS u
    WHERE u.Reputation > 999999
)
SELECT
    n.*
FROM nocare AS n;


WITH
    nocare AS
(
    SELECT
        u.*
    FROM dbo.Users AS u
)
SELECT
    *
FROM nocare AS n
WHERE n.Reputation > 999999;

SQL Server is at least smart enough to be able to push most predicates used outside of common table expressions up into the common table expression.

One example of such a limitation is when you put a windowing function into a common table expression:

WITH 
    nocare AS 
(   
    SELECT   
        v.*,
        n = 
            ROW_NUMBER() OVER 
            ( 
                PARTITION BY 
                    v.UserId
                ORDER BY 
                    v.CreationDate 
            )
    FROM dbo.Votes AS v 
)
SELECT  
    n.*
FROM nocare AS n
WHERE n.VoteTypeId = 8
AND   n.n = 0;

If VoteTypeId were in the PARTITION BY clause of the windowing function, it could be pushed into the common table expression portion of the query. Without it in there, it has to be filtered out later, when the where clause also looks for rows numbered as 0.

But this does bring us to a case where common table expressions are generally okay, but would perform equivalently with a derived table: when you need to stack some logic that can’t be performed in a single pass.

Using a common table expression to filter out the results of a windowing function just can’t be done without some inner/outer context. Since objects in the select list are closer than they appear, you can’t reference them in the where clause directly.

I’m fine with that, as shown in the example above.

Common Stacks

Stacked common table expressions are also “fine” up to a point, and with caveats.

One of my absolute favorite paging methods uses stacked common table expressions.

DECLARE
    @page_number int = 1,
    @page_size int = 100;

WITH
    f /*etch*/ AS
(
    SELECT TOP (@page_number * @page_size)
        p.Id,
        n =
            ROW_NUMBER() OVER
            (
                ORDER BY
                    p.Id
            )
    FROM dbo.Posts AS p
    ORDER BY
        p.Id
),
    o /*ffset*/ AS
(
    SELECT TOP (@page_size)
        f.id
    FROM f
    WHERE f.n >= ((@page_number - 1) * @page_size)
    ORDER BY
        f.id
)
SELECT  
    p.*
FROM o
JOIN dbo.Posts AS p
  ON o.id = p.Id
ORDER BY
    p.Id;

The reason why this is okay is because each common table expression has a single reference. There are two points in the query plan where data is acquired from the Posts table.

Where things get tricky is when you keep doing it over and over again.

Attack Stacks

Take a query like this, and imagine what the query plan will look like for a moment.

WITH
    top5 AS
(
    SELECT
        p.*,
        n = 
            ROW_NUMBER() OVER
            (
                PARTITION BY
                    p.OwnerUserId
                ORDER BY
                    p.Score DESC
            )
    FROM dbo.Posts AS p
    WHERE p.OwnerUserId = 22656
    AND   p.PostTypeId = 1
)
SELECT
    u.DisplayName,
    t1.Title,
    t2.Title,
    t3.Title,
    t4.Title,
    t5.Title
FROM dbo.Users AS u
LEFT JOIN top5 AS t1
  ON  t1.OwnerUserId = u.Id
  AND t1.n = 1
LEFT JOIN top5 AS t2
  ON  t2.OwnerUserId = u.Id
  AND t2.n = 2
LEFT JOIN top5 AS t3
  ON  t3.OwnerUserId = u.Id
  AND t3.n = 3
LEFT JOIN top5 AS t4
  ON  t4.OwnerUserId = u.Id
  AND t4.n = 4
LEFT JOIN top5 AS t5
  ON  t5.OwnerUserId = u.Id
  AND t5.n = 5
WHERE t1.OwnerUserId IS NOT NULL;

An utter disaster, predictably:

We hit the Posts table a total of five times, or once for each reference back to the original common table expression.

This is not a good use of a common table expression, and is a pattern in general to avoid when using them.

Think of common table expressions sort of like ordering a Rum Martinez. You might be happy when the results eventually show up, but every time you say “Rum Martinez”, the bartender has to go through the whole process again.

There’s no magickal pitcher of Rum Martinez sitting around for your poor bartender to reuse.

That’s called a Shirley Temp Table.

Pivot Peeve

This particular query could use a temp table to materialize the five rows, and re-joining to that would be cheap and easy, even five times, since it’s only five rows going in.

WITH
    top5 AS
(
    SELECT
        p.*,
        n = 
            ROW_NUMBER() OVER
            (
                PARTITION BY
                    p.OwnerUserId
                ORDER BY
                    p.Score DESC
            )
    FROM dbo.Posts AS p
    WHERE p.OwnerUserId = 22656
    AND   p.PostTypeId = 1
)
SELECT
    t.OwnerUserId,
    t.Title,
    n
INTO #top5
FROM top5 AS t
WHERE t.n <= 5;

You could also also just PIVOT this one, too:

WITH
    u AS 
(
    SELECT TOP (5)
        u.DisplayName,
        p.Title,
        n = ROW_NUMBER() OVER (ORDER BY p.Score DESC)
    FROM dbo.Posts AS p
    JOIN dbo.Users AS u
      ON p.OwnerUserId = u.Id
    WHERE p.OwnerUserId = 22656
    AND   p.PostTypeId = 1
    ORDER BY
        p.Score DESC
)
SELECT
    p.*
FROM u AS u
PIVOT
(
    MAX(Title)
    FOR n IN ([1], [2], [3], [4], [5])
) AS p;

For all the problems PIVOT can cause when misused, this is a full 19 seconds faster than our most precious common table expression query.

With a half-decent index, it’d probably finish in just about no time.

I’d take this instead any day.

A Note On Recursion

There may be times when you need to build a recursive expression, but you only need the top N children, or you want to get rid of duplicates in child results.

Since you can’t use DISTINCT, TOP, or OFFSET/FETCH directly in a recursive common table expression, some nesting is required.

Of course, we can’t currently nest common table expressions, and to be clear, I think that idea is dumb and ugly.

If Microsoft gives us nested common table expressions before materialized common table expressions, I’ll never forgive them.

WITH 
    postparent AS 
(
    SELECT 
        p.Id, 
        p.ParentId, 
        p.OwnerUserId,
        p.Score,
        p.PostTypeId,
        Depth = 0,
        FullPath = CONVERT(varchar, p.Id)
    FROM dbo.Posts AS p
    WHERE p.CreationDate >= '20131229'
    AND   p.PostTypeId = 1

    UNION ALL

    SELECT
        p2.Id,
        p2.ParentId,
        p2.OwnerUserId,
        p2.Score,
        p2.PostTypeId,
        p2.Depth,
        p2.FullPath
    FROM
    (
        SELECT
            p2.Id, 
            p2.ParentId, 
            p2.OwnerUserId,
            p2.Score, 
            p2.PostTypeId,
            Depth = pp.Depth + 1,
            FullPath = CONVERT(VARCHAR, CONCAT(pp.FullPath, '/', p2.Id)),
            n = ROW_NUMBER() OVER (ORDER BY p2.Score DESC)
        FROM postparent pp
        JOIN dbo.Posts AS p2
          ON pp.Id = p2.ParentId
        AND p2.PostTypeId = 2
    ) AS p2
    WHERE p2.n = 1
)
SELECT 
    pp.Id,
    pp.ParentId,
    pp.OwnerUserId,
    pp.Score,
    pp.PostTypeId,
    pp.Depth,
    pp.FullPath
FROM postparent AS pp
ORDER BY
    pp.Depth
OPTION(MAXRECURSION 0);

To accomplish this, you need to use a derived table, filtering the ROW_NUMBER function outside of it.

This is a more common need than most developers realize when working with recursive common table expressions, and can avoid many performance issues and max recursion errors.

It’s also a good way to show off to your friends at ~~disco~~ new wave parties.

Common Table Ending

Common table expressions can be handy to add some nesting to your query so you can reference generated expressions in the select list as filtering elements in the where clause.

They can even be good in other relatively simple cases, but remember: SQL Server does not materialize results, though it should give you the option to, and the optimizer should have some rules to do it automatically when a common table expression is summoned multiple times, and when it would be safe to do so. I frequently pull common table expression results into a temp table, both to avoid the problems with re-referencing them, and to separate out complexity. The lack of materialization can be hell on cardinality estimation.

In complicated queries, they can often do more harm than good. Excuses around “readability” can be safely discarded. What is “readable” to you, dear human, may not be terribly understandable to the optimizer. You’re not giving it any better information by using common table expressions, nor are you adding any sort of optimization fence to any queries in them without the use of TOP or OFFSET/FETCH. Row goals are a hell of a drug.

Thanks for reading!

Going Further

The How To Write SQL Server Queries Correctly Cheat Sheet: IN And NOT IN

Posted on March 14, 2024March 1, 2024 by Erik Darling

Sweetness, Sweetness

I’ll be brief here, and let you know exactly when I’ll use IN and NOT IN rather than anything else:

When I have a list of literal values

That’s it. That’s all. If I have to go looking in another table for anything, I use either EXISTS or NOT EXISTS. The syntax just feels better to me, and I don’t have to worry about getting stupid errors about subqueries returning more than one value.

For IN clauses, it’s far less of an ordeal, usually. But for NOT IN, there are some additional concerns around NULLable columns.

Of course, actual NULL values really screw things up, but even when SQL Server needs to protect itself against potential NULL values, you can end up in performance hospice.

Historical

First, a little bit of history. With NOT IN. Take the below script, and flip the insert into the @bad table variable to use 2 instead of NULL, after you’ve seen what happens with NULL.

DECLARE
    @good table 
(
    id int NOT NULL   
);

DECLARE
    @bad table
(
    id int NULL
);

INSERT
    @good
(
    id
)
VALUES
    (1);

INSERT
    @bad
(
    id
)
VALUES
    (NULL); /*Change this between NULL and 2*/

SELECT
    records =
        COUNT_BIG(*) /*Should be 1, or something*/
FROM @good AS g
WHERE g.id NOT IN
(
    SELECT
        b.id
    FROM @bad AS b
);

You’ll see pretty quickly that NOT IN gives you wonky results when it hits a NULL.

This is known.

Imaginary NULLs

Let’s take the below setup. Though each table allows NULLs in their single column, no NULL values will be inserted into them.

CREATE TABLE
    #OldUsers
(
    UserId int NULL
);

CREATE TABLE
    #NewUsers
(
    UserId int NULL
);

/*
But neither one will have any NULL values at all!
*/
INSERT
    #OldUsers WITH (TABLOCK)
(
    UserId
)
SELECT
    p.OwnerUserId
FROM dbo.Posts AS p
WHERE p.OwnerUserId IS NOT NULL;

INSERT
    #NewUsers WITH (TABLOCK)
(
    UserId
)
SELECT
    c.UserId
FROM dbo.Comments AS c
WHERE c.UserId IS NOT NULL;

The real lesson here is that if you know that no NULL values are allowed into your tables, you should specify the columns as NOT NULL.

I know, it’s scary. Really scary. Errors. What if. How dare.

But these are the kind of adult decisions you’ll have to make as an application developer.

Be brave.

Protect Your NULL

The big problem with NOT IN, is that SQL Server goes into defensive driving mode when you use it under NULLable conditions.

I don’t have a license because it would be irresponsible, and I’ve lived my entire life in big cities where having a car would be more trouble than it’s worth. But I assume that when I see people complain about drivers not knowing what to do the second there’s a rain drizzle or a snow flurry somewhere in the area is a similarly annoying scenario out there on the roads. All of a sudden, seemingly competent drivers turn into complete basket cases and drive like they’ve got a trunk full of ~~dead bodies~~ clowns.

Here’s an example of a bad way to deal with the situation, vs. a good way to deal with the situation:

/*Bad Way*/
SELECT
    records = COUNT_BIG(*)
FROM #NewUsers AS nu
WHERE nu.UserId NOT IN
(
    SELECT
        ou.UserId
    FROM #OldUsers AS ou
);

/*Good Way*/
SELECT
    records = 
        COUNT_BIG(*)
FROM #NewUsers AS nu
WHERE NOT EXISTS
(
    SELECT
        1/0
    FROM #OldUsers AS ou
    WHERE nu.UserId = ou.UserId
);

Note the very professional formatting and correct syntax. Ahem. Bask.

Count to 10 while you’re basking.

Results

The resulting execution plan for each example should be illuminating. Here they are:

The NOT IN version takes 15 minutes, and the NOT EXISTS version takes 415ms.

Reality Bites

Since there are no NULLs, the first query returns the correct results. But the amount of work SQL Server has to do to make sure there are no NULLs is both absurd and preposterous.

If you’re like me, and you want to throw the query optimizer in the garbage every time you see a TOP over a Scan, you might say something like “an index would make this really fast”.

You wouldn’t be wrong, but most people either:

Take bad advice and never index #temp tables
Create nonclustered indexes on #temp tables that don’t get used

Well, you get what you ~~deserve~~ pay for.

Thanks for reading!

Going Further

The How To Write SQL Server Queries Correctly Cheat Sheet: Cross Apply And Outer Apply

Posted on March 13, 2024March 13, 2024 by Erik Darling

Situations

I end up converting a lot of derived joins, particularly those that use windowing functions, to use the apply syntax. Sometimes good indexes are in place to support that, other times they need to be created to avoid an Eager Index Spool.

One of the most common questions I get is when developers should consider using apply over other join syntax.

The short answer is that I start mentally picturing the apply syntax being useful when:

There’s a small outer table (FROM), and a large inner table (APPLY)
I need to do a significant amount of work on the inner side of a join
The goal of the query is top N per group, or something similar
I’m trying to get parallel nested loops instead of some alternative plan choice
To replace a scalar UDF in the select list with an inline UDF
In order to use the VALUES construct in an odd way

Most of this is situational, and requires a bit of practice and familiarity to spot quickly.

Both cross and outer apply can be used in similar ways to subqueries in the select list, with the added bonus that you can return multiple columns and rows with apply, which you can’t do in a normal subquery.

What Apply Does

The way to think about what apply is doing when a query runs is supplying a table-valued result on the inner side of a join for each row supplied by the outer side of a join.

Here’s a simple example:

SELECT
    u.Id,
    u.DisplayName,
    u.Reputation,
    p.Title,
    p.Score,
    p.CreationDate,
    p.LastActivityDate
FROM dbo.Users AS u
CROSS APPLY
(
    SELECT
        p.*
    FROM dbo.Posts AS p
    WHERE p.OwnerUserId = u.Id
    AND   p.PostTypeId = 1
    ORDER BY
        p.CreationDate DESC
    OFFSET 0 ROWS
    FETCH NEXT 3 ROWS ONLY
) AS p
WHERE u.CreationDate >= '20131230'
ORDER BY
    u.Reputation DESC,
    p.CreationDate DESC;

We’re getting everyone from the Users table who Posted a Question in the final days of 2013, ordered by when it was Created.

For every qualifying User, we get a tabular result showing the Title, Score, CreationDate, and LastActivityDate of their question.

You can picture it sort of like this:

Some Users have have more than three results, and some may have fewer than three results, but since the query is self-limited to only the first three, our query sets a row goal and quits once three are found.

More About The Query

Because the Users table is correlated by Id to the Posts table on OwnerUserId, we need to make sure we have a good index to make that access seekable.

Also because we’re only looking for Questions (PostTypeId = 1), and ordering by the most recent ones (CreationDate DESC), it’s also a wise idea to have those in the key.

It’s also worth talking over an interesting point in the query itself: The select from Posts inside of the apply is doing a select star, sure, but the outer query is only pulling a few of the columns. The optimizer can recognize this, which means we don’t need a gigantic covering index to make this query fast. We also… Don’t really need a covering index at all in this case. Just the key columns are good enough.

CREATE INDEX 
    u 
ON dbo.Users 
    (CreationDate, Reputation, Id) 
INCLUDE 
    (DisplayName) 
WITH 
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE); 

CREATE INDEX 
    p 
ON dbo.Posts 
    (OwnerUserId, PostTypeId, CreationDate) 
WITH 
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

Efficiency Unit

An efficient query plan when using apply will look like this:

An inefficient query plan using apply will look like this:

If you’re experimenting with apply, either cross or outer, pay close attention to the query plans. If someone says “cross apply is always slow”, you can bet they stink at indexes.

Choices, Choices

The choice to use apply at all depends on the goal of the query, and the goals of the query tuner. It’s not always a magic performance bullet, but under the right circumstances, it can really make things fly.

The choice to use cross or outer apply depends on the semantics of the starting query. An inner join commutes easily to cross apply, and a left join commutes easily to outer apply.

One important difference in how the joins are implemented is in the optimizer’s choice between normal nested loops, where the join is done at the nested loops operator, and apply nested loops, which is when the join keys are pushed to an index seek on the inner side of the join.

For an in-depth look at that, I’d suggest reading Paul White’s post, Apply versus Nested Loops Join.

The optimizer is capable of transforming an apply to a join and vice versa. It generally tries to rewrite apply to join during initial compilation to maximize the searchable plan space during cost-based optimization. Having transformed an apply to a join early on, it may also consider a transformation back to an apply shape later on to assess the merits of e.g. an index loops join.

Just writing a query using apply doesn’t guarantee that you get the apply nested loops version of a nested loops join. Having solid indexes and easy to search predicates can help push things in the right direction.

Thanks for reading!

Going Further

The How To Write SQL Server Queries Correctly Cheat Sheet: Select List Subqueries

Posted on March 12, 2024March 1, 2024 by Erik Darling

Flash Boy

I think subqueries in select lists are very neat things. You can use them to skip a lot of additional join logic, which can have all sorts of strange repercussions on query optimization, particularly if you use have to use left joins to avoid eliminating results.

Subqueries do have their limitations:

They can only return one row
They can only return one column

But used in the right way, they can be an excellent method to retrieve results without worrying about what kind of join you’re doing, and how the optimizer might try to rearrange it into the mix.

Since subqueries are in the select list, and can’t eliminate rows from results, you will most commonly see them incorporated late in query plans (much further to the left) as outer joins. The optimizer is generally smart enough to retrieve data for select list subqueries after as much other filtering that can have been applied is applied, so they can be evaluated for as few rows in the outer results as possible.

The most important thing you can do as a developer to make sure your select list subqueries are fast is to make sure you have good supporting indexes for them.

Well Done

Let’s start with this query:

SELECT
    u.Id,
    u.DisplayName,
    u.Reputation,
    TopQuestionScore =     
    (
        SELECT TOP (1)
            p.Score
        FROM dbo.Posts AS p
        WHERE p.PostTypeId = 1
        AND   p.OwnerUserId = u.Id
        ORDER BY
            p.Score DESC,
            p.Id DESC
    ),
    TopAnswerScore =     
    (
        SELECT TOP (1)
            p.Score
        FROM dbo.Posts AS p
        WHERE p.PostTypeId = 2
        AND   p.OwnerUserId = u.Id
        ORDER BY
            p.Score DESC,
            p.Id DESC
    ),
    TotalPosts =    
    (
        SELECT
            COUNT_BIG(*)
        FROM dbo.Posts AS p
        WHERE p.OwnerUserId = u.Id
        AND   p.PostTypeId IN (1, 2)
    )
FROM dbo.Users AS u
WHERE u.Reputation > 500000
ORDER BY
    u.Reputation DESC;

The goal is to find every User with a Reputation over 500,000, and then find their:

Top scoring question (with a unique tie-breaker on most recent post id)
Top scoring answer (with a unique tie-breaker on most recent post id)
Total questions and answers

You might look at this query with a deep sense of impending dread, wondering why we should make three trips to the Posts table to get this information. I totally get that.

But let’s say we have these indexes in place:

CREATE INDEX 
    u
ON dbo.Users 
    (Reputation, Id)
INCLUDE 
    (DisplayName)
WITH
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

CREATE INDEX
    p
ON dbo.Posts
    (OwnerUserId, PostTypeId, Score)
WITH
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

They have everything we need to support quick, navigational lookups.

Query Planner

The query plan for this arrangement looks like this, finishing in 23 milliseconds.

If you write select list subqueries, and they’re terribly slow, there’s a very good chance that the indexes you have in place are not up to the job, particularly if you see Eager Index Spools in the query plan.

All of the time in the plan is spent in the final subquery, that counts the total number of questions and answers. But even that, at 23 milliseconds, is not worth heaving our chests over.

Three round trips are not at all a problem here, but let’s compare.

One Way Ticket

I’m not opposed to experimentation. After all, it’s a great way to learn, observe, and become enraged with the state of query optimization generally.

Here are two rewrites of the above query, to only make one trip to the Posts table. The first one uses a derived join, and the second uses apply. They’re both written to use outer joins, to match the semantics of the three subqueries:

/*
Join
*/
SELECT
    u.Id,
    u.DisplayName,
    u.Reputation,
    p.TopQuestionScore,
    p.TopAnswerScore,
    p.TotalPosts
FROM dbo.Users AS u
LEFT JOIN
(
    SELECT
        p.OwnerUserId,
        TopQuestionScore = 
            MAX(CASE WHEN p.PostTypeId = 1 THEN p.Score ELSE NULL END),
        TopAnswerScore = 
            MAX(CASE WHEN p.PostTypeId = 2 THEN p.Score ELSE NULL END),
        TotalPosts = 
            COUNT_BIG(*)
    FROM dbo.Posts AS p
    WHERE p.PostTypeId IN (1, 2)
    GROUP BY
        p.OwnerUserId
) AS p
  ON p.OwnerUserId = u.Id
WHERE u.Reputation > 500000
ORDER BY
    u.Reputation DESC;

/*
Apply
*/
SELECT
    u.Id,
    u.DisplayName,
    u.Reputation,
    p.TopQuestionScore,
    p.TopAnswerScore,
    p.TotalPosts
FROM dbo.Users AS u
OUTER APPLY
(
    SELECT
        p.OwnerUserId,
        TopQuestionScore = 
            MAX(CASE WHEN p.PostTypeId = 1 THEN p.Score ELSE NULL END),
        TopAnswerScore = 
            MAX(CASE WHEN p.PostTypeId = 2 THEN p.Score ELSE NULL END),
        TotalPosts = 
            COUNT_BIG(*)
    FROM dbo.Posts AS p
    WHERE p.OwnerUserId = u.Id 
    AND   p.PostTypeId IN (1, 2)
    GROUP BY
        p.OwnerUserId
) AS p
WHERE u.Reputation > 500000
ORDER BY
    u.Reputation DESC;

A somewhat brief digression here: Query rewrites to use specific syntax arrangements are often not available in ORMs. Many times while working with clients, we’ll stumble across a bushel of quite painful application-generated queries. I’ll show them useful rewrites to improve things, and we’ll all marvel together at how much better things are. I’ll suggest implementing the rewrite as a stored procedure, and all hell will break loose.

Please don’t be one of those developers. Understand the limitations of the technology stack you’re working with. Not everything produced by code is good.

Compare and Contrast

In this case, both of the attempts at rewrites result in identical query plans. The optimizer does a good job here, but both of the single-trip queries is about 2x slower than the original.

In this case, that difference will look and feel microscopic. And it is, mostly because I provided indexes so good that you could write this query any which way and it would work out pretty well.

While one round trip certainly felt more efficient than three, each trip from outer to inner side of the nested loops had a bit more work to do each time, and that added up.

It’s nothing consequential here, but you may run into plenty of situations where it’s far worse (or to be fair, far better).

For me, the real advantage of writing out the three separate subqueries is to better understand which one(s) do the most work, and might need additional work done to make them fast.

When you do everything all at once, you have no idea which piece is responsible for slowdowns. We know from the very first query plan that getting the full count does the most work, but that wouldn’t be obvious to me, you, or anyone else looking at the two query plans in this section.

Thanks for reading!

Going Further

The How To Write SQL Server Queries Correctly Cheat Sheet: EXISTS and NOT EXISTS

Posted on February 29, 2024March 1, 2024 by Erik Darling

Selectively Numb

The sort of wonderful thing about SQL is that it has many directives which are fairly easy to decipher and use appropriately.

The sort of awful thing about SQL is that it has just as many rules that are somewhat selectively applied as the English language itself.

I have my gripes and grievances with some of the choices, of course, and so will you as you delve further into the language. A petty example is that I sort of wish that SQL used GET instead of SELECT for retrieving data.

Very few people go to the store to select milk, eggs, steak, butter, salt, pepper, and scotch. Most of us just go get it. But enough about breakfast.

Let’s talk about two of the most overlooked and undervalued facilities in the SQL language: EXISTS and NOT EXISTS. Perhaps they’d get more traction is they were called THERE and NOT THERE, but but it would be perilously easy to start confusing your WHERE and THERE clause.

Often besmirched by someone who read a blog post about MySQL in 1998 as “subqueries, which are slower than joins”, they’re two of the most useful things you can grasp and use regularly.

Though they are a bit like subqueries, the columns that you select in an EXISTS or NOT EXISTS subquery can’t be used in the outer query. You can put whatever you want in the select list, from * to 1 to COUNT to 1/0 to the entire contents of the King James Bible, and it will never end up making even the dimmest difference in the world. Likewise, adding DISTINCT, TOP, or any other row-limiting device will do absolutely nothing to change the query plan or performance.

Get over yourself.

Both EXISTS and NOT EXISTS already set a row goal of 1, because all either one has to do is determine if a single row is there or not, just with the logic reversed for each.

Your First Mistakes

Let’s say someone asks you to gin up a list of Users who have Posted anything at all, but whose Reputation still floats at the dreaded 1.

Your first instinct would likely be to write a query that looks like this.

SELECT 
    u.Id,
    u.DisplayName
FROM dbo.Users AS u
JOIN dbo.Posts AS p
  ON u.Id = p.OwnerUserId
WHERE u.Reputation = 1
ORDER BY 
    u.Id;

But you’d quickly find yourself confronted by many duplicate row values, because that’s what one-to-many joins produce. Duplicates.

Your next move, tongue hanging out, sweating profusely, knuckles creaking, nearly paralyzed by the uncertainty of your continued human functioning, would be to do something like this:

SELECT DISTINCT 
    u.Id,
    u.DisplayName
FROM dbo.Users AS u
JOIN dbo.Posts AS p
  ON u.Id = p.OwnerUserId
WHERE u.Reputation = 1
ORDER BY 
    u.Id;

And, sure, with a small enough data set, this is an okay choice. You passed the pop quiz, hot shot. Your instinct to use DISTINCT was not wholly incorrect, but there’s a better way.

But as you start dealing with larger and more imposing sets of data, DISTINCT will no longer cut it.

What EXISTS Does Different

While EXISTS will still use a join to match rows between tables, the semantics are quite a bit different. It can move on once it has determined that a row is either there or not there.

You don’t need to add DISTINCT, grouping, or anything else to get the results you wanted in the first place.

SELECT 
    u.Id,
    u.DisplayName
FROM  dbo.Users AS u
WHERE u.Reputation = 1
AND EXISTS 
( 
    SELECT 
        1/0 
    FROM dbo.Posts AS p 
    WHERE p.OwnerUserId = u.Id 
)
ORDER BY 
    u.Id;

Once EXISTS locates a match, it moves on to the next value from the outer side (in this case the Users table), and attempts to find a match. If no match is found, the row is discarded, which is common to inner joins.

Where a lot of developers get hung up at first is in assuming that EXISTS and NOT EXISTS work like IN or NOT in, and they miss the inner where clause to tell the database which rows should match.

I’ve seen a lot of EXISTS queries written, quite incorrectly, like this:

SELECT 
    u.Id,
    u.DisplayName
FROM  dbo.Users AS u
WHERE u.Reputation = 1
AND EXISTS 
( 
    SELECT 
       p.OwnerUserId 
    FROM dbo.Posts AS p
)
ORDER BY 
    u.Id;

Which will, of course, return absolutely everything. Don’t do this.

The column you select inside of the EXISTS subquery does not infer any sort of matching logic.

Like I said before, it’s essentially discarded by the optimizer.

Your Second Mistakes

No half-assed SQL tutorial is complete without showing you the wrong way to find non-matching rows between two tables.

It will undoubtedly look something like this:

SELECT 
   records = 
       COUNT_BIG(u.Id)
FROM dbo.Users AS u
LEFT JOIN dbo.Posts AS p
  ON u.Id = p.OwnerUserId
WHERE p.Id IS NULL;

It’s not that this pattern is never better, it’s just that it shouldn’t be your go-to for each and every query with this goal in mind.

You take two tables, you join them together, and you add a predicate to your where clause to find rows where an ordinarily not-NULL column returns NULLs.

The problem is that SQL Server’s query optimizer doesn’t contain any logic to turn this into the type of query plan that you’d get using NOT EXISTS instead.

You end up needing to fully join any tables involved together, and then later on use a filter to remove rows where no match was found. This can be incredibly inefficient, especially on large data sets.

One may even be dealing with “big data” when the follies of this paradigm become quite clear.

A generally better approach to writing this type of query is to tell the database you’re using exactly what you’re after and exactly what you expect:

SELECT 
    records = 
        COUNT_BIG(u.Id)
FROM dbo.Users AS u
WHERE NOT EXISTS 
( 
    SELECT 
        1/0 
    FROM dbo.Posts AS p 
    WHERE p.OwnerUserId = u.Id 
);

Your developer-life will be a whole lot less confusing and tiresome if you arm yourself with options and alternatives, which means you’ll have lots of mental energy left over to, like, learn 17 new frameworks and really impress your friends.

Think of the frameworks.

Gear Up

You should make good use of the EXISTS and NOT EXISTS patterns in your queries when you don’t require any rows from another table, and you only need to validate if something is there or not.

In cases where you need to get information from another table, joins are likely the most direct path to getting back the data you need.

But this all brings up an interesting question: what if you want to get back information in the select list without adding in join clauses, worrying about inner, outer, full, or cross, and wondering silently if one day things might go pear shaped.

We’ll talk about that in the next post, when we go over correlated subqueries.

Thanks for reading!

Going Further

The How To Write SQL Server Queries Correctly Cheat Sheet: Joins

Posted on February 27, 2024March 1, 2024 by Erik Darling

So Many Choices

SQL Server is full of ~~landmines~~ options when you’re writing queries. For most queries, you don’t need much beyond the basics.

Think of your standard CRUD operations. Most don’t even require a join; they’re very straightforward. And hey, if you completely denormalize all your data to one huge table, you’ll never have to think about a lot of this stuff anyway.

It’s only when developers are forced to think about things that things start to go wrong. I don’t mean to pick on developers specifically. It’s the human condition. Thinking often leads to poor choices.

In this post, I’m going to give you some basic guidance on when to use various T-SQL facilities, based on years of finding, fixing, and writing queries.

Some of the details and information may not surprise the more seasoned and spiced of you out there.

Here’s a piece of advice that I give everyone: Always start with a SELECT. I don’t care if the final form of your query is going to be an insert, update, or delete (I do care if it’s going to be a merge, because ew), you should always start off by writing a select, so you can validate query results first. It’s easy enough to change things over when you’re done, but please make sure what you’re changing is what you expect to change. I’d even go one step further and say that the first time you run your modification query, you should do it in a transaction with a ROLLBACK command.

I’ll usually do some variation on this, so I can see inserted and deleted results easily:

BEGIN TRANSACTION
    UPDATE TOP (100)
        u
    SET u.Reputation += 1000
    OUTPUT
        'D' AS d, Deleted.*,
        'I' AS i, Inserted.*
    FROM dbo.Users AS u
    WHERE u.Reputation < 1000
    AND   u.Reputation > 1;
ROLLBACK TRANSACTION;

Anyway, on to the cheat codes.

Inner Joins

Joins combine data horizontally (sideways, for the forgetful). The most basic thing you can do with two tables in a database, really.

The important thing to remember is that in one-to-many, and many-to-many relationships, joins will display duplicate matched values.

If you don’t need to show data from another table, don’t use a join. We’ll talk about other options later, but please let this burn into your mind. The number of queries I’ve seen with needless DISTINCT instructions on them is nearing a decent pre-tax cash bonus.

Here’s an example of when a join is necessary. We want to get all of our Users with a Reputation over 500,000, and sum up the Score on all their Posts, plus figure out what kind of Post the points were awarded to.

SELECT
    u.Id,
    u.DisplayName,
    PostType =
        CASE
             p.PostTypeId
             WHEN 1
             THEN 'Question'
             WHEN 2
             THEN 'Answer'
             ELSE 'Other'
        END,
    TotalScore = SUM(p.Score)
FROM dbo.Users AS u
JOIN dbo.Posts AS p
  ON p.OwnerUserId = u.Id
WHERE u.Reputation > 500000
GROUP BY
    u.Id,
    u.DisplayName,
    p.PostTypeId
ORDER BY
    TotalScore DESC;

Because we need multiple columns from the Posts table, we can’t just use a correlated subquery in the select list. Those only allow for one column or expression to be projected from the results.

Since this is an inner join, it restricts the results down only to matching rows. Now, it’s not really possible to get a Reputation over 1 without posting things that other users can vote on, so it doesn’t make sense to use an outer join here.

What if we wanted to find slightly different data?

(Left) Outer Joins

Let’s say we wanted to generate a report of people whose Reputation is sitting at one (the site minimum), to figure out if they’re inactive, unpopular, or if their account has been suspended for some reason.

We could use a query like this to do it.

SELECT
    u.Id,
    u.DisplayName,
    u.Reputation,
    TotalScore = SUM(p.Score),
    c = COUNT_BIG(p.Id)
FROM dbo.Users AS u
LEFT JOIN dbo.Posts AS p
  ON p.OwnerUserId = u.Id
WHERE u.Reputation = 1
GROUP BY 
    u.Id,
    u.DisplayName,
    u.Reputation
ORDER BY
    TotalScore;

Before talking about the logic, it’s important to note that when you’re counting rows from the outer side of a join, you’ll usually wanna specify a non-nullable column to pass into the counting function, rather than (*), so you don’t incorrectly count NULL values.

Primary key columns are your friend for this, but any non-NULLable column will do.

We need a left join here, because we want everyone with a Reputation of 1, not just those users who have posted. The left join preserves rows from the Users table in that case.

The results we get back find all sorts of interesting things (that I told you we were looking for):

Users who were very active, but then had their accounts suspended
Users who have posted, but were heavily downvoted
Users who haven’t posted at all

I’m not going to talk about right outer joins, because that’s the foolish domain of characterless buffoons who use Venn diagrams to explain join results.

I assume they have good intentions, they just lack the backbone to tell you that there is no natural reason to ever use a right join, that isn’t better logically expressed in a different way.

They’re usually trying to sell you something.

(Full) Outer Joins

In short, these preserve results from both tables, but still with a correlation. I’d nearly put these in the same category as right joins, except they have a couple decent use cases, and aren’t personally offensive to polite society.

Let’s say we want to figure out how many Posts don’t have an associated User, and how many Users don’t have an associated Post all in one query:

SELECT
    PostsWithoutAUser = 
        SUM(CASE WHEN u.Id IS NULL THEN 1 ELSE 0 END),
    UsersWithoutAPost = 
        SUM(CASE WHEN p.Id IS NULL THEN 1 ELSE 0 END)
FROM dbo.Users AS u
FULL JOIN dbo.Posts AS p
  ON p.OwnerUserId = u.Id;

It’s sort of an exception report, to let you know just how much referential integrity your data lacks.

Aside from oddball situations, you shouldn’t have to think much about these in your day to day life.

Cross Joins

Like full joins, I don’t see cross joins used terribly often, though they do have some uses, like populating a grid.

A reasonably worded example would be something like: you have a table of scotch, and a table of glass sizes, and you want to show someone all possible combinations of scotch and glass sizes.

If you pick a big enough glass, eventually using cross joins in more creative ways will seem like a good idea. One place I’ve been forced to use them is in some of my stored procedures, like sp_PressureDetctor.

Here’s one example:

SELECT
    sample_time =
        CONVERT
        (
            datetime,
            DATEADD
            (
                SECOND,
                (t.timestamp - osi.ms_ticks) / 1000,
                SYSDATETIME()
            )
        ),
    sqlserver_cpu_utilization =
        t.record.value('(Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)[1]','int'),
    other_process_cpu_utilization =
        (100 - t.record.value('(Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)[1]','int')
         - t.record.value('(Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)[1]','int')),
    total_cpu_utilization =
        (100 - t.record.value('(Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)[1]', 'int'))
FROM sys.dm_os_sys_info AS osi
CROSS JOIN
(
    SELECT
        dorb.timestamp,
        record =
            CONVERT(xml, dorb.record)
    FROM sys.dm_os_ring_buffers AS dorb
    WHERE dorb.ring_buffer_type = N'RING_BUFFER_SCHEDULER_MONITOR'
) AS t
WHERE t.record.exist('(Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization[.>= sql:variable("@cpu_utilization_threshold")])') = 1
ORDER BY
    sample_time DESC;

The sys.dm_os_sys_info view is a single row, with no relation at all to sys.dm_os_ring_buffers, but I need to use the one value in the one column in the one row for every row that it produces, so that I can turn the timetable column into a human-understandable value.

Here’s another example from the same procedure, slightly abridged:

SELECT
    total_threads =
        MAX(osi.max_workers_count),
    used_threads =
        SUM(dos.active_workers_count),
    available_threads =
        MAX(osi.max_workers_count) - SUM(dos.active_workers_count),
    threads_waiting_for_cpu =
        SUM(dos.runnable_tasks_count),
    requests_waiting_for_threads =
        SUM(dos.work_queue_count),
    current_workers =
        SUM(dos.current_workers_count),
    total_active_request_count =
        SUM(wg.active_request_count),
    total_queued_request_count =
        SUM(wg.queued_request_count),
    total_blocked_task_count =
        SUM(wg.blocked_task_count),
    total_active_parallel_thread_count =
        SUM(wg.active_parallel_thread_count),
    avg_runnable_tasks_count =
        AVG(dos.runnable_tasks_count)
FROM sys.dm_os_schedulers AS dos
CROSS JOIN sys.dm_os_sys_info AS osi
CROSS JOIN
(
    SELECT
        wg.active_request_count,
        wg.queued_request_count,
        wg.blocked_task_count,
        wg.active_parallel_thread_count
    FROM sys.dm_resource_governor_workload_groups AS wg      
) AS wg;

In this case, I keep myself safe from exploding result sets by aggregating all of the selected columns. You may also find that necessary, should you choose to work with data so terrible that it requires cross joins.

One thing to be especially aware of is that cross joins can only be physically implemented in SQL Server with a nested loops join, so the larger your tables get, the worse performance will get.

Beware out there.

Thanks for reading!

Going Further

A T-SQL Query To Get The Text Between Two Delimiters In SQL Server

Posted on October 20, 2020April 16, 2025 by Erik Darling

I swear It’s Not String Splitting

I have to do this a fair amount, and I always go look at this little cheat sheet that I have.

Then it occurred to me that it might be worth sharing the details here, in case anyone else runs into the same need.

The way I learned to do it is with SUBSTRING and CHARINDEX, which is a pretty common method.

Why CHARINDEX? Because it accepts an optional 3rd parameter that PATINDEX doesn’t, where you can give it a starting position to search. That comes in really handy! Let’s look at how.

The first thing we need for our test case is the starting point, which I’ve arbitrarily chosen as a colon in error messages.

SELECT
    m.*,
    parsed_string = 
        SUBSTRING
        (
            m.text, /*First argument*/ 
            CHARINDEX(':', m.text), /*Second argument*/ 
            LEN(m.text)/*Third argument*/
        )
FROM sys.messages AS m
WHERE m.language_id = 1033
AND   m.text LIKE N'%:%:%';

But the results aren’t exactly what we want! We can still see all the colons.

What we really need is to cut out the semi-colon, which means we need to move one character over.

Keep in mind that this will change based on the length of the string you’re searching for. If you were looking for ‘Erik’, you’d need to move over four positions.

Sometimes it’s easier to remember that by calling LEN on the search string.

SELECT 
    m.*,
    SUBSTRING
    (
        m.text,
        CHARINDEX(':', m.text) + LEN(':'),
        LEN(m.text)
    ) AS parsed_string
FROM sys.messages AS m
WHERE m.language_id = 1033
AND   m.text LIKE N'%:%:%';

That’ll get us to where we need to be for the first position! Now we need to get the text up to the second colon, which is where things get a little more complicated.

Right now, we’re just getting everything through the end of the error message, using LEN as the 3rd argument to SUBTSTRING.

To start with, let’s establish how we can use the third argument in CHARINDEX.

SELECT 
    m.*,
    SUBSTRING
    (
        m.text, 
        CHARINDEX
        (
            ':', m.text
        ) + LEN(':'),
        CHARINDEX
        (
            ':', 
            m.text, /*!*/
            CHARINDEX
            (
                ':', 
                m.text
            ) + LEN(':')/*!*/
        )
    ) AS parsed_string
FROM sys.messages AS m
WHERE m.language_id = 1033
AND   m.text LIKE N'%:%:%';

The start of the third argument is going to look nearly identical to the first one, except we’re going to start our search in the string after the first colon.

The code in between the exclamation points is the same as the second argument to SUBSTRING.

That’s because it gives us the correct starting position to start looking for the second colon from.

But, you know, this still doesn’t get us what we need, exactly. We need to chop some characters off.

How many?

I’ll save you some delicate CPU brain cycles: we need to subtract the length of the search string, and then subtract the number of positions in that the first search hit was.

SELECT 
    m.*,
    SUBSTRING
    (
        m.text, /*First argument*/
        CHARINDEX /*Begin Second argument*/
        (
            ':', 
            m.text
        ) + LEN(':'), /*End Second argument*/
        CHARINDEX /*Begin Third argument*/
        (
            ':', /*CHARINDEX of the first : after...*/
            m.text, 
            CHARINDEX /*The CHARINDEX of the first : in the string...*/
            (
                ':', 
                m.text
            ) + LEN(':')
        )
        - LEN(':') /*Minus 1, effectively*/
        - CHARINDEX(':', m.text) /*Minus the CHARINDEX of the first : in the string*/
        /*End Third argument*/
    ) AS parsed_string
FROM sys.messages AS m
WHERE m.language_id = 1033
AND   m.text LIKE N'%:%:%';

Which finally gives us what we’re asking for:

Now, this may not make you totally happy. After all, there are still leading and trailing spaces on each line.

If you want to get rid of those, you can either adjust your LEN functions, or you can call TRIM, LTRIM/RTRIM on the final result. I don’t care, really.

They’re your strings. They shouldn’t even be in the database.

Thanks for reading!