The How To Write SQL Server Queries Correctly Cheat Sheet: Select List Subqueries

Flash Boy

I think subqueries in select lists are very neat things. You can use them to skip a lot of additional join logic, which can have all sorts of strange repercussions on query optimization, particularly if you use have to use left joins to avoid eliminating results.

Subqueries do have their limitations:

They can only return one row
They can only return one column

But used in the right way, they can be an excellent method to retrieve results without worrying about what kind of join you’re doing, and how the optimizer might try to rearrange it into the mix.

Since subqueries are in the select list, and can’t eliminate rows from results, you will most commonly see them incorporated late in query plans (much further to the left) as outer joins. The optimizer is generally smart enough to retrieve data for select list subqueries after as much other filtering that can have been applied is applied, so they can be evaluated for as few rows in the outer results as possible.

The most important thing you can do as a developer to make sure your select list subqueries are fast is to make sure you have good supporting indexes for them.

Well Done

Let’s start with this query:

SELECT
    u.Id,
    u.DisplayName,
    u.Reputation,
    TopQuestionScore =     
    (
        SELECT TOP (1)
            p.Score
        FROM dbo.Posts AS p
        WHERE p.PostTypeId = 1
        AND   p.OwnerUserId = u.Id
        ORDER BY
            p.Score DESC,
            p.Id DESC
    ),
    TopAnswerScore =     
    (
        SELECT TOP (1)
            p.Score
        FROM dbo.Posts AS p
        WHERE p.PostTypeId = 2
        AND   p.OwnerUserId = u.Id
        ORDER BY
            p.Score DESC,
            p.Id DESC
    ),
    TotalPosts =    
    (
        SELECT
            COUNT_BIG(*)
        FROM dbo.Posts AS p
        WHERE p.OwnerUserId = u.Id
        AND   p.PostTypeId IN (1, 2)
    )
FROM dbo.Users AS u
WHERE u.Reputation > 500000
ORDER BY
    u.Reputation DESC;

The goal is to find every User with a Reputation over 500,000, and then find their:

Top scoring question (with a unique tie-breaker on most recent post id)
Top scoring answer (with a unique tie-breaker on most recent post id)
Total questions and answers

You might look at this query with a deep sense of impending dread, wondering why we should make three trips to the Posts table to get this information. I totally get that.

But let’s say we have these indexes in place:

CREATE INDEX 
    u
ON dbo.Users 
    (Reputation, Id)
INCLUDE 
    (DisplayName)
WITH
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

CREATE INDEX
    p
ON dbo.Posts
    (OwnerUserId, PostTypeId, Score)
WITH
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

They have everything we need to support quick, navigational lookups.

Query Planner

The query plan for this arrangement looks like this, finishing in 23 milliseconds.

If you write select list subqueries, and they’re terribly slow, there’s a very good chance that the indexes you have in place are not up to the job, particularly if you see Eager Index Spools in the query plan.

All of the time in the plan is spent in the final subquery, that counts the total number of questions and answers. But even that, at 23 milliseconds, is not worth heaving our chests over.

Three round trips are not at all a problem here, but let’s compare.

One Way Ticket

I’m not opposed to experimentation. After all, it’s a great way to learn, observe, and become enraged with the state of query optimization generally.

Here are two rewrites of the above query, to only make one trip to the Posts table. The first one uses a derived join, and the second uses apply. They’re both written to use outer joins, to match the semantics of the three subqueries:

/*
Join
*/
SELECT
    u.Id,
    u.DisplayName,
    u.Reputation,
    p.TopQuestionScore,
    p.TopAnswerScore,
    p.TotalPosts
FROM dbo.Users AS u
LEFT JOIN
(
    SELECT
        p.OwnerUserId,
        TopQuestionScore = 
            MAX(CASE WHEN p.PostTypeId = 1 THEN p.Score ELSE NULL END),
        TopAnswerScore = 
            MAX(CASE WHEN p.PostTypeId = 2 THEN p.Score ELSE NULL END),
        TotalPosts = 
            COUNT_BIG(*)
    FROM dbo.Posts AS p
    WHERE p.PostTypeId IN (1, 2)
    GROUP BY
        p.OwnerUserId
) AS p
  ON p.OwnerUserId = u.Id
WHERE u.Reputation > 500000
ORDER BY
    u.Reputation DESC;

/*
Apply
*/
SELECT
    u.Id,
    u.DisplayName,
    u.Reputation,
    p.TopQuestionScore,
    p.TopAnswerScore,
    p.TotalPosts
FROM dbo.Users AS u
OUTER APPLY
(
    SELECT
        p.OwnerUserId,
        TopQuestionScore = 
            MAX(CASE WHEN p.PostTypeId = 1 THEN p.Score ELSE NULL END),
        TopAnswerScore = 
            MAX(CASE WHEN p.PostTypeId = 2 THEN p.Score ELSE NULL END),
        TotalPosts = 
            COUNT_BIG(*)
    FROM dbo.Posts AS p
    WHERE p.OwnerUserId = u.Id 
    AND   p.PostTypeId IN (1, 2)
    GROUP BY
        p.OwnerUserId
) AS p
WHERE u.Reputation > 500000
ORDER BY
    u.Reputation DESC;

A somewhat brief digression here: Query rewrites to use specific syntax arrangements are often not available in ORMs. Many times while working with clients, we’ll stumble across a bushel of quite painful application-generated queries. I’ll show them useful rewrites to improve things, and we’ll all marvel together at how much better things are. I’ll suggest implementing the rewrite as a stored procedure, and all hell will break loose.

Please don’t be one of those developers. Understand the limitations of the technology stack you’re working with. Not everything produced by code is good.

Compare and Contrast

In this case, both of the attempts at rewrites result in identical query plans. The optimizer does a good job here, but both of the single-trip queries is about 2x slower than the original.

In this case, that difference will look and feel microscopic. And it is, mostly because I provided indexes so good that you could write this query any which way and it would work out pretty well.

While one round trip certainly felt more efficient than three, each trip from outer to inner side of the nested loops had a bit more work to do each time, and that added up.

It’s nothing consequential here, but you may run into plenty of situations where it’s far worse (or to be fair, far better).

For me, the real advantage of writing out the three separate subqueries is to better understand which one(s) do the most work, and might need additional work done to make them fast.

When you do everything all at once, you have no idea which piece is responsible for slowdowns. We know from the very first query plan that getting the full count does the most work, but that wouldn’t be obvious to me, you, or anyone else looking at the two query plans in this section.

Thanks for reading!

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.