The How To Write SQL Server Queries Correctly Cheat Sheet: Cross Apply And Outer Apply

Situations


I end up converting a lot of derived joins, particularly those that use windowing functions, to use the apply syntax. Sometimes good indexes are in place to support that, other times they need to be created to avoid an Eager Index Spool.

One of the most common questions I get is when developers should consider using apply over other join syntax.

The short answer is that I start mentally picturing the apply syntax being useful when:

  • There’s a small outer table (FROM), and a large inner table (APPLY)
  • I need to do a significant amount of work on the inner side of a join
  • The goal of the query is top N per group, or something similar
  • I’m trying to get parallel nested loops instead of some alternative plan choice
  • To replace a scalar UDF in the select list with an inline UDF
  • In order to use the VALUES construct in an odd way

Most of this is situational, and requires a bit of practice and familiarity to spot quickly.

Both cross and outer apply can be used in similar ways to subqueries in the select list, with the added bonus that you can return multiple columns and rows with apply, which you can’t do in a normal subquery.

What Apply Does


The way to think about what apply is doing when a query runs is supplying a table-valued result on the inner side of a join for each row supplied by the outer side of a join.

Here’s a simple example:

SELECT
    u.Id,
    u.DisplayName,
    u.Reputation,
    p.Title,
    p.Score,
    p.CreationDate,
    p.LastActivityDate
FROM dbo.Users AS u
CROSS APPLY
(
    SELECT
        p.*
    FROM dbo.Posts AS p
    WHERE p.OwnerUserId = u.Id
    AND   p.PostTypeId = 1
    ORDER BY
        p.CreationDate DESC
    OFFSET 0 ROWS
    FETCH NEXT 3 ROWS ONLY
) AS p
WHERE u.CreationDate >= '20131230'
ORDER BY
    u.Reputation DESC,
    p.CreationDate DESC;

We’re getting everyone from the Users table who Posted a Question in the final days of 2013, ordered by when it was Created.

For every qualifying User, we get a tabular result showing the Title, Score, CreationDate, and LastActivityDate of their question.

You can picture it sort of like this:

sql server query results
tabular

Some Users have have more than three results, and some may have fewer than three results, but since the query is self-limited to only the first three, our query sets a row goal and quits once three are found.

More About The Query


Because the Users table is correlated by Id to the Posts table on OwnerUserId, we need to make sure we have a good index to make that access seekable.

sql server query text
apply-ular

Also because we’re only looking for Questions (PostTypeId = 1), and ordering by the most recent ones (CreationDate DESC), it’s also a wise idea to have those in the key.

It’s also worth talking over an interesting point in the query itself: The select from Posts inside of the apply is doing a select star, sure, but the outer query is only pulling a few of the columns. The optimizer can recognize this, which means we don’t need a gigantic covering index to make this query fast. We also… Don’t really need a covering index at all in this case. Just the key columns are good enough.

CREATE INDEX 
    u 
ON dbo.Users 
    (CreationDate, Reputation, Id) 
INCLUDE 
    (DisplayName) 
WITH 
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE); 

CREATE INDEX 
    p 
ON dbo.Posts 
    (OwnerUserId, PostTypeId, CreationDate) 
WITH 
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

Efficiency Unit


An efficient query plan when using apply will look like this:

sql server query plan
hello.

An inefficient query plan using apply will look like this:

sql server query plan
oops

If you’re experimenting with apply, either cross or outer, pay close attention to the query plans. If someone says “cross apply is always slow”, you can bet they stink at indexes.

Choices, Choices


The choice to use apply at all depends on the goal of the query, and the goals of the query tuner. It’s not always a magic performance bullet, but under the right circumstances, it can really make things fly.

The choice to use cross or outer apply depends on the semantics of the starting query. An inner join commutes easily to cross apply, and a left join commutes easily to outer apply.

One important difference in how the joins are implemented is in the optimizer’s choice between normal nested loops, where the join is done at the nested loops operator, and apply nested loops, which is when the join keys are pushed to an index seek on the inner side of the join.

For an in-depth look at that, I’d suggest reading Paul White’s post, Apply versus Nested Loops Join.

The optimizer is capable of transforming an apply to a join and vice versa. It generally tries to rewrite apply to join during initial compilation to maximize the searchable plan space during cost-based optimization. Having transformed an apply to a join early on, it may also consider a transformation back to an apply shape later on to assess the merits of e.g. an index loops join.

Just writing a query using apply doesn’t guarantee that you get the apply nested loops version of a nested loops join. Having solid indexes and easy to search predicates can help push things in the right direction.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

The How To Write SQL Server Queries Correctly Cheat Sheet: Select List Subqueries

Flash Boy


I think subqueries in select lists are very neat things. You can use them to skip a lot of additional join logic, which can have all sorts of strange repercussions on query optimization, particularly if you use have to use left joins to avoid eliminating results.

Subqueries do have their limitations:

  • They can only return one row
  • They can only return one column

But used in the right way, they can be an excellent method to retrieve results without worrying about what kind of join you’re doing, and how the optimizer might try to rearrange it into the mix.

Since subqueries are in the select list, and can’t eliminate rows from results, you will most commonly see them incorporated late in query plans (much further to the left) as outer joins. The optimizer is generally smart enough to retrieve data for select list subqueries after as much other filtering that can have been applied is applied, so they can be evaluated for as few rows in the outer results as possible.

The most important thing you can do as a developer to make sure your select list subqueries are fast is to make sure you have good supporting indexes for them.

Well Done


Let’s start with this query:

SELECT
    u.Id,
    u.DisplayName,
    u.Reputation,
    TopQuestionScore =     
    (
        SELECT TOP (1)
            p.Score
        FROM dbo.Posts AS p
        WHERE p.PostTypeId = 1
        AND   p.OwnerUserId = u.Id
        ORDER BY
            p.Score DESC,
            p.Id DESC
    ),
    TopAnswerScore =     
    (
        SELECT TOP (1)
            p.Score
        FROM dbo.Posts AS p
        WHERE p.PostTypeId = 2
        AND   p.OwnerUserId = u.Id
        ORDER BY
            p.Score DESC,
            p.Id DESC
    ),
    TotalPosts =    
    (
        SELECT
            COUNT_BIG(*)
        FROM dbo.Posts AS p
        WHERE p.OwnerUserId = u.Id
        AND   p.PostTypeId IN (1, 2)
    )
FROM dbo.Users AS u
WHERE u.Reputation > 500000
ORDER BY
    u.Reputation DESC;

The goal is to find every User with a Reputation over 500,000, and then find their:

  • Top scoring question (with a unique tie-breaker on most recent post id)
  • Top scoring answer (with a unique tie-breaker on most recent post id)
  • Total questions and answers

You might look at this query with a deep sense of impending dread, wondering why we should make three trips to the Posts table to get this information. I totally get that.

But let’s say we have these indexes in place:

CREATE INDEX 
    u
ON dbo.Users 
    (Reputation, Id)
INCLUDE 
    (DisplayName)
WITH
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

CREATE INDEX
    p
ON dbo.Posts
    (OwnerUserId, PostTypeId, Score)
WITH
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

They have everything we need to support quick, navigational lookups.

Query Planner


The query plan for this arrangement looks like this, finishing in 23 milliseconds.

sql server query plan
all i need

If you write select list subqueries, and they’re terribly slow, there’s a very good chance that the indexes you have in place are not up to the job, particularly if you see Eager Index Spools in the query plan.

All of the time in the plan is spent in the final subquery, that counts the total number of questions and answers. But even that, at 23 milliseconds, is not worth heaving our chests over.

Three round trips are not at all a problem here, but let’s compare.

One Way Ticket


I’m not opposed to experimentation. After all, it’s a great way to learn, observe, and become enraged with the state of query optimization generally.

Here are two rewrites of the above query, to only make one trip to the Posts table. The first one uses a derived join, and the second uses apply. They’re both written to use outer joins, to match the semantics of the three subqueries:

/*
Join
*/
SELECT
    u.Id,
    u.DisplayName,
    u.Reputation,
    p.TopQuestionScore,
    p.TopAnswerScore,
    p.TotalPosts
FROM dbo.Users AS u
LEFT JOIN
(
    SELECT
        p.OwnerUserId,
        TopQuestionScore = 
            MAX(CASE WHEN p.PostTypeId = 1 THEN p.Score ELSE NULL END),
        TopAnswerScore = 
            MAX(CASE WHEN p.PostTypeId = 2 THEN p.Score ELSE NULL END),
        TotalPosts = 
            COUNT_BIG(*)
    FROM dbo.Posts AS p
    WHERE p.PostTypeId IN (1, 2)
    GROUP BY
        p.OwnerUserId
) AS p
  ON p.OwnerUserId = u.Id
WHERE u.Reputation > 500000
ORDER BY
    u.Reputation DESC;

/*
Apply
*/
SELECT
    u.Id,
    u.DisplayName,
    u.Reputation,
    p.TopQuestionScore,
    p.TopAnswerScore,
    p.TotalPosts
FROM dbo.Users AS u
OUTER APPLY
(
    SELECT
        p.OwnerUserId,
        TopQuestionScore = 
            MAX(CASE WHEN p.PostTypeId = 1 THEN p.Score ELSE NULL END),
        TopAnswerScore = 
            MAX(CASE WHEN p.PostTypeId = 2 THEN p.Score ELSE NULL END),
        TotalPosts = 
            COUNT_BIG(*)
    FROM dbo.Posts AS p
    WHERE p.OwnerUserId = u.Id 
    AND   p.PostTypeId IN (1, 2)
    GROUP BY
        p.OwnerUserId
) AS p
WHERE u.Reputation > 500000
ORDER BY
    u.Reputation DESC;

A somewhat brief digression here: Query rewrites to use specific syntax arrangements are often not available in ORMs. Many times while working with clients, we’ll stumble across a bushel of quite painful application-generated queries. I’ll show them useful rewrites to improve things, and we’ll all marvel together at how much better things are. I’ll suggest implementing the rewrite as a stored procedure, and all hell will break loose.

Please don’t be one of those developers. Understand the limitations of the technology stack you’re working with. Not everything produced by code is good.

Compare and Contrast


In this case, both of the attempts at rewrites result in identical query plans. The optimizer does a good job here, but both of the single-trip queries is about 2x slower than the original.

In this case, that difference will look and feel microscopic. And it is, mostly because I provided indexes so good that you could write this query any which way and it would work out pretty well.

sql server query plan
back of the van

While one round trip certainly felt more efficient than three, each trip from outer to inner side of the nested loops had a bit more work to do each time, and that added up.

It’s nothing consequential here, but you may run into plenty of situations where it’s far worse (or to be fair, far better).

For me, the real advantage of writing out the three separate subqueries is to better understand which one(s) do the most work, and might need additional work done to make them fast.

When you do everything all at once, you have no idea which piece is responsible for slowdowns. We know from the very first query plan that getting the full count does the most work, but that wouldn’t be obvious to me, you, or anyone else looking at the two query plans in this section.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

Join me In Boston May 10 For A Full Day Of SQL Server Performance Tuning Training

Spring Training


This May, I’ll be presenting my full day training session The Foundations Of SQL Server Performance Tuning.

All attendees will get free access for life to my SQL Server performance tuning training. That’s about 25 hours of great content.

Get your tickets here for this event, taking place Friday, May 10th 2024 at the Microsoft Offices in Burlington.

Here’s what I’ll be presenting:

The Foundations Of SQL Server Performance Tuning

Session Abstract:

Whether you want to be the next great query tuning wizard, or you just need to learn how to start solving tough business problems at work, you need a solid understanding of not only what makes things fast, but also what makes them slow.

I work with consulting clients worldwide fixing complex SQL Server performance problems. I want to teach you how to do the same thing using the same troubleshooting tools and techniques I do.

I’m going to crack open my bag of tricks and show you exactly how I find which queries to tune, indexes to add, and changes to make. In this day long session, you’re going to learn about hardware, query rewrites that work, effective index design patterns, and more.

Before you get to the cutting edge, you need to have a good foundation. I’m going to teach you how to find and fix performance problems with confidence.

Event Details:

Get your tickets here for this event!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

bit Obscene Episode 2: SQL Server Agent Jobs

bit Obscene Episode 2: SQL Server Agent Jobs


Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

SQL Server Index Design: Getting Key Column Order Right

SQL Server Index Design: Getting Key Column Order Right


Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

How To Evaluate Index Effectiveness While Tuning SQL Server Queries

How To Evaluate Index Effectiveness While Tuning SQL Server Queries


Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

How To Tell If Lazy Table Spools Are Slowing Your SQL Server Queries Down, And Fix Them

How To Tell If Lazy Table Spools Are Slowing Your SQL Server Queries Down, And Fix Them


Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

Join me In Boston May 10 For A Full Day Of SQL Server Performance Tuning Training

Spring Training


This May, I’ll be presenting my full day training session The Foundations Of SQL Server Performance Tuning.

All attendees will get free access for life to my SQL Server performance tuning training. That’s about 25 hours of great content.

Get your tickets here for this event, taking place Friday, May 10th 2024 at the Microsoft Offices in Burlington.

Here’s what I’ll be presenting:

The Foundations Of SQL Server Performance Tuning

Session Abstract:

Whether you want to be the next great query tuning wizard, or you just need to learn how to start solving tough business problems at work, you need a solid understanding of not only what makes things fast, but also what makes them slow.

I work with consulting clients worldwide fixing complex SQL Server performance problems. I want to teach you how to do the same thing using the same troubleshooting tools and techniques I do.

I’m going to crack open my bag of tricks and show you exactly how I find which queries to tune, indexes to add, and changes to make. In this day long session, you’re going to learn about hardware, query rewrites that work, effective index design patterns, and more.

Before you get to the cutting edge, you need to have a good foundation. I’m going to teach you how to find and fix performance problems with confidence.

Event Details:

Get your tickets here for this event!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

Why Multiple Seeks Are Slower For Some SQL Server Queries

Why Multiple Seeks Are Slower For Some SQL Server Queries


Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

The How To Write SQL Server Queries Correctly Cheat Sheet: EXISTS and NOT EXISTS

Selectively Numb


The sort of wonderful thing about SQL is that it has many directives which are fairly easy to decipher and use appropriately.

The sort of awful thing about SQL is that it has just as many rules that are somewhat selectively applied as the English language itself.

I have my gripes and grievances with some of the choices, of course, and so will you as you delve further into the language. A petty example is that I sort of wish that SQL used GET instead of SELECT for retrieving data.

Very few people go to the store to select milk, eggs, steak, butter, salt, pepper, and scotch. Most of us just go get it. But enough about breakfast.

Let’s talk about two of the most overlooked and undervalued facilities in the SQL language: EXISTS and NOT EXISTS. Perhaps they’d get more traction is they were called THERE and NOT THERE, but but it would be perilously easy to start confusing your WHERE and THERE clause.

Often besmirched by someone who read a blog post about MySQL in 1998 as “subqueries, which are slower than joins”, they’re two of the most useful things you can grasp and use regularly.

Though they are a bit like subqueries, the columns that you select in an EXISTS or NOT EXISTS subquery can’t be used in the outer query. You can put whatever you want in the select list, from * to 1 to COUNT to 1/0 to the entire contents of the King James Bible, and it will never end up making even the dimmest difference in the world. Likewise, adding DISTINCT, TOP, or any other row-limiting device will do absolutely nothing to change the query plan or performance.

Get over yourself.

Both EXISTS and NOT EXISTS already set a row goal of 1, because all either one has to do is determine if a single row is there or not, just with the logic reversed for each.

Your First Mistakes


Let’s say someone asks you to gin up a list of Users who have Posted anything at all, but whose Reputation still floats at the dreaded 1.

Your first instinct would likely be to write a query that looks like this.

SELECT 
    u.Id,
    u.DisplayName
FROM dbo.Users AS u
JOIN dbo.Posts AS p
  ON u.Id = p.OwnerUserId
WHERE u.Reputation = 1
ORDER BY 
    u.Id;

But you’d quickly find yourself confronted by many duplicate row values, because that’s what one-to-many joins produce. Duplicates.

Your next move, tongue hanging out, sweating profusely, knuckles creaking, nearly paralyzed by the uncertainty of your continued human functioning, would be to do something like this:

SELECT DISTINCT 
    u.Id,
    u.DisplayName
FROM dbo.Users AS u
JOIN dbo.Posts AS p
  ON u.Id = p.OwnerUserId
WHERE u.Reputation = 1
ORDER BY 
    u.Id;

And, sure, with a small enough data set, this is an okay choice. You passed the pop quiz, hot shot. Your instinct to use DISTINCT was not wholly incorrect, but there’s a better way.

But as you start dealing with larger and more imposing sets of data, DISTINCT will no longer cut it.

What EXISTS Does Different


While EXISTS will still use a join to match rows between tables, the semantics are quite a bit different. It can move on once it has determined that a row is either there or not there.

You don’t need to add DISTINCT, grouping, or anything else to get the results you wanted in the first place.

SELECT 
    u.Id,
    u.DisplayName
FROM  dbo.Users AS u
WHERE u.Reputation = 1
AND EXISTS 
( 
    SELECT 
        1/0 
    FROM dbo.Posts AS p 
    WHERE p.OwnerUserId = u.Id 
)
ORDER BY 
    u.Id;

Once EXISTS locates a match, it moves on to the next value from the outer side (in this case the Users table), and attempts to find a match. If no match is found, the row is discarded, which is common to inner joins.

Where a lot of developers get hung up at first is in assuming that EXISTS and NOT EXISTS work like IN or NOT in, and they miss the inner where clause to tell the database which rows should match.

I’ve seen a lot of EXISTS queries written, quite incorrectly, like this:

SELECT 
    u.Id,
    u.DisplayName
FROM  dbo.Users AS u
WHERE u.Reputation = 1
AND EXISTS 
( 
    SELECT 
       p.OwnerUserId 
    FROM dbo.Posts AS p
)
ORDER BY 
    u.Id;

Which will, of course, return absolutely everything. Don’t do this.

The column you select inside of the EXISTS subquery does not infer any sort of matching logic.

Like I said before, it’s essentially discarded by the optimizer.

Your Second Mistakes


No half-assed SQL tutorial is complete without showing you the wrong way to find non-matching rows between two tables.

It will undoubtedly look something like this:

SELECT 
   records = 
       COUNT_BIG(u.Id)
FROM dbo.Users AS u
LEFT JOIN dbo.Posts AS p
  ON u.Id = p.OwnerUserId
WHERE p.Id IS NULL;

It’s not that this pattern is never better, it’s just that it shouldn’t be your go-to for each and every query with this goal in mind.

You take two tables, you join them together, and you add a predicate to your where clause to find rows where an ordinarily not-NULL column returns NULLs.

The problem is that SQL Server’s query optimizer doesn’t contain any logic to turn this into the type of query plan that you’d get using NOT EXISTS instead.

You end up needing to fully join any tables involved together, and then later on use a filter to remove rows where no match was found. This can be incredibly inefficient, especially on large data sets.

One may even be dealing with “big data” when the follies of this paradigm become quite clear.

A generally better approach to writing this type of query is to tell the database you’re using exactly what you’re after and exactly what you expect:

SELECT 
    records = 
        COUNT_BIG(u.Id)
FROM dbo.Users AS u
WHERE NOT EXISTS 
( 
    SELECT 
        1/0 
    FROM dbo.Posts AS p 
    WHERE p.OwnerUserId = u.Id 
);

Your developer-life will be a whole lot less confusing and tiresome if you arm yourself with options and alternatives, which means you’ll have lots of mental energy left over to, like, learn 17 new frameworks and really impress your friends.

Think of the frameworks.

Gear Up


You should make good use of the EXISTS and NOT EXISTS patterns in your queries when you don’t require any rows from another table, and you only need to validate if something is there or not.

In cases where you need to get information from another table, joins are likely the most direct path to getting back the data you need.

But this all brings up an interesting question: what if you want to get back information in the select list without adding in join clauses, worrying about inner, outer, full, or cross, and wondering silently if one day things might go pear shaped.

We’ll talk about that in the next post, when we go over correlated subqueries.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.