Are Self Joins Ever Better Than Key Lookups In SQL Server?

Sorta Topical

Like most tricks, this has a specific use case, but can be quite effective when you spot it.

I’m going to assume you have a vague understanding of parameter sniffing with stored procedures going into this. If you don’t, the post may not make a lot of sense.

Or, heck, maybe it’ll give you a vague understanding of parameter sniffing in stored procedures.

One For The Money

Say I have a stored procedure that accepts a parameter called @Reputation.

The body of the procedure looks like this:

    SELECT TOP (1000) 
	        u.*
    FROM dbo.Users AS u
    WHERE u.Reputation = @Reputation
    ORDER BY u.CreationDate DESC;

In the users table, there are a lot of people with a Reputation of 1.

There are not so many with a Reputation of 2.

+------------+---------+
| Reputation | records |
+------------+---------+
|          1 | 1090043 |
|          2 |    1854 |
+------------+---------+

Two For The Slow

Data distributions like this matter. They change how SQL Server approaches coming up with an execution plan for a query.

Which indexes to use, what kind of joins to use, how to aggregate data, if the plan should be serial or parallel…

The list goes on and on.

In this case, we have a narrow-ish nonclustered index:

    CREATE INDEX whatever 
        ON dbo.Users (Reputation, Age, CreationDate);

When I run my stored procedure and look for Reputation = 2, the plan is very fast.

EXEC dbo.WORLDSTAR @Reputation = 2;

SQL Server Query Plan — Getting to know you.

This is a great plan for a small number of rows.

When I run it for a large number of rows, it’s not nearly as fast.

EXEC dbo.WORLDSTAR @Reputation = 1;

We go from a fraction of a second to over three seconds.

This is bad parameter sniffing.

If we run it for Reputation = 1 first, we don’t have the same problem.

That’s good(ish) parameter sniffing.

Better For Everyone

Many things that prevent parameter sniffing will only give you a so-so plan. It may be better than the alternative, but it’s certainly not a “fix”.

It’s possible to get a better plan for everyone in this situation by re-writing the Key Lookup as a self join

    SELECT TOP (1000) 
	        u2.*
    FROM dbo.Users AS u
    JOIN dbo.Users AS u2
        ON u.Id = u2.Id
    WHERE u.Reputation = @Reputation
    ORDER BY u.CreationDate DESC;

The reason why is slightly complicated, but I’ll do my best to explain it simply.

Here’s what the bad parameter sniffing plan looks like for each query.

Note that the Key Lookup plan still runs for ~3 seconds, while the self-join plan runs for around half a second.

While it’s possible for Key Lookups to have Sorts introduced to optimize I/O… That doesn’t happen here.

The main difference between the two plans (aside from run time), is the position of the Sort.

In the Key Lookup plan (top), the Key Lookup between the nonclustered and clustered indexes runs to completion.

In other words, for everyone with a Reputation of 1, we go to the clustered index to retrieve the columns that aren’t part of the nonclustered index.

In the self-join plan (bottom), all rows go into the Sort, but only the 1000 come out.

Different World

The difference is more obvious when viewed with Plan Explorer.

In the Key Lookup plan, rows aren’t narrowed until the end so a seek occurs ~1mm times.

In the self-join plan, they’re eliminated directly after the Index Seek, so the join only runs for 1000 rows and produces 1000 seeks.

This doesn’t mean that Top N Sorts are bad, it just means that they may not produce the most optimal plans for Key Lookups.

When This Doesn’t Work

Without a TOP, the self-join pattern isn’t as dramatically faster, but it is about half a second better (4.3s vs. 3.8s) for the bad parameter sniffing scenario, and far less for the others.

Of course, an index change to put CreationDate as the second key column fixes the issue by removing the need to sort data at all.

    CREATE INDEX whatever --Current Index
        ON dbo.Users (Reputation, Age, CreationDate);
    GO 
    
    CREATE INDEX apathy --Better Index For This Query
	    ON dbo.Users (Reputation, CreationDate, Age);
    GO

But, you know, not everyone is able to make index changes easily, and changing the key column order can cause problems for other queries.

Thanks for reading!

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

5 thoughts on “Are Self Joins Ever Better Than Key Lookups In SQL Server?”

Pingback: Self-Joins Versus Key Lookups – Curated SQL
Brian says:

June 1, 2019 at 9:35 pm

SQL would be smarter if it used different plans for different parameters.
Why don’t we help it out a bit?

if( @reputation == 1 )
SELECT TOP (1000) u.* FROM dbo.Users AS u WHERE u.Reputation = 1 ORDER BY u.CreationDate DESC;
else
SELECT TOP (1000) u.* FROM dbo.Users AS u WHERE u.Reputation = @Reputation ORDER BY u.CreationDate DESC;
Jim says:

June 3, 2019 at 11:15 am

Could one use a CTE to produce a similar result (early sort in the execution plan)?

Alternatively, are there any downsides to slapping an OPTION (RECOMPILE) on a query that has a funky distribution? The main downsides I’m aware of are: 1. expensive if the query is run a lot and 2. you miss out on some helpful stats that could be gleaned from the DMV’s.
1. Erik Darling says:
  
  June 3, 2019 at 12:23 pm
  
  RE: a CTE — yeah, maybe sorta kinda. You’d have to play some tricks, most likely.
  
  Yeah, those are about the downsides of recompile. You can offset the DMV issue with a good monitoring tool, though.
  1. Jim says:
    
    June 3, 2019 at 4:53 pm
    
    That was a super interesting stackexchange post – thanks for the link!

Comments are closed.