Indexing Archives | Page 18 of 21

Does SQL Server Choose Multiple Indexes Based On Selectivity?

Posted on August 8, 2019May 16, 2022 by Erik Darling

Will Travel

If you recall yesterday’s post, we added a couple two column indexes to the Posts table.

Each one helped a slightly different query, but either index would likely be “good enough”.

This post will focus on another common scenario I see, where people added many single column indexes over the years.

In this scenario, performance is much more variable.

Singletonary

Here are our indexes:

CREATE INDEX ix_spaces
ON dbo.Posts(ParentId);

CREATE INDEX ix_tabs
ON dbo.Posts(Score);

Taking the same queries from yesterday:

SELECT COUNT_BIG(*) AS records
FROM   dbo.Posts AS p
WHERE  p.ParentId < 1
AND    p.Score > 19000
AND 1 = (SELECT 1);

SELECT COUNT_BIG(*) AS records
FROM   dbo.Posts AS p
WHERE  p.ParentId > 21100000
AND    p.Score < 1
AND 1 = (SELECT 1);

This is what the new plans look like:

The first thing you may notice is that the top plan performs a rather traditional key lookup, and the bottom plan performs a slightly more exotic index intersection.

Both concepts are similar. Since clustered index key columns are present in nonclustered indexes, they can be used to either join a nonclustered index to the clustered index on a table, or to join two nonclustered indexes together.

It’s a nice trick, and this post definitely isn’t to say that either is bad. Index intersection just happens to be worse here.

Wait, But…

You may have noticed that both queries get pretty bad estimates. You might even be thinking about leaving me a comment to update stats.

The thing is that I created these indexes, which means they get stats built with a full scan, and it’s a demo database where nothing changes.

We just get unfortunate histograms, in this case. If I create very specific filtered statistics, both plans perform a key lookup.

CREATE STATISTICS s_orta ON dbo.Posts(ParentId) WHERE ParentId > 21100000 WITH FULLSCAN;
CREATE STATISTICS s_omewhat ON dbo.Posts(Score) WHERE Score < 1 WITH FULLSCAN;
CREATE STATISTICS s_emi ON dbo.Posts(ParentId) WHERE ParentId < 1 WITH FULLSCAN;
CREATE STATISTICS s_lightly ON dbo.Posts(Score) WHERE Score > 19000 WITH FULLSCAN;

This is necessary with the legacy cardinality estimator, too. Rain, sleet, shine.

Bad estimates happen.

When your tables are large enough, those 200 (plus one for NULLs, I know, I know) steps often can’t do the data justice.

Filtered stats and indexes can help with that.

Something I try to teach people is that SQL Server can use whatever statistics or methods it wants for cardinality estimation, even if they’re not directly related to the indexes that it uses to access data.

With filtered statistics, things go fine for both plans:

When Could This Cause Trouble?

Obviously, plans like this are quite sensitive to parameter sniffing. Imagine a scenario where a “bad” plan got cached.

Having one instance of this query running doesn’t cause much of a CPU uptick, but if user concurrency is high then you’d notice it pretty quickly.

Parallel plans, by definition, use a lot more CPU, and more worker threads. These both reserve and use 8 threads.

SQL Server Query Plan Properties — Stretch

Those two plans aren’t even the worst possible case from a duration perspective. Check out these:

Doubledown

When talking index design, single column indexes are rarely a good idea.

Sometimes I’ll see entire tables with an index on every column, and just that column.

That can lead to some very confusing query plans, and some very poor performance.

Thanks for reading!

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Does SQL Server Choose Different Indexes Based On Selectivity?

Posted on August 7, 2019May 16, 2022 by Erik Darling

This Comes Up A Lot

When I’m working with clients, people who don’t spend a lot of time working with indexes have a lot of questions about indexes.

The general rule about leading column selectivity is an easy enough guideline to follow, but what happens if you’re not looking for equality predicates?

What if you’re looking for ranges, and those ranges might sometimes be selective, and other times not?

LET’S FIND OUT!

Chicken and Broccoli

Let’s take these queries against the Posts table. The number next to each indicates the number of rows that match the predicate.

SELECT COUNT_BIG(*) AS records /*6050820*/
FROM   dbo.Posts AS p
WHERE  p.ParentId < 1
AND 1 = (SELECT 1);

SELECT COUNT_BIG(*) AS records /*3*/
FROM   dbo.Posts AS p
WHERE  p.Score > 19000
AND 1 = (SELECT 1);

SELECT COUNT_BIG(*) AS records /*23*/
FROM   dbo.Posts AS p
WHERE  p.ParentId > 21100000
AND 1 = (SELECT 1);

SELECT COUNT_BIG(*) AS records /*6204153*/
FROM   dbo.Posts AS p
WHERE  p.Score < 1
AND 1 = (SELECT 1);

In other words, sometimes they’re selective, and sometimes they’re not.

If we run these without any indexes, SQL Server will ask for single column indexes on ParentId and Score.

But our queries don’t look like that. They look like this (sometimes):

SELECT COUNT_BIG(*) AS records
FROM   dbo.Posts AS p
WHERE  p.ParentId < 1
AND    p.Score > 19000
AND 1 = (SELECT 1);

SELECT COUNT_BIG(*) AS records
FROM   dbo.Posts AS p
WHERE  p.ParentId > 21100000
AND    p.Score < 1
AND 1 = (SELECT 1);

When we run that, SQL Server asks for… the… same index.

Missing index request column order is pretty basic.

Instead, we’re gonna add these:

CREATE INDEX ix_spaces
ON dbo.Posts(ParentId, Score);

CREATE INDEX ix_tabs
ON dbo.Posts(Score, ParentId);

Steak and Eggs

When we run those two queries again, each will use a different index.

Those finish in, apparently, NO TIME WHATSOEVER.

And they do pretty minimal reads.

Table 'Posts'. Scan count 1, logical reads 4
Table 'Posts'. Scan count 1, logical reads 4

If we force those queries to use the opposite index, we can see why SQL Server made the right choice:

SELECT COUNT_BIG(*) AS records
FROM   dbo.Posts AS p WITH (INDEX = ix_spaces)
WHERE  p.ParentId < 1
AND    p.Score > 19000
AND 1 = (SELECT 1);

SELECT COUNT_BIG(*) AS records
FROM   dbo.Posts AS p WITH (INDEX = ix_tabs)
WHERE  p.ParentId > 21100000
AND    p.Score < 1
AND 1 = (SELECT 1);

Time was discovered. As were a bunch more reads.

Table 'Posts'. Scan count 1, logical reads 13519
Table 'Posts'. Scan count 1, logical reads 13876

Sweaty Mess

Having two indexes like that may not always be the best idea.

To make matters worse, you probably have things going on that make answers less obvious, like actually selecting columns instead of just getting a count.

This is where it pays to look at your indexes over time to see how they’re used, or knowing which query is most important.

There isn’t that much of a difference in time or resources here, after all.

Thanks for reading!

Going Further

The Fastest Way To Get The Highest Value In SQL Server Part 2

Posted on July 24, 2019May 14, 2022 by Erik Darling

Whistle Whistle

In yesterday’s post, we looked at four different ways to get the highest value per use with no helpful indexes.

Today, we’re going to look at how those same four plans change with an index.

This is what we’ll use:

CREATE INDEX ix_whatever
    ON dbo.Posts(OwnerUserId, Score DESC);

Query #1

This is our MAX query! It does really well with the index.

SELECT u.Id,
       u.DisplayName,
	   u.Reputation,
	   ca.Score
FROM dbo.Users AS u
CROSS APPLY
(
    SELECT MAX(Score) AS Score
	FROM dbo.Posts AS p
	WHERE p.OwnerUserId = u.Id

) AS ca
WHERE u.Reputation >= 100000
ORDER BY u.Id;

It’s down to just half a second.

Query #2

This is our TOP 1 query with an ORDER BY.

SELECT u.Id,
       u.DisplayName,
	   u.Reputation,
	   ca.Score
FROM dbo.Users AS u
CROSS APPLY
(
    SELECT TOP (1) p.Score
	FROM dbo.Posts AS p
	WHERE p.OwnerUserId = u.Id
	ORDER BY p.Score DESC

) AS ca
WHERE u.Reputation >= 100000
ORDER BY u.Id;

This finished about 100ms faster than MAX in this run, but it gets the same plan.

Who knows, maybe Windows Update ran during the first query.

Query #3

This is our first attempt at row number, and… it’s not so hot.

SELECT u.Id,
       u.DisplayName,
	   u.Reputation,
	   ca.Score
FROM dbo.Users AS u
CROSS APPLY
(
    SELECT p.Score,
	       ROW_NUMBER() OVER (ORDER BY p.Score DESC) AS n
	FROM dbo.Posts AS p
	WHERE p.OwnerUserId = u.Id
) AS ca
WHERE u.Reputation >= 100000
AND ca.n = 1
ORDER BY u.Id;

While the other plans were able to finish quickly without going parallel, this one does go parallel, and is still about 200ms slower.

Query #4

Is our complicated cross apply. The plan is simple, but drags on for almost 13 seconds now.

SELECT u.Id,
       u.DisplayName,
	   u.Reputation,
	   ca.Score
FROM dbo.Users AS u
CROSS APPLY
(
    SELECT * 
	FROM 
	(
        SELECT p.OwnerUserId,
	           p.Score,
	           ROW_NUMBER() OVER (PARTITION BY p.OwnerUserId 
			                      ORDER BY p.Score DESC) AS n
	    FROM dbo.Posts AS p
	) AS p
	WHERE p.OwnerUserId = u.Id
	AND p.n = 1
) AS ca
WHERE u.Reputation >= 100000
ORDER BY u.Id;

Slip On

In this round, row number had a tougher time than other ways to express the logic.

It just goes to show you, not every query is created equal in the eyes of the optimizer.

Now, initially I was going to do a post with the index columns reversed to (Score DESC, OwnerUserId), but it was all bad.

Instead, I’m going to do future me a favor and look at how things change in SQL Server 2019.

Thanks for reading!

Going Further

The Fastest Way To Get The Highest Value In SQL Server Part 1

Posted on July 23, 2019August 29, 2023 by Erik Darling

Expresso

Let’s say you wanna get the highest thing. That’s easy enough as a concept.

Now let’s say you need to get the highest thing per user. That’s also easy enough to visualize.

There are a bunch of different ways to choose from to write it.

In this post, we’re going to use four ways I could think of pretty quickly, and look at how they run.

The catch for this post is that we don’t have any very helpful indexes. In other posts, we’ll look at different index strategies.

Query #1

To make things equal, I’m using CROSS APPLY in all of them.

The optimizer is free to choose how to interpret this, so WHATEVER.

SELECT u.Id,
       u.DisplayName,
	   u.Reputation,
	   ca.Score
FROM dbo.Users AS u
CROSS APPLY
(
    SELECT MAX(Score) AS Score
	FROM dbo.Posts AS p
	WHERE p.OwnerUserId = u.Id

) AS ca
WHERE u.Reputation >= 100000
ORDER BY u.Id;

The query plan is simple enough, and it runs for ~17 seconds.

Query #2

This uses TOP 1.

SELECT u.Id,
       u.DisplayName,
	   u.Reputation,
	   ca.Score
FROM dbo.Users AS u
CROSS APPLY
(
    SELECT TOP (1) p.Score
	FROM dbo.Posts AS p
	WHERE p.OwnerUserId = u.Id
	ORDER BY p.Score DESC

) AS ca
WHERE u.Reputation >= 100000
ORDER BY u.Id;

The plan for this is also simple, but runs for 1:42, and has one of those index spool things in it.

Query #3

This query uses row number rather than top 1, but has almost the same plan and time as above.

SELECT u.Id,
       u.DisplayName,
	   u.Reputation,
	   ca.Score
FROM dbo.Users AS u
CROSS APPLY
(
    SELECT p.Score,
	       ROW_NUMBER() OVER (ORDER BY p.Score DESC) AS n
	FROM dbo.Posts AS p
	WHERE p.OwnerUserId = u.Id
) AS ca
WHERE u.Reputation >= 100000
AND ca.n = 1
ORDER BY u.Id;

Query #4

Also uses row number, but the syntax is a bit more complicated.

The row number happens in a derived table inside the cross apply, with the correlation and filtering done outside.

SELECT u.Id,
       u.DisplayName,
	   u.Reputation,
	   ca.Score
FROM dbo.Users AS u
CROSS APPLY
(
    SELECT * 
	FROM 
	(
        SELECT p.OwnerUserId,
	           p.Score,
	           ROW_NUMBER() OVER (PARTITION BY p.OwnerUserId 
			                      ORDER BY p.Score DESC) AS n
	    FROM dbo.Posts AS p
	) AS p
	WHERE p.OwnerUserId = u.Id
	AND p.n = 1
) AS ca
WHERE u.Reputation >= 100000
ORDER BY u.Id;

This is as close to competitive with Query #1 as we get, at only 36 seconds.

Wrap Up

If you don’t have helpful indexes, the MAX pattern looks to be the best.

Granted, there may be differences depending on how selective data in the table you’re aggregating is.

But the bottom line is that in that plan, SQL Server doesn’t have to Sort any data, and is able to take advantage of a couple aggregations (partial and full).

It also doesn’t spend any time building an index to help that one.

In the next couple posts, we’ll look at different ways to index for queries like this.

Thanks for reading!

Going Further

How SQL Server’s Missing Index Requests Can Hurt Performance

Posted on July 12, 2019May 14, 2022 by Erik Darling

DON’T THROW EGGS

Thanks for watching!

Going Further

SQL Server 2019: If You Want Adaptive Joins, You Need Wider Indexes

Posted on July 10, 2019May 14, 2022 by Erik Darling

What You Make Of It

We’ve got this query. Handsome devil of a query.

You can pretend it’s in a stored procedure, and that the date filter is a parameter if you want.

SELECT u.DisplayName, p.Score
FROM dbo.Users AS u
JOIN dbo.Posts AS p
ON p.OwnerUserId = u.Id
WHERE u.CreationDate >= '20131201';

A long time ago, when we migrated Stack Overflow from Access to SQL Server 2000, we created indexes.

This one has worked alright.

CREATE INDEX ix_whatever 
    ON dbo.Posts(OwnerUserId);

But Now We’re On 2019

And we’ve, like, read a lot about Adaptive Joins, and we think this’ll be cool to see in action.

Unfortunately, our query doesn’t seem to qualify.

Now, there’s an Extended Event that… Used to work.

These days it just stares blankly at me. But since I’ve worked with this before, I know the problem.

It’s that Key Lookup — I’ll explain more in a minute.

Index Upgrade

First, let’s get rid of the Lookup so we can see the Adaptive Join happen.

CREATE INDEX ix_adaptathy 
    ON dbo.Posts(OwnerUserId, Score);

As We Proceed

Let’s think about what Adaptive Joins need:

An index on the column(s) you’re joining

This gives us a realistic choice between using a Nested Loops join to do efficient Seeks, or an easy scan for a Hash Join.

That index has to cover the query

Without a covering index, there’s too much for the optimizer to think about.

It’s not just making a choice between a Nested Loops or Hash Join, it’s also factoring in the cost of a Lookup.

This used to trigger the XE on eajsrUnMatchedOuter, meaning the outer table didn’t have an index that matched the query.

Why Revisit This?

When SQL Server 2019 comes out, people are gonna have really high hopes for their workloads automagickally getting faster.

While there are lots of things that it’ll likely help, it’s going to take a lot of work on your part to make sure your queries and indexes allow for the automagick to kick in.

Thanks for reading!

Going Further

Why Not Just Go For The Big Plan To Improve SQL Server Query Performance?

Posted on July 8, 2019May 14, 2022 by Erik Darling

M’ocean

Going Further

Eager Index Spool Suggestions For SQL Server

Posted on June 21, 2019May 14, 2022 by Erik Darling

Funtime

I’ve added a couple ideas to User Voice.

The ideas are pretty simple:

Eager Index Spools Should Generate Missing Index Requests

In query plans where an Eager Index Spool is directly after a data access operator, a missing index should be generated in the query plan, and/or missing index DMVs that describes the definition of the index needed to make the spool unnecessary.

I would not expect this to happen when a Lazy Index Spool occurs above a subtree.

I’d appreciate it if you’d consider voting for it. It’s something that I was able to implement pretty easily in sp_BlitzCache.

Eager Index Spools Should Generate Wait Stats

In query plans where an Eager Index Spool is directly after a data access operator, wait stats should be generated while the Spool is built. In a parallel plan, EXECSYNC waits are generated, but in a serial plan, you don’t see anything. Problem scenarios will become more common when FROID is released and adopted.

I would not expect this to happen when a Lazy Index Spool occurs above a subtree.

Thanks for reading!

And voting as many times as possible ?

Going Further

How Select List Column Sizes Change How Big Spool Operators Are In SQL Server Query Plans

Posted on June 18, 2019May 14, 2022 by Erik Darling

I’ll Use Those Columns Later, Maybe

This is a short post, since we’re on the subject of index spools this week, to show you that the columns that go into the spool will impact spool size and build time.

I know, that sounds obvious, but once in a while I care about “completeness”.

We’re going to look at two queries that build eager index spools, along with the time the spool takes to build and how many writes we do.

Query 1

On the side of the query where a spool gets built (inside the apply), we’re only selecting one column.

SELECT      TOP ( 10 )
            u.DisplayName, 
			u.Reputation, 
			ca.*
FROM        dbo.Users AS u
CROSS APPLY 
(   
    SELECT   TOP ( 1 )
			   p.Score
    FROM     dbo.Posts AS p
    WHERE    p.OwnerUserId = u.Id
    AND      p.PostTypeId = 1
    ORDER BY p.Score DESC 
) AS ca
ORDER BY    u.Reputation DESC;

In the query plan, we spend 1.4 seconds reading from the Posts table, and 13.5 seconds building the index spool.

We also do 21,085 writes while building it.

SQL Server Extended Events — Insert comma

Query 2

Now we’re going to select every column in the Posts table, except Body.

If I select Body, SQL Server outsmarts me and doesn’t use a spool. Apparently even spools have morals.

SELECT      TOP ( 10 )
              u.DisplayName, 
			  u.Reputation, 
			ca.*
FROM        dbo.Users AS u
CROSS APPLY 
(   
    SELECT   TOP ( 1 )
               p.Id, p.AcceptedAnswerId, p.AnswerCount, p.ClosedDate, 
               p.CommentCount, p.CommunityOwnedDate, p.CreationDate,
               p.FavoriteCount, p.LastActivityDate, p.LastEditDate,
               p.LastEditorDisplayName, p.LastEditorUserId, p.OwnerUserId, 
               p.ParentId, p.PostTypeId, p.Score, p.Tags, p.Title, p.ViewCount
    FROM     dbo.Posts AS p
    WHERE    p.OwnerUserId = u.Id
    AND      p.PostTypeId = 1
    ORDER BY p.Score DESC 
) AS ca
ORDER BY    u.Reputation DESC;
GO

In the query plan, we spend 2.8 seconds reading from the Posts table, and 15.3 seconds building the index spool.

We also do more writes, at 107,686.

This Is Not A Complaint

I just wanted to write this down, because I haven’t seen it written down anywhere else.

Thanks for reading!

Going Further

Common Parameter Sniffing In Entity Framework Queries For SQL Server

Posted on June 11, 2019May 14, 2022 by Erik Darling

Still Not A Developer

I’m going to use a funny example to show you something weird that I see often in EF queries.

I’m not going to use EF to do it, because I have no idea how to. Please use your vast imagination.

In this case, I’m going to figure out if a user is trusted, and only if they are will I show them certain information.

Here goes!

Trust Bust

The first part of the query establishes if the user is trusted or not.

I think this is silly because no one should ever trust users.

DECLARE @UserId INT = 22656, --2788872
        @PostId INT = 11227809,
		@IsTrusted BIT = 0,
		@SQL NVARCHAR(MAX) = N'';

SELECT @IsTrusted = CASE WHEN u.Reputation >= 10000 
                         THEN 1 
						 ELSE 0 
				    END
FROM   dbo.Users AS u
WHERE  u.Id = @UserId;

The second part will query and join a few tables, but one of the joins (to the Votes table) will only run if a user is trusted.

SET @SQL = @SQL + N'
SELECT p.Title, p.Score,
       c.Text, c.Score,
	   v.*
FROM dbo.Posts AS p
LEFT JOIN dbo.Comments AS c
    ON p.Id = c.PostId
LEFT JOIN dbo.Votes AS v
    ON p.Id = v.PostId
    AND 1 = @iIsTrusted
WHERE p.Id = @iPostId
AND   p.PostTypeId = 1;
';

EXEC sys.sp_executesql @SQL,
                       N'@iIsTrusted BIT, @iPostId INT',
					   @iIsTrusted = @IsTrusted,
					   @iPostId = @PostId;

See where 1 = @iIsTrusted? That determines if the join runs at all.

Needless to say, adding an entire join in to the query might slow things down if we’re not prepared.

First I’m going to run it for user 2788872, who isn’t trusted.

This query finishes rather quickly (2 seconds), and has an interesting operator in it.

SQL Server Query Plan Tool Tip — Pump the brakes

The filter has a startup expression in it, which means it’s sort of a gatekeeper, here. If the parameter is 0, we don’t touch Votes.

If it’s 1… Boy, do we touch Votes. This is another case of where cached plans can lie to us.

Rep Up

If we run this for user 22656 (Jon Skeet) afterwards, we will definitely need to touch the Votes table.

I grabbed the Live Query Plan to show you just how little progress it makes over 5 minutes.

The cached plan will look identical. And looking at the plan, it’ll be hard to believe there’s any way it could run >5 minutes.

If we clear the cache and run this for 22656 first, the plan runs relatively quickly, and looks a little different.

Running it for an untrusted user has a similar runtime. It’s not great, but it’s the better of the two.

Fixing It?

It’s difficult to control EF queries with much granularity.

You could branch the application code to run two different queries based on if a user is trusted.

In a perfect world, you’d never even consider that join at all, and avoid having to worry about it.

On the plus side (at least in this case), the good plan for trusted users runs in the same time as the good plan for untrusted users, even though they’re different.

If you’re feeling extra confident, you can try adding an OPTIMIZE FOR hint to your code, or implementing a plan guide.

Thanks for reading!