SQL Server Indexes Sort Data For Us

Yelling and Screaming


If you saw my post about parameterized TOPs, one thing you may have immediately hated is the index I created.

And rightfully so — it was a terrible index for reasons we’ll discuss in this post.

If that index made you mad, congratulations, you’re a smart cookie.

CREATE INDEX whatever ON dbo.Votes(CreationDate DESC, VoteTypeId)
GO

Yes, my friends, this index is wrong.

It’s not just wrong because we’ve got the column we’re filtering on second, but because there’s no reason for it to be second.

Nothing in our query lends itself to this particular indexing scenrio.

CREATE OR ALTER PROCEDURE dbo.top_sniffer (@top INT, @vtid INT)
AS
BEGIN

    SELECT   TOP (@top) 
             v.Id, 
             v.PostId, 
             v.UserId, 
             v.BountyAmount, 
             v.VoteTypeId, 
             v.CreationDate
    FROM     dbo.Votes AS v
    WHERE    v.VoteTypeId = @vtid
    ORDER BY v.CreationDate DESC;

END;

We Index Pretty


The reason I sometimes see columns appear first in an index is to avoid having to physically sort data.

If I run the stored procedure without any nonclustered indexes, this is our query plan:

EXEC dbo.top_sniffer @top = 1, @vtid = 1;
A SQL Server query plan
spillo

A sort, a spill, kablooey. We’re not having any fun, here.

With the original index, our data is organized in the order that we’re asking for it to be returned in the ORDER BY.

This caused all sorts of issues when we were looking for VoteTypeIds that were spread throughout the index, where we couldn’t satisfy the TOP quickly.

There was no Sort in the plan when we had the “wrong” index added.

A SQL Server query plan
Primal

B-Tree Equality


We can also avoid having to sort data by having the ORDER BY column(s) second in the key of the index, because our filter is an equality.

CREATE INDEX whatever ON dbo.Votes(VoteTypeId, CreationDate DESC)
GO

Having the filter column first also helps us avoid the longer running query issue when we look for VoteTypeId 4.

EXEC dbo.top_sniffer @top = 5000, @vtid = 4;
A SQL Server query plan
I like you better.
Table 'Votes'. Scan count 1, logical reads 2262

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 148 ms.

Solving for Sorts


If you’ve been following my blogging for a while, you’ve likely seen me say this stuff before, because Sorts have some issues.

  • They’re locally blocking, in that every row has to arrive before they can run
  • They require additional memory space to order data the way you want
  • They may spill to disk if they don’t get enough memory
  • They may ask for quite a bit of extra memory if estimations are incorrect
  • They may end up in a query plan even when you don’t explicitly ask for them

There are plenty of times when these things aren’t problems, but it’s good to know when they are, or when they might turn into a problem.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Predicate Selectivity And SQL Server Index Design

Loose Ends


A while back I promised I’d write about what allows SQL Server to perform two seeks rather than a seek with a residual predicate.

More recently, a post touched a bit on predicate selectivity in index design, and how missing index requests don’t factor that in when requesting indexes.

This post should tie the two together a bit. Maybe. Hopefully. We’ll see where it goes, eh?

If you want a TL;DR, it’s that neighboring index key columns support seeks quite easily, and that choosing the leading column should likely be a reflection of which is filtered on most frequently.

If you want more specific advice, I’d be happy to give it to you.

Index Management


Let’s get real wacky and create two indexes.

CREATE NONCLUSTERED INDEX whatever
    ON dbo.Posts ( PostTypeId, ClosedDate );

CREATE NONCLUSTERED INDEX apathy
    ON dbo.Posts ( ClosedDate, PostTypeId );

Now let’s run two identical queries, and have each one hit one of those indexes.

SELECT p.Id, p.PostTypeId, p.ClosedDate
FROM   dbo.Posts AS p WITH (INDEX = whatever)
WHERE  p.PostTypeId = 1
AND    p.ClosedDate >= '2018-06-01';

SELECT p.Id, p.PostTypeId, p.ClosedDate
FROM   dbo.Posts AS p WITH (INDEX = apathy)
WHERE  p.PostTypeId = 1
AND    p.ClosedDate >= '2018-06-01';

If you run them a bunch of times, the first query tends to end up around ~50ms ahead of the second, though they both sport nearly identical query plans.

The seek may look confusing, because PostTypeId seems to appear as both a seek and a residual predicate. That’s because it’s sort of both.

The seek tells us where we start reading, which means we’ll find rows starting with ClosedDate 2018-06-01, and with PostTypeId 1.

From there, we may find higher PostTypeIds, which is why we have a residual predicate; to filter those out.

More generally, a seek can find a single row, or a range of rows as long as they’re all together. When the leading column of an index is used to find a range, we can seek to a starting point, but we need a residual predicate to check for other predicates afterwards.

This is why the index rule of thumb for many people is to start indexes with equality predicates. Any rows located will be contiguous, and we can easily continue the seek while applying any other predicates.

Tool Tips from a SQL Server query plan
Seeky Scanny

There’s also differences in stats time and IO.

Table 'Posts'. Scan count 1, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 156 ms.

Table 'Posts'. Scan count 1, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 106 ms.

Remember that this is how things break down for each predicate:

Counts from SQL Server Management Studio
Lotsa and Nunna

But in neither case do we need to touch all ~6mm rows of PostTypeId 1 to locate the correct range of ClosedDates.

Downstairs Mixup


When does that change?

When we design indexes a little bit more differenter.

CREATE NONCLUSTERED INDEX ennui
    ON dbo.Posts ( PostTypeId ) INCLUDE (ClosedDate);

CREATE NONCLUSTERED INDEX morose
    ON dbo.Posts ( ClosedDate ) INCLUDE (PostTypeId);

Running the exact same queries, something changes quite drastically for the first one.

Table 'Posts'. Scan count 1, logical reads 16344, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 297 ms,  elapsed time = 301 ms.

Table 'Posts'. Scan count 1, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 187 ms.

This time, the residual predicate hurts us, when we look for a range of values.

Took Tips from a SQL Server Management Studio
Yaw yaw yaw

We do quite a lot of extra reads — in fact, this time we do need to touch all ~6mm rows of PostTypeId 1.

Off By One


Something similar happens if we only rearrange key columns, too.

CREATE NONCLUSTERED INDEX ennui
    ON dbo.Posts ( PostTypeId, OwnerUserId, ClosedDate ) WITH ( DROP_EXISTING = ON );

CREATE NONCLUSTERED INDEX morose
    ON dbo.Posts ( ClosedDate, OwnerUserId, PostTypeId ) WITH ( DROP_EXISTING = ON );

I have both columns I’m querying in the key of the index this time, but I’ve stuck a column I’m not querying at all between them — OwnerUserId.

This also throws off the range predicate. We read ~30k more pages here because the index is larger.

Table 'Posts'. Scan count 1, logical reads 19375, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 312 ms,  elapsed time = 314 ms.

Table 'Posts'. Scan count 1, logical reads 7, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 178 ms.

The Seeks here look identical to the ones when I had columns in the include section of the index.

What’s It All Mean?


Index column placement, whether it’s in the key or in the includes, can have a non-subtle impact on reads, especially when we’re searching for ranges.

Even when we have a non-selective leading column like PostTypeId with an equality predicate on it, we don’t need to read every single row that meets the filter to apply a range predicate, as long as that predicate is seek-able.

When we move the range column to the includes, or we add a column before it in the key, we end up doing a lot more work to locate rows.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

In SQL Server Stored Procedures, Parameters In TOP Are Sniffed Too

Thank You For Your Business


In yesterday’s post, we looked at how row goals affected performance when data matching our join and where clauses didn’t exist.

Today we’re going to look at something similar, and perhaps a workaround to avoid the same issues.

Here’s our index setup:

CREATE INDEX whatever ON dbo.Votes(CreationDate DESC, VoteTypeId)
GO

Now we’re gonna wrap our query in a stored procedure.

CREATE OR ALTER PROCEDURE dbo.top_sniffer (@top INT, @vtid INT)
AS
BEGIN

    SELECT   TOP (@top) v.*
    FROM     dbo.Votes AS v
    WHERE    v.VoteTypeId = @vtid
    ORDER BY v.CreationDate DESC;

END;

Video Killed MTV


You know what? I don’t wanna write all this stuff. Let’s do a video.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Last Week’s Almost Definitely Not Office Hours: January 18

ICYMI


Last week’s thrilling, stunning, flawless episode of whatever-you-wanna-call-it.

Thanks for watching!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

SQL Server Query Performance When You Search For Rare Data Points

I Got No Rows


Over in the Votes table in the Stack Overflow database, a couple of the more popular vote types are 1 and 2.

A vote type of 1 means that an answer was accepted as being the solution by a user, and a vote type of 2 means someone upvoted a question or answer.

SQL Server Query Results
I Voted!

What this means is that it’s impossible for a question to ever have an accepted answer vote cast for it. A question can’t be an answer, here.

Unfortunately, SQL Server doesn’t have a way of inferring that.

SQL Server Query Results
ANSWER ME!

Anything with a post type id of 1 is a question.

The way the tables are structured, VoteTypeId and PostTypeId don’t exist together, so we can’t use a constraint to validate any conditions that exist between them.

SQL Server Management Studio Table
Votan
SQL Server Management Studio Table
Postan

Lost And Found


When we run a query that looks for posts with a type of 2 (that’s an answer) that have a vote type of 1, we can find 2500 of them relatively quickly.

    SELECT   TOP (2500) 
             p.OwnerUserId, 
             p.Score, 
             p.Title, 
             v.CreationDate,
             ISNULL(v.BountyAmount, 0) AS BountyAmount
    FROM     dbo.Posts AS p
    JOIN     dbo.Votes AS v
        ON  p.Id = v.PostId
    WHERE v.VoteTypeId = 2 --WHERE VoteTypeId = 2
    AND   p.PostTypeId = 1
    ORDER BY v.CreationDate DESC;

Here’s the stats:

Table 'Posts'. Scan count 0, logical reads 29044
Table 'Votes'. Scan count 1, logical reads 29131

 SQL Server Execution Times:
   CPU time = 63 ms,  elapsed time = 272 ms.

And here’s the plan:

SQL Server Query Plan
Nelly.

Colossus of Woes


Now let’s ask SQL Server for some data that doesn’t exist.

    SELECT   TOP (2500) 
             p.OwnerUserId, 
             p.Score, 
             p.Title, 
             v.CreationDate,
             ISNULL(v.BountyAmount, 0) AS BountyAmount
    FROM     dbo.Posts AS p
    JOIN     dbo.Votes AS v
        ON  p.Id = v.PostId
    WHERE v.VoteTypeId = 1 --Where VoteTypeId = 1
    AND   p.PostTypeId = 1
    ORDER BY v.CreationDate DESC;

Here’s the stats:

Table 'Posts'. Scan count 0, logical reads 11504587
Table 'Votes'. Scan count 1, logical reads 11675392

 SQL Server Execution Times:
   CPU time = 14813 ms,  elapsed time = 14906 ms.

You could say things got “worse”.

Not only that, but they got worse for the exact same plan.

SQL Server Query Plan
Argh.

So What Happened?


In the original plan, the TOP asked for rows, and quickly got them.

In the second plan, the TOP kept asking for rows, getting them from the Votes table, and then losing them on the join to Posts.

There was no parameter sniffing, there were no out of date stats, no blocking, or any other oddities. It’s just plain bad luck because of the data’s relationship.

If we apply hints to this query to:

  • Scan the clustered index on Votes
  • Choose Merge or Hash joins instead of Nested Loops
  • Force the join order as written

We get much better performing queries. The plan we have is chosen because the TOP sets a row goal that makes a Nested Loops plan using narrow (though not covering) indexes attractive to the optimizer. When it’s right, like in the original query, you probably don’t even think about it.

When it’s wrong, like in the second query, it can be quite mystifying why such a tiny query can run forever to return nothing.

If you want to try it out for yourself, use these indexes:

    CREATE INDEX whatever 
        ON dbo.Votes( CreationDate, VoteTypeId, PostId );

    CREATE NONCLUSTERED INDEX apathy
        ON dbo.Posts ( PostTypeId )
            INCLUDE ( OwnerUserId, Score, Title );

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

SQLBits Precon Logistics

Sell Out


If you’re coming to my SOLD OUT precon, and really, I appreciate that you all chose to come to my session, here’s some stuff you should do:

Join the Slack channel! Forgive their recent logo sins, and hang out in there to ask questions, yell at me, or ask when lunch is.

You can do that by going to http://sqlslack.com/ and entering your email address to get an invite. Once you’re in, you’ll wanna join #sqlbits-tuning.

If you wanna play along with any parts of the demos, you’ll wanna download this copy of the StackOverflow database.

Fair warning: if you’re gonna do this, do it well in advance. Downloading over public W-i-Fi is quite a gamble.

Lastly But Not Leastly


Check out all the other great sessions available that have seats remaining in them.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Costumery For A Good Cause At SQLBits

Charity Ain’t Just A Dancer


Whether or not you’ll be at SQLBits, I’d be quite grateful if you considered giving.

It’s for a good cause, and, well, I already bought the outfit.

Time to work on that mustache.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

What Metrics Does Rebuilding Indexes Improve In SQL Server?

Gall Of A Red Ox


For Row Store indexes — Columnstore is different and should be maintained differently — rebuilding indexes gets prescribed as a cure for performance issues of all types.

  • Sudden query slowdown? Rebuild’em!
  • Blocking? Rebuild’em!
  • High CPU? Rebuild’em!
  • No idea what’s going on? Rebuild’em!

While an index rebuild will update stats, which is sometimes beneficial, it’s a lot easier to, you guessed it, just update stats.

The main problem is that your index rebuild script is probably measuring the wrong thing.

It’ll have the same net effect on plan invalidation, with a heck of a lot of less resource usage and, potentially, blocking.

Bear Bile


It’s my opinion, and you can take it or leave it, that index rebuilds should be reserved for special circumstances.

  • You deleted a lot of data
  • You need to change something about the index
  • You have a Heap with a lot of forwarded fetches

But why is that my opinion? What evidence has informed it? No, it’s not just because I like to be disagreeable. It’s mostly that I went through a fairly normal progression.

  • A lot of people are talking about index fragmentation, sounds bad!
  • Hey, I think this fixed something? I’m gonna keep doing it.
  • Well, I still have problems, but at least I don’t have fragmentation.
  • I don’t have enough time to run CHECKDB, I need to rebuild less often.
  • No one seems to be complaining when I don’t rebuild indexes…
  • My problems had nothing to do with index fragmentation!

Here’s How To Order


But what metrics might an index rebuild fix, and why was so much fuss made about them for so long?

To test this, and to get you, dear reader, an explanation, I set up some tests.

First, I created a simple index.

CREATE INDEX ix_whatever ON dbo.Users (Reputation);

Then, I wrote queries that will touch 1%, 10%, and 100% of the Users table (2013 version, 2.4mm rows).

SELECT COUNT(*) AS records
FROM dbo.Users AS u
WHERE u.Reputation = 289
AND 1 = (SELECT 1);


SELECT COUNT(*) AS records
FROM dbo.Users AS u
WHERE u.Reputation BETWEEN 100 AND 450
AND 1 = (SELECT 1);


SELECT COUNT(*) AS records
FROM dbo.Users AS u
WHERE 1 = (SELECT 1);

After that, it’s a matter of introducing the “harmful” kind of fragmentation — empty space on pages.

This is the kind of fragmentation that bloats your indexes, both on disk and in memory, leaving less room for other things.

ALTER INDEX ix_whatever ON dbo.Users REBUILD WITH(FILLFACTOR = 5);
ALTER INDEX ix_whatever ON dbo.Users REBUILD WITH(FILLFACTOR = 10);
ALTER INDEX ix_whatever ON dbo.Users REBUILD WITH(FILLFACTOR = 20);
ALTER INDEX ix_whatever ON dbo.Users REBUILD WITH(FILLFACTOR = 30);
ALTER INDEX ix_whatever ON dbo.Users REBUILD WITH(FILLFACTOR = 40);
ALTER INDEX ix_whatever ON dbo.Users REBUILD WITH(FILLFACTOR = 50);
ALTER INDEX ix_whatever ON dbo.Users REBUILD WITH(FILLFACTOR = 60);
ALTER INDEX ix_whatever ON dbo.Users REBUILD WITH(FILLFACTOR = 70);
ALTER INDEX ix_whatever ON dbo.Users REBUILD WITH(FILLFACTOR = 80);
ALTER INDEX ix_whatever ON dbo.Users REBUILD WITH(FILLFACTOR = 90);
ALTER INDEX ix_whatever ON dbo.Users REBUILD WITH(FILLFACTOR = 100);

Now, since I don’t wanna take that many screenshots, I embraced my long-lost ENTJ and made an Excel table.

Retirement Plan

Not surprisingly, the chief metric that went down as fill factor went up is reads. CPU didn’t change much, if at all.

That being said, you have to be real goofy about fill factor to get reads to matter.

Within a sane boundary of fill factors, which I’d call 70 to 100, ask yourself if any variation there is the root cause of your performance problems.

Now, I know you. It’s hard to talk you out of your superstitions, and you’re gonna keep thrashing that expensive SAN with constant I/O nonsense.

But maybe I can talk you into raising the threshold that you do it at. Maybe I can get you to not rebuild (or reorg) your indexes until they hit >70% fragmentation. That’s 30% fill factor on the chart.

Virgin Wax


Alright, now we know that rebuilds help reads. When would number of reads matter most?

  • If your disks are the quite elderly spinning variety
  • And you don’t have enough memory to cache your hot data
  • And maybe your hot data could fit into memory if were defragmented to 100% fill factor

You can kinda start to put a picture together of when it mattered. Before SSDs, before 64bit OSes (32b capped out at ~3GB RAM), and well before flash storage, etc.

Way back when, rebuilding made sense. Reading unordered data could have a big impact on a workload because disks would have to physically move to find unordered/spread out data.

So yeah, if you notice that your workload is doing a lot more reads over time, and rebuilding at 100% fill factor reduces reads, it might be worth rebuilding at some point. But to solve any other common workload problems, look somewhere else. Rebuilding indexes is not the cure.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Open Office Hours, Or Something

Just Like Last Week


I’ll be on YouTube Friday at Noon EST, answering your questions about life, love, SQL Server, and fashion choices.

This time with 33% less potato.

Last week I was a victim of “how hard can this be?”, only to find out that I needed to download an encoder and figure out how to use it to broadcast.

Which is weird, because when I did a dry run, the website seemed to pick up on all the hardware stuff. Ah well. I’m still working on the single channel audio thing, which is apparently something that a lot of people with my setup have. Which solution will work?

TUNE IN TO FIND OUT!

If you wanna watch last week’s episode catch it here!

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

What’s Your Second Biggest SQL Server Problem?

Checking It Twice


Make a list of the top 3-5 problems you have on your SQL Server. Go ahead, I’ll wait.

Now put them in order. Which one is the biggest problem, and who is it a problem for?

That’s a big problem, huh? It’d take a lot of time and effort to fix that problem.

But what about the rest of your problems? If you look down the list, you’ll probably find stuff that you could make some progress on in the meantime.

Don’t Get Stuck


Just as an example, let’s say you have to install SQL Server. When you go to set it up, you realize that the service accounts weren’t made for it, and the Sysadmin who takes care of that stuff is out sick (of being a Sysadmin).

You might hold off on doing everything else because of that roadblock, even though you could have the server ready to go, and just swap the service account in later.

The same thing happens to people with other problems, too. Let’s say the list looks like this:

  1. We’re importing XML and shredding it constantly while queries run
  2. There are some scalar valued functions that we use as helpers to parse the XML
  3. We’ve got tables that need indexes (including clustered indexes)
  4. tempdb only has one data file
  5. Our parallelism settings are wrong

A lot of people will get stuck on their biggest problem. It’s happened to me, too.

Heck, just setting things up here over the past week has been full of roadblocks, and sometimes I’d find myself saying “well, what’s the point of doing X if I don’t have Y yet?”.

Current Events


Right now, my biggest problem is waiting on some paperwork so I can open a business checking account, but if I sat around waiting on that before setting up a mailing list, blogging, setting up Zapier, and all that, I’d be way far behind when that stuff did come in.

And when it did, my second biggest problem would turn into my biggest problem.

If you find yourself not working on other problems because you’ve got one monolithic problem, try to remember that solving smaller problems has an upstream effect. Especially with SQL Server, where small changes can make big impacts.

I’ve seen servers go from constantly running out of worker threads to humming along happily just by having smarter parallelism settings.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.