Indexing Archives | Page 11 of 21

Starting SQL: How Eager Index Spool Operators Hurt Query Performance In SQL Server

Posted on September 14, 2020May 15, 2022 by Erik Darling

Hot To Trot

Most of the time when the optimizer thinks an index will make a big enough difference, it’ll tell you.

Sure, it’s not perfect, but it can get you where you’re going fast enough. If you’re relying on that sort of feedback in query plans, or in the missing index DMVs, you’ll hate this.

Not only does SQL Server create an index for you, it doesn’t really tell you about it. There’s no loud warning here.

It also throws that index away when the query is done executing. It only exists in tempdb while the query executes, and it’s only available to the query that builds it.

And boy, they sure can take a long time to build.

Let’s take a closer look!

Skidding Out

Eager index spools can occur on the inner side of Nested Loops joins to reduce the amount of work that needs to be done there, by creating a more opportune index for the loop to work off of.

That all sounds very good, but there are some problems:

The index gets created single-threaded
The way data is loaded into the spool is very inefficient
The spool is disposed of when the query is finished,
It’ll get built over and over again on later executions
There’s no missing index request for the spool anywhere

I’m not suggesting that the query would be faster without the spool. Reliable sources tell me that this thing runs for over 6 hours without it. My suggestion is that when you see Eager Index spools, you should pay close attention.

Let’s talk about how you can do that.

Mountainous

In some cases, your only option is to look at the Eager Index Spool to see what it’s doing, and create an index on your own to mimic it.

If you’re the kind of person who likes free scripts, sp_BlitzCache will look for Eager Index Spools in your query plans and do that for you. You’re welcome.

If you’re a more manual type, here’s what you do: Look at the Eager Index Spool.

SQL Server Query Plan Tool Tip — ham ham ham ham

The Seek Predicate(s) are they key columns, and the Output List is the included columns.

CREATE INDEX spool_b_gone 
ON dbo.Posts(ParentId) INCLUDE (Score);

Since the Id column is the clustered index, we don’t explicitly need it in the index definition — remember that nonclustered indexes inherit them. It’ll end up as a “hidden” key column, after all.

Human Touch

In most cases, this will be good enough. The performance difference will be night and day, if the build source for the Eager Index Spool was fairly large, or if your query built the same Eager Index Spool multiple times.

Though just like missing index requests, Eager Index Spools don’t always come up with the *best* index.

Thinking through our query, we may want to move Score up to the key of the index.

SELECT pq.OwnerUserId, pq.Score, pq.Title, pq.CreationDate,
       pa.OwnerUserId, pa.Score, pa.CreationDate
FROM dbo.Posts AS pa
INNER JOIN dbo.Posts AS pq
    ON pq.Id = pa.ParentId
WHERE pq.PostTypeId = 1
AND   pq.CommunityOwnedDate IS NULL
AND   pq.AnswerCount > 1
AND   pa.PostTypeId = 2
AND   pa.OwnerUserId = 22656
AND   pa.Score >
(
    SELECT MAX(ps.Score)
    FROM dbo.Posts AS ps
    WHERE ps.ParentId = pa.ParentId 
    AND   ps.Id <> pa.Id
)
ORDER BY pq.Id;

See that subquery at the very end, where we’re aggregating on Score? Having Score in the key of the index will put the data in order, which makes a Stream Aggregate pretty painless. Remember that Stream Aggregates expect sorted input.

That’s, like, how they stream.

CREATE INDEX spool_b_gone 
    ON dbo.Posts(ParentId, Score);

Letter To The Query Editor

There are some cases where changing a query is a lot less painful than changing or adding indexes.

You might already have a lot of indexes, or you might have a really big table, or you might be on Standard Edition, which is a lot like being on one of those airplanes where the entire thing is economy class.

In this case, we can rewrite the query in a way that avoids the Eager Index Spool entirely:

SELECT pq.OwnerUserId, pq.Score, pq.Title, pq.CreationDate,
        pa.OwnerUserId, pa.Score, pa.CreationDate
FROM dbo.Posts AS pa
INNER JOIN dbo.Posts AS pq
    ON pq.Id = pa.ParentId
WHERE pq.PostTypeId = 1
AND   pq.CommunityOwnedDate IS NULL
AND   pq.AnswerCount > 1
AND   pa.PostTypeId = 2
AND   pa.OwnerUserId = 22656
AND NOT EXISTS
(
    SELECT 1/0
    FROM dbo.Posts AS ps
    WHERE ps.ParentId = pa.ParentId
    AND   ps.Id <> pa.Id
    AND   ps.Score >= pa.Score
)
ORDER BY pq.Id;

Which gets us a different plan. And you can see why we’d want one.

Avoiding the need for an Eager Index Spool reduces the query time from over a minute to under a second.

Thanks for reading!

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Starting SQL: Blind Spots In SQL Server’s Missing Index Requests

Posted on September 11, 2020May 15, 2022 by Erik Darling

Underfoot

There are all sorts of things that might artificially keep the optimizer from suggesting a helpful index while you’re writing or tuning a query.

But go figure, there are also totally sensible things that will make it less likely that you’ll get one, and there are definitely things that the requests don’t consider as heavily.

That doesn’t mean they’re awful and wrong, it just means you should practice looking at the requests that get generated with execution plans to see if there’s anything you can tweak to make things run a little bit more smoothly.

Anyway, let’s learn some important lessons about missing index requests.

Lesson One: Inequalities Are Iffy

In my copy of the Stack Overflow database, both of these queries return the same single row.

There is only one comment with 1270(!!!) upvotes, so either query works to locate the data.

/*Inequality predicate*/
SELECT c.CreationDate,
       c.PostId,
       c.Score,
       c.Text,
       c.UserId
FROM dbo.Comments AS c
WHERE c.Score >= 1270 --Hello I'm here
AND   c.CreationDate >= '20110101'
AND   c.CreationDate <  '20120101'
ORDER BY c.CreationDate DESC;

/*Equality predicate*/
SELECT c.CreationDate,
       c.PostId,
       c.Score,
       c.Text,
       c.UserId
FROM dbo.Comments AS c
WHERE c.Score = 1270 --Hello I'm here
AND   c.CreationDate >= '20110101'
AND   c.CreationDate <  '20120101'
ORDER BY c.CreationDate DESC;

Strangely, only one query qualifies for a missing index request, and it’s not the first one.

The index request look like this:

CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[Comments] ([Score],[CreationDate])
INCLUDE ([PostId],[Text],[UserId])

Which is likely fine, as long as you’re okay with the Text column being in there. It’s an NVARCHAR(700), which makes me a little hesitant.

Not adding it, of course, means the optimizer would have to make a choice about Key Lookups vs. Clustered Index Scans, depending on predicates.

Databases are hard.

Lesson Two: Joins Don’t Get A Lot Of Love

If we add a join to the Posts table, the missing index request doesn’t change.

SELECT c.CreationDate,
       c.PostId,
       c.Score,
       c.Text,
       c.UserId
FROM dbo.Comments AS c
JOIN dbo.Posts AS p
    ON p.Id = c.PostId
    AND p.OwnerUserId = c.UserId
WHERE c.Score = 1270 --Hello I'm here
AND   c.CreationDate >= '20110101'
AND   c.CreationDate <  '20120101'
AND 1 = (SELECT 1)
ORDER BY c.CreationDate DESC;

We get the same missing index request as above, which might strike you as a little odd.

savage

Sometimes it’s nice to have join columns in the key of an index. Orderly data tends to join more easily, and can give the optimizer more efficient ways to do it.

It probably doesn’t matter a ton for Hash Joins, but Merge Joins expect data in order. If it’s not, the optimizer will introduce a Sort to get it in order. Likewise, Nested Loops without indexes can be painful.

Having columns as includes means it’s not stored in an order that might be helpful. In fact, it’s not stored in any order at all.

Lesson Three: Neither Does Order By

In our original query, the CreationDate column, by coincidence, was both in the WHERE and ORDER BY clauses.

It ended up in the key columns of the index because of the WHERE clause, though. If we remove it from there, it’ll move to the included columns.

SELECT c.CreationDate,
       c.PostId,
       c.Score,
       c.Text,
       c.UserId
FROM dbo.Comments AS c
WHERE c.Score = 1270 --Hello I'm here
ORDER BY c.CreationDate DESC;

SQL Server Missing Index Request — now you’re getting sued

Why Are We Talking About Indexes Again?

If I do enough drilling, hopefully you’ll understand how important they are to performance.

What I want you to take away from this post is that when you’re looking specifically at missing index requests, whether they’re in query plans or in the missing index DMVs:

They’re not going to consider columns outside of the WHERE clause to be in the key of the index
If your where clause doesn’t have an equality predicate, it’s a lot harder to get missing index requests
Columns that would be helpful to have in index order won’t end up in the key if they’re not in the WHERE
Even the order of columns suggested for being in the key of the index isn’t scientific

I covered a lot of the “why” in these posts:

Without them, it may look weird to spend time talking about the limitations here. But this is why databases still need humans to care for them.

As much as I look forward to a day where they don’t, little things like which columns should be in the key of an index, and the order they should be in, are still important considerations.

You know. Little things.

Thanks for reading!

Going Further

Starting SQL: Why Wouldn’t An Index Help Your SQL Server Query Go Faster?

Posted on September 10, 2020May 14, 2022 by Erik Darling

Pennies From Redmond

We’re spoiled over here in SQL Server land. When we run queries, the optimizer will suggest indexes that might help it run faster. I say might, because sometimes it’s wrong. Like I mentioned before, these suggestions are a bit hasty.

If you read the limitations, the first thing they say is “It is not intended to fine tune an indexing configuration”.

And they’re not. But you know what? If you don’t have anyone who is fine tuning an indexing configuration, they’re a heck of a lot better than a whole lot of nothing.

Are You There, Optimizer?

Starting with no nonclustered indexes, and a clustered index on an unrelated column, this query is a prime candidate for an index. And yet, the optimizer has forsaken us in our time of need.

SELECT COUNT_BIG(*) AS records
FROM dbo.Comments AS c
WHERE COALESCE(c.UserId, 0) = 0;

This isn’t specific to COALESCE, any ol’ function — built-in or otherwise — would do it. But the optimizer doesn’t think an index would make this query any faster. That’s a real bummer.

You may hear people say things like “you can’t use indexes with functions”, but that sentence is incomplete. What’s meant is “you can’t use indexes to seek with functions”.

Going Boing

This goes back to the concept of SARGability that I talked about before. If we do a little bit more typing, the optimizer is our friend again. I know, typing more sucks, but you’ll thank me later.

SELECT COUNT_BIG(*) AS records
FROM dbo.Comments AS c
WHERE ( c.UserId = 0 
        OR c.UserId IS NULL );

The optimizer is back to having good ideas for us. But notice that both of these queries end up taking the same amount of time and doing the same amount of work.

If you were just getting started writing queries, performance tuning, or looking at query plans, you might see this and assume that query performance would remain identical.

It’s sort of a matter of opportunity, here. Like I said before, when you wrap predicates in functions, you lose the ability to seek into indexes.

Sure, you could use an index on UserId in either query, but it doesn’t usually fix a whole lot if you need to scan the entire thing.

Copying In

If we add in the index on UserId and compare plans, the difference is obvious.

Our clearly-expressed query does a lot better here. It’s able to seek, and even running with a serial plan is much faster than the parallel plan for the poorly-expressed query. These things do matter to performance. Functions, even built-in ones, have no relational meaning to the optimizer, and can’t be reasoned with.

When it comes to writing queries, it usually pays to over-communicate. When you take shortcuts, you might not get the best possible outcome.

Tomorrow, we’ll look at hidden missing index requests.

Thanks for reading!

Going Further

Starting SQL: Why Your SQL Server Query Can’t Go Parallel, Scalar Functions Edition

Posted on September 9, 2020May 17, 2022 by Erik Darling

I Won’t Share You

Yesterday we looked at where table variables can have a surprising! impact on performance. We’ll talk more about them later, because that’s not the only way they can stink. Not by a long shot. Even with 1 row in them.

Anyway, look, today’s post is sort of like yesterday’s post, except I’ve had two more drinks.

What people seem to miss about scalar valued functions is that there’s no distinction between ones that touch data and ones that don’t. That might be some confusion with CLR UDFs, which cause parallelism issues when they access data.

Beans and Beans

What I want to show you in this post is that it doesn’t matter if your scalar functions touch data or not, they’ll still have similar performance implications to the queries that call them.

Now look, this might not always matter. You could just use a UDF to assign a value to a variable, or you could call it in the context of a query that doesn’t do much work anyway. That’s probably fine.

But if you’re reading this and you have a query that’s running slow and calling a UDF, it just might be why.

If the UDF queries table data and is inefficient
If the UDF forces the outer query to run serially

They can be especially difficult on reporting type queries. On top of forcing them to run serially, the functions also run once per row, unlike inline-able constructs.

Granted, this once-per-row thing is worse for UDFs that touch data, because they’re more likely to encounter the slings and arrows of relational data. The reads could be blocked, or the query in the function body could be inefficient for a dozen reasons. Or whatever.

I’m Not Touching You

Here’s a function that doesn’t touch anything at all.

CREATE OR ALTER FUNCTION dbo.little_function (@UserId INT)
RETURNS BIGINT
WITH SCHEMABINDING, 
     RETURNS NULL ON NULL INPUT
AS
BEGIN
DECLARE @d DATETIME = GETDATE();
RETURN
(
    ( 
      SELECT @UserId
    )
)
END
GO

I have the declared variable in there set to GETDATE() to disable UDF inlining in SQL Server 2019.

Yes, I know there’s a function definition to do the same thing, but I want you to see just how fragile a feature it is right now. Again, I love where it’s going, but it can’t solve every single UDF problem.

Anyway, back to the story! Let’s call that function that doesn’t do anything in our query.

SELECT TOP (1000) 
    c.Id,
    dbo.little_function(c.UserId)
FROM dbo.Comments AS c
ORDER BY c.Score DESC;

The query plan looks like so, with the warning in properties about not being able to generate a valid parallel plan.

In this plan, we see the same slowdown as the insert to the table variable. There’s no significant overhead from the function, it’s just slower in this case because the query is forced to run serially by the function.

This is because of the presence of a scalar UDF, which can’t be inlined in 2019. The serial plan represents, again, a significant slowdown over the parallel plan.

Bu-bu-bu-but wait it gets worse

Let’s look at a worse function.

CREATE OR ALTER FUNCTION dbo.big_function (@UserId INT)
RETURNS BIGINT
WITH SCHEMABINDING, 
     RETURNS NULL ON NULL INPUT
AS
BEGIN
DECLARE @d DATETIME = GETDATE();
RETURN
(
    ( 
      SELECT SUM(p.Score)
      FROM dbo.Posts AS p
      WHERE p.OwnerUserId = @UserId
    ) - 
    (
      SELECT SUM(c.Score)
      FROM dbo.Comments AS c
      WHERE c.UserId = @UserId
    )
)
END
GO

Not worse because it’s a different kind of function, just worse because it goes out and touches tables that don’t have any helpful indexes.

Getting to the point, if there were helpful indexes on the tables referenced in the function, performance wouldn’t behave as terribly. I’m intentionally leaving it without indexes to show you a couple funny things though.

Because this will run a very long time with a top 1000, I’m gonna shorten it to a top 1.

SELECT TOP (1) 
    c.Id,
    dbo.big_function(c.UserId)
FROM dbo.Comments AS c
ORDER BY c.Score DESC;

Notice that in this plan, the compute scalar takes up a more significant portion of query execution time. We don’t see what the compute scalar does, or what the function itself does in the actual query plan.

The compute scalar operator is what’s responsible for the scalar UDF being executed. In this case, it’s just once. If I had a top that asked for more than one row, It would be responsible for more executions.

We don’t see the function’s query plan in the actual query, because it could generate a different query plan on each execution. Would you really want to see 1000 different query plans?

Anyway, it’s quite easy to observe with operator times where time is spent here. Most people read query plans from right to left, and that’s not wrong.

In that same spirit, we can add operator times up going from right to left. Each operator not only account for its own time, but for the time of all operators that come before it.

The clustered index scan takes 7.5 seconds, the Sort takes 3.3 seconds, and the compute scalar takes 24.9 seconds. Wee.

Step Inside

If you get an actual plan for this query, you won’t see what the function does. If you get an estimated plan, you can get a picture of what the function is up to.

This is what I meant by the function body being allowed to go parallel. This may lead to additional confusion when the calling query accrues parallel query waits but shows no parallel operators, and has a warning that a parallel plan couldn’t be generated.

It’s Not As Funny As It Sounds

If you look at a query plan’s properties and see a non-parallel plan reason, table variable modifications and scalar UDFs will be the most typical cause. They may not always be the cause of your query’s performance issues, and there are certainly many other local factors to consider.

It’s all a bit like a game of Clue. You might find the same body in the same room with the same bashed in head, but different people and blunt instruments may have caused the final trauma.

Morbid a bit, sure, but if query tuning were always a paint by numbers, no one would stay interested.

Anyway.

In the next posts? we’ll look at when SQL Server tells you it needs an index, and when it doesn’t.

Thanks for reading!

Going Further

Starting SQL: Index Wrap Up

Posted on August 21, 2020May 16, 2022 by Erik Darling

Watch Out Now

We talked about a bunch of index stuff this week, and we still barely scratched the surface. That’s okay, though. If you try to cram it all in at once, you’ll get overwhelmed and forget half of it anyway.

If you want to jump ahead to some more advanced topics, check out my presentation about indexes here.

You probably have questions about indexes that I didn’t cover here. Feel free to ask them in the comments section.

The core concepts I want you to take away from this week are:

Indexes put data in order
That order is what makes searching and joining more efficient
Having data in order can also help other parts of the query
Putting data in order without an index requires memory, sometimes a lot

Frog Meditation

If you’re wondering why I didn’t talk at all about index fragmentation, it’s because it’s not an effective query or index tuning strategy. I see far too many people set up some scripts to “fix” it, and then walk away from real tuning work.

I don’t get it, the indexes aren’t fragmented. Why is my query still slow?

Leaving aside that index maintenance scripts all measure the wrong thing, hammering your server night after night with pointless maintenance doesn’t fix the root cause of why performance is bad.

It doesn’t fix problems with functions, sargability, table variables, local variables, or anything else.

Sure, you can say your query got 100ms faster because you got pages back in order, but who cares about that when it still runs for 5-10 seconds, or longer?

Riding Through

The next couple things we’re going to cover this month are wait stats and query plans.

After that, who knows?

Thanks for reading!

Going Further

Starting SQL: How To Choose Key Column Order In SQL Server Indexes

Posted on August 19, 2020May 14, 2022 by Erik Darling

Found A Picture Of You

The way that many people will tell you to design indexes is something along the lines of:

Equality predicates first =
Inequality predicates next >, >=, <, <=, <>
This point intentionally left blank
Put all your select list columns as includes

Well, okay. That probably works with one of those tiny example databases that people keep coming up with creative ways to make bigger.

Real life is often larger than 100MB, though. And besides, real life queries are far more complicated.

Familiar

We don’t have to get too complicated to explose some of the issues with those rules, as queries go a little beyond the where clause.

SELECT TOP (5000) 
    v.Id,
    v.PostId,
    v.UserId,
    v.BountyAmount,
    v.VoteTypeId,
    v.CreationDate 
FROM dbo.Votes AS v
WHERE v.VoteTypeId = ?
AND   v.CreationDate >= ?
ORDER BY v.BountyAmount DESC;

If we were to faithfully follow the Rules of Indexing™, we might end up with an index like this:

CREATE INDEX an_attempt_was_made
    ON dbo.Votes (VoteTypeId, CreationDate)
    INCLUDE(BountyAmount, PostId, UserId);

Which, aside from Include column order (which we know doesn’t matter), is about what SQL Server’s missing index recommendation would be.

Could it be that conventional index design wisdom is based on a faulty algorithm?

Could be. Could be.

Strange

But what happens when we add the index, according to ancient index law, and run queries with different values that have different selectivities?

SELECT TOP (5000) 
    v.Id,
    v.PostId,
    v.UserId,
    v.BountyAmount,
    v.VoteTypeId,
    v.CreationDate 
FROM dbo.Votes AS v
WHERE v.VoteTypeId = 7 --Not a lot of these
AND   v.CreationDate >= '20080101'
ORDER BY v.BountyAmount DESC;


SELECT TOP (5000) 
    v.Id,
    v.PostId,
    v.UserId,
    v.BountyAmount,
    v.VoteTypeId,
    v.CreationDate 
FROM dbo.Votes AS v
WHERE v.VoteTypeId = 2 --Yes a lot of these
AND   v.CreationDate >= '20080101'
ORDER BY v.BountyAmount DESC;

This isn’t parameter sniffing, because we used literal values, and got different executions plans. They even have correct estimates.

Both queries used our index, too. It did okay for a small amount of data, but for a larger amount of data, we got totally sunk.

Order: Equality vs Inequality Predicates

We’ve talked about how indexes make searching easier by putting data in order, and how each key column in an index has a dependency for that order on the column before it. Obviously having the BountyAmount column as an include isn’t going to be helpful, because those columns aren’t stored in any order.

But there’s an important difference in where we put the BountyAmount column in the key of the index. For example, if we change it to look like this:

CREATE INDEX an_attempt_was_made
    ON dbo.Votes (VoteTypeId, CreationDate, BountyAmount)
    INCLUDE(PostId, UserId) WITH(DROP_EXISTING = ON);

We’re not going to do any better. Why? The predicate on CreationDate is an inequality. BountyAmount is only in order within groups of duplicate dates.

Once we cross a date boundary, the order resets, just like we talked about in yesterday’s post. If we want to tune this index to make this query fast for large or small predicates on VoteTypeId, we need to change our index to look like this:

CREATE INDEX an_attempt_was_made
    ON dbo.Votes (VoteTypeId, BountyAmount, CreationDate)
    INCLUDE(PostId, UserId) WITH(DROP_EXISTING = ON);

Rules Of The Road

For index key columns, data stored in order can make searching, joining, grouping, and ordering in queries a lot more efficient. That’s probably just the sort of thing you’re after when query tuning. And index tuning is, after all, part of query tuning.

But I bet you’re asking: why focus so much on Sorts? Tune in tomorrow to find out why!

Thanks for reading!

Going Further

Starting SQL: How SQL Server Stores Data In Nonclustered Indexes

Posted on August 18, 2020May 14, 2022 by Erik Darling

But What Happens When…

People may tell you to always put the most selective column first in a query, but selectivity depends on more than just what values are in a column.

It also depends on how columns are searched, doesn’t it? If people are using inequalities, like >, >=, < , <= then having a totally unique value on every row becomes a bit less helpful.

Likewise, if people can search IN() or NOT IN, NULL or NOT NULL, or even if perhaps the data in a column is only selective for some values, then selectivity can be a whole lot less selective.

Beyond that, it ignores a whole world of considerations around how you’re grouping or ordering data, if your query is a top (n) with an order by, and more.

Before we go jumping off on such wild adventures, let’s talk a little bit about multi-key indexes. It’s easy enough to visualize a single column index putting data in order, but multi-key indexes present a slightly different picture.

Janitorial

Single-column clustered indexes make a lot of sense. Single column nonclustered indexes often make less sense.

It’s sort of like the difference between a chef knife and a Swiss Army knife. You want one to be really good at one specific task, and another to be pretty useful to a bunch of tasks.

Will a Swiss Army knife be the best wine opener you’ve ever owned? No, but it’s a whole lot easier than trying to get a cork out with a cleaver, and it can also be a screwdriver, a pair of scissors, and open a beer bottle for your less industrious friends who can’t quite muster the strength to cope with a twist-off cap.

That multi-tool ability comes at a bit of a cost, too. There’s no such thing as a free index column (unless the table is read only).

All those columns have to be maintained when you modify table data, of course.

And there’s another thing: every key column in the index is dependent on the column that comes before it. Rather than try to word-problem this for you, let’s just look at some demos.

Withdrawals

Let’s say we’ve got this index which, albeit simple, is at least more than a single column. Congratulations, you’ve graduated.

CREATE INDEX joan_jett
ON dbo.Posts
(
    PostTypeId, Score
);

If we write queries like this, we’ll be able to use it pretty efficiently.

SELECT p.Id, p.PostTypeId, p.Score
FROM dbo.Posts AS p
WHERE p.PostTypeId = 7
AND   p.Score = 1;

I’m skipping over a little bit now, because data is mightily skewed in the PostTypeId column towards a couple of quite-common values. I’ll get to it, though.

For now, marvel at the simplicity and Seekiness of this plan.

Now let’s try to find data in the Score column without also searching on the PostTypeId column.

SELECT p.Id, p.PostTypeId, p.Score
FROM dbo.Posts AS p
WHERE p.Score = 999;

A couple things changed, here. We had to scan through the index to find Scores we’re interested in, and the optimizer thought that this would be a process-intensive enough task to use multiple CPU cores to do it.

Okay then.

Age Of Reason

If you’ve been kicking around databases for a little bit, you may have read about this before, or even seen it in action when writing queries and creating indexes.

What I’d like to do is try to offer an explanation of why that happens the way it does: Columns within an index are not ordered independently.

In other words, you don’t have all of the PostTypeIds in ascending order, and then all of the Scores in ascending order. You do have all the PostTypeIds in ascending order, because it’s the leading column, but Scores are only in ascending order after PostTypeId.

A simple query gets illustrative enough results.

SELECT p.PostTypeId, p.Score
FROM dbo.Posts AS p
WHERE p.PostTypeId IN (1, 2)
AND   p.Score BETWEEN 1950 AND 2000
ORDER BY p.PostTypeId, p.Score;

Note how the ordering of Score resets when you cross a value boundary in PostTypeId?

We can see that in action with other queries, too.

Then People Stare

Here are three queries, and three plans.

SELECT TOP (1000) p.Id, p.PostTypeId, p.Score
FROM dbo.Posts AS p
ORDER BY p.Score; --Score isn't stored in order independently


SELECT TOP (1000) p.Id, p.PostTypeId, p.Score
FROM dbo.Posts AS p
ORDER BY p.PostTypeId; --PostTypeId is the leading column, though


SELECT TOP (1000) p.Id, p.PostTypeId, p.Score
FROM dbo.Posts AS p
ORDER BY p.PostTypeId,
         p.Score; --Score is in order within repeated PostTypeId values

Only that first query, where we try to order by Score independently needs to physically sort data. They all use the same index, but that index doesn’t store Score in perfect ascending order, unless we first order by PostType Id.

In tomorrow’s post, we’ll mix things up a little bit and design an index for a tricky query.

Thanks for reading!

Going Further

Starting SQL: Designing Indexes For SQL Server Queries

Posted on August 17, 2020May 14, 2022 by Erik Darling

Sweet Like

Clustered indexes are great, really. Usually. Okay, if they’re bad it’s probably your fault.

Did you really need 10 columns in the key? Did you have to make it on a NVARCHA5(512)?

No. You messed that up. By that I mean the royal you. All of you.

The thing is, they’re of limited overall value for searching for data.

SQL Server Management Studio Table — searchy baby

If every single join and where clause has this Id column in it, we might be okay. But the odds of that being the case are slim to none.

As soon as we want to search by any other columns without searching for a specific Id, we’re toast. That data doesn’t exist in a helpful order for searching.

I know I’ve mentioned it before, but that’s what indexes do to make data easier to find: they put it in order. Ascending, descending. It’s up to you.

The Meaning Of Life

There are two main parts of a nonclustered index: key columns, and included columns.

Sure, there’s other stuff you can do with them, like make them unique, or add filters (where clauses) to them, but we’ll talk about that later.

For now, feast your eyes on the majesty of the nonclustered index create statement.

Look at all that. Can you believe how much faster is can make your queries?

Let’s talk about how that works.

Those Keys

If you want to visualize stuff key columns can help in a query, you can almost draw a Fibonacci whatever on top of it.

Much of the time, it makes sense to focus on the where clause first.

There will of course be times you’ll need to break from that, but as a general design pattern this is a helpful starting place. The stuff index key columns tend to help is under the from clause. That’s because these parts of the query are most often helped by having data in a pertinent order.

Sometimes things above the from clause can be improved above the from, when there’s an aggregate or windowing function involved, but those considerations are more advanced and specialized.

Inclusion Conclusion

Included columns are helpful for queries, because you can have a single index be the source of data for a query. No need for lookups, and fewer optimizer choices.

But included columns aren’t ordered the way key columns are. They’re kinda like window dressing.

Sure, you can use them to find data, it’s just less efficient without the ordering. You can think of them like all the non-key columns in your clustered index.

Some good uses for includes:

Columns only in the select list
Non-selective predicates
Columns in filter definitions

Includes, though, are the place where I see people go overboard. Thinking back a little, if you’re selecting long lists of columns from wide tables, the optimizer might suggest very wide indexes to compensate for that.

The wider your index definitions are, the higher your chances of modification queries needing to touch them are.

It’s a bit like a game of Battleship. The bigger your indexes get, the more of the board they take up, and the more likely it is you’re gonna get hit by one of those little plastic peg torpedoes.

this is from wikipedia. thanks, wikipedia.

Baby Teeth

We know we need indexes, and now we’ve got a rough idea of which parts of the index can help which part of our query.

Next, we’ll look at some of the deeper intricacies of index design, like the column-to-column dependencies that exist in row store indexes.

Thanks for reading!

Going Further

Starting SQL: How Nonclustered Indexes Work In SQL Server

Posted on August 14, 2020May 14, 2022 by Erik Darling

Facsimile

While clustered indexes or heaps are the table, nonclustered indexes are their own special little individual copies of the table data.

They’re ordered by the key columns that you choose, which should probably differ from your clustered index key column(s). There is, of course, no ordering of included columns.

Nonclustered indexes will also have different statistics, and a different set of underlying pages. If you’ve ever wondered why you can drop or disable a corrupt nonclustered index and recreate it, but you can’t do that with a clustered index, it’s because databases hate you and want you to suffer.

Downsides First

There are, of course, downsides to over-indexing a table. Your clustered index is generally there as a good foundation for everything else, but it can’t help everything.

Adding in nonclustered indexes will give you extra objects to:

Write to and lock when you modify data
Choices for the optimizer when it comes up with query plans
Backup, restore, check for corruption, maintain
Read into memory when you need to read them or write to them

Some caution needs to be exercised when creating indexes, of course. You’ll want to avoid overly wide indexes, and indexes that incorporate the same columns over and over again.

This can be complicated when creating wide tables that lack proper normalization. When you create tables that have many columns with a similar prefix, like “customer_”, it may be a sign that you those columns should be in a separate table with a key to connect it to other tables. Another sign is numbered columns, like “phone_1”, “phone_2” etc.

The more columns you add to a single table, the more trouble you’ll get into trying to index it. Users will want to search those tables in all variety of ways and want to return many different columns. It becomes quite difficult to effectively index a single table like that.

There are exceptions, of course. If you have a reporting table that is purposely denormalized, and has no transactional meaning, you can often afford more indexes being added to it, or even use columnstore indexes to aid reporting queries.

Mitigating Downsides

There are a number of things you can do to mitigate some of the issues you can run into with a lot of indexes, depending on what’s available to your version and edition of SQL Server.

For example, if you want to minimize locking issues, you should add NOLOCK everywhere. No but seriously, don’t do that unless you don’t care at all. A much better option is an optimistic isolation level. It’d be great if SQL Server used one by default, but it’s pointless to kick dust now.

Having good hardware, like enough memory to cache your heavily trafficked data, and write-friendly storage can also help with many issues around writes.

Of course, the indexes you need are going to be a personal issue. Some tables, and workloads, will be able to afford more indexing than others. Putting numbers on these things often takes some digging.

Coming back to the wide tables thing, you may find it difficult to stick to 5 or 10 indexes that have 5 or 10 columns in them without having a static group of queries that touches the table, and forget it if you write anything resembling “SELECT *” from a table like that.

Over Under

Deciding if a table is over-indexed comes down to looking at usage. If your server has been up for a month or longer, and you see a bunch of indexes that are totally unused by read queries, or queries that have way more writes then reads, you might wanna think about ditching those.

But always do this part first, because it’ll give you a more clear picture of what you should do with indexes that are leftover.

Other indexes that are safe to mess with are ones that have the exact same key columns. If they have different included columns, you can create one index to include them all.

Just remember to look for stuff in these index definitions like filters or uniqueness — those things can make indexes look a lot different to the optimizer.

Next we’ll start to talk about designing effective nonclustered indexes for your queries. Because that’s what we design indexes for, right? We don’t just make them up.

Thanks for reading!

Going Further

Starting SQL: Picking Clustered Indexes For Tables

Posted on August 13, 2020May 14, 2022 by Erik Darling

Need’em Got’em

Let’s say you’ve either got a heap, like we talked about yesterday, or you’re plotting a new table.

It’s OLTP, and you want the very best for your transactions. What are some good characteristics of a clustered index?

Here’s an easy to remember acronym!

Narrow: Numbers, dates, not string-ish
Unique: Identities or sequences are good for this
Distinct: If not unique, then as distinct as possible
Ever: Increasing (append only is the goal, here
Static: Definitely not a column you ever update

Follow that, and you’ll most likely never go wrong. Unless you’ve got GUIDs involved, maybe. But at least you’re trying.

Why do we care about those things?

Avoiding Problems

All of those traits help us avoid problems down the line. like shuffling pages in the middle around, or with lopsided data.

The narrow thing is important because, like I’ve mentioned, nonclustered indexes inherit those clustered index key columns. If you assign wide clustered indexes, either via the number of key columns, or via the data types of those key columns, you can end up with quite bloated nonclustered indexes.

The whole thing can be unpleasant.

The important thing to remember is that clustered indexes are a nearly free way to logically order your table data, and have an index that fully covers all of the columns. This is probably a good time to remind you that clustered indexes are not a copy of the table, but the table data itself ordered logically by the key columns.

The Questions

To BIGINT Or Not To BIGINT?

If you’re going to do what most people do and make your clustered index an identity, do yourself a favor and use a BIGINT.

Some people will hem and haw about the 4 extra bytes, but those 4 extra bytes don’t make too much difference until the table gets to be around 2 billion rows. At that point, the pain of altering an INT to a BIGINT can be pretty severe, with some careful planning.

I’ve seen enough people go through heck with it to tell you to avoid the problem from the get go.

But I Need So Many GUIDs

Okay, cool. I understand. GUIDs are hard to run out of. They can be rough if not generated sequentially, and even then… Well, at least they’re unique. Your nonclustered indexes won’t suffer *too* much by inheriting them.

But this is generally where a different design pattern makes sense:

Put a nonclustered primary key on your GUID
Add a clustered index to a more sensible column

But hey, businesses seem to get angry when they find out identities can have gaps in values. GUIDs totally don’t have that problem ?

Are Multi-Keys Okay?

I guess, if you need a couple columns to make a unique combination, or if you have a couple columns you’re always searching on.

Just make sure you really try to keep them narrow. Strings usually aren’t your friend here.

Action Indexes

But you know, once you’ve got a clustered index in place, it just kinda sits there.

The real fun comes in figuring out nonclustered indexes

How many can I have?
How many columns can I put in them?

Let’s talk about those a little next.

Thanks for reading!