Execution Plans Archives | Page 15 of 31

Starting SQL: Why Your SQL Server Query Can’t Go Parallel, Scalar Functions Edition

Posted on September 9, 2020May 17, 2022 by Erik Darling

I Won’t Share You

Yesterday we looked at where table variables can have a surprising! impact on performance. We’ll talk more about them later, because that’s not the only way they can stink. Not by a long shot. Even with 1 row in them.

Anyway, look, today’s post is sort of like yesterday’s post, except I’ve had two more drinks.

What people seem to miss about scalar valued functions is that there’s no distinction between ones that touch data and ones that don’t. That might be some confusion with CLR UDFs, which cause parallelism issues when they access data.

Beans and Beans

What I want to show you in this post is that it doesn’t matter if your scalar functions touch data or not, they’ll still have similar performance implications to the queries that call them.

Now look, this might not always matter. You could just use a UDF to assign a value to a variable, or you could call it in the context of a query that doesn’t do much work anyway. That’s probably fine.

But if you’re reading this and you have a query that’s running slow and calling a UDF, it just might be why.

If the UDF queries table data and is inefficient
If the UDF forces the outer query to run serially

They can be especially difficult on reporting type queries. On top of forcing them to run serially, the functions also run once per row, unlike inline-able constructs.

Granted, this once-per-row thing is worse for UDFs that touch data, because they’re more likely to encounter the slings and arrows of relational data. The reads could be blocked, or the query in the function body could be inefficient for a dozen reasons. Or whatever.

I’m Not Touching You

Here’s a function that doesn’t touch anything at all.

CREATE OR ALTER FUNCTION dbo.little_function (@UserId INT)
RETURNS BIGINT
WITH SCHEMABINDING, 
     RETURNS NULL ON NULL INPUT
AS
BEGIN
DECLARE @d DATETIME = GETDATE();
RETURN
(
    ( 
      SELECT @UserId
    )
)
END
GO

I have the declared variable in there set to GETDATE() to disable UDF inlining in SQL Server 2019.

Yes, I know there’s a function definition to do the same thing, but I want you to see just how fragile a feature it is right now. Again, I love where it’s going, but it can’t solve every single UDF problem.

Anyway, back to the story! Let’s call that function that doesn’t do anything in our query.

SELECT TOP (1000) 
    c.Id,
    dbo.little_function(c.UserId)
FROM dbo.Comments AS c
ORDER BY c.Score DESC;

The query plan looks like so, with the warning in properties about not being able to generate a valid parallel plan.

In this plan, we see the same slowdown as the insert to the table variable. There’s no significant overhead from the function, it’s just slower in this case because the query is forced to run serially by the function.

This is because of the presence of a scalar UDF, which can’t be inlined in 2019. The serial plan represents, again, a significant slowdown over the parallel plan.

Bu-bu-bu-but wait it gets worse

Let’s look at a worse function.

CREATE OR ALTER FUNCTION dbo.big_function (@UserId INT)
RETURNS BIGINT
WITH SCHEMABINDING, 
     RETURNS NULL ON NULL INPUT
AS
BEGIN
DECLARE @d DATETIME = GETDATE();
RETURN
(
    ( 
      SELECT SUM(p.Score)
      FROM dbo.Posts AS p
      WHERE p.OwnerUserId = @UserId
    ) - 
    (
      SELECT SUM(c.Score)
      FROM dbo.Comments AS c
      WHERE c.UserId = @UserId
    )
)
END
GO

Not worse because it’s a different kind of function, just worse because it goes out and touches tables that don’t have any helpful indexes.

Getting to the point, if there were helpful indexes on the tables referenced in the function, performance wouldn’t behave as terribly. I’m intentionally leaving it without indexes to show you a couple funny things though.

Because this will run a very long time with a top 1000, I’m gonna shorten it to a top 1.

SELECT TOP (1) 
    c.Id,
    dbo.big_function(c.UserId)
FROM dbo.Comments AS c
ORDER BY c.Score DESC;

Notice that in this plan, the compute scalar takes up a more significant portion of query execution time. We don’t see what the compute scalar does, or what the function itself does in the actual query plan.

The compute scalar operator is what’s responsible for the scalar UDF being executed. In this case, it’s just once. If I had a top that asked for more than one row, It would be responsible for more executions.

We don’t see the function’s query plan in the actual query, because it could generate a different query plan on each execution. Would you really want to see 1000 different query plans?

Anyway, it’s quite easy to observe with operator times where time is spent here. Most people read query plans from right to left, and that’s not wrong.

In that same spirit, we can add operator times up going from right to left. Each operator not only account for its own time, but for the time of all operators that come before it.

The clustered index scan takes 7.5 seconds, the Sort takes 3.3 seconds, and the compute scalar takes 24.9 seconds. Wee.

Step Inside

If you get an actual plan for this query, you won’t see what the function does. If you get an estimated plan, you can get a picture of what the function is up to.

This is what I meant by the function body being allowed to go parallel. This may lead to additional confusion when the calling query accrues parallel query waits but shows no parallel operators, and has a warning that a parallel plan couldn’t be generated.

It’s Not As Funny As It Sounds

If you look at a query plan’s properties and see a non-parallel plan reason, table variable modifications and scalar UDFs will be the most typical cause. They may not always be the cause of your query’s performance issues, and there are certainly many other local factors to consider.

It’s all a bit like a game of Clue. You might find the same body in the same room with the same bashed in head, but different people and blunt instruments may have caused the final trauma.

Morbid a bit, sure, but if query tuning were always a paint by numbers, no one would stay interested.

Anyway.

In the next posts? we’ll look at when SQL Server tells you it needs an index, and when it doesn’t.

Thanks for reading!

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Starting SQL: Query Memory Grants In SQL Server Execution Plans

Posted on September 7, 2020May 14, 2022 by Erik Darling

Do You Believe?

Memory is so very important to SQL Server performance. It doesn’t solve every problem, but you sure as heck feel it when there’s not enough. Like I said before, memory is a shared resource.

Your buffer pool needs it to cache data pages, and your queries need it for operators that need extra space to write stuff down.

If you constantly have queries and the buffer pool fighting over memory, sure, you might just need more memory. You might also need to grapple with those memory grants in a different way.

You & Me

To recap earlier posts, the most common things you’ll see asking for memory grants in a query plan are Sorts and Hashes. There is one form of optimized Nested Loops that’ll ask for memory, but I don’t see people running into problems with those all too often.

Let’s start with a simple query that will ask for a memory grant to sort data.

SELECT *
FROM 
     (  
        SELECT TOP (1000) 
                 u.Id          -- 166MB (INT)
               , u.DisplayName -- 300MB (NVARCHAR 40)
               , u.WebsiteUrl  -- 900MB (NVARCHAR 200)
               , u.Location    -- 1.3GB (NVARCHAR 100)
               , u.AboutMe     -- 9.4GB (NVARCHAR MAX)
        FROM dbo.Users AS u
        ORDER BY u.Reputation
     ) AS u
OPTION(MAXDOP 1, RECOMPILE);

Now, in a previous post I explained how the optimizer can make some weird guesses about memory grant needs when you get strings involved — not ordering by strings, just selecting them.

The guess made is that string columns will be half full. That can work out okay if your string-column-fullness hovers around the middle ground.

But if it hovers towards the high or low end of how you’ve defined the length of your string columns, you can end up with some crappy memory need estimates.

In the last post, we looked at how much the optimizer would ask for. Now let’s look at how much actually gets used.

Plantains

For the full query, the memory grant is 9.4GB, but only 380MB gets used. That seems bad to be.

Now, to be fair, the nice people at Microsoft are attempting to right these wrongs. In SQL Server 2017 we got Memory Grant Feedback, but it was only available for queries with Eau De Batch Mode on.

In SQL Server 2019, it’s available to Row Mode plans. Of course, that’s only in Enterprise Edition. If you’re on Standard Edition, you’re screwed :^)

So for all you screwed folks out there, which is most of you, you might have queries constantly asking for way more memory than they need. In this case, the query asks for about 9.1GB of memory more than it needs.

“Just Select Less”

Well, okay. What if we don’t select that MAX column? All your tables are well-designed, no one has used longer string columns than they should have, and none of your queries select a bunch of columns they don’t need.

SELECT *
FROM 
     (  
        SELECT TOP (1000) 
                 u.Id          -- 166MB (INT)
               , u.DisplayName -- 300MB (NVARCHAR 40)
               , u.WebsiteUrl  -- 900MB (NVARCHAR 200)
               , u.Location    -- 1.3GB (NVARCHAR 100)
        FROM dbo.Users AS u
        ORDER BY u.Reputation
     ) AS u
OPTION(MAXDOP 1, RECOMPILE);

We’ll still ask for 1.3GB of memory, and only use a fraction of it.

What’s interesting is that in total we only use about 100MB less memory without the AboutMe column in there, despite asking for about 8GB less memory.

Reducing Memory Grants

There are a number of ways to help mitigate queries asking for too much memory, but they all depend on what the root cause of the issue is.

You might have overly large columns, or queries that do something like select * — you might be able to fix that by rewriting queries. Given enough control, you should resize columns to be a more appropriate size, too.
You might be dealing with parameter sniffing, which has a whole bunch of different approaches which will require their own decades of research, ha ha ha.
You might be missing an opportune index that could help you avoid a sort, or help the optimizer choose a non-hash join or aggregate.
You might be asking queries to put data in order when it’s not necessary
You might be using some unfortunate “feature” that messes up cardinality estimation
You might be dealing with out of date statistics
You might need to apply a query hint to control the size of the grant
You might need to use Resource Governor to control the size of memory grants

Of course, all this effort depends on how severe your problems are.

A good way to judge if queries are trampling on each other is to look at resource semaphore waits. That’s queries waiting to get memory to run.

Another might be to look at PAGEIOLATCH_XX waits to see how much time you’re spending reading pages from disk. It’s possible that excessive query memory grants are forcing data out of the buffer pool.

Thanks for reading!

Going Further

Starting SQL: Things SQL Server Query Plans Might Tell You

Posted on September 4, 2020May 14, 2022 by Erik Darling

The Fox

Interpreting query plans and problems has gotten a lot easier over the years. Especially with more actuals getting added to actual plans. A lot of the ambiguity about where time and effort was spent in a query plan is gone.

Taking the guesswork out of performance tuning is awesome! The faster I can tune a query for someone, the better. And boy howdy, sometimes I get some awfully big query plans to look at.

But once you know what was slow, what do you do with it? How do you make it not slow?

The Hound

There are the “obvious” things, which we’ve talked about earlier in the series. They’re a good place to start.

There’s other stuff too, which we’ll talk about in detail next week. Normally I’d feel like a broken record, but having all of my fundamentals stuff in one set of posts seems like a reasonable idea. I think that’s the point of all this, anyway.

It’s stuff that isn’t surprising, if you’ve been around the block a little tuning queries. But T-SQL anti-patterns are like a bad haircut. Sometimes you don’t believe you have one until someone tells you.

A lot of the time I spend with clients is showing them how seemingly harmless choices are wrecking performance. Someone a long time ago read about this ~great new feature~ and used it everywhere and now the whole show is slumping towards catastrophe.

Thousand Dollars

This is another place where actual plans tell you way more than estimated or cached plans, but if you know enough about what to look for in a query plan you can find The Usual Suspects for performance issues.

You see, there are some things that nearly guarantee you’ll get one of those “bad estimates” that people blame things on before rebuilding every index in sight.

But all the rebuilt indexes in the world won’t fix these anti-patterns, because most performance problems aren’t because your data pages are a little out of order.

So what might your query plans tell you?

If you got a huge memory grant
If your query was forced to run serially
If you have missing indexes
If you have missing-missing indexes
If you have very sneaky missing indexes
If there was an icky function in your query
If there’s a table variable hanging about
If there’s implicit conversions
If there’s inappropriate scans

And that’s just the estimated plan! With actual plans, you get even more detail. On top of stuff we’ve already talked about, like operator times and wait stats, you can also see if your memory grant got used, spills, and other important details.

Thanks for reading!

Going Further

Starting SQL: Things SQL Server Query Plans Can’t Tell You (But Should)

Posted on September 3, 2020May 14, 2022 by Erik Darling

Hidey-Hole

When you’re looking at the execution plan for a query that someone is griping about, one thing to keep in mind is that it’s just part of the picture. There’s a lot of things that a query plan can’t tell you about the state of the server surrounding the query.

Query plans are, rightly, quite focused on the query at hand. If every query plan sucked in a bunch of other information, it’d be a nightmare.

I do think that actual execution plans should have blocking warnings, in the same way that we have spill warnings on some operators. Or at the very least, blocking waits should register in the wait stats that query plans collect.

Action Plan

We know we can’t see blocking in execution plans, which wouldn’t make a lot of sense at all to show in cached plans. If they’re only blocked sometimes, the value of that kind of feedback diminishes.

You can sometimes figure that out by comparing CPU and duration — if CPU is far less than duration, it could be a blocking problem.

Just like blocking, it’s nearly impossible to tell if a query was slow because the server was all manner of fiery catastrophe just by looking at the execution plan.

A lot of people will jump right to parameter sniffing when a query is sometimes slow, completely discounting other things that go bump in the night. While query plans can tell us a great story, they often can’t tell us a full story.

Going back to something we talked about much earlier in the series: We need to make sure we have the right information to solve the right problem. This is, again, where having a good monitoring tool is indispensable.

What It Is, It Is

We already spent a few days talking about some wait stats, and some things you might be able to do to compensate for them.

Any sporadic performance issue can be a tough nut to crack, and not every symptom is going to be obvious in the places you might first look.

They might not show up as in wait stats, because they’re not happening constantly. They might not show up in query plans because it’s not the fault of the query.

It’s an important thing to keep in mind, though, to not get too rabbit-holed looking at the wrong thing.

Thanks for reading!

Going Further

Starting SQL: Actual Plans Are Important For SQL Server Performance Tuning

Posted on September 2, 2020May 14, 2022 by Erik Darling

No, Really

Estimated plans are like giving someone driving directions. You might know an awesome way to get somewhere, but you don’t know that a truck full of beer crashed into a truck full of monkeys. And though there were no fatalities, the street scene is not one that encourages forward progress.

That’s a lot like what happens to the optimizer. Sometimes it comes up with a good route for Tuesday at noon, but now you gotta make the trip Friday at 5:30. Actual plans can help you identify where things backfired in a way that cached and estimated plans usually can’t.

Really, No

Here’s an example I love using in my training classes. If you saw this in a code review, you might just skip right past it.

SELECT TOP (10)  
    u.DisplayName,
    p.*
FROM dbo.Posts AS p
JOIN dbo.Users AS u
    ON p.OwnerUserId = u.Id
WHERE p.ParentId = @ParentId
ORDER BY u.Reputation DESC;

But queries like this terrify experienced query tuners:

Do we really need everything from Posts?
Do we have a good index on Posts for the where clause and join?
Is the ParentId column a highly skewed data set?
Is there an index on Users to avoid needing to Sort by Reputation?

Under the right circumstances, even the simplest queries can backfire. This is a great example of that.

Indecent Deeds

Query plans can spend a lot of time lying to you. It takes a while to develop an eye for those lies, and figuring out where things went wrong in estimated or cached plans. Actual plans will tell you the truth about most things.

Of course, getting cached or estimated plans is often better than not getting anything at all. But imagine someone sending you this and saying it’s really slow. What would you say? What would you ask for next?

here to baffle you

All you have are estimates. Seeks and estimates.

Go Crazy

You could even get the parameter from the properties, and see if the estimate lines up.

SQL Server Query — maybe we don’t need to update stats.

Eyeballs

Different types of plans provide different levels of information. Identifying bad plan choices in Estimated plans is certainly more difficult, but not impossible with enough practice.

Understanding those poor choices and fixing them is what we’re here to do, of course.

The more estimated plans you look at, the more suspicious you become of every operator and choice, and realize that operator costs are often not well-aligned with reality.

Many query tuning efforts start with Estimated plans, and end with Actual plans.

Why? Because we get lied to enough.

Thanks for reading!

Going Further

Starting SQL: What’s So Great About SQL Server Query Plans, Anyway?

Posted on September 1, 2020May 17, 2022 by Erik Darling

In The Dark

All query plans are based on estimates. The best we can hope for is that the estimates were good enough to get a good plan.

When you get an “actual plan”, all of the estimated metrics remain, but you also get the runtime metrics. Often query tuning is checking that those things lined up reasonably well.

As of now, there’s nothing that updates cached plan estimates after execution; there’s no runtime information added to them. Also notable is are the estimated metrics that don’t have “actual” counterparts, like operator costs.

Live Nude Plans

Microsoft has been adding some cool stuff to query plans, but actual plans are the recipients of most of it.

Wait stats
I/O Stats
Memory Grants
Spills
Thread usage
Operator Times
Statistics used
Row goals

What a list! You can do a whole lot of figuring out with a list like that accompanying such pretty drawings of what your queries did. And yet, we often need to get metrics about queries from other sources.

Jingling

SQL Server will keep lots of metrics about cached plans around to help you figure out where queries fall on the pain-causing scale. But the plan cache is an unreliable place, at times.

The plan cache can get wiped out for many reasons, like memory pressure, settings changes, running naughty DBCC commands, and of course restarting SQL Server.

The plan cache can also be difficult to interpret, because of recompilations and compilations. Recompilations kind of cover up a query’s trail, and compilations mean you could have the same query plan over and over again, but it’s not getting reused.

Both of those situations can make getting a clear picture of what a query has been doing over time.

Enter the Query Store. Though not a replacement for the plan cache, Query Store can be a quite helpful tool for finding performance issues. Especially on SQL Server 2017+ where high-level wait stats are also logged. Being able to tie queries to waits is pretty awesome.

Rubber, Glue, Flaming Pants

Another problem with execution plans is how they can lie to you, both by obfuscation and omission.

With all the detail involved in an execution plan, there are still some key data points missing from the estimated versions. That’s why we need stuff like DMVs and Query Store to act as additional sources of truth for what a query has been up to.

And even with all that, understanding query performance issues can still be tough.

Let’s spend some time learning how to figure out where your queries, and query plans, are going wrong.

Thanks for reading!

Going Further

Starting SQL: How To Choose Key Column Order In SQL Server Indexes

Posted on August 19, 2020May 14, 2022 by Erik Darling

Found A Picture Of You

The way that many people will tell you to design indexes is something along the lines of:

Equality predicates first =
Inequality predicates next >, >=, <, <=, <>
This point intentionally left blank
Put all your select list columns as includes

Well, okay. That probably works with one of those tiny example databases that people keep coming up with creative ways to make bigger.

Real life is often larger than 100MB, though. And besides, real life queries are far more complicated.

Familiar

We don’t have to get too complicated to explose some of the issues with those rules, as queries go a little beyond the where clause.

SELECT TOP (5000) 
    v.Id,
    v.PostId,
    v.UserId,
    v.BountyAmount,
    v.VoteTypeId,
    v.CreationDate 
FROM dbo.Votes AS v
WHERE v.VoteTypeId = ?
AND   v.CreationDate >= ?
ORDER BY v.BountyAmount DESC;

If we were to faithfully follow the Rules of Indexing™, we might end up with an index like this:

CREATE INDEX an_attempt_was_made
    ON dbo.Votes (VoteTypeId, CreationDate)
    INCLUDE(BountyAmount, PostId, UserId);

Which, aside from Include column order (which we know doesn’t matter), is about what SQL Server’s missing index recommendation would be.

SQL Server Missing Index Request — pedestrian blues

Could it be that conventional index design wisdom is based on a faulty algorithm?

Could be. Could be.

Strange

But what happens when we add the index, according to ancient index law, and run queries with different values that have different selectivities?

SELECT TOP (5000) 
    v.Id,
    v.PostId,
    v.UserId,
    v.BountyAmount,
    v.VoteTypeId,
    v.CreationDate 
FROM dbo.Votes AS v
WHERE v.VoteTypeId = 7 --Not a lot of these
AND   v.CreationDate >= '20080101'
ORDER BY v.BountyAmount DESC;


SELECT TOP (5000) 
    v.Id,
    v.PostId,
    v.UserId,
    v.BountyAmount,
    v.VoteTypeId,
    v.CreationDate 
FROM dbo.Votes AS v
WHERE v.VoteTypeId = 2 --Yes a lot of these
AND   v.CreationDate >= '20080101'
ORDER BY v.BountyAmount DESC;

This isn’t parameter sniffing, because we used literal values, and got different executions plans. They even have correct estimates.

Both queries used our index, too. It did okay for a small amount of data, but for a larger amount of data, we got totally sunk.

Order: Equality vs Inequality Predicates

We’ve talked about how indexes make searching easier by putting data in order, and how each key column in an index has a dependency for that order on the column before it. Obviously having the BountyAmount column as an include isn’t going to be helpful, because those columns aren’t stored in any order.

But there’s an important difference in where we put the BountyAmount column in the key of the index. For example, if we change it to look like this:

CREATE INDEX an_attempt_was_made
    ON dbo.Votes (VoteTypeId, CreationDate, BountyAmount)
    INCLUDE(PostId, UserId) WITH(DROP_EXISTING = ON);

We’re not going to do any better. Why? The predicate on CreationDate is an inequality. BountyAmount is only in order within groups of duplicate dates.

Once we cross a date boundary, the order resets, just like we talked about in yesterday’s post. If we want to tune this index to make this query fast for large or small predicates on VoteTypeId, we need to change our index to look like this:

CREATE INDEX an_attempt_was_made
    ON dbo.Votes (VoteTypeId, BountyAmount, CreationDate)
    INCLUDE(PostId, UserId) WITH(DROP_EXISTING = ON);

Rules Of The Road

For index key columns, data stored in order can make searching, joining, grouping, and ordering in queries a lot more efficient. That’s probably just the sort of thing you’re after when query tuning. And index tuning is, after all, part of query tuning.

But I bet you’re asking: why focus so much on Sorts? Tune in tomorrow to find out why!

Thanks for reading!

Going Further

Starting SQL: How SQL Server Stores Data In Nonclustered Indexes

Posted on August 18, 2020May 14, 2022 by Erik Darling

But What Happens When…

People may tell you to always put the most selective column first in a query, but selectivity depends on more than just what values are in a column.

It also depends on how columns are searched, doesn’t it? If people are using inequalities, like >, >=, < , <= then having a totally unique value on every row becomes a bit less helpful.

Likewise, if people can search IN() or NOT IN, NULL or NOT NULL, or even if perhaps the data in a column is only selective for some values, then selectivity can be a whole lot less selective.

Beyond that, it ignores a whole world of considerations around how you’re grouping or ordering data, if your query is a top (n) with an order by, and more.

Before we go jumping off on such wild adventures, let’s talk a little bit about multi-key indexes. It’s easy enough to visualize a single column index putting data in order, but multi-key indexes present a slightly different picture.

Janitorial

Single-column clustered indexes make a lot of sense. Single column nonclustered indexes often make less sense.

It’s sort of like the difference between a chef knife and a Swiss Army knife. You want one to be really good at one specific task, and another to be pretty useful to a bunch of tasks.

Will a Swiss Army knife be the best wine opener you’ve ever owned? No, but it’s a whole lot easier than trying to get a cork out with a cleaver, and it can also be a screwdriver, a pair of scissors, and open a beer bottle for your less industrious friends who can’t quite muster the strength to cope with a twist-off cap.

That multi-tool ability comes at a bit of a cost, too. There’s no such thing as a free index column (unless the table is read only).

All those columns have to be maintained when you modify table data, of course.

And there’s another thing: every key column in the index is dependent on the column that comes before it. Rather than try to word-problem this for you, let’s just look at some demos.

Withdrawals

Let’s say we’ve got this index which, albeit simple, is at least more than a single column. Congratulations, you’ve graduated.

CREATE INDEX joan_jett
ON dbo.Posts
(
    PostTypeId, Score
);

If we write queries like this, we’ll be able to use it pretty efficiently.

SELECT p.Id, p.PostTypeId, p.Score
FROM dbo.Posts AS p
WHERE p.PostTypeId = 7
AND   p.Score = 1;

I’m skipping over a little bit now, because data is mightily skewed in the PostTypeId column towards a couple of quite-common values. I’ll get to it, though.

For now, marvel at the simplicity and Seekiness of this plan.

Now let’s try to find data in the Score column without also searching on the PostTypeId column.

SELECT p.Id, p.PostTypeId, p.Score
FROM dbo.Posts AS p
WHERE p.Score = 999;

A couple things changed, here. We had to scan through the index to find Scores we’re interested in, and the optimizer thought that this would be a process-intensive enough task to use multiple CPU cores to do it.

Okay then.

Age Of Reason

If you’ve been kicking around databases for a little bit, you may have read about this before, or even seen it in action when writing queries and creating indexes.

What I’d like to do is try to offer an explanation of why that happens the way it does: Columns within an index are not ordered independently.

In other words, you don’t have all of the PostTypeIds in ascending order, and then all of the Scores in ascending order. You do have all the PostTypeIds in ascending order, because it’s the leading column, but Scores are only in ascending order after PostTypeId.

A simple query gets illustrative enough results.

SELECT p.PostTypeId, p.Score
FROM dbo.Posts AS p
WHERE p.PostTypeId IN (1, 2)
AND   p.Score BETWEEN 1950 AND 2000
ORDER BY p.PostTypeId, p.Score;

Note how the ordering of Score resets when you cross a value boundary in PostTypeId?

We can see that in action with other queries, too.

Then People Stare

Here are three queries, and three plans.

SELECT TOP (1000) p.Id, p.PostTypeId, p.Score
FROM dbo.Posts AS p
ORDER BY p.Score; --Score isn't stored in order independently


SELECT TOP (1000) p.Id, p.PostTypeId, p.Score
FROM dbo.Posts AS p
ORDER BY p.PostTypeId; --PostTypeId is the leading column, though


SELECT TOP (1000) p.Id, p.PostTypeId, p.Score
FROM dbo.Posts AS p
ORDER BY p.PostTypeId,
         p.Score; --Score is in order within repeated PostTypeId values

Only that first query, where we try to order by Score independently needs to physically sort data. They all use the same index, but that index doesn’t store Score in perfect ascending order, unless we first order by PostType Id.

In tomorrow’s post, we’ll mix things up a little bit and design an index for a tricky query.

Thanks for reading!

Going Further

Starting SQL: Designing Indexes For SQL Server Queries

Posted on August 17, 2020May 14, 2022 by Erik Darling

Sweet Like

Clustered indexes are great, really. Usually. Okay, if they’re bad it’s probably your fault.

Did you really need 10 columns in the key? Did you have to make it on a NVARCHA5(512)?

No. You messed that up. By that I mean the royal you. All of you.

The thing is, they’re of limited overall value for searching for data.

SQL Server Management Studio Table — searchy baby

If every single join and where clause has this Id column in it, we might be okay. But the odds of that being the case are slim to none.

As soon as we want to search by any other columns without searching for a specific Id, we’re toast. That data doesn’t exist in a helpful order for searching.

I know I’ve mentioned it before, but that’s what indexes do to make data easier to find: they put it in order. Ascending, descending. It’s up to you.

The Meaning Of Life

There are two main parts of a nonclustered index: key columns, and included columns.

Sure, there’s other stuff you can do with them, like make them unique, or add filters (where clauses) to them, but we’ll talk about that later.

For now, feast your eyes on the majesty of the nonclustered index create statement.

Look at all that. Can you believe how much faster is can make your queries?

Let’s talk about how that works.

Those Keys

If you want to visualize stuff key columns can help in a query, you can almost draw a Fibonacci whatever on top of it.

Much of the time, it makes sense to focus on the where clause first.

There will of course be times you’ll need to break from that, but as a general design pattern this is a helpful starting place. The stuff index key columns tend to help is under the from clause. That’s because these parts of the query are most often helped by having data in a pertinent order.

Sometimes things above the from clause can be improved above the from, when there’s an aggregate or windowing function involved, but those considerations are more advanced and specialized.

Inclusion Conclusion

Included columns are helpful for queries, because you can have a single index be the source of data for a query. No need for lookups, and fewer optimizer choices.

But included columns aren’t ordered the way key columns are. They’re kinda like window dressing.

Sure, you can use them to find data, it’s just less efficient without the ordering. You can think of them like all the non-key columns in your clustered index.

Some good uses for includes:

Columns only in the select list
Non-selective predicates
Columns in filter definitions

Includes, though, are the place where I see people go overboard. Thinking back a little, if you’re selecting long lists of columns from wide tables, the optimizer might suggest very wide indexes to compensate for that.

The wider your index definitions are, the higher your chances of modification queries needing to touch them are.

It’s a bit like a game of Battleship. The bigger your indexes get, the more of the board they take up, and the more likely it is you’re gonna get hit by one of those little plastic peg torpedoes.

this is from wikipedia. thanks, wikipedia.

Baby Teeth

We know we need indexes, and now we’ve got a rough idea of which parts of the index can help which part of our query.

Next, we’ll look at some of the deeper intricacies of index design, like the column-to-column dependencies that exist in row store indexes.

Thanks for reading!

Going Further

Starting SQL: A Little TOP Can Have A Big Sort In SQL Server Query Plans

Posted on August 11, 2020May 14, 2022 by Erik Darling

Shortcuts

To illustrate this problem a little bit better, I need to use a slightly different query.

While a tidy group by was nice to illustrate some of the simpler points about lookups, it overly complicates things when we want to involve other columns.

The second we need anything else in the select or order by portions of a query, we need to apply aggregates to them, or include them in the group by.

And you see, once you set up a query to return the TOP N rows, there’s an expectation that users get to choose the order they start seeing rows in. As long as we stick to columns whose ordering is supported by an index, things will be pretty stable.

Once we go outside that, a TOP can be rough on a query.

Order By CreationDate

Even if we order by CreationDate in descending order, with the index created in ascending order, we don’t need an explicit sort operation to put data in order.

SELECT TOP (1000)
       p.CreationDate,
       p.OwnerUserId,
       p.Score
FROM dbo.Posts AS p
WHERE p.CreationDate >= '20131015'
ORDER BY p.CreationDate DESC;

There are some additional possibilities for this kind of thing with multi-key indexes that we’ll talk about later in the series, but for now this is a good enough illustration of indexes putting data in order.

Order By Score

Let’s ask for data in a different order now. Score is a convenient villain, because there are lots of times when you might want to see things by a highest whatever metric is commendable to be high. Like not blood pressure, probably.

SELECT TOP (1000)
       p.CreationDate,
       p.OwnerUserId,
       p.Score
FROM dbo.Posts AS p
WHERE p.CreationDate >= '20131015'
ORDER BY p.Score DESC;

Without an index that has Score in the key, we need to physically put the data in order to fit the requirements of the query. Note that in this case, the optimizer no longer sees any benefit to using our nonclustered index.

Why Do We Care About Sorts?

Without jumping too far ahead, Sorts need extra memory to run.

You know, that stuff you cache data in, and you don’t have enough of already?

Yeah, that memory.

The same memory that gives you a conniption every time PLE fluctuates.

How much they ask for is going to depend on:

How much memory you have
What max server memory is set to
The size of the data you need to sort

If we aren’t able to get enough memory, or if we don’t ask for enough up front, data could spill from memory to disk.

When spills get big enough, they can sometimes cause performance issues.

This is a good time to start talking a little bit more about indexes, so we can understand more about how they work and help us solve problems.

Thanks for reading!