Starting SQL Archives | Page 6 of 7

Starting SQL: Compensating For Memory Waits In SQL Server

Posted on August 28, 2020May 14, 2022 by Erik Darling

Scrabble Words

You spend most of your life waiting. Sometimes it’s not bad. Even a long wait can be okay if you only have to do it once.

Once I waited an hour for a lobster roll. Once.

But let’s face it — it’s a lot different waiting for something you really want. When you’re waiting for a query to finish, it’s probably just to do your job.

That probably isn’t fun, like waiting all year to get a glass of Port Ellen in Seattle only to have PASS get moved to Houston and then have a pandemic break out.

Or something.

Semaphored Kind Of Life

Memory is precious, that’s why VM admins give you so little of it. SQL Server also thinks memory is precious, and will only give certain things so much of it. For instance, it’ll only give Windows a tiny bit, ha ha ha.

Just kidding. Sort of. Set Max Server Memory responsibly, though. Especially if you lock pages in memory.

SQL Server will dedicate memory to two main areas:

Caching data in the buffer pool
Giving to queries temporarily to use as scratch space

I posted earlier in the series about what queries use memory for, but to recap:

Building hash tables
Sorting data
Optimized nested loops
Column store inserts

The most common uses will be hashing and sorting, but you should be aware of the others.

RAM, Limited

It’s “fairly” easy to figure out how big your buffer pool needs to be to deal with your most commonly used data. The real wild card becomes figuring out queries and memory grants.

SQL Server might make a bad guess for a memory grant
It might happen because you’ve picked unforgivably large sizes for string columns
It could make a good guess for one parameter that’s bad for another parameter
Or it might just be a bad guess all around, even if you picked good sizes for string columns

And any of this could happen at any time, because databases are completely insane places.

Will There Be Enough Room?

When memory grants go wild, either because a small number start getting huge grants, or because a swarm of queries all get small to medium sized grants, they start stealing space from the buffer pool.

Sure, this sucks because as data gets pushed out, you might have to re-read it from disk. If your disks aren’t awful, you might not notice too much strain. Things’ll be a little less snappy, but it might not be the end of the world.

It’s not like you “run out” of buffer pool space, it just turns into a weird gas-station-sushi montage of things going out as fast as they go in. You absolutely can run out of memory grant space, though. And that, my friends, is when the real waiting can happen.

The wait that shows up when you run out of memory to give to queries is called RESOURCE_SEMAPHORE. I demo that happening in a video here. At that link, you’ll also find an easy to run script to help you figure out which queries might be causing those problems.

Rock n’ Roll, Deal With It

There are some interesting ways to fight inflated memory grants. You do need to strike a balance between memory grant sizes and potential spill sizes, but that’s well beyond a blog post. That also assumes that memory grants can’t be properly addressed via other means.

You may hit some of this stuff for reasons that will sound eerily familiar to other potential query fixes.

Are you writing crappy queries that make the optimizer make bad guesses?
Do you need to add an index to fix a sort?
Does adding a better index get you away from a hash join or aggregate?
Does separating data into relational and informational pieces help?

After that, you’re looking at more heavy-handed fixes, like Resource Governor, or query hints.

If you have a good alternative plan, you may be able to force it with a plan guide, or using Query Store. But that assumes you have a good alternative plan that’s good for a wide range of parameters. And that’s a big assumption.

What’s a good place to find information about memory grants? Well, we got those there query plan things, and maybe we should start poking around those.

Thanks for reading!

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Starting SQL: Compensating For Disk Waits In SQL Server

Posted on August 27, 2020May 14, 2022 by Erik Darling

Repairer Of Reputations

CPU waits are relatively easy to figure out once you wrap your head around the settings that contribute to them, and the queries that cause them. There’s a pretty direct correlation between parallelism, thread usage, and CPU usage.

Compensating for disk waits is a bit of a different game, because there’s a bit to consider from a few different angles. But first, let’s distinguish a little bit.

Waits that commonly crop up when you’re waiting on disk:

PAGEIOLATCH_**
WRITELOG

When people hear they’re waiting a lot on disk, their first inclination might be that they need faster disks. For WRITELOG waits, that can definitely be a factor. For PAGEIOLATCH waits, it probably shouldn’t be your first move.

Relatively Speaking

When SQL Server hits PAGEIOLATCH waits, it’s to signal operations needing to read pages from disk into memory. If you just rebooted, this is inevitable. You start with a totally cold buffer cache.

But if you have enough memory, you’re not likely to see queries consistently waiting on it. Why? Because if data you need is already in memory, that’s where you go get it from. Why go to disk if you don’t have to? It’s icky out there.

If you really want to compensate for this wait, you’re going to need to think about a few things, like

How much memory you have, and how much memory you’re allowed (non-Enterprise versions have limits)
How much data you have, and how many indexes you have

Let’s say end user queries are consistently waiting on reading from disk. It doesn’t matter much if the wait is fast or slow, what matters is that the data isn’t in memory. Sure, it matters more if the waits are slow, but the first question is memory.

Do you have enough?
Can you add more?
Would what you have be enough if you had fewer indexes? (Unused/Duplicative)
Would what you have be enough if you had less data? (Purging/Archving)

Judgement Night

The reason getting memory right is so crucial is because of how much it’s responsible for.

Aside from caching all those thoughtfully crafted data pages, queries need it to sort and hash data, and there are all sorts of other lower level caches that rely on it. The plan cache is probably the most obvious.

Once you realize that memory is a shared resource, you treat it a whole lot differently. Especially if you know just how much memory some things can take.

To learn more, check out my videos:

Costing

Yeah, memory is cheap. Unless you need so much that your next step is going to Enterprise Edition.

But there’s an intermediate step in the mix that not many people talk about. You can have 2-3 Standard Edition boxes with data split out, and have it potentially be more cost effective than jumping to Enterprise Edition.

This is a better fit for applications/servers that use multiple databases, of course, but I’ve seen people do it with archival data too.

Of course, there are some disk things that you should fix. Like if you’re on a SAN and using <8Gb networking, or if you’re using a VM and not using PVSCSI disks.

The point though, is that if you have room to add memory, you should do that before fiddling with disks. It just plain goes further, because you’re not just helping queries read data into memory faster. You’re caching more pages overall, and you have more memory available for query memory grants (and other caching activities).

Faster disks also won’t do anything to help the waits we’ll talk about tomorrow, that can for sure be a sign that SQL Server doesn’t have adequate memory.

Thanks for reading!

Going Further

Starting SQL: Compensating For CPU Waits In SQL Server

Posted on August 26, 2020May 14, 2022 by Erik Darling

Fast Car

Some waits relate, but nearly all waits can be compensated for.

It’s all pulling levers, really. The Great And Powerful Oz has nothing on SQL Server.

More to the point, you can end up in quite a tangle with wait stats. I see people get bogged down in one or two metrics that they read are they worst thing you can ever see on a SQL Server, only to have them be totally unrelated to the problem at hand.

Go figure, the internet misled someone. At some point, this post will probably be misleading too. It might even be misleading right now.

Let’s talk about how to compensate for the common waits we talked about yesterday.

CXPACKET and CXCONSUMER

The fastest answer here is to check your settings. If these are at the defaults, you’re messing up. There’s a lot of competing points of view about how to set parallelism settings in SQL Server. The thing is, any of it can be wrong.

I don’t want to write this and have you think what I say is set in concrete. I want you to understand that I’m giving you a starting point, and it’s going to be up to you to figure out where to go from here.

MAXDOP (Max Degree Of Parallelism)

This is a fun one, because of the number of misconceptions around it. Here’s the way I go about it.

If you have one NUMA node with <= 8 cores, you can probably leave it alone, though sometimes I’ll set it to 4 on a server with 8 cores (like my laptop, ha ha ha) if the server is in tough shape
If you have one NUMA node with > 8 cores, set it to 8
If you have more than one NUMA node with <= 8 cores, set it to the number of cores in a NUMA node
If you have more than one NUMA node with > 8 cores, set it to 8

Will this work perfectly for everyone? No, but it’s a better starting place than 0. There are even some maintenance tasks that you might want to run with higher or lower MAXDOP, but that’s way beyond anything I want to get into here.

Plus, there’s all sorts of oddball CPU configurations that you can see, either because your VM admin has a permanent vape cloud around their head, or because some newer CPUs have non-traditional core counts that don’t follow the power of 2 increases (2, 4, 8, 16, 32, 64…) that CPUs have generally used.

If you leave MAXDOP set to 0, parallel queries can team up and really saturate your CPUs, especially if you don’t change…

CTFP (Cost Threshold For Parallelism)

The default for this is 5. This does not mean seconds. It meant seconds on one computer in the late 90s, but not anymore. I’d link to the post that talked about that one computer, but Microsoft nuked the blog it was on. Thanks, Microsoft.

It’s probably important for you not to have to think about all those old blogs when you’re concentrating so hard on quality CUs and documentation.

For everyone now, they’re sort of like Monopoly money. They don’t mean anything relative to your hardware. It’s not like weight, where you weigh different amounts based on the amount of gravity in your environment. These costs are only meaningful to the optimizer in coming up with plan choices.

Thing is, it’s really easy for a query to cost more than 5 of these whatevers. What’s important to understand up front is how this number is used.

When the optimizer gets a query that it needs to come up with an execution plan for, the first thing it comes up with is a serial execution plan. If the cost of that serial plan is over the CTFP threshold, the optimizer will consider parallel plans (assuming nothing is preventing a parallel plan from being formed, like scalar functions, etc).

If the cost of the parallel plan is cheaper than the cost of the serial plan then SQL Server will go with that one. I’m gonna put this in bold, because it’s a question I answer all the time: The cost of a parallel plan does not have to be higher than CTFP.

Setting Cost Threshold For Parallelism

Well, okay then. This seems important to set correctly. If you have to start somewhere, 50 is a nice number. Do you have to leave it there? No, but just like MAXDOP, it’s a setting you’ll wanna tweak after some observations.

Most importantly, if critical queries got slower after making changes. If they did, we need to figure out if it’s because they either stopped going parallel, or stopped having a higher available degree of parallelism available to them.

Some people will tell you to look at the query costs in your plan cache to figure out what to set this to, but there are some really big problems with that advice: your plan cache could be really unstable, your plan cache could not have plans that are a good representation of your workload, and query plans lie.

Most of that goes for Query Store, too. Even though it theoretically has more history, it’s up to you to sift through everything in there to find the queries you care about trying to get MAXDOP right for.

Sure, it’s nice if those CX waits go down when you make changes, since that’s what the settings we talked about most closely control. But there’s a more important wait that changing these can help limit, too.

THREADPOOL

This is a wait you generally only want to see in demos done by professionals.

Why? Because it means SQL Server ran out of worker threads to give to queries so that they can run. It’s a really bad situation, because you might have a hard time figuring out what’s wrong.

You might not be able to connect to your SQL Server
Your monitoring tool might not be able to connect to your SQL Server

On top of that, your applications might not be able to connect, and even RDP connections might fail. You can mitigate some amount of THREADPOOL incidents by MAXDOP and CTFP correctly. Fewer queries going parallel, and perhaps going parallel to a lesser degree can reduce the number of worker threads that a workload consumes.

But just doing that can’t fix everything. If you just plain have a higher degree of concurrency than your hardware can handle, or if queries are stacking up because of issues with blocking, you still run the risk of this happening. You might need to add hardware or do significant query and index tuning to fix the problem completely.

And no, setting MAXDOP to 1 isn’t a viable solution.

Thanks for reading!

Going Further

Starting SQL: How Different Wait Types Relate To Each Other In SQL Server

Posted on August 25, 2020May 14, 2022 by Erik Darling

Going For Distance

There are waits, and then there are waits. If you know, you know, you know?

If you don’t know, here’s the deal: there are some waits that are gonna happen, and you’re not gonna be able to do much about them.

In this post, I’d rather introduce you to waits that you should pay attention to and can fix, and how they relate.

This isn’t meant to cover every single scenario, but it’s a good starting place. For example, with cloud and VM workloads being more common than ever, you may need to look at different settings or configurations outside of SQL Server.

This also goes for those of you using SANs, where everything from the networking to the controllers to the disk arrays can potentially be a bottleneck.

If one looks hard enough, one may find inefficiencies anywhere.

Going For Speed

The waits I generally feel are common enough and fixable enough are below.

There are plenty of less common waits with perfectly fine solutions, but let’s get you started with the ones that tend to drift up towards the top of all those magical scripts you find strewn about the internet.

CXPACKET/CXCONSUMER: Parallelism
LCK_*: Locking
PAGEIOLATCH_*: Reading pages from disk
RESOURCE_SEMAPHORE: Running out of memory to grant queries
THREADPOOL: Running out of worker threads

Of course, these waits can be closely related, can’t they?

Many concurrent parallel queries can lead to running out of worker threads
A common way to have many parallel queries stack up is to have them get blocked
Granting lots of memory to queries can mean less room for the buffer pool and more reading from disk

Of course, it’s really when these waits happen that matters, isn’t it? If there’s no end-user impact, who cares?

Always Be Checking

For the billionth time, if you have servers you care about performance for, get a monitoring tool.

If your boss won’t spring for one, try a free one. Or if you’re feeling really saucy, download a free trial of a paid tool.

Once you know which waits are happening when users might care, you stand a better chance at being able to start fixing things.

But I don’t want you to think you always need wait stats to tune things. Often, waits for a single query don’t matter. Waits really matter when you want to judge a workload.

Of course, there are exceptions. If every 2nd or 3rd time you run a query, it spends 20 seconds reading pages from disk, but the rest of the time it takes 2 seconds, it’s probably not the query (assuming everything in the plan is identical).

Tomorrow we’ll dig into CPU related waits, and how to address those.

Thanks for reading!

Going Further

Starting SQL: Measuring Wait Stats In SQL Server

Posted on August 24, 2020May 14, 2022 by Erik Darling

Done Started Something

If you talk to anyone who bills by the hour for SQL Server, one of the first things they’ll ask you is what your wait stats are.

It’s not a bad question; it kills some time, and it lets them gauge how engaged you are with your server.

Waits can sometimes indicate a severe bottleneck
Waits can also be “high” on a server that has perfectly acceptable throughput

There are also some problems with wait stats, and performance counters in general: there’s no direct correlation between the counters and the queries that caused them.

Sure, Query Store in 2017+ tracks high level wait stats for queries, but I don’t run into a lot of people using Query Store still.

The other trouble with performance counters is that they just aggregate over time, with no partitioning either by database or time.

Brooklyn Basement

Some things to keep in mind when you’re looking at wait stats:

How they compare to server uptime
How many times the wait occurred
How long each occurrence took to complete on average

If your script or monitoring tool doesn’t report this stuff, it’s leaving some really important stuff out of the picture.

The relationship to server up-time is important, because if all you have to work off of is percentages, you have no idea if your server is constantly grinding on these waits, or if they only happened for a few minutes.

As a thought experiment: say 100% of your waits are on PAGEIOLATCH_SH, and you have 2 hours total of those waits.

How does your opinion of those waits change if the server has been up for:

1 hour
4 hours
24 hours
96 hours

Typically, the longer a server has been up, and the less concentrated waits become, the less of a focal point they should be. There’s no magic ratio here, but for me, the closer wait times are to up time, the more I may want to look at them.

Longs and Shorts

Yes, there can be a mix of these, but there’s still two basic ways that wait stats rack up.

Lots of waits that happen pretty quickly
Fewer waits that last a long time

Obviously, the source of those waits will impact how much we care about them. Here are some common examples:

You wait often on storage, but the waits are short (say <10 ms)
You don’t wait often on locks, but the waits are long (say > 10,000 ms)

It’s pretty trivial to look at your wait stats

Running queries store them in sys.dm_os_waiting_tasks
In 2016+, session waits are in sys.dm_exec_session_wait_stats
Completed waits are stored in sys.dm_os_wait_stats

I’m not gonna reinvent any wheels here, but I will reiterate a couple things:

You can see which queries are generating waits while they’re running, but…
If you’re not constantly logging that, it disappears on you
Once waits are completed, they are faceless and blameless
Waits in general can be far less valuable when workloads are inconsistent

Attentione!

Tomorrow, I’ll talk through some of the most common waits, and first-line ways to start solving them.

They may not solve every single problem, but they’re generally a good place to start.

Thanks for reading!

Going Further

Starting SQL: Index Wrap Up

Posted on August 21, 2020May 16, 2022 by Erik Darling

Watch Out Now

We talked about a bunch of index stuff this week, and we still barely scratched the surface. That’s okay, though. If you try to cram it all in at once, you’ll get overwhelmed and forget half of it anyway.

If you want to jump ahead to some more advanced topics, check out my presentation about indexes here.

You probably have questions about indexes that I didn’t cover here. Feel free to ask them in the comments section.

The core concepts I want you to take away from this week are:

Indexes put data in order
That order is what makes searching and joining more efficient
Having data in order can also help other parts of the query
Putting data in order without an index requires memory, sometimes a lot

Frog Meditation

If you’re wondering why I didn’t talk at all about index fragmentation, it’s because it’s not an effective query or index tuning strategy. I see far too many people set up some scripts to “fix” it, and then walk away from real tuning work.

I don’t get it, the indexes aren’t fragmented. Why is my query still slow?

Leaving aside that index maintenance scripts all measure the wrong thing, hammering your server night after night with pointless maintenance doesn’t fix the root cause of why performance is bad.

It doesn’t fix problems with functions, sargability, table variables, local variables, or anything else.

Sure, you can say your query got 100ms faster because you got pages back in order, but who cares about that when it still runs for 5-10 seconds, or longer?

Riding Through

The next couple things we’re going to cover this month are wait stats and query plans.

After that, who knows?

Thanks for reading!

Going Further

Starting SQL: Sorts And Memory Grants In SQL Server

Posted on August 20, 2020May 14, 2022 by Erik Darling

Sup?

For SQL Server queries that require additional memory, grants are derived for serial plans. If a parallel plan is explored and chosen, memory will be divided evenly among threads.

Memory grant estimates are based on:

Number of rows (cardinality)
Size of rows (data size)
Number of concurrent memory consuming operators

If a parallel plan is chosen, there is some memory overhead to process parallel exchanges (distribute, redistribute, and gather streams), however their memory needs are still not calculated the same way.

Memory Consuming Operators

The most common operators that ask for memory are

Sorts
Hashes (joins, aggregates)
Optimized Nested Loops

Less common operators that require memory are inserts to column store indexes. These also differ in that memory grants are currently multiplied by DOP for them.

Memory needs for Sorts are typically much higher than for hashes. Sorts will ask for at least estimated size of data for a memory grant, since they need to sort all result columns by the ordering element(s). Hashes need memory to build a hash table, which does not include all selected columns.

Examples

If I run this query, intentionally hinted to DOP 1, it will ask for 166 MB of memory.

SELECT 
    u.* 
FROM 
( 
    SELECT TOP (1000) 
        u.Id 
    FROM dbo.Users AS u 
    ORDER BY u.Reputation 
) AS u 
OPTION(MAXDOP 1);

SQL Server Query Plan — Big Memory Grant

If I run this query (again, DOP 1), the plan will change, and the memory grant will go up slightly.

SELECT
    *
FROM
(
    SELECT TOP (1000) 
        u.Id 
    FROM dbo.Users AS u 
    ORDER BY u.Reputation
) AS u
JOIN
(
    SELECT TOP (1000) 
        u.Id 
    FROM dbo.Users AS u 
    ORDER BY u.Reputation
) AS u2
    ON u.Id = u2.Id
OPTION (MAXDOP 1);

There are two Sorts, and now a Hash Join. The memory grant bumps up a little bit to accommodate the hash build, but it does not double because the Sort operators cannot run concurrently.

If I change the query to force a nested loops join, the grant will double to deal with the concurrent Sorts.

SELECT
    *
FROM
(
    SELECT TOP (1000) 
        u.Id 
    FROM dbo.Users AS u 
    ORDER BY u.Reputation
) AS u
INNER LOOP JOIN /*Force the loop join*/
(
    SELECT TOP (1000) 
        u.Id 
    FROM dbo.Users AS u 
    ORDER BY u.Reputation
)AS u2
    ON u.Id = u2.Id
OPTION (MAXDOP 1);

The memory grant doubles because Nested Loop is not a blocking operator, and Hash Join is.

Size Of Data Matters

This query selects string data of different combinations. Depending on which columns I select, the size of the memory grant will go up.

The way size of data is calculated for variable string data is rows * 50% of the column’s declared length. This is true for VARCHAR and NVARCHAR, though NVARCHAR columns are doubled since they store double-byte characters. This does change in some cases with the new CE, but details aren’t documented.

Size of data also matters for hash operations, but not to the same degree that it does for Sorts.

SELECT 
    u.* 
FROM 
( 
    SELECT TOP (1000) 
        u.Id -- 166MB (INT) 
      , u.DisplayName -- 300MB (NVARCHAR 40) 
      , u.WebsiteUrl -- 900MB (NVARCHAR 200) 
      , u.Location -- 1.2GB (NVARCHAR 100) 
      , u.AboutMe -- 9GB (NVARCHAR MAX) 
    FROM dbo.Users AS u 
    ORDER BY u.Reputation 
) AS u 
OPTION(MAXDOP 1);

But What About Parallelism?

If I run this query at different DOPs, the memory grant is not multiplied by DOP.

SELECT
    *
FROM
(
    SELECT TOP (1000) 
        u.Id 
    FROM dbo.Users AS u 
    ORDER BY u.Reputation
) AS u
INNER HASH JOIN
(
    SELECT TOP (1000) 
        u.Id 
    FROM dbo.Users AS u 
    ORDER BY u.Reputation
) AS u2
    ON u.Id = u2.Id
ORDER BY
    u.Id,
    u2.Id -- Add an ORDER BY 
OPTION(MAXDOP ?);

There are slight increases to deal with more parallel buffers per exchange operator, and perhaps there are internal reasons that the Sort and Hash builds require extra memory to deal with higher DOP, but it’s clearly not a multiplying factor.

Thanks for reading!

Going Further

Starting SQL: How To Choose Key Column Order In SQL Server Indexes

Posted on August 19, 2020May 14, 2022 by Erik Darling

Found A Picture Of You

The way that many people will tell you to design indexes is something along the lines of:

Equality predicates first =
Inequality predicates next >, >=, <, <=, <>
This point intentionally left blank
Put all your select list columns as includes

Well, okay. That probably works with one of those tiny example databases that people keep coming up with creative ways to make bigger.

Real life is often larger than 100MB, though. And besides, real life queries are far more complicated.

Familiar

We don’t have to get too complicated to explose some of the issues with those rules, as queries go a little beyond the where clause.

SELECT TOP (5000) 
    v.Id,
    v.PostId,
    v.UserId,
    v.BountyAmount,
    v.VoteTypeId,
    v.CreationDate 
FROM dbo.Votes AS v
WHERE v.VoteTypeId = ?
AND   v.CreationDate >= ?
ORDER BY v.BountyAmount DESC;

If we were to faithfully follow the Rules of Indexing™, we might end up with an index like this:

CREATE INDEX an_attempt_was_made
    ON dbo.Votes (VoteTypeId, CreationDate)
    INCLUDE(BountyAmount, PostId, UserId);

Which, aside from Include column order (which we know doesn’t matter), is about what SQL Server’s missing index recommendation would be.

SQL Server Missing Index Request — pedestrian blues

Could it be that conventional index design wisdom is based on a faulty algorithm?

Could be. Could be.

Strange

But what happens when we add the index, according to ancient index law, and run queries with different values that have different selectivities?

SELECT TOP (5000) 
    v.Id,
    v.PostId,
    v.UserId,
    v.BountyAmount,
    v.VoteTypeId,
    v.CreationDate 
FROM dbo.Votes AS v
WHERE v.VoteTypeId = 7 --Not a lot of these
AND   v.CreationDate >= '20080101'
ORDER BY v.BountyAmount DESC;


SELECT TOP (5000) 
    v.Id,
    v.PostId,
    v.UserId,
    v.BountyAmount,
    v.VoteTypeId,
    v.CreationDate 
FROM dbo.Votes AS v
WHERE v.VoteTypeId = 2 --Yes a lot of these
AND   v.CreationDate >= '20080101'
ORDER BY v.BountyAmount DESC;

This isn’t parameter sniffing, because we used literal values, and got different executions plans. They even have correct estimates.

Both queries used our index, too. It did okay for a small amount of data, but for a larger amount of data, we got totally sunk.

Order: Equality vs Inequality Predicates

We’ve talked about how indexes make searching easier by putting data in order, and how each key column in an index has a dependency for that order on the column before it. Obviously having the BountyAmount column as an include isn’t going to be helpful, because those columns aren’t stored in any order.

But there’s an important difference in where we put the BountyAmount column in the key of the index. For example, if we change it to look like this:

CREATE INDEX an_attempt_was_made
    ON dbo.Votes (VoteTypeId, CreationDate, BountyAmount)
    INCLUDE(PostId, UserId) WITH(DROP_EXISTING = ON);

We’re not going to do any better. Why? The predicate on CreationDate is an inequality. BountyAmount is only in order within groups of duplicate dates.

Once we cross a date boundary, the order resets, just like we talked about in yesterday’s post. If we want to tune this index to make this query fast for large or small predicates on VoteTypeId, we need to change our index to look like this:

CREATE INDEX an_attempt_was_made
    ON dbo.Votes (VoteTypeId, BountyAmount, CreationDate)
    INCLUDE(PostId, UserId) WITH(DROP_EXISTING = ON);

Rules Of The Road

For index key columns, data stored in order can make searching, joining, grouping, and ordering in queries a lot more efficient. That’s probably just the sort of thing you’re after when query tuning. And index tuning is, after all, part of query tuning.

But I bet you’re asking: why focus so much on Sorts? Tune in tomorrow to find out why!

Thanks for reading!

Going Further

Starting SQL: How SQL Server Stores Data In Nonclustered Indexes

Posted on August 18, 2020May 14, 2022 by Erik Darling

But What Happens When…

People may tell you to always put the most selective column first in a query, but selectivity depends on more than just what values are in a column.

It also depends on how columns are searched, doesn’t it? If people are using inequalities, like >, >=, < , <= then having a totally unique value on every row becomes a bit less helpful.

Likewise, if people can search IN() or NOT IN, NULL or NOT NULL, or even if perhaps the data in a column is only selective for some values, then selectivity can be a whole lot less selective.

Beyond that, it ignores a whole world of considerations around how you’re grouping or ordering data, if your query is a top (n) with an order by, and more.

Before we go jumping off on such wild adventures, let’s talk a little bit about multi-key indexes. It’s easy enough to visualize a single column index putting data in order, but multi-key indexes present a slightly different picture.

Janitorial

Single-column clustered indexes make a lot of sense. Single column nonclustered indexes often make less sense.

It’s sort of like the difference between a chef knife and a Swiss Army knife. You want one to be really good at one specific task, and another to be pretty useful to a bunch of tasks.

Will a Swiss Army knife be the best wine opener you’ve ever owned? No, but it’s a whole lot easier than trying to get a cork out with a cleaver, and it can also be a screwdriver, a pair of scissors, and open a beer bottle for your less industrious friends who can’t quite muster the strength to cope with a twist-off cap.

That multi-tool ability comes at a bit of a cost, too. There’s no such thing as a free index column (unless the table is read only).

All those columns have to be maintained when you modify table data, of course.

And there’s another thing: every key column in the index is dependent on the column that comes before it. Rather than try to word-problem this for you, let’s just look at some demos.

Withdrawals

Let’s say we’ve got this index which, albeit simple, is at least more than a single column. Congratulations, you’ve graduated.

CREATE INDEX joan_jett
ON dbo.Posts
(
    PostTypeId, Score
);

If we write queries like this, we’ll be able to use it pretty efficiently.

SELECT p.Id, p.PostTypeId, p.Score
FROM dbo.Posts AS p
WHERE p.PostTypeId = 7
AND   p.Score = 1;

I’m skipping over a little bit now, because data is mightily skewed in the PostTypeId column towards a couple of quite-common values. I’ll get to it, though.

For now, marvel at the simplicity and Seekiness of this plan.

Now let’s try to find data in the Score column without also searching on the PostTypeId column.

SELECT p.Id, p.PostTypeId, p.Score
FROM dbo.Posts AS p
WHERE p.Score = 999;

A couple things changed, here. We had to scan through the index to find Scores we’re interested in, and the optimizer thought that this would be a process-intensive enough task to use multiple CPU cores to do it.

Okay then.

Age Of Reason

If you’ve been kicking around databases for a little bit, you may have read about this before, or even seen it in action when writing queries and creating indexes.

What I’d like to do is try to offer an explanation of why that happens the way it does: Columns within an index are not ordered independently.

In other words, you don’t have all of the PostTypeIds in ascending order, and then all of the Scores in ascending order. You do have all the PostTypeIds in ascending order, because it’s the leading column, but Scores are only in ascending order after PostTypeId.

A simple query gets illustrative enough results.

SELECT p.PostTypeId, p.Score
FROM dbo.Posts AS p
WHERE p.PostTypeId IN (1, 2)
AND   p.Score BETWEEN 1950 AND 2000
ORDER BY p.PostTypeId, p.Score;

Note how the ordering of Score resets when you cross a value boundary in PostTypeId?

We can see that in action with other queries, too.

Then People Stare

Here are three queries, and three plans.

SELECT TOP (1000) p.Id, p.PostTypeId, p.Score
FROM dbo.Posts AS p
ORDER BY p.Score; --Score isn't stored in order independently


SELECT TOP (1000) p.Id, p.PostTypeId, p.Score
FROM dbo.Posts AS p
ORDER BY p.PostTypeId; --PostTypeId is the leading column, though


SELECT TOP (1000) p.Id, p.PostTypeId, p.Score
FROM dbo.Posts AS p
ORDER BY p.PostTypeId,
         p.Score; --Score is in order within repeated PostTypeId values

Only that first query, where we try to order by Score independently needs to physically sort data. They all use the same index, but that index doesn’t store Score in perfect ascending order, unless we first order by PostType Id.

In tomorrow’s post, we’ll mix things up a little bit and design an index for a tricky query.

Thanks for reading!

Going Further

Starting SQL: Designing Indexes For SQL Server Queries

Posted on August 17, 2020May 14, 2022 by Erik Darling

Sweet Like

Clustered indexes are great, really. Usually. Okay, if they’re bad it’s probably your fault.

Did you really need 10 columns in the key? Did you have to make it on a NVARCHA5(512)?

No. You messed that up. By that I mean the royal you. All of you.

The thing is, they’re of limited overall value for searching for data.

SQL Server Management Studio Table — searchy baby

If every single join and where clause has this Id column in it, we might be okay. But the odds of that being the case are slim to none.

As soon as we want to search by any other columns without searching for a specific Id, we’re toast. That data doesn’t exist in a helpful order for searching.

I know I’ve mentioned it before, but that’s what indexes do to make data easier to find: they put it in order. Ascending, descending. It’s up to you.

The Meaning Of Life

There are two main parts of a nonclustered index: key columns, and included columns.

Sure, there’s other stuff you can do with them, like make them unique, or add filters (where clauses) to them, but we’ll talk about that later.

For now, feast your eyes on the majesty of the nonclustered index create statement.

Look at all that. Can you believe how much faster is can make your queries?

Let’s talk about how that works.

Those Keys

If you want to visualize stuff key columns can help in a query, you can almost draw a Fibonacci whatever on top of it.

Much of the time, it makes sense to focus on the where clause first.

There will of course be times you’ll need to break from that, but as a general design pattern this is a helpful starting place. The stuff index key columns tend to help is under the from clause. That’s because these parts of the query are most often helped by having data in a pertinent order.

Sometimes things above the from clause can be improved above the from, when there’s an aggregate or windowing function involved, but those considerations are more advanced and specialized.

Inclusion Conclusion

Included columns are helpful for queries, because you can have a single index be the source of data for a query. No need for lookups, and fewer optimizer choices.

But included columns aren’t ordered the way key columns are. They’re kinda like window dressing.

Sure, you can use them to find data, it’s just less efficient without the ordering. You can think of them like all the non-key columns in your clustered index.

Some good uses for includes:

Columns only in the select list
Non-selective predicates
Columns in filter definitions

Includes, though, are the place where I see people go overboard. Thinking back a little, if you’re selecting long lists of columns from wide tables, the optimizer might suggest very wide indexes to compensate for that.

The wider your index definitions are, the higher your chances of modification queries needing to touch them are.

It’s a bit like a game of Battleship. The bigger your indexes get, the more of the board they take up, and the more likely it is you’re gonna get hit by one of those little plastic peg torpedoes.

this is from wikipedia. thanks, wikipedia.

Baby Teeth

We know we need indexes, and now we’ve got a rough idea of which parts of the index can help which part of our query.

Next, we’ll look at some of the deeper intricacies of index design, like the column-to-column dependencies that exist in row store indexes.

Thanks for reading!