Erik Darling, Author at Darling Data

Starting SQL: How Different Wait Types Relate To Each Other In SQL Server

Posted on August 25, 2020May 14, 2022 by Erik Darling

Going For Distance

There are waits, and then there are waits. If you know, you know, you know?

If you don’t know, here’s the deal: there are some waits that are gonna happen, and you’re not gonna be able to do much about them.

In this post, I’d rather introduce you to waits that you should pay attention to and can fix, and how they relate.

This isn’t meant to cover every single scenario, but it’s a good starting place. For example, with cloud and VM workloads being more common than ever, you may need to look at different settings or configurations outside of SQL Server.

This also goes for those of you using SANs, where everything from the networking to the controllers to the disk arrays can potentially be a bottleneck.

If one looks hard enough, one may find inefficiencies anywhere.

Going For Speed

The waits I generally feel are common enough and fixable enough are below.

There are plenty of less common waits with perfectly fine solutions, but let’s get you started with the ones that tend to drift up towards the top of all those magical scripts you find strewn about the internet.

CXPACKET/CXCONSUMER: Parallelism
LCK_*: Locking
PAGEIOLATCH_*: Reading pages from disk
RESOURCE_SEMAPHORE: Running out of memory to grant queries
THREADPOOL: Running out of worker threads

Of course, these waits can be closely related, can’t they?

Many concurrent parallel queries can lead to running out of worker threads
A common way to have many parallel queries stack up is to have them get blocked
Granting lots of memory to queries can mean less room for the buffer pool and more reading from disk

Of course, it’s really when these waits happen that matters, isn’t it? If there’s no end-user impact, who cares?

Always Be Checking

For the billionth time, if you have servers you care about performance for, get a monitoring tool.

If your boss won’t spring for one, try a free one. Or if you’re feeling really saucy, download a free trial of a paid tool.

Once you know which waits are happening when users might care, you stand a better chance at being able to start fixing things.

But I don’t want you to think you always need wait stats to tune things. Often, waits for a single query don’t matter. Waits really matter when you want to judge a workload.

Of course, there are exceptions. If every 2nd or 3rd time you run a query, it spends 20 seconds reading pages from disk, but the rest of the time it takes 2 seconds, it’s probably not the query (assuming everything in the plan is identical).

Tomorrow we’ll dig into CPU related waits, and how to address those.

Thanks for reading!

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

Starting SQL: Measuring Wait Stats In SQL Server

Posted on August 24, 2020May 14, 2022 by Erik Darling

Done Started Something

If you talk to anyone who bills by the hour for SQL Server, one of the first things they’ll ask you is what your wait stats are.

It’s not a bad question; it kills some time, and it lets them gauge how engaged you are with your server.

Waits can sometimes indicate a severe bottleneck
Waits can also be “high” on a server that has perfectly acceptable throughput

There are also some problems with wait stats, and performance counters in general: there’s no direct correlation between the counters and the queries that caused them.

Sure, Query Store in 2017+ tracks high level wait stats for queries, but I don’t run into a lot of people using Query Store still.

The other trouble with performance counters is that they just aggregate over time, with no partitioning either by database or time.

Brooklyn Basement

Some things to keep in mind when you’re looking at wait stats:

How they compare to server uptime
How many times the wait occurred
How long each occurrence took to complete on average

If your script or monitoring tool doesn’t report this stuff, it’s leaving some really important stuff out of the picture.

The relationship to server up-time is important, because if all you have to work off of is percentages, you have no idea if your server is constantly grinding on these waits, or if they only happened for a few minutes.

As a thought experiment: say 100% of your waits are on PAGEIOLATCH_SH, and you have 2 hours total of those waits.

How does your opinion of those waits change if the server has been up for:

1 hour
4 hours
24 hours
96 hours

Typically, the longer a server has been up, and the less concentrated waits become, the less of a focal point they should be. There’s no magic ratio here, but for me, the closer wait times are to up time, the more I may want to look at them.

Longs and Shorts

Yes, there can be a mix of these, but there’s still two basic ways that wait stats rack up.

Lots of waits that happen pretty quickly
Fewer waits that last a long time

Obviously, the source of those waits will impact how much we care about them. Here are some common examples:

You wait often on storage, but the waits are short (say <10 ms)
You don’t wait often on locks, but the waits are long (say > 10,000 ms)

It’s pretty trivial to look at your wait stats

Running queries store them in sys.dm_os_waiting_tasks
In 2016+, session waits are in sys.dm_exec_session_wait_stats
Completed waits are stored in sys.dm_os_wait_stats

I’m not gonna reinvent any wheels here, but I will reiterate a couple things:

You can see which queries are generating waits while they’re running, but…
If you’re not constantly logging that, it disappears on you
Once waits are completed, they are faceless and blameless
Waits in general can be far less valuable when workloads are inconsistent

Attentione!

Tomorrow, I’ll talk through some of the most common waits, and first-line ways to start solving them.

They may not solve every single problem, but they’re generally a good place to start.

Thanks for reading!

Going Further

Starting SQL: Index Wrap Up

Posted on August 21, 2020May 16, 2022 by Erik Darling

Watch Out Now

We talked about a bunch of index stuff this week, and we still barely scratched the surface. That’s okay, though. If you try to cram it all in at once, you’ll get overwhelmed and forget half of it anyway.

If you want to jump ahead to some more advanced topics, check out my presentation about indexes here.

You probably have questions about indexes that I didn’t cover here. Feel free to ask them in the comments section.

The core concepts I want you to take away from this week are:

Indexes put data in order
That order is what makes searching and joining more efficient
Having data in order can also help other parts of the query
Putting data in order without an index requires memory, sometimes a lot

Frog Meditation

If you’re wondering why I didn’t talk at all about index fragmentation, it’s because it’s not an effective query or index tuning strategy. I see far too many people set up some scripts to “fix” it, and then walk away from real tuning work.

I don’t get it, the indexes aren’t fragmented. Why is my query still slow?

Leaving aside that index maintenance scripts all measure the wrong thing, hammering your server night after night with pointless maintenance doesn’t fix the root cause of why performance is bad.

It doesn’t fix problems with functions, sargability, table variables, local variables, or anything else.

Sure, you can say your query got 100ms faster because you got pages back in order, but who cares about that when it still runs for 5-10 seconds, or longer?

Riding Through

The next couple things we’re going to cover this month are wait stats and query plans.

After that, who knows?

Thanks for reading!

Going Further

Starting SQL: Sorts And Memory Grants In SQL Server

Posted on August 20, 2020May 14, 2022 by Erik Darling

Sup?

For SQL Server queries that require additional memory, grants are derived for serial plans. If a parallel plan is explored and chosen, memory will be divided evenly among threads.

Memory grant estimates are based on:

Number of rows (cardinality)
Size of rows (data size)
Number of concurrent memory consuming operators

If a parallel plan is chosen, there is some memory overhead to process parallel exchanges (distribute, redistribute, and gather streams), however their memory needs are still not calculated the same way.

Memory Consuming Operators

The most common operators that ask for memory are

Sorts
Hashes (joins, aggregates)
Optimized Nested Loops

Less common operators that require memory are inserts to column store indexes. These also differ in that memory grants are currently multiplied by DOP for them.

Memory needs for Sorts are typically much higher than for hashes. Sorts will ask for at least estimated size of data for a memory grant, since they need to sort all result columns by the ordering element(s). Hashes need memory to build a hash table, which does not include all selected columns.

Examples

If I run this query, intentionally hinted to DOP 1, it will ask for 166 MB of memory.

SELECT 
    u.* 
FROM 
( 
    SELECT TOP (1000) 
        u.Id 
    FROM dbo.Users AS u 
    ORDER BY u.Reputation 
) AS u 
OPTION(MAXDOP 1);

SQL Server Query Plan — Big Memory Grant

If I run this query (again, DOP 1), the plan will change, and the memory grant will go up slightly.

SELECT
    *
FROM
(
    SELECT TOP (1000) 
        u.Id 
    FROM dbo.Users AS u 
    ORDER BY u.Reputation
) AS u
JOIN
(
    SELECT TOP (1000) 
        u.Id 
    FROM dbo.Users AS u 
    ORDER BY u.Reputation
) AS u2
    ON u.Id = u2.Id
OPTION (MAXDOP 1);

There are two Sorts, and now a Hash Join. The memory grant bumps up a little bit to accommodate the hash build, but it does not double because the Sort operators cannot run concurrently.

If I change the query to force a nested loops join, the grant will double to deal with the concurrent Sorts.

SELECT
    *
FROM
(
    SELECT TOP (1000) 
        u.Id 
    FROM dbo.Users AS u 
    ORDER BY u.Reputation
) AS u
INNER LOOP JOIN /*Force the loop join*/
(
    SELECT TOP (1000) 
        u.Id 
    FROM dbo.Users AS u 
    ORDER BY u.Reputation
)AS u2
    ON u.Id = u2.Id
OPTION (MAXDOP 1);

The memory grant doubles because Nested Loop is not a blocking operator, and Hash Join is.

Size Of Data Matters

This query selects string data of different combinations. Depending on which columns I select, the size of the memory grant will go up.

The way size of data is calculated for variable string data is rows * 50% of the column’s declared length. This is true for VARCHAR and NVARCHAR, though NVARCHAR columns are doubled since they store double-byte characters. This does change in some cases with the new CE, but details aren’t documented.

Size of data also matters for hash operations, but not to the same degree that it does for Sorts.

SELECT 
    u.* 
FROM 
( 
    SELECT TOP (1000) 
        u.Id -- 166MB (INT) 
      , u.DisplayName -- 300MB (NVARCHAR 40) 
      , u.WebsiteUrl -- 900MB (NVARCHAR 200) 
      , u.Location -- 1.2GB (NVARCHAR 100) 
      , u.AboutMe -- 9GB (NVARCHAR MAX) 
    FROM dbo.Users AS u 
    ORDER BY u.Reputation 
) AS u 
OPTION(MAXDOP 1);

But What About Parallelism?

If I run this query at different DOPs, the memory grant is not multiplied by DOP.

SELECT
    *
FROM
(
    SELECT TOP (1000) 
        u.Id 
    FROM dbo.Users AS u 
    ORDER BY u.Reputation
) AS u
INNER HASH JOIN
(
    SELECT TOP (1000) 
        u.Id 
    FROM dbo.Users AS u 
    ORDER BY u.Reputation
) AS u2
    ON u.Id = u2.Id
ORDER BY
    u.Id,
    u2.Id -- Add an ORDER BY 
OPTION(MAXDOP ?);

There are slight increases to deal with more parallel buffers per exchange operator, and perhaps there are internal reasons that the Sort and Hash builds require extra memory to deal with higher DOP, but it’s clearly not a multiplying factor.

Thanks for reading!

Going Further

Starting SQL: How To Choose Key Column Order In SQL Server Indexes

Posted on August 19, 2020May 14, 2022 by Erik Darling

Found A Picture Of You

The way that many people will tell you to design indexes is something along the lines of:

Equality predicates first =
Inequality predicates next >, >=, <, <=, <>
This point intentionally left blank
Put all your select list columns as includes

Well, okay. That probably works with one of those tiny example databases that people keep coming up with creative ways to make bigger.

Real life is often larger than 100MB, though. And besides, real life queries are far more complicated.

Familiar

We don’t have to get too complicated to explose some of the issues with those rules, as queries go a little beyond the where clause.

SELECT TOP (5000) 
    v.Id,
    v.PostId,
    v.UserId,
    v.BountyAmount,
    v.VoteTypeId,
    v.CreationDate 
FROM dbo.Votes AS v
WHERE v.VoteTypeId = ?
AND   v.CreationDate >= ?
ORDER BY v.BountyAmount DESC;

If we were to faithfully follow the Rules of Indexing™, we might end up with an index like this:

CREATE INDEX an_attempt_was_made
    ON dbo.Votes (VoteTypeId, CreationDate)
    INCLUDE(BountyAmount, PostId, UserId);

Which, aside from Include column order (which we know doesn’t matter), is about what SQL Server’s missing index recommendation would be.

SQL Server Missing Index Request — pedestrian blues

Could it be that conventional index design wisdom is based on a faulty algorithm?

Could be. Could be.

Strange

But what happens when we add the index, according to ancient index law, and run queries with different values that have different selectivities?

SELECT TOP (5000) 
    v.Id,
    v.PostId,
    v.UserId,
    v.BountyAmount,
    v.VoteTypeId,
    v.CreationDate 
FROM dbo.Votes AS v
WHERE v.VoteTypeId = 7 --Not a lot of these
AND   v.CreationDate >= '20080101'
ORDER BY v.BountyAmount DESC;


SELECT TOP (5000) 
    v.Id,
    v.PostId,
    v.UserId,
    v.BountyAmount,
    v.VoteTypeId,
    v.CreationDate 
FROM dbo.Votes AS v
WHERE v.VoteTypeId = 2 --Yes a lot of these
AND   v.CreationDate >= '20080101'
ORDER BY v.BountyAmount DESC;

This isn’t parameter sniffing, because we used literal values, and got different executions plans. They even have correct estimates.

Both queries used our index, too. It did okay for a small amount of data, but for a larger amount of data, we got totally sunk.

Order: Equality vs Inequality Predicates

We’ve talked about how indexes make searching easier by putting data in order, and how each key column in an index has a dependency for that order on the column before it. Obviously having the BountyAmount column as an include isn’t going to be helpful, because those columns aren’t stored in any order.

But there’s an important difference in where we put the BountyAmount column in the key of the index. For example, if we change it to look like this:

CREATE INDEX an_attempt_was_made
    ON dbo.Votes (VoteTypeId, CreationDate, BountyAmount)
    INCLUDE(PostId, UserId) WITH(DROP_EXISTING = ON);

We’re not going to do any better. Why? The predicate on CreationDate is an inequality. BountyAmount is only in order within groups of duplicate dates.

Once we cross a date boundary, the order resets, just like we talked about in yesterday’s post. If we want to tune this index to make this query fast for large or small predicates on VoteTypeId, we need to change our index to look like this:

CREATE INDEX an_attempt_was_made
    ON dbo.Votes (VoteTypeId, BountyAmount, CreationDate)
    INCLUDE(PostId, UserId) WITH(DROP_EXISTING = ON);

Rules Of The Road

For index key columns, data stored in order can make searching, joining, grouping, and ordering in queries a lot more efficient. That’s probably just the sort of thing you’re after when query tuning. And index tuning is, after all, part of query tuning.

But I bet you’re asking: why focus so much on Sorts? Tune in tomorrow to find out why!

Thanks for reading!

Going Further

Starting SQL: How SQL Server Stores Data In Nonclustered Indexes

Posted on August 18, 2020May 14, 2022 by Erik Darling

But What Happens When…

People may tell you to always put the most selective column first in a query, but selectivity depends on more than just what values are in a column.

It also depends on how columns are searched, doesn’t it? If people are using inequalities, like >, >=, < , <= then having a totally unique value on every row becomes a bit less helpful.

Likewise, if people can search IN() or NOT IN, NULL or NOT NULL, or even if perhaps the data in a column is only selective for some values, then selectivity can be a whole lot less selective.

Beyond that, it ignores a whole world of considerations around how you’re grouping or ordering data, if your query is a top (n) with an order by, and more.

Before we go jumping off on such wild adventures, let’s talk a little bit about multi-key indexes. It’s easy enough to visualize a single column index putting data in order, but multi-key indexes present a slightly different picture.

Janitorial

Single-column clustered indexes make a lot of sense. Single column nonclustered indexes often make less sense.

It’s sort of like the difference between a chef knife and a Swiss Army knife. You want one to be really good at one specific task, and another to be pretty useful to a bunch of tasks.

Will a Swiss Army knife be the best wine opener you’ve ever owned? No, but it’s a whole lot easier than trying to get a cork out with a cleaver, and it can also be a screwdriver, a pair of scissors, and open a beer bottle for your less industrious friends who can’t quite muster the strength to cope with a twist-off cap.

That multi-tool ability comes at a bit of a cost, too. There’s no such thing as a free index column (unless the table is read only).

bigstock Ibach Switzerland 349558684 — legally downloaded

All those columns have to be maintained when you modify table data, of course.

And there’s another thing: every key column in the index is dependent on the column that comes before it. Rather than try to word-problem this for you, let’s just look at some demos.

Withdrawals

Let’s say we’ve got this index which, albeit simple, is at least more than a single column. Congratulations, you’ve graduated.

CREATE INDEX joan_jett
ON dbo.Posts
(
    PostTypeId, Score
);

If we write queries like this, we’ll be able to use it pretty efficiently.

SELECT p.Id, p.PostTypeId, p.Score
FROM dbo.Posts AS p
WHERE p.PostTypeId = 7
AND   p.Score = 1;

I’m skipping over a little bit now, because data is mightily skewed in the PostTypeId column towards a couple of quite-common values. I’ll get to it, though.

For now, marvel at the simplicity and Seekiness of this plan.

Now let’s try to find data in the Score column without also searching on the PostTypeId column.

SELECT p.Id, p.PostTypeId, p.Score
FROM dbo.Posts AS p
WHERE p.Score = 999;

A couple things changed, here. We had to scan through the index to find Scores we’re interested in, and the optimizer thought that this would be a process-intensive enough task to use multiple CPU cores to do it.

Okay then.

Age Of Reason

If you’ve been kicking around databases for a little bit, you may have read about this before, or even seen it in action when writing queries and creating indexes.

What I’d like to do is try to offer an explanation of why that happens the way it does: Columns within an index are not ordered independently.

In other words, you don’t have all of the PostTypeIds in ascending order, and then all of the Scores in ascending order. You do have all the PostTypeIds in ascending order, because it’s the leading column, but Scores are only in ascending order after PostTypeId.

A simple query gets illustrative enough results.

SELECT p.PostTypeId, p.Score
FROM dbo.Posts AS p
WHERE p.PostTypeId IN (1, 2)
AND   p.Score BETWEEN 1950 AND 2000
ORDER BY p.PostTypeId, p.Score;

Note how the ordering of Score resets when you cross a value boundary in PostTypeId?

We can see that in action with other queries, too.

Then People Stare

Here are three queries, and three plans.

SELECT TOP (1000) p.Id, p.PostTypeId, p.Score
FROM dbo.Posts AS p
ORDER BY p.Score; --Score isn't stored in order independently


SELECT TOP (1000) p.Id, p.PostTypeId, p.Score
FROM dbo.Posts AS p
ORDER BY p.PostTypeId; --PostTypeId is the leading column, though


SELECT TOP (1000) p.Id, p.PostTypeId, p.Score
FROM dbo.Posts AS p
ORDER BY p.PostTypeId,
         p.Score; --Score is in order within repeated PostTypeId values

Only that first query, where we try to order by Score independently needs to physically sort data. They all use the same index, but that index doesn’t store Score in perfect ascending order, unless we first order by PostType Id.

In tomorrow’s post, we’ll mix things up a little bit and design an index for a tricky query.

Thanks for reading!

Going Further

Starting SQL: Designing Indexes For SQL Server Queries

Posted on August 17, 2020May 14, 2022 by Erik Darling

Sweet Like

Clustered indexes are great, really. Usually. Okay, if they’re bad it’s probably your fault.

Did you really need 10 columns in the key? Did you have to make it on a NVARCHA5(512)?

No. You messed that up. By that I mean the royal you. All of you.

The thing is, they’re of limited overall value for searching for data.

SQL Server Management Studio Table — searchy baby

If every single join and where clause has this Id column in it, we might be okay. But the odds of that being the case are slim to none.

As soon as we want to search by any other columns without searching for a specific Id, we’re toast. That data doesn’t exist in a helpful order for searching.

I know I’ve mentioned it before, but that’s what indexes do to make data easier to find: they put it in order. Ascending, descending. It’s up to you.

The Meaning Of Life

There are two main parts of a nonclustered index: key columns, and included columns.

Sure, there’s other stuff you can do with them, like make them unique, or add filters (where clauses) to them, but we’ll talk about that later.

For now, feast your eyes on the majesty of the nonclustered index create statement.

Look at all that. Can you believe how much faster is can make your queries?

Let’s talk about how that works.

Those Keys

If you want to visualize stuff key columns can help in a query, you can almost draw a Fibonacci whatever on top of it.

Much of the time, it makes sense to focus on the where clause first.

There will of course be times you’ll need to break from that, but as a general design pattern this is a helpful starting place. The stuff index key columns tend to help is under the from clause. That’s because these parts of the query are most often helped by having data in a pertinent order.

Sometimes things above the from clause can be improved above the from, when there’s an aggregate or windowing function involved, but those considerations are more advanced and specialized.

Inclusion Conclusion

Included columns are helpful for queries, because you can have a single index be the source of data for a query. No need for lookups, and fewer optimizer choices.

But included columns aren’t ordered the way key columns are. They’re kinda like window dressing.

Sure, you can use them to find data, it’s just less efficient without the ordering. You can think of them like all the non-key columns in your clustered index.

Some good uses for includes:

Columns only in the select list
Non-selective predicates
Columns in filter definitions

Includes, though, are the place where I see people go overboard. Thinking back a little, if you’re selecting long lists of columns from wide tables, the optimizer might suggest very wide indexes to compensate for that.

The wider your index definitions are, the higher your chances of modification queries needing to touch them are.

It’s a bit like a game of Battleship. The bigger your indexes get, the more of the board they take up, and the more likely it is you’re gonna get hit by one of those little plastic peg torpedoes.

2020 07 21 7 11 38 — this is from wikipedia. thanks, wikipedia.

Baby Teeth

We know we need indexes, and now we’ve got a rough idea of which parts of the index can help which part of our query.

Next, we’ll look at some of the deeper intricacies of index design, like the column-to-column dependencies that exist in row store indexes.

Thanks for reading!

Going Further

Starting SQL: How Nonclustered Indexes Work In SQL Server

Posted on August 14, 2020May 14, 2022 by Erik Darling

Facsimile

While clustered indexes or heaps are the table, nonclustered indexes are their own special little individual copies of the table data.

They’re ordered by the key columns that you choose, which should probably differ from your clustered index key column(s). There is, of course, no ordering of included columns.

Nonclustered indexes will also have different statistics, and a different set of underlying pages. If you’ve ever wondered why you can drop or disable a corrupt nonclustered index and recreate it, but you can’t do that with a clustered index, it’s because databases hate you and want you to suffer.

Downsides First

There are, of course, downsides to over-indexing a table. Your clustered index is generally there as a good foundation for everything else, but it can’t help everything.

Adding in nonclustered indexes will give you extra objects to:

Write to and lock when you modify data
Choices for the optimizer when it comes up with query plans
Backup, restore, check for corruption, maintain
Read into memory when you need to read them or write to them

Some caution needs to be exercised when creating indexes, of course. You’ll want to avoid overly wide indexes, and indexes that incorporate the same columns over and over again.

This can be complicated when creating wide tables that lack proper normalization. When you create tables that have many columns with a similar prefix, like “customer_”, it may be a sign that you those columns should be in a separate table with a key to connect it to other tables. Another sign is numbered columns, like “phone_1”, “phone_2” etc.

The more columns you add to a single table, the more trouble you’ll get into trying to index it. Users will want to search those tables in all variety of ways and want to return many different columns. It becomes quite difficult to effectively index a single table like that.

There are exceptions, of course. If you have a reporting table that is purposely denormalized, and has no transactional meaning, you can often afford more indexes being added to it, or even use columnstore indexes to aid reporting queries.

Mitigating Downsides

There are a number of things you can do to mitigate some of the issues you can run into with a lot of indexes, depending on what’s available to your version and edition of SQL Server.

For example, if you want to minimize locking issues, you should add NOLOCK everywhere. No but seriously, don’t do that unless you don’t care at all. A much better option is an optimistic isolation level. It’d be great if SQL Server used one by default, but it’s pointless to kick dust now.

Having good hardware, like enough memory to cache your heavily trafficked data, and write-friendly storage can also help with many issues around writes.

Of course, the indexes you need are going to be a personal issue. Some tables, and workloads, will be able to afford more indexing than others. Putting numbers on these things often takes some digging.

Coming back to the wide tables thing, you may find it difficult to stick to 5 or 10 indexes that have 5 or 10 columns in them without having a static group of queries that touches the table, and forget it if you write anything resembling “SELECT *” from a table like that.

Over Under

Deciding if a table is over-indexed comes down to looking at usage. If your server has been up for a month or longer, and you see a bunch of indexes that are totally unused by read queries, or queries that have way more writes then reads, you might wanna think about ditching those.

But always do this part first, because it’ll give you a more clear picture of what you should do with indexes that are leftover.

Other indexes that are safe to mess with are ones that have the exact same key columns. If they have different included columns, you can create one index to include them all.

Just remember to look for stuff in these index definitions like filters or uniqueness — those things can make indexes look a lot different to the optimizer.

Next we’ll start to talk about designing effective nonclustered indexes for your queries. Because that’s what we design indexes for, right? We don’t just make them up.

Thanks for reading!

Going Further

Starting SQL: Picking Clustered Indexes For Tables

Posted on August 13, 2020May 14, 2022 by Erik Darling

Need’em Got’em

Let’s say you’ve either got a heap, like we talked about yesterday, or you’re plotting a new table.

It’s OLTP, and you want the very best for your transactions. What are some good characteristics of a clustered index?

Here’s an easy to remember acronym!

Narrow: Numbers, dates, not string-ish
Unique: Identities or sequences are good for this
Distinct: If not unique, then as distinct as possible
Ever: Increasing (append only is the goal, here
Static: Definitely not a column you ever update

Follow that, and you’ll most likely never go wrong. Unless you’ve got GUIDs involved, maybe. But at least you’re trying.

Why do we care about those things?

Avoiding Problems

All of those traits help us avoid problems down the line. like shuffling pages in the middle around, or with lopsided data.

The narrow thing is important because, like I’ve mentioned, nonclustered indexes inherit those clustered index key columns. If you assign wide clustered indexes, either via the number of key columns, or via the data types of those key columns, you can end up with quite bloated nonclustered indexes.

The whole thing can be unpleasant.

The important thing to remember is that clustered indexes are a nearly free way to logically order your table data, and have an index that fully covers all of the columns. This is probably a good time to remind you that clustered indexes are not a copy of the table, but the table data itself ordered logically by the key columns.

The Questions

To BIGINT Or Not To BIGINT?

If you’re going to do what most people do and make your clustered index an identity, do yourself a favor and use a BIGINT.

Some people will hem and haw about the 4 extra bytes, but those 4 extra bytes don’t make too much difference until the table gets to be around 2 billion rows. At that point, the pain of altering an INT to a BIGINT can be pretty severe, with some careful planning.

I’ve seen enough people go through heck with it to tell you to avoid the problem from the get go.

But I Need So Many GUIDs

Okay, cool. I understand. GUIDs are hard to run out of. They can be rough if not generated sequentially, and even then… Well, at least they’re unique. Your nonclustered indexes won’t suffer *too* much by inheriting them.

But this is generally where a different design pattern makes sense:

Put a nonclustered primary key on your GUID
Add a clustered index to a more sensible column

But hey, businesses seem to get angry when they find out identities can have gaps in values. GUIDs totally don’t have that problem ?

Are Multi-Keys Okay?

I guess, if you need a couple columns to make a unique combination, or if you have a couple columns you’re always searching on.

Just make sure you really try to keep them narrow. Strings usually aren’t your friend here.

Action Indexes

But you know, once you’ve got a clustered index in place, it just kinda sits there.

The real fun comes in figuring out nonclustered indexes

How many can I have?
How many columns can I put in them?

Let’s talk about those a little next.

Thanks for reading!

Going Further

Starting SQL: Quick Notes About Heaps In SQL Server (Tables Without Clustered Indexes)

Posted on August 12, 2020May 14, 2022 by Erik Darling

Nein Nein Nein

I know, I said we’d talk about indexes, and heaps are basically the opposite of indexes.

Even if you have a heap with a nonclustered primary key, or nonclustered indexes hanging around, the base table is still a heap.

Why? It’s all about that clustered index.

Do you always need a clustered index? Sometimes not. But if you’re running an OLTP workload, you probably do.

It’s not that heaps can’t be useful, it’s just that their best use-case is still loading the big datas.

Downsides Of Heaps

Heaps have a couple issues that are going to sound a lot like index fragmentation. In fact, they’re a lot like index fragmentation.

Forwarded fetches from updates
Empty pages from deletes

Forwarded Fetches

Forwarded fetches are a lot like logical fragmentation, which is when pages are out of order. Instead of having clustered index keys to follow, we have pointers from where the row used to be to where it got moved to.

This happens because updates change a value that makes a row not fit on a page anymore.

Heaps compound the issue a little bit though, because you can’t seek into a heap to avoid forwarded fetches (though you can seek into any index created on top of a heap). You can seek into a clustered index to avoid logical fragmentation.

Empty Space

Empty space from deletes goes a little bit further, too. You can end up with entire pages being empty, if queries that delete data don’t successfully escalate locks to the table level. Worse, those empty pages get read when the heap is scanned.

Sure, clustered indexes can end up with a bunch of empty space on pages, but when pages are totally emptied out they will get deallocated. Heaps don’t do that without a table level lock during the delete.

The thing is that both of these only happen to the heap itself. If you’ve got nonclustered indexes on the heap and queries mostly use those, you may not notice either of these problems. Perhaps a little bit if you do bookmark lookups.

When They Might Matter

This is largely a size thing. The bigger your heaps are, the more pronounced these problems can be.

It also depends a bit on what heap tables get used for. Or more importantly, who uses them.

If they’re staging tables that end users don’t touch, fixing them probably won’t solve a big problem for you outside of ETL.

But if they’re transactional tables where end users are complaining about performance, you need to do two things: check the query plans for table scans, and then check the heaps for forwarded fetches and deletes.

To fix them, you can rebuild the heap, or create a clustered index. In either case, be very careful if you’ve got nonclustered indexes on the table, because they’ll need to all get rebuilt to either fix the RIDs in a heap, or add the clustered index keys.

Your next question is: Erik, how do I pick a clustered index? We’ll talk about that tomorrow!

Thanks for reading!