Query Tuning Archives | Page 15 of 30

Starting SQL: How SQL Server Stores Data In Nonclustered Indexes

Posted on August 18, 2020May 14, 2022 by Erik Darling

But What Happens When…

People may tell you to always put the most selective column first in a query, but selectivity depends on more than just what values are in a column.

It also depends on how columns are searched, doesn’t it? If people are using inequalities, like >, >=, < , <= then having a totally unique value on every row becomes a bit less helpful.

Likewise, if people can search IN() or NOT IN, NULL or NOT NULL, or even if perhaps the data in a column is only selective for some values, then selectivity can be a whole lot less selective.

Beyond that, it ignores a whole world of considerations around how you’re grouping or ordering data, if your query is a top (n) with an order by, and more.

Before we go jumping off on such wild adventures, let’s talk a little bit about multi-key indexes. It’s easy enough to visualize a single column index putting data in order, but multi-key indexes present a slightly different picture.

Janitorial

Single-column clustered indexes make a lot of sense. Single column nonclustered indexes often make less sense.

It’s sort of like the difference between a chef knife and a Swiss Army knife. You want one to be really good at one specific task, and another to be pretty useful to a bunch of tasks.

Will a Swiss Army knife be the best wine opener you’ve ever owned? No, but it’s a whole lot easier than trying to get a cork out with a cleaver, and it can also be a screwdriver, a pair of scissors, and open a beer bottle for your less industrious friends who can’t quite muster the strength to cope with a twist-off cap.

That multi-tool ability comes at a bit of a cost, too. There’s no such thing as a free index column (unless the table is read only).

All those columns have to be maintained when you modify table data, of course.

And there’s another thing: every key column in the index is dependent on the column that comes before it. Rather than try to word-problem this for you, let’s just look at some demos.

Withdrawals

Let’s say we’ve got this index which, albeit simple, is at least more than a single column. Congratulations, you’ve graduated.

CREATE INDEX joan_jett
ON dbo.Posts
(
    PostTypeId, Score
);

If we write queries like this, we’ll be able to use it pretty efficiently.

SELECT p.Id, p.PostTypeId, p.Score
FROM dbo.Posts AS p
WHERE p.PostTypeId = 7
AND   p.Score = 1;

I’m skipping over a little bit now, because data is mightily skewed in the PostTypeId column towards a couple of quite-common values. I’ll get to it, though.

For now, marvel at the simplicity and Seekiness of this plan.

Now let’s try to find data in the Score column without also searching on the PostTypeId column.

SELECT p.Id, p.PostTypeId, p.Score
FROM dbo.Posts AS p
WHERE p.Score = 999;

A couple things changed, here. We had to scan through the index to find Scores we’re interested in, and the optimizer thought that this would be a process-intensive enough task to use multiple CPU cores to do it.

Okay then.

Age Of Reason

If you’ve been kicking around databases for a little bit, you may have read about this before, or even seen it in action when writing queries and creating indexes.

What I’d like to do is try to offer an explanation of why that happens the way it does: Columns within an index are not ordered independently.

In other words, you don’t have all of the PostTypeIds in ascending order, and then all of the Scores in ascending order. You do have all the PostTypeIds in ascending order, because it’s the leading column, but Scores are only in ascending order after PostTypeId.

A simple query gets illustrative enough results.

SELECT p.PostTypeId, p.Score
FROM dbo.Posts AS p
WHERE p.PostTypeId IN (1, 2)
AND   p.Score BETWEEN 1950 AND 2000
ORDER BY p.PostTypeId, p.Score;

Note how the ordering of Score resets when you cross a value boundary in PostTypeId?

We can see that in action with other queries, too.

Then People Stare

Here are three queries, and three plans.

SELECT TOP (1000) p.Id, p.PostTypeId, p.Score
FROM dbo.Posts AS p
ORDER BY p.Score; --Score isn't stored in order independently


SELECT TOP (1000) p.Id, p.PostTypeId, p.Score
FROM dbo.Posts AS p
ORDER BY p.PostTypeId; --PostTypeId is the leading column, though


SELECT TOP (1000) p.Id, p.PostTypeId, p.Score
FROM dbo.Posts AS p
ORDER BY p.PostTypeId,
         p.Score; --Score is in order within repeated PostTypeId values

Only that first query, where we try to order by Score independently needs to physically sort data. They all use the same index, but that index doesn’t store Score in perfect ascending order, unless we first order by PostType Id.

In tomorrow’s post, we’ll mix things up a little bit and design an index for a tricky query.

Thanks for reading!

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Starting SQL: Designing Indexes For SQL Server Queries

Posted on August 17, 2020May 14, 2022 by Erik Darling

Sweet Like

Clustered indexes are great, really. Usually. Okay, if they’re bad it’s probably your fault.

Did you really need 10 columns in the key? Did you have to make it on a NVARCHA5(512)?

No. You messed that up. By that I mean the royal you. All of you.

The thing is, they’re of limited overall value for searching for data.

SQL Server Management Studio Table — searchy baby

If every single join and where clause has this Id column in it, we might be okay. But the odds of that being the case are slim to none.

As soon as we want to search by any other columns without searching for a specific Id, we’re toast. That data doesn’t exist in a helpful order for searching.

I know I’ve mentioned it before, but that’s what indexes do to make data easier to find: they put it in order. Ascending, descending. It’s up to you.

The Meaning Of Life

There are two main parts of a nonclustered index: key columns, and included columns.

Sure, there’s other stuff you can do with them, like make them unique, or add filters (where clauses) to them, but we’ll talk about that later.

For now, feast your eyes on the majesty of the nonclustered index create statement.

Look at all that. Can you believe how much faster is can make your queries?

Let’s talk about how that works.

Those Keys

If you want to visualize stuff key columns can help in a query, you can almost draw a Fibonacci whatever on top of it.

Much of the time, it makes sense to focus on the where clause first.

There will of course be times you’ll need to break from that, but as a general design pattern this is a helpful starting place. The stuff index key columns tend to help is under the from clause. That’s because these parts of the query are most often helped by having data in a pertinent order.

Sometimes things above the from clause can be improved above the from, when there’s an aggregate or windowing function involved, but those considerations are more advanced and specialized.

Inclusion Conclusion

Included columns are helpful for queries, because you can have a single index be the source of data for a query. No need for lookups, and fewer optimizer choices.

But included columns aren’t ordered the way key columns are. They’re kinda like window dressing.

Sure, you can use them to find data, it’s just less efficient without the ordering. You can think of them like all the non-key columns in your clustered index.

Some good uses for includes:

Columns only in the select list
Non-selective predicates
Columns in filter definitions

Includes, though, are the place where I see people go overboard. Thinking back a little, if you’re selecting long lists of columns from wide tables, the optimizer might suggest very wide indexes to compensate for that.

The wider your index definitions are, the higher your chances of modification queries needing to touch them are.

It’s a bit like a game of Battleship. The bigger your indexes get, the more of the board they take up, and the more likely it is you’re gonna get hit by one of those little plastic peg torpedoes.

this is from wikipedia. thanks, wikipedia.

Baby Teeth

We know we need indexes, and now we’ve got a rough idea of which parts of the index can help which part of our query.

Next, we’ll look at some of the deeper intricacies of index design, like the column-to-column dependencies that exist in row store indexes.

Thanks for reading!

Going Further

The Trillion Row Operator In SQL Server

Posted on August 12, 2020May 16, 2022 by Joe Obbish

SQL Server friend and excellent webmaster Mr. Erik C. Darling showed me a query plan gone wrong:

SQL Server Query Plan 48 billion rows for a single operator is certainly a large number for most workloads. I ended up completely missing the point and started wondering how quickly a query could process a trillion rows through a single operator. Eventually that led to a SQL performance challenge: what is the fastest that you can get an actual plan with at least one operator processing a trillion rows? The following rules are in play:

Start with no user databases
Any query can run up to MAXDOP 8
Temp tables may be created and populated but all such work needs to finish in 30 seconds or less
A RECOMPILE query hint must be present in the final query
Undocumented features and behavior are fair game

Unless stated otherwise here, all testing was done on SQL Server 2017 CU20. The processor of my local machine is an Intel i7-9700K. If you’re following along at home you may see different results depending on your hardware. The queries are extremely CPU intensive. If you’d like to skip ahead to the winning query then scroll down the bottom.

Loop Join

A natural way to generate a lot of rows quickly is to cross join together two equally sized tables. Throwing a million rows into a temporary table takes very little time. Finding the count of the result set should meet the challenge requirements as long as the right optimizer transforms are not available. An initial attempt for data prep:

DROP TABLE IF EXISTS #loop_test;

CREATE TABLE #loop_test (
ID TINYINT NOT NULL
);

INSERT INTO #loop_test WITH (TABLOCK)
SELECT TOP (1000000) CAST(0 AS TINYINT) ID
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON #loop_test (ID) WITH FULLSCAN, NORECOMPUTE;

Along with the query to run:

SELECT COUNT_BIG(*)
FROM #loop_test t1 WITH (TABLOCK)
CROSS JOIN #loop_test t2 WITH (TABLOCK)
OPTION (RECOMPILE, MAXDOP 8);

The query returns the expected result of 1 trillion, but there is no operator that processes one trillion rows:

SQL Server Query Plan

The query optimizer uses the LocalAggBelowJoin rule to perform an aggregate before the join. This brings down the total query estimated cost from 1295100 units to 6.23274 units for quite a savings. Trying again after disabling the transform:

SELECT COUNT_BIG(*)
FROM #loop_test t1 WITH (TABLOCK)
CROSS JOIN #loop_test t2 WITH (TABLOCK)
OPTION (RECOMPILE, MAXDOP 8, QueryRuleOff LocalAggBelowJoin);

Results in a new query plan:

SQL Server Query Plan

This query meets the requirements of the challenge. The stream aggregate, nested loops, and row count spool operators will process 1 trillion rows each. However, it is unsatisfactory from a performance point of view. The row mode stream aggregate has a significant profiling penalty compared to a batch mode hash aggregate alternative. It is simple enough to create a dummy table to add batch mode eligibility for the query plan:

DROP TABLE IF EXISTS #cci;
CREATE TABLE #cci (ID INT, INDEX CCI CLUSTERED COLUMNSTORE)

SELECT COUNT_BIG(*)
FROM #loop_test t1 WITH (TABLOCK)
CROSS JOIN #loop_test t2 WITH (TABLOCK)
LEFT OUTER JOIN #cci ON 1 = 0
OPTION (RECOMPILE, MAXDOP 8, QueryRuleOff LocalAggBelowJoin);

The new plan has a nice batch mode operator to do the counting:

SQL Server Query Plan

With an actual plan request, the new query processes rows at roughly twice the rate as the row mode query. Perhaps query profiling does its work once per batch for batch mode operators and once per row for row mode operators. Each batch has around 900 rows so profiling might be much cheaper per row for the hash aggregate compared to the stream aggregate. Strictly speaking, the goal of this challenge is to optimize the amount of time it takes to get an actual plan. This might lead to different tuning results than optimizing for query completion time without an actual plan request.

The final time for the batch mode query is 2 hours and 12 minutes. I uploaded an actual plan to pastetheplan.com. The query has a CPU to elapsed time ratio of only 5.37. That means on that average during query execution, more than 2.5 CPU cores were not doing useful work. Getting all eight CPU cores to perform work as much as possible during query execution should allow the query to finish faster. The CPU imbalance in this case is caused by how rows are distributed to threads on the outer side of the nested loop operator:

SQL Server Threads

The query engine has different algorithm choices to perform a parallel scan of a table. In this case, it doesn’t pick an algorithm that leads to optimal performance. The nested loop operator in total drives 42717 seconds of CPU work. It’s important to have that work as balanced between threads as possible. In this situation it is helpful to reduce the number of rows per page. To understand why I recommend watching this Pass Summit presentation by Adam Machanic. This results in better row distribution between threads and theoretically better performance. The following code only allows for a single row per data page:

DROP TABLE IF EXISTS #wide_1_million_rows;

CREATE TABLE #wide_1_million_rows (
ID TINYINT NOT NULL,
FILLER VARCHAR(4200) NOT NULL
);

INSERT INTO #wide_1_million_rows WITH (TABLOCK)
SELECT TOP (1000000) CAST(0 AS TINYINT) ID, REPLICATE('Z', 4200) FILLER
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON #wide_1_million_rows (ID) WITH FULLSCAN, NORECOMPUTE;

The new query plan generates in 1:29:40 with a CPU ratio of the query is 7.977 which is near perfect. Rows are quite balanced between threads:

SQL Server Threads

I don’t know of a method to further optimize the nested loop join query. Most of the query’s execution time is spent on profiling. Sampled stacks from 15 seconds of query execution:

ETW tracing tools such as PerfView along with internals knowledge can be used to more finely break down how CPU is used by this query. Almost 75% of the query’s execution time is spent on sqlmin!CQScanProfileNew:GetRowImp and on methods called by it. The query only takes 23 minutes to complete without an actual plan which is quite close to 25% of the runtime with an actual plan. If you’re interested in more examples of using PerfView to draw conclusiosn about SQL Server consider checking out the list of links in this stack exchange answer.

Merge Join

A cross join can only be implemented as a nested loop join but the same end result can be emulated with a merge join. Let’s start with testing a MAXDOP 1 query. Setup:

DROP TABLE IF EXISTS #cci;
CREATE TABLE #cci (ID INT, INDEX CCI CLUSTERED COLUMNSTORE)

DROP TABLE IF EXISTS #merge_serial;

CREATE TABLE #merge_serial (
ID TINYINT NOT NULL
);

INSERT INTO #merge_serial WITH (TABLOCK)
SELECT TOP (1000000) 0 ID
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

CREATE CLUSTERED INDEX CI ON #merge_serial (ID);

The query to run:

SELECT COUNT_BIG(*)
FROM #merge_serial t1 WITH (TABLOCK)
INNER JOIN #merge_serial t2 WITH (TABLOCK) ON t1.ID = t2.ID
LEFT OUTER JOIN #cci ON 1 = 0
OPTION (RECOMPILE, MERGE JOIN, MAXDOP 1, QueryRuleOff LocalAggBelowJoin);

I did not allow the query to run to completion because it would have taken around 25 hours to complete. On an 8 core machine with merge join, there’s no reason to expect that a parallel query could do any better than 25/8 = 3.125 hours. This makes merge join an inferior option compared to loop join so there isn’t a good reason to pursue further approaches that use a merge join.

For completeness, the best merge plan time I was able to get was 5 hours and 20 minutes. The plan avoids the scalability issues found with parallel merge and order preserving repartition streams. The end result is still poor. The merge join simply consumes a lot of CPU to do its work. The additional overhead added by the loop join is not helpful.

Row Mode Hash Join

Perhaps we’ll have more luck with a hash join compared to a merge join. After all, the query optimizer assigns a lower estimated cost for a row mode hash join compared to a merge join for the trillion row scenario. What could go wrong? The first thing that goes wrong is that the addition of the empty CCI will lead to a batch mode hash join for this query. Avoiding that requires changing the join to a type that doesn’t allow for batch mode execution. Among other methods, this can be accomplished by putting the hash join on the inner side of a nested loop or changing the join to also match on NULL. The second technique will be used at first because it leads to significantly less overhead compared to the loop join approach.

Starting again with a MAXDOP 1 test:

DROP TABLE IF EXISTS #cci;
CREATE TABLE #cci (ID INT, INDEX CCI CLUSTERED COLUMNSTORE)

DROP TABLE IF EXISTS #hash_maxdop_1;
CREATE TABLE #hash_maxdop_1 (
	ID TINYINT NULL
);

INSERT INTO #hash_maxdop_1 WITH (TABLOCK)
SELECT TOP (1000000) 0 
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON #hash_maxdop_1 (ID) WITH FULLSCAN, NORECOMPUTE;

The test query has a more complex join condition:

SELECT COUNT_BIG(*)
FROM #hash_maxdop_1 t1 WITH (TABLOCK)
INNER HASH JOIN #hash_maxdop_1 t2 WITH (TABLOCK) ON t1.ID = t2.ID OR (t1.ID IS NULL AND t2.ID IS NULL)
LEFT OUTER JOIN #cci ON 1 = 0
OPTION (MAXDOP 1, RECOMPILE)

Note how the join now includes rows where both sides are NULL. There are no such rows in the table but the query optimizer does not know that so a row mode join is used. The query completed in 5.5 hours. This means that a parallel hash join approach could be competitive with the nested loop join. An optimistic time estimate for a parallel query is 41 minutes.

Getting a parallel query isn’t as simple as changing MAXDOP from 1 to 8 will not have the desired effect. Take a look at the following query plan born from such an attempt:

SQL Server Query Plan

The repartition stream operators have a partitioning type of hash. That means that a hashing algorithm is used to assign rows to different threads. All of the rows from the table have the same column value, so they will all hash to the same value and will end up on the same thread. Seven CPUs will have no work to do even though the query technically executes at DOP 8. This will not lead to performance that’s better than executing at DOP 1.

Through testing I found that the following values for a nullable TINYINT all go to different threads: 0, 1, 4, 5, 7, 9, 14, and 17. Note that I’m relying on undocumented details here that might change between releases. Inserting 250k rows of each value into one table and 500k rows of each value into another table should result in 8 * 250000 * 500000 = 1 trillion rows total. Code to do this is below:

DROP TABLE IF EXISTS #cci;
CREATE TABLE #cci (ID INT, INDEX CCI CLUSTERED COLUMNSTORE)

DROP TABLE IF EXISTS #row_hash_join_repartition_8_build;
CREATE TABLE #row_hash_join_repartition_8_build (
	ID TINYINT NULL
);

INSERT INTO #row_hash_join_repartition_8_build
SELECT TOP (2000000) CASE ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) % 8 
WHEN 0 THEN 0
WHEN 1 THEN 1
WHEN 2 THEN 4
WHEN 3 THEN 5
WHEN 4 THEN 7
WHEN 5 THEN 9
WHEN 6 THEN 14
ELSE 17
END
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON #row_hash_join_repartition_8_build (ID) WITH FULLSCAN, NORECOMPUTE;


DROP TABLE IF EXISTS #row_hash_join_repartition_8_probe;
CREATE TABLE #row_hash_join_repartition_8_probe (
	ID TINYINT NULL
);

INSERT INTO #row_hash_join_repartition_8_probe
SELECT TOP (4000000) CASE ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) % 8 
WHEN 0 THEN 0
WHEN 1 THEN 1
WHEN 2 THEN 4
WHEN 3 THEN 5
WHEN 4 THEN 7
WHEN 5 THEN 9
WHEN 6 THEN 14
ELSE 17
END
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON #row_hash_join_repartition_8_probe (ID) WITH FULLSCAN, NORECOMPUTE;

Along with the query to run:

SELECT COUNT_BIG(*)
FROM #row_hash_join_repartition_8_build t1 WITH (TABLOCK)
INNER JOIN #row_hash_join_repartition_8_probe t2 WITH (TABLOCK) ON t1.ID = t2.ID OR (t1.ID IS NULL AND t2.ID IS NULL)
LEFT OUTER JOIN #cci ON 1 = 0
OPTION (RECOMPILE, HASH JOIN, MAXDOP 8, QueryRuleOff LocalAggBelowJoin);

The query plan appeared after 42 minutes with a CPU ratio of 7.7. This was a better result than I was expecting. The repartition stream operators don’t seem to have much of a negative effect on overall runtime.

For completeness, I also tested a parallel hash join plan without the repartition stream operators. All of the hash join work is on the inner side of a nested loop join. The loop join adds significant overhead so this ends up being a losing method. If you’d like to see for yourself, here is one way to set up the needed tables:

DROP TABLE IF EXISTS #cci;
CREATE TABLE #cci (ID INT, INDEX CCI CLUSTERED COLUMNSTORE)

DROP TABLE IF EXISTS #hash_join_demand_62500_rows;
CREATE TABLE #hash_join_demand_62500_rows (
	ID TINYINT NOT NULL
);

INSERT INTO #hash_join_demand_62500_rows WITH (TABLOCK)
SELECT TOP (62500) ISNULL(CAST(0 AS TINYINT), 0) ID
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON #hash_join_demand_62500_rows (ID);

As well as one way to generate such a query:

SELECT COUNT_BIG(*)
FROM
(VALUES
	  (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
	, (0),(0),(0),(0),(0),(0),(0),(0)
) v(v)
CROSS JOIN (
	SELECT 1 c
	FROM #hash_join_demand_62500_rows t1 WITH (TABLOCK)
	INNER HASH JOIN #hash_join_demand_62500_rows t2 WITH (TABLOCK) ON t1.ID = t2.ID
) ca
LEFT OUTER JOIN #cci ON 1 = 0
OPTION (RECOMPILE, FORCE ORDER, MAXDOP 8, MIN_GRANT_PERCENT = 99, NO_PERFORMANCE_SPOOL);

It might be difficult to understand what’s going on here, so here’s a picture of the query plan:

SQL Server Query Plan

Work is split up into 256 buckets. Each row fed into the outer side of the nested loop join results in 62500 rows hash joined with 62500 rows.

The disappointing final result takes about 2 hours of runtime. This query might perform better on a busy system compared to the simple parallel hash join because work is assigned to threads somewhat on demand. In this case, the repartition streams query query requires equal work to be completed on all threads. However, the testing environment only has a single query running at a time.

The query pattern here is similar to a parallel apply. Perhaps you’ve heard good things about that pattern and are wondering why it didn’t work out here. In my experience, that query pattern does best when an order preserving repartition streams operator can be avoided, there’s a supporting index on the inner side, and the row count coming out of the inner side is relatively low. The above query is 0 for 3.

Batch Mode Hash Join

I’ll be honest. I expected batch mode hash join to be the winner. I was wrong. Batch mode parallelism works in a fundamentally different way than row mode parallelism. There is no repartition streams operator which assigns rows to different threads based on a hash or other algorithm. Rather, each operator takes care of parallelism by itself. In my experience this is generally quite advantageous for both performance and scalability, but there can be drawbacks.

We intend to get a batch mode hash join, so it seems logical enough to join together a CCI with one million rows. Sample code to do that:

DROP TABLE IF EXISTS #batch_mode_hash_equal_sized;

CREATE TABLE #batch_mode_hash_equal_sized (
ID TINYINT NOT NULL,
INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO #batch_mode_hash_equal_sized WITH (TABLOCK)
SELECT TOP (1000000) CAST(0 AS TINYINT) ID
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON #batch_mode_hash_equal_sized (ID) WITH FULLSCAN, NORECOMPUTE;

The query to be run no longer needs the join to the empty CCI table:

SELECT COUNT_BIG(*) 
FROM #batch_mode_hash_equal_sized c1
INNER JOIN #batch_mode_hash_equal_sized c2 ON c1.ID = c2.ID
OPTION (RECOMPILE, MAXDOP 8, QueryRuleOff LocalAggBelowJoin);

An actual plan is generated on my machine in about 27.5 minutes. While clearly better than all other attempts so far, the CPU to elapsed ratio suggests that significant improvement is possible: it is only 4.6 out of a possible 8.0. Batch mode processing is generous enough to provide a helpful wait stat. There is about 5607 seconds of the HTDELETE wait. Put simply, some of the threads of the hash join ran out of work to do and waited for the other threads to finish their work. As I understand it with batch mode joins, the probe side is what matters in terms of work imbalance. Row count per thread of the probe side:

SQL Server Threads

The million row CCI table is split up into 9 chunks of rows which are divided between 8 threads which leads to many threads having no work to do for a while. Quite unfortunate. Oddly enough, switching to MAXDOP 9 could lead to much better performance for this query. That isn’t an available option on my local machine so I switched to testing on a VM with Intel 8280 sockets. The theory works out: the MAXDOP 8 query takes 50 minutes to complete and the MAXDOP 9 query takes 25 minutes to complete. At MAXDOP 9 the row distribution is quite nice on the probe side:

SQL Server Threads

The rules for the different algorithms available to do parallel scans on CCIs are not documented. I previously investigated it here. Interestingly enough, this is an example where an exchange operator could make a positive difference. However, the batch mode hash join doesn’t have that as an option. While there are various tricks to achieve even row distribution from a CCI, it seems simplest to just replace the tables with row store. Converting 2 million rows from row mode to batch mode won’t add any measurable overhead. Using the same table that was used for the best loop join time:

SELECT COUNT_BIG(*) 
FROM #wide_1_million_rows c1
INNER JOIN #wide_1_million_rows c2 ON c1.ID = c2.ID
LEFT OUTER JOIN #cci ON 1 = 0
OPTION (RECOMPILE, MAXDOP 8, QueryRuleOff LocalAggBelowJoin)

The new query plan finishes in just under 16 minutes. Row distribution per thread on the probe side is much more even:

SQL Server Threads

You might be thinking that 16 minutes is a pretty good result. I too initially thought that.

The Winner

All of the previous approaches spent the vast majority of their time performing a join to generate rows from smaller tables. The scans of the underlying tables themselves were quite fast. Perhaps a better approach would be to avoid the join entirely. For example, consider inserting a billion rows into a table and scanning that table 1000 times with UNION ALL. That would send one trillion rows to a single operator without the expense of a join. Of course, the downside is that SQL Server would need to read one trillion rows total from a table. It would be important to make the reading of rows as efficient as possible. The most efficient method I know is to create a columnstore table with extreme compression.

The title of this section gives it away, but let’s start with a small scale test to see if the approach has merit. Code to load 100 million rows into a CCI:

DROP TABLE IF EXISTS #count_test;

CREATE TABLE #count_test (
ID TINYINT NOT NULL,
INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO #count_test WITH (TABLOCK)
SELECT TOP (100000000) 0 ID
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON #count_test (ID);

SQL Server is able to compress this data quite well because there’s just a single unique value. The table is only 1280 KB in size. The data prep code does exceeds the 30 second time limit on my machine due to the statistics creation, but I’ll ignore that for now because it’s just a trial run. I can avoid all types of early aggregation with the following syntax:

SELECT COUNT_BIG(*)
FROM
(
	SELECT ID
	FROM #count_test
	UNION ALL
	SELECT ID
	FROM #count_test
) q
OPTION (RECOMPILE, MAXDOP 8, FORCE ORDER)

Picture of the query plan:

SQL Server Query Plans

On my machine, that query takes about 41 ms to process 200 million rows. Based on that, it seems reasonable to expect that a query that processes a trillion rows could finish in under four minutes. There is a balancing act to perform here. Putting more rows into the table makes it harder to complete data prep within 30 seconds, but cuts down on the number of UNION ALLs performed which reduces query compilation time.

The approach that I settled on was to load 312.5 million rows into a table. Reading that table 3200 times results in a trillion total rows. 3200 is a convenient number to achieve with nested CTEs and I’m able to load 312.5 million rows quite reliably under 30 seconds even with MAXDOP 4. I can’t say that this is the best possible approach but it seems to work fairly well.

The last detail in terms of data prep is to deal with statistics creation. SQL Server forces a sample rate of 100% for very small tables. Unfortunately, extremely compressed columnstore indexes can fall into the 100% range even if they have a high row count. Creating a statistic object with a sample of 312.5 million rows will take longer than 30 seconds. I don’t need good statistics for query performance so I work around the issue by creating a statistic with NORECOMPUTE after an initial load of 500k rows. Code to create and populate a CCI with 312.5 million rows:

DROP TABLE IF EXISTS #CCI_1M_ONE_COLUMN;

CREATE TABLE #CCI_1M_ONE_COLUMN (ID TINYINT NOT NULL, INDEX I CLUSTERED COLUMNSTORE);

DECLARE @insert_count TINYINT = 0;
WHILE @insert_count < 8
BEGIN
	INSERT INTO #CCI_1M_ONE_COLUMN
	SELECT TOP (125000) 0
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
	OPTION (MAXDOP 1);

	SET @insert_count = @insert_count + 1;
END;

DROP TABLE IF EXISTS #C;

CREATE TABLE #C (I TINYINT NOT NULL, INDEX I CLUSTERED COLUMNSTORE);

INSERT INTO #C WITH (TABLOCK)
SELECT TOP (500000) 0
FROM #CCI_1M_ONE_COLUMN
OPTION (MAXDOP 1);

CREATE STATISTICS S ON #C (I) WITH NORECOMPUTE, FULLSCAN;

WITH C13 AS (
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
	UNION ALL
	SELECT ID FROM #CCI_1M_ONE_COLUMN
)
, C312 AS
 (
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
	UNION ALL
	SELECT ID FROM C13
)
INSERT INTO #C WITH (TABLOCK)
SELECT 0
FROM C312
OPTION (MAXDOP 4, FORCE ORDER, QueryRuleOff GenLGAgg, QueryRuleOff EnforceBatch, QueryRuleOff GbAggToStrm);


DBCC TRACEON(10204);

ALTER INDEX I ON #C REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON);
ALTER INDEX I ON #C REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON);

DBCC TRACEOFF(10204);

The above code isn’t as efficient as possible but it gets the job done. It also provides a preview of the syntax that is used to query the table 3200 times. A nested CTE approach is used instead of writing out UNION ALL 3199 times. Trace flag 10204 is used because there’s no need to act on rows outside of the delta rowgroups.

Query compile time can be quite significant for this query. On an older PC while running these queries, I could determine when the query compile ended by a change in fan noise. The new additions to the query hints are there to minimize query compile time as much as possible. This is important when a table is referenced 3200 times in a query. I’m told that there are undocumented trace flags which can reduce query compile time even more but I’ll leave that as an exercise to the reader. Here’s the query text of the UNION ALL approach:

WITH C8 AS (
	SELECT I FROM #C WITH (TABLOCK)
	UNION ALL
	SELECT I  FROM #C WITH (TABLOCK)
	UNION ALL
	SELECT I FROM #C WITH (TABLOCK)
	UNION ALL
	SELECT I FROM #C WITH (TABLOCK)
	UNION ALL
	SELECT I FROM #C WITH (TABLOCK)
	UNION ALL
	SELECT I FROM #C WITH (TABLOCK)
	UNION ALL
	SELECT I FROM #C WITH (TABLOCK)
	UNION ALL
	SELECT I FROM #C WITH (TABLOCK)
)
, C160 AS
 (
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
	UNION ALL
	SELECT I FROM C8
)
, C3200 AS
(
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
	UNION ALL
	SELECT I FROM C160
)
SELECT COUNT_BIG(*)
FROM C3200
OPTION (RECOMPILE, MAXDOP 8, FORCE ORDER, QueryRuleOff GenLGAgg, QueryRuleOff EnforceBatch, QueryRuleOff GbAggToStrm);

The good news is that generating an actual plan takes about 3 minutes on my machine. The bad news is that uploading a 27 MB query plan is a lot more difficult than I thought. I ended up loading it to my personal website. You can view all of the XML in its glory here.

For a method more friendly to my personal bandwidth, you can look at the screenshot below:

SQL Server Query Plan

I don’t know of a more faster way to complete the challenge. Without an actual plan, the query takes about 63 seconds to complete. However, the query takes about 3 minutes to complete on SQL Server 2019, even without requesting an actual plan. The difference is caused by lightweight query profiling in SQL Server 2019, which is turned on by default. I find query profiling to be very useful in general for evaluating performance issues in real time and for estimating how long a query will take to complete. However, it can still add significant overhead for certain types of queries. The difference is quite apparent in the call stacks:

This is why I did all of my testing for this blog post on SQL Server 2017, in case you’ve been wondering that.

Final Thoughts

Summary of all attempts:

The results are quite specific to the test scenario: they should not be used to draw conclusions about general join performance. As mentioned earlier, in a few cases the bottleneck is the query profiling infrastructure instead of typical aspects of query performance. If anyone out there knows of a faster technique than those tested here, I would be delighted to learn about it.

Thanks for reading!

Going Further

Starting SQL: Quick Notes About Heaps In SQL Server (Tables Without Clustered Indexes)

Posted on August 12, 2020May 14, 2022 by Erik Darling

Nein Nein Nein

I know, I said we’d talk about indexes, and heaps are basically the opposite of indexes.

Even if you have a heap with a nonclustered primary key, or nonclustered indexes hanging around, the base table is still a heap.

Why? It’s all about that clustered index.

Do you always need a clustered index? Sometimes not. But if you’re running an OLTP workload, you probably do.

It’s not that heaps can’t be useful, it’s just that their best use-case is still loading the big datas.

Downsides Of Heaps

Heaps have a couple issues that are going to sound a lot like index fragmentation. In fact, they’re a lot like index fragmentation.

Forwarded fetches from updates
Empty pages from deletes

Forwarded Fetches

Forwarded fetches are a lot like logical fragmentation, which is when pages are out of order. Instead of having clustered index keys to follow, we have pointers from where the row used to be to where it got moved to.

This happens because updates change a value that makes a row not fit on a page anymore.

Heaps compound the issue a little bit though, because you can’t seek into a heap to avoid forwarded fetches (though you can seek into any index created on top of a heap). You can seek into a clustered index to avoid logical fragmentation.

Empty Space

Empty space from deletes goes a little bit further, too. You can end up with entire pages being empty, if queries that delete data don’t successfully escalate locks to the table level. Worse, those empty pages get read when the heap is scanned.

Sure, clustered indexes can end up with a bunch of empty space on pages, but when pages are totally emptied out they will get deallocated. Heaps don’t do that without a table level lock during the delete.

The thing is that both of these only happen to the heap itself. If you’ve got nonclustered indexes on the heap and queries mostly use those, you may not notice either of these problems. Perhaps a little bit if you do bookmark lookups.

When They Might Matter

This is largely a size thing. The bigger your heaps are, the more pronounced these problems can be.

It also depends a bit on what heap tables get used for. Or more importantly, who uses them.

If they’re staging tables that end users don’t touch, fixing them probably won’t solve a big problem for you outside of ETL.

But if they’re transactional tables where end users are complaining about performance, you need to do two things: check the query plans for table scans, and then check the heaps for forwarded fetches and deletes.

To fix them, you can rebuild the heap, or create a clustered index. In either case, be very careful if you’ve got nonclustered indexes on the table, because they’ll need to all get rebuilt to either fix the RIDs in a heap, or add the clustered index keys.

Your next question is: Erik, how do I pick a clustered index? We’ll talk about that tomorrow!

Thanks for reading!

Going Further

Starting SQL: Why SQL Server’s Optimizer Doesn’t Always Choose The Same Index

Posted on August 10, 2020May 14, 2022 by Erik Darling

What’s With Widths?

We don’t need to add a ton of columns to our query to have index usage change, but we do need to go back in time a little bit.

Here’s our query now, with just a one day difference in the where clause.

SELECT p.CreationDate,
       COUNT_BIG(*) AS records
FROM dbo.Posts AS p
WHERE p.CreationDate >= '20131015'
AND   p.OwnerUserId = 22656
GROUP BY p.CreationDate;


SELECT p.CreationDate,
       COUNT_BIG(*) AS records
FROM dbo.Posts AS p
WHERE p.CreationDate >= '20131016'
AND   p.OwnerUserId = 22656
GROUP BY p.CreationDate;

Not quite Halloween, but I’ll take the spooky factor.

The other difference is that now we’ve got the OwnerUserId column in there, which isn’t in our nonclustered index.

It’s in the where clause, not the select list, but if we added it there it would have a similar effect on the query. Either way, we have to do something with this new column, and we have to get it from somewhere.

CREATE INDEX CreationDate ON dbo.Posts(CreationDate);

Things Are Looking Up

The query plans for these will look a little bit different.

If you’re not familiar with what just happened here, a Key Lookup is a trick the optimizer has up its sleeve. It uses two indexes from the same table to satisfy one query.

We find rows in our nonclustered index that qualify for the filter on CreationDate. That’s a smart start!

Then we join the nonclustered index to the clustered index to find any columns we might need. Remember the clustered index has all the table columns in it.

Stuff like this is made possible by nonclustered indexes inheriting clustered index key columns. Crazy, right?

The Point At This Point

There are many internal details to explore around Key Lookups. There are even some interesting things about how clustered index keys get stored in nonclustered indexes.

What you need to know about Lookups right now is what they are (which we talked about), and that they represent a choice the optimizer has when it comes to index usage.

If you create a narrow index, say one that satisfies some part of the query like the join or where clause, but doesn’t fully contain all of the columns referenced in your query, it may not get used reliably. The usage is decided based on cardinality estimates. The more rows SQL Server expects, the less likely it is that your narrow index will get used.

For example, it may only get used when the value for CreationDate is estimated to return a small-ish number of rows. Parameterization and plan re-use can make this even more confusing.

Next, we’ll look at how we can encourage the optimizer to choose narrow indexes, and the problems we might run into.

Thanks for reading!

Going Further

Starting SQL: Messing Up A SQL Server Query’s Performance

Posted on August 7, 2020May 14, 2022 by Erik Darling

Your Fault

Yesterday we created an index, and things went pretty well. Good for us. By next week we’ll have six pack abs.

Today we’re going to look at things that I see happen pretty often in queries that mess with effciency.

If one were to sit and think carefully about the way B-Tree indexes are implemented, where key columns define order, one may see these problems coming a mile away.

Then again, one might expect a full formed datababby system to be able to figure some of these things out and use indexes appropriately anyway.

General Anti-Patterns

I’ve posted this on here a number of times, but here it goes again.

Things that keep SQL Server from being able to seek:

Function(Column) = …
Column + Column = …
Column + Value = …
Value + Column = …
Column = @Value or @Value IS NULL
Column LIKE ‘%…’
Some implicit conversions (data type mismatches)

Seeks aren’t always necessary, or even desirable, and likewise scans aren’t always bad or undesirable. But if we’re going to give our queries the best chance of running well, our job is most often to give the optimizer every opportunity to make the right decisions. Being the optimizer is hard enough without us grabbing it by the nose and poking it in the eyes.

To fix code, or make code that looks like that tolerable to The Cool DBA Kids™, there are some options like:

Computed columns
Dynamic SQL
Rewrites

Different options work well in different scenarios, of course. And since we’re here, I might as well foreshadow a future post: These patterns are most harmful when applied to the leading key column of an index. When they’re residual predicates that follow seek predicates, they generally make less of a difference. But we’re not quite there yet.

The general idea, though, is that as soon as we write queries in a way that obscure column data, or introduce uncertainty about what we’re searching for, the optimizer has a more difficult time of things.

Do It Again

Let’s compare a couple different ways of writing yesterday’s query. One good, one bad (in that order).

SELECT p.CreationDate,
       COUNT_BIG(*) AS records
FROM dbo.Posts AS p
WHERE p.CreationDate >= '20131225'
GROUP BY p.CreationDate;

SELECT p.CreationDate,
       COUNT_BIG(*) AS records
FROM dbo.Posts AS p
WHERE YEAR(p.CreationDate) = 2013
AND   MONTH(p.CreationDate) = 12
AND   DAY(p.CreationDate) >= 25
GROUP BY p.CreationDate;

The mistake people often make here is that they think these presentation layer functions have some relational meaning. They don’t.

They’re presentation layer functions. Let’s see those execution plans. Maybe then you’ll believe me.

Things are not so hot when we pile a Mess Of Functions™ into the where clause, are they?

I mean, our CPUs are hot, but that’s generally not what we’re after.

The Use Of Indexes

We could still use our index. Many people will talk about functions preventing the use of indexes, but more precisely we just can’t seek into them.

But you know what can prevent the use of nonclustered indexes? Long select lists.

Next time, we’ll look at that.

Thanks for reading!

Going Further

Starting SQL: Measuring A SQL Server Query’s Performance Metrics

Posted on August 6, 2020May 14, 2022 by Erik Darling

Outlaw Buttons

Yesterday, we started with a query, and wanted to add an index to make it faster. Remember, indexes help queries find and relate data by putting it in order.

In the weird old days, we might use SET STATISTICS TIME, IO ON; to measure a query. If you’re still using old (unsupported) versions of SQL Server, I understand if you have to continue to use those.

For those of us living in the present (and near past), we can get the detail we need from actual execution plans.

Look, you’re probably getting those anyway if you’re trying to tune a query with any seriousness. You might as well make the most of them.

Something New

When most people look at execution plans, they neglect one of the most important parts: operator properties.

Once you got a plan in front of you, you can either highlight an operator and hit F4, or right click on one and choose “Properties”. A little window pane should appear over to the right in SSMS, and it’ll have a whole bunch of useful information in it that doesn’t show up in the tool tips you get when you hover over operators.

What you can see over there in detail is information about reads and operator time.

SQL Server Query Plan Properties — signs and measures

Interpreting Stuff

Logical reads is the number of 8KB pages the query read through. I used to put a lot of stock in tuning queries to do fewer reads, but I’ve run into too many situations where a faster query did more reads. I no longer consider it to be a great indicator of performance being better.

If logical reads incidentally go down, great. If not, whatever.

The reason this is better to get from execution plans is because if you use the SET STATISTICS command, it only tells you reads at the table level, not at the index level. If you have more complicated query plans that use multiple indexes from the same table, you can’t really tell which did what. Getting it from the query plan is easier to interpret.

Down the bottom is some more useful information about CPU and elapsed time. I know what you’re thinking: that’s incredible, I can see how long each operator runs for. And that’s true; you can. But it’s a little more complicated depending on what kind of plan you’re looking at.

If the plan is all row mode operators, time is cumulative going from right to left
If the plan is all batch mode, time is per operator
If the plan is a mix, then the time will also be mixed

It’s a little confusing, I know. But that confusion is going to become more common as you nice people start using SQL Server 2019 and seeing Batch Mode On Rowstore. Otherwise, you’ll really only see Batch Mode when columnstore is involved.

If you use this feature enough, you’ll undoubtedly run into rowmode query plans where the times are inconsistent, especially around parallel exchanges. Don’t take it personally — accounting for parallel thread and operator times is difficult. Things will most often even out as you read through the plan.

For more detail on timing stuff, check out my videos here and here on it.

For Real Though

Let’s look at our query with and without an index.

CREATE INDEX CreationDate ON dbo.Posts(CreationDate);

SELECT p.CreationDate,
       COUNT_BIG(*) AS records
FROM dbo.Posts AS p
WHERE p.CreationDate >= '20131225'
GROUP BY p.CreationDate;

It’s pretty easy to see what got better by looking at the execution plans.

With an index on CreationDate, we avoid the clustered index scan and the need to go parallel to compensate for not having an opportune index.

We can seek easily to the dates we care about, because they’re ordered in a way that makes it easy to find them.

Okay, Great

Now we know how to figure out if we did something right when adding an index.

Tomorrow we’ll look at ways you can write your queries that might interfere with indexes being used efficiently.

Thanks for reading!

Going Further

Starting SQL: SQL Server Indexes Store Key Column Data In Order

Posted on August 5, 2020May 14, 2022 by Erik Darling

Index Skit (0:39)

The most common types of indexes you’re going to see in your life are of the rowstore variety.

As much as I love columnstore, it’s probably not something you’re going to see a whole lot outside of data warehouses. Sure, some folks will have reporting over OLTP, and might find some utility for them, but they can be tough to manage with all those tiny modifications.

Let’s stick with the stuff that’ll help you the most: clustered and nonclustered rowstore indexes.

There are important things to know about indexes, and we’ll get more in-depth later on. For now, let’s talk about how they can help a query.

Clustered

Our table has a clustered index on it, which is also playing the part of a primary key. The primary key attribute makes it unique, of course. By default, if you create a primary key, it’ll also be used as the clustered index key. If you only create a clustered index, it won’t be unique by default.

Let’s not get bogged down there, though. Here’s our index.

We can identify clustered indexes and which columns are in them pretty easily in SSMS.

To simplify a bit, the clustered index is all these columns:

Ordered by this column:

That’s Great But

Having the Id column in order doesn’t help us find data in other columns, because they’re not in order.

Let’s say we wanted to find posts by CreationDate. The values for it aren’t in an order that helps us search through them.

If our query is only concerned with the CreationDate column we can create a single-column index on it. As queries become more complicated and involve more columns, we need to consider wider indexes sometimes so that they stand a better chance of getting used, but we’ll come back to that later.

Here’s our overly-simple query.

SELECT p.CreationDate,
       COUNT_BIG(*) AS records
FROM dbo.Posts AS p
WHERE p.CreationDate >= '20131225'
GROUP BY p.CreationDate;

And here’s how the optimizer decides to answer our query.

We have to scan all of the data pages in the clustered index looking for CreationDates that match our predicate.

Make It Plain

It’s not such a crazy idea to create additional indexes to speed up queries, but how exactly do they do that?

What is it about indexes that magically make queries go faster? According to the title, they put data in order, so let’s go with that.

It’s easy enough to create a helpful index here.

CREATE INDEX CreationDate ON dbo.Posts(CreationDate);

Tomorrow, we’ll look at ways to see if our index gets used, and different ways to measure if it improves our query.

But What Happens When…

Janitorial

Withdrawals

Age Of Reason

Then People Stare

Going Further

Sweet Like

The Meaning Of Life

Those Keys

Inclusion Conclusion

Baby Teeth

Going Further

Loop Join

Merge Join

Row Mode Hash Join

Batch Mode Hash Join

The Winner

Final Thoughts

Going Further

Nein Nein Nein

Downsides Of Heaps

Forwarded Fetches

Empty Space

When They Might Matter

Going Further

What’s With Widths?

Things Are Looking Up

The Point At This Point

Going Further

Your Fault

General Anti-Patterns

Do It Again

The Use Of Indexes

Going Further

Outlaw Buttons

Something New

Interpreting Stuff

For Real Though

Okay, Great

Going Further

Index Skit (0:39)

Clustered

That’s Great But

Make It Plain

Going Further

awks ox

Going Further

WELL.

Going Further