columnstore – Page 4 – Darling Data

A Columnstore Compression Magic Trick For SQL Server

Posted on November 20, 2017May 16, 2022 by Joe Obbish

Columnstore compression is complicated, and in some cases, surprising.

The Setup

The source data for the CCI has enough rows to fit six perfect rowgroups. The ID column is just sequential integers from 1 to 6291456. The ID2 column is the ID column modulo 20001. Code to load the data into a temp table:

DROP TABLE IF EXISTS #STG_DATA;
CREATE TABLE #STG_DATA (
	ID BIGINT NOT NULL,
	ID2 BIGINT NOT NULL,
	PRIMARY KEY (ID)
);

INSERT INTO #STG_DATA WITH (TABLOCK)
SELECT t.RN, t.RN % 20001
FROM
(
	SELECT TOP (6 * 1048576) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t;

Here’s the table definition for the target CCI:

DROP TABLE IF EXISTS dbo.TARGET_CCI;
CREATE TABLE dbo.TARGET_CCI (
	ID2 BIGINT NOT NULL,
	ID BIGINT NOT NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);

The reversal of column order is important for the demo below.

Changing MAXDOP

First let’s load the ID2 column the temp table into the CCI. The order of data can matter for compression so I have a superfluous TOP expression to force SQL Server to read the data in clustered key order.

INSERT INTO dbo.TARGET_CCI WITH (TABLOCK)
SELECT TOP (9876543210) ID2, 0
FROM #STG_DATA
ORDER BY ID
OPTION (MAXDOP 1);

The insert query takes 2765 ms of CPU time and 2771 ms of elapsed time on my machine. According to sys.dm_db_column_store_row_group_physical_stats each rowgroup has a size of 2098320 bytes:

a20_maxdop_1_rg_dmv

Now let’s move on to a parallel insert query with MAXDOP 2. The purpose of the second column in the CCI is to make the insert go parallel on my machine. It’s possible that you’ll need to use trace flag 8649 or some other trick to get a parallel insert. Here’s the code that I ran:

TRUNCATE TABLE dbo.TARGET_CCI;

INSERT INTO dbo.TARGET_CCI WITH (TABLOCK)
SELECT TOP (9876543210) ID2, 0
FROM #STG_DATA
ORDER BY ID
OPTION (MAXDOP 2);

The insert query now takes 3594 ms of CPU time and 2112 ms of elapsed time on my machine. The size of each rowgroup did not change. It’s still 2098320 bytes. Even though this is a parallel query there’s no element of randomness in this case. In the query plan we can see that the source table was scanned in a serial zone and round robin distribution is to used to distribute exactly half of the rows to each parallel thread.

a20_parallel_insert

This seems like a reasonable plan given that TOP forces a serial zone and we need to preserve order. It’s possible to rewrite the query to encourage a parallel scan of the source table, but that would introduce an order-preserving gather streams operator.

I’m not satisfied with the runtime yet, so I’m going to bump up MAXDOP to 3:

TRUNCATE TABLE dbo.TARGET_CCI;

INSERT INTO dbo.TARGET_CCI WITH (TABLOCK)
SELECT TOP (9876543210) ID2, 0
FROM #STG_DATA
ORDER BY ID
OPTION (MAXDOP 3);

The insert query now takes 114172 ms of CPU time and 39208 ms of elapsed time to execute. However, each rowgroup now is just 54496 bytes.

a20_maxdop_3_rg_dmv

The INSERT took significantly longer than before, but we have 38X better compression compared to the table after the MAXDOP 2 query. What happened?

Revealing the Magic Trick

An interesting pattern for compressed data sizes appears when working with repeated integers for a single rowgroup. The query that I tested with was roughly of the following format:

INSERT INTO dbo.CCI
SELECT t.RN % @MOD_NUM
FROM
(
	SELECT TOP (@ROWS_INSERTED)
		ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t;

Below is a contour plot that shows how the compressed size for a single rowgroup varies as the number of rows and the modulus value changes:

a20_contour_size

Values that are repeated 64 or more times seem to be compressed much better than other values. This pattern definitely doesn’t always hold as you add more columns to the table which is why I made the ID2 column the first column in the target CCI. Why is this pattern relevant to the previous example?

Consider the MAXDOP 1 insert query. With a full rowgroup of 1048576 rows a value will be repeated at most 1048576/20001 = 53 times in each rowgroup. It doesn’t cross the threshold of 64 so we end up with a compressed size of 2098320 bytes.

Now consider the MAXDOP 2 insert query. The ordered data from the scan is distributed using round robin distribution on two threads. For the first 20001 rows from the scan, thread 0 gets all even values and thread 1 gets all odd values. For the next 20001 rows, thread 0 gets all odd values and thread 1 gets all even values. This occurs because 20001 isn’t divisible by 2. For all six compressed rowgroups we end up with the same data distribution as we had when doing MAXDOP 1 inserts. It makes sense that the compressed size remained at 2098320 bytes.

Now consider the MAXDOP 3 insert query. The query still uses round robin distribution but there are now three threads. 20001 is divisible by 3 so thread 0 only ends up with 6667 unique values from 0, 3, … to 19999. Thread 1 also ends up with 6667 unique values from 1, 4, … to 20000. Thread 2 follows a similar pattern. Each compressed rowgroup only has 6667 unique values instead of 20001. Each value shows up at least 157 times in the rowgroup, so all of the data qualifies for much better compression.

Final Thoughts

This has absolutely no practical value. Thanks for reading!

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

ROWGROUP_FLUSH Deadlocks In SQL Server Column Store Indexes

Posted on November 13, 2017May 16, 2022 by Joe Obbish

We recently observed many ROWGROUP_FLUSH deadlocks while doing concurrent inserts into CCIs. I’m not really a concurrency kind of guy but I figured that I should blog about this just so other people with the same problem can find some information about it.

Deadlock Reproduction

The schedulers of the involved sessions are important in some way, especially when going for a simple reproduction. It’s easiest to just make all new sessions go the same CPU:

ALTER SERVER CONFIGURATION
SET PROCESS AFFINITY CPU = 0;

Obviously you should never do that in production. After affinity has been addressed I recommend creating a nearly empty source table and a new CCI table:

DROP TABLE IF EXISTS dbo.CCI_DEADLOCKED;
CREATE TABLE dbo.CCI_DEADLOCKED (
	COL VARCHAR(1500),
	INDEX CCI CLUSTERED COLUMNSTORE
);

CREATE TABLE ##SOURCE_IDS (ID BIGINT NOT NULL);

INSERT INTO ##SOURCE_IDS WITH (TABLOCK)
SELECT TOP (1048576) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

One way to see the deadlock is to quickly kick off two inserts into the CCI_DEADLOCKED table from different sessions. Inserting a larger amount of data means that you’ll have more time to kick off the second session before the first completes, but a longer rollback time on the first session. On my machine inserting 1048576 rows of VARCHAR(1500) data seems like a reasonable compromise:

INSERT INTO dbo.CCI_DEADLOCKED
SELECT REPLICATE('Z', 1500)
FROM ##SOURCE_IDS
OPTION (MAXDOP 1, MAX_GRANT_PERCENT = 0);

The second session waits on the first with a LCK_M_IX wait event. The first session loads all of its rows into the delta store, then deadlocks and rolls them all back. You can see this happen in near real time by looking at sys.dm_db_column_store_row_group_physical_stats:

a19_disappearing_delta_store_rows

Here’s the deadlock XML for those who are interested in that kind of thing:

unknown

INSERT INTO dbo.CCI_DEADLOCKED SELECT REPLICATE('Z', 1500) FROM ##SOURCE_IDS OPTION (MAXDOP 1, MAX_GRANT_PERCENT = 0);

unknown

INSERT INTO dbo.CCI_DEADLOCKED SELECT REPLICATE('Z', 1500) FROM ##SOURCE_IDS OPTION (MAXDOP 1, MAX_GRANT_PERCENT = 0);

SSMS can't produce a deadlock graph for this type of deadlock. Below is the non-copy-and-pastable error message from it:

Failed to initialize deadlock control.
There is an error in XML document (1, 2497).
Instance validation error: 'ROWGROUP_FLUSH' is not a valid value for hobtlockSubresource.

Plan Explorer from SentryOne can help us:

a19_deadlock_graph

If you're following along at home don't forget to reset your affinity to whatever you had it before. The most common option:

ALTER SERVER CONFIGURATION
SET PROCESS AFFINITY CPU = AUTO;

The Workarounds

We've only observed this deadlock with multiple concurrent sessions insert to the delta store for the same target CCI due to server memory pressure or very low cardinality estimates (less than 251 rows). The correct mitigation depends on why you're seeing the issue in the first place. If you're seeing it due to low cardinality estimates then fix your estimates, or at the very least get them above 250 rows. If you're seeing them because the memory grant for the CCI build times out after 25 seconds then lower concurrency or increase server memory.

The problem can also be avoided by not doing concurrent inserts in the first place. In some cases a parallel insert may be a reasonable alterative. There's also some evidence that the deadlock is only seen when the number of rows for insert is very close to 1048576, but we weren't able to make any definitive conclusions around that.

Final Thoughts

Don't despair if you run into a ROWGROUP_FLUSH deadlock! There's probably something you can do in the application to avoid it. If you feel that you shouldn't have to take such measures feel free to vote for my connect item here.

Going Further

If this is the kind of SQL Server stuff you love learning about, you'll love my training. I'm offering a 75% discount to my blog readers if you click from here. I'm also available for consulting if you just don't have time for that and need to solve performance problems quickly.

Surprise Delta Stores In SQL Server’s Column Store Indexes

Posted on November 6, 2017May 14, 2022 by Joe Obbish

This post contains all of the possible causes for delta store creation that I’ve found. I cannot say with certainty that it’s a complete list, but some of them may be new or unexpected to the reader.

Why Care about Delta Stores?

Microsoft and many others will be quick to tell you that loading data into CCIs is much faster when you can bypass the delta store. In SQL Server 2016 and beyond, delta stores are uncompressed rowstore mini-tables that serve as a temporary holding data until the data can be compressed into columnar format. They’re good when you have a trickle of data to load into a CCI, but bad in all possible ways for a data warehouse workload.

Reviewing the Documentation

I briefly reviewed the documentation written by Microsoft concerning the appearance of delta stores. Here’s a quote:

Rows go to the deltastore when they are:
Inserted with the INSERT INTO VALUES statement.
At the end of a bulk load and they number less than 102,400.
Updated. Each update is implemented as a delete and an insert.

There are also a few mentions of how partitioning can lead to the creation of multiple delta stores from a single insert. It seems as if the document is incomplete or a little misleading, but I admit that I didn’t exhaustively review everything. After all, Microsoft hides columnstore documentation all over the place.

Test Data

The source data for the CCI inserts is fairly uninteresting. I put four rowgroups worth of rows into a rowstore table with a BIGINT column and a randomly generated VARCHAR(16) value.

DROP TABLE IF EXISTS dbo.STAGING_TABLE;

CREATE TABLE dbo.STAGING_TABLE (
	ID BIGINT NOT NULL,
	STR1 VARCHAR(16) NOT NULL,
	PRIMARY KEY (ID)
);

INSERT INTO dbo.STAGING_TABLE WITH (TABLOCK)
SELECT TOP (4 * 1048576)
  ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, LEFT(CAST(NEWID() AS VARCHAR(36)), 16)
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

The columns for the table definition for the CCI were chosen to cover all of the demos except for the partitioning one. Your fact table definitions probably don’t look like this.

DROP TABLE IF EXISTS dbo.DELTA_STORE_DUMPING_GROUND;
CREATE TABLE dbo.DELTA_STORE_DUMPING_GROUND (
	ID BIGINT NULL,
	STR1 VARCHAR(100) NULL,
	STR2 VARCHAR(100) NULL,
	STR3 VARCHAR(100) NULL,
	STR1_MAX VARCHAR(MAX) NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);

Not Enough Rows For Bulk Load

The first reason for delta creation is well known and understood on SQL Server 2016. If you insert fewer than 102400 rows then SQL Server will not attempt to skip the delta store. This behavior is by design. The following query does not do a bulk load:

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND (ID)
SELECT TOP (102399) ID
FROM dbo.STAGING_TABLE
ORDER BY ID
OPTION (MAXDOP 1);

We can see the delta store that was just created with the following query:

SELECT
    row_group_id
  , state_desc
  , total_rows
--, trim_reason_desc
--, deleted_rows
--, partition_number
FROM sys.dm_db_column_store_row_group_physical_stats rg
INNER JOIN sys.tables t ON rg.OBJECT_ID = t.OBJECT_ID
WHERE t.name = 'DELTA_STORE_DUMPING_GROUND';

The results:

The other examples in this post use similar queries to get information about the newly added rowgroups to the table. They will be omitted for brevity. Simply inserting one row results in the delta store getting skipped:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND (ID)
SELECT TOP (102400) ID
FROM dbo.STAGING_TABLE
ORDER BY ID
OPTION (MAXDOP 1);

Now the rowgroup is compressed:

The rules change slightly in SQL Server 2017 with support of VARCHAR(MAX) and other LOB columns in columnstore. The delta store can be skipped with an insert of as few as 251 rows. Whether or not you write to the delta store depends on the amount of data being written. Below is one query that still writes to the delta store:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND (STR1_MAX)
SELECT TOP (251) REPLICATE(STR1, 40)
FROM dbo.STAGING_TABLE
ORDER BY ID
OPTION (MAXDOP 1);

Once again you can see the delta store:

Things are different if we increase the length of the inserted data. The query below writes to a compressed rowgroup and bypasses the delta store:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND (STR1_MAX)
SELECT TOP (251) REPLICATE(STR1, 500)
FROM dbo.STAGING_TABLE
ORDER BY ID
OPTION (MAXDOP 1);

The resulting rowgroup is compressed:

Removing just a single row from the insert brings us back to the delta store.

Inserting to Multiple Partitions

If a MAXDOP 1 INSERT query writes to multiple partitions then it could possibly write to multiple delta stores. The number of rows written to each partition is important as opposed to the total number of rows written to the table. Below I define a simple table with 2 partitions:

CREATE PARTITION FUNCTION CLUNKY_SYNTAX_1
(BIGINT)
AS RANGE LEFT
FOR VALUES (
  0
, 2000000
); 

CREATE PARTITION SCHEME CLUNKY_SYNTAX_2
AS PARTITION CLUNKY_SYNTAX_1
ALL TO ( [PRIMARY] );

DROP TABLE IF EXISTS dbo.PARTITIONED_DELTA_STORE;
CREATE TABLE dbo.PARTITIONED_DELTA_STORE (
ID BIGINT NULL,
INDEX CCI CLUSTERED COLUMNSTORE
) ON CLUNKY_SYNTAX_2 (ID);

The insert writes 200k rows to the CCI which you might expect to bypass the delta store, but since the rows are evenly spread over two partitions we end up with two delta stores:

INSERT INTO dbo.PARTITIONED_DELTA_STORE (ID)
SELECT ID
FROM dbo.STAGING_TABLE
WHERE ID BETWEEN 1900001 AND 2100000
OPTION (MAXDOP 1);

a18_dmv_5

With MAXDOP 8 INSERT queries and the maximum number of partitions defined on a table, it is possible to get 120000 delta stores. I don’t recommend doing this.

Bulk Insert Leftovers

Often applications will not insert an exact multiple of 1048576 rows. That means that rows can be left over after a few rowgroups worth of inserted rows are compressed. Those leftover rows can go into a delta store. Consider the following insert query that inserts 100000 rows more than 1048576:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND (ID)
SELECT TOP (1048576 + 100000) ID
FROM dbo.STAGING_TABLE
ORDER BY ID
OPTION (MAXDOP 1);

As expected, the final result is one compressed rowgroup of 1048576 rows and one delta store of 100k rows.

a18_dmv_6

If we inserted just a few thousand more rows than we’d end up with two compressed rowgroups.

Updates

UPDATE queries always write to the delta store. There are many other reasons to avoid UPDATES to CCIs if the application makes it possible to do so.

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND (ID)
SELECT TOP (1048576) ID
FROM dbo.STAGING_TABLE
ORDER BY ID
OPTION (MAXDOP 1);

At first there’s just a single compressed rowgroup:

Now run the UPDATE query and go make coffee:

UPDATE DELTA_STORE_DUMPING_GROUND
SET ID = ID;

Our table doesn’t look so hot:

a18_dmv_8

In SQL Server 2016 the Tuple Mover will not clean up this table. Another row needs to be inserted into the table before the rowgroup is marked as CLOSED.

Parallel Insert

Many parallel queries have an element of randomess around how rows are distributed to parallel threads. Rows are not moved between threads after they flow to the part of the plan that performs the insert into the CCI. It’s possible to end up with a number of new delta stores equal to the number of parallel threads for the query. Let’s start with a parallel insert that moves 4 * 1048576 rows into the CCI:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND
WITH (TABLOCK) (ID)
SELECT ID
FROM dbo.STAGING_TABLE
OPTION (MAXDOP 4);

It’s possible to end up without any delta stores and the results of the query against sys.dm_db_column_store_row_group_physical_stats will vary, but generally you’ll get at least one:

a18_dmv_9

If we have unnaturally high beauty standards for our rowgroups we can rewrite the query to effectively force rows to be evenly distributed on all threads. The query below does this with a join to a derived table:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND
WITH (TABLOCK) (ID)
SELECT stg.ID
FROM
(
	VALUES
	(0 * 1048576 + 1, 1 * 1048576),
	(1 * 1048576 + 1, 2 * 1048576),
	(2 * 1048576 + 1, 3 * 1048576),
	(3 * 1048576 + 1, 4 * 1048576)
)
v (start_id, end_id)
INNER JOIN dbo.STAGING_TABLE stg ON
	stg.ID BETWEEN v.start_id and v.end_id
OPTION (MAXDOP 4);

Perfection:

a18_dmv_10

I know that you were looking forward to another image of a tiny table, but here’s the important part of the query plan for those who like that sort of thing:

a18_parallel_query_plan_1

Getting perfect rowgroups can also be accomplished by adding the TOP operator to the original query, but that adds a serial zone to the plan:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND
WITH (TABLOCK) (ID)
SELECT TOP (9999999999999999) stg.ID
FROM dbo.STAGING_TABLE stg
OPTION (MAXDOP 4);

The key here is the parallelism operator in the plan uses a round robin method for distributing rows:

a18_parallel_query_plan_2

Dictionary Pressure

In SQL Server 2016 the maximum size for a column dictionary is 16 MB. This limit is raised in SQL Server 2017 for VARCHAR(MAX) and similar columns. I’m not going to get into the details of dictionaries here but it suffices to say that columns with too many unique string columns can experience dictionary pressure. Dictionary pressure leads to compressed rows that are less than the perfect size of 1048576 rows. Let’s insert the STR1 column into the CCI this time:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND (STR1)
SELECT TOP (1048576) stg.STR1
FROM dbo.STAGING_TABLE stg
ORDER BY ID
OPTION (MAXDOP 1);

Due to dictionary pressure there’s now a delta store with about 73000 rows:

a18_dmv_11

We can see that the dictionary size for the column is close to the limit with the query below:

SELECT csd.entry_count, csd.on_disk_size
FROM sys.column_store_dictionaries csd
INNER JOIN sys.partitions p
    ON csd.partition_id = p.partition_id
INNER JOIN sys.tables t
    ON p.OBJECT_ID = t.OBJECT_ID
WHERE t.name = 'DELTA_STORE_DUMPING_GROUND'
AND csd.column_id = 2;

Here are the results:

Rowgroup Memory Pressure

The memory grant for CCI compression for an INSERT is calculated based on DOP and column definitions of target columns in the target table. The memory grant can be insufficient to get a full 1048576 rows into a compressed rowgroup depending on the table definition and the characteristics of the data getting loaded into the table. Consider an example in which data is loaded into three columns of the CCI:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND
(STR1, STR2, STR3)
SELECT TOP (1048576)
  LEFT(STR1, 10)
, LEFT(STR1, 5)
, LEFT(STR1, 6)
FROM
dbo.STAGING_TABLE stg
ORDER BY ID
OPTION (MAXDOP 1);

With the above syntax the memory grant is calculated from just the STR1, STR2, and STR3 columns. The memory grant of 171152 KB isn’t enough to avoid a delta store:

a18_dmv_12

Note that you may not see the same results on your machine due to the randomness of the source data. For my table and source data set, adding a single column and inserting NULL into it bumps the memory grant up enough to avoid memory pressure:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

ALTER TABLE dbo.DELTA_STORE_DUMPING_GROUND
ADD MORE_MEMORY_PLZ VARCHAR(1) NULL;

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND
(STR1, STR2, STR3, MORE_MEMORY_PLZ)
SELECT TOP (1048576)
  LEFT(STR1, 10)
, LEFT(STR1, 5)
, LEFT(STR1, 6)
, NULL
FROM
dbo.STAGING_TABLE stg
ORDER BY ID
OPTION (MAXDOP 1);

The compressed rowgroup contains 1048576 rows now that memory pressure has been addressed.

Cardinality Estimate Less Than 251 Rows

SQL Server won’t even ask for a memory grant if the cardinality estimate is less than 251 rows. Perhaps this is because the memory grant would be wasted unless at least 102400 rows were inserted into the table. There’s no second chance at a memory grant here, so it’s possible to insert millions of rows to delta stores. A TOP expression with a variable will default to a cardinality estimate of 100 rows, so this works nicely to show the behavior:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

DECLARE @top_rows BIGINT = 1048576;
INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND (ID)
SELECT TOP (@top_rows) ID
FROM dbo.STAGING_TABLE
ORDER BY ID
OPTION (MAXDOP 1);

Despite inserting 1048576 rows we aren’t able to bypass the delta store:

The same behavior can be observed with a cardinality estimate of 250 rows. The OPTIMIZE FOR query hint is used to control the estimate.

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

DECLARE @top_rows BIGINT = 1048576;
INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND (ID)
SELECT TOP (@top_rows) ID
FROM dbo.STAGING_TABLE
ORDER BY ID
OPTION (MAXDOP 1, OPTIMIZE FOR (@top_rows = 250));

However, if I bump up the estimate by one more row a memory grant is given to the query and the delta store is bypassed:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

DECLARE @top_rows BIGINT = 1048576;
INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND (ID)
SELECT TOP (@top_rows) ID
FROM dbo.STAGING_TABLE
ORDER BY ID
OPTION (MAXDOP 1, OPTIMIZE FOR (@top_rows = 251));

Under this scenario we’ve observed deadlocks when multiple sessions insert into delta stores from the same target table.

Extreme Server Memory Pressure

Memory grants for queries that insert into CCIs have a hardcoded timeout of 25 seconds. After 25 seconds they execute with required serial memory and always write to the delta store. In the query below I simulate memory pressure with a MAX_GRANT_PERCENT hint of 0:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND (ID)
SELECT TOP (1048576) ID
FROM dbo.STAGING_TABLE
ORDER BY ID
OPTION (MAXDOP 1, MAX_GRANT_PERCENT = 0);

The query always writes to the delta store. It cannot compress data without a memory grant.

Under this scenario we’ve observed deadlocks when multiple sessions insert into delta stores from the same target table.

Final Thoughts

It took forever to do the formatting for this one, so I hope that someone finds it useful.

Going Further

An Adaptive Join Regression In SQL Server

Posted on October 2, 2017May 14, 2022 by Joe Obbish

Adaptive joins are a new feature in SQL Server 2017. For adaptive join operators the decision to do a hash or loop join is deferred until enough input rows are counted. You can get an introduction on the topic in this blog post by Joe Sack. Dmitry Pilugin has an excellent post digging into the internals. The rest of this blog post assumes that you know the basics of adaptive joins.

Getting an Adaptive Join

It’s pretty easy to create a query that has an adaptive join in SQL Server 2017. Below I create a CCI with 100k rows and an indexed rowstore table with 400k rows:

DROP TABLE IF EXISTS dbo.MY_FIRST_CCI;

CREATE TABLE dbo.MY_FIRST_CCI (
	FILTER_ID_1 INT NOT NULL,
	FILTER_ID_2 INT NOT NULL,
	FILTER_ID_3 INT NOT NULL,
	JOIN_ID INT NOT NULL,
	INDEX CI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.MY_FIRST_CCI WITH (TABLOCK)
SELECT TOP (100000)
  t.RN
, t.RN
, t.RN
, t.RN
FROM
(
	SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t
OPTION (MAXDOP 1);

ALTER TABLE dbo.MY_FIRST_CCI REBUILD WITH (MAXDOP = 1);

CREATE STATISTICS S1 ON dbo.MY_FIRST_CCI (FILTER_ID_1)
WITH FULLSCAN;
CREATE STATISTICS S2 ON dbo.MY_FIRST_CCI (FILTER_ID_2)
WITH FULLSCAN;
CREATE STATISTICS S3 ON dbo.MY_FIRST_CCI (FILTER_ID_3)
WITH FULLSCAN;
CREATE STATISTICS S4 ON dbo.MY_FIRST_CCI (JOIN_ID)
WITH FULLSCAN;

DROP TABLE If exists dbo.SEEK_ME;

CREATE TABLE dbo.SEEK_ME (
	JOIN_ID INT NOT NULL,
	PADDING VARCHAR(2000) NOT NULL,
	PRIMARY KEY (JOIN_ID)
);

INSERT INTO dbo.SEEK_ME WITH (TABLOCK)
SELECT TOP (400000)
  ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, REPLICATE('Z', 2000)
FROM master..spt_values t1
CROSS JOIN master..spt_values t2

CREATE STATISTICS S1 ON dbo.SEEK_ME (JOIN_ID)
WITH FULLSCAN;

The full scan stats are just there to show that there isn’t any funny business with the stats. The below query gets an adaptive join:

SELECT *
FROM dbo.MY_FIRST_CCI o
INNER JOIN dbo.SEEK_ME i ON o.JOIN_ID = i.JOIN_ID

It’s obvious when it happens in SSMS:

a16_sample_adaptive_join

It’s possible to get an adaptive join with even simpler table definitions. I created the tables this way because they’ll be used for the rest of this post.

Adaptive Threshold Rows

Unlike some other vendors, Microsoft was nice enough to expose the adaptive row threshold in SSMS when looking at estimated or actual plans:

a16_threshold

The adaptive join saves input rows to a temporary structure and acts as a blocking operator until it makes a decision about which type of join to use. In this example, if there are less than 80388.3 rows then the adaptive join will execute as a nested loop join. Otherwise it’ll execute as a hash join.

The adaptive threshold row count can change quite a bit based on the input cardinality estimate. It changes to 22680 rows if I add the following filter that results in a single row cardinality estimate:

WHERE o.FILTER_ID_1 = 1

It was surprising to me to see so much variance for this query. There must be some overhead with doing the adaptive join but I wouldn’t expect the tipping point between a loop and hash join to change so dramatically. I would expect it to be close to a traditional tipping point calculated without adaptive joins.

Traditional Tipping Point

Let’s disable adaptive joins using the 'DISABLE_BATCH_MODE_ADAPTIVE_JOINS' USE HINT and consider how an execution plan would look for this query:

SELECT *
FROM dbo.MY_FIRST_CCI o
INNER JOIN dbo.SEEK_ME i ON o.JOIN_ID = i.JOIN_ID
WHERE o.FILTER_ID_1 BETWEEN @start AND @end
OPTION (
RECOMPILE,
USE HINT('DISABLE_BATCH_MODE_ADAPTIVE_JOINS')
);

We should expect a hash join if the local variables don’t filter out as many rows. Conversely, we should expect a loop join if the local variables on FILTER_ID_1 filter out many rows from the table. There’s a tipping point where the plan will change from a hash join to a loop join if we filter out a single additional row . On my machine, the tipping point is between 48295 and 48296 rows:

The estimated costs for the two queries are very close to each other: 74.6842 and 74.6839 optimizer units. However, we saw earlier that the tipping point for an adaptive join on this query can vary between 22680 and 80388.3 rows. This inconsistency means that we can find a query that performs worse with adaptive joins enabled.

The Regression

After some trial and error I found the following query:

SELECT *
FROM dbo.MY_FIRST_CCI o
INNER JOIN dbo.SEEK_ME i ON o.JOIN_ID = i.JOIN_ID
WHERE o.FILTER_ID_1 BETWEEN 1 AND 28000
AND o.FILTER_ID_2 BETWEEN 1 AND 28000
AND o.FILTER_ID_3 BETWEEN 1 AND 28000
ORDER BY (SELECT NULL)
	OFFSET 100001 ROWS FETCH NEXT 1 ROW ONLY
OPTION (MAXDOP 1);

The ORDER BY stuff isn’t important. It’s there just to not send any rows over the network. Here’s the plan:

a16_regression

The query has a cardinality estimate of 10777.7 rows coming out of the MY_FIRST_CCI table. The adaptive join has a tipping point of 27611.6 rows. However, I’ve constructed the table and the filter such that 28000 rows will be sent to the join. SQL Server expects a loop join, but it will instead do a hash join because 28000 > 27611.6. With a warm cache the query takes almost half a second:

CPU time = 469 ms, elapsed time = 481 ms.

If I disable adaptive joins, the query finishes in less than a fifth of a second:

CPU time = 172 ms, elapsed time = 192 ms.

A loop join is a better choice here, but the adaptive row threshold makes the adaptive join pick a hash join.

Final Thoughts

This post contains only a single test query, so it’s no cause for panic. It’s curious that Microsoft made the adaptive join tipping so dependent on cardinality estimates going into the join. I’m unable to figure out the design motivation for doing that. I would expect the other side of the join to be much more important.

Thanks for reading!

Going Further

The SQL Server Trillion Row Table

Posted on August 16, 2017May 14, 2022 by Joe Obbish

I loaded one trillion rows into a nonpartitioned table just to see what would happen. Spoiler: bad things happen.

Hardware and Table Plan

I did all of my testing on my home desktop which has an i5-4670 CPU (quad core), 5 GB of RAM for SQL Server, and a Samsung SSD 850 EVO 1 TB. Obviously not ideal to do testing like this but it was enough to get the job done. In order to fit a trillion rows into a table I had to use a CCI. Rowstore tables simply don’t offer good enough compression. For example, consider a page compressed heap with nothing but NULL values. One million rows takes up 10888 KB of disk space:

DROP TABLE IF EXISTS dbo.TEST_HEAP;
CREATE TABLE dbo.TEST_HEAP (
ID BIGINT
) WITH (DATA_COMPRESSION = PAGE);

INSERT INTO dbo.TEST_HEAP WITH (TABLOCK)
SELECT TOP (1000000) NULL
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

EXEC sp_spaceused 'TEST_HEAP';

Therefore, such a table with a trillion rows would require around 10 TB of disk space. That won’t fit on a 1 TB HDD, but the better compression of CCIs gives us some options. Ultimately I decided on building completely full rowgroups of 1048576 rows with each rowgroup only storing a single value. Estimating space for the final table is difficult, but we can hopefully get an upper bound using the following code:

DROP TABLE IF EXISTS dbo.TRIAL_BALLOON;
CREATE TABLE dbo.TRIAL_BALLOON (
ID BIGINT,
INDEX CCI_TRIAL_BALLOON CLUSTERED COLUMNSTORE
) 

INSERT INTO dbo.TRIAL_BALLOON WITH (TABLOCK)
SELECT TOP (10 * 1048576) NULL
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

EXEC sp_spaceused 'TRIAL_BALLOON';

The table has 776 KB reserved space and 104 KB space for data. In the worst case we might expect the final table to require 75 GB space on disk.

Serial Population Strategy

My strategy for populating the table was to run the same code in four different SQL Server sessions. Each session grabs the next ID from a sequence and does a MAXDOP 1 insert of 1048576 rows into the CCI. With only four concurrent sessions I didn’t expect to run into any locking issues such as the mysterious ROWGROUP_FLUSH wait event. The sequence definition is about as simple as it gets:

DROP SEQUENCE IF EXISTS dbo.CCI_Sequence;
CREATE SEQUENCE dbo.CCI_Sequence AS BIGINT
START WITH 1
INCREMENT BY 1
NO CACHE;

Here’s the code that I used to add rowgroups to the table:

ALTER SERVER CONFIGURATION
SET PROCESS AFFINITY CPU=0;

DECLARE @i BIGINT

SET NOCOUNT ON;

IF EXISTS (SELECT 1 FROM ##stop_table)
BEGIN
	SET @i = 9999999999999;
END
ELSE
BEGIN
	SELECT @i = NEXT VALUE FOR dbo.CCI_Sequence;
END; 

WHILE @i <= 953674
BEGIN
	WITH NUM (n) AS (
	SELECT n
	FROM
	(
	VALUES
	 (@i),(@i),(@i),(@i),(@i),(@i),(@i),(@i)
	,(@i),(@i),(@i),(@i),(@i),(@i),(@i),(@i)
	,(@i),(@i),(@i),(@i),(@i),(@i),(@i),(@i)
	,(@i),(@i),(@i),(@i),(@i),(@i),(@i),(@i)
	) v(n)
	)
	INSERT INTO dbo.BIG_DATA
	SELECT n1.n
	FROM NUM n1
	CROSS JOIN NUM n2
	CROSS JOIN NUM n3
	CROSS JOIN NUM n4
	OPTION (MAXDOP 1);

	IF EXISTS (SELECT 1 FROM ##stop_table)
	BEGIN
		SET @i = 9999999999999;
	END
	ELSE
	BEGIN
		SELECT @i = NEXT VALUE FOR dbo.CCI_Sequence;
	END;
END;

The affinity stuff was to get the work spread out evenly over my four schedulers. Each session was assigned to a different CPU from 0-3. It was also important to run the following in a new session after all four sessions started working:

ALTER SERVER CONFIGURATION
SET PROCESS AFFINITY CPU=AUTO;

It wasn’t clear to me why that was needed to get good throughput. Perhaps part of the work of building a compressed rowgroup is offloaded to a system process?

The references to the ##stop_table temp table are just a way to pause the work as needed without skipping numbers in the sequence. I think that the code to generate 1048576 rows is fairly optimized. I did try to optimize it since this code was going to be run over 950000 times, but I still suspect that there was a better way to do it that I missed.

The jobs finished after about 2 days of running on 4 CPUs. That’s a rate of around 86 million rows per core per minute which I was pretty happy with. After the sessions finished I needed to do one final, very nervous, insert into the table:

INSERT INTO dbo.BIG_DATA
SELECT TOP (331776) 953675
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

Running sp_spaceused has never felt so satisfying:

Under a GB for one trillion rows. Not bad.

Parallel Population Strategy

Taking advantage of natural parallelism within SQL Server may also be a viable method to populate a trillion row table (especially on an actual server), but I didn’t test it fully. In order to get perfect rowgroups, you need round robin parallelism with a driving table that’s a multiple of the query’s MAXDOP. For example, here’s a suitable plan:

a11_CCI_parallel_insert

The Constant Scan contains exactly four rows because I’m running MAXDOP 4. The source CCI has exactly 1048576 rows. The parallel insert happens without a repartition streams so we end up with exactly 1048576 rows on each thread and in each compressed rowgroup. Below is one way to generate such a plan:

DROP TABLE IF EXISTS dbo.SOURCE_CCI;
CREATE TABLE dbo.SOURCE_CCI (
ID BIGINT,
INDEX CCI_SOURCE_CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.SOURCE_CCI WITH (TABLOCK)
SELECT TOP (1048576) 0
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

DROP TABLE IF EXISTS dbo.TRIAL_BALLOON;
CREATE TABLE dbo.TRIAL_BALLOON (
ID BIGINT,
INDEX CCI_TRIAL_BALLOON CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.TRIAL_BALLOON WITH (TABLOCK)
SELECT  driver.n
FROM (
SELECT TOP (4) v.n
FROM (
	VALUES
		(1),(2),(3),(4)
	) v(n)
) driver
INNER JOIN dbo.SOURCE_CCI sc ON sc.ID < driver.n
OPTION (MAXDOP 4, NO_PERFORMANCE_SPOOL);

With nothing else running on my machine I estimate that this method would take about 2 days to complete. The problem is that if one core is busy with something else, such as watching terrible Youtube videos, then the entire insert could be slowed down.

Updating Stats

To do anything interesting on a table we want statistics. Gathering statistics for extremely compressed data can be challenging in SQL Server. That is because the target sampled rate is based on the total size of the table as opposed to the number of rows in the table. Consider an 8 billion row table built in the same way as the one trillion row table above. SQL Server generates the following query to gather sampled stats against the table:

SELECT StatMan([SC0])
FROM (
SELECT TOP 100 PERCENT [ID] AS [SC0]
FROM [dbo].[BIG_DATA] WITH (READUNCOMMITTED)
ORDER BY [SC0]
) AS _MS_UPDSTATS_TBL
OPTION (MAXDOP 1)

You may notice the lack of TABLESAMPLE as well as the MAXDOP 1 hint. To gather sampled stats SQL Server will get all eight billion rows from the table, sort them, and build the statistics object using the eight billion rows. On my machine, this took almost 3 hours to complete and tempdb grew to 85 GB.

There is a trick to get more reasonable sampled stats. All that’s needed is to increase the table size while keeping the same data. Soft deletion of compressed rowgroups is a good way to accomplish this. First find a data distribution that doesn’t compress well in CCIs. Here’s one example:

SELECT 9999999 + SUM(RN) OVER (ORDER BY RN)
FROM (
	SELECT TOP (1048576) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t
OPTION (MAXDOP 1);

Each rowgroup has a size of 8392 KB, so adding 100 rowgroups will add 839200 KB to the table. Deleting all of the newly added rows can take a little while and will log quite a bit to the transaction log, but the table size won’t change. Gathering sampled stats after the insert and delete took just a few seconds. The sample size was about 1% of the table. After a REORG the fully deleted rowgroups will be marked as TOMBSTONE and cleaned up by a background process.

For the one trillion row table I decided to roll the dice and go for sampled stats without any tricks. I gave tempdb a maximum size of 280 GB in order to not completely fill my hard drive. The stats update took 3 hours and 44 minutes. Surprisingly, the stat update grew tempdb to its maximum size but it didn’t fail. Perhaps I got very lucky. Here is the hard earned stats object:

a11_stats_object

Expected Query Performance

I expected reasonably fast query performance for queries designed to take advantage of rowgroup elimination. After all, the table was built in a way such that every compressed rowgroup only has a single value. I can get a count of rowgroups along with some other metadata in 20 seconds using the DMVs:

SELECT COUNT(*), MIN(css.min_data_id), MAX(css.max_data_id)
FROM sys.objects o
INNER JOIN sys.columns c ON o.object_id = c.object_id
INNER JOIN sys.partitions p ON o.object_id = p.object_id
INNER JOIN sys.column_store_segments css
    ON p.hobt_id = css.hobt_id
    AND css.column_id = c.column_id
WHERE o.name = 'BIG_DATA'
AND c.name = 'ID';

Getting the relevant segment_ids for a particular filter finishes in under a second:

SELECT segment_id
FROM sys.objects o
INNER JOIN sys.columns c ON o.object_id = c.object_id
INNER JOIN sys.partitions p ON o.object_id = p.object_id
INNER JOIN sys.column_store_segments css
    ON p.hobt_id = css.hobt_id
    AND css.column_id = c.column_id
WHERE o.name = 'BIG_DATA'
AND c.name = 'ID'
AND 500000 BETWEEN css.min_data_id AND css.max_data_id;

I’m dealing with DMVs so I would expect SQL Server to be able to do rowgroup elimination in a much more efficient way than the above queries. Therefore, 30 seconds seemed like a reasonable upper bound for the following query:

SELECT COUNT(*)
FROM dbo.BIG_DATA
WHERE ID = 500000;

The query takes over 15 minutes to complete despite reading only a single segment:

Table ‘BIG_DATA’. Scan count 4, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 6, lob physical reads 1, lob read-ahead reads 0.
Table ‘BIG_DATA’. Segment reads 1, segment skipped 953674.

SQL Server Execution Times:
CPU time = 905328 ms, elapsed time = 917478 ms.

The Evil Wait Event

The query had a max wait time of 915628 ms for QUERY_TASK_ENQUEUE_MUTEX. This is suspiciously close to the elapsed time of 917478 ms. Unfortunately this is a very unpopular wait event in the industry. The wait event library has almost no information about it as well.

I call this the evil wait event because while it’s happening queries cannot be canceled through SSMS and many unrelated queries won’t even run. Most of the time no useful work can be done on the instance. I can’t read Russian so I’m not sure what the wait event is about. After I restarted SQL Server the wait event no longer appeared as consistently, but query performance did not improve as far as I could tell.

Other Queries

I ran a few other tests queries as well, although I was limited in what I could do by the evil wait event. The following query is the only one that I found that wasn’t affected:

SELECT TOP 1 ID
FROM dbo.BIG_DATA;

For reasons I don’t understand the query still took a long time:

Table ‘BIG_DATA’. Scan count 1, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 6, lob physical reads 1, lob read-ahead reads 0.
Table ‘BIG_DATA’. Segment reads 1, segment skipped 0.

SQL Server Execution Times:
CPU time = 791625 ms, elapsed time = 811202 ms.

Counting the rows in the table took almost half an hour:

SELECT COUNT_BIG(*)
FROM dbo.BIG_DATA;

Statistics output:

Table ‘BIG_DATA’. Scan count 4, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 5722050, lob physical reads 819345, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 3734515 ms, elapsed time = 1635348 ms.

A query to sum every ID in the table took over 40 minutes:

SELECT SUM(ID)
FROM dbo.BIG_DATA;

Statistics output:

SQL Server Execution Times:
CPU time = 4810422 ms, elapsed time = 2433689 ms.

As expected performance gets worse without aggregate pushdown. Picking a simple query that’s not supported:

SELECT MAX(ID / 1)
FROM dbo.BIG_DATA;

This query took over an hour:

SQL Server Execution Times:
CPU time = 10218343 ms, elapsed time = 3755976 ms.

Queries against some of the CCIs DMVs don’t do very well either. A simple count took almost ten minutes:

SELECT COUNT(*)
FROM sys.dm_db_column_store_row_group_physical_stats
where OBJECT_ID = OBJECT_ID('BIG_DATA');

All of the work appears to be in the COLUMNSTORE_ROW_GROUPS table-valued function but I didn’t dig any more into it.

If you’re interested in how a query performs let me know in the comments. I will try anything that’s somewhat reasonable.

Final Thoughts

Now I can add working with trillion row tables to my resume. The compression for the one trillion row table was very impressive but everything else was decidedly less impressive. The very long QUERY_TASK_ENQUEUE_MUTEX wait times for nearly all queries were especially disappointing. I plan to do more testing at a later date with a partitioned table to see if that helps at all.

Thanks for reading!

Going Further

Rowgroup Elimination In SQL Server Column Store Indexes

Posted on August 7, 2017May 14, 2022 by Joe Obbish

Rowgroup elimination is a performance optimization based on compressed rowgroup metadata that can allow rowgroups to be skipped during query execution. It’s likely that all of the metadata used for the optimization is exposed in the sys.column_store_segments DMV. This blog post explores some of the less well known rules and limitations for rowgroup elimination.

Test Data

To keep things very simple we’ll build 100 rowgroups with exactly 1 million rows in each of them. ID and ID2 increase from 1 to 10000000 and ID_NULL is always NULL. Code to create and populate the table:

DROP TABLE IF EXISTS dbo.MILLIONAIRE_CCI;

CREATE TABLE dbo.MILLIONAIRE_CCI (
	ID BIGINT NULL,
	ID2 BIGINT NULL,
	ID_NULL BIGINT NULL,
	INDEX CCI_MILLIONAIRE_CCI CLUSTERED COLUMNSTORE
);

DECLARE @loop INT = 0;
BEGIN
	SET NOCOUNT ON;
	WHILE @loop < 100
	BEGIN
		INSERT INTO dbo.MILLIONAIRE_CCI WITH (TABLOCK)
		SELECT t.RN, t.RN, NULL
		FROM (
			SELECT TOP (1000000)
				(1000000 * @loop)
				+ ROW_NUMBER()
					OVER (ORDER BY (SELECT NULL)) RN
			FROM master..spt_values t1
			CROSS JOIN master..spt_values t2
			ORDER BY RN
		) t
		OPTION (MAXDOP 1);

		SET @loop = @loop + 1;
	END;
END;

We can expect very good rowgroup elimination on the ID and ID2 columns based on how we built them. That can be verified by calculating the REFF or by looking at sys.column_store_segments:

a10_not_null_DMV

Code to generate the above result set:

SELECT css.min_data_id, css.max_data_id, css.has_nulls
FROM sys.objects o
INNER JOIN sys.columns c ON o.object_id = c.object_id
INNER JOIN sys.partitions p ON o.object_id = p.object_id
INNER JOIN sys.column_store_segments css
    ON p.hobt_id = css.hobt_id
    AND css.column_id = c.column_id
INNER JOIN sys.dm_db_column_store_row_group_physical_stats s
    ON o.object_id = s.object_id
    AND css.segment_id = s.row_group_id
    AND s.partition_number = p.partition_number
WHERE o.name = 'MILLIONAIRE_CCI'
AND c.name = 'ID'
AND s.[state] = 3
ORDER BY css.min_data_id, css.segment_id;

Many of the test queries below select a single aggregate value. This isn’t done for any special reason other than to limit the size of the result set. The easiest way to see how many rowgroups were skipped is to use SET STATISTICS IO ON and that requires that the results be returned to the client.

Single Column Filtering

Consider the following query:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID IN (1000000, 2000001);

Based on how we built the segments for the ID column we might expect that only two segments will need to be read: segment 1 with ID values of 1-1000000 and segment 3 with ID values of 2000001-3000000. As usual, SQL Server does not care about our expectations:

Table ‘MILLIONAIRE_CCI’. Segment reads 3, segment skipped 97.

Why did the storage engine scan two segments instead of three? Running another test makes the problem more clear:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID IN (1, 100000000);

For this query we end up scanning the entire table:

Table ‘MILLIONAIRE_CCI’. Segment reads 100, segment skipped 0.

It seems as if the query optimizer reduces the predicate against the filtered column to be a range of IDs. That range of IDs is used for rowgroup elimination. In some cases it’s possible to write a WHERE clause that won’t return any rows but still isn’t eligible for rowgroup elimination. The storage engine is not able to skip any segments while executing the below query:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID < 0 OR ID > 100000000;

There isn’t an issue when the where clause is filtering on a contiguous range. For example, the following query skips 98 segments as expected:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID BETWEEN 1 AND 2000000;

There also isn’t an issue when filtering down to multiple values as long as those values are sufficiently close together, as shown with the first example. I also wasn’t able to find any liminations around the number of values in the IN clause. The query below reads 1 segment and skips 99 as we might hope:

SELECT MAX(l.ID)
FROM dbo.MILLIONAIRE_CCI l
WHERE l.ID IN (
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30
, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40
, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50
, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60
, 61, 62, 63, 64
);

If we add one more filter value then the query optimizer changes the plan to use a join:

a10_part1_to_join

The above query is eligible for rowgroup elimination but it follows slightly different ruless as covered later in this post.

It is possible to disable the transformation to a join by using the undocumented query hint QueryRuleOff SelToLSJ. With 976 entries in the IN clause I still get rowgroup elimination as expected. With 977 entries nothing was pushed to the scan at all, and we get a truly horrible plan:

a10_terrible_plan

This doesn’t appear to be a columnstore limitation. The same behavior can be observed with a clusted rowstore index.

Getting back on track, the internal calculation around which rowgroups to skip isn’t always as simple as calculating the minimum and maximum in the range and using those values to do elimination. It’s possible to end up with no rowgroup elimination even when the maximum and minimum ID in the WHERE clause are close to each other. Consider the following query:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID BETWEEN 1 AND 2
OR ID BETWEEN 2 AND 3;

The storage engine only has to read a single segment. We can see in the query plan that the optimizer was able to simplify the expression into something that happens to qualify for rowgroup elimination:

a10_part1_rewrite

Now consider the following query:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID BETWEEN 1 AND 2
OR ID BETWEEN 3 AND 4;

It would be helpful if the query optimizer changed the predicate to ID BETWEEN 1 AND 4 when doing calculations around which rowgroups can be skipped. This does not happen, and as a result all 100 rowgroups are scanned. Rowgroup elimination won’t be available when the WHERE clause is a sufficiently complicated mix of AND and OR conditions, even when filtering on just one column.

NULLs

Information about NULLs is stored internally and can be used for rowgroup elimination. SQL Server knows that none of the compressed segments for the ID column contain NULL, so the storage engine can skip all 100 segments for the following query:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID IS NULL;

Naturally, reversing the filter for this query will require the storage engine to scan the entire table.

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID IS NOT NULL;

We might expect that query to skip all segments if we change the filter column to ID_NULL. All rows in the rowgroups for ID_NULL are NULL and SQL Server ought to be aware of that fact. However, the storage engine still scans the entire table even for the query below:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID_NULL IS NOT NULL;

The DMV for ID_NULL doesn’t look as we might expect:

a10_NULL_DMV

sys.column_store_dictionaries has a value of 0 for the entry_count column. It seems likely that the fact that the segments only contain NULL can be deduced from information already tracked by SQL Server. Rowgroup elimination for IS NOT NULL may have not been added because it was thought to be too unlikely of a use case.

Filters on Multiple Columns

To state it simply, rowgroup elimination can work quite well with AND predicates on different columns. It will not work with OR predicates on different columns unless the query optimizer can simplify the expression to something that’s eligible for rowgroup elimination.

The following queries are all able to skip 99 rowgroups:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID = 1 AND ID2 = 1;

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID BETWEEN 1 AND 2
AND ID2 BETWEEN 3 AND 4;

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID BETWEEN 1 AND 100000000
AND ID2 BETWEEN 1000001 AND 2000000;

This query skips all 100 rowgroups:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID = 1 AND ID2 = 1000001;

The storage engine doesn’t take the union of rowgroups that could be relevant. It instead takes the intersection, so adding AND predicates won’t increase the number of segments scanned, unless perhaps if you do something very unreasonable. The following query scans one rowgroup as expected:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID BETWEEN 1 AND 100000000
AND ID2 BETWEEN 1000001 AND 2000000
AND ID > ID2;

The final part of the WHERE clause is implemented in a filter operator. The rest of the WHERE clause remains eligible for rowgroup elimination.

Now let’s try a simple query with an OR predicate:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID = 1 OR ID2 = 1;

We might hope that the storage engine is able to deduce that only the first segment is relevant. Instead, rowgroup elimination isn’t even attempted. The predicate is implemented as a filter:

a10_FILTER

The only situation with OR filters that I’ve found to work with rowgroup elimination is when the optimizer can eliminate one of them. For example, the following query scans 5 segments because the optimizer is able to eliminate the condition on the ID2 column:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID IN (1, 5000000) OR ID2 BETWEEN 1 AND 0;

Joins

The query optimizer is able to copy predicates when filtering and joining on the same column. The copied predicates are eligible for rowgroup elimination. Consider the query below:

SELECT MAX(l1.ID)
FROM dbo.MILLIONAIRE_CCI l1
INNER JOIN dbo.MILLIONAIRE_CCI l2 ON l1.ID = l2.ID
WHERE l1.ID BETWEEN 1 AND 1000000;

Only two segments are read because the filter on ID can be applied to both sides of the join. The same behavior can be observed when forcing a merge join. Loop join is a bit different. As covered in the post on CCI string aggregation, rowgroup elimination does not occur on the inner side of a loop. Consider the following query:

SELECT MAX(l1.ID)
FROM dbo.MILLIONAIRE_CCI l1
INNER JOIN dbo.MILLIONAIRE_CCI l2 ON l1.ID = l2.ID
WHERE l1.ID BETWEEN 1 AND 1000
OPTION (LOOP JOIN, NO_PERFORMANCE_SPOOL);

The inner side is scanned 1000 times and the outer side is scanned once. The filter on ID allows all segments to be skipped besides one. So we should read 1001 segments and skip 1001 * 100 – 1001 = 99099 segments. This is what happens:

Table ‘MILLIONAIRE_CCI’. Segment reads 1001, segment skipped 99099.

More segments will be read depending on how many rowgroups the filter crosses. Suppose that we include rows with an ID that’s between 999501 and 1000500:

SELECT MAX(l1.ID)
FROM dbo.MILLIONAIRE_CCI l1
INNER JOIN dbo.MILLIONAIRE_CCI l2 ON l1.ID = l2.ID
WHERE l1.ID BETWEEN 999501 AND 1000500
OPTION (LOOP JOIN, NO_PERFORMANCE_SPOOL);

Now each scan on both the inner and outer side will need to read two segments:

Table ‘MILLIONAIRE_CCI’. Segment reads 2002, segment skipped 98098.

It’s possible to get rowgroup elimination even when filtering and joining on different columns. Consider the following query that joins on ID but filters on ID2:

SELECT MAX(l1.ID)
FROM dbo.MILLIONAIRE_CCI l1
INNER JOIN dbo.MILLIONAIRE_CCI l2 ON l1.ID = l2.ID
WHERE l1.ID2 BETWEEN 1 AND 1000000;

We still get rowgroup elimination against both sides of the join:

Table ‘MILLIONAIRE_CCI’. Segment reads 2, segment skipped 198.

The key is the optimized bitmap:

a10_opt_bitmap

That allows rowgroup elimination to happen on both sides. Bitmap optimization can only occur with hash joins, so queries written in this way that do a merge or loop join won’t be able to take advantage of rowgroup elimination against both tables.

Less Reasonable Queries

Below is a set of sometimes unreasonable queries to test some of the limits around rowgroup elimilation. It was surprising how often the queries remained eligible for rowgroup elimination. For example, local variables seem to cause no issues, even without PEO. The following query reads just one segment:

DECLARE @ID_FILTER BIGINT = 1;
SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID = @ID_FILTER;

Data type conversions on the filtered expression don’t make the query ineligible for rowgroup elimination:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID = '1';

Casting on the filtered column is going to prevent rowgroup elimination. As will “optimizer tricks” like adding zero to the column:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID + 0 = 1;

We read all rowgroups:

Table ‘MILLIONAIRE_CCI’. Segment reads 100, segment skipped 0.

The query below is eligible for rowgroup elimination:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID <= CEILING(RAND());

Using scalar UDFs in queries is a terrible idea, but let’s create one for testing purposes:

CREATE OR ALTER FUNCTION dbo.CHEAP_UDF() RETURNS BIGINT
AS
BEGIN
	RETURN 1;
END;

As you might expect, the following query runs without parallelism and cannot skip any segments:

SELECT MAX(l.ID)
FROM dbo.MILLIONAIRE_CCI l
WHERE l.ID = dbo.CHEAP_UDF();

However, if we add SCHEMABINDING to the function definition then we get rowgroup elimination:

Table ‘MILLIONAIRE_CCI’. Segment reads 1, segment skipped 99.

The query below gets rowgroup elimination with and without SCHEMABINDING:

SELECT MAX(l.ID)
FROM dbo.MILLIONAIRE_CCI l
WHERE l.ID = (SELECT MAX(dbo.CHEAP_UDF()));

Query Rewrites for Better Rowgroup Elimination

In some cases it’s possible to rewrite queries to get better rowgroup elimination. This requires knowing your data and awareness of the rules around rowgroup elimination. Going back to an earlier example, the following query isn’t eligible for rowgroup elimination (without very convenient constraints):

SELECT *
FROM dbo.MILLIONAIRE_CCI
WHERE ID = 1 OR ID2 = 1;

It can be written to use UNION or UNION ALL. Here’s the UNION query:

SELECT *
FROM dbo.MILLIONAIRE_CCI
WHERE ID = 1

UNION 

SELECT *
FROM dbo.MILLIONAIRE_CCI
WHERE ID2 = 1;

Now the storage engine skips 198 segments and only reads 2:

Table ‘MILLIONAIRE_CCI’. Segment reads 2, segment skipped 198.

In some cases it may be advantageous to avoid the sort. The query below has the same rowgroup elimination:

SELECT *
FROM dbo.MILLIONAIRE_CCI
WHERE ID = 1

UNION ALL

SELECT *
FROM dbo.MILLIONAIRE_CCI
WHERE ID2 = 1 AND ID <> 1;

Here’s the query plan:

a10_section_rewrite_UNION_ALL

Consider another query with a wide range of values in the IN clause, but filtered against a single column. The query below won’t be able to skip any rowgroups because we’re including the minimum and maximum value of ID in the query’s results:

SELECT *
FROM dbo.MILLIONAIRE_CCI
WHERE ID IN (
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
, 100000000
);

It may be impractical to write out the query using UNION. Instead, we can essentially force a join by putting the filter values into a derived table. The optimizer is likely to move the IN list to a constant scan and do a hash join to the CCI. We can get rowgroup elimination through the bitmap filter on the hash join. Here’s one way to rewrite the query:

SELECT c.*
FROM
(
	VALUES (1), (2), (3), (4), (5)
	, (6), (7), (8), (9), (10)
	, (100000000)
) v(x)
INNER JOIN dbo.MILLIONAIRE_CCI c ON c.ID = v.x;

Here’s the plan:

a10_section_rewrite_hash

As expected, we only need to scan 2 rowgroups:

Table ‘MILLIONAIRE_CCI’. Segment reads 2, segment skipped 98.

SQL Server 2017 Changes

I ran all of the test queries against SQL Server 2017 RC2. I was not able to observe any differences. It may be that Microsoft did not choose to make improvements in this area, or any improvements were missed by my test cases.

Final Thoughts

Rowgroup elimination seems designed to reduce IO requirements for queries that filter against contiguous ranges against a column, like filtering against a single month of data from a table, or when joining to the CCI through a hash join. It’s possible to write queries for which rowgroup elimination does not occur, even though SQL Server in theory has all of the information that it would need to perform rowgroup elimination. From a practical point of the view, the biggest limitation is probably around OR logic.

Thanks for reading!

Going Further

Aggregate Pushdown Limitations With SQL Server Column Store Indexes

Posted on July 31, 2017May 14, 2022 by Joe Obbish

Aggregate pushdown is an optimization for aggregate queries against columnstore tables that was introduced in SQL Server 2016. Some aggregate computations can be pushed to the scan node instead of calculated in an aggregate node. I found the documentation to be a little light on details so I tried to find as many restrictions around the functionality as I could through testing.

Test Data

Most of my testing was done against a simple CCI with a single compressed rowgroup:

DROP TABLE IF EXISTS dbo.AP_1_RG;
CREATE TABLE dbo.AP_1_RG (
	ID1 bigint NULL,
	ID2 bigint NULL,
	AGG_COLUMN BIGINT NOT NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.AP_1_RG WITH (TABLOCK)
SELECT
t.RN % 8000
, t.RN % 8000
, 0
FROM
(
	SELECT TOP (1048576)
	ROW_NUMBER() OVER
		(ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t
OPTION (MAXDOP 1);

I found this table structure and data to be convenient for most of the test cases, but all tests can be reproduced with different data or a different table structure.

Restrictions without GROUP BY

Aggregate pushdown is supported both with and without a GROUP BY clause. I found it convenient to test those two cases separately. Below is a list of restrictions that I found or verified during testing.

Data Type of Aggregate Column

The documentation says:

Any datatype <= 64 bits is supported. For example, bigint is supported as its size is 8 bytes but decimal (38,6) is not because its size is 17 bytes. Also, no string types are supported.

However, this isn’t quite accurate. Float data types are not supported. numeric(10,0) is supported despite requiring 9 bytes for storage. Here’s a full table of results:

a9_data_type_table

* bit is not supported for aggregates in general
** not all data types were tested

I would summarize support as all date and time data types are supported except datetimeoffset. All exact numeric data types are supported if they are under 10 bytes. Approximate numerics, strings, and other data types are not supported.

Data Type Conversions

Aggregate pushdown does not appear to be supported if there is any data type conversion in any part of the aggregate expression. Both implicit and explicit data type conversions can cause issues, although it ultimately depends on the rules for data type precedence and if the query optimizer determines if a conversion is needed. For example, the following queries are eligible for pushdown:

SELECT MAX(ID1 + 1)
FROM dbo.AP_1_RG;

SELECT MAX(ID1 + CAST(1 AS BIGINT))
FROM dbo.AP_1_RG;

SELECT MAX(ID1 + CAST(1 AS INT))
FROM dbo.AP_1_RG;

SELECT SUM(ID1 + 1)
FROM dbo.AP_1_RG;

SELECT SUM(1 * ID1)
FROM dbo.AP_1_RG;

SELECT MAX(CAST(ID1 AS BIGINT))
FROM dbo.AP_1_RG;

However, the following queries are not:

SELECT MAX(CAST(ID1 AS INT))
FROM dbo.AP_1_RG;

SELECT SUM(1.5 * ID1)
FROM dbo.AP_1_RG;

SELECT SUM(ID1 + CAST(1 AS BIGINT))
FROM dbo.AP_1_RG;

Sometimes the compute scalar appears with the conversion even when we might not expect it, like for the last query:

a9_compute_scalar

Unsupported Operators

Division and modulus prevent aggregate pushdown even when they wouldn’t change the result or if there isn’t a data type conversion in the plan. There are likely other unsupported operators as well. The following queries are eligible for pushdown:

SELECT MAX(ID1 + 1)
FROM dbo.AP_1_RG;

SELECT MAX(ID1 - 1)
FROM dbo.AP_1_RG;

SELECT MAX(ID1 * 2)
FROM dbo.AP_1_RG;

The following queries are not eligible for pushdown:

SELECT MAX(ID1 / 1)
FROM dbo.AP_1_RG;

SELECT MAX(ID1 % 9999999999999)
FROM dbo.AP_1_RG;

Aggregate Cannot be Applied to Scan

The aggregate expression must be applied directly to the scan. If there’s a filter between the aggregate and the scan then it won’t work. A compute scalar node between the aggregate and the scan can be okay.

Filter expressions involving OR on different columns tend to be calculated as a filter. This means that the following query isn’t eligible for pushdown:

SELECT MIN(ID1)
FROM dbo.AP_1_RG
WHERE ID1 > 0 OR ID2 > 0;

It’s likely that this restriction will affect many queries with joins to other tables.

Local Variables Without PEO

Aggregate pushdown is not available if there is a local variable in the aggregate expression unless the optimizer is able to embed the literal parameter value in the query, such as with a RECOMPILE hint. For example, the first query is not eligible for pushdown but the second query is:

DECLARE @var BIGINT = 0;
-- no
SELECT MIN(ID1 + @var)
FROM dbo.AP_1_RG;

-- yes
SELECT MIN(ID1 + @var)
FROM dbo.AP_1_RG
OPTION (RECOMPILE);

Trivial Plans

Simple queries that use SUM, AVG, COUNT, or COUNT_BIG against very small tables may get a trivial plan. In SQL Server 2016 that trivial plan will not be eligible for batch mode so aggregate pushdown will not occur. Consider the following CCI with 10000 rows in a compressed rowgroup:

DROP TABLE IF EXISTS AGG_PUSHDOWN_FEW_ROWS;

CREATE TABLE dbo.AGG_PUSHDOWN_FEW_ROWS (
ID BIGINT NOT NULL,
INDEX CCI2 CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.AGG_PUSHDOWN_FEW_ROWS WITH (TABLOCK)
SELECT t.RN
FROM
(
	SELECT TOP (10000) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
	CROSS JOIN master..spt_values t3
) t
OPTION (MAXDOP 1);

ALTER INDEX CCI2 ON AGG_PUSHDOWN_FEW_ROWS
REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON);

With the default cost threshold for parallelism value of 5 I get a trivial plan:

a9_trivial_plan

If I decrease the CTFP value to 0 I get batch mode along with aggregate pushdown. As far as I can tell, queries with MAX or MIN do not have this issue.

Lack of Hash Match

For some queries the query optimizer may cost a stream aggregate as a cheaper alternative to a hash match aggregate. Aggregate pushdown is not available with a stream aggregate. If I truncate the AGG_PUSHDOWN_FEW_ROWS table and load 3002 rows into it I get a stream aggregate and no pushdown. With 3003 rows I get a hash match and pushdown. The tipping point depends on the aggregate function, MAXDOP, the data loaded into the table, and so on. The undocumented query hint QUERYRULEOFF GbAggToStrm can be used to encourage a hash match aggregate. If the aggregated column is a string this gets more complicated.

Non-trivial CASE statements

Trivial CASE statements inside the aggregate are still eligible for pushdown:

SELECT SUM(CASE WHEN 1 = 1 THEN ID1 ELSE ID2 END)
FROM dbo.AP_1_RG;

However, many other CASE statements are not. Here is one example:

SELECT SUM(CASE WHEN ID1 < ID2 THEN ID1 ELSE ID2 END)
FROM dbo.AP_1_RG;

Basic Restrictions

For completeness I’ll list a few more of the more obvious restrictions on pushdown. Many of these are documented by Microsoft. The CCI must contain at least one compressed rowgroup. Delta stores are not eligible for pushdown. There are only six aggregate functions supported: COUNT, COUNT_BIG, MIN, MAX, SUM, and AVG. COUNT (DISTINCT col_name) is not supported.

If a query does not get batch mode due to TF 9453 or for other reasons it will not get aggregate pushdown. Undocumented trace flag 9354 directly disables aggregate pushdown.

Restrictions with GROUP BY

Queries with a GROUP BY have many, if not, all of the same restrictions on the aggregate expressions. As far as I can tell there are no restrictions on the data type of the GROUP BY columns. More than one GROUP BY column is supported as well. However, there are a few restrictions which only apply to queries with a GROUP BY clause.

Non-direct Column References

The columns in the GROUP BY need to be columns. Adding scalars and other nonsense appears to make the query ineligible. The following queries are not eligible:

SELECT ID1 + 0, SUM(AGG_COLUMN)
FROM dbo.AP_1_RG
GROUP BY ID1 + 0;

SELECT ID1 + ID2, SUM(AGG_COLUMN)
FROM dbo.AP_1_RG
GROUP BY ID1 + ID2;

This one is eligible:

SELECT ID1, ID2, SUM(AGG_COLUMN)
FROM dbo.AP_1_RG
GROUP BY ID1, ID2;

Segment Not Compressed Enough?

A rowgroup appears to be ineligible for aggregate pushdown if the GROUP BY column has a segment size which is too large. This can be observed with the following test data:

DROP TABLE IF EXISTS AP_3_RG;
CREATE TABLE dbo.AP_3_RG (
ID1 bigint NULL,
AGG_COLUMN BIGINT NOT NULL,
INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.AP_3_RG WITH (TABLOCK)
SELECT
t.RN % 16000
, 0
FROM
(
	SELECT TOP (1 * 1048576) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
	CROSS JOIN master..spt_values t3
) t
OPTION (MAXDOP 1);

INSERT INTO dbo.AP_3_RG WITH (TABLOCK)
SELECT
t.RN % 17000
, 0
FROM
(
	SELECT TOP (1 * 1048576) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
	CROSS JOIN master..spt_values t3
) t
OPTION (MAXDOP 1);

INSERT INTO dbo.AP_3_RG WITH (TABLOCK)
SELECT
t.RN % 16000
, 0
FROM
(
	SELECT TOP (1 * 1048576) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
	CROSS JOIN master..spt_values t3
) t
OPTION (MAXDOP 1);

Rows from the second rowgroup are not locally aggregated for the following query:

SELECT ID1, SUM(AGG_COLUMN)
FROM AP_3_RG
GROUP BY ID1
OPTION (MAXDOP 1);

The following query has 2097152 locally aggregated rows which correspond to the first and third rowgroups. From the actual plan:

a9_local_agg

Poking around in the sys.column_store_segments and sys.column_store_dictionaries DMVs doesn’t reveal any interesting differences other than the rowgroup with more distinct values has a much larger size (2097736 bytes versus 128576 bytes ). We can go deeper with the undocumented DBCC CSINDEX:

DBCC TRACEON (3604);

DBCC CSINDEX (
7, -- DB_ID
72057613148028928, -- hobt_id
2, -- 1 + column_id
0, -- segment_id
1, -- 1 for segment
0 -- print option
);

Among other differences, for the 16000 distinct value segment we see:

Bitpack Data Header:

Bitpack Entry Size = 16
Bitpack Unit Count = 0
Bitpack MinId = 3
Bitpack DataSize = 0

But for the 17000 distinct value segment we see:

Bitpack Data Header:
Bitpack Entry Size = 16
Bitpack Unit Count = 262144
Bitpack MinId = 3
Bitpack DataSize = 2097152

Perhaps bitpack compressed data is not eligible for aggregate pushdown?

It’s important to note that segments are not compressed independently in SQL Server, despite the data being stored at a column level. For the AP_1_RG table we’re eligible for pushdown when aggregating by ID1 or ID2. However, if I truncate the table and change the data slightly:

TRUNCATE TABLE dbo.AP_1_RG;

INSERT INTO dbo.AP_1_RG WITH (TABLOCK)
SELECT
t.RN % 8000
, t.RN % 8001 -- was previously 8000
, 0
FROM
(
	SELECT TOP (1048576) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
	CROSS JOIN master..spt_values t3
) t
OPTION (MAXDOP 1);

Now both columns are no longer eligible for aggregate pushdown. A REBUILD operation on the table does not help. Tricking SQL Server into assigning more memory to the columnstore compression also does not help.

Pushdown Surprises

During testing I was surprised by a few queries that supported aggregate pushdown. The most surprising was that the following query can get aggregate pushdown:

SELECT MAX(ID + ID2)
FROM dbo.AP_1_RG;

I have no idea how it works, but it does. For a few others, TABLESAMPLE does not prevent aggregate pushdown from happening. In addition, GROUP BY CUBE, GROUPING SETS, and ROLLUP are supported as well.

Changes with SQL Server 2017

I ran the same series of tests on SQL Server 2017 RC1. The only difference I observed was that I could no longer get a trivial plan without batch mode aggregation. This change was announced here by Microsoft.

Final Thoughts

As you can see, there are many undocumented restrictions around aggregate pushdown. These limits may be changed or go away as Microsoft continues to update SQL Server. Pushdown with GROUP BY is supported for some queries against some tables, but eligibility appears to be based on how the data is compressed, which cannot be predicted ahead of time. In my opinion, this makes it rather difficult to count on in practice.

Thanks for reading!

Going Further

SQL Server Clustered Columnstore Index Partitioning Part 1: Rowgroup Elimination Fragmentation

Posted on July 19, 2017May 14, 2022 by Joe Obbish

This is part 1 of a series on columnstore index partitioning. The focus of this post is on rowgroup elimination.

Defining Fragmentation for CCIs

There are many different ways that a CCI can be fragmented. Microsoft considers having multiple delta rowgroups, deleted rows in compressed rowgroups, and compressed rowgroups less than the maximum size all as signs of fragmentation. I’m going to propose yet another way. A CCI can also be considered fragmented if the segments for a column that is very important for rowgroup elimination are packed in such a way that limits rowgroup elimination. That sentence was a bit much but the following table examples should illustrate the point.

The Test Data

First we’ll create a single column CCI with integers from 1 to 104857600. I’ve written the query so that the integers will be loaded in order. This is more commonly done using a loop and I’m relying slightly on undocumented behavior, but it worked fine in a few tests that I did:

DROP TABLE IF EXISTS dbo.LUCKY_CCI;

CREATE TABLE dbo.LUCKY_CCI (
	ID BIGINT NOT NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.LUCKY_CCI WITH (TABLOCK)
SELECT TOP (100 * 1048576) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
ORDER BY RN
OPTION (MAXDOP 1);

On my machine, the table takes less than 40 seconds to populate. We can see that the minimum and maximum IDs for each rowgroup are nice and neat from the sys.column_store_segments DMV:

a7_lucky

This table does not have any rowgroup elimination fragmentation. The segments are optimally constructed from a rowgroup elimination point of view. SQL Server will be able to skip every rowgroup except for one or two if I filter on a sequential range of one million integers:

SELECT COUNT(*)
FROM dbo.LUCKY_CCI
WHERE ID BETWEEN 12345678 AND 13345677;

STATISTICS IO output:

Table ‘LUCKY_CCI’. Segment reads 2, segment skipped 98.

Erik Darling was the inspiration for this next table, since he can never manage to get rowgroup elimination in his queries. The code below is designed to create segments that allow for pretty much no rowgroup elimination whatsoever:

DROP TABLE IF EXISTS dbo.UNLUCKY_CCI;

CREATE TABLE dbo.UNLUCKY_CCI (
	ID BIGINT NOT NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.UNLUCKY_CCI WITH (TABLOCK)
SELECT rg.RN
FROM
(
	SELECT TOP (100) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
) driver
CROSS APPLY (
	SELECT driver.RN

	UNION ALL

	SELECT 104857601 - driver.RN

	UNION ALL

	SELECT TOP (1048574) 100 + 1048574 * (driver.RN - 1)
		+ ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
	FROM master..spt_values t2
	CROSS JOIN master..spt_values t3
) rg
ORDER BY driver.RN
OPTION (MAXDOP 1, NO_PERFORMANCE_SPOOL);

Each rowgroup will contain an ID near the minimum (1 – 100) and an ID near the maximum (104857301 – 104857400). We can see this with the same DMV as before:

a7_unlucky

This table has pretty much the worst possible rowgroup elimination fragmentation. SQL Server will not be able to skip any rowgroups when querying almost any sequential range of integers:

SELECT COUNT(*)
FROM dbo.UNLUCKY_CCI
WHERE ID BETWEEN 12345678 AND 13345677;

STATISTICS IO output:

Table ‘UNLUCKY_CCI’. Segment reads 100, segment skipped 0.

Measuring Rowgroup Elimination Fragmentation

It might be useful to develop a more granular measurement for rowgroup elimination instead of just “best” and “worst”. One way to approach this problem is to define a batch size in rows for a filter against the column and to estimate how many total table scans would need to be done if the entire table was read in a series of SELECT queries with each processing a batch of rows. For example, with a batch size of 1048576 query, 1 would include the IDs from 1 to 1048576, query 2 would include IDs from 1 + 1048576 to 1048576 * 2, and so on. I will call this number the rowgroup elimination fragmentation factor, or REFF for short. It roughly represents how many extra rowgroups are read that could have been skipped with less fragmentation. Somewhat-tested code to calculate the REFF is below:

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

DECLARE @batch_size INT = 1048576;
DECLARE @global_min_id BIGINT;
DECLARE @global_max_id BIGINT;
DECLARE @segments BIGINT;

DROP TABLE IF EXISTS #MIN_AND_MAX_IDS;
CREATE TABLE #MIN_AND_MAX_IDS (
	min_data_id BIGINT,
	max_data_id BIGINT
);
CREATE CLUSTERED INDEX c ON #MIN_AND_MAX_IDS
(min_data_id, max_data_id);

INSERT INTO #MIN_AND_MAX_IDS
SELECT css.min_data_id, css.max_data_id
FROM sys.objects o
INNER JOIN sys.columns c ON o.object_id = c.object_id
INNER JOIN sys.partitions p ON o.object_id = p.object_id
INNER JOIN sys.column_store_segments css
	ON p.hobt_id = css.hobt_id
	AND css.column_id = c.column_id
INNER JOIN sys.dm_db_column_store_row_group_physical_stats s
	ON o.object_id = s.object_id
	AND css.segment_id = s.row_group_id
	AND s.partition_number = p.partition_number
WHERE o.name = 'LUCKY_CCI'
AND c.name = 'ID'
AND s.[state] = 3;

SET @segments = @@ROWCOUNT;

SELECT
  @global_min_id = MIN(min_data_id)
, @global_max_id = MAX(max_data_id)
FROM #MIN_AND_MAX_IDS;

DROP TABLE IF EXISTS #BUCKET_RANGES;
CREATE TABLE #BUCKET_RANGES (
	min_data_id BIGINT,
	max_data_id BIGINT
);

 -- allows up to 6.4 million pieces
INSERT INTO #BUCKET_RANGES
SELECT
  @global_min_id + (RN - 1) * @batch_size
, @global_min_id + RN * @batch_size - 1
FROM
(
	SELECT ROW_NUMBER() OVER
	(ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t
WHERE -- number of buckets
t.RN <= CEILING((1.0 * @global_max_id
	- (@global_min_id)) / @batch_size);

 -- number of total full scans
SELECT 1.0 * SUM(cnt) / @segments
FROM (
	SELECT COUNT(*) cnt
	FROM #BUCKET_RANGES br
	LEFT OUTER JOIN #MIN_AND_MAX_IDS m
		ON m.min_data_id <= br.max_data_id AND m.max_data_id >= br.min_data_id
	GROUP BY br.min_data_id
) t;

For the LUCKY_CCI table, if we pick a batch size of 1048576 we get a REFF of 1.0. This is because each SELECT query would read exactly 1 rowgroup and skip the other 99 rowgroups. Conversely, the UNLUCKY_CCI table has a REFF of 100.0. This is because every SELECT query would need to read all 100 rowgroups just to return 1048576 rows. In other words, each query does a full scan of the table and there are 100 rowgroups in the table.

What about DUIs?

Suppose our lucky table runs out of luck and starts getting DUIs. As rows are deleted and reinserted they will be packed into new rowgroups. We aren’t likely to keep our perfect rowgroups for every long. Below is code that randomly deletes and inserts a range of 100k integers. There is some bias in which numbers are processed but that’s ok for this type of testing. I ran the code with 600 loops which means that up to 60% of the rows in the table could have been moved around. There’s nothing special about 60%. That just represents my patience with the process.

DECLARE @target_loops INT = 600,
@batch_size INT = 100001,
@current_loop INT = 1,
@middle_num BIGINT,
@rows_in_target BIGINT;

BEGIN

SET NOCOUNT ON;

SELECT @rows_in_target = COUNT(*)
FROM dbo.LUCKY_CCI;

DROP TABLE IF EXISTS #num;
CREATE TABLE #num (ID INT NOT NULL);

INSERT INTO #num WITH (TABLOCK)
SELECT TOP (@batch_size) -1 + ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;

WHILE @current_loop <= @target_loops
BEGIN
	SET @middle_num = CEILING(@rows_in_target * RAND());

	DELETE tgt WITH (TABLOCK)
 	FROM dbo.LUCKY_CCI tgt
 	WHERE tgt.ID BETWEEN
 	CAST(@middle_num - FLOOR(0.5*@batch_size) AS BIGINT)
	AND
	CAST(@middle_num + FLOOR(0.5*@batch_size) AS BIGINT);

 	INSERT INTO dbo.LUCKY_CCI WITH (TABLOCK)
 	SELECT @middle_num + n.ID - FLOOR(0.5 * @batch_size)
	FROM #num n
 	WHERE n.ID BETWEEN 1 AND @rows_in_target
	OPTION (MAXDOP 1);

	SET @current_loop = @current_loop + 1;
END;

END;

Our table now has a bunch of uncompressed rowgroups and many compressed rowgroups with lots of deleted rows. Let’s clean it up with a REORG:

ALTER INDEX CCI ON dbo.LUCKY_CCI
REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON);

This table has a REFF of about 41 for a batch size of 1048576. That means that we can expect to see significantly worse rowgroup elimination than before. Running the same SELECT query as before:

Table ‘LUCKY_CCI’. Segment reads 51, segment skipped 75.

The LUCKY_CCI table clearly needs to be renamed.

The Magic of Partitioning

A partition for a CCI is a way of organizing rowgroups. Let’s create the same table but with partitions that can hold five million integers. Below is code to do that for those following along at home:

CREATE PARTITION FUNCTION functioning_part_function
(BIGINT)
AS RANGE RIGHT
FOR VALUES (
  0
, 5000000
, 10000000
, 15000000
, 20000000
, 25000000
, 30000000
, 35000000
, 40000000
, 45000000
, 50000000
, 55000000
, 60000000
, 65000000
, 70000000
, 75000000
, 80000000
, 85000000
, 90000000
, 95000000
, 100000000
, 105000000
); 

CREATE PARTITION SCHEME scheming_part_scheme
AS PARTITION functioning_part_function
ALL TO ( [PRIMARY] );

DROP TABLE IF EXISTS dbo.PART_CCI;

CREATE TABLE dbo.PART_CCI (
ID BIGINT NOT NULL,
INDEX CCI CLUSTERED COLUMNSTORE
) ON scheming_part_scheme(ID);

INSERT INTO dbo.PART_CCI WITH (TABLOCK)
SELECT TOP (100 * 1048576) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
ORDER BY RN
OPTION (MAXDOP 1);

5 million was a somewhat arbitrary choice. The important thing is that it’s above 1048576 rows. Generally speaking, picking a partition size below 1048576 rows is a bad idea because the table’s rowgroups will never be able to reach the maximum size.

Using the sys.dm_db_column_store_row_group_physical_stats dmv we can see that rowgroups were divided among partitions as expected:

a7_part

This table has a REFF of 1.9 for a batch size of 1048576. This is perfectly reasonable because the partition size was defined as 5 * 1000000 instead of 5 * 1048576 which would be needed for a REFF of 1.0.

Now I’ll run the same code as before to do 60 million DUIs against the table. I expect the partitions to limit the damage because row cannot move out of their partitions. After the process finishes and we do a REORG, the table has a REFF of about 3.7. The rowgroups are definitely a bit messier:

a7_part_after_DUI

Running our SELECT query one last time:

SELECT COUNT(*)
FROM dbo.Part_CCI
WHERE ID BETWEEN 12345678 AND 13345677;

We see the following numbers for rowgroup elimination:

Table ‘PART_CCI’. Segment reads 4, segment skipped 3.

Only four segments are read. Many segments are skipped, including all of those in partitions not relevant to the query.

Final Thoughts

Large, unpartitioned CCIs will eventually run into issues with rowgroup elimination fragmentation if the loading process can delete data. A maintainance operation that rebuilds the table with a clustered rowstore index and rebuilds the CCI with MAXDOP = 1 is one way to restore nicely organized rowgroups, but that is a never-ending battle which grows more expensive as the table gets larger. Picking the right partitioning function can guarantee rowgroup elimination fragmentation.

Thanks for reading!

Going Further

Columnstore Parallel Scan Row Distribution In SQL Server

Posted on July 17, 2017May 14, 2022 by Joe Obbish

Parallel rowstore scans are parallel “aware”. This makes them unlike most other operators which work independently on different threads. Columnstore indexes store data in a dramatically different way than rowstore objects, so perhaps we can expect differences in how rows are distributed among threads during a parallel scan. This blog post explores some of the observable behavior around row distribution for parallel columnstore scans.

Methods of Rowstore Scan Parallel Row Distribution

This will be an extremely brief summary of how SQL Server distributes rows among threads for rowstore parallel scans. A parallel page supplier sends pages to each thread on a demand basis. Threads may end up processing different numbers of pages for many reasons. If you like, you can read more about this here and here. In addition, there is some fancy behavior for partitioned tables. The documentation describes this behavior in SQL Server 2008 and it may have changed since then, but this is sufficient to set the stage. The important part is that SQL Server will in some cases give each relevant partition its own thread. In other cases, SQL Server will assign multiple threads to a single partition.

Methods of Columnstore Scan Parallel Row Distribution

I investigated tables with only compressed rowgroups. I did not consider delta stores because real CCIs don’t have delta stores. As far as I can tell, there are at least three different methods that SQL Server can use to assign rows to threads during a parallel CCI scan.

Rowgroup Parallelism

One method of distributing rows is to assign each thread to a relevant rowgroup. Two or more threads will not read rows from the same rowgroup. This strategy can be used if the cardinality estimate from the scan is sufficiently high compared to the DOP used by the query. Rowgroup level parallelism will be used if:

Cardinality Estimate >= MAXDOP * 1048576 * 0.5

To show this, let’s build a CCI with a single column integer that runs from 1 to 1048576 * 10. Naturally, the table will have ten compressed rowgroups of the maximum size:

CREATE TABLE dbo.CCI_SHOW_RG_PARALLELISM (
	ID BIGINT NOT NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.CCI_SHOW_RG_PARALLELISM WITH (TABLOCK)
SELECT TOP (1048576 * 10) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

I ran my tests with a simple query that is likely to go parallel on my machine and that doesn’t qualify for aggregate pushdown. With a cardinality estimate of exactly 1000000 and MAXDOP of 2 all of the rows are sent to one thread:

SELECT MAX(SQRT(ID))
FROM dbo.CCI_SHOW_RG_PARALLELISM
WHERE ID <= 1000000
OPTION (MAXDOP 2);

From the actual plan:

a6_RG_1

If we reduce the cardinality estimate by one row, the rows are spread out more evenly on the two threads:

SELECT MAX(SQRT(ID))
FROM dbo.CCI_SHOW_RG_PARALLELISM
WHERE ID <= 999999
OPTION (MAXDOP 2);

From the actual plan:

a6_RG_2

Note that the cardinality estimate displayed in SSMS may be misleading. The first query has a displayed cardinality estimate of 2000000 rows, but it does not use rowgroup parallelism:

SELECT MAX(SQRT(ID))
FROM dbo.CCI_SHOW_RG_PARALLELISM
WHERE ID <= 1999999 -- query plans can round
OPTION (MAXDOP 4);

From the actual plan:

a6_RG_3

But this one does:

SELECT MAX(SQRT(ID))
FROM dbo.CCI_SHOW_RG_PARALLELISM
WHERE ID <= 2000000
OPTION (MAXDOP 4);

From the actual plan:

a6_RG_4

Of course, we can get the query that takes an aggregate of 1999999 rows to use rowgroup level parallelism by bumping up the estimate:

DECLARE @filter BIGINT = 1999999;

SELECT MAX(SQRT(ID))
FROM dbo.CCI_SHOW_RG_PARALLELISM
WHERE ID <= @filter
OPTION (MAXDOP 4);

Here the estimate is:

0.3 * 10485800 = 3145740.0

So we get the expected parallelism strategy:

a6_RG_5

We can show that the rowgroup parallelism strategy is demand-based by deliberately slowing down the required the thread that grabs the first rowgroup in the table. Here I’m defining first by the ID that’s returned when running a SELECT TOP 1 ID query against the table. On my machine I get an ID of 9437185. The following code will add significant processing time in the CROSS APPLY part for only the row with an ID of 9437185. Every other row simply does a Constant Scan and goes on its merry way:

SELECT COUNT(*)
FROM dbo.CCI_SHOW_RG_PARALLELISM o
CROSS APPLY (
	SELECT TOP 1 1 c
	FROM
	(
		SELECT 1 c
		WHERE o.ID <> 9437185

		UNION ALL

		SELECT 1 c
		FROM master..spt_values t1
		CROSS JOIN master..spt_values t2
		CROSS JOIN master..spt_values t3
		WHERE o.ID = 9437185
		ORDER BY (SELECT NULL)
			OFFSET 100000000 ROWS
			FETCH FIRST 1 ROW ONLY
	) t
) t2
OPTION (MAXDOP 2);

Thread 1 processes nine rowgroups and waits on thread 2 to process its single rowgroup:

a6_RG_6

Split Rowgroup Parallelism

If the CCI has a small number of rows and the cardinality estimate is low enough you may see a different parallel row distribution strategy employed. I couldn’t think of a good name for this, but SQL Server splits up each rowgroup into roughly equal pieces and threads can process those pieces on a demand basis. The number of pieces seems to depend on the number of rows in the rowgroup instead of MAXDOP . For MAXDOP larger than 2 this behavior can be observed if a table has one or two rowgroups. For a MAXDOP of 2 this behavior can be observed if a table has exactly one rowgroup.

The formula for the number of pieces appears to be the number of rows in the rowgroup divided by 104857, rounded down, with a minimum of 1. The maximum rowgroup size of 1048576 implies a maximum number of pieces of 10 per rowgroup. Here’s a table to show all of the possibilities:

a6_table

We can see evidence of this behavior in SQL Server with a few tests. As before I’ll need to slow down one thread. First I’ll put 943713 integers into a single compressed rowgroup:

DROP TABLE IF EXISTS dbo.TEST_CCI_SMALL;

CREATE TABLE dbo.TEST_CCI_SMALL (
ID BIGINT NOT NULL,
INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.TEST_CCI_SMALL WITH (TABLOCK)
SELECT TOP (943713) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

I will test with a very similar query to a previous test:

SELECT COUNT(*)
FROM dbo.TEST_CCI_SMALL o
CROSS APPLY (
	SELECT TOP 1 1 c
	FROM
	(
		SELECT 1 c
		WHERE o.ID <> 1

		UNION ALL

		SELECT 1 c
		FROM master..spt_values t1
		CROSS JOIN master..spt_values t2
		CROSS JOIN master..spt_values t3
		WHERE o.ID = 1
		ORDER BY (SELECT NULL)
			OFFSET 100000000 ROWS
			FETCH FIRST 1 ROW ONLY
	) t
) t2
OPTION (QUERYTRACEON 8649, MAXDOP 4);

The parallel scan should be split into nine pieces to be divided among threads because the table has a single rowgroup with a row count of 943713. This is exactly what happens:

a6_small_1

If I truncate the table and load one fewer row, the scan is now split up into eight pieces:

a6_small_2

I can also create two compressed rowgroups of a size that should led to seven pieces per rowgroup:

DROP TABLE IF EXISTS dbo.TEST_CCI_SMALL;

CREATE TABLE dbo.TEST_CCI_SMALL (
	ID BIGINT NOT NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.TEST_CCI_SMALL WITH (TABLOCK)
SELECT TOP (733999) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

INSERT INTO dbo.TEST_CCI_SMALL WITH (TABLOCK)
SELECT TOP (733999) 733999 + ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

Thread 1 processes the row with an ID of 734000 so it only gets one piece:

a6_small_3

With more than one rowgroup, the demand-based aspect of piece distribution doesn’t quite work in the same way as with a single rowgroup. I wasn’t able to work out all of the details.

A Third Way?

What about queries that do not meet either of the two above criteria? For example, a query against a not small CCI that has a low cardinality estimate coming out of the scan? In some cases SQL Server will use rowgroup level distribution. In other cases it appears to use a combination of the two methods described above. Most of the time the behavior can be described as each thread gets assigned an entire rowgroup and threads race for pieces of the remaining rowgroups. I wasn’t able to figure out exactly how SQL Server decides which method to use, despite running many tests. However, I will show most of the behavior that I observed. First put 7 rowgroups into a CCI:

DROP TABLE IF EXISTS dbo.MYSTERIOUS_CCI;

CREATE TABLE dbo.MYSTERIOUS_CCI (
	ID BIGINT NOT NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.MYSTERIOUS_CCI WITH (TABLOCK)
SELECT TOP (1048576 * 7) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

SELECT TOP 1 ID -- 6291457
FROM dbo.MYSTERIOUS_CCI;

I added local variables to my test query to lower the cardinality estimate. Otherwise I would get rowgroup distribution every time. Here’s an example query:

declare @lower_id INT = 1048576 * 2 + 1;
declare @upper_id INT = 1048576 * 7;

SELECT COUNT(*)
FROM dbo.MYSTERIOUS_CCI o
CROSS APPLY (
	SELECT TOP 1 1 c
	FROM
	(
		SELECT 1 c
		WHERE o.ID <> 6291457

		UNION ALL

		SELECT 1 c
		FROM master..spt_values t1
		CROSS JOIN master..spt_values t2
		CROSS JOIN master..spt_values t3
		WHERE o.ID = 6291457
		ORDER BY (SELECT NULL)
			OFFSET 100000000 ROWS
			FETCH FIRST 1 ROW ONLY
	) t
) t2
WHERE o.ID BETWEEN @lower_id AND @upper_id
OPTION (QUERYTRACEON 8649, MAXDOP 3);

With a MAXDOP of 3 and five processed rowgroups I get rowgroup level parallelism:

a6_3rd_1

With a MAXDOP of 4 and five processed rowgroups, each thread gets a rowgroup and the other three threads race to process the remaining 20 pieces:

a6_3rd_2

With a MAXDOP of 3 and six processed rowgroups we no longer get rowgroup level parallelism:

a6_3rd_3

What about Partitioning?

In tests not reproduced here, I was not able to observe any differences in how rows were distributed for parallel CCI scans when the underlying CCI was partitioned. This seems reasonable if we think about how partitioning for CCIs is different than for rowstore tables. CCI partitions are simply a collection of rowgroups containing rows relevant to the partition. If there’s a need to split up the underlying components of a table we can just assign rowgroups to different threads. For rowstore tables, we can think of each partition as a mini-table. Partitioning for rowstore adds underlying objects which can be distributed to parallel threads.

Fun with Query Plans

With a partial understanding of how SQL server distributes rows after a parallel scan, we can write queries that show some of the edge cases that can lead to poor performance. The following query is complete nonsense but it shows the point well enough. Here I’m cross joining to a few numbers from the spt_values table. The CROSS JOIN is meant to represent other downstream work done by a query:

SELECT MAX(SQRT(o.ID + t.number))
FROM dbo.MYSTERIOUS_CCI o
CROSS JOIN master..spt_values t
WHERE o.ID <= 1100000 AND
t.number BETWEEN 0 AND 4
OPTION (MAXDOP 2, NO_PERFORMANCE_SPOOL, QUERYTRACEON 8649);

The cardinality estimate and MAXDOP of 2 leads to rowgroup level parallelism being used. Unfortunately, this is very unbalanced:

a6_fun_1

And as a result, the query barely benefits from parallelism:

CPU time = 25578 ms, elapsed time = 24580 ms.

It’s a well-guarded secret that queries run faster with odd MAXDOP , so let’s try a MAXDOP of 3. Now my parallel threads no longer sleep on the job:

CPU time = 30922 ms, elapsed time = 11840 ms.

Here’s the row distribution from the scan:

a6_fun_2

Final Thoughts

This post explored a bit of the observable behavior for how rows are distributed to threads after a parallel columnstore index scan. The algorithms used can in some cases lead to row imbalance on threads which can cause performance issues downstream in the plan. The lack of repartition stream operators in batch mode can make this problem worse than it might be for rowstore scans, but I still expect issues caused by it to be relatively uncommon in practice.

Thanks for reading!

Going Further

SQL Server Clustered Column Store Indexes and String Aggregation

Posted on July 3, 2017May 14, 2022 by Joe Obbish

String aggregation is a not uncommon problem in SQL Server. By string aggregation I mean grouping related rows together and concatenating a string column for that group in a certain order into a single column. How will do CCIs do with this type of query?

The Data Set

A simple table with three columns is sufficient for testing. ITEM_ID is the id for the rows that should be aggregated together. LINE_ID stores the concatenation order within the group. COMMENT is the string column to be concatenated. The table will have 104857600 rows total with 16 rows per ITEM_ID.

CREATE TABLE dbo.CCI_FOR_STRING_AGG (
	ITEM_ID BIGINT NOT NULL,
	LINE_ID BIGINT NOT NULL,
	COMMENT VARCHAR(10) NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);

DECLARE @loop INT = 0
SET NOCOUNT ON;
WHILE @loop < 100
BEGIN
	INSERT INTO dbo.CCI_FOR_STRING_AGG WITH (TABLOCK)
	SELECT
	t.RN / 16
	, 1 + t.RN % 16
	, CHAR(65 + t.RN % 16)
	FROM
	(
		SELECT TOP (1048576)
		(1048576 * @loop) - 1 +
		ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
		FROM master..spt_values t1
		CROSS JOIN  master..spt_values t2
	) t
	OPTION (MAXDOP 1);

	SET @loop = @loop + 1;
END;

On my machine this codes takes around a minute and a half and the final table size is around 225 MB. I’m inserting batches of 1048576 rows with MAXDOP 1 to get nice, clean rowgroups.

The Test Query

Let’s concatenate the strings at an ITEM_ID level for all ITEM_IDs with an id between 3276800 and 3342335. All of the data is stored in a single rowgroup and the CCI was built in such a way that all other rowgroups can be skipped with that filter. This should represent the best case for CCI performance. A common way to concatenate a column is with the FOR XML PATH method:

SELECT
  o.ITEM_ID
, STUFF(
	(
		SELECT ','+ i.COMMENT
		FROM dbo.CCI_FOR_STRING_AGG i
		WHERE o.ITEM_ID = i.ITEM_ID
		ORDER BY i.LINE_ID
		FOR XML PATH('')
	)
,1 ,1, '') ALL_COMMENTS
FROM dbo.CCI_FOR_STRING_AGG o
WHERE o.ITEM_ID BETWEEN 3276800 AND 3342335
GROUP BY o.ITEM_ID;

Note that I’m not bothering with the additional arguments for the XML part, but you should whenever using this method in production.

The query plan looks reasonable at first glance:

a4 FOR XML estimated

However, the query takes nearly six minutes to complete. That’s not the lightning fast CCI query performance that we were promised!

Spool Problems

We can see from the actual execution plan that SQL Server scanned all 104.8 million rows from the CCI on a single thread:

a4 actual threads

Those rows were then sent into an index spool which was read from all four threads in the nested loop. The CCI scan and the build of the index spool took the most time in the plan so it makes sense why CPU time is less than elapsed time, even with a MAXDOP 4 query:

CPU time = 305126 ms, elapsed time = 359176 ms.

Thinking back to how parallel nested loop joins work, it seems unavoidable that the index spool was built with just one thread. The rows from the outer result set are sent to the inner query one row at a time on different threads. All of the threads run independently at MAXDOP 1. The eager spool for the index is a blocking operation and the blocking occurs even when the subquery isn’t needed. Consider the following query in which the ITEM_ID = 3400000 filter means that rows from the FOR XML PATH part of the APPLY will never be needed:

SELECT o.ITEM_ID
, ac.ALL_COMMENTS
FROM
(
	SELECT TOP (9223372036854775807) b.ITEM_ID
	FROM dbo.CCI_FOR_STRING_AGG b
	WHERE b.ITEM_ID BETWEEN 3276800 AND 3342335
	GROUP BY b.ITEM_ID
) o
OUTER APPLY (
	SELECT 'TEST'

	UNION ALL

	SELECT STUFF(
		(
			SELECT ','+ i.COMMENT
			FROM dbo.CCI_FOR_STRING_AGG i
			WHERE o.ITEM_ID = i.ITEM_ID
			ORDER BY i.LINE_ID
			FOR XML PATH('')
		)
	,1 ,1, '')
	WHERE o.ITEM_ID =  3400000
)  ac (ALL_COMMENTS)
OPTION (MAXDOP 4, QUERYTRACEON 8649);

The index spool is still built for this query on one thread even though the startup expression predicate condition is never met. It seems unlikely that we’ll be able to do anything about the fact that the index spool is built with one thread and that it blocks execution for the query. However, we know that we only need 1048576 rows from the CCI to return the query’s results. Right now the query takes 104.8 million rows and throws them into the spool. Can we reduce the number of rows put into the spool? The most obvious approach is to simply copy the filter on ITEM_ID into the subquery:

SELECT
  o.ITEM_ID
, STUFF(
	(
		SELECT ','+ i.COMMENT
		FROM dbo.CCI_FOR_STRING_AGG i
		WHERE o.ITEM_ID = i.ITEM_ID
		and i.ITEM_ID BETWEEN 3276800 AND 3342335
		ORDER BY i.LINE_ID
		FOR XML PATH('')
	)
,1 ,1, '') ALL_COMMENTS
FROM dbo.CCI_FOR_STRING_AGG o
WHERE o.ITEM_ID BETWEEN 3276800 AND 3342335
GROUP BY o.ITEM_ID;

This doesn’t have the desired effect:

a4 bad filter with spool

The filter is moved after the index build. We’ll still get all of the rows from the table put into the spool, but no rows will come out unless ITEM_ID is between 3276800 and 3342335. This is not helpful. We can get more strict with the query optimizer by adding a superfluous TOP to the subquery. That should force SQL Server to filter on ITEM_ID before sending rows to the spool because otherwise the TOP restriction may not be respected. One implementation:

SELECT
o.ITEM_ID
, STUFF(
(
SELECT ','+ i.COMMENT
FROM
(
SELECT TOP (9223372036854775807)
a.ITEM_ID, a.LINE_ID, a.COMMENT
FROM dbo.CCI_FOR_STRING_AGG a
WHERE a.ITEM_ID BETWEEN 3276800 AND 3342335
) i
WHERE o.ITEM_ID = i.ITEM_ID
ORDER BY i.LINE_ID
FOR XML PATH('')
)
,1 ,1, '') ALL_COMMENTS
FROM dbo.CCI_FOR_STRING_AGG o
WHERE o.ITEM_ID BETWEEN 3276800 AND 3342335
GROUP BY o.ITEM_ID;

SQL Server continues to outsmart us:

a4 no spool

As you can see in the plan above, the spool has completely disappeared. I was not able to find a way, even with undocumented black magic, to reduce the number of rows going into the spool. Perhaps it is a fundamental restriction regarding index spools. In fact, we can use the undocumented trace flag 8615 to see that spools are not even considered at any part in the query plan for the new query. On the left is the previous query with the spool with an example highlighted. On the right is the new query. The text shown here is just for illustration purposes, but we can see the spool on the left:

a4 TF diff

The important point is that for this query we appear to be stuck.

Rowgroup Elimination Problems

We can try our luck without the spool by relying on rowgroup elimation alone. The spool can’t be eliminated with the NO_PERFORMANCE_SPOOL hint, but another option (other than the TOP trick above) is to use the undocumented QUERYRULEOFF syntax to disable the optimizer rule for building spools:

SELECT
  o.ITEM_ID
, STUFF(
	(
		SELECT ','+ i.COMMENT
		FROM dbo.CCI_FOR_STRING_AGG i
		WHERE o.ITEM_ID = i.ITEM_ID
		ORDER BY i.LINE_ID
		FOR XML PATH('')
	)
,1 ,1, '') ALL_COMMENTS
FROM dbo.CCI_FOR_STRING_AGG o
WHERE o.ITEM_ID BETWEEN 3276800 AND 3342335
GROUP BY o.ITEM_ID
OPTION (QUERYRULEOFF BuildSpool);

The spool is gone but we don’t get rowgroup elimination:

a4 no RG elimination

We can get rowgroup elimination even without hardcoded filters with a bitmap from a hash join. Why don’t we get it with a nested loop join? Surely SQL Server ought to be able to apply rowgroup elimination for each outer row as it’s processed by the inner part of the nested loop join. We can explore this question further with tests that don’t do string aggregation. The query below gets rowgroup elimination but the constant filters are passed down to the CCI predicate:

SELECT *
FROM (VALUES (1), (2), (3), (4)) AS v(x)
INNER JOIN dbo.CCI_FOR_STRING_AGG a ON v.x = a.ITEM_ID
OPTION (FORCE ORDER, LOOP JOIN);

If we throw the four values into a temp table:

SELECT x INTO #t
FROM (VALUES (1), (2), (3), (4)) AS v(x)

The join predicate is applied in the nested loop operator. We don’t get any rowgroup elimination. Perhaps TOP can help us:

SELECT *
FROM #t v
CROSS APPLY (
	SELECT TOP (9223372036854775807) *
	FROM dbo.CCI_FOR_STRING_AGG a
	WHERE v.x = a.ITEM_ID
) a
OPTION (LOOP JOIN);

Sadly not:

a4 top failure

The join predicate is applied in the filter operation. We still don’t get any rowgroup elimination. Frustratingly, we get the behavior that we’re after if we replace the CCI with a heap:

a4 heap success

I don’t really see an advantage in pushing down the predicate with a heap. Perhaps Microsoft did not program the optimization that we’re looking for into the query optimizer. After all, this is likely to be a relatively uncommon case. This query is simple enough in that we filter directly against the CCI. In theory, we can give our query a fighting chance by adding a redundant filter on ITEM_ID to the subquery. Here’s the query that I’ll run:

SELECT
  o.ITEM_ID
, STUFF(
	(
		SELECT ','+ i.COMMENT
		FROM dbo.CCI_FOR_STRING_AGG i
		WHERE o.ITEM_ID = i.ITEM_ID
		and i.ITEM_ID BETWEEN 3276800 AND 3342335
		ORDER BY i.LINE_ID
		FOR XML PATH('')
	)
,1 ,1, '') ALL_COMMENTS
FROM dbo.CCI_FOR_STRING_AGG o
WHERE o.ITEM_ID BETWEEN 3276800 AND 3342335
GROUP BY o.ITEM_ID
OPTION (QUERYRULEOFF BuildSpool);

Unfortunately performance is even worse than before. The query finished in around 51 minutes on my machine. It was a proper parallel query and we skipped quite a few rowgroups:

Table ‘CCI_FOR_STRING_AGG’. Segment reads 65537, segment skipped 6488163.
CPU time = 10028282 ms, elapsed time = 3083875 ms.

Skipping trillions of rows is impressive but we still read over 68 billion rows from the CCI. That’s not going to be fast. We can’t improve performance further with this method since there aren’t any nonclustered indexes on the CCI, so there’s nothing better to seek against in the inner part of the loop.

Temp Tables

We can use the humble temp table to avoid some of the problems with the index spool. We’re able to insert into it in parallel, build the index in parallel, and we can limit the temp table to only the 1048576 rows that are needed for the query. The following code finishes in less than two seconds on my machine:

CREATE TABLE #CCI_FOR_STRING_AGG_SPOOL (
	ITEM_ID BIGINT NOT NULL,
	LINE_ID BIGINT NOT NULL,
	COMMENT VARCHAR(10) NULL
);

INSERT INTO #CCI_FOR_STRING_AGG_SPOOL WITH (TABLOCK)
SELECT a.ITEM_ID, a.LINE_ID, a.COMMENT
FROM dbo.CCI_FOR_STRING_AGG a
WHERE a.ITEM_ID BETWEEN 3276800 AND 3342335;

CREATE CLUSTERED INDEX CI_CCI_FOR_STRING_AGG_SPOOL
ON #CCI_FOR_STRING_AGG_SPOOL (ITEM_ID, LINE_ID);

SELECT
  o.ITEM_ID
, STUFF(
	(
		SELECT ','+ i.COMMENT
		FROM #CCI_FOR_STRING_AGG_SPOOL i
		WHERE o.ITEM_ID = i.ITEM_ID
		ORDER BY i.LINE_ID
		FOR XML PATH('')
	)
,1 ,1, '') ALL_COMMENTS
FROM dbo.CCI_FOR_STRING_AGG o
WHERE o.ITEM_ID BETWEEN 3276800 AND 3342335
GROUP BY o.ITEM_ID;

The query plan for the last query:

a4 temp table

With this approach we’re not really getting much from defining the underlying table as a CCI. If the table had more columns that weren’t needed then we could skip those columns with column elimination.

Final Thoughts

Note that the poor performance of these queries isn’t a problem specific to columnstore. Rowstore tables can have the similar performance issues with string aggregation if there isn’t a sufficient covering index. However, if CCIs are used as a replacement for all nonclustered indexes, then the performance of queries that require the table to be on the inner side of a nested loop join may significantly degrade. Some of the optimizations that make querying CCIs fast appear to be difficult to realize for these types of queries. The same string aggregation query when run against a table with a rowstore clustered index on ITEM_ID and LINE_ID finished in 1 second. That is an almost 360X improvement over the CCI.

Thanks for reading!