If you have a workload that uses #temp tables to stage intermediate results, and you probably do because you’re smart, it might be worth taking advantage of being able to insert into the #temp table in parallel.
If your code is already using the SELECT ... INTO #some_table pattern, you’re probably already getting parallel inserts. But if you’re following the INSERT ... SELECT ... pattern, you’re probably not, and, well, that could be holding you back.
Pile On
Of course, there are some limitations. If your temp table has indexes, primary keys, or an identity column, you won’t get the parallel insert no matter how hard you try.
The first thing to note is that inserting into an indexed temp table, parallel or not, does slow things down. If your goal is the fastest possible insert, you may want to create the index later.
No Talent
When it comes to parallel inserts, you do need the TABLOCK, or TABLOCKX hint to get it, e.g. INSERT #tp WITH(TABLOCK) which is sort of annoying.
But you know. It’s the little things we do that often end up making the biggest differences. Another little thing we may need to tinker with is DOP.
Here are the query plans for 3 fully parallel inserts into an empty, index-less temp #table. Note the execution times dropping as DOP increases. At DOP 4, the insert really isn’t any faster than the serial insert.
If you start experimenting with this trick, and don’t see noticeable improvements at your current DOP, you may need to bump it up to see throughput increases.
Though the speed ups above at higher DOPs are largely efficiency boosters while reading from the Posts table, the speed does stay consistent through the insert.
If we crank one of the queries that gets a serial insert up to DOP 12, we lose some speed when we hit the table.
Next time you’re tuning a query and want to drop some data into a temp table, you should experiment with this technique.
Thanks for reading!
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.
So you’re that odd soul who has been listening to rumors about table variables. Perhaps things about them only being in memory, or that they’re okay to use if you only put less than some arbitrary number of rows in them.
Those things are both wrong. But of course, my favorite rumor is the one about arbitrary numbers of rows being safe.
Ouch! What a terrible performance
Let’s do everything in our power to help SQL Server make a good guess.
We’ll create a couple indexes:
CREATE INDEX free_food ON dbo.Posts(OwnerUserId);
CREATE INDEX sea_food ON dbo.Comments(UserId);
Those stats’ll be so fresh you could make tartare with them.
We’ll create our table variable with a primary key on it, which will also be the clustered index.
DECLARE @t TABLE( id INT PRIMARY KEY );
INSERT @t ( id )
VALUES(22656);
And finally, we’ll run the select query with a recompile hint. Recompile fixes everything, yeah?
SELECT AVG(p.Score * 1.) AS lmao
FROM @t AS t
JOIN dbo.Posts AS p
ON p.OwnerUserId = t.id
JOIN dbo.Comments AS c
ON c.UserId = t.id
OPTION(RECOMPILE);
GO
How does the query do for time? Things start off okay, but keep the cardinality estimate in mind.
But quickly go downhill.
Fish are dumb, dumb, dumb
The whole problem here is that, even with just one row in the table variable, an index on the one column in the table variable, and a recompile hint on the query that selects from the table variable, the optimizer has no idea what the contents of that single row are.
That number remains a mystery, and the guess made ends up being wrong by probably more than one order of magnitude. Maybe even an order of manure.
Table variables don’t gather any statistical information about what’s in the column, and so has no frame of reference to make a better cardinality estimate on the joins.
If we insert a value that gets far fewer hits in both the Posts and Comments tables (12550), the estimate doesn’t really hurt. But note that the guesses across all operators are exactly the same.
C’est la vie mon ami
You have a database. Data is likely skewed in that database, and there are already lots of ways that you can get bad guesses. Parameter sniffing, out of date stats, poorly written queries, and more.
Databases are hard.
The point is that if you use table variables outside of carefully tested circumstances, you’re just risking another bad guess.
All of this is tested on SQL Server 2019, with table variable deferred compilation enabled. All that allows for is the number of rows guessed to be accurate. It makes no attempt to get the contents of those rows correct.
So next time you’re sitting down to choose between a temp table and a table variable, think long and hard about what you’re going to be doing with it. If cardinality esimation might be important, you’re probably going to want a temp table instead.
Thanks for reading!
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.
The important thing to understand about parallelism is it’s great when appropriate. Striking a balance between what should go parallel, the CPU usage it’s allowed, and what should stay serial can be tough.
It can be especially difficult when parameter sniffing comes into play. Here are a couple scenarios:
For a small amount of data, a serial query plan runs quickly and uses relatively few resources
For a large amount of data, a query re-uses the serial plan and runs for >20 seconds
For a small amount of data, a re-used parallel plan overloads the server due to many concurrent sessions
For a large amount of data, the re-used parallel plan finishes in 2-3 seconds
What do you do? Which plan do you favor? It’s an interesting scenario. Getting a single query to run faster by going parallel may seem ideal, but you need extra CPU, and potentially many more worker threads to accomplish that.
In isolation, you may think you’ve straightened things out, but under concurrency you run out of worker threads.
There are ways to address this sort of parameter sniffing, which we’ll get to at some point down the line.
Wrecking Crew
One way to artificially slow down a query is to use some construct that will inhibit parallelism when it would be appropriate.
There are some exotic reasons why a query might not go parallel, but quite commonly scalar valued functions and inserts to table variables are the root cause of otherwise parallel-friendly queries staying single-threaded and running for long times.
While yes, some scalar valued functions can be inlined in SQL Server 2019, not all can. The list of ineligible constructs has grown quite a bit, and will likely continue to. It’s a feature I love, but it’s not a feature that will fix everything.
Databases are hard.
XML Fetish
You don’t need to go searching through miles of XML to see it happening, either.
All you have to do is what I’ve been telling you all along: Look at those operator properties. Either hit F4, or right click and choose the properties of a select operator.
Where I see these performance surprises! pop up is often when either:
Developers develop on a far smaller amount of data than production contains
Vendors have clients with high variance in database size and use
In both cases, small implementations likely mask the underlying performance issues, and they only pop up when run against bigger data. The whole “why doesn’t the same code run fast everywhere” question.
Well, not all features are created equally.
Simple Example
This is where table variables catch people off-guard. Even the “I swear I don’t put a lot of rows in them” crowd may not realize that the process to get down to very few rows is impacted by these @features.
SELECT TOP (1000) c.Id
FROM dbo.Comments AS c
ORDER BY c.Score DESC;
This query, on its own, is free to go parallel — and it does! It takes about 4.5 seconds to do so. It’s intentionally simple.
Once we try to involve a @table variable insert, parallelism goes away, time increases 3 fold, and the non-parallel plan reason is present in the plan XML.
DECLARE @t TABLE(id INT);
INSERT @t ( id )
SELECT TOP 1000 c.Id
FROM dbo.Comments AS c
ORDER BY c.Score DESC;
Truesy
This can be quite a disappointing ramification for people who love to hold themselves up as responsible table variable users. The same will occur if you need to update or delete from a @table variable. Though less common, and perhaps less in need of parallelism, I’m including it here for completeness.
This is part of why multi-statement table valued functions, which return @table variables, can make performance worse.
To be clear, this same limitation does not exist for #temp tables.
Anyway, this post went a little longer than I thought it would, so we’ll look at scalar functions in tomorrow’s post to keep things contained.
Thanks for reading!
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.
Rather than just give you that one row estimate, it’ll wait until you’ve loaded data in, and then it will use table cardinality for things like joins to the table variable. Just be careful when you use them in stored procedures.
That can be a lot more helpful than what you currently get, but the guesses aren’t quite as helpful when you start using a where clause, because there still aren’t column-level statistics. You get the unknown guess for those.
How Can You Test It Out Before SQL Server 2019?
You can use #temp tables.
That’s right, regular old #temp tables.
They’ll give you nearly the same results as Table Variable Deferred Compilation in most cases, and you don’t need trace flags, hints, or or SQL Server 2019.
Heck, you might even fall in love with’em and live happily ever after.
The Fine Print
I know, some of you are out there getting all antsy-in-the-pantsy about all the SQL Jeopardy differences between temp tables and table variables.
I also realize that this may seem overly snarky, but hear me out:
Sure, there are some valid reasons to use table variables at times. But to most people, the choice about which one to use is either a coin flip or a copy/paste of what they saw someone else do in other parts of the code.
In other words, there’s not a lot of thought, and probably no adequate testing behind the choice. Sort of like me with tattoos.
Engine enhancements like this that benefit people who can’t change the code (or who refuse to change the code) are pretty good indicators of just how irresponsible developers have been with certain ✌features✌. I say this because I see it week after week after week. The numbers in Azure had to have been profound enough to get this worked on, finally.
I can’t imagine how many Microsoft support tickets have been RCA’d as someone jamming many-many rows in a table variable, with the only reasonable solution being to use a temp table instead.
I wish I could say that people learned the differences a lot faster when they experienced some pain, but time keeps proving that’s not the case. And really, it’s hardly ever the software developers who feel it with these choices: it’s the end users.
Thanks for reading!
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.
Normally, I link people to this post by Kendra and this post by Paul when I need to point them to information about what goes wrong with local variables. They’re both quite good, but I wanted something a little more specific to the situation I normally see with people locally, along with some fixes.
First, some background:
In a stored procedure (and even in ad hoc queries or within dynamic SQL, like in the examples linked above), if you declare a variable within that code block and use it as a predicate later, you will get either a fixed guess for cardinality, or a less-confidence-inspiring estimate than when the histogram is used.
The local variable effect discussed in the rest of this post produces the same behavior as the OPTIMIZE FOR UNKNOWN hint, or executing queries with sp_prepare. I have that emphasized here because I don’t want to keep qualifying it throughout the post.
That estimate will be based on the number of rows in the table, and the “All Density” of the column multiplied together, for single equality predicates. The process for multiple predicates depends on which cardinality estimation model you’re using.
CREATE INDEX flubber
ON dbo.Posts(OwnerUserId);
DBCC SHOW_STATISTICS(Posts, flubber);
For example, this query using a single local variable with a single equality:
DECLARE @oui INT = 22656;
SELECT COUNT(*) FROM dbo.Posts AS p WHERE p.OwnerUserId = @oui;
Will get an estimate of 11.9-ish, despite 27,901 rows matching over here in reality.
Which can be replicated like so, using the numbers from the screenshot up yonder.
SELECT (6.968291E-07 * 17142169) AS [?]
Several Different Levels
You can replicate the “All Density” calculation by doing this:
SELECT (1 /
CONVERT(FLOAT, COUNT(DISTINCT p.OwnerUserId))
) AS [All Density]
FROM Posts AS p
GO
Notice I didn’t call the estimate “bad”. Even though it often is quite bad, there are some columns where the distribution of values will be close enough to this estimate for it not to matter terribly for plan shape, index choice, and overall performance.
Don’t take this as carte blanche to use this technique; quite the opposite. If you’re going to use it, it needs careful testing across a variety of inputs.
Why? Because confidence in estimates decreases as they become based on less precise information.
In these estimates we can see a couple optimizer rules in action:
Inclusion: We assume the value is there — the alternative is ghastly
Uniformity: The data will have an even distribution of unique values
For ranges (>, >=, <, <=), LIKE, BETWEEN, and <>, there are different fixed guesses.
These numbers may change in the future, but up through 2019 this is what my testing resulted in.
Heck, maybe this behavior will be alterable in the future :^)
No Vector, No Estimate
A lot of people (myself included) will freely interchange “estimate” and “guess” when talking about this process. To the optimizer, there’s a big difference.
An estimate represents a process where math formulas with strange fonts that I don’t understand are used to calculate cardinality.
A guess represents a breakdown in that process, where the optimizer gives up, and a fixed number is used.
Say there’s no “density vector” available for the column used in an equality predicate. Maybe you have auto-create stats turned off, or stats created asynchronously is on for the first compilation.
You get a guess, not an estimate.
ALTER DATABASE StackOverflow2013 SET AUTO_CREATE_STATISTICS OFF;
GO
DECLARE @oui INT = 22656;
SELECT COUNT(*) FROM dbo.Posts AS p WHERE p.OwnerUserId = @oui;
SELECT COUNT(*) FROM dbo.Posts AS p WHERE p.OwnerUserId = @oui OPTION(USE HINT('FORCE_LEGACY_CARDINALITY_ESTIMATION'));
GO
ALTER DATABASE StackOverflow2013 SET AUTO_CREATE_STATISTICS ON;
GO
Using the new cardinality estimator (CE), which Microsoft has quite presumptuously started calling the Default CE, I get a guess of 4,140.
Using the legacy CE, which maybe I’ll start referring to as the Best CE, to match the presumptuousness of Microsoft, I get a guess of 266,409.
Though neither one is particularly close to the reality of 27,901 rows, we can’t expect a good guess because we’re effectively poking the optimizer in the eyeball by not allowing it to create statistics, and by using a local variable in our where clause.
These things would be our fault, regardless of the default-ness, or best-ness, of the estimation model.
If you’re keen on calculating these things yourself, you can do the following:
SELECT POWER(CONVERT(FLOAT, 17142169), 0.75) AS BEST_CE;
SELECT SQRT(CONVERT(FLOAT, 17142169)) AS default_ce_blah_whatever;
SELECT COUNT_BIG(*)
FROM dbo.Posts AS p
WHERE p.CreationDate = p.CommunityOwnedDate;
SELECT COUNT_BIG(*)
FROM dbo.Posts AS p
WHERE p.CreationDate = p.CommunityOwnedDate
OPTION(USE HINT('FORCE_LEGACY_CARDINALITY_ESTIMATION'));
The so-called “default” CE thinks 1,714,220 rows will match for a column-equals-column comparison, and the “legacy” CE thinks 6.44248 rows will match, assuming that histograms are available for both of these queries.
How many actually match? 59,216.
I never said this was easy, HOWEVER!
Ahem.
The “legacy” CE estimate comes from advanced maths that only people who vape understand, while the so-called “default” CE just guesses ten percent, in true lazybones fashion. “You treat your stepmother with respect, Pantera!“, as a wise man once said.
Second, what we want to happen:
Code that uses literals, parameters, and other sniff-able forms of predicates use the statistics histogram, which typically has far more valuable information about data distribution for a column. No, they’re not always perfect, and sure, estimates can still be off if we use this, but that’s a chance I’m willing to take.
Even if they’re out of date. Maybe. Maybe not.
Look, just update those statistics.
Like I mentioned before, these estimates typically have higher confidence levels because they’re often based on more precise details about the data.
If I had to rank them:
Direct histogram step hits for an equality
Intra-step hits for an equality
Direct histogram step hits for a range
Intra-step hits for a range
Inequalities (not equals to)
Joins
1000 other things
All the goofy stuff you people do to make this more difficult, like wrapping columns in functions, mismatching data types, using local variables, etc.
Of course, parameterized code does open us up to parameter sniffing issues, which I’m not addressing in this post. My only goal here is to teach people how to get out of performance jams caused by local variables giving you bad-enough estimates. Ha ha ha.
Plus, there’s a lot of negativity out there already about parameter sniffing. A lot of the time it does pretty well, and we want it to happen.
Over-Under
The main issues with the local variable/density vector estimates is that they most often don’t align well with reality, and they’re almost certainly a knee-jerk reaction to a parameter sniffing problem, or done out of ignorance to the repercussions. It would be tedious to walk through all of the potential plan quality issues that could arise from doing this, though I did record a video about one of them here.
Instead of doing all that stuff, I’d rather walk through what works and what doesn’t when it comes to fixing the problem.
But first, what doesn’t work!
Temporary Objects Don’t Usually Work
If you put the value of the local variable in a #temp table, you can fall victim to statistics caching. If you use a @table variable, you don’t get any column-level statistics on what values go in there (even with a recompile hint or trace flag 2453, you only get table cardinality).
There may be some circumstances where a #temp table can help, or can get you a better plan, but they’re probably not my first stop on the list of fixes.
The #temp table will require a uniqueness constraint to work
This becomes more and more difficult if we have multiple local variables to account for
And if they have different data types, we need multiple #temp tables, or wide tables with a column and constraint per parameter
From there, we end up with difficulties linking those values in our query. Extra joins, subqueries, etc. all have potential consequences.
Inline Table Valued Functions Don’t Work
They’re a little too inline here, and they use the density vector estimate. See this gist for a demo.
Recompile Can Work, But Only Do It For Problem Statements
It has to be a statement-level recompile, using OPTION(RECOMPILE). Putting recompile as a stored procedure creation option will not allow for parameter embedding optimizations, i.e. WITH RECOMPILE.
The tool tip on the left is from a plan with a statement-level recompile. On the right is from a plan with a procedure-level recompile. In the statement-level recompile plan, we can see the scalar operator is a literal value. In the procedure-level recompile, we still see @ParentId passed in.
The difference is subtle, but exists. I prefer statement-level recompiles, because it’s unlikely that every statement in a procedure should or needs to be recompiled, unless it’s a monitoring proc or something else with no value to the plan cache.
Targeting specific statements is smarterer.
Erer.
A more detailed examination of this behavior is at Paul’s post, linked above.
Dynamic SQL Can Work
Depending on complexity, it may be more straight forward to use dynamic SQL as a receptacle for your variables-turned-parameters.
CREATE PROCEDURE dbo.game_time(@id INT)
AS BEGIN
DECLARE @id_fix INT;
SET @id_fix = CASE WHEN @id < 0 THEN 1 ELSE @id END;
DECLARE @sql NVARCHAR(MAX) = N'';
SET @sql += N'SELECT COUNT(*) FROM dbo.Posts AS p WHERE p.OwnerUserId = @id;';
EXEC sys.sp_executesql @sql, N'@id INT', @id_fix
END;
Separate Stored Procedures Can Work
If you need to declare variables internally and perform some queries to assign values to them, passing them on to separate stored procedures can avoid the density estimates. The stored procedure occurs in a separate context, so all it sees are the values passed in as parameters, not their origins as variables.
In other words, parameters can be sniffed; variables can’t.
CREATE PROCEDURE dbo.game_time(@id INT)
AS
BEGIN
DECLARE @id_fix INT;
SET @id_fix = CASE WHEN @id < 0 THEN 1 ELSE @id END;
EXEC dbo.some_new_proc @id_fix;
END;
Just pretend the dynamic SQL from above occupies the stored procedure dbo.some_new_proc here.
Optimizing For A Value Can Work
But choosing that value is hard. If one is feeling ambitious, one could take the local parameter value, compare it to the histogram on one’s own, then choose a value on one’s own that, one, on their own, could use to determine if a specific, common, or nearby value would be best to optimize for, using dynamic SQL that one has written on one’s own.
Ahem.
CREATE PROCEDURE dbo.game_time(@id INT)
AS BEGIN
DECLARE @id_fix INT;
SET @id_fix = CASE WHEN @id < 0 THEN 1 ELSE @id END;
DECLARE @a_really_good_choice INT;
SET @a_really_good_choice = 2147483647; --The result of some v. professional code IRL.
DECLARE @sql NVARCHAR(MAX) = N'';
SET @sql += N'SELECT COUNT(*) FROM dbo.Posts AS p WHERE p.OwnerUserId = @id OPTION(OPTIMIZE FOR(@id = [a_really_good_choice]));';
SET @sql = REPLACE(@sql, N'[a_really_good_choice]', @a_really_good_choice);
EXEC sys.sp_executesql @sql, N'@id INT', @id_fix;
END;
GO
Wrapping Up
This post aimed to give you some ways to avoid getting bad density vector estimates with local variables. If you’re getting good guesses, well, sorry you had to read all this.
When I see this pattern in client code, it’s often accompanied by comments about fixing parameter sniffing. While technically accurate, it’s more like plugging SQL Server’s nose with cotton balls and Lego heads.
Sometimes there will be several predicate filters that diminish the impact of estimates not using the histogram. Often a fairly selective predicate evaluated first is enough to make this not suck too badly. However, it’s worth learning about, and learning how to fix correctly.
Thanks for reading!
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.
You already know that your temp table needs an index. Let’s say there’s some query plan ouchie from not adding one. You’ve already realized that you should probably use a clustered index rather than a nonclustered index. Adding a nonclustered index leaves you with a heap and an index, and there are a lot of times when nonclustered indexes won’t be used because they don’t cover the query columns enough.
Do not explicitly drop temp tables at the end of a stored procedure, they will get cleaned up when the session that created them ends.
Do not alter temp tables after they have been created.
Do not truncate temp tables
Move index creation statements on temp tables to the new inline index creation syntax that was introduced in SQL Server 2014.
Where it can be a bad option is:
If you can’t get a parallel insert even with a TABLOCK hint
Sorting the data to match index order on insert could result in some discomfort
After Creation
This is almost always not ideal, unless you want to avoid caching the temp table, and for the recompilation to occur for whatever reason.
It’s not that I’d ever rule this out as an option, but I’d wanna have a good reason for it.
Probably even several.
After Insert
This can sometimes be a good option if the query plan you get from inserting into the index is deficient in some way.
Like I mentioned up above, maybe you lose parallel insert, or maybe the DML Request Sort is a thorn in your side.
This can be awesome! Except on Standard Edition, where you can’t create indexes in parallel. Which picks off one of the reasons for doing this in the first place, and also potentially causes you headaches with not caching temp tables, and statement level recompiles.
One upside here is that if you insert data into a temp table with an index, and then run a query that causes statistics generation, you’ll almost certainly get the default sampling rate. That could potentially cause other annoyances. Creating the index after loading data means you get the full scan stats.
Hooray, I guess.
This may not ever be the end of the world, but here’s a quick example:
DROP TABLE IF EXISTS #t;
GO
--Create a table with an index already on it
CREATE TABLE #t(id INT, INDEX c CLUSTERED(id));
--Load data
INSERT #t WITH(TABLOCK)
SELECT p.OwnerUserId
FROM dbo.Posts AS p;
--Run a query to generate statistics
SELECT COUNT(*)
FROM #t AS t
WHERE t.id BETWEEN 1 AND 10000
GO
--See what's poppin'
SELECT hist.step_number, hist.range_high_key, hist.range_rows,
hist.equal_rows, hist.distinct_range_rows, hist.average_range_rows
FROM tempdb.sys.stats AS s
CROSS APPLY tempdb.sys.dm_db_stats_histogram(s.[object_id], s.stats_id) AS hist
WHERE OBJECT_NAME(s.object_id, 2) LIKE '#t%'
GO
DROP TABLE #t;
--Create a query with no index
CREATE TABLE #t(id INT NOT NULL);
--Load data
INSERT #t WITH(TABLOCK)
SELECT p.OwnerUserId
FROM dbo.Posts AS p;
--Create the index
CREATE CLUSTERED INDEX c ON #t(id);
--Run a query to generate statistics
SELECT COUNT(*)
FROM #t AS t
WHERE t.id BETWEEN 1 AND 10000
--See what's poppin'
SELECT hist.step_number, hist.range_high_key, hist.range_rows,
hist.equal_rows, hist.distinct_range_rows, hist.average_range_rows
FROM tempdb.sys.stats AS s
CROSS APPLY tempdb.sys.dm_db_stats_histogram(s.[object_id], s.stats_id) AS hist
WHERE OBJECT_NAME(s.object_id, 2) LIKE '#t%'
GO
DROP TABLE #t;
On the left is the first 20 steps from the first histogram, and on the right is the first 20 from the second one.
You can see some big differences — whether or not they end up helping or hurting performance would take a lot of different tests. Quite frankly, it’s probably not where I’d start a performance investigation, but I’d be lying if I told you it never ended up there.
All Things Considerateded
In general, I’d stick to using the inline index creation syntax. If I had to work around issues with that, I’d create the index after loading data, but being on Standard Edition brings some additional considerations around parallel index creation.
Thanks for reading!
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.
This is the part of SQL Server I often find myself rolling my eyes at: poor feature interoperability, and that V1 smell that… never seems to turn into that V2 smell.
The full script is hosted here. I don’t want a tedious post full of setting up partitioning, etc.
I wanna get to the stuff that you might care about later.
If You Wanna…
The utility of Partitioning is being able to quickly switch partitions in and out. Data management. Not query performance (unless you’re using columnstore).
If you want to do that with temporal tables, your staging and “out” tables need to match exactly, down to the temporal-ness.
For example, this won’t work:
CREATE TABLE dbo.Votes_Deletes_Stage
(
Id INT NOT NULL,
PostId INT NOT NULL,
UserId INT NULL,
BountyAmount INT NULL,
VoteTypeId INT NOT NULL,
CreationDate DATETIME NOT NULL,
SysStartTime DATETIME2(7) NOT NULL, --Versioning column
SysEndTime DATETIME2(7) NOT NULL --Versioning column
CONSTRAINT dr_rockzo_Stage PRIMARY KEY CLUSTERED (CreationDate, Id) ON [PRIMARY]
) ON [PRIMARY];
You gotta have all the same stuff you used to get your partitioned table set up for temporal-ness.
CREATE TABLE dbo.Votes_Deletes_Stage
(
Id INT NOT NULL,
PostId INT NOT NULL,
UserId INT NULL,
BountyAmount INT NULL,
VoteTypeId INT NOT NULL,
CreationDate DATETIME NOT NULL,
SysStartTime DATETIME2(7) GENERATED ALWAYS AS ROW START HIDDEN
CONSTRAINT df_VotesDeletes_Stage_SysStartTime
DEFAULT SYSDATETIME(),
SysEndTime DATETIME2(7) GENERATED ALWAYS AS ROW END HIDDEN
CONSTRAINT df_VotesDeletes_Stage_SysEndTime
DEFAULT CONVERT(DATETIME2(7), '9999-12-31 23:59:59.9999999')
CONSTRAINT dr_rockzo_Stage PRIMARY KEY CLUSTERED (CreationDate, Id) ON [PRIMARY],
PERIOD FOR SYSTEM_TIME([SysStartTime], [SysEndTime])
) ON [PRIMARY];
Then If You Wanna…
Switch data in or out, you have to turn off the temporal-ness.
Msg 13546, Level 16, State 1, Line 97
Switching out partition failed on table ‘DeletesDemo.dbo.Votes_Deletes’ because it is not a supported operation on system-versioned tables. Consider setting SYSTEM_VERSIONING to OFF and trying again.
“Consider turning off the feature that takes forever to turn back on with large tables so you can do the thing partitioning does quickly”
Don’t worry, the color red you’re seeing is totally natural.
And hey, once you’ve turned it off, you can swap a partition in or out.
A Normal Partitioning Problem
The partition you’re going to switch in needs to have a constraint on it that tells the whatever that the data you’re switching in is valid for the partition you’re swapping it into.
Msg 4982, Level 16, State 1, Line 105
ALTER TABLE SWITCH statement failed. Check constraints of source table ‘DeletesDemo.dbo.Votes_Deletes_Stage’ allow values that are not allowed by range defined by partition 8 on target table ‘DeletesDemo.dbo.Votes_Deletes’.
The thing is, this error message sucks. It sucks all the suck. Especially when dealing with temporal tables, you might think something odd happened with the constraints on your versioning columns. They both have constraints on them. WHICH CONSTRAINT IS THE PROBLEM?
If you’re new to Partitioning, you may not have ever switched data into or out of a table before. This error message can be a real head-scratcher.
The fix is to add a check constraint to your staging table — the one you’re swapping in — that tells Partitioning about what’s in the table. In my case, I have the Votes_Deletes table partitioned by CreationDate, by one year ranges. For me, Partition 8 contains values for the year 2013. To make sure it’s safe to swap my staging partition into the partition for that year, it needs a constraint:
ALTER TABLE dbo.Votes_Deletes_Stage
ADD CONSTRAINT ck_yrself
CHECK (CreationDate >= '20130101' AND CreationDate < '20140101'
AND CreationDate IS NOT NULL);
And You Should Probably
Turn the temporal-ness back on. When you do that, you have an option. Do you want to make sure your data is consistent?
ALTER TABLE dbo.Votes_Deletes SET (SYSTEM_VERSIONING = ON
( HISTORY_TABLE=dbo.Votes_Deletes_History,
DATA_CONSISTENCY_CHECK= ON) );
If you don’t, re-enabling is instant. Buuuuuuut you take the chance that some data in your table may have changed while you were tooting around trying to get partitions swapped in and out. I have no idea what the ramifications of skipping the check might be. In the context of this post, probably nothing. If you’ve got a full history table and the specter of changes during this whole shebang…
This is what the query plan for turning it back on looks like.
There’s nothing in the history table. If there were, this could get really bad (trust me, ha ha ha). What checks do we do when the history table is empty?
Le Shrug, as they say in the land of protest.
But a 53 million row assert sure does take a couple minutes.
Thanks for reading!
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.