SQL Server Archives | Page 117 of 163

SQL Server Foreign Keys Don’t Always Improve Query Plans

Posted on January 21, 2021May 16, 2022 by Erik Darling

Short and Lousy

This is one of the most frustrating things I’ve seen from the optimizer in quite a while.

Here are a couple tables, with a foreign key between them:

CREATE TABLE dbo.ct(id int PRIMARY KEY, dt datetime);

CREATE TABLE dbo.ct_fk(id int PRIMARY KEY, dt datetime);

ALTER TABLE dbo.ct ADD CONSTRAINT
    ct_c_fk FOREIGN KEY (id) REFERENCES dbo.ct_fk(id);

When we use the EXISTS clause, join elimination occurs normally:

SELECT COUNT_BIG(*) AS [?]
FROM dbo.ct AS c
WHERE EXISTS
(
    SELECT 1/0
    FROM dbo.ct_fk AS cf
    WHERE cf.id = c.id
);

SQL Server Query Plan — all the chickens

But when we use NOT EXISTS, it… doesn’t.

SELECT COUNT_BIG(*) AS [?]
FROM dbo.ct AS c
WHERE NOT EXISTS
(
    SELECT 1/0
    FROM dbo.ct_fk AS cf
    WHERE cf.id = c.id
);

Thanks for reading!

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

But When Did We Capture That Changed Data?

Posted on January 20, 2021May 16, 2022 by Erik Darling

Where CDC Can Fall Short

There are two common questions people ask about changed data that CDC doesn’t support very naturally:

When did it change?
Who changed it?

While it is possible to join to cdc.lsn_time_mapping to get that information, you may be dealing with a box product that doesn’t support that functionality.

Or something.

Hypothetically.

They’re Just Tables

The kinda funny thing about the tables that all of your changed data ends up in is that… they’re regular tables.

They’re not in the sys schema, they’re in the cdc schema. You can alter them in all sorts of ways. You can drop and truncate them if you want. I mean, not in that order, but you get my point. There’s no special protection for them.

That means you can add a column like this to them to track when the rows ended up in there. This also saves you from altering production tables to account for when things change.

ALTER TABLE cdc.dbo_Posts_CT 
    ADD ChangeTime datetime DEFAULT SYSDATETIME();

Oversight

I do think it’s a pretty big oversight to not have a column like this in the table already, but maybe people use CDC in far different ways than I normally see it used.

Or maybe not, and everyone has to come up with some workaround like this to deal with it. It could be that CDC data ends up like a lot of other data, and people get really excited about having it, but never actually look at it.

Thanks for reading!

Going Further

Indexed Views As Filtered Indexes In SQL Server

Posted on January 19, 2021August 4, 2022 by Erik Darling

Pssst!

If you landed here from Brent’s weekly links, use this link to get my training for 90% off.

The access is for life, but this coupon code isn’t! Get it while it lasts.

Discount applies at checkout, and you have to purchase everything for it to apply.

A Persistent Frustration

SQL Server comes with some great features for tuning queries:

Computed Columns
Filtered Indexes
Indexed Views

But there’s an interoperability issue when you try to use things together. You can’t create a filtered index with the filter definition on a computed column, nor can you create a filtered index on an indexed view.

If you find yourself backed into a corner, you may need to consider using an indexed view without any aggregation (which is the normal use-case).

Empty Tables

If we try to do something like this, we’ll get an error.

DROP TABLE IF EXISTS dbo.indexed_view;
GO

CREATE TABLE dbo.indexed_view
(
    id int PRIMARY KEY,
    notfizzbuzz AS (id * 2)
);
GO

CREATE INDEX n 
    ON dbo.indexed_view (notfizzbuzz) 
WHERE notfizzbuzz = 0;
GO

Yes, I’m putting the error message here for SEO bucks.

Msg 10609, Level 16, State 1, Line 19
Filtered index 'nfb' cannot be created on table 'dbo.indexed_view' because the column 'notfizzbuzz' in the filter expression is a computed column. 
Rewrite the filter expression so that it does not include this column.

An Indexed View Doesn’t Help

If we run this to create an indexed view on top of our base table, we still can’t create a filtered index, but there’s a different error message.

CREATE OR ALTER VIEW dbo.computed_column
WITH SCHEMABINDING
AS
SELECT
    iv.id, 
    iv.notfizzbuzz
FROM dbo.indexed_view AS iv;
GO 

CREATE UNIQUE CLUSTERED INDEX c 
    ON dbo.computed_column(id);

CREATE INDEX nfb 
    ON dbo.computed_column(notfizzbuzz) 
WHERE notfizzbuzz = 0;

Msg 10610, Level 16, State 1, Line 37
Filtered index 'nfb' cannot be created on object 'dbo.computed_column' because it is not a user table. 
Filtered indexes are only supported on tables. 
If you are trying to create a filtered index on a view, consider creating an indexed view with the filter expression incorporated in the view definition.

But what a thoughtful error message it is! Thanks, whomever wrote that.

Still Needs Help

We can create this indexed view just fine.

CREATE OR ALTER VIEW dbo.computed_column
WITH SCHEMABINDING
AS
SELECT
    iv.id, 
    iv.notfizzbuzz
FROM dbo.indexed_view AS iv
WHERE iv.notfizzbuzz = 0;
GO 

CREATE UNIQUE CLUSTERED INDEX c 
    ON dbo.computed_column(id);

But if we try to select from it, the view is expanded.

SELECT
    cc.id, 
    cc.notfizzbuzz
FROM dbo.computed_column AS cc
WHERE cc.notfizzbuzz = 0;

The issue here is the simple parameterization that is attempted with the trivial plan.

If we run the query like this, and look at the end of the output, we’ll see a message at the bottom that our query is safe for auto (simple) parameterization. This may still happen even if the plan doesn’t remain trivial (more detail at the link above!)

DBCC FREEPROCCACHE;
GO 
DBCC TRACEON(8607, 3604);
GO 
SELECT
    cc.id, 
    cc.notfizzbuzz
FROM dbo.computed_column AS cc
WHERE cc.notfizzbuzz = 0;
DBCC TRACEOFF(8607, 3604);
GO 

********************
** Query marked as Cachable
** Query marked as Safe for Auto-Param
********************

Making It Work

The two ways we can run this query to get the indexed view to be used are like so:

SELECT
    cc.id, 
    cc.notfizzbuzz
FROM dbo.computed_column AS cc WITH(NOEXPAND)
WHERE cc.notfizzbuzz = 0;


SELECT
    cc.id, 
    cc.notfizzbuzz
FROM dbo.computed_column AS cc
WHERE cc.notfizzbuzz = 0
AND   1 = (SELECT 1);

A Closer Look

If we put those two queries through the ringer, we’ll still see auto (simple) parameterization from the first query:

DBCC FREEPROCCACHE;
GO 
DBCC TRACEON(8607, 3604);
GO 
SELECT
    cc.id, 
    cc.notfizzbuzz
FROM dbo.computed_column AS cc WITH(NOEXPAND)
WHERE cc.notfizzbuzz = 0;
GO 
DBCC TRACEOFF(8607, 3604);
GO 

********************
** Query marked as Cachable
** Query marked as Safe for Auto-Param
********************

DBCC FREEPROCCACHE;
GO 
DBCC TRACEON(8607, 3604);
GO 
SELECT
    cc.id, 
    cc.notfizzbuzz
FROM dbo.computed_column AS cc
WHERE cc.notfizzbuzz = 0
AND   1 = (SELECT 1);
GO 
DBCC TRACEOFF(8607, 3604);
GO 

********************
** Query marked as Cachable
********************

It’s goofy, but it’s worth noting. Anyway, if I had to pick one of these methods to get the plan I want, it would be the NOEXPAND version.

Using that hint is the only thing that will allow for statistics to get generated on indexed views.

In case you’re wondering, marking the computed column as PERSISTED doesn’t change the outcome for any of these issues.

Thanks for reading!

Going Further

Normalizing SQL Server Tables To Reduce Query Blocking

Posted on January 18, 2021June 17, 2022 by Erik Darling

Chalky

I see a lot of tables that look something like this:

CREATE TABLE dbo.orders
(
    order_id int NOT NULL PRIMARY KEY
         DEFAULT (NEXT VALUE FOR dbo.order_id),
    order_date datetime NOT NULL,
    order_ship_date datetime NOT NULL,
    order_total money NOT NULL,
    order_tax money NOT NULL,
    customer_id int NOT NULL
        DEFAULT (NEXT VALUE FOR dbo.customer_id),
    customer_fullname nvarchar(250),
    customer_street nvarchar(250),
    customer_street_2 nvarchar(250),
    customer_city nvarchar(250),
    customer_state nvarchar(250),
    customer_zip nvarchar(250),
    customer_country nvarchar(250)
);

Looking at the design, there are two big problems:

There are “order” columns that are going to get a lot of inserts and updates
You’re going to be storing the same customer information over and over again

The more related, but not independent, data you store in the same table, the harder it becomes to effectively index that table.

A while back, I blogged about Tables Within Tables, but uh… surprisingly, the problem still exists! Usually when I blog about something, the problem disappears. Hm.

Better But Not Best

If we follow some practical guidance and put named columns into their own table, we end up with this:

CREATE TABLE dbo.orders
(
    order_id int NOT NULL PRIMARY KEY
         DEFAULT (NEXT VALUE FOR dbo.order_id),
    order_date datetime NOT NULL,
    order_ship_date datetime NOT NULL,
    order_total money NOT NULL,
    order_tax money NOT NULL,
    customer_id int NOT NULL
);

CREATE TABLE dbo.customers
(
    customer_id int NOT NULL PRIMARY KEY
        DEFAULT (NEXT VALUE FOR dbo.customer_id),
    customer_fullname nvarchar(250),
    customer_street nvarchar(250),
    customer_street_2 nvarchar(250),
    customer_city nvarchar(250),
    customer_state nvarchar(250),
    customer_zip nvarchar(250),
    customer_country nvarchar(250)
);

This is a better scenario, because we only store customer information once, and inserts/updates to order information don’t impact people working with customer data.

But this still isn’t great — what if a customer wants to send an order to a different address?

If we wanted to store everything in this table, we’d be breaking other practical rules: we’d have to have multiple rows for users, or we’d have to add columns columns to the table to deal with multiple addresses. That’s a mess both for people who don’t use all those extra columns, and for people who might have half a dozen addresses they send to.

Getting There

A better way to phrase the customer table might be like this:

CREATE TABLE dbo.customers
(
    customer_id int NOT NULL PRIMARY KEY
        DEFAULT (NEXT VALUE FOR dbo.customer_id),
    default_fullname nvarchar(250),
    default_street nvarchar(250),
    default_street_2 nvarchar(250),
    default_city nvarchar(250),
    default_state nvarchar(250),
    default_zip nvarchar(250),
    default_country nvarchar(250)
);

Most of the time, people are going to send stuff to one address — call it home if you want. It’s probably also their billing address, so it makes sense for it to be the default, and to have it be the first choice.

Then we’ll have a table of EAV data that looks like this:

CREATE TABLE dbo.customers_address_book
(
    address_id int NOT NULL
        DEFAULT (NEXT VALUE FOR dbo.address_id),
    customer_id int NOT NULL,
    address_type tinyint,
    customer_fullname nvarchar(250),
    customer_street nvarchar(250),
    customer_street_2 nvarchar(250),
    customer_city nvarchar(250),
    customer_state nvarchar(250),
    customer_zip nvarchar(250),
    customer_country nvarchar(250),
    CONSTRAINT pk_cab_id PRIMARY KEY (customer_id, address_id)
);

In a table like this, whenever a customer ships to a non-default address it gets stored off here. Now customers can have as many addresses as they want to choose from without us having to have an extra bloated table of default information plus non-default information.

Because of the way this data is modeled, we don’t need to keep adding columns to accommodate multiple addresses. We just tack rows on, and since this data isn’t likely to get updated the insert/select pattern should end up with minimal blocking.

Tomato Sauce

I know, horrifying. You might have to write a join. You poor, downtrodden developer.

Of course, this makes the most sense when you’re dealing with OLTP workloads. And sure, a lot of these columns probably don’t need to be as long as they are, but that’s a totally different post.

When you’re dealing with reporting data, de-normalizing is generally preferred. Though if you’re doing serious reporting and using column store indexes, I’d probably wanna keep the strings out as much as possible, and just key back to them in other tables. Yuck.

Have I ever mentioned that strings in databases were a mistake?

Thanks for reading!

Going Further

SQL Server Performance Measures And Factors

Posted on January 15, 2021May 16, 2022 by Erik Darling

Start To Finish

When I’m working with clients, we look at a lot of stuff together to make sure things are in good working order.

If we’re tuning a single query, the goal is pretty easy to define. We want it to finish as fast as possible.

How you get there is where things may get interesting, but let’s be honest: most of the time you’re going to be fixing the same half a dozen problems in some combination.

Note that the query finishing, and results being finished returning are two different metrics. A query may finish very quickly, but returning results may take considerably longer for various reasons.

You can think of this as responsiveness, and it’s a cornerstone to building something people are happy with. When you’re talking about user experience, performance has to be part of the conversation.

How fast individual queries finish also plays a big part in overall concurrency.

Put Through

Typically this matters more for OLTP workloads. Data warehouses aren’t usually high concurrency environments in that the Batch Requests/Sec counter is sitting around bored. Joe Obbish has a great talk on improving data warehouse loading patterns, because usually isn’t always. Ha ha ha. Barf.

Workloads that do have OLTP characteristics are more sensitive to all sorts of things, and I don’t just mean parameters. Locking and Deadlocking are high up on the list, along with available hardware resources.

Rather obviously, if you want to improve overall throughput, having queries finish as quickly as possible is an attractive goal. You can run way more queries in one second if they take 100ms than if they take 900ms, you know? You know. You’re smart.

Factorials

What are the factors that influence query speed?

Starting in the database:

Well written queries
Thoughtful indexes
Maintained Statistics
Correctly stored data types
Properly normalized tables

Moving out a little bit further, there are some important settings:

Parallelism
Optimistic Isolation Levels
Max Server Memory
tempdb stuff
Instant File Initialization
Lock Pages In Memory (mostly)

Further out, we have the hardware:

CPU
Memory
Disk
Network Path To Disks (SAN)

A lot of times something being off in one place can make it look like there’s a bottleneck in another place. Probably the easiest example to conjure up is if you’re missing an opportune nonclustered index, and you end up scanning a clustered index over and over again. The clustered index doesn’t fit in memory, so you end up generating a lot of PAGEIOLATCH_SH waits.

It might look like you have a memory shortage, but in reality you could be making far better use of the memory you have.

Margin For Error

Being able to identify and fix specific bottlenecks is an important skill. Wait stats and other counters can be useful, but are often underwhelming unless you’re watching at the right time, or you have a monitoring tool that logs them for you. Bottlenecks can live in many places, and oftentimes performance tuning is like playing whack-a-mole with them.

Hitting hard limits and running out of CPU or RAM is never fun. Picking hardware that’s up to the task is step one, but staying on top of query and index tuning to make sure it stays adequate for longer is a process that you need to go through. If this is the sort of thing you need help with, drop me a line.

Thanks for reading!

Going Further

Change Data Capture Learning Resources For SQL Server

Posted on January 14, 2021May 16, 2022 by Erik Darling

Everything Old Is New Again

If I have to suffer, you all have to suffer. I should probably rephrase that: if I have to suffer, I want you to suffer less. Learn from my suffering, or something.

Recently I’ve been doing some work with Change Data Capture. I’ve done stuff with it before, but I never thought to round up places I’ve picked up tips, tricks, and knowledge for dealing with it. All of the information in these is relevant today, despite most being from 10 or more years go. The only real change I’ve seen is a __$command_id column added to the cdc.<schema>_<table>_CT table in SQL Server 2012-2016.

First, Change Data Capture vs Change Tracking

Second, Tuning the Performance of Change Data Capture

Third,

Fourthly, from the comments:

Blog posts by the repltalk team on CDC https://repltalk.com/tag/cdc/

I hope you find these useful.

Actually, I hope you don’t, because you never have to deal with Change Data Capture. Ha ha ha.

Let me know in the comments if I missed anything!

Thanks for reading!

Going Further

Tracking Row Changes With Temporal Columns In SQL Server

Posted on January 13, 2021May 16, 2022 by Erik Darling

Extra, Extra

Temporal tables are cool, but keeping all that history can be stressful. Change Tracking adds overhead to every transaction, and requires Snapshot Isolation to be successful. Change Data Capture can also run into problems scanning the transaction log, depending on transaction volume, etc.

Change Data Capture is also a little unfortunate in that it doesn’t track schema changes like adding or dropping columns, or changing column data types. Change Tracking doesn’t either, it just tracks the keys of what changed when data is modified.

Temporal Tables are different, because it will add new, and drop removed columns from the history table as they change in the base table. It’ll also push data type changes across. But the history table will not have the various permutations tracked. It just mirrors current state.

Just The Columns, Ma’am

Let’s say your needs aren’t robust enough to need any one of those highly specialized features, or even triggers to move data around when it changes.

You’re perfectly free and able to add the tracking columns that temporal tables use to your base tables, but adding them is far from free. When I added them to the 17 million row Posts table, it took about 40 seconds. My laptop doesn’t suck, either.

To show you a little how it works, let’s create a copy of the Votes table from Stack Overflow.

CREATE TABLE dbo.Votes_Tracked
(
    Id int NOT NULL,
    PostId int NOT NULL,
    UserId int NULL,
    BountyAmount int NULL,
    VoteTypeId int NOT NULL,
    CreationDate datetime NOT NULL,
    LastModified datetime2 GENERATED ALWAYS AS ROW start NOT NULL
        CONSTRAINT DF_LastModified DEFAULT (SYSDATETIME()),
    JunkDate datetime2 GENERATED ALWAYS AS ROW END HIDDEN NOT NULL
        CONSTRAINT DF_JunkDate DEFAULT ('9999-12-31 23:59:59.9999999'),
    PERIOD FOR SYSTEM_TIME (LastModified, JunkDate),
    CONSTRAINT PK_Votes_Id
        PRIMARY KEY CLUSTERED (Id ASC)
);
GO

Note that you need two columns to define the “period for system time”, and one of them will always be useless. That’s why I called it JunkDate, and not, like, whatever. But the good news is you can define that column as HIDDEN so that it doesn’t show up in all your queries.

Now we can stick some data in there and see how it works.

INSERT 
    dbo.Votes_Tracked WITH (TABLOCK)
(
    Id,
    PostId,
    UserId,
    BountyAmount,
    VoteTypeId,
    CreationDate
)
SELECT
    Id,
    PostId,
    UserId,
    BountyAmount,
    VoteTypeId,
    CreationDate
FROM StackOverflow2013.dbo.Votes AS v
WHERE v.CreationDate >= '20130101';

Looking Around

The table data looks like this:

If we run an update:

UPDATE v
    SET v.BountyAmount = 9999
FROM dbo.Votes_Tracked AS v
WHERE v.VoteTypeId = 7;

Note that these screen caps were taken without the HIDDEN keyword added to the table definition: that was an after thought recommended by my friend Peter.

Whatabouts?

Of course, if you remove rows from the table, they’re just gone. You’d still need a trigger to cover deletes, if you need to track those.

And if you want to remove those columns later, it takes a little bit of tweaking.

ALTER TABLE dbo.Votes_Tracked DROP CONSTRAINT DF_LastModified, DF_JunkDate;
ALTER TABLE dbo.Votes_Tracked DROP COLUMN JunkDate, LastModified;

Msg 13588, Level 16, State 1, Line 63
Column 'JunkDate' in table 'Crap.dbo.Votes_Tracked' cannot be dropped because it is a part of period definition.

Of course, the table isn’t system versioned, so this command will also fail:

ALTER TABLE dbo.Votes_Tracked SET (SYSTEM_VERSIONING = OFF);

Msg 13591, Level 16, State 1, Line 66
SYSTEM_VERSIONING is not turned ON for table 'Crap.dbo.Votes_Tracked'.

If you want to remove them, you’ll need to use this:

ALTER TABLE dbo.Votes_Tracked DROP PERIOD FOR SYSTEM_TIME;

Now you’ll be able to remove them.

Is This A Good Idea?

Well, it depends on what you need. If you just need to know when a row changed, and you don’t need to know what changed or who changed it, it can work in a more automated way than triggers. It does require an additional column, which isn’t ideal, but it’s not a gigantic column, and you don’t need to worry about indexing it because the data is junk.

I haven’t found any ridiculous downsides to this, but I’ll keep this updated if I do.

Thanks for reading!

Going Further

Considerations For Implementing Soft Deletes In SQL Server

Posted on January 12, 2021May 16, 2022 by Erik Darling

From The Beginning

Implementing soft deletes for an app that’s been around for a while can be tough. In the same way as implementing Partitioning can be tough to add in later to get data management value from (rebuilding clustered indexes on the scheme, making sure all nonclustered indexes are aligned, and all future indexes are too, and making sure you have sufficient partitions at the beginning and end for data movement).

Boy, I really stressed those parentheses out.

If you do either one from the outset, it’s far less painful to manage. The structural stuff is there for you from the beginning, and you can test different strategies early on before data change become difficult to manage.

Queries

The first and most obvious thing is that all your queries now need to only find data that isn’t deleted.

Almost universally, it’s easier to put views on top of tables that have the appropriate bit search for deleted or not deleted rows than to expect people to remember it.

CREATE VIEW dbo.Users_Active
AS
SELECT 
    u.*
FROM dbo.Users AS u
WHERE u.is_deleted = 0;

CREATE VIEW dbo.Users_Inactive
AS
SELECT 
    u.*
FROM dbo.Users AS u
WHERE u.is_deleted = 1;

It’s not that views have any magical performance properties; they’re just queries after all, but it gives you an explicit data source.

Indexes

Depending on how your other queries search for data, you may need to start accounting for the is_deleted flag in your indexes. This could make a really big difference if the optimizer stops choosing your narrower nonclustered indexes because it hates key lookups.

Typically, other predicates will give you a selective-enough result set that a residual predicate on a bit field won’t make much difference. If you’ve already got a seek to the portion of data you’re interested in and most of it will be not-deleted, who cares?

And let’s be honest, in most implementations deleted rows will be the minority of data, and searches for it will be far less common. Usually it’s just there for an occasional audit.

In adjacent cases where instead of deleted you need to designate things as currently active, and you may have many inactive rows compared to active rows, filtered indexes can be your best friend.

Coming back to the views, I don’t think that making them indexed is necessary by default, but it might be if you’re using forced parameterization and filtered indexes.

CREATE TABLE dbo.Users(id int, is_deleted bit);
GO 

CREATE INDEX u ON dbo.Users (id) WHERE is_deleted = 0;
GO

SELECT 
    u.id, u.is_deleted
FROM dbo.Users AS u
WHERE u.is_deleted = 0;

Under simple parameterization, this can be fine. Under forced parameterization, things can get weird.

Tables and Tables

In some cases, it might be easier to create tables specifically for deleted rows so you don’t have unnecessary data in your main tables. You can implement this easily enough with after triggers. Just make sure they’re designed to handle multiple rows.

If you want something out of the box, you might mess with:

Temporal tables
Change Data Capture
Change Tracking

However, none of those help you deal with who deleted rows. For that, you’ll need an Audit.

Thanks for reading!

Going Further

Simplifying Archival Processes In SQL Server Using OUTPUT

Posted on January 11, 2021May 16, 2022 by Erik Darling

Long List Of Demands

Many suggestions for making changes or improvements to a process in a database are met like orders that mean, if not certain, then likely death.

EXT. A WEBEX MEETING

… Erik is smiling, happy to be wrapping up and moving on. Client DBAs and Developers are sweating profusely, mechanical keyboards all heavily armed with Cherry MX Blue switches in an attempt to drown out the part where he talks about NOLOCK hints. For some reason everyone is smoking cigarettes and finalizing their wills

ERIK

You should enable the remote DAC, it’ll be helpful next time you hit THREADPOOL.

There’s a script included to turn it on, and a bunch of examples on how to use it.

IT Manager

By God, there’s no telling who of us will survive this, but we’ll see’er through.

Oh, Now You Need Consistency

Well, friends, I have good news for you. This is an easy one to implement.

Let’s say that in Stack Overflow land, when a user deletes their account we also delete all their votes. That’s not how it works, but it’s how I’m going to show you how to condense what can normally be a difficult process to isolate into a single operation.

First, we need an archive table. So let it be done.

CREATE TABLE dbo.Votes_Archive
(
    Id int NOT NULL,
    PostId int NOT NULL,
    UserId int NULL,
    BountyAmount int NULL,
    VoteTypeId int NOT NULL,
    CreationDate datetime NOT NULL,
    CONSTRAINT PK_VotesArchive_Id
        PRIMARY KEY CLUSTERED (Id ASC)
);

Next, we need a query to, like, do things. I guess.

Normally, you’d insert the rows you want to keep into the archive table, and then remove them from the source table by using a delete … where exists type query. Usually in a transaction, because you heard you need them.

Savvier practitioners might even use the output clause with their delete to dump things into a #temp table.

But the savviest practitioners will do it all in one go.

Sauvignon Blanc

What a lot of people miss about output is that the results are tabular, like many other things in a database.

Just like you can select from deleted and inserted virtual tables in a trigger, you can select from them in a… select statement.

I’m using a transaction here because I want to roll it back.

BEGIN TRAN

INSERT dbo.Votes_Archive
    (Id, PostId, UserId, BountyAmount, VoteTypeId, CreationDate)
SELECT 
    v.*
FROM 
(
    DELETE v
        OUTPUT
            Deleted.Id,
            Deleted.PostId,
            Deleted.UserId,
            Deleted.BountyAmount,
            Deleted.VoteTypeId,
            Deleted.CreationDate
    FROM dbo.Votes AS v
    WHERE v.UserId = 190597
) AS v;

--Validate stuff
SELECT * FROM dbo.Votes AS v WHERE v.UserId = 190597;

SELECT * FROM dbo.Votes_Archive AS vi;
--Stop validating stuff, nutso
ROLLBACK

So when whoever user 190597 is deletes their account, all their votes are inserted into the archive table at the same time they’re deleted.

The execution plan looks like this:

Which is pretty nifty, and eliminates the need for a lot of jumping around.

Thanks for reading!

Going Further

Deduplicating SQL Server Missing Index Requests Part 3

Posted on January 8, 2021May 16, 2022 by Erik Darling

Dodo

We’ve got a set of missing index requests for a single table, and we’ve got the queries asking for them.

Going back to our queries and our index requests, all the queries have two things in common:

They filter on OwnerUserId
They order by Score

There are of course other elements in the where clause to attend to, but our job is to come up with one index that helps all of our queries.

Query Real Hard

To recap, these are our queries.

SELECT TOP (10)
    p.*
FROM dbo.Posts AS p
WHERE p.OwnerUserId = 22656
ORDER BY p.Score DESC;
GO 10

SELECT TOP (10) 
    p.*
FROM dbo.Posts AS p
WHERE p.OwnerUserId = 22656
AND   p.CreationDate >= '20130101'
ORDER BY p.Score DESC;

SELECT TOP (10)
    p.*
FROM dbo.Posts AS p
WHERE p.OwnerUserId = 22656
AND   p.PostTypeId = 1
ORDER BY p.Score DESC;

SELECT TOP (10)
    p.*
FROM dbo.Posts AS p
WHERE p.OwnerUserId = 22656
AND   p.LastActivityDate >= '20130101'
ORDER BY p.Score DESC;

SELECT TOP (10) 
    p.*
FROM dbo.Posts AS p
WHERE p.OwnerUserId = 22656
AND   p.Score > 0
ORDER BY p.Score DESC;

Index Real Dumb

Which means that all of our missing index requests are going to be on maybe a couple key columns, and then include every other column in the Posts table.

This is a bad idea, so we’re going to dismiss the includes and focus on keys.

CREATE INDEX [OwnerUserId_LastActivityDate_Includes] 
    ON [StackOverflow2013].[dbo].[Posts] ([OwnerUserId], [LastActivityDate]);

CREATE INDEX [OwnerUserId_Score_Includes] 
    ON [StackOverflow2013].[dbo].[Posts] ([OwnerUserId], [Score]);

CREATE INDEX [OwnerUserId_PostTypeId_Includes] 
    ON [StackOverflow2013].[dbo].[Posts] ([OwnerUserId], [PostTypeId]);

CREATE INDEX [OwnerUserId_CreationDate_Includes] 
    ON [StackOverflow2013].[dbo].[Posts] ([OwnerUserId], [CreationDate]);

CREATE INDEX [OwnerUserId_Includes] 
    ON [StackOverflow2013].[dbo].[Posts] ([OwnerUserId]);

Now that we’ve got a more sane bunch of requests to focus on, let’s do something thinking.

I hate thinking, so we won’t do a lot of it.

Indexes put data in order, and equality predicates preserve ordering of secondary index columns. That makes putting the key on (OwnerUserId, Score) a no-brainer. One could make an entire career out of avoiding sorting in the database.

But now we have three other columns to think about: LastActivityDate, PostTypeId, and CreationDate.

We could spend a whole lot of time trying to figure out the best order here, considering things like: equality predicates vs inequality predicates, and selectivity, etc.

But what good would it do?

Dirty Secret

No matter what order we might put index key columns in after Score, it won’t matter. Most of our queries don’t search on OwnerUserId and then Score. Only one of them does, and it doesn’t search on anything else.

That means that most of the time, we’d be seeking to OwnerUserId, and then performing residual predicates against other columns we’re searching on.

On top of that, we’d have whatever overhead there is of keeping things in order when we modify data in the key of the index. Not that included columns are free-of-charge to modify, but you get my point. There’s no order preserved in them.

In reality, a good-enough-index for the good-enough-optimizer to come up with a good-enough-plan looks like this:

CREATE INDEX good_enough
    ON dbo.Posts
        (OwnerUserId, Score)
    INCLUDE 
        (PostTypeId, CreationDate, LastActivityDate);

Planama

The index above does two things:

It helps us search on a selective predicate on OwnerUserId
It keeps Score in order after the quality so the order by is free
It has all the other potential filtering elements so we can apply predicates locally
It teaches us that include column order doesn’t matter

All of the query plans will look roughly like this, regardless of the where clause:

What Difference Does It Make?

Alright, so we’ve got one good-enough index for a bunch of different queries. By adding the index, we got all of them to go from taking ~600ms to taking 0ms.

What else did we do?

We made them faster without going parallel
They no longer need memory to sort data

And we did it without creating a gigantic covering index.

Of course, the optimizer still thinks we need indexes…

SQL Server Query Plan Missing Index Request — of what?

But do we really need them?

No.

77% of nothing is nothing.

Thanks for reading!