So You Want To Implement Soft Deletes In SQL Server

Thanks for watching!

Video Summary

In this video, I delve into the considerations for implementing soft deletes in SQL Server tables. Whether you’re looking to add a soft delete feature to an existing table or considering building new tables from scratch, there are several key points to keep in mind. I discuss the challenges of backfilling nullable columns and the importance of batch updates when making such changes to large datasets. Additionally, I explore the complexities involved in ensuring data integrity with unique constraints and check constraints, drawing on insights from Alexander Kuznetsov’s work, which is now available through an Amazon affiliate link. The video also covers practical advice for creating views and filtered indexes to simplify querying active versus inactive rows, emphasizing the importance of proper index design based on query patterns. Lastly, I touch on the broader implications of soft deletes compared to actual data deletion, including the potential benefits of using temporal tables or archiving deleted records in an archive table.

Full Transcript

Erik Darling here with Darling Data. And as promised, I’m here to talk about some considerations for implementing soft deletes in SQL Server. Now, there are a lot of reasons to want to do this. And really, like, the implementation of a soft delete or the inclusion of a, like, you know, which row is currently active type thing, they’re almost two different outcomes, right? Because, you know, there are some situations where you might, you know, want to be able to soft delete something just because it was a mistake or, you know, just to, you know, mark it for, like, archival or something. And there are other times when you only want to be, you only want to query the most active row. And we’re going to talk a little bit about both of those. Now, the first thing that I want to point out is that setting this up for a table that currently exists and is of any, you know, remarkable size is really difficult for two reasons. One, you know, not because, like I said, like, look, if you’re adding a nullable column, or you’re adding a nullable column with no default value, it gets, it gets sent to the table instantaneously, right? It’s, there’s no locking, blocking problems with that. Your problem after that becomes, okay, which rows are deleted and which rows are not. So you, like, like, it would behoove you to backfill that column with some data to make sure that people know whether a row is deleted or not. Because what you don’t want to do is add that in and start dealing with three value logic with zero, one, and null. That’s just a recipe for disaster. So backfilling those and figuring out which ones are deleted or not can sometimes be a challenge. If the goal is just, we need this new thing, and we’re going to start soft deleting rows, don’t worry about it. Make everything zero.

But don’t make everything zero all at once, because if it’s a table of any, of any, like, remarkable size, you’re going to want to batch, batch those updates to set that to zero. That’s a lot kinder on your server than just adding a, you know, column with a default of zero. And, you know, having that happen to all the rows in one go. Batching them up is a lot, it’s a lot nicer. So if it’s a situation where, you know, neither one of those scenarios really fits what you want to do, there are definitely times when building new tables with those design improvements instead is a smarter and safer bet. I’ve been a part of a couple projects that were geared towards doing that.

And, you know, wrote all sorts of things to migrate the data to the new schema, you know, make sure that everything is set up the way it’s supposed to be, all that stuff. You know, so like, that is a doable, it is a doable, achievable task. You know, it’s just, you know, one of those things where it really comes down to what the goal is. Now, if you only want to have one row per group be active, you’re going to have to deal with at least one unique constraint.

And depending on the complexity of what defines an active row, you could need a whole bunch of unique constraints. Now, this guy, Alexander Kuznetsov, and I don’t know, I apologize, Alexander, if you watch this video and either I butchered your last name or you prefer to be called Alex. I’m not sure. I’m not sure you’ll ever see this.

From what I know, you’ve moved on to Postgres. But Alexander wrote a book that I actually, I read and loved a long time ago called Defensive Database Programming in SQL Server. However, I will have an Amazon affiliate link to that book.

So if you do decide to buy it, I’ll make a third of a cent. I’ll finally make some money off one of these videos. But Alexander also was a blogger for a while.

And he wrote on SQLblog.com, which has been dustbin, which is a shame because there are a lot of great blog posts on there that you can only get through like the archive.org stuff. And so one of the articles that he wrote was about storing intervals of time with no overlaps. And this is the table definition that I’m just going to format this a little bit so that when I, this is not how the table appeared in the original post.

I just need to format that a little bit so that when I zoom in, my giant head doesn’t cover anything important. So in this, so in order to implement this for a table that has one setting ID and value started at, finished at, previous finished at, five columns, you need one, two, three, four, five separate check constraints to ensure that this data is stored correctly. You need a primary key.

You need a unique constraint on setting ID and previous finished at. You need a foreign key in the table that references itself. You need another check constraint to ensure that there are no overlaps between previous finished at and started at and started at and finished at. There are, there’s a lot that you have to do to make this work right in order to ensure that your table doesn’t have any bad data in it.

So you can already imagine the complexity and imagine, imagine that you have a table that you thought was set up to do this. And you’re just now realizing that it wasn’t set up correctly to do this. And now you have to add these constraints in.

I will bet you a million dollars that adding at least one of these will fail. Right. Adding at least one of these constraints to your current set of data will fail miserably. In some cases, you just might want to use temporal tables.

Now, temporal tables are a feature that I do not get excited about because there are a lot of problems with them. There’s a blog post by a very famous Bob about when you can encounter error 1, 3, 5, 3, 5, where data, where having temporal tables can cause data modifications to fail. This is just the tip of the iceberg when it comes to problems with temporal tables.

There are all sorts of very strange bugs that can happen under concurrency and using transactions to modify data in a table just like this that can make things really weird and complicated and even make you end up with probably incorrect data sometimes. So temporal tables aren’t a feature that I get very excited about, but given a simple enough implementation, they’re probably okay for some requirements. Again, nothing that I’d get like, wow, use temporal tables.

They’re amazing. It’s more like you could try them. Make no promises. It’s like getting on an old roller coaster. You don’t really know how safe it is.

You don’t really know what might happen. You just don’t want to be that unlucky. You don’t want to be that unlucky passenger. You know, Microsoft and Boeing have a lot in common where there’s been this real degradation in quality over the years.

And a lot of really half-baked features have been implemented with not a lot of assurances that things will go well with them. So be careful out there. All right.

But if you’re going to, you know, let’s just say that, you know, you go the route of implementing soft deletes in your tables. And you want to make sure that everything works right. One thing, actually two things you’re going to want to do is you’re going to want to create a couple views.

And you’re going to want to create a couple views because you should never trust a developer. Never trust a developer to do the right thing or remember what they’re supposed to be doing. So whenever I’ve implemented soft deletes for clients, you know, the nice people who pay me to make these videos, you could be one of them someday.

You know where to find me. Is I will create a video that explicitly, I will create a video. No, Erik Darling will create a view.

Erik Darling does create videos, but not for soft deletes. Well, actually, I am creating a video about soft deletes. This is wild. But when I implement soft deletes for clients, I will create views for active and inactive whatevers so that specifically have a filter for where, for whatever we’re looking for. Is deleted equals zero or is deleted equals one?

Because I want to make sure that, you know, there might be even be a third one that, you know, just hits both. But that could just be the table. But really what I want is for people to be looking in the right place. And from my perspective, it’s a lot easier to see if a developer is hitting the wrong table than if they forgot something in a where clause.

So there’s that. The other thing that you’re going to probably want, and this is a very common thing, is you’re going to want a unique constraint on your table to sort of ensure that there is only one not deleted thing for a person. Now, going back to the Alexander Kuznetsov stuff about, you know, with the overlapping stuff, that’s a lot different than just figuring out if, like, just making sure that someone only has one active row period for themselves.

There might be all sorts of other considerations for that. And you might need more than just the column that I’m showing here for the unique index. It depends on what you’re trying to accomplish.

But in general, you just want to make sure that there’s one unique ID that is not deleted, right? You generally don’t want a bunch of active rows. If you want a bunch of – if you’re okay with having a bunch of active rows in there, if there might be duplicates for some reason, then you could just create a regular nonclustered index.

But having that filter operator in there is very important. The other big thing about doing this, and this is just a general piece of advice, because filtered indexes where you don’t have the column or columns that you’re filtering on in the where clause as part of the index somewhere, generally having them as an include is good enough. If you don’t have them in the where clause, if you don’t have them in the where clause, you are – it’s a real crapshoot if that filtered index will get used when it should.

That goes for bits. That goes for null or not null. That goes for, you know, greater than, less than, in, whatever the setup is.

Just do yourself a favor. Don’t screw this part up, because you might get real disappointed with your filtered indexes. Please put the columns that you are filtering on somewhere else physically in the index, either as a key or as an include.

Like I said, generally the include is good enough, but if it’s very special to you, you might want to put it in the key. All kind of depends on what the column is filtering on. Now, the nice thing about doing this with the views and the filtered indexes is let’s assume that the majority of your code is either from an ORM where things are parameterized or based on store procedures where things are parameterized.

What can happen, or even if you have forced parameterization turned on for your database, as long as your query is partially parameterized or in a store procedure, you don’t have to worry about the issue with filtered indexes and parameters or variables. So one big catch with filtered indexes is that if you have a piece of code, and let’s just say that that piece of code is, you know, filtering on the deleted column, is deleted column, and you want people to be able to search for either is deleted equals zero or is deleted equals one. Again, we’re throwing three-valued logic out with this.

If you allow nulls in a bit column, there is a special place right next to me in hell. So you better watch out. But when you have a parameter or variable for this, SQL Server can’t take advantage of your filtered indexes unless you throw a recompile hint on.

There are other ways to snake around it with dynamic SQL and putting a literal value in for one part of it. But in general, you know, if you’re like, you don’t want to have a parameter to search on a filtered index column because SQL Server needs to cache and reuse an execution plan that’s safe for is deleted being zero or is deleted equal being one. So, you know, at that point, you’re going to probably ignore your filtered index and get an unmatched index warning.

Now, the reason why I like to create these views up here is because these are literal values. All right. And even with forced parameterization turned on, if your query is in a stored procedure or partially parameterized, SQL Server won’t try to parameterize this part.

Right. If you have a trivial enough execution plan, you might get simple parameterization. But, you know, if, you know, again, most people live in a sufficiently complex world where that’s not something that’s reasonably going to happen.

You never know. But there’s something to watch out for. So when you’re going to implement soft deletes, the three things that you need to consider are what’s really your goal?

Are you going to allow multiple not deleted or can you only have one active not deleted row per user or whatever entity, whatever you want to call it? How are you going to set up for people to query those so that they don’t have to remember to apply some predicate to it? Especially if you are using ORMs, it can be really difficult to remember to do these things in your queries or even know how to do these things in your queries.

That’s why the views work out really well. You’re going to figure out whether you need unique indexes or non-unique indexes to maintain the referential or not the referential integrity. Just like the sort of constraints of the table.

And depending on query patterns, not every index can be filtered. Right? Not every index is going to be geared towards a query that’s looking for is deleted or is not deleted. There are some that might just have to span the whole table.

Implementing soft deletes is often a lot less scary to business users than actually deleting data because once you delete it, it’s gone. Now, granted, you can do things where either, again, temporal tables or you could put a trigger on your table so that if someone deletes a row, that row goes off to some other table until you’re legally not required to have it anymore. If you’re going to go that route, though, you might as well implement the soft deletes.

And then you might as well have a process that moves soft deleted columns off to some archive table eventually. It can be after three months, six months, nine months, seven years, whatever you’re really meant to do. Because, you know, even with filtered indexes and even with, you know, queries geared towards using only the non-deleted rows, it’s usually a good idea to keep your tables on the small as possible, especially if the deleted rows really start to pile up.

So that’s another big thing to consider. What is going to be the ratio of deleted to not deleted rows in your system, right? Because, you know, well, like, you know, at that point, you have to wonder if filtered indexes are even useful to you, right?

Because if your table is like, you know, 60, 70, 80, 90% not deleted rows and soft deletes are a rarity, filtered indexes don’t really buy you all that much. Unless you need that uniqueness, unless you need to ensure that only one user can have an active row, have a not deleted row at a time, it doesn’t really make sense to add those in, except as sort of like a logical check constraint to keep your data the way that you expect it to look. So generally, you know, these are the questions that I ask.

These are the things that I talk about with clients. You know, we might come to an understanding that perhaps adding soft deletes to existing tables is not what we want to do. Perhaps we would prefer temporal tables if we are willing to accept all of the risks that come along with using them.

There might be a solution with triggers and other things that would be more palatable. You know, you might want, you know, figuring out exactly what the end goal of the soft deletes are, figuring out exactly like, you know, the requirements before we implement stuff usually leads to a much better solution than just saying, oh, you wanted soft deletes? Cool.

Here’s a new bit column. That’ll be $10,000, please. Because I’ve seen a lot of that happen. And what the end result is, is people have this soft delete column. No one actually soft deletes anything.

And it’s kind of a bummer. So, you know, really, really figure out the problem you’re trying to solve by implementing soft deletes before you just go and throw a brand new column in a table or you go through a bunch of work to try to establish all the other stuff that we’ve talked about. So, thank you for watching.

I hope you enjoyed yourselves. I hope you learned something. If you like this video, you are free to give this video a thumb, a nice happy comment, a smiley face. Whatever you’re feeling at the time, feel free to comment.

If you like this sort of SQL Server content, you can subscribe to my channel. There are buttons that allow you to do all sorts of fantastic things in that regard. And lastly, but not leastly, thank you for watching.

I will see you in the next video. Where hopefully I don’t have to soft delete this video for any reason. That would make me sad.

That would just mean that the last 20 minutes of my life were all for naught. Which, you know, has happened far too often to me. Ugh.

You know, I think if you really tally up the number of regrettable 20 minute spans in your life, you can come up with a lot of those, right? Anyway. I’m going to go think about that for a while.

Might need a drink to wash that thought down with. So, I will see you in the next video. Goodbye.

Goodbye.

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

4 thoughts on “So You Want To Implement Soft Deletes In SQL Server”

Amperage Caterer says:

August 21, 2024 at 4:09 pm

Another nice video, thanks.

You know, I generally hate the subscriptions, but you are the one youtuber (no offense meant) that I would consider supporting on regular basis on Patreon or something.

Eh, maybe I just buy your training video bundle to show support.
1. Erik Darling says:
  
  August 21, 2024 at 5:03 pm
  
  Ha ha, no offense taken. There are memberships for my YouTube channel, but apparently getting to them is a hassle.
MuSQL says:

August 22, 2024 at 4:41 am

Great stuff!
We use a DATETIME2 column for our soft deletes to enable access to historical transactions on an entity. I don´t like it but it´s whats required.
Great SQL like AND(Item.Deleted IS NULL OR trn.Created <= Item.Deleted) 🙁
1. Erik Darling says:
  
  August 22, 2024 at 10:38 am
  
  That seems like a good use for canary values, like 99991231, but with a good index it might not matter so much.

Comments are closed.