Happy Labor Day, From Darling Data

Holiday In Suburbia


You may be reading this and thinking “what does a consultant know about labor?”

Well, I wasn’t always this idle in my profession. I’ve had a variety of jobs in my life that have hardened the hands and stiffened the back. Now I mostly harden my liver with stiff drinks.

Mostly I’m celebrating that I have soft hands again.

Remember to moisturize, friends.

If you’ve got the day off, make it a good one. See you tomorrow.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

Multiple DISTINCT Aggregate Performance In SQL Server

Multiple DISTINCT Aggregate Performance In SQL Server



Thanks for watching!

Video Summary

In this video, I delve into a common performance challenge in SQL Server queries: multiple distinct aggregates and their impact on row mode execution. I explore why these operations can significantly slow down your queries, especially when dealing with large datasets. We then dive into how batch mode can mitigate these issues, showcasing the differences between achieving batch mode through temporary objects or compatibility levels, and how this can drastically improve performance. Additionally, I discuss the limitations faced by Standard Edition users, suggesting alternative strategies such as using temp tables to bypass some of the inefficiencies associated with spools. By the end of the video, you’ll understand not only why these issues arise but also practical steps to optimize your queries for better performance.

Full Transcript

Erik Darling here with Darling Data. According to the weather outside, I am recording this video during the end times. It’s wild out there. Whew! Hurricane Debbie, what a gal. Anyway, sorry if that clap was loud. We should clap lower. Maybe that’s less painful for you. I’m never going to hear this, so deal with it. Alright, so let’s talk about what we’re going to talk about today. In this video, we’re going to cover how multiple distinct aggregates are tough for performance in SQL Server queries, but mostly only in row mode. We’re also going to cover how batch mode can help, but also how Microsoft has no respect for standard edition users. So we’re going to cover all the fun stuff today. As usual, this channel is free. I don’t have commercials. I don’t have any paid sponsors.

MeUndies has not beaten down my door, mostly because I don’t wear underwear. So if you want to sign up for a membership to show your love and appreciation for the channel, you can do that. There are some cheap ones. I’m going to work on better stuff for the higher price ones when I have some time to think. That doesn’t happen a lot. But when it does, boy, the gems come out.

If you’re unable to monetarily support the channel, that’s totally cool. There are free ways to show me that you care. You can like, you can subscribe, you can leave comments, all sorts of fun stuff. So if you can’t do one, think about doing the others, because I do enjoy the feedback. I like the engagement. I like knowing that the things that I write and record reach as many people as possible. That’s what makes me feel good. If you’re in need of a SQL Server consultant, you’re having some performance problem or other thing, and you need a trusted, reliable set of eyes and hands, and I’m not going to put feet on your SQL Server. That’s weird. That cost extra.

You can hire me to do that. My rates are reasonable. If you’re looking for affordable SQL Server training at the beginner, intermediate, or advanced levels, you can go to training.erikdarling.com, and you can get 75% off the whole kit and caboodle with the discount code SPRINGCLEANING.

There’s also a link for that in the old show notes. So you can click on that, and it might be a bit easier for you than remembering all those letters up there. URLs. Wow. What a thing. As far as when you can see me live and in person on this slide with a tremendous amount of white space that I have not figured out what to meaningfully do with yet, perhaps I just need a graphic floating over my head or something.

You can catch me Friday, December 6th, doing a full-day pre-con at Data Saturday Dallas. You can also catch me November 4th and 5th at Past Data Summit in Seattle, where I’ll be all week doing Seattle things. I don’t know. There’s a presidential election that week. Who knows if Seattle will survive that? So I don’t know. Whatever. Only time will tell. Let’s get on with the demos for today’s topic of interest. So when you’re working on reporting queries, there are a lot of times when you’ll need to use distinct, lest we end up counting many useless things. It’s not just for counting. It could be for summing or averaging or basically anything else that you can put in an aggregate, min, max, I don’t know, stdev, standard deviations. There are so many things you could do. So many things. But you got to do it. The problem is that when you do it in row mode, it’s very slow. The more you distinct, the more slower you get.

So let’s look at the query plan for this, which has one, two, three, four, five distinct aggregates applied to it. Now, the slow parts of the plan are fairly obvious once you start looking. The stuff that happens over here with the joins, not that big a deal. Where things get wonky, and we talked about this a little bit in the video that I did on top n% performance queries, where as soon as you need to start streaming into a, you know, because, you know, parallel spools, ha ha ha. As soon as you need to start streaming rows from multiple threads into a spool in tempdb, your life is going to be in for a awful, just a awful. Your life is going to be a awful.

A awful. This is not fun. Notice we go from 22 seconds here to a minute and 12 seconds here. That is no bueno. There are zero buenos about this. On the bueno scale, we are at absolute zero. Right? There have been no buenos. Things after that generally don’t change much. Right? Through all this, we’ve added on like another three seconds in this branch from, go from 112. Well, look, 115, and then it’s 115 all across. But then we hit three minutes and two seconds.

Por qué las embarguesas? Why? How did that happen? Well, that happened for, I mean, a few different reasons. You know, notice that we have all of these sort of headless spools. All of these spools are actually fueled by this spool, right? This is, this is, this, this is the spool that fuels the spools.

Ah, that made my mouth feel nasty. Yeah. Spool fuel. Hmm. Spoodum to fuel your spools. Spool your, fuel your spool. Uh, so each one of these spools, uh, has a whole bunch of rows in it. And each one of these spools will go out into an aggregate and, you know, we’ll, we’ll do some work in order to make whatever we’re aggregating distinct. This happens a whole bunch of times, but where this causes a real, a real kerfuffle in the query plan is specifically in this branch where we have a hash aggregate that spools for a very long time. This, my friends is again, this is negative on the Bueno scale. We have gone negative Buenos. Right? There was a, a, a, a Bueno deficit for this particular branch of the query plan. So this, this, this entire thing ends up taking again about three minutes. Now what’s funny is that this isn’t, that you can avoid a lot of these problems with batch mode. And depending on where you are in the SQL Server world, as far as version and addition and, you know, uh, compatibility level and stuff, um, you know, you, you can deal with this in different ways.

The big deal with batch mode is that you get rid of all those gigantic spools. Uh, spools in a SQL Server query plan are often a sign that you have done something drastically, tremendously wrong that needs correcting. Spools are there for your benefit, but they only benefit bad mistakes.

They don’t tend to benefit good choices. They tend to benefit bad choices, right? So they’re, they’re like a little insurance policy. They’re a little, uh, safety rail for, for your queries when you, when you do something terrible in them. So the two ways of approaching batch mode are of course, uh, to, uh, either create a, some sort of object. It doesn’t have to be a temporary object. If you’re using a SQL Server 2019 in memory, temp TB feature, uh, you can’t create a, a temporary object with a columnstore index on it.

So, you know, you can, you’re, but you’re perfectly able to create a stationary permanent object in your database, uh, that has a columnstore index on it with no rows, or you can add a non-clustered filtered columnstore index to a table that has like filters still to something that where nothing, no rows can possibly exist. And when you, when you do this, um, you will get a sort of bastardized version of batch mode and it’s bastardized because it is not fully implemented. It is not as tightly ingrained into the optimizer is when you use a higher compatibility level to get batch mode on row store. I have to be very specific about this, that, uh, fooling SQL Server into using batch mode some of the time is not the same as getting batch mode on rowstore, the intelligent query processing feature.

Uh, the intelligent query processing feature does a whole lot more, uh, is able to do affect a whole lot more operator operations in your query plan than just tricking SQL Server into using batch mode for some operators. All right. So we’re going to run these two and these will not take three minutes to run thankfully. Um, but while these execute, I do, I do want to say, I mean, that’s about seven seconds there. So, uh, well, these are, these execute. I do want to say that part of what makes spools rough is that they have like, when you load data into a temp table, uh, you know, there are all sorts of optimizations that temp table loading has gotten in SQL Server over the years. Like there were some big ones in 2014 and, you know, I’m sure some other ones in newer versions and things have gotten better over time, you know, as they sometimes do. Uh, but part of the problem is loading data into the spool. It’s a single row at a time. It’s very, very slow. You have none of the temp table optimizations behind spools that you have for temp tables. So, you know, like part of what makes that query, aside from that monstrous hash spill up, part of what makes that, those, the multiple distinct query up, they’re really slow. It’s just loading that big spool, right? Loading millions of rows into a spool is always going to be painful because of how little love and attention spools have gotten over the years. So if we look at the query plans for these two, they’re not that different. Now, this is good for us because, you know, if you’re not in a position where you can use a higher compatibility level, even for a single query, like I did for this one, uh, but you can sort of finagle with the query a little bit, either, you know, it’s a tricky SQL Server and using batch mode, you can get much, much better performance. We’re down to four and a half seconds from three minutes. And notice we don’t have any big crazy spools in this thing, right? We’re not spooling in billions of rows. We’re not hash spilling. Nothing awful is happening here. The second query where I do get batch mode on row store, that finishes about 1.3 seconds faster, right? Which is not a not a not a not meaningful improvement, right? The difference is, of course, uh, in reading from the votes table. Uh, so the votes table up here, we take 3.1 seconds to read from that. And down here, we take 1.9 seconds to read from that. The difference, of course, is that in the query where we trick SQL Server into using batch mode for some operators, we still read from the votes table using row mode, right? Because when, again, when we play these tricks on the optimizer, we don’t get the fully ingrained batch mode on rowstore experience. We get like a, like a, like, I don’t know, like, we get like the half naked show, not the fully naked show. So we, well, we do get batch mode for this hash join and for this hash aggregate. We do not get batch mode anywhere else. In neither, in neither of these queries do we read from the post table using batch mode, uh, largely because the post, oh wait, we do. Weird. That didn’t happen before. Hey, that’s funny. We got batch mode on row store here too. Usually I don’t because of the body column in the post table. Uh, maybe I’m just lucky on this one. I don’t know. So, uh, cool. But, uh, the reason why I thought I still wasn’t getting it is because there’s not a really big difference in how long it took to read from there. So, uh, I’m not going to start this video over again. We’re just going to say, cool, it can happen sometimes. Usually I don’t get batch mode on the post table though. Usually the big and VARCAR max columns in there, uh, completely to the SQL servers like, nah, nah, we’re not doing it. So, uh, yeah. So we do save a little bit of time here. Now let’s say you are, uh, on, uh, Enterprise Edition. Either one of these things will be fine for you.

These are great. These are great options if you are on Enterprise Edition and you want this query to go faster. If you are on Standard Edition, if you are cursed, if you are doomed to some bottomless existence, uh, some pit that you will never arise from, uh, using Standard Edition, one thing that you can do is you can replace the spool with a temp table by loading data into the temp table, which has all of the optimizations that you would expect. Uh, just, you know, not, not a spool, right? Uh, you, you might not want to load huge temp tables up, but hey, everyone has their own fetish. If you’re on Standard Edition, you just might fit somewhere in the S or the M world of things. Maybe, maybe even both, right? I don’t know. Uh, so we’re going to try this query out in two different ways. Now, the reason why I bring up Standard Edition here is because in Standard Edition, batch mode stuff, columnstore stuff, uh, is limited to at best a DOP of two. So if you, if you are, if you find yourself in a situation where, um, maybe you get okay performance from a DOP to, uh, batch mode plan in Standard Edition, cool. If not, you’re going to have to do something with yourself that gets you, uh, a, a higher DOP plan. Um, you know, DOP 8 is a pretty good DOP. I’ve had pretty good luck with DOP 8 over the years. Uh, um, you know, my, I am a, I am a DOP 8 proponent in most circumstances. Um, so one thing you can do is you can drop, um, all of the, the things that you want to deal with into a temp table.

Now this loading up the data into this temp table will be a lot faster than loading it into the spool. Remember the spool took like, I don’t know, 30 seconds or something. Loading it into the temp table takes about 7.3 seconds. Is that, is that super fast? No, but it’s better than like 30 seconds.

And then once we work off, once we’re working off that, things will get a bit better for us. So, uh, well, you know, these, I probably should, I probably should have started these running while I was, while I was off talking about Standard Edition and all that goofy stuff. But, uh, if you notice, uh, in this, this query plan, we no longer have all of the terrible awfulness that we had in the other query plan, right? This is the first one, which, which is, which is in row mode again, but we don’t have all of the spooling, right? Because we’re just working off, uh, this temporary object for all of them. And even though we get a spill in the same place for, uh, for this, this leg of the branch, uh, we get, it’s far less of a severe spill. This is not spilled for one minute.

And so this ends up running for about six seconds rather than about three minutes. Now, working off the, uh, temp table, uh, over here, this runs for about a second, right? So pretty good. That’s, and that’s the batch mode version. Cause this is the, in this version, I bumped, uh, the compat level up to 160, 150 would have been sufficient here on developer edition to get batch mode on rowstore. But even like the non batch mode version, working off the temporary object is a lot faster, uh, both because you take the place of the, the very inefficient spool loading in the, in the initial query. And because, uh, you don’t have, you don’t reuse that spool over and over again. And you know, the estimates in this one are better because we’ve stabilized the result set. We are no longer as prone to estimation errors because we have a stable temp table that we’re working off of not like, Oh, I don’t know what this join is going to do. Let’s all, let’s all have a weird day. Uh, so things turn out a bit better in this one.

So, uh, to recap multiple distincts in that and row mode queries, uh, can have a really nasty, we have really nasty performance side effects. Uh, I think probably the best way of dealing with them on enterprise edition is to get batch mode involved in some manner. Um, I think, you know, batch mode on rowstore, uh, does have more optimizer support for more operators than tricking SQL Server into using batch mode, but sometimes just the batch mode trick is enough. Uh, if you’re not on enterprise edition, uh, I would highly suggest the use of temp tables, uh, because when you use temp tables, uh, you take the nasty spool, the nasty, inefficient spool out of the equation and, and you can use DOP up to whatever DOP you have set for either the server database query, wherever you choose to set DOP, wherever your DOP fetish setting lies, you can, you can, you can set DOP there. Um, and, uh, you, you know, not saying it’s not worth testing the, uh, batch mode stuff on standard edition, just in my experience, the DOP two limitation really does hurt performance, uh, uh, you know, in, in many, many situations. Um, you’re, you’d be very lucky if DOP two batch mode were equivalent to like DOP eight row mode depends on some stuff worth testing, but prepare yourself for disappointment. Of course, if you’re using standard edition, you are already well positioned to deal with disappointment. So you have that going for you. Um, there, there’s always a silver lining in the, in the, in the darling data playbook. So, uh, I think that’s good for this. Um, uh, I hope you enjoyed yourselves. I hope you, hope you learned something. Uh, I hope you paid attention to me babbling about liking and subscribing at the beginning of the video, because I’m not going to do it again at the end, because let’s face it, this, this isn’t where, where any, anyone else, uh, anyone, anyone pays attention. That’s this, it’s over by now. You’ve drifted away. You’ve gone to sleep.

You’ve, you’ve hit fast forward. You’ve, you’ve, you’ve, you’ve absorbed all of the usefulness of this. And now you are just, we just had this useless husk of time left to, to, uh, to, to, to, to close things out. So, uh, I’m going to go stand in the rain for a little bit, take myself a nature shower. All right, cool. Thank you for watching.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

“SARGable” Sorting and Grouping In SQL Server

“SARGable” Sorting and Grouping In SQL Server



Thanks for watching!

Video Summary

In this video, I delve into the concept of sargability in SQL Server, expanding on its traditional usage in join and where clauses to include sorting and grouping. I explain how wrapping columns or functions around expressions can disrupt index usage, leading to suboptimal query performance even when the differences might seem negligible at first glance. By sharing practical examples and execution plans, I illustrate the importance of maintaining sargability not just for search operations but also for overall data integrity within indexes. Throughout the video, I emphasize the broader implications of non-sargable predicates on code quality and database performance, encouraging viewers to adopt cleaner coding practices that can prevent future performance issues.

Full Transcript

Erik Darling here with Darling Data and continuing with my new channel setup where I’m going to do my begging and pleading at the beginning of videos where people are still paying attention. Of course, you fully have the option to skip over this part. That’s up to you. It’s a cruel world. What can you do? This video we’re going to talk…we’re going to use the term Sargable a little more loosely. …than some people might be comfortable with. Sargability typically deals with you writing expressions in a way that allow SQL Server to effectively use indexes for them. Whether it’s a join or a where clause. But we’re going to talk about the Sargability even though there’s not technically a search argument involved in the context of sorting and grouping because it can do stuff there. So, this is a This is my YouTube channel. All of this content is free. I don’t have any problems except Intel timing out looking for drivers constantly. God, I hate that thing. And I’d like to keep it free. So, in order to help me do that, you can subscribe to the channel. There are very low-cost memberships available where you can donate a buck or two if you’re feeling generous.

But if you can’t be bothered with a buck or two, then please. I like likes. I like comments. I like subscribers. I like growing this channel so that someday when I do have to spring commercials on everyone, it can be to the largest possible audience. I’m also a SQL Server performance tuning consultant. I can do all of these things in order to make your life with SQL Server better and easier. And you can hate SQL Server less because I’ll take that on for you. I will be your SQL Server sin eater. If you need SQL Server training at a low cost, you can get 75% off mine with the coupon code SPRINGCLEANING. Just go to training.erikdarling.com, plug that code in, and off you go. Stream away, my friends. Stream away.

As far as where you can catch me live and in the flesh, Friday, September 6th, I have a full-day pre-con at Data Saturday Dallas. And November 4th and 5th, I have full-day pre-cons both days at Pass Data Summit in Seattle. And I also have regular sessions attached to those events. So, if you’re going there, stop by, give me a hug, come to my pre-cons.

It’s a nice thing to do. All right. Cool. Everything went black. Most people are used to dealing with the term sargability in the context of join and where clauses. So, if you do something like this, or if you do something like this, most right-minded people will call you a damn fool.

When you wrap columns and functions like this, you ruin SQL Server’s ability to interpret indexes in an orderly manner and find and join your data in a quick and orderly manner. It is generally not great for performance when you do this.

It’s not going to hurt you every single time. But it is something that you’re going to want to avoid because someday it will hurt. Or someday someone will copy what you did in your code and it will hurt.

And they’ll think you’re an idiot for doing these things. So, generally, this is sort of like a sanitary code thing for me. You know, avoiding this sort of stuff generally tends to help your code bases in general be cleaner.

I’ve said it on the channel before in other videos. But when you write code, code is culture. So, if your company culture has good, clean code in it that doesn’t have a bunch of things like non-sargable predicates, then you will tend to have a better, more quickly functioning code base than if you allow this sort of crap to go through.

Because this is ideally not what you want to see in your queries. This can also apply to other parts of the query, though. It’s not just joins and where clauses.

So, just to give a quick example, none of these queries are going to be bafflingly slow or awful or ugly. But you will see differences. I created an index on the comments table just on the score column.

Because it’s a good column to create an index on. Especially for the demos I’m about to run. Because, um…

Would you believe that I wrote this specifically to show you these problems? I don’t know if you would believe that. People think I’m a lot more slapdash than I am.

Maybe I am. So, let’s run these two queries. And we’re just going to get the top one score from the comments table. And we actually sort of got lucky.

Because, you know, the score column is not unique in any regard. But we did not have a tie for the highest scoring comment. How a comment got a score of 1270 is a real remarkable thing to me.

Because this is the comment that has a high score of 1270. Did you really stumble upon that exact situation in a real life scenario? Or was this question only meant to be a puzzler just for the fun of it?

Space? Question mark. Apparently that was profound to almost 1300 people. So, I wonder…

I don’t know how many… I don’t know what that comment’s current upvote situation is. I might go to the Stack Overflow Data Explorer after I record this and see. Because I’m a little fascinated now.

So, both of these queries return the same row. Which is, in our case, lucky. Because when you have non-unique columns that you’re ordering by, you often need a tiebreaker column to give you a reliable result that is a unique column.

You would need to order by, like, the score column and then, like, ID descending or something. Because ID is an identity column and you generally won’t see too many dupes in identity columns. But coming to the execution plans, look what happens here.

And the one where we have a score unencumbered by is null, this whole thing runs in zero seconds with a serial execution plan. Right?

Even though we have to scan this index, that’s okay. Because all we have to do is grab one row out of the index. We do a lookup to get whatever else we want. But this lookup is for one row. And would you believe a one row lookup is not the end of the world?

It’s not a tremendously big penalty on a one row lookup unless you’re using columnstore. So, this is pretty quick. In the second query, where we wrapped the score column in an is null, right?

So, if there were any nulls in this, we replaced it with zero. We have a parallel execution plan. This thing runs at DOP 8.

It takes almost a second to run. And SQL Server finds it necessary to sort the data. Notice we don’t have any sorts in this query plan up here, where we didn’t wrap score in is null, because the index presents score to us in an orderly manner, right?

Indexes put data in order. This is one of the things they do that makes queries go faster. We have ordered data.

It’s good for searching. It’s good for already having data prepared in the order that we need to present it in. So, all good there. But as soon as we wrap score in is null, SQL Server says, oh, no, no.

This is no longer in order. We’ve got to sort this, right? And you can see that this sort is on an expression. Expression 1, 0, 0, 1 descending. That expression maps directly to is null score 0 over there.

So, this goes parallel, requires almost a full second to run, takes about a, oh, I guess that’s one meg memory grant. Not the end of the world, but, you know, just needing one at all for this, it’s pretty ridiculous because we have an index that has this in order.

But we ruin that index’s ability, or rather we ruin SQL Server’s ability to use that index effectively by doing this. This can also happen with grouping, right?

And, you know, this is probably less of a performance suck than I think the ordering problem is. But I want you to pay attention to a couple of things in these query plans because they’re interesting, right?

There’s not a vast difference in timing between the two, right? And these both generate parallel execution plans. This one takes about 500 milliseconds. This one takes about 700 milliseconds.

So, there’s only about a 200 millisecond difference. But in the plan where we wrap the score column in is null, and we group by that, SQL Server has to add extra operators in, and we have to do extra work in order to do that.

For this small query right here that we’re doing, it’s not that big a deal. For bigger queries that process way more rows, or even bigger queries where, you know, you might join stuff off together, and there might be all sorts of other crazy things going on in the query, this stuff has sort of like a snowball effect on things.

And, you know, I think it’s important to understand, again, coming back to like the code is culture thing, don’t make, if you write cleaner code, you’ll have fewer problems in general.

Like paying attention to the basics and the fundamentals of things is really important because then you have, like your code is more, your code runs more reliably fast. You don’t have as many weird bugs and other things to worry about. And you don’t have to worry about these pile-on effects where, I don’t know, let’s say that you stuck this group by query in a CTE, and then you were joining it off to something else.

SQL Server might make all sorts of bad choices because it had to make extra steps, do things less efficiently, maybe make worse guesses as far as cardinality goes. All sorts of pile-on effects can happen as soon as you start allowing these little unwanted in your code base.

Now, for this one, notice something kind of interesting here. SQL Server starts with a stream aggregate, right?

And this basically reads the data in order from here. So rather than having to, you know, sort data or do anything else, we have the data that we care about in order here.

We do need to sort after this one because the repartition streams, I don’t believe was helping us maintain data in order. That’s okay. But really the important thing here is that we just don’t have any unnecessary additional steps in this plan in order to get the results that we need.

In this one, we have an extra compute scalar to generate the isNull on the score column for grouping. And then we have this partial aggregate where we partially aggregate data.

And we are off a little bit in the cardinality estimation department. Again, this is, again, pile-on effect, right? In SQL Server, this could make a more profound difference where the cardinality estimation was off by more and we had to do something else with this result, join it off somewhere else, you know, whatever it is.

And then, you know, coming over here, we do have to, you know, again, repartition, sort, stream aggregate. So we do just a little bit more work in this query. The amount of work that these two queries do, and even that the two sorts do, pretty inconsequential in the, and like, you know, if you just looked at that case, like, you know, just on its own, you know, for the sorts, you would probably be like, oh, yeah, a serial plan that finishes in zero seconds is better than a parallel plan that takes a full second.

For the grouping queries, you know, again, nothing, nothing end of the world here. I just want you to understand that when you start adding in these extra unnecessary things, you start adding in extra unnecessary work for SQL Server to do in order to produce the result you want.

So the concept of sargability for me, you know, well, it is, you know, if you, if you want to polish some ivory on it, it is purely for situations like this, where, you know, you’re wrapping columns and functions and those, those, those functions prevent SQL Server from being able to use your indexes effectively to find data, right?

That’s search argumentable. But sargability for me kind of goes a little bit further because I feel like the concept of sargability is important to grasp because it’s not just for searching things.

Sargability is literally just about searching things, but I think that the concept of not disturbing the contents of your indexes with function calls wrapped around columns goes, needs like a bigger word for it.

There has to be like a concept beyond just like, sargability is this one part of it, and then there’s this other part of it where like, you’re just disturbing index order, and that disturbing that index order messes up SQL Server in all sorts of weird ways.

So it’s not just when like the sargability problem, when you’re searching on stuff. There’s also this problem where, you know, as soon as you like wrap a column in a function, SQL Server has to process things in a different place, right?

Because things like reading data and things like, you know, preserving index ordering, that’s like storage engine stuff. That’s stuff that happens like way, way down over here in a query plan, right? That happens like way, way down deep in the query plan.

Running functions, even built-in functions, for SQL Server, which have absolutely no relational meaning to the engine. Those happen like a few layers up in this like expression service of the query execution engine.

So they don’t happen down here. They don’t happen down here. They happen way up here. So like a bunch of stuff has to happen down here before you get to the point where you’re calling that function.

And that’s where you can really start screwing things up, is because you take away SQL Server’s ability to do things way deep down in the execution plan or in the storage engine.

And you all of a sudden force it to start doing stuff way up here where it’s far less efficient, right? And again, for this query, there’s only a couple hundred milliseconds difference. But you wouldn’t, I don’t think most people would expect to see a couple hundred milliseconds difference when the only thing that we’re doing really differently is either grouping by score on its own or grouping by is null score here, right?

So this is where things get, this is where we put the extra work in for this query. Now, again, 200 millisecond difference, not gonna end your world, but as you start dealing with bigger and bigger data or you start using data that you do this sort of thing to in the context of bigger queries, you’re gonna start noticing more and more profound differences and you’re gonna start seeing more and more profound problems with query performance as time goes on and your data grows and your queries get slower and you start wondering why.

And then you say, gosh, I ought to hire that Erik Darling to fix these problems. And then my job is to come along and delete is null. So, you know, there is that.

Anyway, thank you for watching. I hope you enjoyed yourselves. Hope you learned something. I hope that you didn’t skip over the part of the video where I asked you to like and subscribe and comment and hire me and buy training from me because that would be mean.

And I’m not mean to you. I don’t know why you’d be mean to me. So I’m gonna go upload this and then, I don’t know, maybe I’m gonna record something else.

We’ll see how it goes. I do need to get to the gym today. I do need to keep up this godlike physique that you see before you. So we’re gonna go to the gym.

Maybe we’ll talk more about that in a different video. I don’t know. Maybe that’ll be a TikTok exclusive. We’ll see what happens.

All right. Cool. Thank you for watching.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

Join Me At Data Saturday Dallas Sept 6-7

2024 07 31 22 30 23Spring Training


This September, I’ll be presenting my full day training session The Foundations Of SQL Server Performance Tuning for Data Saturday Dallas.

All attendees will get free access for life to my SQL Server performance tuning training. That’s about 25 hours of streaming on-demand content.

Get your tickets here for my precon, taking place Friday, September 6th 2024, at Microsoft Corporation 7000 State Highway 161 Irving, TX 75039

Here’s what I’ll be presenting:

The Foundations Of SQL Server Performance Tuning

Session Abstract:

Whether you want to be the next great query tuning wizard, or you just need to learn how to start solving tough business problems at work, you need a solid understanding of not only what makes things fast, but also what makes them slow.

I work with consulting clients worldwide fixing complex SQL Server performance problems. I want to teach you how to do the same thing using the same troubleshooting tools and techniques I do.

I’m going to crack open my bag of tricks and show you exactly how I find which queries to tune, indexes to add, and changes to make. In this day long session, you’re going to learn about hardware, query rewrites that work, effective index design patterns, and more.

Before you get to the cutting edge, you need to have a good foundation. I’m going to teach you how to find and fix performance problems with confidence.

Event Details:

Get your tickets here for my precon!

Register for Data Saturday, on September 7th here!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

Performance Tuning TOP PERCENT Queries In SQL Server

Performance Tuning TOP PERCENT Queries In SQL Server



Thanks for watching!

Video Summary

In this video, I delve into performance tuning top and percent queries—a topic that might not come up every day but is crucial when it does. We explore how to make these queries run much faster by using Common Table Expressions (CTEs) or derived tables instead of parameterized queries. By doing so, we eliminate the need for sorts in parallel plans, significantly reducing execution time. I walk through a stored procedure example where I demonstrate this technique step-by-step, showing how it can drastically improve performance without complicating your code too much. Additionally, I share some insights on upcoming events like Data Saturday Dallas and Past Data Summit, where you might catch me live. If you’re interested in learning more about SQL Server performance tuning or need help with health checks, performance analysis, or training, feel free to reach out; my channel is here for you.

Full Transcript

Erik Darling here with Darling Data. Look at that handsome logo. Look at that handsome head. Look at all that handsome floating around in here. We’re going to try something a little bit different in these videos because I did notice that when I told people to do stuff in the beginning of the video, they did stuff. So we’re going to start off with my appeals. at the beginning of the videos rather than just at the very end of the videos where most people have stopped watching because they’ve got what they needed from me. And usually those transactions don’t end quite as well. And also, it’s my channel and I can do what I want. When you get your channel, you can do what you want. In today’s video, we’re going to be talking about performance tuning top and percent queries, which is something you might not have to do very often in your career. But when you do have to do it, it can be quite an excruciating process. And we’re going to talk about how you can make it much easier on yourself. So about my channel, all of this content is free. And I’d like to keep it that way without commercials and other crap. So thank you, Intel for chiming in about the driver timeout. Great. Intel’s not doing so great lately. Their CEO was was praying on Twitter the other day as stocks are melting.

So we’re going to leave Intel be for a little bit. But yeah, all of this content is free. I like likes, I like comments, and I like subscribers. It’s a good way for me to know if I’m doing a good job or not. If you feel like supporting my channel, there are like low cost memberships where it’s like a few bucks a month if you just feel like saying thank you. You don’t have to, but if you’re feeling generous, go for it. I haven’t really promoted this much because I don’t really know what I want to do with it just yet. I think that ideally I would like to turn it into some more sort of like private one-on-one, well, maybe not one-on-one, but like private group stuff for people who donate at higher levels. We’ll see where it goes. I haven’t quite fleshed it all out yet, and I don’t know when I’m going to. So we’ll get to that. We’ll cross that bridge when we come to it. Hopefully, hopefully, and hopefully in a nice car. I’m a consultant. I do SQL Server performance tuning. If you need health checks, performance analysis, hands-on tuning, you have a SQL Server emergency, or you want to get your developers trained so that they don’t stink at SQL Server anymore, you can hire me. I do those things for money. That’s a good transaction for most people. I also sell training. If you go to training.erikdarling.com, there’s also a link in the show notes for this stuff.

You can get over 24 hours of streaming SQL Server performance tuning content at the beginner, intermediate, and advanced levels. And if you pay enough attention to this video or click the link in the show notes, you can get 75% off everything with the discount code SPRINGCLEANING. As far as upcoming events goes, where you can catch me live and in person, Friday, September 6th of this current year, 2024, I will be at Data Saturday Dallas. I have a pre-con on the Friday, and then I’m speaking at the regular event on Saturday. And of course, November 4th and 5th, I will be at Past Data Summit. This slide has a lot of white space on it, and I haven’t quite figured out how I want to fill that yet. It’s also possible that after September 6th, this slide will have a lot more white space on it, because I will be purely focused on being prepared for Past Data Summit.

So if you feel like seeing me in the flesh, as they say, those are two good ways to catch me there. And now we have gone black, which is a lot of the opposite problem of white space. And we’re going to move on and cover the subject of our video.

And well, let me actually close this. The query plans, we’re going to get to the query plans. But I have a couple indexes on the post table. I have one that supports what we’re searching for and what we’re ordering by.

And I have one that just supports what we’re searching for, because I want to show you two sort of different query plans related to that. Now, here’s the store procedure that I have that is looking for the top percent, top whatever percent we plug in rows for a specific post ID. You know, this probably is not, again, this is not the most common routine for returning rows out to a client.

But I do see a lot of people use it just because, you know, sometimes, you know, it becomes kind of wonky if you’re like, yeah, give me the top thousand rows, but a thousand rows don’t exist. So some people prefer to return a non-hard-coded number or a non-default number. They want to return a number of rows based on the population of data that they’ve got.

So, you know, it’s a somewhat, you know, somewhat less common thing to do, but it can also be a little bit less confusing when you’re like, yeah, give me the top thousand rows. But again, the thousand rows don’t come back. So I’ve got this query set up in a few different ways.

Four to be precise. Four. We’ll count them. Number one, just a regular top end percent query to start things off right here, where we’re just doing what we’re doing with no interference. And then below it, I have a query hinted to use the index that supports both the searching and the sorting here.

Then the third query is hinted to use the index that just supports the searching. And then finally, I have a query at the bottom where this is all expressed without any parameters. This is expressed with literal values, because what I wanted to show you is that this is not an effect of the top percent being parameterized.

This is just what top percent query plans tend to look like. Below this, I’m going to show you a way to make these go way faster. Okay.

So coming back to the query plans, I think potentially my new least favorite trail of query plan operators is a sort, which of course is parallel, because right next to it, we have a gather streams. And then that goes into an eager table spool. And the reason why I pre-ran all of these is you can probably guess by looking at some of the numbers here.

These are not the fastest boys in the world. No, these are very slow boys. This one takes nearly a full minute.

This one down here takes 30 seconds, which is an improvement by nearly half. But still not great. Still not what I would call fast.

And then, you know, down here, this is… I think the reason why I wanted to have the parallel… Rather, I wanted to have an index where both search and ordering is supported and where only search is supported is because I think the query plan difference here is interesting.

And the reason why I think the query plan difference here is interesting is, of course, because when you have a nested loops join, SQL Server only considers the stuff over here for, you know, how much faster it would be to execute the nested loops in parallel. Because parallel nested loops don’t operate the way…

Don’t always operate cooperatively the way that, like, parallel merge or hash joins do. Parallel merge or hash joins, you’re probably very used to seeing those where on the inner side of the join, the number of rows in the table is split up equally or split up hopefully as close to equally as possible amongst the dot threads in the query.

With a parallel nested loops join, you’re really running dot copies of the nested loops on the inner side. So every thread is going to have every row on it. There are a couple caveats to that that are too involved to cover here.

But just in general, when you see parallel nested loops, you should know that you’re dealing probably with a lot more rows than you think you are per thread. So this one is interesting because this query where we have the search and the sorting supported does not get a parallel plan. This gets a serial plan which runs for about 30 seconds, which again is twice as fast as any parallel plan that we have in here.

The one up top, fully parallel plan. Granted, we scanned the clustered index. We have kind of a nasty sort here.

But even going from the sort to the gather streams, that goes from 18 seconds, jumps up to 35 seconds. And then going into the eager table spool, that jumps up to 58 seconds. So really awful sort of like chunks of time spent waiting for those operators to process rows around.

The serial plan where we lose the sort, things do improve, but not, I mean, they improve and they improve dramatically. They improve their, this query is twice as fast as the other one, but it’s still slow as hell. Right.

And a lot of the time, you know, we have a full 18 seconds getting into there. And then, you know, this seek to key lookup thing is not a great scenario. We spend about almost 11 seconds just in these, these three operators alone. So that’s, that’s also unfortunate.

But what I thought what was really interesting was when we flip back to the parallel plan with this sort in it, we go from about three seconds here. This jumps up to 11 seconds here. So we spend, you know, just about eight seconds, you know, in the sort.

And then from there, it’s another like 15 seconds in the gather streams. And then a whole bunch of time in the eager index. So this, that series of operators is just real unfortunate in these plans.

And then, of course, the one way down at the bottom is, this is just the one with the literal values, which is pretty much an exact duplicate of the one up at top of the parameter. It’s a few seconds faster.

It’s about seven seconds faster, I think, for, you know, I don’t know, whatever reason. I didn’t really dig into why this one’s seven seconds faster. It’s not really pertinent. But in the moment, mainly what I wanted to show you here was just that you get the same plan shape with using literal values as parameterized values. So one way to tune these queries and to get much faster performance out of them is to use, you can use either a CTE or a derived table, whatever you want to do.

So I use the CTE here because I like to, you know, I like to give equal opportunity to queries where it doesn’t make a difference. And inside of the CTE, what I’m doing is I’m calculating the percentage using the top parameter and multiplying that by the count of records that we get from when we look at that. And then you divide that by 100 to get the percentage that you’re looking for, right?

So this piece of math right here will get us the top percent that we care about. And then what I’m doing down here is I am saying I am selecting. So that PCT table where I get the percent, that’s a pretty short hand abbreviation for percent.

I’m cross applying to the post table and I’m getting the top calculation from up from up here. So this, this records column, this is the one that this is the percentage. This is the number of rows that we need to get as that percentage.

And I’m passing that percentage in here. And I’m doing the same thing in here with the where clause. And I’m doing the same thing here with the ordering. And then one thing that, you know, you always need to be thinking about is that SQL Server does not guarantee ordering unless you tell it what to order by. So we’re even so just getting the select top percent in here ordered by creation date descending is not enough to guarantee that the external provided results, the presentation, the presented results will also be ordered correctly.

So we have another order by out here to take care of that. And the nice thing is that when we create this store procedure or rather when we alter that store procedure to use the new one, and I’m going to execute this the exact same way. Top one, post type ID equals one.

I’m going to execute this the exact same way. And this is going to be a lot faster. You’ll notice that, I mean, a lot of the time was spent returning the 60,000 rows out. The query itself actually finishes in about half a second.

So it’s very, very easy and convenient for us to seek to the rows we care about in here to aggregate, you know, do our count to get generate a number. And then inside the top here to get that 60,003 rows, which is the top 1%. All sounds great.

Everything’s good here. Where this pattern will generally apply well to most executions. As the percentage gets higher for a lot of rows, performance is going to suffer.

So if we crank this up to, let’s say, the top 20%, I mean, on top of the fact that, you know, we’re going to spend more time sending rows out to SSMS because, you know, 20% of 6 million is a, you know, fairly high number. Like, this is going to slow things down. But most of the slowdown is returning the results to SQL Server Management Studio.

You know, of course, I could dump it into a temp table and, you know, you know, return, maybe ignore it or return the results from there. But a lot of the time in here is just time spent returning the results. If you look at the actual query itself, the actual query itself finished executing in about three seconds flat.

We spent a whole bunch of extra time returning 1.2 million rows out. Now, there is a slight downside to this in that the top is always going to estimate 100 rows. But as long as your indexes are set up to support the seeking and ordering that you care about for the top end percent, that’s not going to hurt you too much.

Where it would hurt you is if you didn’t have the sorting element assigned in the index and you had a sort operator that had very variable, like, memory grant requirements. You know, obviously sorting 60,000 rows is a lot different than sorting 1.2 million rows as far as how much memory you’d need. Maybe if you’d want a parallel plan for that, stuff, things along those lines.

But generally, this pattern works out way better, like, up until a very, very high row count. And even then, so, like, let’s be a little bit honest about, like, high row count return queries. This thing dumped out 1.2 million rows, right?

That’s 20% of the 6 million something rows that have a post type ID of 1 in the post table. Cool. If you are returning 1.2 million rows to anybody with the exception of, like, exporting to a Excel file or some other kind of file format, ain’t no one looking at 1.2 million rows.

No one is going to go through all 1.2 million rows and do something with them aside from, like, copy and paste them to another thing to make those results more portable. Most likely an Excel file. Most, you know, if you dump out 1.2 million rows to an end user, they’re not going to do anything with 1.2 million rows in your application results, most likely.

They’re going to take those 1.2 million rows, paste them into Excel, do whatever, you know, goofy Excel stuff people do in Excel, and then, you know, use that for whatever they’re building, right? Whether it’s a chart, graph, something, VLOOKUP, I don’t know. Whatever people do in Excel.

It’s crazy. So, returning 1.2 million rows out to SQL Server Management Studio or an application is generally not something. Like, if you’re at the point where you’re doing that, I have questions that you can pay me to ask you about just what the hell you’re doing or what the hell an end user would be doing with that much data getting returned to them. Because most of the time, they’re not going to get to the end of 1.2 million results and be like, hmm, I’m satisfied now.

They’re just not. That’s generally not the way most human brains work, especially in this day and age where everyone is in therapy and medicated and has some sort of neurodivergency that prevents them from paying attention to 1.2 million rows of anything. Right?

Or that’s like, maybe that’s just basic human sanity. Like, 1.2 million rows? I’m not looking at all that. You would forget everything you saw. Right?

There’s generally no point to that. So, you know, I think for most people doing top and percent things, you’re going to be pretty safe with this setup because you will hopefully never be returning, you know, millions and millions of rows, especially to SQL Server Management Studio, which has a notoriously difficult time of ingesting, displaying, formatting, all that stuff quickly. It’s not a fast boy for that.

So, anyway, thank you for watching. I hope you learned something. I hope you enjoyed yourselves. I hope that you paid attention earlier in the video where I talked about liking and subscribing and hiring me and buying training and all the other stuff. Because, you know, I like to have friends when I record these and I don’t like to feel lonely.

So, anyway, thank you for watching. I’m going to prepare one of these other demos. You see I have many tabs open up at the top there.

I’m going to go prepare one of those to record. And then I’m going to change one slide in the deck and I’m going to do that all again. So, anyway, it’s my channel. Remember?

Do what I want. Okay, cool. Thank you.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

A Little About Computed Columns, Filtered Indexes, and Indexed Views in SQL Server

A Little About Computed Columns, Filtered Indexes, and Indexed Views in SQL Server



Thanks for watching!

Video Summary

In this video, I delve into the world of indexed views in SQL Server and how they can be used to circumvent some of the limitations associated with computed columns and indexes. I share my experience from a previous attempt where my voice sounded like it had emerged from a crypt, which led me to create this video with a bit more preparation. We explore why you can’t create filtered indexes on computed columns or indexed views, and how we can work around these limitations by incorporating the filter expression directly into the view definition. I also discuss the practical implications for standard edition users and enterprise edition users, emphasizing the importance of using the `NOEXPAND` hint to ensure that SQL Server utilizes the index view effectively.

Full Transcript

Erik Darling here with Darling Data. And this is actually my second take of this video because on the first take, it was actually the first words that I had spoken aloud all day. And I sounded like I had just emerged from a crypt. And it was actually, it even felt strange for me. So in this video, we are going to talk about how you can use indexed views to get around some of the limitations with computing. columns and indexes in SQL Server. It’s going to be a lot of fun, assuming I don’t start sneezing at any moment now. So the big problem with computed columns is you can’t create filtered indexes on them. You just can’t do it. It’s impossible. You just get errors. It’s actually kind of a miserable experience. Like many other things in SQL Server, it is pure pain and agony and suffering. So here’s what we’re going to do. We’re going to create this table called indexed view, which is not confusing at all, is it? It’s sort of like when you have a table with TBL in it, and then eventually something happens where you have to convert that to a view, it hits like a table with a different definition. And now you have a view called TBL something. It’s the same deal, right? So that’s good times. And we’re going to stick a little bit of data into this table just to make it semi-realistic. It’s only about 15,000 rows. Not a lot. That’s okay. We don’t need a lot to sort of prove through the concept. Now, the problem, like I was saying, is that you cannot create a computed column on an indexed view.

I’m sorry, on a filtered index. So what it says here is, filtered index n cannot be created on table dbo.indexed view because the column, not fizzbuzz in the filter expression is a computed column. So helpful. Rewrite the filter expression so that it does not include this column. Well, it’s the only column that I care about filtering, so what should I do there, Microsoft? Tell me what I should do. Illuminate me. I would love to know more. The one thing that I want to do can’t be done. Why bother telling me to do something else?

So here comes the indexed view. And before I create this, I should pause for a moment because, you know, again, like most things in SQL Server, there are agony and pain and suffering. Indexed views and their limitations are certainly a large source of pain and agony and suffering for many SQL Server users. They can be used to great effect in very simple scenarios. And perhaps their limitations are a feature and not just pure, you know, unabashed laziness.

Because they do kind of prevent people from putting really crazy things into index views and maybe index view maintenance for those things would be tough. But it is a real shame that even basic things like min and max aren’t supported by index views. Legend and lore has it it’s because Connor Cunningham decided one day that he decided when they’re putting index views in that having an additional nonclustered index to support the min and max was just too much.

So we don’t get that now. Great. Connor is working on CPU instructions and we’re still suffering with not being able to put min and max into an index view.

So the happy time for us, us leftovers in SQL Server. Grand. So, yeah, we can’t create this. We get enough errors.

No matter how many times you click this, we get errors. Now, the main place that I still find enough valid use for indexed views is on standard edition. And they’re great for standard edition because Microsoft has hobbled columnstore in batch mode in standard edition to the point of utter disrespect for their users.

So, like, often it’s just like, we can try it. What’s the point? It’s just, it’s like nothing.

It’s pointless. So, indexed views can be good because, you know, where columnstore in enterprise edition excels at being able to do large aggregations very quickly, you can often use indexed views to pre-compute those aggregations.

Indexed views do, of course, have, you know, maintenance downsides. But, you know, even in enterprise edition workloads where there’s some, you know, something about the workload or something about the table or something about, I don’t know, a million other things that make using columnstore indexes impractical or impossible, indexed views can even still be good there.

But, you know, in this case, we’re using it to get around a rowstore limitation with filtered indexes. So, let us continue. We are going to create a view.

And I just want to show you this part first because this part is interesting, too. We’re going to create this view called computed column. Again, not at all confusing. And we’re going to attempt to create a computed column, attempt to create a filtered index on that.

Now, we’re going to get a completely different error here. We are going to, it’s going to say, you know, let’s put this, let’s format this a little bit for easy reading. All right?

We don’t want, we don’t want to make reading any harder than it is. We are, we are high school dropouts after all. So, the filtered index can’t be created on computed column because it is not a user table. It is an indexed view.

Indexed views in general cannot have filtered indexes on them. So, but at least this error message is somewhat helpful. Consider creating an indexed view with the filter expression incorporated in the view definition.

Boy, howdy. We can do that, can’t we? We can follow those instructions. Those are actually useful instructions.

Whoever wrote that error message, you deserve some kind of, some sort of gold medal. I hope that you have gotten a good job after that. So, we’re going to put our filter expression into the indexed view.

And we are going to quote this part of the indexed creation out. And we are going to recreate both indexes. Now, this gets us where we want to go.

We have essentially an indexed materialized pre-computed thing filtered to the stuff that we care about. Right? So, like that computed column.

And the big problem, of course, was that we can’t create a filtered index on a computed column. We can’t create the indexed view to materialize that. And, you know, under most circumstances, as long as you have reasonable supporting backing indexes between, you know, the tables and the indexed view and the indexed view.

And, you know, just being honest, I’m really not a big fan of indexed views that span multiple tables. I’d rather create, like, two, like, indexed views with, like, one table apiece and then them join two tables in an indexed view most of the time. And it’s a treacherous set of circumstances.

But now we have what we want there. And part of this is going to be because of the small data set that I’m using. We don’t get a very, you know, interesting, we don’t get an interesting enough query plan just selecting some data from the table or just getting a count from the table matched on the not fizzbuzz column being equal to zero.

Right? We just, it’s just, it’s not 15,000 rows. It’s just, like, in real life, if someone was, like, we need an indexed view on this 15,000 row table, I would probably punch them.

It’s not reasonable. So when we run this query and we look at the execution plan, we will see something that we’ve seen in a few other examples and videos that I’ve recorded here before. This query is at the mercy of both getting a trivial plan and simple parameterization.

We can tell because the literal value that I used before has been replaced with at one. And that we have all these silly brackets that are completely unnecessary injected into our code. And we have a complete lack of as in the aliasing, which is something that I would never do because I’m a professional human being.

So in order to get around that, you know, you can always do the old one equals select one trick. If you’re going to, if you’re thinking about typing in the comments, why are you using one equals select one? I have bad news for you.

Type it into a search engine instead, and you will get both a blog post and a video where I explain it. So if you ask in the comments, you’re out banning you from my channel for life. Kidding.

I’m not. I will probably make fun of you a little bit, though. So if I run this and I look at the execution plan. Oh, yeah. Well, actually, I should backtrack a little bit.

What I meant to say up here is that because of all that, you can notice that instead of using the index view that I created, we’re using the base table, which is named index view. So that’s the thing we don’t like there. But if we add in the one equals select one, we avoid the trivial plan and the simple parameterization, we switch to using the index view that I created called computed column, which is precisely what we wanted.

But an important thing, particularly for standard edition users, the much abused, left behind standard edition users, is that you usually want to include the no expand hint. Standard edition does nothing for like computed column or rather does nothing for like index view matching. The no expand hint is necessary in like 99 point nearly infinite nine cases where I need where I want.

I need to I need to make sure that the index view is routinely hit rather than the base table base table behind the index view. The other really important thing, even for enterprise edition users, when it comes to index views and using the no expand hint, is this is the only way for SQL Server to create system statistics on columns in the index view. If you don’t put no expand in there, you don’t get any like it doesn’t create histograms for you.

It’s really weird. I don’t get it personally, but it is it is it does appear to be the case. So sometimes in query plans, you might see a warning that says like like, oh, columns with no columns with no statistics or something like that.

It’s a very misleading warning. Sometimes it’ll happen because it’s a index view and you don’t have statistics because of it. You didn’t use no expand when you queried it.

Other times it’s because you don’t have statistics in the very specific column order that SQL Server would want. And so like you do have statistics on these columns. It’s just not the exact it’s it’s almost like a like like a missing index request that’s wrong.

That’s like a missing statistics request that’s wrong because you do have statistics. It’s just not the specific statistics that SQL Server wants. So what did we learn today?

What did what did what did we learn? SQL Server has a lot of bizarre limitations. You know, you can create computed columns. You can create filtered indexes.

You just can’t create filtered indexes on computed columns. You can create index views, but you can’t create filtered indexes on index views. You can apply filters to index views and then create whatever indexes you want on that index on that view clustered and then whatever nonclustered indexes. But index views are a tough or a tough sell in a lot of cases.

You know, with the exception of people on standard edition who are have who, you know, can’t have like a just beat to death version of columnstore and batch mode available. You know, that’s that’s no good. And then like index views can be good there.

And then if you’re on enterprise edition, but for some reason column stores and column stores and no go for you for whatever other, you know, weird pathological reasons you have in your database, whether it’s, you know, the data types, cursors, you know, other kind of constraints and stuff. That’s another, you know, another story. columnstore does have some limitations, but, you know.

So I am still rather fond of columnstore most of the time. Rather, rather, rather sunny about columnstore. Got some good stuff going for it.

And Microsoft actually actively seems to be working on columnstore, which is a nice change of pace from index views, filtered indexes, partitioning, you know, a billion and a half other features that have been left at sort of like V1, V2 with no real investment afterwards. So, you know, columnstore at least has that going for it. It’s okay there.

So, there we go. Thank you for watching. I hope you enjoyed yourselves. I hope you learned something. If you like this sort of SQL Server content, then you should, you can like the video. You can comment on the video.

You can subscribe to my channel, and you can join over 4,000 other lovely data darlings and getting notified every time. I publish one of these videos so that you can stay on the, I don’t know. I’d call it the cutting edge, but, gosh, is SQL Server the cutting edge anymore?

I often wonder. I often wonder. It’s a fun thing. Fun thing to ponder.

Fun thing to consider. Why? What happened to SQL Server? Why is everything seemingly spinning out of control? So, you can do that.

And then, I don’t know. Maybe you just like looking at me. Maybe you just like the sound of my voice. I don’t know. I don’t know what it is that gets people to subscribe.

But, if you do that, you get notified. And I’ll get larger subscriber counts. And larger subscriber counts are kind of my fetish at this point. So, you know, the higher that number goes, the happier I am.

And the happier I am, the more I record. So, it’s like a good feedback mechanism. You subscribe. I record.

We’re good. Right? Everyone. It’s a happy ending for everyone. All right. Apparently, I have a call starting soon. So, I’m going to go do that. And that call is with a nice client who is paying me money so that I can record these things for free.

There aren’t even commercials on my channel. Like some other SQL Server channels. So, you can watch these things uninterrupted.

And if you would like to hire me for consulting, well, I mean, you know my name. You know that my name is my website. And you can always get in touch with me that way.

All right. Great. Cool. Now, we’ve done all our plugs at the end. Because I’m an idiot and I should do it at the beginning. But, you know. Can’t teach an old dog new dogs.

All right. Goodbye. Thank you for watching.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

So You Want To Implement Soft Deletes In SQL Server

So You Want To Implement Soft Deletes In SQL Server



Thanks for watching!

Video Summary

In this video, I delve into the considerations for implementing soft deletes in SQL Server tables. Whether you’re looking to add a soft delete feature to an existing table or considering building new tables from scratch, there are several key points to keep in mind. I discuss the challenges of backfilling nullable columns and the importance of batch updates when making such changes to large datasets. Additionally, I explore the complexities involved in ensuring data integrity with unique constraints and check constraints, drawing on insights from Alexander Kuznetsov’s work, which is now available through an Amazon affiliate link. The video also covers practical advice for creating views and filtered indexes to simplify querying active versus inactive rows, emphasizing the importance of proper index design based on query patterns. Lastly, I touch on the broader implications of soft deletes compared to actual data deletion, including the potential benefits of using temporal tables or archiving deleted records in an archive table.

Full Transcript

Erik Darling here with Darling Data. And as promised, I’m here to talk about some considerations for implementing soft deletes in SQL Server. Now, there are a lot of reasons to want to do this. And really, like, the implementation of a soft delete or the inclusion of a, like, you know, which row is currently active type thing, they’re almost two different outcomes, right? Because, you know, there are some situations where you might, you know, want to be able to soft delete something just because it was a mistake or, you know, just to, you know, mark it for, like, archival or something. And there are other times when you only want to be, you only want to query the most active row. And we’re going to talk a little bit about both of those. Now, the first thing that I want to point out is that setting this up for a table that currently exists and is of any, you know, remarkable size is really difficult for two reasons. One, you know, not because, like I said, like, look, if you’re adding a nullable column, or you’re adding a nullable column with no default value, it gets, it gets sent to the table instantaneously, right? It’s, there’s no locking, blocking problems with that. Your problem after that becomes, okay, which rows are deleted and which rows are not. So you, like, like, it would behoove you to backfill that column with some data to make sure that people know whether a row is deleted or not. Because what you don’t want to do is add that in and start dealing with three value logic with zero, one, and null. That’s just a recipe for disaster. So backfilling those and figuring out which ones are deleted or not can sometimes be a challenge. If the goal is just, we need this new thing, and we’re going to start soft deleting rows, don’t worry about it. Make everything zero.

But don’t make everything zero all at once, because if it’s a table of any, of any, like, remarkable size, you’re going to want to batch, batch those updates to set that to zero. That’s a lot kinder on your server than just adding a, you know, column with a default of zero. And, you know, having that happen to all the rows in one go. Batching them up is a lot, it’s a lot nicer. So if it’s a situation where, you know, neither one of those scenarios really fits what you want to do, there are definitely times when building new tables with those design improvements instead is a smarter and safer bet. I’ve been a part of a couple projects that were geared towards doing that.

And, you know, wrote all sorts of things to migrate the data to the new schema, you know, make sure that everything is set up the way it’s supposed to be, all that stuff. You know, so like, that is a doable, it is a doable, achievable task. You know, it’s just, you know, one of those things where it really comes down to what the goal is. Now, if you only want to have one row per group be active, you’re going to have to deal with at least one unique constraint.

And depending on the complexity of what defines an active row, you could need a whole bunch of unique constraints. Now, this guy, Alexander Kuznetsov, and I don’t know, I apologize, Alexander, if you watch this video and either I butchered your last name or you prefer to be called Alex. I’m not sure. I’m not sure you’ll ever see this.

From what I know, you’ve moved on to Postgres. But Alexander wrote a book that I actually, I read and loved a long time ago called Defensive Database Programming in SQL Server. However, I will have an Amazon affiliate link to that book.

So if you do decide to buy it, I’ll make a third of a cent. I’ll finally make some money off one of these videos. But Alexander also was a blogger for a while.

And he wrote on SQLblog.com, which has been dustbin, which is a shame because there are a lot of great blog posts on there that you can only get through like the archive.org stuff. And so one of the articles that he wrote was about storing intervals of time with no overlaps. And this is the table definition that I’m just going to format this a little bit so that when I, this is not how the table appeared in the original post.

I just need to format that a little bit so that when I zoom in, my giant head doesn’t cover anything important. So in this, so in order to implement this for a table that has one setting ID and value started at, finished at, previous finished at, five columns, you need one, two, three, four, five separate check constraints to ensure that this data is stored correctly. You need a primary key.

You need a unique constraint on setting ID and previous finished at. You need a foreign key in the table that references itself. You need another check constraint to ensure that there are no overlaps between previous finished at and started at and started at and finished at. There are, there’s a lot that you have to do to make this work right in order to ensure that your table doesn’t have any bad data in it.

So you can already imagine the complexity and imagine, imagine that you have a table that you thought was set up to do this. And you’re just now realizing that it wasn’t set up correctly to do this. And now you have to add these constraints in.

I will bet you a million dollars that adding at least one of these will fail. Right. Adding at least one of these constraints to your current set of data will fail miserably. In some cases, you just might want to use temporal tables.

Now, temporal tables are a feature that I do not get excited about because there are a lot of problems with them. There’s a blog post by a very famous Bob about when you can encounter error 1, 3, 5, 3, 5, where data, where having temporal tables can cause data modifications to fail. This is just the tip of the iceberg when it comes to problems with temporal tables.

There are all sorts of very strange bugs that can happen under concurrency and using transactions to modify data in a table just like this that can make things really weird and complicated and even make you end up with probably incorrect data sometimes. So temporal tables aren’t a feature that I get very excited about, but given a simple enough implementation, they’re probably okay for some requirements. Again, nothing that I’d get like, wow, use temporal tables.

They’re amazing. It’s more like you could try them. Make no promises. It’s like getting on an old roller coaster. You don’t really know how safe it is.

You don’t really know what might happen. You just don’t want to be that unlucky. You don’t want to be that unlucky passenger. You know, Microsoft and Boeing have a lot in common where there’s been this real degradation in quality over the years.

And a lot of really half-baked features have been implemented with not a lot of assurances that things will go well with them. So be careful out there. All right.

But if you’re going to, you know, let’s just say that, you know, you go the route of implementing soft deletes in your tables. And you want to make sure that everything works right. One thing, actually two things you’re going to want to do is you’re going to want to create a couple views.

And you’re going to want to create a couple views because you should never trust a developer. Never trust a developer to do the right thing or remember what they’re supposed to be doing. So whenever I’ve implemented soft deletes for clients, you know, the nice people who pay me to make these videos, you could be one of them someday.

You know where to find me. Is I will create a video that explicitly, I will create a video. No, Erik Darling will create a view.

Erik Darling does create videos, but not for soft deletes. Well, actually, I am creating a video about soft deletes. This is wild. But when I implement soft deletes for clients, I will create views for active and inactive whatevers so that specifically have a filter for where, for whatever we’re looking for. Is deleted equals zero or is deleted equals one?

Because I want to make sure that, you know, there might be even be a third one that, you know, just hits both. But that could just be the table. But really what I want is for people to be looking in the right place. And from my perspective, it’s a lot easier to see if a developer is hitting the wrong table than if they forgot something in a where clause.

So there’s that. The other thing that you’re going to probably want, and this is a very common thing, is you’re going to want a unique constraint on your table to sort of ensure that there is only one not deleted thing for a person. Now, going back to the Alexander Kuznetsov stuff about, you know, with the overlapping stuff, that’s a lot different than just figuring out if, like, just making sure that someone only has one active row period for themselves.

There might be all sorts of other considerations for that. And you might need more than just the column that I’m showing here for the unique index. It depends on what you’re trying to accomplish.

But in general, you just want to make sure that there’s one unique ID that is not deleted, right? You generally don’t want a bunch of active rows. If you want a bunch of – if you’re okay with having a bunch of active rows in there, if there might be duplicates for some reason, then you could just create a regular nonclustered index.

But having that filter operator in there is very important. The other big thing about doing this, and this is just a general piece of advice, because filtered indexes where you don’t have the column or columns that you’re filtering on in the where clause as part of the index somewhere, generally having them as an include is good enough. If you don’t have them in the where clause, if you don’t have them in the where clause, you are – it’s a real crapshoot if that filtered index will get used when it should.

That goes for bits. That goes for null or not null. That goes for, you know, greater than, less than, in, whatever the setup is.

Just do yourself a favor. Don’t screw this part up, because you might get real disappointed with your filtered indexes. Please put the columns that you are filtering on somewhere else physically in the index, either as a key or as an include.

Like I said, generally the include is good enough, but if it’s very special to you, you might want to put it in the key. All kind of depends on what the column is filtering on. Now, the nice thing about doing this with the views and the filtered indexes is let’s assume that the majority of your code is either from an ORM where things are parameterized or based on store procedures where things are parameterized.

What can happen, or even if you have forced parameterization turned on for your database, as long as your query is partially parameterized or in a store procedure, you don’t have to worry about the issue with filtered indexes and parameters or variables. So one big catch with filtered indexes is that if you have a piece of code, and let’s just say that that piece of code is, you know, filtering on the deleted column, is deleted column, and you want people to be able to search for either is deleted equals zero or is deleted equals one. Again, we’re throwing three-valued logic out with this.

If you allow nulls in a bit column, there is a special place right next to me in hell. So you better watch out. But when you have a parameter or variable for this, SQL Server can’t take advantage of your filtered indexes unless you throw a recompile hint on.

There are other ways to snake around it with dynamic SQL and putting a literal value in for one part of it. But in general, you know, if you’re like, you don’t want to have a parameter to search on a filtered index column because SQL Server needs to cache and reuse an execution plan that’s safe for is deleted being zero or is deleted equal being one. So, you know, at that point, you’re going to probably ignore your filtered index and get an unmatched index warning.

Now, the reason why I like to create these views up here is because these are literal values. All right. And even with forced parameterization turned on, if your query is in a stored procedure or partially parameterized, SQL Server won’t try to parameterize this part.

Right. If you have a trivial enough execution plan, you might get simple parameterization. But, you know, if, you know, again, most people live in a sufficiently complex world where that’s not something that’s reasonably going to happen.

You never know. But there’s something to watch out for. So when you’re going to implement soft deletes, the three things that you need to consider are what’s really your goal?

Are you going to allow multiple not deleted or can you only have one active not deleted row per user or whatever entity, whatever you want to call it? How are you going to set up for people to query those so that they don’t have to remember to apply some predicate to it? Especially if you are using ORMs, it can be really difficult to remember to do these things in your queries or even know how to do these things in your queries.

That’s why the views work out really well. You’re going to figure out whether you need unique indexes or non-unique indexes to maintain the referential or not the referential integrity. Just like the sort of constraints of the table.

And depending on query patterns, not every index can be filtered. Right? Not every index is going to be geared towards a query that’s looking for is deleted or is not deleted. There are some that might just have to span the whole table.

Implementing soft deletes is often a lot less scary to business users than actually deleting data because once you delete it, it’s gone. Now, granted, you can do things where either, again, temporal tables or you could put a trigger on your table so that if someone deletes a row, that row goes off to some other table until you’re legally not required to have it anymore. If you’re going to go that route, though, you might as well implement the soft deletes.

And then you might as well have a process that moves soft deleted columns off to some archive table eventually. It can be after three months, six months, nine months, seven years, whatever you’re really meant to do. Because, you know, even with filtered indexes and even with, you know, queries geared towards using only the non-deleted rows, it’s usually a good idea to keep your tables on the small as possible, especially if the deleted rows really start to pile up.

So that’s another big thing to consider. What is going to be the ratio of deleted to not deleted rows in your system, right? Because, you know, well, like, you know, at that point, you have to wonder if filtered indexes are even useful to you, right?

Because if your table is like, you know, 60, 70, 80, 90% not deleted rows and soft deletes are a rarity, filtered indexes don’t really buy you all that much. Unless you need that uniqueness, unless you need to ensure that only one user can have an active row, have a not deleted row at a time, it doesn’t really make sense to add those in, except as sort of like a logical check constraint to keep your data the way that you expect it to look. So generally, you know, these are the questions that I ask.

These are the things that I talk about with clients. You know, we might come to an understanding that perhaps adding soft deletes to existing tables is not what we want to do. Perhaps we would prefer temporal tables if we are willing to accept all of the risks that come along with using them.

There might be a solution with triggers and other things that would be more palatable. You know, you might want, you know, figuring out exactly what the end goal of the soft deletes are, figuring out exactly like, you know, the requirements before we implement stuff usually leads to a much better solution than just saying, oh, you wanted soft deletes? Cool.

Here’s a new bit column. That’ll be $10,000, please. Because I’ve seen a lot of that happen. And what the end result is, is people have this soft delete column. No one actually soft deletes anything.

And it’s kind of a bummer. So, you know, really, really figure out the problem you’re trying to solve by implementing soft deletes before you just go and throw a brand new column in a table or you go through a bunch of work to try to establish all the other stuff that we’ve talked about. So, thank you for watching.

I hope you enjoyed yourselves. I hope you learned something. If you like this video, you are free to give this video a thumb, a nice happy comment, a smiley face. Whatever you’re feeling at the time, feel free to comment.

If you like this sort of SQL Server content, you can subscribe to my channel. There are buttons that allow you to do all sorts of fantastic things in that regard. And lastly, but not leastly, thank you for watching.

I will see you in the next video. Where hopefully I don’t have to soft delete this video for any reason. That would make me sad.

That would just mean that the last 20 minutes of my life were all for naught. Which, you know, has happened far too often to me. Ugh.

You know, I think if you really tally up the number of regrettable 20 minute spans in your life, you can come up with a lot of those, right? Anyway. I’m going to go think about that for a while.

Might need a drink to wash that thought down with. So, I will see you in the next video. Goodbye.

Goodbye.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

One Way To Track Row Changes In Place In SQL Server

One Way To Track Row Changes In Place In SQL Server



Thanks for watching!

Video Summary

In this video, I dive into the world of tracking changes in SQL Server tables using temporal columns—a feature that can help you monitor when data modifications occur without the complexity of full-fledged temporal tables. I start by explaining why change tracking is often a problematic choice and highlight how change data capture (CDC) simplifies setting up change tracking for new or existing tables, though it comes with its own set of challenges. Then, I demonstrate how to add temporal columns to an existing table to track modifications, emphasizing the importance of hidden junk date columns and default constraints that help record when a row was last modified. Through practical examples, I show how these changes can be implemented and queried, providing insights into their usefulness for data auditing and movement processes.

Full Transcript

Erik Darling here with Darling Data. Of course, I forgot to adjust my camera before I go to recording, so we’re going to make sure that we are crisp, clean consultants for this video. And I’ve been told that I should do this at the beginning when people are still watching. Apparently it’s more effective. If you like SQL Server performance tuning content or just SQL Server informational stuff in general, and you end up liking this video, you can like the video and you can comment on it. You can also subscribe to my channel and subscribe to my channel. if you want to be notified whenever I publish these things. So, uh, we’re doing that upfront this time. I’m not going to thank you for watching because you haven’t watched it yet. In this video, we’re going to continue on with, uh, they call that A-B testing in market research. Uh, in this video, we’re going to continue with the theme of yesterday’s, the last video, not really yesterday yet. Uh, uh, that was deep. Uh, with different ways to track changes in SQL Server. So, uh, in yesterday’s video, we covered the fact that change tracking is one of the worst features that Microsoft has ever added to SQL Server. And, uh, that change data capture was fantastic. Uh, in this video, we’re going to talk about how you can use temporal columns, not temporal tables. Temporal tables have way too many problems for me to, for me to say, yes, you should use them. Uh, but temporal columns can help you make this a little bit easier. Now, full disclosure, uh, just, you know, change data capture is a lot easier to set up for a table, uh, because you just start tracking new changes and all that stuff. If you wanted to like add this to an existing table, especially if it’s a big table, it could be painful to add these columns because you are adding not null columns with default constraints, which means a whole lot of data pages get written to. But if you have a brand new table or you’re designing a table and you want to be able to track when data change in that table, this could be a good way to do it. Uh, now keep in mind this, this of course, unless you’re using soft deletes, this will not change when something got deleted. You know, soft deletes will track like when the change happened to a row, like when you change is deleted from zero to one. Uh, but this won’t like if you delete a row out of the table, this won’t check that. This won’t track that you would need something more robust, like change data capture. But if your only goal is to figure out when a row got modified, this is a pretty good way to do it. So what I’m going to do is I’m going to, uh, drop this table if it exists and then also create this table. And this table is going to have all of the hallmarks of a table with, uh, for the, they would, they would be set up as a temporal table, but there’s no history table for it. We’re not assigning a history table. We are adding two columns to it that you need to have, but this is, this is really important. We’re going to talk, talk about two important things at this juncture.

One, we have this column called junk date and we don’t really need junk date. This, this doesn’t do us any good. So we’re marking this as hidden. We don’t want this to show up in select queries. We don’t need it. It’s useless to us. The other thing I’m going to cover is this. So we have a default constraint on the column that we do care about that shows us when something was last modified. I have sysdate time here because this will help me know exactly when a row got into a table, right? That’s, that’s, that’s useful in some cases. In other cases, you might only care when a row changed in the table.

If that, if that’s the case, you might not, it might look kind of weird and confusing to have a current time in that row or a current date in that row because that might’ve just shown up in there. We don’t know if that’s when anything actually happened or not. So you, if you don’t, if you want to know like that a row actually changed at some point in time, having a current date in there might be a little confusing.

What might be a little less confusing is if you put like a really old date in there, like 1900, oh, one, oh, one or something. So you could, you know, that obviously wouldn’t tell you what that the, when the row got into the table, that might just give you, I mean, unless you’re, you, you had a SQL Server in 1900, in which case, share your time traveling secrets with me, dear leader. I have some ideas.

One of them is called the stock market. Sounds like a winner. The other one is becoming a credit card company. That sounds like a, that’s another good one. That’s, I got goals, you know.

Anyway, so these are the two columns that we have to add to the table in order to make this whole thing work. And then of course we have this period for system time, blah, blah, blah thing, because apparently that’s just, that’s just syntactically correct. So let’s stick some rows from the votes table into the votes track table.

Not a whole lot of them, just about a year’s worth. And then we’re going to look at what data ended up in there because, I don’t know, it seems like a reasonably fun thing to do. And of course this insert is taking its sweet time, even with a tab lock hint, screw you SQL Server.

All right. So let’s take a quick view of what’s in this table. This is everything for vote type ID 7.

And you’ll see that we have a last modified date over here of the day and time that I’m recording this. And since we have bulk inserted this data in, these are all going to have the same value. If you were inserting like a single row at a time or like two or three rows at a time, they would all have the same value.

But, you know, this, because it was just a big whopping insert, they all have this last modified date of today, which is not yesterday yet. Very deep, very deep thinkers here at Darling Data. And so we got that and that’s all well and good, right?

We have that. So let’s look at, so right now, let’s just make a mental note that this is 2024 08.01 at 18.02 in 34 seconds, right? And now let’s update the bounty amount column and set that to four nines, which is not three sixes, four nines.

We’re not getting weird here. And let’s look at what’s in the table now. All right.

So this is obviously incremented by however long I was talking for, because if we look at, you know, the results of these two things, they will have slightly different times. So we did not affect vote type ID 4, which still has a last modified date of 18.02. And the last modified date for the rows we did change is 18.03.

So this is another way of figuring out when data changed in your table. If you need to take that data and either, you know, audit it, be like, hey, this looks funny. It changed.

When did it last change? Huh? Like, you know, you know, it’s not going to tell you who changed it. It’s not going to tell you what the previous values were. So it’s not really, like, good for, like, a data auditing scenario. But it is good if you need to take this data and put it somewhere else and you need to figure out, you know, again, sort of like we talked about with change data capture.

Let’s say you have a process that will look at this table every X minutes and look for data that has changed since the last time it ran and put that data into another table. Maybe it will aggregate it and do something else with it, but this is one way that you can figure out, hey, these rows changed since the last time this process ran. I need to take these rows and move them over.

And it’s kind of cool for that because, you know, it’s not change tracking, which is the worst feature that Microsoft has ever added to SQL Server. Or at least, you know, probably top five worst features. Again, I’ve seen that thing cause more trouble than it’s, I don’t know.

It’s just brutal. And, yeah, so this is just kind of another fun way to do that. And I’ve used this with a few clients to help them with, you know, data movement processes, which, you know, they turned out to be pretty happy with.

And happy clients is what I aim for, right? Again, clients, the nice people who pay me to make these free videos. If you would like to hire me so that you can support this channel and I can keep making free videos, you know how to find me.

We already talked about liking and subscribing and commenting, so we can skip that part in case you forgot because some of you people are a little forgetful up here. But anyway, thank you for watching. I hope you enjoyed yourselves.

I hope you learned something. And I will see you in the next video, which will not at all be about tracking changes. It just might be about considerations for implementing soft deletes, which would actually kind of go hand in hand with this. So, we’ll see how that goes, I suppose.

Maybe I’ll even see you there. Maybe you’ll even decide to subscribe and get notified for when that video comes out. Because I promise you, you will learn something.

I hope. I hope you will learn something. Or you will enjoy yourself. Or both. Lots of options. We have lots of potential here.

You and me, we’re going to be together for a long time. Anyway, thank you for watching.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

Join Me At Data Saturday Dallas Sept 6-7

2024 07 31 22 30 23Spring Training


This September, I’ll be presenting my full day training session The Foundations Of SQL Server Performance Tuning for Data Saturday Dallas.

All attendees will get free access for life to my SQL Server performance tuning training. That’s about 25 hours of streaming on-demand content.

Get your tickets here for my precon, taking place Friday, September 6th 2024, at Microsoft Corporation 7000 State Highway 161 Irving, TX 75039

Here’s what I’ll be presenting:

The Foundations Of SQL Server Performance Tuning

Session Abstract:

Whether you want to be the next great query tuning wizard, or you just need to learn how to start solving tough business problems at work, you need a solid understanding of not only what makes things fast, but also what makes them slow.

I work with consulting clients worldwide fixing complex SQL Server performance problems. I want to teach you how to do the same thing using the same troubleshooting tools and techniques I do.

I’m going to crack open my bag of tricks and show you exactly how I find which queries to tune, indexes to add, and changes to make. In this day long session, you’re going to learn about hardware, query rewrites that work, effective index design patterns, and more.

Before you get to the cutting edge, you need to have a good foundation. I’m going to teach you how to find and fix performance problems with confidence.

Event Details:

Get your tickets here for my precon!

Register for Data Saturday, on September 7th here!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

Making It Easy To Tell When Data Changed With Change Data Capture In SQL Server

Making It Easy To Tell When Data Changed With Change Data Capture In SQL Server



Thanks for watching!

Video Summary

In this video, I delve into tweaking change data capture (CDC) tables to make them more human-readable by adding a `change_time` column that captures the exact timestamp of when rows were modified. This simple tweak can significantly ease the process of working with CDC data, especially for tasks like populating data warehouses or performing other data movements. I also share my personal disdain for change tracking (CT), describing it as one of SQL Server’s worst features due to its tendency to cause server backups under high concurrency. The video walks through enabling CDC on a sample table and demonstrates how the `change_time` column can be used to streamline data movement processes, making it easier to identify new or updated records based on timestamps.

Full Transcript

Erik Darling here. Still Darlinging the data, as far as I know. Haven’t quite had any acquisition offers lately, so, you know, who knows, maybe someday. You know, the dream of every founder is to be acquired and, you know, go start some new business. So, you know, any day now, I’m sure. My billions of weight. Maybe I got into the wrong line of work for acquisitions, though, who knows. Anyway, today’s video, we are going to talk about how you can tweak change data capture tables so that you have a more human-readable idea of when rows ended up in your CDC tables. This can make all sorts of things easier. Now, look, there are built-in functions that can help you with this that do all sorts of things easier. There are lots of wacky stuff with log sequence numbers. The problem is, they’re complicated. It’s hard. I don’t like doing it. I like things to be easy, especially when my goal is to build bigger tasks around change data capture. Whether it’s populating a data warehouse or, you know, doing some other data movement stuff. This is one of my favorite tricks. Now, for so many reasons, I absolutely loathe change tracking.

Change tracking is one of the worst features that Microsoft has ever rolled into SQL Server. It is crap on wheels. I have seen it take down so many servers under high concurrency because what happens is you start to stack up these commit table weights. It’s commit underscore table weights. It’s a monster. And when your server backs up on those commit table weights, there is no helping it. You’re going to flame out. Again, change tracking, CT, is god-awful. If you turn it on, no one’s going to want to help you. I don’t even know if anyone at Microsoft still does anything with it.

I think it’s abandonedware at this point. Change data capture is beautiful. It’s wonderful. It is my precious angel because change data capture works asynchronously. You set up change data capture for a table or tables. And as changes happen to those tables, change data capture reads from the transaction log and moves that data into tables that you can actually see and do things with in your database, which is wonderful. Because you can do all sorts of neat stuff with those tables, including index them any way you want. You can add columns to them. You can, I don’t know, let’s see, you can prune them out whenever you want.

It’s not like change tracking where you have to do magic incantations and hope that whatever thing you run actually clears things out of the change tracking tables. Again, change tracking, nightmare nuisance, change data capture, beautiful, wonderful angel. So if you take one piece of advice from me and you work at a company where someone is like, do we want change tracking or change data capture?

Please say change data capture and save yourself all the problems in the world. Now, neither one of those things will tell you who changed something. Right? That’s auditing. That’s a completely separate thing.

You just need to track what data changed. Change data capture is what you should be using. So in my database called crap, and if you do not have a database called crap, I question whether you are a real database person, because every database person needs a database called crap to do crap like this in.

It’s just the way it goes. So I’ve created a table called posts that mimics, make that formatted correctly, that mimics the post table in the Stack Overflow database. At least, you know, the publicly available copy of it.

I’m sure the actual, you know, production copy of the post table is a nightmare. And I’ve just stuck a thousand rows in there to make life easy, because I felt like it, because I don’t need a huge table to show you this. So what we’re going to do is we’re going to walk through enabling change data capture for the crap database.

Oh, that happened so quickly. That was nice. And then we’re going to enable change data capture for the post table. And this will take a couple seconds to kick in, because it does all sorts of stuff in the background.

And, well, you know, okay, well, whatever. Cool. And what we’re going to do is we’re going to add a column to the post table that adds a… We’re going to… Jeez.

We’re going to add a column to the change data capture instance for the post table called change time, with a default of the sys date time that will tell us exactly when data changed. So let’s go and alter our table.

That happened very quickly. That is lovely. Now, if we go and look in here, there will be nothing, because we have not captured any data changes. There simply have been no data changes to capture.

That’s okay, because we’re going to make some changes. It’s going to be great, right? It’s going to be so much fun. So we’re going to update the table, and we’re going to plug that in there. And one row got affected.

And now, wonderfully, magically, we’re going to have data in our change data capture table. Look at all this great stuff. Now, if you were looking to use change data capture professionally with the built-in functions, you could use the LSNs in here and whatever to figure out when things change.

But that’s Bush League. You want to do things in a way that people will actually be able to use, right? It’s like extended events.

Like, Microsoft just made it as hard as possible to do anything. And they’re like, why doesn’t anyone use this? Why do people still use Profiler?

Why? Well, I’ll tell you why. Because you made it impossible. You jammed it full of XML. What did you expect was going to happen when you made something as difficult as possible? You built something that was great for, like, you know, like smart support engineers to use.

You ruined it for normal people. Much like using LSNs to track when things changed. But now, if we scroll over a little bit in our change data capture results, we will way over at the end here have this lovely column called Change Time with the exact time and date that something changed in.

This gives away when I’m recording this, doesn’t it? Crap. You’ve got me. Now, if we make a different update to that table, we’re going to just add 1,000 points onto the score here.

And we revisit our, oh, that highlight did not go well at all. And we revisit the change data capture results. Now we have some additional data in there. And if we scroll over a little bit further, we will see that the change time column is now incremented to exactly when we made the change over here.

Right? So this is about a minute and a half apart. So we see exactly what time something came in. This can be useful for all sorts of things.

Like, if you’re the type of person who uses change data capture to push data to another source, or rather another destination, another target, it’s a lot easier to use this change time in your queries to figure out if this is new data that you need to change.

Because in whatever process you do, you can say, last data movement was it this time. Is the change time column greater than this time? Yes!

We move that data. Is it less than that? No, we don’t touch it. No more futzing around with anything else. You can have a jolly old time moving your data over without having to worry too much about it. And so for these reasons, which I believe to be self-evident, change data capture is superior in absolutely every single way to change tracking.

If you don’t like it, again, I’ve got the Delta Miles to come fight you. I’m happy to do it. So now we’re going to finish this thing off by turning off change data capture for our table and database.

Because if I don’t, then every time I run SPHoo is active, I’ll have some weird job running in the background that just makes things weird and confusing. And no one needs that. So there we go.

I’m going to finish this video by saying, thank you for watching. I hope you learned something. I hope you enjoyed yourselves. I hope you never use change tracking in your life. I hope you always opt to use change data capture instead.

I should really start doing this stuff at the beginning of the video before everyone stops watching. But, you know, if you like this video, I like thumbs ups and I like comments.

Again, even mean comments because I do have the Delta Miles to come fight you. But still, you know, take your chances. Why not?

If you enjoy this type of SQL Server content, you can subscribe to the channel and join now over 4,000 data darlings in their joyful bliss every time they get a notification that I publish a video. So you should do all three of those things.

Except, I don’t know, maybe be nice. You could choose to be nice. What does it people say? It costs nothing to be nice.

I don’t know if that’s true. Sometimes it takes a physical, there’s a physical cost to being nice. There’s a physical toll on being nice. But other times, you know, it’s cool. It’s fine.

You know, tip your bartenders. It’s basically my, be nice to your bartenders. I’m all, I’m, the people in the world who I am the nicest to are bartenders because anyone who just keeps bringing me drinks is best friend.

So much like the way a dog will look at you when you give them a liver treat, that’s how I look at bartenders who bring me drinks. So that’s, that’s me though.

You might have your own kink. I don’t know. Whatever. Anyway, I’m going to get going. Apparently I have a phone call starting in a few minutes. So I should probably, I should probably attend to that. That would be a client call.

One of the lovely people who paid me money to do SQL Server Consulting so that I can keep recording these videos for free. If you find yourself in the market for a SQL Server Consultant to help you with performance or choosing between change tracking and change data capture, you know how to get in touch with me.

Here I am. My rates are reasonable. Thank you for watching. I’m going to go, I’m going to go entertain someone for money now. That sounds, that sounds wronger than it actually is, but gosh darn it, it’s the truth. All right.

Cool. Adios.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.