All About SQL Server Stored Procedures: Batching Modifications

Video Summary

In this video, I delve into the topic of batching modifications in SQL Server stored procedures, a practice aimed at managing large-scale data changes more efficiently. Batching helps mitigate issues like locking and blocking, as well as potential server resource overloads, by breaking down big tasks into smaller, manageable chunks. Throughout the video, I explore various scenarios where batching can be particularly useful, such as when dealing with updates that might require rolling back if an error occurs. I also discuss practical techniques for implementing batch modifications, including using primary keys to efficiently filter and process data in batches, setting up external control tables to manage loop execution, and considering isolation levels to handle potential concurrency issues. If you’re interested in supporting the channel or need consulting help with SQL Server performance tuning, there are links provided to get started. Additionally, I cover some rare but important considerations like preserving previous data states for rollback purposes, which can be crucial in certain situations.

Full Transcript

Erik Darling here with Darling Data, and we are back to talking about stored procedures in SQL Server. And in this video, we’re going to talk a bit about batching modifications. Because there are some things that come along with batching that not a lot of people think about as they get along in their process of stuff. Right? And, there are some important things to consider, depending on what kind of batch modification you’re doing. And, also, what, what, like, how you need to handle various, like, tying back into last video about server-seed with the error handling stuff, sort of figuring out, like, okay, well, like, what’s going to happen if under these, these circumstances, like, if I hit an error, if, like, something weird happens, like, what do I do? So, we’re going to talk a bit about that. If you like this channel, and the content that I produce, and you think, wow, Erik Darling is worth four bucks a month, there’s a link down in the video description, where you can become a member of this fine channel, and, and do that. But, don’t let four dollars a month be, like, like, like, the limit. You can go, you can go higher than that if you want. Like, I’m not gonna, I’m not gonna complain. If you want to, like, instead of buying me, like, some of one espresso, you want to buy me some of two espressos, that’d be, that’d be cool, too. If you are, if you’re, if you’re, if you’re more of a tea person, and you, you just don’t want your money going towards the, the evil coffee bean, you, you, you can do all sorts of other things to support the channel. You can like, you can comment, you can subscribe, and if you want to ask questions that I will answer publicly in my office hours videos, you can do that via that link, which is also a link down there to make it really easy for clicking. If you need consulting help with SQL Server, if you are having performance problems, and you, you think it’s high time you hired someone to come look at these things, well, guess what? My rates are reasonable. I do all this stuff, and I’m great at it. Thanks for watching. If you want some SQL Server training, I have that, too, also at a reasonable rate, 100, but 150 US dollars per month once you apply that there discount code, and, and, and, I don’t know, you, you, you sit, and you watch it, and you learn, and you get better at stuff.

You actually have to watch it for that to happen. It just does, does not happen through osmosis. You cannot just headbutt the screen and receive information. SQL Saturday, New York City, 2025 is happening May the 10th, with a pre-con on May the 9th, a performance tuning pre-con by Andreas Walter, and we will, we will have a great time on both of those days. I will be at both, um, serving you lunch. I’m gonna be dressed up like a, like a, like an old school lunch lady. Hairnet, apron, rubber, weird baggy gloves. I got, got the works. It’s gonna be fun.

Um, anyway, let’s talk a little bit about batching modifications. So, um, the goal of batching modification is to accomplish a large number of row modifications without, like, causing a big issue on the server. Um, you know, like, if you have a very, very large amount of data that needs to change in a database, um, batching modifications is a great way to lessen the impact by separating it into smaller changes, smaller batches of changes. Um, you know, like, from the perspective of, you know, like locking and blocking, uh, the transaction log and, uh, you know, like hitting, like physical problems with the server, like, like, like your log file fills up and like you, like you hit an error and the whole thing rolls back.

Um, you know, like having, like hitting a, like hitting an error in one batch is a whole lot less devastating than hitting an error when you’re like 99% of the way through a really big modification. Uh, there’s just a lot of stuff that, um, you know, there’s a lot of stuff that, um, you know, there’s a lot of stuff that batching modifications can help. One thing that it may not help is the overall time it takes to get that amount of work done.

Cause you still have to do that full unit of work, right? Like let’s say, um, you need, you have to do something to like 10 million, 20 million rows and doing it all in one chunk would take an hour. Uh, batching up the modifications, it might still take an hour, but the point is, during that hour you’re, you’re just biting off smaller chunks rather than trying to do the whole thing at once.

Right. That’s, that’s the goal of batching the modifications. Uh, probably like, you know, obviously like the canonical post for me about bashing modifications is Michael J. Swart, uh, originally published back in 2014.

I still reference it whenever I need to start something, uh, just to make sure that I’m like, like mentally, I’m like in the right place for it. Uh, you might have some other requirements as well. Uh, a while back I wrote this post, uh, called batching modification queries is great until, and, uh, this is about updates.

So I do want to walk over to that, uh, that code for a second, because, uh, there is some, there is some stuff in here that, you know, does bear repeating. So like, if you wanted to, um, if you wanted to write code in a way where like, like if you’re like inserting data into a table from like something else, like batching inserts, fine.

Like, you know, like you, you, you don’t have to worry about too, too much because you have the source data over here, right? Like you’re just pulling data from here over here. And then when you’re done with that whole thing, like with all your data is there and safe and sound, you can truncate this if you want.

You can also keep it for reference. I don’t, I don’t know. I don’t care if you’re deleting data and you’re just like, this is stuff that we no longer care about, or this is stuff that we legally have to get rid of and we can’t keep it around anymore. Well, I mean, again, like if a batch error is out for some reason and you move on to the next batch, you just have to figure out at some point, like, okay, what went wrong there?

But if you’re updating data and, uh, either something goes wrong or you’re in a situation where like, okay, look, uh, I have to update these million rows. I’m going to do it a thousand rows at a time.

But if any one of these batches errors out, uh, I have to change this data back to the way it was before. I can’t keep going. I like, I like there is a problem with like the data somehow that I need to keep going. Like one handy way of, uh, dealing with that would be to like do an update with the output clause and output the, the data that changed into another table so that you can roll back that stuff.

Because if you’re doing, uh, all these batch modifications, you don’t want to do it all within the scope of one transaction because then you have most of the same problems that you would have if you were trying to do stuff in just one big update anyway.

And if you do it in like smaller transactions where you’re like, you know, you can either like do a transaction for every update or, um, we’re going to talk later about the concept of, um, like, like batching batches kind of for like with transactions.

Like, like once you commit that, that’s, that’s done. And you can’t just, you can’t roll that back. Like you’re onto the next transaction once that starts.

So like, like for some stuff, you know, you may have to preserve the data as it was before you made the changes for updates in order to roll that stuff back to something else. So the update, um, the update stuff is very useful or rather do the update with the output to capture that data can be a pretty useful thing in those cases.

Those are pretty rare though. Like those aren’t things that I run into very often. Right. So I, and like, well, there’s, there’s two things about that. One, it’s not a requirement that I run into very often.

Most people are just like, well, if one batch hit a problem, we can, you know, we can just figure out what went wrong with that batch. We don’t need to roll back like, you know, the 90 million rows we just changed because, you know, there was a problem with row 10 million, right?

It’s not, not, not a very common requirement, but if it is, there are ways to, there are ways to preserve the previous data and then like batch back through and read and like undo all the work you did.

It’s a pain, but there are ways to do it. Um, it would be nice if, uh, like it would be nice if SQL Server had something like Oracle has with like the flashback tables.

Um, you know, the, uh, the temporal table SQL Server has probably wouldn’t, wouldn’t be the greatest thing for this, but they, you could use that too. Uh, temporal tables just have too many problems for me to like recommend them is, you know, a great source of truth for stuff.

So, um, one thing or rather a couple of things that are worth talking about in here are, you know, a, like how you find where you want to start. Uh, typically you want to use a primary key column because that’s going to get you a, it’s like, it’s not nullable, which is great.

Uh, B, um, you know, uh, it’s, it’s, since it’s a primary key, it’s indexed and you have, it’s very easy for you to sort of like page through a primary key where you’re like filtering on say, like where the primary key is greater than this and then order by the primary key.

It’s, it just makes life a lot easier. If you don’t have a primary key on the table, you’re stuck with some weird cursor options, right?

There’s some very strange ways to page through, uh, unordered data with a cursor. We’re not going to talk about that here though. Cause that’s, that’s, that’s a little, a little bit too, um, a little bit too, uh, exotic for, uh, for general consumption.

But there are two ways to do it, right? Uh, you could either, you know, just like declare some, uh, parameters or declare some variables or use some parameterized stuff.

And you can find stuff in like the, the table that you want, the rows in the table you want to affect in one batch and then go through here and another, another way you could do this, uh, and like, so actually one thing that was worth talking about here is, um, whenever I need to do something like this, I like to set up like sort of an external control table.

Because what happens is, uh, you know, if like, let’s say this batch runs for some amount of time, right? You can build, you can build a timer into this where you’re like, you know, if this has been running for six hours, it needs to stop.

Right. Like, you know, like even like Ola Hallengrin scripts have like stoppers in them. Like this job is only allowed, like after the job is run for this long, we’re like, don’t process any more commands. So you could totally do something like that here.

And you could totally do something like that with a control table. Another thing that, but one thing that I like to do is, uh, has to have an external control table.

And, you know, I like to capture some information about what, like, you know, what’s going on in there. So like, you know, like if I have multiple jobs that have to do something like this, I’ll put in the job name, the name of the loop that I’m currently in. If there are multiple like steps to it.

Um, if the name of the step that I’m in, if it’s the current step that’s executing, or if it’s just like the next thing, uh, the last completion time. And usually you compute, like a computed column where I’ll have like the start time and the completion time, like set, like, uh, computed out with date diff to like minutes or hours or seconds or something.

Uh, so that I can, uh, I can very easily, um, like sort of like gauge how, like if I, if I need to fix anything in any of these loops, right? Like, like how long does this take?

Wow. There’s something that’s terrible in there. Maybe, maybe an index or something. Uh, but I like to have a, like at minimum a table that helps me control if the loop should keep going because inevitably what happens is like, like if you don’t have anything built in to stop it, like you’re eventually, you’re going to be like, uh, okay, well I’m going to do something and, uh, this has been running for a long time.

Uh, it’s getting into the business. I’m just going to hit cancel. Like if you’re, if you’re in the middle of this, like, that’s kind of annoying. Like you want this thing to be able to finish. So what I, what I like to do is set up sort of a loop, what I call a loop controller table.

And, uh, you know, I’ll just say, you know, should I continue? Yes. And then in the loop, I’ll say, you know, uh, while the, the row count is greater than zero.

In other words, while I still have rows to process and, uh, you know, like I’m allowed to continue, I want to keep running this loop. So that way, if I decide that like this thing has been running too long, I can just update the loop controller table to false.

And then the loop will stop the next time before it starts another batch of modification. So I can at least finish the last thing that I’m doing, uh, cleanly before, uh, chain, before bailing out.

So like, you know, just, you know, like, and again, if, if, if you have a loop controller table with like, you know, job name, step name, stuff like that. And it, like, you would have to just tell your where clause to make sure you’re ending the loop for the thing, the specific thing that you care about stopping.

Another, another interesting question that comes up with, um, with these sorts of batch modifications is from, is around isolation levels. Um, again, this is not a terribly common requirement, but it is worth noting that no matter which way you do this, uh, you are subject to concurrent potential concurrency issues.

So if you care about that, you need to think about like a, like transactions and B locking hints for when you find data. Uh, so there are like, there’s essentially two ways of looking for it, but both of them are subject to concurrency problems potentially, like not, not always.

And you might not care about them, but if you’re, if you’re not okay with them, you do have to think about that is like, let’s say like you’re using read committed, right? The, the crappy isolation level.

If you’re unfamiliar with why I think read committed is a crappy isolation level, I highly suggest you watch my series. Everything, you know, about isolation levels is wrong, very good stuff in there, all about problems with read committed.

But, uh, generally like, let’s say you do something here and you find this data and then something happens here in another transaction, which alters like whatever, like the, like the, the correctness of the data that you found here, like maybe invalidates something that you found in here.

Like let’s say there was some additional where clause where like you’re saying where the largest key processed is greater than or equal to the last thing that I just did. And the last modification date is less than, is like older than it, like a week or a month or a year or something.

And then something happens that like changes that last modification date. And all of a sudden you would be doing something to a role that you shouldn’t be doing something to anymore. Uh, that could be a problem, right? Like all of a sudden you shouldn’t be, you shouldn’t be doing something with this.

Right. And that could happen if you did the update like this too. Um, because you know, in this, in this scope, you would, you would take a lock here, but in this scope, you would not have, um, you would not have that same, you would not have any locks taken that would prevent data from being read or prevent data from moving around data you’ve already read.

So something could happen, like even with like a sub query thing like this, where you would invalidate the stuff that you would, you would be working on in the update. So there’s, there’s, you do have to be careful. So in those cases, um, you know, like depending on how, depending on what you care about and what you need to, what you need to happen, you could either add in locking hints.

Uh, you could use a static cursor, which copies the table data that you’re going to be working on, um, if you don’t want that to happen, you can use a dynamic cursor where it would be working directly off the table data.

Uh, or if, if you, if you like a key set cursor would also technically preserve the key values, but not the, not any other table data. So the static cursor might be the best way to go depending on what you want. Uh, but you would might have, if you, if you do care about isolating the work that the batch is doing and not allowing any other queries to mess with that data while you’re in the process of like either finding it or modifying it, you might have to use, uh, you might have to, you know, use a transaction and you might have to add in, um, oops, that’s, uh, that should have been upd lock, not, not hold lock.

Hold lock is a synonym for serializable, which means I probably messed that up somewhere else too. But anyway, uh, so you would have to, um, you know, because like, actually this came up in the YouTube comments recently, one thing that I like about some other database platforms like Postgres, MySQL, I think even Oracle has it is the select for update clause.

So like, instead of having to be surprised by locking hints in your from clause, you could do something like in, like when you write the query, you say like select for update up here.

And then that it turns that select into a locking select. So you could, you could do that here, like take locks on the rows that you, you, you find in this batch and then like pass that in here.

But again, you would need, uh, you would need this down here as well. If you want it, like, so like if you’re finding the next batch, next batch, next batch max, uh, using a select before you do the update, you would have to put that in there.

And then if you are finding the, the, the thousand rows and aware clause, you could do that down here, but you would need the locking hint. If you care about like nothing else, like messing with that data as you make your way through.

Um, but, uh, like kind of going back to, um, uh, some of the stuff that we talked about before, uh, in the previous video about like error in situation handling, writing batches like this is, uh, a great place to have retry logic built in, uh, because there are going to be taught, there are going to be times when your batch gets either deadlocked or blocked and you don’t want to, uh, you don’t want to like keep trying to, or like cause a bigger blocking chain that might already exist.

You like just want things to kind of keep going and make progress depending on the batch or the, depending on the purpose of the batch, you might, you might not care to the point where you could put like a read past hint in your query as a, as a locking hint so that you skip over any rows that are locked.

But you know, that, that’s a, that’s a, that’s a, that’s a, that’s more like a queue table situation. What we’re talking about here is, um, where you like a situation where like, you know, let’s say you kept hitting deadlocks with, when you try to do your modification, or let’s say that, you know, um, if you don’t want your process colliding with anything else, you might stick like a lock timeout or like, you know, retry logic if you had a deadlock.

And this comes back to like how you would incorporate some of the, the retry stuff that I talked about in the last video, uh, with a, with a batching process so that you could very easily, um, you know, like if you hit a problem within your loop, you just go back and start your loop over.

Like you just retry that batch the next time around. So that, that would be one way of doing it. So, um, yeah, anyway, uh, there’s a lot of interesting stuff to think about when it comes to batching modifications.

Um, you know, a, uh, is, is this something that I can batch? Is this something that like, um, like I, I, I, I, it’s even allowable to, you know, split this up into chunks.

There are some situations where it’s not, uh, the best way to write the batch. Again, that Michael J. Swart blog post is probably my favorite. Like it’s a great starting point for things. It’s a great explanation of things. There’s a lot of good technical detail in there.

Um, if it’s an update loop and you need to like maybe, maybe roll that stuff back, then, you know, that, that would be, that would be one thing. Um, I guess that, that could go for deletes too, but that depends on how much you care about the data.

Uh, and then of course, um, you know, like all sorts of other concurrency stuff like should be at least on your mind when you’re writing this stuff. You might decide immediately that you don’t care, but at least, you know, have the mental conversation with yourself about, uh, how much protection, uh, your batching process needs.

And, uh, always, always, always give yourself a way to bail out of the loop. So have that external control table or like a job logging table, uh, where you can control exactly like, you know, how long a loop can run for, or, uh, if you want to bail out on the next loop through, because that gives you a lot of flexibility for controlling the loop.

It’s not just hitting cancel in the middle of it and wondering when things will finish. So thank you for watching. I hope you enjoyed yourselves.

I hope you learned something and I will see you in the next video where, uh, well, I haven’t quite, I haven’t, I haven’t quite, uh, organized the schedule yet. So, uh, the next video might be a surprise.

We’ll see. We’ll see how it goes. Anyway, thank you for watching.

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

All About SQL Server Stored Procedures: Batching Modifications