A Little About String Splitting In SQL Server

A Little About String Splitting In SQL Server



To get the PASS Summit discount, head over here and use the discount code DARLINGE24 to get $150 regular price on the 3-day convention.

Thanks for watching!

Video Summary

In this video, I dive into the appropriate use of string split in SQL Server queries, highlighting common pitfalls and best practices. Erik Darling from Darling Data shares insights on how using string split directly within a query can lead to poor performance due to inaccurate cardinality estimations. He demonstrates this with a practical example, comparing it to his previous video on table variables, showing that both methods result in similarly suboptimal performance. To avoid these issues, he emphasizes the importance of dumping the results into a temporary table before using them in your query, which allows SQL Server to make better estimates and optimize the execution plan effectively. By following this advice, you can write more efficient queries and even take advantage of his special discount code DARLINGY24 for Pass Summit, saving 150 bucks while enjoying some well-deserved drinks with Erik on his birthday!

Full Transcript

Erik Darling here with Darling Data. Gosh darn it. And in today’s video, we’re going to talk about two things. The main point of the video is the appropriate use of string split in SQL Server queries. The first thing, probably the most interesting thing, is I have a discount code for you for past Data Community Summit 2020, 2024. That is, of course, coming up this November in Seattle. Remember, one of those days in there is my birthday where you are contractually obligated to buy me a drink. Everyone at the conference is contractually obligated to buy me one drink on my birthday. And I am contractually obligated to drink that drink on my birthday. I’m going to die. But if you are on the fence or if your employer is cheaping out on you about coming to pass Data Community Summit this November 4th, through 8th, through 8th, in Seattle, Washington, I have a discount code for you. So if you use, and this will be in the show notes for the Darling Data Radio podcast for Data Darlings, if you use the discount code, and I hate this, Darlingy24, you can get $150 off the three-day conference ticket. That’s this thing. So if you’re bad at math like me, and you use SSMS as a calculator, the regular price is $1,895. With the discount, it’s $1,745. Now granted, that probably isn’t coming out of your pocket. Your boss or your company is probably paying for that. So if you want to save your boss or your company $150, they can… I don’t know what a company does with $150, to be honest with you. It’s always blown my mind that $150 would make a difference.

in a company budget. But anyway, if you want to get that discount, and when you sign up to go to Pass Data Summit, you can plug Darlingy24 in there and receive your $150 off. So that’s cool, I think. Anyway, let’s talk about the appropriate use of splitting strings in SQL Server queries. Now I’ve created a couple opportune indexes here on the post table and on the user’s comments table, sorry. And what I’m going to do is I’m going to show you two queries that hit these tables. Now this demo might look very familiar to you. If you remember my smash… I’m going to actually… I’m going to start running this because…

it runs for a little bit. So if you remember my New York Times best-selling YouTube video on table variables, splitting strings and using table variables have two very big things in common. SQL Server has no idea what’s in them. SQL Server has no idea what’s in your table variable. It might know how many rows are in there, maybe, under the right circumstances, but it has no idea what data makes up those rows. And you would have seen that in my video called, for the last time it doesn’t matter how many rows you put in a table variable.

And the same thing goes for a string. So this demo might look very familiar to anyone who is… who watched that video. But here’s what I’m doing. Wait, let me… I’m sorry, I’ve messed things up a little bit. I did not frame things appropriately here. I want to make sure… ZoomIt is very laggy on me today, I’m not sure why. ZoomIt is once again rebelling against me. So right up here, I am declaring some IDs, or I’m declaring an ID variable, not a table variable.

Right? Just in the… For the naysayers out there, it’s only a varchar8000, it’s not a varcharMax, not a varcharMaximum data type. So there’s nothing to freak out about there, just a varchar8000. Still a pretty long string, right? It’s a pretty long letter. Someone wrote 8000 bytes to me, I probably wouldn’t read half of it.

But then what I’m doing is… exactly like what I did in the table variable demo, to show you how… Sebel Server cannot do good cardinality estimation with table variables. It can also not do good cardinality estimation with split strings.

So I am doing exactly the same thing, where I am inserting the ID for Mr. John Skeet. At least the last time I checked, he was still a mister. I don’t know if that may have changed. It’s difficult to keep up with John Skeet. And then I’m also sticking in the top 99 nobodies.

Sorry if you’re in this group. The top 99 nobodies in the Stack Overflow database. These are users with a reputation of one who have created their accounts most recently. And we do not have high expectations of these new recruits, these cadets, because they have not had enough time to make an impact yet.

John Skeet has been around from the beginning, making big impacts, tyrannosaurus foot impacts on things. And so, you know, we don’t expect them to have much data. And so what I see clients do a lot is something like this, where they will extract values from a list of things.

And then they will either do like, you know, with an in or exists or a join. I’ve chosen a join in this case because it was the easiest way to convert the old demo to this one. Pure laziness on my part. Nothing funny going on, I promise.

But this is what I do, where I extract the value from the string split. And then I join that to the post table. And I join the post table to the comments table. And then down here, I show you how wrong of an idea this is.

This is definitely not a good thing to do. This is the bad way of doing things. This is the inappropriate use of string split in a SQL Server query.

It’s not how you want to do things, because you will end up with very bad performance. If I move my armpit a little bit, oh, that’s my hand too. You might see two minutes and 13 seconds under there.

And that two minutes and 13 seconds was largely eaten up by just this query here by itself. Right? Because this is the crappy way of doing things. If you do things in this way, my rates are reasonable.

You can hire me to fix these things. I will fix them all day long, happily. I’ll whistle. I’ll draw you doodles of things. I’ll bring flowers.

Maybe we’ll go on a picnic. I don’t know. We could do all sorts of stuff. But really what you want to be doing is something more like this, where you can split your strings into a temp table with the appropriate data type already assigned to it. One thing that it doesn’t…

So it can certainly make things messier, but it’s not like the biggest issue with the string split. And I’ll show you in a minute. But by default, the value column that comes out of string split is like some big envarkar column.

But SQL Server is able to like do a convert implicit on what comes out of the string split function if you’re using it like I am where you’re joining it off to something. And I’ll show you that in a minute in the query plan. But then if we put that into a temp table with the correct data type, SQL Server can do this a lot better.

Because two reasons. One, you know, again the correct data type, minor point there. But when you put that data into a temp table, SQL Server is then free to figure out what makes up that data.

You get a statistical histogram on the temp table, so SQL Server can do better cardinality estimation. This is almost the exact same thing that happened with the table variable. SQL Server doesn’t know what makes up the data in the table variable.

SQL Server also doesn’t know what values, what comma separated values make up the string that you’re splitting. It does not pause every time it extracts a value and say, oh, it’s this value, cardinality estimate. Or it doesn’t like extract, it doesn’t fully extract the values and say, oh, this is what I’ve got.

Let’s do cardinality estimation based on these values. It just says, uh, crappy guess. Got a crappy guess for you?

Want a crappy guess? Got a crappy guess. Got lots of crappy guesses. Handing them out, making it rain crappy guesses. So, if you’re gonna split on that string, what you want to do is not this, where you immediately join to it in your query. You want to do that and dump it into a temp table.

Right? And I think the proof here is in the query plans. Now, we return exactly the same result for both of these. That’s a big number and all, but if we look at what goes on here, right?

We have our initial string split, right? Or, sorry, this is our initial string assignment, right, up here. And this is quick, about 45 milliseconds in total.

And then down here, this is our first attempt at the count query. And this ends up performing just about as poorly as the table variable query for just about the same reasons. Now, I could have made this worse, but I chose to not make it worse.

I chose to make it a little bit easier because if I made it worse, it might have looked a little unrealistic. But, I mean, if you look what happens here. Oops, come on, zoom it.

This query takes 2 minutes and 12 seconds in total. And SQL Server guesses 50, but we get 100, right? So we have a 200% misguess here.

But then SQL Server’s estimates down here are also awful. We are off by 5, wow, that lined up really nicely. I didn’t do that on purpose.

If I tried to do that a million times, I would not be able to do that. That is a very nice… Look at this, look at the symmetry on that. God bless. Wow. Anyway, SQL Server makes a very bad guess here, right?

We get almost, we’re not even close to reality. We get roughly 5,804% wrong. But granted, since we have a good index, it happens pretty quick. The trouble is that, I mean, A, the guesses are off.

And then SQL Server has chosen this particular plan shape, where we just go from nested loop to nested loop. And you’ll notice that we spend quite a bit of time in here, where this estimate is off by some number that can only be conveyed with an ellipsis at the end. That is 81704…

I don’t know where that number ends. I do know, looking at this, that if we pay attention to these numbers right here, and again, I’m not good at math as a high school dropout. A public education received student.

But, I can tell you that 1-3-0-4-0-0-9-0-6-7, that is a 10 digit number, is a lot bigger than 1-5-9-6-0, a 5 digit number. How much bigger? You figure it out.

81704 something percent bigger. A lot bigger, right? Orders of magnitude. If anyone knows how to order a magnitude, please let me know. I’m willing to pay extra.

So, this thing, this whole thing takes 2 minutes and 12 seconds. Now, if we look at the rest of what happened in here, when we go, and we look at our ingenious method of, dumping data into a temp table from the string that we are splitting, right?

We dump this in here, and we look at the query plan down here. It’s going to look a lot like the query plan that we got when we used a temp table in the, for the last time, it doesn’t matter how many rows you put in a table variable video, where SQL Server makes slightly better guesses.

So, in here, SQL Server actually makes the right guess. It knows 100 rows are going to come out of here. It even goes to the trouble of sorting those 100 rows, so that we match the order of, if we had a clustered index on the IDs table, we wouldn’t need to do that, but that’s okay.

For this, I’ll forgive the zero second, the zero anythings that it takes to do that, right? And then we, granted, this guess is not fantastic, but you know, the default cardinality estimator is not fantastic, so we don’t get an impossibly good guess here, but we don’t necessarily, you know, the nested loops join thing is fine here, okay?

The big deal here is that SQL Server sticks a stream aggregate between one nested loop and the other, right? So we have this combination of query plan operators in this plan that we don’t have in this plan, right? Remember I said from this one, it just goes from nested loop to nested loop.

For this one, SQL Server says, you know what? I’ve thought about it, and I think we need to trim this down. And so SQL Server has to do far less work in here.

This goes from taking, let’s go look at that again. This goes, that took, ah, get out of the way tooltip. No one needs you.

We’re all hitting F4 and getting properties when we need to see that stuff. So this goes from taking one minute and 28 seconds to taking one millisecond. Now granted, this cardinality estimate is still off by a little bit, but this aggregate saves our buttocks in a very serious way.

And so this query working off the temp table only ends up taking 10 milliseconds. It’s a big difference, all right? So when you’re, I beg of you, if you are writing queries in SQL Server, and you are passing comma separated or anything separated lists, even if you’re passing XML or JSON or anything like that, and you need to relate that blob of string text to something in a query, in a joiner where clause, or a subquery, or an exists, or really anything relational, anything with relational meaning, please, I beg of you, do not do the string splitting in the query.

Take the XML or the JSON or the comma separate, anything separated list of things you’re passing, dump them into a temp table.

You can use any string splitting facility you want for that. You can use the built-in one that Microsoft gave you in 2016. You can use the one that you copied and pasted that’s a multi-statement table valued function with a while loop in it. You can take that one that you copied and pasted from the internet in 2002 and use that, and it will not profoundly affect your query because you’re dumping it into a temp table.

And then you’re going to use that temp table in your query instead of doing the string split in the query, and you’re going to look so smart in front of everybody. You’re going to look so smart that your boss will be like, Hey, you know what?

You get to go to pass summit this year, and you get to use Erik Darling’s discount code DARLINGY24 to do it, because we’re going to save 150 bucks, and I’m going to put 150 bucks cold hard cash in your hands to buy Erik Darling drinks for his birthday. That’s exactly how that’s going to go.

So, for the sake of all of us, so that you get to come to pass and celebrate my birthday with me, and you get 150 bucks cold hard cash from your boss to buy me drinks, you’re going to not put string split where it would matter in your queries.

You’re going to dump your split string results into a temp table, and then you’re going to use that temp table in your query. All right? Cool.

So, with that out of the way, thank you for watching. I hope you enjoyed yourselves. I hope you learned something about string splitting, the appropriate uses for it in SQL Server queries.

If you like this video, you know what I like? Thumbs ups. I like nice comments.

I like motivational comments. Those are cool, too. If you like this sort of SQL Server content, you are free to subscribe to this channel and join, let’s see, let’s get the most current tally on this, nearly 3,775 other data darlings out there by subscribing to the channel and getting a delightful notification, very helpful notification every time I publish one of these videos.

And you know what that gets you? 150 bucks off pass. And the pleasure of my company.

And, I don’t know, cool way to kill 15 minutes? I don’t know. Maybe you’re like, you’re in the bathroom, you got nothing else to do. You finish playing Wordle.

You finish playing Words with Friends. And you’ve gone through your RSS feed. And you’re just like, what am I going to do for 20 minutes? Earbuds in.

Watch my video. Soothing. It’s meditative, practically. Reduce blood pressure. Give you a good laugh. I don’t know.

Whatever. It’s a cure for what ails you, basically. It’s like drinking a Guinness in the morning. Right? The smart thing to do. Anyway. That’s enough for me.

I’m going to go back to work. Maybe go to the gym. Give this Adidas t-shirt a workout. Right? Get sweaty a little bit.

And I don’t know. I’m going to think about what to record next. Because I’m always thinking for you. Anyway. Thank you for watching.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.