STRING_AGG vs SQL Server’s Optimizer
Thanks for watching!
Video Summary
In this video, I delve into the intricacies of using `STRING_AGG` in SQL Server and its interactions with the query optimizer. I explore how `STRING_AGG` can lead to unexpected performance issues due to its design, particularly when dealing with large concatenated strings that exceed certain size limits. I demonstrate real-world examples where queries using `STRING_AGG` with a `VARCHAR(MAX)` data type perform significantly worse than those with smaller string types, highlighting the importance of being cautious about the length of the concatenated strings and their impact on query execution plans.
Full Transcript
Erik dishwaskeldarling here, Darling Data. We live to fight another day, don’t we? In today’s video, we’re going to be talking about StringAg versus the optimizer. Now, StringAg was a string aggregation function that was designed to replace all that sort of XML for path type value and VARCAR max stuff that people used to have to put into queries in order to create a list that aggregated strings, usually separated by commas or some other delimiter, or some other delimiter. I guess spaces would be equally valid there. And it’s got some funny optimizer repercussions sometimes, things that we don’t like to see. And it’s actually a very unfortunate side effect of the way StringAg was designed, where rather than Microsoft telling StringAg, hey, if this thing that we’re concatenating together, is big, we should just convert it to a big thing, right? We should just use, like, put a convert in StringAg for the values that we’re concatenating. Because if you’re, like, concatenating integers or something together, like, it’s not like you get, it’s not like you, the problem is, like, you can’t put an integer with a comma. The problem is that if you put a big enough string together, SQL Server’s like, whoa! Can’t do that. You need to convert it.
And then convert that to a string that we, a size that we understand. And so you end up having to write code that looks rather silly. I didn’t mean to exit out of that because, of course, we need to talk about this stuff before we talk about that stuff. You can become a member of my channel for four bucks a month. It’s a pretty good deal. If four bucks a month is, like, all your lunch money, well, like, comment, subscribe. It’s all good stuff in there. I am a SQL Server Consultant. That is how I make my money. Right now, even with the 25 very thoughtful, very generous people who have become members of my YouTube channel, that puts my monthly income from YouTube at roughly 123 pre-tax dollars per month that pays one of my cable bills. So that’s good. But, you know, rent is a much bigger deal. And that’s where the consulting end of my life tends to come in because for less than the cost of one core of Enterprise Edition, we can fix lots of problems.
And then you will have to buy fewer cores of Enterprise Edition or spend fewer monies on Microsoft or Azure’s incredibly overpriced cloud offerings. Screw them. Why give them all your money? Give me your money, then save money. All right. It’s a nice tradeoff. If you would like some very high quality, very low cost SQL Server performance tuning training, you can get all of mine for 150 bucks for life. That beats the pants off all the Black Friday deals you’re going to see. So you should buy that. You can either use that link in this discount code to do it or down in the video description, click on the link in there and it’ll do all that for you. So, upcoming events. Well, you know, I don’t have any dates right now. So if you would like to go on a date with me, you can tell me about your event and I’ll show up. Dress nice. Smell good. At least at the beginning. Don’t know about the end. Anyway, let’s look at this string ag nightmare.
So, this is the thing that happens whenever you have, whenever you write queries with string ag in them. This is where you run into stuff. Is that the, like, when you’re like, so ID is an integer, right? Just a four bytes, whatever. But when we create this thing, SQL Server does not make a big enough data type or something for the concatenated string. And it’s just like, whoa, whoa, whoa. We can’t do that. It throws a dumb error. And then you have to write convert varchar max whatever thing plus the comma thing. Notice that the convert varchar max is not around the comma at all. The comma space is just around that ID column.
So, this is where you can start running into dumb stuff, right? And really, the path of least resistance is, of course, to just say varchar max because who knows when you might have two something point whatever gigs of IDs to put into a string. It’s a really good use of SQL Server, right? That’s good use of licensing money there. Create a two gig string of integers. Great. Thanks. Thanks. We’re good with that. You could, of course, mess with things a little bit. And you can experiment with smaller values, like, say, varchar 500.
And this will, of course, get you around the error and, you know, give you the string that you want. So, the difference between these two things, and this is where you have to be careful, is that when you say max or where you say that the value is above a certain point, query execution times are way different. If you, sorry, not you, I’m not going to make you do any work here.
If we zoom in here, this top query took 17 seconds and this bottom query took 7 seconds. So, the bottom query is a full 10 seconds faster. Granted, there are things about both of these queries that we could fix a bit. But the big problem with the first query is there’s an additional operator in it.
See, there’s a filter here and a filter here. And the filters here are for the halving from the count, right? So, this is going to be a filter no matter what, because SQL Server has to run the query, come up with the counts, and then filter on the count. So, that can only be filtered out after the results are run, after the results are calculated at runtime.
The top query, where the string column is a varchar max, has an additional filter in it that is not down here, right? Like, there’s no thing between the sort and the compute scalar the way there is up here. And this filter is saying where post type ID equals 1. It’s kind of weird.
What’s this filter doing? Greater than 1. What’s this filter doing? Greater than 1. What’s this index scan doing? Absolutely nothing. Just scanning the whole table. There is no predicate in here. What’s this index seek doing? This is seeking to… Oh, gosh darn it.
This is going to be a tough one to frame up. I’ve got to move this over a little bit. There we go. We are seeking to where post type ID equals 1 here. The reason why this happens is because of optimizer costing.
When we do the string ag for a big string, and SQL Server computes a scalar, this is where we compute that big string for string ag. If you don’t believe me, well, that’s too bad. That’s exactly where it happens. For a big string, SQL Server is like, oh, I can’t push a predicate down past that.
For a small string, SQL Server is like, oh, yeah, I can do an index seek for that. No problem. I’m glad you asked. That’s a great idea. The thing is that there’s like a weird level.
Like, I’ve experimented with this number quite a bit. And, like, if I went up to, like, 700 or 800 or even 600, that filter would come back. If I went down a little bit, it would go away.
If I changed compatibility levels, the number at which this changed from having that additional filter with the index scan to having the index seek with no additional filter would change. Sometimes statistic sampling would have something to do with it.
There were all sorts of things that would make this number weird each and every time. So be very careful when you’re using string ag in queries. You might find that using varchar max or even a longer string type than is probably necessary for the convert up there, that you might find that SQL Server all of a sudden starts choosing really stupid execution plans.
You might find that bringing that number down helps quite a bit sometimes. And other times the optimizer is like, ah, changed my mind. Think it costs differently now.
Price has changed, right? It’s like the stocks. They go ups, they go downs. They ruin your chances of retirement. They give you a little bit of hope for retirement.
It never really says, yay, I’m going to retire. Anyway, we do have to be careful with these things. And this is honestly something that Microsoft should fix because there is nothing about this string or what the value of this string is that should prohibit the optimizer from being able to push a predicate down and seek into an index rather than scan an index.
There is nothing going on in there where that should be an optimizer limitation. And, you know, I guess somewhat thankfully I still see most people not use string ag and just use the XML version of it. I’m particularly fond of the XML version of it because, I don’t know, kind of an XML thing.
A little bit. You know, kind of my first SQL Server frenemy was XML and XQuery. So, got a little bit of a crush on that.
Anyway, be careful out there when you’re using string ag. Be prudent with this sort of syntax when, if you are, you know, using non-string values and all of a sudden you need to, you know, concatenate those into a list, whether they’re numbers, dates, bits, I guess. I mean, one thing.
That would be a choice. But, you know, whatever you’re doing in there. Be very prudent with the length of the string that you convert whatever column data to because, you know, you want to avoid the error and you want to avoid truncating things, but you also want to avoid situations where SQL Server no longer decides to push predicates to where they belong and just starts filtering all your data out and making all your queries take way longer than they should.
So, I hope you enjoyed yourselves. I hope you learned something. And I will see you, I think this video is scheduled to go out on a Friday.
So, I do hope and pray that everyone has a great weekend and enjoys themselves to the fullest. All right. We’re good.
Time to turn these lights off. Getting sweaty. Thank you. I love you. Goodbye.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.