Particular String
Video Summary
In this video, I dive into the intriguing world of collation in SQL Server, specifically focusing on a peculiar scenario that caught my attention. Using a lengthier string value to populate a temporary table, I demonstrate how different collations can lead to unexpected query plan behavior and performance issues, especially when creating or updating statistics. The video delves into the nuances of using non-default collations like Latin General CS AS WS (Case Sensitive, Accent Sensitive, with Supplementary Characters) versus the default SQL_Latin1_General_CP1_CS_AS, highlighting how these differences can result in misleading execution times and query plans that don’t accurately reflect the actual performance. Through this exploration, I aim to shed light on potential pitfalls for database administrators and developers who might be using non-standard collations without fully understanding their implications.
Full Transcript
Erik Darling here with Erik Darling Data. Still alive despite the best efforts of many, including so far the only person who the Customer Experience Improvement Program has actively tried to murder. Actually, there was an attempt on my life via the artificial intelligence built into the CEIP. It was vicious. Got into a knockdown, drag-out brawl with it. I’m lucky to be alive. Lucky to be alive. Anyway, today we’re going to talk about collation. Because collation is weird. Collation is bizarre. I’m not going to pretend that I understand a lot about it because I really don’t. Flat out, I just don’t understand a ton about it. But I do want to talk about something interesting that I found slash saw slash ran into. I don’t know why I’m slashing so many things. It’s a very violent video so far. The assassinations, slashings, it’s terrible. It’s like the 60s. Alright, so to get us set up for this, to repro sort of what was happening when I noticed this issue with a customer, we’re going to grab a fairly decent length of screen.
That length of string. That length of string is going to be a part of a query plan. We’re not going to represent it as XML, though. We’re just going to represent it as regular, regular old string storage. But the important thing is that we have one row in this temp table that has 50,000 characters or so in it. And we’re going to use that because it’s a nice low number where I can still get things to demo well. And I don’t have to come up with like all different length, lengths, lengths of lengths in the table that I’m going to create to get this thing to repro. And in real life, the table that I was looking at had like, you know, some columns had like two or three characters. Others had like 200,000 characters. So it was just like a lot of swing. And I don’t want to, I don’t want to work that hard. To be honest, this is a free video. Only so much I’m going to do. I’m kidding. Go to the end of the R3, people.
So what I’m going to do is use a collation that is not the database’s default collation. I’m using this Latin general, Latin one general, case sensitive, accent sensitive, and with supplementary characters to fill a table with just a whole bunch of that value over and over again. So the 50,000 characters from, uh, the first 50,000 characters that query plan, we’re just going to stick them into a, into a table. And then we’re going to look at a couple different things that happened. I think, I think are interesting. I mean, you, you might, you might not, I don’t know. Uh, we’ll see what happens.
It’s a, it’s a weird day. It’s the second day of June and it’s strangely cold, but I’m sweating. Uh, anyway, so we have that table all populated and, uh, I mean, it was sort of unimportant, but just so you see what I’m talking about, the regular collation that I would be using with this database is this one, the, uh, SQL underscore CP1 case sensitive access, accent sensitive. This is the default for, I think most, or I mean the SQL Latin one general is, uh, the default for, I think most databases.
Uh, they do generally use the case insensitive, accent insensitive version of this. Uh, but a long time ago, a really mean guy, uh, made me do all of my script development on case sensitive instances, just in case they had to get deployed on case sensitive instances out in the wild. Because, you know, I don’t, no one wants to deal with those errors.
So I have had case sensitive databases, uh, for the last, uh, six years now. Some good reason. Uh, so let’s take a quick look at why in some cases you might need a different collation for things.
So if we run this query, right, we get back, uh, the emoji that we select here. That’s no, that’s no big deal. That’s almost to be expected in any, uh, any, any collation that supports Unicode characters.
That’s not that big a deal. But where things change a little bit is, uh, notice that we’re using this here, uh, this here collation with the supplementary character font. And what this changes is if we need to, uh, work with this as a string, we get two different sort of weird results here.
Uh, the first result that we get back is a little question mark. It’s a little who’s he, what’s it? I don’t know who you are, what you are.
Uh, but for this one, we get back the emoji that we asked for. So there are some valid use cases for different collations outside of the regular old default one. Where this got interesting was when we needed, or when, not me, when SQL Server needed to, uh, update statistics or create statistics on these columns.
So what I’m going to do is, uh, run everything on this, in this, on this page. Uh, we’re going to look at what our session weights are, uh, at the start. We’re going to set statistics time and IO on.
And then we’re going to run this query where we’re going to get the length and the count from, uh, the test table grouped by the Envarkar character, uh, the Envarkar column in that table, rather. Uh, set statistics time and IO off, and then check out user statistics at the very end. So I’m going to get that running.
We’re going to see that we do not have much going on here, but when we go look over here, uh, this query has been running for about four and a half seconds. You can see it’s still executing over here. That’s still chugging along.
Uh, and if you run this a couple of times, we’ll see this still go and go and go and, you know, whatever. It’s going to run for a little bit. Uh, but if you go look at the execution plan, we sort of have a funny looking query plan. Now, if you were to look at this while the query was executing, you might think to yourself, oh, this is the query plan of the query that I’m looking at.
Would you be right? Would you be wrong? I don’t know.
But you see this query is still executing. We’re almost 40 seconds in, right? And it’s still just going and going and going. What could it possibly be up to for 40 seconds? This does not look like that big, bad or gnarly of a query plan, right?
Looking at this still going 50 seconds in. And it’s finally done after 50 seconds. Here’s the wild part.
Keep an eye on what this query plan is. We scan the clustered index. We compute a scale R. We sort some data. And then we stream aggregate some data. And that looks like a pretty reasonable execution plan for a query that just does a select len with a count and a group by. You might see that sort of pattern everywhere where you sort the data and aggregate it.
And then you’re done, right? But look what happens here. This is where things get very interesting.
Down in the bottom, which I’m going to have to maneuver this in some kind of weird way. Actually, you know what? Maybe I’ll put my video head somewhere else for this video. It’s not completely in the way.
But if we look down here, we can see that this query ran for 53 seconds on the wall clock. The wait stats before this thing ran were nothing, right? We had some memory allocation ext because, I don’t know, we opened a window in SSMS.
Down here, we can see that this query did not generate 53 seconds worth of waits anywhere. Maybe four seconds there total. Actually, maybe, I don’t even think there’s three seconds there total.
And now if we go look at the messages tab, the time that SQL Server reports, 54 seconds there. But look at all this chuggalug up here. All right.
And now, the mystery, the plot is going to thicken. Like Texas chili. This is the query plan that we get back for the query itself, which is a completely different query plan from the one that we just saw in that other window.
We have a clustered index seek and then a hash match aggregate. And this thing says that it took 53 milliseconds. What’s interesting is if we come over here and we look at the plan XML.
And I know, I know, I’m sorry. I apologize. Looking at XML.
Again, another video looking at XML. But look at what is buried in here. Statman. Again, we all know that means that SQL Server was doing something with statistics for this query. Look at this execution plan again.
This execution plan was for creating the statistics. Why that’s not a big deal, but why it’s an interesting deal is SQL Server 2019 introduced a brand new wait stat called wait on sync statistics refresh. This thing fires off when synchronous stats updates kick off.
This does not kick off, which I believe I covered in another video or blog post, when statistics are first created. So it can be very, very misleading, to say the least, trying to figure out what the problem is in this situation. Where you’re creating statistics for the first time when you’re waiting on it for 50 seconds.
I don’t know, nothing really. But it’s just a little bit odd that really nothing in the query plan or anything in here gives us any indication that this query actually executed for that long doing what it does. If we look at the query time stats, we’re still going to have 53 milliseconds right there.
And, you know, just to be clear, this doesn’t happen under the SQL collation. This only happens under this weird Windows collation. If I were to rerun this test and use the SQL collation that’s native to the database, or native to the, I don’t know, you want to say native to the database.
That’s misleading, too. If I were to use a SQL collation that, instead of the Windows collation, then this would execute just about instantly and not be misleading. So if you have anyone using the Latin collations here that don’t have SQL in front, coming up with statistics on string columns can be a rather brutish experience.
I’m not saying that you don’t have a valid use case for them and that they’re bad. But I am saying that you might hit some really weird performance issues, either when stats get created on string columns using this collation, or when stats get updated on columns using this collation. And I think that’s just about that.
I’m going to go do something completely different now. I don’t know. I don’t know what yet.
Thanks for watching. I hope you learned something. I hope you enjoyed yourself. And if you see the customer experience improvement program, you tell them I’m coming for them. All right.
Maybe you don’t say that. I don’t think that’s legally defensible at this point. It’s on video. I should probably just stop talking now. All right.
I’m going to go call my lawyer. Maybe I can get YouTube censored. All right. Goodbye.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.
We hit this exact problem today, so it was great to have read about it previously & recently, in trying to research it further (& with the help of bing translate) I found the following article which was useful for some extra details.
http://www.db-berater.de/2017/02/optimierung-von-like-suche/
Oh wow! What a mess. They never should have let strings in the database.