Advanced String Searching In SQL Server

Advanced String Searching In SQL Server



Thanks for watching!

Video Summary

In this video, I dive into advanced string splitting techniques in C-SQL and SQL Server, focusing on scenarios where strings are delimited by various characters. I start with a basic example of using the `SUBSTRING` function to extract text between two delimiters, explaining the nuances of its three arguments—especially how the third argument works differently from other substring functions. Then, I move on to more complex cases involving multiple delimiters and special characters like asterisks, exclamation points, question marks, and dollar signs. To tackle these scenarios, I employ `CROSS APPLY` twice to find the first and second occurrences of each delimiter, demonstrating a practical solution for extracting text between them. This approach helps in creating a minimal viable product (MVP) that can be easily adapted to various string patterns. If you’re curious about how to handle such intricate string manipulations or just enjoy solving complex SQL queries, this video is definitely worth your time.

Full Transcript

Erik Darling here. I hope no one heard that weird hand fart. With Darling Data, recently voted by BareGut Magazine to have the YouTube channel with the most accidental hand farts, which was really, I think the good folks at BareGut Magazine are psychic. How else could they know? Anyway, today’s video, we’re going to talk about advanced string splitting in C-SQL. SQL Server, or really just one aspect of it. I can’t possibly cover very advanced aspects of string splitting in SQL Server because that would be a long video. And I’ve gotten complaints about videos that crested the 20 minute mark. And boy, oh boy, the attention span on you kids. If you feel like you need to stim a bit and step away from the computer, YouTube was kind enough to provide you with a pause button somewhere over there. Oh, my fingers gone. Somewhere over in that corner. So you can always hit pause and come back after you’ve hand flapped and gargled and done your fidget spinner or whatever. But anyway, we’re going to talk about you and me and how you can buy me a fidget spinner. For the low cost of $4 a month, you can sign up for a channel membership, which will get you all of these videos.

If you don’t have the $4 a month, if you feel that I am unworthy of fidget spinning, you can like and comment and even subscribe to the channel and join 4,700 other dated darlings who get notified every time one of these gorgeous videos drops. If you need SQL Server consulting, I am of course available not 24-7, but sometimes seven days a week, usually during the day, but not before like 8 a.m. and definitely not after like 6 p.m. That’s when I do other stuff. I have a family and all that who also require my time, though the pay on that is significantly lower.

Anyway, if you would like to watch me do this whenever you feel like it, 24 hours a day, seven days a week, you can get all of my performance tuning content for life for about $150 with the discount code SPRINGCLEANING. What’s nice about that is that you don’t have to worry about anything after you get it. You know, there are no time commitments.

You can go off and stim and spin and whatever, flap your arms, whatever wackadoodle stuff you need to do, and then come back and watch more of it. If you think 20 minutes is a long time, 24 hours is even longer. If you would like to catch me live and in person, I guess this would be a total of 16 hours and not in 20-minute chunks.

You can catch me and Kendra Little November 4th and 5th at Past Data Summit in Seattle doing SQL Server stuff, performance tuning, getting wild all day long, between the hours of 8 p.m. and 6 p.m. most likely. If there is an event nearby you and you think, boy, this handsome visage that stands before me sure would be a great accidental hand fart addition to that lineup, let me know what that is. Who knows? Maybe I’ll show up.

And with that out of the way, let’s talk about this string splitting nonsense. Now, before we get into the advanced stuff, I need to show you the proper way to get, let’s just say, a string between two delimiters. In this case, we’re going to have the same delimiter twice, but in real life, actually in the example that we’re going to look at below, we will have a variety of delimiters in slightly different circumstances.

So, first, I need to show you the proper way to do this. If you just have two delimiters and you’re like, give me whatever’s between them. Now, it doesn’t have to be colons.

It could be any two delimiters, right? It might be a period and the next period. It might be a period and a comma or a space and then something. Whatever it is. There’s all sorts of uses for this sort of stuff.

So, we need the substring function. And we need to talk about the three arguments of the substring function. All right? The first argument, because this is something a lot of people mess up. The first argument of the substring function is the string that you want to sub for.

Ah. Minus one family-friendly point. All right? The second argument of the string split is the position that you want to start your substring at.

In this case, it is the position of the first semicolon in the string. Right? That one up there.

Plus the length of that character. This is something that’s pretty important. Like the length, you need to add that onto the position of that so that you don’t include that. If you want to include it, you can.

But in this case, we don’t want that colon to show up. We want the space between the colons. I’m losing more family-friendly points as the further this goes on. The third argument is not the end position.

A lot of people think, because in some places, substring does function differently. But the third argument in SQL Server is not the position of the end. It is the number of bytes after the first thing that you want to include in the string.

So that’s where things get more complicated. Because the first one, we just need to get the car index of the colon in the text plus the length of that. And the second one, we need to use advanced car indexing to get the position of the first occurrence in the string after.

Right? There’s a third argument for car index. And we’re going to start it at the position of the first colon in the string, of course, plus the length of the colon.

Damn. This is not going well. Then we have to do some other stuff.

We have to subtract the length of the delimiter and the length of the car index of the first occurrence. And that will give us the string between the two things. Now, I’m using the sys.messages table.

And I’m only looking for rows that have two colons occur in them. It’s medically improbable, but there it is. And this is just to make it a little bit easier.

Because if we didn’t have this, we would need all sorts of case expressions or other protections for the substring function to make sure that we don’t throw an error if we give an invalid length to the substring function. So that’s why that’s there.

But if we run this query, we’re going to get back the actual text of the message. And we’ll be able to verify in a few different places that this is correct. Right?

So let’s just take this one as an easy example. It’s from colon space percent d to colon. And that’s what we get right there in the parse string. There’s another good example a little bit further down that’s really easy to show in there.

I forget exactly where it is, but we’ll just look at this one. There’s two colons in that one. That one’s a little weird.

I don’t know. You get the point. It worked. Right? Actually, this is the… Actually, no. That one’s not so good. These ones are good. These ones are easy. So here we have is page percent d.

And that’s exactly what we get back in there. Some people would throw like a L trim, R trim, or a trim on this to get rid of spaces around that. I’m not that fancy.

So we’re just going to leave that in there. But anyway, the whole point is this all works. Right? This all works just fine. The situation I had to deal with was I needed to find the space between a variety of delimiters. And I needed to find the first occurrence of each one.

So what I’m going to do is show you a little bit about what the strings looked like for me. This is not, of course, exact. This is just sort of what…

This is enough to make a simple MVP. It’s minimal viable product or MVC, MCVE, complete example, minimal viable complete example or whatever they call it, where the strings were kind of weird. Some of them started with a number and then some of them started with a character of some sort or special character, not like a letter.

But they were all sort of set up like this where I needed to find the space between the first weird thing and the next weird thing. And I knew what all the weird things were. They were asterisks in this case, exclamation points, question marks, and dollar signs.

Right? So these are all the weird things that I had to find. And I had to find the first occurrence of whatever came first in the string and then the second occurrence of whatever came next in the string. And that required some serious brain time from me.

Now, just because I don’t remember what I actually did from this, I’m going to rerun all this. And I’m going to create these two tables and populate them. And then I’m going to show you that this table has one row for each instance of the special characters I had to find.

I didn’t specifically need this. It just made writing the query a little bit easier. So the substring is going to do exactly what we did up there.

It just looks a lot cleaner because all I have to do is put the columns in here and operate on the columns rather than have to generate all the expressions in the select and operate on all the expressions. Because that turns into a real confusing time with all the functions, sub-function stuff in there. T-SQL doesn’t make it easy to nest these things.

But what I did was I used cross-apply twice. The first cross-apply will go and find the top one. And it will get the top one from this query.

And what this query does is find the search position. So it finds the car index of the search string and the string that we care about. And then it looks for where search position is greater than zero because this actually helped me avoid a lot of errors.

And, you know, it was better this way. And then we order by the earliest search position. So search position ascending, so the earliest search position.

Then, in the second cross-apply, we’re actually going to reuse elements that project out of the first cross-apply. So note that this one is called x1, right? And if we come down here and we zoom in on this, in this one we’re going to do, this is like the second argument of substring.

We’re going to search from the search position of the search element that we care about and the string that we care about at the starting position of the x1 search position. Right? So, and then down here, rather than filter on greater than zero, we’re going to say where the position in this one is greater than the search position that we found for the first one.

Right? And then we’re going to order that by this. And now if we run all this, what we’ll get back is exactly what we should see.

Right? So just to highlight this string a little bit, the first weird character was a dollar sign. The second weird character was an asterisk.

So, and then the substring between those two was the number 23. And that holds up for these as well, where the first weird character was an asterisk. Sorry.

The first one, the first weird character was a dollar sign. The second one, the first weird character is an asterisk. That’s in the first position. The second one was an exclamation point. And the fourth position in the substring between asterisk and exclamation point was the number 12. So this works for all of these pretty well.

Granted, it might not be like the most explosively well-performing code in the world. If you have very, very large data sets, of course, indexing for this stuff does help a bit. But once you get into the realm of like searching through strings and stuff, a lot of performance stuff can happen that isn’t easily controlled by you or indexes or really anything else.

It’s all very fuzzy string stuff. As I’ve said before, strings were a mistake. Should not be in databases.

Everyone should have just learned binary and learned how to read binary representations of strings. I’m kidding. You shouldn’t actually ever have to do that. But anyway, I hope you enjoyed yourselves. I hope you learned something.

I hope that you find sort of weird query stuff like this as fun as I do to solve and write and figure out how to get it to work. I am not the sharpest knife in the drawer. So actually writing this code took me quite a bit of like, you know, head on keyboard moments.

But once it got there, boy, was I proud of me. I was like, I’m a big boy now. I don’t need a diaper anymore yet.

I mean, I might need one again. Who knows where the weekend will take me. But, you know, weird stuff happens between, in the delimited between Friday and Monday, weird stuff happens in there. Hopefully not with my colon, though.

Anyway, that’s probably enough there. Thank you for watching. Goodbye. Please don’t report this video.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.