Learn T-SQL With Erik: Solving Problems With APPLY

Video Summary

In this video, I delve into the beginner section of my upcoming T-SQL course, focusing on a fundamental yet powerful feature: APPLY. Starting with an engaging introduction to APPLY as a unique blend between JOINs and subqueries, I walk through common issues encountered when trying to solve top-end-per-group problems using CTEs or derived joins. By demonstrating how APPLY elegantly handles these scenarios, I aim to demystify its syntax and application in real-world queries. Through practical examples, including the Stack Overflow database, I show how APPLY can be used to find the top three badges earned by users with a reputation over 750,000, illustrating why it’s such an essential tool for any SQL Server professional. The video also touches on alternative syntax and performance considerations, providing a comprehensive yet accessible overview of this powerful T-SQL feature.

Full Transcript

Erik Darling here with Darling Data, and we’re going to continue on with some of the preview content from my upcoming T-SQL course. I would like to thank everyone who has pre-purchased it so far. Your support, of course, means quite a bit to me. And my chief goal right now is to not disappoint you. Again, this content is going to be part of a larger paid course. If you would like to pre-purchase it, there is a link down in the video description for you to do that. The price is currently $250. That will be going up to $500 once the course is fully published. And this is companion content to the T-SQL pre-conferences that Kendra Little and I are doing at PASS Data Community Summit in Seattle this November. And so if you are attending those, don’t buy it because you will get access to the content. as part of your admission to those pre-conferences. So hold your horses for that. I mean, I’m going to tell you not to. If you decide to just because you’re like, Erik Darling, we love you. You’re worth $250. Cool, but you don’t have to. Save your money. Go drinking. Today’s video is going to be about apply. Now, this is the beginner section of the content. So there’s not going to be the advanced level of stuff that you would maybe normally find in content about apply that that is coming later. What we’re going to do today is solve a little problem with apply and sort of dissect how apply works. Now, we’ll also look at an alternative to the apply syntax because sometimes looking at alternative syntax helps people understand what we’re doing with apply a little bit better.

But apply is one of my favorite pieces of t-SQL syntax. Sort of generally. I was going to see if zoom it’s awake. It is wonderful. Apply is sort of like if a drive join in a subquery had a baby. It’s the beautiful bouncing baby apply. In other dialects of SQL, postgres, Oracle, some other fancy ones. The cross-apply syntax is fairly well equivalent to the lateral join syntax over there. Apparently, it was in the SQL standard, but it wasn’t fully codified in the SQL standard as the lateral join syntax until after Microsoft had added apply to SQL Server, So apply is not the like anti-SQL standard form of way of doing it, but it is the sort of equivalent idea. Now, apply does have two flavors. There is cross-apply, which is an inner join equivalent, and then there’s outer apply, which is an outer join equivalent.

And the problem that we’re going to solve today is probably the easiest one to help most folks grasp what apply is good at. And that is the sort of top end per group problem. What we’re going to look at is in the context of the stack overflow database is a few of the higher reputation users in the three most recent badges that they’ve earned for being really good at overflowing stacks. But in your work, you might look at this in other ways, like maybe the top three selling products per category, like bed and bath, home and garden, kitchen, sneakers, whatever product categories you have, you want to find the top three selling products for that. Things like that.

So we’re going to look at sort of some query patterns that don’t work for that and then how apply works for that. So if you were to try this with a CTE, and I’m not saying that it’s not doable with a CTE, just you can’t do it like this. And looking at this sort of help set up the apply syntax a little bit.

So if we were to run this query, which in here we’re looking, we’re getting the top three badges, you notice that this line right here is quoted out. And that’s because in the context of this CTE, the users table, the alias for the users table, the reference for the users table down here is not visible to it yet. We can’t access the users table from the CTE up here.

We can reference it after we join to the CTE down here, but it doesn’t get us the result that we want. So like if we just run this and like we see what the query produces, we just produce three random badges, right, for three random users. These are just the three most recent badges earned period, right?

So this doesn’t get us what we want. We want the top three badges for the users that qualify for our query down here. So this query produces no results, right? We don’t get anything back because no one with a reputation over 750,000 was in the top three badges that got earned period for the badges table, right?

This is just like the whole thing. These people weren’t in there. The other part of it that doesn’t work is if we tried to do a similar thing with a derived join. You notice I have this where clause similarly quoted out in here because like the users table, the reference to the users table, isn’t available inside of the derived join.

We can’t see it until after the join is complete and we do the on clause. You end up with the exact same problem in here, right? Selecting the top three most recent badges, none of those users have a reputation over 750,000, right?

So like this returns no results as well. This is why you can’t just go stick things like top into a CTE or a derived join. Like expect it to match the results of the other stuff, right?

Like it works for some things, but not for this particular task. So if we use apply, what we’re able to do is a see like that reference to the users table inside of the apply. So this does end up getting a little bit more spread out than it would normally be, but it’s only because I tried to put some helpful commentary in there.

Where inside we have the we select from the users table and then we take then we write our cross apply, right? So the cross apply is right after that. Like I said, the cross apply is like an inner join.

So it comes after that from and in here we can select the top three badges and we can make sure because we’re using top now would be the same story with offset fetch to right? Like you could just as easily use offset fetch instead of top here, just like offset zero rows fetch next three rows. But in here we’re using top because it’s faster to write.

No other reason. That most people in T-SQL will write top just sort of like muscle memory, right? Top, whatever.

So but with top we need order by and because this order by on the date column, the date column is not unique, right? We might have duplicates in here. We do need to have our unique tiebreaker here.

This ID column. The ID is the clustered primary key of the badges table. So this is our unique tiebreaker in case there are any duplicates in the date column, right? That’s a critical thing that you need when you are ordering by a date, when you’re ordering by a column that might have duplicates in it because it’s considered non-deterministic, right?

You get different results without that unique tiebreaker. But inside of the apply, we are able to see that outer reference to the users table here, right? We can see this inside of the apply, which means that we, when we run this query, we’ll actually get back the results that we want, which are the users with a rep So we have a total of over 750,000, right?

Let’s validate that looking in here. Everyone here, the lowest one is 754,947. And we get, as long as there are three badges, like they’re like, as long as they’ve earned three badges, you’ll see three badges for each person.

Now, this is the sort of interesting thing about apply is that, you know, you can say you want like the top however many, but if that however many don’t exist, you only get back like, the, you only get back however many results actually do exist. So like, let’s say we said top 10, but one person that only earned nine badges, we would only get nine rows for them.

We can’t invent a row to send back for them. And we’re not going to send back this a random duplicate row because that, that would be insane. Right?

So like you get the top however many are available, like for your, like to, until your top is either met or we run out of rows to find. So this give, this does give us the correct results, which is wonderful. So now let’s dissect a little bit more how apply works so that everyone can understand the majesty of this query.

Right? So, um, the other queries didn’t work because again, it’s sort of like you were saying, uh, select the top three badges of whatever, but like just changing this query to hit the users table and do a simple in to figure out who. Uh, like which users earned just the top three most recent badges ever in here.

You’ll see that we get back three kind of random people and none of them have a reputation even close to 750,000. Right? Maybe they do now.

Maybe they’re a fantastic, like just amazing stack overflow database users at this point. I don’t know. I don’t follow their careers. I’ll be honest with you. I don’t, don’t care that much, but like, like those three people were not in, did not meet that 7,000.

So we’re going to have a 750,000 reputation goal. So the way that apply works is sort of like if we take that, like the initial starting query, right? Like let’s say that we know we want to find like what, like our basic starting point is we want to select.

We want to select these columns from the users table. And all that we care about are users who have a reputation of over 750,000. And when we supply our results back to, you know, whatever application web front end, whatever, whatever you want to call it.

So we want those results ordered by reputation descending. And then for any, if there are any duplicates, the ID column descending, right? We need that unique tiebreaker to make sure the results come back in a sort of reliable, reliable order every single time.

These are, these are the people we get back. So these six rows. All right.

We get, of course, John Skeet. We get Balassie. I’m not sure who that is. We get Darren Dimitrov. Sounds very smart. We get Bon C. I’ve seen that name around. We get Hans Passant. Some sort of German French mass mashup there, perhaps, perhaps a Belgian.

I don’t know. You can’t tell. And then we have Marc Gravel right here rounding out the list with 754,947. So the, this is our starting query, and these are the results from our starting query.

The way apply works is sort of like for every row that your starting query produces, apply the result of the cross apply query to that row. So that’s why we get three rows back for John Skeet and three rows back for Balassie and so on down the list, because we take, we apply that top three query to the results of this query. And just to sort of go a little bit further with that, you know, again, like this is the, this is the query inside of the cross apply just without the correlation in here.

So what we need to say is for each user that qualifies for our where clause, apply this query result to that. Now you could absolutely do this with a derived join or a CTE. You just have to write the query a little bit differently.

Another common approach to this is to use the row number window function and to get a result, like generate a row number over that result and then filter on the row number after that result has been, after that result has been produced. So in here we’re saying where the, remember the alias for the row number inside of the derived join is n, right? That’s this thing right here.

So we’re saying where n is less than or equal to three. Now, just to kind of make this a little bit easier to understand, I’m just going to edit this query really quickly. And let’s grab, let’s look at this again.

And so these are our top six users. We’re just going to grab this number 22656. And then in here, we’re just going to change this for a minute where b.userid equals 22656. And we’re just going to run this result on its own because there’s no correlation inside of a derived join.

Again, you can’t access the stuff outside of a derived join within the derived join. So there’s no correlation to the user’s table inside of this. So we can just run this without an error and we can just see what the results are for one user, right?

So if we run this, we’re going to get back this result. And this is, these are going to be the top three most recent badges for this particular user ID, right? We had a good answer, a nice answer, and a great answer.

What a fantastic answer. And we have that ordered by date descending in here. And of course, we have this ordered by ID. If there were, I don’t think, no, I don’t see any duplicate dates in there, but, if there were any duplicate dates in here, then the ID would act as the tiebreaker to differentiate between those duplicate rows.

Like I said, you could, like with a duplicate, when you order by a column with duplicates in it, the result is considered non-deterministic. So what we, we get the same result back for this one. We just take on a slightly different query format with it.

So let’s get rid of this where clause, because we don’t need this anymore. That was purely a display model query. And let’s run this. And you’ll see that we get back the same thing.

Well, now we have this, this column in here, right? And one, two, three, one, two, three, over and over again. So I guess this could be useful if you wanted to quickly eyeball anyone who may not have had, that may not have had a top three badges in there.

Maybe someone who only earned two badges. They somehow accumulated that much reputation without ever doing anything really spectacular, I guess. But we, we get all the same sort of information.

We get the same sort of set of information back from that. So that’s a bit about how apply works. Hopefully it helps you understand the syntax for it and at least one application for apply that you can, where you can use it to solve interesting query riddles without breaking your back with and typing a whole lot more than you might if you start using windowing functions. There are some performance considerations around this stuff, but you know, there’s only so much time in the, in the preview material for me to spend on things.

So you’re just, you’re just going to have to wait for the full video on that, but you only get the full video if you buy the training. But anyway, thank you for watching. I hope you enjoyed yourselves.

I hope you learned something and I will see you in the next video where we will do some more of this fun T-SQL learning stuff. Anyway, thank you for watching. I’m out of here. Goodbye.

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.