How To Write SQL Server Queries Correctly: Joins With OR Clauses Part 2

How To Write SQL Server Queries Correctly: Joins With OR Clauses Part 2



Thanks for watching!

Video Summary

In this video, I dive into the intricacies of joins with OR clauses in SQL Server, specifically using examples from Stack Overflow’s database design to illustrate common pitfalls and suboptimal query plans. We explore how these join conditions can lead to inefficient execution plans, such as constant scans and nested loops joins, which often result in unnecessary row counts and sorting operations. I also discuss potential solutions, including query rewrites that leverage EXISTS clauses or CROSS APPLY operators, demonstrating how these changes can significantly improve performance. By the end of this video, you’ll understand why normalization is crucial for avoiding these issues and learn techniques to optimize your queries when OR clauses are necessary.

Full Transcript

Erik Darling here with Darling Data. All excited to talk more about joins with OR clauses. It’s going to be a real grand day for all of us here. This is, of course, part two of the joins with OR clauses videos. The first one, we talked about some simple stuff. It’s kind of set up, you know, the things that SQL Server can and can’t do when there are OR clauses in the table with the simple join. Now we’re going to talk about more complicated stuff and, of course, how you can fix the more complicated stuff. And, of course, because I like talking about it, we’re going to talk about shortcomings in SQL Server’s Query Optimizer. Because it’s got those. Lord knows. Anyway, before we do that, if you would like to slap four bucks a month on my rent check, you can click the link down in the video description below and you can do that. Maybe it’ll make you feel good.

You know, or maybe not. I don’t know. Maybe, hopefully, you just forget about it for a while. That’s what we’re all hoping for. If four bucks a month is not what you have in mind for me, you can like, you can comment, you can subscribe. All sorts of great things. If you are in need of consulting for your SQL Server, all of these things, I am the best in the world at them.

Anyone who tells you otherwise is a dirty, rotten liar. And as always, my rates are reasonable. Training. I’ve got it. 24 or so hours of it. Beginner, intermediate, expert. All of those things. You can get it for about 150 US dollars. That’s the link. That’s the coupon code.

And they are helpfully glued together for you also in the video description. Lucky, lucky you. No more, no upcoming events until 2025, probably later in the year. But for now, we can just go talk about what we came here to talk about, which is joins with or clauses. So joins with or clauses are one of my biggest pet peeves because the optimizer does have space to do this better.

It just isn’t there, right? It just doesn’t do it. There’s just no help for you when a join has an or clause in it. There’s not like a special thing you have to do or there’s not a hint. There’s not like a trace flag. All there is is query rewrites. And those query rewrites are annoying because SQL Server’s optimizer could figure it out.

Just doesn’t. It’s lazy. Very lazy, lazy optimizer sometimes. So I’ve created some very generous indexes on the votes table and two on the post table. And what we’re going to look at is the query pattern here.

Now, it is a little bit confusing when the first time you see it, because it is, well, I mean, not the most straightforward thing in the world. But I probably should have started this running before I started talking too much, right? OK, seven seconds. Not too bad.

But here’s the execution plan. And it is one of it’s this is one of those query plan patterns that I try to train people to look for, because this query plan pattern is always a sign that something bad is happening in your query. And that something bad is going to be a join with an or clause.

So we’re doing I mean, it’s a little bit of a funny query here, but it’s one that you have to do with the Stack Overflow database. Because, you know, everyone who gives credit to Stack Overflow for like when it first started being like the cool, smart, like, whoa, rogue band of geniuses don’t like they’re doing everything on the cutting edge of cool and fun. They were crap at database design.

Crappy, just awful. Should have hired someone who knew what who knew what they were doing like this much because the post table has all the posts in it. It has all the questions. It has all the answers. It has all the everything else.

And what’s what’s really crappy. Oh, my God, is like the version of the post table that you see in like the Stack Overflow data explorer and the version of the post table that you see, if you download the data dump, is not nearly reality. If you saw that, if you saw the actual table in there, it is so much wider and more denormalized and awful.

It would it would it would make you even if you know nothing about normalization, it would make your head spin. But we have to join the post table to itself because questions. Right. Those are these things. When you join them, you need to join the post table to itself to find which questions have accepted answers.

Great. Good. Good plan. Self joins. Hey, I like I like those sometimes. Sometimes you do have to you do have to join yourself. But then when we join to the votes table, what we’re saying is where we need where the the either the question ID equals the post ID in the votes table or the accepted answer ID equals the post ID in the votes table.

So we’re looking to get votes on both the question. That’s the ID and the answer, which is the accepted answer ID. Right. And we I don’t know through this. I forget why I threw this on there. I think it tidied things up a little bit. Maybe maybe maybe I didn’t put that on there in seven seconds or in like 70 seconds.

But then we’re looking for where p dot post type ID equals two. Right. And we’re looking for where p two dot accepted answer ID is greater than zero. So we’ve done everything that we we can there. Now, what makes this challenging or rather like.

What the optimizer could do is something like this. It could unroll the join from being this set of conditions to being this set of conditions. Right. Like it could do that because it is fairly equivalent. It could also do this. But it kind of doesn’t.

It could be a good answer. So what’s really interesting, but sort of difficult to describe well to the optimizer is that any given question can only have one accepted answer. So there’s no many to many relationship with that join.

So these are things that the optimizer could do, but it doesn’t. And this is the query plan pattern that I need to warn you about. And this is the query plan pattern that I want to show you is this is what happens.

So we do two seeks into the post table. Right. This finds this does our join. Right. This finds our questions or rather sends our answers with the questions of accepted answers and all that other good stuff. And then this is the join to the votes table. Now, what I want you to keep an eye on is the number of rows that come out of this three eight one eight four nine seven.

OK, remember that number because you’re going to see it again. And right now way over here in the query plan, this is this is most commonly what you will see in a query plan that has a bad join with an or clause in it, which means any join with an or clause. You’re going to see a constant scan.

With three eight one eight four nine seven rows come out of it and then another constant scan with three eight one eight four nine seven rows come out of it. That is the results of this twice. Right. So we have one set for the P dot ID column in the post table and one set for the accepted answer ID column in the post table. Right.

We have that many rows for both after we join those two tables together. That’s how many rows are produced. So each one of those columns gets produced by this by these constant scan operators. Notice these don’t touch a table. These sort of come out of nowhere, but they they really get fueled from here.

So then SQL Server says I have to put those two results together. It essentially does like a union union all sorry. So this number doubles right.

At least my math is correct. And then it sorts all that data. It puts all of those rows in order. So that it can merge duplicates. And it note the estimates in here are all pretty wonky.

Right. Of one of one of two of two of two. So we’re SQL Server. Expects two rows to come across all this stuff.

It we get way more than that. Now, I think like Microsoft could start a little bit earlier and like doing all this stuff by just like trying to get better estimates out of this part of the plan, because then maybe it wouldn’t choose a nested loops join for this part of the plan.

It’s also entirely possible that this is an optimizer limitation and that SQL Server could not possibly do any other kind of join here. But then we seek into the votes table and we we seek on this expression one oh two four. Right. Or sorry.

We have a between thing. Right. Because we merge the interval. So we go and find greater than expression one oh two three and less than expression one oh two four. That’s going to be the the start and end points of the constant scan results after we’ve merged the intervals in.

So we every time we loop in, we go and we find those that range of columns. Now, it’s bad enough that we have like this fake result set. Right. From the from the constant scans that we loop and find stuff in here.

But then like there’s an sort of this other nested loops join here that also like takes this stuff comes out of here and goes through all this. Right. So it’s like this loop, like even though it starts here, it kind of like really starts over here. So this all ends up taking about eight seconds or seven and a half seconds, which is pretty slow for a query.

And like I said, SQL Server could unroll all this stuff and do it differently. Now, it’s what we can’t. It’s hard to describe this to the optimizer without a lot of really difficult constraining that every question, if it has an accepted answer, can only have one accepted answer.

If we were to use the parent ID column instead of accepted answer, it would be a slightly different story because one question can have many answers, but only one of them can be the answer. So this is a great reason to normalize your data. This is one place where the Stack Overflow data design team failed.

Right. Off like just big thumbs down there was every question and every post are in the same post table. I do have some stuff about what happens if you split those out and I do have some other stuff about what you do if you put the body column on another table. But I don’t know. I feel like that’s worth money.

So that might not go on YouTube. Now, part of the reason why this optimization space is difficult is because you would have to you would have to unroll the query to look like one of these two things. Right. Where bounty amount is no and exists.

Select this on this. But this will give you identical results to what you got before. But without all that crazy stuff going on in it, this finishes in about 207 milliseconds.

And that’s, you know, arguably a lot better. Right. And this is this filter is a startup expression predicate. So this is not the awful kind of filter where all the rows come out and then try to pass the filter.

This filter only lets rows passed if if if if if anything meets that criteria. So that’s a nice thing in there. Right. This is all good stuff. SQL Server handles this really well.

And because the where clause in here is really only for one set one set of things, it turns out a lot better. Now, you could also rewrite that in this way, which is even more complicated, where you could use sort of a double exists with a union all. Right. So rather than write this or clause, you could write this in and you could get good results back from this one.

This query plan, it’s a little bit different. It takes about 385 milliseconds. There’s one filter at the end here, which I am not crazy about.

Where this is not null or this is not null. Usually this is the kind of stuff that I worry about in query plans. But in this case, it’s OK.

We get a couple adaptive joins over here between the post table. Well, sorry, between the votes table and the post table. Right. So that all turns out OK. And then if we needed columns from the inner part.

So like we talked about exists and not exists and joins and all that stuff. One thing that I talked to you about was that when you use exists, you can’t project columns out of it. So if this were a more complicated query, you know, like like for a lot of queries, I use count big just to make sure that I’m getting like the right number of rows back.

I just like basic starting point. Are these results correct? One one thing that, you know, like normal queries, people want to see data, right? They don’t just want to count.

They’re not just like, oh, well, yeah, just if you just show me counts, I’m happy. They want to see like the actual stuff in there. So if you need to rewrite a query like this and you need to project rows out, cross supply is a really handy way of getting the same performance. But but you can project columns out of the cross supply.

I don’t know why there’s a space in here. That’s that’s clearly clearly someone is sabotaging me. But if we write the query like this with the cross supply, it’s also pretty quick. 345 milliseconds.

And if we needed columns to come out of the this part of the query, we could get them. So if we like so like you can see, like we count the P dot ID column that comes out of the apply in here, which is union all in here. So we’re counting.

We’re able to reference the P dot ID column there, where if we try to do this up here, we couldn’t. So are these queries bigger and longer and more complicated? Yes.

But sometimes that’s what you that’s what you need to. That’s the kind of query you need to write for the optimizer to fully understand what you’re doing. It’s so, you know, one thing that I say quite frequently with clients and in training and to just random passerby on the street is anything you do that makes your job easier, make SQL servers job harder. Sometimes things that you do that make your life easier.

SQL servers like, oh, yeah, well, I’m going to make my life easy to take this query plan. When you start writing things out more verbosely, sometimes you can give SQL Server a better understanding of what you need to do, what data you need, what data you’re trying to touch. And SQL Server can come up with better plans based on that.

Shorter, shorter, simpler queries are not always faster queries. You know, there are a lot of people who seem to feel that way and they write these tiny little compact queries there. Some of them just on a single line, very long lines.

But like writing simple, compact queries is good for simple, compact logic. When your logic gets very complicated, sometimes very short queries that just, you know, try to compact everything very small. It’s not cold golf.

Sometimes the optimizer can’t do a lot with those. Sometimes you have to be you have to write out more specifically what you’re trying to do. And sometimes your queries might look like this. Sometimes your queries might have extra touches of tables in them.

But you can’t like don’t be afraid of trying this stuff because you might find queries perform much better when you are much more explicit to SQL Server about what you’re after. So don’t just stop at simple, complex query because you’re happy with simple, complex query. Get just the performance.

Look at the query plan. See what happens when you spread things out a little bit. Give SQL Server a little room to think and breathe. You don’t have to smother it all the time with these weird queries. Now, one thing that I’ve talked about a few times, maybe not in this video, but in the last video, before I decided to split this up into two videos, that’s why there might be something up in there that seems a little weird, but is case expressions and joins and where clauses.

We’re going to talk about that next. That is why this is 12 and the little tab next to it says 13 case expressions. So we’re going to talk about those next.

And we’re also going to look at some rewrites for those. Some of those look similar to these other than others of them look similar for two other things that we’ve talked about in the series. But it helps reinforce that you’re not writing SQL Server queries correctly and you need to start.

So, sorry to break it to you, but you know what? At least it’s not terminal. Just temporary.

It can be cured. Don’t worry. Without surgery, hopefully. Yeah, hammer. Alright, cool. Thank you for watching. I hope you enjoyed yourselves. I hope you learned something and I will see you over in the next video about case expressions.

And you’re going to learn a lot. So watch it and don’t skip stuff. You heard?

Alright, goodbye.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

How To Write SQL Server Queries Correctly: Joins With OR Clauses Part 1

How To Write SQL Server Queries Correctly: Joins With OR Clauses Part 1



Thanks for watching!

Video Summary

In this video, I delve into the intricacies of joins with OR clauses in SQL Server, demonstrating how they can sometimes lead to unexpected behavior and performance issues. Starting with a simple inner join setup between comments and posts tables, I illustrate how SQL Server infers join conditions from the WHERE clause, making the query efficient when both columns are involved. However, introducing an OR condition that splits the join across different tables causes SQL Server to struggle, resulting in slower execution plans. To address this, I show a workaround by adjusting the OR clause to involve only one table, allowing SQL Server to seek into indexes more effectively and improve performance. This video is part one of a series on joins with OR clauses, setting up the groundwork for exploring more complex join scenarios like left joins in future parts.

Full Transcript

Erik Darling here with Darling Data. Got a barbell next to a dumbbell. Anyway, today’s video we’re going to talk about joins with or clauses. I’m actually going to do this in two pieces. The first piece just kind of sets it up a little bit and gets some of the easier stuff out of the way. And then the second part we’ll go into more detail. And I tried to do it in one and it just went on too long. Even for me, I actually apologized to my own self for my length for once. So that was a fun change of pace. If you would like to support this channel by giving a dollar sign a finger mustache, you can click the link down in the video description. For as little as $4 a month, you can support me by giving a dollar sign a finger mustache. You can click the link down in the video description for as little as $4 a month. And I’ll see you next time.

Training. You probably need it. If you’re here, you definitely need it. You’re probably not getting enough of it. It’s like vitamins. You can get all of mine for about $150 by using that coupon code at that website up there. That’s about it. There’s also a link for that down in the video description. It’s very good stuff. Paul White Tech reviewed it. So I don’t know. Maybe that’ll convince you. No upcoming events until 2025. Ba-ba-ba-ba-ba-ba. With that out of the way, let’s talk about joins with OR clauses. So SQL Server has funny ways of handling OR expressions sometimes. Sometimes it can infer certain things depending on how you set your query up. And other times it cannot.

So I think a good way to start this off is by showing you this query. The top one is just there to get a quick result back. But we have a join between comments and posts. And the join is on p.ownerUserId equals c.userId. And we have a where clause to say where ownerUserId equals 22656. If we run this query, SQL Server is able to infer that we meant that we only want to see userId 22656 or we only care about ownerUserId 22656.

Because what happens is SQL Server tells us we don’t have a join predicate. We get this stupid little warning right here. And it says, yeah, guess what? No join predicate. Shame on you. You’re an idiot. The number of times I’ve seen this in a query plan and been like, huh, do we really not have a join predicate?

And been like, oh, no, no, no, we totally have a join predicate is pretty high at this point. Microsoft summer interns. What would you do without them? But if we look a little bit more closely at the plan, we will see a predicate on this side of the join that says where ownerUserId equals 22656.

And we will see a predicate on this side of the join that says where userId equals 22656. And so all of the results that come back from the post table and from the comments table have a join condition or rather have a value in their join column that is the same. So there is actually no need to do anything with that.

Right. We don’t need to do any additional join conditioning. We have met the condition of our join via two when right when we touch the indexes. So that’s where SQL Server is kind of good at stuff.

You know, we touch the indexes on both sides. We do the filtering right there. And then everything that we need to mush together is the same value. They’re all it’s two sets of 22656.

And there’s like, what do you join there? It’s already joined for you. We have we have implied the join via the where clause. Where the optimizer is somewhat less good at figuring these things out is when, of course, when you get separate columns involved.

Right. So if we look at this query. Right.

And I have these quoted out for the time being for dramatic reasons. We’re joining comments to votes on c.postid equals v.postid. And we’re going to say where c.postid equals 838 or v.userid equals 831.

Right. And if we run this query. Do do.

It takes a few seconds. We get 42 rows back. SQL Server has chosen the most adaptive of joins. We scan here for about a second and a half. We scan here for about half a second.

And in total, we take about 2.6 seconds to figure this all out. Now, if we quote out these four sequence and we rerun this, SQL Server says, there is no way we could do that. It is impossible.

Why have you tried to do this to me? You are trying to bend the rules of space time. You have divided by zero. You are not fit for this planet.

You think you’ve seen too many Doctor Strange movies. But this is kind of a funny thing because when you look at what SQL Server does, look at the execution plan. And then the entire query is evaluated here at the join.

We have this whole or clause thing in here. Now, what we can do is, since we know something about our data that SQL Server doesn’t, is that if we run this and we say where we flip that so that we say where both of these columns come from the votes table. Right?

Right? Now we’re saying where V is this on both. We get the same results back, right? This is 41. 41. Da, da, da, da, da, da, da, da, da, da, da, da. And this is 41. But much more, a much quick, much more quick 41.

Now SQL Server is able to seek into two indexes on the votes table and a third index on the comments table. And we are able to get back our results. The only change I’m making is that I’m using the post ID column from the votes table instead of the post ID comment from the comments table.

Right? So before it was c.post ID and v.user ID. Now it’s v.post ID and v.user ID.

So when you comment on or you vote on something, the post ID is logged. Right? You’re going to, since we’re joining on post ID, right, it’s going to be the same for both.

We didn’t change the v.user ID. If the v.user ID were the problem, that would not give us correct results. Or rather, that would pose a different set of issues.

Sort of like what we talked about with the where or clause stuff in the last one. But if we just change v.post ID to 138, then from c.post ID equals 138, then we’re getting all the same stuff back because we’re joining on post ID. Right?

That’s all we care about. But for some reason, this or clause getting slightly different tables involved really screws SQL Server up. So when you’re writing queries, these are the little things that you need to tweak and experiment with to make sure that you are not missing anything obvious.

A lot of times when I talk about query complexity, people start picturing that big, you know, monolithic query that spans 17 scrolls of SS through SSMS. And if you make the font in SQL Server Management Studio like 4%, it’s still just like it doesn’t fit on one screen. That, of course, is a different brand of hell.

But complexity in queries can arise in many small interesting places. You know, in clauses, or clauses. Like anything that adds a little bit of drama to your queries can certainly make things somewhat strange in there.

So that’s the set things up. Now, this was fairly simple because this was an inner join with just a where clause that had an or for just, right, a simple inner join, right? Just one table to another.

The where clause was only on a couple columns with one or condition. What we’re going to look at next is more complicated join scenarios with left joins and stuff. So we have that to look forward to.

Anyway, I hope you enjoyed yourselves. I hope you learned something. And I will see you in part two of Joins with Orr clauses when we will get deeper into the subject. And I will show you more awful things that can happen and how you can fix them.

Because what’s the point in just showing you bad things that can happen if I don’t show you how to fix the bad things? It doesn’t seem like a very, it doesn’t seem like very good training if all I do is tell you that bad things happen. It sounds like, it sounds like other training I’ve seen.

Bad things happen. How do you fix it? It’s hard. Yeah, you know, life’s hard. Screw it.

All right. Thanks for watching.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

How To Write SQL Server Queries Correctly: Where Clauses With OR In Them

How To Write SQL Server Queries Correctly: Where Clauses With OR In Them



Thanks for watching!

Video Summary

In this video, I delve into optimizing queries that use OR clauses across columns, particularly in scenarios involving outer joins and complex WHERE conditions. I explore how these types of queries can sometimes struggle to leverage indexes effectively, leading to suboptimal performance. By rewriting the query to break it down into more manageable parts, we can better guide SQL Server to utilize the appropriate indexes for seeking rather than scanning, thus improving overall query efficiency. Through practical examples and detailed explanations, I demonstrate how strategic indexing and query restructuring can significantly enhance performance in these challenging cases.

Full Transcript

Erik Darling here with Darling Data. And we’re going to talk about something quite interesting today. Now, we’re going to, of course, carrying on, soldiering on with our series on how to write queries correctly. And I did promise some tidbits, morsels, tiny, sprinkled little nuggets of Intel timing out looking for drivers. God, that’s depressing. It just does that randomly. I don’t understand. I wish Intel would just get out of computers. It was a good run. Intel and Boeing just get together and go do something else. Enough lives have been lost. So I did promise some stuff about performance and we’re going to cover that today because this is not precisely about how to write queries correctly. This is about indexing for a specific type of query and how you can rewrite that query in a slightly different way to maybe better understand why your indexes are not being used perhaps the way you think they should be. So we’re, of course, going to be talking about where clauses with or operators in them. Shocking, I know. Who would ever be indecisive about what data they’re returning in their query? It’s hard to know. The level of indecision, I have.

I see in many SQL Server queries. Why don’t you just say select star from table. The number of OR conditions that you have applied to this query makes me wary that you might just be doing a secret select star. Anyway. So let’s go on with that. But before we get into the meat, the beef of this subject, if you would like to become a member and support this channel with $4, as little as $4 a month, you are, of course, able to do that by clicking on the little become a member link in the video description. If $4 a month would mean that you would starve cold and alone in the streets, like these poor UK pensioners who have had their heating allowance pulled from their hands, you can like, you can comment, you can subscribe.

Those are all wonderful ways to keep me warm all winter long. If you need help with your SQL Server, if you’re out there in the world saying, God, I wish I knew why this SQL Server was so slow, or I wish someone would come and make this SQL Server faster for me. Well, I do all of those things. I can tell you why it’s slow, I can make it faster, and all of that at a reasonable rate. Ha! It’s wonderful.

If you would like some very high quality, incredibly high quality, probably the highest quality training you can possibly get at a very low price, probably the lowest price you can possibly get without pirating it, You can get all of my training for about $150 USD for the remainder of your years on this planet Earth. If you go to that link and put in that coupon code, there is also a link that assembles all of these details for you in the video description.

It’s your lucky day. No upcoming events, because end of the year, yada yada yada, 2025, we will reconsider our stance on no new events. For now, I just wish to be warm and snuggly with a bottle of coat roti and not going anywhere in a winter coat to talk about SQL Server.

With that out of the way, let’s fade to black and let’s talk about how we can make OR clause queries better. Now, the OR operator is, of course, perfectly valid to use. There are some times, and I hate to see it more than others, joins particularly, left joins especially.

It’s almost like you should have just done a full join, probably, for a lot of those WHERE clauses. But when you use an IN clause, there’s a perfectly reasonable chance that… Sorry, when you use an OR clause, there is a perfectly reasonable chance that the operator will convert it to a series of SEEK predicates like IN1, 2, 3, or it might take IN1, 2, 3, and make it OR 1, OR 2, OR 3, and you don’t have to do anything about it.

Optimizers are just kind of funny like that. They try to optimize things, and sometimes they do better than others. The problem isn’t sort of generally when asking for IN or OR on a single column, right?

If you are the type of, like, dim person who likes to say where is null column something is null or is not null or whatever… or coalesce or whatever thing you do, even… I even see people do it with null if…

like, null if empty string is null. But if you say, like, where column equals empty string or column is null or anything like that, usually you’re better off just typing more, right?

So if you say, like, where column 1 equals something or column 1 is null, it’s okay. As long as you have an index that supports, you know, that, like, supports what you’re searching for, an OR expression on a single column is not going to ruin your day.

Where OR clauses tend to get tougher from a performance perspective is when they span columns, especially across joins, especially outer joins, because outer joins, you know, they do preserve nulls. So you end up with a weird sort of set of things to think about as far as, like, query performance goes there.

So if you were to do something like this, you might have a tougher time getting it to perform as well as… if that were just where t1.thing equals something or t1.thing is null. You know, at any point, though, sufficiently complex filtering will…

like I’ve seen so many times, and something that we’re going to cover in a couple few videos, is when you… someone decides to take a where clause or a join clause and turn it into a case expression, thinking that it will simplify things.

So it’s like 1 equals case when some conditions equals 1, or some other conditions equals 1, or some other conditions equals 2. And that’s when… that’s really painful overkill.

Just a plain old select… case expression, a select list, doesn’t cause nearly as many headaches, right? I mean, some types of, like, exists in subqueries… are exceptions to that, but…

for the most part, a select list… with a… sorry, a case expression, a select list, it’s not going to be a problem. The main issue, of course, that we’re talking about today is performance, and sometimes figuring out a reasonable index or set of indexes.

Sometimes… you know, it’s crazy to think that sometimes you might have one query that needs more than one index. You might need some number of indexes in order to make your query work reliably fast.

If you work… if you’re the… if you’ve ever done work with, like, interval-type queries that have a start date and a begin date, and you need to either find overlapping or non-overlapping or consecutive or non-consecutive things, you’re probably very, very used to the idea that your start date and your begin date are each essentially going to need an index that leads with them so that your searches can get to things in there as quickly as possible.

Now, sometimes it’s not just the indexes. Sometimes your queries, when you do write these types of queries, will need index hints both to use the correct index and to perform the correct seek or scan on that index.

Some people get really scared of that stuff, but, you know, the optimizer just sometimes will not make the right choice on its own. It refuses to.

Even though you know better and you run a query, and it’s better and faster when you add those hints in, some people will just say, Oh, well, you can’t add hints. It’s too scary for me. And so you end up with a query that is just perpetually slow.

So, you know, if you are… You know, if you have perfect indexes and perfect queries and everything’s working great, I’m very happy for you, of course.

Couldn’t be more thrilled. You’ve reached a level of nirvana, of spiritual perfection and enlightenment that I can only hope to achieve one day. But until then, I’m just probably going to try to keep getting paid to fix other people’s lack of spiritual enlightenment and nirvana.

Hopefully bring them a bliss that they’ve never known before. Get a back rub too.

So I already have this index created on the post table. It’s on owner user ID, last editor ID, and score. And this is the query that I am trying to get to support.

Sorry, this is the query that this index is there to support. So I’m going to show you something. And it’s going to be something that maybe annoys you.

I know it annoys me sometimes, and I see it, is even though this index leads with owner user ID, and even though we do have owner user ID as an equality predicate in our query, we don’t seek into this index, do we?

That is not what happens. We scan the index. All right, and our scan predicate is looking for a score greater than zero, and owner user ID equals 22656, or last editor user ID equals 22656.

Well, you’d think that we could at least be able to seek to one of those, but when you try to tell SQL Server to seek into this index, it says no.

The query processor cannot come up with a query plan that will help this thing. Why is that? Very strange. Well, it’s not because the index is created wrong. If we did create the index with the score column first, we wouldn’t be able to seek to that.

The thing is, with this predicate, most everything in the, like, rather, not most everything, but really a significant chunk of the post table does have a score greater than zero.

So we wouldn’t really get much out of that. It wouldn’t help us as much as you think it would. If we run this, where we’re looking for a score greater than 1,000, of course, things with a score greater than 1,000 are far less common in the user’s table, sorry, in the post table.

So this would result in a lot fewer rows, but we still can’t seek to what we care about. Like, it’s still just not an option to us with this index. So, sort of paraphrasing what’s in here, because, you know, I do need to finish these videos at some point.

It’s really tough to get a seek with an or clause sometimes. If we put the score column first, depending on what the score column is searching for, it might get better because, it might be okay, because we’d at least be able to seek to a small number of scores, but we still wouldn’t get the good results that we want to out of this query.

Because essentially, we have two separate predicates, right? Where owner user ID equals 22656 and score is greater than zero, or last editor user ID equals 22656 and score is greater than zero.

Like, we have this sort of double Dutch, double date weird thing that we have to figure out. If we express the query, and this is where I was talking about, where if you rewrite the query in a slightly different way, it’s a bit easier to see what SQL Server can seek to and what it can’t, given the index that we have.

So, let’s run the query like this. And I’m just doing the score greater than a thousand. It’s a little bit quicker this way. But essentially, we’re taking the query that we had written in one compact bit, and we are writing it in two separate bits, and we’re joining, we’re unioning those results together and selecting from the results of both of these.

So, we have one up here for owner user ID and one down here for last editor user ID. When we write the query like this, what we need to do becomes a lot more obvious. Even, it’s so obvious, even SQL Server can figure it out.

Except what SQL Server tries to tell us we need is an index on three key columns that covers the entire post table, including the body column, which is an envarkar max, including several other string columns in there.

If you’re the type of person who, you know, I don’t know, maybe you’re a consultant even, who’s like, oh, looked at the missing index request, going to create those, please don’t.

It’s stupid. They’re not good. They’re not good ideas. Even if you, like, consolidate them a little bit, you have no idea what query they’re helping.

You have no idea what the actual, like, the helpfulness of the indexes will be for the workload, for specific queries. It’s really just working blind, and you shouldn’t be doing that.

You should know better. Because people are paying you. They’re not paying you not to know better. They’re paying you to know better than they do.

If you add missing index requests just from the DMVs, looking at uses, or impact, or anything like that, you’re screwing up big time. Don’t do it. So what we can do, because we are smarter than SQL Server’s missing index requests, is we can create a couple indexes that will help our query.

Now, we’ve seen the query plan for the one before, right? We were able to seek into one part of it, and we had to scan the other. We had to seek to, we could seek to owner user ID and score, but we couldn’t seek to last editor user ID and score, because last editor user ID is the second key column in the index.

So we don’t have that in a helpful order unless we hit owner user ID first. The thing is, we’re not scanning to where owner user ID equals something and last editor user ID equals something.

That or condition means we could go here or we could go here and we need score, right? So we can only seek to one of those arrangements based on what the index was.

We can only seek to this arrangement because last editor user ID is the second key column. We cannot seek to this arrangement. Okay, so with these two indexes in place though, if we look at this query plan now, SQL Server will use an index union plan.

We can tell us index union because we have a concatenation here where a SQL Server basically adds this index to another. If this were an index intersection plan, SQL Server would join these two indexes together.

So we basically seek to what we care about in this index, right? Which is score and owner user ID. And we seek to what we care about in this index, which is last editor user ID and score.

And then we bring those two results together and we do a lookup for the rows that we, for the remaining columns that aren’t in those indexes for the rows that we care about. All right, so SQL Server is able to bring those two results together and then combine the clustered index to get us a full result for the query, which looks pretty close to what we would see in here, but just arranged slightly differently.

Let me actually run both of these together so we can see the differences. There’s not, there’s really not a big timing difference on these. If I, if I ran these enough times, well, actually, these are all at zero seconds anyway.

Sometimes they end up at like a couple milliseconds, but for the most part, the choice is either to concatenate the results of both of these seeks via index union, sort them down a little bit.

It’s only 16 rows, so no one really cares, and then do one key lookup for the 16 rows that we have. The alternative is to do one index seek and one key lookup, and then another index seek and another key lookup, of course, the first index seek is just like we’ve seen before.

Oops, that didn’t frame up too well, did it? Owner user ID and score, and then the second index seek is last editor user ID and score, and then we just do slightly smaller lookups for the rest of these, and then we return out the 16 rows that we care about for both of them.

So indexing for these types of queries is a little bit more challenging. Like I said, though, whenever you’re fixated on writing some OR condition into a WHERE clause, we’re going to talk about joins with OR clauses next, but whenever you’re fixated on writing an OR clause into a, writing ORs in your WHERE clause, you might want to consider performance, performance, and the way to consider performance, of course, is to write the query out with each of the specific WHERE clauses expressed the way that you would care about that, and if you do that, what’s going to happen is SQL Server is going to point out where you have good indexes to do things and where you don’t have good indexes to do things, and that can really help guide you to make better indexes to support queries that have this kind of extended, expansive logic in them.

So anyway, that’s about enough of this. In the next video, we’re going to be talking about joins with OR clauses in them, which are a separate nightmare, but we’ll have fun with that, and that’ll also be a bit more performance-focused, but, you know, we’re going to…

we’re going to stick to the sort of script of the series and make sure that we all understand that there is a correct way to write these sorts of queries.

So thank you for watching. I hope you enjoyed yourselves. I hope you learned something, and I will see you in the next video about joins with OR clauses in them, and we will all have a good laugh at the end if we make it to the end.

Really, anything could happen, right? You have about 15, 20 minutes in which, I don’t know, the world could end. The lights could all go out.

It could be a solar storm. Anything. It would be awful, but hopefully not. Anyway, thank you for watching.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

How To Write SQL Server Queries Correctly: Views vs Common Table Expressions

How To Write SQL Server Queries Correctly: Views vs Common Table Expressions



Thanks for watching!

Video Summary

In this video, I dive deep into the world of views and Common Table Expressions (CTEs) in SQL Server, addressing some common misconceptions and providing practical insights. Erik Darling from Darling Data shares his experiences and observations on how these objects are often misused or misunderstood by developers. Whether you’re a seasoned DBA or just starting out, this video offers valuable perspectives on when to use views versus CTEs, the importance of avoiding materialization issues, and the potential pitfalls of using `TOP` in views. I also explore why there seems to be a bias against views among some developers while CTEs are often embraced without question. By the end of the video, you’ll have a clearer understanding of how to leverage these objects effectively for better query performance and maintainability.

Full Transcript

Erik Darling here with Darling Data. And in today’s video, we’re going to talk about two of my frenemies in SQL Server, views and CTE. There are some things to discuss with the views and CTE that I think are important for people to know. And we’ll do that today. in some level of detail that will be too much for some and too little for others. But guess what? It’s a free video. I’m doing what makes me happy. So if you are also made happy by this video, and gosh, I hope you are, you can become a member of this channel for as low as four, that is, quattro dollars a month. there’s a link in the video description somewhere in this general vicinity.

You can click on that, become a member, and that’d be cool. It’d be very kind and giving of you in the holiday spirit. If you like to show your holiday spirit in different ways, different holly jolly ways, you can like, you can comment, you can subscribe.

That’s all right, too. If you are watching this video, it’s like, I mean, I know this is going to get published the day after Christmas, but who knows when you’re watching it.

But let’s say you’re like, wow, we have all this New Year’s budget. How are we going to spend it? What about on your SQL Server with spending some quality time with Erik Darling from Darling Data? You can hire me to do all of these things.

And I do them the best in the world, because I’ve seen what other people do, and it sucks. If you would like some very high quality, very low cost, SQL Server performance tuning content, again, there is a link in the description below where you can get all of this stuff combined for you.

But it is about 150 USD for the whole thing, and that is for the rest of your life. So live a long time.

Be happy. Again, this is being recorded, well, this is being presented at the day after Christmas, so I’m going to be huddled up somewhere, probably drinking red wine and staring dreamily as my children ungratefully unwrap presents in Paris.

So, no, that’s my plan. I don’t know what you’re up to. And with that out of the way, let’s have fun.

Let’s talk about CTE and views, because I suppose that is what we came here for, isn’t it? So one of the most exhausting parts of my job is, you know, being like the groundhog DBA who has to sort of say the same thing to different people over and over again.

It’s like starting from scratch. It’s like, I get really good at the piano, but the world around me is just starting from scratch, and no one realizes how good I am at piano.

So there are a lot of notions that people have about SQL Server that are both untrue and untested by them. It’s like, oh, I thought I read a thing that said that once.

I’m like, oh, can you show me the thing? No, of course not. It’s all in the sands of time. But the way that I think about views and CTE are since when you make a view, you write create or alter view, and then you put the query in it, which hopefully doesn’t contain too many other views, then hopefully the view definition doesn’t contain anything too outrageously awful, because Lord knows they have that.

People have a propensity for putting all the worst things into their views. They are like a permanent home for your query. They are not a permanent home for your data, because your views are not self-materializing.

And you have to go through great troubles and lengths that Microsoft should apologize for in order to create an index view. But it’s a thing that lives in the definition of your database.

It is not a physical object, but it’s like a stored procedure doesn’t materialize the data that the stored procedure selects and does stuff with.

View doesn’t either. But CTE are a bit more like mobile homes, because you can take one and you can park it anywhere, and it doesn’t actually live there.

You can put it over here. You can put it in a stored procedure over here. You can put it in a stored procedure over here. You can write it wherever you want. Just pull up, park it, have it make performance suck there too. It’s sort of like installing a toilet, right?

There’s a time and a place for a toilet, usually in the bathroom. If you use the same amount of discretion with toilet installetry that you use with your views and CTE, and you say you plop it right in the middle of the kitchen, probably don’t hook any pipes up to it, just leave it there, it’s going to look stupid immediately, and eventually it’s going to stink in your kitchen.

Probably not what you want next to the dinner table, is it? Unless… Unless… Oh, I don’t even want to go down that path.

Since views are programmable objects, right? They are actually modules in your database. They do have a little bit more depth of character and flavor to them than CTE.

For example, like when I talked about views in another video, you can add a with check option to have SQL Server do stuff with the data in the view when you modify it. You can also index a view.

You cannot index a CTE. Contrary to what I’ve heard said at several live in-person events and read in several places, views in CTE actually can use the indexes on the underlying tables that they select from.

The data does not become an amorphous blob anywhere. You can use the indexes there. It’s pretty spiffy.

Yeah. All right. What I find particularly curious about the view and CTE thing is how developers are sort of racist against views in a way that they are not against CTE.

So like, like I’m pretty sure that this is just like a random conversation that has happened 5 million times in the world. One developer will be like, just wrote this short procedure.

It has hundreds of CTE in it. And another developer will be like, wow, that’s amazing. You’re the best at SQL. These CTE, dog, they’re so readable.

I can really, really understand all this query. I really can’t. And then if the developer did the same thing, was like, hey, this database has hundreds of views in it. Developer will be like, man, why, why, why, why you gotta like, you know, mess up the database with all these views?

What’s wrong with you? That just makes things hard and complicated and unreadable. So like, it is weird that despite them having so much in common, and despite views having a leg up on CTE in several ways, you know, people are sort of aligned against them.

But, you know, even I kind of get it because I cringe a little bit when I see that, when like, someone’s like, I don’t know, I have this simple query, but it takes forever.

And I’m like, oh, okay. And it’s just like, you know, select stuff from a thing. And then you’re like, oh, well, I bet it’s just missing an index. And so like, you go to get the estimated plan, and it’s like, see what’s going on.

And then the estimated plan is like, gigantic, like, like one of those like open world video games where the map just keeps getting bigger and bigger and like, fill, and then that’s the query plan.

And like, you have to like zoom all the way out to even be able to grasp the full size of it. So, so I understand because views have been abused so horribly, but, but, but, but, but, you know, by the same token, CTE in my experience have been abused, just as horribly by people.

It’s just easier to see upfront and it’s not more readable. One thing that I see quite a bit is people still trying to stick top 100% in a view, thinking that it will present their data in order when they select from it.

It won’t. You need the outer order by no matter what. But one thing that I want to show you is like, if let’s say that we have a select top one in a view, and then we have a select top 100% in a view, if I show you, I’m not going to get the actual plans for these because this one will select 100% of the 2.4 million rows out of the users table.

And I don’t want to sit here for that. But if I show you the estimated plans for these, you’ll notice something, a slight difference between them.

The first one has a top operator in it, because there is a top that is actually honored by the optimizer. The second one does not have a top operator in it. The optimizer throws top 100% away, because top 100% means the whole damn table.

That does not mean there is no need to do a top operator in there, because we are selecting everything. If we replace this with top 99%, well, that’s a huss of a different color, because SQL Server actually has to figure out 99%.

It just has to do it in a real ugly way. You don’t want SQL Server to scan the entire clustered index, spool the entire clustered index, 99% of the clustered index into an eager table spool, and then have the top read 99% of the rows from there.

That’s a bad time. So we’re going to say that’s a bad idea. We’re going to not go with that idea. I’m going to change that back, because I don’t want that to accidentally be there and have anyone…

I don’t want to, like, die and have anyone go through my database and be like, he had a view with top 99% in it. Glad he’s dead.

I don’t want that to happen. So one other thing that’s important, and let me just get rid of that red squiggle down below. One thing that is important to understand about views and CTE, and this is something that I’ve said, a point that I have belabored, that this dead horse is well fed, that they don’t materialize nothing.

So when you reference a view, or… Excuse me. I’m very dry in here. This winter heat stuff just…

Despite there being a prevalence of steam pipes, no steam releases from the pipes. It’s just dry, dry heat. Because views and CTE do not materialize data, every time you reference them, or every time you access a view or CTE, the entire query inside it has to run.

I have written an unnecessarily large query in here, but if I go and create this, and I repeat that same thing, this is using a CTE, of course.

This is using the with syntax to create a CTE. And then I either join that CTE to itself, like so, or I join the view to itself, like so, and we look at the query plans.

We will see that the query plan for both of these, when it finally does a thing, will repeat itself for both of them. We have the set of joins from the first time we talked to our CTE up here, and we have the set of joins from the second time that we talked to our CTE down here.

We have the exact same pattern repeat in the view. It’s the same query plan. This is the first time we touched the view. This is the second time we touched the view. Every single one of those joins has to happen all over again.

It’s not a good time. And I see people do this constantly with both views and CTE where they’ll take the view. They, oh, all I need to do is run the view with like this where clause in a CTE, and then run the view with this where clause in this other CTE, and then join one to the other.

And you have this query plan that, again, just defies all logic. Well, I mean, it doesn’t defy logic because I know what’s going to happen, but it really, I think the better is, it defies like reasonable analysis because you’re just looking at this giant query plan and going, why would you do this in the first place?

Why would you want to hurt SQL Server like this? So not a lot of difference here. No matter which one you use, there is no materialization.

You can at least materialize a view. You can index a view. Again, a lot of rules to follow there. You cannot index a CTE, like the definition of it.

You can’t say like with CTE as, and like define an index in the CTE definition. But both of them, both views and CTE can use whatever underlying indexes you have.

They’re often responsible for either one being successful. But of course, the crappier code you put into either a view or a CTE, the worse off you are.

So neither one is going to turn out well. I don’t really know. Let’s see. Yeah, okay. Well, I mean, one other good point in here is that if you use views and you hire a young, handsome performance tuning consultant like myself, you take a look at my reasonable rates and you think, gosh, how can we afford not to?

And let’s say I come along and I performance tune a view. Everything that relies on that view will get faster. If you just sprinkle CTE everywhere, like kitchen toilets, guess what’s going to happen?

I’m going to have to go through every place you use that CTE or a CTE, and I’m going to have to fix all of those individually. You can imagine which one is a better use of time. So there is that to consider.

Anyway, thank you for watching. I hope you enjoyed yourselves. I hope you learned something. I hope that you will practice better diligence in your use of both views and CTE.

And I also hope you keep watching. So let’s do all those things together. Go team.

I’m not much of a high five guy, but I don’t really have much of a… It wouldn’t make sense for me to do this. Like really for the camera, the high five or the fist bump is the only thing that makes any sense.

So anyway, let’s go record another video. We’re going to talk about, I guess, or and where clauses next. Ooh.

It’s going to be perf heavy topic, isn’t it? I’m going to have to get into that one. Anyway, thank you for watching. Goodbye.

I love you.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

How To Write SQL Server Queries Correctly: INTERSECT And EXCEPT

How To Write SQL Server Queries Correctly: INTERSECT And EXCEPT



Thanks for watching!

Video Summary

In this video, I dive into the lesser-known but incredibly useful `INTERSECT` and `EXCEPT` operators in SQL. These operations are particularly handy for comparing columns across multiple queries without the need for verbose and error-prone null handling logic. By exploring these operators through practical examples, you’ll see how they can simplify complex comparisons and improve query performance. I also delve into their unique operator precedence rules and discuss when to use them effectively in your SQL queries. Whether you’re a seasoned database professional or just starting out, understanding `INTERSECT` and `EXCEPT` will give you a powerful toolset for writing more efficient and readable code. Additionally, I touch on the recent additions like `IS DISTINCT FROM` and `IS NOT DISTINCT FROM` in SQL Server 2022, highlighting how they further enhance T-SQL’s capabilities. By the end of this video, you’ll have a solid grasp on when and how to leverage these operators for your own projects.

Full Transcript

Erik Darling here with Darling Data. We’re going to do a great job today. I can just feel it. I’m feeling in my bones. And I can tell we’re going to do a great job because we’re talking about a subject that almost no one ever talks about. Intersect and accept. You might notice that that word looks a little funny. Brevity is the soul of wit, I think. So, you know, these are not things that I just want to say. I see people use in production queries terribly often. Most of the time when I see people use these, they’re trying to compare, like, the entire contents of one table to another and try to figure out where there are differences. You don’t see people use these a lot in, like, smart ways in queries. They would much rather sit there and, you know, figure out if columns match or are both no. Or if, like, like, like, like wrapping things in is null and coalesce like, like the fools they are and these giant or clauses. And, and gosh, gosh, is it nice to just retype those as, as intersect or accept queries because not only do they, they look prettier, but they often perform better. And there’s often a lot of, a lot of room and potential for logical incorrectness.

in, uh, extended and, uh, or clause queries, uh, that, you know, uh, usually you’ll find a couple few bugs in those. So, we’re gonna have fun today. But before we have fun, we need to talk about, uh, stuff you can buy from me. Yeah, cause everyone’s trying to sell you something. I guess that, that includes me. Merry Christmas. Uh, if you would like to become a member of this channel, uh, and, and, and support my, my, my, enduring efforts to bring you quality SQL Server content, uh, there’s a link right in the video description that says become a member in which you can become a member for as low as $4 a month.

If the $4 a month is just too rich for your blood, uh, you can like, you can comment, you can subscribe. There are all sorts of buttons you can push that push my joy buttons, which is just the, probably one of the, the best things you can do in life, right? Uh, what is it? It costs nothing to be kind. So, it says the sages of every social media platform overrun with people, uh, being selectively kind. Uh, if you need help with SQL Server, if you’re looking at your SQL Server and thinking, gosh, this thing is slow, uh, Erik Darling, uh, from Darling Data does all of these things, the best in the world.

Uh, so you can, you can, you can hire him, me, us, as, as a, as a package and get this handsome devil to show up and make your SQL Server faster in exchange for money. And as always, my rates are reasonable. Uh, if you would like to get some training on SQL Server, I have about 24, 25 hours of it. Uh, there is a discount code you can use to get 75% off that brings it to about 150 USD and you get that for the rest of your natural life.

Uh, again, it is the end of 2024. I might just take this slide out because I’m getting sick of saying it. 2025. But, you know, then if I, if I stop saying it, maybe you’ll forget that I, I, I come places and I speak and also in exchange for money. Uh, so if you would like me at your event, tell, tell me what your event is.

With that out of the way, let us intersect and accept ourselves into oblivion. Now, one of the first times I actually ever saw someone use, uh, these operators was, of course, uh, my, my dear friend and my, my wine distributor from New Zealand, uh, Paul White. And, uh, one thing I want to note is, uh, you see that URL up there? Um, it’s, uh, right, right, right about there.

Uh, I want you to just take a quick look at the date in here. 2011. So June of 2011. Uh, so there, originally I was going to put a trigger warning on this because, uh, you know, obviously code from, uh, gosh, almost 15 years ago, uh, does not live up to today’s modern standards of code formatting style choices.

Uh, but instead I have reformatted this code to fit modern conventions, coding standards in T-SQL. Um, I’ve also taken the liberty of replace, uh, in the original, uh, queries, the, the table, the temp tables were table variables. I’ve only, I haven’t changed those for any overarching performance reasons only to make each of the select queries below a little bit more portable.

Okay. Uh, because otherwise I would have to declare and populate and run the select queries for each of the things below. And that, that just gets a little unwieldy. So, uh, we’re going to, uh, just do this instead. Uh, we’re going to make our table variable, our temp tables, not table variables. Uh, and then we’re going to insert some data into them.

Um, and then, uh, so the, the cool thing about Paul’s post is that he walks through like the query that you people, a lot of people would write, uh, that would get them the incorrect results. Right. Like this, because this does not account for nulls. This just returns us one row, uh, for the five that we put in both of these. This messes up anything where there are nulls and, and, and that doesn’t get us what we want back.

Right. This is obviously an incorrect result. Uh, you could rewrite the query. And this is how I see a lot of people do it where they have this gigantic or clause. A lot of ends sprinkled in a lot of people don’t get their parentheses, right?

Or don’t get there, or maybe they like copy, there are copy and paste errors in here. There’s a lot of things that can go wrong when you start writing, uh, non-trivial complicated queries like this. Uh, so this will get you the correct results.

Um, what is a little annoying here is this filter operator. Which expands out that entire or, the, the, all of, all of those or clauses for a couple of five row tables. Obviously this doesn’t make a performance difference, but when you’re dealing with larger data or darling data, which is the largest data you can get, uh, then these things do start to make a big difference.

This does get us correct results, right? This does bring us back everything that we care about. Uh, this will also get you correct results.

But as I’ve said in 10 million videos at this point, wrapping where and join clause columns in, uh, in functions, even built-in ones is just a recipe for performance disaster. Uh, so we can run this. We can get correct results, uh, but we would end up with, um, uh, just a weird sort of bunch of stuff happening, uh, in here.

There we go. So, uh, we have this whole predicate against this table, which I, which I suppose is a step up from a filter operator. But this whole predicate is just a series of case expressions for all the coalesces that we had in there.

So, uh, you can see the various case when, blah, blah, blah. Uh, case expressions are another thing that I, I, I strongly advise against putting into your join and where clause column, uh, predicates because you will, you will likewise be in for a bad time. Uh, there is also an example with isnel, and isnel should look roughly the same, except rather than a series of case expressions, we will have a predicate that is a series of isnels.

This is not necessarily a better situation. They’re both about equivalent performance wise once data becomes of a certain size. So be careful with that.

Uh, this is a much better way of writing the query because intersect will correctly deal with the nulls and it’s a whole lot less typing and a whole lot less error prone. So if we run this, we get back the correct results and we don’t have any weird stuff.

Uh, in our query plan as far as like gross predicates go, uh, all of this stuff just gets evaluated, uh, in a nested, in the nested loops join and we evaluate everything across there.

So it all ends up being fairly easy and straightforward. Um, there is sort of an unfortunate thing where not exist does not, uh, provide us with, uh, the entirely correct results on this.

We get back, um, an additional role that we don’t care for there. Uh, so, you know, uh, this is the best version of the query that I think could possibly happen. Intersect and accept are very useful for these sort of like, you know, uh, extended column comparison things.

So, uh, like I said before, um, I’ve never, uh, seen anyone really use these like when they should have, uh, in production queries.

Um, I think probably part of the problem is that it’s somewhat unclear what they do, uh, when you’re reading through any sort of SQL guide, uh, or rather even when you’re, you’re just looking at like, um, when the linguistic possibilities of SQL, you’re going to look at like, you know, stuff like join and exists and not exists and where and group by and order by and all the other words that come up.

And you’re going to say, huh, those make sense. I can sort of figure out what they do there. Intersect and accept.

it’s not really clear what they do from how they’re named. Um, uh, there are some weird rules around operator precedence with these two things. uh, has, uh, very specific operator precedence rules that other things don’t.

Uh, intersect will give you a unique set of rows from both queries and except will only give you a unique set of rows from the first query. First is going to be an air quotes there.

Uh, because, uh, like, you know, we’re just, we’re thinking about the queries as written in order. So the first one that you do, then the except, then the other thing, but then you could put other accepts and intersects below that.

Really, we just, we care about like the first one with except. Um, but the, the set of rows that you get back from them is going to be uniqueified.

So, uh, these are operators that work, uh, somewhat better or reason, not somewhat better.

I would say, uh, somewhere, somewhere between somewhat and profoundly better. If you have primary keys or unique indexes defined on the columns that you’re looking at, or at least one of the columns that you’re comparing there.

Uh, but I think probably the best part about them is that they handle null comparisons, uh, without a lot of crazy syntax, like we saw above with the ands, ors, and the is null coalesce and blah, blah, blah, blah, blah.

Um, that what is tricky about them is of course, figuring out when you should use them and what order to write things in. Often this takes quite a bit of experimentation to get, to start to get a real handle on.

And if you don’t use them for a few days, you will probably lose that handle entirely and have to come back and start from scratch. That’s sort of like when I have to write any XML query, I’m like, Oh God.

And I have to go reference everything that I’ve ever done before to figure out what I need to do now. But, uh, just to give you a couple examples of what they do, uh, let’s have, uh, let’s just do this.

And let’s say we want to intersect, uh, everything with the score over two with everything with the score over three. Uh, and we, we, we get the results we want.

Down here, we get about 6,500 rows and the old, but the, all the results are going to be everything where there’s a score over four from this, from both of these queries, right?

So this one is looking for greater than two. This one is looking for greater than three. And where those two results set start to actually find matching rows is when we hit a score of four. So everything in here is going to have a score of four and up, right?

So pretty easy stuff there. That’s where these two results start to overlap at four. We have greater than two. We have greater than three. What’s after three, four. What’s a couple after two, four.

That’s where we start showing what’s going to come back. Um, and like I said, uh, the user ID column, which is nullable and has nulls and it does not give us any issues here. Um, so if you, if you are writing queries and you have, find yourself having to like, you know, that compare, uh, columns like this, uh, especially like across a number of columns, often intersect or accept would give you better performance and, uh, better sort of, uh, code clarity and everything else because they handle nulls without a lot of, uh, overly verbose, uh, you know, uh, querying or adding in functions that replace nulls with canary values.

Uh, so this is, uh, this is the same two queries this time just using except. And, uh, this is going to give us just results from the first query, right?

Just from this one, right? Because we want to see everything from here except what’s in here. So we don’t show anything at all from this query. We only show the stuff from this query. Now this is only going to show results again from the first query, big air quotes there.

Uh, you could also call it the left, the left most query or the outer query with a score of three, because that’s the only data that exists in it. That’s also in the second query or the inner queer, right?

Or that’s, that’s not also in this one. So this we’re saying greater than three, this is greater than four. So where these start to collide is just at score three, right?

We’re only showing the threes in here. Um, so it is a bit like using not exists and that the rows are only checked from the second, from the, from this query. They’re not, we don’t project anything out from this one.

Uh, and again, the nulls are handled quite well. Uh, SQL Server 2022, uh, wow.

They, they, they really modernized T SQL with this one. Um, they added the, uh, is distinct from and is not distinct from syntax to SQL Server 2022. And I suppose at this point, we could be happy that they showed up at all because is distinct from was introduced in 1999.

99. This thing was old enough to drink before Microsoft got around to adding this in. And, uh, actually this one, this, this one was too.

Okay. I guess this would have been the funnier joke. Uh, this one was added in 2003. Well, I guess in 2022, it wasn’t quite old enough to drink. I guess it depends on where you live. If you’re, if you’re in those devil may care European countries where teenagers just get sloshed in the streets.

Well, my, actually that sounds kind of good to me. Uh, that’s what I did.

And I, I just had to find sneaky ways to do it because I didn’t want to get busted by the federal allies. Uh, so I mean, I understand no, no database really adheres like perfectly to ANSI standards, but waiting like 20 years to get, uh, basic syntax added to T SQL is, man, it is the, the choices Microsoft makes wild when it comes to this stuff.

Uh, I realized that like, you know, people aren’t exactly clamoring for like some of these things. things, but just like basic stock functionality where like, if, if you wanted to, you know, bring an application from, uh, like Postgres or Oracle or DB2 and put it in SQL Server and like, like, you had queries with this stuff written in there and you were like, Oh, that’s broken.

How do we rewrite it? It’s like, great. Awesome. Right. Good stuff. Good stuff.

Good stuff. Microsoft. We really appreciate all the wonderful things you’ve added to the product in the meantime that have changed everyone’s lives. So anyway, they are, the, the, the, the is distinct from, it is not distinct from what will be useful to someone someday when they start.

I don’t know. At this point, the SQL Server 2025 is probably going to be where people go. Not a lot of people went to SQL Server 2022. Apparently everyone was pretty happy with 2019.

Uh, I don’t know. I don’t, I don’t know fully what 2025 is going to bring us. Uh, there was just the announcements that it, it ignited fairly recently, which were, uh, I, I suppose, whelming at best.

Uh, of course, there was a lot of talk about, uh, fabric and AI, which, um, are two stupid things that, uh, of course being stapled onto, into SQL Server or SQL servers being stapled into them.

I don’t know. Uh, but there, there wasn’t a lot of, uh, wasn’t, it wasn’t a very good highlight reel about what’s actually in SQL Server 2025. Aside from stupid current hype cycle memes that are being shoved down all of our throats with, uh, very little care.

Anyway, uh, let’s look at how you can use is, is not distinct from, uh, I don’t, I don’t know if I have an is distinct from example here, but, uh, let’s say that, you know, one of my least favorite things to see in, in, in a join is an or clause.

All right. Something like this. Yuck, grotesque, awful, you do this sort of thing. You should have your keyboard removed from your hands, placed elsewhere.

Uh, but let’s run these two queries. And do, do, do, do, do, do, Yep.

Okay. Uh, there we go. They finally finished. So, uh, this first query, uh, took 6.2 seconds. This is the bad query with the or clause in it.

Uh, and it has all of the hallmarks. I’m going to make this a little bit bigger so we can see both the queries on, both the query plans on one screen here. Uh, this, this, this has all the hallmarks of a join with an or clause.

Uh, uh, we start off scanning the, well, I mean, you’re not always going to see a scan. You, you could technically see a seek here if we had a where clause on the user’s table, but we take all 2.4, some odd million rows from the user’s table.

We feed them through the constant scan operator twice. Notice that we have two sets, two full sets of rows from the user’s table. We have, um, uh, one set to find, uh, u dot ID, uh, in here.

So that’s, that’s quite a bit of not fun. And then down in here, we seek into, we, well, we do a bunch of work, right? We try to collapse all this stuff into here and we can catenate them and we sort them and we merge interval them to remove duplicates.

And then we spend time in here doing a nested loops join to the post table over and over again. And then finally 6.2 seconds later, we come up with a result set.

Hooray, hooray, hooray. We did it. Uh, this query down here, now granted, you know, there, there might be some indexing stuff we could do that would make this a little bit happier, but, uh, this takes about two and a half seconds, a pretty good improvement from six seconds, right?

Maybe, maybe not where I would stop if I were, you know, if I were being paid to tune this query, but I would, I would at least say, Hey, we could just do a little rewrite and get this in a better place. So that’s, that’s cool there. So I am excited for, to be able to start using these.

Uh, there’s just not a lot, a lot of opportunity for that yet. Um, you know, uh, I, I, I think that, um, T SQL additions like this should get back to, backported because like, they’re, they’re important, but, um, what do I know about backporting?

I’m just a, just a guy who writes door procedures. Uh, you could, there’s also, you could also read, it’s in the spirit of intersect and accept. We could also rewrite the queries in this way.

Now, uh, uh, that what’s, what’s fun about when I say fun, again, lots of air quotes in this, this, this video.

Uh, what’s, what’s interesting about query tuning these days is, um, the amount of things that may or may not kick in depending on how SQL Server feels at the time. Uh, things like batch mode on rowstore are, unless you force the issue with like, you know, some sort of, uh, column story thing, uh, it is entirely based on heuristics.

And so you can end up with like just strange performance differences, like for two queries that are, do they do sort of equivalent things like the join with the or clauses just get out. But if we were to run these two queries we’re using is not distinct from here.

And we’re using sort of the point of the video intersect here. All right. And if we look at these two query plans, this one runs for 2.6 seconds, just like above.

And this one runs for six, 600 milliseconds, which is quite different from above, right? Like the intersect query went faster. The intersect query went faster because SQL Server naturally chose batch mode on rowstore for this, for this query plan.

We have the batch there. We have the batch here. We have the batch here. Uh, I don’t think anything down here is eligible for the batch, except maybe reading from some of these, but, uh, we go, so we got, we got the batch up here when we read from the post table as well.

So SQL Server was like, yeah, batch mode on rowstore. Sounds great. Did not choose that here. Uh, and like I said, we could, we could force the issue, um, running, uh, this query and doing a funny little left join to a empty table with a columnstore index on it.

And if we look at the execution plan, now that we’ve got some batch mode in here, this one is all of a sudden competitive with the, uh, the intersect version up there.

So like I said, query tuning these days, real fun. Cause who knows what tiny little change you can make that will awaken SQL Server’s senses and say, oh yes, we should use batch mode.

We’re doing stuff with a lot of rows here. That would be smart. Um, so, you know, uh, keep an eye out for these things. Uh, I don’t really know where else to go with that.

Um, it’s a good, it’s a good time. It’s a real good time. Uh, so, uh, that’s about it for intersect and accept.

Um, I’m going to wrap this one up and I’m going to apparently talk about views versus CTE, which will be a rollicking ride. Uh, we’ll do that, I guess, as soon as this uploads.

So, we, we, see you then. Thank you for watching. I hope you enjoyed yourselves. I hope you learned something. I hope that you will expand your SQL vocabulary and start using intersect and accept. And I hope that if you, as soon as you are able to, you’ll start using is distinct from, and is not distinct from, as they become linguistically available to you.

Whatever SQL Server version or edition you end up on next. Thank you for watching. there’s a bright at heart and começa the людocker option, I wonder if or not could help you know why. A pastidade or assistant speakers,

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

How To Write SQL Server Queries Correctly: UNION and UNION ALL

How To Write SQL Server Queries Correctly: UNION and UNION ALL



Thanks for watching!

Video Summary

In this video, I delve into the nuances of using `UNION` and `UNION ALL` in SQL Server queries. Erik Darling from Darling Data shares insights on how these operations can affect query performance and result correctness. I highlight the differences between `UNION` and `UNION ALL`, explaining why `UNION` adds distinctness at the end, while `UNION ALL` simply concatenates results without deduplication. The video also explores various placement scenarios of these operators within a query to demonstrate their impact on execution plans and performance.

Furthermore, I discuss when it’s appropriate to use `UNION` versus `UNION ALL`, emphasizing that most of the time, `UNION ALL` is more efficient due to its lower overhead for deduplication. However, there are cases where ensuring a unique result set can optimize other parts of the query plan. The video includes practical examples and SQL Server 2022 function demonstrations to illustrate these concepts, making it easier for viewers to understand how to write queries that return accurate results efficiently.

Full Transcript

Erik Darling here with Darling Data. And we’re going to talk in today’s video, continuing on with our series about how to write queries correctly. And this, of course, you know, comes down to two things, like both getting an accurate, correct result, you know, according to the logical demands of the query, and also having it return data to you in as efficient a manner as possible. And today we are going to talk about Union and Union All, or as they say in the South, Union Y’all. All right. If you can forget that joke happened, and you would like to support this channel, not for the jokes, but for the SQL Server information, or if you like that joke, you can do it for the jokes, too. I don’t care.

Whatever your motivation is, is fine with me. There’s a link in the video description where you can become a member of the channel. If clicking that link is too hard for you, perhaps clicking other things you’ll find a little bit easier. Liking, commenting, subscribing, all wonderful things to do. If the topics in these videos are near and dear to your heart, and you’re having SQL Server issues that you think a young, handsome fellow like Erik Darling from Darling Data could help you solve in exchange for money, I am available for hire for any and all of these things. Have a great time with me together, one-on-one, solving all sorts of stuff.

Training. Good to have. Better to watch. I had someone email me and say, hey, when I go to your training site, I can’t add anything to the cart, and I can watch all the videos. Did I buy this before? And so I went and looked in my receipts drawer, and lo and behold, they had purchased the training in December of 2020. So I said, yes, you did purchase this.

Merely, I mean, just like on the cusp of four years ago. Training works best when you actually watch it. It’s a participation sport. You have to be involved. Otherwise, you still know nothing. No upcoming events. 2025, talk to me. Tell me where to go. I don’t know where you are or where you want me to go.

I’m not endowed with psychic abilities, though I wish I were. So you have to tell me where you would like Erik Darling to be. With that out of the way, let’s talk about union and union all. So union and union all are funny because they kind of get used interchangeably in almost the same way that a lot of other things are, with very little consideration for performance or result correctness or other things.

So like CTE, temp tables, temp tables, table variables, parameters, local variables, all sorts of things that, you know, joins and exists. People just, you know, start writing a query one way and then they just always write the query that way. I remember a long time ago I read it. I used to play drums when I was a kid.

And I was reading an interview with a drummer in some drummer magazine and he was like, Ah, yeah, you know, I was playing in this cover band and, you know, we were like, you know, it was good because it was a very successful cover band. But, you know, every time I sat down to play drums, the only thing I could think of to play were like the drums to these cover songs.

And I was like, like, wow, that’s depressing. And then I realized, wow, that’s how people write queries too. They sit down and they’re just like, Oh, I’m going to play Copacabana again, I guess. So there are lots of times when I see developers use union, like when the results have absolutely no chance of having duplicates in them.

They either like join different tables together or have different where clauses, or sometimes they even have like literal values in the select list. They’re just like, you know, like, you like, how can you how can you make those distinct? How can you do? How can you do that? It’s weird. You already have distinct in the select, how are you going to make it more distinct?

And, you know, a lot of this stuff comes from testing queries in isolation, not really knowing that, you know, what when you should use exist first joins, because they haven’t watched the other video in this series about exists and joins. And, you know, once you start getting other things involved, like no lock hints, you could just end up with crap everywhere.

Bad things popping around everywhere. Now, how to write queries correctly does depend on a number of things. A, the quality of your data in general.

B, the quality of the data structures that you have available. That largely means the indexes. And also, you know, like, what returns a result to you as quickly as possible that is still logically correct. This can all become really difficult to figure out.

You know, especially the more complex a schema is, the more things you have to get involved, the more calculations you have to do. Knowing what, I think, really the hardest part about writing a query, aside from knowing, like, the SQL behind it, is knowing what the correct result should look like. It’s a very hard thing to define up front.

You have to, like, write a query, get a result, and probably show it to someone who’s like, who can be like, yeah, I don’t know either. Probably. I guess it’s right.

Seems fine. You know, validating query result correctness is a challenging, challenging thing. I’m happy when my rewrites just match what the slower query returned.

I have no idea if that’s right or not. I’m not checking that. I’m just making sure that what, like, the same number of rows and the same data output is as far as I usually go. Because I don’t usually get involved enough to, like, dig deep into, like, someone’s data and understand what an actually correct query result would look like.

It was sort of a funny story where there was one client of mine who had a very, very slow recursive CTE, which I helped fix up. And the original version of it would just never return a result. It wasn’t like an infinite loop thing.

It was just slow as hell. And when we got it right and we sped it up, they were like, I think the logic in here is wrong because these results aren’t correct. So we had to, like, redo the recursive CTEbecause they were like, yeah, this isn’t showing us the correct thing.

So that was fun, too. So let’s use a rather nifty SQL Server 2022 function. So let’s start talking about union and union all and how they differ.

And actually, union and distinct is what we’re going to talk about first. So I’m going to, you know what, I think I actually already did this stuff. If we look at these two query plans, there should already be stuff in these tables.

Yes, there are. Okay, great. So we have two queries here. The first one says select I from T1, union select I from T2.

The second one says select distinct from T1, union all select distinct from T2. The thing that I want to show you here first is, of course, the results. We get very different results back.

The first one just returns 1, 2, 3, 4, 5, 6. The second one returns 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6. So even though we had 1 through 5 in this table twice and 1 through 6 in this table twice, these queries made different things unique at different points.

If we look at the query plans for these, the union query adds its distinctness at the end across both result sets. So we not only deduplicate the results from each table scan, but we deduplicate the entire result here. For the two distinct queries, the distinctness happens here, not at the end.

Right? So when we say select distinct from this one, select distinct from this one with a union all in the middle, we make this distinct and this distinct, but we don’t make the final result distinct.

We just spit back whatever these two things put together. So distinct and union all do make things, distinct and union do make different things unique. Most of the time, if you are using union, rather all of the time, if you are using union between two queries, you don’t need to also add distinct to the select list because you’re already going to get that at the end.

You might just be doing weird extra work. Another thing that is kind of fun about, is about union and union all placement in a query. So if we look at, oh, I should have highlighted the whole thing, shouldn’t I?

If we look at this and we look at the query plan, we just get a constant scan. If we were to quote any of these things, then, you know, it wouldn’t make a difference. But if we start changing these, we will start seeing slightly different results.

For example, this one removes an extra one and two, or sorry, removes an extra one, but we still have two twos and two threes. I think the execution plans for these are just kind of funny because we have two constant scans and a distinct up here, but then just a concatenation for the other two things, which are down below this.

If we quote this one out and this one in and run this, the execution plan now has this with the distinct over here, right? We have a constant scan, constant scan, concatenation, constant scan, concatenation, and the results are one, two, three, three. So we have removed some additional, we have removed that extra two from the query results.

And of course, if we put the union down here at the very end, we will see a slightly different query plan and a fully deduplicated result set. One, two, three. We look at the query plan for this.

We have one constant scan, one constant scan, and then the duplication at the end, just like with the query where we selected from the temp tables. So that’s fun right there. Now, there has been quite a lot of performance talk about union and union all over the years.

Well, I do agree that most of the time, as long as you get correct results, union all is going to be a bit cheaper on you because SQL Server will not attempt to deduplicate the result sets. Even attempting to deduplicate unique result sets, if you have a bunch of string columns involved, can be rather painful and unwieldy.

If you are in the habit of writing union queries or even putting distinct in your select list, I would really strongly encourage you to think about the number of columns you’re selecting, the data types of the columns you’re selecting, and what actually identifies a unique row. Because you might be doing distinct over a bunch of columns where it’s not making a difference. You might be doing union over a bunch of columns where it doesn’t make a difference.

And oftentimes there is a sort of hidden subset of columns in your data that you can make an easy sort of distinct result set from without having to worry too much about it, without having to worry about long select lists and stuff. So if we run this and we look at the query plan that comes back, this takes about two seconds.

This is not a terribly dramatic example, I admit it. And, you know, it’s okay. But, you know, the thing is that we’re, you know, doing select all this stuff and including this text column and, you know, it’s just, it’s, it’s, this is an envarchar 700.

And now you have to worry about deduplicating that. If we were to take this query and do this, and what we would be doing is taking the same base query with the same columns in it, but then only generating a row number over the columns that we know make a unique result set, this will, this is a little bit faster, right?

Like I said, this is not a terribly dramatic example. That’s about 400 or so milliseconds faster, but it’s still a good example of how you can improve things by thinking a little bit more, a little bit more analytically about your data and what you need to make unique and what you don’t.

Right? So like in this case, doing a, generating a row number over these two union all the result sets is a lot faster. Now we’re now when I said most of the time, I do mean most of the time, there is less overhead to union all over union.

But there are some cases where it does make sense to make a unique result set to make some other operation in a query plan more efficient. Uh, I want you to think of this sort of like, uh, pushing predicates down to when you touch tables rather than having a filter operator happen later. That’s something that I talked about in the, uh, exists versus joins video, where I showed you a query that uses a left join, uh, with a, a where clause, uh, to find rows that don’t exist in one table that exist in another table.

And how using, uh, not exist was much more efficient in that case because we joined less data together, right? Because with the, the left join thing, we had to fully join both tables and then filter out nulls afterwards. The same thing is for like pushing any predicate or reducing rows as much as possible before you do something that is, uh, computationally expensive in a query.

So sometimes getting a distinct set of rows for your query can make things a lot better. So what I’m going to do now is, uh, populate this temp table with, um, user IDs for people who have won these badges. And the first way that I’m going to run this query is with a union.

And we’re going to marvel at the query plan for this. So this takes about, let’s see, five seconds right there. Right.

And, uh, you know, this, this would get better with, you know, slightly better memory grant stuff like that. Uh, if I, if I ran this like multiple times, you would see it get a little bit faster because SQL, since I’m in compat level 160 for the other stuff, we’re getting memory grant feedback. So this would improve over a few runs that we would eventually see that spill kind of fall off.

Usually it’s like three or four runs before the spill goes away. But, um, anyway, now what I want to do is change this to union all. So now we’re going to be taking like, and before with union, we were deduplicating these results, right?

We were getting rid of them. Um, but now when we do this, what we’re going to notice is that this query no longer finishes in like five seconds. This query drags on for a little bit longer.

And by a little bit longer, I mean, this thing is going to run for about 30 or so seconds total. Um, it’s been a while since I timed this one. So, you know, who knows, maybe, maybe, you know, Intel gave me some supercharged boost to my CPUs and maybe it’ll finish a little bit quicker.

Um, maybe not. I don’t know. We’re going to, we’re just going to, we’re just going to let this thing ride.

Uh, and we’re at, well, we’re at 35, 36, 38. Oh, we’re at 40 seconds now. Uh, do, do, do.

Well, I don’t know. I think this, this might be proving the point a little too well. So we got up to almost 50 seconds on this one. We got 47 seconds. And that’s because rather than, um, and, and, and you can see that like the, the, the, the, the pain of this wasn’t even in here.

Like there was almost no overhead to like doing the concatenation of these. Uh, we did still have to, you know, do all this stuff in the sort and whatnot, but where this makes the biggest difference is the number of rows that end up or the number of, uh, things that we end up doing in the table spool. Right.

So we have a nested, we have a nested loops join here. SQL Server uses a table spool here. Um, I’ve talked about table spools in the past, but they are sort of interesting. Uh, SQL Server uses them on, like for in select queries, not modification queries, modification queries, spools are for Halloween protection.

In select queries, spools are often there to optimize, uh, operations on the inner side of nested loops. By the inner side of nested loops, I mean this portion of the query plan in here. So, um, a table spool is fun because a table spool, uh, you know, you take, uh, sorted data from this table, right?

You sort it here and then you pass it to the nested loops and the nested loops, uh, will, you know, tell the table spool to go run the query for, let’s just say ID one. Uh, it’ll populate the table spool. Uh, it’ll populate the table spool.

And then for any additional runs with ID one, it’ll reuse the data in the table spool. Then as soon as we get to ID two, the table spool will get truncated and repopulated. And then, uh, you’ll like, it will, it will reuse data in the table spool for ID two until we hit ID three.

So table spools really can save a lot of time and energy on the inner side of nested loops sometimes. But in this case, the problem is more that we end up with way, way, way more stuff to do with the, because we don’t deduplicate results here the way that we did with the union query. So if we go back and we quote out the all here, remember this is about 40 something seconds and we run this.

We’ll do this one more time with the, with the, the union rather than the union all, it should be about five seconds or so. So do do do. And we get the results. Notice that, uh, like, you know, we, we, we do the same thing where we, you know, concatenate the results. But then before we, uh, send it, send it along, we get, we use a distinct sort to make the results of these two things distinct.

And we spend a whole lot less time in this portion of the query plan. Right? So this, like this part does a whole lot less work because we made the results set unique in this part. So you can run into interesting situations where using union to deduplicate results makes a computationally expensive part of a query faster or repetitive part of a query faster.

In this case, the nested loops is really what did it. So that’s a, that’s a pretty good way of thinking about things sometimes where can I make, can I make this part of the query execute fewer times or do less work with a distinct result set rather than a non distinct result set. So anyway, that’s about all I have to say about union versus union all. As always, I hope you enjoyed yourselves. I hope you learned something.

I hope that you will continue watching this series. I hope that you will continue to write queries correctly. And I will see you over in the next video, which is going to be about two things that I never see anyone use. Intersect and accept.

Fun times. Oh boy. You’re going to really, you’re going to get a relational mouthful on the next one. So I will see you there. Goodbye. Bye.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

How To Write SQL Server Queries Correctly: Views vs Inline Table Valued Functions

How To Write SQL Server Queries Correctly: Views vs Inline Table Valued Functions



Thanks for watching!

Video Summary

In this video, I dive into the world of views and inline table-valued functions in SQL Server, comparing their pros and cons while highlighting common pitfalls that can lead to performance issues. Erik Darling from Darling Data shares his experiences working with clients who have created overly complex view hierarchies, emphasizing that views themselves are not inherently bad but can become problematic when misused. I also explore the benefits of inline table-valued functions, particularly their ability to accept parameters and push predicates further into queries. Additionally, I discuss Microsoft’s recent fixes for certain issues related to parameter sniffing in views, explaining how to enable these fixes on different SQL Server versions. By sharing practical examples and insights, this video aims to help database administrators and developers make informed decisions about when and how to use views versus inline functions in their projects.

Full Transcript

Erik Darling here with Darling Data. Having a lovely day so far. Absolutely lovely day. Stunning day. It’s freezing cold outside. In today’s video, we’re going to go to the next topic in our How to Write Queries Correctly series. In this one, we’re going to talk about views and how they compare with inline table-valued functions. I know that it just says views there with an exclamation point. I kind of ran out of room, but then it looked funny and the spacing was weird. So this is what you get. But there are some differences and there are some things that we should talk about because I see a lot of people making the same mistakes over and over and over again. And quite frankly, I’m getting tired of fixing them. So I’m hoping that we can talk through some things today and you can start not doing the wrong thing. First time for everything, right? As usual, if you like the channel and you want to support the channel, there’s a link down in the video description below. If you don’t want to do that with these things, you can do these things up there. If you need consulting for SQL Server, if you’re like, wow, that Erik Darling, he knows what he’s talking about. He seems like a nice, reasonable, fellow with reasonable rates who I could work with on MySQL Server performance problems. You can hire me to do all of this stuff. It’s a pretty good deal. If you’re into good deals, how about getting all of my training for about $150 for the rest of your life? Hard to pass that up. No upcoming events, end of year, blah, blah, blah. With that out of the way, let’s talk about views and functions and stuff.

Now, there are a lot of bad things that one could say about views because over the years, we, probably we, the royal we, I really just mean me, is a performance tuner, have seen people just do absolutely awful, egregiously disgusting things inside of their views. Views on their own are not the problem. It’s the way that people treat views. Views on their own are not the way that people treat views. They end up sort of being like a junk drawer for just like weird query logic.

And, you know, this is like coming from like a rather like personal place right now because like current, like one of my current clients, I was, I was trying to figure something out. And I did SP help text on eight view names before I found a view definition that touched one single physical table in this database. Every other view definition was selecting stuff from one or more other views.

I still haven’t gotten to like the root view, the like original vampire view that has everything in it. Like I just got to a certain point and I was like, that’s an, I, we’re not going any further. I have this figured out enough, but like eventually at some point I’m going to, I’m going to whittle it down and I’m going to figure out what the like core criteria is that is throughout these 50 billion views. And this is a common mistake that people make.

They think that views are some sort of performance thing. They think that views are automatically a materialized result set. They don’t understand that all they’re, all they’re doing is interacting with a query that has a name, right?

They’re just housing this query. And I understand the point of, of, of making them because, you know, if you have, you know, a lot of tables and, you know, it’s hard to remember all those joins correctly. And, you know, making sure you have like the one or more join columns, right?

And, you know, whatever other, you know, you know, manipulations you have to do to data in order to get the results that you’re after. They can, it can be time consuming and annoying. So I, I understand why people make views, but the things that they end up doing after that are just, just sinful.

Now, views really, you know, if we’re going to make, make a statement about views, they’re, they’re only as bad as you make them. If they’re bad, it’s your fault. You, you did all the bad stuff.

You put all the bad stuff in there. You, it was you. The view did not force you to do it. The view did not change itself overnight to become an evil view. You put all the bad stuff in the view and, and, and now you’re living with it.

So you have, you have reaped what you have sown, right? Reaped what you have sown. Whirlwind is in stuff.

So, let’s talk a little bit about the case for views, because there, there, there is one nice thing about views. And that, that is that if, if you obey like the 10 million rules that Microsoft has put in place, you, you can index a view. And then you do have a materialized result set and that is pretty nice.

Um, it, you know, uh, it gets a little dicey if you have more than one table reference in the view. Like if you’re joining multiple tables together, things can get kind of awkward with, uh, index view maintenance. Um, but in general, and like, you know, a single table index view, uh, with, uh, indexes, uh, available on the table in order to make index view maintenance, uh, very quick and efficient are really no different than having another nonclustered index available for your query.

Uh, there are some kind of funny things about index views. Like, um, you have to use the no expand hint if you want, uh, column level statistics generated on the view. Um, if you’re on standard edition, you don’t get the index view matching to the same way that you get it with enterprise edition.

So the no expand hint becomes even more useful, but, uh, even on enterprise edition, you need no expand for the column level statistics thing. Um, you know, so there’s, there’s stuff about index views that is kind of tricky. I’m not going to put them on the same level as partitioning because index views actually can make queries faster.

Whereas partitioning just doesn’t make queries faster. You’re lucky. You’re very lucky if you partition a table and, and, and, and performance stays the same.

Uh, in reality, partitioning a table is absolutely no different from having a good seekable index. On the table. I think where a lot of people kind of get confused is that when they partition the table, they changed the, they changed the definition of the clustered index and made it match better.

The, the sort of, uh, the path that the, the queries were taking to the data they wanted. And they’re like, wow, partitioning was magical, but really they just indexed the table poorly to begin with. So, uh, there are, so I forget the application name.

Well, like I’ve run into it with Looker. I know that there are a couple others that, um, sort of build queries based off metadata. And, but they, they, they’re unable to do that with inline table valued functions.

Um, they really only, uh, do that with views and stuff. For some reason they can’t see, uh, what that is to build queries off of it. Um, so, you know, so there’s that, uh, I, I really dislike having crappy applications like Looker dictate what sort of database objects I can use and create.

Um, but you know, some people make bad choices outside of the database too. And, you know, you’re, you kind of get stuck with them. Uh, index views, you know, they’d be, they’d be great if Microsoft would invest like an ounce of time into them.

Uh, they really haven’t gotten anything aside from like, you know, uh, like bug fixes, uh, since they first came on the scene. Um, that, uh, it’s, it’s really, it’s really a bit disgraceful what other database engines, uh, are capable of doing with index views and allow an index views that Microsoft does not. A bare minimum, like min and max aggregates.

Like how, like, like really, you can’t do that in an index view? Like what, that’s a sad, it’s a really pathetic state of affairs, really. Uh, you know, we, we, we’ve, we’ve gotten so many awful features that have, that have died on the vine.

Uh, and, and we can’t get min and max support and index views. It’s, one, one wonders where the people in charge of SQL Server store their heads. We, we wonder if maybe there’s a, there’s a glass belly button joke hiding from us in there.

Um, one kind of weird thing about views, and this is something that, uh, I, I, I, I, like I have read before, but it never really sticks with me. But you can create views with a, with, with an, with check option. And, uh, I’m just gonna read from the documentation that it forces all data modification statements executed against the view to follow the criteria set within select, within the select statement.

I don’t know why there’s an underscore there. I copied and pasted this. Uh, when a row is modified through a view, the width check option makes sure that data remains visible through the view after the modification is committed. Okay. If that’s important to you, use the width check option.

Um, I, I don’t know when I’d want that to happen. Uh, can’t think of anything quickly. So, views, you know, uh, they, there’s a, there’s a tremendous propensity for people to put a lot of bad crap into views and ruin performance over time by nesting, nesting, nesting, nesting, nesting, nesting, and putting worse and worse, more complicated queries in each sort of level. Uh, and then, you know, uh, at some point along the way, you’re like, oh, hey, hey, Dante.

Oh, Lucifer, it’s you. Nice to see you here. Uh, but inline table valued functions, you know, they, they, they do offer some things that views don’t. Uh, namely the ability to, uh, add parameters to them, right? Views don’t accept parameters in the, in the definition inline table valued functions can.

And that can allow you to push predicates a little bit further into queries, uh, in some circumstances than views will allow. Well, we’re going to talk, we’re going to show you that example in a minute. Um, you, of course you cannot index a function.

There are no, no such thing in the current state of SQL Server is, is, is materialized functions. Uh, so, I don’t, I don’t really know that I care. I mean, it’s not, it’s not, it’s not, it’s not that big of a deal.

And I can only imagine what awful things people would do with them. Uh, but, uh, the, the ability to pass parameters to a function is a pretty big deal in some cases. And again, we’re going to look at that.

Now, um, it would be nice if, uh, Uh, there were a way to pass parameters as queries to certain things, right? Like it would be cool if, you know, like if you had like, just, you know, as a, as a sort of a stupid example, let’s say you had a store procedure that took, that was responsible for taking full backups.

And, uh, you know, like the, really the only way to pass a parameter to that or pass a value to that, uh, store procedure and then have it do something with it is to exit is like, you know, build a loop or a cursor or an array, like a CSV and pass it into the store procedure. But even then, like, even after you, if you, even if you pass a CSV, uh, like a variable or parameter in, uh, you still have to break it apart and do stuff with each individual line inside the store procedure. It’d be really cool if you could do something like this, where, you know, you would say, take a full backup and database name is the result of this select query.

Like that would be awesome. Cause then you wouldn’t have to do all the like weird internal work to like, you know, uh, write a cursor correctly or write a loop correctly. And then like, you know, other things you could, you could just do this and life would be a lot easier.

Uh, but you can’t, which is kind of lame, but you know, it’s not, not really the whole point of this. Now, the thing that I want to show you with inline functions is, um, uh, they, when I first came across this, I was really, really puzzled, uh, because there was a view and, uh, uh, in, in, in, when you called the view, like in, like in, like outside of a store procedure with a literal value, everything went fine.

When you called the view inside of a procedure with a parameter things went not, not fine. And we’re, we’re, we’ll, we’ll talk about that. But, um, the, the thing that, so like that, that does lead me to Microsoft did add a fix for this.

Uh, the thing is you have to be on, uh, at least SQL Server 2017 CU 30 and have query optimized or hot fixes enabled, uh, in order for it to kick in. Uh, I’m not exactly sure which CU for 2019 this was available on. Um, I would probably, I would, I would imagine that I can’t remember if it was available in 2019 from the get go or, but I’m pretty sure it was back ported because it’s under, it’s, it’s, it’s, it’s, it’s, it’s, it’s, it’s, it’s in 2022 under compat level 160.

So I think whatever CU came out around the same time as CU 30 for 2017 is probably where this thing ended up for 2019. Um, but, uh, if you’re on SQL Server 2022, you can just use compat level 160 and get the, uh, get the same behavior. So, uh, what I want to show you here is that you could run this and get the fix that I want to show you, but I have query optimizer hot fixes not enabled.

Uh, you could also enable trace flag 4199 as long as you’re not, um, you know, Oh, geez, that could have been a disaster. As long as you’re not, um, uh, you know, uh, hampered by anything like, uh, being on, uh, Azure SQL DB or managed instance or, uh, not have sysadmin privileges. It’s kind of a downside of DBC, of D trace flag stuff like, uh, like DBCC trace on or, uh, uh, you know, query trace on and there’s a query hint.

Uh, and I’m also not in compat level 160, even though I am on SQL Server 2022. Currently I have this thing set to, I think, compat level 140. Um, just because, uh, I don’t know, sometimes too many things kick in, in the higher compat levels that makes coming up with, uh, good repeatable demos kind of difficult.

And, uh, as a presenter, I often just need good repeatable demos. Um, sometimes the higher compat levels do offer that because they do something really bad with some of these new features. But, you know, for the most part, I like just the stability of like not every single intelligent query process or feature trying to kick in and, and like, you know, ruin my day.

But anyway, uh, we have this view and the main, the main thing in this view that will cause us, uh, problems down the line is going to be the windowing function. Now, uh, I know that I talked in the CTE video, the first CTE video about, um, there’s going to be a second one. So that’s why I’m saying the first one about how, about pushing predicates, um, to window functions, uh, from outside of CTE.

And how, if you have the, uh, the partition by column, uh, is the column that you’re filtering on SQL Server has an easier time of doing that. But, uh, that does not hold true with views in all circumstances. Let me just make sure I actually created that.

So, uh, we have query plans turned on and if we run this query, this will run very quickly. Uh, we will have a very nice, easy index seek right here. Everything is fine.

Uh, even though, uh, well, simple parameterization was at least attempted here. Uh, I don’t, I’m not going to dig in and figure, I don’t, don’t believe it was successful because we, this, the query plan looks like this. But, um, the reason this works fine is because we have a literal value right here.

Uh, not a variable, not a parameter, not a placeholder of any variety. So everything kind of looks how we would expect because we have an index on owner user ID. And, you know, uh, we’re also, uh, that also has the score column sort of descending in it.

So our making this dense rank is very, very simple for us. Now, uh, I could do all, any of the things that I mentioned above with the, the database scope configuration with the trace flag, blah, blah, blah.

But, uh, what I want to show you before I do any of that stuff is what the query plan looks like with a parameter touching that view without any of the fixes in place. And this takes a little bit longer and we no longer have a nice, simple index seek plan. Uh, now this takes about seven seconds or sorry, about six seconds, I guess.

So we have a nice, simple index seek, we scan the entire table. Uh, we generate our row number over here, segment, segment, sequence project. And then way over here, we have a filter, uh, say where the predicate equals owner user ID, right?

The parameter that we passed in, uh, the, the limitation here is where SQL Server can’t push a parameter or a variable past the sequence project operator. It gets stuck outside of this thing. If we recreate the store procedure with option use hint, uh, query optimizer, compat level 160, uh, the plan will go back to what we expected.

It’s a nice, fast, simple index seek, and no longer having to wait about six seconds to do all this stuff, right? The, the, there’s no longer a filter over here. We were able to push that down past the sequence project.

It would be nice if a lot of other things, um, worked under, uh, worked when you hinted higher compatibility levels. Um, just as sort of a stupid example, like, um, uh, there are a lot of, uh, new, uh, functions and functionality added to T SQL, uh, with each release. And, um, you would think that if, if you put a query level hint on to say, like, use string split or string ag or one of those other things that comes along, but is only available under higher compat levels, you would think that you would be able to access that stuff just by using the query hint for a higher compat level.

But you, you can’t, uh, Microsoft wants you to use the, the, the, the higher compat level all around in order to get that, which, um, is, is a perilous venture. Uh, but the, since we’re comparing views and inline functions here, uh, I do want to show you that if we created the, this as a, as an inline table valued function. Uh, now, of course, this is not a scalar UDF and this is not a multi-statement table valued function.

Uh, we are returning a table, which just returns a select. There is no table variable. There is no data type involved here.

We are returning the results of a select query, right? So if we do that, we can pass a parameter in here and here. And if we do this, even without query optimizer hot fixes enabled, uh, SQL Server is able to, uh, push that seek or push that parameter down into there just the way that we would expect.

So even if you’re not having this specific problem, uh, I think it is often, uh, worth exploring, converting views into inline table valued functions. Uh, just because if there is a common filtering or joining criteria, uh, it’s very, very convenient having parameters to express, uh, express that into be able to pass those in. Um, it better shows the intent of the module and what it can be used for.

And it prevents developers from forgetting filtering. I thought that filtering criteria and getting really like just exploded out results. So, uh, this is just a sort of short walk through the differences between views and inline table valued functions.

Um, uh, you know, again, uh, views, you can materialize them. Um, if you follow the 10 billion rules, uh, inline table valued functions, you can’t materialize, but you can pass parameters too, which can be a very, very valuable performance tuning thing. Um, you know, when you like apply to an inline table valued function, then you can pass column names in, uh, for the parameters that you can get often get very, very nice, uh, performance increases doing stuff like that.

Uh, but anyway, uh, I hope you enjoyed yourselves. I hope you learned something. Uh, the next video will be about, uh, union verse union all.

Um, uh, we’re going to explore, uh, uh, sort of like where union starts making results distinct and things like that. And, uh, we’re also going to challenge, uh, uh, uh, a very common performance tuning, um, uh, uh, uh, I don’t know, just, it’s a strongly held religious belief that union all is always faster than union. And we’re going to look at an example where that is not true.

So I hope that, hope that you hope you’re wearing your helmet for that one. You’ve got your crucifix and garlic and all that stuff. So anyway, uh, I’m going to get this one, going to get this one sent along to YouTube and then, then record that one.

So see you shortly.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

How To Write SQL Server Queries Correctly: Common Table Expressions

How To Write SQL Server Queries Correctly: Common Table Expressions



Thanks for watching!

Video Summary

In this video, I delve into why Common Table Expressions (CTEs) are often overused and misapplied in SQL Server queries. I argue that CTEs can be a bit like 70s fashion—faddish and largely unnecessary, with little real benefit beyond readability, which is subjective at best. I explore the performance pitfalls of CTEs, explaining why they can lead to suboptimal query plans and poor execution times. Additionally, I highlight how different database systems handle CTEs differently, using Postgres as an example where CTEs are materialized by default, contrasting it with SQL Server’s behavior. The video also covers specific scenarios where CTEs might be necessary or useful, such as when dealing with windowing functions and row numbering, but emphasizes that in many cases, simpler alternatives exist. I provide practical examples to illustrate these points, showing how the placement of filters can significantly impact query performance and results. Finally, I touch on the future of T-SQL, speculating on potential improvements like the `QUALIFY` keyword, which could make CTE usage more efficient.

Full Transcript

Erik Darling here with Darling Data. Today’s video where we continue our series on how to write SQL Server queries correctly, I’m going to continue to aggravate certain portions of the public by talking about how completely stupid CTE are. Looking forward to it. Before we do that, let’s talk about stuff that we talk about all the time. If you like this channel, if you find the content in this channel valuable, and you would like to support this channel to the tune of four bucks a month or more, depending on your level of kindness and generosity, depending on where you are in your life as a matter of salary, you can click on the link right in the video description at the top there that says become a member, and you can join the 30 or so other people who have done that and contribute to the video description. I want to contribute a little bit to keeping the lights on for this channel. The light bulbs that it takes to do all this are very expensive. If four bucks a month is just too rich for your blood, for whatever reason, you know, your mom’s in a nursing home or something, liking, commenting, subscribing, all wonderful ways to make me feel cherished by you.

If you watch these videos and you think, wow, that Erik Darling sure does know a thing or two about SQL Server, perhaps he could make my SQL Server faster in exchange for money. You can hire me as a consultant to do just that. I am good at all these things, best in the world, according to most. So why not take a chance on that? If you would like some very high quality, very low cost SQL Server training, you can get all of mine, again, link in the video description down yonder. You can get all of mine for 150 bucks. And that’s for the rest of your life. No subscription required. Isn’t that lovely?

No upcoming events. No upcoming events. 2024. End of it. 2025. We’ll get into it again. I will go to as many pre-cons as my wife will let me. She does miss me terribly when I’m gone. So I can’t just fly around the country every weekend, but, you know, I’ll do my best to get to your very important event. But, yeah, let’s talk about, let’s expose CTE for the fraud that they are.

So CTE for me are a lot like clothes in the 70s. A bunch of people with absolutely no taste convinced a bunch of other people with absolutely no clue that they should dress just like them. And so if it weren’t for us getting the 80s after the 70s, the human race would have absolutely no redemption arc. In fact, since like probably the mid-90s or so, the redemption arc is descending. I’m not sure if you feel the same way, but boy, I think 1980s was the peak of human civilization.

That’s when all of the best things, all of the best music and movies and everything was going, fashion was wonderful. Nothing better. Now CTE were implemented to fill in some blanks that derive tables left, right?

Things like being able to re-reference them in queries and things like that. But the first, the problem with CTE is that the very things that they were designed or that they were implemented to address with derived tables are the things that make them suck from like performance-wise. I didn’t know what a petard was until I read this, but apparently it’s some kind of stick that you can hoist people with.

I guess it’s like a wedgie stick where you can pick someone up by the back of their underwear and just hoist them up and wiggle them around and make it real uncomfortable. But people use them the same way that they use nose and air hair trimmers where they just jam them in and wiggle them around. Obviously, nose and air hair trimmers weren’t my first choice of metaphor here, but I’ve got to keep it family friendly.

But they just jam it in and they wiggle it around. Maybe they just keep wiggling until they stop hearing hairs get trimmed here and then that’s it. There’s very little actual mental feedback for most people about if CTE are good, bad, or ugly.

You can’t see in your own ear too well and you can’t see up your own nose too well, but all you’re left with is a lack of clipping noises. It’s amazing to me how many people will just stick with, like get awful performance using CTE, but stick with them just because of this misguided notion that the query is more readable. They read a style guide.

The query is more readable. They read their 70s style guide and now that they’re showing up with crocheted bell bottoms and velour neckerchiefs and spread collar shirts with gold necklaces tangled in chest hair. Stop.

Okay, just do yourself a favor. Stop. Way back. CTE are one of the least advanced components in T-SQL. I’m going to cover more about that.

You can probably see up at the top there’s another tab 14, tab with numbered 14, where we’re going to talk a little bit more about CTE usage. But really I just want you to know there’s nothing all that interesting or advanced about them. Anyone who says they have anything advanced or interesting to teach you about CTE is either a complete simpleton or a charlatan.

Pretending that there’s an advanced notion, advanced usage of CTE. It’s the same level of idiocy as explaining joins with Venn diagrams. Like semi-colored circles is going to help anyone understand what their join query is doing.

It does absolutely nothing for anyone. So you should cast those people by the wayside because they are not good people. Now, if you’re coming from a different database platform, you might have a different experience with CTE.

For example, Postgres, by default and when considered safe to do so, CTE get materialized. They get like, it’s almost like a temp table. They cache a common sub-expression type thing.

It’s almost like a spool or a temp table, whatever you want to call it. It’s a temporary object that caches the result of the query. I have a couple examples from the Postgres documentation.

But if you’re coming from, like if you have a Postgres background and you’re used to CTE behaving in this way, you get to SQL Server and you’re like, wait a minute, why does this suck? Well, because it doesn’t behave the same way.

You’re also probably going to be wondering why your read queries are blocking and deadlocking with your write queries because SQL Server does not use an optimistic isolation level by default either. So woe to you, fine folks out there.

So with Postgres, you can choose. I mean, notice the red squiggles under here, right? Make this not valid T-SQL, maybe someday.

But in Postgres, you can choose to materialize a CTE result, which is just like sticking in an attempt table. And you can work off that.

But when you don’t materialize it, you re-execute the query in here as many times as you touch the CTE. Now, CTE, for some reason that I cannot fathom, get a lot of developer defense to the same extent that table variables get. It’s befuddling to me.

Like people will write a query, use a CTE, have it perform awfully, but then sit there and be like, oh, but it’s so readable. Oh, look how well I can read this query.

I have so much time to admire this readable query and re-read my query. Will I wait for this query to finish running because I’m sticking with this stupid CTE? It’s fascinating to watch.

It is, there’s some sort of mental disorder going on in people who do this sort of thing. Now, there are times when CTE will have no impact on anything, right? So if I run both of these, make sure query plans are turned on.

If I run both of these queries, one with a filter inside of the CTE, one with a filter outside of the CTE, we will get identical results and identical query plans.

And they make no difference in this case. None whatsoever. Right? So there are times when SQL Server just is nice enough to optimize the CTE away. It just throws it right out.

Now, one place where you do have to use CTE currently in SQL Server is if you need to sort of have some runtime expression, like a row number, and you want to filter on that row number.

SQL Server doesn’t offer a way to do this with a single query. Other database engines have a qualify keyword, which I’ll show you in a second, that allows you to do that without nesting your query at all.

So, but another interesting thing is that if you are going to put, you are going to use windowing functions in CTE, and you want to filter on stuff outside of that, sometimes SQL Server is unable to push your predicate up into where it, up into the CTE.

Now, it can’t do that here because the column that I’m filtering on, vote type ID 8, is not a partitioning element in the windowing function up here. So, what happens is I have to run this whole query.

I have to generate a row number over the votes table. And then once I get outside of the votes table, and I have generated my row number, and I start filtering on my row number, only then is it safe for SQL Server to apply a filter to the, to the, apply the vote type ID filter to the query.

Now, this, this sounds funny and weird, but obviously if I were also partitioning by vote type ID, I would be answering a somewhat different question with the windowing function.

Right? That would be like, like partitioning by user ID and vote type ID would mean that I am effectively asking SQL Server to rank things differently than just by user ID. There, there are times when it’s safe to do that in the partition by, but not here.

So, if we look at the query plan for this, you’ll see it ran for a heck of a long time. And if we look at what happened, the, the details of the filter operator, you can see that there’s a predicate on both vote type ID 8 and this expression 1 0 0 1 equals 0.

That’s basically this where clause right here. So, uh, obviously vote type ID 8. You can, that’s pretty easy to work out.

But then, uh, this being equal to 0 is the expression that is also evaluated in that filter. So SQL Server had to do all the work to generate the row number to then apply those filters later.

Uh, but I want to talk about how that is sort of answering a different question. So what I’m going to do is I’m going to create a table, uh, called the top answers of all time.

And I’m going to put, uh, uh, I don’t know, like 2,500 or so rows of Paul White in there with this number. And I’m going to put 10 rows of Erik Darling in there with that number minus whatever row number we’re at, right?

So row numbers 1 through 10, I’m going to subtract numbers 1 through 10 from this, right? So insert those rows in. And this is what the table looks like.

I have my 10 rows down here where I have a decrementing value from Paul’s big score down through all this stuff. And then I have, uh, 2,500 or so rows of Paul White’s high score, right?

So the reason why that answers, why the part, this answers different questions for like the example query that I was showing you up there was, let’s say that this is the initial query where I want to find the top answers and then find out the top 10 answers, right?

That’s this filter. And then ask if any of them have the answer, answerer name Erik Darling, right? So let’s just run this internal query first so I can show you what’s going on in here.

Obviously all of these 999s, they, they all tie, right? But they get, but they get an incrementing row number. It’s not like rank where rank would give you one for all of them. Uh, row number and dense rank will give you this incrementing number across all of them.

Uh, but if we get down here to the end, you’ll see my final 10 rows where like, obviously I’m, I’m out of the picture. So now if we run this to say where row number is less than or equal to 10, we’re going to have Paul’s 10 rows.

So if I’m asking the question, who are the top 10 answerers of all time? And are any of their names Erik Darling? The answer is going to be no, right? But if I change that and I put this in here, you’re going to, and I run this query, you’re going to see just my answers ranked, right?

So high answer to low answer. Now, if I run this query, we’re going to get those same 10 rows back because I only have 10 in the table, but you can, you know, if you want to just see a slightly different take on it, here’s the, here’s like the top five answers of Erik Darling.

So you actually answer different questions depending on where you put filters in for windowing functions, right? So that’s something to be aware of when you’re, when you’re using them.

Now, this does bring us to a case where CTE are generally okay because they’re generally needed, right? You can’t calculate in SQL Server at current, the way T-SQL is designed, you can’t filter on a row number within one query.

Like I said, you have to nest the query, generate the row number, and then filter on it. Now, Snowflake, a different database platform, has this keyword qualify.

Qualify allows you to either put a row number directly into the qualify clause like this, right? Like you can say qualify this equals one, or you can put a row number in your select list like this and then say qualify row num equals one.

Now, this is obviously just a little bit of syntactical sugar because you’re going to end up with the, kind of doing the same thing is like the query plan that we saw when I showed you the filtering thing where you’re going to have to run the query, generate the row number, and filter on it.

This is just a nice compact way of doing it. I think this should, obviously, I think this should be in T-SQL because it gets you out of a lot of like extra typing and nesting queries when you generate a row number. It’d be really nice to be able to do it in place like this.

Whether it’ll ever happen or not, I don’t know. Microsoft is busy burning all its money on OpenAI, co-pilot, so who knows, right? T-SQL just, who knows?

SQL Server might get no attention whatsoever. Who knows? Who knows, right? It’s crazy. You’ll probably get the ability to put a CTE inside of a CTE before you see any actual useful progress to T-SQL as a language.

But one of my favorite uses of CTE is paging queries. And this is a technique that, again, I mentioned Paul White for the five billionth time.

This is a technique that I learned in 2009 from Paul White blog posts where you can stack CTE like this. And this doesn’t have the same performance impact that re-referencing CTE via like joins does.

And I’ll show you what I mean. So let’s run this whole thing. And the reason why this is okay is because we only end up with, well, for the CTE part of it, I do want to point out that I do join back to the post table here to get, why are you red?

You should always be pink. You silly, you silly ghoul. We do have two references to the post table because I joined back to the post table here. But in this query plan, what looks kind of funny is the scan of the clustered index, the generating of the row number, and then a top, and then a filter, and then another top, right?

But the CTE isn’t to blame for the two references to the post table. I explicitly re-referenced the post table down here. So this is where you can see where I filter on the row number inside of the CTE, right?

That’s this thing right here, right? So the two tops in here, one is the one in this query, right? Where I get the top page number times page size.

That’s the first top. That’s this thing right here. Page number times page size. And then this other top where, ah, gosh darn it.

There we go. Where I’m just getting the top page size. So that lines up exactly with what I do here and what I do here. And then we saw in the filter operator, we saw this.

This is a really good way of writing page inquiries because you only hit the base table for a limited number of columns, filter down that primary key just to the rows that you care about, and then get all the columns you care about after you’ve reduced the rows that you care about.

So this is okay because we’re not re-referencing the CTE outside of this, right? We have them stacked up, which means that we hit the post table once and then run all the logic from these stacked things on the post table.

So where that gets different though is if you re-reference the CTE. So let’s say that I write a query like this, and this is an actual, this is inspired by an actual client query where they were doing this exact same thing, and their idea was to get the top, I mean, this was a different piece of software that had different things, but the idea of this query is to use the row number function to rate a particular user, sorry, a particular user’s questions by score descending, right?

So we want to find that essentially the top five, and the way that we’re doing that is for every, we left join to, ah, zoom it, curse you. We left join to the CTE to itself five times for five, and for each one, we get a slightly higher row number, right?

So we get row number one there, row number two there, row number three there, row number four there, row number five there. So when you see the query plan for this, you’ll start to understand what I mean by the petard hoisting of CTE.

So we’re derived tables, you couldn’t do that re-referencing. With CTE, you can, but unlike Postgres, which will, again, when it’s considered safe, materialize the result of a CTE, we don’t get that here.

We get a query plan, if I can manage SSMS, we get a query plan that hits the post table one, two, three, four, five times. For some reason, the first one is nice enough to go parallel.

The rest of them are just single-threaded scans of the post table, but each one of these, four seconds, four seconds, four and a half, oh, sorry, like four and a half seconds a piece. But the query as a whole takes almost, it takes about 18 seconds to run.

This obviously isn’t a good situation for the CTE. I don’t care how readable it makes your query. If it takes 20 seconds to get one row back, it stinks.

You shouldn’t just, it’s so readable. Who cares? Your query sucks and it’s slow. There are more important things to worry about. Like your end users aren’t going to call you up and say, hey, thanks for that nice readable query.

They’re going to say, hey, thanks for this really fast query result. I don’t have to wait my whole lunch break to get my report back. All right? CTE are not performance solutions. I don’t even think they make queries more readable.

Good formatting makes queries readable. CTE do nothing for query readability and do a lot to hurt query performance. So a lot to hurt query performability. How about that?

So you could do this in two different ways. You could create a temp table very cheaply with the five rows that you care about in it. And then you could join to that very small temp table five times in the same way that we did before.

And none of that takes anywhere near 20 seconds. You could also use pivot in this case. And you could get that very quickly as well. Again, with only one scan of the post or one scan of the post table.

So both of these end up just about the same. We do have to scan this, the temp table five times, but a five row temp table is pretty cheap to do that with. So there may be times when you need to build something recursive with a CTE.

And with recursive CTE, one sort of annoying thing is it in the recursive portion of the query, like the anchor portion of the query where you get like the row or rows that you want to build the recursion with.

You can do like almost anything in there. But with the anchor part of the query that you’re using to do the recursion down with, you can’t do anything like distinct or top or offset fetch to only get like one result per recursion.

But you can use a derived query inside of there. You can use row number inside of the derived query, and you can filter on the row number outside of it.

This is actually a pretty good performance thing for a lot of recursive CTE because a lot of the times when I see people build them, there’s a lot of duplication in them that shouldn’t be there.

So this is what I mean by that. We have the anchor part of the CTE here, and we have the recursive part of the CTE here.

Now, since we can’t put distinct or top or anything just like right in this query, like the way that we could write in a normal query, we have to nest it, right?

And we have to nest and do a row number in here and then filter on the row number outside of that. But you can do that in there just fine. And that’s one way to get around a lot of weird stuff that happens in recursive CTE.

So CTE can be handy to add some nesting to your query so you can reference generated expressions in the select list as filtering elements and where clauses.

Right now, SQL Server doesn’t have a qualify keyword that allows you to do that with windowing functions like some other database platforms do. They can even be good in relatively simple use cases. But remember, SQL Server does not materialize those results.

It would be nice if it gave you at least the option to. In the optimizer, it would be even better if it had some rules in place to automatically do it when it’s safe and you’re referencing a CTE multiple times.

I spend a lot of time in my consulting work putting CTE results into temp tables to avoid the re-execution problem that I showed you before and to reduce a lot of query complexity issues where bad cardinality estimates from a certain portion of a CTE leak out into other parts of the plan and lead to other bad choices elsewhere.

Materializing that result set and letting SQL Server build statistics on that result set is a really easy way to make a lot of queries go a lot faster. In complicated queries, CTE often do way more harm than good.

And any excuses that people feed you around readability are just throw them away. Again, it’s a stupid response. People will spend an inordinate amount of time trying to come up with cases where it’s okay to write queries the wrong way.

Mostly because they’re lazy and they want to just do the lazy thing and they want to keep writing queries the wrong way because they think they found this one magical time where it’s just okay and safe to do it.

And that’s almost never the case. You might find a query more readable by using CTE, but the optimizer will not. It does nothing to help the optimizer, does nothing to help guide it to better choices.

You can create, at least currently, some performance fences around things in CTE or derived tables by putting top or offset fetch in them, but that does not materialize the result.

It will sort of isolate that portion of the query, which can be useful sometimes, but there’s still no materialization. So if you put a top in there and you re-reference that CTE like I showed you in the demo query just prior, it’s a bad time.

So please use CTE really carefully. Don’t just assume that they are going to do anything best or better for performance.

Don’t think that they help the optimizer in any way. They really don’t, and you can run into a lot of big trouble with them if you start using them inappropriately. And it’s really easy to fall into that trap because you think you get this free lunch being able to re-reference them, but you just don’t.

So anyway, that’s about enough on that. This video went on longer than I thought it would, which story of my life. Again, I find myself having to apologize for the length here. So the next video, number seven, is going to be about views versus functions.

Very, very few demos in there. It’s a lot of spoken word poetry. So if you’re the type of person who really enjoys demos, maybe you can just skim that one a little bit.

But anyway, thank you for watching. I hope you enjoyed yourselves. I hope you learned something. And I will see you in the next video where we compare views and functions as a big happy family.

So great. Cool. Goodbye.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

How To Write SQL Server Queries Correctly: IN and NOT IN

How To Write SQL Server Queries Correctly: IN and NOT IN



Thanks for watching!

Video Summary

In this video, I delve into the nuances of using `IN` and `NOT IN` clauses in T-SQL queries, particularly focusing on how these directives behave when dealing with nullable columns. I share practical examples from my series on writing SQL Server queries correctly, emphasizing why `EXISTS` and `NOT EXISTS` are often superior choices due to their consistent handling of nulls and better performance. I also address the complexities that arise with `NOT IN`, such as the defensive query plans generated by SQL Server when null values are involved, which can lead to significant performance issues. The video concludes with a comparison between using `IN` and `NOT IN`, highlighting the importance of understanding these behaviors for writing efficient and reliable SQL queries.

Full Transcript

Erik Darling here, with Darling Data. Look at that nice blue. Really brings out the suffering in my eyes. Anyway, today’s video, guess what? We’re going to continue my series on how to write SQL Server queries correctly. And in today’s video, we are going to cover in and not in, because there are some funny things about these directives in T-SQL. But of course, before we get on with that, let’s talk about you and me, and you being nice to me for once. Mom. If you would like to sign up for a membership to my channel, there’s a link right in the video description that says become a member. You can do so for as little as $4 a month. If $4 a month would of, want. www, Eh?upg, not 100문in.

2015. *** buscar leads. directives to use in SQL in general. We’re talking about T-SQL specifically, but these rules will generally apply across all SQLs. So just really, you know, stick to exist and not exist as much as you can. In and not in, you know, if it’s literal values within, fine. But even if it’s literal values with not in, you have to be very careful about nulls. And you have to write your query defensively to explicitly remove nulls in order to get that working. Now, I did a performance focus video about not in, and I’m going to rehash a little bit of that. Again, oh, why aren’t you spelled out correctly? Oh, I hate when I use old code where I didn’t use the full word integer.

It’s very embarrassing for me. But I’ve already created two tables that allow, that are integers that allow nulls. And I’ve written two versions of, I’ve populated them explicitly from two different tables in the Stack Overflow database where no null values ended up in the tables, but the columns are nullable. Now, the big problem here is that SQL Server adds sort of like without even like announcing itself unbeknownst to you, adds a whole bunch of defensive stuff to the query plan in the event that a null occurs in the results. I’m going to show you what that looks like. So what I’m going to show you is this is the estimated plan for this. It does all sorts of things in here.

The estimated plan doesn’t really, like this is mostly just to show you that there is a lot of added complexity. We hit old users once, twice, three times, and we touch new users once. But the actual execution plan, you can see just how painful this was. This ran for over 20 minutes, 21 minutes, 22 seconds. And a lot of the problem in this query was, it’s a query plan pattern that is really, really bad. But SQL Server just inserts it when you use not in with nullable columns, right? See, like, you might see this query plan pattern in other places in SQL Server, but this is where SQL Server just throws it in there to protect itself. So you have this top above a scan. And like, if you see a top above a scan, and the scan is on a big table, you’re in real trouble. That’s never any fun, because this nested loops join is going to make this do a lot of work, right? So you have a lot of rows that come out of here, you have a lot of rows that go into the loop join, and you have a lot of scans of the old users table. That’s a bad time. Now, what I said about using not exist, like not having to worry about having perfect indexes, or writing really complex, or rather, adding complexity to the query to like, like hard code protection against nulls. Like, sure, I could add an index on old users on the the user ID column, or whatever I called it. And there would be a top above a seek, and that would be faster. But you would still run into the sort of logic, I’m going to call it a logical inconsistency, even though it’s consistent behavior. It doesn’t feel right to me for not in to screw up with nulls the way it does, or to handle nulls the way it does. So you could add an index, sure, and that would be faster. But now you have to worry about, you know, indexing your temp tables every single time you do this pattern. You could add a bunch of explicit not null checks on the old on the new users and old users table, you could write that query very defensively. Or you could not worry about any of that stuff, and just use the not exist version. This, that’s what I’ve done here, we say, you know, select the records from new users where not exists, correlate on that. And rather than taking 20 minutes, this runs in a few seconds, actually about two and a little under two and a half seconds.

So apart from the logical reasons for avoiding not in when with nullable columns, because, you know, let’s face it, a lot of people out there are afraid of making a column not null, because who knows, right, for the same reason that developers will make every string column varchar 255, or even a max data type, just because they’re afraid of truncation errors, or who knows what, even though the column is like state code, and you’re like, wow, like Massachusetts, New York, California, NACA, NYCTA, like none of these things are ever going to be 255. But who knows what will happen? Maybe, I don’t know, I don’t even have a reasonable thing to put in there. Maybe someone will like, say they live in every state or something, and you’ll have, you know, 50 times two, and that’s 100, but not even 255. Wow, that’d be real hard to do that. But developers screw things up constantly. That’s why I am a consultant with reasonable with reasonable rates, fixes these things. Okay, so anyway, you are far better off from both a performance and a consistency point of view, using exists and not exists over in and not in. Like I’ve said a few times now, it doesn’t really matter with in, because in does works the way that the same way that exists does, regardless of nulls.

So if you have a list of literal values or a subquery, you are safe using in. If you have a list of literal values or a column to column comparison with not in, and those columns are nullable, but don’t contain any nulls, you can end up with a really wacky query plan unless you explicitly filter out nulls with your query. And you have really good indexes in place to support that query.

With not in, if you like, if you just have a column and a list of literals, you still have to worry about that column, because any nulls in that column will make the list of literals from the not in clause misbehave, right? So just be like, be wary out there, exists and not exists are the better choice, because they both act consistently with probably what you would want to get back for your query results. Since there are no nulls, the first query returns correct results, but the amount of work SQL Server has to do to make sure that it doesn’t encounter any nulls or that it can behave safely if any nulls is pretty absurd, right? 20 something minutes to do that work sucks.

You can, of course, index the temp tables, but a lot of people have read one single blog post in their entire career about that and think that it’s always a bad idea. So it’s a separate conversation. But, you know, I’m going to be honest with you, I’ve had a lot of really good luck indexing temp tables in my life. So, what can I tell you? What can I tell you? The SQL Server has room for a breadth of experiences in when it comes to performance tuning. But anyway, I was going to say something else, but now I’m just giggling internally and I’ve lost my train of thought. So that was just about it for in and not in. Again, exists and not exists are usually far better options. Next up, I’m going to talk about CTE a bit. I actually have two videos coming up for CTE and they both sort of have different approaches to them. So there’s one that I’m going to talk about next and there’s one that I’m going to talk about, one that I’m going to go at the very end of the series. So there’s multiple contents on common table expressions because there is quite a bit to say about them. So anyway, thank you for watching. I hope you enjoyed yourselves. I hope you learned something and I hope to see you in the next video about common table expressions, which is going to be fun.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

How To Write SQL Server Queries Correctly: Apply

How To Write SQL Server Queries Correctly: Apply



Thanks for watching!

Video Summary

In this video, I dive into the world of SQL Server’s `APPLY` operator, specifically focusing on `CROSS APPLY` and `OUTER APPLY`. I explain how these operators can transform your queries to be faster and more efficient, especially when dealing with complex operations like windowing functions or derived joins. Whether you’re looking for top N rows, implementing row numbering, or performing pivot-like transformations, this video covers it all. I also highlight the importance of having appropriate indexes in place to leverage `APPLY` effectively, as well as discuss scenarios where using `APPLY` can lead to better performance than traditional derived joins. By the end, you’ll have a solid understanding of when and how to use `APPLY` to optimize your queries.

Full Transcript

Erik Darling here with Darling Data. Boy, am I hungry. It’s been a long day. In today’s video, we’re going to talk about, continue with my series, How to Write Queries Correctly. In this one, we’re going to cover the usage of apply. That’s cross-apply and outer-apply and different ways you can use cross-apply and outer-apply to make your queries faster and better and, I don’t know, do things like that. I like talking about apply. I like talking about apply. Hopefully, you’ll like me talking about apply. Hopefully, it just works out for everyone, right? Anyway, before we do that, we of course have some monetary concerns. We have to talk about fiscal policy here. If you would like to become a member of this fine channel, and maybe say thank you for the billions of hours of content that I produce, you can become a member. Join like, 30-something other people who have become members by clicking the first link in the video description down over there. It says become a member. If you are, I don’t know, if you are hawkish on your financial policy and you say, four bucks a month, Erik Darling, jeez, I don’t know. You can like, you can comment, you can subscribe. You can make my numbers go up in other places. Hopefully, it’s not going to be like blood pressure and cholesterol, because when those go up, bad things tend to happen.

Anyway, let’s not get grim. If you need help with your SQL Server, I am a SQL Server Performance Tuning Consultant. Some might even say they work at BeerGut Magazine, the best SQL Server Consultant in the world. I am available for all of these things and more. And as always, my rates are reasonable. If you would like to get access to my paid training, my paid SQL Server Performance Tuning training, you can get all 24 plus hours of it for about 150 USD.

And that lasts for life. It’s not a subscription. You have to renew every year. You buy it once and you have it. And then you can watch it and get better. And then you can be good at SQL, like me. Link, coupon code, done. No events. 2024 is done for events. 2025, I’m watching you. If there’s an event near you and you think Erik Darling would be good at that event, well, if you tell me what that event is, I’m not psychic. I have many things. I’m not psychic. I cannot magically guess which event you think I should go to. Let me know what it is.

And maybe I’ll be able to go. Who knows? I don’t know yet. Because you haven’t told me. All right. Now, let’s talk about apply. Now, I end up converting specifically a lot of derived joins, particularly ones that have windowing functions in them, to use apply instead for very specific reasons. Sometimes you need to create an index to support that thing. But mostly you want to avoid the eager index pool.

You need to at least be able to seek into an index on the inner side of a nested loops join to have that make sense. But particularly ones where row number is involved, it makes a lot of sense for reasons that I’m going to explain. All right. You will get a full explanation, but I just want to let you know that that’s usually where, for me, the apply stuff shines.

There are many other great reasons and places and things you can do with it. But that’s the one that I end up fixing the most. Now, when apply is most useful is if you have a small outer table and a large inner table.

Right. Because you want to have a small number of rows on the outer side of a nested loops join. And you can use that small number of rows to get to the inner side of the nested loops. Right. You don’t want nested loops for like you don’t want a big table on the outer side of nested loops.

And you don’t want two big tables involved with nested loops because you’re in for a bad time if you do. If the amount of work that like the query that goes into the apply is rather complex or does something that is computationally complex, windowing functions being one of those things, that’s another very good reason to use apply instead.

If I have very specific query goals that make apply pretty much the smartest way of doing things. Sometimes it’s like, you know, saying top three or offset zero rows fetch next three rows. Other times it is using a windowing function to filter out like where the windowing function is less than or equal to three.

Really, that depends a lot on data distribution density, stuff like that. That would be another good reason to use apply. If I am really trying my hardest to get a parallel nested loops plan, apply is usually a good way to do that.

If I need to replace scalar UDF in the select list with an inline UDF, that might be another good place to use apply. And if I need to use the values construct to do some surgery on one or more columns, that would be another good reason. We’ll talk through most of this stuff.

A lot of it is situational and it does require some practice to get familiar with it and know when the appropriate time to use apply is. Both cross and outer apply can be used in very similar ways to subqueries in the select list with the added bonus that, you know, like we took in the last video about subqueries. We talked about how you can really only return one row with them and you can only return one column with them.

With cross apply and outer apply, you don’t have those limitations. You can return multiple rows and multiple columns. That’s why I like the top end per group thing is really popular for apply.

What you really want to think of when you’re choosing which apply to use is cross apply really should be called inner apply because it’s like an inner join. And outer apply is actually appropriately named because it’s sort of like an outer join, right? Outer apply does not restrict rows.

Cross apply does. So here’s here’s sort of a simple example. Now you could use top three. You can use offset fetch in here. But let’s say that I just wanted to get the top three user the top three posts for a user.

I can do that with with this query pretty easily. Now. These are the these are the results.

Some of them. You will have three. Some of them you won’t. There might not be three for everyone. But for the people who do have three, you will get them. So that’s nice.

Right. There’s not really a great way to like insert a dummy row if if you just want like a third thing to show up for everybody. But no, whatever. Neither here nor there.

You can also use row number to do something similar if you don’t have good indexes in place or if your data distribution just sort of it just makes more sense to to use row number instead. There are some pretty good reasons to use row number if you are on if you’re in a higher SQL Server compat level where batch mode on rowstore is available. Or if you like, you know, can get batch mode involved using a trick with like a temporary table or an empty filtered non clustered columnstore index.

Because you can see the window aggregate from batch mode show up. And that’s a lot faster than the typical arrangement with row mode windowing functions where you’ll sometimes have a sort to put data in order for the partition by order by. And then the segment segment the sequence project.

The window aggregate is a lot usually a lot faster than the like the batch mode window aggregate is a lot faster than the row mode equivalent of those plans. So there are lots of good reasons to use apply depending on like and we’re, you know, we’re using apply in both of these just this one’s with row number and this one is you can again, you can use top or offset fetch. What what really drives the decision here is if this is fast enough, then cool, use this.

If this if that’s not fast enough, then you might want to think about using row number instead of the top with offset fetch. Really, like I said, it’s going to depend on like data distribution density, things like that. And, you know, getting like like top and offset fetch don’t really get like the batch mode benefit that windowing functions do.

So if if batch mode is a goal, then the windowing function will probably be faster there as well. So like for those queries, we’re just getting everyone from the users table who posted a question in the final days of 2013 ordered by when it was created and reputation and some other stuff. But the I guess the point is that this produces essentially a tabular result, right?

This produces like a second table that you’re joining to. And for everything that we find in here, for every row that we find that that meets this criteria, we apply this logic to every row. Right. So we can get multiple columns back.

We can get multiple rows back. We can get lots of stuff back. Now, one thing that I think is probably worth pointing out is like when we were talking about exists and not exists, one thing that I said is that it doesn’t matter if you what you put in the select list of exists and not exists because SQL Server just throws it away. Okay. Sort of in the same vein, it’s okay if you use star in apply or outer apply in the inside of the applied part of the query, because whatever columns you actually pull out of it up here, those are the only ones that the SQL Server is smart enough to realize you’re not selecting every column out of the post table.

SQL Server does some figuring when you first send it a query and it says, oh, even though there’s a select star in here, I know that in the outer select, I’m only getting title score creation date and last activity date. So I know that I like, I don’t actually need to treat this like a select star query. So using the select star inside of this is not a big deal because I’m not using select star in the final outer select slash project front for the results.

So that’s nice there. It’s sort of like how I have a select star here and a select star here, but SQL Server is like smart enough to realize that only these columns are involved aside from, you know, the stuff that I’m using inside of the query. So because the user’s table is correlated from ID to owner user ID in the post table, we do need to make sure that we at least have a good index that leads on owner user ID. So for every trip that we, for every time we apply that query to what’s in the user’s table, we have an efficient way to seek into that index and find the rows that we want.

You’re going to have some additional considerations with the windowing function thing, or if you have a top or offset fetch with a, with an order by in there, because you’re going to want to figure out how to not sort data every time, probably. Right. So just a couple notes on that. Another neat thing you can do with apply is sort of, you can do like a mock pivot and unpivot with apply. Itzik Ben-Gan has a lot of great videos on this. If you’ve never seen him present on apply, I would highly suggest just looking, looking for either, you know, his blog posts about apply or his videos about apply.

Pretty much anything where he talks about T-SQL is magical. But one thing that you can do is you can use this cross apply with a values clause to sort of combine the creation date and last activity date columns like this. And you can use that to sort of like pivot on them like this, like you’re turning each of these, you’re turning each of these, like each of these columns into a single column, right? So creation date and last activity date are two separate columns, but using values, we can pass them in as a single column and we can do, we can mimic the greatest and least functions that SQL Server 2022 added.

So like with SQL, if you’re on SQL Server 2022, we could just use greatest and least to figure out which value is higher or lower, the greater or the lesser. But with older versions of SQL Server that were the greatest and least functions aren’t available, which is weird because they’ve been around in other databases like forever. You can do something like this and this ends up with a pretty neat and nifty query plan.

We get all the stuff that we care about out of here. You’ll just have this constant scan, which you’ll see has just about twice as many rows as this because we need to basically make one long list from creation date and last activity date. And then we just aggregate to figure out the min and the max from those.

So you can use apply with the values clause for a lot of really powerful stuff. The choice to use apply really does depend on the goal of the query and the goals of the query tuner. It’s not always a magic performance tuning bullet, but under the right circumstances, it can really make things a lot faster than doing something like a derived join.

The choice of cross apply or outer apply, of course, comes down to query semantics. If you want the apply to restrict rows or filter rows, you want cross apply. If you want to do the equivalent of like an outer join, then you want to use outer apply.

One important difference in how the joins are implemented is in the optimizer’s choice between normal nested loops where the join is done at the nested loops operator and the apply nested loops, which is when the join keys are pushed to the index seek on the inner side of a join. When you see that, when you get apply nested loops, you can tell because when you highlight, when you hover over the nested loops join, you get the little tool tip that pops up.

Down at the bottom, you’ll see something that says outer references. And those outer references are the seek predicates being pushed into inside of the nested loops join rather than having them apply at the nested loops join. There’s a great post by Paul White about apply nested loops.

Again, if you’re feeling googly, definitely look for his post on Paul White apply nested loops because you’ll learn a lot about that there. Now, the optimizer is capable of transforming an apply to a join and vice versa. It will generally try to rewrite apply to a join during initial compilation because there’s more searchable plan space for that type of join.

If you transform to an apply early on, it may also consider a transformation back to an apply shape later just to figure out what would be cheaper. But just writing a query using apply does not guarantee that you get apply nested loops instead of just regular vanilla nested loops. Having good indexes in place is really like generally what tips the optimizer towards using that.

Now, there are a couple of things that, well, because we was talking about Itzikbengan earlier. There are a couple of cool things that you can do with apply that make life a lot easier. One of them is kind of what I showed you with greatest and least, except you can expand that to do lots of fun things, right?

Like finding the min and max per user. This isn’t a terribly fast query, but that’s okay for this one. We just start off, we start off by doing sort of what we did in the first query where we get sort of an initial min and max from things.

And then we use a slightly, I mean, it’s not even convoluted. It’s just something that we have to do some additional aggregations out here to have that group by ID and display name. Otherwise, we would have multiple rows, right?

We would get multiple rows back. Because, again, not like exists where you just get one thing. You know, like basically for every post in the users table, this is going to generate a row. So users who have multiple posts will generate multiple rows and we want to collapse that down just to get the min and max.

Another really cool thing that you can do with cross apply and continuing to use the values clause is you can sort of like, again, this is something Itzik talks about in his things. I think he’s right that this is a very neat trick, is you can do stuff like get the year that each of these things happened in.

All right. We have like the creation year and the last access date year. And then we could just assemble like the beginning and end span from that.

So we can use date from parts to take a year and just say 0101. Oops, that didn’t go well. Just say 0101 here and say 1231 here.

And we can get sort of like a span of time. So, I mean, some of these are more interesting than others, right? Like, well, like a lot of these are 2008 through 2018.

So it doesn’t really show off how cool this is. But this one, you know, we get the correct 2008 to 2017. Where are some good ones in here?

I don’t know. Some of these query results are just boring. But very neat things that you can do with apply and with values that can actually, and I’m going to talk about this when you get into CTE a bit more. With CTE, you typically have to like keep stacking them.

And, you know, the more complex things you have to do, the worse that stacking gets and like passing results and aliases down from one to another. But with apply, you can generally do stuff a lot more cleanly without having, without the fear of CTE executing the query inside them more than once and causing a giant cascading awful of query plan. So just a little bit about cross apply and outer apply there.

Very, very useful query techniques for all sorts of things. Most common use is sort of a top and per group thing. But, you know, there are lots of other cool uses for them that, you know, really, if I had all day to talk about apply, I could show you a lot of things.

But trying to keep this sort of short and basic so folks understand kind of what they are and how to use them. I don’t want to overcomplicate that and get into the really crazy stuff because I would probably lose a lot of people. So I don’t want to lose anyone.

I’ve lost enough in my life. I don’t know. We don’t need to talk about that. But you know who you are up there. Anyway, thank you for watching. I hope you enjoyed yourselves.

I hope you learned something. And I will see you in the next video, which is going to be about what’s number five here? In and not in.

So we’re going to have some fun things to say about in and not in in that video. So do try to contain yourselves. But I understand why some of you out there just might be orgasmic.

What we’re going to say in the next one. So I will see you in that video. Goodbye.

Not forever. Just until next time. Goodbye for now.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.