How To Write SQL Server Queries Correctly: Literally Everything You Need To Know About CTEs
Thanks for watching!
Video Summary
In this video, I dive into the world of Common Table Expressions (CTE) in SQL Server, providing a comprehensive yet critical analysis of their use and limitations. Erik Darling from Darling Data shares his extensive experience and expertise to show that while CTEs are useful for certain scenarios, they often come with significant drawbacks, such as redundant queries and performance issues due to single-threaded recursion. He emphasizes the importance of good formatting and comments over the supposed readability benefits of CTEs, encouraging viewers to focus on more effective query writing techniques like APPLY operators for simpler tasks.
Full Transcript
Erik Darling here with Darling Data, and in this video we’re going to continue on with our How to Write Queries Correctly series. And in this one I’m going to tell you, I’m going to teach you everything you need to know, literally everything you need to know about CTE for free. If anyone ever tries to convince you that they have something worth selling you, for you to learn about CTE, laugh in their face, throw something heavy at them. This is it. This is all you need to know about CTE.
Alright? So, before we do that, if you would like to spend $4 a month on this channel, you can. There’s a link in the video description where you can click that says, become a member, and then you become a member. And I get $3.10 a month after taxes. If you are just, no way, $4.00, you can like, you can comment, you can subscribe.
All very valid ways to show your undying love for me. Consulting! I do it. A lot of it. And according to my wife I need to do more of it, because it keeps me in my office. So, if you want to make my wife happy, you can hire me to do any one of these things, and more, with your SQL Server or servers.
And as always, my rates are reasonable. If you would like to get some very high quality, very low cost SQL Server training, none of it about CTE, at least, you know, not in a very elementary way, because I would never charge you for that. Save the good stuff for training.
You can get all of mine for $150 US dollars for the rest of your life. It’s not quite free, but if you live a long time, it’s close to it. There is also a link for that stuff in the video description.
You know, 2025, I’ll be on the road again for now. Give me a break. So, with that out of the way, let’s talk about the laughability of CTE. So, one, they’re just inline views, and you need them for some types of queries.
Unlike derived tables, you can reference them multiple times in a query. But this is where they start to cause problems. You can write modifications against them, which is mostly useful for doing ordered deletes.
For some reason, you can’t put a top with an order by and a delete. It’s very strange. Sorry, a delete with an order by and a top, a delete, and an order by. It’s weird. You can write recursive ones, but it’s really annoying.
And if you find yourself having to write recursive CTE a lot, you should probably consider a different data model. You can often do things a lot more cleanly with apply than you can with CTE. CTE require you to stack things one above the other and keep doing things.
With apply, you can just have a nice, neat query where you just have stuff very nice and streamlined and inlined without having multiple steps that someone has to read through and remember the column names and the CTE names. It’s all very aggravating.
So, like I said, the big gap that CTE were meant to fill in in SQL Server is with derived tables, you might notice some red squiggles here. That’s because we can’t talk to the derived expression x more than once.
Right? We just can’t do that. It just doesn’t work for us. Even if I name that correctly.
There we go. Even if I name that correctly, we still get red squiggles. And if I try to run this, it’ll say invalid object name. Next one, who are you? I don’t know.
But we can do that with CTE. The problem, as always, with CTE is every time you re-reference them, SQL Server runs the query in them again. If you look at this, we touch the user’s table not once, but twice.
And if we quote this in and we add a third join to our correctly named CTE, we will now see that we touch the user’s table three times. A one, a two, a three.
So, like I said, the thing that, like, Microsoft was like, cool, we have CTE now. Now you can reference them multiple times. No way to materialize them.
Other database engines give you that. Microsoft, as of this recording, does not. Maybe that’s a secret thing in SQL Server 2025. I don’t know. Couldn’t tell you.
I don’t know how much work goes into CTE. Microsoft is clearly busy with very important things involving fabric and AI that will go absolutely nowhere. So, we have that to look forward to.
Thanks. So, this is what I was talking about with deletes. You cannot do this, right? You can’t have that in there. But we can have that in here.
If we add an order by, u.id, this is completely valid where this is not, right? This will say, bleh, no. No ordering.
No ordering for u. This thing, though, this will give us a nice ordered delete, and we don’t have to worry about anything in there. There’s no sort in this because I’m ordering by the id column. I think when I was messing with this before, I think I deleted the order by by accident.
But if we were to order by a column that is not supported by an index, you would see it gets sorted there. When we use id, that is the clustered primary key of the users table. So, it’s already ordered for us.
So, we can delete things nice and orderly this way. Sometimes, you need a CTE. You could also do this with a derived table. My style guide says it doesn’t matter because for a query this simple, it’s just not that big of a deal, right?
You generate a row number, and you want to filter out to where that row number equals something. There’s no way in SQL Server to do that all in one step, right? You can’t do that all in one go for some reason.
And SQL Server, Microsoft makes you write a two-step query and generate everything. And then at the very end of that query, filter out everything in there, right? So, that’s a lot of fun, right?
Sometimes, you need to use them because Microsoft won’t implement useful things that make your life easier. You get big data clusters, synapse, managed instance. Why would anyone want these things?
They’re garbage. They’re stupid things. And, of course, you can write recursive CTE using CTE. Of course, you couldn’t write a recursive CTE without a CTE because there’s no just plain recursive.
You need that. But, you know, I did go a little bit more into performance details on this in the other video in this series about CTE, which you can find if you look at the playlist.
It’s all in there. But you start off with this part. And this is where you write the anchor portion of your CTE. And this is where you write the recursive portion of your CTE. And if you run this query, well, actually, let’s look at the query plan.
I forget. I don’t think I have indexes that make this good. So this would run for a long time. One of the big problems of CTE is that, recursive CTE, rather, is that the recursive part of the query is forced to run single-threaded.
And it can be very, very slow if you don’t have good indexes in place to support it. And another thing is that usually you’re going to see a lot of nested loops inside of your recursive CTE. And if you don’t have good indexes, you’re going to see eager index spools inside of here.
And that is a surefire sign that they are going to be terribly, terribly slow. So not a lot to say about the recursive CTE. Again, if this is something you need to do frequently, you probably need a better data model.
Now, for anyone who says they make queries more readable, again, for like the 10 billionth time, they don’t. They don’t make them more readable. They don’t make them more understandable.
What makes them, what makes queries readable is good formatting. And what makes queries understandable is good commenting. Writing a CTE does absolutely nothing to make a query either one of those things. It does not help.
There is still no explanation for what you did. There is still just wild formatting where you, for some reason, it’s like select and then like columns and then from is on the same line. And then join is on the same line, but on is down.
And like just format your queries and people will find them readable and make comments on your queries and people will find them understandable. It’s a fantastic thing. It’s a wonderful thing to do.
Otherwise, you’re just writing gobbledygook. So one thing that some people will do with CTE is stuff that they could do with apply very easily. Like, let’s say this is something that Itzik talks about sometimes in his training is that like just people who want to do simple things.
Like just get the year from a date or date time or whatever column. Date, date time to take time 1 million 17th. Yeah, fire.
Big city living folks, big city living. And then let’s say you wanted to get the following year, you could write another CTE. And then you could get all this stuff.
And then, I mean, granted, this is just sort of a goofy example, but it gets to the mentality of people who use CTE for just everything. Because they have it in their goofy heads that they think it’s going to make the query readable and understandable for everyone. Where you could just do something very simple like this and just cross apply some values to that year.
And then you have that expression available to you in the select list and your query becomes a lot more compact and you don’t need 17 different steps to express yourself or figure out or write your logic out. You can do things very easily and more compactly when you use better forms of query writing. So this is literally everything you need to know about CTE.
There is nothing magical or mystical about them. There is nothing fantastic about them. Perhaps someday materialization will come to SQL Server, but we don’t have that now. Fabric did just get Azure SQL database.
Wow. Hold your applause. No, I mean it. Hold your applause.
Anyway, thank you for watching. I hope you enjoyed yourselves. Honestly, I hope you learned nothing. Because this is all very generic information about CTE, but it is all the information you need about CTE. Alright.
I will see you in the next video about the output clause, which will be literally everything you need to know about the output clause. Alright. Have a great day. It’s pretty salty.
It’s clear!
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.
How To Write SQL Server Queries Correctly: Case Expressions
Thanks for watching!
Video Summary
In this video, I dive into the intricacies of case expressions in SQL Server queries, explaining why they can be problematic for performance tuning and optimization. I share insights from my experience as a database professional and highlight how seemingly innocent use of case expressions can lead to complex query plans and significant execution times. By walking through examples and referencing Aaron Bertrand’s blog post on the “Dirty Secrets of the Case Expression,” I aim to help you understand why it’s crucial to be mindful of where and when you use these constructs in your queries, especially in joins and where clauses.
Full Transcript
Erik Darling here with the old Darling Data. Having a lot of fun. A lot of fun. I don’t know how anyone doesn’t have fun with databases. They’re just the most amusing things one could possibly conceive of. Barrel of laughs. Today’s video, we’re going to get ever closer to the close of our series about how to write queries correctly. And in this one, we’re going to cover case expressions. Not case statements. Case is not a statement. Case is an expression. You’re just going to have to deal with that fact. Or else, I don’t know, I’ll drop a bag of hammers on you. If you’re that gung-ho about calling case a statement, it might have no effect on your skull. It’s probably rock solid anyway. If you like this content, and you’re like, wow, where would I be without this content? Well, for four bucks a month, you can support this content. And I can, you know, carry the two. I’ll make about about $3.10 after taxes on that.
Give or take. So, that’d be nice. If you don’t have $4 to turn into the alchemy of taxation to hand over to me, you can like, you can comment, you can subscribe. And there’s no tax on that. That is free love. If you are in need of SQL Server Consulting. If the things that I say in these videos resonates with you, and you think that perhaps my wit and wisdom would be of some benefit to the SQL Servers that you care and feed, well, you can hire me to do all sorts of things. And as always, my rates are reasonable. I also have training. And you could have that training for about $150 for the rest of your life. Not a subscription. You can go to that link and plug in that coupon code. And through, again, magical transformations. Actually, reverse taxation. You actually get money off of things. I guess that’s a tax on me.
Boy, this is a rough mental awakening for me. Me giving you discounts is a tax on me. That’s messed up. But you can get all of my training for, yeah, that money, that URL, that code. It’s amazing. No upcoming events because end of year, family, things like that. I’ll be around in 2025. With all that out of the way, let’s talk about these here case expressions. Now, one thing that you should know off the bat is that green screens later in the day are hard. That’s why I have strange things happening behind this armpit. Do not look behind the armpit.
Is that some things internally, even though they appear to be built-in functions that SQL Server has, the things that turn pink when you type them. Some of those things are just wrappers for case expressions. Translate is not one of them, though. When I was first looking into this, I half expected translate to be a case expression, but it is not. If we look at the query plan and we look at the compute scalar, the defined thing in here will not say anything about a case expression.
But if we dig into some of these other things, like if, and we look at the execution plan here, and we look at the compute scalar, we will see that internally, if, is just doing a case expression. You can see where that starts right here. So if is just a case expression. This is one of the reasons why when I start having to warn people about using functions and where clauses and joins, I also need to warn them about using case and where clauses and joins, because guess what?
They both have the same performance problems. It’s amazing. It doesn’t matter if you explicitly write case. Some of the functions that you might write will use case anyway. Behind the scenes, they’ll just magically crop up there for you. Another one of them is coalesce. If we do this, and we look at the execution plan, and we look at the predicate that gets applied to the table, that is a case expression too.
Case when is not null, then, blah, blah, blah, end. See? Coalesce. Inside, just a case expression. Choose, similarly, is a case expression. If we look at the query plan here, and we will look at the compute scalar, we will once again see the defined values show us a case expression.
So, if you can, because you’re the one writing queries, you should probably not put things like if, and coalesce, and choose, and your where clause, because they are case expressions, and they will have performance problems. Other built-in functions, like is null, and replace, and stuff like that, they are not case expressions inside, but they have very similar performance problems.
All of a sudden, using indexes becomes difficult for a SQL Server. Aaron Bertrand, who, I don’t know, if you ask me, I don’t talk to him enough. If you ask him, I talk to him too much.
So, there’s two sides to every coin. Either, you know, I don’t know what Canadian coins are called. But, these, he has a blog post anyway, called, The Dirty Secrets of the Case Expression.
Thank you, Aaron, for not calling it a case statement. And this is an example that he has in his blog post. There are more if you use the search engine of your choosing. I don’t care what it is.
And you search for Aaron Bertrand, Dirty Secrets of the Case Expression. You will find his post. This is the most amusing one that I think exists on there. Is this, when you declare a variable called i, and you say, case when i equals one, then one, else the minimum of one divided by zero.
And you get a divide by zero error, even though this part never should have been explored, right? Because this equals one. If you rewrite that a little bit, and you say, else select min, that will work.
Case figures that one out. Case doesn’t go that far along in that. But it’s an interesting thing, where case does not always short circuit where and when you think it might. All sorts of things can happen in there.
Now, let’s talk about performance, though. Because that’s what we care the most about. We are, for the most part, performance tuners. We’re pretty good at it at this point, I think.
At least I hope, anyway. That’s what we get paid for. I see a lot of queries with a lot of problems. And a lot of them look something like this.
And notice that there’s a join with a case expression here. All right, so there’s a, we’re going to join depending on if this, right? Remember, so like if one equals, ah, gosh darn it.
If one equals case when this, then one. And when this condition, then one. But if we don’t meet those conditions, then zero. And then another thing where one equals case when these things are, these conditions are true than one.
Else zero end. And I’m not actually going to run this query now. Because we’re already around the eight minute mark.
And this query takes a full minute and one second to execute. The query plan is not terribly interesting. Just note that everything is fine until we try to apply that expression right here. All right, a whole bunch of bad things get involved at the join.
And this portion of the plan eats up the majority of the time. It would be nice if I could draw like different shapes with ZoomIt. But I guess making two rectangles isn’t going to, you know, isn’t that much of a challenge for me.
It’s not that difficult. It would just be nice if I could like draw better in general. If I could draw better, I would probably have a different career, honestly.
But this query plan is, of course, a disaster. And you should not emulate this kind of query. This query is no good.
For the sake of those who may, I don’t know, be watching and not listening, maybe you take visual cues better than you take verbal cues. We are going to say no.
No. No. If I was better at art, I would make this a skull. But I’m not very good at art.
So, er, er, er. And we’ll get some things here that hopefully don’t look too much like other things. Eh.
We’re getting good at this. We’re getting better at this. Every day. Every day. We want to avoid this. This is bad. Okay. This is not what you want to do with your queries. This is frowny face.
Bad for performance. But, but. Okay. If all this is not enough clue, I don’t know what to do for you. We are, we are, I’m out of ideas.
So, one way that you could rewrite this query would be something like this. You don’t have to do this. But this, to me, this just looks kind of nice. We’re going to put the query, the predicates that we care about for the users table in here.
Right? Because that’s good. We care about that. And we’re going to cross apply to a union all. And if you are, you’ve been following this series, separating out queries like this.
Very common. Very useful query pattern. Even if it doesn’t solve the precise performance problem that you’re after, it can certainly illuminate you as to why you are having a performance problem with one or both of these queries or sets of, the set of predicates in here.
Perhaps it’s an indexing issue or something else that’s just driving things amok. But when we run this query, this thing will let me highlight things. This returns instantly.
Right? We no longer have to wait a very long time. We no longer have a giant parallel plan with a spool in it and join predicates figured out at the, or rather, predicates figured out at the join clause.
We have an apply nested loops plan. Again, we can tell that the apply nested loops kicked in because we have this outer references thing in the nested loops join, which means that our join predicates were sent out as seek predicates to here and here.
So, I mean, obviously that’s the, somewhat obviously, because it’s the same index name. We hit the same index twice and seek into it and all is well. And we get the, back the result that we care about immediately. So, when you’re writing queries, putting things like case expressions and joins and where clause may seem like a convenient vehicle for you to express the logic that you care about, but it is not, often not a very convenient vehicle for SQL Server to optimize.
Again, this is really, when you’re writing queries where performance matters, which you should write every query like performance matters because someday it might, even if it doesn’t today, someday performance might be of critical essence, critical importance to this query or to whoever’s running this query.
And you don’t want it to be slow. And you don’t want like one of those comments up at the top where it has your full name and a date and like all that stuff in there. And where you’re like, you know, you make notes about things that you swear work, but don’t actually work.
So, these may seem very convenient for you, but these are very bad for SQL Server. Much better ways exist to write queries that SQL Server can interpret into a much better, much faster execution plan for you.
So, one thing that I talked about at the, well, not at the beginning of this post, obviously, because this is, this is the culmination of several things is handling, like using or. So, the case, this video on case expressions really ties in with the, the two videos on where, where, where clauses with or in them and join clauses with or in them.
Uh, they have very similar performance implications and sometimes even query plan implications. So, one thing that I haven’t covered in these that I’m going to cover in a longer format about dynamic SQL is how to handle cases where you want to, um, have optional parameters passed in, uh, to a store procedure.
Sometimes they are null. Sometimes they are used and there are different ways to sort of deal with that. Um, but please, for, at least for the time being, keep case expressions out of your joins and your where clauses and, uh, be, be careful, be very diligent in how you write case expressions, even in your select list, because they may not shortcut when you expect them to.
SQL Server may do some additional exploring in those case expressions that you are not aware of until something weird happens. So, be careful there.
Anyway, thank you for watching. I hope you enjoyed yourselves. I hope you learned something. And I will see you in the next video on, uh, CTE. I know that I already did one in this series, but, um, the, the next two videos on common table expressions and output are going to be literally everything you need to know about CTE and output.
Do not buy anything from anyone purporting to teach you about case expression, about common table expressions and output. There is nothing worth paying for to learn about them. Okay?
Spend no money on these things. They’re not, they’re not worth it. Anyway, I will see you in those videos. Au revoir.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.
How To Write SQL Server Queries Correctly: Joins With OR Clauses Part 2
Thanks for watching!
Video Summary
In this video, I dive into the intricacies of joins with OR clauses in SQL Server, specifically using examples from Stack Overflow’s database design to illustrate common pitfalls and suboptimal query plans. We explore how these join conditions can lead to inefficient execution plans, such as constant scans and nested loops joins, which often result in unnecessary row counts and sorting operations. I also discuss potential solutions, including query rewrites that leverage EXISTS clauses or CROSS APPLY operators, demonstrating how these changes can significantly improve performance. By the end of this video, you’ll understand why normalization is crucial for avoiding these issues and learn techniques to optimize your queries when OR clauses are necessary.
Full Transcript
Erik Darling here with Darling Data. All excited to talk more about joins with OR clauses. It’s going to be a real grand day for all of us here. This is, of course, part two of the joins with OR clauses videos. The first one, we talked about some simple stuff. It’s kind of set up, you know, the things that SQL Server can and can’t do when there are OR clauses in the table with the simple join. Now we’re going to talk about more complicated stuff and, of course, how you can fix the more complicated stuff. And, of course, because I like talking about it, we’re going to talk about shortcomings in SQL Server’s Query Optimizer. Because it’s got those. Lord knows. Anyway, before we do that, if you would like to slap four bucks a month on my rent check, you can click the link down in the video description below and you can do that. Maybe it’ll make you feel good.
You know, or maybe not. I don’t know. Maybe, hopefully, you just forget about it for a while. That’s what we’re all hoping for. If four bucks a month is not what you have in mind for me, you can like, you can comment, you can subscribe. All sorts of great things. If you are in need of consulting for your SQL Server, all of these things, I am the best in the world at them.
Anyone who tells you otherwise is a dirty, rotten liar. And as always, my rates are reasonable. Training. I’ve got it. 24 or so hours of it. Beginner, intermediate, expert. All of those things. You can get it for about 150 US dollars. That’s the link. That’s the coupon code.
And they are helpfully glued together for you also in the video description. Lucky, lucky you. No more, no upcoming events until 2025, probably later in the year. But for now, we can just go talk about what we came here to talk about, which is joins with or clauses. So joins with or clauses are one of my biggest pet peeves because the optimizer does have space to do this better.
It just isn’t there, right? It just doesn’t do it. There’s just no help for you when a join has an or clause in it. There’s not like a special thing you have to do or there’s not a hint. There’s not like a trace flag. All there is is query rewrites. And those query rewrites are annoying because SQL Server’s optimizer could figure it out.
Just doesn’t. It’s lazy. Very lazy, lazy optimizer sometimes. So I’ve created some very generous indexes on the votes table and two on the post table. And what we’re going to look at is the query pattern here.
Now, it is a little bit confusing when the first time you see it, because it is, well, I mean, not the most straightforward thing in the world. But I probably should have started this running before I started talking too much, right? OK, seven seconds. Not too bad.
But here’s the execution plan. And it is one of it’s this is one of those query plan patterns that I try to train people to look for, because this query plan pattern is always a sign that something bad is happening in your query. And that something bad is going to be a join with an or clause.
So we’re doing I mean, it’s a little bit of a funny query here, but it’s one that you have to do with the Stack Overflow database. Because, you know, everyone who gives credit to Stack Overflow for like when it first started being like the cool, smart, like, whoa, rogue band of geniuses don’t like they’re doing everything on the cutting edge of cool and fun. They were crap at database design.
Crappy, just awful. Should have hired someone who knew what who knew what they were doing like this much because the post table has all the posts in it. It has all the questions. It has all the answers. It has all the everything else.
And what’s what’s really crappy. Oh, my God, is like the version of the post table that you see in like the Stack Overflow data explorer and the version of the post table that you see, if you download the data dump, is not nearly reality. If you saw that, if you saw the actual table in there, it is so much wider and more denormalized and awful.
It would it would it would make you even if you know nothing about normalization, it would make your head spin. But we have to join the post table to itself because questions. Right. Those are these things. When you join them, you need to join the post table to itself to find which questions have accepted answers.
Great. Good. Good plan. Self joins. Hey, I like I like those sometimes. Sometimes you do have to you do have to join yourself. But then when we join to the votes table, what we’re saying is where we need where the the either the question ID equals the post ID in the votes table or the accepted answer ID equals the post ID in the votes table.
So we’re looking to get votes on both the question. That’s the ID and the answer, which is the accepted answer ID. Right. And we I don’t know through this. I forget why I threw this on there. I think it tidied things up a little bit. Maybe maybe maybe I didn’t put that on there in seven seconds or in like 70 seconds.
But then we’re looking for where p dot post type ID equals two. Right. And we’re looking for where p two dot accepted answer ID is greater than zero. So we’ve done everything that we we can there. Now, what makes this challenging or rather like.
What the optimizer could do is something like this. It could unroll the join from being this set of conditions to being this set of conditions. Right. Like it could do that because it is fairly equivalent. It could also do this. But it kind of doesn’t.
It could be a good answer. So what’s really interesting, but sort of difficult to describe well to the optimizer is that any given question can only have one accepted answer. So there’s no many to many relationship with that join.
So these are things that the optimizer could do, but it doesn’t. And this is the query plan pattern that I need to warn you about. And this is the query plan pattern that I want to show you is this is what happens.
So we do two seeks into the post table. Right. This finds this does our join. Right. This finds our questions or rather sends our answers with the questions of accepted answers and all that other good stuff. And then this is the join to the votes table. Now, what I want you to keep an eye on is the number of rows that come out of this three eight one eight four nine seven.
OK, remember that number because you’re going to see it again. And right now way over here in the query plan, this is this is most commonly what you will see in a query plan that has a bad join with an or clause in it, which means any join with an or clause. You’re going to see a constant scan.
With three eight one eight four nine seven rows come out of it and then another constant scan with three eight one eight four nine seven rows come out of it. That is the results of this twice. Right. So we have one set for the P dot ID column in the post table and one set for the accepted answer ID column in the post table. Right.
We have that many rows for both after we join those two tables together. That’s how many rows are produced. So each one of those columns gets produced by this by these constant scan operators. Notice these don’t touch a table. These sort of come out of nowhere, but they they really get fueled from here.
So then SQL Server says I have to put those two results together. It essentially does like a union union all sorry. So this number doubles right.
At least my math is correct. And then it sorts all that data. It puts all of those rows in order. So that it can merge duplicates. And it note the estimates in here are all pretty wonky.
Right. Of one of one of two of two of two. So we’re SQL Server. Expects two rows to come across all this stuff.
It we get way more than that. Now, I think like Microsoft could start a little bit earlier and like doing all this stuff by just like trying to get better estimates out of this part of the plan, because then maybe it wouldn’t choose a nested loops join for this part of the plan.
It’s also entirely possible that this is an optimizer limitation and that SQL Server could not possibly do any other kind of join here. But then we seek into the votes table and we we seek on this expression one oh two four. Right. Or sorry.
We have a between thing. Right. Because we merge the interval. So we go and find greater than expression one oh two three and less than expression one oh two four. That’s going to be the the start and end points of the constant scan results after we’ve merged the intervals in.
So we every time we loop in, we go and we find those that range of columns. Now, it’s bad enough that we have like this fake result set. Right. From the from the constant scans that we loop and find stuff in here.
But then like there’s an sort of this other nested loops join here that also like takes this stuff comes out of here and goes through all this. Right. So it’s like this loop, like even though it starts here, it kind of like really starts over here. So this all ends up taking about eight seconds or seven and a half seconds, which is pretty slow for a query.
And like I said, SQL Server could unroll all this stuff and do it differently. Now, it’s what we can’t. It’s hard to describe this to the optimizer without a lot of really difficult constraining that every question, if it has an accepted answer, can only have one accepted answer.
If we were to use the parent ID column instead of accepted answer, it would be a slightly different story because one question can have many answers, but only one of them can be the answer. So this is a great reason to normalize your data. This is one place where the Stack Overflow data design team failed.
Right. Off like just big thumbs down there was every question and every post are in the same post table. I do have some stuff about what happens if you split those out and I do have some other stuff about what you do if you put the body column on another table. But I don’t know. I feel like that’s worth money.
So that might not go on YouTube. Now, part of the reason why this optimization space is difficult is because you would have to you would have to unroll the query to look like one of these two things. Right. Where bounty amount is no and exists.
Select this on this. But this will give you identical results to what you got before. But without all that crazy stuff going on in it, this finishes in about 207 milliseconds.
And that’s, you know, arguably a lot better. Right. And this is this filter is a startup expression predicate. So this is not the awful kind of filter where all the rows come out and then try to pass the filter.
This filter only lets rows passed if if if if if anything meets that criteria. So that’s a nice thing in there. Right. This is all good stuff. SQL Server handles this really well.
And because the where clause in here is really only for one set one set of things, it turns out a lot better. Now, you could also rewrite that in this way, which is even more complicated, where you could use sort of a double exists with a union all. Right. So rather than write this or clause, you could write this in and you could get good results back from this one.
This query plan, it’s a little bit different. It takes about 385 milliseconds. There’s one filter at the end here, which I am not crazy about.
Where this is not null or this is not null. Usually this is the kind of stuff that I worry about in query plans. But in this case, it’s OK.
We get a couple adaptive joins over here between the post table. Well, sorry, between the votes table and the post table. Right. So that all turns out OK. And then if we needed columns from the inner part.
So like we talked about exists and not exists and joins and all that stuff. One thing that I talked to you about was that when you use exists, you can’t project columns out of it. So if this were a more complicated query, you know, like like for a lot of queries, I use count big just to make sure that I’m getting like the right number of rows back.
I just like basic starting point. Are these results correct? One one thing that, you know, like normal queries, people want to see data, right? They don’t just want to count.
They’re not just like, oh, well, yeah, just if you just show me counts, I’m happy. They want to see like the actual stuff in there. So if you need to rewrite a query like this and you need to project rows out, cross supply is a really handy way of getting the same performance. But but you can project columns out of the cross supply.
I don’t know why there’s a space in here. That’s that’s clearly clearly someone is sabotaging me. But if we write the query like this with the cross supply, it’s also pretty quick. 345 milliseconds.
And if we needed columns to come out of the this part of the query, we could get them. So if we like so like you can see, like we count the P dot ID column that comes out of the apply in here, which is union all in here. So we’re counting.
We’re able to reference the P dot ID column there, where if we try to do this up here, we couldn’t. So are these queries bigger and longer and more complicated? Yes.
But sometimes that’s what you that’s what you need to. That’s the kind of query you need to write for the optimizer to fully understand what you’re doing. It’s so, you know, one thing that I say quite frequently with clients and in training and to just random passerby on the street is anything you do that makes your job easier, make SQL servers job harder. Sometimes things that you do that make your life easier.
SQL servers like, oh, yeah, well, I’m going to make my life easy to take this query plan. When you start writing things out more verbosely, sometimes you can give SQL Server a better understanding of what you need to do, what data you need, what data you’re trying to touch. And SQL Server can come up with better plans based on that.
Shorter, shorter, simpler queries are not always faster queries. You know, there are a lot of people who seem to feel that way and they write these tiny little compact queries there. Some of them just on a single line, very long lines.
But like writing simple, compact queries is good for simple, compact logic. When your logic gets very complicated, sometimes very short queries that just, you know, try to compact everything very small. It’s not cold golf.
Sometimes the optimizer can’t do a lot with those. Sometimes you have to be you have to write out more specifically what you’re trying to do. And sometimes your queries might look like this. Sometimes your queries might have extra touches of tables in them.
But you can’t like don’t be afraid of trying this stuff because you might find queries perform much better when you are much more explicit to SQL Server about what you’re after. So don’t just stop at simple, complex query because you’re happy with simple, complex query. Get just the performance.
Look at the query plan. See what happens when you spread things out a little bit. Give SQL Server a little room to think and breathe. You don’t have to smother it all the time with these weird queries. Now, one thing that I’ve talked about a few times, maybe not in this video, but in the last video, before I decided to split this up into two videos, that’s why there might be something up in there that seems a little weird, but is case expressions and joins and where clauses.
We’re going to talk about that next. That is why this is 12 and the little tab next to it says 13 case expressions. So we’re going to talk about those next.
And we’re also going to look at some rewrites for those. Some of those look similar to these other than others of them look similar for two other things that we’ve talked about in the series. But it helps reinforce that you’re not writing SQL Server queries correctly and you need to start.
So, sorry to break it to you, but you know what? At least it’s not terminal. Just temporary.
It can be cured. Don’t worry. Without surgery, hopefully. Yeah, hammer. Alright, cool. Thank you for watching. I hope you enjoyed yourselves. I hope you learned something and I will see you over in the next video about case expressions.
And you’re going to learn a lot. So watch it and don’t skip stuff. You heard?
Alright, goodbye.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.
How To Write SQL Server Queries Correctly: Joins With OR Clauses Part 1
Thanks for watching!
Video Summary
In this video, I delve into the intricacies of joins with OR clauses in SQL Server, demonstrating how they can sometimes lead to unexpected behavior and performance issues. Starting with a simple inner join setup between comments and posts tables, I illustrate how SQL Server infers join conditions from the WHERE clause, making the query efficient when both columns are involved. However, introducing an OR condition that splits the join across different tables causes SQL Server to struggle, resulting in slower execution plans. To address this, I show a workaround by adjusting the OR clause to involve only one table, allowing SQL Server to seek into indexes more effectively and improve performance. This video is part one of a series on joins with OR clauses, setting up the groundwork for exploring more complex join scenarios like left joins in future parts.
Full Transcript
Erik Darling here with Darling Data. Got a barbell next to a dumbbell. Anyway, today’s video we’re going to talk about joins with or clauses. I’m actually going to do this in two pieces. The first piece just kind of sets it up a little bit and gets some of the easier stuff out of the way. And then the second part we’ll go into more detail. And I tried to do it in one and it just went on too long. Even for me, I actually apologized to my own self for my length for once. So that was a fun change of pace. If you would like to support this channel by giving a dollar sign a finger mustache, you can click the link down in the video description. For as little as $4 a month, you can support me by giving a dollar sign a finger mustache. You can click the link down in the video description for as little as $4 a month. And I’ll see you next time.
Training. You probably need it. If you’re here, you definitely need it. You’re probably not getting enough of it. It’s like vitamins. You can get all of mine for about $150 by using that coupon code at that website up there. That’s about it. There’s also a link for that down in the video description. It’s very good stuff. Paul White Tech reviewed it. So I don’t know. Maybe that’ll convince you. No upcoming events until 2025. Ba-ba-ba-ba-ba-ba. With that out of the way, let’s talk about joins with OR clauses. So SQL Server has funny ways of handling OR expressions sometimes. Sometimes it can infer certain things depending on how you set your query up. And other times it cannot.
So I think a good way to start this off is by showing you this query. The top one is just there to get a quick result back. But we have a join between comments and posts. And the join is on p.ownerUserId equals c.userId. And we have a where clause to say where ownerUserId equals 22656. If we run this query, SQL Server is able to infer that we meant that we only want to see userId 22656 or we only care about ownerUserId 22656.
Because what happens is SQL Server tells us we don’t have a join predicate. We get this stupid little warning right here. And it says, yeah, guess what? No join predicate. Shame on you. You’re an idiot. The number of times I’ve seen this in a query plan and been like, huh, do we really not have a join predicate?
And been like, oh, no, no, no, we totally have a join predicate is pretty high at this point. Microsoft summer interns. What would you do without them? But if we look a little bit more closely at the plan, we will see a predicate on this side of the join that says where ownerUserId equals 22656.
And we will see a predicate on this side of the join that says where userId equals 22656. And so all of the results that come back from the post table and from the comments table have a join condition or rather have a value in their join column that is the same. So there is actually no need to do anything with that.
Right. We don’t need to do any additional join conditioning. We have met the condition of our join via two when right when we touch the indexes. So that’s where SQL Server is kind of good at stuff.
You know, we touch the indexes on both sides. We do the filtering right there. And then everything that we need to mush together is the same value. They’re all it’s two sets of 22656.
And there’s like, what do you join there? It’s already joined for you. We have we have implied the join via the where clause. Where the optimizer is somewhat less good at figuring these things out is when, of course, when you get separate columns involved.
Right. So if we look at this query. Right.
And I have these quoted out for the time being for dramatic reasons. We’re joining comments to votes on c.postid equals v.postid. And we’re going to say where c.postid equals 838 or v.userid equals 831.
Right. And if we run this query. Do do.
It takes a few seconds. We get 42 rows back. SQL Server has chosen the most adaptive of joins. We scan here for about a second and a half. We scan here for about half a second.
And in total, we take about 2.6 seconds to figure this all out. Now, if we quote out these four sequence and we rerun this, SQL Server says, there is no way we could do that. It is impossible.
Why have you tried to do this to me? You are trying to bend the rules of space time. You have divided by zero. You are not fit for this planet.
You think you’ve seen too many Doctor Strange movies. But this is kind of a funny thing because when you look at what SQL Server does, look at the execution plan. And then the entire query is evaluated here at the join.
We have this whole or clause thing in here. Now, what we can do is, since we know something about our data that SQL Server doesn’t, is that if we run this and we say where we flip that so that we say where both of these columns come from the votes table. Right?
Right? Now we’re saying where V is this on both. We get the same results back, right? This is 41. 41. Da, da, da, da, da, da, da, da, da, da, da, da. And this is 41. But much more, a much quick, much more quick 41.
Now SQL Server is able to seek into two indexes on the votes table and a third index on the comments table. And we are able to get back our results. The only change I’m making is that I’m using the post ID column from the votes table instead of the post ID comment from the comments table.
Right? So before it was c.post ID and v.user ID. Now it’s v.post ID and v.user ID.
So when you comment on or you vote on something, the post ID is logged. Right? You’re going to, since we’re joining on post ID, right, it’s going to be the same for both.
We didn’t change the v.user ID. If the v.user ID were the problem, that would not give us correct results. Or rather, that would pose a different set of issues.
Sort of like what we talked about with the where or clause stuff in the last one. But if we just change v.post ID to 138, then from c.post ID equals 138, then we’re getting all the same stuff back because we’re joining on post ID. Right?
That’s all we care about. But for some reason, this or clause getting slightly different tables involved really screws SQL Server up. So when you’re writing queries, these are the little things that you need to tweak and experiment with to make sure that you are not missing anything obvious.
A lot of times when I talk about query complexity, people start picturing that big, you know, monolithic query that spans 17 scrolls of SS through SSMS. And if you make the font in SQL Server Management Studio like 4%, it’s still just like it doesn’t fit on one screen. That, of course, is a different brand of hell.
But complexity in queries can arise in many small interesting places. You know, in clauses, or clauses. Like anything that adds a little bit of drama to your queries can certainly make things somewhat strange in there.
So that’s the set things up. Now, this was fairly simple because this was an inner join with just a where clause that had an or for just, right, a simple inner join, right? Just one table to another.
The where clause was only on a couple columns with one or condition. What we’re going to look at next is more complicated join scenarios with left joins and stuff. So we have that to look forward to.
Anyway, I hope you enjoyed yourselves. I hope you learned something. And I will see you in part two of Joins with Orr clauses when we will get deeper into the subject. And I will show you more awful things that can happen and how you can fix them.
Because what’s the point in just showing you bad things that can happen if I don’t show you how to fix the bad things? It doesn’t seem like a very, it doesn’t seem like very good training if all I do is tell you that bad things happen. It sounds like, it sounds like other training I’ve seen.
Bad things happen. How do you fix it? It’s hard. Yeah, you know, life’s hard. Screw it.
All right. Thanks for watching.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.
How To Write SQL Server Queries Correctly: Where Clauses With OR In Them
Thanks for watching!
Video Summary
In this video, I delve into optimizing queries that use OR clauses across columns, particularly in scenarios involving outer joins and complex WHERE conditions. I explore how these types of queries can sometimes struggle to leverage indexes effectively, leading to suboptimal performance. By rewriting the query to break it down into more manageable parts, we can better guide SQL Server to utilize the appropriate indexes for seeking rather than scanning, thus improving overall query efficiency. Through practical examples and detailed explanations, I demonstrate how strategic indexing and query restructuring can significantly enhance performance in these challenging cases.
Full Transcript
Erik Darling here with Darling Data. And we’re going to talk about something quite interesting today. Now, we’re going to, of course, carrying on, soldiering on with our series on how to write queries correctly. And I did promise some tidbits, morsels, tiny, sprinkled little nuggets of Intel timing out looking for drivers. God, that’s depressing. It just does that randomly. I don’t understand. I wish Intel would just get out of computers. It was a good run. Intel and Boeing just get together and go do something else. Enough lives have been lost. So I did promise some stuff about performance and we’re going to cover that today because this is not precisely about how to write queries correctly. This is about indexing for a specific type of query and how you can rewrite that query in a slightly different way to maybe better understand why your indexes are not being used perhaps the way you think they should be. So we’re, of course, going to be talking about where clauses with or operators in them. Shocking, I know. Who would ever be indecisive about what data they’re returning in their query? It’s hard to know. The level of indecision, I have.
I see in many SQL Server queries. Why don’t you just say select star from table. The number of OR conditions that you have applied to this query makes me wary that you might just be doing a secret select star. Anyway. So let’s go on with that. But before we get into the meat, the beef of this subject, if you would like to become a member and support this channel with $4, as little as $4 a month, you are, of course, able to do that by clicking on the little become a member link in the video description. If $4 a month would mean that you would starve cold and alone in the streets, like these poor UK pensioners who have had their heating allowance pulled from their hands, you can like, you can comment, you can subscribe.
Those are all wonderful ways to keep me warm all winter long. If you need help with your SQL Server, if you’re out there in the world saying, God, I wish I knew why this SQL Server was so slow, or I wish someone would come and make this SQL Server faster for me. Well, I do all of those things. I can tell you why it’s slow, I can make it faster, and all of that at a reasonable rate. Ha! It’s wonderful.
If you would like some very high quality, incredibly high quality, probably the highest quality training you can possibly get at a very low price, probably the lowest price you can possibly get without pirating it, You can get all of my training for about $150 USD for the remainder of your years on this planet Earth. If you go to that link and put in that coupon code, there is also a link that assembles all of these details for you in the video description.
It’s your lucky day. No upcoming events, because end of the year, yada yada yada, 2025, we will reconsider our stance on no new events. For now, I just wish to be warm and snuggly with a bottle of coat roti and not going anywhere in a winter coat to talk about SQL Server.
With that out of the way, let’s fade to black and let’s talk about how we can make OR clause queries better. Now, the OR operator is, of course, perfectly valid to use. There are some times, and I hate to see it more than others, joins particularly, left joins especially.
It’s almost like you should have just done a full join, probably, for a lot of those WHERE clauses. But when you use an IN clause, there’s a perfectly reasonable chance that… Sorry, when you use an OR clause, there is a perfectly reasonable chance that the operator will convert it to a series of SEEK predicates like IN1, 2, 3, or it might take IN1, 2, 3, and make it OR 1, OR 2, OR 3, and you don’t have to do anything about it.
Optimizers are just kind of funny like that. They try to optimize things, and sometimes they do better than others. The problem isn’t sort of generally when asking for IN or OR on a single column, right?
If you are the type of, like, dim person who likes to say where is null column something is null or is not null or whatever… or coalesce or whatever thing you do, even… I even see people do it with null if…
like, null if empty string is null. But if you say, like, where column equals empty string or column is null or anything like that, usually you’re better off just typing more, right?
So if you say, like, where column 1 equals something or column 1 is null, it’s okay. As long as you have an index that supports, you know, that, like, supports what you’re searching for, an OR expression on a single column is not going to ruin your day.
Where OR clauses tend to get tougher from a performance perspective is when they span columns, especially across joins, especially outer joins, because outer joins, you know, they do preserve nulls. So you end up with a weird sort of set of things to think about as far as, like, query performance goes there.
So if you were to do something like this, you might have a tougher time getting it to perform as well as… if that were just where t1.thing equals something or t1.thing is null. You know, at any point, though, sufficiently complex filtering will…
like I’ve seen so many times, and something that we’re going to cover in a couple few videos, is when you… someone decides to take a where clause or a join clause and turn it into a case expression, thinking that it will simplify things.
So it’s like 1 equals case when some conditions equals 1, or some other conditions equals 1, or some other conditions equals 2. And that’s when… that’s really painful overkill.
Just a plain old select… case expression, a select list, doesn’t cause nearly as many headaches, right? I mean, some types of, like, exists in subqueries… are exceptions to that, but…
for the most part, a select list… with a… sorry, a case expression, a select list, it’s not going to be a problem. The main issue, of course, that we’re talking about today is performance, and sometimes figuring out a reasonable index or set of indexes.
Sometimes… you know, it’s crazy to think that sometimes you might have one query that needs more than one index. You might need some number of indexes in order to make your query work reliably fast.
If you work… if you’re the… if you’ve ever done work with, like, interval-type queries that have a start date and a begin date, and you need to either find overlapping or non-overlapping or consecutive or non-consecutive things, you’re probably very, very used to the idea that your start date and your begin date are each essentially going to need an index that leads with them so that your searches can get to things in there as quickly as possible.
Now, sometimes it’s not just the indexes. Sometimes your queries, when you do write these types of queries, will need index hints both to use the correct index and to perform the correct seek or scan on that index.
Some people get really scared of that stuff, but, you know, the optimizer just sometimes will not make the right choice on its own. It refuses to.
Even though you know better and you run a query, and it’s better and faster when you add those hints in, some people will just say, Oh, well, you can’t add hints. It’s too scary for me. And so you end up with a query that is just perpetually slow.
So, you know, if you are… You know, if you have perfect indexes and perfect queries and everything’s working great, I’m very happy for you, of course.
Couldn’t be more thrilled. You’ve reached a level of nirvana, of spiritual perfection and enlightenment that I can only hope to achieve one day. But until then, I’m just probably going to try to keep getting paid to fix other people’s lack of spiritual enlightenment and nirvana.
Hopefully bring them a bliss that they’ve never known before. Get a back rub too.
So I already have this index created on the post table. It’s on owner user ID, last editor ID, and score. And this is the query that I am trying to get to support.
Sorry, this is the query that this index is there to support. So I’m going to show you something. And it’s going to be something that maybe annoys you.
I know it annoys me sometimes, and I see it, is even though this index leads with owner user ID, and even though we do have owner user ID as an equality predicate in our query, we don’t seek into this index, do we?
That is not what happens. We scan the index. All right, and our scan predicate is looking for a score greater than zero, and owner user ID equals 22656, or last editor user ID equals 22656.
Well, you’d think that we could at least be able to seek to one of those, but when you try to tell SQL Server to seek into this index, it says no.
The query processor cannot come up with a query plan that will help this thing. Why is that? Very strange. Well, it’s not because the index is created wrong. If we did create the index with the score column first, we wouldn’t be able to seek to that.
The thing is, with this predicate, most everything in the, like, rather, not most everything, but really a significant chunk of the post table does have a score greater than zero.
So we wouldn’t really get much out of that. It wouldn’t help us as much as you think it would. If we run this, where we’re looking for a score greater than 1,000, of course, things with a score greater than 1,000 are far less common in the user’s table, sorry, in the post table.
So this would result in a lot fewer rows, but we still can’t seek to what we care about. Like, it’s still just not an option to us with this index. So, sort of paraphrasing what’s in here, because, you know, I do need to finish these videos at some point.
It’s really tough to get a seek with an or clause sometimes. If we put the score column first, depending on what the score column is searching for, it might get better because, it might be okay, because we’d at least be able to seek to a small number of scores, but we still wouldn’t get the good results that we want to out of this query.
Because essentially, we have two separate predicates, right? Where owner user ID equals 22656 and score is greater than zero, or last editor user ID equals 22656 and score is greater than zero.
Like, we have this sort of double Dutch, double date weird thing that we have to figure out. If we express the query, and this is where I was talking about, where if you rewrite the query in a slightly different way, it’s a bit easier to see what SQL Server can seek to and what it can’t, given the index that we have.
So, let’s run the query like this. And I’m just doing the score greater than a thousand. It’s a little bit quicker this way. But essentially, we’re taking the query that we had written in one compact bit, and we are writing it in two separate bits, and we’re joining, we’re unioning those results together and selecting from the results of both of these.
So, we have one up here for owner user ID and one down here for last editor user ID. When we write the query like this, what we need to do becomes a lot more obvious. Even, it’s so obvious, even SQL Server can figure it out.
Except what SQL Server tries to tell us we need is an index on three key columns that covers the entire post table, including the body column, which is an envarkar max, including several other string columns in there.
If you’re the type of person who, you know, I don’t know, maybe you’re a consultant even, who’s like, oh, looked at the missing index request, going to create those, please don’t.
It’s stupid. They’re not good. They’re not good ideas. Even if you, like, consolidate them a little bit, you have no idea what query they’re helping.
You have no idea what the actual, like, the helpfulness of the indexes will be for the workload, for specific queries. It’s really just working blind, and you shouldn’t be doing that.
You should know better. Because people are paying you. They’re not paying you not to know better. They’re paying you to know better than they do.
If you add missing index requests just from the DMVs, looking at uses, or impact, or anything like that, you’re screwing up big time. Don’t do it. So what we can do, because we are smarter than SQL Server’s missing index requests, is we can create a couple indexes that will help our query.
Now, we’ve seen the query plan for the one before, right? We were able to seek into one part of it, and we had to scan the other. We had to seek to, we could seek to owner user ID and score, but we couldn’t seek to last editor user ID and score, because last editor user ID is the second key column in the index.
So we don’t have that in a helpful order unless we hit owner user ID first. The thing is, we’re not scanning to where owner user ID equals something and last editor user ID equals something.
That or condition means we could go here or we could go here and we need score, right? So we can only seek to one of those arrangements based on what the index was.
We can only seek to this arrangement because last editor user ID is the second key column. We cannot seek to this arrangement. Okay, so with these two indexes in place though, if we look at this query plan now, SQL Server will use an index union plan.
We can tell us index union because we have a concatenation here where a SQL Server basically adds this index to another. If this were an index intersection plan, SQL Server would join these two indexes together.
So we basically seek to what we care about in this index, right? Which is score and owner user ID. And we seek to what we care about in this index, which is last editor user ID and score.
And then we bring those two results together and we do a lookup for the rows that we, for the remaining columns that aren’t in those indexes for the rows that we care about. All right, so SQL Server is able to bring those two results together and then combine the clustered index to get us a full result for the query, which looks pretty close to what we would see in here, but just arranged slightly differently.
Let me actually run both of these together so we can see the differences. There’s not, there’s really not a big timing difference on these. If I, if I ran these enough times, well, actually, these are all at zero seconds anyway.
Sometimes they end up at like a couple milliseconds, but for the most part, the choice is either to concatenate the results of both of these seeks via index union, sort them down a little bit.
It’s only 16 rows, so no one really cares, and then do one key lookup for the 16 rows that we have. The alternative is to do one index seek and one key lookup, and then another index seek and another key lookup, of course, the first index seek is just like we’ve seen before.
Oops, that didn’t frame up too well, did it? Owner user ID and score, and then the second index seek is last editor user ID and score, and then we just do slightly smaller lookups for the rest of these, and then we return out the 16 rows that we care about for both of them.
So indexing for these types of queries is a little bit more challenging. Like I said, though, whenever you’re fixated on writing some OR condition into a WHERE clause, we’re going to talk about joins with OR clauses next, but whenever you’re fixated on writing an OR clause into a, writing ORs in your WHERE clause, you might want to consider performance, performance, and the way to consider performance, of course, is to write the query out with each of the specific WHERE clauses expressed the way that you would care about that, and if you do that, what’s going to happen is SQL Server is going to point out where you have good indexes to do things and where you don’t have good indexes to do things, and that can really help guide you to make better indexes to support queries that have this kind of extended, expansive logic in them.
So anyway, that’s about enough of this. In the next video, we’re going to be talking about joins with OR clauses in them, which are a separate nightmare, but we’ll have fun with that, and that’ll also be a bit more performance-focused, but, you know, we’re going to…
we’re going to stick to the sort of script of the series and make sure that we all understand that there is a correct way to write these sorts of queries.
So thank you for watching. I hope you enjoyed yourselves. I hope you learned something, and I will see you in the next video about joins with OR clauses in them, and we will all have a good laugh at the end if we make it to the end.
Really, anything could happen, right? You have about 15, 20 minutes in which, I don’t know, the world could end. The lights could all go out.
It could be a solar storm. Anything. It would be awful, but hopefully not. Anyway, thank you for watching.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.
How To Write SQL Server Queries Correctly: Views vs Common Table Expressions
Thanks for watching!
Video Summary
In this video, I dive deep into the world of views and Common Table Expressions (CTEs) in SQL Server, addressing some common misconceptions and providing practical insights. Erik Darling from Darling Data shares his experiences and observations on how these objects are often misused or misunderstood by developers. Whether you’re a seasoned DBA or just starting out, this video offers valuable perspectives on when to use views versus CTEs, the importance of avoiding materialization issues, and the potential pitfalls of using `TOP` in views. I also explore why there seems to be a bias against views among some developers while CTEs are often embraced without question. By the end of the video, you’ll have a clearer understanding of how to leverage these objects effectively for better query performance and maintainability.
Full Transcript
Erik Darling here with Darling Data. And in today’s video, we’re going to talk about two of my frenemies in SQL Server, views and CTE. There are some things to discuss with the views and CTE that I think are important for people to know. And we’ll do that today. in some level of detail that will be too much for some and too little for others. But guess what? It’s a free video. I’m doing what makes me happy. So if you are also made happy by this video, and gosh, I hope you are, you can become a member of this channel for as low as four, that is, quattro dollars a month. there’s a link in the video description somewhere in this general vicinity.
You can click on that, become a member, and that’d be cool. It’d be very kind and giving of you in the holiday spirit. If you like to show your holiday spirit in different ways, different holly jolly ways, you can like, you can comment, you can subscribe.
That’s all right, too. If you are watching this video, it’s like, I mean, I know this is going to get published the day after Christmas, but who knows when you’re watching it.
But let’s say you’re like, wow, we have all this New Year’s budget. How are we going to spend it? What about on your SQL Server with spending some quality time with Erik Darling from Darling Data? You can hire me to do all of these things.
And I do them the best in the world, because I’ve seen what other people do, and it sucks. If you would like some very high quality, very low cost, SQL Server performance tuning content, again, there is a link in the description below where you can get all of this stuff combined for you.
But it is about 150 USD for the whole thing, and that is for the rest of your life. So live a long time.
Be happy. Again, this is being recorded, well, this is being presented at the day after Christmas, so I’m going to be huddled up somewhere, probably drinking red wine and staring dreamily as my children ungratefully unwrap presents in Paris.
So, no, that’s my plan. I don’t know what you’re up to. And with that out of the way, let’s have fun.
Let’s talk about CTE and views, because I suppose that is what we came here for, isn’t it? So one of the most exhausting parts of my job is, you know, being like the groundhog DBA who has to sort of say the same thing to different people over and over again.
It’s like starting from scratch. It’s like, I get really good at the piano, but the world around me is just starting from scratch, and no one realizes how good I am at piano.
So there are a lot of notions that people have about SQL Server that are both untrue and untested by them. It’s like, oh, I thought I read a thing that said that once.
I’m like, oh, can you show me the thing? No, of course not. It’s all in the sands of time. But the way that I think about views and CTE are since when you make a view, you write create or alter view, and then you put the query in it, which hopefully doesn’t contain too many other views, then hopefully the view definition doesn’t contain anything too outrageously awful, because Lord knows they have that.
People have a propensity for putting all the worst things into their views. They are like a permanent home for your query. They are not a permanent home for your data, because your views are not self-materializing.
And you have to go through great troubles and lengths that Microsoft should apologize for in order to create an index view. But it’s a thing that lives in the definition of your database.
It is not a physical object, but it’s like a stored procedure doesn’t materialize the data that the stored procedure selects and does stuff with.
View doesn’t either. But CTE are a bit more like mobile homes, because you can take one and you can park it anywhere, and it doesn’t actually live there.
You can put it over here. You can put it in a stored procedure over here. You can put it in a stored procedure over here. You can write it wherever you want. Just pull up, park it, have it make performance suck there too. It’s sort of like installing a toilet, right?
There’s a time and a place for a toilet, usually in the bathroom. If you use the same amount of discretion with toilet installetry that you use with your views and CTE, and you say you plop it right in the middle of the kitchen, probably don’t hook any pipes up to it, just leave it there, it’s going to look stupid immediately, and eventually it’s going to stink in your kitchen.
Probably not what you want next to the dinner table, is it? Unless… Unless… Oh, I don’t even want to go down that path.
Since views are programmable objects, right? They are actually modules in your database. They do have a little bit more depth of character and flavor to them than CTE.
For example, like when I talked about views in another video, you can add a with check option to have SQL Server do stuff with the data in the view when you modify it. You can also index a view.
You cannot index a CTE. Contrary to what I’ve heard said at several live in-person events and read in several places, views in CTE actually can use the indexes on the underlying tables that they select from.
The data does not become an amorphous blob anywhere. You can use the indexes there. It’s pretty spiffy.
Yeah. All right. What I find particularly curious about the view and CTE thing is how developers are sort of racist against views in a way that they are not against CTE.
So like, like I’m pretty sure that this is just like a random conversation that has happened 5 million times in the world. One developer will be like, just wrote this short procedure.
It has hundreds of CTE in it. And another developer will be like, wow, that’s amazing. You’re the best at SQL. These CTE, dog, they’re so readable.
I can really, really understand all this query. I really can’t. And then if the developer did the same thing, was like, hey, this database has hundreds of views in it. Developer will be like, man, why, why, why, why you gotta like, you know, mess up the database with all these views?
What’s wrong with you? That just makes things hard and complicated and unreadable. So like, it is weird that despite them having so much in common, and despite views having a leg up on CTE in several ways, you know, people are sort of aligned against them.
But, you know, even I kind of get it because I cringe a little bit when I see that, when like, someone’s like, I don’t know, I have this simple query, but it takes forever.
And I’m like, oh, okay. And it’s just like, you know, select stuff from a thing. And then you’re like, oh, well, I bet it’s just missing an index. And so like, you go to get the estimated plan, and it’s like, see what’s going on.
And then the estimated plan is like, gigantic, like, like one of those like open world video games where the map just keeps getting bigger and bigger and like, fill, and then that’s the query plan.
And like, you have to like zoom all the way out to even be able to grasp the full size of it. So, so I understand because views have been abused so horribly, but, but, but, but, but, you know, by the same token, CTE in my experience have been abused, just as horribly by people.
It’s just easier to see upfront and it’s not more readable. One thing that I see quite a bit is people still trying to stick top 100% in a view, thinking that it will present their data in order when they select from it.
It won’t. You need the outer order by no matter what. But one thing that I want to show you is like, if let’s say that we have a select top one in a view, and then we have a select top 100% in a view, if I show you, I’m not going to get the actual plans for these because this one will select 100% of the 2.4 million rows out of the users table.
And I don’t want to sit here for that. But if I show you the estimated plans for these, you’ll notice something, a slight difference between them.
The first one has a top operator in it, because there is a top that is actually honored by the optimizer. The second one does not have a top operator in it. The optimizer throws top 100% away, because top 100% means the whole damn table.
That does not mean there is no need to do a top operator in there, because we are selecting everything. If we replace this with top 99%, well, that’s a huss of a different color, because SQL Server actually has to figure out 99%.
It just has to do it in a real ugly way. You don’t want SQL Server to scan the entire clustered index, spool the entire clustered index, 99% of the clustered index into an eager table spool, and then have the top read 99% of the rows from there.
That’s a bad time. So we’re going to say that’s a bad idea. We’re going to not go with that idea. I’m going to change that back, because I don’t want that to accidentally be there and have anyone…
I don’t want to, like, die and have anyone go through my database and be like, he had a view with top 99% in it. Glad he’s dead.
I don’t want that to happen. So one other thing that’s important, and let me just get rid of that red squiggle down below. One thing that is important to understand about views and CTE, and this is something that I’ve said, a point that I have belabored, that this dead horse is well fed, that they don’t materialize nothing.
So when you reference a view, or… Excuse me. I’m very dry in here. This winter heat stuff just…
Despite there being a prevalence of steam pipes, no steam releases from the pipes. It’s just dry, dry heat. Because views and CTE do not materialize data, every time you reference them, or every time you access a view or CTE, the entire query inside it has to run.
I have written an unnecessarily large query in here, but if I go and create this, and I repeat that same thing, this is using a CTE, of course.
This is using the with syntax to create a CTE. And then I either join that CTE to itself, like so, or I join the view to itself, like so, and we look at the query plans.
We will see that the query plan for both of these, when it finally does a thing, will repeat itself for both of them. We have the set of joins from the first time we talked to our CTE up here, and we have the set of joins from the second time that we talked to our CTE down here.
We have the exact same pattern repeat in the view. It’s the same query plan. This is the first time we touched the view. This is the second time we touched the view. Every single one of those joins has to happen all over again.
It’s not a good time. And I see people do this constantly with both views and CTE where they’ll take the view. They, oh, all I need to do is run the view with like this where clause in a CTE, and then run the view with this where clause in this other CTE, and then join one to the other.
And you have this query plan that, again, just defies all logic. Well, I mean, it doesn’t defy logic because I know what’s going to happen, but it really, I think the better is, it defies like reasonable analysis because you’re just looking at this giant query plan and going, why would you do this in the first place?
Why would you want to hurt SQL Server like this? So not a lot of difference here. No matter which one you use, there is no materialization.
You can at least materialize a view. You can index a view. Again, a lot of rules to follow there. You cannot index a CTE, like the definition of it.
You can’t say like with CTE as, and like define an index in the CTE definition. But both of them, both views and CTE can use whatever underlying indexes you have.
They’re often responsible for either one being successful. But of course, the crappier code you put into either a view or a CTE, the worse off you are.
So neither one is going to turn out well. I don’t really know. Let’s see. Yeah, okay. Well, I mean, one other good point in here is that if you use views and you hire a young, handsome performance tuning consultant like myself, you take a look at my reasonable rates and you think, gosh, how can we afford not to?
And let’s say I come along and I performance tune a view. Everything that relies on that view will get faster. If you just sprinkle CTE everywhere, like kitchen toilets, guess what’s going to happen?
I’m going to have to go through every place you use that CTE or a CTE, and I’m going to have to fix all of those individually. You can imagine which one is a better use of time. So there is that to consider.
Anyway, thank you for watching. I hope you enjoyed yourselves. I hope you learned something. I hope that you will practice better diligence in your use of both views and CTE.
And I also hope you keep watching. So let’s do all those things together. Go team.
I’m not much of a high five guy, but I don’t really have much of a… It wouldn’t make sense for me to do this. Like really for the camera, the high five or the fist bump is the only thing that makes any sense.
So anyway, let’s go record another video. We’re going to talk about, I guess, or and where clauses next. Ooh.
It’s going to be perf heavy topic, isn’t it? I’m going to have to get into that one. Anyway, thank you for watching. Goodbye.
I love you.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.
How To Write SQL Server Queries Correctly: INTERSECT And EXCEPT
Thanks for watching!
Video Summary
In this video, I dive into the lesser-known but incredibly useful `INTERSECT` and `EXCEPT` operators in SQL. These operations are particularly handy for comparing columns across multiple queries without the need for verbose and error-prone null handling logic. By exploring these operators through practical examples, you’ll see how they can simplify complex comparisons and improve query performance. I also delve into their unique operator precedence rules and discuss when to use them effectively in your SQL queries. Whether you’re a seasoned database professional or just starting out, understanding `INTERSECT` and `EXCEPT` will give you a powerful toolset for writing more efficient and readable code. Additionally, I touch on the recent additions like `IS DISTINCT FROM` and `IS NOT DISTINCT FROM` in SQL Server 2022, highlighting how they further enhance T-SQL’s capabilities. By the end of this video, you’ll have a solid grasp on when and how to leverage these operators for your own projects.
Full Transcript
Erik Darling here with Darling Data. We’re going to do a great job today. I can just feel it. I’m feeling in my bones. And I can tell we’re going to do a great job because we’re talking about a subject that almost no one ever talks about. Intersect and accept. You might notice that that word looks a little funny. Brevity is the soul of wit, I think. So, you know, these are not things that I just want to say. I see people use in production queries terribly often. Most of the time when I see people use these, they’re trying to compare, like, the entire contents of one table to another and try to figure out where there are differences. You don’t see people use these a lot in, like, smart ways in queries. They would much rather sit there and, you know, figure out if columns match or are both no. Or if, like, like, like, like wrapping things in is null and coalesce like, like the fools they are and these giant or clauses. And, and gosh, gosh, is it nice to just retype those as, as intersect or accept queries because not only do they, they look prettier, but they often perform better. And there’s often a lot of, a lot of room and potential for logical incorrectness.
in, uh, extended and, uh, or clause queries, uh, that, you know, uh, usually you’ll find a couple few bugs in those. So, we’re gonna have fun today. But before we have fun, we need to talk about, uh, stuff you can buy from me. Yeah, cause everyone’s trying to sell you something. I guess that, that includes me. Merry Christmas. Uh, if you would like to become a member of this channel, uh, and, and, and support my, my, my, enduring efforts to bring you quality SQL Server content, uh, there’s a link right in the video description that says become a member in which you can become a member for as low as $4 a month.
If the $4 a month is just too rich for your blood, uh, you can like, you can comment, you can subscribe. There are all sorts of buttons you can push that push my joy buttons, which is just the, probably one of the, the best things you can do in life, right? Uh, what is it? It costs nothing to be kind. So, it says the sages of every social media platform overrun with people, uh, being selectively kind. Uh, if you need help with SQL Server, if you’re looking at your SQL Server and thinking, gosh, this thing is slow, uh, Erik Darling, uh, from Darling Data does all of these things, the best in the world.
Uh, so you can, you can, you can hire him, me, us, as, as a, as a package and get this handsome devil to show up and make your SQL Server faster in exchange for money. And as always, my rates are reasonable. Uh, if you would like to get some training on SQL Server, I have about 24, 25 hours of it. Uh, there is a discount code you can use to get 75% off that brings it to about 150 USD and you get that for the rest of your natural life.
Uh, again, it is the end of 2024. I might just take this slide out because I’m getting sick of saying it. 2025. But, you know, then if I, if I stop saying it, maybe you’ll forget that I, I, I come places and I speak and also in exchange for money. Uh, so if you would like me at your event, tell, tell me what your event is.
With that out of the way, let us intersect and accept ourselves into oblivion. Now, one of the first times I actually ever saw someone use, uh, these operators was, of course, uh, my, my dear friend and my, my wine distributor from New Zealand, uh, Paul White. And, uh, one thing I want to note is, uh, you see that URL up there? Um, it’s, uh, right, right, right about there.
Uh, I want you to just take a quick look at the date in here. 2011. So June of 2011. Uh, so there, originally I was going to put a trigger warning on this because, uh, you know, obviously code from, uh, gosh, almost 15 years ago, uh, does not live up to today’s modern standards of code formatting style choices.
Uh, but instead I have reformatted this code to fit modern conventions, coding standards in T-SQL. Um, I’ve also taken the liberty of replace, uh, in the original, uh, queries, the, the table, the temp tables were table variables. I’ve only, I haven’t changed those for any overarching performance reasons only to make each of the select queries below a little bit more portable.
Okay. Uh, because otherwise I would have to declare and populate and run the select queries for each of the things below. And that, that just gets a little unwieldy. So, uh, we’re going to, uh, just do this instead. Uh, we’re going to make our table variable, our temp tables, not table variables. Uh, and then we’re going to insert some data into them.
Um, and then, uh, so the, the cool thing about Paul’s post is that he walks through like the query that you people, a lot of people would write, uh, that would get them the incorrect results. Right. Like this, because this does not account for nulls. This just returns us one row, uh, for the five that we put in both of these. This messes up anything where there are nulls and, and, and that doesn’t get us what we want back.
Right. This is obviously an incorrect result. Uh, you could rewrite the query. And this is how I see a lot of people do it where they have this gigantic or clause. A lot of ends sprinkled in a lot of people don’t get their parentheses, right?
Or don’t get there, or maybe they like copy, there are copy and paste errors in here. There’s a lot of things that can go wrong when you start writing, uh, non-trivial complicated queries like this. Uh, so this will get you the correct results.
Um, what is a little annoying here is this filter operator. Which expands out that entire or, the, the, all of, all of those or clauses for a couple of five row tables. Obviously this doesn’t make a performance difference, but when you’re dealing with larger data or darling data, which is the largest data you can get, uh, then these things do start to make a big difference.
This does get us correct results, right? This does bring us back everything that we care about. Uh, this will also get you correct results.
But as I’ve said in 10 million videos at this point, wrapping where and join clause columns in, uh, in functions, even built-in ones is just a recipe for performance disaster. Uh, so we can run this. We can get correct results, uh, but we would end up with, um, uh, just a weird sort of bunch of stuff happening, uh, in here.
There we go. So, uh, we have this whole predicate against this table, which I, which I suppose is a step up from a filter operator. But this whole predicate is just a series of case expressions for all the coalesces that we had in there.
So, uh, you can see the various case when, blah, blah, blah. Uh, case expressions are another thing that I, I, I strongly advise against putting into your join and where clause column, uh, predicates because you will, you will likewise be in for a bad time. Uh, there is also an example with isnel, and isnel should look roughly the same, except rather than a series of case expressions, we will have a predicate that is a series of isnels.
This is not necessarily a better situation. They’re both about equivalent performance wise once data becomes of a certain size. So be careful with that.
Uh, this is a much better way of writing the query because intersect will correctly deal with the nulls and it’s a whole lot less typing and a whole lot less error prone. So if we run this, we get back the correct results and we don’t have any weird stuff.
Uh, in our query plan as far as like gross predicates go, uh, all of this stuff just gets evaluated, uh, in a nested, in the nested loops join and we evaluate everything across there.
So it all ends up being fairly easy and straightforward. Um, there is sort of an unfortunate thing where not exist does not, uh, provide us with, uh, the entirely correct results on this.
We get back, um, an additional role that we don’t care for there. Uh, so, you know, uh, this is the best version of the query that I think could possibly happen. Intersect and accept are very useful for these sort of like, you know, uh, extended column comparison things.
So, uh, like I said before, um, I’ve never, uh, seen anyone really use these like when they should have, uh, in production queries.
Um, I think probably part of the problem is that it’s somewhat unclear what they do, uh, when you’re reading through any sort of SQL guide, uh, or rather even when you’re, you’re just looking at like, um, when the linguistic possibilities of SQL, you’re going to look at like, you know, stuff like join and exists and not exists and where and group by and order by and all the other words that come up.
And you’re going to say, huh, those make sense. I can sort of figure out what they do there. Intersect and accept.
it’s not really clear what they do from how they’re named. Um, uh, there are some weird rules around operator precedence with these two things. uh, has, uh, very specific operator precedence rules that other things don’t.
Uh, intersect will give you a unique set of rows from both queries and except will only give you a unique set of rows from the first query. First is going to be an air quotes there.
Uh, because, uh, like, you know, we’re just, we’re thinking about the queries as written in order. So the first one that you do, then the except, then the other thing, but then you could put other accepts and intersects below that.
Really, we just, we care about like the first one with except. Um, but the, the set of rows that you get back from them is going to be uniqueified.
So, uh, these are operators that work, uh, somewhat better or reason, not somewhat better.
I would say, uh, somewhere, somewhere between somewhat and profoundly better. If you have primary keys or unique indexes defined on the columns that you’re looking at, or at least one of the columns that you’re comparing there.
Uh, but I think probably the best part about them is that they handle null comparisons, uh, without a lot of crazy syntax, like we saw above with the ands, ors, and the is null coalesce and blah, blah, blah, blah, blah.
Um, that what is tricky about them is of course, figuring out when you should use them and what order to write things in. Often this takes quite a bit of experimentation to get, to start to get a real handle on.
And if you don’t use them for a few days, you will probably lose that handle entirely and have to come back and start from scratch. That’s sort of like when I have to write any XML query, I’m like, Oh God.
And I have to go reference everything that I’ve ever done before to figure out what I need to do now. But, uh, just to give you a couple examples of what they do, uh, let’s have, uh, let’s just do this.
And let’s say we want to intersect, uh, everything with the score over two with everything with the score over three. Uh, and we, we, we get the results we want.
Down here, we get about 6,500 rows and the old, but the, all the results are going to be everything where there’s a score over four from this, from both of these queries, right?
So this one is looking for greater than two. This one is looking for greater than three. And where those two results set start to actually find matching rows is when we hit a score of four. So everything in here is going to have a score of four and up, right?
So pretty easy stuff there. That’s where these two results start to overlap at four. We have greater than two. We have greater than three. What’s after three, four. What’s a couple after two, four.
That’s where we start showing what’s going to come back. Um, and like I said, uh, the user ID column, which is nullable and has nulls and it does not give us any issues here. Um, so if you, if you are writing queries and you have, find yourself having to like, you know, that compare, uh, columns like this, uh, especially like across a number of columns, often intersect or accept would give you better performance and, uh, better sort of, uh, code clarity and everything else because they handle nulls without a lot of, uh, overly verbose, uh, you know, uh, querying or adding in functions that replace nulls with canary values.
Uh, so this is, uh, this is the same two queries this time just using except. And, uh, this is going to give us just results from the first query, right?
Just from this one, right? Because we want to see everything from here except what’s in here. So we don’t show anything at all from this query. We only show the stuff from this query. Now this is only going to show results again from the first query, big air quotes there.
Uh, you could also call it the left, the left most query or the outer query with a score of three, because that’s the only data that exists in it. That’s also in the second query or the inner queer, right?
Or that’s, that’s not also in this one. So this we’re saying greater than three, this is greater than four. So where these start to collide is just at score three, right?
We’re only showing the threes in here. Um, so it is a bit like using not exists and that the rows are only checked from the second, from the, from this query. They’re not, we don’t project anything out from this one.
Uh, and again, the nulls are handled quite well. Uh, SQL Server 2022, uh, wow.
They, they, they really modernized T SQL with this one. Um, they added the, uh, is distinct from and is not distinct from syntax to SQL Server 2022. And I suppose at this point, we could be happy that they showed up at all because is distinct from was introduced in 1999.
99. This thing was old enough to drink before Microsoft got around to adding this in. And, uh, actually this one, this, this one was too.
Okay. I guess this would have been the funnier joke. Uh, this one was added in 2003. Well, I guess in 2022, it wasn’t quite old enough to drink. I guess it depends on where you live. If you’re, if you’re in those devil may care European countries where teenagers just get sloshed in the streets.
Well, my, actually that sounds kind of good to me. Uh, that’s what I did.
And I, I just had to find sneaky ways to do it because I didn’t want to get busted by the federal allies. Uh, so I mean, I understand no, no database really adheres like perfectly to ANSI standards, but waiting like 20 years to get, uh, basic syntax added to T SQL is, man, it is the, the choices Microsoft makes wild when it comes to this stuff.
Uh, I realized that like, you know, people aren’t exactly clamoring for like some of these things. things, but just like basic stock functionality where like, if, if you wanted to, you know, bring an application from, uh, like Postgres or Oracle or DB2 and put it in SQL Server and like, like, you had queries with this stuff written in there and you were like, Oh, that’s broken.
How do we rewrite it? It’s like, great. Awesome. Right. Good stuff. Good stuff.
Good stuff. Microsoft. We really appreciate all the wonderful things you’ve added to the product in the meantime that have changed everyone’s lives. So anyway, they are, the, the, the, the is distinct from, it is not distinct from what will be useful to someone someday when they start.
I don’t know. At this point, the SQL Server 2025 is probably going to be where people go. Not a lot of people went to SQL Server 2022. Apparently everyone was pretty happy with 2019.
Uh, I don’t know. I don’t, I don’t know fully what 2025 is going to bring us. Uh, there was just the announcements that it, it ignited fairly recently, which were, uh, I, I suppose, whelming at best.
Uh, of course, there was a lot of talk about, uh, fabric and AI, which, um, are two stupid things that, uh, of course being stapled onto, into SQL Server or SQL servers being stapled into them.
I don’t know. Uh, but there, there wasn’t a lot of, uh, wasn’t, it wasn’t a very good highlight reel about what’s actually in SQL Server 2025. Aside from stupid current hype cycle memes that are being shoved down all of our throats with, uh, very little care.
Anyway, uh, let’s look at how you can use is, is not distinct from, uh, I don’t, I don’t know if I have an is distinct from example here, but, uh, let’s say that, you know, one of my least favorite things to see in, in, in a join is an or clause.
All right. Something like this. Yuck, grotesque, awful, you do this sort of thing. You should have your keyboard removed from your hands, placed elsewhere.
Uh, but let’s run these two queries. And do, do, do, do, do, do, Yep.
Okay. Uh, there we go. They finally finished. So, uh, this first query, uh, took 6.2 seconds. This is the bad query with the or clause in it.
Uh, and it has all of the hallmarks. I’m going to make this a little bit bigger so we can see both the queries on, both the query plans on one screen here. Uh, this, this, this has all the hallmarks of a join with an or clause.
Uh, uh, we start off scanning the, well, I mean, you’re not always going to see a scan. You, you could technically see a seek here if we had a where clause on the user’s table, but we take all 2.4, some odd million rows from the user’s table.
We feed them through the constant scan operator twice. Notice that we have two sets, two full sets of rows from the user’s table. We have, um, uh, one set to find, uh, u dot ID, uh, in here.
So that’s, that’s quite a bit of not fun. And then down in here, we seek into, we, well, we do a bunch of work, right? We try to collapse all this stuff into here and we can catenate them and we sort them and we merge interval them to remove duplicates.
And then we spend time in here doing a nested loops join to the post table over and over again. And then finally 6.2 seconds later, we come up with a result set.
Hooray, hooray, hooray. We did it. Uh, this query down here, now granted, you know, there, there might be some indexing stuff we could do that would make this a little bit happier, but, uh, this takes about two and a half seconds, a pretty good improvement from six seconds, right?
Maybe, maybe not where I would stop if I were, you know, if I were being paid to tune this query, but I would, I would at least say, Hey, we could just do a little rewrite and get this in a better place. So that’s, that’s cool there. So I am excited for, to be able to start using these.
Uh, there’s just not a lot, a lot of opportunity for that yet. Um, you know, uh, I, I, I think that, um, T SQL additions like this should get back to, backported because like, they’re, they’re important, but, um, what do I know about backporting?
I’m just a, just a guy who writes door procedures. Uh, you could, there’s also, you could also read, it’s in the spirit of intersect and accept. We could also rewrite the queries in this way.
Now, uh, uh, that what’s, what’s fun about when I say fun, again, lots of air quotes in this, this, this video.
Uh, what’s, what’s interesting about query tuning these days is, um, the amount of things that may or may not kick in depending on how SQL Server feels at the time. Uh, things like batch mode on rowstore are, unless you force the issue with like, you know, some sort of, uh, column story thing, uh, it is entirely based on heuristics.
And so you can end up with like just strange performance differences, like for two queries that are, do they do sort of equivalent things like the join with the or clauses just get out. But if we were to run these two queries we’re using is not distinct from here.
And we’re using sort of the point of the video intersect here. All right. And if we look at these two query plans, this one runs for 2.6 seconds, just like above.
And this one runs for six, 600 milliseconds, which is quite different from above, right? Like the intersect query went faster. The intersect query went faster because SQL Server naturally chose batch mode on rowstore for this, for this query plan.
We have the batch there. We have the batch here. We have the batch here. Uh, I don’t think anything down here is eligible for the batch, except maybe reading from some of these, but, uh, we go, so we got, we got the batch up here when we read from the post table as well.
So SQL Server was like, yeah, batch mode on rowstore. Sounds great. Did not choose that here. Uh, and like I said, we could, we could force the issue, um, running, uh, this query and doing a funny little left join to a empty table with a columnstore index on it.
And if we look at the execution plan, now that we’ve got some batch mode in here, this one is all of a sudden competitive with the, uh, the intersect version up there.
So like I said, query tuning these days, real fun. Cause who knows what tiny little change you can make that will awaken SQL Server’s senses and say, oh yes, we should use batch mode.
We’re doing stuff with a lot of rows here. That would be smart. Um, so, you know, uh, keep an eye out for these things. Uh, I don’t really know where else to go with that.
Um, it’s a good, it’s a good time. It’s a real good time. Uh, so, uh, that’s about it for intersect and accept.
Um, I’m going to wrap this one up and I’m going to apparently talk about views versus CTE, which will be a rollicking ride. Uh, we’ll do that, I guess, as soon as this uploads.
So, we, we, see you then. Thank you for watching. I hope you enjoyed yourselves. I hope you learned something. I hope that you will expand your SQL vocabulary and start using intersect and accept. And I hope that if you, as soon as you are able to, you’ll start using is distinct from, and is not distinct from, as they become linguistically available to you.
Whatever SQL Server version or edition you end up on next. Thank you for watching. there’s a bright at heart and começa the людocker option, I wonder if or not could help you know why. A pastidade or assistant speakers,
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.
How To Write SQL Server Queries Correctly: UNION and UNION ALL
Thanks for watching!
Video Summary
In this video, I delve into the nuances of using `UNION` and `UNION ALL` in SQL Server queries. Erik Darling from Darling Data shares insights on how these operations can affect query performance and result correctness. I highlight the differences between `UNION` and `UNION ALL`, explaining why `UNION` adds distinctness at the end, while `UNION ALL` simply concatenates results without deduplication. The video also explores various placement scenarios of these operators within a query to demonstrate their impact on execution plans and performance.
Furthermore, I discuss when it’s appropriate to use `UNION` versus `UNION ALL`, emphasizing that most of the time, `UNION ALL` is more efficient due to its lower overhead for deduplication. However, there are cases where ensuring a unique result set can optimize other parts of the query plan. The video includes practical examples and SQL Server 2022 function demonstrations to illustrate these concepts, making it easier for viewers to understand how to write queries that return accurate results efficiently.
Full Transcript
Erik Darling here with Darling Data. And we’re going to talk in today’s video, continuing on with our series about how to write queries correctly. And this, of course, you know, comes down to two things, like both getting an accurate, correct result, you know, according to the logical demands of the query, and also having it return data to you in as efficient a manner as possible. And today we are going to talk about Union and Union All, or as they say in the South, Union Y’all. All right. If you can forget that joke happened, and you would like to support this channel, not for the jokes, but for the SQL Server information, or if you like that joke, you can do it for the jokes, too. I don’t care.
Whatever your motivation is, is fine with me. There’s a link in the video description where you can become a member of the channel. If clicking that link is too hard for you, perhaps clicking other things you’ll find a little bit easier. Liking, commenting, subscribing, all wonderful things to do. If the topics in these videos are near and dear to your heart, and you’re having SQL Server issues that you think a young, handsome fellow like Erik Darling from Darling Data could help you solve in exchange for money, I am available for hire for any and all of these things. Have a great time with me together, one-on-one, solving all sorts of stuff.
Training. Good to have. Better to watch. I had someone email me and say, hey, when I go to your training site, I can’t add anything to the cart, and I can watch all the videos. Did I buy this before? And so I went and looked in my receipts drawer, and lo and behold, they had purchased the training in December of 2020. So I said, yes, you did purchase this.
Merely, I mean, just like on the cusp of four years ago. Training works best when you actually watch it. It’s a participation sport. You have to be involved. Otherwise, you still know nothing. No upcoming events. 2025, talk to me. Tell me where to go. I don’t know where you are or where you want me to go.
I’m not endowed with psychic abilities, though I wish I were. So you have to tell me where you would like Erik Darling to be. With that out of the way, let’s talk about union and union all. So union and union all are funny because they kind of get used interchangeably in almost the same way that a lot of other things are, with very little consideration for performance or result correctness or other things.
So like CTE, temp tables, temp tables, table variables, parameters, local variables, all sorts of things that, you know, joins and exists. People just, you know, start writing a query one way and then they just always write the query that way. I remember a long time ago I read it. I used to play drums when I was a kid.
And I was reading an interview with a drummer in some drummer magazine and he was like, Ah, yeah, you know, I was playing in this cover band and, you know, we were like, you know, it was good because it was a very successful cover band. But, you know, every time I sat down to play drums, the only thing I could think of to play were like the drums to these cover songs.
And I was like, like, wow, that’s depressing. And then I realized, wow, that’s how people write queries too. They sit down and they’re just like, Oh, I’m going to play Copacabana again, I guess. So there are lots of times when I see developers use union, like when the results have absolutely no chance of having duplicates in them.
They either like join different tables together or have different where clauses, or sometimes they even have like literal values in the select list. They’re just like, you know, like, you like, how can you how can you make those distinct? How can you do? How can you do that? It’s weird. You already have distinct in the select, how are you going to make it more distinct?
And, you know, a lot of this stuff comes from testing queries in isolation, not really knowing that, you know, what when you should use exist first joins, because they haven’t watched the other video in this series about exists and joins. And, you know, once you start getting other things involved, like no lock hints, you could just end up with crap everywhere.
Bad things popping around everywhere. Now, how to write queries correctly does depend on a number of things. A, the quality of your data in general.
B, the quality of the data structures that you have available. That largely means the indexes. And also, you know, like, what returns a result to you as quickly as possible that is still logically correct. This can all become really difficult to figure out.
You know, especially the more complex a schema is, the more things you have to get involved, the more calculations you have to do. Knowing what, I think, really the hardest part about writing a query, aside from knowing, like, the SQL behind it, is knowing what the correct result should look like. It’s a very hard thing to define up front.
You have to, like, write a query, get a result, and probably show it to someone who’s like, who can be like, yeah, I don’t know either. Probably. I guess it’s right.
Seems fine. You know, validating query result correctness is a challenging, challenging thing. I’m happy when my rewrites just match what the slower query returned.
I have no idea if that’s right or not. I’m not checking that. I’m just making sure that what, like, the same number of rows and the same data output is as far as I usually go. Because I don’t usually get involved enough to, like, dig deep into, like, someone’s data and understand what an actually correct query result would look like.
It was sort of a funny story where there was one client of mine who had a very, very slow recursive CTE, which I helped fix up. And the original version of it would just never return a result. It wasn’t like an infinite loop thing.
It was just slow as hell. And when we got it right and we sped it up, they were like, I think the logic in here is wrong because these results aren’t correct. So we had to, like, redo the recursive CTEbecause they were like, yeah, this isn’t showing us the correct thing.
So that was fun, too. So let’s use a rather nifty SQL Server 2022 function. So let’s start talking about union and union all and how they differ.
And actually, union and distinct is what we’re going to talk about first. So I’m going to, you know what, I think I actually already did this stuff. If we look at these two query plans, there should already be stuff in these tables.
Yes, there are. Okay, great. So we have two queries here. The first one says select I from T1, union select I from T2.
The second one says select distinct from T1, union all select distinct from T2. The thing that I want to show you here first is, of course, the results. We get very different results back.
The first one just returns 1, 2, 3, 4, 5, 6. The second one returns 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6. So even though we had 1 through 5 in this table twice and 1 through 6 in this table twice, these queries made different things unique at different points.
If we look at the query plans for these, the union query adds its distinctness at the end across both result sets. So we not only deduplicate the results from each table scan, but we deduplicate the entire result here. For the two distinct queries, the distinctness happens here, not at the end.
Right? So when we say select distinct from this one, select distinct from this one with a union all in the middle, we make this distinct and this distinct, but we don’t make the final result distinct.
We just spit back whatever these two things put together. So distinct and union all do make things, distinct and union do make different things unique. Most of the time, if you are using union, rather all of the time, if you are using union between two queries, you don’t need to also add distinct to the select list because you’re already going to get that at the end.
You might just be doing weird extra work. Another thing that is kind of fun about, is about union and union all placement in a query. So if we look at, oh, I should have highlighted the whole thing, shouldn’t I?
If we look at this and we look at the query plan, we just get a constant scan. If we were to quote any of these things, then, you know, it wouldn’t make a difference. But if we start changing these, we will start seeing slightly different results.
For example, this one removes an extra one and two, or sorry, removes an extra one, but we still have two twos and two threes. I think the execution plans for these are just kind of funny because we have two constant scans and a distinct up here, but then just a concatenation for the other two things, which are down below this.
If we quote this one out and this one in and run this, the execution plan now has this with the distinct over here, right? We have a constant scan, constant scan, concatenation, constant scan, concatenation, and the results are one, two, three, three. So we have removed some additional, we have removed that extra two from the query results.
And of course, if we put the union down here at the very end, we will see a slightly different query plan and a fully deduplicated result set. One, two, three. We look at the query plan for this.
We have one constant scan, one constant scan, and then the duplication at the end, just like with the query where we selected from the temp tables. So that’s fun right there. Now, there has been quite a lot of performance talk about union and union all over the years.
Well, I do agree that most of the time, as long as you get correct results, union all is going to be a bit cheaper on you because SQL Server will not attempt to deduplicate the result sets. Even attempting to deduplicate unique result sets, if you have a bunch of string columns involved, can be rather painful and unwieldy.
If you are in the habit of writing union queries or even putting distinct in your select list, I would really strongly encourage you to think about the number of columns you’re selecting, the data types of the columns you’re selecting, and what actually identifies a unique row. Because you might be doing distinct over a bunch of columns where it’s not making a difference. You might be doing union over a bunch of columns where it doesn’t make a difference.
And oftentimes there is a sort of hidden subset of columns in your data that you can make an easy sort of distinct result set from without having to worry too much about it, without having to worry about long select lists and stuff. So if we run this and we look at the query plan that comes back, this takes about two seconds.
This is not a terribly dramatic example, I admit it. And, you know, it’s okay. But, you know, the thing is that we’re, you know, doing select all this stuff and including this text column and, you know, it’s just, it’s, it’s, this is an envarchar 700.
And now you have to worry about deduplicating that. If we were to take this query and do this, and what we would be doing is taking the same base query with the same columns in it, but then only generating a row number over the columns that we know make a unique result set, this will, this is a little bit faster, right?
Like I said, this is not a terribly dramatic example. That’s about 400 or so milliseconds faster, but it’s still a good example of how you can improve things by thinking a little bit more, a little bit more analytically about your data and what you need to make unique and what you don’t.
Right? So like in this case, doing a, generating a row number over these two union all the result sets is a lot faster. Now we’re now when I said most of the time, I do mean most of the time, there is less overhead to union all over union.
But there are some cases where it does make sense to make a unique result set to make some other operation in a query plan more efficient. Uh, I want you to think of this sort of like, uh, pushing predicates down to when you touch tables rather than having a filter operator happen later. That’s something that I talked about in the, uh, exists versus joins video, where I showed you a query that uses a left join, uh, with a, a where clause, uh, to find rows that don’t exist in one table that exist in another table.
And how using, uh, not exist was much more efficient in that case because we joined less data together, right? Because with the, the left join thing, we had to fully join both tables and then filter out nulls afterwards. The same thing is for like pushing any predicate or reducing rows as much as possible before you do something that is, uh, computationally expensive in a query.
So sometimes getting a distinct set of rows for your query can make things a lot better. So what I’m going to do now is, uh, populate this temp table with, um, user IDs for people who have won these badges. And the first way that I’m going to run this query is with a union.
And we’re going to marvel at the query plan for this. So this takes about, let’s see, five seconds right there. Right.
And, uh, you know, this, this would get better with, you know, slightly better memory grant stuff like that. Uh, if I, if I ran this like multiple times, you would see it get a little bit faster because SQL, since I’m in compat level 160 for the other stuff, we’re getting memory grant feedback. So this would improve over a few runs that we would eventually see that spill kind of fall off.
Usually it’s like three or four runs before the spill goes away. But, um, anyway, now what I want to do is change this to union all. So now we’re going to be taking like, and before with union, we were deduplicating these results, right?
We were getting rid of them. Um, but now when we do this, what we’re going to notice is that this query no longer finishes in like five seconds. This query drags on for a little bit longer.
And by a little bit longer, I mean, this thing is going to run for about 30 or so seconds total. Um, it’s been a while since I timed this one. So, you know, who knows, maybe, maybe, you know, Intel gave me some supercharged boost to my CPUs and maybe it’ll finish a little bit quicker.
Um, maybe not. I don’t know. We’re going to, we’re just going to, we’re just going to let this thing ride.
Uh, and we’re at, well, we’re at 35, 36, 38. Oh, we’re at 40 seconds now. Uh, do, do, do.
Well, I don’t know. I think this, this might be proving the point a little too well. So we got up to almost 50 seconds on this one. We got 47 seconds. And that’s because rather than, um, and, and, and you can see that like the, the, the, the, the pain of this wasn’t even in here.
Like there was almost no overhead to like doing the concatenation of these. Uh, we did still have to, you know, do all this stuff in the sort and whatnot, but where this makes the biggest difference is the number of rows that end up or the number of, uh, things that we end up doing in the table spool. Right.
So we have a nested, we have a nested loops join here. SQL Server uses a table spool here. Um, I’ve talked about table spools in the past, but they are sort of interesting. Uh, SQL Server uses them on, like for in select queries, not modification queries, modification queries, spools are for Halloween protection.
In select queries, spools are often there to optimize, uh, operations on the inner side of nested loops. By the inner side of nested loops, I mean this portion of the query plan in here. So, um, a table spool is fun because a table spool, uh, you know, you take, uh, sorted data from this table, right?
You sort it here and then you pass it to the nested loops and the nested loops, uh, will, you know, tell the table spool to go run the query for, let’s just say ID one. Uh, it’ll populate the table spool. Uh, it’ll populate the table spool.
And then for any additional runs with ID one, it’ll reuse the data in the table spool. Then as soon as we get to ID two, the table spool will get truncated and repopulated. And then, uh, you’ll like, it will, it will reuse data in the table spool for ID two until we hit ID three.
So table spools really can save a lot of time and energy on the inner side of nested loops sometimes. But in this case, the problem is more that we end up with way, way, way more stuff to do with the, because we don’t deduplicate results here the way that we did with the union query. So if we go back and we quote out the all here, remember this is about 40 something seconds and we run this.
We’ll do this one more time with the, with the, the union rather than the union all, it should be about five seconds or so. So do do do. And we get the results. Notice that, uh, like, you know, we, we, we do the same thing where we, you know, concatenate the results. But then before we, uh, send it, send it along, we get, we use a distinct sort to make the results of these two things distinct.
And we spend a whole lot less time in this portion of the query plan. Right? So this, like this part does a whole lot less work because we made the results set unique in this part. So you can run into interesting situations where using union to deduplicate results makes a computationally expensive part of a query faster or repetitive part of a query faster.
In this case, the nested loops is really what did it. So that’s a, that’s a pretty good way of thinking about things sometimes where can I make, can I make this part of the query execute fewer times or do less work with a distinct result set rather than a non distinct result set. So anyway, that’s about all I have to say about union versus union all. As always, I hope you enjoyed yourselves. I hope you learned something.
I hope that you will continue watching this series. I hope that you will continue to write queries correctly. And I will see you over in the next video, which is going to be about two things that I never see anyone use. Intersect and accept.
Fun times. Oh boy. You’re going to really, you’re going to get a relational mouthful on the next one. So I will see you there. Goodbye. Bye.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.
How To Write SQL Server Queries Correctly: Views vs Inline Table Valued Functions
Thanks for watching!
Video Summary
In this video, I dive into the world of views and inline table-valued functions in SQL Server, comparing their pros and cons while highlighting common pitfalls that can lead to performance issues. Erik Darling from Darling Data shares his experiences working with clients who have created overly complex view hierarchies, emphasizing that views themselves are not inherently bad but can become problematic when misused. I also explore the benefits of inline table-valued functions, particularly their ability to accept parameters and push predicates further into queries. Additionally, I discuss Microsoft’s recent fixes for certain issues related to parameter sniffing in views, explaining how to enable these fixes on different SQL Server versions. By sharing practical examples and insights, this video aims to help database administrators and developers make informed decisions about when and how to use views versus inline functions in their projects.
Full Transcript
Erik Darling here with Darling Data. Having a lovely day so far. Absolutely lovely day. Stunning day. It’s freezing cold outside. In today’s video, we’re going to go to the next topic in our How to Write Queries Correctly series. In this one, we’re going to talk about views and how they compare with inline table-valued functions. I know that it just says views there with an exclamation point. I kind of ran out of room, but then it looked funny and the spacing was weird. So this is what you get. But there are some differences and there are some things that we should talk about because I see a lot of people making the same mistakes over and over and over again. And quite frankly, I’m getting tired of fixing them. So I’m hoping that we can talk through some things today and you can start not doing the wrong thing. First time for everything, right? As usual, if you like the channel and you want to support the channel, there’s a link down in the video description below. If you don’t want to do that with these things, you can do these things up there. If you need consulting for SQL Server, if you’re like, wow, that Erik Darling, he knows what he’s talking about. He seems like a nice, reasonable, fellow with reasonable rates who I could work with on MySQL Server performance problems. You can hire me to do all of this stuff. It’s a pretty good deal. If you’re into good deals, how about getting all of my training for about $150 for the rest of your life? Hard to pass that up. No upcoming events, end of year, blah, blah, blah. With that out of the way, let’s talk about views and functions and stuff.
Now, there are a lot of bad things that one could say about views because over the years, we, probably we, the royal we, I really just mean me, is a performance tuner, have seen people just do absolutely awful, egregiously disgusting things inside of their views. Views on their own are not the problem. It’s the way that people treat views. Views on their own are not the way that people treat views. They end up sort of being like a junk drawer for just like weird query logic.
And, you know, this is like coming from like a rather like personal place right now because like current, like one of my current clients, I was, I was trying to figure something out. And I did SP help text on eight view names before I found a view definition that touched one single physical table in this database. Every other view definition was selecting stuff from one or more other views.
I still haven’t gotten to like the root view, the like original vampire view that has everything in it. Like I just got to a certain point and I was like, that’s an, I, we’re not going any further. I have this figured out enough, but like eventually at some point I’m going to, I’m going to whittle it down and I’m going to figure out what the like core criteria is that is throughout these 50 billion views. And this is a common mistake that people make.
They think that views are some sort of performance thing. They think that views are automatically a materialized result set. They don’t understand that all they’re, all they’re doing is interacting with a query that has a name, right?
They’re just housing this query. And I understand the point of, of, of making them because, you know, if you have, you know, a lot of tables and, you know, it’s hard to remember all those joins correctly. And, you know, making sure you have like the one or more join columns, right?
And, you know, whatever other, you know, you know, manipulations you have to do to data in order to get the results that you’re after. They can, it can be time consuming and annoying. So I, I understand why people make views, but the things that they end up doing after that are just, just sinful.
Now, views really, you know, if we’re going to make, make a statement about views, they’re, they’re only as bad as you make them. If they’re bad, it’s your fault. You, you did all the bad stuff.
You put all the bad stuff in there. You, it was you. The view did not force you to do it. The view did not change itself overnight to become an evil view. You put all the bad stuff in the view and, and, and now you’re living with it.
So you have, you have reaped what you have sown, right? Reaped what you have sown. Whirlwind is in stuff.
So, let’s talk a little bit about the case for views, because there, there, there is one nice thing about views. And that, that is that if, if you obey like the 10 million rules that Microsoft has put in place, you, you can index a view. And then you do have a materialized result set and that is pretty nice.
Um, it, you know, uh, it gets a little dicey if you have more than one table reference in the view. Like if you’re joining multiple tables together, things can get kind of awkward with, uh, index view maintenance. Um, but in general, and like, you know, a single table index view, uh, with, uh, indexes, uh, available on the table in order to make index view maintenance, uh, very quick and efficient are really no different than having another nonclustered index available for your query.
Uh, there are some kind of funny things about index views. Like, um, you have to use the no expand hint if you want, uh, column level statistics generated on the view. Um, if you’re on standard edition, you don’t get the index view matching to the same way that you get it with enterprise edition.
So the no expand hint becomes even more useful, but, uh, even on enterprise edition, you need no expand for the column level statistics thing. Um, you know, so there’s, there’s stuff about index views that is kind of tricky. I’m not going to put them on the same level as partitioning because index views actually can make queries faster.
Whereas partitioning just doesn’t make queries faster. You’re lucky. You’re very lucky if you partition a table and, and, and, and performance stays the same.
Uh, in reality, partitioning a table is absolutely no different from having a good seekable index. On the table. I think where a lot of people kind of get confused is that when they partition the table, they changed the, they changed the definition of the clustered index and made it match better.
The, the sort of, uh, the path that the, the queries were taking to the data they wanted. And they’re like, wow, partitioning was magical, but really they just indexed the table poorly to begin with. So, uh, there are, so I forget the application name.
Well, like I’ve run into it with Looker. I know that there are a couple others that, um, sort of build queries based off metadata. And, but they, they, they’re unable to do that with inline table valued functions.
Um, they really only, uh, do that with views and stuff. For some reason they can’t see, uh, what that is to build queries off of it. Um, so, you know, so there’s that, uh, I, I really dislike having crappy applications like Looker dictate what sort of database objects I can use and create.
Um, but you know, some people make bad choices outside of the database too. And, you know, you’re, you kind of get stuck with them. Uh, index views, you know, they’d be, they’d be great if Microsoft would invest like an ounce of time into them.
Uh, they really haven’t gotten anything aside from like, you know, uh, like bug fixes, uh, since they first came on the scene. Um, that, uh, it’s, it’s really, it’s really a bit disgraceful what other database engines, uh, are capable of doing with index views and allow an index views that Microsoft does not. A bare minimum, like min and max aggregates.
Like how, like, like really, you can’t do that in an index view? Like what, that’s a sad, it’s a really pathetic state of affairs, really. Uh, you know, we, we, we’ve, we’ve gotten so many awful features that have, that have died on the vine.
Uh, and, and we can’t get min and max support and index views. It’s, one, one wonders where the people in charge of SQL Server store their heads. We, we wonder if maybe there’s a, there’s a glass belly button joke hiding from us in there.
Um, one kind of weird thing about views, and this is something that, uh, I, I, I, I, like I have read before, but it never really sticks with me. But you can create views with a, with, with an, with check option. And, uh, I’m just gonna read from the documentation that it forces all data modification statements executed against the view to follow the criteria set within select, within the select statement.
I don’t know why there’s an underscore there. I copied and pasted this. Uh, when a row is modified through a view, the width check option makes sure that data remains visible through the view after the modification is committed. Okay. If that’s important to you, use the width check option.
Um, I, I don’t know when I’d want that to happen. Uh, can’t think of anything quickly. So, views, you know, uh, they, there’s a, there’s a tremendous propensity for people to put a lot of bad crap into views and ruin performance over time by nesting, nesting, nesting, nesting, nesting, nesting, and putting worse and worse, more complicated queries in each sort of level. Uh, and then, you know, uh, at some point along the way, you’re like, oh, hey, hey, Dante.
Oh, Lucifer, it’s you. Nice to see you here. Uh, but inline table valued functions, you know, they, they, they do offer some things that views don’t. Uh, namely the ability to, uh, add parameters to them, right? Views don’t accept parameters in the, in the definition inline table valued functions can.
And that can allow you to push predicates a little bit further into queries, uh, in some circumstances than views will allow. Well, we’re going to talk, we’re going to show you that example in a minute. Um, you, of course you cannot index a function.
There are no, no such thing in the current state of SQL Server is, is, is materialized functions. Uh, so, I don’t, I don’t really know that I care. I mean, it’s not, it’s not, it’s not, it’s not that big of a deal.
And I can only imagine what awful things people would do with them. Uh, but, uh, the, the ability to pass parameters to a function is a pretty big deal in some cases. And again, we’re going to look at that.
Now, um, it would be nice if, uh, Uh, there were a way to pass parameters as queries to certain things, right? Like it would be cool if, you know, like if you had like, just, you know, as a, as a sort of a stupid example, let’s say you had a store procedure that took, that was responsible for taking full backups.
And, uh, you know, like the, really the only way to pass a parameter to that or pass a value to that, uh, store procedure and then have it do something with it is to exit is like, you know, build a loop or a cursor or an array, like a CSV and pass it into the store procedure. But even then, like, even after you, if you, even if you pass a CSV, uh, like a variable or parameter in, uh, you still have to break it apart and do stuff with each individual line inside the store procedure. It’d be really cool if you could do something like this, where, you know, you would say, take a full backup and database name is the result of this select query.
Like that would be awesome. Cause then you wouldn’t have to do all the like weird internal work to like, you know, uh, write a cursor correctly or write a loop correctly. And then like, you know, other things you could, you could just do this and life would be a lot easier.
Uh, but you can’t, which is kind of lame, but you know, it’s not, not really the whole point of this. Now, the thing that I want to show you with inline functions is, um, uh, they, when I first came across this, I was really, really puzzled, uh, because there was a view and, uh, uh, in, in, in, when you called the view, like in, like in, like outside of a store procedure with a literal value, everything went fine.
When you called the view inside of a procedure with a parameter things went not, not fine. And we’re, we’re, we’ll, we’ll talk about that. But, um, the, the thing that, so like that, that does lead me to Microsoft did add a fix for this.
Uh, the thing is you have to be on, uh, at least SQL Server 2017 CU 30 and have query optimized or hot fixes enabled, uh, in order for it to kick in. Uh, I’m not exactly sure which CU for 2019 this was available on. Um, I would probably, I would, I would imagine that I can’t remember if it was available in 2019 from the get go or, but I’m pretty sure it was back ported because it’s under, it’s, it’s, it’s, it’s, it’s, it’s, it’s, it’s, it’s in 2022 under compat level 160.
So I think whatever CU came out around the same time as CU 30 for 2017 is probably where this thing ended up for 2019. Um, but, uh, if you’re on SQL Server 2022, you can just use compat level 160 and get the, uh, get the same behavior. So, uh, what I want to show you here is that you could run this and get the fix that I want to show you, but I have query optimizer hot fixes not enabled.
Uh, you could also enable trace flag 4199 as long as you’re not, um, you know, Oh, geez, that could have been a disaster. As long as you’re not, um, uh, you know, uh, hampered by anything like, uh, being on, uh, Azure SQL DB or managed instance or, uh, not have sysadmin privileges. It’s kind of a downside of DBC, of D trace flag stuff like, uh, like DBCC trace on or, uh, uh, you know, query trace on and there’s a query hint.
Uh, and I’m also not in compat level 160, even though I am on SQL Server 2022. Currently I have this thing set to, I think, compat level 140. Um, just because, uh, I don’t know, sometimes too many things kick in, in the higher compat levels that makes coming up with, uh, good repeatable demos kind of difficult.
And, uh, as a presenter, I often just need good repeatable demos. Um, sometimes the higher compat levels do offer that because they do something really bad with some of these new features. But, you know, for the most part, I like just the stability of like not every single intelligent query process or feature trying to kick in and, and like, you know, ruin my day.
But anyway, uh, we have this view and the main, the main thing in this view that will cause us, uh, problems down the line is going to be the windowing function. Now, uh, I know that I talked in the CTE video, the first CTE video about, um, there’s going to be a second one. So that’s why I’m saying the first one about how, about pushing predicates, um, to window functions, uh, from outside of CTE.
And how, if you have the, uh, the partition by column, uh, is the column that you’re filtering on SQL Server has an easier time of doing that. But, uh, that does not hold true with views in all circumstances. Let me just make sure I actually created that.
So, uh, we have query plans turned on and if we run this query, this will run very quickly. Uh, we will have a very nice, easy index seek right here. Everything is fine.
Uh, even though, uh, well, simple parameterization was at least attempted here. Uh, I don’t, I’m not going to dig in and figure, I don’t, don’t believe it was successful because we, this, the query plan looks like this. But, um, the reason this works fine is because we have a literal value right here.
Uh, not a variable, not a parameter, not a placeholder of any variety. So everything kind of looks how we would expect because we have an index on owner user ID. And, you know, uh, we’re also, uh, that also has the score column sort of descending in it.
So our making this dense rank is very, very simple for us. Now, uh, I could do all, any of the things that I mentioned above with the, the database scope configuration with the trace flag, blah, blah, blah.
But, uh, what I want to show you before I do any of that stuff is what the query plan looks like with a parameter touching that view without any of the fixes in place. And this takes a little bit longer and we no longer have a nice, simple index seek plan. Uh, now this takes about seven seconds or sorry, about six seconds, I guess.
So we have a nice, simple index seek, we scan the entire table. Uh, we generate our row number over here, segment, segment, sequence project. And then way over here, we have a filter, uh, say where the predicate equals owner user ID, right?
The parameter that we passed in, uh, the, the limitation here is where SQL Server can’t push a parameter or a variable past the sequence project operator. It gets stuck outside of this thing. If we recreate the store procedure with option use hint, uh, query optimizer, compat level 160, uh, the plan will go back to what we expected.
It’s a nice, fast, simple index seek, and no longer having to wait about six seconds to do all this stuff, right? The, the, there’s no longer a filter over here. We were able to push that down past the sequence project.
It would be nice if a lot of other things, um, worked under, uh, worked when you hinted higher compatibility levels. Um, just as sort of a stupid example, like, um, uh, there are a lot of, uh, new, uh, functions and functionality added to T SQL, uh, with each release. And, um, you would think that if, if you put a query level hint on to say, like, use string split or string ag or one of those other things that comes along, but is only available under higher compat levels, you would think that you would be able to access that stuff just by using the query hint for a higher compat level.
But you, you can’t, uh, Microsoft wants you to use the, the, the, the higher compat level all around in order to get that, which, um, is, is a perilous venture. Uh, but the, since we’re comparing views and inline functions here, uh, I do want to show you that if we created the, this as a, as an inline table valued function. Uh, now, of course, this is not a scalar UDF and this is not a multi-statement table valued function.
Uh, we are returning a table, which just returns a select. There is no table variable. There is no data type involved here.
We are returning the results of a select query, right? So if we do that, we can pass a parameter in here and here. And if we do this, even without query optimizer hot fixes enabled, uh, SQL Server is able to, uh, push that seek or push that parameter down into there just the way that we would expect.
So even if you’re not having this specific problem, uh, I think it is often, uh, worth exploring, converting views into inline table valued functions. Uh, just because if there is a common filtering or joining criteria, uh, it’s very, very convenient having parameters to express, uh, express that into be able to pass those in. Um, it better shows the intent of the module and what it can be used for.
And it prevents developers from forgetting filtering. I thought that filtering criteria and getting really like just exploded out results. So, uh, this is just a sort of short walk through the differences between views and inline table valued functions.
Um, uh, you know, again, uh, views, you can materialize them. Um, if you follow the 10 billion rules, uh, inline table valued functions, you can’t materialize, but you can pass parameters too, which can be a very, very valuable performance tuning thing. Um, you know, when you like apply to an inline table valued function, then you can pass column names in, uh, for the parameters that you can get often get very, very nice, uh, performance increases doing stuff like that.
Uh, but anyway, uh, I hope you enjoyed yourselves. I hope you learned something. Uh, the next video will be about, uh, union verse union all.
Um, uh, we’re going to explore, uh, uh, sort of like where union starts making results distinct and things like that. And, uh, we’re also going to challenge, uh, uh, uh, a very common performance tuning, um, uh, uh, uh, I don’t know, just, it’s a strongly held religious belief that union all is always faster than union. And we’re going to look at an example where that is not true.
So I hope that, hope that you hope you’re wearing your helmet for that one. You’ve got your crucifix and garlic and all that stuff. So anyway, uh, I’m going to get this one, going to get this one sent along to YouTube and then, then record that one.
So see you shortly.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.
How To Write SQL Server Queries Correctly: Common Table Expressions
Thanks for watching!
Video Summary
In this video, I delve into why Common Table Expressions (CTEs) are often overused and misapplied in SQL Server queries. I argue that CTEs can be a bit like 70s fashion—faddish and largely unnecessary, with little real benefit beyond readability, which is subjective at best. I explore the performance pitfalls of CTEs, explaining why they can lead to suboptimal query plans and poor execution times. Additionally, I highlight how different database systems handle CTEs differently, using Postgres as an example where CTEs are materialized by default, contrasting it with SQL Server’s behavior. The video also covers specific scenarios where CTEs might be necessary or useful, such as when dealing with windowing functions and row numbering, but emphasizes that in many cases, simpler alternatives exist. I provide practical examples to illustrate these points, showing how the placement of filters can significantly impact query performance and results. Finally, I touch on the future of T-SQL, speculating on potential improvements like the `QUALIFY` keyword, which could make CTE usage more efficient.
Full Transcript
Erik Darling here with Darling Data. Today’s video where we continue our series on how to write SQL Server queries correctly, I’m going to continue to aggravate certain portions of the public by talking about how completely stupid CTE are. Looking forward to it. Before we do that, let’s talk about stuff that we talk about all the time. If you like this channel, if you find the content in this channel valuable, and you would like to support this channel to the tune of four bucks a month or more, depending on your level of kindness and generosity, depending on where you are in your life as a matter of salary, you can click on the link right in the video description at the top there that says become a member, and you can join the 30 or so other people who have done that and contribute to the video description. I want to contribute a little bit to keeping the lights on for this channel. The light bulbs that it takes to do all this are very expensive. If four bucks a month is just too rich for your blood, for whatever reason, you know, your mom’s in a nursing home or something, liking, commenting, subscribing, all wonderful ways to make me feel cherished by you.
If you watch these videos and you think, wow, that Erik Darling sure does know a thing or two about SQL Server, perhaps he could make my SQL Server faster in exchange for money. You can hire me as a consultant to do just that. I am good at all these things, best in the world, according to most. So why not take a chance on that? If you would like some very high quality, very low cost SQL Server training, you can get all of mine, again, link in the video description down yonder. You can get all of mine for 150 bucks. And that’s for the rest of your life. No subscription required. Isn’t that lovely?
No upcoming events. No upcoming events. 2024. End of it. 2025. We’ll get into it again. I will go to as many pre-cons as my wife will let me. She does miss me terribly when I’m gone. So I can’t just fly around the country every weekend, but, you know, I’ll do my best to get to your very important event. But, yeah, let’s talk about, let’s expose CTE for the fraud that they are.
So CTE for me are a lot like clothes in the 70s. A bunch of people with absolutely no taste convinced a bunch of other people with absolutely no clue that they should dress just like them. And so if it weren’t for us getting the 80s after the 70s, the human race would have absolutely no redemption arc. In fact, since like probably the mid-90s or so, the redemption arc is descending. I’m not sure if you feel the same way, but boy, I think 1980s was the peak of human civilization.
That’s when all of the best things, all of the best music and movies and everything was going, fashion was wonderful. Nothing better. Now CTE were implemented to fill in some blanks that derive tables left, right?
Things like being able to re-reference them in queries and things like that. But the first, the problem with CTE is that the very things that they were designed or that they were implemented to address with derived tables are the things that make them suck from like performance-wise. I didn’t know what a petard was until I read this, but apparently it’s some kind of stick that you can hoist people with.
I guess it’s like a wedgie stick where you can pick someone up by the back of their underwear and just hoist them up and wiggle them around and make it real uncomfortable. But people use them the same way that they use nose and air hair trimmers where they just jam them in and wiggle them around. Obviously, nose and air hair trimmers weren’t my first choice of metaphor here, but I’ve got to keep it family friendly.
But they just jam it in and they wiggle it around. Maybe they just keep wiggling until they stop hearing hairs get trimmed here and then that’s it. There’s very little actual mental feedback for most people about if CTE are good, bad, or ugly.
You can’t see in your own ear too well and you can’t see up your own nose too well, but all you’re left with is a lack of clipping noises. It’s amazing to me how many people will just stick with, like get awful performance using CTE, but stick with them just because of this misguided notion that the query is more readable. They read a style guide.
The query is more readable. They read their 70s style guide and now that they’re showing up with crocheted bell bottoms and velour neckerchiefs and spread collar shirts with gold necklaces tangled in chest hair. Stop.
Okay, just do yourself a favor. Stop. Way back. CTE are one of the least advanced components in T-SQL. I’m going to cover more about that.
You can probably see up at the top there’s another tab 14, tab with numbered 14, where we’re going to talk a little bit more about CTE usage. But really I just want you to know there’s nothing all that interesting or advanced about them. Anyone who says they have anything advanced or interesting to teach you about CTE is either a complete simpleton or a charlatan.
Pretending that there’s an advanced notion, advanced usage of CTE. It’s the same level of idiocy as explaining joins with Venn diagrams. Like semi-colored circles is going to help anyone understand what their join query is doing.
It does absolutely nothing for anyone. So you should cast those people by the wayside because they are not good people. Now, if you’re coming from a different database platform, you might have a different experience with CTE.
For example, Postgres, by default and when considered safe to do so, CTE get materialized. They get like, it’s almost like a temp table. They cache a common sub-expression type thing.
It’s almost like a spool or a temp table, whatever you want to call it. It’s a temporary object that caches the result of the query. I have a couple examples from the Postgres documentation.
But if you’re coming from, like if you have a Postgres background and you’re used to CTE behaving in this way, you get to SQL Server and you’re like, wait a minute, why does this suck? Well, because it doesn’t behave the same way.
You’re also probably going to be wondering why your read queries are blocking and deadlocking with your write queries because SQL Server does not use an optimistic isolation level by default either. So woe to you, fine folks out there.
So with Postgres, you can choose. I mean, notice the red squiggles under here, right? Make this not valid T-SQL, maybe someday.
But in Postgres, you can choose to materialize a CTE result, which is just like sticking in an attempt table. And you can work off that.
But when you don’t materialize it, you re-execute the query in here as many times as you touch the CTE. Now, CTE, for some reason that I cannot fathom, get a lot of developer defense to the same extent that table variables get. It’s befuddling to me.
Like people will write a query, use a CTE, have it perform awfully, but then sit there and be like, oh, but it’s so readable. Oh, look how well I can read this query.
I have so much time to admire this readable query and re-read my query. Will I wait for this query to finish running because I’m sticking with this stupid CTE? It’s fascinating to watch.
It is, there’s some sort of mental disorder going on in people who do this sort of thing. Now, there are times when CTE will have no impact on anything, right? So if I run both of these, make sure query plans are turned on.
If I run both of these queries, one with a filter inside of the CTE, one with a filter outside of the CTE, we will get identical results and identical query plans.
And they make no difference in this case. None whatsoever. Right? So there are times when SQL Server just is nice enough to optimize the CTE away. It just throws it right out.
Now, one place where you do have to use CTE currently in SQL Server is if you need to sort of have some runtime expression, like a row number, and you want to filter on that row number.
SQL Server doesn’t offer a way to do this with a single query. Other database engines have a qualify keyword, which I’ll show you in a second, that allows you to do that without nesting your query at all.
So, but another interesting thing is that if you are going to put, you are going to use windowing functions in CTE, and you want to filter on stuff outside of that, sometimes SQL Server is unable to push your predicate up into where it, up into the CTE.
Now, it can’t do that here because the column that I’m filtering on, vote type ID 8, is not a partitioning element in the windowing function up here. So, what happens is I have to run this whole query.
I have to generate a row number over the votes table. And then once I get outside of the votes table, and I have generated my row number, and I start filtering on my row number, only then is it safe for SQL Server to apply a filter to the, to the, apply the vote type ID filter to the query.
Now, this, this sounds funny and weird, but obviously if I were also partitioning by vote type ID, I would be answering a somewhat different question with the windowing function.
Right? That would be like, like partitioning by user ID and vote type ID would mean that I am effectively asking SQL Server to rank things differently than just by user ID. There, there are times when it’s safe to do that in the partition by, but not here.
So, if we look at the query plan for this, you’ll see it ran for a heck of a long time. And if we look at what happened, the, the details of the filter operator, you can see that there’s a predicate on both vote type ID 8 and this expression 1 0 0 1 equals 0.
That’s basically this where clause right here. So, uh, obviously vote type ID 8. You can, that’s pretty easy to work out.
But then, uh, this being equal to 0 is the expression that is also evaluated in that filter. So SQL Server had to do all the work to generate the row number to then apply those filters later.
Uh, but I want to talk about how that is sort of answering a different question. So what I’m going to do is I’m going to create a table, uh, called the top answers of all time.
And I’m going to put, uh, uh, I don’t know, like 2,500 or so rows of Paul White in there with this number. And I’m going to put 10 rows of Erik Darling in there with that number minus whatever row number we’re at, right?
So row numbers 1 through 10, I’m going to subtract numbers 1 through 10 from this, right? So insert those rows in. And this is what the table looks like.
I have my 10 rows down here where I have a decrementing value from Paul’s big score down through all this stuff. And then I have, uh, 2,500 or so rows of Paul White’s high score, right?
So the reason why that answers, why the part, this answers different questions for like the example query that I was showing you up there was, let’s say that this is the initial query where I want to find the top answers and then find out the top 10 answers, right?
That’s this filter. And then ask if any of them have the answer, answerer name Erik Darling, right? So let’s just run this internal query first so I can show you what’s going on in here.
Obviously all of these 999s, they, they all tie, right? But they get, but they get an incrementing row number. It’s not like rank where rank would give you one for all of them. Uh, row number and dense rank will give you this incrementing number across all of them.
Uh, but if we get down here to the end, you’ll see my final 10 rows where like, obviously I’m, I’m out of the picture. So now if we run this to say where row number is less than or equal to 10, we’re going to have Paul’s 10 rows.
So if I’m asking the question, who are the top 10 answerers of all time? And are any of their names Erik Darling? The answer is going to be no, right? But if I change that and I put this in here, you’re going to, and I run this query, you’re going to see just my answers ranked, right?
So high answer to low answer. Now, if I run this query, we’re going to get those same 10 rows back because I only have 10 in the table, but you can, you know, if you want to just see a slightly different take on it, here’s the, here’s like the top five answers of Erik Darling.
So you actually answer different questions depending on where you put filters in for windowing functions, right? So that’s something to be aware of when you’re, when you’re using them.
Now, this does bring us to a case where CTE are generally okay because they’re generally needed, right? You can’t calculate in SQL Server at current, the way T-SQL is designed, you can’t filter on a row number within one query.
Like I said, you have to nest the query, generate the row number, and then filter on it. Now, Snowflake, a different database platform, has this keyword qualify.
Qualify allows you to either put a row number directly into the qualify clause like this, right? Like you can say qualify this equals one, or you can put a row number in your select list like this and then say qualify row num equals one.
Now, this is obviously just a little bit of syntactical sugar because you’re going to end up with the, kind of doing the same thing is like the query plan that we saw when I showed you the filtering thing where you’re going to have to run the query, generate the row number, and filter on it.
This is just a nice compact way of doing it. I think this should, obviously, I think this should be in T-SQL because it gets you out of a lot of like extra typing and nesting queries when you generate a row number. It’d be really nice to be able to do it in place like this.
Whether it’ll ever happen or not, I don’t know. Microsoft is busy burning all its money on OpenAI, co-pilot, so who knows, right? T-SQL just, who knows?
SQL Server might get no attention whatsoever. Who knows? Who knows, right? It’s crazy. You’ll probably get the ability to put a CTE inside of a CTE before you see any actual useful progress to T-SQL as a language.
But one of my favorite uses of CTE is paging queries. And this is a technique that, again, I mentioned Paul White for the five billionth time.
This is a technique that I learned in 2009 from Paul White blog posts where you can stack CTE like this. And this doesn’t have the same performance impact that re-referencing CTE via like joins does.
And I’ll show you what I mean. So let’s run this whole thing. And the reason why this is okay is because we only end up with, well, for the CTE part of it, I do want to point out that I do join back to the post table here to get, why are you red?
You should always be pink. You silly, you silly ghoul. We do have two references to the post table because I joined back to the post table here. But in this query plan, what looks kind of funny is the scan of the clustered index, the generating of the row number, and then a top, and then a filter, and then another top, right?
But the CTE isn’t to blame for the two references to the post table. I explicitly re-referenced the post table down here. So this is where you can see where I filter on the row number inside of the CTE, right?
That’s this thing right here, right? So the two tops in here, one is the one in this query, right? Where I get the top page number times page size.
That’s the first top. That’s this thing right here. Page number times page size. And then this other top where, ah, gosh darn it.
There we go. Where I’m just getting the top page size. So that lines up exactly with what I do here and what I do here. And then we saw in the filter operator, we saw this.
This is a really good way of writing page inquiries because you only hit the base table for a limited number of columns, filter down that primary key just to the rows that you care about, and then get all the columns you care about after you’ve reduced the rows that you care about.
So this is okay because we’re not re-referencing the CTE outside of this, right? We have them stacked up, which means that we hit the post table once and then run all the logic from these stacked things on the post table.
So where that gets different though is if you re-reference the CTE. So let’s say that I write a query like this, and this is an actual, this is inspired by an actual client query where they were doing this exact same thing, and their idea was to get the top, I mean, this was a different piece of software that had different things, but the idea of this query is to use the row number function to rate a particular user, sorry, a particular user’s questions by score descending, right?
So we want to find that essentially the top five, and the way that we’re doing that is for every, we left join to, ah, zoom it, curse you. We left join to the CTE to itself five times for five, and for each one, we get a slightly higher row number, right?
So we get row number one there, row number two there, row number three there, row number four there, row number five there. So when you see the query plan for this, you’ll start to understand what I mean by the petard hoisting of CTE.
So we’re derived tables, you couldn’t do that re-referencing. With CTE, you can, but unlike Postgres, which will, again, when it’s considered safe, materialize the result of a CTE, we don’t get that here.
We get a query plan, if I can manage SSMS, we get a query plan that hits the post table one, two, three, four, five times. For some reason, the first one is nice enough to go parallel.
The rest of them are just single-threaded scans of the post table, but each one of these, four seconds, four seconds, four and a half, oh, sorry, like four and a half seconds a piece. But the query as a whole takes almost, it takes about 18 seconds to run.
This obviously isn’t a good situation for the CTE. I don’t care how readable it makes your query. If it takes 20 seconds to get one row back, it stinks.
You shouldn’t just, it’s so readable. Who cares? Your query sucks and it’s slow. There are more important things to worry about. Like your end users aren’t going to call you up and say, hey, thanks for that nice readable query.
They’re going to say, hey, thanks for this really fast query result. I don’t have to wait my whole lunch break to get my report back. All right? CTE are not performance solutions. I don’t even think they make queries more readable.
Good formatting makes queries readable. CTE do nothing for query readability and do a lot to hurt query performance. So a lot to hurt query performability. How about that?
So you could do this in two different ways. You could create a temp table very cheaply with the five rows that you care about in it. And then you could join to that very small temp table five times in the same way that we did before.
And none of that takes anywhere near 20 seconds. You could also use pivot in this case. And you could get that very quickly as well. Again, with only one scan of the post or one scan of the post table.
So both of these end up just about the same. We do have to scan this, the temp table five times, but a five row temp table is pretty cheap to do that with. So there may be times when you need to build something recursive with a CTE.
And with recursive CTE, one sort of annoying thing is it in the recursive portion of the query, like the anchor portion of the query where you get like the row or rows that you want to build the recursion with.
You can do like almost anything in there. But with the anchor part of the query that you’re using to do the recursion down with, you can’t do anything like distinct or top or offset fetch to only get like one result per recursion.
But you can use a derived query inside of there. You can use row number inside of the derived query, and you can filter on the row number outside of it.
This is actually a pretty good performance thing for a lot of recursive CTE because a lot of the times when I see people build them, there’s a lot of duplication in them that shouldn’t be there.
So this is what I mean by that. We have the anchor part of the CTE here, and we have the recursive part of the CTE here.
Now, since we can’t put distinct or top or anything just like right in this query, like the way that we could write in a normal query, we have to nest it, right?
And we have to nest and do a row number in here and then filter on the row number outside of that. But you can do that in there just fine. And that’s one way to get around a lot of weird stuff that happens in recursive CTE.
So CTE can be handy to add some nesting to your query so you can reference generated expressions in the select list as filtering elements and where clauses.
Right now, SQL Server doesn’t have a qualify keyword that allows you to do that with windowing functions like some other database platforms do. They can even be good in relatively simple use cases. But remember, SQL Server does not materialize those results.
It would be nice if it gave you at least the option to. In the optimizer, it would be even better if it had some rules in place to automatically do it when it’s safe and you’re referencing a CTE multiple times.
I spend a lot of time in my consulting work putting CTE results into temp tables to avoid the re-execution problem that I showed you before and to reduce a lot of query complexity issues where bad cardinality estimates from a certain portion of a CTE leak out into other parts of the plan and lead to other bad choices elsewhere.
Materializing that result set and letting SQL Server build statistics on that result set is a really easy way to make a lot of queries go a lot faster. In complicated queries, CTE often do way more harm than good.
And any excuses that people feed you around readability are just throw them away. Again, it’s a stupid response. People will spend an inordinate amount of time trying to come up with cases where it’s okay to write queries the wrong way.
Mostly because they’re lazy and they want to just do the lazy thing and they want to keep writing queries the wrong way because they think they found this one magical time where it’s just okay and safe to do it.
And that’s almost never the case. You might find a query more readable by using CTE, but the optimizer will not. It does nothing to help the optimizer, does nothing to help guide it to better choices.
You can create, at least currently, some performance fences around things in CTE or derived tables by putting top or offset fetch in them, but that does not materialize the result.
It will sort of isolate that portion of the query, which can be useful sometimes, but there’s still no materialization. So if you put a top in there and you re-reference that CTE like I showed you in the demo query just prior, it’s a bad time.
So please use CTE really carefully. Don’t just assume that they are going to do anything best or better for performance.
Don’t think that they help the optimizer in any way. They really don’t, and you can run into a lot of big trouble with them if you start using them inappropriately. And it’s really easy to fall into that trap because you think you get this free lunch being able to re-reference them, but you just don’t.
So anyway, that’s about enough on that. This video went on longer than I thought it would, which story of my life. Again, I find myself having to apologize for the length here. So the next video, number seven, is going to be about views versus functions.
Very, very few demos in there. It’s a lot of spoken word poetry. So if you’re the type of person who really enjoys demos, maybe you can just skim that one a little bit.
But anyway, thank you for watching. I hope you enjoyed yourselves. I hope you learned something. And I will see you in the next video where we compare views and functions as a big happy family.
So great. Cool. Goodbye.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.