How To Write SQL Server Queries Correctly: Joins With OR Clauses Part 2
Thanks for watching!
Video Summary
In this video, I dive into the intricacies of joins with OR clauses in SQL Server, specifically using examples from Stack Overflow’s database design to illustrate common pitfalls and suboptimal query plans. We explore how these join conditions can lead to inefficient execution plans, such as constant scans and nested loops joins, which often result in unnecessary row counts and sorting operations. I also discuss potential solutions, including query rewrites that leverage EXISTS clauses or CROSS APPLY operators, demonstrating how these changes can significantly improve performance. By the end of this video, you’ll understand why normalization is crucial for avoiding these issues and learn techniques to optimize your queries when OR clauses are necessary.
Full Transcript
Erik Darling here with Darling Data. All excited to talk more about joins with OR clauses. It’s going to be a real grand day for all of us here. This is, of course, part two of the joins with OR clauses videos. The first one, we talked about some simple stuff. It’s kind of set up, you know, the things that SQL Server can and can’t do when there are OR clauses in the table with the simple join. Now we’re going to talk about more complicated stuff and, of course, how you can fix the more complicated stuff. And, of course, because I like talking about it, we’re going to talk about shortcomings in SQL Server’s Query Optimizer. Because it’s got those. Lord knows. Anyway, before we do that, if you would like to slap four bucks a month on my rent check, you can click the link down in the video description below and you can do that. Maybe it’ll make you feel good.
You know, or maybe not. I don’t know. Maybe, hopefully, you just forget about it for a while. That’s what we’re all hoping for. If four bucks a month is not what you have in mind for me, you can like, you can comment, you can subscribe. All sorts of great things. If you are in need of consulting for your SQL Server, all of these things, I am the best in the world at them.
Anyone who tells you otherwise is a dirty, rotten liar. And as always, my rates are reasonable. Training. I’ve got it. 24 or so hours of it. Beginner, intermediate, expert. All of those things. You can get it for about 150 US dollars. That’s the link. That’s the coupon code.
And they are helpfully glued together for you also in the video description. Lucky, lucky you. No more, no upcoming events until 2025, probably later in the year. But for now, we can just go talk about what we came here to talk about, which is joins with or clauses. So joins with or clauses are one of my biggest pet peeves because the optimizer does have space to do this better.
It just isn’t there, right? It just doesn’t do it. There’s just no help for you when a join has an or clause in it. There’s not like a special thing you have to do or there’s not a hint. There’s not like a trace flag. All there is is query rewrites. And those query rewrites are annoying because SQL Server’s optimizer could figure it out.
Just doesn’t. It’s lazy. Very lazy, lazy optimizer sometimes. So I’ve created some very generous indexes on the votes table and two on the post table. And what we’re going to look at is the query pattern here.
Now, it is a little bit confusing when the first time you see it, because it is, well, I mean, not the most straightforward thing in the world. But I probably should have started this running before I started talking too much, right? OK, seven seconds. Not too bad.
But here’s the execution plan. And it is one of it’s this is one of those query plan patterns that I try to train people to look for, because this query plan pattern is always a sign that something bad is happening in your query. And that something bad is going to be a join with an or clause.
So we’re doing I mean, it’s a little bit of a funny query here, but it’s one that you have to do with the Stack Overflow database. Because, you know, everyone who gives credit to Stack Overflow for like when it first started being like the cool, smart, like, whoa, rogue band of geniuses don’t like they’re doing everything on the cutting edge of cool and fun. They were crap at database design.
Crappy, just awful. Should have hired someone who knew what who knew what they were doing like this much because the post table has all the posts in it. It has all the questions. It has all the answers. It has all the everything else.
And what’s what’s really crappy. Oh, my God, is like the version of the post table that you see in like the Stack Overflow data explorer and the version of the post table that you see, if you download the data dump, is not nearly reality. If you saw that, if you saw the actual table in there, it is so much wider and more denormalized and awful.
It would it would it would make you even if you know nothing about normalization, it would make your head spin. But we have to join the post table to itself because questions. Right. Those are these things. When you join them, you need to join the post table to itself to find which questions have accepted answers.
Great. Good. Good plan. Self joins. Hey, I like I like those sometimes. Sometimes you do have to you do have to join yourself. But then when we join to the votes table, what we’re saying is where we need where the the either the question ID equals the post ID in the votes table or the accepted answer ID equals the post ID in the votes table.
So we’re looking to get votes on both the question. That’s the ID and the answer, which is the accepted answer ID. Right. And we I don’t know through this. I forget why I threw this on there. I think it tidied things up a little bit. Maybe maybe maybe I didn’t put that on there in seven seconds or in like 70 seconds.
But then we’re looking for where p dot post type ID equals two. Right. And we’re looking for where p two dot accepted answer ID is greater than zero. So we’ve done everything that we we can there. Now, what makes this challenging or rather like.
What the optimizer could do is something like this. It could unroll the join from being this set of conditions to being this set of conditions. Right. Like it could do that because it is fairly equivalent. It could also do this. But it kind of doesn’t.
It could be a good answer. So what’s really interesting, but sort of difficult to describe well to the optimizer is that any given question can only have one accepted answer. So there’s no many to many relationship with that join.
So these are things that the optimizer could do, but it doesn’t. And this is the query plan pattern that I need to warn you about. And this is the query plan pattern that I want to show you is this is what happens.
So we do two seeks into the post table. Right. This finds this does our join. Right. This finds our questions or rather sends our answers with the questions of accepted answers and all that other good stuff. And then this is the join to the votes table. Now, what I want you to keep an eye on is the number of rows that come out of this three eight one eight four nine seven.
OK, remember that number because you’re going to see it again. And right now way over here in the query plan, this is this is most commonly what you will see in a query plan that has a bad join with an or clause in it, which means any join with an or clause. You’re going to see a constant scan.
With three eight one eight four nine seven rows come out of it and then another constant scan with three eight one eight four nine seven rows come out of it. That is the results of this twice. Right. So we have one set for the P dot ID column in the post table and one set for the accepted answer ID column in the post table. Right.
We have that many rows for both after we join those two tables together. That’s how many rows are produced. So each one of those columns gets produced by this by these constant scan operators. Notice these don’t touch a table. These sort of come out of nowhere, but they they really get fueled from here.
So then SQL Server says I have to put those two results together. It essentially does like a union union all sorry. So this number doubles right.
At least my math is correct. And then it sorts all that data. It puts all of those rows in order. So that it can merge duplicates. And it note the estimates in here are all pretty wonky.
Right. Of one of one of two of two of two. So we’re SQL Server. Expects two rows to come across all this stuff.
It we get way more than that. Now, I think like Microsoft could start a little bit earlier and like doing all this stuff by just like trying to get better estimates out of this part of the plan, because then maybe it wouldn’t choose a nested loops join for this part of the plan.
It’s also entirely possible that this is an optimizer limitation and that SQL Server could not possibly do any other kind of join here. But then we seek into the votes table and we we seek on this expression one oh two four. Right. Or sorry.
We have a between thing. Right. Because we merge the interval. So we go and find greater than expression one oh two three and less than expression one oh two four. That’s going to be the the start and end points of the constant scan results after we’ve merged the intervals in.
So we every time we loop in, we go and we find those that range of columns. Now, it’s bad enough that we have like this fake result set. Right. From the from the constant scans that we loop and find stuff in here.
But then like there’s an sort of this other nested loops join here that also like takes this stuff comes out of here and goes through all this. Right. So it’s like this loop, like even though it starts here, it kind of like really starts over here. So this all ends up taking about eight seconds or seven and a half seconds, which is pretty slow for a query.
And like I said, SQL Server could unroll all this stuff and do it differently. Now, it’s what we can’t. It’s hard to describe this to the optimizer without a lot of really difficult constraining that every question, if it has an accepted answer, can only have one accepted answer.
If we were to use the parent ID column instead of accepted answer, it would be a slightly different story because one question can have many answers, but only one of them can be the answer. So this is a great reason to normalize your data. This is one place where the Stack Overflow data design team failed.
Right. Off like just big thumbs down there was every question and every post are in the same post table. I do have some stuff about what happens if you split those out and I do have some other stuff about what you do if you put the body column on another table. But I don’t know. I feel like that’s worth money.
So that might not go on YouTube. Now, part of the reason why this optimization space is difficult is because you would have to you would have to unroll the query to look like one of these two things. Right. Where bounty amount is no and exists.
Select this on this. But this will give you identical results to what you got before. But without all that crazy stuff going on in it, this finishes in about 207 milliseconds.
And that’s, you know, arguably a lot better. Right. And this is this filter is a startup expression predicate. So this is not the awful kind of filter where all the rows come out and then try to pass the filter.
This filter only lets rows passed if if if if if anything meets that criteria. So that’s a nice thing in there. Right. This is all good stuff. SQL Server handles this really well.
And because the where clause in here is really only for one set one set of things, it turns out a lot better. Now, you could also rewrite that in this way, which is even more complicated, where you could use sort of a double exists with a union all. Right. So rather than write this or clause, you could write this in and you could get good results back from this one.
This query plan, it’s a little bit different. It takes about 385 milliseconds. There’s one filter at the end here, which I am not crazy about.
Where this is not null or this is not null. Usually this is the kind of stuff that I worry about in query plans. But in this case, it’s OK.
We get a couple adaptive joins over here between the post table. Well, sorry, between the votes table and the post table. Right. So that all turns out OK. And then if we needed columns from the inner part.
So like we talked about exists and not exists and joins and all that stuff. One thing that I talked to you about was that when you use exists, you can’t project columns out of it. So if this were a more complicated query, you know, like like for a lot of queries, I use count big just to make sure that I’m getting like the right number of rows back.
I just like basic starting point. Are these results correct? One one thing that, you know, like normal queries, people want to see data, right? They don’t just want to count.
They’re not just like, oh, well, yeah, just if you just show me counts, I’m happy. They want to see like the actual stuff in there. So if you need to rewrite a query like this and you need to project rows out, cross supply is a really handy way of getting the same performance. But but you can project columns out of the cross supply.
I don’t know why there’s a space in here. That’s that’s clearly clearly someone is sabotaging me. But if we write the query like this with the cross supply, it’s also pretty quick. 345 milliseconds.
And if we needed columns to come out of the this part of the query, we could get them. So if we like so like you can see, like we count the P dot ID column that comes out of the apply in here, which is union all in here. So we’re counting.
We’re able to reference the P dot ID column there, where if we try to do this up here, we couldn’t. So are these queries bigger and longer and more complicated? Yes.
But sometimes that’s what you that’s what you need to. That’s the kind of query you need to write for the optimizer to fully understand what you’re doing. It’s so, you know, one thing that I say quite frequently with clients and in training and to just random passerby on the street is anything you do that makes your job easier, make SQL servers job harder. Sometimes things that you do that make your life easier.
SQL servers like, oh, yeah, well, I’m going to make my life easy to take this query plan. When you start writing things out more verbosely, sometimes you can give SQL Server a better understanding of what you need to do, what data you need, what data you’re trying to touch. And SQL Server can come up with better plans based on that.
Shorter, shorter, simpler queries are not always faster queries. You know, there are a lot of people who seem to feel that way and they write these tiny little compact queries there. Some of them just on a single line, very long lines.
But like writing simple, compact queries is good for simple, compact logic. When your logic gets very complicated, sometimes very short queries that just, you know, try to compact everything very small. It’s not cold golf.
Sometimes the optimizer can’t do a lot with those. Sometimes you have to be you have to write out more specifically what you’re trying to do. And sometimes your queries might look like this. Sometimes your queries might have extra touches of tables in them.
But you can’t like don’t be afraid of trying this stuff because you might find queries perform much better when you are much more explicit to SQL Server about what you’re after. So don’t just stop at simple, complex query because you’re happy with simple, complex query. Get just the performance.
Look at the query plan. See what happens when you spread things out a little bit. Give SQL Server a little room to think and breathe. You don’t have to smother it all the time with these weird queries. Now, one thing that I’ve talked about a few times, maybe not in this video, but in the last video, before I decided to split this up into two videos, that’s why there might be something up in there that seems a little weird, but is case expressions and joins and where clauses.
We’re going to talk about that next. That is why this is 12 and the little tab next to it says 13 case expressions. So we’re going to talk about those next.
And we’re also going to look at some rewrites for those. Some of those look similar to these other than others of them look similar for two other things that we’ve talked about in the series. But it helps reinforce that you’re not writing SQL Server queries correctly and you need to start.
So, sorry to break it to you, but you know what? At least it’s not terminal. Just temporary.
It can be cured. Don’t worry. Without surgery, hopefully. Yeah, hammer. Alright, cool. Thank you for watching. I hope you enjoyed yourselves. I hope you learned something and I will see you over in the next video about case expressions.
And you’re going to learn a lot. So watch it and don’t skip stuff. You heard?
Alright, goodbye.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 25% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.