How To Write SQL Server Queries Correctly: UNION and UNION ALL
Thanks for watching!
Video Summary
In this video, I delve into the nuances of using `UNION` and `UNION ALL` in SQL Server queries. Erik Darling from Darling Data shares insights on how these operations can affect query performance and result correctness. I highlight the differences between `UNION` and `UNION ALL`, explaining why `UNION` adds distinctness at the end, while `UNION ALL` simply concatenates results without deduplication. The video also explores various placement scenarios of these operators within a query to demonstrate their impact on execution plans and performance.
Furthermore, I discuss when it’s appropriate to use `UNION` versus `UNION ALL`, emphasizing that most of the time, `UNION ALL` is more efficient due to its lower overhead for deduplication. However, there are cases where ensuring a unique result set can optimize other parts of the query plan. The video includes practical examples and SQL Server 2022 function demonstrations to illustrate these concepts, making it easier for viewers to understand how to write queries that return accurate results efficiently.
Full Transcript
Erik Darling here with Darling Data. And we’re going to talk in today’s video, continuing on with our series about how to write queries correctly. And this, of course, you know, comes down to two things, like both getting an accurate, correct result, you know, according to the logical demands of the query, and also having it return data to you in as efficient a manner as possible. And today we are going to talk about Union and Union All, or as they say in the South, Union Y’all. All right. If you can forget that joke happened, and you would like to support this channel, not for the jokes, but for the SQL Server information, or if you like that joke, you can do it for the jokes, too. I don’t care.
Whatever your motivation is, is fine with me. There’s a link in the video description where you can become a member of the channel. If clicking that link is too hard for you, perhaps clicking other things you’ll find a little bit easier. Liking, commenting, subscribing, all wonderful things to do. If the topics in these videos are near and dear to your heart, and you’re having SQL Server issues that you think a young, handsome fellow like Erik Darling from Darling Data could help you solve in exchange for money, I am available for hire for any and all of these things. Have a great time with me together, one-on-one, solving all sorts of stuff.
Training. Good to have. Better to watch. I had someone email me and say, hey, when I go to your training site, I can’t add anything to the cart, and I can watch all the videos. Did I buy this before? And so I went and looked in my receipts drawer, and lo and behold, they had purchased the training in December of 2020. So I said, yes, you did purchase this.
Merely, I mean, just like on the cusp of four years ago. Training works best when you actually watch it. It’s a participation sport. You have to be involved. Otherwise, you still know nothing. No upcoming events. 2025, talk to me. Tell me where to go. I don’t know where you are or where you want me to go.
I’m not endowed with psychic abilities, though I wish I were. So you have to tell me where you would like Erik Darling to be. With that out of the way, let’s talk about union and union all. So union and union all are funny because they kind of get used interchangeably in almost the same way that a lot of other things are, with very little consideration for performance or result correctness or other things.
So like CTE, temp tables, temp tables, table variables, parameters, local variables, all sorts of things that, you know, joins and exists. People just, you know, start writing a query one way and then they just always write the query that way. I remember a long time ago I read it. I used to play drums when I was a kid.
And I was reading an interview with a drummer in some drummer magazine and he was like, Ah, yeah, you know, I was playing in this cover band and, you know, we were like, you know, it was good because it was a very successful cover band. But, you know, every time I sat down to play drums, the only thing I could think of to play were like the drums to these cover songs.
And I was like, like, wow, that’s depressing. And then I realized, wow, that’s how people write queries too. They sit down and they’re just like, Oh, I’m going to play Copacabana again, I guess. So there are lots of times when I see developers use union, like when the results have absolutely no chance of having duplicates in them.
They either like join different tables together or have different where clauses, or sometimes they even have like literal values in the select list. They’re just like, you know, like, you like, how can you how can you make those distinct? How can you do? How can you do that? It’s weird. You already have distinct in the select, how are you going to make it more distinct?
And, you know, a lot of this stuff comes from testing queries in isolation, not really knowing that, you know, what when you should use exist first joins, because they haven’t watched the other video in this series about exists and joins. And, you know, once you start getting other things involved, like no lock hints, you could just end up with crap everywhere.
Bad things popping around everywhere. Now, how to write queries correctly does depend on a number of things. A, the quality of your data in general.
B, the quality of the data structures that you have available. That largely means the indexes. And also, you know, like, what returns a result to you as quickly as possible that is still logically correct. This can all become really difficult to figure out.
You know, especially the more complex a schema is, the more things you have to get involved, the more calculations you have to do. Knowing what, I think, really the hardest part about writing a query, aside from knowing, like, the SQL behind it, is knowing what the correct result should look like. It’s a very hard thing to define up front.
You have to, like, write a query, get a result, and probably show it to someone who’s like, who can be like, yeah, I don’t know either. Probably. I guess it’s right.
Seems fine. You know, validating query result correctness is a challenging, challenging thing. I’m happy when my rewrites just match what the slower query returned.
I have no idea if that’s right or not. I’m not checking that. I’m just making sure that what, like, the same number of rows and the same data output is as far as I usually go. Because I don’t usually get involved enough to, like, dig deep into, like, someone’s data and understand what an actually correct query result would look like.
It was sort of a funny story where there was one client of mine who had a very, very slow recursive CTE, which I helped fix up. And the original version of it would just never return a result. It wasn’t like an infinite loop thing.
It was just slow as hell. And when we got it right and we sped it up, they were like, I think the logic in here is wrong because these results aren’t correct. So we had to, like, redo the recursive CTEbecause they were like, yeah, this isn’t showing us the correct thing.
So that was fun, too. So let’s use a rather nifty SQL Server 2022 function. So let’s start talking about union and union all and how they differ.
And actually, union and distinct is what we’re going to talk about first. So I’m going to, you know what, I think I actually already did this stuff. If we look at these two query plans, there should already be stuff in these tables.
Yes, there are. Okay, great. So we have two queries here. The first one says select I from T1, union select I from T2.
The second one says select distinct from T1, union all select distinct from T2. The thing that I want to show you here first is, of course, the results. We get very different results back.
The first one just returns 1, 2, 3, 4, 5, 6. The second one returns 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6. So even though we had 1 through 5 in this table twice and 1 through 6 in this table twice, these queries made different things unique at different points.
If we look at the query plans for these, the union query adds its distinctness at the end across both result sets. So we not only deduplicate the results from each table scan, but we deduplicate the entire result here. For the two distinct queries, the distinctness happens here, not at the end.
Right? So when we say select distinct from this one, select distinct from this one with a union all in the middle, we make this distinct and this distinct, but we don’t make the final result distinct.
We just spit back whatever these two things put together. So distinct and union all do make things, distinct and union do make different things unique. Most of the time, if you are using union, rather all of the time, if you are using union between two queries, you don’t need to also add distinct to the select list because you’re already going to get that at the end.
You might just be doing weird extra work. Another thing that is kind of fun about, is about union and union all placement in a query. So if we look at, oh, I should have highlighted the whole thing, shouldn’t I?
If we look at this and we look at the query plan, we just get a constant scan. If we were to quote any of these things, then, you know, it wouldn’t make a difference. But if we start changing these, we will start seeing slightly different results.
For example, this one removes an extra one and two, or sorry, removes an extra one, but we still have two twos and two threes. I think the execution plans for these are just kind of funny because we have two constant scans and a distinct up here, but then just a concatenation for the other two things, which are down below this.
If we quote this one out and this one in and run this, the execution plan now has this with the distinct over here, right? We have a constant scan, constant scan, concatenation, constant scan, concatenation, and the results are one, two, three, three. So we have removed some additional, we have removed that extra two from the query results.
And of course, if we put the union down here at the very end, we will see a slightly different query plan and a fully deduplicated result set. One, two, three. We look at the query plan for this.
We have one constant scan, one constant scan, and then the duplication at the end, just like with the query where we selected from the temp tables. So that’s fun right there. Now, there has been quite a lot of performance talk about union and union all over the years.
Well, I do agree that most of the time, as long as you get correct results, union all is going to be a bit cheaper on you because SQL Server will not attempt to deduplicate the result sets. Even attempting to deduplicate unique result sets, if you have a bunch of string columns involved, can be rather painful and unwieldy.
If you are in the habit of writing union queries or even putting distinct in your select list, I would really strongly encourage you to think about the number of columns you’re selecting, the data types of the columns you’re selecting, and what actually identifies a unique row. Because you might be doing distinct over a bunch of columns where it’s not making a difference. You might be doing union over a bunch of columns where it doesn’t make a difference.
And oftentimes there is a sort of hidden subset of columns in your data that you can make an easy sort of distinct result set from without having to worry too much about it, without having to worry about long select lists and stuff. So if we run this and we look at the query plan that comes back, this takes about two seconds.
This is not a terribly dramatic example, I admit it. And, you know, it’s okay. But, you know, the thing is that we’re, you know, doing select all this stuff and including this text column and, you know, it’s just, it’s, it’s, this is an envarchar 700.
And now you have to worry about deduplicating that. If we were to take this query and do this, and what we would be doing is taking the same base query with the same columns in it, but then only generating a row number over the columns that we know make a unique result set, this will, this is a little bit faster, right?
Like I said, this is not a terribly dramatic example. That’s about 400 or so milliseconds faster, but it’s still a good example of how you can improve things by thinking a little bit more, a little bit more analytically about your data and what you need to make unique and what you don’t.
Right? So like in this case, doing a, generating a row number over these two union all the result sets is a lot faster. Now we’re now when I said most of the time, I do mean most of the time, there is less overhead to union all over union.
But there are some cases where it does make sense to make a unique result set to make some other operation in a query plan more efficient. Uh, I want you to think of this sort of like, uh, pushing predicates down to when you touch tables rather than having a filter operator happen later. That’s something that I talked about in the, uh, exists versus joins video, where I showed you a query that uses a left join, uh, with a, a where clause, uh, to find rows that don’t exist in one table that exist in another table.
And how using, uh, not exist was much more efficient in that case because we joined less data together, right? Because with the, the left join thing, we had to fully join both tables and then filter out nulls afterwards. The same thing is for like pushing any predicate or reducing rows as much as possible before you do something that is, uh, computationally expensive in a query.
So sometimes getting a distinct set of rows for your query can make things a lot better. So what I’m going to do now is, uh, populate this temp table with, um, user IDs for people who have won these badges. And the first way that I’m going to run this query is with a union.
And we’re going to marvel at the query plan for this. So this takes about, let’s see, five seconds right there. Right.
And, uh, you know, this, this would get better with, you know, slightly better memory grant stuff like that. Uh, if I, if I ran this like multiple times, you would see it get a little bit faster because SQL, since I’m in compat level 160 for the other stuff, we’re getting memory grant feedback. So this would improve over a few runs that we would eventually see that spill kind of fall off.
Usually it’s like three or four runs before the spill goes away. But, um, anyway, now what I want to do is change this to union all. So now we’re going to be taking like, and before with union, we were deduplicating these results, right?
We were getting rid of them. Um, but now when we do this, what we’re going to notice is that this query no longer finishes in like five seconds. This query drags on for a little bit longer.
And by a little bit longer, I mean, this thing is going to run for about 30 or so seconds total. Um, it’s been a while since I timed this one. So, you know, who knows, maybe, maybe, you know, Intel gave me some supercharged boost to my CPUs and maybe it’ll finish a little bit quicker.
Um, maybe not. I don’t know. We’re going to, we’re just going to, we’re just going to let this thing ride.
Uh, and we’re at, well, we’re at 35, 36, 38. Oh, we’re at 40 seconds now. Uh, do, do, do.
Well, I don’t know. I think this, this might be proving the point a little too well. So we got up to almost 50 seconds on this one. We got 47 seconds. And that’s because rather than, um, and, and, and you can see that like the, the, the, the, the pain of this wasn’t even in here.
Like there was almost no overhead to like doing the concatenation of these. Uh, we did still have to, you know, do all this stuff in the sort and whatnot, but where this makes the biggest difference is the number of rows that end up or the number of, uh, things that we end up doing in the table spool. Right.
So we have a nested, we have a nested loops join here. SQL Server uses a table spool here. Um, I’ve talked about table spools in the past, but they are sort of interesting. Uh, SQL Server uses them on, like for in select queries, not modification queries, modification queries, spools are for Halloween protection.
In select queries, spools are often there to optimize, uh, operations on the inner side of nested loops. By the inner side of nested loops, I mean this portion of the query plan in here. So, um, a table spool is fun because a table spool, uh, you know, you take, uh, sorted data from this table, right?
You sort it here and then you pass it to the nested loops and the nested loops, uh, will, you know, tell the table spool to go run the query for, let’s just say ID one. Uh, it’ll populate the table spool. Uh, it’ll populate the table spool.
And then for any additional runs with ID one, it’ll reuse the data in the table spool. Then as soon as we get to ID two, the table spool will get truncated and repopulated. And then, uh, you’ll like, it will, it will reuse data in the table spool for ID two until we hit ID three.
So table spools really can save a lot of time and energy on the inner side of nested loops sometimes. But in this case, the problem is more that we end up with way, way, way more stuff to do with the, because we don’t deduplicate results here the way that we did with the union query. So if we go back and we quote out the all here, remember this is about 40 something seconds and we run this.
We’ll do this one more time with the, with the, the union rather than the union all, it should be about five seconds or so. So do do do. And we get the results. Notice that, uh, like, you know, we, we, we do the same thing where we, you know, concatenate the results. But then before we, uh, send it, send it along, we get, we use a distinct sort to make the results of these two things distinct.
And we spend a whole lot less time in this portion of the query plan. Right? So this, like this part does a whole lot less work because we made the results set unique in this part. So you can run into interesting situations where using union to deduplicate results makes a computationally expensive part of a query faster or repetitive part of a query faster.
In this case, the nested loops is really what did it. So that’s a, that’s a pretty good way of thinking about things sometimes where can I make, can I make this part of the query execute fewer times or do less work with a distinct result set rather than a non distinct result set. So anyway, that’s about all I have to say about union versus union all. As always, I hope you enjoyed yourselves. I hope you learned something.
I hope that you will continue watching this series. I hope that you will continue to write queries correctly. And I will see you over in the next video, which is going to be about two things that I never see anyone use. Intersect and accept.
Fun times. Oh boy. You’re going to really, you’re going to get a relational mouthful on the next one. So I will see you there. Goodbye. Bye.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.