Learn T-SQL With Erik: DISTINCT Isn't Always Evil

Learn T-SQL With Erik: DISTINCT Isn’t Always Evil

Video Summary

In this video, I delve into the nuances of using `DISTINCT` in T-SQL queries, addressing common misconceptions and providing practical examples to help you understand when and why `DISTINCT` is necessary. I start by explaining how `DISTINCT` can sometimes get a bad rap due to its impact on performance with large result sets, but emphasize that it’s not always harmful or avoidable. The video covers scenarios where `DISTINCT` is crucial for getting accurate query results, such as in join operations and when dealing with many-to-many relationships. I also discuss alternative methods like using `EXISTS` to achieve similar outcomes without the performance overhead of `DISTINCT`. This content is part of my comprehensive T-SQL course, which you can pre-order now at a special price of $250; after completion, the price will increase to $500. If you’re attending the PASS Data Community Summit in Seattle this November and are interested in these topics, be sure to check out the companion material for my pre-cons with Kendra Little-Leader.

Full Transcript

Erik Darling here with Darling Data and we are going to, in this video, which is again more sort of preview content from my T-SQL course, we’re going to talk about Distinct. Now, if you would like to get in on the pre-sale price for this video, down in the video description there is a link. You can buy it right now for $250. That price, once all the material is finished, will be going up to $500. And of course, this is companion material for the pre-cons that Kendra Little-Leader has. And I will be, I don’t know, I don’t know, I don’t know, I don’t know, I don’t know, I don’t know, I don’t know, I don’t know, I don’t know. I don’t know, I don’t know, I don’t know, I don’t know, I don’t know, I don’t know, there’s lots of P’s you could use in there. At PASS Data Community Summit, this November in Seattle. So if you are attending those pre-cons, you can, of course, get, you’ll, of course, get this content free with your admission there. Otherwise, the link is down below. So this is a short excerpt from my material on Distinct.

And I hope you enjoy it. Anyway, when it was like, like, like, I wanted to write specifically about Distinct, because there are times when it’s like, not harmful, and when it is necessary. And like, all the time, like, every time I open LinkedIn, it’s like, the first three things I see are like, this terrible LLM generated posts, where it’s like, top 10 things to avoid in SQL, and it’s like, fire emoji, avoid, avoid select star, and you’re like, here we go. And it’s rocket ship emoji, don’t use distinct. And you’re like, hmm, but, but some, but, and there’s never like, an alternative.

All right, there’s never like, like, here’s what, here’s a better way of doing anything. It’s always just like, don’t do this stuff later. You’re like, oh, great. Oh, wonderful. You really put your heart and soul into that one. Good, good, good job there. But what we’re going to talk about in this video is sort of like why Distinct kind of ends up sneaking into things. And when you might actually need to distinctify some stuff in order to make SQL, make, get query, get the correct query results back. Who knows, right? Sometimes you got to make something distinct.

There are a lot more examples in the full content, but this is, so this is just something to give you an idea of what’s in there, right? A little, this little, what’s that, what’s that you’re cooking? What’s in the kitchen? Good stuff. Come for dinner. Whatever. Anyway, let’s say that, you know, you’re, you’re writing a query and you want to see everyone, all the users with a reputation of one that have posted an answer. Remember in the inner join, outer join video, we talked about different join, like logical semantic stuff that preserves rows on whatever side of the thing.

And so you write a query that looks something like this, right? Where we’re selecting ID and display name from the users table. And we’re just joining to the post table to see, hey, did they post anything? Right? Did they post the right kind of thing in there? Oh, maybe they did. Maybe they didn’t. But we’re going to find out now.

The thing is when you run this query, you get like a lot of duplicates in here, right? You can see good old, a mean, a meanie. That’s duplicate rows. And, you know, as you go through old farmer, Dave, a couple in there and you realize that, well, you know, we got like a hundred thousand rows back from this thing. If we run this and get the actual execution plan, about 99,000 rows come out of this and we realize we got a lot of duplicates in there.

And we might not want a lot of duplicates in there. So what we do is slap an old distinct on it. Now, the reason why distinct gets sort of a bad rap is because it is like doing a big group buy on all the columns that you are selecting. Right? Maybe just distinct. Right? It doesn’t happen at the same place when you, in like the logical query processing part of the query is grouped by.

It happens like way at the end and like down, like if you remember the order that SQL processes queries in, distinct happens way down at the end with the select. So it’s like a group buy after you’ve already like potentially grouped by other stuff. Like there’s very little reason to mix distinct in with a group buy as well.

So like, let’s say we just slap a distinct on this one, which is no different from saying group buy ID display name. And when we look at the execution plan, rather than 99,000 rows coming back, it’s only 79,000. So we got like 20,000 some odd rows from the results by putting a distinct on this.

Now, every, now every combination of ID and display name that we see in here is going to be unique. We see that now we only have one Amin Amini and there’s going to be one Farmer Dave and everyone else. We just have one of them. They are now, this is now fully unique.

Now, the reason why this happens a lot is because people’s view of SQL in general, like as a language, how it fundamentally works is often full of shortcomings, right? The goal of this course is to fill those shortcomings in, right? There’s a need in the market, stuff about the economy. I don’t know.

But, um, the, like a lot of the times people will use a join, uh, some, some form of join syntax to figure out if rows exist or don’t exist. And that’s not really the best way to do this sort of thing. So, uh, if Zoomit will wake up, hello, Zoomit, come on, paid good money for you.

Just kidding, Mark Rassinovich. I downloaded wares. Uh, just kidding, it’s free. Um, so with a join, if you have a one-to-many relationship, the way you’d be doing the, from the user’s table to the post table, right?

Because one user could have many posts, joins preserve that, right? So we get all the rows back from a join that match, right? It doesn’t, doesn’t matter if there are one or 10 million, all those rows come back if they match, right?

The joins preserve that sort of duplication stuff, the one-to-many relationship. Because, like, you might, you know, want it for some queries, right? There might be something that you actually need, like, all of the matches for.

Like, if you were summing up the total score for a user, you would need all of their scores. You wouldn’t want just one of their scores. It wouldn’t make sense, right? You need all their scores to sum up a user’s total score from the post table.

If you just need to figure out if data is there or not there, the best way to go about that is to use exists or not exists. Because these either find a row or they don’t. If you find a row, it existed.

If you don’t find a row, it not existed. But you don’t preserve one-to-many relationships. You just find a relationship or don’t find a relationship. And that’s what meets the logical requirements of your query, right?

You don’t mess around with a whole bunch of other stuff. So, I love summing is now mad that apparently I downloaded it for free. But one way that you can get around having to use distinct is to use instead of a join, just use exists. Because this means that we’ll only get one row back when there’s a match.

So, like, for this, this just says, hey, if this user is over here, send them to the results, right? It’s not like if this user is over here, get all their results, right? Get all their posts.

So, if we run this and we look at the execution plan, we will have gotten the same 79,000 rows back without having to slap a distinct on it, which would make the LLMs of LinkedIn happy, right? I wish there was a way to just report those people. No semicolon.

I should report myself. Anyway, there are times when distinct is the correct thing to do. Now, let’s take a query like this, which is sort of a many-to-many relationship. So, we’re selecting from the post table and we’re doing a left join to the comments table, right?

And we’re just looking at this for one single user. Now, since this is a left join to the comments table, if we wanted to get a count of how many comments a user had, we could do this in one of three ways. We could either just do a count, which would be wrong.

We could do a count of just the ID column in the comments table. Now, remember, count doesn’t count nulls. It only counts present values, right?

But it will count duplicate present values, right? We talked about this in the case expression video a bit. And then the other way of doing this is to say, give me a count of just distinct IDs from the comments table. Now, this is important because ID is the clustered primary key of the comments table.

So, if there are duplicates in here, it’s not because we have some, like, error in our table and we have corrupt data and we have duplicates in our primary key. That’s not the case. It’s just the left join has a, like, it’s a many-to-many relationship between posts and comments.

Right? Like, one user can have a whole mess of comments. So, this is one of those things.

I guess it’s one-to-many relationships and it’s just one user. But one user can have lots of comments, but they have, but there’s also, but, like, the owner user ID in the post table is not unique. Right?

Because one user can have lots of posts. So, it’s, like, one, it is a many-to-many in that regard. So, if we were to run this query and look at what these different counts produce, we’d pretty quickly see that we do need distinct here in order to get correct results. So, let’s run this.

And so, it’ll take a second to run because we’re doing a lot of work here. But let’s look at the difference in what these counts produce. Right?

When we just do count or when we do a count of the comment IDs, right? That’s these two up here. Right? We have the count big and the count big CID. We get gigantic numbers back.

Right? But when we do a count of distinct IDs here, we get a much smaller number back. 46,737. I’m not sure how to pronounce this number.

We’re going to, there are no commas in there. We’re just going to be terrified. But we can see where these counts came from. At least the two big wrong ones. Right?

If we look at the execution plan, look at the number of rows that that join produces. That is the same number that we got from the count star and the count CID. Right?

So, the 27,901 rows in the post table that match owner user ID 22656 and the 46,737 rows in the comments table, those get joined together and they produce 1304009036. It’s a nine, 10 digit number.

I might have run out of fingers if I did that. Right? So, that produces a big result set. Right? It’s not quite a Cartesian product, but it’s a big number. It produces a big number of rows.

Right? Because they all match. There’s a many to many relationship. It’s like 27,901 times 46,737. It’s a lot of rows. Right?

Big number of rows. But the select distinct on the ID column over here produces the correct number of rows. And we can validate that by just hitting the comments table and getting a count for that specific user ID.

So, if we run this and we look at how many rows come back from that, we’ll see user ID 22656 does indeed have 46,737 rows in there. So, while I totally get the advice to avoid distinct, you know, it comes from a good place because, you know, distinct does get stuck on very big queries that select a lot of columns.

A lot of them are probably strings. And, like, doing that sort of distinctification of really, really big result sets can be painful. But there are plenty of times when using distinct, A, provides you correct results.

And, B, you know, not something we’re going to explore today. But there are times when using distinct in a query gives the optimizer some guarantees around a unique set of data being produced by that query.

And that can actually help the optimizer quite a bit to produce better query plans. And we’re actually, you know, using distinct to produce fewer rows from something that’s part of your query can be very useful at times.

So, don’t throw distinct out completely. Distinct can be useful. But, you know, there are times when, you know, just doing a select distinct on something can cause, you know, some performance pains.

But, you know, if you test it with and without distinct and it’s not that much different, maybe, I don’t know, maybe it’s okay. Maybe the LLMs aren’t always right.

Shocking, I know. Anyway, thank you for watching. I hope you enjoyed yourselves. I hope you learned something. And we will talk about something else wonderful and T-SQL-y in the next video. Anyway, goodbye.

Bye-bye. Dalai. Bye-bye. Bye-bye. Thank you.

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.