The SQL Server Performance Tasting Menu: How DISTINCT Queries Can Hurt Performance

This Stinks

Video Summary

In this video, I delve into a common performance issue in SQL Server: the overuse of the `DISTINCT` keyword. I explore how relying on `DISTINCT` can lead to unnecessary work for SQL Server and discuss alternative approaches that can be more efficient. Specifically, I demonstrate using `EXISTS` instead of joins with `DISTINCT`, which not only avoids extra processing but also results in a different execution plan. Additionally, I highlight the importance of leveraging primary key columns when possible, as they can significantly improve performance by reducing the need for complex operations like hash matches and sorts. The video also touches on more advanced techniques such as using window functions to generate row numbers, which can be even more performative than `DISTINCT` in certain scenarios. Overall, this content is aimed at helping database administrators and developers optimize their queries and improve overall system performance by making smarter choices when dealing with distinct result sets.

Full Transcript

Ah, yes. Ah, yes. Video 857 on this 10,000th day of quarantine. Just kidding. Man, the jokes in the, some of the jokes in the video is going to age very poorly, assuming that we age beyond, beyond this week. So, in this video, I want to talk about something that I see a lot, and something that I see sort of frequently causing performance issues, is people being overly reliant on the distinct keyword. Now, this could, this is interchangeable with a long string of group buys. Now, I don’t want to say that this isn’t always necessary, because sometimes it absolutely is, but I want to show you a couple different approaches to getting around having to use distinct and, I don’t know, some stuff to look at. So, I don’t want to vote for, I guess. It’s fun. It’s fun. It’s a lot of fun, isn’t it? Isn’t it fun, SQL Server? Database is fun and sexy. Everyone wearing their bikinis today? I hope so. I am, underneath my clothes.

Yeah. So, let’s just say that at some point, your boss is like, yeah, we need to get a report of people who did a thing. Right? So, we’re selecting some stuff from users, joined to posts, and everyone who has a reputational one. I don’t know, whatever. The query itself isn’t that important. What’s important is the results. But when we run this query, we get a bunch of duplicates back. You get 280,000 rows, and gosh darn it, there are a lot of duplicates in these rows. There’s lots of people posted multiple times. Now, we might say, oh, God. I can’t handle it no more. I can’t take it. We might write a query that looks like this, where we get a distinct group of rows back.

And this doesn’t take terribly long, a couple seconds. We get fewer rows back. I get 218,000 rows back. And I don’t know. It’s not that big a deal, maybe. But did we really need to do this? Did we really need to go and make SQL Server do the extra work to make these values distinct? What if at some point we said, you know what, we should stop displaying user IDs?

Because when we display user IDs, that’s actually someone’s account number. That could, well, that’s not good. Maybe that would be considered personally identifiable information of some kind.

But there we go, where order by items must appear in the select list if select distinct is specified. The other thing that kind of stinks is that if we take out ID, we can have duplicate display names. So there’s actually a minor loss, well, not minor. There’s a loss of fidelity if we just get distinct display names.

There are 2.4-ish million records in the user’s table. There are exactly the same number of unique IDs, but there’s only 1.8-something million unique display names. So there are duplicate display names in there.

A display name is not meant to be unique. Now, one way to get a distinct result setback without having to use a distinct keyword is to use exists. Because with exists, we don’t present duplicate rows.

We just check to see if a correlated row exists in the other table. And when we do this, I mean, the wall clock time isn’t, like, really all that much better. But we do get a different plan.

This runs a little bit faster, and we didn’t have to use distinct. And we didn’t have to select the ID column, right? We notice that we’re not selecting ID, but we can still order by ID. So there are a number of benefits to using exists over a join when we don’t need to see all of the duplicate rows that come out of what would be a join operation.

Exists does not produce duplicate rows like that. Now, where this can make this a more profound difference is with not exists. So something that I’ll see quite frequently is someone follow this query pattern.

We’ll say, select count from users joined to posts. And they’ll… Unruly child.

I don’t know whose kid that is, being noisy out there. But if we run this query, and we see what happens, it will be less pretty than alternatives. So this runs for about 1.8 seconds.

And what kind of sucks about this query, the way it’s written, is you have a full scan of this index on the users table, a full scan of this index on the post table, and then a full join of both of those indexes.

We don’t eliminate rows until after the join. We spend all our time up in this area, and then after this, we filter out rows based on where p.id is null.

If we change the way we write that query a little bit, and we change that to not exists, it runs a little bit faster. We save about 800 milliseconds, and we don’t have that filter after the join now.

We have a left anti-semi join, which indicates that we are just looking for the lack of existence. We are looking for non-matches. Ooh la la.

Big words. Big, fancy database words. Now, where things can get annoying with distinct or with non-unique results of some kind, is if you have queries that select many columns.

Because a distinct is sort of like a group by across a lot of columns. You’re asking SQL Server to find a distinct group of tuples, more and more crazy database words. A distinct set of tuples across many columns.

And that can be, depending on the number of columns and the data types and other things, that can be a fairly painful process. So let’s start with this query. Now, I just hit F5 here, and that ran for about 20 seconds.

Two zero seconds. If you look at the execution plan, it is, I think, in the running for one of the most defensive query plans I’ve ever seen. The reason for that is this part of the plan down here, where SQL Server, because we have chosen a merge join.

See, the optimizer was like, we need a merge join here. We need to sort the entire votes table in order to have the merge join run. This ends up spilling a bit.

Not a ton, but a little bit. But 52,000 pages. Probably not the end of the world, but a decent chunk. And then it’s just not a plan that I like. I hate this plan for many, many reasons.

We’ll leave that there, though. That’s what we call a side thing. Sidebar digression?

I don’t know. Whatever. Words aren’t fun anymore. But now let’s look at how long it takes to return a distinct set of rows here. Remember, that returned 53,946 rows. Let’s say we only want distinct values back.

Okay. Okay. So, what’s important to note about this query are a couple things. One is one of the columns that we’re selecting is the ID column.

Hmm? Yes. The ID column is the primary key. It’s the clustered index of the table. And if you recall back to the trivial plans video, if SQL Server knows that a column is unique, then it doesn’t have to really do any work to come up with a unique result set.

In this query, we get back 25,000 rows. Good. Good, good, good. Good.

Good. And then if we look at the query plan. That took eight seconds, by the way. If we look at the query plan, we can see over here that this asked for around about 740 megs of memory. And if we look at how SQL Server made the result set unique, well, where is it?

We have a hash match aggregate on the votes table. So what SQL Server did was it only made the post ID from votes unique, right? So index scan, hash match aggregate here, and then the join to the post table here.

At no point in this set of columns, or rather in this set of operators, do we see SQL Server making the post table unique. So all it had to do was work on the votes table, which is nice. That’ll change a bit.

If we take the ID column out. So now rather than selecting distinct ID, then accepted answer ID, if we just say select accepted answer ID, well, this query will have some different characteristics to it. And I’m going to show you why this stuff can be kind of nasty, yucky, icky when you get into, you know, more and more rows coming back.

Larger result sets, you know, having, it’s a lot of work. Making things unique. It’s a lot of work.

This query has already been running for 20 seconds to return the, 21 seconds to return the same number of rows. This time the query plan looks a bit different. So now we still have the votes table is over here. And we still have the votes table getting aggregated on post ID right here, right?

And then sorted and then another damn merge join. Boy howdy. If I had a dollar for every time I wanted to shoot a merge join. But again, nothing happening in here.

But now, after we come out of this, we do this join, we do this. Now we have a distinct sort. Mmm. Ooh la la.

A distinct sort. We also have some kind of funny business in the operator times here where we go from 11.3 to 10.2 to 10.2 to 11.6. But that’s neither here nor there.

It’s a good effort on those. But now we see that there is a distinct sort in the data, right? We see the distinct sort. And if we look at the output list and the order by list, we can see that SQL Server implemented getting a unique set of columns in the result set by ordering by every single column that we selected. If we look at the memory grant for this, it is 229.200, whatever that means.

Whatever that’s supposed to mean. But that was obviously not a very good memory grant given the amount that we spilled here and here. It was not a good time for this query.

You can see if we look carefully, we start at one second. And that’s going to be about two seconds. But then we spend, well, about six and a half seconds there and a couple of seconds there and a second there. So this was probably not the greatest way to do things.

Now, quite often when I see people doing this, it’s a bit misguided. Because usually there is not, usually there is within the result set, there is usually a set of columns that we can use to identify a distinct result set. That doesn’t mean we have to use every single column.

So let’s take this query where rather than say we’re going to get a distinct set of columns from everything that we put in, let’s just get a distinct set of columns based on the ID and run this. And that happens quite a stretch faster than even the query where we, again, this unruly child who I wish I knew who it belonged to so I could flog their parents.

But if we look at how long this took, this took about a second to get that unique result set. And we did not, it did not take eight seconds like it even did when we included the ID column. So if you’re really gung-ho on saying I need distinct, damn it, please consider, if you can, including a primary key column.

And if not, perhaps generating a row number over, you know, a couple few columns that you can get unique results set from instead would be a better approach than saying I need a distinct, all of these columns because that can be painful and time consuming. Anyway, where things can get even more annoying, especially with rowstore indexes, is let’s take this query. Now, I had been testing this on SQL Server 2019.

There’s a bit more hope for with batch mode on rowstore. That’s why I have this hint down here to disallow batch mode. But if we look at this query, all right, and we look at what happens, the execution plan is rather wild, right?

We joined the post table to itself and we do a bunch of work here and there’s a spool here and a spool here and a spool here. So really, we fed into this spool and then we reused the data in that spool here and we reused the data in that spool here. And we use that, it can get a result set.

It took about four and a half seconds. Just, I mean, probably not terrible. But let’s look at that same query, right? Same exact query. But now, so SQL Server 2019, we would be using batch mode for rowstore. That would be, you know, probably better, that would be an equally good approach.

Just as good. But you get some attention cake too. But in early versions prior to SQL Server 2019, 2014, 16, 17 specifically, if we create a temp table that’s empty with a clustered columnstore index on it, and we do a stupid join to that temp table that doesn’t actually give us any rows but doesn’t eliminate any rows, it just kind of gives SQL Server this whisper of columnstore in its ear.

And it can start using batch mode for things. We start using batch mode for some rowstore operations. We get a much faster query.

At least we did last time I ran this. But who knows what will happen this time now that people are watching. So this runs in about three seconds, which is better, right? It’s better than four and a half seconds. And we have a far less complicated query plan to get all that done.

So, you know, if you are running these kinds of queries, these kind of workloads, then, you know, columnstore is probably a good choice for you generally. But SQL Server 2019 batch mode on rowstore will definitely start helping out there.

So, what did we learn? If we need to get distinct result sets and we are joining tables together, or if we need to find a lack of existence, usually exist or not exist is a better choice than doing join and using distinct over the results, or doing a left join and saying where some column is null to the join table.

If you need to get a distinct result set, if there’s a primary key on the table, that can certainly help with performance if you’re including that. If not, then usually you can get better performance by using a windowing function like row number to generate row number, figure out where row number equals one.

And usually just doing that over a smaller set of columns is more performative than your alternatives. The more performative alternative. Maybe that’s going to be my company slogan.

The performative alternative.

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

One thought on “The SQL Server Performance Tasting Menu: How DISTINCT Queries Can Hurt Performance”

Sean says:

June 3, 2020 at 11:43 am

Hi Erik,

I haven’t seen an EXISTS with similar conditions to your >= u.id and = and <= would just equate to =)

Also, I really did ponder this for longer than I'd be comfortable to admit.

Comments are closed.

This Stinks

Video Summary

Full Transcript

Going Further

Related Posts

One thought on “The SQL Server Performance Tasting Menu: How DISTINCT Queries Can Hurt Performance”