The SQL Server Performance Tasting Menu: How To Count Distinct Values In SQL Server

Counting On You


Video Summary

In this video, I delve into the intricacies of counting distinct values using windowing functions in SQL Server, a topic that often leaves many developers scratching their heads. I share my experience of having to delete 15 minutes of footage because I accidentally took a sip of coffee and felt uncomfortable with the swallowing noise it might have picked up. This minor distraction led me to focus on the importance of clean audio for viewer enjoyment. The core content, however, centers around how counting distinct values over a window is currently impossible in SQL Server—despite Postgres supporting this feature effortlessly. I explore various workarounds, including a complex query rewrite that significantly improves performance but requires more code than necessary. Through detailed analysis and humor, I highlight the ongoing challenges with windowing functions and advocate for their continued improvement, emphasizing the need for better support and syntax in future updates.

Full Transcript

I want you to know how much I care about your listening and viewing pleasure with these videos. I just deleted 15 minutes of video because I took a sip of coffee towards the end and I felt that there might have been a swallowing noise picked up and that made me uncomfortable. And so I said to myself, I’m not going to make these nice people watch 15 minutes of video just to hear a swallowing noise and feel grossed out because there’s one thing that I can’t abide. It’s mouth noises, swallowing noises, chewing noises, any of that. It’s all grotesque. I don’t have any form of OCD whatsoever. Don’t worry. I want to talk about counting distinct values. I don’t have any form of OCD in the context of windowing functions because we can’t. It’s unfortunate. And the workarounds that we must use to get this to work are weird. And then getting them to be fast is also kind of weird. So without further ado, let’s go look at what the heck I’m talking about. Now, this is my index right here. This is the one that I have on the comments table. It’s on user ID and then score. And I don’t think that you could ask for a better index to support this query. It’s our where clause, right? User ID is not null. Score is greater than one. That’s good setup for this.

And then more importantly, though, in this query, it is our partition by element is user ID. And we are getting distinct scores. So we’re set up very, very well to do this. The problem is, and this query runs really quickly. This is one of the fastest queries I’ve ever written because it throws an error immediately. It’s just done, right? It doesn’t get much faster than that. We can’t use distinct with the over clause. Why? I don’t know. We can count. We can count distinct. We can count over. But for some reason, counting distinct over is just impossible. Can’t be done. Or rather, hasn’t been implemented. Why? I don’t know.

I hear Postgres does it for free, though. So I don’t know. Maybe you want to think about migrating to Postgres if this is really important to you. If you would, I don’t know. Maybe the world would end if this was implemented properly. Well, if you’d like to vote for the world ending, you can go to this link right here and vote for this. Vote for this as a feedback item. Keep in mind, though, like this bitly link is case sensitive. So mind those Cs and Ds and Ps when you’re typing it in to go vote. Thanks for doing your democratic duty to vote for the end of the world.

So without much further ado, I think there was already without further ado. So let’s cut down on even more further ado here. And let’s look at a rewrite of that query. Significantly more query was written in order to get that to happen. So here we are selecting a distinct from comments. Cross-apply comments to itself correlated on user ID with the filtering elements that we need in there.

And this will, thankfully, not throw an error. However, it will not be particularly fast either. Using my patented scientific rows to second ratio, this is not a very fast query, right? Three seconds, 14,000 rows. Not into it. Maybe if we got, like, down to three seconds from some much higher number, it would be better.

But starting at three seconds, I’m like, oh, oh, not cool. So I’m not going to show you the query plan for this now because it’s sort of irrelevant at the moment. It’s not fully relevant until I show you the rewrite that is fast and why the plans are different.

So let’s scroll on down. Let’s also get rid of some more further ado here. Now what we’re going to do in this query is we are going to add a sort of redundant-looking group by on user ID. I say it’s sort of redundant-looking because we are not projecting the user ID column inside of the cross-apply.

We’re not selecting user ID. We’re not bringing that with us. So what’s the point of grouping by user ID? We’re just getting a count of distinct scores. We’re correlating on user ID here.

What possible good could grouping by user ID do? Well, how about this? How about zero seconds for a rows to second ratio?

How about that instead? Now, again, I’m not going to show you the query plan here because it doesn’t matter here. The query plan matters down here where we can look at them together and make fun of them.

We did a thing. So let’s see. This one’s still slow. About three and a half seconds. See? Three and a half seconds right there.

And if we trace the lineage of that three and a half seconds back a little bit, if we step back from here and here and here and here and here, we can kind of see that the bulk of the time in this query is spent just about in here. This is where most of the time the query plan comes from.

There’s a little bit of time spent in this branch. Now, keep in mind this is SQL Server 2017. This is a row mode execution plan. And so the times going from right to left will largely be cumulative.

So all these times will add up going from right to left. All right. So we spent about 335 milliseconds from here to here.

All right. So we start 252 milliseconds and then another 25-ish. And then you get the point. Every operator includes the child operator or operator’s time and its time. By the time we get over here, we’re at about two and a half seconds.

Of that two and a half seconds, we’ve got about 500 milliseconds on this index scan. And then about a full second in the repartition streams. So then we have another, like, second or so here.

So why, though? Why? So if you remember back a couple videos ago when I was talking about other distinct matters and I showed you a pattern where people, to find rows that don’t exist in another table, will use a left join where some other column is not null.

And how that’s an anti-pattern because you’ll end up with a full join and then a filter after the join. And if you had the filter before the join, then you would be able to, or rather if you either had the filter before the join or if you write the query using not exists so that you don’t have to use a filter, you can use an anti-semi join to rule out rows, you have a much faster query.

And that’s true here as well. So if you look at what happened, we aggregated down to about 150,000 rows here, but then we had to fully join those 150,000 rows to about 25 million rows. Now, this is kind of funny and kind of funky because it’s a self-join.

So any row that comes out of that 150,000 is going to match to the 25 million, right? We’re going to have matches there. We’re guaranteed to have matches, like fully guaranteed to have matches there.

And then it’s not until after we join that and get a full 25 million rows back. And then after that, we filter down to about 10 million rows. And then way later in the plan, we’d get down to about 14,000 rows.

Nearly at the very end of the plan, we get down to about 14,000 rows. That’s different in the bottom, in the second query, the bottom query, where we get down to 13,000, those 13,000 rows fairly early.

You can see the filter is before the join here. We don’t have the full hash join of 25 million rows to 150,000 rows. We have an apply nested loops join.

You can see in there we have that outer reference. So that means we got apply nested loops. And we can verify that by looking at the index seek over here. For some reason, that tool tip is being stubborn.

But we can see the seek predicate there for user ID. So we get down to about 14,000 user IDs. And we hit those and we get our rows back here. And then we very quickly get back down.

We aggregate back down to 14,000 rows. So we skip a lot of the pain of this hash join of joining 150,000 rows to 25 million rows and filtering down to about 10 million rows afterwards.

We get down to only producing about 10 million rows from here and then aggregating down to 14,000 rows nearly immediately, which is, I think, pretty spiffy.

So what’s the deal? That’s a clever rewrite. Sure, it’s probably helpful to some people sometime. But I think more to the point is that windowing functions have a long way to go.

I mean, I appreciate that we’ve had them, that we got them, that the problems they solved, they replaced, you know, a lot of painful cursor and correlated subquery stuff.

But there’s a lot of basic linguistic support still missing. And, like, trust me, I like them better than the alternative, but boy, oh, boy, we haven’t had, like, a serious update to windowing function syntax since, I don’t know, 2012-ish?

Somewhere in there. There’s a lot of stuff that we would have, should have, could have had in windowing functions, but instead we got, I don’t know, big data clusters, distributed availability groups, and I don’t know, other wacky stuff that someone likes, I guess, somewhere.

I don’t know. I don’t touch this stuff. Anyway, that’s that.

That’s what I had to say about that. I’m going to go turn off recording, take a sip of coffee, and then record another video. So you’re welcome.

And I love you. Goodbye. Bye. Bye. Bye.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.