Learn T-SQL With Erik: WHERE vs Having
Video Summary
In this video, I delve into a practical example to explain the differences between using `WHERE` and `HAVING` clauses in SQL Server queries. Erik Darling from Darling Data shares insights on how these clauses can be used effectively or misused when filtering data. By walking through an interesting query involving user display names and reputation scores from the Stack Overflow database, I demonstrate why applying a `WHERE` clause might not always yield the desired results and introduce conditional aggregates as a solution to preserve the integrity of grouped data while still filtering out unwanted rows. This video is a sneak peek into my upcoming T-SQL course, which will be available at a presale price of $250 until it launches in full, priced at $500. If you’re interested in learning more advanced SQL techniques and strategies from experienced instructors like myself and Kendra, make sure to check out the link in the video description for early access.
Full Transcript
Erik Darling here with Darling Data and today’s video is going to be another little sort of teaser-oo for my upcoming T-SQL course. If you are interested in getting in on the presale price for this course, link down in the video description to do that. It is currently $250. When the course is all completed, it will be going up to $500. And of course, this course, of course, of course, of course, is a horse. This course material is going to be $500. This course material is going to be in large part a representation of the material that Kendra and I are presenting at our pre-cons at PASS Data Community Summit in November. So, like the stuff that you learn there will be the stuff that is in this course. And if you attend our pre-cons in Seattle, I will give you access to this material as part of that ticket. So, you will get it for free if you show up there. If you’re not going to be there, well, the link to purchase it is below. There are no further discounts on this. So, don’t ask. It’s a lot of work. But in today’s video, we’re going to talk about where versus having a little bit. And because they’re both ways of filtering data, right? So, like you have the where clause, which filters out rows from query results, right? It applies a filter and it says you rose, you made it or you didn’t make it.
And then you have the having clause. And the having clause introduces filtering on a grouped table. And by grouped table, I mean you have a query with a group by expression in it. And you might want to figure out if the result of some aggregate on that group table meets some criteria. So, an easy way to show you that is to use this first query here. We’ll just wait for ZoomIt to wake up. ZoomIt apparently is just taking a very long nap and doesn’t want to do anything for me here.
So, this is an example of a grouped query. We have a group by on the display name column, right? We are also selecting the display name column. But we’re asking SQL Server to filter out the results by only display names. Now, this is a sort of a funny quirk about the Stack Overflow database and it’s kind of an amusing query because display names are non-unique.
Users can, multiple users can have the same display name. There are even two John Skeets in the Stack Overflow database for some strange reason. But it’s just sort of, so it’s sort of a funny query where we’re just taking all the display names and we are looking for that average reputation, average downvotes, average upvotes, and then the count of total users that have that display name. And we’re sort of putting them on teams together, right? So, like team display name, whatever your display name is.
So, if your display name has more than 2,000 participants, more than 2,000 team members, that’s where we can start caring about your numbers for these. So, this is our original query and ZoomIt has once again gone to sleep. So, if we run this, we’ll get some results back.
And let’s pretend that after running this query, we have sort of a practical problem. We have these nine rows, right? The nine rows is sort of important to the problem we’re going to solve. But the thing that our teams are mad about is the fact that their reputations, their average reputation, is being brought low by teammates with low reputation.
People with a reputation of, let’s say, one. Let’s say these low reputation users are really screwing things up. So, what we might be tempted to do is apply a where clause additionally to this query, right?
We’re still going to, we still need to figure out who has a total team size of greater than or equal to 2,000. But now we’re just going to say, well, we’d only, we’re going to screen out all the people with a reputation of one. And hopefully, this will make our, this will bring up the averages for these teams, right?
Because we want to make sure that these teams still have good averages. The problem we run into is that this significantly cuts down on the results. We only have five rows now.
We actually, we actually destroyed four teams by getting rid of, by filtering out reputation equals one users. These are the original results. So, we’re missing Matt, we’re missing Mark.
I don’t know. This is like weird, like apostle names or something. But like, you know, the averages did go up for these, right? Like these score, these average reputations are higher, but we no longer have this full list of teams.
And that’s not a very good sign, is it? We have done something wrong with our query. And we need to figure out a way to run this query and get the same nine rows back.
But without these reputations suffering, right? Like these reputations, sorry, these reputations suffering. These reputations have already suffered, right?
And granted, like the average downvotes and upvotes are probably a bit different too. But the total users is different because we have screened out all these reputation equals one people. So, the where cloud is the wrong way to do this.
Now, there’s one way that you can write this query to give you the, to preserve the nine rows, but still not consider the low reputation equals one people, is to write a conditional aggregate. Now, if we, I’m going to zoom in on this.
And I have, I have, I have formatted this in a way to make it hopefully very clear what we’re doing here. Like this average downvotes is very easy to figure out. It’s just an average on downvotes.
So, I’ve, I’ve spread this out a bit to make it very clear what we’re doing. And what we have in here is a case expression. Case expressions are wonderful things, right?
You can do all sorts of fun stuff with them. And one thing that you can do with them is stick them in an aggregate like this. So, what we’re going to say is if your reputation equals one, then we are going to, we’re going to substitute that value with a null, right?
So, average isn’t a, null isn’t a value. So, you can’t average nulls, right? It’s not a, not an averageable thing. And if your rep, so that’s that.
And then our bailout on the case expression is if your reputation is not equal to one, or rather higher than one, right? Because there’s no, no reputation lower than one in the Stack Overflow database. One is as low as you can go.
There’s not a check constraint on that in the 2013 copy. But in real life, your reputation can’t go below one, right? Doesn’t matter, doesn’t matter how bad you screw up, your reputation doesn’t go below one.
I guess otherwise it would be like, almost like the opposite of the high reputation contest, sort of the Stack Overflow thing, sort of gamifies for people answering questions and stuff. You’re able to say, I want to see how low I can get this.
I don’t know how many bad answers. Like, actually like rewarding bad behavior if they did that. So, if we run this query now, we will get, so like, so just to like make sure that we’re on the same page here. This query had the where clause where reputation is greater than one in it.
I’ve taken that out of this one, right? We have no where clause here. All we have in here is the having clause. So, say, we’re grouped by as, sorry, we’re count big as greater than or equal to 200, greater than or equal to 2,000 after we group by display name, right?
And if we run this, we will get our same nine rows back, but our reputation, right? Our reputation will be okay, right? So, all of the numbers, all of the other numbers are going to match up.
There’s not a great way to sync this up, but you can see that the numbers up here and the numbers in the upvotes and downvotes columns, these all match up nicely, right? These are all the same, but the numbers in our reputation column, our average reputation column, these are all, these have gone up.
Now, the order is a little bit different here. The reason why the order is a little bit different here is because once we’ve screened out the reputation equals one people, some of the averages have sort of changed places a little bit, right?
Like, there’s a little bit of difference in the order of these, where if we compare one to the other, Matt is still on top, Alex is still second, David is still third, Mark is still fourth, but Chris and Michael have switched places, and I think actually that’s about it there.
Yeah. So, this, like, that did change the scores a little bit, because the average reputations switched around because of, like, some display names had way, had more reputation equals one people that got, where the reputation got replaced with null, so that the scores, like, the order changed a little bit, so someone might be mad in there, someone might want to, I don’t know, start kicking some reputation one people off their team, maybe.
But this is one way to get around the problems you might run into when you are trying to sort of manipulate data that is being grouped, and stuff like the where clause and the having clause is not an appropriate way to filter things out. Sometimes you do need to filter out rows with the where clause, and sometimes you do need to figure out, filter out rows from the group table with the having clause.
Because other times, you can apply a little bit of filtering inside of your aggregates in order to change what actually gets aggregated in there. So, that’s about it for this one.
In the actual course material, there is, of course, a lot more other stuff that goes along with this lesson, but this is just sort of one of the more practical examples of what the things that I’m going to be teaching you. If you like this, again, you can buy it down there.
The videos will, the actual full-length course videos will start dropping soon. They’re out in tech review right now, kind of getting spruced up and fixed up. And, you know, I am like rehearsing them and stuff to make sure that my delivery is sound and solid for when I give these to you.
But they will be dropping very shortly. And I hope you actually buy them and watch them and like them and all that stuff. So, thank you for watching.
I hope you enjoyed yourselves. I hope you learned something. And I will see you in the next video where we will talk about some more of the course material. So, anyway, that’s about it for me. Goodbye.
Goodbye. Goodbye.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 25% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.