How To Write SQL Server Queries Correctly: Joins
Thanks for watching!
Video Summary
In this video, I dive into the world of joins in SQL Server, explaining how they union data horizontally and how unions join data vertically. This is a fundamental concept that every database query must understand, especially when dealing with complex relationships between tables. I also discuss different types of joins, such as inner joins and left joins, and provide practical examples to illustrate their usage. Whether you’re looking for high-quality SQL Server performance tuning or simply want to improve your query writing skills, this video is packed with valuable insights that will help you write more efficient and logical queries. If you find the content helpful and would like to support my channel, consider subscribing or making a small donation. Your support helps me continue creating these tutorials and offering resources for SQL Server enthusiasts and professionals alike.
Full Transcript
Erik Darling here with Darling Data, and we’re kicking off a, I want to say at this point, this is going to be a 15 video series on how to write queries correctly. This was a bunch of blog posts that I had put together and written, typed and typed and typed, and they actually got a fair amount of comments, which I was happy about because normally when I write things, it’s like crickets. Not a lot of commentary on those things. I don’t know why. It’s okay. Sometimes reading is hard. Perhaps my written self is not quite as persuasive or not quite as engaging as my video self. It’s entirely possible that I just have an enormous physicality, a presence, an unstoppable presence that just does not come across in written form, and that’s okay. Video just might be the better vehicle for me. And that’s all right. You live and you learn. But we’re going to start this off by talking about joins. There are all sorts of joins. There are all sorts of important things to think about and know about when it comes to writing queries that are joining data together. And that’s what we’re going to talk about today. But before we do, I need to convince you to give me some money. If you like this channel and the content provided therein, you can subscribe.
You can join 30 other people. I think over 30 other people. The last time I hit refresh on YouTube, which I do every 5 or 10 minutes. There are over 30 people who have decided that I am worth giving at least $4 a month to. So that’s nice. I appreciate you 30 some odd people who do that, who made that choice, who made that sacrifice upon the altar of darling data. But if you just don’t have $4 a month, you can like, you can comment, you can subscribe, and you can make a totally different set of numbers go up.
Which I would be, of course, very happy if you did. If you need help with SQL Server, if the kind of things, the performance tuning stuff, the topics that I cover in this channel are of interest to you in a way that you think I could come and look at your SQL Servers and help you with these things. Well, I am the absolute best SQL Server consultant in the world.
And, of course, my rates are reasonable. If you would like some very high quality, very low cost training, you can get over 24 hours of my performance tuning stuff, my thoughts, my knowledge, my beautiful voice, for about $150 USD. There are links for everything that I talk about in the video descriptions, the membership, the training.
But if you feel like just typing on your own, you can go to that URL and put in that discount code and you can get the whole kit and caboodle. Again, it’s the end of the year. I don’t have any events upcoming. In the new year, we can discuss these sorts of plans.
Right now, I am far too focused on festivities to travel. Well, sorry, to travel for work. I will do all sorts of other travel.
Trust me on that. But with that out of the way, let us begin our journey into joins. Let’s get over to SQL Server Management Studio. I believe I’ve already run this thing, so we’re good there.
Now, if we were to define joins in SQL Server, we could say that joins union data horizontally. That’s sideways, right? You have a table and you put it next to another table and you match the rows or whatever you’re doing.
So that’s one way of thinking about it. The inverse of that is union queries where unions join data vertically. So joins union data horizontally, unions join data vertically.
Kind of a funny way to think about it. But there’s probably one of the most basic things you can do with a database query, aside from selecting data from one table.
If you have the kind of database where you only have one gigantic, very wide, very long table full of data, I will reiterate for all of you, my rates are reasonable. If you’re having problems with that sort of thing.
The important thing to remember about joins is that they will respect one-to-many or many-to-many relationships, meaning that they will duplicate data, right? They don’t duplicate data.
But if you have like one row and one table that matches three rows and another table, or like four rows and they all have like the value two in the join key, and they match eight rows in the other table, all those things will join together and make a set, right?
So joins do respect that. We’re going to talk about some alternative syntax. I don’t mean using distinct, of course, that will join tables together without needing to deal with duplicates.
But that’s for another video. That’s, oh, that’s, I don’t know, that’s video like two or three or something. So you always have something to look forward to.
But one thing to understand is that if you don’t need to show data from another table, a lot of people just use joins to filter data from one table, like from a result set. Say something like, well, if the rows are here but not here, we don’t want to show the rows that aren’t here, right?
It’s almost like having a where clause on your query. But here’s an example of when a join is necessary. Let’s say that we want to get all the users in our users table with a reputation over 500,000, and we want to sum up all of their scores from the post table and figure out what kind of post those points were awarded to.
We could write a query that looks like this, where we’re selecting data from the users table. We’re joining that to the post table. We’re asking only for users with a reputation over 500,000.
And of course, because we have an aggregate in here, we need to group by stuff down here. Now, I do want to talk for a quick moment about what we’re doing with the score column, right? Because in Stack Overflow, scores can be negative numbers and many scores might be zero.
For some queries that you write, it would make a lot of sense to get rid of things that you maybe don’t want to include in there. Like negative numbers, you actually do probably want to include those because they do take away from a user’s total score. But for some types of reports, you may not want to do that.
For some types of reports, you may only want to include positive numbers. You may not want to take into account anything that might detract from the total score, which would be like either one could be sensible depending on what you want to show.
The other thing that is worth bringing up is that because anything with the score of like absolutely zero will not add anything to the totals, it could make sense to filter out rows where score is, I mean, this says greater than zero, but we’d probably want to say not equal to zero, right?
Because for the sake of this query, we want to include negative numbers. But for some reports, we might not want to. So actually, let’s just say we don’t want to.
It might make sense to do this for some cases because guess what? If we start screening out rows where the score is greater than zero or not equal to zero or greater than zero, we will have fewer rows in total to join together.
Remember, the where clause generally will like just filter out rows before the joins happen. So you could end up joining fewer rows and that could make your joins more efficient. We’re not doing that here.
I’m not going to go through all that because, well, I am primarily a SQL Server performance tuning consultant. Not every, and there are going to be some tidbits of performance tuning wit and wisdom throughout these videos. Most of this is just making sure that you can write your queries logically correctly.
So that’s what we’re doing here. That is the purpose of our presence here today. So this will give us some results that look like this.
And we will see the majority of the post types are answers. And we’ll see the total score for all these answers in here. And then we’ll get down a little bit further and we’ll finally have some, you know, results for questions.
And I think there may be some other. Yeah, there we go. There are some other things. And here’s where things kind of get interesting.
There’s a total score of zero for a lot of these. Right? So we might want to start filtering out other things. Like, like for like, so like if we were to filter out anything with a score of zero, we would have, we would not have to deal with a lot of the other stuff where we just had a bunch of zeros in there. Right?
Because we don’t, do we really care about that? Not particularly, to be honest with you. So that’s inner joins. Right? That is the, probably the simplest possible type of join you can write. Now, let’s say that we wanted to generate a report of people whose reputation is at, for sitting at one, which is the site minimum.
Right? Stack Overflow protects you from having a reputation in the negatives. If they did allow that, there would probably be, you know, quite a few sock puppet accounts out there that were really, really trying hard to have the lowest negative score.
Like, like, how can I, how can I be like the upside down John Skeet? Actually, I’m pretty sure someone would use, would, would take advantage of the Unicode usernames and Stack Overflow. Write John Skeet in upside down font and try to have negative, like a million whatever reputation.
Pretty sure someone would do that. We could write a query like this. Now, this is where we have to use a left join.
Right? Where we’ve changed from a left join to an inner join because there are going to be a bunch of people with a reputation of one who have not made any posts. There are people who probably joined the site, thought about asking a question because, or answering a question, but never actually did it.
So we can find all those people. But we can also find other types of people. Now, there are a bunch of reasons why an account with a high reputation, or sorry, with a, with a high scores in the post table might in the user’s table have a reputation of one.
Could be like, you know, they could, their account could be suspended or something. You know, they could have been downvoted into oblivion despite making a bunch of posts. So what we can do is look at what users within those situations by using a left join.
Now, to show you sort of the outlier stuff from that, I’ve ordered by total score descending. So here’s a bunch of people, including this lovely person, your common sense, who has a reputation of one, but a total score of 9,500. And, you know, some other stuff in there.
What was this? This one’s supposed to be total. Oh, yeah. So another thing that’s kind of interesting about when you start joining tables together is you might want to count things in the other table. When you’re looking at left, when you’re doing left joins, sometimes you do need to think about whether you want to get a count or a count distinct.
Right. So if we were counting, like this one getting the total count is just getting owner user ID. Right.
So we don’t need to worry about this being distinct because we’re using a non-nullable column in the table. Most often for something like this, you would want to use the table that you’re left joining to as primary key. In that case, that would be the p.id column.
It’s not shown here because it doesn’t help us in any way. But you can totally use another non-nullable column to do that here. You don’t want to count the nulls.
You want to count the actual values. So that’s where this sort of thing, that’s where this sort of logical stuff starts to get important. But you look over in the results. Like when we count up like the total post type IDs, you get a much different number from the distinct post type IDs.
Most people are only going to do something with a question or something with an answer. There are only two post type IDs for that. But just notice that when you count up, when you tally up the number of non-distinct values, you get the same answer here.
But for distinct values, you only get that there. So just something to consider when you’re writing these queries is when you’re getting things like counts and sometimes, well, I mean, sometimes other things like averages or sums or whatever, you do need to consider whether you want to get a distinct count or just a total count in there.
And when you do want to get a total count from a table you’re left joining to, most of the time you’re going to want, as long as that table has a primary key, you’re probably going to just want to get a count of the primary key. If not, you can use whatever non-nullable column from the left join to table you want.
Like you generally don’t want to include nulls and you don’t want to include like fake nulls where like, you know, a column might have like a million fake, a million real nulls in it. But then like you get up with a million more like sort of fake nulls from the left join.
So just some stuff to think about with that. Now, before we start talking about the overall logic, actually, we’ve already covered that. So the results that we get back find all sorts of interesting things that like we kind of talked about users who are very active, but then had their accounts suspended or deactivated or something.
Users who have posted but were heavily downvoted and users who haven’t posted at all. So like if we scroll down far enough, I think we have some negative numbers like way down at the bottom here. If the scroll bar will work.
For some reason, scroll bars have like stopped working for me. Where, yeah, there are some negatives in there. So there are some negative numbers where people have actually made posts but then got downvoted to nothing, right? So that’s there.
Now, we do need to make a small digression to talk about logical query processing. There are far smarter people who cover this stuff. Itzik is one of them.
But this is like my, like the reason why I want to talk about this is because a lot of people, when they write left joins, they do their filtering in a dumb way or they do their filtering in an incorrect way. It’s sort of hard to figure out all of this stuff when you’re just learning.
But here’s the deal. When you write queries, this is the order that you write things in. Select from, joins, where, group by, having, order by, like all the stuff that you can put like, you know, in here.
Oops. Any expressions like column one plus column two, aggregates like some count aliases, like, you know, let’s say you wanted to rename a column. Distinct and top.
Of course, when you’re writing your from clause, you can also write, you also put your joins in here and all your on conditions. Your where clause, like, I’m not a fan of like the old anti joins where it’s like from table one, comma, table two, and then put the on clause, like the basically put your on clause in the where clause. I don’t like that.
It just rattles my brain too much. Then you use your where clause to restrict rows, right? And this is kind of where we can stop here. Now, when SQL Server, when SQL, when, not just SQL Server, when any database runs queries, like the order that queries kind of get processed in is a bit different, right? You have your from clause and then your where clause and the, I don’t know, Zoom, it just exploded on me.
And then after everything you have like select and order by. So like the, like the, basically just the, the order of queries get processed, processed in is different from the order you write them in. But the important part here is for you to understand when you’re writing queries with left joins, where to do restrictions.
Now, if we, if we write, if we write the query this way and we say where a score is not equal to zero and the where clause that actually restricts, restricts things from the entire result set. And what, one thing that we’re doing is we’re getting rid of like, this would also filter out any nulls, right? So what this, what SQL Server actually does is internally, when you have a left join and then you have something in the where clause for that left join, that gets rid of values, including nulls.
SQL Server actually turns that into an inner join logically, right? Because that left join is, that left join is no longer useful. SQL Server says that’s not a real left join because we would be getting, we would be eliminating nulls from the, from the results.
So SQL Server actually just internally changes that to a left join. So if you wanted to filter out rows where score equals zero from the table that you’re left joining to, you would actually want to put that condition in the join clause, right? So not in the where clause, because the where clause will restrict it from the entire result set.
This will restrict it just from the table that you’re joining to. So for this one, SQL Server would still respect the semantics of a left join and give you, give you the results that we had before where we were, you know, way down the bottom. We show people who are users, but haven’t done anything.
And then over in the query plan, you can see that this remains an outer join. But of course, SQL Server rewrote our query and said, or rewrote our join order and then did a right outer join instead. So SQL Server flipped it on us.
Nutty SQL Server. So if you want to preserve results from the outer table, like, and keep like any non-matching null rows, but still get rid of some things from that, put your, put the things you want to get rid of in the join clause, not the where clause, right? There are some restrictions around, right, right joins.
I, I realize that not many people will actually ever intentionally write a right join. But one thing there’s to be aware of is that with some right joins, some right join plans can not happen. Specifically like nested loops with specific join orders.
So if we look at the query plan for this, SQL Server is perfectly capable of doing a left join and putting the tables in this order. But if we were to try to do this and say right join instead of left join, SQL Server can’t even come up, can’t even come up with a query plan for this one. All right.
So they get an error. SQL Server says, nope, not with those hints. You have messed us up too badly. We cannot do anything. That of course, if we said hash join, oops, not ha, ha, ha, join, but SQL Server can make a hash join plan out of that, right? It’s not a nested loops plan.
So just something to be aware of. You know, probably not the most important tidbit of information that you’ll come across in this video, but something that you should be aware of when you’re writing queries. Now, that brings us, so we’ve covered inner joins.
We’ve covered left joins. We covered the little thing that you need to know about right joins. It is worth talking about full joins just briefly. I’ve never found too, too many great uses for full joins except for doing things like writing exception reports.
So those queries tend to look something like this, where we’re taking everything from users and everything from posts, and we’re full joining on this condition. But the only thing that we care about, like we’re going to sum up any time where null, where we have, sorry, where we have a null in the users table, we’re going to say that’s a post that doesn’t have a user attached to them. And then every time we have something where the, again, these are the primary keys, the clustered primary keys for both of these tables.
So any time when these are null, we’re going to say either that’s a post without a user when u.id is null, and then users without a post when p.id is null. And if we run this, we will get a sum of users who, or posts without a user and users without a post. I’ve never, again, I’ve never had a tremendous amount of use for full joins, but they are kind of good for some things.
Cross joins are another thing that I don’t see things, or that I don’t have a tremendous amount of use for. I really don’t, what I really dislike about cross joins is, especially when very large tables get involved, the only type of, the only physical type of join that you can do a cross join with is nested loops.
And you can end up with real bad performance problems if you start cross joining big tables, right? It doesn’t have to be two big tables. You can do, you can cross join a small table to a big table and still run into nasty perf problems because of the nested loop nature of cross joins, right?
The only way they can be physically implemented is with nested loops. So, but where I see cross joins get used a lot is with stuff like permissions, where you need to like take a table of users and cross that, cross join that to a table of permissions and then figure out what users have with what permissions, right?
You need to blow the whole thing up and like assign every user every permission and then figure out which ones they actually have. So if you, let’s say we wanted to do something a little bit more fun and say that we have a table of scotches and a table of, a table of glass sizes and you want to show someone every possible example of scotch and glass size and then like with prices.
So you would have, you would have like, you know, your Lagavulin, Laphroaig, Ardbeg, Bunnehaben, Cow Isla, like all the good Isla scotches. And you would want to have like, like one ounce, two ounce, four ounce, six ounce, eight ounce, 24 ounce glass of scotch.
And then you would do some math with like the price per ounce and say, you know, whatever, like however many ounces of, you know, Bunnehaben 25 times 24 would be like a $30,000 glass of scotch or something, right?
So you would do something like that. Where I end up using cross joins most frequently is with DMV queries. Specifically, I use this in SP pressure detector to do some math on the timestamp column that you get from sys.osinfo, I think.
Or sorry, from, well, yeah, it comes from sys.osinfo. But you need to do some math to figure out when things happen in there. So we have a table. I’m just going to show you the top 10 rows from the ring buffer table.
But then we have literally one row in here. Now, for convenience, because I need to do, I need to convert this to XML and then work with the XML, I’m actually cross joining to the larger table.
It would probably be a lot more common of a solution to cross join to the single row table. Like just logically, a lot of people would say cross join this single row thing. I don’t know.
This is just me writing the query the way I have it. Because I have to do the convert to XML in here and then do some XML parsing out here, I’ve just chosen to use this as the inner thing where I can have the convert in here. But what you end up with is a convenient way to do this date math in here for every row and come up with a sample time for that.
So just a little bit of information there about inner joins, left and right outer joins, full joins, and cross joins. Again, full joins and cross joins, I don’t come up with too many uses for. But inner joins, left joins, all the time have to do stuff with those.
Be very careful with how you set up filtering for left joins. Because putting stuff in the where clause is a lot different from putting stuff in the on clause. And remember, let’s see, other stuff that we covered in here.
Joins will respect one-to-many and many-to-many relationships and show you data for all of the things that meet your join criteria. So what we’re going to talk about, I think, next, I think next, I think number two is exists and not exists. We’re going to talk about that next because those are somewhat different from joins.
And there are some important things that you need to know about them and when to use them and why to use them. And so that will be our next video. This ended up being a little bit longer than I intended.
I’m not sure exactly where I got carried away, but oh well. You got 25 free minutes out of me, I guess. Not exactly 900 number prices, but, you know, whatever.
Anyway, thank you for watching. Hope you enjoyed yourselves. I hope you learned something. And I hope that you will continue to watch these videos to learn more about how to write queries correctly. Thank you.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 25% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.