Row Goals: Part 2

Video Summary

In this video, I delve into the intricacies of execution plans in SQL Server, particularly focusing on how row goals affect query performance. I explore various scenarios where row goals are introduced and how they influence the query plan, such as when using `TOP`, `EXISTS`, and `IN` clauses. Through detailed examples, I demonstrate how these elements can lead to unexpected behaviors and highlight the importance of understanding their impact on query optimization. Additionally, I discuss the nuances between setting a row goal versus a row limit, illustrating practical cases where large numbers or specific conditions can significantly alter the execution plan, offering insights that can help optimize SQL queries more effectively.

Full Transcript

All right, Erik Darling here with Darling Data. And we are going to, this is part two of the Rogals playlist. Two of what? I don’t know yet. Hopefully four. I think I’m aiming for four. And if you have stumbled upon this and you’re hearing that it’s the second one and you are taken by surprise, well, there’s a playlist of these that you can find the other videos in. But right now, we’re going to do this. So, most people in the world, at least SQL Server people in the world, right? Because we have very nice graphic execution plans to look at. Not everyone is so lucky. Some people have text. And some people just have like weird tree looking things. But we’re lucky and we read execution plans kind of like starting over. Well, that’s, I guess this is mirror image from the right to the left, which is fine because that’s the way data flows. But when a query plan executes, it’s really, this is something that Paul White has talked about in a few blog posts and in a few videos where, you know, you have stuff that, sorry, my mirror image, stuff that opens up and starts executing over here and then sort of asks for rows down as it comes along. So, that’s part of why things like Rogals and operators like Top kind of work in the way that they do is because they are just continuously asking for rows until the, top specification is met or you run out of rows to produce. So, I have this table called high rep users. There’s only about 600 rows in it. But I want to show you sort of three different execution plans. They’re all two of them look pretty close, but it’s interesting to note this stuff. So, one is just selecting the top 10 rows with no particular order. Now, what we’re going to observe here is that rows will get returned in clustered index order.

It’s not a guaranteed order. This is only an observed behavior for this query is the way it’s running. So, keep that in mind. Then we have another query that’s asking for the top 10, but with the rows ordered by ID descending. Okay. Now, there’s another, the third query we’re going to run is the top 10, but with, but this is going to be ordered first by reputation descending. We don’t have any indexes on the reputation column that are going to help have, they’re going to put this in order. So, we’re going to have to physically sort the data. This will all make more sense when we look at the execution plans. So, let’s run these three. And what we’re going to see is, of course, three execution plans. If SSMS will be so kind as to let me get a little more screen real estate here.

So, the first query just starts asking for rows and these, these rows could be returned in any order, right? Cause we didn’t specify an order by. So, if we use maybe a different index than the clustered index, then we might get rows in a different order for this, right? But for this one, notice that the 10 rows, just SQL Server reads 10 rows from here, right? And we get those back in whatever order the thread happened to find them. And it’s a serial plan. So, it’s most likely just going to be clustered index order. Again, observed behavior, not guaranteed. The second query ordered by the top 10 with ID descending is we still don’t have to sort this data because we can still read this in clustered index order just backwards. And we just go and get 10 rows from here, right? So, this is just where we get all 10 rows from.

The last query is a little bit different. So, where this one is different is that the other two just had top, right? Just a top operator. This one has a top end sort. And the reason why this query is different is because we actually have to read all 613 rows from the table into the sort operator. And then we just get 10 rows from the top end sort as we get the data from here. We sort it and then we just get the top 10 rows in the order that we asked for it in, which is reputation descending and then ID descending.

So, while row goals do trickle down through the plan, sometimes you’ll end up with row goals that don’t quite make it down as far as other times, right? It all sort of depends on how your queries are and what your indexes look like. Now, because exists are something that may introduce a row goal, one thing that I see a lot of people do is act absolutely ridiculous when they write exists subqueries.

So, when we run this query, right, we’re going to say select star from high rep users where we have some where one of those users has a badge, which of course is going to be all of them because it’s impossible to get a high reputation without getting a badge, I think. Oh, wait, no, I lied. It’s only 609 rows. So, four people have a high reputation and no badge. All right. Maybe I’m wrong about this stuff sometimes. But if we look at this execution plan, it is just, it looks just like this, right?

We scan the clustered index of high rep users and then we have a left semi-join. In this case, the left semi-join does introduce a row goal. All right. I’m holding back a sneeze because I’m not wasting this five minutes.

And if we look at the properties of the index seek on the badges table, we’ll have this line right here, which tells us that a row goal was introduced, right? Estimate, estimate, estimate, not estimated, estimate rolls without row goal. Okay. Not estimated. Okay. Just estimate. Okay. Whatever.

But this shows us that a row goal was introduced, which happens sometimes with exists because a semi-join only needs to find one row. So, what I want to show you next is something that I see people do quite a bit with exists subqueries, which is a lot of useless typing, right? They will throw a distinct, a top one, they’ll group by, they’ll order by, they’ll do all this stuff.

Now, if we were to, if we run this query on its own, I just want to show you the execution plan for this real quick. If we run this without that outer correlation, we will get sort of a predictable execution plan, right? We have a top because we had a top. We have, well, I mean, we did a distinct and group by like fools, but, you know, okay.

We’re not going to distinct and group by, but, you know, stupid. But, you know, you at least have an aggregate here that shows we did some aggregating of things. If we run the full query, right, if we take out, if we, rather, we put the correlation back in and we run this, what we’ll see in the query plan is that SQL Server completely ignored all of the stuff that we did around the correlation, right?

The distinct, the top one, the group by, the order by. The execution plan doesn’t change one bit. We just go back to having a left semi-join.

There’s no aggregate, there’s no top, no nothing. Why? Because SQL Server is already sort of putting a top in there with the row goal. So, just a couple notes on some other T-SQL things. There are, there is syntax that is logically equivalent to exists, which is, which is like the in clause, okay?

We can say select from high rep users where ID is in, select user ID from badges. And if we do this, we will get the same semi-join query plan, right? The exact same thing that we got before.

I mean, obviously it’s not like, like exact exact because it’s a different query. Like other things will be different, like hashes and handles and things, but it’s still the exact same query plan shape and operators. Another way of writing a query with like this is to say where ID equals any.

Okay. That’s a funny one. Right? Bet you didn’t know you could do that. But any equals, uh, is sort of like doing in or a left semi-join, right? Right.

You could say equals any, and you could get this plan back, right? Which is the same thing with the left semi-join. You could also write that as high rep users equals some, which is equivalent to any, right? So any and some, uh, equivalent in this case.

And this will give you the same left semi-join plan. So there are all sorts of funny ways to get this syntax. We don’t have in, in T-SQL a way to say like, uh, like, uh, inner or outer semi-join or something like that. There’s no way to, um, give like, uh, request a semi-join, but there are ways that you can sort of, uh, influence the optimizer to give you the semi-join behavior.

There’s also another funny way, uh, to write queries, but this is not the, uh, this is not the equivalent. This will just get a very strange, well, not strange execution plan, but we won’t get any rows back because not all, all rows match. Don’t, not all rows match across all of these.

So this is it. Um, not in is of course not equivalent to not exists. Um, I’ve written and recorded quite a bit about that. So I’m not going to go into detail on that here.

Good Lord. Just know that I’m dying and my dying wishes for you to get to the end of this video. Uh, so I’m not going to talk about not in here, but, um, uh, we’re going to, uh, just point you to a handy little link that you can go to. All right.

Anyway, um, top 100% is absolutely useless in views. Um, if you see people do this, uh, yell at them. Um, if we were on this query and just the, just like from like without, not from the view, but you’ll see that we, we like SQL, there’s no top and there’s no top end sort. We do sort the data because we asked for data in a particular order that we do not have an index to support, but there is no top SQL Server ignores the top 100%.

So if we create a view that says select top 100% stuff, and then we select from that view, uh, you’ll note the execution plan has no top, no sort, no nothing. Right. So this obviously the optimizer does throws us away and it, but it does this to be helpful, right?

There’s no top operator. There’s no sort. So the optimizer does this to clue people in that they are potentially being dumb or potentially they are trying to rely on behavior that is not guaranteed. So we should all give the optimizer a round of applause for correcting our terrible, uh, I don’t know.

What is, what would you call that? Malfeasance? Right. Maybe correcting our terrible query writing.

I don’t know. Something like that. Anyway, there are other ways that row goals can sneak into your queries. Um, one interesting way, at least I think it’s interesting, but I think lots of things about this stuff are interesting that maybe other people don’t. Is, uh, when you have, when you’re, uh, filtering on row number.

So if we say where the, the N up here, N equals row number, right? Uh, if we say where N is between one and 10, SQL Server will, uh, do something that we’ve actually kind of saw before in one of the other queries. Where, uh, we have a top, right?

And this top starts asking for 10 rows, but before it can, before it can ask for 10 rows, those rows have to get sorted. Right. So we get all 613 rows from high rep users, but we don’t, we don’t reduce that to the, the row goal of 10 until we’ve sorted them. Right.

And then the, of course, the two operators in here are the row mode representation of generating the row number, the sequence project in the segment. Uh, but if we were to change this to say 20, we would see that row, we would see the, the, the, uh, 20 in here instead of a 10 in here. Right.

We would see this creep up and say, oh, now we need 20 rows from the sort. So row goals can sneak in, in all sorts of ways that you may not expect. There are other things that will also set row goals, but this, there’s a difference, right? Because we talked about it in the first video between setting a row goal and setting a row limit.

Right. So, uh, a query like this, where we say, uh, select whatever from table offset zero rows, fetch next one row only. This will set a row goal of one, but also only return one row.

Right. We’ll get Chris Jester Young here. We can also set a row goal with, with a fast end hint. Of course, you can put whatever number you want up to, I think the big int max in here, but that would, I don’t know.

I don’t know at what point that would become counterproductive. But when we say option fast one, we’re still setting a row goal. We’re just not setting a row limit.

Okay. So fast, like the fast hints will return all the rows, but with a query plan where the row goal is set to, uh, whatever number you put in there. Now, uh, the top operator, of course, this goes for offset fetch and fast two, uh, accepts, accepts up to the big int maximum.

There are a couple of neat shortcuts you can take. If you don’t remember the numbers involved, uh, using the power function. So, uh, two dot, the dot is very important.

If you forget the dot, you will get an error. But, uh, this one is two to the power of 31 minus one. And the big int max is two to the power of 63 minus one. But this returns those, those maximum numbers.

Sometimes if I’m, uh, feeling very ambitious, I will set up a view that has, uh, various SQL Server data types in their full ranges. I’m not like, like the, not like 2 billion rows full range, but like the min and the max for them. Because it’s, it’s a very handy reference for me when I’m trying to figure out what I, what I want and what I want to do.

Um, I don’t know if you ever do things like that. But, uh, getting back to rule goals a little bit. Uh, we can put big numbers into tops.

And those big numbers can, uh, again, sort of influence the optimizer in various ways. If I run this query where I’m asking for the top nine quintillion rows from high rep users where the display name equals John Skeet, we get this query plan where we seek into an index on high rep users that happens to be on the display name column. We do a key lookup for John Skeet.

And we still, but we still have a top, right? When we said top 100%, SQL Server said, well, I know better than that. But when we say top very big number, SQL Server does not know better than that. And SQL Server has to figure out a way to get that number of rows back to us.

So fun stuff there. What’s, what’s, what’s, what’s interesting too is that the optimizer and like sort of what, what might start cluing you in to what, how SQL Server kind of fences things around stuff like top. Is if we run a query like this, where we’ve sort of nested the top away under here.

And we have the display name filter out here that when we run this, we’ll get a different execution plan back. We still have a top, right? And we still have like SQL Server still gets the top nine quintillion rows from the 613 row table, but it has to do this first.

And then it can filter stuff out. This is like, this is more like logical correctness stuff though, right? Because asking for the top nine quintillion users and then figuring out if the top, anyone in the top nine quintillion users has a display name, John Skeet, or just got, is different from asking for the top nine quintillion John Skeets, right?

There’s two logically different things. One thing that I see quite a bit and often bothers me is people using non-deterministic ordering and not really caring that what they’re doing is potentially going to introduce incorrect result bugs. Okay.

So if we run this, that’s like the top five from score. And again, like remember what we were talking about in some of the previous demos in this video where I said like we’re most likely going to return this data in clustered index order. We had a serial plan.

You know, SQL Server was just asking for the top 10 rows. So we could observe that as long as we use the clustered index and as long as we had a serial plan, SQL Server would return data in the clustered index order, right? Again, observe, not guaranteed.

If we run this query, which is selecting the top five rows from the comments table ordered by score. And we look at the IDs that get returned. Like notice a score here is all zeros, right?

So there are duplicates in here. We get these five rows back nine, 11, whatever, 855, 19, blah, blah, blah. If we run this a few times, we’re going to see different rows showing up in here because we have a parallel execution plan. And when top, when there’s a parallel execution plan with top like this, SQL Server is just sending out the dot threads and whatever threads come back first or the top five rows that we get back.

Right? Right? So obviously, depending on timing and reads and all the other stuff, we might get back five different rows every single time.

If I run this query again and we look at the results, we can see that even the order of IDs have changed and which IDs we get back. This all changes every single time. The only way to make sure that ordering is deterministic is to have some sort of unique column in there to break ties within the column that has the duplicates.

There we go. All right. We’re smart people.

We can do this stuff. So if we run this query a few times, we will see that we get back this data in the same order each and every time. Right? So this is deterministic ordering. The one we did before is not deterministic ordering.

All right. So we have reached nearly the 20-minute mark. This is a good place to pause because I just finished a demo. And now we have another demo that we can do right after this. This is fantastic.

Right? And we can actually start looking at reshaping execution plans with row goals a little bit now that we’ve established some bare minimum fundamentals. All right. So thank you for watching.

I hope you enjoyed yourselves. I hope you learned something. And I will see you in the next video, which will hopefully be three of four. But we’re just going to have to see how things go. All right. Thank you for watching.

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

Row Goals: Part 2

Video Summary

Full Transcript

Going Further

Related Posts