Top 1 vs. Row Number in SQL Server

Cruel, Cruel Number

One is the loneliest number. Sometimes it’s also the hardest number of rows to get, depending on how you do it.

In this video, I’ll show you how a TOP 1 query can perform much differently from a query where you generate row numbers and look for the first one.

Thanks for watching!

Video Summary

In this video, I delve into an interesting performance discrepancy between two queries that produce the same results but exhibit vastly different execution times. The primary query uses a `CROSS APPLY` with `TOP 1`, which surprisingly took over a minute to return just 101 rows, despite minimal logical reads on the involved tables. By examining the execution plan and statistics, I highlight how an index spool operator was created behind the scenes by SQL Server, significantly impacting performance due to its single-threaded nature even in a parallel query context. To contrast this, I demonstrate a slightly modified version of the query that uses `ROW_NUMBER` instead, achieving a much faster execution time with similar logical reads but vastly reduced CPU and elapsed times. This comparison underscores how simple query rewrites can have substantial performance benefits.

Full Transcript

Howdy folks, Erik Darling here with Erik Darling Data because Brent is lazy. And I was kind of enjoying my Saturday afternoon and writing some blog posts when I came across what I thought was an interesting difference between two queries that are written slightly differently, give you the same results but quite different performance. So I wanted to talk about that with you and unfortunately to talk about that with you I have to put this drink down to operate the computer. So, but that’s okay. because it’s a quick video and hopefully no one will notice. Now, I have this query here and the whole point of the query is to get the top 100 users by reputation and their most recent badge. And to do that I’m using cross apply with the top one over to the badges table. And if you look down in the corner you can probably see that this query ran for a little over a minute to return those 101 rows. That’s a pretty long time for not a lot of data.

Now, we can figure out why when we start looking at some different aspects of the query. And by different aspects I mean what happened with statistics time and IO and what happened in the query plan. Now, the CPU and elapsed time are almost the same which is a little bit weird because this is a parallel plan. Usually the whole point of a parallel plan is to use multiple threads to cut down on the total elapsed time. So you sacrifice using extra CPU to make the query overall run quicker. And you can also see that we didn’t do a lot of work against users or badges.

The users table we did 44,000 logical reads and the badges table we did about 50,000 logical reads. That’s not a lot of reads. The other thing we’re running into though is that we have this work table.

And this work table does a ton of logical reads. That’s about 24 million. So we have to ask ourselves where that came from.

And if we look over at the execution plan, it’ll become a little bit more obvious. Well, it’s really obvious to me and now it’s going to be obvious to you too. That work table comes from this index spool operator.

SQL Server wanted an index so badly on this data that it created an index behind your back up in tempdb. And it didn’t ask for an index. If you look at this top line here, there is no missing index request.

If we go and we look in the execution plan XML, there will be no missing index request. There is just this query running where SQL Server says, I’m going to create an index for you, you lazy bad DBA. What really stinks about this index spool?

Well, there’s a couple things that stink about it. One is that after this query runs, SQL Server will throw it away. And if this query runs again, or if this query runs a million times, every time this query runs, this spool will get created and thrown away. But what’s particularly nasty about these index spools is that if we go look at the properties, and we look at where all the rows line up across the parallel threads in the query, they all end up on one.

And it doesn’t matter how big this table is. It doesn’t matter, like if you use a different table, if I use seven different tables. Index spools build the index behind them single threaded.

That’s just the way it goes. So all eight million rows end up on one single thread. In this case, it’s thread three.

If I ran it a bunch of different times, they might end all end up on one different thread, but they would all always end up on one thread. That’s no good. We don’t like that.

And that’s basically what made this parallel query run like a serial query. Because this whole unit of work is done serially. And this is really where we spent the majority of the time in the query.

Well, that stinks. And you see this pretty frequently, specifically with cross-apply with a top one. And it’s because the optimizer can’t really unroll that.

And what I mean by unroll that is just turn it into a regular join. It’s going to use like kind of like the literal translation of cross-apply to go get a row and apply it to what’s down here. The optimizer is free to transform that into a regular join, but it doesn’t, especially when a top is involved.

Now, let’s contrast that with a query that’s written slightly differently. It’s still going to return the same results, but we’re going to use row number instead. We’re even still going to use cross-apply.

And we’re going to select the user ID and the name from the badges table. But this time we’re going to generate a row number over the same columns that we generated the top one with. And then we’re going to end up filtering out the results to only where row number equals one.

Now, remember that first query took a minute and three seconds to run. And if we go and execute this, it will be significantly faster. I forget how fast, but long enough for me to take a sip.

Now, that took six seconds. Why did that only take six seconds? Why did we end up with like, you know, you can see that it’s the same amount of reads here, the 44,000 and 49,000.

But for the CPU time and elapsed time, we did way better. That’s about, you know, there was like a tenth of the time. One percent of the time.

I’m not good. I’m not good at math no matter what. It doesn’t matter if it’s Saturday morning or not. If we look over in the execution plan, this plan is also parallel. But we don’t have any spooling operators.

You know, we, in this case, the optimizer was free to take that cross supply. And rather than do a nested loops row by row join, it was free to transform that into a hash join right here. And when we went and generated the row number over all the results of the badges table partitioned by user ID and ordered by the date column, we filtered out all of those rows, all of the rows that we weren’t using with this filter operator pretty early on.

So we still did it. And I’m not saying that this query is perfect and that we couldn’t tune things better and that we couldn’t make things better for this query. But it just goes to show you that sometimes a pretty simple rewrite can have pretty profound effects on a query.

And that, you know, sometimes that cross-apply with top is not always the best form of a query that can be written. Anyway, I’m going to go get back to the rest of this. I hope you enjoyed this.

I hope you learned something. And I will see you, I don’t know, maybe, maybe, maybe I’ll record something later that I won’t remember. I don’t know. We’ll see. Thanks for watching.