My SQL Server friendo Mr. Erik C. Darling was recently telling me about some work he did getting batch mode on paging queries. This sounded a bit odd to me because paging queries make me think of small seeks of rows against finely curated nonclustered rowstore indexes, so I asked him for a link to his blog post about it. He grumpily refused and told me to find the link myself, which I eventually did.
High End Machine Performance
Erik’s first attempt used OFFSET/FETCH and resulted in a row mode query:

The clustered index scan makes this a bit of a sad paging query. In Erik’s defense, there’s a whole host of real world reasons as to why you wouldn’t be able to make the perfect nonclustered index for your paging query:
- End users may choose to sort on many different columns and you can’t index them all
- A key member of your Index Approval Committee is on vacation
- You already have more indexes on the table than your number of fingers
- You’re working with third party software which does not allow you to create custom indexes
Getting back to the query, it doesn’t look that offensive to me. The row mode sort is a parallel top N sort and the overall number of rows to return is low, so each thread can independently sort its rows and return 1000 locally sorted rows to the parent operator. This is about as good as it gets with parallel row mode sorting. This is a row mode only query so the operator times that you see are the sum of that operator’s work and its children. In terms of real work done by the query, the scan clocks in at 1.168 seconds and the sort clocks in at 0.84 seconds. The final accounting at the end by the parent Parallelism (Gather Streams) is misleading at best and an outright LIE at worst. There wasn’t 4 seconds of work done by this query. There was only 2 seconds. The red lines illustrate the problem perfectly and I won’t be elaborating further:

Erik’s second attempt uses ROW_NUMBER() and he achieves a plan with some batch mode operators using BMOR (batch mode on row store):

The parallel batch mode sort works just fine here in that the single thread output property isn’t an issue. The parent operator is a batch mode window aggregate, but even if it wasn’t, the grandparent is a gather streams operator so the rows would end up on one thread anyway. Actual time statistics accounting works differently for batch mode operators: each batch mode operator only tracks its own work. In terms of real work done by the query, the scan clocks in at 1.1022 seconds and the sort clocks in at 0.892 seconds. This is quite similar to the first attempt. It could be argued that the batch mode sort is more efficient than the row mode top N sort, but I’d call it a wash considering the unpredictable rowstore to batch mode conversion overhead (which does seem to be small for this table).
Low End Machine Performance
I tested on my local machine with 64 GB of RAM which is less than Erik’s laptop. My clustered index scans took significantly longer than his, but as usual, there’s a lot to learn from low end machine performance. Let’s go back to the first reason as to why the table might not be indexed well for this particular query:
End users may choose to sort on many different columns and you can’t index them all
Microsoft presents a standard solution for this scenario: the humble nonclustered columnstore index. This will be great for my low end machine because I’ll be able to fit the new NCCI in memory. For those following along at home on their own low end machines, I created the index on every column except the Body column in a very carefree fashion:
CREATE NONCLUSTERED COLUMNSTORE INDEX ncci ON posts ( Id , AcceptedAnswerId , AnswerCount , ClosedDate , CommentCount , CommunityOwnedDate , CreationDate , FavoriteCount , LastActivityDate , LastEditDate , LastEditorDisplayName , LastEditorUserId , OwnerUserId , ParentId , PostTypeId , Score , Tags , Title , ViewCount ) WITH (MAXDOP = 1)
I ran the OFFSET/FETCH query on my low end machine with MAXDOP 4 and it only took 430 CPU ms and 130 ms of elapsed time:

That’s a huge improvement compared to the 15-20 second runtime I was experiencing earlier. Interestingly, the second query (the ROW_NUMBER() approach) sticks with the parallel batch mode sort and performs significantly worse in comparison:

The key difference here is the batch mode Top N sort in the first query. Remember that the query compile process for BMOR is different than what you get when compiling in the presence of a glorious columnstore index. You can get the improved batch mode top N sort by also doing a fake join to an empty CCI table. Serious batch mode connoisseurs should be mindful of the compile differences as they seek to gain the greatest performance benefit possible from batch mode.
Final Thoughts
Friends don’t let friends be lied to by SSMS. Thanks for reading!