Wild, Wild Life
Note: this was fixed in CU11.
Thanks for reading!
Video Summary
In this video, I delve into a peculiar performance issue I encountered while working with SQL Server 2019, specifically in version CU9. The problem arose when creating an index on a table and running a windowing function over it, leading to bizarre parallel query performance issues at DOPs up to eight cores. What made this issue even more intriguing was that the same query ran much faster with just one core, suggesting some form of internal contention or optimization hiccup. I explore various fixes, including running the query with a single thread, hinting the clustered index, adding an unnecessary ORDER BY clause, and using a non-clustered columnstore index as the data source. The issue seems to be unique to SQL Server 2019, as it hasn’t been observed on earlier versions like 2016 or 2017. If you’re dealing with similar windowing function issues in parallel queries on SQL Server 2019, this video might offer some insights and potential solutions.
Full Transcript
Erik Darling here with Erik Darling Data, a company recently voted least likely to be acquired by anyone by BeerGut Magazine. So I’m not sure how I feel about that, but okay. And anyway, today I want to talk about a very, very weird issue that I’ve been running into with SQL Server 2019. Now this is cumulative update 9. We are on the latest bits. The server has been rebooted many times at this point. There is nothing all that weird about my setup, and I’ve gotten this to repro in many different ways across many different servers. The only thing that seems to be a defining factor hardware-wise is that the server have at least more than eight cores in it. But you can repro it to some degree on other smaller CPU configurations, but anything over eight, this thing really seems to kick up. Now, the original issue that I ran into was I created an index on a table, and then I ran a windowing function over the table, and just with an over clause, nothing too weird in there. And it was weird, just bizarre performance when it went parallel. It happened at every DOP up to eight. I could put it up to DOP 16 and see the same problem, but you know, DOP 8 was kind of where the query was running originally. And what was even weirder was that it got completely fixed at DOP 1. And this is what the query plans end up looking like. Oops, shouldn’t have done that. You have this top query that goes parallel. And I mean, like, this was like a bad run. This is like a sort of like, actually, this is like in the middle of bad runs. Really bad runs were like 45, 50 seconds.
Like sort of like sort of like sort of like sort of like sort of like sort of like sort of like sort of like 15 and 30 seconds. This was like, you know, a little bit higher on the uptick. This is just the one that I happened to capture when I was thinking of it for the record this video. But you see the parallel version of this plan. We have an index scan that’s quick. We have a compute scalar that’s also quick. But then we have this window aggregate. And window aggregate happens in queries that use batch mode. This is batch mode on rowstore. But you could most likely see the same thing happen if you were to trick SQL Server into using batch mode either by, you know, left joining off to an empty temp table or something or whatever trick you want to use, creating an empty non-clustered columnstore index, you know, you name it.
There’s all sorts of things you could do in there that would trick the optimizer. But what gets really weird is that when we run that same query at top one, it is pretty quick, right? Like obviously the scan of the index over here is going to suffer because we’re scanning 25-ish million rows with a single thread.
That’s going to be slower. But the window aggregate finishes almost immediately. And this query runs, let’s go look at the time stats, properties, query time stats. So about 2.5 seconds there. But then up here, the parallel plan, well, this thing just gets the crap kicked out of it.
Like that’s just, yeah, I don’t know. I don’t know. I don’t claim to fully understand what’s happening behind the scenes. But that was sort of the deal there.
And like, you know, if we can just run this, runs quickly. Actually, that’s not going to run too quickly because that’s going to recompile and it’s going to spill a little bit on the first run. So this is going to look not so great. But then once memory adjusts because batch mode, this thing will run fairly quickly on the second go.
But, you know, this top eight query is always sort of suffering. And I’m not going to sit there and make you wait however long this thing might take to run. But that’s sort of what happens.
Now, what I did or what I was talking about this with some friends of mine. My friend Forrest said that you should look at spin locks during that period of time. And I said, you know what?
Forrest has never spin locks. I think it’s spin locks. So this was a, again, sort of a middle of the road run. This executed for about 37 seconds when this went off.
I didn’t grab a query plan for it because it was kind of useless in this case. But if we look during this 37 second window, this SOS suspend queue spin lock happened an extraordinary number of times. An extraordinary number of times in that window.
Now, of course, I don’t claim to know much about spin locks. So I went to look up what this spin lock meant. It said it was for internal use only. So I assume that I am never going to know what that means.
And I even had to go and steal the script from the lovely and talented Paul Randall. I will link to this in the YouTube notes. But there it is.
If you want to capture spin lock statistics for a period of time, aside from some mildly formatting that I did on things, this is the before and after of spin locks on my system when this thing runs at DOP 8. Now, I wanted to have some fixes for this. Or rather, I wanted to list out some of the things I found that had worked for this as far as fixing the issue goes.
Because I assume that if you stumble on this video, you might be running into the same problem. Or, you know, if you’re on SQL Server 2019 and using a lot of windowing functions, you should probably be aware of this until, I don’t know, maybe Microsoft fixes spin locks. I’m kidding, obviously.
But if you, if we, so like all sorts of things that I found that have fixed the issue. If we run the query at DOP 1, things finish very quickly. That’s all good there.
If we, what do you call it, hint the clustered index, we still face the issue to some degree. It’s much, much smaller for some reason. I don’t know if it’s something with the parallel page supplier or, you know, the size of the rows and the aggregate.
I just don’t, don’t really know offhand. And I don’t fully understand why going back to the clustered index works better than just going to the nonclustered index. And also, if we add a useless order by to the windowing function.
So here I just stuck an order by C dot score in there. The issue does not reproduce there. And finally, if we use a columnstore index, in this case non-clustered columnstore, as a data source, the issue also does not reproduce there.
So all sorts of things that potentially make this query faster or fix the issue with the spin locks, that seems to work there. This only seems to happen on SQL Server 2019. So far I haven’t gotten it to repro on SQL Server 2016 or 2017.
And if you want sort of a more generalized reproduction of this, there’s just, you know, if you create a temp table and stick a bunch of rows in it and, you know, do sort of the same query with the sum over, you know, whatever from that temp table, you can see the same sort of generalized pattern where, you know, you have a window aggregate in parallel that runs for a very long time. This will produce a whole bunch of those SOS spins. And then if you use the serial version of the plan, you will not run into the same performance issue.
Again, this query takes about 8.2 seconds in total. This query becomes batch mode, blah, blah, blah, takes about five seconds total. So definitely something interesting there.
I don’t know if this is something that requires some engineering fixes on Microsoft’s side or I don’t know, maybe I’ve just broken 15 different servers by running this demo there. But I spent last night and this morning spinning up cloud instances of various sizes between 16 and 96 cores. And so I’m going to have a bill to pay this month.
But it seemed to reproduce reliably on SQL Server 2019 on all of those. So if you’re on SQL Server 2019 and you have window aggregates in your execution plans that are going parallel, you might find some weird performance issues. Performance might be very variable or performance might be consistently bad.
Anyway, that’s all I had there. I’m going to go package these details up and send them to my friends who can fix these things. And hopefully in the next CU there will be far less spinlock contention when we have parallel batch mode window aggregates.
Well, I guess you couldn’t technically have a row mode window aggregate. That just wouldn’t be sensible, would it? No.
Anyway, thank you for watching. I hope you learned something. And I don’t know, it’s like 10 a.m. on a Tuesday. So it’s probably time for me to start thinking about what cocktail I’m going to have to get warmed up for lunch cocktails because it’s just around the corner. Don’t want to go into lunch cocktails too sober because you can really get ahead of yourself that way.
Anyway, yeah, that’s it. Thank you for watching. Goodbye.
I’m exhausted.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.
https://www.sqlskills.com/blogs/paul/capturing-spinlock-statistics-for-a-period-of-time/
Hey Erik, does this still happen since CU14 TF8101?
There’s a note immediately under the video that it was fixed in CU11. I have no idea how that trace flag interacts.