Fixing Parallel Row Skew With TOP In SQL Server (With A Brief Re-Complaint About CXCONSUMER Waits)

Fixing Parallel Row Skew With TOP In SQL Server (With A Brief Re-Complaint About CXCONSUMER Waits)


Video Summary

In this video, I delve into the issue of parallel skew in queries and how it can lead to suboptimal execution plans. After receiving feedback on my previous video about CX consumer weights and detecting parallel skew through CX sync port weights, I decided to create a follow-up focused on fixing these issues. We start by examining a problematic query that took 15 seconds to complete due to uneven parallelism, with most of the work being done on a single thread despite an eight-way degree of parallelism. The solution involves introducing a `TOP` operator within a `CROSS APPLY` to better distribute rows and balance the workload across threads, resulting in a more efficient execution plan.

Full Transcript

Erik Darling here with Darling Data. I did tweak some audio settings, so hopefully there’ll be less noise gate cutoff of things, and only when I’m actually silent will you not hear any of the weird background stuff that goes on in my office or the frequent sirens that go by my house in lovely, crime-free New York. So today’s video, I don’t actually, I’ll tell you here, a while back I recorded a video about CX weights, like CX consumer and CX sync port, how to detect parallel skew in queries and how to figure out if that’s why your query is slow, and some of the feedback that I got from that, both on YouTube and and story and they have spend.

Y Oh,unnoV Mod and they notes3 very, but silk. how do I fix it? So finally got around to recording a video about how to fix it. So let’s do that.

Let’s talk about fixing it. But let’s go back in time a little bit and let’s look at what the problem is and some of the other surrounding stuff. So let’s make sure I have query plans turned on here. Actually, let’s make sure I have query plans turned on here ahead of time too, so I don’t go and screw anything up. And I’m going to run this whole kit and caboodle right now.

And we’re going to get a little bit of execution plan noise on this, but that’s okay because we’re doing stuff up here and then we’re doing stuff down here. But the bottom line is that this query should take around 15 seconds to complete. And there we go, right at the 15 second mark. There we go. If you look under my armpit, you’ll see 15 seconds down there. It’s amazing. It’s immaculate, right? So what happened? Why? What went on with this awful query? This abomination? Well, let’s start by looking a little bit at weight stats. And one of my chief complaints about a lot of things in SQL Server is that, you know, at some point in time, someone at Microsoft said, CX consumer weights, no big deal. Ignore them. Don’t ever think about them. But I think we can see pretty well here that CX consumer weights were a big deal for this query and that we should maybe pay attention to them because it’s a lot of time spent waiting on the old CX consumer.

Now, in the original video, one thing I noted was that the CX sync port weight, the time on that was just about even with the query execution time. And I think that might be another way that no one has said we should ignore yet to detect parallel skew in a query, which is great, you know, awesome, cool. So let’s look at the query plan. And let’s look at where things went wrong. And then we’ll look at how to fix it, how to fix it. All right. So the bad part of this query, or rather where the problems start in this query is over here, when we have this serial zone, it gets flipped to a parallel zone, and then sent into a nested loops join. And what happens is if we look at the properties of this thing, actually, that was really well set up. Usually, I’m not that well prepared, but here we go. All right. Well, actually, this part’s a little bit off. Maybe this wasn’t the greatest idea in the world. Let’s see if I can slide that over a little bit. There we are. All right.

Maybe that’s the secret. Maybe I’m just not loud enough. I should be loud and dumb. I watched the movie American Fiction last night, finally. It’s a very funny movie. And there’s a line in it that really stuck out to me, where it said, I forget, I can’t remember. I’m not good at actor names. I can never remember anyone. Like, I don’t know. I’m just bad at it.

But the star of the movie said something like, the dumber I act, the richer I get. I got a kick out of that. And I was like, wow, I got to try acting real dumb. Because apparently, lots of people who act real dumb make a lot of money. So maybe that’s a good tack for me to take in my career. I’m going to start acting dumb or dumber than usual, maybe. I don’t know. Maybe I’m not acting dumb enough. But if we look at the details here, I can stop moving around. We can see that all these rows end up on a single thread. That’s an eight-digit number. Because I have two fingers left up. So that’s an eight-digit number there of rows. So that’s 10 million rows, I think. Ended up on a single thread. And of course, we have a fully parallel execution plan. But having 10 million rows on a single thread kind of just makes it a serial execution plan. It doesn’t matter how many threads you throw at it. If all 10 million or if rows heavily skew towards a single thread, it’s not going to be a good time for your parallel plan. Especially a parallel nested loops plan.

And you should watch my video about parallel nested loops to learn more about why. It’s a good video. If you like this video and you subscribe to my channel, you will get notifications every time I pass on one of these remarkable nuggets of knowledge to you for free.

Amazing things will happen to your career. You might go from junior DBA to regular DBA to senior DBA to staff DBA to principal DBA to president of the DBA society. You never know. You never can tell. But the thing here is that this situation proceeds throughout the query plan. The 10 million rows still all on one thread. There you can see that. I can zoom in a little bit to make life a little bit easier than some of you old fogies who watch with your bad eyes. And if we highlight this arrow right here, I mean, it’s going to be fewer rows on this one. Oh, what happened there?

Oh, you know what? The compute scale I messed things up. You can look at this one. But still, all of the rows are on a single thread, right? So we had essentially a parallel query where one thread did all the work, right? And that’s not good because this query ran at a degree of parallelism of eight. The ocho, as they say. Eight, right? And the reason why I brought up the weight stats over here, right, where you can see CX consumer loud and clear as a session level weight stat is because if you look in the query plan, CX consumer, at least I hope it’s going to not be there because it’s never been there before. If it’s there now, I’m going to lose my mind. But it’s not.

Okay, great. So I was actually writing a blog post because when this stuff originally started coming around, one of the things that I complained about was that lock weights didn’t show up in query plans. And then I went to complain about it in another blog post. And apparently they show up now. So I should probably actually write a blog post to correct my other blog post. That’s not a bad idea.

If someone can leave a comment on this video reminding me to do that, I would really appreciate it. So yeah, so CX consumer weights do not show up in here, but those CX sync port weights do, right? So we can see all that going on in there. That’s 15 seconds about, right? That’s about how long our query executed for. And if we look there, if we actually come back to the results, CX sync port, 14,779 milliseconds, which is 14.779 seconds. And well, that’s, you know what? Horseshoes, hand grenades, government work.

We’ll take it. That’s close enough for me. All right. I don’t know why this window moved. It had no business moving. So we’re going to pop that back over there. And then, actually, before I move on, I should show you the part of the query that caused the problem, right? Because we have a problem area in this query. And that’s right here, this cross join, where SQL Server’s like, yep, one row is going to come out of there. And sometimes cross joins can be fine. Other times like this, we end up with, I don’t know what we just saw, where we have a fully parallel plan. SQL Server didn’t do a good job of splitting rows out to all the parallel threads.

And this is one of those unfortunate things, because we can’t really control what SQL Server does when we write things so plainly, right? We have very little control over what, how SQL Server is going to generate a query plan for these things. And with a cross join, I mean, there’s really nothing to correlate in here, right? We can’t do anything in here, because this max is an expression here, and we can’t really do anything down here. Like we could try with having, but that just gets weird and awkward. So the only place we can really apply the filter that we get from here is down in here, in the where clause. And now, this is where our trouble lives, right? It’s this whole thing. SQL server gets a single row from this and passes it along to the whole rest of the query. If we go back to the query plan, this area here, oops, I don’t need that tool tip. This area here is where SQL Server is getting that max, right? We have the index scanned, and then we get a top one, and then we aggregate that down to the, to get the max creation date. And then we use that max creation date is a filter in here, right? So we have where the post creation date is less than that scalar operator, which is an expression, which is how SQL Server has identified that max aggregate inside the cross join. All right. Cool. Got that sorted out. We think we know what the problem is. This query has very uneven parallelism. We know the part of the query that is causing that issue, right? It’s this part here, where it goes from a serial plan to a parallel plan. And then that’s where one, or rather one thread does all the work with all 10 million some odd rows. So what we can do to fix that in this, in this situation, and this is a technique that works quite well in many places, is to introduce another top into the query. So what we’re going to do is rather than cross join, we’re going to cross apply.

And inside that cross apply, because it’s safe to do so here, because we’re getting a single thing back from here, we’re just going to say, I want the top one. Oops, that’s not what I wanted. Wait, where did you go? That was strange. We’re going to say, we just want the top one from here. And then inside the cross apply, we’re allowed to do something that we’re not allowed to do inside of a cross join, which is have this inner correlation, right? So we can push that up into here. And I don’t know if you noticed from when I just ran this, but it didn’t take 15 seconds. It took zero seconds, right? So we can do more work inside of this, and we end up with a different execution plan overall, right? Rather than have that serial zone that turns into, come on, man, help me out. There we go. Nope, I almost had it.

There we go. Rather than have that serial zone coming out of the cross join, we now have a parallel zone down in this part of the query, right? And the nice thing about top, the great thing about tops is, even if this didn’t happen, what top does, aside from when it’s on the inner side of a nested loops join, what top will often do is force a serial zone in your query plan. And if your query goes from parallel to serial to parallel, one thing that it has to do when it comes out of the serial zone is distribute stream, right? It has to go send all those rows out to multiple threads so that you can, or rather so that your query can process rows with all the parallel threads that are available to it, right? Great, good stuff. So, excuse me, the Microsoft documentation on top, or rather on parallelism kind of sucks because it seems to infer that top forces your entire query to run single-threaded, but it doesn’t. There is just generally a serial zone going into a top operator, except when that top is on the inner part of a nested loops join, which it is here. We have our nested loops up here, we have our outer part of the nested loops up here, and then our inner part of the nested loops is down here. But this ends up pretty good for us. And it’s a pattern that you can generally apply. It might not always result in a query plan that looks like this, but what adding another top in can do is gather streams where it goes into the serial zone for the top, and then distribute streams where it comes out of the serial zone and back into a parallel zone as it leaves the top. And the results are all the same, and the weight stats for this, of course, look quite a bit better, even though we somehow managed to rack up five seconds of CX Consumer in a query that executed for 668 milliseconds. It’s a fun time, right?

Bit of a mind bender there, I admit. What were all those CX consumers consuming? I’d like to consume some of what they consume. But anyway, this was, I thought, a reasonable solution and a pattern that you can apply generally to your queries. If you’re examining a parallel query and you’re wondering, why the hell is this thing so slow? And you look at any of the sort of little arrows or any of the operators where data is passed along, and you don’t see something like this.

All right. This is a pretty even good parallel row distribution on parallel threads, right? There’s about 3,000 on each. That means every thread is going to do a pretty normal amount of work, or pretty equivalent amount of work. If you see all or most, or even like, like if you’re looking at the numbers, and let’s just say for the sake of having something to talk about that thread four had like, I don’t know, like two, three, four more digits than the rest of them, then that would be a pretty good, pretty good place to start with. Your plan is to figure out, well, how can I get better thread distribution so that I can have my parallel plan, my parallel threads, my parallel workers all doing an equal amount of work? Because if you don’t do that, one of them’s going to start expecting a raise, and you give one of them a raise, they’re all going to want a raise, and it’s going to be really hard to explain to them that, well, this one thread did 10 million rows worth of work, and you did zero. You don’t deserve a raise. Sorry. It’s not in the cards.

We had record-breaking fiscal year, but, you know, you get a bowling party, and you’ll get laid off next week. Fun times, right? All right. Anyway, I got other stuff to do, so I’m going to wrap this up. Top is your friend when you have uneven parallel threads, and you want a way within your query to naturally gather streams into a serial zone and redistribute them, hopefully in a more even manner, going back into a parallel zone. Cross-supply is another very useful thing for this, because, again, cross-supply will quite frequently be optimized out to a nested loops join, and when you get a parallel, well, I mean, when you have a, I’m going to say this very specifically, when you have a parallel nested loops join, and you have a top on the inner side of the nested loops join, that top will no longer introduce a serial zone into the plan. So, use your tops wisely. All right. Cool. Thank you for watching. Hope you enjoyed yourselves. Hope you learned something. If you like this video, please do give it the old thumbs up. If you like SQL Server performance tuning content, well, this is a good channel to subscribe to for it, as far as I can tell. It’s all I talk about, so might be worth doing that. Ever closer to 3,000 subscribers. I don’t know what I’m going to do when I hit that number. I might just drink on camera. You can all see what I’m really like.

All right. Anyway, that’s enough for today. Thank you for watching, and I will see you in another video at another time in another place. Actually, most of that was a lie. I’m going to be here. We’ll, might probably be another time.

Same bat channel, though. I’m not, not moving out yet. Anyway, that’s good enough.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.



4 thoughts on “Fixing Parallel Row Skew With TOP In SQL Server (With A Brief Re-Complaint About CXCONSUMER Waits)

  1. you ask us to write a comment about you doing a blog post about Lock Waits? (at 7:40)

Comments are closed.