A Little About Nested Loops, Parallelism, and the Perils of Recursive Common Table Expressions
How can I optimize a recursive CTE inside a IVTF?
Video Summary
In this video, I dive into some interesting aspects of SQL Server query execution plans, focusing on nested loops joins and the peculiarities of recursive common table expressions (CTEs). Despite my current predicament of needing a haircut, which I can’t get for two more weeks, I share insights that might save you from similar delays in understanding complex query behaviors. I explore how parallelism works with nested loops joins, particularly highlighting the importance of one-row guarantees and the concept of “parallel skew.” By demonstrating practical examples, I illustrate why even a small tweak like adding a `TOP` clause can significantly impact performance and plan execution.
Full Transcript
Erik Darling here with Darling Data, for as long as you’ll have me. I want to talk today a little bit, despite the fact that I’m desperately in need of a haircut, and I can’t get one for like two weeks, and I don’t want you to have to wait two weeks to learn about SQL Server stuff, I want to talk today a little bit about nested loops, specifically parallel nested loops. and the perils of recursive CTE. And, you know, you don’t need specifically a recursive CTE to see kind of performance stuff I’m going to talk about, but they do make a rather clever foil in that regard. And, well, I answered a question recently about them on a database, is an endpoint, typical recursive CTE plan. So usually when you use a recursive common table expression, the recursive part of the query, in this case that’s most of the query, most of the query plan rather, you are not eligible for parallelism in here. SQL Server just doesn’t go for it. It will cause a parallel zone in your query plan and also is like, you know, sort of an aside, I believe batch mode is also ineligible for the inner side of recursive CTE plan, which kind of makes sense because there’s really not a lot of operators in here that are eligible for batch mode anyway. There’s nested loops, joins, spools, and I suppose compute scalars are, I think concatenation is too, but like what’s the point of doing that, right? Not a lot of it.
Now, you can use cross-apply sometimes in some ways to change that, right? So this query here, we’re cross-applying, but SQL Server still chooses a single threaded execution plan. For this one down here, where we do things a little bit differently, rather than looking for just a single post ID. Now, remember, the ID table, or sorry, the ID column in the post table is a clustered primary key on there.
So looking for one value means one row is going to come out of it. This is a particularly interesting value because it actually has a lot of answers and comments associated with it. So if I tried to run this for real, it would run for a long time, and it would fail without the max recursion zero hint. So a lot of fun background there, right? Life-changing stuff. This version of the query, though, in this version of the query, we are looking for, and I don’t know why IntelliSense is crapping all over this thing and telling me that the post table doesn’t exist, and let’s get rid of that. I don’t need I don’t need your IntelliSense today anyway. These objects clearly exist. It’s not like the video where I was telling you to maybe pretend to select from an object that doesn’t exist so that you don’t cache query plans, but whatever. This query changes a little bit, and the reason we’re going to talk about why in a minute, but we’re also going to talk about a little bit about parallel nested loops, because a lot of stuff about them that a lot of people kind of don’t understand.
This whole query now is eligible for parallelism, and there are some rules and some reasons and some things to watch out for. So let’s hide those. I don’t want to ruin the big surprise. I’m going to scroll up here a little bit where I’ve written some things about parallel nested loops that you should know. And there we go. That’s a bit better on the Zoom. I would just have to move this so my big head isn’t covering up anything important. And let’s talk a little bit about parallel nested loops.
It’s only chosen for the inner side when the outer side has a one row guarantee. So like in the function that I was running, where even when I was searching for owner user ID 22656, which has about 27, 28,000 rows associated with it, what we’re passing into the inline table valued function that locates a post to look for is the ID of the table, right? Again, the clustered primary key on the table. So even though 27,000 rows will be taken and looped, or 28,000 rows will eventually be taken and looped, what SQL Server cares about is that what we’re finding each time we do that loop has a one row guarantee. And that join, rather that correlation on the clustered primary key of the post table is what gives it that. The cost reduction stuff doesn’t consider the inner side of the nested loops join when SQL Server is figuring out if it’s going to give you a parallel plan or not. It only cares about making the outer part of the nested loops cheaper. And coming back over here, what I mean by that is that only this, this is all SQL Server cares about. If SQL Server thought that it wouldn’t be any harder using a, or wouldn’t be any more costly using a single threaded execution plan, then it would have chosen a serial plan again, like we saw in the other ones. The parallel plan for this has an estimated subtree cost of 135 query bucks and change. And SQL Server thought this was a cheaper option than scanning the post table using a single thread. So none of the stuff on this side of the nested loops join is considered when costing a query for parallelism because, get ready for this because this is a wild one, what’s running in here in parallel is not exactly parallel. It’s running .copies of the same query across .threads and each thing is sort of like a serial plan inside running individually, which conceptually is kind of like what happens with like other parallel plan types, other parallel join types rather like hash joins and merge joins and stuff like that. But you should, parallel merge joins were a mistake and we’re not going to talk about those in SULLI this fine video. So all that out of the way, I’m going to talk a little bit about the setup for the query plans that we’re about to look at and the function that they use.
And then I’m going to show you why parallel nested loops joins can be just terrifically sensitive to parallel skew. And by parallel skew, I mean how many rows end up on each thread and a neat way that you can fix those issues if you are having those issues. So we have a couple indexes here to help our demo along. We have one on the post table and one on the comments table. The columns in use here make total sense for what the query, what the function is doing. The function here, here’s the anchor part of the recursive CTE where we look up a single post ID. And I have littered this function with force sequence because SQL Server, despite my fantastic indexes, was choosing to take those indexes and build nearly identical indexes from them using an eager index pool, which was making everything terrifically slow. So lesson one, if you have a good index on your table and SQL Server is making an eager index pool from that index, use a force sequence to make your life easy.
We have had a lot of instances lately helping clients out where SQL Server was choosing daft execution plans and force sequence were quite useful in addressing those issues. Now down here in the, you have the anchor and then the hierarchy building part of your common table expression. There’s a word for it that’s slipping my mind right now. Maybe I should have done some research before I started yacking away. So here’s where we go find things in the post table. So like what we found up top was an answer, right? Because that’s the clustered primary keys, finds that, finds an answer that we care about. And then in here is where we find questions or rather answers to the question. And then down here is where we find comments on the question or answer. So we find all the stuff that we can in here. And then we select stuff out of the recursive CTE. This is still inside the function. And then here are a couple of queries that demonstrate really well what I meant by parallel nested loops being very sensitive to skew. So I’m going to make sure that I hit R and not E so I don’t try to run this whole demo file over again. And we’re going to look at a couple portions of these query plans.
Now notice that the only difference between these two queries is here we just do a straight select from cross apply where p dot creation date is greater than or equal to 2013-03-06. And in here we have a select count from and then we have a top in here with a big top number so that we don’t have to worry about maybe not getting enough rows back, right? We definitely have fewer than 2.1 billion rows in the table. And then we cross apply to the hierarchy function outside of that, right? So this top query takes just about a full minute to run. If we look at the final operator in the query plan, that’s 58.443 milliseconds. And there’s not like a lot of weird mumbo jumbo in here where SQL Server messes up query operator times because it does that quite a bit too. But really what’s important here is when we if we right click on this, that little line, and we look at the actual number of rows for all executions, we are going to see quite a bit of skew here, right? Like all these rows, let me actually clear that out. Let’s just highlight the whole thing. Some rows like here and here and here end up with a lot more rows on them than other threads like, stay pink, u and u and u and u. So the way that SQL Server figures which rows are going to go on which thread is it uses some like modulus-ish math and this thing called the parallel page supplier starts dividing rows up depending on you know how many threads are involved and you know like how many rows and other stuff those things get split up and you hope that they end up getting split up evenly but that doesn’t always happen the way things that the values that come out of the table end up hashing out. If you have like values that are really like all even then like like then like whatever like modulus value for the odd numbers are going to be screwed, right?
Because they’re not going to find anything. You could end up with like million rows, zero rows, million rows, zero rows, million rows, zero rows for all the threads in your query which would be a bad situation like this. This is this is nearly as bad a situation uh just not with one with million zero it’s just pretty close with 1.4 million, 1.4 million, 1.6 million and a bunch of other that have like 10, 12, 10 to 12,000 rows on each thread. So that’s a bad time and this query ends up running for a full, well just about a full minute. We can call it a full minute. I feel feel like it’s a full minute. It felt like a full minute of my life anyway and then we have this query down here which finishes twice as fast. The final timing on this is just about 30 seconds, close enough to 30 seconds for me anyway and if it’s close enough for me then it’s close enough. And the reason why is because and I wish this thing would stop jumping around and reframing it’s pretty annoying. When we have a top in a query, that top under most circumstances unless it’s on the inner side of a nested loops join, uh that top will force a serial zone in your query plan. It does not force the whole query to be serial which is the way the crappy Microsoft documentation about parallelism makes it sound.
Uh it just for which I’ve had like three or four people be confused about that in my and like in recent memory. So uh it’s just badly documented. So it forces a serial zone in the plan up here. But because the overall plan is parallel right so like this is all parallel and this is all parallel and all this is obviously parallel. We know that because we have our fun little racing stripes here.
SQL Server needs to split those rows back out after the parallel zone. All right so if we, oh boy this reef, this SMS is really really fighting me today. So what SQL Server does to do that is it has to distribute streams. So we have a parallel scan, we have a gather streams, and then we have a distribute streams because this top has no racing stripes. Slow top right? Single threaded top. I’m not saying it’s slow. It’s 750 milliseconds. Who cares? Uh but what this distribute streams operator forces us to do is redistribute those rows. And what a distribute streams operator gives us the opportunity to do is distribute those rows evenly. So boy you’re a real jerk. So if we look at this and we go to the properties and we look at the actual number of rows, the same thing happened here right? These numbers are all all over the place. I mean they’re not exactly the same but they’re definitely all over the place. But if we look at the row distribution after that distribute streams going into the nested loops join, these are all very very even right? Right we have 2, 4, 3, 2, 4, 3, and then 2, 4, 2. So we have a much better balance of rows on our parallel threads because of that. So because I’ve talked long enough and because I have a dinner reservation soon and I want to go eat, I’m hungry. I haven’t eaten all day for various reasons. I’ve been very busy. I know I don’t look like the type of person who skips meals but dinner is the most important drink of the day.
I thought breakfast was the most important meal of the day. I don’t know. One of those things. I’m sure maybe I’ll write an article about what the most important drink of the day is for BeerGut magazine. See what happens there. So anyway, uh, recursive CTE, kind of a pain in the butt. They can be very slow because oftentimes queries aren’t written away where they can engage a parallel plan. And even when they do engage a parallel plan, you have to be very, very careful about how that parallel plan is written because you could end up in a situation with incredibly skewed rows across parallel threads, making your query very slow. In this case, it’s a 30 second verse full minute query, and I will take 30 seconds over 60 seconds when query tuning any day. So that being said, thank you for watching. I hope you learned something. Hope you enjoyed yourselves. If you like SQL Server performance tuning content like this for free from young, handsome men who need haircuts, feel free to subscribe to my channel. If you like this video, well, give it the old thumbs up because that’s pretty much the only thing that brings me joy these days, especially before I’ve had the most important drink of the day. So, uh, I’m gonna get going now.
Thank you for watching, and, um, I don’t know. I hope, I hope you also enjoy your most important drink of the day too, if you’re the type of person who partakes in important drinks. So, thank you.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.