A Bug In Batch Mode Adaptive Join Plan Timings
Video Summary
In this video, I delve into a fascinating issue related to adaptive join plans and operator times in SQL Server. Erik Darling from Darling Data highlights discrepancies between the actual execution time of queries and what is displayed in the query plan when using adaptive joins. Specifically, he demonstrates how the time spent on certain operators is not accurately reflected, leading to confusion and potential frustration for database administrators and developers. To address this issue, I’ve opened a user voice item urging Microsoft to rectify these inaccuracies. If you’re interested in supporting this effort or learning more about SQL Server performance tuning, be sure to check out the helpful links provided in the video description, including options to hire me for consulting, buy my training materials, and become a paid member of the channel.
Full Transcript
Erik Darling here with Darling Data. And this is just a quick video to promote a user voice item that I opened up about, what do you call it there? Bugs with operator times and adaptive join plans. That’s, I guess what that says, isn’t it? Anyway, down in the video description, all sorts of helpful links. You can hire me, you can buy my training, you can become a supporting paid member of the channel. You can ask me questions for free for my Office Hours episodes, and of course, as always, please do if you enjoy this content, even in the most modest increments, like subscribe, tell a friend, and all that stuff. I still am speaking, as far as I know, hopefully, at Past Data Community Summit coming up in Seattle, November 17th to 21st, where Ms. Kendra Little and I will be delivering the two most fantastic days of T-SQL pre-cons that you have ever seen in your entire life. So, hope to see you there. And as a, just a little teaser note on that, I sent the order for my new line of promotional T-shirts out to the printers yesterday. So, I will have brand new, very cool, very funny T-shirts to give to attendees of the pre-cons. So, we all have that to look forward to. Anyway, oh, I clicked one too many times. Let’s take a look at a couple things.
So, these are the two queries that I’m going to start with, right? So, one of them, they’re both the same query. The only difference is one of them has this use hint, disable batch mode adaptive joins thing. If we look at the operator times for these two queries, we’ll see that they both take, I mean, close enough to the same amount of time. 38.7 and 38.57. So, like 500 milliseconds difference, but that’s not really the issue.
The thing is that in the second plan, you’ll notice that the hash match inner join, like, so this is a batch mode plan, right? Mostly. We didn’t get a batch mode adaptive join, but we did get a batch mode hash join, which I’ll show you in a second. But you’ll notice that this thing correctly accounts for about 30 seconds of time spent in the batch mode hash join, right?
If we hover over this thing, you’ll see that it is batch mode. And because it’s batch mode, that means that it is only accounting for the time spent in that operator. And if we sort of trundle along through the query plan a little bit further, like we have, you know, the milliseconds that we spent reading from the tables, the 29 seconds in here, about two seconds here, about seven and a half seconds here.
So, it all pretty much adds up close enough to 38.574 once you account for, like, the little, like, eight milliseconds and stuff in other places. And the adaptive join plan, that’s a lot different, right? Because if we look over here, we have 400 milliseconds and 700 milliseconds.
And, of course, because this is the unexplored branch of the adaptive join, zero milliseconds, right? We have zero rows that came out of this thing. All of the row action came out of this part of the plan where we hit the comments table, not this one.
But the adaptive join says that it only ran for about 900 milliseconds, which is a little bit funny. Because if we look at the rest of the operator times, they line up pretty closely from one plan to another. It’s 1.6 versus 1.9, 7.2 versus 7.5, and then seven milliseconds versus eight seconds.
So, the time in this plan is completely unaccounted for. Now, of course, I checked all sorts of things like latches and spin locks, thinking, like, I must be crazy because, you know, Microsoft would never mess up this badly. But the other thing that I looked at is, sorry, I’m holding back a sneeze at the moment.
We might have to break recording or something. I don’t know. But if you look at the wait stats, the top one for both of them is CXSync port, and it’s got just about the same amount of time spent in both of the plans. There’s really nothing else interesting for other waits in these plans that would account for, like, the amount of time spent in, like, nothing else that says, hey, 38 seconds, right?
So, a while back, so the reason I’m bringing this up is because both of these plans end with a batch mode sort, right? Batch mode sort here and batch mode sort here, right? The reason I bring this up is because a while back, I had posted something, and Paul White wrote an article called, what’s it called up here?
Impossible Execution Timings, where he actually referenced me by name, which was, you know, of course, an emotional moment in a young man’s life. But in the article, Paul goes into detail about the batch mode sort and the threads and all the other stuff that happens in there. But he also brought up some workarounds.
And the reason why I bring this up is because I also tried the workarounds, and I’m going to show you those in a moment. So, we had one which was putting data into a temporary table, right? So, like, you know, this stuff here. The other, which was an eager index spool, right?
And you get that by doing this top spid minus spid plus 100% query. And the other was reducing parallelism, so max.2. And then row mode sort, so using trace flag 9347 to get a row mode sort instead of a batch mode sort.
So, and this isn’t to say that Paul was wrong about any of this, because when is Paul ever wrong? This is just to say, I tried this stuff. This isn’t what’s happening, okay?
So, this is just to say, I’m not crazy. Paul’s still right. I’m just not, I’m not the one with the problem, right? So, if we come back over to Management Studio, we’re going to see all of the things that I mentioned in Paul’s post that I have here. So, one of them is putting stuff into a temp table.
One of them is saying top spid minus spid plus 100%. The other one is running the query at max.2. And then finally using trace flag 9347 to get rid of the batch mode sort. So, for all of these query plans, you’ll notice that there’s still 30 some odd seconds in each of them, right?
Some of these, like, you’re going to see the operator time show up differently because of the different plans that we got. But none of them that use an adaptive join will have the time correctly assigned to them. So, like, the first query where we dump data into a temp table, at least I think that’s what we did.
Oh, come on, zoom it. You’ll still see about 30 seconds represented in that regular hash join, right? So, SQL Server didn’t choose an adaptive join for this one.
So, this one actually does still look okay. For the next one down, this is that funny top spid minus spid plus 100% query. And, you know, the operator time does show up earlier because we have the gather streams here, right?
So, this is where the batch mode stuff pretty much ends. None of the parallel exchanges support batch mode. So, this is where we go into row mode.
So, we see, like, 40 seconds in here, but the adaptive join still shows the wrong amount of time here, 938 milliseconds, which, strangely, is exactly what the one over here showed, too. Or, actually, am I on crack? Yeah, I’m on a little bit of crack.
Okay, 894. All right, never mind. Not exactly. Close, though. If we scroll down, this is the plan where we reduced max dot to 2. And this runs for a lot longer because we have a much lower dot, right?
The other query ran at dot 8. This is at dot 2. But we still don’t see the correct operator time show up really across here. The adaptive join shows a little more time because it ran at a lower dot, but it’s only 2.5 seconds, right?
So, this thing ran for two full minutes. We see 22 seconds spent here. We see 5 seconds spent here.
We see 2.5 seconds spent here. But we don’t see anything that fully accounts for the two minutes anywhere, right? We can pretty much guess that it’s going to be happening at this point. And if we scroll down, finally, we’ll have one more query that has an adaptive join.
This was the one that disabled the batch mode sort. So, the sort at the end is in row mode, right? It is in row mode.
This did happen in row mode. This is not a batch mode sort. But still, the adaptive join shows about 940 or so milliseconds of time. And the full query runs for just about the same 38, 39 seconds. So, I think this is a bug.
And because I think this is a bug, I have opened up a user voice item to say, Hey, Microsoft, I think something is wrong here. That link to that will be down in the video description. So, on top of all the other helpful links in there that we talked about before, you also have that link to click on.
And you can upvote this item and hopefully get this fixed. Because if, like, you know, I do a lot of query tuning work. And so, to me, when I said this, I said, this smells fishy. And dug into it.
But if you were doing this kind of work, you might just, like, give up and, like, run away. Like, I don’t know. Just, like, not be so psyched on things. Anyway, go vote for it. This is a problem.
It should be fixed. Thank you for watching. I hope you enjoyed yourselves. I hope you learned something. And I will see you in tomorrow’s video. All right. I think that’s good here. Yeah?
All right. Goodbye.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.