A Little About Skewed Data and Skewed Parallelism
Video Summary
In this video, I delve into a fascinating SQL Server performance issue: how skewed data can lead to skewed parallelism, which in turn can significantly impact query performance by sacrificing CPU resources to reduce wall clock time. I walk through an example using the Stack Overflow database, where I artificially skewed the `votes` table and ran a query that highlighted this problem. By examining the query plan and analyzing various metrics such as CPU time, elapsed time, and memory grants, we can identify when parallelism is ineffective due to skewed data distribution. This video serves as an excellent introduction to recognizing and understanding this common performance bottleneck, setting the stage for future content where I will explore solutions to address these issues.
Full Transcript
Erik Darling here with Darling Data. And in response to many miserable, wretched people complaining about the fact that I need to make a living and owing to the fact that I needed to edit some slides. Anyway, I’ve redone my intro deck. So we’re gonna experiment a little bit with that today. In this video, we’re going to talk about how skewed data can lead to skewed parallelism. And skewed parallelism, of course, means where many rows end up on either one or a small number of threads. And kind of undo all of the, you know, like potential speed improvements by sacrificing CPU in order to reduce wall clock time. They can happen because of that. And how you can find it. I’m not gonna talk about how to fix it in this one, because I’ve got something else coming up for that. But this is a good way to introduce the problem and to help you out. identify it in a few different ways. So if you look at the video description, you will see all sorts of helpful links. You can hire me for consulting by my training, of course, at a discount, because I do care about the people who who buy my training and I’m not going broke when they do so. Becoming a channel member to support my efforts to bring you this high quality SQL Server content. Ask me questions on office hours. And of course, if you if you if you want to get a like this stuff, please do like subscribe and tell a friend and all that good stuff. I just finished Pass on Tour New York City. So that that one is off the list. I still have Pass on Tour Dallas and Pass on Tour in Utrecht, where I have half day pre cons and regular sessions at both. And of course, the Pass Data Community Summit coming up in Seattle November 17th to 21st, where Kendra Little and I are tag teaming two days of T SQL pre cons. But with all that out of the way, let’s let’s party, let’s have ourselves a little database party here. Now, let’s go over to SQL Server Management Studio. What I’m going to do a little bit is explain to you how I created this data. And what the query plan for with with the skewed parallelism looks like in some way that you can start ways that you can start figuring out if you’re having this problem. And we’ll get we’ll get to the solving it part later. Different video. So what I did was the votes table in the stack overflow database has the user ID column and the user ID column in the votes table is skewed and is rather is scrubbed, not skewed. We made it. I made it skewed is scrubbed in the public data releases because you know, they don’t mean, of course, you’re supposed to be able to vote in anonymity.
On stack overflow, no one knows who you upvoted downvoted, you’re free to divulge that information if you’d like. But maybe the nice folks at stack overflow don’t just want to go publicly tell everyone what you did. Now, what I did after I selected everything from the votes table into this new table is I updated that table to replace the null user IDs. There are a couple vote types for bounties. Bounties are when you like want to give people additional reputation coming from your reputation for having a really good answer or something. But so there are some of them populated. So what I did was I set all of the null user IDs, which is the majority of the table, equal to the vote type ID that is in the vote type ID column. And there are not too many vote type IDs are mostly upvotes and downvotes.
There are a few other ones, but they’re not terribly interesting, nor do they comprise much data in the table. Of course, then we had to create a clustered primary key on the table so that we were not dealing with some terrible heapness, which you can blame things on. And then, of course, I needed to create an index on the table to support the query that we’re about to run or rather that I ran ahead of time.
Why did I run it ahead of time? You might ask. And I might I might tell you if you were to ask me such a thing because it runs for a little bit. And what this query does is it selects from the users table because users table is kind of where most queries start. You want to figure out what user stuff belongs to who did what.
And then we are going to outer reply this votes skewed table down here correlated on that new on that newly populated user ID column. And then we are saying where vote type ID is between one and four. This will this is going to be part of what gives us the skew because the vote type ID, of course, is ends up as the user ID for these because these are these are these are the ones that were null before.
So this is where things kind of get fun. So if we look at the query plan for this, right, we will see that we indeed have all of the hallmarks of a parallel execution plan. And we have parallel exchanges.
Well, actually, we have one parallel exchange. OK, singular. We have one parallel exchange. We also have all sorts of little parallel indicators across all of these wonderful operators telling us that SQL Server used multiple CPUs to run parts of the query. But with all those one up here, too, let’s go get that.
Oh, let’s let’s be completists over here. Now, this one here, this is this one doesn’t count because this one this one had this sort had a spill. The spill doesn’t really matter all that much for this, but it is actually kind of interesting because in a way that we’ll talk about. So if we start up here in this part of the query plan and let’s make this a little bit bigger so we can see what’s going on here.
And we expand this actual number of rows thing, node, whatever you want to call it. And we zoom in. You’ll see that the each each of these eight threads in here did a fairly even amount of work.
Right. It’s like you might, you know, pick a few bones about some of these. But and for most of this, it’s fine. Right.
So this part of the starting part of the query plan, the scan of the users table, we do not have any problems there. And when we go into the sort and I don’t know why that decided to do that to me. But if we look at the sort operator, we’ll see that parallelism remains pretty even across all of these threads here, too.
We didn’t we didn’t really lose anything here. And when we go into the nested loops join, things are still OK here. We did.
We’re still doing all right. But now this is where things start to get a little wonky on us. And this is this is why this this lower branch of the plan kind of, you know, is where we start having issues is if we go to the clustered index C here, we will see that all of the rows ended up on a single thread.
The reason why this happens with skewed data is because SQL Server uses this thing called the parallel page supplier in order to hand rows out to threads. Part of how it decides which go which rows go on which threads is by using sort of a modulo hash function to say, oh, if well, if this number divides by this, then we send you to this thread. In this case, all of the rows just happened to hash out to thread one.
And this is going to be true when we go across. Now, I said that the sort didn’t really matter, but there is something interesting about it. And the interesting thing about it is, of course, because all query plans start as serial execution plans and parallel plan exploration isn’t started unless the subtree cost of the serial plan is higher than the cost threshold for parallelism.
That means all of the stuff that SQL Server sort of initially assigns to a query, one of those stuffs being memory grants, are decided before a parallel plan is explored and aren’t changed if a parallel plan is chosen. Now, if you have a DOP 8 query, the memory grant is not then multiplied by DOP. The memory grant is divided by DOP.
So the memory grant for this query, which is, if we hover over here, the memory grant for this query is 359 megs, which isn’t big by any stretch of the imagination. But this number is divided by 8 for us because we have this query running at DOP 8, which means that our sort operator asked for enough memory to sort all that data. But look what happened.
One thread, because we had one thread that had any rows on it, entirely spilled because that one thread had one eighth of the memory. If this one thread had all 359 megs, maybe it wouldn’t have spilled. But since it had 358 or 9 divided by 8, so this thread had like one eighth of the memory grant, it didn’t get nearly enough.
So it spilled a whole bunch. Now, this was a particularly long running spill, right? So we had 4.6 seconds spent, or let’s just call it 4.7 seconds spent here.
And then we have 45 seconds here minus that 4.7. So let’s just call this 40 seconds of this one thread spilling out to disk. So this did cause a performance problem for us because this one like memory requiring operator got one eighth of the memory.
Rather, the thread that we wanted, the memory was spread across all eight threads. So each thread got one eighth of the memory. And that one thread doing all the work means that one thread didn’t have nearly enough memory.
So this spilled out a whole bunch. Now, there are a couple ways where you might, you know, aside from looking at the rows on the per thread, like we did in the query plan, there are some other ways that you might look at this and decide that you have an issue.
So one thing that I always tell people to look out for when they run a parallel query, and it does not run as quickly as they would hope, is to look at the CPU and elapsed time. One very, very good indicator that parallelism was ineffective is if CPU time and elapsed time are very close together.
In a perfect world, let’s say we have a DOP8 query, then we would want to see CPU time be 8x elapsed time, right? We would want to see all eight threads do an even-ish number of work. And we would want to see all eight of those threads work cooperatively and produce an elapsed time, like a wall clock time, that is one-eighth of CPU time.
So in this case, we have 45 seconds of CPU time and 53 seconds of elapsed time. So we can kind of tell here that parallelism wasn’t very effective. From a weight stats perspective, this is where things get a little disappointing.
If we look at weight stats for this query, we’re not going to see any parallelism weights in the query plan that help us figure stuff out. We are going to see a whole bunch of this I.O. completion weight from the sort that spilled, but we’re not going to see the weight that really makes a difference here, which is CX Consumer.
Microsoft screens CX Consumer weights out of actual execution plans because some guy who works at Salesforce says they don’t matter. But in reality, they actually do for this sort of thing.
So if you’re running a query like this, or rather you see a parallel query running, and the CX Consumer weights are really high, and then you have other indicators like I’m showing you here, where the parallelism is obviously skewed, either looking at the rows per thread counter, stuff like that, or the CPU time and elapsed time are very close, then that’s a pretty good sign that your query has very skewed parallelism.
So these are things for you to watch out for when parallel queries are not perhaps running as quickly as you would like them to. And, you know, just to sort of finish things out, this is what the votes skewed table looks like as far as, you know, like distributions in here go.
So I suppose I should probably limit this to something a little bit more useful because we have where vote type ID is between 1 and 4 here. So let’s actually add that in down here.
Let’s not be silly and foolish with ourselves. Let’s make sure that we have just sort of a look at exactly what data we care about. This is how all the rows distributed, and, you know, because of whatever hash function SQL Server chose when it was doing the parallel page supplying, every single row ended up on one thread.
So that’s what happened there. So this is how skewed data can also lead to skewed parallelism. And can cause you performance problems. So thank you for watching.
I hope you enjoyed yourselves. I hope you learned something. And I will have a follow-up video on this one eventually about how to fix skewed parallelism for this query. So we have that to look forward to.
But tomorrow’s video is going to be sort of a different one where I’m going to dig into a different way of troubleshooting query performance that goes a little bit deeper than just looking at query plans and weight stats and stuff. Because we’re going to look at a query that actually doesn’t have anything helpful in that regard.
And we have to go deeper. And we have to look at some more stuff. So anyway, thank you for watching. And I’ll see you in tomorrow’s video, I hope.
I pray we all survive the night. Anyway, goodbye. Bye.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.