Performance Issues With Denormalized Data In SQL Server
Thanks for watching!
Video Summary
In this video, I delve into the intricacies of database normalization and its impact on SQL Server performance. With a bit of a groggy start due to an early morning battle with allergies, I explore how a lack of proper normalization can lead to unexpected query performance issues. Specifically, we look at the Stack Overflow POST table, where questions and answers have different types of relationships and constraints that are challenging for the SQL Server optimizer to understand without explicit guidance. Through practical examples and queries, I demonstrate how these issues manifest and provide insights on optimizing queries in such scenarios.
Full Transcript
Erik Darling here with Darling Data, and boy do I have a big smile on my face today for no good reason. This Monday morning hit like an absolute ton of bricks. I woke up at 3.30 in the morning with every allergy in my head conspiring against me, and it took a very long time to get things calmed to a point where I could fall back to sleep. So if this video is a little weird or low energy, well, you have a pretty good reason why there.
My head is swimming with strange, my eyes have just all these weird floaty things, whatever those things are called. So in today’s video, we’re going to talk a little bit more about how a lack of database normalization can lead to some very weird performance issues. We’ve talked about it a little bit before in the Stack Overflow database.
Again, coming back to the POST table, and why the POST table is so important to this is because within the POST table, we have questions and answers and some other things, but only like certain things can apply to questions or answers, right? Like only answers can have a parent ID because the question is the parent of the answer.
And likewise, only answers in the POST table can be accepted as an answer to a question. Questions can get a different type of voting on them. They can get up votes and down votes just like answers can, but only answers have an additional attribute where they can be marked as the answer.
Get a little green check mark and 15 extra life points to use at the Stack Overflow company store to buy, I don’t know, windbreakers and kayaks and zippos and stuff. It’s like the Marlboro Miles of yore. So let’s look at voting in questions and answers because this is an election year, I hear.
So someone’s going to have to vote, unfortunately. And if we look at what we’ve got in the POST table, there are eight different kinds of POST that you can have in there. The ones that we care about are the ones that make up the majority of the table, which are questions and answers.
Now, like I said before, questions and answers can all be voted on in different ways, but only answers can get voted on with a vote type ID of one, right? So only anything with a POST type ID of two can get a vote type ID of one, right?
Ups and downs can apply to either one. So that’s fine, but it can make life a little weird when you’re querying data in there. Now, like we’re just running straight queries here. So I can show you, I can show you like an easy example of this. But imagine that you had a front end that called the store procedure and passed some parameters in.
You could end up in an even tougher spot than what I’m going to show you with parameter sniffing also involved in this situation. But even without parameter sniffing, you can end up with some weird query plan stuff. So this first query where we’re looking for, you know, anything, any POST type ID of one, which remember again, that’s a question, that has a vote type ID of two.
Right? So the top 2,500 POST ordered by creation date, this finishes nice and quick, right? We get a rather easy, simple looking query plan here. We seek into the votes table. We find some rows. We do a nested loops join to the POST table.
We find some rows and we return the top 2,500 ordered by creation date descending out to our fabulous, wonderful end users. Now, imagine a scenario where, again, if you had a store procedure and like, you know, you know your data, and you love your data, and you give your data big hugs and kisses, and ruffle its hair before it goes to school every morning, gets on the school bus, right, and lucky pencil behind its ear.
Like, you could like, put some safeguards in here. And even like, you know, if you were a reasonably gifted front end designer, you could put some safeguards in here, so that this didn’t happen to you.
But a lot of folks have no idea that, A, this can be a problem, B, that this is a problem that they should sort of guard themselves against. So now we’re going to look at where data doesn’t exist. So now we’re going to look for a vote type ID of 1, like there, and a POST type ID of 1.
And remember, that combination cannot exist across these two tables. Alright, so if we run this, this is going to be significantly slower than the previous query, because SQL Server is just looking and looking and looking for nothing, rows that can’t possibly exist.
They cannot possibly occur in that way, unless we had terribly wrong, erroneous data hanging out in our tables. I mean, God, I think we’re too smart for that, aren’t we? Right?
So if someone, you know, just like either entered the wrong thing, or, you know, like maybe just has no idea, again, most end users are not walking data dictionaries that understand all of these things. Like, like what the numbers are, and what they’re actually looking for.
Especially when it comes to, you know, the vote types, post types, I mean, it’s all weird. So, the query plan ends up looking like this, where we have to, our top spins up, and it starts asking for rows over here, and we have to keep going to the votes table.
We spend almost two full seconds going to the votes table repeatedly saying, okay, give me rows, give me rows, give me rows, give me rows, give me rows, give me rows, I need to look for stuff.
And then we spend almost, well, actually, this is a little over five and a half seconds, seeking into the post table, looking for rows and rows and rows and rows and rows, only to have nothing return. No rows met our criteria.
Right? And like the optimizer, I mean, you know, God bless it, it made some reasonable guesses about, like how many rows might exist and might come together, but it was wrong. Right?
There was zero of 7500. It’s crazy. Crazy. Silly little optimizer. How dare you? But like I said, this sort of, this sort of like domain logic is really difficult to express to the optimizer. Like, I guess you could create an index view, maybe, and use the index view instead.
But that might mess you up even still with all this stuff. But like, there’s no like cross table constraint that does a good job of this stuff. Because the only way to do that is to use a scalar UDF.
And if you put a scalar UDF into a check constraint, you’re an awful person. Your heart is just rotten garbage. It’s, get away from databases.
I’m going to send you the rock collecting book. Don’t do that. Scalar UDFs in computed columns and check constraints have this nasty habit of making, if not every, then nearly every query that touches the table ineligible for a parallel execution plan.
And that can have some really, really serious detrimental effects on the speed and efficiency of your workload. Alright? So don’t, don’t do that.
Don’t be awful in that way. Now, this is partially due to the fact that we are asking for a top. We have requested a top. We, the same thing would happen if we used offset and fetch, because under the covers offset and fetch is just, just a little topperoo.
Uh, but this is all because using top and, you know, a number of other things internally sets a little role goal, for the optimizer. And the optimizer, uh, sort of uses that role, row goal to hedge its bets against how hard it’s going to be to find this limited number of rows.
Remember, like a row goal means I don’t need to find all the matching rows. I just need to find the top X number of matching rows. Right?
Which, if we run this query with the optimizer row goal disabled, this comes back very quickly finding no data. But we choose a much different query plan. Right?
I mean, granted, it doesn’t run for like eight seconds, which is good, but it’s a much, much different query plan. Right? We still have some nice little index seeks here and here, and the whole thing takes around 470 milliseconds. And that’s not bad, but that row goal internally just kind of made things get, get wonky weird.
Right? All right. SQL Server chose a silly little navigational nested loopy easy. Well, you know, just, uh, I think this is going to be super easy to find this data.
All this data exists. I’m just going to find it real quick type bet. But that, that query plan was not a very good bet for the table. Now, uh, I, I’ve, I’ve, I’ve demonstrated this little trick before, but you know, it might be worth, you know, going through again.
If you’re, you know, the type of person who’s not allowed to use hints in queries, like disable, optimize, or row goal, you could do something like this, where you nest some tops, and you get a very similar execution plan. Right?
So this one again, about 400, 500 milliseconds. Same bigger query plan, parallel, lots of seeking and hashing and stuff. But, uh, it finds the lack of matching rows quicker. And if we were to, uh, use the original sort of argument of vote type ID two down here for post type ID one, we would still be able to find all the rows that we cared about relatively quickly.
Doing that. So, uh, this one goes back to using sort of the original plan with just the really easy nested loopy deal in here. But, you know, uh, either way we get, we get the results very quickly.
And the same thing would happen, uh, with the query above with this thing, uh, the disable optimizer row goal hint applied to it. This would also finish rather quickly because SQL Server is able to find this data very fast based on the indexes that we have. So, if there’s a message in here, it’s largely that well normalized data causes fewer problems.
Um, uh, in situations where I see any sort of, like, parent-child relationship or hierarchical relationship, uh, all designed into a single table, uh, the, the, the, the, like, the actual, like, stuff that you know can only happen for a parent or a child or a question or an answer, or the top of the food chain versus the middle and bottom of the food chain, uh, gets very difficult to, uh, get into a table. Get across to the optimizer as far as just like, you know, what data lives and can, is eligible for stuff in other tables, right? Because that’s what kind of happened here is SQL Server just doesn’t know that only these type of, uh, only answers can be accepted as the answer, uh, to a question.
Questions can’t do that. And the vote type ID of one and the post type ID of one, the vote type ID of two and the post type ID of one means absolutely nothing to the optimizer. It’s just making guesses based on what data lives in there.
So, um, if you need help with that sort of thing, I am, I am a SQL Server consultant. I can, I can fix these sorts of issues, apparently, that people pay me for, usually. Uh, they do not pay me for YouTube videos.
These are, these are free for you to, uh, you know, help you understand what’s going on with your SQL Server a little bit better. Uh, in that, uh, in that vein, I did, I do, I do hope you learned something. I do hope you enjoyed yourselves.
I do hope that my, my energy level was acceptable for, uh, the limited amount of highly damaged sleep that I got last night. Uh, and I hope that my, my, my, my, my wording and thoughts were clear to you. So, we have all that stuff going for us.
Uh, if you like this video, nice comments, especially about my new haircut, uh, are appreciated. Uh, if you like this sort of SQL Server content, well, golly and gosh, you can join. Look, let me get the, let me get the most up-to-date number here.
You can join nearly 3,600. We are at 3,599 subscribers to this channel who are blessed with magnificent, angelic notifications from YouTube. Every single time I post one of these free little gems for you.
Ain’t that nice? Anyway, uh, I’m going to go back to my natural state of being slumped in a chair. Um, maybe I’m going to drink some more coffee.
That might be a good idea. Espresso? I’m sorry. Not, not just coffee. Espresso. I go, I, I’m, I go for the hard stuff. All right? I’m a, I’m a hard caffeine user when it comes to that. So, uh, I’m going to go do that now.
And probably, I don’t know, do all sorts of other dorky stuff like take antihistamines and use nasal sprays. Maybe, maybe I’ll do a neti pot. But I’m, I’m not going to film me doing a neti pot because, uh, I know, I know, I know how many, um, sort of sensory issues exist in my, my audience.
So there’ll be no live neti potting. Don’t worry. Anyway, um, thank you for watching.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.