Learn T-SQL With Erik: Getting Defensive with CASE Expressions
Video Summary
In this video, I delve into the importance of writing defensive case expressions in SQL queries, drawing from my experience at Darling Data. I highlight how unexpected or problematic data can lead to errors and bad results, emphasizing the need for robust coding practices. By using examples like dividing by zero or handling null values, I demonstrate practical techniques such as CASE expressions and NULLIF functions to ensure your code remains error-free even when faced with unforeseen data scenarios.
Full Transcript
Erik Darling here with Darling Data, and today’s video we are going to talk about defensive case expressions. Now, I don’t mean the kind of case expressions that, you know, practice like beating up mannequins and whatnot. I mean the kind of defensive case expressions that keep you from hitting all sorts of weird errors and bad things that happen in your queries. Writing defensive code is a big part of your queries. You know, it’s hard of working with databases because you never know what’s going to end up in them. So, uh, be prepared, right? Anyway, uh, this, this, like this particular module, um, there’s way more to it in the full material, but this particular module is very near and dear to me because I have had to do this in so many of my stored procedures where I started running into things that, like, I didn’t anticipate or like, like, that doesn’t make any sense. Like, why, why is that an error now? Like, oh, there’s a zero, right? So, or like, oh, there’s something wrong with this string. Okay. Well, we got to write, got to be more defensive with our code. So if, if you look through this, the code that I write, a lot of it is defensive by default because I have been to too many bad places. Um, not talking about restaurants in Boston. So let’s create a table and let’s put some initial data into that table. Uh, just 10 rows is all we need to start with.
But, uh, now let’s say that, uh, we want to, um, figure out the percentage of something from this table, right? So we’re just going to run this query. And this isn’t a very, like, honestly, it’s a very simple table, very simple query. We’re not going to see anything amazing or groundbreaking here, but I just want to show you what happens when, uh, unexpected or even unwanted data ends up in our, in our databases. Now, uh, up in the table create statement, there were a couple notes like, Hey, should we have a check constraint to reject values of zero, like zero valid data points? So some stuff to think about, but you know, not necessarily stuff that you need to think about immediately. Um, so one thing, uh, before we go on is that when you’re returning results like this, uh, you do need to be explicit about converting data to the type that you expect. If you look at the results that we got back here, we got 20 with a whole mess of zeros.
All right. We don’t, do we need all those zeros? Do we want all those zeros? Do those zeros add anything to us? No, but because of the way that SQL Server chose to implicitly convert that, that division math, we ended up with a whole bunch of zeros. So if we want to control our results, we must convert our results to, uh, to display the way that we want. So here we go with a percent of 20 with only two zeros. And again, honestly, don’t add much to the equation. Right. We could just do, we could just have a, we could just have an integer there, honestly, but you know, makes it a little, makes it a little bit more interesting.
But now let’s pretend that a very problematic row appears in our table. And all of a sudden we are inserting a number zero into our table, right? And this is going to mess up everything. The whole works are going down. I thought the Titanic had a bad night. Wait till we start running this. Now, uh, if we try to run this query now, we are of course going to get a divide by zero error. All right. SQL Server will have returned some results, right? We get the first 10 rows back, but row 11, no can do, no can do. Now, of course we could write a where clause.
We could add this where clause in and say, Hey, divisor greater than zero only. We don’t get any results, but now we’re missing that row from the results. And we don’t necessarily want that either. So this is where we have to get a little defensive in our code, right? Again, this is where we get to strike the mannequin or the boxing dummy or whatever you call it. But, uh, what we can do here is we can use a case expression and we can say when the divisor equals zero, then just replace it with zero dot zero zero.
And if the else condition would be to, uh, do our convert to a decimal 10, two, uh, and then just have our division math in there. Right. And this will get us, uh, error-free code back, uh, which, which includes, uh, row 11 without having, uh, without, without any issues, right? No, no red text. That’s the big important thing. Uh, another way you can do that is by using the nullif function.
And I’m totally okay if you want to use the nullif function. Where this gets a little bit wonky though, at least for me, is that, uh, when you, when you use nullifs, what we’re going to say is, uh, if the divisor, right, the way nullif works is the first thing is the test, right? So if this thing is going to say nullif, like, there’s going to, like nullif means, uh, if this condition is true, then it’s going to, then it’s going to emit a null.
Right. So nullif divisor zero, right? So if divisor equals zero, then we’re going to emit a null here. This does, this does run successfully. And this does produce the results that we want, kinda, except now we just have a null down here. And where this, it’s kind of annoying is that now we have to wrap this whole thing in is null.
And we end up with a much, much larger expression than we did when we just use case to do the case when zero, then zero dot zero, zero. So we can run this and get back what we want with replace, replacing the null at the end with zero dot zero zero. And all is well here. So again, when you’re, when you’re writing, uh, queries, uh, and like there’s anything involved, uh, whether it’s division, substring, left, uh, you know, like so many different ways that you can write a query where the results might hit some sort of internal error, whether it’s mathematical or, uh, like invalid, like substring, uh, chopping, like, uh, bite placement, uh, please be very, very careful.
Please always write your code defensively. Uh, even if you know the data very, very well as it exists today, that does not stop bad data from eventually and they are just different data than you expect ending up in there in your code, all of a sudden throwing problems. So always think ahead a little bit when you’re writing these queries and always try to write your queries in as defensive as a way possible so that you don’t end up getting surprised by weird errors.
And all of a sudden, like getting page in the middle of the night and someone saying, Hey, the ETL process won’t run. Everything’s dividing by zero. It’s all falling over. It’s burning. Help us. Save us. We can’t live without you. And then you go find that one place where you’re doing some division and you, and you fix it and then you look like a big hero and the ETL runs and, um, I don’t know.
You have a dashboard in a day. Anyway, thank you for watching. I hope you enjoyed yourselves. I hope you learned something and I will see you in the next video where apparently we’re going to talk about subqueries. What are we going to talk about?
Well, you just, you just have to show up, won’t you? All right. Thank you for watching.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.
Learn T-SQL With Erik: Getting Fancy With CASE Expressions
Video Summary
In this video, I delve into the world of case expressions in T-SQL, dispelling any confusion around their terminology by clarifying that they are indeed “case expressions” and not “case statements.” I walk through a simple count query example to illustrate how using subqueries can lead to multiple table scans, significantly impacting performance. To address this issue, I demonstrate how to use case expressions to selectively apply aggregates, ensuring more efficient execution plans and reducing the number of times the table is accessed. The video also explores the nuances between using `COUNT` and `SUM` with case expressions, emphasizing the importance of correctly handling null values for accurate results. Throughout the discussion, I highlight common pitfalls when defining ranges in these expressions to avoid data exclusion or double counting issues, providing practical advice on how to write robust and efficient queries.
Full Transcript
Erik Darling here with Darling Data, and in today’s video we are going to continue on talking about, this time, case expressions. I suppose it’s good to get it out of the way, first and foremost, that they are case expressions, they are not case statements. If you run into anyone in the wild who calls it a case statement, you can very, very safely disregard their opinion about basically anything and everything that they might say. So it is kind of a nice litmus test for a person. Call it a case statement, out you go. Anyway, let’s say that we have a, just a simple count query like this. But the problem with this count query is that we need to count all different sort of reputations in different ways.
You might end up, if you kind of knew it at T-SQL, writing a query that looks a little bit like this, right? Where, you know, for some reason you’re going to be able to write a query, and you’re like, well, I’ll just, I know, I know it’ll work. So I’ll just write some sub queries that count for reputation equals one, and then reputation between one and 9,000, and then reputation over 9,000. So this, this will work. This is perfectly fine.
But if we look at, if we run it, for the small table that we’re working with, this is fast enough. But if we look at the execution plan, we will see that we hit the user’s table once, twice, three times. And each time we hit the user’s table, oh, the tooltip went away. We read the whole thing, and, you know, we have sort of a not good time, right? So if your queries are big and more complex and slow for each one of these sub queries, you will have a 3x slowdown for your query, rather than just hitting the table once.
Now, what you can do is you can use case expressions to apply aggregates selectively. So if you do some, but you have to be a little bit careful with it, like, like which aggregate you choose. Like up here, we’re, we’re really, like, we’re just getting a count of all the users whose reputation falls into, you know, these various sort of segments. If you use count like this, you’re going to get wrong results back.
Right? Because we know when we ran this query for reputation equals one, we get 1,090,000 or so users back. If we run this query, what we’re going to get back is 2.46 million. Why? Because count is counting the ones, and the zeros. Count doesn’t exclude zeros. It’s like, what’s a value? So it gets counted. If you wanted to use count accurately for this, then you would have to, then you would have to suppress non-one reputations with null.
And then you could do this and get the right number of rows back. We get that 1,090,000. A lot of developers find it a little bit more intuitive to use sum rather than count, right? Because with sum, you’re saying if reputation equals one, then add one to it. If reputation doesn’t equal one, doesn’t equal one, then add zero to it. And that gives you a little bit easier time getting the correct count back.
But what that’s leading up to is that you can do something even more interesting with both sum and count. But sum is usually a little bit more common to do this with, is where you don’t have to just supply a static value for your then or else expressions. You can actually do like a column value. So let’s just say we wanted to sum up everyone’s reputation for 2013.
We could say where creation date is greater than equal to 2013-0101. And if that’s true, then we sum the reputation column. And if that’s not true, then we add in a zero, right? So you can do this and you can replace a column with, you can sum up a column doing that. Now, what that’s leading up to is that when you want to do that across a range of things, there’s a very easy way to do that to hit the entire users table, but only hit it once and just use case with the selective, selective aggregates to get you what you want.
The only thing that I’d say here is just when you’re writing stuff like this, be really careful, right? Because if you don’t write these, the ranges correctly. Like this, this query is very simple. So this query is hard to mess up. But as you write more complicated expressions for these things, you have to be very careful that you don’t write overlapping expressions, at least where you don’t intend to.
So like, it’d be like very easy to make a mental error and just say, like accidentally say less than 9,000 here and greater than 9,000 here. And what we would miss everyone with a reputation of exactly 9,000, right? So we would leave them completely out of the results. Same thing would be like if we accidentally did greater than or equal to 9,000 here, we would be double counting 9,000s across both of these.
So just when you’re writing these things, just take time, take your time and be careful about like when you’re setting up these ranges that they don’t overlap or exclude things accidentally. But this can be a very handy way of getting you a like full result set like we saw without having to write repetitive sub queries without having to hit the user’s table multiple times. This just does this just as one big scan of the user’s table makes life a lot easier and more simple, especially if the query that you need to pull those things from gets big and complicated, right?
This is just a very simple single table. If you have like a big long set of joins and you’re joining in views and functions, who knows what else then you know, like just having to hit that once rather than three, four more times that can be a lot easier on your server. Anyway, that’s just a little bit about case expressions and sort of more interesting ways of using them.
Thank you for watching. I hope you enjoyed yourselves. I hope you learned something. And I will see you in the next video where we will talk about using case expressions defensively in queries to avoid errors and other strange issues. So, hope you’ll join me there. Anyway, thank you for watching. Goodbye.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.
I work for a small startup (7 ppl, 3 on sql server including the boss). Our software is a complex ERP system with business logic heavily implemented on sql server. We are planning to move to Postgres. Is postgres a good platform for complex business logic and performance? Can a small team achieve this migration for a complex software? Your expert opinion?
Hey Erik. What deciding factors should be considered when marking an entire table for Update Statistics? Is it based on how fast the data changes, or is it okay to run the Update Statistics overnight, irrespective?
Hi Erik, I have top wait type SOS_SCHEDULER_YIELD which is way higher then second one: SOS_SCHEDULER_YIELD: 75h of 230 Hours Sample, 1.5ms on avg. Second PAGELATCH_SH is 5h total. Should I focus mostly on most CPU-intensive queries (in cache) or this can be connected also with “High CPU ready time on VM side” or lack of proper indexes that causes too much scans index/table from memory? Strange is that I see this wait (I’m logging sp_Blitz every 15 min) as top also on weekends when there is no heavy load in DB. Should I start logging sp_PresureDetector to see number of worker threads used and try to correlate it with high SOS_SCHEDULER_YIELD wait? Thank you!
I see Brent is going to put AI in his procedures. Are you plannong on doing that too?
Why would I use dynamic SQL to defeat parameter sensitivity, when I can just use OPTIMIZE FOR?
Video Summary
In this video, I dive into some of the most pressing questions submitted by our community during my office hours session. We tackled topics ranging from migrating to PostgreSQL for complex business logic and performance considerations, to optimizing SQL Server statistics updates and identifying CPU-intensive queries. It was a lively discussion with plenty of insights and advice shared among us. Whether you’re considering a database switch or looking to fine-tune your SQL Server environment, there’s something here for everyone. Additionally, I share my thoughts on the future of AI in SQL Server procedures, explaining why I believe it’s more practical to leverage existing tools like Copilot rather than building custom solutions. The video is packed with valuable information and actionable steps that can help you improve your database management skills.
Full Transcript
Erik Darling here with Darling Data and, well, Ben Huffin again, which means it’s time for some office hours. I will answer five of your questions that you submit through this link right here, this beauty, which is down in the video description. Speaking of the video description, if you would like to become a member, a paying member of this channel to support my efforts to bring you all this work, wonderful, high-quality SQL Server content, you can sign up for a membership in the exact same place. I don’t know why this thing has me turning so pink right now. It’s a little unsettling. I am not normally radiation poisoning red, but I don’t know. Maybe it’s just how I feel inside. If you do not care that much about me with money, you can like, you can comment, you can subscribe, and you can ask me questions for these here episodes of Office Hours, which is always always a highlight. If you need help from a SQL Server consultant. I am one of them. One of too many, I think. When I look around LinkedIn, it’s absolutely saturated with idiots who have experts in their LinkedIn profile, and then you see the things that they say and they write about, and you’re like, you wish that you could bring back firing squads.
None of them would make it. If you need help with SQL Server, health checks, performance analysis, hands-on tuning of your SQL Server workloads, responding to SQL Server performance emergencies, and of course, training your developers so that you don’t have those anymore, and you can go back to drinking or huffing compressed air in peace, I’m available. And as always, my rates are reasonable. If you would like some performance tuning content from me, I have 24 hours of it available. With the discount code, it’s about $150, and that’s pretty good. That $150 will last you the rest of your life. So good for you if you choose to look that far into the future. Bet you have a 401k or something. A real whiz kid.
I, of course, have a new course coming out. Learn T-SQL with Eric. That’s me. That’s Eric right there. This fella. All this content is being recorded leading up to Kendra and I doing our past pre-cons, so attendees to those will get access to all of this as part of their admission. You buy tickets. You come, you get access. The course is available now for everyone else for a pre-sale price of $250 that will go up to $500 when everything is said and done. The videos will start dropping shortly. Well, actually, this is going out after May 10th, so videos will have already started dropping by the time this thing sees the light of day.
I record the office hours ahead of time and schedule them for every Monday, so that’s why sometimes the dates are a little funky on me. So I apologize, but it would be a lot of work to fix that, so we’re not going to do it. I did manage to fix this slide, though, because New York’s E-Saturday will have already happened, so we no longer need to talk about that.
But we still do need to talk about PASS being on tour this summer. New York City, August 18th to 20th. Dallas, September 15th to 17th. And Amsterdam, October 1st to 3rd.
Of course, all of that is leading up to just warming us up, getting us nice and loose and limber and getting the blood flowing and muscles in order for PASS Data Community Summit in Seattle, November 17th to 21st. So we’ll be doing as much as we humanly possibly can to do all that.
But with that out of the way, let’s answer some questions here on this office hours episode. Oh, we got some long ones in here. Boy, oh boy, oh boy.
Y’all like typing so much. All right. I work for a small startup, seven pipples, three on SQL Server, including the boss. That’s a mistake.
Never let your boss use SQL Server. Our software is a complex ERP system with business logic heavily implemented on SQL Server. We are planning to move to Postgres. Is Postgres a good platform for complex business logic and performance?
I don’t know. I don’t use Postgres. But what I can tell you is the number of people who have said the exact same thing to me. We’re going to move to Postgres.
Postgres. One of my oldest clients, I started this consultancy in 2019. I have had a consultant since April. No.
March of 2019. To today. Who, back in 2019, said, we’re going to be on Postgres in six to eight months. You know where they’re not?
Postgres. Postgres. If you want to dip your toes in it, try moving some little pieces of it over to Postgres. See how it goes. See how it handles whatever tiny little bits of logic you have. Maybe it’ll go well.
Maybe it’ll go not well. But if your idea is to just move the whole thing over, I will see you in six to eight months. You goofballs.
Where do you get this stuff from? Boy, oh boy. Oh, that’s the wrong. That’s the wrong ZoomIt option. Hey, Eric.
Hey. How you doing? What deciding factors should be considered when marking an entire table for update statistics? Is it based on how fast the data changes or is it okay to run the update statistics overnight, irrespective?
Oh, gosh. Are you… So, like, when… Reasons to update statistics, you know? You got a table that, you know, maybe sees a lot of modifications.
Maybe the statistics fall out of date pretty quickly. You know, the big thing that I need to tell everyone about update stats is, you know, the update stats isn’t generally there to help you with, like, data that already was in the stats histogram that’s changed.
Usually, the beauty of update stats is to get new data into the histogram that is not currently represented in it. So, if you are constantly adding new data to the table, then I think it is a good idea to update your statistics for that table. So, you know, you don’t have to deal with any off-histogram searches against your table that might not have any representation.
It might get you a pretty bad cardinality estimate. So, yes, update your statistics. But, you know, overnight is fine.
But I’ve worked in some environments where we had to update stats way more often than that. And some other environments where not only did we have to update stats way more frequently than overnight, like every hour, half hour or so, we also had to do it at a very specific sampling percent in order to get a good histogram for things.
So, you know, you might find that the overnight… Just the overnight stats update is not enough. Don’t be afraid to update your stats more often.
All right. It’s hard to tell where this one ends. That’s a lot of writing. All right.
Hi, Eric. Hello. I have weight type SOS scheduler yield, which is way higher than second one. SOS scheduler yield is 75 hours of 230 hours. PageLodgeSH is five hours total.
Should I focus mostly on CPU-intensive queries? Or should… Or this can be connected also with high CPU ready time on VM side? So you’re going in a few different directions here.
And it seems like you’re doing a lot of analysis, but not a lot of actual fixing of things. I think you are generally right that you should be tuning CPU-intensive queries. What I’m concerned about here, though, are a couple things.
One, you have that much SOS scheduler yield, which is 75 hours. And then the next weight down is PageLodgeSH at five hours. There’s no real mention of, like, CX weights, like CX packet, consumer, like any of those.
So I’m concerned that you may have max stops set to one, or you may have either at the server or database level. Perhaps you have cost threshold for parallelism set to whatever the high value for that is. Or maybe you have just a crap ton of scalar UDFs that force your entire query to run single-threaded.
But so if it were me looking at your server, first I would be trying to figure out where’s all the parallelism? Because it doesn’t seem like you have any, at least not that you mentioned. Maybe there are some, you know, like background tasks that are eligible for parallelism that don’t bring it up past the five-hour mark in 230 hours of uptime, which would be interesting.
But that would be my main concern, is why you’re not getting any parallel queries on this. I don’t even know if that’s a thing for SharePoint anymore, to be honest with you. But, like, besides that, yeah, I mean, like, generally, yes, I would go after CPU.
I would go after queries that have the highest CPU on here. And I would also look at queries that have a very high execution frequency. So you could use SP Quickie Store to look both by average CPU or by executions.
That would give you some pretty good insight into both the stuff that takes a long time when it runs and the stuff that, like, uses a lot of, like, very choppy CPU, right? Things like scalar UDFs would be especially prone to that, especially because, you know, like, they don’t run once per query. They run once per row that they have to process, which leads to a lot of additional scheduling.
I would also potentially, maybe not, like, very, very concerned, but I would also maybe be a little worried that you don’t have a lot of CPUs on this thing. And maybe queries might be waiting a long time to get CPU attention. That would tie into, like, the high CPU ready time on the VM side, potentially.
But, you know, that’s more than I can dig into with the text you’ve given me. In general, I wouldn’t add more observer-ness to this. I wouldn’t start logging more stuff.
I would just start going after meaningful queries that, like, either, like, run and take a long time or run a lot. They, you know, like, again, scalar UDFs, loops, cursors, like, loops and cursors in scalar UDFs, stuff like that. There’s a lot of things that I would go after there.
All right. Next question. I see Brent is going to put AI in his procedures. Are you plan on doing that too? So, no.
And so, like, you know, AI is, or LLMs, however you wish to refer to it. They’re always an interesting topic. You know, you don’t really want to watch where the ball is. You want to watch where the ball is going, right?
It’s not, like, just because things are, like, not great now doesn’t mean that they won’t be great down the line. You know, AI and LLMs, they are constantly improving. But we as humans tend to stay relatively the same.
We don’t make, like, big leaps and bounds in our abilities very quickly. And also, you know, when you think about, like, a lot of stuff that has, like, really fundamentally changed the way that humans interact with the world around them. Like, just, you know, technology-wise, technology side, you know, cars were invented, but it was a long time before cars were widespread.
You know, PCs were invented, but it was a long time before everyone had a PC. Never mind a PC sitting in their pocket, right? Like, cell phones.
Then, you know, the internet was around for a long, long time before everyone’s house was wired with internet. And we have satellite space internet now. So, like, AI, like, for whatever it is now, like, maybe, like, down the line, like, there’s going to be a lapse between, like, you know, the current hype cycle of everything AI all the time.
And, you know, like, what it actually ends up, like, how it actually ends up becoming a part of our lives, who knows. But, like, for me personally, putting AI into this stuff doesn’t make a lot of sense. Like, it would be highly speculative for me to say this, but, like, I saw Brent’s post about him doing it.
But it would only make sense to do that if I were planning on releasing my own, like, Erik Darling SQL Server agent that would, like, you would pay me to answer questions for you from these things, right? Like, that’s the only way it would make sense for me to do it. I don’t know if Brent’s doing that.
I don’t know if that’s his plan. Just because, like, right now, like, think about, like, most people’s usage of LLMs, right? Like, Copilot’s going to be an SSMS, right? So, like, whatever results you get from an SSMS query, it’s not a far trip to Copilot whatever results come back, whether it’s, you know, query plans or indexes or whatever.
Most people will have, like, have access to an LLM sitting next to them, right? So it’s like, you know, you could, like, you could copy something from SSMS, put it in a browser tab. So, like, it doesn’t make sense, like, for me to build anything into it that seems like if someone wants to take those results and do something with them in an LLM, it’s not a very long trip for them to do that on their own.
So, like, I don’t have any plans of, like, creating or introducing my own AI agent yakky mouth thing. So, like, I wouldn’t build anything into mine unless there was a way for that to, like, there’s a lot of work, right? Like, the security and figuring all that out.
A lot of work on that side. So I wouldn’t necessarily do that unless it were going to be, like, a profitable venture for me. So maybe Brent has something figured out that I don’t, if it has to do with marketing, he probably does.
There’s a 99% chance that he’s far ahead on that. But for me, I can’t see a lot of upside to putting that into my procedures in, at least as things currently stand with me and my life. All right.
So last question here. Why would I use Dynamic SQL to defeat parameter sensitivity when I can just use Optimize 4? Well, I mean, I’m not sure why you’re asking me this. It’s a bit silly.
If you can figure out a way to get your parameter sensitivity problems fixed with just using Optimize 4, go for it. So the reason why I present the options that I do is because perhaps there are situations where Optimize 4 does not solve your parameter sensitivity issues. Perhaps it just gives you one big plan that happens to make you happy in the moment.
I don’t know. You also left off what you’re optimizing for. If you’re going to say unknown, I’m going to come to your house and I’m going to perch on your bedpost and stare at you while you sleep.
Because that’s a nightmare. And you deserve nightmares for what you do to SQL Server. If you’re optimizing for specific values, that can certainly work.
I’ve done it in the past. But using Dynamic SQL to fix parameter sensitivity issues is a good tool to have around just in case any of that stuff doesn’t work. And if you’re talking about my defeating parameter sensitivity with Dynamic SQL video, one of the methods that I use in there actually does use a Dynamic Optimize 4.
So you can combine both of those into one big happy family of stuff to do. So if you can get away with optimizing for some values and fixing parameter sensitivity, great. You can stop right there.
If you run into a problem that Optimize 4 doesn’t solve, then you might need to reach a little deeper into your tool bag. Maybe even down into the Dynamic SQL pocket and do a little bit more typing. But anyway, that’s it for today.
That was all five questions. Thank you for watching. I hope you enjoyed yourselves. I hope you learned something. And I will see you in the next office hours. And also the next video and so on and so forth into eternity.
All right. Thank you for watching. Goodbye. There’s the button. There we go. All right.
Now, now, now. All right. We’ll do that one more time so I don’t stick to landing. Thank you for watching. Goodbye. Goodbye.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.
It’s been a busy few months working on scripts in my GitHub repo, including adding two new members to the family.
sp_IndexCleanup: Easily deduplicate indexes — scripts all the changes out for you!
sp_PerfCheck: A high-level review of performance-related settings and configurations!
Here are the main changes from lovely contributors:
sp_QuickieStore: Appended _ms where it was missing for avg_cpu_time by @ReeceGoding in https://github.com/erikdarlingdata/DarlingData/pull/600
sp_QuickieStore: Moved validation of @sort_order to be much earlier on. by @ReeceGoding in https://github.com/erikdarlingdata/DarlingData/pull/602
sp_QuickieStore: Moved special sorting columns for special sort order values to be jus… by @ReeceGoding in https://github.com/erikdarlingdata/DarlingData/pull/604
sp_QuickieStore: Made plan hash, query hash, or sql handle show when the parameter for filtering them out is passed in by @ReeceGoding in https://github.com/erikdarlingdata/DarlingData/pull/608
sp_QuickieStore: Documents that @format_output = 1 removes most decimals. by @ReeceGoding in https://github.com/erikdarlingdata/DarlingData/pull/606
Adding support for indexed views in addition to indexes on tables. by @FirstCall42 in https://github.com/erikdarlingdata/DarlingData/pull/610
sp_QuickieStore: Move regression validation to just after sort order by @ReeceGoding in https://github.com/erikdarlingdata/DarlingData/pull/612
Made sp_HumanEventsBlockViewer not error out when system_health is used by @ReeceGoding in https://github.com/erikdarlingdata/DarlingData/pull/618
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.
In this video, I delve into the world of T-SQL paging queries, specifically focusing on the use of `TOP` and `OFFSET FETCH` in SQL Server. I explore how these techniques can be used to efficiently retrieve large sets of data while maintaining performance. I demonstrate practical examples using Common Table Expressions (CTEs) to illustrate both methods, providing a clear understanding of when and how to apply them effectively. Whether you’re just starting out with T-SQL or looking to enhance your skills, this video offers valuable insights into optimizing query performance for pagination scenarios.
Full Transcript
Erik Darling here with Darling Data. And today’s video we’re going to continue our T-SQL love fest where I’m previewing some of the content from the beginner portion of the T-SQL course that I’m working on. You can of course purchase this course down in the video description for the pre-sale price. Once the course is fully published with the beginner and the advanced material, the price will go up to double what it is now. So now is a good time to buy. And of course, if you are attending Kendra and I’s pre-cons at Past Data Summit in Seattle this November, you will get access to this content for free with the admission to your, with the admission to our pre-con. Your admission to our pre-con. There we go. Anyway, let’s talk a little bit about offset fetch then, first top, at least in the context of the one place where they really compare well is writing page inquiries. Now, assuming that you’ve written top, like to say select top and number of rows from some query, order by something, better be an order by in there at this point, hammer at home that you must order by when you use top.
But assuming that you’ve written top without, with ties or like percent or something, you get that number of rows back. If you say select top 100 rows, but you only have 95 rows that qualify for your query results, SQL Server does not invent five rows for you to work to like just make you happy. If you say select top 100 rows, you can get some kind of weird stuff back. By weird, I mean like if you’ve done it enough, you know what to expect, but it might look strange to someone who has not done it before. So just to show you what I mean there, if we say select top one with ties where reputation equals 3622, we say top one, but there are 11 ties. So we get 12 rows back.
All right. Look over here. There are 12 people with a reputation of 3622. Guess what? They all tied. So we get for top one with ties, 12 rows in this case. Kind of fun. But top does not have a natural syntactical offset in T-SQL, right? Because T-SQL is the only language that has top in it. Most other languages use limit, but most others also have incorporated offset fetch at this point. What offset fetch gives you is the ability to limit, not only limit the results the way top would, but also skip rows within that result set to page through things.
So like the fetch portion of offset fetch is like the top end rows, right? You say offset end number, sorry, fetch end number, next end number of rows only. That gives you like the top-ish behavior. The offset is what gives you the skip behavior. So if you want to have skip behavior with top, you have to do a little bit more work generally using the row number window function. So the way to think about paging queries in general is first you want to mentally separate the columns in your mind between relational columns.
By relational columns, I mean things that you’re going to perform some relational activity on, whether it’s a where clause, whether it’s a join, whether it’s a group by, whether it’s an order by, stuff like that. Those are your relational columns. And then you have your informational columns. Your informational columns are just the ones that you show the end user. So when you say, you know, select these columns, those select columns are your informational columns.
There might be some overlap, of course, but in general, if it’s only in your select list, it’s informational. If you are saying like, you know, from this table where this column equals something, join to this other table on this expression, then those are relational columns. When you’re writing paging queries, you want to avoid getting the informational columns for as long as possible.
You want to do all of your initial filtering, all of your initial joining, existing grouping, ordering, all that good stuff with as narrow a set of columns as possible. This gives you a much better chance at consistent index usage, not blowing your server out with gigantic memory grants from having to sort this big long list of every string column you’ve ever put into a table that’s in VARCAR max because your developers are lazy. You avoid a lot of performance issues with paging queries that way.
So that’s really the main sort of gist of it. Now, without getting too deep into indexing for paging queries, because that is a big, giant, complicated, messy hornet’s nest of a topic, we’re going to skip ahead to just kind of show you what a good starting point. Like when people say this house has good bones, we’re going to say this paging query has good bones, right?
So what we’re going to do here is show you what a good starting point for a paging query looks like. All right. So like notice that we’re only selecting one column. That is the clustered primary key of the post table.
And we’re going to, again, you know, avoid selecting all the columns you want to show people for as long as possible. In this case, thankfully, it’s not too long, right? Because once we use this CTE to find the rows that we care about, filtering to the owner user ID that we care about, and applying our offset in our fetch, we are going to join that CTE, which is called paging, back to the post table just for the rows that we found in the CTE.
All right. So if we run this, we will get back our necessary required 100 rows, and we will be very, very happy with the results. This was speedy enough in this case. Was it the fastest query I’ve ever written? I don’t know. I don’t know what the fastest query I’ve ever written is.
Probably a lot of zeros involved, but it gives us the results that we want, right? And it’s set up the way that we care about. There are two ways you can do this with top. The first way I’m going to show you just uses a single CTE, and in that single CTE, we generate a row number over results from the post table where the owner user ID equals the owner user ID that we care about out here.
Once we’ve located those rows, then we are going to do some math, some very, very fancy math, on the row number that we created in our CTE. That is f dot n. n is our row number right here.
Okay? And we need to do a little bit of math on our page number and page size variables so that we get the correct page, the correct number of rows positioned in the correct point in the results.
And, of course, we will need to finish up by ordering by our wonderful last activity date and tiebreaker ID columns. So if we run this, we’ll get the same results that we got back from the last query. And, again, lickety-split pretty quick there.
Same basic deal, only selecting the ID column and then joining to the post table outside here. The second way is a little bit more complicated. And this way is, this is a method that I, I’ve told this story many times, but way back in, like, 2009, I had only been working with SQL Server a very short amount of time, but I was tasked with writing a page inquiry, and I found this blog post by Paul White.
And this was the method that I learned from Paul in that blog post, where you use not one, but two stacked CTE. Now, if you’ve heard me talk about CTE before, I do want to say that this format of writing CTE, where one CTE draws from the one before it, doesn’t have the performance issues that I’ve talked about with CTE a lot in other videos.
If you haven’t seen those, hang on, because there’s CTE coming up in a few videos here. But if we run this query, what we’re going to see is, in this first CTE, again, only selecting the narrowest set of rows that we need, but up in the top, we do the page number times the page size.
And what this gives us is a somewhat bigger result for us to page through. So, if we’re on, I mean, currently we’re on page number one, and with a page size of 100.
So, page number times page size gives us 100. If we were in the top 200, then it would be page two, sorry, if we wanted page two, then it would be two times 100, so top 200.
So, this part of the CTE does give us a longer result set the deeper we go in. But what we immediately do is filter that out here, where we only select the top page size from the previous CTE, where the row number is greater than the page number minus one times the page size.
So, for, like, page one, we get the 100, page two, we get 200, all that other stuff. And we’re going to do the same thing out here, where we join back to the post table to get all of the columns that we care about, and again, ordered by the two columns that we care about for, A, the initial presentation ordering of the results, and then the ID for the tiebreaker, just in case there are any dupes and last activity date.
So, if we go and run this query, we will get, again, the same results back, and it will be just as lickety-split-quick as the other two. Right.
We’re not doing anything too big and crazy here, but, you know, as page inquiries get bigger and more complex, then we, that’s when we need to start thinking about other things like indexing, and, you know, perhaps some additional query tricks to stabilize result sets.
But for most page inquiries that you write, you will want to use one of these three methods and just sort of test them and figure out which one performs the best, based on what your sort of normal workload is and what users are requesting.
Again, offset fetch tends to suffer a little bit performance-wise as you get deeper into result sets, but if your users aren’t typically going past, going very deep into result sets, then it’s not really a concern.
Anyway, thank you for watching. I hope you enjoyed yourselves. I hope you learned something. And I hope that you’ll buy the course because it’s a good one and put a lot of good stuff into it.
Anyway, thank you for watching. Goodbye. Goodbye.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.
In this video, I delve into the intricacies of non-deterministic order by clauses and unexpected query results in T-SQL, drawing from my upcoming course that is currently on presale. With the material nearing completion and tech review underway, I’ve already started recording some sections, which will begin appearing on the training site shortly. This video serves as a sneak peek for those who might be attending the pre-cons at Past Data Summit in Seattle, where you’ll gain free access to this valuable content. For everyone else, now is the time to act if you want to avoid doubling your regret later—grab the presale deal before prices go up. We explore how non-deterministic order by clauses can lead to inconsistent query results and discuss best practices for ensuring stable outcomes in T-SQL queries. Through practical examples, I demonstrate the importance of incorporating unique columns as tiebreakers and highlight common pitfalls that developers might encounter.
Full Transcript
Erik Darling here with Darling Data, and in today’s video we are going to go over a little bit of the material from my T-SQL course. You know, the usual spiel about this stuff, it is on presale right now. The course is completely written at this point, just finishing up the last bits of tech review and tidying up and I’ve already started recording some of the already done content. So, that’ll start appearing over on the training site in the next week or so. Of course, this is the companion material to the stuff Kendra and I will be teaching at Past Data Summit. If you’re attending the pre-cons in Seattle, then you will get free access to the material. If you’re not attending the pre-cons in Seattle, you get nothing. You have to buy it just like everyone else. Once the material is fully published, of course, it will be going up to the full price, which of course will be double the presale price. So, get in now to have half the regret later on. Who knows what you could have done with the other half of the money that would have been good for you. You could have bought so many lottery tickets. Anyway, we’re going to talk a little bit about order buy. And if you find yourself watching this material and saying, yeah, Erik, I know, well, congratulations. This is the beginner material and you just might not be a beginner, which is wonderful for you, right? But for those who have not yet progressed to your lofty place in the T-SQL world, this sort of education is necessary. So, we’re going to talk a little bit about order buy and specifically non-deterministic order buys.
And then, of course, expecting queries to return in a specific order when there is no order buy, which is another thing that people tend to do. They observe a query running and returning results in a specific order. And then if it ever doesn’t return the results in the order, they freak out. They’re like, but it was all this way before. So, do that a little bit.
So, first, let’s look. Query plans are turned on. Look at that. Baked right into the demo. You’d think I’ve been practicing or something. If we run this query a few times, what we’re going to see is sort of in general, the results, like just over here in the ID column, are going to come back in all different orders. And, of course, the reason for that is that this particular sort order is non-deterministic.
The reputation at the beginning part of the index, or rather at the beginning part of the values, there’s no index on reputation currently, the beginning part of the values for the reputation column are all ones. Right. And so, all we’re doing is ordering by the ones, but the ones can come back in any order that they want. Right. As long as they’re just ordered over here. So, if we keep running this a few times, we’re going to notice that the ID column flips around quite a bit.
Right. This looks different from the last one. We were on this and just keep running it. Now negative one is at the top. Negative one is still at the top with some other stuff switched around. All right. We just keep doing this over and over again. We’re going to see just like different results depending on like threads and timing and all sorts of things.
So, if we want a deterministic order by, if we want to always present results in a specific way, we need to incorporate a unique column as sort of a tiebreaker to our order by elements in here. Right. So, we need to add in, in this case, the ID column from the users table, which is the clustered primary key. Right. So, if we run this, then not only are we ordered by reputation here, but in order to tiebreak the ordering to make sure this is deterministic, now we order by the ID column as a secondary element.
So, we order first by reputation, but then for duplicates and reputation, we order by ID. And this will return absolutely stable results every single time. We can’t, we can’t mess with this ordering because of that, because of the uniqueness of the ID column.
Now, this behavior can be sort of misleading to a lot of people. If their data doesn’t have a lot of duplicates, just some duplicates, right? So, if we flip the ordering of reputation and we just say order by reputation descending, then, you know, there’s not a lot of overlapping values at the top of the food chain here.
The reputation column is like, largely like, once you get up this high, not a lot of people have like, duplicate reputations where you would need a sort of tiebreaker thing. So, you could observe behavior like this working quite well for a while. But then, after a certain point, you are going to start hitting duplicates and things might start flipping around on you.
So, just for example, the first reputation at the high end that I could find that had a duplicate value when it was 160303. And sometimes this takes a few runs to kind of get to work correctly. But here you see the reputation 160303 at the very top here.
I don’t think there’s any other dupes down below. But you see IDs 206403 and 19679. If we just kind of run this enough times, you’ll see those two columns flip back and forth.
And that’s because, you know, they are duplicates. And so, SQL Server only has to return this thing in order to a certain point. But within duplicates in there, it can return those in any sort of flip-floppy way that it wants, right?
We don’t know which one we’re going to get back. This might not seem like a very big deal to you in this pretty narrow case. But I guarantee you, at some point in your T-SQL developer career, you are going to have someone file a bug report and say something like, Hey, I’m getting weird results back here.
And you’re going to have a non-deterministic order by somewhere in your query that is messing things up for everyone. So, let’s just look at a quick example of where you might see, like, just the results change if your database changes, right? So, let’s pretend that for years we had this table and this query.
And every time we ran this query, I’ve got a couple index hints on here just to make sure that, just to show you what happens when we use different indexes. And so, this first query is always going to be hinted to use the clustered primary key on the post table. And this query is always going to be hinted to use the nonclustered index last editor display name on the post table.
That’s the index that we created right here on last editor display name. It’s very, very explanatory. It’s almost nearly self-documenting if you want to feel special about it.
But let’s just say that for years we had this query running. And every time this query ran, we got results in a specific order and everyone was happy. And at some point, someone was like, oh, well, there’s a missing index request or something.
We’re going to add this index and SQL Server is going to use this index and we’re all going to be thrilled. And if we run this, just run these back to back. Let’s zoom it, getting a little weird on me.
You’ll see that these two queries return data in very different orders, right? If we just kind of make that about half and half, we just get this stuff back. Like these two result sets are just way different, right?
5, 2, 2, 2, 2, 3, 4, 2, 2, whatever. This one, 2, 2, 2, 2, 3, 2, 2, 4, 18, 4, 2, 16. So we just got way different results back running the same query just using different indexes.
If we look at the execution plans, we get some sense of why. Well, we do an index seek into the post table here and then we do a hash match aggregate. And of course, hashes, when you see a hash, there’s absolutely no ordering required for that, right?
Hashes just mush everything together and throw it along to the next thing. And then the second query, we do an index seek into the nonclustered index. And then we have a stream aggregate.
And stream aggregates do require sorted data. If we didn’t have an index on last set or display name and SQL Server chose a stream aggregate, you would have to pre-sort the data for us.
In this case, we have the index, though, so we’re good there. So SQL Server just streams the results into there and shoots it along to the next operator, the filter operator. But in both of these cases, like neither one has an order by, but they both return results in slightly different order because the query plan is different and they use a different object to initially access data in the table.
So this is the kind of stuff that you’re going to have to deal with as a T-SQL developer, as you’re writing queries, as you’re troubleshooting bugs and results and stuff, non-deterministic order buys or just expecting results to always return in the same order no matter what, without any order buy whatsoever.
Two very, very big mistakes that people make quite a bit. Anyway, thank you for watching. I hope you enjoyed yourselves. I hope you learned something. I hope you will consider purchasing this course for the pre-sale price. The link is down in the video description.
If not, I will see everyone who is going to get this for free at Past Data Summit, I guess, in November. Anyway, thank you for watching. Goodbye.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.
I’ve heard you talk about your career path a few times, and it’s pretty weird. Do you have any regrets? Are you still happy with what you do?
Do you have differing approaches for performance tuning an OLAP system vs an OLTP system?
Do you know of any disadvantages of using a filtered index to filter NULL values? We have a very heavy transactional table, like 10k trans/sec, with a clustered index and one non-clustered index. We don’t have any queries that select rows with NULL values from this table. The DBA team said we should avoid using a filtered index without any proof. What do you think?
In all your demos you compress (page) your indexes. Do you default to that with your all your client workloads? Do you see more benefit than a negative impact in your experience? Thanks!
I’ve seen you suggest columnstore for paging and dynamic searches. How do you make your non-clustered columnstore indexes perform acceptably on tables where all of the data is hot?
Video Summary
In this video, I dive into a series of questions from viewers, covering topics ranging from my career path and the challenges it has brought to performance tuning strategies for OLAP versus OLTP systems. I share insights on filtered indexes, page compression, and columnstore indexes, addressing common concerns and providing practical advice based on real-world experiences. Whether you’re curious about the ups and downs of a SQL Server consultant’s life or looking for tips on optimizing your database performance, there’s something here for everyone. Additionally, I discuss my upcoming courses and events, including the new “Learn SQL with Eric” course and Pass Data Community Summit in Seattle, offering early access deals that you won’t want to miss. So, if you have any burning questions or need some expert guidance on SQL Server performance, this video is a must-watch!
Full Transcript
Erik Darling here with Darling Data. And today me and my pal Bats here are going to kick off an Office Hours episode where I’m going to answer five of your most burning pertinent questions about life, love, high finance, extreme fitness, and of course SQL Server performance. So that’s what we’re doing today. If you like this channel and you want to support it with money, you can sign up for a membership. There is a link in the video description. If you like this channel, but not in a way that is monetarily beneficial to me, you are free to, for this is absolutely free, like, comment, subscribe, and of course, ask me questions for these Office Hours episodes that I enjoy so very thoroughly. If you need help with SQL Server, perhaps in a way that goes beyond what a YouTube Q&A can do, can do you for. I am available as a consultant to consult, to do these things live and in, well, I mean, sort of in person on your SQL servers. The usual stuff, health checks, performance analysis, hands-on tuning, responding to performance emergencies, and of course, training your developers so that you have fewer emergencies. No one likes heartburn, I guess. If you would like some performance tuning training from me, I have 24 hours of it available.
For 75% off, that brings the price down to about 150 USD, and that is a very good deal. I am also dropping a new course on T-SQL that is not part of that. It is a brand new thing. It is called Learn SQL with Eric. That’s me. It is in the works. It is up for tech review. I am recording things, and I’m going to be releasing it over the next summer period months, and it should all be complete by the time we get to pass. There is going to be the beginner and advanced material on that. If you are going to pass, if you are going to attend the pre-cons that I am doing at Pass, you will get free access to the material. But right now, the pre-sale price for the course is $250. Once all the material is fully booked out and live, the price will go up to exactly $500. The price is going to double when the material is complete, because by then, I am probably going to want to make more money off this thing.
Upcoming events. Lots of fun. Well, by the time this goes live, SQL Saturday in New York City will have come and gone. I should probably delete that, huh? Pass going on tour this summer. New York City, August 18th to 20th. Dallas, September 15th to 17th. And Amsterdam, October 1st to 3rd. And then, of course, all of those events presage, the big one, Pass Data Community Summit, taking place in Seattle from November 17th to 21st.
So with that out of the way, let’s answer some questions over here. Let me just tidy this up a little bit. Let me bring that over a little so everything fits on the screen once Zoomit decides the appropriate level of default Zoomit-ness. Oh, so it’s a non-SQL Server question first. I’ve heard you talk about your career path a few times, and it’s pretty weird. Yes, it is. I forget when the last time I talked about it. It was probably on that Simple Talk Redgate podcast.
Do you have any regrets? Are you still happy with what you do? Jeez, bare my soul, huh? So, yes, of course, there are regrets. I think most of them, though, are limited to me. I do a lot of consulting, and being on the phone a lot wears you down.
I always wish that I had more time to put into producing new training material for money, but it always seems like as soon as I’m like, wow, this week’s going to be nice and open for me to do all this stuff, there’s just like I start working on it. I’m like, yeah.
And then it’s like avalanche of new people need to have problems. Like, well, you know, I’m not getting any younger. You can’t say no to money, so that happens.
You know, I’m still mostly happy with what I do. Of course, there are ups and downs. Some days are more frustrating than others. But, you know, it has been weird.
And, like, you know, a lot of the weirdness with my career path was, like, prior to SQL. But there’s been plenty of weirdness with my career path, like, since then, too. You know, like, before I started working for Brendo’s Unlimited back in 2015, I was, like, a relative nobody, right?
You know, I had presented a few times and I had, like, you know, done some stuff, but, like, I had a blog that I maintained lightly. But, you know, I, like, I didn’t know how weird parts of the SQL community were before I started working there. Like, I was not at the cool kids table at all.
Like, I had no idea that there was, like, so much, like, just crappy high school clicky, like, stuff. So, like, when I started working there, you know, like, you find out about, like, all, like, the stuff that goes on, like, beneath the covers a little bit. And, you know, like, there are people, like, especially in the MVP community who have just been pure nasty to me because I work there, right?
Because, like, I have some, like, friendships and relationships with, like, Brenton people who work there. There are people who have just been awful to me throughout my career. And, like, they can all go fix cars for all I care.
But, you know, it’s one of the, like, like, like the weirdness didn’t stop when, like, you know, I stopped bouncing and got into databases. Like, it’s just been weird all throughout, right? It’s just, like, like, it’s strange stuff, right?
But, yeah, you know, no, like, no, like, giant regrets. Of course, you know, you know, there are things that I wish I had done differently and a bit more smartly when I first started my own consultancy up. And there’s stuff that I still wish I was, like, doing a little bit better at.
Like, you know, I’m not good at, like, SEO and marketing and all the other stuff. Like, you know, I can produce content. But, like, when it comes down to it, you know, I’m not, I am not a marketing master. So, you know, there’s stuff that I wish I was better at.
But, you know, my regrets are all my own. They’re not anyone else’s. So, anyway, let’s go on to the next question here, which is, do you have differing approaches for performance tuning and OLAP system versus an OLTP system? Well, yeah, of course.
You know, it’s OLAP is all about throughput. Sorry, OLTP is all about throughput and OLAP is all about latency, right? You know, OLTP, you need to be able to pound, pound, pound, pound, pound, pound, pound, and get a whole bunch of stuff in and out very quickly. OLAP, you need to have, like, big things happen faster, right?
So, you know, it certainly changes, you know, the things that you look at as far as, you know, like, which queries you go after. You know, like in an OLAP system, it might make total sense to go after things that take the, like, longest or use the most CPU to run. In OLTP, you do have to sort of balance that with, like, you know, what runs the most?
Is there anything we can do about this? Things like that. But, you know, there are, of course, differences, you know. Like, even, like, indexing strategies, OLAP, I’m going to push columnstore. OLTP, I’m going to push, like, narrow rowstore indexes, stuff like that.
You know, there are, of course, like, differences in those things. But, like, a lot of the environments I see are kind of mixed, right? Like, there’s, you know, there’s OLTP plus reporting, right?
Like, plus the OLAP stuff. So, you know, you do have to sort of balance out both, you know. OLAP is sort of, like, I don’t know. OLAP is interesting in a way because you have queries where, like, the expectation is that, yes, they’re going to take longer.
But, like, you also have to balance resource usage a lot differently, right? Because, like, you know, you might only have, like, four or five queries running at the same time. But, like, that’s where you really start running into, like, resource semaphore stuff.
You know, OLTP is typically where you start running into, like, thread pool stuff. Then when you mix them, you get both. It’s a real joy. But, yeah, there are different approaches to it. But, you know, like I said, a lot of the stuff that I see is kind of mixed.
So you do have to attack both sides of that coin when working on things. So let’s see here. The next question we have.
Oh, boy. It’s a lot of writing. Do you know of any disadvantages of using a filtered index to filter null values? We have a very heavy transactional table, like 10K transactions a second, with a clustered index and one nonclustered index.
We don’t have any queries that select rows with null values from this table. The DBA team said we should avoid using a filtered index without any proof. What do you think?
Well, what proof did they want you to present to them? That is the question. What proof? What are they looking for? How does one gather proof without being able to do an experiment in which evidence is gathered and some hypothesis is tested? So I don’t know what proof they want.
But, you know, the big thing with filtered indexes is they are really only terribly sensible if the filter is going to exclude a large number of rows. Usually you want to, you know, like if you’re only going to exclude like 25 or 30% of the rows, you’re not going to see a dramatic difference in query performance. Once you start getting to like the 50, 60, 75% range, that’s when you start to see bigger differences with things.
So, you know, first, you know, find out what proof they want. Second, figure out what your queries are doing. Figure out how many like null rows your filtered index would actually exclude and kind of go from there.
You know, take some of your, you know, assuming you have some sort of development environment, you know, create the filtered index that you care about and start testing queries against it. You know, there’s not really a downside. The only thing that you must remember is with your filtered indexes, whatever column, like you said, you want to exclude no values.
I assume there is a column or maybe multiple columns with no values that you are looking to exclude. Make sure that those columns are in the filtered index definition somewhere, not just in the filter, but like as included columns too. So SQL Server doesn’t have to do as much guesswork.
It has those columns available to it so we can evaluate whatever you want, look up free. There are a lot of peculiarities with the optimizer around filtered indexes, specifically with nulls that do sort of force you to need to have those columns in the index definition beyond just the filter. All right.
Let’s see what we got here. In your demos, hey, someone’s paying attention. Good for you. You can press page. Your indexes. Do you default to that with all your client workloads? Do you see more benefit than negative impact in your experience?
Yes, I do default to that. If you take a look at my new store procedure, SP index cleanup, part of the results in there, well, like for the part of the store procedure where like merge index, like index merge statements are like generated for you to like bring two indexes together. So you can replace like multiple indexes with a single index.
Like any index create statement there gets created with page compression by default. There’s also a whole section of the results that scripts out adding page compression to your existing indexes. I use that.
I use the hell out of page compression. Most people who I work with have far more data than memory. And page compressing indexes, you know, aside from columnstore compression, but you can’t put columnstore on a lot of tables for a lot of reasons, which is actually now that I’m looking ahead a little bit. I see that’s sort of in the next question.
But yeah, like page compression makes your data smaller on disk and in the buffer pool. And you can make way better use of the hardware that you have. So if there is any IO boundness to your workload, right, like you see a lot of page IO latch underscore sh and ex weights, you know, queries are just constantly going to disk.
You know, it could be when you look at weight stats as a whole or when you hit SP who is active and you see all these queries bogged up waiting on reading pages from disk. Page compression can take some of the edge off that by having smaller objects to a read, right? Like having less data to bring from disk into memory is a faster process.
And then having that data be compressed in the buffer pool means that every object up there takes up less space in the buffer pool. So you have more space for more things, right? It’s sort of like those vacuum bags where the people pack their, like, you know, winter clothes in when summer rolls around and they suck them down.
And like you have these like giant puffer jackets and blankets and stuff and just bring them down to this tiny little nice compressed thing. And it just makes you gives you a lot more storage space. So it’s the same basic idea there.
Okay, so the final question. What do we have here? I’ve seen you suggest columnstore for paging and dynamic searches. How do you make your non-clustered columnstore indexes perform acceptably on tables where all of the data is hot?
Hmm. Hmm. So if you pay very careful attention, one thing that I say is that columnstore indexes should generally be reserved for large tables. Usually, you know, a lot of the pamphlet material from Microsoft is at least a million rows.
I’m not sure that that’s where I start on. I think, you know, 5, 10, 15, 20 million are more sensible numbers there. So we’re going to run a question by you.
And that is, do you really have 5, 10, 15, 20 million row tables where all of the data is hot? You have a table with that many rows where people are just constantly updating all 5, 10, 15, 20 million rows. Do you really?
Honestly, truly have a table like that. That would be a very strange thing. Very, very strange thing indeed.
I think perhaps you’re misinterpreting where I suggest using these columnstore indexes. If you truly have a table like that, then that would be an interesting consulting engagement. Good Lord.
Yeah, so, you know, you are right that, you know, updating columnstore indexes does not quite always go as well as updating the rowstore indexes. But, boy, I think there is some attention that needs to get paid to the types of tables that make good candidates for columnstore indexes here. All right.
I’m out of breath. I’m winded. My allergies are terrible. My lungs are not at full capacity. So I’m going to go breathe for a little bit and then I don’t know what. But anyway, thank you for watching.
I hope you enjoyed yourselves. I hope you learned something and I will see you in whatever video we do next. All right. Cool. Thank you. Goodbye.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.
In this video, I share a quick and handy trick using GroupBy in SQL Server, perfect for those Friday afternoons when you just want to get your work done efficiently before heading off to enjoy the weekend. This isn’t a deep dive into the intricacies of GroupBy; instead, it’s a practical solution to an annoyance many of us face—how messy queries can become when grouping by expressions. I demonstrate how using `CROSS APPLY` with the `VALUES` construct can clean up your code and make it more readable without resorting to complex Common Table Expressions (CTEs) or other workarounds that can be a pain to remember. So, whether you’re preparing for the Precons for Past Data Summit or just looking for a way to streamline your SQL queries, this video has got you covered. Enjoy your weekend and happy coding!
Full Transcript
Erik Darling here with Darling Data. In today’s video, we’re going to do a quick little trick with GroupBy. The reason it’s a quick video is because this is getting published on a Friday, and golly and gosh, I don’t want to keep you anywhere waiting on a Friday. You should be gallivanting off to have a great weekend. So this is just a cool little trick. This is not going to be deep, fundamental, internal stuff about GroupBy and grouping tables and all the other stuff that’s in this session, whatever you want to call it. Of course, this is part of the companion material to Kendra Little and I’s Precons for Past Data Summit. If you are coming to the Precons, you will get free access to this material, since it’s companion material to the Precons. If you are not coming, you can get it for the presale price of $250 US dollars. That will be going up to $500 when the course is fully published. So, get in while the getting’s good, as they say. Eat while the chicken’s hot, or something. Smokey beans. Anyway, what are they going to say? Oh yeah, GroupBy. Fun stuff. So, one of my annoyances with GroupBy, at least in SQL Server, is how messy queries can get once you start grouping by some expression. Once you start writing things out, you’re like, I’ve got to write this over and over again. Gosh almighty. Even if you’re really good at copying and pasting, it gets very irritating.
And some code completion tools, I’m not going to name any names, have some issues in this area. So, this is what a typical, very, you know, I had to type too much query to do this GroupBy thing looks like, where we’re getting the date part for creation date. And then when the GroupBy, we got to do this. Now Oracle actually recently brought out some improvement to their SQL dialect. Where you can actually reference column, like, aliased columns in the GroupBy, which I think is fantastic.
T-SQL should have such a thing. SQL Server 2025. Where are you at, buddy? T-SQL should have such a thing. Now, of course, you can, because if you’ve been paying attention to my videos and the logical query processing stuff, you could actually reference them in the OrderBy. That would be okay. So, like, if we look at this query down here, like, we have some squiggles in the GroupBy. Like, the GroupBy is not going to work, but the OrderBy is totally kosher. Like, this part’s fine.
But, you know, if we try to run this, the SQL Server is going to be like, I don’t know. Like, who are these columns? I’ve never heard of them. Where did they come from? From whence did they came? Which is annoying to me. But, and this is probably one of the most frequent things that you will hear said about T-SQL. Don’t worry. There’s a workaround. Because it’s never just a straightforward way of like, hey, we can just do this. It’s always, there’s a workaround. Like, another one. Something else to memorize. And that is, of course, using the cross-apply with values construct clause thing, whatever you want to call it.
So, we’re going to use cross-apply in here. And in cross-apply, we are going to say values. And within those values, we are going to get the two expressions that we care about. Creation date, year and quarter, right? There’s year, there’s quarter. Sorry, it’s very pink on here today. And then we are going to alias those columns coming out of the cross-apply is creation year and creation quarter. And, of course, now we can run this query with absolutely no problems. And I, like, you know, assuming that you can cope with a little cross-apply diversion in your query like this, I do find that this makes the code much sort of cleaner and easier to deal with.
And you don’t have to write any goofy CTE and get into arguments with LinkedIn on people about how goofy CTE are. They don’t actually make queries more readable. Because who has time for that? Not me. Anyway, that’s my quick video today. Like I said, this is going on. Well, this is Friday now for you.
So, please do have a great weekend. Please do drive safely. Even if you’re not drinking, drive safely. Not drinking is no excuse to drive worse. I hope you enjoyed yourselves. I hope you learned something.
And remember, this is the pre-sale thing for the course down here. So, if you’re enjoying this at all, imagine if you had, like, a couple days worth of T-SQL content like this to watch and learn from. It might be pretty good, right?
It’s a whole lot better than reading some stiff book or some terrible documentation or just not knowing what you’re doing and hoping that there’s a Stack Overflow answer for you. Or hoping that the LLM was right because, hoo-wee, you got a lot of hoping to do there. Anyway, thank you for watching. Goodbye.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.
In this video, I delve into some quirky aspects of T-SQL that might surprise even seasoned database professionals. Specifically, we explore the output clause and its peculiar behaviors when used in certain scenarios. While working on my new T-SQL course, which is part of a series aimed at advanced topics beyond beginner material, I stumbled upon an interesting issue involving the output clause with check constraints. Occasionally, attempting to insert values into a table with a check constraint results in a conflicting message about login time, despite successfully inserting a row. This led me to experiment with using `DELETE` and `OUTPUT` together, expecting a clever workaround for session state management tasks. However, I found that even this approach didn’t work as intended, resulting in primary key violations due to the way SQL Server handles these operations within the query plan. The video is a light-hearted exploration of these oddities, aimed at both entertaining and educating viewers about the nuances of T-SQL.
Full Transcript
Erik Darling here with Darling Data. And this video is, of course, part of my ongoing T-SQL course stuff. This is just, this is a little jump ahead, right? Because this is beyond just like what we’re going to, what I would consider beginner material. But so this is a little jump ahead to some of the more advanced stuff. But it was just one of those funny things that I came across while writing stuff. And me and Bats just got such a kick out of it that we wanted to talk about it now rather than later. So this is a little short video about some weird stuff that you might run into with the output clause. Because it’s amusing to me. So once again, I am working on my new T-SQL course. The T-SQL course is, of course, of course, companion material to the pre-companion material to the pre-companion material. schedule slash the pre-cons, the pre-cons that Kendra Little and I will be delivering and pass data community summit this November. If you are attending that, you can get the material for free, as with your attendance to the pre-cons, of course. Now, if you go to someone else’s pre-cons, screw you. Everybody give you anything free. If you if you’re not going to attend, then you can get it for the presale price of $250. And I’m just waving at it down here in the video description. There is a link where you can you can purchase it there. Anyway, let’s get on with this funny little demo. Now, the first funny thing is this part, right, where just dropping a table if it exists, creating a table with a check constraint, and then trying to insert some values into that table. Every once in a while when you run this, you’ll get this message that the statement conflicted with the check constraint on login time. I guess every once in a while, sysdate time just works, and that check constraint works well, and sometimes it doesn’t, but let’s make sure we get a row in there. Okay, one row affected. Great. We did it, and if we look at that one row in the table, this is what we get. We have spid 121, and this is our login time.
What a beautiful sysdate time that is. Gorgeous. I’ve never seen anything so beautiful in my life. Now, the first sort of funny thing with output is, so like, what I was in my head, I was like, well, let’s say you’re like dealing with sort of a session state table, and you know, you’re like, maybe you have a procedure that does like, you know, like checks to see if a row already exists in there, and if it does, you delete, and if it doesn’t, you insert it. Well, I was thinking like, you know what, I bet there’s a real clever way to use output to do that, and so I was like, I got this.
I’m on this case. Like, I’m on the prowl here, right, and so what I figured I’d do is use delete with output, because when you use delete with output like this, you can nest DML. So notice that we have an insert select here, and the insert, and that, well, the select portion is coming from this, right, so we’re sort of treating delete with output like a subquery, you know, and a lot of the videos, like, one of the things I explain is that, like, in SQL Server, or in databases in general, but we’re using T-SQL and SQL Server, like, like, the result of everything is sort of tabular, right, so like, this is just sort of like a table value and result, right, it’s like a little derived table in there, it’s like a, it’s a table expression with a delete inside it, wild, but we can do that, it’s valid, but it doesn’t work, right, and I’ll show you what doesn’t work, is, you know, we have a delete, right, and in that delete, we are deleting the spid, where the spid equals my spid, so spid 121 should get blown away, and then in the insert, we are selecting, we are reselecting the spid with a new sys date time, right, so we can try to do this, but every time we run this, like, this one, this one never works, right, this one, every time we run this, we get this error message, that it’s a primary key violation, right, we just can’t do it, and it’s, it’s funny to me, too, is, because, like, when you look at the execution plan, you have a clustered index delete over here, right, so that, that happens, right, that’s the very first operator in the plan, clustered index delete, should get rid of that row, be gone, and then over here, we have the clustered index insert, right, so it’s like, hey, check it out, we’re, we’re, we’re going to delete this thing over here, and then we’re going to insert this thing later on, but no, every single time, every single time, primary key violation, and the, the primary key, I mean, granted, you could probably make the primary, you could probably expand the primary key to be both spit and login time, and, you know, not to, not to obviously foreshadow anything, but that, that was just a very surprising thing, but, you know, I thought maybe, maybe I’m being too clever, maybe I’m being just, you know, far, far too clever with things, so I thought maybe if I just did a delete with output into the table, that, that would work, right, because, like, maybe the nesting is just like, it’s too much for SQL servers, SQL servers is like, no, I can’t do, I can’t do it, I can’t figure this out, it’s too many operators in this query plan, it gets all, it’s all, it’s like, just cagey and weird, so I would just do delete, and I would output some stuff into here, but, like, like, I know this would give me a, like, a violation, a primary constraint, key constraint violation, but I just wanted to see if, like, like, it would, it would get past anything, because the, the query plan, you know, it’s, it’s, it’s, it’s, ah, right, so you can’t even get a query plan for this one, and when you try to execute it, you get this, you get a different error message, I’m just trying to find a good place to chop this off, because it’s a little bit longer than the first one, where, now, with, with that output clause, we get something different, you can’t, you can’t even get an estimated plan for it, because SQL Server’s like, screw you, like, violation right off the bat, the target table of the output clause, output into clause cannot have any enabled check constraints, or any enabled rules, found constraint or rule, CK nested blah blah blah, 1DD13137, okay, so it’s just, it’s funny to me, because with the, the one plan thing, you get the, the primary key violation, and it’s like, oh, well, clearly, we can’t, we can’t violate a primary key, but then when you do it a slightly different way, in which you actually would violate the primary key, right, like, there actually would be a primary key constraint violation here, you hit this other one on the check constraint, so I have not, I have not quite found the most clever way of doing this yet, but I’m working on it, so, um, if you’re into that sort of thing, I don’t know, perhaps the $250 that it costs to buy the pre-sale price, buy this course at the pre-sale price, uh, would be worth it to you, if not, you can just wait until it’s fully published and it costs $500, because maybe, maybe it would be so worth it to you that you’re like, no, no, I’m waiting for the price to go up, Eric Darlin deserves his flowers, anyway, uh, thank you for watching, I hope you enjoyed yourselves, I hope you are as annoyed with this as I am, and I will see you in the next video, all right, goodbye.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.