Trying Something New With sp_QuickieStore: 80/20 Query Scoring Across Multiple Metrics
Summary
In this video, I dive into an old-school approach to identifying SQL Server performance issues using SP Quickie Store and a novel method inspired by the Pomodoro technique. Traditionally, Query Store surfaces queries that consumed a lot of CPU over the last seven days, but often these results are too broad for practical use. To address this, I’ve developed a multi-dimensional scoring system that evaluates queries based on their impact across several key metrics: CPU usage, duration, physical reads, writes, and executions. This approach helps pinpoint the most problematic queries more accurately, even when they run outside of typical working hours or are unparameterized. By sharing these insights, I hope to provide a practical tool for SQL Server administrators looking to optimize their databases without relying solely on modern monitoring tools.
Chapters
- *00:00:00* – Introduction
- *00:00:39* – SP Quickie Store Overview
- *00:01:22* – Work Hours Definition
- *00:01:38* – Query Execution Analysis
- *00:02:03* – Unparameterized Queries
- *00:02:52* – 80-20 Rule Application
- *00:03:12* – Resource Grouping Criteria
- *00:03:40* – Today’s Discussion
- *00:03:56* – Training Offer
- *00:04:21* – Office Hours Questions
- *00:04:43* – Free Monitoring Tool
- *00:05:07* – T-SQL Collectors Overview
- *00:05:47* – Dashboard Summaries
- *00:06:08* – Upcoming Conferences
- *00:07:04* – Results Explanation
- *00:07:23* – Primary Window Details
- *00:08:01* – Working Hours Definition
- *00:08:16* – Query Identifiers
- *00:08:36* – Impact Score Calculation
- *00:10:15* – Memory Consideration
- *00:10:20* – Diagnostics Layer
- *00:10:44* – Performance Metrics
- *00:11:05* – Plan Instability Check
- *00:11:22* – Volatility Metrics
- *00:12:02* – Time Zone Note
- *00:12:32* – Hammer Tool Usage
- *00:13:09* – Query Execution Period
- *00:13:42* – Grouping by Query Hash
- *00:14:08* – Plan IDs Explanation
- *00:14:35* – Impact Score Details
- *00:15:26* – Off Hours Queries
- *00:16:16* – High Load Query Analysis
- *00:17:23* – Multi-Metric Evaluation
- *00:18:00* – GitHub Repository
- *00:18:44* – Tool Usage Tips
- *00:19:44* – Conclusion
Full Transcript
Erik Darling here with Darling Data, taking a break from my monitoring tool mogul personhood to talk about some old-fashioned stuff, you know, some old-fashioned ways of finding SQL Server problems that isn’t my new fancy free open source, better than all the paid monitoring tools, monitoring tool. And we’re going to talk about doing something a little bit different with SP Quickie Store. So let me start by sort of defining the problem I want to solve, and then tell you and then show you a little bit about how I chose to go about solving it. So when I look at Query Store traditionally, and by default, SP Quickie Store looks at data from the last seven days and sorts by average CPU descending, which does have a great way of surfacing things that burned a lot of CPU.
Often when you find those things and show them to people, though, they say, well, that runs at night, it doesn’t matter. And at first, I tried to cope with that by adding in a few parameters, like say, workdays, when set to one that will, by default, look for everything between 9am and 5pm on weekdays only, and show you that. Of course, of course, the hours, work start and work end, serve to define your work day to your needs. But still, you needed a lot of other sort of filtering in order to come to an agreement on which queries mattered.
For example, some queries would have a, you know, would still execute during the day, and would burn a lot of CPU on average, but people would say things like, well, it appears to have only executed once or twice. Or they would say things like, oh, you know, that’s not user facing, I don’t know, like, you know, it’s a lot of stuff that would come up and be like, okay, well, you know, we can filter, we can say, set a minimum number of executions. What makes sense? How many executions would you start to care about? I don’t know. And then, you know, there’s all sorts of other things that sort of act as traps in there, like to say, unparameterized queries, where they generate a new query ID for every new literal value that gets passed in.
And so it looks like they like something is maybe executed, like, you know, a few times, maybe, but there’s real but like, if you look at like the aggregate for that query hash, which is something I added in also was include query hash totals. Well, you know, I think you sort of get the point, it still was really hard to find, like, what is really stressing the server out. So the way that I chose to go about trying to fix this now is by sort of doing like the Pomodo method, but with a matrix.
So the whole theory that, you know, 80% of your blank is, you know, caused by 20% of your blank rule of things, right? Like the 80-20 rule effectively, right? Except I want, except, you know, if you do that across just one resource, you’re like, well, okay, CPU, fine, right? So I chose to use a group of resources that I care about, which are CPU, duration, physical reads, not logical reads, logical reads are for people using SQL Server 2008, we’re not there anymore, writes, and executions.
So I’m taking those five metrics, and I’m finding things that have a meaningful impact on your server across those. So that’s what we’re going to talk about today, and I’m going to show you kind of what that looks like. Down in the video description, oh boy, all sorts of useful links, you can hire me for consulting still, you know, before the robots put every independent consultant out of business, you can still talk to me.
You can purchase my training before, you know, sort of get the human angle on all this stuff, not just the, you’re absolutely right, LLM angle on any questions you might have. You can become a supporting member of the channel, wouldn’t that be just grand if you gave me about $4 a month to, you know, pay just the tax on a cappuccino. You can also, for free, ask me office hours questions, and even more for free, like, subscribe, and assuming that you’ve got any friends who are still living people and not large language models, well, you can tell them about the channel and maybe they can subscribe too.
So that’d be, wouldn’t that be just fantastic? We could all be friends. Speaking of friends, I’m your friend. And because I’m your friend, and I want you to be happy, and I want your life to be better, I have made a free open source SQL Server performance monitoring tool. No complicated installations, no licensing, no email needed, no phone home telemetry, no nothing.
Just a bunch of T-SQL collectors running on a schedule, populating dashboards, basically all the stuff that I would look at as a consultant if I were trying to keep a SQL Server happy from a performance point of view. And also, the thing that I think is pretty neat is that I have built in to the monitoring tool, it is opt-in, of course, it is not monitored by default, MCP servers, so if you want to have your robot friends look at just your performance data, only the collected performance data, you want to ask them questions specifically about that nicely aggregated, collected, you know, gentrified data that I get for you, you can do that, and they can summarize very nicely all the stuff that has happened with your server that is making it unhappy.
I will be out and about in the world, I guess by the time this goes live, I forgot to take off DataTune Nashville, but I like DataTune Nashville, maybe there will be one next year and I’ll go back for that. So, if there is, DataTune Nashville, that’s a great conference. Data Saturday Chicago coming up, well, when this publishes, it’ll be this week.
SQL Day Poland in May, woohoo, and Data Saturday Croatia also, woohoo, June 12th and 13th. And if this slide will go on, you will see that I have at least updated the slide now that we have left the murderous winter, we have survived this Donner Party winter, and we are now entering spring, and the hills are once again alive with the sound of database music, which is quite harmonious if you ask me.
Anyway, let’s go do this thing. Let’s go over to SQL Server Management Studio, something that we haven’t seen in one of these videos in a little bit, not since the Get AI Ready with Erik series, a triumphant series of videos. So, let’s talk about these results a little bit. So, before I even do this, what I want to point out here is that if you use the at help parameter in SP Quickie Store, what you’ll get is in the results, this text right here.
So, like, I made sure that this all had a very good explanation so that it would be durable for you to refer to should you choose to use the new Find High Impact parameter and look at Query Store in this way. So, we’ve got, you know, sort of like some normal stuff that you would expect to see, database name, start date and end date, that’s the window of time that we looked at. And then the sort of primary window where things all come up will tell you sort of when the query runs most, right?
So, like, I just chose to sort of separate the day in two. So, 7 a.m. to 7 p.m. is working, 7 p.m. to 7 a.m. is off hours. I also have some stuff in there for weekends and I also have some designations in there for things whose execution profile sort of spans everything.
Well, I’ll show you what that looks like in a moment. Then we have some of the normal identifying characteristics you would expect to see about a query. Things like the object name, the SQL text, the query plan, the top weights if they’re available for it, the query hash.
And something that I want to point out here is that Query Store has query hashes in it. And I chose to use those as identifiers instead of query IDs for the sort of, because of what I, the problem that I was talking about earlier in the video where, you know, if you have unparameterized queries or something, you might not look like these are significant when in fact they are significant.
So we’re doing this by query hash. I do show you the query IDs and query plans that those query hashes are associated with. But, but like my default grouping is query hash.
There’s also a count of how many distinct query IDs that hash generated, which gives you a good idea if the code is unparameterized. You might want to look at it that way. And of course, a count of the plans.
Then there is an impact score, which is a number from zero to one. So the closer this number is to one, sort of like with the, actually am I, sort of like the reverse of the vector search stuff, or the semantic search stuff that I was talking about, which closer to zero was better.
I guess closer to zero means your query is less impactful. So that is good. But closer to one means you have something very impactful about this query. And that’s, and that score is generated by effectively getting the average percent rank across all of the active dimensions that we’re looking at.
And then there’s also stuff like total executions. So tell you how many, showing how many times that query has executed. And then what its share of CPU duration, physical reads, writes, and memory are.
I think I forgot to mention memory before, but that’s okay. I’m mentioning it now because I see it and we’re good. So after that, we then have some diagnostics, which is sort of a rule-based layer that is layered on top of the score, which kind of tells you like some stuff that we found that I found while looking at your query sort of execution patterns, not like the query plan or anything, just the, like the way it runs and stuff.
There’s also the, I also include weight stats for the query. So like, so like what we look at is, you know, like any like blocking resource contention IO. We try to figure out if it’s parameter sensitive, right?
If there’s plan instability. And then if like, you know, like, let’s say that if we have a select query that runs and it’s causing writes, then we might say, Hey, this thing might be spilling, or this thing might have spools in it that you could fix and take a look at.
And then we also flag like metrics by volatility. So if there are really, really big swings in the minimum or maximums for things compared to the averages, which can be signs of either parameters, like parameter sensitivity or something, then that means that you might have a situation like that, that you need to take a look at, right?
Cause it’s sometimes this query is very fast. Sometimes this query is very slow, but like, but like it might look okay on average, but it’s got a wide swing to it.
So what this ends up looking like in a result form, and you’ll have to sort of excuse that a lot of this stuff says off hours because, you know, my, I can’t remember how many times I’ve said this now, but all my, all my windows VMs and SQL servers installed on are in Pacific time.
And it seems like no matter when I run a workload, it’s always off hours. And maybe it’s just a California thing. They’re all on vacation. But this looks like this. This is the, of course, the database name that we ran it in.
Uh, I I’ve been using hammer to be a little bit more recently because I can just like run a workload and it causes a whole bunch of stuff to happen, right? Like high CPU blocking deadlocks.
And I can just come back and say, Oh, cool. You did all this stuff without me having to write like a synthetic stack overflow workload, which would be annoying and time consuming. And not something I really want to do right now with my time.
I got a lot of other stuff going on. Um, so then you see like the period that we ran this for. And then the, this is where we start getting into the stuff about the actual query. So, um, you see stuff that executed during business hours.
Again, this would be the 7 a.m. to 7 p.m. Right now this is hard coded, but well, you know, I’ll, I’ll, I’ll open this up a little bit more. You know, people start making requests about it.
Um, off hours would mean 7 p.m. to 7 a.m. And then down here a little bit, we have one that actually says there’s actually spread. And that means that it is spread across both off hours and business hours.
Um, if there were this, any of this were coming from a store procedure, that name would be there. We have the query text and then we have the query plan. Um, we have all of the aggregated weights for that query.
So you can kind of get a sense of like what that thing is up to. Um, again, these come from query source. So they’re the high level bucketed weights. They’re not the granular weights that you might be useful, used to, uh, if you’re looking at weight stats.
And then down here we have, um, so like, oh, like just, uh, you know, a couple of examples. I know it’s a little cut off on over my head, but that’s okay. Uh, so we have, so again, we’re grouped by query hash.
Because as you can see from all, just about all of these lines, one query hash can spawn a whole bunch of query IDs. If we were to do this by query ID, we would have a far less good view of which queries were most impactful in the aggregate, right?
Because like, look, there’s like seven different query IDs. It’d be like, oh, well, these all look fine. What could be going wrong here? Uh, and then we have the list of plan IDs, right? So, uh, you know, all kind of nice things to have in front of you.
Uh, where, you know, it’s like sort of interesting stuff where two different query IDs, like the group to the same hash only have one query plan ID, which is strange and bizarre. But, uh, then coming over this way, this is where we start to get into, um, uh, like which queries are most impactful.
So impact score, again, the close risk, I screen out anything that’s less than 0.5. Cause that’s where like, that’s where like stuff gets like pretty insignificant. So this is all, every impact score you see is going to be from 0.50, uh, and one, right?
With one being the highest you can get. So the closer this gets to one, that means it’s, that means it’s like, like this query isn’t like the 98th percentile of how much stuff it does, right?
So, um, like this one, like, you know, across, um, like all of these metrics, like all of this, all of this can sit, all of this like contributes to the, the general score of things. And then, you know, coming over a little bit further, um, you know, this is where the diagnostics come in.
Uh, the TPCC database is actually a little bit more interesting in here. So let’s switch over to that and let’s just run this for that one. So, uh, this, so this one’s a little bit different, right?
I mean, like still it’s all off hours cause stupid West Coast VMs, but we actually get a bunch of store procedure names in here. And we get slightly more interesting numbers.
I think anyway, um, over in this section. So if you look at just this top query where the impact score is 0.91%, and you kind of look across like, you know, that, holy cow, that’s a lot of executions.
Uh, wow. Uh, that even took me by surprise. Uh, you can see that, you know, this query for this database, uh, for the timeframe that we looked at was 18% of the CPU, uh, 1.3% of the duration, 3% of the physical reads, 54.2% of the writes, and 28.6% of the executions.
So like this query is contributing a lot of load to the server. This is a query that you would want to look at. Now, based on like the execution count and all that, you know, like it’s unlikely that you could make this query a lot faster, but this is a, this is a query that you might want to look at and say like, Hey, like, do we need to run this this much?
Can we cache this somewhere? You know, like the usual, I mean, maybe it is something you could tune in, like, you know, make fat, like make fat. You’re not going to make it execute less, right? You might could probably bring down the CPU share.
You could probably bring down like some of the other numbers and make it less significant. But it also might be a query that you want to think about not running as much because that is a bonkers number of executions. And then over here, not really any diagnostics, but we get some volatility, volatility metrics in here, where sometimes the duration is 707 times the average.
And then sometimes the physical reads are 93,000 times the average. And sometimes this one’s duration is almost 11,000 times the average. So this is just a different way of sort of looking at the queries that run in query store, scoring them across a bunch of metrics and figuring out which ones are contributing the most across all of those, not just to one.
Because oftentimes people are not just concerned about one thing. They’re like, well, CPU is high and, you know, we’re constantly going to disk and, you know, like problems, chocolate, right? So like this is a good way for you to start figuring out like as a whole, like which queries are contributing across a variety of metrics so you can attack things that are most meaningful across all of those.
Anyway, this is right now in the dev branch for my GitHub repo. That is code.erikdarling.com. Shockingly, this is also the same repo that my monitoring tool is in.
So I’m just going to bring that up. If you go to code.erikdarling.com, it’s going to be in the Darling data repo. This is, of course, SP Quickie store.
If you can go check that out. Again, it’s on the dev branch. This is not live on the main branch yet. But after I have let this soak in a little bit and I’ve had a chance to kick it around a little bit more, it will make its way over there.
So again, just a different way of sort of looking at query performance, a different way of finding queries that are meaningful to go after. I think it’s pretty cool.
Hopefully you do too. Hopefully it helps you find queries that are meaningful on your servers in ways that, you know, it’s like any other tool you look at to sort of get this information from. It’s always just like you sort by one metric, right?
And even if you sort it by two metrics, like that one metric sort of dominates and sends a much higher signal. The other metrics sort of like wishy-washy, right?
So it’s like just to sort of talk about the problem in the opposite way. Like, you know, like I said, by default, Quickie store orders by average CPU. If you flip that and you say, I’m going to order by executions, you find something executes a lot, but it’s like stuff that’s going to have like zero millisecond, like CPU and duration because it’s like very fast code that’s not doing anything.
So you’re still finding a problem, but you’re not finding sort of things that are causing like more than one problem at once, right? So this is just one.
This is a good way to not just focus on one metric or one problem. This is a great way to look at your queries across a bunch of metrics that will cause you problems. Anyway, thank you for watching.
I hope you enjoyed yourselves. I hope you learned something. I hope you’ll check this thing out. That’s spquickiestore in my GitHub repo right there. Give it a shot. Let me know how that goes. And I don’t know. See you on the main branch someday soon.
All right. Thank you for watching.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.