SQL Server Performance Office Hours Episode 17

SQL Server Performance Office Hours Episode 17


I work for a small startup (7 ppl, 3 on sql server including the boss). Our software is a complex ERP system with business logic heavily implemented on sql server. We are planning to move to Postgres. Is postgres a good platform for complex business logic and performance?  Can a small team achieve this migration for a complex software? Your expert opinion?
Hey Erik. What deciding factors should be considered when marking an entire table for Update Statistics? Is it based on how fast the data changes, or is it okay to run the Update Statistics overnight, irrespective?
Hi Erik, I have top wait type SOS_SCHEDULER_YIELD which is way higher then second one: SOS_SCHEDULER_YIELD: 75h of 230 Hours Sample, 1.5ms on avg. Second PAGELATCH_SH is 5h total. Should I focus mostly on most CPU-intensive queries (in cache) or this can be connected also with “High CPU ready time on VM side” or lack of proper indexes that causes too much scans index/table from memory? Strange is that I see this wait (I’m logging sp_Blitz every 15 min) as top also on weekends when there is no heavy load in DB. Should I start logging sp_PresureDetector to see number of worker threads used and try to correlate it with high SOS_SCHEDULER_YIELD wait? Thank you!
I see Brent is going to put AI in his procedures. Are you plannong on doing that too?
Why would I use dynamic SQL to defeat parameter sensitivity, when I can just use OPTIMIZE FOR?

Video Summary

In this video, I dive into some of the most pressing questions submitted by our community during my office hours session. We tackled topics ranging from migrating to PostgreSQL for complex business logic and performance considerations, to optimizing SQL Server statistics updates and identifying CPU-intensive queries. It was a lively discussion with plenty of insights and advice shared among us. Whether you’re considering a database switch or looking to fine-tune your SQL Server environment, there’s something here for everyone. Additionally, I share my thoughts on the future of AI in SQL Server procedures, explaining why I believe it’s more practical to leverage existing tools like Copilot rather than building custom solutions. The video is packed with valuable information and actionable steps that can help you improve your database management skills.

Full Transcript

Erik Darling here with Darling Data and, well, Ben Huffin again, which means it’s time for some office hours. I will answer five of your questions that you submit through this link right here, this beauty, which is down in the video description. Speaking of the video description, if you would like to become a member, a paying member of this channel to support my efforts to bring you all this work, wonderful, high-quality SQL Server content, you can sign up for a membership in the exact same place. I don’t know why this thing has me turning so pink right now. It’s a little unsettling. I am not normally radiation poisoning red, but I don’t know. Maybe it’s just how I feel inside. If you do not care that much about me with money, you can like, you can comment, you can subscribe, and you can ask me questions for these here episodes of Office Hours, which is always always a highlight. If you need help from a SQL Server consultant. I am one of them. One of too many, I think. When I look around LinkedIn, it’s absolutely saturated with idiots who have experts in their LinkedIn profile, and then you see the things that they say and they write about, and you’re like, you wish that you could bring back firing squads.

None of them would make it. If you need help with SQL Server, health checks, performance analysis, hands-on tuning of your SQL Server workloads, responding to SQL Server performance emergencies, and of course, training your developers so that you don’t have those anymore, and you can go back to drinking or huffing compressed air in peace, I’m available. And as always, my rates are reasonable. If you would like some performance tuning content from me, I have 24 hours of it available. With the discount code, it’s about $150, and that’s pretty good. That $150 will last you the rest of your life. So good for you if you choose to look that far into the future. Bet you have a 401k or something. A real whiz kid.

I, of course, have a new course coming out. Learn T-SQL with Eric. That’s me. That’s Eric right there. This fella. All this content is being recorded leading up to Kendra and I doing our past pre-cons, so attendees to those will get access to all of this as part of their admission. You buy tickets. You come, you get access. The course is available now for everyone else for a pre-sale price of $250 that will go up to $500 when everything is said and done. The videos will start dropping shortly. Well, actually, this is going out after May 10th, so videos will have already started dropping by the time this thing sees the light of day.

I record the office hours ahead of time and schedule them for every Monday, so that’s why sometimes the dates are a little funky on me. So I apologize, but it would be a lot of work to fix that, so we’re not going to do it. I did manage to fix this slide, though, because New York’s E-Saturday will have already happened, so we no longer need to talk about that.

But we still do need to talk about PASS being on tour this summer. New York City, August 18th to 20th. Dallas, September 15th to 17th. And Amsterdam, October 1st to 3rd.

Of course, all of that is leading up to just warming us up, getting us nice and loose and limber and getting the blood flowing and muscles in order for PASS Data Community Summit in Seattle, November 17th to 21st. So we’ll be doing as much as we humanly possibly can to do all that.

But with that out of the way, let’s answer some questions here on this office hours episode. Oh, we got some long ones in here. Boy, oh boy, oh boy.

Y’all like typing so much. All right. I work for a small startup, seven pipples, three on SQL Server, including the boss. That’s a mistake.

Never let your boss use SQL Server. Our software is a complex ERP system with business logic heavily implemented on SQL Server. We are planning to move to Postgres. Is Postgres a good platform for complex business logic and performance?

I don’t know. I don’t use Postgres. But what I can tell you is the number of people who have said the exact same thing to me. We’re going to move to Postgres.

Postgres. One of my oldest clients, I started this consultancy in 2019. I have had a consultant since April. No.

March of 2019. To today. Who, back in 2019, said, we’re going to be on Postgres in six to eight months. You know where they’re not?

Postgres. Postgres. If you want to dip your toes in it, try moving some little pieces of it over to Postgres. See how it goes. See how it handles whatever tiny little bits of logic you have. Maybe it’ll go well.

Maybe it’ll go not well. But if your idea is to just move the whole thing over, I will see you in six to eight months. You goofballs.

Where do you get this stuff from? Boy, oh boy. Oh, that’s the wrong. That’s the wrong ZoomIt option. Hey, Eric.

Hey. How you doing? What deciding factors should be considered when marking an entire table for update statistics? Is it based on how fast the data changes or is it okay to run the update statistics overnight, irrespective?

Oh, gosh. Are you… So, like, when… Reasons to update statistics, you know? You got a table that, you know, maybe sees a lot of modifications.

Maybe the statistics fall out of date pretty quickly. You know, the big thing that I need to tell everyone about update stats is, you know, the update stats isn’t generally there to help you with, like, data that already was in the stats histogram that’s changed.

Usually, the beauty of update stats is to get new data into the histogram that is not currently represented in it. So, if you are constantly adding new data to the table, then I think it is a good idea to update your statistics for that table. So, you know, you don’t have to deal with any off-histogram searches against your table that might not have any representation.

It might get you a pretty bad cardinality estimate. So, yes, update your statistics. But, you know, overnight is fine.

But I’ve worked in some environments where we had to update stats way more often than that. And some other environments where not only did we have to update stats way more frequently than overnight, like every hour, half hour or so, we also had to do it at a very specific sampling percent in order to get a good histogram for things.

So, you know, you might find that the overnight… Just the overnight stats update is not enough. Don’t be afraid to update your stats more often.

All right. It’s hard to tell where this one ends. That’s a lot of writing. All right.

Hi, Eric. Hello. I have weight type SOS scheduler yield, which is way higher than second one. SOS scheduler yield is 75 hours of 230 hours. PageLodgeSH is five hours total.

Should I focus mostly on CPU-intensive queries? Or should… Or this can be connected also with high CPU ready time on VM side? So you’re going in a few different directions here.

And it seems like you’re doing a lot of analysis, but not a lot of actual fixing of things. I think you are generally right that you should be tuning CPU-intensive queries. What I’m concerned about here, though, are a couple things.

One, you have that much SOS scheduler yield, which is 75 hours. And then the next weight down is PageLodgeSH at five hours. There’s no real mention of, like, CX weights, like CX packet, consumer, like any of those.

So I’m concerned that you may have max stops set to one, or you may have either at the server or database level. Perhaps you have cost threshold for parallelism set to whatever the high value for that is. Or maybe you have just a crap ton of scalar UDFs that force your entire query to run single-threaded.

But so if it were me looking at your server, first I would be trying to figure out where’s all the parallelism? Because it doesn’t seem like you have any, at least not that you mentioned. Maybe there are some, you know, like background tasks that are eligible for parallelism that don’t bring it up past the five-hour mark in 230 hours of uptime, which would be interesting.

But that would be my main concern, is why you’re not getting any parallel queries on this. I don’t even know if that’s a thing for SharePoint anymore, to be honest with you. But, like, besides that, yeah, I mean, like, generally, yes, I would go after CPU.

I would go after queries that have the highest CPU on here. And I would also look at queries that have a very high execution frequency. So you could use SP Quickie Store to look both by average CPU or by executions.

That would give you some pretty good insight into both the stuff that takes a long time when it runs and the stuff that, like, uses a lot of, like, very choppy CPU, right? Things like scalar UDFs would be especially prone to that, especially because, you know, like, they don’t run once per query. They run once per row that they have to process, which leads to a lot of additional scheduling.

I would also potentially, maybe not, like, very, very concerned, but I would also maybe be a little worried that you don’t have a lot of CPUs on this thing. And maybe queries might be waiting a long time to get CPU attention. That would tie into, like, the high CPU ready time on the VM side, potentially.

But, you know, that’s more than I can dig into with the text you’ve given me. In general, I wouldn’t add more observer-ness to this. I wouldn’t start logging more stuff.

I would just start going after meaningful queries that, like, either, like, run and take a long time or run a lot. They, you know, like, again, scalar UDFs, loops, cursors, like, loops and cursors in scalar UDFs, stuff like that. There’s a lot of things that I would go after there.

All right. Next question. I see Brent is going to put AI in his procedures. Are you plan on doing that too? So, no.

And so, like, you know, AI is, or LLMs, however you wish to refer to it. They’re always an interesting topic. You know, you don’t really want to watch where the ball is. You want to watch where the ball is going, right?

It’s not, like, just because things are, like, not great now doesn’t mean that they won’t be great down the line. You know, AI and LLMs, they are constantly improving. But we as humans tend to stay relatively the same.

We don’t make, like, big leaps and bounds in our abilities very quickly. And also, you know, when you think about, like, a lot of stuff that has, like, really fundamentally changed the way that humans interact with the world around them. Like, just, you know, technology-wise, technology side, you know, cars were invented, but it was a long time before cars were widespread.

You know, PCs were invented, but it was a long time before everyone had a PC. Never mind a PC sitting in their pocket, right? Like, cell phones.

Then, you know, the internet was around for a long, long time before everyone’s house was wired with internet. And we have satellite space internet now. So, like, AI, like, for whatever it is now, like, maybe, like, down the line, like, there’s going to be a lapse between, like, you know, the current hype cycle of everything AI all the time.

And, you know, like, what it actually ends up, like, how it actually ends up becoming a part of our lives, who knows. But, like, for me personally, putting AI into this stuff doesn’t make a lot of sense. Like, it would be highly speculative for me to say this, but, like, I saw Brent’s post about him doing it.

But it would only make sense to do that if I were planning on releasing my own, like, Erik Darling SQL Server agent that would, like, you would pay me to answer questions for you from these things, right? Like, that’s the only way it would make sense for me to do it. I don’t know if Brent’s doing that.

I don’t know if that’s his plan. Just because, like, right now, like, think about, like, most people’s usage of LLMs, right? Like, Copilot’s going to be an SSMS, right? So, like, whatever results you get from an SSMS query, it’s not a far trip to Copilot whatever results come back, whether it’s, you know, query plans or indexes or whatever.

Most people will have, like, have access to an LLM sitting next to them, right? So it’s like, you know, you could, like, you could copy something from SSMS, put it in a browser tab. So, like, it doesn’t make sense, like, for me to build anything into it that seems like if someone wants to take those results and do something with them in an LLM, it’s not a very long trip for them to do that on their own.

So, like, I don’t have any plans of, like, creating or introducing my own AI agent yakky mouth thing. So, like, I wouldn’t build anything into mine unless there was a way for that to, like, there’s a lot of work, right? Like, the security and figuring all that out.

A lot of work on that side. So I wouldn’t necessarily do that unless it were going to be, like, a profitable venture for me. So maybe Brent has something figured out that I don’t, if it has to do with marketing, he probably does.

There’s a 99% chance that he’s far ahead on that. But for me, I can’t see a lot of upside to putting that into my procedures in, at least as things currently stand with me and my life. All right.

So last question here. Why would I use Dynamic SQL to defeat parameter sensitivity when I can just use Optimize 4? Well, I mean, I’m not sure why you’re asking me this. It’s a bit silly.

If you can figure out a way to get your parameter sensitivity problems fixed with just using Optimize 4, go for it. So the reason why I present the options that I do is because perhaps there are situations where Optimize 4 does not solve your parameter sensitivity issues. Perhaps it just gives you one big plan that happens to make you happy in the moment.

I don’t know. You also left off what you’re optimizing for. If you’re going to say unknown, I’m going to come to your house and I’m going to perch on your bedpost and stare at you while you sleep.

Because that’s a nightmare. And you deserve nightmares for what you do to SQL Server. If you’re optimizing for specific values, that can certainly work.

I’ve done it in the past. But using Dynamic SQL to fix parameter sensitivity issues is a good tool to have around just in case any of that stuff doesn’t work. And if you’re talking about my defeating parameter sensitivity with Dynamic SQL video, one of the methods that I use in there actually does use a Dynamic Optimize 4.

So you can combine both of those into one big happy family of stuff to do. So if you can get away with optimizing for some values and fixing parameter sensitivity, great. You can stop right there.

If you run into a problem that Optimize 4 doesn’t solve, then you might need to reach a little deeper into your tool bag. Maybe even down into the Dynamic SQL pocket and do a little bit more typing. But anyway, that’s it for today.

That was all five questions. Thank you for watching. I hope you enjoyed yourselves. I hope you learned something. And I will see you in the next office hours. And also the next video and so on and so forth into eternity.

All right. Thank you for watching. Goodbye. There’s the button. There we go. All right.

Now, now, now. All right. We’ll do that one more time so I don’t stick to landing. Thank you for watching. Goodbye. Goodbye.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

Updates To My SQL Server Performance Troubleshooting Scripts

Snappy Name


It’s been a busy few months working on scripts in my GitHub repo, including adding two new members to the family.

  • sp_IndexCleanup: Easily deduplicate indexes — scripts all the changes out for you!
  • sp_PerfCheck: A high-level review of performance-related settings and configurations!

Here are the main changes from lovely contributors:

  • sp_QuickieStore: Appended _ms where it was missing for avg_cpu_time by @ReeceGoding in https://github.com/erikdarlingdata/DarlingData/pull/600
  • sp_QuickieStore: Moved validation of @sort_order to be much earlier on. by @ReeceGoding in https://github.com/erikdarlingdata/DarlingData/pull/602
  • sp_QuickieStore: Moved special sorting columns for special sort order values to be jus… by @ReeceGoding in https://github.com/erikdarlingdata/DarlingData/pull/604
  • sp_QuickieStore: Made plan hash, query hash, or sql handle show when the parameter for filtering them out is passed in by @ReeceGoding in https://github.com/erikdarlingdata/DarlingData/pull/608
  • sp_QuickieStore: Documents that @format_output = 1 removes most decimals. by @ReeceGoding in https://github.com/erikdarlingdata/DarlingData/pull/606
  • Adding support for indexed views in addition to indexes on tables. by @FirstCall42 in https://github.com/erikdarlingdata/DarlingData/pull/610
  • sp_QuickieStore: Move regression validation to just after sort order by @ReeceGoding in https://github.com/erikdarlingdata/DarlingData/pull/612
  • Made sp_HumanEventsBlockViewer not error out when system_health is used by @ReeceGoding in https://github.com/erikdarlingdata/DarlingData/pull/618

You can download all the scripts here.

Happy tuning!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

My Upcoming Speaking Schedule

Busy Summer


The nice folks at Red Gate have decided to put me to work.

That means I’m going on tour, and maybe getting some socks and a Hawaiian shirt.

No word on a “Lego Erik” yet.

PASS On Tour Events:

PASS Data Community Summit:

Of course, Kendra Little and I are back in action to teach back-t0-back T-SQL precons.

 

See you out there!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

Learn T-SQL With Erik: TOP and OFFSET/FETCH

Learn T-SQL With Erik: TOP and OFFSET/FETCH


Video Summary

In this video, I delve into the world of T-SQL paging queries, specifically focusing on the use of `TOP` and `OFFSET FETCH` in SQL Server. I explore how these techniques can be used to efficiently retrieve large sets of data while maintaining performance. I demonstrate practical examples using Common Table Expressions (CTEs) to illustrate both methods, providing a clear understanding of when and how to apply them effectively. Whether you’re just starting out with T-SQL or looking to enhance your skills, this video offers valuable insights into optimizing query performance for pagination scenarios.

Full Transcript

Erik Darling here with Darling Data. And today’s video we’re going to continue our T-SQL love fest where I’m previewing some of the content from the beginner portion of the T-SQL course that I’m working on. You can of course purchase this course down in the video description for the pre-sale price. Once the course is fully published with the beginner and the advanced material, the price will go up to double what it is now. So now is a good time to buy. And of course, if you are attending Kendra and I’s pre-cons at Past Data Summit in Seattle this November, you will get access to this content for free with the admission to your, with the admission to our pre-con. Your admission to our pre-con. There we go. Anyway, let’s talk a little bit about offset fetch then, first top, at least in the context of the one place where they really compare well is writing page inquiries. Now, assuming that you’ve written top, like to say select top and number of rows from some query, order by something, better be an order by in there at this point, hammer at home that you must order by when you use top.

But assuming that you’ve written top without, with ties or like percent or something, you get that number of rows back. If you say select top 100 rows, but you only have 95 rows that qualify for your query results, SQL Server does not invent five rows for you to work to like just make you happy. If you say select top 100 rows, you can get some kind of weird stuff back. By weird, I mean like if you’ve done it enough, you know what to expect, but it might look strange to someone who has not done it before. So just to show you what I mean there, if we say select top one with ties where reputation equals 3622, we say top one, but there are 11 ties. So we get 12 rows back.

All right. Look over here. There are 12 people with a reputation of 3622. Guess what? They all tied. So we get for top one with ties, 12 rows in this case. Kind of fun. But top does not have a natural syntactical offset in T-SQL, right? Because T-SQL is the only language that has top in it. Most other languages use limit, but most others also have incorporated offset fetch at this point. What offset fetch gives you is the ability to limit, not only limit the results the way top would, but also skip rows within that result set to page through things.

So like the fetch portion of offset fetch is like the top end rows, right? You say offset end number, sorry, fetch end number, next end number of rows only. That gives you like the top-ish behavior. The offset is what gives you the skip behavior. So if you want to have skip behavior with top, you have to do a little bit more work generally using the row number window function. So the way to think about paging queries in general is first you want to mentally separate the columns in your mind between relational columns.

By relational columns, I mean things that you’re going to perform some relational activity on, whether it’s a where clause, whether it’s a join, whether it’s a group by, whether it’s an order by, stuff like that. Those are your relational columns. And then you have your informational columns. Your informational columns are just the ones that you show the end user. So when you say, you know, select these columns, those select columns are your informational columns.

There might be some overlap, of course, but in general, if it’s only in your select list, it’s informational. If you are saying like, you know, from this table where this column equals something, join to this other table on this expression, then those are relational columns. When you’re writing paging queries, you want to avoid getting the informational columns for as long as possible.

You want to do all of your initial filtering, all of your initial joining, existing grouping, ordering, all that good stuff with as narrow a set of columns as possible. This gives you a much better chance at consistent index usage, not blowing your server out with gigantic memory grants from having to sort this big long list of every string column you’ve ever put into a table that’s in VARCAR max because your developers are lazy. You avoid a lot of performance issues with paging queries that way.

So that’s really the main sort of gist of it. Now, without getting too deep into indexing for paging queries, because that is a big, giant, complicated, messy hornet’s nest of a topic, we’re going to skip ahead to just kind of show you what a good starting point. Like when people say this house has good bones, we’re going to say this paging query has good bones, right?

So what we’re going to do here is show you what a good starting point for a paging query looks like. All right. So like notice that we’re only selecting one column. That is the clustered primary key of the post table.

And we’re going to, again, you know, avoid selecting all the columns you want to show people for as long as possible. In this case, thankfully, it’s not too long, right? Because once we use this CTE to find the rows that we care about, filtering to the owner user ID that we care about, and applying our offset in our fetch, we are going to join that CTE, which is called paging, back to the post table just for the rows that we found in the CTE.

All right. So if we run this, we will get back our necessary required 100 rows, and we will be very, very happy with the results. This was speedy enough in this case. Was it the fastest query I’ve ever written? I don’t know. I don’t know what the fastest query I’ve ever written is.

Probably a lot of zeros involved, but it gives us the results that we want, right? And it’s set up the way that we care about. There are two ways you can do this with top. The first way I’m going to show you just uses a single CTE, and in that single CTE, we generate a row number over results from the post table where the owner user ID equals the owner user ID that we care about out here.

Once we’ve located those rows, then we are going to do some math, some very, very fancy math, on the row number that we created in our CTE. That is f dot n. n is our row number right here.

Okay? And we need to do a little bit of math on our page number and page size variables so that we get the correct page, the correct number of rows positioned in the correct point in the results.

And, of course, we will need to finish up by ordering by our wonderful last activity date and tiebreaker ID columns. So if we run this, we’ll get the same results that we got back from the last query. And, again, lickety-split pretty quick there.

Same basic deal, only selecting the ID column and then joining to the post table outside here. The second way is a little bit more complicated. And this way is, this is a method that I, I’ve told this story many times, but way back in, like, 2009, I had only been working with SQL Server a very short amount of time, but I was tasked with writing a page inquiry, and I found this blog post by Paul White.

And this was the method that I learned from Paul in that blog post, where you use not one, but two stacked CTE. Now, if you’ve heard me talk about CTE before, I do want to say that this format of writing CTE, where one CTE draws from the one before it, doesn’t have the performance issues that I’ve talked about with CTE a lot in other videos.

If you haven’t seen those, hang on, because there’s CTE coming up in a few videos here. But if we run this query, what we’re going to see is, in this first CTE, again, only selecting the narrowest set of rows that we need, but up in the top, we do the page number times the page size.

And what this gives us is a somewhat bigger result for us to page through. So, if we’re on, I mean, currently we’re on page number one, and with a page size of 100.

So, page number times page size gives us 100. If we were in the top 200, then it would be page two, sorry, if we wanted page two, then it would be two times 100, so top 200.

So, this part of the CTE does give us a longer result set the deeper we go in. But what we immediately do is filter that out here, where we only select the top page size from the previous CTE, where the row number is greater than the page number minus one times the page size.

So, for, like, page one, we get the 100, page two, we get 200, all that other stuff. And we’re going to do the same thing out here, where we join back to the post table to get all of the columns that we care about, and again, ordered by the two columns that we care about for, A, the initial presentation ordering of the results, and then the ID for the tiebreaker, just in case there are any dupes and last activity date.

So, if we go and run this query, we will get, again, the same results back, and it will be just as lickety-split-quick as the other two. Right.

We’re not doing anything too big and crazy here, but, you know, as page inquiries get bigger and more complex, then we, that’s when we need to start thinking about other things like indexing, and, you know, perhaps some additional query tricks to stabilize result sets.

But for most page inquiries that you write, you will want to use one of these three methods and just sort of test them and figure out which one performs the best, based on what your sort of normal workload is and what users are requesting.

Again, offset fetch tends to suffer a little bit performance-wise as you get deeper into result sets, but if your users aren’t typically going past, going very deep into result sets, then it’s not really a concern.

Anyway, thank you for watching. I hope you enjoyed yourselves. I hope you learned something. And I hope that you’ll buy the course because it’s a good one and put a lot of good stuff into it.

Anyway, thank you for watching. Goodbye. Goodbye.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

Learn T-SQL With Erik: ORDER BY

Learn T-SQL With Erik: ORDER BY


Video Summary

In this video, I delve into the intricacies of non-deterministic order by clauses and unexpected query results in T-SQL, drawing from my upcoming course that is currently on presale. With the material nearing completion and tech review underway, I’ve already started recording some sections, which will begin appearing on the training site shortly. This video serves as a sneak peek for those who might be attending the pre-cons at Past Data Summit in Seattle, where you’ll gain free access to this valuable content. For everyone else, now is the time to act if you want to avoid doubling your regret later—grab the presale deal before prices go up. We explore how non-deterministic order by clauses can lead to inconsistent query results and discuss best practices for ensuring stable outcomes in T-SQL queries. Through practical examples, I demonstrate the importance of incorporating unique columns as tiebreakers and highlight common pitfalls that developers might encounter.

Full Transcript

Erik Darling here with Darling Data, and in today’s video we are going to go over a little bit of the material from my T-SQL course. You know, the usual spiel about this stuff, it is on presale right now. The course is completely written at this point, just finishing up the last bits of tech review and tidying up and I’ve already started recording some of the already done content. So, that’ll start appearing over on the training site in the next week or so. Of course, this is the companion material to the stuff Kendra and I will be teaching at Past Data Summit. If you’re attending the pre-cons in Seattle, then you will get free access to the material. If you’re not attending the pre-cons in Seattle, you get nothing. You have to buy it just like everyone else. Once the material is fully published, of course, it will be going up to the full price, which of course will be double the presale price. So, get in now to have half the regret later on. Who knows what you could have done with the other half of the money that would have been good for you. You could have bought so many lottery tickets. Anyway, we’re going to talk a little bit about order buy. And if you find yourself watching this material and saying, yeah, Erik, I know, well, congratulations. This is the beginner material and you just might not be a beginner, which is wonderful for you, right? But for those who have not yet progressed to your lofty place in the T-SQL world, this sort of education is necessary. So, we’re going to talk a little bit about order buy and specifically non-deterministic order buys.

And then, of course, expecting queries to return in a specific order when there is no order buy, which is another thing that people tend to do. They observe a query running and returning results in a specific order. And then if it ever doesn’t return the results in the order, they freak out. They’re like, but it was all this way before. So, do that a little bit.

So, first, let’s look. Query plans are turned on. Look at that. Baked right into the demo. You’d think I’ve been practicing or something. If we run this query a few times, what we’re going to see is sort of in general, the results, like just over here in the ID column, are going to come back in all different orders. And, of course, the reason for that is that this particular sort order is non-deterministic.

The reputation at the beginning part of the index, or rather at the beginning part of the values, there’s no index on reputation currently, the beginning part of the values for the reputation column are all ones. Right. And so, all we’re doing is ordering by the ones, but the ones can come back in any order that they want. Right. As long as they’re just ordered over here. So, if we keep running this a few times, we’re going to notice that the ID column flips around quite a bit.

Right. This looks different from the last one. We were on this and just keep running it. Now negative one is at the top. Negative one is still at the top with some other stuff switched around. All right. We just keep doing this over and over again. We’re going to see just like different results depending on like threads and timing and all sorts of things.

So, if we want a deterministic order by, if we want to always present results in a specific way, we need to incorporate a unique column as sort of a tiebreaker to our order by elements in here. Right. So, we need to add in, in this case, the ID column from the users table, which is the clustered primary key. Right. So, if we run this, then not only are we ordered by reputation here, but in order to tiebreak the ordering to make sure this is deterministic, now we order by the ID column as a secondary element.

So, we order first by reputation, but then for duplicates and reputation, we order by ID. And this will return absolutely stable results every single time. We can’t, we can’t mess with this ordering because of that, because of the uniqueness of the ID column.

Now, this behavior can be sort of misleading to a lot of people. If their data doesn’t have a lot of duplicates, just some duplicates, right? So, if we flip the ordering of reputation and we just say order by reputation descending, then, you know, there’s not a lot of overlapping values at the top of the food chain here.

The reputation column is like, largely like, once you get up this high, not a lot of people have like, duplicate reputations where you would need a sort of tiebreaker thing. So, you could observe behavior like this working quite well for a while. But then, after a certain point, you are going to start hitting duplicates and things might start flipping around on you.

So, just for example, the first reputation at the high end that I could find that had a duplicate value when it was 160303. And sometimes this takes a few runs to kind of get to work correctly. But here you see the reputation 160303 at the very top here.

I don’t think there’s any other dupes down below. But you see IDs 206403 and 19679. If we just kind of run this enough times, you’ll see those two columns flip back and forth.

And that’s because, you know, they are duplicates. And so, SQL Server only has to return this thing in order to a certain point. But within duplicates in there, it can return those in any sort of flip-floppy way that it wants, right?

We don’t know which one we’re going to get back. This might not seem like a very big deal to you in this pretty narrow case. But I guarantee you, at some point in your T-SQL developer career, you are going to have someone file a bug report and say something like, Hey, I’m getting weird results back here.

And you’re going to have a non-deterministic order by somewhere in your query that is messing things up for everyone. So, let’s just look at a quick example of where you might see, like, just the results change if your database changes, right? So, let’s pretend that for years we had this table and this query.

And every time we ran this query, I’ve got a couple index hints on here just to make sure that, just to show you what happens when we use different indexes. And so, this first query is always going to be hinted to use the clustered primary key on the post table. And this query is always going to be hinted to use the nonclustered index last editor display name on the post table.

That’s the index that we created right here on last editor display name. It’s very, very explanatory. It’s almost nearly self-documenting if you want to feel special about it.

But let’s just say that for years we had this query running. And every time this query ran, we got results in a specific order and everyone was happy. And at some point, someone was like, oh, well, there’s a missing index request or something.

We’re going to add this index and SQL Server is going to use this index and we’re all going to be thrilled. And if we run this, just run these back to back. Let’s zoom it, getting a little weird on me.

You’ll see that these two queries return data in very different orders, right? If we just kind of make that about half and half, we just get this stuff back. Like these two result sets are just way different, right?

5, 2, 2, 2, 2, 3, 4, 2, 2, whatever. This one, 2, 2, 2, 2, 3, 2, 2, 4, 18, 4, 2, 16. So we just got way different results back running the same query just using different indexes.

If we look at the execution plans, we get some sense of why. Well, we do an index seek into the post table here and then we do a hash match aggregate. And of course, hashes, when you see a hash, there’s absolutely no ordering required for that, right?

Hashes just mush everything together and throw it along to the next thing. And then the second query, we do an index seek into the nonclustered index. And then we have a stream aggregate.

And stream aggregates do require sorted data. If we didn’t have an index on last set or display name and SQL Server chose a stream aggregate, you would have to pre-sort the data for us.

In this case, we have the index, though, so we’re good there. So SQL Server just streams the results into there and shoots it along to the next operator, the filter operator. But in both of these cases, like neither one has an order by, but they both return results in slightly different order because the query plan is different and they use a different object to initially access data in the table.

So this is the kind of stuff that you’re going to have to deal with as a T-SQL developer, as you’re writing queries, as you’re troubleshooting bugs and results and stuff, non-deterministic order buys or just expecting results to always return in the same order no matter what, without any order buy whatsoever.

Two very, very big mistakes that people make quite a bit. Anyway, thank you for watching. I hope you enjoyed yourselves. I hope you learned something. I hope you will consider purchasing this course for the pre-sale price. The link is down in the video description.

If not, I will see everyone who is going to get this for free at Past Data Summit, I guess, in November. Anyway, thank you for watching. Goodbye.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

SQL Server Performance Office Hours Episode 16

SQL Server Performance Office Hours Episode 16



To ask your questions, head over here.

I’ve heard you talk about your career path a few times, and it’s pretty weird. Do you have any regrets? Are you still happy with what you do?
Do you have differing approaches for performance tuning an OLAP system vs an OLTP system?
Do you know of any disadvantages of using a filtered index to filter NULL values? We have a very heavy transactional table, like 10k trans/sec, with a clustered index and one non-clustered index. We don’t have any queries that select rows with NULL values ​​from this table. The DBA team said we should avoid using a filtered index without any proof. What do you think?
In all your demos you compress (page) your indexes. Do you default to that with your all your client workloads? Do you see more benefit than a negative impact in your experience? Thanks!
I’ve seen you suggest columnstore for paging and dynamic searches. How do you make your non-clustered columnstore indexes perform acceptably on tables where all of the data is hot?

Video Summary

In this video, I dive into a series of questions from viewers, covering topics ranging from my career path and the challenges it has brought to performance tuning strategies for OLAP versus OLTP systems. I share insights on filtered indexes, page compression, and columnstore indexes, addressing common concerns and providing practical advice based on real-world experiences. Whether you’re curious about the ups and downs of a SQL Server consultant’s life or looking for tips on optimizing your database performance, there’s something here for everyone. Additionally, I discuss my upcoming courses and events, including the new “Learn SQL with Eric” course and Pass Data Community Summit in Seattle, offering early access deals that you won’t want to miss. So, if you have any burning questions or need some expert guidance on SQL Server performance, this video is a must-watch!

Full Transcript

Erik Darling here with Darling Data. And today me and my pal Bats here are going to kick off an Office Hours episode where I’m going to answer five of your most burning pertinent questions about life, love, high finance, extreme fitness, and of course SQL Server performance. So that’s what we’re doing today. If you like this channel and you want to support it with money, you can sign up for a membership. There is a link in the video description. If you like this channel, but not in a way that is monetarily beneficial to me, you are free to, for this is absolutely free, like, comment, subscribe, and of course, ask me questions for these Office Hours episodes that I enjoy so very thoroughly. If you need help with SQL Server, perhaps in a way that goes beyond what a YouTube Q&A can do, can do you for. I am available as a consultant to consult, to do these things live and in, well, I mean, sort of in person on your SQL servers. The usual stuff, health checks, performance analysis, hands-on tuning, responding to performance emergencies, and of course, training your developers so that you have fewer emergencies. No one likes heartburn, I guess. If you would like some performance tuning training from me, I have 24 hours of it available.

For 75% off, that brings the price down to about 150 USD, and that is a very good deal. I am also dropping a new course on T-SQL that is not part of that. It is a brand new thing. It is called Learn SQL with Eric. That’s me. It is in the works. It is up for tech review. I am recording things, and I’m going to be releasing it over the next summer period months, and it should all be complete by the time we get to pass. There is going to be the beginner and advanced material on that. If you are going to pass, if you are going to attend the pre-cons that I am doing at Pass, you will get free access to the material. But right now, the pre-sale price for the course is $250. Once all the material is fully booked out and live, the price will go up to exactly $500. The price is going to double when the material is complete, because by then, I am probably going to want to make more money off this thing.

Upcoming events. Lots of fun. Well, by the time this goes live, SQL Saturday in New York City will have come and gone. I should probably delete that, huh? Pass going on tour this summer. New York City, August 18th to 20th. Dallas, September 15th to 17th. And Amsterdam, October 1st to 3rd. And then, of course, all of those events presage, the big one, Pass Data Community Summit, taking place in Seattle from November 17th to 21st.

So with that out of the way, let’s answer some questions over here. Let me just tidy this up a little bit. Let me bring that over a little so everything fits on the screen once Zoomit decides the appropriate level of default Zoomit-ness. Oh, so it’s a non-SQL Server question first. I’ve heard you talk about your career path a few times, and it’s pretty weird. Yes, it is. I forget when the last time I talked about it. It was probably on that Simple Talk Redgate podcast.

Do you have any regrets? Are you still happy with what you do? Jeez, bare my soul, huh? So, yes, of course, there are regrets. I think most of them, though, are limited to me. I do a lot of consulting, and being on the phone a lot wears you down.

I always wish that I had more time to put into producing new training material for money, but it always seems like as soon as I’m like, wow, this week’s going to be nice and open for me to do all this stuff, there’s just like I start working on it. I’m like, yeah.

And then it’s like avalanche of new people need to have problems. Like, well, you know, I’m not getting any younger. You can’t say no to money, so that happens.

You know, I’m still mostly happy with what I do. Of course, there are ups and downs. Some days are more frustrating than others. But, you know, it has been weird.

And, like, you know, a lot of the weirdness with my career path was, like, prior to SQL. But there’s been plenty of weirdness with my career path, like, since then, too. You know, like, before I started working for Brendo’s Unlimited back in 2015, I was, like, a relative nobody, right?

You know, I had presented a few times and I had, like, you know, done some stuff, but, like, I had a blog that I maintained lightly. But, you know, I, like, I didn’t know how weird parts of the SQL community were before I started working there. Like, I was not at the cool kids table at all.

Like, I had no idea that there was, like, so much, like, just crappy high school clicky, like, stuff. So, like, when I started working there, you know, like, you find out about, like, all, like, the stuff that goes on, like, beneath the covers a little bit. And, you know, like, there are people, like, especially in the MVP community who have just been pure nasty to me because I work there, right?

Because, like, I have some, like, friendships and relationships with, like, Brenton people who work there. There are people who have just been awful to me throughout my career. And, like, they can all go fix cars for all I care.

But, you know, it’s one of the, like, like, like the weirdness didn’t stop when, like, you know, I stopped bouncing and got into databases. Like, it’s just been weird all throughout, right? It’s just, like, like, it’s strange stuff, right?

But, yeah, you know, no, like, no, like, giant regrets. Of course, you know, you know, there are things that I wish I had done differently and a bit more smartly when I first started my own consultancy up. And there’s stuff that I still wish I was, like, doing a little bit better at.

Like, you know, I’m not good at, like, SEO and marketing and all the other stuff. Like, you know, I can produce content. But, like, when it comes down to it, you know, I’m not, I am not a marketing master. So, you know, there’s stuff that I wish I was better at.

But, you know, my regrets are all my own. They’re not anyone else’s. So, anyway, let’s go on to the next question here, which is, do you have differing approaches for performance tuning and OLAP system versus an OLTP system? Well, yeah, of course.

You know, it’s OLAP is all about throughput. Sorry, OLTP is all about throughput and OLAP is all about latency, right? You know, OLTP, you need to be able to pound, pound, pound, pound, pound, pound, pound, and get a whole bunch of stuff in and out very quickly. OLAP, you need to have, like, big things happen faster, right?

So, you know, it certainly changes, you know, the things that you look at as far as, you know, like, which queries you go after. You know, like in an OLAP system, it might make total sense to go after things that take the, like, longest or use the most CPU to run. In OLTP, you do have to sort of balance that with, like, you know, what runs the most?

Is there anything we can do about this? Things like that. But, you know, there are, of course, differences, you know. Like, even, like, indexing strategies, OLAP, I’m going to push columnstore. OLTP, I’m going to push, like, narrow rowstore indexes, stuff like that.

You know, there are, of course, like, differences in those things. But, like, a lot of the environments I see are kind of mixed, right? Like, there’s, you know, there’s OLTP plus reporting, right?

Like, plus the OLAP stuff. So, you know, you do have to sort of balance out both, you know. OLAP is sort of, like, I don’t know. OLAP is interesting in a way because you have queries where, like, the expectation is that, yes, they’re going to take longer.

But, like, you also have to balance resource usage a lot differently, right? Because, like, you know, you might only have, like, four or five queries running at the same time. But, like, that’s where you really start running into, like, resource semaphore stuff.

You know, OLTP is typically where you start running into, like, thread pool stuff. Then when you mix them, you get both. It’s a real joy. But, yeah, there are different approaches to it. But, you know, like I said, a lot of the stuff that I see is kind of mixed.

So you do have to attack both sides of that coin when working on things. So let’s see here. The next question we have.

Oh, boy. It’s a lot of writing. Do you know of any disadvantages of using a filtered index to filter null values? We have a very heavy transactional table, like 10K transactions a second, with a clustered index and one nonclustered index.

We don’t have any queries that select rows with null values from this table. The DBA team said we should avoid using a filtered index without any proof. What do you think?

Well, what proof did they want you to present to them? That is the question. What proof? What are they looking for? How does one gather proof without being able to do an experiment in which evidence is gathered and some hypothesis is tested? So I don’t know what proof they want.

But, you know, the big thing with filtered indexes is they are really only terribly sensible if the filter is going to exclude a large number of rows. Usually you want to, you know, like if you’re only going to exclude like 25 or 30% of the rows, you’re not going to see a dramatic difference in query performance. Once you start getting to like the 50, 60, 75% range, that’s when you start to see bigger differences with things.

So, you know, first, you know, find out what proof they want. Second, figure out what your queries are doing. Figure out how many like null rows your filtered index would actually exclude and kind of go from there.

You know, take some of your, you know, assuming you have some sort of development environment, you know, create the filtered index that you care about and start testing queries against it. You know, there’s not really a downside. The only thing that you must remember is with your filtered indexes, whatever column, like you said, you want to exclude no values.

I assume there is a column or maybe multiple columns with no values that you are looking to exclude. Make sure that those columns are in the filtered index definition somewhere, not just in the filter, but like as included columns too. So SQL Server doesn’t have to do as much guesswork.

It has those columns available to it so we can evaluate whatever you want, look up free. There are a lot of peculiarities with the optimizer around filtered indexes, specifically with nulls that do sort of force you to need to have those columns in the index definition beyond just the filter. All right.

Let’s see what we got here. In your demos, hey, someone’s paying attention. Good for you. You can press page. Your indexes. Do you default to that with all your client workloads? Do you see more benefit than negative impact in your experience?

Yes, I do default to that. If you take a look at my new store procedure, SP index cleanup, part of the results in there, well, like for the part of the store procedure where like merge index, like index merge statements are like generated for you to like bring two indexes together. So you can replace like multiple indexes with a single index.

Like any index create statement there gets created with page compression by default. There’s also a whole section of the results that scripts out adding page compression to your existing indexes. I use that.

I use the hell out of page compression. Most people who I work with have far more data than memory. And page compressing indexes, you know, aside from columnstore compression, but you can’t put columnstore on a lot of tables for a lot of reasons, which is actually now that I’m looking ahead a little bit. I see that’s sort of in the next question.

But yeah, like page compression makes your data smaller on disk and in the buffer pool. And you can make way better use of the hardware that you have. So if there is any IO boundness to your workload, right, like you see a lot of page IO latch underscore sh and ex weights, you know, queries are just constantly going to disk.

You know, it could be when you look at weight stats as a whole or when you hit SP who is active and you see all these queries bogged up waiting on reading pages from disk. Page compression can take some of the edge off that by having smaller objects to a read, right? Like having less data to bring from disk into memory is a faster process.

And then having that data be compressed in the buffer pool means that every object up there takes up less space in the buffer pool. So you have more space for more things, right? It’s sort of like those vacuum bags where the people pack their, like, you know, winter clothes in when summer rolls around and they suck them down.

And like you have these like giant puffer jackets and blankets and stuff and just bring them down to this tiny little nice compressed thing. And it just makes you gives you a lot more storage space. So it’s the same basic idea there.

Okay, so the final question. What do we have here? I’ve seen you suggest columnstore for paging and dynamic searches. How do you make your non-clustered columnstore indexes perform acceptably on tables where all of the data is hot?

Hmm. Hmm. So if you pay very careful attention, one thing that I say is that columnstore indexes should generally be reserved for large tables. Usually, you know, a lot of the pamphlet material from Microsoft is at least a million rows.

I’m not sure that that’s where I start on. I think, you know, 5, 10, 15, 20 million are more sensible numbers there. So we’re going to run a question by you.

And that is, do you really have 5, 10, 15, 20 million row tables where all of the data is hot? You have a table with that many rows where people are just constantly updating all 5, 10, 15, 20 million rows. Do you really?

Honestly, truly have a table like that. That would be a very strange thing. Very, very strange thing indeed.

I think perhaps you’re misinterpreting where I suggest using these columnstore indexes. If you truly have a table like that, then that would be an interesting consulting engagement. Good Lord.

Yeah, so, you know, you are right that, you know, updating columnstore indexes does not quite always go as well as updating the rowstore indexes. But, boy, I think there is some attention that needs to get paid to the types of tables that make good candidates for columnstore indexes here. All right.

I’m out of breath. I’m winded. My allergies are terrible. My lungs are not at full capacity. So I’m going to go breathe for a little bit and then I don’t know what. But anyway, thank you for watching.

I hope you enjoyed yourselves. I hope you learned something and I will see you in whatever video we do next. All right. Cool. Thank you. Goodbye.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

Learn T-SQL With Erik: A Neat GROUP BY Trick

Learn T-SQL With Erik: A Neat GROUP BY Trick


Video Summary

In this video, I share a quick and handy trick using GroupBy in SQL Server, perfect for those Friday afternoons when you just want to get your work done efficiently before heading off to enjoy the weekend. This isn’t a deep dive into the intricacies of GroupBy; instead, it’s a practical solution to an annoyance many of us face—how messy queries can become when grouping by expressions. I demonstrate how using `CROSS APPLY` with the `VALUES` construct can clean up your code and make it more readable without resorting to complex Common Table Expressions (CTEs) or other workarounds that can be a pain to remember. So, whether you’re preparing for the Precons for Past Data Summit or just looking for a way to streamline your SQL queries, this video has got you covered. Enjoy your weekend and happy coding!

Full Transcript

Erik Darling here with Darling Data. In today’s video, we’re going to do a quick little trick with GroupBy. The reason it’s a quick video is because this is getting published on a Friday, and golly and gosh, I don’t want to keep you anywhere waiting on a Friday. You should be gallivanting off to have a great weekend. So this is just a cool little trick. This is not going to be deep, fundamental, internal stuff about GroupBy and grouping tables and all the other stuff that’s in this session, whatever you want to call it. Of course, this is part of the companion material to Kendra Little and I’s Precons for Past Data Summit. If you are coming to the Precons, you will get free access to this material, since it’s companion material to the Precons. If you are not coming, you can get it for the presale price of $250 US dollars. That will be going up to $500 when the course is fully published. So, get in while the getting’s good, as they say. Eat while the chicken’s hot, or something. Smokey beans. Anyway, what are they going to say? Oh yeah, GroupBy. Fun stuff. So, one of my annoyances with GroupBy, at least in SQL Server, is how messy queries can get once you start grouping by some expression. Once you start writing things out, you’re like, I’ve got to write this over and over again. Gosh almighty. Even if you’re really good at copying and pasting, it gets very irritating.

And some code completion tools, I’m not going to name any names, have some issues in this area. So, this is what a typical, very, you know, I had to type too much query to do this GroupBy thing looks like, where we’re getting the date part for creation date. And then when the GroupBy, we got to do this. Now Oracle actually recently brought out some improvement to their SQL dialect. Where you can actually reference column, like, aliased columns in the GroupBy, which I think is fantastic.

T-SQL should have such a thing. SQL Server 2025. Where are you at, buddy? T-SQL should have such a thing. Now, of course, you can, because if you’ve been paying attention to my videos and the logical query processing stuff, you could actually reference them in the OrderBy. That would be okay. So, like, if we look at this query down here, like, we have some squiggles in the GroupBy. Like, the GroupBy is not going to work, but the OrderBy is totally kosher. Like, this part’s fine.

But, you know, if we try to run this, the SQL Server is going to be like, I don’t know. Like, who are these columns? I’ve never heard of them. Where did they come from? From whence did they came? Which is annoying to me. But, and this is probably one of the most frequent things that you will hear said about T-SQL. Don’t worry. There’s a workaround. Because it’s never just a straightforward way of like, hey, we can just do this. It’s always, there’s a workaround. Like, another one. Something else to memorize. And that is, of course, using the cross-apply with values construct clause thing, whatever you want to call it.

So, we’re going to use cross-apply in here. And in cross-apply, we are going to say values. And within those values, we are going to get the two expressions that we care about. Creation date, year and quarter, right? There’s year, there’s quarter. Sorry, it’s very pink on here today. And then we are going to alias those columns coming out of the cross-apply is creation year and creation quarter. And, of course, now we can run this query with absolutely no problems. And I, like, you know, assuming that you can cope with a little cross-apply diversion in your query like this, I do find that this makes the code much sort of cleaner and easier to deal with.

And you don’t have to write any goofy CTE and get into arguments with LinkedIn on people about how goofy CTE are. They don’t actually make queries more readable. Because who has time for that? Not me. Anyway, that’s my quick video today. Like I said, this is going on. Well, this is Friday now for you.

So, please do have a great weekend. Please do drive safely. Even if you’re not drinking, drive safely. Not drinking is no excuse to drive worse. I hope you enjoyed yourselves. I hope you learned something.

And remember, this is the pre-sale thing for the course down here. So, if you’re enjoying this at all, imagine if you had, like, a couple days worth of T-SQL content like this to watch and learn from. It might be pretty good, right?

It’s a whole lot better than reading some stiff book or some terrible documentation or just not knowing what you’re doing and hoping that there’s a Stack Overflow answer for you. Or hoping that the LLM was right because, hoo-wee, you got a lot of hoping to do there. Anyway, thank you for watching. Goodbye.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

Learn T-SQL With Erik: Stupid OUTPUT Stuff

Learn T-SQL With Erik: Stupid OUTPUT Stuff


Video Summary

In this video, I delve into some quirky aspects of T-SQL that might surprise even seasoned database professionals. Specifically, we explore the output clause and its peculiar behaviors when used in certain scenarios. While working on my new T-SQL course, which is part of a series aimed at advanced topics beyond beginner material, I stumbled upon an interesting issue involving the output clause with check constraints. Occasionally, attempting to insert values into a table with a check constraint results in a conflicting message about login time, despite successfully inserting a row. This led me to experiment with using `DELETE` and `OUTPUT` together, expecting a clever workaround for session state management tasks. However, I found that even this approach didn’t work as intended, resulting in primary key violations due to the way SQL Server handles these operations within the query plan. The video is a light-hearted exploration of these oddities, aimed at both entertaining and educating viewers about the nuances of T-SQL.

Full Transcript

Erik Darling here with Darling Data. And this video is, of course, part of my ongoing T-SQL course stuff. This is just, this is a little jump ahead, right? Because this is beyond just like what we’re going to, what I would consider beginner material. But so this is a little jump ahead to some of the more advanced stuff. But it was just one of those funny things that I came across while writing stuff. And me and Bats just got such a kick out of it that we wanted to talk about it now rather than later. So this is a little short video about some weird stuff that you might run into with the output clause. Because it’s amusing to me. So once again, I am working on my new T-SQL course. The T-SQL course is, of course, of course, companion material to the pre-companion material to the pre-companion material. schedule slash the pre-cons, the pre-cons that Kendra Little and I will be delivering and pass data community summit this November. If you are attending that, you can get the material for free, as with your attendance to the pre-cons, of course. Now, if you go to someone else’s pre-cons, screw you. Everybody give you anything free. If you if you’re not going to attend, then you can get it for the presale price of $250. And I’m just waving at it down here in the video description. There is a link where you can you can purchase it there. Anyway, let’s get on with this funny little demo. Now, the first funny thing is this part, right, where just dropping a table if it exists, creating a table with a check constraint, and then trying to insert some values into that table. Every once in a while when you run this, you’ll get this message that the statement conflicted with the check constraint on login time. I guess every once in a while, sysdate time just works, and that check constraint works well, and sometimes it doesn’t, but let’s make sure we get a row in there. Okay, one row affected. Great. We did it, and if we look at that one row in the table, this is what we get. We have spid 121, and this is our login time.

What a beautiful sysdate time that is. Gorgeous. I’ve never seen anything so beautiful in my life. Now, the first sort of funny thing with output is, so like, what I was in my head, I was like, well, let’s say you’re like dealing with sort of a session state table, and you know, you’re like, maybe you have a procedure that does like, you know, like checks to see if a row already exists in there, and if it does, you delete, and if it doesn’t, you insert it. Well, I was thinking like, you know what, I bet there’s a real clever way to use output to do that, and so I was like, I got this.

I’m on this case. Like, I’m on the prowl here, right, and so what I figured I’d do is use delete with output, because when you use delete with output like this, you can nest DML. So notice that we have an insert select here, and the insert, and that, well, the select portion is coming from this, right, so we’re sort of treating delete with output like a subquery, you know, and a lot of the videos, like, one of the things I explain is that, like, in SQL Server, or in databases in general, but we’re using T-SQL and SQL Server, like, like, the result of everything is sort of tabular, right, so like, this is just sort of like a table value and result, right, it’s like a little derived table in there, it’s like a, it’s a table expression with a delete inside it, wild, but we can do that, it’s valid, but it doesn’t work, right, and I’ll show you what doesn’t work, is, you know, we have a delete, right, and in that delete, we are deleting the spid, where the spid equals my spid, so spid 121 should get blown away, and then in the insert, we are selecting, we are reselecting the spid with a new sys date time, right, so we can try to do this, but every time we run this, like, this one, this one never works, right, this one, every time we run this, we get this error message, that it’s a primary key violation, right, we just can’t do it, and it’s, it’s funny to me, too, is, because, like, when you look at the execution plan, you have a clustered index delete over here, right, so that, that happens, right, that’s the very first operator in the plan, clustered index delete, should get rid of that row, be gone, and then over here, we have the clustered index insert, right, so it’s like, hey, check it out, we’re, we’re, we’re going to delete this thing over here, and then we’re going to insert this thing later on, but no, every single time, every single time, primary key violation, and the, the primary key, I mean, granted, you could probably make the primary, you could probably expand the primary key to be both spit and login time, and, you know, not to, not to obviously foreshadow anything, but that, that was just a very surprising thing, but, you know, I thought maybe, maybe I’m being too clever, maybe I’m being just, you know, far, far too clever with things, so I thought maybe if I just did a delete with output into the table, that, that would work, right, because, like, maybe the nesting is just like, it’s too much for SQL servers, SQL servers is like, no, I can’t do, I can’t do it, I can’t figure this out, it’s too many operators in this query plan, it gets all, it’s all, it’s like, just cagey and weird, so I would just do delete, and I would output some stuff into here, but, like, like, I know this would give me a, like, a violation, a primary constraint, key constraint violation, but I just wanted to see if, like, like, it would, it would get past anything, because the, the query plan, you know, it’s, it’s, it’s, it’s, ah, right, so you can’t even get a query plan for this one, and when you try to execute it, you get this, you get a different error message, I’m just trying to find a good place to chop this off, because it’s a little bit longer than the first one, where, now, with, with that output clause, we get something different, you can’t, you can’t even get an estimated plan for it, because SQL Server’s like, screw you, like, violation right off the bat, the target table of the output clause, output into clause cannot have any enabled check constraints, or any enabled rules, found constraint or rule, CK nested blah blah blah, 1DD13137, okay, so it’s just, it’s funny to me, because with the, the one plan thing, you get the, the primary key violation, and it’s like, oh, well, clearly, we can’t, we can’t violate a primary key, but then when you do it a slightly different way, in which you actually would violate the primary key, right, like, there actually would be a primary key constraint violation here, you hit this other one on the check constraint, so I have not, I have not quite found the most clever way of doing this yet, but I’m working on it, so, um, if you’re into that sort of thing, I don’t know, perhaps the $250 that it costs to buy the pre-sale price, buy this course at the pre-sale price, uh, would be worth it to you, if not, you can just wait until it’s fully published and it costs $500, because maybe, maybe it would be so worth it to you that you’re like, no, no, I’m waiting for the price to go up, Eric Darlin deserves his flowers, anyway, uh, thank you for watching, I hope you enjoyed yourselves, I hope you are as annoyed with this as I am, and I will see you in the next video, all right, goodbye.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

Learn T-SQL With Erik: DISTINCT Isn’t Always Evil

Learn T-SQL With Erik: DISTINCT Isn’t Always Evil


Video Summary

In this video, I delve into the nuances of using `DISTINCT` in T-SQL queries, addressing common misconceptions and providing practical examples to help you understand when and why `DISTINCT` is necessary. I start by explaining how `DISTINCT` can sometimes get a bad rap due to its impact on performance with large result sets, but emphasize that it’s not always harmful or avoidable. The video covers scenarios where `DISTINCT` is crucial for getting accurate query results, such as in join operations and when dealing with many-to-many relationships. I also discuss alternative methods like using `EXISTS` to achieve similar outcomes without the performance overhead of `DISTINCT`. This content is part of my comprehensive T-SQL course, which you can pre-order now at a special price of $250; after completion, the price will increase to $500. If you’re attending the PASS Data Community Summit in Seattle this November and are interested in these topics, be sure to check out the companion material for my pre-cons with Kendra Little-Leader.

Full Transcript

Erik Darling here with Darling Data and we are going to, in this video, which is again more sort of preview content from my T-SQL course, we’re going to talk about Distinct. Now, if you would like to get in on the pre-sale price for this video, down in the video description there is a link. You can buy it right now for $250. That price, once all the material is finished, will be going up to $500. And of course, this is companion material for the pre-cons that Kendra Little-Leader has. And I will be, I don’t know, I don’t know, I don’t know, I don’t know, I don’t know, I don’t know, I don’t know, I don’t know, I don’t know. I don’t know, I don’t know, I don’t know, I don’t know, I don’t know, I don’t know, there’s lots of P’s you could use in there. At PASS Data Community Summit, this November in Seattle. So if you are attending those pre-cons, you can, of course, get, you’ll, of course, get this content free with your admission there. Otherwise, the link is down below. So this is a short excerpt from my material on Distinct.

And I hope you enjoy it. Anyway, when it was like, like, like, I wanted to write specifically about Distinct, because there are times when it’s like, not harmful, and when it is necessary. And like, all the time, like, every time I open LinkedIn, it’s like, the first three things I see are like, this terrible LLM generated posts, where it’s like, top 10 things to avoid in SQL, and it’s like, fire emoji, avoid, avoid select star, and you’re like, here we go. And it’s rocket ship emoji, don’t use distinct. And you’re like, hmm, but, but some, but, and there’s never like, an alternative.

All right, there’s never like, like, here’s what, here’s a better way of doing anything. It’s always just like, don’t do this stuff later. You’re like, oh, great. Oh, wonderful. You really put your heart and soul into that one. Good, good, good job there. But what we’re going to talk about in this video is sort of like why Distinct kind of ends up sneaking into things. And when you might actually need to distinctify some stuff in order to make SQL, make, get query, get the correct query results back. Who knows, right? Sometimes you got to make something distinct.

There are a lot more examples in the full content, but this is, so this is just something to give you an idea of what’s in there, right? A little, this little, what’s that, what’s that you’re cooking? What’s in the kitchen? Good stuff. Come for dinner. Whatever. Anyway, let’s say that, you know, you’re, you’re writing a query and you want to see everyone, all the users with a reputation of one that have posted an answer. Remember in the inner join, outer join video, we talked about different join, like logical semantic stuff that preserves rows on whatever side of the thing.

And so you write a query that looks something like this, right? Where we’re selecting ID and display name from the users table. And we’re just joining to the post table to see, hey, did they post anything? Right? Did they post the right kind of thing in there? Oh, maybe they did. Maybe they didn’t. But we’re going to find out now.

The thing is when you run this query, you get like a lot of duplicates in here, right? You can see good old, a mean, a meanie. That’s duplicate rows. And, you know, as you go through old farmer, Dave, a couple in there and you realize that, well, you know, we got like a hundred thousand rows back from this thing. If we run this and get the actual execution plan, about 99,000 rows come out of this and we realize we got a lot of duplicates in there.

And we might not want a lot of duplicates in there. So what we do is slap an old distinct on it. Now, the reason why distinct gets sort of a bad rap is because it is like doing a big group buy on all the columns that you are selecting. Right? Maybe just distinct. Right? It doesn’t happen at the same place when you, in like the logical query processing part of the query is grouped by.

It happens like way at the end and like down, like if you remember the order that SQL processes queries in, distinct happens way down at the end with the select. So it’s like a group buy after you’ve already like potentially grouped by other stuff. Like there’s very little reason to mix distinct in with a group buy as well.

So like, let’s say we just slap a distinct on this one, which is no different from saying group buy ID display name. And when we look at the execution plan, rather than 99,000 rows coming back, it’s only 79,000. So we got like 20,000 some odd rows from the results by putting a distinct on this.

Now, every, now every combination of ID and display name that we see in here is going to be unique. We see that now we only have one Amin Amini and there’s going to be one Farmer Dave and everyone else. We just have one of them. They are now, this is now fully unique.

Now, the reason why this happens a lot is because people’s view of SQL in general, like as a language, how it fundamentally works is often full of shortcomings, right? The goal of this course is to fill those shortcomings in, right? There’s a need in the market, stuff about the economy. I don’t know.

But, um, the, like a lot of the times people will use a join, uh, some, some form of join syntax to figure out if rows exist or don’t exist. And that’s not really the best way to do this sort of thing. So, uh, if Zoomit will wake up, hello, Zoomit, come on, paid good money for you.

Just kidding, Mark Rassinovich. I downloaded wares. Uh, just kidding, it’s free. Um, so with a join, if you have a one-to-many relationship, the way you’d be doing the, from the user’s table to the post table, right?

Because one user could have many posts, joins preserve that, right? So we get all the rows back from a join that match, right? It doesn’t, doesn’t matter if there are one or 10 million, all those rows come back if they match, right?

The joins preserve that sort of duplication stuff, the one-to-many relationship. Because, like, you might, you know, want it for some queries, right? There might be something that you actually need, like, all of the matches for.

Like, if you were summing up the total score for a user, you would need all of their scores. You wouldn’t want just one of their scores. It wouldn’t make sense, right? You need all their scores to sum up a user’s total score from the post table.

If you just need to figure out if data is there or not there, the best way to go about that is to use exists or not exists. Because these either find a row or they don’t. If you find a row, it existed.

If you don’t find a row, it not existed. But you don’t preserve one-to-many relationships. You just find a relationship or don’t find a relationship. And that’s what meets the logical requirements of your query, right?

You don’t mess around with a whole bunch of other stuff. So, I love summing is now mad that apparently I downloaded it for free. But one way that you can get around having to use distinct is to use instead of a join, just use exists. Because this means that we’ll only get one row back when there’s a match.

So, like, for this, this just says, hey, if this user is over here, send them to the results, right? It’s not like if this user is over here, get all their results, right? Get all their posts.

So, if we run this and we look at the execution plan, we will have gotten the same 79,000 rows back without having to slap a distinct on it, which would make the LLMs of LinkedIn happy, right? I wish there was a way to just report those people. No semicolon.

I should report myself. Anyway, there are times when distinct is the correct thing to do. Now, let’s take a query like this, which is sort of a many-to-many relationship. So, we’re selecting from the post table and we’re doing a left join to the comments table, right?

And we’re just looking at this for one single user. Now, since this is a left join to the comments table, if we wanted to get a count of how many comments a user had, we could do this in one of three ways. We could either just do a count, which would be wrong.

We could do a count of just the ID column in the comments table. Now, remember, count doesn’t count nulls. It only counts present values, right?

But it will count duplicate present values, right? We talked about this in the case expression video a bit. And then the other way of doing this is to say, give me a count of just distinct IDs from the comments table. Now, this is important because ID is the clustered primary key of the comments table.

So, if there are duplicates in here, it’s not because we have some, like, error in our table and we have corrupt data and we have duplicates in our primary key. That’s not the case. It’s just the left join has a, like, it’s a many-to-many relationship between posts and comments.

Right? Like, one user can have a whole mess of comments. So, this is one of those things.

I guess it’s one-to-many relationships and it’s just one user. But one user can have lots of comments, but they have, but there’s also, but, like, the owner user ID in the post table is not unique. Right?

Because one user can have lots of posts. So, it’s, like, one, it is a many-to-many in that regard. So, if we were to run this query and look at what these different counts produce, we’d pretty quickly see that we do need distinct here in order to get correct results. So, let’s run this.

And so, it’ll take a second to run because we’re doing a lot of work here. But let’s look at the difference in what these counts produce. Right?

When we just do count or when we do a count of the comment IDs, right? That’s these two up here. Right? We have the count big and the count big CID. We get gigantic numbers back.

Right? But when we do a count of distinct IDs here, we get a much smaller number back. 46,737. I’m not sure how to pronounce this number.

We’re going to, there are no commas in there. We’re just going to be terrified. But we can see where these counts came from. At least the two big wrong ones. Right?

If we look at the execution plan, look at the number of rows that that join produces. That is the same number that we got from the count star and the count CID. Right?

So, the 27,901 rows in the post table that match owner user ID 22656 and the 46,737 rows in the comments table, those get joined together and they produce 1304009036. It’s a nine, 10 digit number.

I might have run out of fingers if I did that. Right? So, that produces a big result set. Right? It’s not quite a Cartesian product, but it’s a big number. It produces a big number of rows.

Right? Because they all match. There’s a many to many relationship. It’s like 27,901 times 46,737. It’s a lot of rows. Right?

Big number of rows. But the select distinct on the ID column over here produces the correct number of rows. And we can validate that by just hitting the comments table and getting a count for that specific user ID.

So, if we run this and we look at how many rows come back from that, we’ll see user ID 22656 does indeed have 46,737 rows in there. So, while I totally get the advice to avoid distinct, you know, it comes from a good place because, you know, distinct does get stuck on very big queries that select a lot of columns.

A lot of them are probably strings. And, like, doing that sort of distinctification of really, really big result sets can be painful. But there are plenty of times when using distinct, A, provides you correct results.

And, B, you know, not something we’re going to explore today. But there are times when using distinct in a query gives the optimizer some guarantees around a unique set of data being produced by that query.

And that can actually help the optimizer quite a bit to produce better query plans. And we’re actually, you know, using distinct to produce fewer rows from something that’s part of your query can be very useful at times.

So, don’t throw distinct out completely. Distinct can be useful. But, you know, there are times when, you know, just doing a select distinct on something can cause, you know, some performance pains.

But, you know, if you test it with and without distinct and it’s not that much different, maybe, I don’t know, maybe it’s okay. Maybe the LLMs aren’t always right.

Shocking, I know. Anyway, thank you for watching. I hope you enjoyed yourselves. I hope you learned something. And we will talk about something else wonderful and T-SQL-y in the next video. Anyway, goodbye.

Bye-bye. Dalai. Bye-bye. Bye-bye. Thank you.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

Learn T-SQL With Erik: Solving Problems With APPLY

Learn T-SQL With Erik: Solving Problems With APPLY


Video Summary

In this video, I delve into the beginner section of my upcoming T-SQL course, focusing on a fundamental yet powerful feature: APPLY. Starting with an engaging introduction to APPLY as a unique blend between JOINs and subqueries, I walk through common issues encountered when trying to solve top-end-per-group problems using CTEs or derived joins. By demonstrating how APPLY elegantly handles these scenarios, I aim to demystify its syntax and application in real-world queries. Through practical examples, including the Stack Overflow database, I show how APPLY can be used to find the top three badges earned by users with a reputation over 750,000, illustrating why it’s such an essential tool for any SQL Server professional. The video also touches on alternative syntax and performance considerations, providing a comprehensive yet accessible overview of this powerful T-SQL feature.

Full Transcript

Erik Darling here with Darling Data, and we’re going to continue on with some of the preview content from my upcoming T-SQL course. I would like to thank everyone who has pre-purchased it so far. Your support, of course, means quite a bit to me. And my chief goal right now is to not disappoint you. Again, this content is going to be part of a larger paid course. If you would like to pre-purchase it, there is a link down in the video description for you to do that. The price is currently $250. That will be going up to $500 once the course is fully published. And this is companion content to the T-SQL pre-conferences that Kendra Little and I are doing at PASS Data Community Summit in Seattle this November. And so if you are attending those, don’t buy it because you will get access to the content. as part of your admission to those pre-conferences. So hold your horses for that. I mean, I’m going to tell you not to. If you decide to just because you’re like, Erik Darling, we love you. You’re worth $250. Cool, but you don’t have to. Save your money. Go drinking. Today’s video is going to be about apply. Now, this is the beginner section of the content. So there’s not going to be the advanced level of stuff that you would maybe normally find in content about apply that that is coming later. What we’re going to do today is solve a little problem with apply and sort of dissect how apply works. Now, we’ll also look at an alternative to the apply syntax because sometimes looking at alternative syntax helps people understand what we’re doing with apply a little bit better.

But apply is one of my favorite pieces of t-SQL syntax. Sort of generally. I was going to see if zoom it’s awake. It is wonderful. Apply is sort of like if a drive join in a subquery had a baby. It’s the beautiful bouncing baby apply. In other dialects of SQL, postgres, Oracle, some other fancy ones. The cross-apply syntax is fairly well equivalent to the lateral join syntax over there. Apparently, it was in the SQL standard, but it wasn’t fully codified in the SQL standard as the lateral join syntax until after Microsoft had added apply to SQL Server, So apply is not the like anti-SQL standard form of way of doing it, but it is the sort of equivalent idea. Now, apply does have two flavors. There is cross-apply, which is an inner join equivalent, and then there’s outer apply, which is an outer join equivalent.

And the problem that we’re going to solve today is probably the easiest one to help most folks grasp what apply is good at. And that is the sort of top end per group problem. What we’re going to look at is in the context of the stack overflow database is a few of the higher reputation users in the three most recent badges that they’ve earned for being really good at overflowing stacks. But in your work, you might look at this in other ways, like maybe the top three selling products per category, like bed and bath, home and garden, kitchen, sneakers, whatever product categories you have, you want to find the top three selling products for that. Things like that.

So we’re going to look at sort of some query patterns that don’t work for that and then how apply works for that. So if you were to try this with a CTE, and I’m not saying that it’s not doable with a CTE, just you can’t do it like this. And looking at this sort of help set up the apply syntax a little bit.

So if we were to run this query, which in here we’re looking, we’re getting the top three badges, you notice that this line right here is quoted out. And that’s because in the context of this CTE, the users table, the alias for the users table, the reference for the users table down here is not visible to it yet. We can’t access the users table from the CTE up here.

We can reference it after we join to the CTE down here, but it doesn’t get us the result that we want. So like if we just run this and like we see what the query produces, we just produce three random badges, right, for three random users. These are just the three most recent badges earned period, right?

So this doesn’t get us what we want. We want the top three badges for the users that qualify for our query down here. So this query produces no results, right? We don’t get anything back because no one with a reputation over 750,000 was in the top three badges that got earned period for the badges table, right?

This is just like the whole thing. These people weren’t in there. The other part of it that doesn’t work is if we tried to do a similar thing with a derived join. You notice I have this where clause similarly quoted out in here because like the users table, the reference to the users table, isn’t available inside of the derived join.

We can’t see it until after the join is complete and we do the on clause. You end up with the exact same problem in here, right? Selecting the top three most recent badges, none of those users have a reputation over 750,000, right?

So like this returns no results as well. This is why you can’t just go stick things like top into a CTE or a derived join. Like expect it to match the results of the other stuff, right?

Like it works for some things, but not for this particular task. So if we use apply, what we’re able to do is a see like that reference to the users table inside of the apply. So this does end up getting a little bit more spread out than it would normally be, but it’s only because I tried to put some helpful commentary in there.

Where inside we have the we select from the users table and then we take then we write our cross apply, right? So the cross apply is right after that. Like I said, the cross apply is like an inner join.

So it comes after that from and in here we can select the top three badges and we can make sure because we’re using top now would be the same story with offset fetch to right? Like you could just as easily use offset fetch instead of top here, just like offset zero rows fetch next three rows. But in here we’re using top because it’s faster to write.

No other reason. That most people in T-SQL will write top just sort of like muscle memory, right? Top, whatever.

So but with top we need order by and because this order by on the date column, the date column is not unique, right? We might have duplicates in here. We do need to have our unique tiebreaker here.

This ID column. The ID is the clustered primary key of the badges table. So this is our unique tiebreaker in case there are any duplicates in the date column, right? That’s a critical thing that you need when you are ordering by a date, when you’re ordering by a column that might have duplicates in it because it’s considered non-deterministic, right?

You get different results without that unique tiebreaker. But inside of the apply, we are able to see that outer reference to the users table here, right? We can see this inside of the apply, which means that we, when we run this query, we’ll actually get back the results that we want, which are the users with a rep So we have a total of over 750,000, right?

Let’s validate that looking in here. Everyone here, the lowest one is 754,947. And we get, as long as there are three badges, like they’re like, as long as they’ve earned three badges, you’ll see three badges for each person.

Now, this is the sort of interesting thing about apply is that, you know, you can say you want like the top however many, but if that however many don’t exist, you only get back like, the, you only get back however many results actually do exist. So like, let’s say we said top 10, but one person that only earned nine badges, we would only get nine rows for them.

We can’t invent a row to send back for them. And we’re not going to send back this a random duplicate row because that, that would be insane. Right?

So like you get the top however many are available, like for your, like to, until your top is either met or we run out of rows to find. So this give, this does give us the correct results, which is wonderful. So now let’s dissect a little bit more how apply works so that everyone can understand the majesty of this query.

Right? So, um, the other queries didn’t work because again, it’s sort of like you were saying, uh, select the top three badges of whatever, but like just changing this query to hit the users table and do a simple in to figure out who. Uh, like which users earned just the top three most recent badges ever in here.

You’ll see that we get back three kind of random people and none of them have a reputation even close to 750,000. Right? Maybe they do now.

Maybe they’re a fantastic, like just amazing stack overflow database users at this point. I don’t know. I don’t follow their careers. I’ll be honest with you. I don’t, don’t care that much, but like, like those three people were not in, did not meet that 7,000.

So we’re going to have a 750,000 reputation goal. So the way that apply works is sort of like if we take that, like the initial starting query, right? Like let’s say that we know we want to find like what, like our basic starting point is we want to select.

We want to select these columns from the users table. And all that we care about are users who have a reputation of over 750,000. And when we supply our results back to, you know, whatever application web front end, whatever, whatever you want to call it.

So we want those results ordered by reputation descending. And then for any, if there are any duplicates, the ID column descending, right? We need that unique tiebreaker to make sure the results come back in a sort of reliable, reliable order every single time.

These are, these are the people we get back. So these six rows. All right.

We get, of course, John Skeet. We get Balassie. I’m not sure who that is. We get Darren Dimitrov. Sounds very smart. We get Bon C. I’ve seen that name around. We get Hans Passant. Some sort of German French mass mashup there, perhaps, perhaps a Belgian.

I don’t know. You can’t tell. And then we have Marc Gravel right here rounding out the list with 754,947. So the, this is our starting query, and these are the results from our starting query.

The way apply works is sort of like for every row that your starting query produces, apply the result of the cross apply query to that row. So that’s why we get three rows back for John Skeet and three rows back for Balassie and so on down the list, because we take, we apply that top three query to the results of this query. And just to sort of go a little bit further with that, you know, again, like this is the, this is the query inside of the cross apply just without the correlation in here.

So what we need to say is for each user that qualifies for our where clause, apply this query result to that. Now you could absolutely do this with a derived join or a CTE. You just have to write the query a little bit differently.

Another common approach to this is to use the row number window function and to get a result, like generate a row number over that result and then filter on the row number after that result has been, after that result has been produced. So in here we’re saying where the, remember the alias for the row number inside of the derived join is n, right? That’s this thing right here.

So we’re saying where n is less than or equal to three. Now, just to kind of make this a little bit easier to understand, I’m just going to edit this query really quickly. And let’s grab, let’s look at this again.

And so these are our top six users. We’re just going to grab this number 22656. And then in here, we’re just going to change this for a minute where b.userid equals 22656. And we’re just going to run this result on its own because there’s no correlation inside of a derived join.

Again, you can’t access the stuff outside of a derived join within the derived join. So there’s no correlation to the user’s table inside of this. So we can just run this without an error and we can just see what the results are for one user, right?

So if we run this, we’re going to get back this result. And this is, these are going to be the top three most recent badges for this particular user ID, right? We had a good answer, a nice answer, and a great answer.

What a fantastic answer. And we have that ordered by date descending in here. And of course, we have this ordered by ID. If there were, I don’t think, no, I don’t see any duplicate dates in there, but, if there were any duplicate dates in here, then the ID would act as the tiebreaker to differentiate between those duplicate rows.

Like I said, you could, like with a duplicate, when you order by a column with duplicates in it, the result is considered non-deterministic. So what we, we get the same result back for this one. We just take on a slightly different query format with it.

So let’s get rid of this where clause, because we don’t need this anymore. That was purely a display model query. And let’s run this. And you’ll see that we get back the same thing.

Well, now we have this, this column in here, right? And one, two, three, one, two, three, over and over again. So I guess this could be useful if you wanted to quickly eyeball anyone who may not have had, that may not have had a top three badges in there.

Maybe someone who only earned two badges. They somehow accumulated that much reputation without ever doing anything really spectacular, I guess. But we, we get all the same sort of information.

We get the same sort of set of information back from that. So that’s a bit about how apply works. Hopefully it helps you understand the syntax for it and at least one application for apply that you can, where you can use it to solve interesting query riddles without breaking your back with and typing a whole lot more than you might if you start using windowing functions. There are some performance considerations around this stuff, but you know, there’s only so much time in the, in the preview material for me to spend on things.

So you’re just, you’re just going to have to wait for the full video on that, but you only get the full video if you buy the training. But anyway, thank you for watching. I hope you enjoyed yourselves.

I hope you learned something and I will see you in the next video where we will do some more of this fun T-SQL learning stuff. Anyway, thank you for watching. I’m out of here. Goodbye.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.