Erik Darling, Author at Darling Data

Starting SQL: What Happens When You Don’t Parameterize SQL Server Queries?

Posted on September 22, 2020May 17, 2022 by Erik Darling

Nothing, Nothing, Nothing

Aside from the obvious perils of SQL Injection, parameterizing queries can be helpful in other ways.

Not to downplay SQL Injection at all, it’s just that fixing code is only one part of the equation. Vendor applications often need to do some pretty administrative things, and require elevated permissions.

Just last week I was looking at an application that had a stored procedure which took a parameter that was the name of an executable. There were no checks on the name. You could put in quite not-figuratively anything you wanted, and xp_cmdshell would run it.

All the parameterization in the world won’t help that.

Well, you get what you get.

Of Parameters And Plans

This can happen in three common ways:

You have an interface that accepts user-entered values
You have an ORM where you haven’t explicitly defined parameter types and precisions
You have dynamic SQL where values are concatenated into strings

Since I spend all my time in the database, I’m going to show you the third one. I wouldn’t normally format a query like this, but I’m trying to keep it web-friendly.

DECLARE @StartDate DATETIME = '20130101';
WHILE @StartDate < '20140101'
BEGIN
DECLARE @NoParams4u NVARCHAR(MAX) = 
N'
SELECT SUM(c.Score) AS TotalScore
FROM dbo.Comments AS c
WHERE c.CreationDate 
    BETWEEN CONVERT(DATETIME, ''' + RTRIM(@StartDate) + ''') 
    AND CONVERT(DATETIME, ''' + RTRIM(DATEADD(DAY, 11, @StartDate)) + ''')
AND 1 = (SELECT 1);
';

EXEC sys.sp_executesql @NoParams4u;
RAISERROR('%s', 0, 1, @NoParams4u);

SET @StartDate = DATEADD(DAY, 11, @StartDate);

END
GO

To make things interesting, I’ve created a non-covering index on the Comments table:

CREATE INDEX c ON dbo.Comments(CreationDate);

I’ve also had to use an oddly specific number of day increment in order to get some good plan variety because of that index. If you ever wonder why some blog posts take three hours to write, the most likely cause is finding the right number.

The first thing you should notice is that even using sp_executesql doesn’t help when our code isn’t parameterized. The second thing you’ll probably notice is 1 = (SELECT 1) at the end of the query. Past me has answers to all those questions, if you click on the links.

Bouquet

If I run that loop, I get back 34 results. Each query gets an individually compiled query plan, though there are only three “different” plans used.

Notice that each plan has different literal date values passed in to it, and different estimates. If we created a covering index, we’d get plan stability across executions, but then we’d still have to compile it when these literal values get passed in. I did it like this to reinforce my point.

We can validate that by looking in the plan cache and surrounding DMVs, using sp_BlitzCache.

Planimal Activist

If you have code like this, one good way to find culprits is by running it like this:

EXEC sp_BlitzCache @SortOrder = 'query hash';

A zoomed-in snapshot of the part of the results that prove my point looks about like this:

This is just the top 10 results. You can see a warning about multiple plans, and that each query has a single execution.

I mentioned before that if we add a covering index, we’ll get plan stability. That’s true, but specific to this demo on SQL Server 2019, there’s an alternate plan available for executions that qualify for Batch Mode On Rowstore (BMOR):

In prior versions of SQL Server in row mode only plans, we could only have a stream aggregate. But even getting the same plan most of the time, we still need to compile it every time. SQL Server still thinks each of these queries is “new” and needs to get a plan compiled.

Fixing It

There are two options for fixing this. If you need to do it quickly, at scale, the database level option FORCED PARAMETERIZATION can take care of most of these problems. Just make sure you read up on the limitations. To fix it for just a couple problem queries, you need to fix the dynamic SQL.

DECLARE @NoParams4u NVARCHAR(MAX) = 
N'
SELECT SUM(c.Score) AS TotalScore
FROM dbo.Comments AS c
WHERE c.CreationDate BETWEEN @StartDate AND DATEADD(DAY, 11, @StartDate)
AND 1 = (SELECT 1);
';

EXEC sys.sp_executesql @NoParams4u, N'@StartDate DATETIME', @StartDate;

I’m not concatenating values into the string anymore, and I’m passing the @StartDate value in when I execute the dynamic SQL.

One point I want to make is that it’s generally safe to do date math on the parameter. I’m not doing date math on the column, which would generally be a bad idea.

But anyway, now our plan gets used 34 times.

In this case, plan reuse works out well. Every query is looking at a sufficiently narrow range of data to have it not matter, and the longest running execution is around 31ms.

But what about when that doesn’t work out? When can parameterization backfire? We’ll find out tomorrow!

Thanks for reading!

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

Starting SQL: What’s The Difference Between Parameters And Local Variables In SQL Server?

Posted on September 21, 2020May 17, 2022 by Erik Darling

Parameter Positive

There are many good reasons to parameterize a query, but there are also trade-offs. There’s no such thing as a free parameter, as they say.

In this post, we’re going to discuss what is and isn’t a parameter, and some of the pros and cons.

What’s important to keep in mind is that good indexing can help avoid many of the cons, but not all. Bad indexing, of course, causes endless problems.

There are many good reasons to parameterize your queries, too. Avoiding SQL injection is a very good reason.

But then!

What’s Not A Parameter

It can be confusing to people who are getting started with SQL Server, because parameters and variables look exactly the same.

They both start with @, and feel pretty interchangeable. They behave the same in many ways, too, except when it comes to cardinality estimation.

To generalize a bit, though, something is a parameter if it belongs to an object. An object can be an instance of:

A stored procedure
A function
Dynamic SQL

Things that aren’t parameters are things that come into existence when you DECLARE them. Of course, you can pass things you declare to one of the objects above as parameters. For example, there’s a very big difference between these two blocks of code:

DECLARE @VoteTypeId INT = 7;

SELECT *
FROM dbo.Votes AS v
WHERE v.VoteTypeId = @VoteTypeId;

DECLARE @sql NVARCHAR(MAX) = N'
SELECT *
FROM dbo.Votes AS v
WHERE v.VoteTypeId = @VoteTypeId;
'
EXEC sp_executesql @sql, N'@VoteTypeId INT', @VoteTypeId;

But it’s not obvious until you look at the query plans, where the guess for the declared variable is god awful.

Then again, if you read the post I linked to up there, you already knew that. Nice how that works.

If you’re too lazy to click, I’m too lazy to repeat myself.

What’s the point? Variables, things you declare, are treated differently from parameters, things that belong to a stored procedure, function, or dynamic SQL.

Parameter Problems

The problem with parameterization is one of familiarity. It not only breeds contempt, but… sometimes data just grows apart.

Really far apart.

SELECT 
    v.VoteTypeId, 
    COUNT_BIG(*) AS records
FROM dbo.Votes AS v
GROUP BY v.VoteTypeId
ORDER BY records;

Natural Selection

When you parameterize queries, you give SQL Server permission to remember, and more importantly, to re-use.

What it re-uses is the execution plan, and what it remembers are cardinality estimates. If we do something like this, we don’t get two different execution plans, or even two different sets of guesses, even though the values that we’re feeding to each query have quite different distributions in our data.

The result is two query plans that look quite alike, but behave quite differently.

One takes 23 milliseconds. The other takes 1.5 seconds. Would anyone complain about this in real life?

Probably not, but it helps to illustrate the issue.

Leading Miss

Where this can get confusing is when you’re trying to diagnose a performance problem.

If you look in the plan cache, or in query store, you’ll see the plan that gets cached for the very first parameter. It’ll look simple and innocent, sure. But the problem is with a totally different parameter that isn’t logged anywhere.

You might also face a different problem, where the query recompiles because you restarted the server, updated stats, rebuilt indexes, or enough rows in the table changed to trigger an automatic stats update. If any of those things happen, the optimizer will wanna come up with a new plan based on whatever value goes in first.

If the roles get reversed, the plan will change, but they’ll both take the same amount of time now.

DECLARE @VoteTypeId INT;

SET @VoteTypeId = 16

DECLARE @sql NVARCHAR(MAX) = N'
SELECT *
FROM dbo.Votes AS v
WHERE v.VoteTypeId = @VoteTypeId;
';
EXEC sp_executesql @sql, N'@VoteTypeId INT', @VoteTypeId;

SET @VoteTypeId = 7;

SET @sql = N'
SELECT *
FROM dbo.Votes AS v
WHERE v.VoteTypeId = @VoteTypeId;
';
EXEC sp_executesql @sql, N'@VoteTypeId INT', @VoteTypeId;

Deal With It ?

In the next few posts, we’ll talk about what happens when you don’t parameterize queries, and different ways to deal with parameter sniffing.

A recompile hint can help, it might not always be appropriate depending on execution frequency and plan complexity
Optimize for unknown hints will give you the bad variable guess we saw at the very beginning of this post

We’re going to need more clever and current ways to fix the issue. If you’re stuck on those things recompiling or unknown-ing, you’re stuck not only on bad ideas, but outdated bad ideas.

Like duck l’orange and Canadian whiskey.

Thanks for reading!

Going Further

Starting SQL: Query Plan Round Up

Posted on September 18, 2020May 15, 2022 by Erik Darling

Put On

There are lots of things that can be fine in one query plan, and horrible in another.

Likewise, a query plan might be great for one set of parameters, and terrible for another. That’s called parameter sniffing, and that’s what we’re going to talk about next week.

What I want you to take away from the last couple weeks of posts is some of the improvements that have come into query plans over the years, and some of the shortcomings that still exist.

Despite the shortcomings, query plans offer an important view into things. When viewed alongside metrics in other DMVs, the text of the query, indexes that are available, and any parameters passed to the query, you can get a fairly complete view of things.

Major Pain

Of course, you can’t get rabbit-holed on this stuff. As much detail as gets captured in actual plans and in the other sources I mentioned, they don’t tell you about important things like blocking or resource contention.

You can infer this a bit by looking at waits on the server, or by comparing CPU to duration.

If you see lots of LCK, THREADPOOL, or RESOURCE_SEMAPHORE waits, it might not be an issue of tuning a single query.

Likewise, if you see a lot of PAGEIOLATCH_xx waits, you may need to look at your RAM to data ratio (among other things).

We’re at the point now where we’re ready to dive into a tougher subject, and start answering questions like “why did my query suddenly get slower?”

Thanks for reading!

Going Further

Starting SQL: Things Your SQL Server Execution Plan Should Warn You About

Posted on September 17, 2020May 15, 2022 by Erik Darling

Mild Best Wishes

Microsoft has chosen some odd things to warn us about in query plans. For estimated and cached plans, I totally understand some of the limitations.

Though the DMVs for queries associated with query plans (both in the plan cache and query store) log some additional metrics about memory grants, spills, CPU, duration, reads, and more, none of that additional information ends up as feedback in the plan XML.

I’m not complaining, either. Modifying XML is stupid.

Bellyachin’

What could be in the XML is an extension of the current set of warnings, with maybe a bit more in Actual plans.

Query plans could warn if:

A non-SARGable predicate caused a scan
A non-SARGable/complex predicate caused a Filter
A residual predicate did a lot of reads
Eager Index Spools are over a certain number of rows
Performance Spools have an unfavorable rebind/rewind ratio
If modes switch between row and batch
Why an index wasn’t used
When estimated and actual rows or executions are way off
When joins have an OR clause
When parallel row distributions are uneven
Louder warnings for when something forces a query to run serially
Show multiple missing index requests when present
When operators execute more than once
When a statistic used has significant modifications
A roll up of per-operator I/O at the root node

Many of these things could be in both estimated and actual plans, and would really help people understand both why the things they do when writing queries can mess them up, and also when there might be a better way to do things.

A Professional With Standards

Is that a lot of stuff? Yes. But think about the warnings we get now, and the last time they were really helpful to you.

If they ever were, I can guarantee that they’ve been misleading more often.

All in all, people need a more informative query plan, and building this sort of instrumentation for end users also gives the robots in Azure, and future Intelligent Query Processing features some nice feedback.

Thanks for reading!

Going Further

Starting SQL: Things Your SQL Server Execution Plan Might Warn You About

Posted on September 16, 2020May 15, 2022 by Erik Darling

Oh look, XML

One of my favorite pieces of SQL Server documentation is the XML schema for query plans.

Whenever I’ve needed to figure something out to query the plan cache, or when I’ve been looking for stuff that might be interesting to pull out of it, this is where I’ve gone.

Especially when plan XML, it helps me rest a bit easier to know I’ve covered all the documented possibilities.

Of course, there’s no distinction in the documentation between what can be in the Estimated plan, and what’s reserved for Actual plans.

Such Document

While that’s inconvenient, you can figure most things out by separating plan elements into two groups:

Optimization time metrics
Runtime metrics

Here are the documented warnings available in query plans.

SQL Server Showplan XML — what was the robot’s name?

Things like spills and memory usage can only be known at runtime, when the optimizer’s estimates are put into play.

For many other things, the optimizer likely knows about and has to account for the warnings while it’s coming up with a query plan.

Warnings About Warnings

One thing to keep in mind about many of these warnings, whether they’re optimization-time or run-time, is that they’re trigger for things that might not matter, or even necessarily be true.

I’ve written in the past about silly execution plan warnings. I’m not going to re-write all those here.

But I do want you to reinforce a couple points I’ve made over the course of the series:

Estimated and cached plans miss details that make troubleshooting easier
Not every metric and warning is a critical datapoint

For example, I’ve seen people focus on small spills in execution plans many times only to have them not be the cause of a performance problem.

Operator times make distinguishing this easier, of course. Prior to that, even actual plans could be misleading and unforgiving.

Thanks for reading!

Going Further

Starting SQL: I Don’t Care How Many Rows Are In Your Table Variable, They Can Still Hurt Performance

Posted on September 15, 2020May 16, 2022 by Erik Darling

Well, well, well

So you’re that odd soul who has been listening to rumors about table variables. Perhaps things about them only being in memory, or that they’re okay to use if you only put less than some arbitrary number of rows in them.

Those things are both wrong. But of course, my favorite rumor is the one about arbitrary numbers of rows being safe.

Ouch! What a terrible performance

Let’s do everything in our power to help SQL Server make a good guess.

We’ll create a couple indexes:

CREATE INDEX free_food ON dbo.Posts(OwnerUserId);
CREATE INDEX sea_food ON dbo.Comments(UserId);

Those stats’ll be so fresh you could make tartare with them.

We’ll create our table variable with a primary key on it, which will also be the clustered index.

DECLARE @t TABLE( id INT PRIMARY KEY );
INSERT @t ( id )
VALUES(22656);

And finally, we’ll run the select query with a recompile hint. Recompile fixes everything, yeah?

SELECT AVG(p.Score * 1.) AS lmao
FROM   @t AS t
JOIN   dbo.Posts AS p
    ON p.OwnerUserId = t.id
JOIN   dbo.Comments AS c
    ON c.UserId = t.id
OPTION(RECOMPILE);
GO

How does the query do for time? Things start off okay, but keep the cardinality estimate in mind.

But quickly go downhill.

Fish are dumb, dumb, dumb

The whole problem here is that, even with just one row in the table variable, an index on the one column in the table variable, and a recompile hint on the query that selects from the table variable, the optimizer has no idea what the contents of that single row are.

That number remains a mystery, and the guess made ends up being wrong by probably more than one order of magnitude. Maybe even an order of manure.

Table variables don’t gather any statistical information about what’s in the column, and so has no frame of reference to make a better cardinality estimate on the joins.

If we insert a value that gets far fewer hits in both the Posts and Comments tables (12550), the estimate doesn’t really hurt. But note that the guesses across all operators are exactly the same.

C’est la vie mon ami

You have a database. Data is likely skewed in that database, and there are already lots of ways that you can get bad guesses. Parameter sniffing, out of date stats, poorly written queries, and more.

Databases are hard.

The point is that if you use table variables outside of carefully tested circumstances, you’re just risking another bad guess.

All of this is tested on SQL Server 2019, with table variable deferred compilation enabled. All that allows for is the number of rows guessed to be accurate. It makes no attempt to get the contents of those rows correct.

So next time you’re sitting down to choose between a temp table and a table variable, think long and hard about what you’re going to be doing with it. If cardinality esimation might be important, you’re probably going to want a temp table instead.

Thanks for reading!

Going Further

Starting SQL: How Eager Index Spool Operators Hurt Query Performance In SQL Server

Posted on September 14, 2020May 15, 2022 by Erik Darling

Hot To Trot

Most of the time when the optimizer thinks an index will make a big enough difference, it’ll tell you.

Sure, it’s not perfect, but it can get you where you’re going fast enough. If you’re relying on that sort of feedback in query plans, or in the missing index DMVs, you’ll hate this.

Not only does SQL Server create an index for you, it doesn’t really tell you about it. There’s no loud warning here.

It also throws that index away when the query is done executing. It only exists in tempdb while the query executes, and it’s only available to the query that builds it.

And boy, they sure can take a long time to build.

Let’s take a closer look!

Skidding Out

Eager index spools can occur on the inner side of Nested Loops joins to reduce the amount of work that needs to be done there, by creating a more opportune index for the loop to work off of.

That all sounds very good, but there are some problems:

The index gets created single-threaded
The way data is loaded into the spool is very inefficient
The spool is disposed of when the query is finished,
It’ll get built over and over again on later executions
There’s no missing index request for the spool anywhere

I’m not suggesting that the query would be faster without the spool. Reliable sources tell me that this thing runs for over 6 hours without it. My suggestion is that when you see Eager Index spools, you should pay close attention.

Let’s talk about how you can do that.

Mountainous

In some cases, your only option is to look at the Eager Index Spool to see what it’s doing, and create an index on your own to mimic it.

If you’re the kind of person who likes free scripts, sp_BlitzCache will look for Eager Index Spools in your query plans and do that for you. You’re welcome.

If you’re a more manual type, here’s what you do: Look at the Eager Index Spool.

SQL Server Query Plan Tool Tip — ham ham ham ham

The Seek Predicate(s) are they key columns, and the Output List is the included columns.

CREATE INDEX spool_b_gone 
ON dbo.Posts(ParentId) INCLUDE (Score);

Since the Id column is the clustered index, we don’t explicitly need it in the index definition — remember that nonclustered indexes inherit them. It’ll end up as a “hidden” key column, after all.

Human Touch

In most cases, this will be good enough. The performance difference will be night and day, if the build source for the Eager Index Spool was fairly large, or if your query built the same Eager Index Spool multiple times.

Though just like missing index requests, Eager Index Spools don’t always come up with the *best* index.

Thinking through our query, we may want to move Score up to the key of the index.

SELECT pq.OwnerUserId, pq.Score, pq.Title, pq.CreationDate,
       pa.OwnerUserId, pa.Score, pa.CreationDate
FROM dbo.Posts AS pa
INNER JOIN dbo.Posts AS pq
    ON pq.Id = pa.ParentId
WHERE pq.PostTypeId = 1
AND   pq.CommunityOwnedDate IS NULL
AND   pq.AnswerCount > 1
AND   pa.PostTypeId = 2
AND   pa.OwnerUserId = 22656
AND   pa.Score >
(
    SELECT MAX(ps.Score)
    FROM dbo.Posts AS ps
    WHERE ps.ParentId = pa.ParentId 
    AND   ps.Id <> pa.Id
)
ORDER BY pq.Id;

See that subquery at the very end, where we’re aggregating on Score? Having Score in the key of the index will put the data in order, which makes a Stream Aggregate pretty painless. Remember that Stream Aggregates expect sorted input.

That’s, like, how they stream.

CREATE INDEX spool_b_gone 
    ON dbo.Posts(ParentId, Score);

Letter To The Query Editor

There are some cases where changing a query is a lot less painful than changing or adding indexes.

You might already have a lot of indexes, or you might have a really big table, or you might be on Standard Edition, which is a lot like being on one of those airplanes where the entire thing is economy class.

In this case, we can rewrite the query in a way that avoids the Eager Index Spool entirely:

SELECT pq.OwnerUserId, pq.Score, pq.Title, pq.CreationDate,
        pa.OwnerUserId, pa.Score, pa.CreationDate
FROM dbo.Posts AS pa
INNER JOIN dbo.Posts AS pq
    ON pq.Id = pa.ParentId
WHERE pq.PostTypeId = 1
AND   pq.CommunityOwnedDate IS NULL
AND   pq.AnswerCount > 1
AND   pa.PostTypeId = 2
AND   pa.OwnerUserId = 22656
AND NOT EXISTS
(
    SELECT 1/0
    FROM dbo.Posts AS ps
    WHERE ps.ParentId = pa.ParentId
    AND   ps.Id <> pa.Id
    AND   ps.Score >= pa.Score
)
ORDER BY pq.Id;

Which gets us a different plan. And you can see why we’d want one.

Avoiding the need for an Eager Index Spool reduces the query time from over a minute to under a second.

Thanks for reading!

Going Further

Starting SQL: Blind Spots In SQL Server’s Missing Index Requests

Posted on September 11, 2020May 15, 2022 by Erik Darling

Underfoot

There are all sorts of things that might artificially keep the optimizer from suggesting a helpful index while you’re writing or tuning a query.

But go figure, there are also totally sensible things that will make it less likely that you’ll get one, and there are definitely things that the requests don’t consider as heavily.

That doesn’t mean they’re awful and wrong, it just means you should practice looking at the requests that get generated with execution plans to see if there’s anything you can tweak to make things run a little bit more smoothly.

Anyway, let’s learn some important lessons about missing index requests.

Lesson One: Inequalities Are Iffy

In my copy of the Stack Overflow database, both of these queries return the same single row.

There is only one comment with 1270(!!!) upvotes, so either query works to locate the data.

/*Inequality predicate*/
SELECT c.CreationDate,
       c.PostId,
       c.Score,
       c.Text,
       c.UserId
FROM dbo.Comments AS c
WHERE c.Score >= 1270 --Hello I'm here
AND   c.CreationDate >= '20110101'
AND   c.CreationDate <  '20120101'
ORDER BY c.CreationDate DESC;

/*Equality predicate*/
SELECT c.CreationDate,
       c.PostId,
       c.Score,
       c.Text,
       c.UserId
FROM dbo.Comments AS c
WHERE c.Score = 1270 --Hello I'm here
AND   c.CreationDate >= '20110101'
AND   c.CreationDate <  '20120101'
ORDER BY c.CreationDate DESC;

Strangely, only one query qualifies for a missing index request, and it’s not the first one.

The index request look like this:

CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[Comments] ([Score],[CreationDate])
INCLUDE ([PostId],[Text],[UserId])

Which is likely fine, as long as you’re okay with the Text column being in there. It’s an NVARCHAR(700), which makes me a little hesitant.

Not adding it, of course, means the optimizer would have to make a choice about Key Lookups vs. Clustered Index Scans, depending on predicates.

Databases are hard.

Lesson Two: Joins Don’t Get A Lot Of Love

If we add a join to the Posts table, the missing index request doesn’t change.

SELECT c.CreationDate,
       c.PostId,
       c.Score,
       c.Text,
       c.UserId
FROM dbo.Comments AS c
JOIN dbo.Posts AS p
    ON p.Id = c.PostId
    AND p.OwnerUserId = c.UserId
WHERE c.Score = 1270 --Hello I'm here
AND   c.CreationDate >= '20110101'
AND   c.CreationDate <  '20120101'
AND 1 = (SELECT 1)
ORDER BY c.CreationDate DESC;

We get the same missing index request as above, which might strike you as a little odd.

Sometimes it’s nice to have join columns in the key of an index. Orderly data tends to join more easily, and can give the optimizer more efficient ways to do it.

It probably doesn’t matter a ton for Hash Joins, but Merge Joins expect data in order. If it’s not, the optimizer will introduce a Sort to get it in order. Likewise, Nested Loops without indexes can be painful.

Having columns as includes means it’s not stored in an order that might be helpful. In fact, it’s not stored in any order at all.

Lesson Three: Neither Does Order By

In our original query, the CreationDate column, by coincidence, was both in the WHERE and ORDER BY clauses.

It ended up in the key columns of the index because of the WHERE clause, though. If we remove it from there, it’ll move to the included columns.

SELECT c.CreationDate,
       c.PostId,
       c.Score,
       c.Text,
       c.UserId
FROM dbo.Comments AS c
WHERE c.Score = 1270 --Hello I'm here
ORDER BY c.CreationDate DESC;

SQL Server Missing Index Request — now you’re getting sued

Why Are We Talking About Indexes Again?

If I do enough drilling, hopefully you’ll understand how important they are to performance.

What I want you to take away from this post is that when you’re looking specifically at missing index requests, whether they’re in query plans or in the missing index DMVs:

They’re not going to consider columns outside of the WHERE clause to be in the key of the index
If your where clause doesn’t have an equality predicate, it’s a lot harder to get missing index requests
Columns that would be helpful to have in index order won’t end up in the key if they’re not in the WHERE
Even the order of columns suggested for being in the key of the index isn’t scientific

I covered a lot of the “why” in these posts:

Without them, it may look weird to spend time talking about the limitations here. But this is why databases still need humans to care for them.

As much as I look forward to a day where they don’t, little things like which columns should be in the key of an index, and the order they should be in, are still important considerations.

You know. Little things.

Thanks for reading!

Going Further

Starting SQL: Why Wouldn’t An Index Help Your SQL Server Query Go Faster?

Posted on September 10, 2020May 14, 2022 by Erik Darling

Pennies From Redmond

We’re spoiled over here in SQL Server land. When we run queries, the optimizer will suggest indexes that might help it run faster. I say might, because sometimes it’s wrong. Like I mentioned before, these suggestions are a bit hasty.

If you read the limitations, the first thing they say is “It is not intended to fine tune an indexing configuration”.

And they’re not. But you know what? If you don’t have anyone who is fine tuning an indexing configuration, they’re a heck of a lot better than a whole lot of nothing.

Are You There, Optimizer?

Starting with no nonclustered indexes, and a clustered index on an unrelated column, this query is a prime candidate for an index. And yet, the optimizer has forsaken us in our time of need.

SELECT COUNT_BIG(*) AS records
FROM dbo.Comments AS c
WHERE COALESCE(c.UserId, 0) = 0;

This isn’t specific to COALESCE, any ol’ function — built-in or otherwise — would do it. But the optimizer doesn’t think an index would make this query any faster. That’s a real bummer.

You may hear people say things like “you can’t use indexes with functions”, but that sentence is incomplete. What’s meant is “you can’t use indexes to seek with functions”.

Going Boing

This goes back to the concept of SARGability that I talked about before. If we do a little bit more typing, the optimizer is our friend again. I know, typing more sucks, but you’ll thank me later.

SELECT COUNT_BIG(*) AS records
FROM dbo.Comments AS c
WHERE ( c.UserId = 0 
        OR c.UserId IS NULL );

The optimizer is back to having good ideas for us. But notice that both of these queries end up taking the same amount of time and doing the same amount of work.

If you were just getting started writing queries, performance tuning, or looking at query plans, you might see this and assume that query performance would remain identical.

It’s sort of a matter of opportunity, here. Like I said before, when you wrap predicates in functions, you lose the ability to seek into indexes.

Sure, you could use an index on UserId in either query, but it doesn’t usually fix a whole lot if you need to scan the entire thing.

Copying In

If we add in the index on UserId and compare plans, the difference is obvious.

Our clearly-expressed query does a lot better here. It’s able to seek, and even running with a serial plan is much faster than the parallel plan for the poorly-expressed query. These things do matter to performance. Functions, even built-in ones, have no relational meaning to the optimizer, and can’t be reasoned with.

When it comes to writing queries, it usually pays to over-communicate. When you take shortcuts, you might not get the best possible outcome.

Tomorrow, we’ll look at hidden missing index requests.

Thanks for reading!

Going Further

Starting SQL: Why Your SQL Server Query Can’t Go Parallel, Scalar Functions Edition

Posted on September 9, 2020May 17, 2022 by Erik Darling

I Won’t Share You

Yesterday we looked at where table variables can have a surprising! impact on performance. We’ll talk more about them later, because that’s not the only way they can stink. Not by a long shot. Even with 1 row in them.

Anyway, look, today’s post is sort of like yesterday’s post, except I’ve had two more drinks.

What people seem to miss about scalar valued functions is that there’s no distinction between ones that touch data and ones that don’t. That might be some confusion with CLR UDFs, which cause parallelism issues when they access data.

Beans and Beans

What I want to show you in this post is that it doesn’t matter if your scalar functions touch data or not, they’ll still have similar performance implications to the queries that call them.

Now look, this might not always matter. You could just use a UDF to assign a value to a variable, or you could call it in the context of a query that doesn’t do much work anyway. That’s probably fine.

But if you’re reading this and you have a query that’s running slow and calling a UDF, it just might be why.

If the UDF queries table data and is inefficient
If the UDF forces the outer query to run serially

They can be especially difficult on reporting type queries. On top of forcing them to run serially, the functions also run once per row, unlike inline-able constructs.

Granted, this once-per-row thing is worse for UDFs that touch data, because they’re more likely to encounter the slings and arrows of relational data. The reads could be blocked, or the query in the function body could be inefficient for a dozen reasons. Or whatever.

I’m Not Touching You

Here’s a function that doesn’t touch anything at all.

CREATE OR ALTER FUNCTION dbo.little_function (@UserId INT)
RETURNS BIGINT
WITH SCHEMABINDING, 
     RETURNS NULL ON NULL INPUT
AS
BEGIN
DECLARE @d DATETIME = GETDATE();
RETURN
(
    ( 
      SELECT @UserId
    )
)
END
GO

I have the declared variable in there set to GETDATE() to disable UDF inlining in SQL Server 2019.

Yes, I know there’s a function definition to do the same thing, but I want you to see just how fragile a feature it is right now. Again, I love where it’s going, but it can’t solve every single UDF problem.

Anyway, back to the story! Let’s call that function that doesn’t do anything in our query.

SELECT TOP (1000) 
    c.Id,
    dbo.little_function(c.UserId)
FROM dbo.Comments AS c
ORDER BY c.Score DESC;

The query plan looks like so, with the warning in properties about not being able to generate a valid parallel plan.

2020 09 03 19 15 33 — what’s so great about you?

In this plan, we see the same slowdown as the insert to the table variable. There’s no significant overhead from the function, it’s just slower in this case because the query is forced to run serially by the function.

This is because of the presence of a scalar UDF, which can’t be inlined in 2019. The serial plan represents, again, a significant slowdown over the parallel plan.

Bu-bu-bu-but wait it gets worse

Let’s look at a worse function.

CREATE OR ALTER FUNCTION dbo.big_function (@UserId INT)
RETURNS BIGINT
WITH SCHEMABINDING, 
     RETURNS NULL ON NULL INPUT
AS
BEGIN
DECLARE @d DATETIME = GETDATE();
RETURN
(
    ( 
      SELECT SUM(p.Score)
      FROM dbo.Posts AS p
      WHERE p.OwnerUserId = @UserId
    ) - 
    (
      SELECT SUM(c.Score)
      FROM dbo.Comments AS c
      WHERE c.UserId = @UserId
    )
)
END
GO

Not worse because it’s a different kind of function, just worse because it goes out and touches tables that don’t have any helpful indexes.

Getting to the point, if there were helpful indexes on the tables referenced in the function, performance wouldn’t behave as terribly. I’m intentionally leaving it without indexes to show you a couple funny things though.

Because this will run a very long time with a top 1000, I’m gonna shorten it to a top 1.

SELECT TOP (1) 
    c.Id,
    dbo.big_function(c.UserId)
FROM dbo.Comments AS c
ORDER BY c.Score DESC;

Notice that in this plan, the compute scalar takes up a more significant portion of query execution time. We don’t see what the compute scalar does, or what the function itself does in the actual query plan.

2020 09 03 20 00 04 — got yourself a function

The compute scalar operator is what’s responsible for the scalar UDF being executed. In this case, it’s just once. If I had a top that asked for more than one row, It would be responsible for more executions.

We don’t see the function’s query plan in the actual query, because it could generate a different query plan on each execution. Would you really want to see 1000 different query plans?

Anyway, it’s quite easy to observe with operator times where time is spent here. Most people read query plans from right to left, and that’s not wrong.

In that same spirit, we can add operator times up going from right to left. Each operator not only account for its own time, but for the time of all operators that come before it.

The clustered index scan takes 7.5 seconds, the Sort takes 3.3 seconds, and the compute scalar takes 24.9 seconds. Wee.

Step Inside

If you get an actual plan for this query, you won’t see what the function does. If you get an estimated plan, you can get a picture of what the function is up to.

This is what I meant by the function body being allowed to go parallel. This may lead to additional confusion when the calling query accrues parallel query waits but shows no parallel operators, and has a warning that a parallel plan couldn’t be generated.

It’s Not As Funny As It Sounds

If you look at a query plan’s properties and see a non-parallel plan reason, table variable modifications and scalar UDFs will be the most typical cause. They may not always be the cause of your query’s performance issues, and there are certainly many other local factors to consider.

It’s all a bit like a game of Clue. You might find the same body in the same room with the same bashed in head, but different people and blunt instruments may have caused the final trauma.

Morbid a bit, sure, but if query tuning were always a paint by numbers, no one would stay interested.

Anyway.

In the next posts? we’ll look at when SQL Server tells you it needs an index, and when it doesn’t.

Thanks for reading!