Software Vendor Mistakes With SQL Server: Writing And Optimizing Paging Queries

Paging Doctor Darling


Words I’ll never hear, at least in reference to me. I’m sure there’s a Darling out there who is a doctor. A real doctor, too. Not one of those “I went to college for 7 extra years” doctors who don’t know anything about the human body.

But I digress! Today we’re gonna learn about paging queries. Why? Because I see people screw them up all the gosh darn ding-a-ling time.

Far and away, I think the worst problem starts at the beginning — the SELECT list — where people try to get every single possible column. There’s a much better query pattern available to you, if you’re open to writing a little bit more code.

Unfortunately, I haven’t seen any ORMs able to handle this method natively. Hard to believe developers developed the same performance-killing method in C# that they did in T-SQL.

Okay, maybe not so hard to believe. But follow along to see a much better way of doing it in the video below. It’s just a small part of my paid training, and if you like what you see there’s a link at the end of the post for 75% off the whole package.

Optimizing Paging Queries Video


Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Software Vendor Mistakes With SQL Server: Handling Optional Parameters

Optionally Yours


You’ve got tables. So many tables And columns. So many columns.

Users — demanding as they are — might wanna see, search, and order by all sorts of things in those many columns in those many tables. Writing a query to do that is hard at first, especially if you’re afraid of dynamic SQL.

Over the years, developers have come up with all sorts of “workarounds” for handling optional search parameters. The problem is that none of them work, whether it’s:

  • column = @parameter or @parameter is null
  • column = isnull(@parameter, column)
  • column = coalesce(@parameter, column, ‘magic’)

Or any variation thereof. Doing this can (and will!) screw up query performance in all sorts of ways that sneak up on you.

  • Bad cardinality estimates
  • Scanning instead of Seeking in indexes
  • Using the “wrong” indexes

In the video below, I’ll show you how to use dynamic SQL the right way to handle optional parameter search scenarios. This video is a small part of my paid training. If you like what you see, there’s a link for 75% off the entire package at the bottom of the post.

Optional Parameters Video


Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Software Vendor Mistakes With SQL Server: Writing Unsafe Dynamic SQL

SQL Injection


The first thing that comes to mind when you mention dynamic SQL is SQL injection. That makes sense! It’s a scary thing that has messed up a lot of people, places, and things.

Imagine how scary it would be to wake up one day and find there had been a data breach, defacement, server takeover, or ransomeware-d.

All of these things are possible if you write the bad kind of dynamic SQL, or think that typing square brackets will save you.

If I had my way, there would be more way more invested in making safe dynamic SQL easy to write, debug, and maintain. But I don’t have my way, and Microsoft is more focused on defrauding you with shoddy cloud performance than anything else.

To define the problem:

  • You write code that accepts user input
  • The input is a string with a non-arbitrary length
  • Which is concatenated into another string for execution
  • And not parameterized, properly quoted, etc.

Building A String


This can be done in the application, in SQL Server, or a mix of the two.

Since I’m not an application developer, I’m going to scope this to how it looks in a T-SQL stored procedure.

CREATE OR ALTER PROCEDURE
    dbo.haxer
(
    @injectable nvarchar(100)
)
AS
BEGIN
   SET NOCOUNT, XACT_ABORT ON;

/*Our placeholder variable for the string we'll build*/
DECLARE    
    @sql nvarchar(MAX) = N'';

/*The start of our query-building*/
SELECT
    @sql = N'
SELECT
    H.*
FROM dbo.hax AS H
WHERE 1 = 1
';

/*
This is where things first go wrong,
concatenating input into the string
*/
IF @injectable IS NOT NULL
BEGIN
    SELECT
        @sql += N'AND H.user_name = ''' + @injectable + ''';';
END;

PRINT @sql;

/*
This is where things continue to go wrong, 
using EXEC against the variable
*/
EXEC(@sql);

END;
GO 

There’s not much complicated here. The comments tell the story:

  • Concatenating the @injectable parameter into the string means it can bring anything the user wants into your query
  • Using EXEC directly against the @sql variable means it can’t be parameterized

Before we can run this, though, we need a table and some dummy data.

Dummy Data And String Things


Here’s a simple table, with a single row in it. I don’t think this needs much explanation.

CREATE TABLE
    dbo.hax
(
    id int PRIMARY KEY,
    user_name nvarchar(100),
    points bigint,
    location nvarchar(250),
    creation_date datetime,
    last_login datetime
);

INSERT 
    dbo.hax
(
    id,
    user_name,
    points,
    location,
    creation_date,
    last_login
)
VALUES
(
    1,   
    N'Rick Ross',
    400000000,
    'Port Of Miami',
    '20060808',
    '20211210' 
);

As a note here, this is another good reason to reasonably scope string column lengths. The shorter they are, the shorter search parameters are, and that limits the amount of potentially malicious code that can be stuffed into them. A 10-byte length string is a lot harder to cause problems with than a string with a 50-byte length, and so on. I’ve oversized a couple columns here as examples.

Is there any value in a 100 character user name, or a 250 character location? I sort of doubt it.

Executed Normally


When we execute the stored procedure as-is, searching for our lone user, we get a single result back:

EXEC dbo.haxer
    @injectable = N'Rick Ross';

The current query result isn’t important, but the result of the PRINT statement is.

SELECT
    H.*
FROM dbo.hax AS H
WHERE 1 = 1
AND H.user_name = 'Rick Ross';

There are two issues, here because of the lack of parameterization

  • It can lead to generating multiple execution plans for the same query
  • Users can substitute the legitimate search with a malicious query

Unfortunately, neither the Optimize For Ad Hoc Workloads nor the Forced Parameterization settings can protect you from SQL injection. Forced parameterization can help you with the second problem, though.

Executed Abnormally


When we execute the stored procedure with a slightly different value for the parameter, we get into trouble:

EXEC dbo.haxer
    @injectable = N'%'' UNION ALL SELECT object_id,name,schema_id,type_desc,create_date,modify_date FROM sys.tables --';

We’re not trying to do anything too crazy here, like drop a table or a database, change a setting, or other thing that you might need a lot of control of the server, database, or schema to do. We’re querying a pretty standard system view.

Now, I’m not big on security, permissions, or any of that stuff. It has never captured my fancy, so I don’t have any great advice about it. I will make a few points though:

  • Most applications have a single login
  • A lot of the time that login is given sa-level permissions
  • Even when it’s not, that login may need to do routine database things:
    • Check if databases, tables, columns, or indexes exist (especially during upgrades)
    • Create or drop all of those things
    • Execute database maintenance procedures
    • Mess with settings or Agent jobs

Without really tight control of a whole bunch of permissions, you’re gonna end up with applications that can do a ton of crazy stuff, including a list of tables in the database from this query.

SELECT
    H.*
FROM dbo.hax AS H
WHERE 1 = 1
AND H.user_name = '%' UNION ALL SELECT object_id,name,schema_id,type_desc,create_date,modify_date FROM sys.tables--';

All of this fits well within the 100 characters available in the search parameter, and the entire query is executed as one.

From here, a mean person could take one of the tables names and list out the columns (using the sys.columns view), and then use that to grab data they wouldn’t normally be allowed to see.

This how user data ends up on the dark web. Passwords, credit cards, social security numbers, phone numbers, addresses, childhood fears. You name it.

Protecting Yourself


Sure, you can steam yourself up and encrypt the crap out of everything, not store certain personal or processing details in the database, or a number of other things.

But before you go and dump out the bag of golden hammers, let’s start by eliminating the risks of dynamic SQL. We’re gonna need to rewrite our stored procedure a bit to do that.

CREATE OR ALTER PROCEDURE
    dbo.no_hax
(
    @injectable nvarchar(100)
)
AS
BEGIN
   SET NOCOUNT, XACT_ABORT ON;

/*Our placeholder variable for the string we'll build*/
DECLARE    
    @sql nvarchar(MAX) = N'';

/*The start of our query-building*/
SELECT
    @sql = N'
SELECT
    H.*
FROM dbo.hax AS H
WHERE 1 = 1
';

/*
This changed: the parameter is embedded in the string
instead of being concatenated into it
*/
IF @injectable IS NOT NULL
BEGIN
    SELECT
        @sql += N'AND H.user_name = @injectable;';
END;

PRINT @sql;

/*
This also changed: we're using EXEC against
the system stored procedure sp_executesql,
and feeding it parameters 
*/
EXEC sys.sp_executesql
    @sql,
  N'@injectable nvarchar(100)',
    @injectable;

END;
GO

In this version, here’s what we’re doing different:

  • The parameter is inside of the string, instead of concatenated in from outside
  • We’re using the system stored procedure sp_executesql to execute the string we’ve built up

It’s important to note that just using sp_executesql won’t protect you from SQL injection attacks unless the strings are parameterized like they are here.

Any Execution


No matter how we execute this, the query that gets generated and executed will look the same.

EXEC dbo.no_hax
    @injectable = N'Rick Ross';

EXEC dbo.no_hax
    @injectable = N'%'' UNION ALL SELECT object_id,name,schema_id,type_desc,create_date,modify_date FROM sys.tables--';
SELECT
    H.*
FROM dbo.hax AS H
WHERE 1 = 1
AND H.user_name = @injectable;

The important thing here is that the second execution does not result in successful SQL injection, but the first query returns correct results.

The second query returns no results this time, because the entire search string is parameterized, and no names match the supplied value.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Software Vendor Mistakes With SQL Server: IF Branching In Stored Procedures

What Goes Wrong


I still see people calling into this trap when writing stored procedures. What no one seems to realize until it’s too late data grows past a toy data base size, is that SQL Server’s query optimizer doesn’t respect IF branches as boundaries in the way you’d expect.

Sure, the control-flow logic boundaries are respected, but the big surprise to most people comes from how the optimizer treats query plans in IF branches: It compiles them all no matter what.

In the video below, I’ll show you exactly what that means, and how it can really screw up performance. This video is a small part of my paid training offering, and if you enjoy it, there’s a link at the end of the post to get 75% off.

IF Branching Video!


Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Software Vendor Mistakes With SQL Server: Using CASE Expressions In JOIN Or WHERE Clauses

Just In Case


There are some circumstances where you can use this approach in a non-harmful way. In general it’s something you should avoid, though, especially if you’re not sure how to tell if it’s harmful or not.

This is one of those unfortunate query patterns that leads to you getting called nights and weekends because the server blew up. I’ve seen it do everything from generate a reliably bad query plan, to adding just enough of an element of cardinality uncertainty that slight plan variations felt like parameter sniffing — even with no parameters involved!

Uncertainty is something we’ve covered quite a bit in this series, because if you don’t know what you want, SQL Server’s optimizer won’t either.

Up The Bomb


Let’s start with a reasonable index on the Posts table:

CREATE INDEX p ON
    dbo.Posts
(
    PostTypeId,
    OwnerUserId
);

We don’t need a very complicated query to make things bad, even with that stunningly perfect index in place.

SELECT
    c = COUNT_BIG(*)
FROM dbo.Users AS u
JOIN dbo.Posts AS p
    ON 1 = CASE     
               WHEN p.PostTypeId = 1
               AND  p.OwnerUserId = u.Id
               THEN 1
               WHEN p.PostTypeId = 2
               AND  p.OwnerUserId = u.AccountId
               THEN 1
               ELSE -1
           END
WHERE u.Reputation > 500000;

This returns a count of 3,374, and runs for about a minute total.

Naughty Query Plan


The plan for this may look innocent enough, but it’s one of those cases where a Lazy Table Spool is a warning sign.

SQL Server Query Plan
unhappy ending

Operator times in query plans are generally a blessing, because they show you where execution time ratchets up. Unfortunately, it’s not always clear what you have to do to fix the problems they show you.

I suppose that’s what posts like this are for, eh?

Oh, what a feeling.

Rewriting The Query


A Useful Rewrite© of this query looks something like this, at least if you’re a fancy-pants blogger who cares a lot about formatting.

SELECT
    c = COUNT_BIG(*)
FROM
(
    SELECT
        u.Id,
        u.AccountId
    FROM dbo.Users AS u
    WHERE u.Reputation > 500000 
) AS u
CROSS APPLY
(

    SELECT
        p.*
    FROM dbo.Posts AS p
    WHERE p.PostTypeId = 1
    AND   p.OwnerUserId = u.Id

    UNION ALL

    SELECT
        p.*
    FROM dbo.Posts AS p
    WHERE p.PostTypeId = 2
    AND   p.OwnerUserId = u.AccountId
) AS c;

I’m using a derived table here because if I used a Common Table Expression, you’d think they have some ridiculous magic powers that they really don’t, and one wouldn’t make the query any more readable.

Caring about formatting makes queries more readable.

Good game.

Nice Query Plan


You’ll notice that this plan no longer features a Table Spool, no longer runs for over a minute, and makes me happy.

SQL Server Query Plan
push it to the limit

I’m not a math major, but 181 milliseconds seems like a great improvement over 60 seconds.

Suede Shoes


This is another case of more typing for us results in a faster query. Perhaps there’s some wisdom to learning how to clearly express oneself before starting a career talking to databases.

Database Communications Major, or something. Just don’t buy it from Sally Struthers.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Software Vendor Mistakes With SQL Server: Using Left Joins To Find Rows That Don’t Exist

Natural Language


In SQL Server’s T-SQL, and in the more general ANSI-Standard SQL, you’re supposed to write queries in a way that mimics how you’d ask the same question — just don’t call it a query — in English. Because of that, there are some ways to phrase a query that are more natural than others.

Some are more intuitive once you get them down, and others can bewitch you for decades. For example, I’m still not always awesome at reasoning out INTERSECT and EXCEPT queries, or even thinking of them first when writing a query where they’d be useful.

Maybe someday.

Dirty Pig Latin


Bad advice exists in many places in the world. I don’t mean to single out the internet, though it certainly has made the publishing and proliferation of bad advice much more accessible.

I do a lot of reading about databases in general, and SQL Server specifically, to see what other folks are out there writing about and teaching others. One of my favorite arrondissements of that world is the stuff that attracts beginners.

After all, that’s when you develop habits, good or bad. Much of this series focuses on the bad habits learned in that time, and how they muck up performance.

One of those bad habits I see over and over again is using LEFT JOINs to find rows that don’t exist. The reason I focus on these is because of the effect they have on query plans, due to the way that queries have to be logically processed.

Let’s look at that, first.

Who’s Not On First?


A quite common looking query for doing this (if you want to do it wrong) would look something like this:

SELECT
   c = COUNT_BIG(*)
FROM dbo.Users AS U
LEFT JOIN dbo.Comments AS C
    ON  C.UserId = U.Id
    AND C.Score > 0
WHERE C.Id IS NULL;

The important part of the query plan is right around here:

SQL Server Query Plan
slug

If you’re looking extra closely, without any leading or prompting by me whatsoever, you’ll notice that after the join operation bring the two tables we’re querying togethers — Users and Comments — which is expressed as a left outer join of course, then and only then do we filter out rows where the Id column in Comments is NULL.

The problem is that all this is after the join, and in some scenarios this is far less efficient. Both from the perspective that you have to join many more matching rows together, and from the perspective that the optimizer can sometimes have a weird time ordering outer joins, especially when there are a lot of them.

Note that, for various reasons, this query runs for around 4 seconds total.

Who Doesn’t Exist On First?


An often better way of expressing this sort of query is using the — wait for it — natural expression of the Structured Query Language.

SELECT
   c = COUNT_BIG(*)
FROM dbo.Users AS U
WHERE NOT EXISTS
      (
          SELECT
              1/0
          FROM dbo.Comments AS C
          WHERE  C.UserId = U.Id
          AND    C.Score > 0
      );

This query is logically equivalent to the last one. It may even be arguably better looking. More elegant, as the developers say when they want to describe well-formatted code that no one really understands.

The query plan looks like this now:

SQL Server Query Plan
but better

Now we get this crazy little thing called an Anti Semi Join. That means rows are filtered out at the join rather than at some point later on in an explicit Filter operator.

To highlight things a little further, look at the actual number of rows that pass through the filter in the original query and the join in the second query:

SQL Server Query Plan
a number!

See there? The filter in the LEFT JOIN query reduces he working row set to the same number as the NOT EXISTS query does at the join.

In most cases, you’re better off writing queries this way. It may depend a bit on available indexes, batch mode, and server settings. But writing a blog post that takes every single one of those things into account would likely leave you bored out of your gourd.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Software Vendor Mistakes With SQL Server: Joining On OR Conditions

Choose One Once


Everyone has that person in their life. I’m not gonna call them a “friend”, but they might be. Whoever they are, you know them by their indecisiveness. Because when you interact with them, you know you’re gonna have a whole bunch of decisions to make before you come up with a plan.

Those decisions are usually proposed to you in a series of “or we could” statements that you barely have time to assess before the next one hits. It makes life difficult.

Don’t be that person to SQL Server’s optimizer. I mean, don’t be that person in general, but especially don’t be that person to an expensive piece of software that makes business critical transactions and decisions possible for you.

Be Kind, Rewind


Remember a couple things that have been mentioned at various points in this series:

  • Anything that makes your job easier makes the optimizer’s job harder
  • Query data the way it’s stored, and store data the way you query it

What’s a real shame is that most of the things that fall into the first category involve you typing fewer characters once. Isn’t that a funny thing? Normally you type a bunch of code once, and after that it’s a matter of executing it.

Sometimes typing more once really pays off.

Let’s look at an example of that!

Orrible


In this query, we’re just going to use one OR predicate in a join, and as the saying goes, “this is Jackass”.

SELECT
   c = COUNT_BIG(*)
FROM dbo.Users AS U
JOIN dbo.Comments AS C
    ON  C.UserId = U.Id
    OR  C.UserId = U.AccountId;

Since I’d like to keep your attention here, I’m just gonna show you the end result query plan, and how long it takes.

You’re not misreading that it takes around 15 seconds.

SQL Server Query Plan
beretta

Most of that time is spent in ridiculous Nested Loops Joins. And the really nice part is that you can’t hint hash or merge joins; you get a query processor error.

Union Dues


Let’s compare that to a query where we keep things simple(r). Sure, we type more, and part of our brain had to wake up and do some thinking.

But think of the optimizer. If we save just one optimizer, people, it’s worth it.

SELECT
    c = SUM(x.c)
FROM 
(
    SELECT
       c = 1
    FROM dbo.Users AS U
    JOIN dbo.Comments AS C
        ON  C.UserId = U.Id

    UNION ALL 

    SELECT
       c = 1
    FROM dbo.Users AS U
    JOIN dbo.Comments AS C
        ON  C.UserId =  U.AccountId
    	AND C.UserId <> U.Id
) AS x;

To shortcut to the important part of the plan, this version runs in a little under two seconds.

SQL Server Query Plan
little under

More typing. Fewer problems. Good and good. You might be okay with accepting this rewrite and explanation, and be off to test how a similar rewrite might improve your queries. If so, great! Get to it, Dear Reader.

If you want to dig further into why, follow me along to the next section.

Why OR Conditions Turn Our Poorly


Here’s the “full” query plan for the OR condition join query. There are a couple operators at the far left side that I’ve omitted because they don’t help with the explanation.

SQL Server Query Plan
by the numbers
  1. We scan the Users table. There nothing we can Seek to, so that’s fine, but note the number of r0ws (2,465,713) that come out
  2. All rows from the Users table are emitted via a constant scan operator for the Id column
  3. All rows from the Users table are emitted via a constant scan operator for the AccountId column

I don’t think the optimizer makes better guesses for constant scans, so even though all rows in the table are escorted forth, none of the estimates reflect anything close to that.

The full set of these rows combined (4,931,426) get sorted together rather than as two separate sets, and then a meager attempt to reduce overlapping values is applied at the merge interval. That set is only reduced to 4,931,419; a whopping 7 rows removed.

All of those rows are fed into an Apply Nested Loops join that hits the Comments table and searches the UserId column for values that fall between lower and upper bounds.

SQL Server Query Plan
surely

Even with a good index to seek to these values in, the sheer number of executions gets us stuck pretty badly. In all, ~14 of the ~15 seconds of execution time is spent in this exercise in futility. This query pattern becomes even worse with less useful indexes on the join columns.

At the moment, SQL Server’s query optimizer isn’t able to unwind OR predicates like this to come up with a better execution plan.

A Stronger Union


In the UNION ALL query, you’ll see far less futility, and a more reasonable set of operators presented for the number of rows being processed.

SQL Server Query Plan
trees

There are of course circumstances where this might be less efficient, like if one or more tables isn’t able to fit in the buffer pool and disk I/O sucks (you’re in the cloud, for example). But in general, splitting queries into factions that express precise logic is a good thing.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Software Vendor Mistakes With SQL Server: Using Date Functions On Columns Or Local Variables

Reusable Logic


In most programming languages, it’s quite sensible to create a variable or parameter, use some predefined logic to assign it a value, and then keep reusing it to prevent having to execute the same code over and over again.

But those languages are all procedural, and have a bit of a different set of rules and whatnot. In SQL Server, there are certainly somewhat procedural elements.

  • Functions
  • Control-flow logic
  • Cursors
  • While loops
  • Maybe the inner side of Nested Loops joins

You may be able to name some more, if you really get to thinking about it. That should be a common enough list, though.

Reusable Problems


SQL Server has a wonderful optimizer. It’s capable of many things, but it also has some problems.

Many of those problems exist today for “backwards compatibility”. In other words: play legacy games, win legacy performance.

Lots of people have found “workarounds” that rely on exploiting product behavior, and taking that away or changing it would result in… something else.

That’s why so many changes (improvements?) are hidden behind trace flags, compatibility levels, hints, batch mode, and other “fences” that you have to specifically hop to see if the grass is greener.

One of those things is the use of local variables. The linked post details how lousy those can be.

In this post, I show how you’re better off using the date math expressions instead.

And in this post, I show how you’re better off doing date math on parameters rather than on columns.

Let’s bring all that together!

Reusable Solutions


In SQL Server, context is everything. By context, I mean the way different methods of query execution are able to accept arguments from others.

You’ll sometimes hear this referred to as scope, too. Usually people will say inner context/scope and outer context/scope, or something similar.

What that means is something like this, if we’re talking about stored procedures:

CREATE PROCEDURE
    dbo.InnerContext
(
    @StartDate datetime,
    @EndDate datetime
)
AS
BEGIN
SET NOCOUNT, XACT_ABORT ON;

    SELECT
        C.PostId,
    	Score = 
    	    SUM(C.Score)
    FROM dbo.Comments AS C
    JOIN dbo.Votes AS V
        ON C.PostId = V.PostId
    WHERE C.CreationDate >= @StartDate
    AND   c.CreationDate <  @EndDate
    GROUP BY c.PostId;

END;
GO 

CREATE PROCEDURE
    dbo.OuterContext
(
    @StartDate datetime,
    @EndDate datetime
)
AS
BEGIN
SET NOCOUNT, XACT_ABORT ON;

IF @StartDate IS NULL
BEGIN
   SELECT
       @StartDate = GETDATE();
END;

IF @EndDate IS NULL
BEGIN
   SELECT
       @EndDate = DATEADD(DAY, 30, GETDATE());
END;

EXEC dbo.InnerContext
    @StartDate = @StartDate,
    @EndDate = @EndDate;

END;

If you’re okay using dynamic SQL, and really, you should be because it’s awesome when you’re not bad at it, you can do something like this:

CREATE PROCEDURE
    dbo.OuterContext
(
    @StartDate datetime,
    @EndDate datetime
)
AS
BEGIN
SET NOCOUNT, XACT_ABORT ON;

IF @StartDate IS NULL
BEGIN
   SELECT
       @StartDate = GETDATE();
END;

IF @EndDate IS NULL
BEGIN
   SELECT
       @EndDate = DATEADD(DAY, 30, GETDATE());
END;

DECLARE 
    @sql nvarchar(MAX) = N'
    /*dbo.OuterContext*/
    SELECT
        C.PostId,
    	Score = 
    	    SUM(C.Score)
    FROM dbo.Comments AS C
    JOIN dbo.Votes AS V
        ON C.PostId = V.PostId
    WHERE C.CreationDate >= @StartDate
    AND   c.CreationDate <  @EndDate
    GROUP BY c.PostId;		
    ';

EXEC sys.sp_executesql
    @sql,
  N'@StartDate datetime, 
    @EndDate datetime',
    @StartDate,
    @EndDate;

END;

Which will achieve the same thing.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Software Vendor Mistakes With SQL Server: Using Functions In Join Or Where Clauses

Easy Thinking


To start with, let’s classify functions into two varietals:

  • Ones built-in to SQL Server
  • Ones that developers write

Of those, built-in functions are pretty much harmless when they’re in the select list. Classifying things a bit further for the ones user writes, we have:

Out of the four up there, only the last one doesn’t have a reference link. Why? Because I don’t write C# — I’m not that smart — if you need someone smart about that, go read my friend Josh’s blog. He’s quite capable.

If you’re too lazy to go read the three reference links:

  • Scalar User Defined Functions generally wreck performance
  • Multi-Statement Table Valued Functions stand a high chance of generally wrecking performance
  • Inline Table Valued Functions are okay as long as you don’t do anything awful in them

Smart Thinking


The important thing to understand is that using any of these functions, let’s call it below the belt, can really mess things up for query performance in new and profound ways compared to what they can do in just the select list.

To be more specific for you, dear developer reader, let’s frame below the belt as anything underneath the from clause. Things here get particularly troublesome, because much of the activity here is considered relational, whereas stuff up above is mostly just informational.

Why is the relational stuff a much bigger deal than the informational stuff? Because that’s where all the math happens in a query plan, and SQL Server’s optimizer decides on all sorts of things at compile-time, like:

  • Which indexes to use
  • Join order
  • Join types
  • Memory grants
  • Parallelism
  • Seeks and Scans
  • Aggregate types
  • Much, much more!

Those things are partially based on how well it’s able to estimate the number of rows that join and where conditions will produce.

Sticking functions in the way of those join and where conditions is a bit like putting a blindfold on SQL Server’s optimization and cardinality estimation process and asking it to swing a bowling ball sharp saber at a piece of confetti at 100 paces.

In other words, don’t complain when your query plans suck and your queries run slow. You’re doing the hobbling, you dirty bird.

Future Thinking


If you want your customers, users, or whatever you want to call them, to be reliably happy in the future, even as their database sizes grow beyond your wildest imagination, and your application gets used in ways that would make Caligula blush, you need to start by obeying the first law of database physics: thou shalt not get in the way of the optimizer.

Going back to a couple other laws of  database physics that cannot be ignored:

  • Anything that makes your job easier makes the optimizer’s job harder
  • Store data the way you query it, and query data the way you store it

If it makes you feel better, stick a few thous and shalls or shalt nots or whences or whenceforths in there. It might make you happier.

It will make your customers, users, or whatever you want to call them happier, if you listen to me.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Software Vendor Mistakes With SQL Server: String Splitting With Functions

Advent


Since SQL Server 2016, at least for databases in a similarly current compatibility level, the STRING_SPLIT function has been available. There were some problems with it initially that have recently been solved, too, like a lack of an “ordinal position” indicator, to show you when items occurred in a string.

But there’s still a bigger problem, since it’s sort of a Jack-Of-All-Splitters, you might not be getting the datatype back that you want. From the docs:

If the ordinal output column is not enabled, STRING_SPLIT returns a single-column table whose rows are the substrings. The name of the column is value. It returns nvarchar if any of the input arguments are either nvarchar or nchar. Otherwise, it returns varchar. The length of the return type is the same as the length of the string argument.

If you’re comparing the output to something that isn’t stringy, you might hit some weird implicit conversion issues.

Another significant problem is that SQL Server’s optimizer can’t see inside the string you pass in to generate any meaningful statistics information about the values in there.

This is generally true of many of the unpleasant T-SQL User Defined Functions I’ve seen written to split string prior to this, too.

Playtime


The big blaring headline here should be that if you’re going to do split any string and attempt to filter or join to it, you should put the parsed results into a #temp table first, with whatever indexing is appropriate. If you don’t do that, you’re gonna see all the screwy stuff that I mention below.

First, you only get a 50 row estimate no matter how many separate values are in the string. That might be fine for you, but it might not be fine for everyone.

Let’s start by throwing a real knuckleball to the function.

DECLARE 
    @x nvarchar(MAX) = N'';

SELECT 
    @x += 
        (
               SELECT DISTINCT
                   [text()] = 
                       ',' + RTRIM(P.OwnerUserId)
               FROM dbo.Posts AS P
               FOR XML 
                   PATH(''),
                   TYPE
        ).value
            (
                './text()[1]', 
                'nvarchar(max)'
            );

SELECT
    p.OwnerUserId,
    TotalScore = 
        SUM(p.Score * 1.)
FROM dbo.Posts AS P
WHERE P.OwnerUserId IN
(
    SELECT
        SS.value
    FROM STRING_SPLIT(@x, ',') AS SS
)
GROUP BY P.OwnerUserId
OPTION(RECOMPILE);

The code above takes every unique value in the OwnerUseId column in the Posts table and adds them to a comma separated list. Why didn’t I use the STRING_AGG function? I forgot. And after so many years, I have the XML equivalent memorized.

Even with a recompile hint on board, this is the query plan we get:

SQL Server Query Plan

 

Spending 8 seconds inside the splitter function feels pretty bad on its own, but then the entire query plan turns out crappy because of the 50 row estimate, and takes 11 minutes to finish in total. I spent 11 minutes of my life waiting for that to finish. Just for you.

Think about that.

Regularly


Under more normal circumstances, the lack of introspection (of the statistical variety) that you get from directly querying the splitter function can lead to a bad time.

I’m intentionally including a value with a fair amount of skew in the list-to-be-parsed, that leads to what might be a bad plan choice, based on that lack of introspection:

SELECT
    p.OwnerUserId,
    TotalScore = 
        SUM(p.Score * 1.)
FROM dbo.Posts AS P
WHERE P.OwnerUserId IN
(
    SELECT
        SS.value
    FROM STRING_SPLIT('1138,22656', ',') AS SS
)
GROUP BY P.OwnerUserId;
SQL Server Query Plan
zee problemmen!

We get a 50 row guess for two values, but one of those values has ~28k rows associated with it in the Posts table. Since there’s no stats on the internal elements of the string, we don’t get a good estimate for it at all

Big Awkward


If you, or one of your users, throws a malformed string into the mix, you might get some unexpected results.

On top of the values that we just passed in, I’m also going to pass in an empty string at the end after the final comma:

SELECT
    p.OwnerUserId,
    TotalScore = 
        SUM(p.Score * 1.)
FROM dbo.Posts AS P
WHERE P.OwnerUserId IN
(
    SELECT
        SS.value
    FROM STRING_SPLIT('1138,22656, ', ',') AS SS
)
GROUP BY P.OwnerUserId;

This time around, the results are a little different from above. The space at the end is implicitly converted from an empty string to the number 0, and things look like this:

SQL Server Query Plan
played

Most all of the bad guesses are retained as the previous plan, but since the number 0 accounts for a whole mess of rows too, things get a whole lot worse.

We go from 4.5 seconds to 30 seconds, and blah blah blah everything is horrible.

If you want to get around this, sure, you can add a filter to remove empty strings from the results of the splitter, but if you’re reading this, chances are you might not have the most defensive T-SQL coding practices in place. to begin with.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.