Why MAX Data Types Are Usually A Bad Choice For SQL Server Columns

Easy Rider


When you’re trying to figure out how to store string data, it often seems easiest to just choose an extra long — even MAX — data type to avoid future truncation errors.

Even if you’re storing strings with a known, absolute length, developers may choose to not enforce that in the application, either via a drop down menu or other form of validation.

And so to avoid errors when users try to put their oh-so-important data in their oh-so-expensive database, we get columns added to tables that can fit a galaxy of data in them, when we only need to store an ashtray worth of data.

While getting data into those columns is relatively easy — most application inserts are single rows — getting data out of those columns can be quite painful, whether it’s searching or just presenting in the select portion of a query.

Let’s look at a couple simple examples of how that happens.

Search Engine


Let’s take a query like this one:

SELECT TOP (20)
    p.Id,
    p.Title,
    p.Body
FROM dbo.Posts AS p
WHERE p.Body LIKE N'SQL Server%';

The Body column in the Posts table is nvarchar and MAX, but the same thing would happen with a varchar column.

If you need a simple way to remember how to pronounce those data types, just remember to Pahk yah (n)vahcah in Hahvahd Yahd.

Moving on – while much has been written about leading wildcard searches (that start with a % sign), we don’t do that here. Also, in general, using charindex or patindex instead of leading wildcard like searching won’t buy you all that much (if anything at all).

Anyway, since you can’t put a MAX datatype in the key of an index, part of the problem with them is that there’s no way to efficiently organize the data for searching. Included columns don’t do that, and so we end up with a query plan that looks some-such like this:

SQL Server Query Plan
ouch in effect

We spend ~13.5 seconds scanning the clustered index on the Posts table, then about two minutes and twenty seven seconds (minus the original 13.5) applying the predicate looking for posts that start with SQL Server.

That’s a pretty long time to track down and return 19 rows.

Let’s change the query a little bit and look at how else big string columns can cause problems.

Memory Bank


Rather than search on the Body column, let’s select some values from it ordered by the Score column.

Since Score isn’t indexed, it’s not sorted in the database. That means SQL Server needs to ask for memory to put the data we’re selecting in the order we’re asking for.

SELECT TOP (200)
    p.Body
FROM dbo.Posts AS p
ORDER BY p.Score DESC;

The plan for this query asks for a 5GB memory grant:

SQL Server Query Pla
quietly

I know what you’re thinking: the Body column probably has some pretty big data in it, and you’re right. In this case, it’s the right data type to use.

The bad news is that SQL Server will makes the same memory grant estimation based on the size of the data we need to sort whether or not it’s a good choice.

I talk more about that in this Q&A on Stack Exchange.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

Stop Making SQL Server Weird, Microsoft.

Ledgers And Tridents And Toads, Fido-hmy


With every release of SQL Server, Microsoft adds in a bunch of tadpole features. It’s up to those tadpoles to survive long enough to turn into beautiful uh… Frogs Toads.

The way they do that is by having enough paying customers use them — usually prodded by #MVPBuzz chasers — to get continued support and development.

All of this is subject to internal whims (like Microsoft rebranding, renaming, or rebasing a feature, take Mirroring >> Availability Groups), and external customer fancies (Tableau vs PowerBI, etc).

I feel bad for folks who latch onto these things with any sort of seriousness, like Big Data Clusters, or any of the dozens of other features that died before they could bump their butts on a log.

One marvels at the sheer number of “you better get on board” style session titles that have been molded around features where absolutely no one got on board.

Even Kubernetes (which I know isn’t a SQL Server feature) is sort of like K-Pop: Apparently millions of people like it, but I’ve never met a single person who actually uses it.

Maybe it’s more like a bot net. K-Bot? Not sure, here. I’m bad at creative writing.

They Don’t Like You Anyway


Microsoft has this weird perceived popularity issue.  Perhaps it’s the diminishing self-esteem that comes with age, with SQL Server turning 30-something and all. Microsoft keeps trying to make SQL Server appeal to swaths of people who will never like it, because it’s not the shiny new thing on the block.

They won’t love you like I do.

Adding in every passing fad to the product to try to stay young and hip does a great disservice to the product as a whole, which has many addressable flaws. Microsoft dedicates countless development cycles to goofy memes that will wither on the vine shortly after the first version.

Imagine if cars still came with cassette players, but not ones that can do anything cool like play both sides without flipping the tape over, or being able to fast forward or rewind to the start or end of a track.

You’d basically have Spatial Data And The Graph Nodes in your car trying to play Garage Days Re-Re-Revisited.

Duets


I do wish the Microsoft bean counters would dance with who they brung a bit more. You know, people who need a relational database with as few bugs and preventable performance issues as possible, with useful development features, and a coherent toolset?

Clearing the minefield of pitfalls that developers fall int0 (hello, MERGE!) would allow less effort to be spent on chasing down oddball problems and more time spent developing quality features.

It’s sort of like if I had a keyboard that fixed all my typos immediately, I’d have more time to write quality blog posts that don’t need several editorial passes.

Maybe someday.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

SQL Server 2022 CTP 2.1 Improvements To Parameter Sensitive Plan Optimization

Hop On Your Good Foot


Several weeks back, I blogged about a missed opportunity with the new parameter sensitive plan feature. At the time, I thought that there was indeed sufficient skewness available to trigger the additional plan variants, and apparently some nice folks at Microsoft agreed.

If we step back through the old demo, we’ll get different results.

EXEC dbo.VoteSniffing 
    @VoteTypeId = 4;

EXEC dbo.VoteSniffing 
    @VoteTypeId = 2;

I’m not gonna go into a lot of depth here, because I’m writing this while traveling, but we get the (mostly) desired outcome of two different plans being generated. Each plan is more suitable to the amount of data that the query has to process.

SQL Server Query Plan
home improvement

I’d consider this a success, even if it adds more queries to the unidentifiable corpse-pile.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

SQL Server 2022 Introduces IS [NOT] DISTINCT FROM; World Underwhelmed

Back In Time


Feeling quite old while remembering the first time I ever read this, and this note at the end:

This is the SQL language IS DISTINCT FROM feature —implemented in the query processor, but not yet available in the T-SQL language.

Groovy. Since at least 2011, this has been in the Query Processor, and here in 2022 we finally get the linguistic support.

The thing is, it’s pretty underwhelming, and I’m going to show you why I think so.

First, in the docs for SQL Server, all the examples use a single literal value, like so:

SELECT * FROM #SampleTempTable WHERE id IS DISTINCT FROM 17;

I went looking for other docs examples from vendors who have had the syntax around for 10+ years, and there wasn’t anything all that much more interesting.

Mostly case expressions and whatnot.

Big deal.

Alignment


First, if I try to run either of these queries, I’ll get an error after about 6 seconds.

SELECT
    c = COUNT_BIG(*)
FROM dbo.Comments AS c
WHERE c.UserId IS DISTINCT FROM 
               (
                   SELECT
                       v.UserId
                   FROM dbo.Votes AS v
               );

SELECT
    c = COUNT_BIG(*)
FROM dbo.Comments AS c
WHERE c.UserId IS NOT DISTINCT FROM 
               (
                   SELECT
                       v.UserId
                   FROM dbo.Votes AS v
               );

Why does it take 6 seconds to get an error? Because a few parts of the query plan have to do some work, and then finally:

Msg 512, Level 16, State 1, Line 1
Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.

Sort of like how sometimes you try to SUM a column and after a while you get an error about arithmetic overflow.

This is a bit annoying, because that means we need a way to return a single value to evaluate.

So Yeah…


We can’t even rewrite the queries like this to get around the error, but I do want to show you the plans.

This is why we have to wait several seconds to get an error (unless you change it to IS DISTINCT FROM ALL/ANY):

SELECT
    c = COUNT_BIG(*)
FROM dbo.Comments AS c
WHERE c.UserId IS DISTINCT FROM 
               (
                   SELECT
                       v.UserId
                   FROM dbo.Votes AS v
                   WHERE v.UserId = c.UserId
               );

SELECT
    c = COUNT_BIG(*)
FROM dbo.Comments AS c
WHERE c.UserId IS NOT DISTINCT FROM 
               (
                   SELECT
                       v.UserId
                   FROM dbo.Votes AS v
                   WHERE v.UserId = c.UserId
               );

Adding a where clause inside the subquery doesn’t help.

But these query plans are total motherchucking disasters, anyway. We’ll get into indexing later, but right now they both have the same shape and operators, though slightly different semantics to deal with is/is not distinct.

SQL Server Query Plan
is not cool

Both plans run single threaded, and using Nested Loops as the physical join type, which stinks because we’re putting together two pretty big tables.

Not to mention that Eager Index Spool. What a filth.

Adding Indexes


We need these indexes to make things go any faster. Before we do anything else, let’s create these so we’re not just sitting around thumb-twiddling.

CREATE INDEX
    c
ON dbo.Comments
    (UserId)
WITH
(
    SORT_IN_TEMPDB = ON,
    DATA_COMPRESSION = PAGE
);

CREATE INDEX
    v
ON dbo.Votes
    (UserId)
WITH
(
    SORT_IN_TEMPDB = ON,
    DATA_COMPRESSION = PAGE
);

Thinking About It


Okay, so writing the query like we did up there isn’t going to get us anything. Perhaps my expectations are a bit too exotic.

Let’s try something a bit more austere:

SELECT
    c = COUNT_BIG(*)
FROM dbo.Comments AS c
JOIN dbo.Votes AS v
    ON c.UserId IS DISTINCT FROM v.UserId;

SELECT
    c = COUNT_BIG(*)
FROM dbo.Comments AS c
JOIN dbo.Votes AS v
    ON c.UserId IS NOT DISTINCT FROM v.UserId;

The first thing to be aware of here is that the IS DISTINCT FROM is an inequality predicate, so you’re stuck with Nested Loops as the physical join type:

SQL Server Query Plan
nightmare

I ran out of care-juice waiting for this to finish, so all you’re getting is an estimated plan. The lack of an equality predicate here means you don’t have Hash or Merge join as an option.

Following up on bad ideas, the IS NOT DISTINCT FROM is an equality predicate, but the plan chosen is a serial Merge Join variety, which drags on 14 seconds too long:

SQL Server Query Plan
change up

Drop An Index


If we simulate not having any useful indexes on one table or the other by hinting the clustered index, the performance outlook does not improve.

SELECT
    c = COUNT_BIG(*)
FROM dbo.Comments AS c WITH(INDEX = 1)
JOIN dbo.Votes AS v
    ON c.UserId IS NOT DISTINCT FROM v.UserId;

SELECT
    c = COUNT_BIG(*)
FROM dbo.Comments AS c
JOIN dbo.Votes AS v WITH (INDEX = 1)
    ON c.UserId IS NOT DISTINCT FROM v.UserId;

No useful parts of the first query happen in Batch Mode, but the second query is rescued by two hash aggregates happening in batch mode.

SQL Server Query Plan
ouching

An odd point to make on a blog focused on SQL Server performance tuning is that sometimes not having a useful index gets you a better plan.

Anyway, I’m going back to my vacation.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

SQL Server 2022: Introduces the DATETRUNC Function, So You Can Truncate Your Dates And Stuff

Underscoring The Importance


When I first sat down to write about this, I made a funny mistake: I kept writing DATE_TRUNC over and over again.

In SQL Server it’s DATETRUNC.

Why? Because that’s the way it’s implemented in Postgres and DB2. Oracle, of course, just calls it TRUNC.

So, while it’s nice to have (what appears to be) the same behavior, it doesn’t exactly help to not have a 1:1 calling equivalent to other platforms.

I assume most of these additions to T-SQL are for cross-platform development and migrations.

Of course, Microsoft being so gosh darn late to this game means folks have probably been rolling-their-own versions of these functions for years.

If they went and called their system function DATE_TRUNC or even TRUNC, they might have some object naming issues to contend with.

Well, okay. But how does it work?

Childish Games


Here are some quick examples of how you call it.

SELECT TOP (10)
    u.DisplayName,
    year = 
        DATETRUNC(YEAR, u.LastAccessDate),
    quarter = 
        DATETRUNC(QUARTER, u.LastAccessDate),
    month = 
        DATETRUNC(MONTH, u.LastAccessDate),
    dayofyear = 
        DATETRUNC(DAYOFYEAR, u.LastAccessDate),
    day = 
        DATETRUNC(DAY, u.LastAccessDate),
    week = 
        DATETRUNC(WEEK, u.LastAccessDate),
    iso_week = 
        DATETRUNC(ISO_WEEK, u.LastAccessDate),
    hour = 
        DATETRUNC(HOUR, u.LastAccessDate),
    minute = 
        DATETRUNC(MINUTE, u.LastAccessDate),
    second = 
        DATETRUNC(SECOND, u.LastAccessDate),
    millisecond = 
        DATETRUNC(MILLISECOND, u.LastAccessDate),
    microsecond = 
        DATETRUNC(MICROSECOND, u.LastAccessDate) /*Doesn't work with datetime because there are no microseconds*/
FROM dbo.Users AS u;

And here are the results:

SQL Server Query Results
workin’

The thing to note here is that there’s no rounding logic involved. You just go to the start of whatever unit of time you choose. Of course, this doesn’t seem to do anything to the millisecond portion of DATETIME, because it’s not precise enough.

But for anyone out there who was hoping for a SOMONTH function to complement the EOMONTH function, you get this instead.

Works well enough!

But does it perform, Darling?

UnSARGable?


To make any test like this worthwhile, we need an index to make data searchable.

CREATE INDEX
    v
ON dbo.Votes
    (CreationDate)
WITH
(
    SORT_IN_TEMPDB = ON,
    DATA_COMPRESSION = PAGE
);

You know, because that’s what they do. To make searching faster. Hello.

So look, under these perfect circumstances, everything performs well. But we have to do a lot of typing.

SELECT
    c = COUNT_BIG(*)
FROM dbo.Votes AS v
WHERE v.CreationDate >= DATETRUNC(YEAR, CONVERT(datetime, '20130101 00:00:00.000'));

Note here that we’re working on a literal value, not a column value, and we have to tell the datetrunc function which type we want via the convert function so that we get a simple seek plan:

SQL Server Query Plan
grin

Without all that, we get a dynamic seek plan:

SELECT
    c = COUNT_BIG(*)
FROM dbo.Votes AS v
WHERE v.CreationDate >= DATETRUNC(YEAR, '20130101 00:00:00.000');

This has some… obvious performance issues compared to the above plan with correct data types.

SQL Server Query Plan
query pan

Query Pranks


Frequent blog readers will not be surprised that wrapping a table column in the new DATETRUNC function yields old performance problems:

SELECT
    c = COUNT_BIG(*)
FROM dbo.Votes AS v
WHERE DATETRUNC(YEAR, v.CreationDate) >= CONVERT(datetime, '20130101 00:00:00.000');

This is particularly annoying because we’re truncating the column to the beginning of the year, which should be searchable in the index since that’s the sort order of the data in the index.

SQL Server Query Plan
shined out

Like most functions, these are fine in the presentation layer, but terrible in the relational layer. There’s no warning about this performance degradation in the documentation, either at the example of using the function in a where clause, or in the final closing remarks.

But that’s par for the course with any of these built-in functions.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

The Cost Of Complexity In SQL Server

Value Added


SQL Server can do a whole lot of things, aside from return the results of your queries. The thing is, the more of those you add into the mix the tougher it becomes to scale, and maintain reliable performance.

Of course, there are other design considerations that introduce the same kind of complexity:

  • Lots of tables in your schema leads to more joins
  • Lots of nested views lead to more choice for the optimizer
  • Lots of databases lead to cross-database joins, which can lead to tricky issues if they’re in different compat levels, etc.
  • Lots of functions and triggers that have entire books of business logic in them make everything go to hell

But beyond that, consider some more internal design choices:

  • Change Tracking
  • Change Data Capture
  • Auditing
  • Temporal Tables

Now you’re asking SQL Server to log a bunch of stuff every time you do something in your database. You have the action, and then writing that action down.

On top of that, you’re also opening yourself up to more queries because people are going to want to access and report on that data. This isn’t even touching all the additional internal monitoring that goes into SQL Server to track everything going on in there.

Observer overhead is a real thing.

Now let’s add in data protection and synchronization

  • Availability Groups
  • Replication (God forgive me)

I’m leaving Log Shipping out of the equation because it doesn’t really add much complexity on its own. You may add complexity do it by needing to copy files to strange places and choosing to report off shipped databases. I’m also leaving Failover Clusters out, because there’s one set of data that sits in one place. Only the SQL Server instance moves between nodes, unless you decide to span Failover Clusters and synchronize the data using your SAN or something. That stuff usually has a big enough price tag on it to dissuade most folks from tramping down that Path Of Misery©.

Both Availability Groups and Replication can make life miserable. I work with some very nice people who can’t create indexes unless they switch their Availability Groups to asynchronous mode first.

I also work with some very nice people who use Replication and don’t get much sleep at night. Me? I won’t touch the stuff.

The point, though, is that if you decide to have SQL Server start doing a whole lot of extra stuff every time you run a query, or make your queries really complicated to come up with an execution plan for, performance might suffer.

It’s up to you to make sure that you plan accordingly for adding in new features.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Even If SQL Server Table Variables Were Always In Memory, It Wouldn’t Make Them Better Than Temp Tables

Past Perfect


You know all those pesky developer myths that never go away?

  • CTEs and Views can’t use indexes: WRONG
  • Subqueries are slower than joins: WRONG
  • Temp tables are bad for performance: WRONG
  • Table Variables are always in memory: WRONG

Okay, that’s not a complete list, but a complete list would fill the internet to its very brim and cause all sorts of overflow errors.

My job as a performance tuning consultant is to teach folks when they’ve been lead astray.

To confuse things a bit further, Microsoft allows for in-memory @table variables, but you have do a lot of work to get set up to use them.

They’re not the ones you get out of the box.

Future Mediocre


Let’s say for the sake of argument that @table variables were always more “in-memory” than #temp tables (they’re not). They’d still have a lot of the same fundamental problems:

With all that stuff in mind: who cares if they’re only in memory? There are so many downsides for most query patterns, they could all happen in quantum space and performance would still stink.

I know, they have their uses:

But most of y’all aren’t using them that wisely.

And here’s the other thing: Most of y’all don’t have enough memory to begin with, and the memory you do have you misuse in a million ways. Why do you wanna make that worse?

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

Some Requests For Authors Of SQL Server Books

Heal Thyself


I dislike criticism without introspection, so anything that I say here I’ll also be doing my best to apply to anything I produce.

But anyway, I’m writing this because I read a lot of SQL Server books, and there are some things about them are pretty frustrating.

This goes for both print and digital copies, though some advice is specific to print copies because I think a lot of folks still buy them.

Anyway, here are some thoughts I have. You can show this post to your editors if you want. Maybe it’s their fault, not yours.

I have no idea.

Code Formatting


While I don’t expect anyone’s coding format to be as highly-evolved to near-perfection as mine, I do prefer consistency.

Many times while reading a book, the code formatting style changes from query to query, page to page, etc.

I understand that with multiple authors, things may change between chapters (but even then, agree to something, would ya?) but it’s jarring to see queries written in several different styles throughout a book.

Most code formatting tools out there will get you at least 70% of the way to code being formatted the way you want, even if you’re particularly persnickety about things.

Code Placement


For the absolute love of Crispin Glover, stop putting code in places where it will span multiple pages.

It sucks for everyone, no matter how they’re reading the book, but all those nice little SQL Server Scrubs who bought (hopefully bought) the digital copy of your book have an awkward time copying and pasting things over to run locally.

For the Print-cesses of the world, we just don’t need that kind of page-turning, cliff-hanging suspense.

I’ve seen this more than anything, and it drives me bonkers every time. Just keep writing and say the code example is on the next page.

If your code example takes up a whole page and then some… Well, look, I don’t have an easy answer for you. You want it there for digital folks to copy, but no one with a print edition is going to write that all out from scratch.

Maybe you should just get yourself a nice GitHub gist and link to longer code examples? Use a URL shortener or GO codes to make memorable URLs?

Either way, this particular issue is in every book I’ve ever read.

Image Quality


We really need to talk about this one.

Now, I get it, you’re probably not gonna talk anyone into printing a full color SQL Server book. At least no one smart.

If you can do that, great. You’re one step closer to your book not looking like a 90’s zine about anarchy or graffiti or whatever.

The next step: use high quality images. Please, please, please, use high quality images.

No one’s going to be able to read your book and follow along without wondering if they have cataracts if they’re staring at washed-out, fuzzy images the whole way through.

Far too often, it looks like someone compressed-to-death and watered down the printer ink for every image in a book.

Tech Review


Get someone you disagree with to tech review your book. Not someone who’s wrong, just someone who has a different point of view.

This isn’t a pitch to get me to do it. Unless I love you to death, I’m probably not gonna wanna read and review your book.

But far too many books seem to suffer from “I got my friend to do this”, which leads to groupthink.

There’s no alternate viewpoint considered or explored, and you end up with one-sided highways of thought.

Doing a tech review shouldn’t be about toeing a line, it should be about making a book more technically accurate.

Concept Introduction


I’m not sure if this is a book or thought  organization issue, but… If you’re in an early chapter of a book, say the first three, and you keep seeing things like “we cover this more in chapter 12” or 13 or 14 or some other far off place, you probably shouldn’t be doing it right then and there.

When you’re going to introduce a concept, you should either be covering it in that chapter, or covering it in the next one.

I see this quite a bit, and wonder if I need to skip way ahead in the book and read about this other thing, and then come back.

That’s a crappy, because it makes you feel lost in the material that’s supposed to be bringing you clarity on a subject.

Try to contain yourself, here. Maybe you’re saying too much too soon.

Auto-Errata


It happens to me every time I write a blog post. I re-read it as I’m going through, I re-read it when I’m done, and I tend to check in on things a day or so before it goes live.

When it goes live, I read through the post, and notice half-a-hundred things that I wanna fix, tweak, or expand on.

I can only imagine what that’s like at Bookscale© where you have 300+ pages of blog posts publishing at once.

As you make changes to the book, you should notify folks who bought it. Just randomly posting corrections is hard to keep track of.

GitHub is, again, a decent way to get and track feedback. It also gives you a way to announce large releases to the content so everyone knows when a new version is available.

Hopefully you’re making digital copies available to folks who purchase a physical copy. Otherwise, you’re gonna have a hard time getting a better product in their hands.

Reinventing The Wheel


Rather than take up a whole page with some reheated DMV query, give some love to community scripts out there that do the same thing, but better.

Anyone who knows better is going to pass right over that stuff anyway.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

SQL Server 2022’s GENERATE_SERIES Doesn’t Suck Anymore

Life Comes At You Fast


The summer intern at Microsoft was hard at work between CTPs. Last time around, there were some serious performance issues with our dear new friend GENERATE_SERIES.

With the release of CTP 2.1, the problems that we saw the first time around are all gone. But there are still a couple small caveats that you should be aware of.

There’s also been a change in the way you call the function, too. You not longer need the START and STOP identifiers in the function.

You just put in whatever numbers you’re into.

That’s probably a good thing.

Caveat #1: Parallelism Restrictions


Loading data into tables that have row store indexes on them can’t go parallel.

DROP TABLE IF EXISTS
    dbo.art_aux;

CREATE TABLE 
    dbo.art_aux
(
    id int NOT NULL PRIMARY KEY CLUSTERED
);

INSERT INTO
    dbo.art_aux WITH(TABLOCK)
(
    id
)
SELECT
    gs.value
FROM GENERATE_SERIES
     (
         1, 
         10000000
     ) AS gs
OPTION
(
    MAXDOP 8, 
    QUERYTRACEON 8649
);

But performance is still pretty good, here. Much better than prior.

SQL Server Query Plan
cooperation

I think most folks out there would be okay waiting a few seconds for 10 million rows to show up in a table for them.

So sequential. Much unique.

But, what if you want to load those rows in faster?

Caveat #2: Use A Heap Or SELECT INTO Instead


If you create a heap, or just SELECT INTO on instead, you’ll get a parallel load into the table.

For 10 million rows, there’s an improvement of about 1.7 seconds on my sort of crappy travel VM with 4 available CPUs.

DROP TABLE IF EXISTS
    dbo.art_aux;

CREATE TABLE 
    dbo.art_aux
(
    id int NOT NULL
);

INSERT INTO
    dbo.art_aux WITH(TABLOCK)
(
    id
)
SELECT
    gs.value
FROM GENERATE_SERIES
     (
         1, 
         10000000
     ) AS gs;

Here’s the plan:

SQL Server Query Plan
hot swap

Yes, dear reader, I even blog on vacation. That’s how much I love and care about you.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

Why You Can’t Always Rely On Estimated Query Plans In SQL Server

Simple Enough


Most of the time, if you get the estimated plan for a query, the actual execution plan will look about the same.

You’ll get the additional information of course, but the shape and most of the other attributes will match up.

Other times, things will change between the two.

Here’s a silly example I came across recently while working on demos for some other stuff.

SELECT TOP (100)
     p.Id,
     vs.UpVotes,
     vs.DownVotes
FROM dbo.VoteStats() AS vs
JOIN dbo.Posts AS p
    ON vs.PostId = p.Id
WHERE vs.DownVotes > vs.UpMultipier
AND   p.CommunityOwnedDate IS NULL
AND   p.ClosedDate IS NULL
ORDER BY vs.UpVotes DESC;

Estimated Execution Plan


Here’s the estimated plan I get for this query:

SQL Server Query Plan
estimations

SQL Server estimates that a serial execution plan will work best, here.

Actual Execution Plan


Here’s the actual execution plan for the same query:

SQL Server Query Plan
actuality

At runtime, SQL Server decides that a parallel plan is the better solution.

Wild Nights


SQL Server can totally estimate parallel plans. I’ve seen it a million times.

In this case, though, it doesn’t do that. If you get the estimated plan after executing the query, it’ll come back with the parallel version.

Why does this happen? Well, as SQL Server adds features to the Intelligent Query Processing, more and more decisions are made when a query executes.

Stuff like this can make estimated and actual plans look much different from each other:

  • Adaptive joins
  • Interleaved execution
  • Table variable deferred compilation
  • Batch mode on row store

Wonders of modern query optimizers, eh?

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.