SQL Server Archives | Page 136 of 164

All The Performance Problems With Select * Queries In SQL Server

Posted on December 10, 2019May 16, 2022 by Erik Darling

This was originally posted by me as an answer here. I’m re-posting it locally for posterity.

The two reasons that I find the most compelling not to use SELECT * in SQL Server are

Memory Grants
Index usage

Memory Grants

When queries need to Sort, Hash, or go Parallel, they ask for memory for those operations. The size of the memory grant is based on the size of the data, both row and column wise.

String data especially has an impact on this, since the optimizer guesses half of the defined length as the ‘fullness’ of the column. So for a VARCHAR 100, it’s 50 bytes * the number of rows.

Using Stack Overflow as an example, if I run these queries against the Users table:

SELECT TOP 1000 
       u.DisplayName 
FROM dbo.Users AS u 
ORDER BY u.Reputation;


SELECT   TOP 1000
         u.DisplayName,
         u.Location
FROM     dbo.Users AS u
ORDER BY u.Reputation;

DisplayName is NVARCHAR 40, and Location is NVARCHAR 100.

Without an index on Reputation, SQL Server needs to sort the data on its own.

But the memory it nearly doubles.

DisplayName:

DisplayName, Location:

This gets much worse with SELECT *, asking for 8.2 GB of memory:

It does this to cope with the larger amount of data it needs to pass through the Sort operator, including the AboutMe column, which has a MAX length.

Index Usage

If I have this index on the Users table:

CREATE NONCLUSTERED INDEX ix_Users ON dbo.Users ( CreationDate ASC, Reputation ASC, Id ASC );

And I have this query, with a WHERE clause that matches the index, but doesn’t cover/include all the columns the query is selecting…

SELECT   u.*,
         p.Id AS PostId
FROM     dbo.Users AS u
JOIN     dbo.Posts AS p
    ON p.OwnerUserId = u.Id
WHERE    u.CreationDate > '20171001'
AND      u.Reputation > 100
AND      p.PostTypeId = 1
ORDER BY u.Id;

The optimizer may choose not to use the narrow index with a key lookup, in favor of just scanning the clustered index.

You would either have to create a very wide index, or experiment with rewrites to get the narrow index chosen, even though using the narrow index results in a much faster query.

CX:

SQL Server Execution Times: CPU time = 6374 ms, elapsed time = 4165 ms.

NC:

SQL Server Execution Times: CPU time = 1623 ms, elapsed time = 875 ms.

Thanks for reading!

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

In SQL Server, Does Query Parallelism Change Query Memory Grants?

Posted on December 9, 2019May 16, 2022 by Erik Darling

This was originally posted as an answer by me here, I’m re-posting it locally for posterity

Sup?

For SQL Server queries that require additional memory, grants are derived for serial plans. If a parallel plan is explored and chosen, memory will be divided evenly among threads.

Memory grant estimates are based on:

Number of rows (cardinality)
Size of rows (data size)
Number of concurrent memory consuming operators

If a parallel plan is chosen, there is some memory overhead to process parallel exchanges (distribute, redistribute, and gather streams), however their memory needs are still not calculated the same way.

Memory Consuming Operators

The most common operators that ask for memory are

Sorts
Hashes (joins, aggregates)
Optimized Nested Loops

Less common operators that require memory are inserts to column store indexes. These also differ in that memory grants are currently multiplied by DOP for them.

Memory needs for Sorts are typically much higher than for hashes. Sorts will ask for at least estimated size of data for a memory grant, since they need to sort all result columns by the ordering element(s). Hashes need memory to build a hash table, which does not include all selected columns.

Examples

If I run this query, intentionally hinted to DOP 1, it will ask for 166 MB of memory.

SELECT *
FROM 
     (  
        SELECT TOP (1000) 
               u.Id 
        FROM dbo.Users AS u
        ORDER BY u.Reputation
     ) AS u
OPTION(MAXDOP 1);

If I run this query (again, DOP 1), the plan will change, and the memory grant will go up slightly.

SELECT *
FROM (  
        SELECT TOP (1000) 
               u.Id
        FROM dbo.Users AS u
        ORDER BY u.Reputation
     ) AS u
JOIN (
        SELECT TOP (1000) 
               u.Id
        FROM dbo.Users AS u
        ORDER BY u.Reputation
     ) AS u2
ON u.Id = u2.Id
OPTION(MAXDOP 1);

There are two Sorts, and now a Hash Join. The memory grant bumps up a little bit to accommodate the hash build, but it does not double because the Sort operators cannot run concurrently.

If I change the query to force a nested loops join, the grant will double to deal with the concurrent Sorts.

SELECT *
FROM (  
        SELECT TOP (1000) 
               u.Id
        FROM dbo.Users AS u
        ORDER BY u.Reputation
     ) AS u
INNER LOOP JOIN ( --Force the loop join
        SELECT TOP (1000) 
               u.Id
        FROM dbo.Users AS u
        ORDER BY u.Reputation
     ) AS u2
ON u.Id = u2.Id
OPTION(MAXDOP 1);

The memory grant doubles because Nested Loop is not a blocking operator, and Hash Join is.

Size Of Data Matters

This query selects string data of different combinations. Depending on which columns I select, the size of the memory grant will go up.

The way size of data is calculated for variable string data is rows * 50% of the column’s declared length. This is true for VARCHAR and NVARCHAR, though NVARCHAR columns are doubled since they store double-byte characters. This does change in some cases with the new CE, but details aren’t documented.

Size of data also matters for hash operations, but not to the same degree that it does for Sorts.

SELECT *
FROM 
     (  
        SELECT TOP (1000) 
                 u.Id          -- 166MB (INT)
               , u.DisplayName -- 300MB (NVARCHAR 40)
               , u.WebsiteUrl  -- 900MB (NVARCHAR 200)
               , u.Location    -- 1.2GB (NVARCHAR 100)
               , u.AboutMe     -- 9GB   (NVARCHAR MAX)
        FROM dbo.Users AS u
        ORDER BY u.Reputation
     ) AS u
OPTION(MAXDOP 1);

But What About Parallelism?

If I run this query at different DOPs, the memory grant is not multiplied by DOP.

SELECT *
FROM (  
        SELECT TOP (1000) 
               u.Id
        FROM dbo.Users AS u
        ORDER BY u.Reputation
     ) AS u
INNER HASH JOIN (
        SELECT TOP (1000) 
               u.Id
        FROM dbo.Users AS u
        ORDER BY u.Reputation
     ) AS u2
ON u.Id = u2.Id
ORDER BY u.Id, u2.Id -- Add an ORDER BY
OPTION(MAXDOP ?);

There are slight increases to deal with more parallel buffers per exchange operator, and perhaps there are internal reasons that the Sort and Hash builds require extra memory to deal with higher DOP, but it’s clearly not a multiplying factor.

Thanks for reading!

Going Further

SQL Saturday Washington, DC: One Week To Go!

Posted on December 6, 2019May 16, 2022 by Erik Darling

HELLO GUAM!

If you’re planning on attending SQL Saturday Washington, DC, why not play hooky from work and spend the day with me learning about all the atrocities SQL Server is capable of?

I’m going to be delivering my Total Server Tuning material, which has been a hit at a whole bunch of events this past year. It’s an eye-opening full day of training where you’ll find out all my favorite ways that things can go wrong with SQL Server hardware, queries, and indexes.

And of course, how you can outsmart SQL Server.

Which is pretty hard.

Like, doctors work on it and stuff.

Attendees has a choice to either follow along with me on their laptops, or just watch in horror as familiar events unfold before their very eyes.

If you want to follow along, grab a copy of the StackOverflow2013 database. It’s about a 10GB download, which turns into a ~60GB database.

Fair warning: if you’re gonna follow along, you’re gonna have a tough time on skimpy laptop hardware. My personal laptop is 64GB of RAM and some pretty fast cores. At least they were until Intel started patching things. Most demos are on SQL Server 2017, but I’m going to be showing you stuff from SQL Server 2019 as well.

See you there!

Going Further

Why Functions Are Bad in SQL Server Where Clauses

Posted on December 5, 2019May 14, 2022 by Erik Darling

Raised Right

It seems like every time I check out a server, the query plans are a nightmare. Users are freaking out, and management’s coffee is more Irish than Colombian.

Many times, the issue is that people are using presentation layer functions for relational processes. The where clause, joins, group by, and order by parts of a query.

Think about built-in string and date functions, wrapped around columns, and the problems they can cause.

These are things you should actively be targeting in existing code, and fighting to keep out of new code.

Nooptional

When you’re trying to get rid of them, remember your better options

Cleaning data on input, or via triggers: Better than wrapping everything in RTRIM/LTRIM
Using computed columns: Better than relying on runtime calculations like DATEADD/DATEDIFF
Breaking queries up: Use UNION ALL to query for either outcome (think ISNULL)
Using indexed views: If you need to calculate things in columns across tables
Creating reporting tables: Sometimes it’s easier to denormalize a bit to make writing and indexing easier
Using #temp tables: If you have data that you need to persist a calculation in and the query to generate it is complicated

Note the things I’m not suggesting here:

CTEs: Don’t materialize anything
@table variables: Cause more problems than they solve
Views: Don’t materialize unless indexed
Functions: Just no, thanks

More Work

Yes, finding and fixing this stuff is more work for you. But it’s a whole lot less work for the optimizer, and your server, when you’re done.

If that’s the kind of thing you need help with, drop me a line.

Thanks for reading!

Going Further

Why DATEDIFF Returns Surprising Results In SQL Server

Posted on December 4, 2019May 16, 2022 by Erik Darling

All Day

If I sent you these three queries and asked you if they’d return 0 or 1, what would you guess?

SELECT DATEDIFF(YEAR, '2019-12-31', '2020-01-01');
SELECT DATEDIFF(MONTH, '2019-12-31', '2020-01-01');
SELECT DATEDIFF(DAY, '2019-12-31', '2020-01-01');

I’ll give you a second to think about it.

One.

There.

Good Job!

If you guessed that they’d all return 1, you were right. That’s because DATEDIFF isn’t always very smart about measuring time.

All it measures is that the unit of time you’re interested in has increased or decreased.

Even though January 1st is the day after December 31st, the year is different, so it says there’s a year difference between them. Same with the month query.

For day it makes total sense here, but if you wanted to see a 24 hour difference, it might not go so well.

Anyway, it may not be measuring what you think it’s measuring.

Precision

If you want more precise measurements, you’re gonna have to get on that datemath post I wrote recently.

For example, to replicate DATEDIFF for this query:

SELECT COUNT(*)
FROM   dbo.Posts AS p
JOIN   dbo.Comments AS c
    ON  p.Id = c.PostId
WHERE DATEDIFF(YEAR, p.CreationDate, c.CreationDate) > 1
AND   p.PostTypeId = 1
AND   c.Score > 0;

You’d have to do something like this:

SELECT COUNT(*)
FROM   dbo.Posts AS p
JOIN   dbo.Comments AS c
    ON  p.Id = c.PostId
WHERE p.CreationDate < DATEADD(YEAR, -1, DATEADD(YEAR, DATEDIFF(YEAR, 0, c.CreationDate), 0))
AND   c.CreationDate > DATEADD(YEAR,  1, DATEADD(YEAR, DATEDIFF(YEAR, 0, p.CreationDate), 0))
AND   p.PostTypeId = 1
AND   c.Score > 0

But all that tells you is that the creation dates have different years. It doesn’t tell you if those creation dates are fully a year apart, either by measuring 12 months or 365 days (I know, leap years. Can it, Smokey.).

If you want dates that are a year apart, you need to do something like this:

SELECT COUNT(*)
FROM   dbo.Posts AS p
JOIN   dbo.Comments AS c
    ON  p.Id = c.PostId
WHERE p.CreationDate < DATEADD(YEAR, -1, c.CreationDate)
AND   c.CreationDate > DATEADD(YEAR,  1, p.CreationDate)
AND   p.PostTypeId = 1
AND   c.Score > 0;

But to illustrate how inaccurate DATEDIFF can be, let’s look at the first few lines of this query:

SELECT DATEDIFF(YEAR, p.CreationDate, c.CreationDate) AS YearDiff,
       DATEDIFF(MONTH, p.CreationDate, c.CreationDate) AS MonthDiff,
	   DATEDIFF(DAY, p.CreationDate, c.CreationDate) AS DayDiff
FROM   dbo.Posts AS p
JOIN   dbo.Comments AS c
    ON  p.Id = c.PostId
WHERE DATEDIFF(YEAR, p.CreationDate, c.CreationDate) = 1
AND   p.PostTypeId = 1
AND   c.Score > 0
ORDER BY YearDiff, MonthDiff, DayDiff;

The beginning of the results look okay. But towards the end of the dates with “one year” difference, things look uh…

Admitting Is The First Step

If you need precise date measurements, you can’t always rely on DATEDIFF.

Especially for larger gaps, you can get some rather odd results depending on how you’re defining what qualifies for your requirements.

Thanks for reading!

Going Further

SQL Server Index Spools When You Have An Index

Posted on December 3, 2019May 14, 2022 by Erik Darling

PAUL WHITE IS WRONG

Ha ha. Just kidding. He’s never wrong about anything.

But he did write about Eager Index Spools recently, and the post ended with the following statement:

Eager index spools are often a sign that a useful permanent index is missing from the database schema.

I’d like to show you a case where you may see an Eager Index Spool even when you have the index being spooled.

Funboy & Funboy & Funboy

Let’s say we’ve got a query that, for better or worse, was written like so:

SELECT SUM(records)
FROM dbo.Posts AS p 
CROSS APPLY 
(
    SELECT COUNT(p2.Id)
    FROM   dbo.Posts AS p2
    WHERE  p2.LastEditDate >= '20121231'
	AND    p.Id = p2.Id

    UNION ALL

    SELECT COUNT(p2.Id)
    FROM   dbo.Posts AS p2
    WHERE  p2.LastEditDate IS NULL
	AND    p.Id = p2.Id
) x (records);

Right now, we’ve got this index:

CREATE INDEX to_null_or_not_to_null ON dbo.Posts(LastEditDate);

Which means we’ve effectively got an index on (LastEditDate, Id), because of how clustered index key columns are inherited by nonclustered indexes.

The APPLY section of the query plan looks like so:

Each spool runs for nearly 53 seconds. The entire plan runs for 1:52.

There have been times when I’ve seen index spools created to effectively re-order existing indexes.

Perhaps that’s the case here? Let’s add this index.

CREATE /*UNIQUE*/ INDEX that_is_a_question ON dbo.Posts(Id, LastEditDate);

I’ve got UNIQUE in there in case you’re playing along at home. It makes no difference to the outcome.

I’d expect you to ask about that. I have high expectations of you, dear reader. I love you.

PLEASE DON’T LEAVE ME.

Get Out

The new execution plan looks uh.

That’s frustrating, isn’t it? Why would you do that?

When I asked Paul why the optimizer was wrong (I understand that many of you confuse Paul with the optimizer. To wit, they’ve never been seen together.), he said something along the lines of:

The issue is that you have a unique clustered index that prevents the index matching logic from finding the better nonclustered index.

Well okay yeah lemme just go drop that clustered index or something.

Workarounds

There are several workarounds, like using FORCESEEK inside the APPLY logic.

Of course, the better method is just to write the query so there’s no need for the optimizer to join a table to itself a couple times.

SELECT SUM(x.records)
FROM (    
	SELECT COUNT(p.records)
    FROM 
    (
        SELECT 1 AS records
        FROM   dbo.Posts AS p2
        WHERE  p2.LastEditDate >= '20121231'
    ) AS p
    
    UNION ALL
    
    SELECT COUNT(p.records)
    FROM 
    (
        SELECT 1 AS records
        FROM   dbo.Posts AS p2
        WHERE  p2.LastEditDate IS NULL
    ) AS p
) AS x (records);

Which’ll finish in about 1.5 seconds.

But hey, nifty demo.

Thanks for reading!

Going Further

Announcing My Precon For SQL Saturday Chicago 2020

Posted on December 2, 2019May 14, 2022 by Erik Darling

Deep Dish SQL

If you’re in Chicago and heading to SQL Saturday 2020, I’d love to see you at my full day precon on Friday, March 20.

I’ll be delivering a ton of awesome material that’ll help make your performance tuning life a whole lot easier. This session is totally focused on how to tune SQL Server performance, and teaching you how to do pain-free analysis of queries, indexes, and execution plans.

All of the material comes from the most common and most complex performance issues I’ve seen during my years as a consultant and DBA. We’ll be pulling on the rubber gloves and digging deep into places that’ll make the optimizer uncomfortable.

For a limited time, use the coupon code “newyorkpizza” for $50 off!

Thanks for reading, and see you there!

Going Further

SQL Saturday Washington, DC: Two Weeks To Go!

Posted on November 29, 2019May 16, 2022 by Erik Darling

HELLO GUAM!

If you’re planning on attending SQL Saturday Washington, DC, why not play hooky from work and spend the day with me learning about all the atrocities SQL Server is capable of?

I’m going to be delivering my Total Server Tuning material, which has been a hit at a whole bunch of events this past year.

Attendees has a choice to either follow along with me on their laptops, or just watch in horror as familiar events unfold before their very eyes.

If you want to follow along, grab a copy of the StackOverflow2013 database. It’s about a 10GB download, which turns into a ~60GB database.

For a limited time, use the coupon code “votesql” for $50 off.

Going Further

The Right Way To Compare Date Columns In Two Different SQL Server Tables

Posted on November 28, 2019May 16, 2022 by Erik Darling

A Certain Ratio

I have a lot to say about this demo on SQL Server 2019, which I’ll get to in a future post.

For now, I’m going to concentrate on ways to make this situation suck a little less for you wherever you are.

Let’s talk.

Pinky Blue

Let’s get a couple indexes going:

CREATE INDEX something ON dbo.Posts(PostTypeId, Id, CreationDate);
CREATE INDEX stuffy ON dbo.Comments(Score, PostId, CreationDate);

And look at a maybe kinda sorta stupid query.

SELECT COUNT(*)
FROM   dbo.Posts AS p
JOIN   dbo.Comments AS c
    ON  p.Id = c.PostId
WHERE DATEDIFF(YEAR, p.CreationDate, c.CreationDate) > 1
AND p.PostTypeId = 1
AND c.Score > 0;

We wanna find questions where a comment was left a year after they were posted, and the comment was upvoted.

BIG DATA!

What We Know

From yesterday’s post, we know that even if we put our date columns first in the index, we wouldn’t have anything to seek to.

Unlike yesterday’s post, these columns are in two different tables. We can’t make a good computed column to calculate that.

The indexes that I created help us focus on the SARGable parts of the where clause and the join conditions.

That query takes about 2 seconds.

Switch Hitter

You might be tempted to try something like this, but it won’t turn out much better unless you change your indexes.

SELECT DATEDIFF(YEAR, p.CreationDate, c.CreationDate) AS Diffo
INTO #t
FROM   dbo.Posts AS p
JOIN   dbo.Comments AS c
    ON  p.Id = c.PostId
WHERE p.CreationDate < DATEADD(YEAR, -1, c.CreationDate)
AND   c.CreationDate > DATEADD(YEAR,  1, p.CreationDate)
AND p.PostTypeId = 1
AND c.Score > 0

SELECT COUNT(*)
FROM #t AS t
WHERE t.Diffo > 1

DROP TABLE #t;

Moving CreationDate to the second column helps the first query quite a bit.

CREATE INDEX something_alt ON dbo.Posts(PostTypeId, CreationDate, Id);
CREATE INDEX stuffy_alt ON dbo.Comments(Score, CreationDate, PostId);

Opternatives

You could try an indexed view here, too.

CREATE VIEW dbo.Sneaky
WITH SCHEMABINDING
AS
SELECT COUNT_BIG(*) AS records
FROM   dbo.Posts AS p
JOIN   dbo.Comments AS c
    ON  p.Id = c.PostId
WHERE DATEDIFF(YEAR, p.CreationDate, c.CreationDate) > 1
AND p.PostTypeId = 1
AND c.Score > 0
GO 

CREATE UNIQUE CLUSTERED INDEX whatever ON Sneaky(records);

But, you know. That might be overkill.

Depending.

Thanks for reading!

Going Further

The Right Way To Compare Date Columns In SQL Server Queries

Posted on November 27, 2019May 16, 2022 by Erik Darling

Tell It To The Judge

A common dilemma is when you have two date columns, and you need to judge the gap between them for something.

For instance, say you have a table of orders and you want to figure out how long on average it takes an ordered item to ship, or a shipped item to be delivered.

You’re not left with many good ways to write the query to take advantage of indexes.

Let’s have a look-see.

Iron Mask

We’re gonna skip the “aw shucks, this stinks without an index” part.

CREATE INDEX gloom ON dbo.Posts(CreationDate, LastActivityDate);

SELECT COUNT(*) AS records
FROM dbo.Posts AS p
WHERE DATEDIFF(YEAR, p.CreationDate, p.LastActivityDate) > 9;

Because it still stinks with an index. Check it out.

Though we have a predicate, and an index on both columns, we don’t have anything to seek to.

Why? Because our predicate isn’t on anything that the index is keeping track of.

Indexes don’t care how many years, months, days, hours, minutes, seconds, milliseconds, or microseconds difference there are between these two columns.

That’d be a really cool kind of index to have for sure, but insert a shrug that fills your screen the way dark matter fills the universe here.

All we got is workarounds.

Another For Instance

We can use a computed column:

ALTER TABLE dbo.Posts 
    ADD despair AS DATEDIFF(YEAR, CreationDate, LastActivityDate);

CREATE INDEX sadness ON dbo.Posts(despair);

The result is something we can seek to.

Which is probably the type of plan that you’d prefer.

Thanks for reading!