Using Column Store Indexes To Improve Unpredictable User Search Queries

And Cough


Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Where Bitmaps Dare In SQL Server Query Plans

I AIN’T NO


Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Computed Column Follies In SQL Server

Darnit


While helping a client out with a performance problem recently, I ran into something kind of funny when creating a computed column.

They were experiencing performance problems because of a join involving a substring.

Weird, right? Like, if I tried to show you this in a presentation, you’d chase me out of the room.

But since they were nice enough to hire me, I went about fixing the problem.

Computer Magic


The “obvious” — and I apologize if this isn’t obvious to you, dear reader — was to add a computed column to work around the issue.

Adding a computed column gives you the expression that you’re generating on the fly and trying to join on. Because manipulating column data while you’re joining or filtering on it is generally a bad idea. Sometimes you can get away with it.

But here’s something that messed me up, a uh… seasoned database professional.

The query was doing something like this (not exactly, but it’s good enough to get us moving):

SELECT COUNT(*) AS records
FROM dbo.Users AS u
WHERE SUBSTRING(u.DisplayName, 1, LEN(u.DisplayName) - 4)
 = SUBSTRING(u.DisplayName, 1, LEN(u.DisplayName) - 4);

Matching strings from the beginning to the end minus four characters.

I wanted to look smart, so I did this:

ALTER TABLE dbo.Users 
    ADD DisplayNameComputed
	    AS SUBSTRING(DisplayName, 1, LEN(DisplayName) - 4);

I didn’t want to persist it right away — that can lock the table and take longer — and because I knew I was going to index it.

The problem is that when I tried to index it:

CREATE INDEX dummy 
    ON dbo.Users(DisplayNameComputed);

I got this error:

Msg 537, Level 16, State 3, Line 21
Invalid length parameter passed to the LEFT or SUBSTRING function.

And when I tried to select data from the table, the same error.

In the real query, there was a predicate that avoided columns with too few characters, but it was impossible to apply that filter to the index.

There’s also other restrictions on filtered index where clauses, like you can’t like LIKE ‘____%’, or LEN(col) > 4, etc.

Case Of Mace


Having done a lot of string splitting in my life, I should have been more defensive in my initial computed column definition.

What I ended up using was this:

ALTER TABLE dbo.Users 
    ADD DisplayNameComputed
	    AS SUBSTRING(DisplayName, 1, LEN(DisplayName) 
		- CASE WHEN LEN(DisplayName) < 4 THEN LEN(DisplayName) ELSE 4 END);

A bit more verbose, but it allowed me to create my computed column, select from the table, and create my index.

AND THEY ALL LIVED HAPPILY EVER AFTER

Just kidding, there was still a lot of work to do.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Implicit Conversion Is A SARGability Problem, That’s Why It’s A Performance Problem In SQL Server

Concerns


If you compare the things that non-SARGable queries cause issues with alongside the things that bad implicit conversions cause issues with, it’s an identical list.

  • Increased CPU
  • Inefficient use of indexes
  • Poor cardinality estimation
  • Maybe a bad memory grant based on that
  • Some “row by row” event

Though we often bucket the problems separately, they’re really the same thing.

That’s because, under the covers, something similar happens.

SQL Server Query Plan
Four letters

If you replace “CONVERT_IMPLICIT” with any other function, like ISNULL, COALESCE, DATEADD, DATEDIFF, etc. you may see the same performance degradation.

Probably not the most thought provoking thing you’ve ever heard, but if you understand why one is bad and not the other, this may help you.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Column Selectivity And SQL Server Index Design Patterns

The Memes On The Blogs Fall Mainly On The Aughs


Unless your column is unique, and defined as unique, and people are searching for equality predicates on it — and I don’t mean column = column — I mean column = value, it might not be a great first column in your index. Many unique columns I see are identity columns that don’t necessarily define a relationship or usable search values. They’re cool for keeping the clustered index sane, but no one’s looking at the values in them.

The problem with the advice that you should “always put the most selective column first” is that not many columns are uniformly selective. For some ranges, they may be selective, for other ranges, they may not be.

Let’s look at some examples.

Usery


Let’s look at some tables in the Stack Overflow data dump. I realize this isn’t a perfect data set, but it has a lot of things in common with data sets I see out in the world.

  • The site has gotten more popular over time, so year over year dates become less selective
  • The site has definite groups of “power users” and “one and done” users
  • Certain site activities are more common than others: votes cast, types of posts made
  • Certain user attributes, like badges, are more common than others

All of these patterns are generally observable in real world data, too. Growth is a near constant, and with growth is going to come lumpy patterns.

SQL Server Management Studio Query Results
Not like the others

Looking at significant number differences here, the top vote type (an upvote) has 37 million entries. The next most popular one has 3.7 million.

Are either of those selective? No.

But when you get down to the bottom, you reach some selectivity.

The dates the vote were cast become less selective over time, too.

SQL Server Management Studio Query Results
Tinder

Within User Reputations, things become skewed towards the bottom end.

Though the site gets more users overall, Reputation is still largely skewed towards power users.

SQL Server Management Studio Query Results
Coconut colored

What Does This Mean For You?


  • Don’t assume that just because you search for something with an equality that it’s the most selective predicate.
  • Don’t assume that any search will always be selective (unless the column is unique).
  • Don’t assume that the most selective predicate should always be the first column in an index; there are other query operations that should be considered as well

I can’t count the number of times that someone has told me something like “this query is fast, except when someone searches for X”, or “this query is fast, except when they ask for a year of data”, and the solution has been creating alternate indexes with key columns in a different order, or flipping current index key columns around.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Is RECOMPILE Really That Bad For SQL Server Query Performance?

To Re Or Not To Re


It’s been a while since SQL Server has had a real RECOMPILE problem. And if you put it up against the performance problems that you can hit with parameter sniffing, I’d have a hard time telling someone strapped for time and knowledge that it’s the worst idea for them.

Obviously, you can run into problems if you (“you” includes Entity Framework, AKA the Database Demolisher) author the kind of queries that take a very long time to compile. But as I list them out, I’m kinda shrugging.

Here are some problems you can hit with recompile. Not necessarily caused by recompile, but by not re-using plans.

  • Long compile times: Admittedly pretty rare, and plan guides or forced plans are likely a better option.
  • No plan history in the cache (only the most recent plan): Sucks if you’re looking at the plan cache. Sucks less if you have a monitoring tool or Query Store.
  • CPU spikes for high-frequency execution queries: Maybe time for caching some stuff, or getting away from the kind of code that executes like this (scalar functions, cursors, etc.)

But for everything in the middle: a little RECOMPILE probably won’t hurt that bad.

Thinking of the problems it can solve:

  • Parameter sniffing
  • Parameter embedding (lack of)
  • Local variable estimates
  • Catch all queries

Those are very real problems that I see on client systems pretty frequently. And yeah, sometimes there’s a good tuning option for these, like changing or adding an index, moving parts of the query around, sticking part of the query in a temp table, etc.

But all that assumes that those options are immediately available. For third party vendors who have somehow developed software that uses SQL Server for decades without running into a single best practice even by accident, it’s often harder to get those changes through.

There’s More Than One Way To Recompile


Sure, you might be able to sneak a recompile hint somewhere in the mix even if it’d make the vendor upset. You can always yoink it out later.

But you have alternatives, too.

  • DBCC FREEPROCCACHE: No, not the whole cache. You can single out troublesome queries to remove specific plans.
  • Plan Guides: An often overlooked detail of plan guides is that you can attach hints to them, including recompile.

Using a plan guide doesn’t interfere with that precious vendor IP that makes SQL Server unresponsive every 15 minutes. Or whatever. I’m not mad.

And yeah, there’s advances in SQL Server 2017 and 2019 that start to address some issues here, but they’re still imperfect.

I like’em, but you know. They’re not quite there yet.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

SQL Server’s Partial Aggregate Query Plan Operators CAN Quit You

This Machine Makes Seltzer


Let’s say you have a parallel query running at DOP 4. The final logic of the query is some aggregate: COUNT, SUM, MIN, MAX, whatever.

Sure, the optimizer could gather all the streams, and then calculate one of those for all four of them, but why do that?

We have a Partial Aggregate operator that allows an aggregate per thread to be locally aggregated, then a final global aggregate can be more quickly calculated from the four locally aggregated values.

There are a couple odd things about Partial Aggregates though:

  • They ask for a fixed amount of memory, which is usually quite small
  • When they run out of memory, they don’t spill, they just stop aggregating

Which is why for identical executions of identical queries, you may see different numbers of rows come out of them.

SQL Server Query Plan
Everyone I know is sick to death of you.

We start with the same number of rows coming out of the Hash Join, which is expected.

We ran the same query.

However, the Partial Aggregate emits different numbers of rows.

It doesn’t matter much, because the global aggregate later in the plan will still be able to figure things out, albeit slightly less efficiently.

SQL Server Query Plan
Appolonia

If we look at the spills in the Hash Match Aggregates from both of the above plans, the warnings are slightly different.

Ayyyyy

Hardly anything to worry about here, of course. But definitely something to be aware of.

No, SQL Server isn’t leaking memory, or full of bugs. It’s just sensitive.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

SQL Server’s Batch Mode On Row Store Is Fickle

Thanks Though


I’m excited about this feature. I’m not being negative, here. I just want you, dear reader, to have reasonable expectations about it.

This isn’t a post about it making a query slower, but I do have some demos of that happening. I want to show you an example of it not kicking in when it probably should. I’m going to use an Extended Events session that I first read about on Dmitry Pilugin’s blog here. It’ll look something like this.

CREATE EVENT SESSION heristix
    ON SERVER
    ADD EVENT sqlserver.batch_mode_heuristics
        ( ACTION( sqlserver.sql_text ))
    ADD TARGET package0.event_file
        ( SET filename = N'c:\temp\heristix' )
    WITH
        ( MAX_MEMORY = 4096KB,
          EVENT_RETENTION_MODE = ALLOW_SINGLE_EVENT_LOSS,
          MAX_DISPATCH_LATENCY = 1 SECONDS,
          MAX_EVENT_SIZE = 0KB,
          MEMORY_PARTITION_MODE = NONE,
          TRACK_CAUSALITY = OFF,
          STARTUP_STATE = ON );
GO

The Setup


Let’s start with some familiar indexes and a familiar query from other posts the last couple weeks.

CREATE INDEX something ON dbo.Posts(PostTypeId, Id, CreationDate);
CREATE INDEX stuffy ON dbo.Comments(PostId, Score, CreationDate);
SELECT COUNT(*)
FROM   dbo.Posts AS p
JOIN   dbo.Comments AS c
    ON  p.Id = c.PostId
WHERE DATEDIFF(YEAR, p.CreationDate, c.CreationDate) > 1
AND p.PostTypeId = 1
AND c.Score > 0;

The query plan is unimportant. It just doesn’t use any Batch Mode, and takes right about 2 seconds.

SQL Server Query Plan
Blech Mode

If we look at the entry for this query in our XE session, we can see that the optimizer considered the heck out of Batch Mode, but decided against it.

SQL Server Extended Events
All The Heck

Curiouser


If we add a hash join hint to the query, it finishes in about 800ms.

SELECT COUNT(*)
FROM   dbo.Posts AS p
JOIN   dbo.Comments AS c
    ON  p.Id = c.PostId
WHERE DATEDIFF(YEAR, p.CreationDate, c.CreationDate) > 1
AND p.PostTypeId = 1
AND c.Score > 0
OPTION(HASH JOIN);
SQL Server Query Plan
Every Time

All the operators in this plan except Gather Streams are run in Batch Mode. Clearly it was helpful.

SQL Server Extended Events
See Ghosts

And according to the XE session, we can see that decision in writing. Yay.

Alt Roq


If we modify our indexes slightly, we can get an Adaptive Join plan.

CREATE INDEX something_alt ON dbo.Posts(PostTypeId, CreationDate, Id);
CREATE INDEX stuffy_alt ON dbo.Comments(Score, CreationDate, PostId);

And, yes, this is about twice as fast now (compared to the last Batch Mode query), mostly because of the better indexing.

SQL Server Query Plan
Montage

Is There A Moral So Far?


Yes, don’t count on Batch Mode to kick in for every query where it would be helpful.

If you want queries to consistently use Batch Mode, you’ll need to do something like this.

SELECT COUNT(*)
FROM   dbo.Posts AS p
JOIN   dbo.Comments AS c
    ON  p.Id = c.PostId
LEFT JOIN dbo.t ON 1 = 0
WHERE DATEDIFF(YEAR, p.CreationDate, c.CreationDate) > 1
AND p.PostTypeId = 1
AND c.Score > 0;

But you have to be careful there too.

SQL Server Query Plan
Mad Mad Mad Mad

You might lose your nice parallel plan and end up with a slower query.

Huh.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.