Just One Index
I want both of these queries to be fast.
SELECT TOP (5000) p.LastActivityDate, p.PostTypeId, p.Score, p.ViewCount FROM dbo.Posts AS p WHERE p.PostTypeId = 4 AND p.LastActivityDate >= '20120101' ORDER BY p.Score DESC; SELECT TOP (5000) p.LastActivityDate, p.PostTypeId, p.Score, p.ViewCount FROM dbo.Posts AS p WHERE p.PostTypeId = 1 AND p.LastActivityDate >= '20110101' ORDER BY p.Score DESC;
Get to work.
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.
Create an Index on PostTypeID(if it is not clustered index) and lastactivitydate
I don’t see the table design (sorry if I missed that)
No, you didn’t miss it, but if you’re interested it’s documented here.
Thanks!
Alright I’ll take a shot, I am guessing LastActivityDate is more selective than PostTypeId. And Score probably more so than LastActivityDate, although Score and PostTypeId are possibly skewed a lot more? Are your stats up to date? 🙂
CREATE NONCLUSTERED INDEX IX_ShotInTheDark ON [dbo].[Posts] ([LastActivityDate], [PostTypeId], [Score] DESC)
INCLUDE ([ViewCount]);
I have a feeling Score first may be the faster though.
I’d also add a filter to the index (where PostTypeId IN (4,1))
I update stats every 15 seconds, tyvm 😀
create index ix_forerik on posts ( Posttypeid, LastActivirtydate. Score ) include ( viewcount)
I appreciate the thoughtfulness of your naming conventions!
LastActivityDate (the way we’re using it) is not very selective; PostTypeId is pretty good. For PostTypeId = 1, there are 1825 distinct scores. For PostTypeId = 5, there is only 1. There are lots of duplicate scores, and the vast majority of them are low scores. The lowest score that would be included is 391, with about ~20 that could be chosen from arbitrarily.
SEDE link (don’t have the database on my machine and too lazy to download it): https://data.stackexchange.com/stackoverflow/query/1177637/erik-darling-design-an-index-together-v1?opt.withExecutionPlan=true#resultSets
We could maybe try a filtered index, although I strongly suspect that we couldn’t get one that would match these predicates (and we’d have to make it on the underlying table anyway, and then it definitely wouldn’t be used).
I think a CCI is actually going to be the best index, but if we want to avoid columnstore I’d probably guess something like this:
CREATE NONCLUSTERED INDEX PlaytimeWithErik
ON dbo.Posts ( PostTypeId, Score DESC, LastActivityDate DESC )
INCLUDE ( ViewCount );
PostTypeId is first because I think it is always going to be the most selective.
I’m uncertain about Score vs LastActivityDate; intuitively I could imagine SQL Server scanning the Score list in descending order, checking for LastActivityDate >= our filter, and quitting once 5000 rows are returned. I’m reasonably confident it won’t do that, in which case LastActivityDate should probably be second.
LastActivityDate DESC so we can scan forwards instead of backwards (again, intuition that SQL Server can figure this out, not empirical).
Score DESC to match the ORDER BY, but this only matters if it lets us avoid the sort. It’ll only avoid the sort if my intuition above works, because LastActivityDate is not equality but a range, so I’ll move it to be in the INCLUDE if we change the order.
ViewCount in the INCLUDE because we don’t need it for anything fancy.
Rubber ducky, you’re the one.
I would have created a nonclustered columnstore index.
I LOVE those! But that’s not where I’m going with this one, hahaha.
I would create an index
CREATE NONCLUSTERED INDEX IX_PostTypeID_LastActivityDate_Score_includes ON [dbo].[Posts] (PostTypeId, LastActivityDate, Score) INCLUDE (ViewCount);
This
You’re getting warm!
I guess I am the weirdo in the group, but putting score at the front of the index allows it to get rid of the sort and grab the necessary records much quicker:
CREATE INDEX IX_UR_MOM on posts(Score,PostTypeID,LastActivityDate ) INCLUDE(ViewCount)
You’re close, but you’re gonna wanna check out the next couple posts.
Also, according to good consulting practice and Brent O. I should ask what fast is?
FASTER!
I don’t really like designing indexes for specific queries only. I much prefer to look at the entire workload.
But assuming these are the CEO’s pet reports, I’ll give it a stab.
Equality filters are more effective then inequality, so the index leads with PostTypeId, followed by LastActivityDate. There is no need to put Score as in index column, after an inequality filter the remaining columns are no longer sorted in the result set. (Had the query used equality on LastActivityDate, my suggestion would have been different). So Score is not indexed, and neither is ViewCount. But they do need to be included, else we’ll need a lookup.
(If any of the two included columns is in the clustered index, then including them makes me look silly but doesn’t hurt).
CREATE INDEX DontTellEriksBossHeDidntDoThisHimself
ON dbo.Posts(PostTypeID, LastActivityDate)
INCLUDE(Score, ViewCount);
If you really want to go overboard (but I expect that it will save milliseconds at best, at the price of being an overall far less usable index), then add a filter too:
CREATE INDEX DontTellEriksBossHeDidntDoThisHimself
ON dbo.Posts(PostTypeID, LastActivityDate)
INCLUDE(Score, ViewCount)
WHERE PostTypeID IN (1,4) AND LastActivityDate >= ‘20110101’;
Jeez, Pinal is smarter than you 😉
It took you THIS long to figure that out? Guess where that places you in the smartness ranking?
Seriously: You are right that my answer was incorrect. I realized that after posting. But you are not correct if you claim that my answer is not the fastest possible index. It might be, and the question doesn’t contain enough data to know for sure.
Using the second query as example, let’s say the PostTypeID filter reduces the data to a million rows. If 990,000 of them are from before 2011, then I want “my” index to not have to read those 990,000 rows and I’ll happily pay the price of sorting the remaining 10,000 rows.
But if 990,000 of those posts are recent, then “your” index prevents a huge sort of almost a million rows, while only reading 1% more data.
So as usual, the only correct answer is “it depends”.
(Perhaps I should use ix_ItDependsButOnWhat as my future index name … for all indexes)
Hahaha, ix_it_worked_on_my_machine is definitely prescient.
Because we have:
– TOP(5000)
– Order by Score DESC
– WHERE p.PostTypeId =
i will create one index
create nonclustered index idx_Posts___PostType_Score_LastActivity_I ON dbo.Posts
(
PostTypeId
,Score DESC
,LastActivityDate
)
Include (ViewCount)
You are rocking and rolling and can’t no one stop you.