Parallelism – Page 5 – Darling Data

How Changing Max Degree Of Parallelism Can Change Query Plans In SQL Server

Posted on August 22, 2019May 14, 2022 by Erik Darling

Rop-A-Dop

After a while tuning a query, sometimes it’s fun to mess with the DOP it’s run at to see how things change.

I wouldn’t consider this a query tuning technique, more like a point of interest.

For a long time, when I’d look at a serial plan, and then a parallel plan for a query, the shape would be the same.

But that’s not always true.

DOP 1

At DOP 1, the plan looks like this:

DOP 2

At DOP 2, the plan looks like this:

Mo’ DOP

At DOP 3-8, the plan looks like this:

No DOP

The DOP 2 plan has a significantly different shape than the serial, or more parallel plans.

It also chooses different types of joins.

Of course, we can use a merge join hint to have it pick the same plan as higher DOPs, but where’s the fun in that?

Anyway, the reason I found this interesting is because I always thought the general optimization process was:

Come up with a serial plan
If the plan cost is > CTFP, look at the parallel version of the serial plan
If the parallel version is cheaper, go with it

Though it appears like there’s an extra step where the optimizer considers multiple parallel alternatives to the serial plan, and not just the parallel version of the serial plan.

The process is closer to:

Come up with a serial plan
If the plan cost is > CTFP, create a *NEW* plan using parallelism
If the parallel version is cheaper, go with it

In many cases, the *NEW* plan will be the “same” as the serial plan, just using parallelism. The optimizer is a creature of habit, and applies the same rules and transformations.

Thanks for reading!

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Spills Week: How Exchange Spills In Parallel Query Plans Crush Query Performance

Posted on August 2, 2019May 16, 2022 by Erik Darling

Many Have Entered, Few Have Left

For some background on Exchange Spills, check out this Great Post™ by Joe.

The root of this demo was trying to show people silly things about CTEs, how TOP can fence things off, and how TOP introduces a serial zone in plans unless it’s used inside the APPLY operator.

The result was this magnificent beast.

Long Mane

Why is this magnificent?

Because we have the trifecta. We have a spill on all three types of parallel exchanges.

Let’s take a closer look at those beauties.

Why Did That Happen?

This plan has a Merge Join, which requires ordered input.

That means the Repartition and Gather Streams operators preserve the order of the Id column in the Users table.

They don’t actually order by that column, they just keep it in order.

But what about Distribute Streams? GREAT QUESTION!

It has the same Partition Column as Repartition Streams. They both have to respect the same order going into the Merge Join, because it’s producing ordered output to the Gather Streams operator.

In short, there’s a whole lot of buffers filling up while waiting for the next ordered value.

Were Parallel Merge Joins A Mistake?

[Probably] not, but they always make me nervous.

Especially when exchange operators are the direct parent or child of an order preserving operator. This also goes for stream aggregates.

I realize that these things are “edge cases”. It says so in the documentation.

The Exchange Spill event class indicates that communication buffers in a parallel query plan have been temporarily written to the tempdb database. This occurs rarely and only when a query plan has multiple range scans… Very rarely, multiple exchange spills can occur within the same execution plan, causing the query to execute slowly. If you notice more than five spills within the same query plan’s execution, contact your support professional.

Well, shucks. We only have three spills. It looks like we don’t qualify for a support professional.

Thanks for reading!

Going Further

A Trick For Dealing With Table Variable Modification Performance Problems In SQL Server

Posted on July 17, 2019May 16, 2022 by Erik Darling

Don’t Want None

One of the many current downsides of @table variables is that modifying them inhibits parallelism, which is a problem #temp tables don’t have.

While updating and deleting from @table variables is fairly rare (I’ve seen it, but not too often), you at minimum need an insert to put some data in there.

No matter how big, bad, ugly, or costly your insert statement is, SQL Server can’t parallelize it.

Dimmo

Here’s our select statement.

SELECT DISTINCT
       u.Id
FROM   dbo.Users AS u
JOIN   dbo.Posts AS p
    ON p.OwnerUserId = u.Id
JOIN   dbo.Comments AS c
    ON  c.PostId = p.Id
    AND c.UserId = u.Id
WHERE  c.Score >= 5;

This goes parallel and runs for about 3 seconds.

But if we try to insert that into a @table variable, the plan will no longer go parallel, and will run for ~6 seconds.

DECLARE @icko TABLE (id INT);
INSERT @icko ( id )
SELECT DISTINCT
       u.Id
FROM   dbo.Users AS u
JOIN   dbo.Posts AS p
    ON p.OwnerUserId = u.Id
JOIN   dbo.Comments AS c
    ON  c.PostId = p.Id
    AND c.UserId = u.Id
WHERE  c.Score >= 5;

If we hit F4 to get the properties of the INSERT, well…

SQL Server Query Plan Properties — Chocolate Nonpareils Reason?

Let’s Say For Some Crazy Reason

You need to keep using a table variable.

Let’s say, I dunno, the crappy 1 row estimate gets you a better plan.

Or like… I dunno. Temp tables recompile too much.

I’m reaching, I know. But hey, that’s what consultants do. Have you read any blog posts lately?

If we change our insert to this, we get parallelism back:

DECLARE @icko TABLE (id INT);
INSERT @icko ( id )
EXEC(N'SELECT DISTINCT
              u.Id
       FROM   dbo.Users AS u
       JOIN   dbo.Posts AS p
           ON p.OwnerUserId = u.Id
       JOIN   dbo.Comments AS c
           ON  c.PostId = p.Id
           AND c.UserId = u.Id
       WHERE  c.Score >= 5;')

The dynamic SQL executes in a separate context, and the insert happens in… idk some magickal place.

But THAT’S PRETTY COOL, HUH?

This would also work if we put the query inside a stored procedure, or stuck the statement inside a variable and used sp_executesql.

In either case, the INSERT…EXEC pattern is your friend here.

Of course, you could just use a #temp table.

Sigh.

Going Further

The Difference Between Statistics Time And Plan Operator Times In SQL Server Query Plans

Posted on July 9, 2019May 14, 2022 by Erik Darling

Goal Posts

When you’re measuring query changes to see if your performance changes have made a difference, a common way to do that is to use STATISTICS TIME and IO.

They’re not perfect, but the barrier to entry is super low, and you can get a good enough feel for if you’re on the right track.

In a perfect world, people would only select the rows and columns they need.

Also in a perfect world: that really embarrassing thing you did in 3rd grade wouldn’t pop into your head every time you’re about to do something really important.

Durex

What can make judging differences tough is if you’re returning a lot of rows to SSMS.

Sometimes it feels like you can reduce reads and CPU time, but your overall query time hasn’t changed.

Now with query operator times, that becomes easier to see.

And Earl

Let’s take this query, which returns ~271k rows.

SET STATISTICS TIME, IO ON;

SELECT c.Score, c.UserId, c.Text 
FROM dbo.Comments AS c
WHERE c.Score BETWEEN 5 AND 30
ORDER BY c.Score DESC

In the Stack Overflow 2013 database, this runs for about 3 wall clock seconds.

It says so in the bottom corner of SSMS.

Since we turned on stats time, we can look in the messages window to see that information.

Here are the relevant details:

 SQL Server Execution Times:
   CPU time = 3516 ms,  elapsed time = 3273 ms.

What looks odd here is that CPU and elapsed time are near-equal, but the plan shows parallelism.

Thankfully, with operator times, the actual plan helps us out.

The query itself ran for <900ms.

The situation isn’t so dire.

More Ales

In stats time, elapsed time measures until results are done getting to SSMS.

It might look like this query “ran” for ~3 seconds, but it didn’t. The query finished processing data in under a second, but it took another couple seconds for SSMS to render the results.

You can do a mock test by doing something like this:

DECLARE @blob_eater VARCHAR(8000);

SELECT @blob_eater = c.Score, 
       @blob_eater = c.UserId, 
	   @blob_eater = c.Text 
FROM dbo.Comments AS c
WHERE c.Score BETWEEN 5 AND 30
ORDER BY c.Score DESC

Now when we run the query, stats time is much closer to the operator finish time:

 SQL Server Execution Times:
   CPU time = 2954 ms,  elapsed time = 897 ms.

Thanks for reading!

Going Further

Too Much Of A Harmless SQL Server Wait Type Can Be Harmful

Posted on July 5, 2019May 14, 2022 by Erik Darling

Absence Of Evidence

There’s a pinky-out saying about wine: don’t hate the grape.

People say it because the same kind of grape can be grown by different people in different places.

Due to many, ahem, local factors, wine made with that grape can taste miraculously different.

It’s with that in mind that I’m going to say this: don’t ignore the wait.

No matter what script you’re using to look at wait stats, try unquoting the ignoreable list and seeing what shows up.

Get curious. Poke around. You might find something interesting.

Twosifer

While experimenting with FROID, I came up with a function and query that generate some weird waits.

Those waits are EXECSYNC, and CXCONSUMER. Now, under normal circumstances, you might be able to ignore them.

But things are rarely normal when you’re experiencing performance problems, are they? If you ignore too much, you can miss big problems.

Going back to running this query, I can see the wait stats that get generated in sys.dm_exec_session_wait_stats when the query is finished.

SELECT u.DisplayName, 
       dbo.TotalScore(u.Id) AS TotalScore
FROM dbo.Users AS u
WHERE u.Reputation >= 200000
ORDER BY u.Id;

Here’s what those waits look like:

If one were to follow advice — even advice from Microsoft — one may miss important clues as to what happened.

CXCONSUMER waits being high is fairly tightly correlated to skewed parallelism, and this is no exception.

EXECSYNC represents a serial zone within a parallel plan, in this case building two Eager Index Spools:

When you spend a long time building indexes single threaded, you spend a long time waiting on CXCONSUMER (and not so much time at all waiting on CXPACKET).

Being able to put the waits together with the query plan can help you tune queries more efficiently.

This is especially true if you’re on earlier versions of SQL Server/SSMS where the kind of detail shown in query plans here doesn’t exist.

Thanks for reading!

A Trick For Working Around Scalar UDF Performance Issues In SQL Server

Posted on July 2, 2019May 14, 2022 by Erik Darling

Mr. Duster

Thanks for watching!

Going Further

CMEMTHREAD Waits Generated By Query Store In SQL Server

Posted on June 28, 2019May 14, 2022 by Erik Darling

Memethread

Thanks for watching!

Going Further

Thoughts On Setting MAXDOP Correctly In SQL Server

Posted on June 18, 2019May 16, 2022 by Joe Obbish

Microsoft recently published new guidance on setting server level MAXDOP. I hope to help the community by analyzing the new guidance and offering some of my own thoughts on query parallelism.

Line by line

Documentation is meant to be shared after all, so hopefully no one minds if I quote most of it:

Starting with SQL Server 2016 (13.x), during service startup if the Database Engine detects more than eight physical cores per NUMA node or socket at startup, soft-NUMA nodes are created automatically by default. The Database Engine places logical processors from the same physical core into different soft-NUMA nodes.

This is true and one of the bigger benefits of auto soft-NUMA as far as I’ve been able to tell.

The recommendations in the table below are aimed at keeping all the worker threads of a parallel query within the same soft-NUMA node.

SQL Server is not designed to keep all worker threads in a single soft-NUMA node. That might have been true in SQL Server 2008, but it changed in 2012. The only semi-official documentation that I know of is here and I looked into the behavior here. Read through both if you’re interested in how scheduling of parallel worker threads is performed by SQL Server, but I’ll provide a quick summary via example here.

Suppose you have two soft-NUMA nodes of 6 schedulers each and the server just restarted.NUMA node 0 has positions 0-5 and NUMA node 1 has positions 6-11. The global enumerator starts at position 0. If I run a MAXDOP 4 query then the enumerator advances by 4. The parallel workers are allowed in positions 0-3 which means that any four out of six schedulers can be chosen from NUMA node 0. All parallel worker threads are in NUMA node 0 for the first query. Suppose I run another MAXDOP 4 query. The enumerator advances by 4 and the allowed positions are 4-7. That means that any two schedulers can be chosen from NUMA node 0 and any two schedulers can be chosen from NUMA node 1. The worker threads are split over two soft-NUMA nodes even though query MAXDOP is less than the size of the soft-NUMA nodes.

Unless you’re on a server with a single soft-NUMA node it is difficult to guarantee that all worker threads end up on the same soft-NUMA node. I strongly recommend against aiming for that as a goal. There are more details in the “Preventing hard NUMA worker splits” section of this blog post.

This will improve the performance of the queries and distribution of worker threads across the NUMA nodes for the workload. For more information, see Soft-NUMA.

I’ve heard some folks claim that keeping all parallel workers on a single hard NUMA nodes can be important for query performance. I’ve even seen some queries experience reduced performance when thread 0 is on a different hard NUMA node than parallel worker threads. I haven’t heard of anything about the importance of keeping all of a query’s worker threads on a single soft-NUMA node. It doesn’t really make sense to say that query performance will be improved if all worker threads are on the same soft-NUMA node. Soft-NUMA is a configuration setting. Suppose I have a 24 core hard NUMA node and my goal is to get all of a parallel query’s worker threads on a single soft-NUMA node. To accomplish that goal the best strategy is to disable auto soft-NUMA because that will give me a NUMA node size of 24 as opposed to 8. So disabling auto soft-NUMA will increase query performance?

Starting with SQL Server 2016 (13.x), use the following guidelines when you configure the max degree of parallelism server configuration value:

Server with single NUMA node [and] Less than or equal to 8 logical processors: Keep MAXDOP at or below # of logical processors

I don’t understand this guidance at all. If MAXDOP is set to above the number of logical processors then the total number of logical processors is used. This is even mentioned earlier on the same page of documentation. This line is functionally equivalent to “Set MAXDOP to whatever you want”.

Server with single NUMA node [and] Greater than 8 logical processors: Keep MAXDOP at 8

This configuration is only possible with a physical core count between 5 and 8 and with hyperthreading enabled. Setting MAXDOP above the physical core count isn’t recommended by some folks, but I suppose there could be some scenarios where it makes sense. Keeping MAXDOP at 8 isn’t bad advice for many queries on a large enough server, but the documentation is only talking about small servers here.

Server with multiple NUMA nodes [and] Less than or equal to 16 logical processors per NUMA node: Keep MAXDOP at or below # of logical processors per NUMA node

I have never seen an automatic soft-NUMA configuration result in more than 16 schedulers per soft-NUMA node, so this covers all server configurations with more than 8 physical cores. Soft-NUMA scheduler counts per node can range from 4 to 16. If you accept this advice then in some scenarios you’ll need to lower MAXDOP as you increase the number of physical cores per socket. For example, if I have 24 schedulers per socket without hyperthreading then auto soft-NUMA gives me three NUMA nodes of 8 schedulers, so I might set MAXDOP to 8. But if the scheduler count is increased to 25, 26, or 27 then I’ll have at least one soft-NUMA node of 6 schedulers. So I should lower MAXDOP from 8 to 6 because the physical core count of the socket increased?

Server with multiple NUMA nodes [and] Greater than 16 logical processors per NUMA node: Keep MAXDOP at half the number of logical processors per NUMA node with a MAX value of 16

I have never seen an automatic soft-NUMA configuration result in more than 16 schedulers per soft-NUMA node. I believe that this is impossible. At the very least, if it possible I can tell you that it’s rare. This feels like an error in the documentation. Perhaps they were going for some kind of hyperthreading adjustment?

NUMA node in the above table refers to soft-NUMA nodes automatically created by SQL Server 2016 (13.x) and higher versions.

I suspect that this is a mistake and that some “NUMA node” references are supposed to refer to hard NUMA. It’s difficult to tell.

Use these same guidelines when you set the max degree of parallelism option for Resource Governor workload groups.

There are two benefits to using MAXDOP at the Resource Governor workload group level. The first benefit is that it allows different workloads to have different MAXDOP without changing lots of application code. The guidance here doesn’t allow for that benefit. The second benefit is that it acts as a hard limit on query MAXDOP as opposed to the soft limit provided with server level MAXDOP. It may also be useful to know that the query optimizer takes server level MAXDOP into account when creating a plan. It does not do so for MAXDOP set via Resource Governor.

I haven’t seen enough different types of workloads in action to provide generic MAXDOP guidance, but I can share some of the issues that can occur with query parallelism being too low or too high.

What are some of the problems with setting MAXDOP too low?

Better query performance may be achieved with a higher MAXDOP. For example, a well-written MAXDOP 8 query on a quiet server may simply run eight times as quickly as the MAXDOP 1 version. In some scenarios this is highly desired behavior.
There may not be enough concurrent queries to get full value out of the server’s hardware without increasing query MAXDOP. Unused schedulers can be a problem for batch workloads that aim to get a large, fixed amount of work done as quickly as possible.
Row mode bitmap operators associated with hash joins and merge joins only execute in parallel plans. MAXDOP 1 query plans lose out on this optimization.

What are some of the problems with setting MAXDOP too high?

At some point, throwing more and more parallel queries at a server will only slow things down. Imagine adding more and more cars to an already gridlocked traffic situation. Depending on the workload you may not want to have many active workers per scheduler.
It is possible to run out of worker threads with many concurrent parallel queries that have many parallel branches each. For example, a MAXDOP 8 query with 20 branches will ask for 160 parallel workers. When this happens parallel queries can get downgraded all the way to MAXDOP 1.
Row mode exchange operators need to move rows between threads and do not scale well with increased query MAXDOP.
Some types of row mode exchange operators evenly divide work among all parallel worker threads. This can degrade query performance if even one worker thread is on a busy scheduler. Consider a server with 8 schedulers. Scheduler 0 has two active workers and all other schedulers have no workers. Suppose there is 40 seconds of CPU work to do, the query scales with MAXDOP perfectly, and work is evenly distributed to worker threads. A MAXDOP 4 query can be expected to run in 40/4 = 10 seconds since SQL Server is likely to pick four of the seven less busy schedulers. However, a MAXDOP 8 query must put one of the worker threads on scheduler 0. The work on schedulers 1 – 7 will finish in 40/8 = 5 seconds but the worker thread on scheduler 0 has to yield to the other worker threads. It may take 5 * 3 = 15 seconds if CPU is shared evenly, so in this example increasing MAXDOP from 4 to 8 increases query run time from 10 seconds to 15 seconds.
The query memory grant for parallel inserts into columnstore indexes increases with MAXDOP. If MAXDOP is too high then memory pressure can occur during compression and the SELECT part of the query may be starved for memory.
The query memory grant for memory-consuming operators on the inner side of a nested loop is often not increased with MAXDOP even though the operator may execute concurrently once on each worker thread. In some uncommon query patterns, increasing MAXDOP will increase the amount of data spilled to tempdb.
Increasing MAXDOP increases the number of queries that will have parallel workers spread across multiple hard NUMA nodes. If MAXDOP is greater than the number of schedulers in a hard NUMA node then the query is guaranteed to have split workers. This can degrade query performance for some types of queries.
Worker threads may need to wait on some type of shared resource. Increasing MAXDOP can increase contention without improving query performance. For example, there’s nothing stopping me from running a MAXDOP 100 SELECT INTO, but I certainly do not get 100X of the performance of a MAXDOP 1 query. The problem with the below query is the NESTING_TRANSACTION_FULL latch:

SQL Server Query Plan Properties

Preventing hard NUMA worker splits

It generally isn’t possible to prevent worker splits over hard NUMA nodes without changing more than server level and query level MAXDOP. Consider a server with 2 hard NUMA nodes of 10 schedulers for each. To avoid a worker split, an administrator might try setting server level MAXDOP to 10, with the idea being that each parallel query spreads its workers over NUMA node 0 or NUMA node 1. This plan won’t work if any of the following occur:

Any query runs with a query level MAXDOP hint other than 0, 1, 10, or 20.
Any query is downgraded in MAXDOP but still runs in parallel.
A parallel stats update happens. The last time I checked these run with a query level MAXDOP hint of 16.
Something else unexpected happens.

In all cases the enumerator will be shifted and any MAXDOP 10 queries that run after will split their workers. TF 2467 can help, but it needs to be carefully tested with the workload. With the trace flag, as long as MAXDOP <= 10 and automatic soft-NUMA is disabled then the parallel workers will be sent to a single NUMA node based on load. Note that execution context 0 thread can still be on a different hard NUMA node. If you want to prevent that then you can try Resource Governor CPU affinity at the Resource Pool level. Create one pool for NUMA node 0 and one pool for NUMA node 1. You may experience interesting consequences when doing that.

The most reliable method by far is to have a single hard NUMA node, so if you have a VM that fits into a single socket of a VM host and you care about performance then ask your friendly VM administrator for some special treatment.

Final thoughts

I acknowledge that it’s difficult to create MAXDOP guidance that works for all scenarios and I hope that Microsoft continues to try to improve their documentation on the subject. Thanks for reading!

How Bad Cardinality Estimates Lead To Bad Query Plan Choices: Does Batch Mode Help?

Posted on May 22, 2019May 16, 2022 by Erik Darling

Notes and Notability

In the last post, I showed you a query where a bad plan was chosen because of bad guesses.

In this post, I’m going to see if Batch Mode will help anything.

To do that, we’re going to use an empty temp table with a clustered columnstore index on it.

    CREATE TABLE #t (id INT NOT NULL, 
	                 INDEX c CLUSTERED COLUMNSTORE);

First Up, No Hints

    SELECT   p.*
    FROM     dbo.Posts AS p
    JOIN     dbo.Votes AS v
        ON p.Id = v.PostId
	LEFT JOIN #t ON 1 = 0
    WHERE    p.PostTypeId = 2
    AND      p.CreationDate >= '20131225'
    ORDER BY p.Id;

Though this plan is ~6 seconds faster than the Merge Join equivalent in the last post, that’s not the kind of improvement I’m shooting for.

Remember than the Hash Join plan in Row Mode ran in 2.6 seconds.

The only operator to run in Batch Mode here is the Sort. To be fair, it’s really the only one eligible with the trick we used.

Forcing this plan to run in parallel, we go back to a 27 second runtime with no operators in Batch Mode.

Next Up, Hash Hint

Disappointingly, this gets worse. The Row Mode only plan was 2.6 seconds, and this is 6.8 seconds.

The answer to why the Batch Mode plan is 3x slower lies in our Row Mode plan. Let’s look at them head to head.

See that Repartition Streams operator? It literally saves the entire query.

The Batch Mode plan doesn’t get one. Because of that, Bad Things Happen™

Skew Job

Both plans start out with an unfortunate scan of the Posts table.

But in the Row Mode plan, Repartition Streams does exactly what it sounds like it does, and balances things out. The number of rows on each thread is even because of this. Crazy to think about, but threads dividing work up evenly is, like, a good thing.

In the Batch Mode plan, that… doesn’t happen.

We have the dreaded serial parallel query. Aside from the Scan of the Votes table, only one thread is ever active across the rest of the query.

This isn’t a limitation of Batch Mode generally, though I suspect it has something to do with why we don’t see Repartition Streams show up.

One limitation of Batch Mode is with Sorts — they are single threaded — a point this particular demo obfuscates, unless they’re a child operator of a Window Aggregate.

Well, Darn

I was expecting some fireworks there.

Maybe 2019 will help?

In the next post, we’ll look at that.

Thanks for reading!

Going Further

SQL Server Query Plan Operators That Hide Performance Problems

Posted on February 12, 2019May 14, 2022 by Erik Darling

In A Row?

When you’re reading query plans, you can be faced with an overwhelming amount of information, and some of it is only circumstantially helpful.

Sometimes when I’m explaining query plans to people, I feel like a mechanic (not a Machanic) who just knows where to go when the engine makes a particular rattling noise.

That’s not the worst thing. If you know what to do when you hear the rattle next time, you learned something.

One particular source of what can be a nasty rattle is query plan operators that execute a lot.

Busy Killer Bees

I’m going to talk about my favorite example, because it can cause a lot of confusion, and can hide a lot of the work it’s doing behind what appears to be a friendly little operator.

Something to keep in mind is that I’m looking at the actual plans. If you’re looking at estimated/cached plans, the information you get back may be inaccurate, or may only be accurate for the cached version of the plan. A query plan reused by with parameters that require a different amount of work may have very different numbers.

Nested Loops

Let’s look at a Key Lookup example, because it’s easy to consume.

CREATE INDEX ix_whatever ON dbo.Votes(VoteTypeId);

SELECT v.VoteTypeId, v.BountyAmount
FROM dbo.Votes AS v
WHERE v.VoteTypeId = 8
AND v.BountyAmount = 100;

You’d think with “loops” in the name, you’d see the number of executions of the operator be the number of loops SQL Server thinks it’ll perform.

But alas, we don’t see that.

In a parallel plan, you may see the number of executions equal to the number of threads the query uses for the branch that the Nested Loops join executes in.

For instance, the above query runs at MAXDOP four, and coincidentally uses four threads for the parallel nested loops join. That’s because with parallel nested loops, each thread executes a serial version of the join independently. With stuff like a parallel scan, threads work more cooperatively.

If we re-run the same query at MAXDOP 1, the number of executions drops to 1 for the nested loops operator, but remains at 71,048 for the key lookup.

But here we are at the very point! It’s the child operators of the nested loops join that show how many executions there were, not the nested loops join itself.

Weird, right?

Thanks for reading!