Skip to content

Conversation

@timcassell
Copy link
Collaborator

@timcassell timcassell commented Jan 16, 2025

Fixes #1120.

  1. Refactored engine stages to make the stack size constant for each benchmark invocation.
  2. Applied AggressiveOptimization to engine methods to eliminate tiered-JIT as a potential variable in the engine when running iterations. (See also Apply AggressiveOptimization to clocks AndreyAkinshin/perfolizer#19)

I tested this on Ryzen 7 9800X3D, and got the same results on both master and this PR. I could not repro the measurement spikes from the issue (the graph was smooth).

I tested on Apple M3 and got these results.

Master:

image

PR:

image

Observations to note

  • The after results are much less spikey for longer.
    • I can't say why the last bit of the graph is also spikey, it could have to do with the fact that I ran on a MacBook Air and it could have throttled, but I can't prove it (the benchmark took a very long time to run, so I only ran it once). I also did not include the changes in Apply AggressiveOptimization to clocks AndreyAkinshin/perfolizer#19 in this test, which could also be a factor.
  • The after results are larger, at the upper end of the spikes from the results with master.
    • I suspect that the benchmarks are sensitive to the alignment of the stack when it is invoked. I'm not sure if BDN should take this into account by default, and if so, what it should do about it (we already have [MemoryRandomization] that randomizes the stack for each iteration).

@timcassell
Copy link
Collaborator Author

timcassell commented Feb 13, 2025

Another curiosity I found while working on #2336 is, unrolling the calls affects performance strangely on ARM architecture (Apple M3) (while it does what you'd expect on x86-64).

Benchmark of just _field++ (M3):

Unroll x16

overhead:  0.911, workload:  0.635, diff: - 0.275

NoUnroll

overhead:  0.900, workload:  1.240, diff:  0.339

timcassell and others added 3 commits February 15, 2025 15:09
Apply AggressiveOptimization to engine methods.
Copy link
Member

@AndreyAkinshin AndreyAkinshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timcassell I really love this approach! Maintaining a constant stack size should help achieve more stable measurements. I also like the idea of a unified method for running iterations in the engine instead of delegating it to the *Stage.Run*() methods.

Now, we need to develop a better design. The IEngineStageEvaluator seems unnecessary since we already have IStoppingCriteria; merging these entities would be beneficial. We can also simplify the existing solution by eliminating EngineWarmupStage, EnginePilotStage, and EngineActualStage, as their purpose is to provide running and stopping strategies. Since we have a sealed hierarchy of stages that should not be extended externally, we can introduce an advanced dispatcher that builds an IStoppingCriteria/IEngineStageEvaluator based on IterationStage, IterationMode, and other state properties without calling overridden Engine*Stage methods.

What do you think?

@timcassell
Copy link
Collaborator Author

timcassell commented May 1, 2025

Now, we need to develop a better design. The IEngineStageEvaluator seems unnecessary since we already have IStoppingCriteria; merging these entities would be beneficial.

I initially looked at re-using IStoppingCriteria, but it wouldn't quite work for this purpose due to how the pilot stage builds up the count. I implemented it this way to prevent any breaking changes, since stages and IStoppingCriteria are public. But if we're cool with breaking those public types (or removing them from public surface entirely), then we can certainly adjust the implementation.

We can also simplify the existing solution by eliminating EngineWarmupStage, EnginePilotStage, and EngineActualStage, as their purpose is to provide running and stopping strategies. Since we have a sealed hierarchy of stages that should not be extended externally, we can introduce an advanced dispatcher that builds an IStoppingCriteria/IEngineStageEvaluator based on IterationStage, IterationMode, and other state properties without calling overridden Engine*Stage methods.

I'm not sure what you have in mind there. It seems to me that the EnumerateStages() method is that dispatcher. You just want to remove the stages altogether and move all that logic to that iterator?

@AndreyAkinshin
Copy link
Member

but it wouldn't quite work for this purpose due to how the pilot stage builds up the count

We can rework how the pilot stage exposes the chosen invocation count; changing the API is not a problem.

But if we're cool with breaking those public types (or removing them from public surface entirely)

Initially, I made it public to enable experiments with stopping criteria without needing to build a new version of BenchmarkDotNet. However, I don't feel that anyone is actually using it. Therefore, I believe it's safe to completely rework the API and reduce the public surface.

I'm not sure what you have in mind there.

Currently, the Engine*Stage classes serve primarily as wrappers for two methods (Auto/Specific) that implement the running and stopping strategy. What if we extracted each strategy into a separate class that manages the stopping behavior and includes additional features, such as exposing the chosen invocation count? EnumerateStages could return instances of this class instead of tuples (IterationStage stage, IterationMode mode, IEngineStageEvaluator evaluator). This change would allow us to eliminate the existing Engine*Stage classes, resulting in a design that is more conducive to testing.

@timcassell
Copy link
Collaborator Author

timcassell commented May 1, 2025

@AndreyAkinshin Since we're ok with breaking changes, what do you think of making the measurement iteration indices 0-based? I noticed they are 1-based while I was working on this, and I couldn't figure out why. Maybe just an off-by-one bug that was never caught? There are no tests for the measurement iteration index.

@AndreyAkinshin
Copy link
Member

@timcassell The measurement iteration indices are primarily intended for presentation to the user, rather than for internal logic. From a usability perspective, 1-based numbering tends to be more intuitive and user-friendly. For example, we typically refer to the first iteration as “Iteration 1” rather than “Iteration 0”—just as spreadsheet rows in Excel start from 1 to match user expectations.

Using 0-based indices in this context can easily lead to confusion. Consider a case where the output shows:

Iteration 0  
Iteration 1  
Iteration 2  

If someone refers to “the second iteration,” it’s ambiguous whether they mean Iteration 1 (which is technically the second) or Iteration 2 (which has the number 2). With 1-based numbering, this ambiguity is avoided.

Since the main purpose of these indices is to improve readability and clarity for the end user, we chose 1-based indexing intentionally.

@timcassell
Copy link
Collaborator Author

timcassell commented May 4, 2025

Initially, I made it public to enable experiments with stopping criteria without needing to build a new version of BenchmarkDotNet. However, I don't feel that anyone is actually using it. Therefore, I believe it's safe to completely rework the API and reduce the public surface.

Well, it looks like there wasn't any way to inject custom stopping criteria. Users would have to implement their own IEngine(Factory) anyway.

@timcassell timcassell requested a review from AndreyAkinshin May 4, 2025 01:06
Copy link
Member

@AndreyAkinshin AndreyAkinshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks much better now. Thanks for implementing this!

@timcassell timcassell merged commit 8b14cbb into dotnet:master May 12, 2025
8 checks passed
@timcassell timcassell deleted the engine-stages branch May 12, 2025 00:21
@timcassell timcassell added this to the v0.15.0 milestone May 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Large Spike from WorkloadWarmup to WorkloadActual

3 participants