Skip to content

arm: add ROIAlign NEON implementation#6781

Draft
devin-lai wants to merge 1 commit into
Tencent:masterfrom
devin-lai:arm-roialign-neon
Draft

arm: add ROIAlign NEON implementation#6781
devin-lai wants to merge 1 commit into
Tencent:masterfrom
devin-lai:arm-roialign-neon

Conversation

@devin-lai

Copy link
Copy Markdown
Contributor

Summary

This PR adds an ARM NEON pack4 implementation for ROIAlign.

The main change is in the bilinear accumulation helper. The previous helper accumulated all four interpolation corners into one _sum, which made the inner loop latency-bound as the sample count grew. The NEON path now keeps the four corner accumulators independent and reduces them at the end.

For ROIAlign version 0, this also hoists repeated per-cell metadata out of the channel loop. The bounds, coordinate sets, bin grid, and pixel area are computed once per ROI instead of once per channel.

Performance

Benchmark setup:

  • Apple M-series, arm64
  • ncnn Release build
  • Single thread, num_threads=1
  • fp32 pack4 path
  • Baseline: 6e359b6f
  • Optimized: 66efebb5
  • Each result is the minimum of 7 x 1000 iterations
  • Time is measured per ROI, lower is better
Input pooled sampling_ratio version original optimized speedup
100x100x256 7x7 2 0 0.0152 ms 0.0123 ms 1.24x
100x100x256 7x7 2 1 0.0120 ms 0.0119 ms 1.01x
64x64x256 7x7 2 0 0.0146 ms 0.0112 ms 1.30x
64x64x256 7x7 2 1 0.0108 ms 0.0105 ms 1.03x
200x200x256 14x14 2 0 0.0616 ms 0.0512 ms 1.20x
200x200x256 14x14 2 1 0.0490 ms 0.0485 ms 1.01x
100x100x512 7x7 0 0 0.1260 ms 0.0735 ms 1.71x
100x100x512 7x7 0 1 0.1201 ms 0.0729 ms 1.65x
50x50x256 7x7 4 0 0.0617 ms 0.0356 ms 1.73x
50x50x256 7x7 4 1 0.0598 ms 0.0354 ms 1.69x

The common sampling_ratio=2 cases improve by about 1.2x to 1.3x for version 0, mainly from avoiding repeated metadata work inside the channel loop. Version 1 is mostly neutral there because it does not have the same redundant per-channel setup and the dependency chain is short.

When the sample count is higher, either from sampling_ratio=0 on larger feature maps or explicit sampling_ratio=4, the independent accumulators matter more and both versions improve by about 1.65x to 1.73x.

No tested configuration regressed.

Correctness

tests/test_roialign.cpp passes, including added pack4 coverage for sampling_ratio=0.

The scalar path is unchanged. The pack4 result matches the original implementation within normal floating-point reassociation differences; the maximum absolute difference observed across 60,544 sampled outputs was 2.4e-7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant