arm: add ROIAlign NEON implementation by devin-lai · Pull Request #6781 · Tencent/ncnn

devin-lai · 2026-06-17T03:45:57Z

Summary

This PR adds an ARM NEON pack4 implementation for ROIAlign.

The main change is in the bilinear accumulation helper. The previous helper accumulated all four interpolation corners into one _sum, which made the inner loop latency-bound as the sample count grew. The NEON path now keeps the four corner accumulators independent and reduces them at the end.

For ROIAlign version 0, this also hoists repeated per-cell metadata out of the channel loop. The bounds, coordinate sets, bin grid, and pixel area are computed once per ROI instead of once per channel.

Performance

Benchmark setup:

Apple M-series, arm64
ncnn Release build
Single thread, num_threads=1
fp32 pack4 path
Baseline: 6e359b6f
Optimized: 66efebb5
Each result is the minimum of 7 x 1000 iterations
Time is measured per ROI, lower is better

Input	pooled	sampling_ratio	version	original	optimized	speedup
100x100x256	7x7	2	0	0.0152 ms	0.0123 ms	1.24x
100x100x256	7x7	2	1	0.0120 ms	0.0119 ms	1.01x
64x64x256	7x7	2	0	0.0146 ms	0.0112 ms	1.30x
64x64x256	7x7	2	1	0.0108 ms	0.0105 ms	1.03x
200x200x256	14x14	2	0	0.0616 ms	0.0512 ms	1.20x
200x200x256	14x14	2	1	0.0490 ms	0.0485 ms	1.01x
100x100x512	7x7	0	0	0.1260 ms	0.0735 ms	1.71x
100x100x512	7x7	0	1	0.1201 ms	0.0729 ms	1.65x
50x50x256	7x7	4	0	0.0617 ms	0.0356 ms	1.73x
50x50x256	7x7	4	1	0.0598 ms	0.0354 ms	1.69x

The common sampling_ratio=2 cases improve by about 1.2x to 1.3x for version 0, mainly from avoiding repeated metadata work inside the channel loop. Version 1 is mostly neutral there because it does not have the same redundant per-channel setup and the dependency chain is short.

When the sample count is higher, either from sampling_ratio=0 on larger feature maps or explicit sampling_ratio=4, the independent accumulators matter more and both versions improve by about 1.65x to 1.73x.

No tested configuration regressed.

Correctness

tests/test_roialign.cpp passes, including added pack4 coverage for sampling_ratio=0.

The scalar path is unchanged. The pack4 result matches the original implementation within normal floating-point reassociation differences; the maximum absolute difference observed across 60,544 sampled outputs was 2.4e-7.

arm: add ROIAlign NEON implementation

66efebb

github-actions Bot added test arm labels Jun 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

arm: add ROIAlign NEON implementation#6781

arm: add ROIAlign NEON implementation#6781
devin-lai wants to merge 1 commit into
Tencent:masterfrom
devin-lai:arm-roialign-neon

devin-lai commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

devin-lai commented Jun 17, 2026

Summary

Performance

Correctness

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant