Skip to content

Conversation

Zhenming-Lin
Copy link
Contributor

No description provided.

@Zhenming-Lin
Copy link
Contributor Author

Some u80 or u128 integer operations are not yet supported under self-hosted aarch64 architecture.

@andrewrk
Copy link
Member

andrewrk commented Oct 2, 2025

Can you share a short summary of what improvements you made, and how you made them?

@Zhenming-Lin
Copy link
Contributor Author

Zhenming-Lin commented Oct 2, 2025

The impls(prev & cur) are all referenced from the libc, and the current impl is a newer version. The previous version used bit-by-bit calculations and was slow, but the current version uses lookup tables combined with Goldschmidt iterations to significantly improve the speed.

  • The zig impl of sqrtf, sqrt and sqrtq is almost identical to the C impl.
  • The Zig impl of __sqrth is based on the impl of sqrt, with the number of iterations appropriately reduced according to the precision requirements of f16 to improve speed.
  • The Zig impl of __sqrtx is based on the impl of sqrtq, and the number of fractional bits in the last iteration is changed to 80-bit according to the precision requirements of f80 to improve speed.
Benchmark 1 (7605 runs): prev sqrtf
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           579us ±  140us     505us … 7.40ms        645 ( 8%)        0%
Benchmark 2 (10000 runs): sqrtf
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           102us ± 64.2us    77.3us … 4.44ms        879 ( 9%)        ⚡- 82.3% ±  0.5%
Benchmark 1 (2040 runs): prev sqrt
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          2.35ms ±  318us    2.05ms … 4.88ms        124 ( 6%)        0%
Benchmark 2 (10000 runs): sqrt
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           143us ± 53.7us     117us … 1.34ms        839 ( 8%)        ⚡- 93.9% ±  0.3%
Benchmark 1 (10000 runs): __sqrth
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           101us ± 49.6us    79.5us … 2.04ms        451 ( 5%)        0%
Benchmark 2 (10000 runs): sqrtf
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          94.8us ± 55.1us    76.6us … 2.41ms        955 (10%)        ⚡-  5.8% ±  1.4%
Benchmark 3 (10000 runs): sqrt
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           145us ± 62.7us     117us … 2.34ms        882 ( 9%)        💩+ 43.8% ±  1.6%
Benchmark 4 (9413 runs): __sqrtx
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           458us ±  142us     359us … 5.09ms        575 ( 6%)        💩+355.2% ±  2.9%
Benchmark 5 (6763 runs): sqrtq
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           664us ±  136us     568us … 3.70ms        476 ( 7%)        💩+560.0% ±  2.9%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants