Skip to content

Conversation

seiko2plus
Copy link
Member

@seiko2plus seiko2plus commented Jun 4, 2025

  • Add sin/cos functions with ≤1 ULP and ≤4 ULP accuracy variants
  • Support large argument ranges with proper range reduction
  • Standalone implementation doesn't require libm
  • Written using Google Highway
  • implemented for both float32 and float64 data types

@seiko2plus seiko2plus force-pushed the impl-sincos branch 2 times, most recently from a3356f1 to 18d9307 Compare June 4, 2025 18:34
@seiko2plus seiko2plus changed the title algo: implement sine/cosine using Intel SVML for single/double precision algo: implement sine/cosine based on Intel SVML for single/double precision Jun 4, 2025
@r-devulap r-devulap self-assigned this Jun 6, 2025
@r-devulap r-devulap requested a review from Copilot June 6, 2025 15:49
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements highly accurate sine and cosine functions using Intel SVML techniques and extended precision range reduction, with separate implementations for small and large argument cases. Key changes include:

  • New implementations of sin/cos functions that guarantee ≤1 ULP (single precision) and ≤4 ULP (double precision) accuracy.
  • Support for large argument range reduction using multi-precision techniques based on a precomputed lookup table.
  • Integration with the Google Highway library for platform-specific SIMD optimizations.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
npsr/utils-inl.h Adds utility functions with conditional compilation and lookup support.
npsr/trig/small-inl.h Provides polynomial approximations for small-angle sin/cos calculations.
npsr/trig/lut-inl.h.py Implements a Python script to generate lookup tables for trigonometric reductions and approximations.
npsr/trig/large-inl.h Implements the extended precision algorithm for large argument reduction.
npsr/trig/inl.h Integrates the small and large argument implementations into a unified sin/cos API.
npsr/npsr.h Wraps the trigonometric headers ensuring single inclusion per target.
npsr/common.h Provides common macros, type definitions, and an RAII class for floating‑point environment control.
Comments suppressed due to low confidence (2)

npsr/trig/small-inl.h:338

  • Cosine-specific adjustments are marked as TODO and not yet implemented. Please complete the phase adjustment logic (i.e. correctly apply the π/2 shift) to ensure cosine accuracy matches the sine pathway.
    // TODO: Implement cosine-specific adjustments

npsr/trig/large-inl.h:147

  • [nitpick] Consider adding more inline comments to explain the rationale behind the magic numbers used for fractional shifts to improve code clarity and maintainability.
constexpr int kFracMidShift = kIsSingle ? 18 : 24;

npsr/common.h Outdated
*
* Precise precise = {kLowAccuracy, kNoSpecialCases, kNoLargeArgument};
* const ScalableTag<float> d;
* typename V = Vec<DFromV<SclableTag>>;
Copy link

Copilot AI Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a typo in the documentation comment: 'SclableTag' should be corrected to 'ScalableTag'.

Suggested change
* typename V = Vec<DFromV<SclableTag>>;
* typename V = Vec<DFromV<ScalableTag>>;

Copilot uses AI. Check for mistakes.

@Mousius Mousius mentioned this pull request Jun 9, 2025
Copy link
Member

@Mousius Mousius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think this is a good starting point. Do you want to land this with #if 0 and other various things incomplete @seiko2plus? We can be a bit relaxed as we setup the repo.

npsr/utils-inl.h Outdated
#ifdef NPSR_UTILS_INL_H_
#undef NPSR_UTILS_INL_H_
#else
#define NPSR_UTILS_INL_H_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a lut-inl.h, so we can haev LutX2, LutX4 etc?

A generic utils-inl.h is going to get pretty big I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, mainly to optimize 128-bit simd extensions, transpose and shuffles to reduce cache/memory access

npsr/common.h Outdated
struct _Zero {};
static constexpr auto kForce = _Force{};
static constexpr auto kNearest = _Nearest{};
#if 0 // not used yet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add these when we come across use cases for them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, this good point .. I'm gonna remove them

npsr/trig/inl.h Outdated

namespace npsr::HWY_NAMESPACE::sincos {
template <bool IS_COS, typename V, typename Prec>
HWY_API V SinCos(Prec &prec, V x) {
Copy link
Member

@Mousius Mousius Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially worth creating a new macro for HWY_API so it doesn't change behaviour for Highway?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest a name? HWY_API just combine HWY_ATTR and HWY_INLINE.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NPSR_INLINE which expands to HWY_ATTR HWY_INLINE ?

I feel like a lot of things will eventually be using this, so we don't want it to break if Highway's API changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction: HWY_API does not include HWY_ATTR, since all definitions are placed between the BEFORE/AFTER macros, which emit target attributes through #pragma.

NPSR_INLINE might cause confusion with HWY_INLINE. I’ve updated the PR to use NPSR_INTRIN instead — feel free to rename it later if you prefer. Please see:

// This macro is used to define intrinsics that are:
// NOTE: equals to HWY_API.
// - always inlined
// - flattened (no separate stack frame)
// - marked maybe unused to suppress warnings when they are not used
// NOTE: we do not need to use HWY_ATTR because we wrap Highway intrinsics in
// HWY_BEFORE_NAMESPACE()/HWY_AFTER_NAMESPACE()
// which implies the nessessary target attributes via #pargma.
#define NPSR_INTRIN static HWY_INLINE HWY_FLATTEN HWY_MAYBE_UNUSED

Also, please check the latest push for the full set of changes.

npsr/trig/inl.h Outdated
#include "npsr/sincos/small-inl.h"

// clang-format off
#if defined(NPSR_TRIG_INL_H_) == defined(HWY_TARGET_TOGGLE) // NOLINT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be trig-inl.h?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed

@seiko2plus
Copy link
Member Author

In general I think this is a good starting point. Do you want to land this with #if 0 and other various things incomplete @seiko2plus? We can be a bit relaxed as we setup the repo.

I was putting all my time into the testing unit, also improve luts generations.. so I left any extra optimizations off until I make sure that both functions sin/cos are well implemented.

@seiko2plus
Copy link
Member Author

All stress tests passed localy on x86, just need a CI and nice cleanup and then it will be ready for review ... maybe I should split it up into multiple prs

@Mousius
Copy link
Member

Mousius commented Jun 11, 2025

All stress tests passed localy on x86, just need a CI and nice cleanup and then it will be ready for review ... maybe I should split it up into multiple prs

Yep, this is unreviewable as-is, so incremental PRs would be much appreciated 😸

@r-devulap
Copy link
Member

maybe I should split it up into multiple prs

Yes please, I am having a hard time reviewing this.

@seiko2plus
Copy link
Member Author

Yep, this is unreviewable as-is, so incremental PRs would be much appreciated
Yes please, I am having a hard time reviewing this.

Working on it.

- Add sin/cos functions with ≤1 ULP and ≤4 ULP accuracy variants
- Support large argument ranges with proper range reduction
- Standalone implementation doesn't require libm
- Written using Google Highway for SIMD optimization
- Optimize for both float32 and float64 data types
- Add Python-based generator tool for creating C++ headers and Python templates
- Include sollya wrapper for mathematical computations
- Add utilities for formatting C/Python arrays and header generation
- Include comprehensive .gitignore for Python/C++ development
- Configure spin commands for building, testing, and development
- Add custom generate command for sollya-based file generation
- Set up IPython integration with pre-imported numpy_sr module
- Configure project metadata and build requirements
- Implement Python bindings for Highway SIMD intrinsics
- Add type system for scalar/vector data with proper conversions
- Create dynamic dispatch system for multiple SIMD targets
- Include helper functions for ULP distance calculations
- Set up pytest infrastructure for comprehensive testing
Highway provides portable SIMD intrinsics required by the testing framework
Initialize build configuration for npsr module
- Generate test vectors using Sollya for high-precision reference values
- Test challenging cases including large arguments, pi multiples, and edge cases
- Cover both float32 and float64 precision with ULP accuracy validation
- Include tests for standard and low-accuracy modes
Clean up empty PreciseBind function in preparation for proper implementation
Enable trigonometric functions in the Python testing framework
- Correct polynomial coefficients for double precision cosine
- Fix table indexing and data layout for lookup tables
- Improve accuracy for large argument range reduction
- Implement efficient SVML-style linear approximation for sin/cos
- Fix precision issues in Cody-Waite range reduction
- Add proper sign handling for both sine and cosine
- Improve polynomial evaluation with correction terms
- Document mathematical equivalence between traditional and optimized approaches
- Move Precise class and related types to npsr/precise.h
- Add deduction guides for cleaner template instantiation
- Improve modularity by separating precision control from common utilities
…ures

- Change parameter order to Precise<Args...>, V for consistency
- Update include paths from sincos/ to trig/
- Fix constant types and naming conventions (kException -> kExceptions)
Add meson configuration to include trigonometric test data in installation
Define Python dependencies for building and testing:
- meson/meson-python for build system
- ninja for compilation
- spin for development workflow
- pytest for test execution
This patch removes the Python dependency by rewriting the generator using pure Sollya scripting.
It also applies general code cleanups and adds support for non-FMA targets.
- Define `arch_pkgs` mapping for native, aarch64, s390x, ppc64le, and armhf
- Allow selection via `NPSR_ARCH` env var (defaults to native)
- Configure per-arch virtualenv under `.direnv/py/<arch>`
- Bootstrap Python venv with `spin==0.13`
- Completely rewrote Sollya integration with a new Python CLI tool (tools/sollya/cli.py)
- Replaced the old generate.py script with a more robust implementation
- Added comprehensive Sollya core library (tools/sollya/core.sol) with utility functions
- Created new SIMD-optimized lookup table implementation (npsr/lut-inl.h)
- Significantly enhanced the Precise class for floating-point precision control
- Refactored trigonometric data generation scripts with better documentation
- Updated constants generation with improved Cody-Waite and Payne-Hanek reduction
- Added kpi16-inl.h.sol for high-precision π/16 table generation
- Removed deprecated utils-inl.h and old trigonometric data files
- Improved project configuration in pyproject.toml and .spin/cmds.py
- Enhanced code documentation throughout the codebase
@seiko2plus
Copy link
Member Author

Now this PR contains the implementation without data helpers (neither the Sollya scripts nor the generated headers). They can be found in PR #4.

…on dispatch

- Replace IS_COS boolean template parameters with Operation enum class (kSin/kCos)
- Introduce NPSR_INTRIN macro for consistent intrinsic function attributes
- Update trigonometric function implementations to use new enum-based dispatch
- Improve code readability and maintainability by eliminating boolean parameter confusion
- Update function signatures and documentation to reflect new API design
- Replace HWY_ATTR with HWY_INLINE for better consistency in LUT implementation
Copy link
Member

@Mousius Mousius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, let's get things moving and land this and refactor further afterwards 😸

Thanks for your patience @seiko2plus!

@Mousius Mousius merged commit 78aa3f5 into numpy:main Oct 7, 2025
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants