Skip to content

Domestic Accelerator Backend PoC #196

Description

@Dayuxiaoshui

This page scopes a community discussion for domestic accelerator support in
RL-Kernel. It tracks the P3 roadmap item for domestic accelerator and research
expansion, and should be used with:

The goal is not to promise first-class support for every device immediately. The
goal is to define a reproducible path for adding new vendor platforms after the
main CUDA, ROCm, and PyTorch fallback path is correct, integrated, and measurable.

Target Scope

The first scoping pass should cover these vendor families:

Vendor family Example platform Proposed platform key Initial status
MetaX C500 / MACA software stack maca Local PoC data available
Huawei Ascend CANN / NPU stack ascend Community scoping needed
Cambricon MLU stack cambricon Community scoping needed
Other CUDA-compatible stacks Vendor-specific CUDA bridge vendor key required Community scoping needed

Each platform should begin with a narrow operator target, a safe fallback story,
and explicit validation hardware. Avoid broad "support this chip" issues that do
not identify the runtime, compiler, operator, and correctness baseline.

Recommended Platform Model

RL-Kernel should distinguish logical vendor platforms from PyTorch device strings.
Several non-NVIDIA runtimes expose tensors through torch.device("cuda"), so
torch.cuda.is_available() is not enough to infer NVIDIA CUDA semantics.

Recommended structure:

  • Add a vendor platform detector under rl_engine/platforms/.
  • Keep the existing CUDA and ROCm paths intact.
  • Add explicit platform keys such as maca, ascend, and cambricon.
  • Add backend priorities per platform in KernelRegistry.
  • Add vendor-specific operator wrappers only after a fallback path is validated.

For MetaX/MACA, a future code layout could include:

rl_engine/platforms/maca.py
rl_engine/kernels/ops/maca/
tests/platforms/test_maca_detection.py

For Ascend and Cambricon, use the same pattern once maintainers or community
contributors can provide the runtime facts and validation hardware.

MetaX C500 / MACA Observations

Local experiments on a four-card MetaX C500 node showed:

torch                  2.8.0+metax3.5.3.9
torch.version.cuda     11.6
torch.version.hip      None
device name            MetaX C500
device count           4
compute capability     (8, 0)
warp_size              64
CUDA_HOME              /opt/maca/tools/cu-bridge
compiler               /opt/maca/tools/cu-bridge/bin/cucc

This means the current runtime dispatch treats C500 as a CUDA platform, because
the MACA PyTorch build reports torch.version.hip is None and exposes devices
through the CUDA API. That is not the same as NVIDIA CUDA compatibility for every
kernel.

Initial operator findings:

Area Result Suggested default
PyTorch native LogP Correct Safe fallback
Generic fused_logp_kernel.cu Compiles and passes small correctness probes Candidate for MACA fused LogP
CUDA prefix-shared attention Fails to compile due to NVIDIA inline asm Disable on MACA
SM90/TMA kernels NVIDIA Hopper-specific Disable on MACA
FlashAttention package Direct flash_attn_func call works on C500 in a small SDPA comparison Candidate attention backend
RL-Kernel CUDA FlashAttention wrapper Currently requires _C before trying external flash_attn Wrapper should be relaxed
FlashInfer top-k Works in a small probe Candidate sampling backend
FlashInfer top-p API requires uniform_samples with shape [max_rounds, batch] and returns (samples, success) Wrapper update required
Triton ratio_kl / grpo_loss Forward and backward smoke tests passed Candidate backend with tests
Triton linear_logp Forward produced incorrect values on fp32 and bf16 probes Disable by default on MACA

These results are not a support guarantee. They are a starting point for a
reproducible community PoC.

Fallback Contract

Every new domestic accelerator platform should define fallback behavior before
adding optimized kernels.

Minimum fallback contract:

  • logp: PyTorch native must be available; fused backend is optional.
  • attention: PyTorch SDPA/native attention must be available; vendor flash
    attention is optional.
  • linear_logp: PyTorch native must be available; Triton/vendor fused backend is
    optional and must pass forward accuracy before dispatch is enabled.
  • ratio_kl and grpo_loss: PyTorch native must be available; Triton/vendor
    fused backend must pass forward and backward tests.
  • sampling: PyTorch multinomial/top-k/top-p fallback must be available; vendor
    sampling must handle failure flags or retry semantics explicitly.

Optimized kernels should never be selected only because tensors report
device.type == "cuda". They should be selected because the platform detector
and operator-specific capability check both pass.

Scoping Template

Use this checklist when opening or updating a domestic accelerator issue.

## Platform

- Vendor:
- Device model:
- Number of cards:
- Driver/runtime version:
- PyTorch version:
- `torch.version.cuda`:
- `torch.version.hip`:
- Device name from `torch.cuda.get_device_name(0)`:
- Device capability:
- Warp/wave size if exposed:
- Compiler path:

## Target Operator

- Operator:
- Current fallback:
- Proposed optimized backend:
- Required dtype:
- Required shapes:
- Expected unsupported shapes:

## Correctness Baseline

- Reference implementation:
- Tolerance:
- Test command:
- Result:

## Build / Runtime Notes

- Extension compiler:
- Required environment variables:
- Known compile blockers:
- Known runtime blockers:

## Dispatch Proposal

- Platform key:
- Registry priority:
- Conditions for enabling optimized backend:
- Conditions for falling back:

## Validation Hardware

- Who can run validation:
- Hardware access constraints:
- Reproducible command log:

Proposed MetaX/MACA First Milestone

The first MACA milestone should be intentionally small:

  1. Detect MetaX/MACA as a separate maca platform instead of generic CUDA.
  2. Keep PyTorch fallback enabled for every operator.
  3. Enable external flash_attn only when its import and smoke test pass.
  4. Update FlashInfer sampling wrapper for the installed API contract.
  5. Disable SM90, prefix-shared attention, and Triton linear_logp by default.
  6. Build only the generic fused LogP extension path for MACA.
  7. Add MACA-specific CI or manual validation commands before claiming support.

This gives the community a correctness-first base while preserving room for
vendor-specific optimized kernels later.

Community Questions

Feedback is especially useful on:

  • Whether maca, ascend, and cambricon should be platform keys or grouped
    under a broader domestic namespace.
  • Which operator should be the first optimized target for each vendor.
  • Which vendor runtimes expose CUDA-compatible APIs and which require separate
    extension toolchains.
  • Whether platform detection should rely on PyTorch metadata, device name,
    environment variables, or a combination of all three.
  • Who can provide reproducible validation hardware for Ascend and Cambricon.
  • What minimum benchmark and correctness matrix should be required before a
    backend is marked active instead of experimental.

Please keep discussion tied to concrete hardware, command output, and operator
results so the PoC can move from platform detection to reliable dispatch.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions