Domestic Accelerator Backend PoC

This page scopes a community discussion for domestic accelerator support in
RL-Kernel. It tracks the P3 roadmap item for domestic accelerator and research
expansion, and should be used with:

- [Issue #83](https://github.com/RL-Align/RL-Kernel/issues/83): roadmap tracking.
- [Issue #175](https://github.com/RL-Align/RL-Kernel/issues/175): domestic accelerator
  backend PoC tracking.

The goal is not to promise first-class support for every device immediately. The
goal is to define a reproducible path for adding new vendor platforms after the
main CUDA, ROCm, and PyTorch fallback path is correct, integrated, and measurable.

## Target Scope

The first scoping pass should cover these vendor families:

| Vendor family | Example platform | Proposed platform key | Initial status |
| --- | --- | --- | --- |
| MetaX | C500 / MACA software stack | `maca` | Local PoC data available |
| Huawei Ascend | CANN / NPU stack | `ascend` | Community scoping needed |
| Cambricon | MLU stack | `cambricon` | Community scoping needed |
| Other CUDA-compatible stacks | Vendor-specific CUDA bridge | vendor key required | Community scoping needed |

Each platform should begin with a narrow operator target, a safe fallback story,
and explicit validation hardware. Avoid broad "support this chip" issues that do
not identify the runtime, compiler, operator, and correctness baseline.

## Recommended Platform Model

RL-Kernel should distinguish logical vendor platforms from PyTorch device strings.
Several non-NVIDIA runtimes expose tensors through `torch.device("cuda")`, so
`torch.cuda.is_available()` is not enough to infer NVIDIA CUDA semantics.

Recommended structure:

- Add a vendor platform detector under `rl_engine/platforms/`.
- Keep the existing CUDA and ROCm paths intact.
- Add explicit platform keys such as `maca`, `ascend`, and `cambricon`.
- Add backend priorities per platform in `KernelRegistry`.
- Add vendor-specific operator wrappers only after a fallback path is validated.

For MetaX/MACA, a future code layout could include:

```text
rl_engine/platforms/maca.py
rl_engine/kernels/ops/maca/
tests/platforms/test_maca_detection.py
```

For Ascend and Cambricon, use the same pattern once maintainers or community
contributors can provide the runtime facts and validation hardware.

## MetaX C500 / MACA Observations

Local experiments on a four-card MetaX C500 node showed:

```text
torch                  2.8.0+metax3.5.3.9
torch.version.cuda     11.6
torch.version.hip      None
device name            MetaX C500
device count           4
compute capability     (8, 0)
warp_size              64
CUDA_HOME              /opt/maca/tools/cu-bridge
compiler               /opt/maca/tools/cu-bridge/bin/cucc
```

This means the current runtime dispatch treats C500 as a CUDA platform, because
the MACA PyTorch build reports `torch.version.hip is None` and exposes devices
through the CUDA API. That is not the same as NVIDIA CUDA compatibility for every
kernel.

Initial operator findings:

| Area | Result | Suggested default |
| --- | --- | --- |
| PyTorch native LogP | Correct | Safe fallback |
| Generic `fused_logp_kernel.cu` | Compiles and passes small correctness probes | Candidate for MACA fused LogP |
| CUDA prefix-shared attention | Fails to compile due to NVIDIA inline asm | Disable on MACA |
| SM90/TMA kernels | NVIDIA Hopper-specific | Disable on MACA |
| FlashAttention package | Direct `flash_attn_func` call works on C500 in a small SDPA comparison | Candidate attention backend |
| RL-Kernel CUDA FlashAttention wrapper | Currently requires `_C` before trying external `flash_attn` | Wrapper should be relaxed |
| FlashInfer top-k | Works in a small probe | Candidate sampling backend |
| FlashInfer top-p | API requires `uniform_samples` with shape `[max_rounds, batch]` and returns `(samples, success)` | Wrapper update required |
| Triton `ratio_kl` / `grpo_loss` | Forward and backward smoke tests passed | Candidate backend with tests |
| Triton `linear_logp` | Forward produced incorrect values on fp32 and bf16 probes | Disable by default on MACA |

These results are not a support guarantee. They are a starting point for a
reproducible community PoC.

## Fallback Contract

Every new domestic accelerator platform should define fallback behavior before
adding optimized kernels.

Minimum fallback contract:

- `logp`: PyTorch native must be available; fused backend is optional.
- `attention`: PyTorch SDPA/native attention must be available; vendor flash
  attention is optional.
- `linear_logp`: PyTorch native must be available; Triton/vendor fused backend is
  optional and must pass forward accuracy before dispatch is enabled.
- `ratio_kl` and `grpo_loss`: PyTorch native must be available; Triton/vendor
  fused backend must pass forward and backward tests.
- `sampling`: PyTorch multinomial/top-k/top-p fallback must be available; vendor
  sampling must handle failure flags or retry semantics explicitly.

Optimized kernels should never be selected only because tensors report
`device.type == "cuda"`. They should be selected because the platform detector
and operator-specific capability check both pass.

## Scoping Template

Use this checklist when opening or updating a domestic accelerator issue.

```markdown
## Platform

- Vendor:
- Device model:
- Number of cards:
- Driver/runtime version:
- PyTorch version:
- `torch.version.cuda`:
- `torch.version.hip`:
- Device name from `torch.cuda.get_device_name(0)`:
- Device capability:
- Warp/wave size if exposed:
- Compiler path:

## Target Operator

- Operator:
- Current fallback:
- Proposed optimized backend:
- Required dtype:
- Required shapes:
- Expected unsupported shapes:

## Correctness Baseline

- Reference implementation:
- Tolerance:
- Test command:
- Result:

## Build / Runtime Notes

- Extension compiler:
- Required environment variables:
- Known compile blockers:
- Known runtime blockers:

## Dispatch Proposal

- Platform key:
- Registry priority:
- Conditions for enabling optimized backend:
- Conditions for falling back:

## Validation Hardware

- Who can run validation:
- Hardware access constraints:
- Reproducible command log:
```

## Proposed MetaX/MACA First Milestone

The first MACA milestone should be intentionally small:

1. Detect MetaX/MACA as a separate `maca` platform instead of generic CUDA.
2. Keep PyTorch fallback enabled for every operator.
3. Enable external `flash_attn` only when its import and smoke test pass.
4. Update FlashInfer sampling wrapper for the installed API contract.
5. Disable SM90, prefix-shared attention, and Triton `linear_logp` by default.
6. Build only the generic fused LogP extension path for MACA.
7. Add MACA-specific CI or manual validation commands before claiming support.

This gives the community a correctness-first base while preserving room for
vendor-specific optimized kernels later.

## Community Questions

Feedback is especially useful on:

- Whether `maca`, `ascend`, and `cambricon` should be platform keys or grouped
  under a broader `domestic` namespace.
- Which operator should be the first optimized target for each vendor.
- Which vendor runtimes expose CUDA-compatible APIs and which require separate
  extension toolchains.
- Whether platform detection should rely on PyTorch metadata, device name,
  environment variables, or a combination of all three.
- Who can provide reproducible validation hardware for Ascend and Cambricon.
- What minimum benchmark and correctness matrix should be required before a
  backend is marked active instead of experimental.

Please keep discussion tied to concrete hardware, command output, and operator
results so the PoC can move from platform detection to reliable dispatch.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Domestic Accelerator Backend PoC #196

Target Scope

Recommended Platform Model

MetaX C500 / MACA Observations

Fallback Contract

Scoping Template

Proposed MetaX/MACA First Milestone

Community Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Vendor family	Example platform	Proposed platform key	Initial status
MetaX	C500 / MACA software stack	`maca`	Local PoC data available
Huawei Ascend	CANN / NPU stack	`ascend`	Community scoping needed
Cambricon	MLU stack	`cambricon`	Community scoping needed
Other CUDA-compatible stacks	Vendor-specific CUDA bridge	vendor key required	Community scoping needed

Area	Result	Suggested default
PyTorch native LogP	Correct	Safe fallback
Generic `fused_logp_kernel.cu`	Compiles and passes small correctness probes	Candidate for MACA fused LogP
CUDA prefix-shared attention	Fails to compile due to NVIDIA inline asm	Disable on MACA
SM90/TMA kernels	NVIDIA Hopper-specific	Disable on MACA
FlashAttention package	Direct `flash_attn_func` call works on C500 in a small SDPA comparison	Candidate attention backend
RL-Kernel CUDA FlashAttention wrapper	Currently requires `_C` before trying external `flash_attn`	Wrapper should be relaxed
FlashInfer top-k	Works in a small probe	Candidate sampling backend
FlashInfer top-p	API requires `uniform_samples` with shape `[max_rounds, batch]` and returns `(samples, success)`	Wrapper update required
Triton `ratio_kl` / `grpo_loss`	Forward and backward smoke tests passed	Candidate backend with tests
Triton `linear_logp`	Forward produced incorrect values on fp32 and bf16 probes	Disable by default on MACA

Uh oh!

Domestic Accelerator Backend PoC #196

Description

Target Scope

Recommended Platform Model

MetaX C500 / MACA Observations

Fallback Contract

Scoping Template

Proposed MetaX/MACA First Milestone

Community Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions