This page scopes a community discussion for domestic accelerator support in
RL-Kernel. It tracks the P3 roadmap item for domestic accelerator and research
expansion, and should be used with:
The goal is not to promise first-class support for every device immediately. The
goal is to define a reproducible path for adding new vendor platforms after the
main CUDA, ROCm, and PyTorch fallback path is correct, integrated, and measurable.
Target Scope
The first scoping pass should cover these vendor families:
| Vendor family |
Example platform |
Proposed platform key |
Initial status |
| MetaX |
C500 / MACA software stack |
maca |
Local PoC data available |
| Huawei Ascend |
CANN / NPU stack |
ascend |
Community scoping needed |
| Cambricon |
MLU stack |
cambricon |
Community scoping needed |
| Other CUDA-compatible stacks |
Vendor-specific CUDA bridge |
vendor key required |
Community scoping needed |
Each platform should begin with a narrow operator target, a safe fallback story,
and explicit validation hardware. Avoid broad "support this chip" issues that do
not identify the runtime, compiler, operator, and correctness baseline.
Recommended Platform Model
RL-Kernel should distinguish logical vendor platforms from PyTorch device strings.
Several non-NVIDIA runtimes expose tensors through torch.device("cuda"), so
torch.cuda.is_available() is not enough to infer NVIDIA CUDA semantics.
Recommended structure:
- Add a vendor platform detector under
rl_engine/platforms/.
- Keep the existing CUDA and ROCm paths intact.
- Add explicit platform keys such as
maca, ascend, and cambricon.
- Add backend priorities per platform in
KernelRegistry.
- Add vendor-specific operator wrappers only after a fallback path is validated.
For MetaX/MACA, a future code layout could include:
rl_engine/platforms/maca.py
rl_engine/kernels/ops/maca/
tests/platforms/test_maca_detection.py
For Ascend and Cambricon, use the same pattern once maintainers or community
contributors can provide the runtime facts and validation hardware.
MetaX C500 / MACA Observations
Local experiments on a four-card MetaX C500 node showed:
torch 2.8.0+metax3.5.3.9
torch.version.cuda 11.6
torch.version.hip None
device name MetaX C500
device count 4
compute capability (8, 0)
warp_size 64
CUDA_HOME /opt/maca/tools/cu-bridge
compiler /opt/maca/tools/cu-bridge/bin/cucc
This means the current runtime dispatch treats C500 as a CUDA platform, because
the MACA PyTorch build reports torch.version.hip is None and exposes devices
through the CUDA API. That is not the same as NVIDIA CUDA compatibility for every
kernel.
Initial operator findings:
| Area |
Result |
Suggested default |
| PyTorch native LogP |
Correct |
Safe fallback |
Generic fused_logp_kernel.cu |
Compiles and passes small correctness probes |
Candidate for MACA fused LogP |
| CUDA prefix-shared attention |
Fails to compile due to NVIDIA inline asm |
Disable on MACA |
| SM90/TMA kernels |
NVIDIA Hopper-specific |
Disable on MACA |
| FlashAttention package |
Direct flash_attn_func call works on C500 in a small SDPA comparison |
Candidate attention backend |
| RL-Kernel CUDA FlashAttention wrapper |
Currently requires _C before trying external flash_attn |
Wrapper should be relaxed |
| FlashInfer top-k |
Works in a small probe |
Candidate sampling backend |
| FlashInfer top-p |
API requires uniform_samples with shape [max_rounds, batch] and returns (samples, success) |
Wrapper update required |
Triton ratio_kl / grpo_loss |
Forward and backward smoke tests passed |
Candidate backend with tests |
Triton linear_logp |
Forward produced incorrect values on fp32 and bf16 probes |
Disable by default on MACA |
These results are not a support guarantee. They are a starting point for a
reproducible community PoC.
Fallback Contract
Every new domestic accelerator platform should define fallback behavior before
adding optimized kernels.
Minimum fallback contract:
logp: PyTorch native must be available; fused backend is optional.
attention: PyTorch SDPA/native attention must be available; vendor flash
attention is optional.
linear_logp: PyTorch native must be available; Triton/vendor fused backend is
optional and must pass forward accuracy before dispatch is enabled.
ratio_kl and grpo_loss: PyTorch native must be available; Triton/vendor
fused backend must pass forward and backward tests.
sampling: PyTorch multinomial/top-k/top-p fallback must be available; vendor
sampling must handle failure flags or retry semantics explicitly.
Optimized kernels should never be selected only because tensors report
device.type == "cuda". They should be selected because the platform detector
and operator-specific capability check both pass.
Scoping Template
Use this checklist when opening or updating a domestic accelerator issue.
## Platform
- Vendor:
- Device model:
- Number of cards:
- Driver/runtime version:
- PyTorch version:
- `torch.version.cuda`:
- `torch.version.hip`:
- Device name from `torch.cuda.get_device_name(0)`:
- Device capability:
- Warp/wave size if exposed:
- Compiler path:
## Target Operator
- Operator:
- Current fallback:
- Proposed optimized backend:
- Required dtype:
- Required shapes:
- Expected unsupported shapes:
## Correctness Baseline
- Reference implementation:
- Tolerance:
- Test command:
- Result:
## Build / Runtime Notes
- Extension compiler:
- Required environment variables:
- Known compile blockers:
- Known runtime blockers:
## Dispatch Proposal
- Platform key:
- Registry priority:
- Conditions for enabling optimized backend:
- Conditions for falling back:
## Validation Hardware
- Who can run validation:
- Hardware access constraints:
- Reproducible command log:
Proposed MetaX/MACA First Milestone
The first MACA milestone should be intentionally small:
- Detect MetaX/MACA as a separate
maca platform instead of generic CUDA.
- Keep PyTorch fallback enabled for every operator.
- Enable external
flash_attn only when its import and smoke test pass.
- Update FlashInfer sampling wrapper for the installed API contract.
- Disable SM90, prefix-shared attention, and Triton
linear_logp by default.
- Build only the generic fused LogP extension path for MACA.
- Add MACA-specific CI or manual validation commands before claiming support.
This gives the community a correctness-first base while preserving room for
vendor-specific optimized kernels later.
Community Questions
Feedback is especially useful on:
- Whether
maca, ascend, and cambricon should be platform keys or grouped
under a broader domestic namespace.
- Which operator should be the first optimized target for each vendor.
- Which vendor runtimes expose CUDA-compatible APIs and which require separate
extension toolchains.
- Whether platform detection should rely on PyTorch metadata, device name,
environment variables, or a combination of all three.
- Who can provide reproducible validation hardware for Ascend and Cambricon.
- What minimum benchmark and correctness matrix should be required before a
backend is marked active instead of experimental.
Please keep discussion tied to concrete hardware, command output, and operator
results so the PoC can move from platform detection to reliable dispatch.
This page scopes a community discussion for domestic accelerator support in
RL-Kernel. It tracks the P3 roadmap item for domestic accelerator and research
expansion, and should be used with:
backend PoC tracking.
The goal is not to promise first-class support for every device immediately. The
goal is to define a reproducible path for adding new vendor platforms after the
main CUDA, ROCm, and PyTorch fallback path is correct, integrated, and measurable.
Target Scope
The first scoping pass should cover these vendor families:
macaascendcambriconEach platform should begin with a narrow operator target, a safe fallback story,
and explicit validation hardware. Avoid broad "support this chip" issues that do
not identify the runtime, compiler, operator, and correctness baseline.
Recommended Platform Model
RL-Kernel should distinguish logical vendor platforms from PyTorch device strings.
Several non-NVIDIA runtimes expose tensors through
torch.device("cuda"), sotorch.cuda.is_available()is not enough to infer NVIDIA CUDA semantics.Recommended structure:
rl_engine/platforms/.maca,ascend, andcambricon.KernelRegistry.For MetaX/MACA, a future code layout could include:
For Ascend and Cambricon, use the same pattern once maintainers or community
contributors can provide the runtime facts and validation hardware.
MetaX C500 / MACA Observations
Local experiments on a four-card MetaX C500 node showed:
This means the current runtime dispatch treats C500 as a CUDA platform, because
the MACA PyTorch build reports
torch.version.hip is Noneand exposes devicesthrough the CUDA API. That is not the same as NVIDIA CUDA compatibility for every
kernel.
Initial operator findings:
fused_logp_kernel.cuflash_attn_funccall works on C500 in a small SDPA comparison_Cbefore trying externalflash_attnuniform_sampleswith shape[max_rounds, batch]and returns(samples, success)ratio_kl/grpo_losslinear_logpThese results are not a support guarantee. They are a starting point for a
reproducible community PoC.
Fallback Contract
Every new domestic accelerator platform should define fallback behavior before
adding optimized kernels.
Minimum fallback contract:
logp: PyTorch native must be available; fused backend is optional.attention: PyTorch SDPA/native attention must be available; vendor flashattention is optional.
linear_logp: PyTorch native must be available; Triton/vendor fused backend isoptional and must pass forward accuracy before dispatch is enabled.
ratio_klandgrpo_loss: PyTorch native must be available; Triton/vendorfused backend must pass forward and backward tests.
sampling: PyTorch multinomial/top-k/top-p fallback must be available; vendorsampling must handle failure flags or retry semantics explicitly.
Optimized kernels should never be selected only because tensors report
device.type == "cuda". They should be selected because the platform detectorand operator-specific capability check both pass.
Scoping Template
Use this checklist when opening or updating a domestic accelerator issue.
Proposed MetaX/MACA First Milestone
The first MACA milestone should be intentionally small:
macaplatform instead of generic CUDA.flash_attnonly when its import and smoke test pass.linear_logpby default.This gives the community a correctness-first base while preserving room for
vendor-specific optimized kernels later.
Community Questions
Feedback is especially useful on:
maca,ascend, andcambriconshould be platform keys or groupedunder a broader
domesticnamespace.extension toolchains.
environment variables, or a combination of all three.
backend is marked active instead of experimental.
Please keep discussion tied to concrete hardware, command output, and operator
results so the PoC can move from platform detection to reliable dispatch.