baybridge is an in-progress AMD-focused kernel DSL and portability layer.
Today it provides:
- a Python frontend with an example-friendly
import baybridge as cutesurface - a portable kernel IR
- textual lowering backends (
mlir_text,gpu_text,gpu_mlir,waveasm_ref,aster_ref, andhipkittens_ref) - executable HIP lowering for a narrow validated subset
- an optional executable HipKittens backend for a narrow BF16 GEMM subset
- a reference runtime for basic examples and
@jitwrapper execution
It does not yet provide:
- HSACO production or HIP launch
- PTX-to-AMDGCN translation
- broad executable lowering for the full DSL surface
These paths are usable now:
- basic
@kerneland@jitauthoring - topology builtins such as
arch.thread_idx(),arch.block_idx(),block_dim(),grid_dim(),lane_id() - runtime tensors
- direct scalar indexing in traced kernels:
g_c[row, col] = g_a[row, col] + g_b[row, col] compile(...)from explicitTensorSpecannotationscompile(...)from sample runtime arguments@jitwrappers that launch a single traceable kernel- runtime fallback for simple
@jitwrappers that cannot be traced yet
Current boundary:
- Python control flow on traced dynamic values is not supported.
- If a launched kernel uses
if thread_idx(...) == ...style control flow,compile(...)falls back to the reference runtime for@jitwrappers instead of producing IR. - Direct
@kernelcompilation fails in that case rather than silently degrading.
baybridge is currently tested with a repo-local .venv.
git clone <repo>
cd baybridge
python3 -m venv .venv
. .venv/bin/activate
python -m pip install -U pip
python -m pip install -e '.[dev]'Run the test suite:
PYTHONDONTWRITEBYTECODE=1 PYTHONPATH=src .venv/bin/python -m pytest -q -p no:cacheproviderThis runs immediately through the reference runtime.
import baybridge as cute
@cute.kernel
def hello_world_kernel():
tidx, _, _ = cute.arch.thread_idx()
if tidx == 0:
cute.printf("Hello world")
@cute.jit
def hello_world():
cute.printf("hello world")
hello_world_kernel().launch(grid=(1, 1, 1), block=(32, 1, 1))
hello_world()
compiled = cute.compile(hello_world)
compiled()Notes:
hello_world()executes directly on the reference runtime.compile(hello_world)succeeds, but because the launched kernel uses dynamic Python control flow, the compiled artifact executes through the runtime path rather than producing portable IR.
This kernel is traceable because it avoids dynamic Python if.
import baybridge as cute
@cute.kernel(launch=cute.LaunchConfig(grid=(1, 1, 1), block=(4, 1, 1)))
def indexed_add_kernel(
g_a: cute.Tensor,
g_b: cute.Tensor,
g_c: cute.Tensor,
):
tidx, _, _ = cute.arch.thread_idx()
_, n = g_a.shape
col = tidx % n
row = tidx // n
g_c[row, col] = g_a[row, col] + g_b[row, col]
a = cute.tensor([[1, 2], [3, 4]], dtype="f32")
b = cute.tensor([[10, 20], [30, 40]], dtype="f32")
c = cute.zeros((2, 2), dtype="f32")
artifact = cute.compile(indexed_add_kernel, a, b, c, backend="gpu_text")
print(artifact.ir.to_dict())
print(artifact.lowered_module.text)This produces:
- portable IR containing
thread_idx,constant,mod,floordiv,load,add, andstore gpu_textoutput containingmemref.loadandmemref.store
If a @jit wrapper only specializes and launches a traceable kernel, compile(...) returns the launched kernel IR.
import baybridge as cute
@cute.jit
def indexed_add(m_a: cute.Tensor, m_b: cute.Tensor, m_c: cute.Tensor):
indexed_add_kernel(m_a, m_b, m_c).launch(grid=(1, 1, 1), block=(4, 1, 1))
compiled = cute.compile(indexed_add, a, b, c, backend="portable")
print(compiled.ir.name) # indexed_add_kernel
print(compiled.ir.metadata["wrapped_by"]) # indexed_add- the
kerneldecorator- device-style kernel definition
- direct
compile(...)produces kernel IR when the body is traceable
- the
jitdecorator- host-style specialization/wrapper function
- can execute directly on the reference runtime
- can compile either to launched kernel IR or to a runtime fallback artifact
You can supply tensor information in two ways:
- Explicit annotations
@cute.kernel
def copy_kernel(
src: cute.TensorSpec(shape=(128,), dtype="f16"),
dst: cute.TensorSpec(shape=(128,), dtype="f16"),
):
cute.copy(src, dst)- Sample runtime arguments
a = cute.tensor([[1, 2], [3, 4]], dtype="f32")
b = cute.tensor([[10, 20], [30, 40]], dtype="f32")
c = cute.zeros((2, 2), dtype="f32")
artifact = cute.compile(indexed_add_kernel, a, b, c)The reference runtime uses lightweight CPU-side tensors:
import baybridge as cute
a = cute.tensor([[1, 2, 3], [4, 5, 6]], dtype="f32")
b = cute.zeros((2, 3), dtype="f32")
print(a.shape) # (2, 3)
print(a.dtype) # f32
print(a.tolist()) # [[1, 2, 3], [4, 5, 6]]Useful helpers:
baybridge.tensor(data, dtype=...)baybridge.zeros(shape, dtype=...)baybridge.clone(tensor)baybridge.size(value, mode=...)
For a current backend inventory, validation matrix, and benchmark notes, see docs/backend-status.md.
compile(...) supports:
backend="portable"- stores portable IR only
backend="mlir_text"- lowers portable IR to a baybridge-specific textual form
backend="gpu_text"- lowers portable IR to GPU/ROCDL-flavored textual IR
backend="gpu_mlir"- lowers portable IR to a stricter MLIR GPU/module form suitable for external backend consumption
backend="waveasm_ref"- lowers supported kernels to WaveASM-oriented MLIR plus tool invocation hints
- this is a reference backend, not an executable backend
- emits a
*.waveasm_repro/bundle next to the lowered artifact withkernel.mlir,repro.sh, andmanifest.json - the bundle path is exposed on the compiled artifact as
artifact.debug_bundle_dir - you can also generate it directly with
baybridge.emit_waveasm_repro(...)ortools/emit_waveasm_repro.py tools/compare_backends.pycan compile the same kernel through multiple backends, optionally execute supported ones, and report bundle/debug paths in JSON- pass
--include-envtotools/compare_backends.pyto include Python/platform/tool/version metadata for cross-machine diffs - sample factories used with
tools/compare_backends.pymay optionally acceptbackend_nameandtarget_archso they can return backend-specific compile/run args BAYBRIDGE_WAVEASM_ROOTcan point at a Wave checkout to improve tool discovery hints
backend="waveasm_exec"- experimental backend behind
BAYBRIDGE_EXPERIMENTAL_WAVEASM_EXEC=1 - lowers a narrow standard-MLIR subset to WaveASM-consumable MLIR and launches the resulting HSACO through HIP's module API
- experimental backend behind
backend="aster_ref"- lowers supported kernels to ASTER-oriented reference MLIR plus tool/runtime hints
- this is a reference backend, not an executable backend
- emits a
*.aster_repro/bundle next to the lowered artifact withkernel.mlir,repro.sh, andmanifest.json BAYBRIDGE_ASTER_ROOTcan point at an ASTER checkout or install root to improve tool and package discovery- current experimental family:
- single-global-tensor pointwise kernels
- single-global-tensor pointwise math kernels
- single-global-tensor shared-memory staging
- not enabled by default because current upstream WaveASM execution still has correctness issues for Baybridge kernels, including the scalar global
memref.load/storeSRD aliasing bug tracked iniree-org/wave#1117 - does not yet cover reductions, multi-buffer copy, GEMM, or Baybridge custom tensor-SSA ops
- also emits a
*.waveasm_repro/bundle so the exact MLIR/toolchain path can be reproduced outside Baybridge - the bundle path is exposed on the compiled artifact as
artifact.debug_bundle_dir - requires
waveasm-translate,clang++orclang, and prefersld.lldfor final HSACO linking
backend="flydsl_ref"- lowers a supported Baybridge subset to a FlyDSL-oriented reference module
- this is a reference backend, not an executable backend
- if
BAYBRIDGE_FLYDSL_ROOTpoints at a FlyDSL checkout, the artifact includes a direct root/build hint
backend="flydsl_exec"- lowers a validated Baybridge tensor subset to a FlyDSL-flavored Python module and launcher scaffold
- current supported family:
- pointwise tensor kernels
- broadcasted tensor binary ops
- tensor factories via
fill - simple reductions
- shared-memory staging and whole-tensor copy
- requires a built and importable FlyDSL environment, not just a source checkout
BAYBRIDGE_FLYDSL_ROOTshould point at a checkout withbuild-fly/python_packagesorbuild/python_packages- fake/local harness execution can launch with:
baybridge.from_dlpack(...)tensor handles- raw DLPack-capable tensors such as
torch.Tensor - plain
baybridge.RuntimeTensorvalues whentorchis importable
- real upstream FlyDSL execution is currently validated for a narrow subset:
- 1D
f32pointwise binary ops through the copy-atom/register path- hardware-validated today: add, sub, mul, div
- canonical linear indexed pointwise binary ops of the form
block_idx.x * block_dim.x + thread_idx.xare also hardware-validated today:- add, sub, mul, div
- 1D
f32copy - 1D
f32unary math bundle through the copy-atom/register path- hardware-validated today:
- exp, log, cos, erf
- exp2, log2, log10, sqrt
- sin
- rsqrt
- hardware-validated today:
- 2D
f32broadcast binary through the row-slice/copy-atom path whengrid == block == (1, 1, 1)- hardware-validated today: add, sub, mul, div
- 2D
f32reduction bundle through the row-slice/copy-atom path whengrid == block == (1, 1, 1)- hardware-validated today:
- add: full reduction to
(1,)plus row reduction to(M,) - mul: full reduction to
(1,)plus row reduction to(M,)
- add: full reduction to
- hardware-validated today:
- 2D
f32unary math bundle through the row-slice/copy-atom path whengrid == block == (1, 1, 1)- hardware-validated today:
- exp, log, cos, erf
- hardware-validated today:
- 2D
f32tensor-factory bundle through the row-slice/copy-atom path whengrid == block == (1, 1, 1) - 1D
f32shared-memory staging copy when the traced kernel is exactly a shared-memory round-trip andblock.x == extent acos,asin,atan, andatan2remain outside the validated real subset because the current upstream FlyDSL pipeline does not lower the corresponding scalar libcalls cleanly on the active AMD environments
- 1D
- broader real upstream FlyDSL execution is still gated behind:
BAYBRIDGE_EXPERIMENTAL_REAL_FLYDSL_EXEC=1
- reason:
- Baybridge's current generic lowering still does not match FlyDSL's real buffer/tile access model
- the validated subset uses specialized upstream-compatible lowerings
tools/compare_backends.py --executereports that gate asskipped_unvalidated_real_flydsl_exec- intended as the first executable FlyDSL bridge, not full FlyDSL coverage yet
backend="hipkittens_ref"- lowers matched GEMM and attention-family kernels to a HipKittens-oriented C++ reference artifact
- this is a reference backend, not an executable backend
- if
BAYBRIDGE_HIPKITTENS_ROOTpoints at a HipKittens checkout, the artifact includes a direct include-path build hint
backend="hipcc_exec"- builds and runs a narrow executable HIP subset through
hipcc
- builds and runs a narrow executable HIP subset through
backend="hipkittens_exec"- builds and runs a narrow executable HipKittens subset
- current supported kernel family: pure BF16/F16 GEMM composed from HipKittens micro tiles
- currently validated for
gfx950 gfx942executable support is currently limited to the BF16 slice; F16 exec is not wired yet- transposed F16 GEMM stays on
hipkittens_refongfx950; current upstream HipKittens gfx950 MMA templates reject that family - RMSNorm stays on
hipkittens_refongfx950; current upstream HipKittens gfx950 headers do not compile the generated RMSNorm kernel - supported micro tiles:
f16/f16 -> f32bf16/bf16 -> f32(32, 16) x (16, 32) -> (32, 32)(16, 32) x (32, 16) -> (16, 16)
- full tensor shapes may be larger as long as they are exact multiples of one supported micro tile and the reduction dimension is tiled accordingly
- requires
BAYBRIDGE_HIPKITTENS_ROOTto point at a HipKittens checkout
Default backend behavior:
- if
compile(...)is called without an explicit backend, the traced kernel matcheshipkittens_execongfx950, andBAYBRIDGE_HIPKITTENS_ROOTpoints at a usable checkout, Baybridge auto-prefershipkittens_exec - the same auto-preference path can apply to
gfx942, but only when the local ROCm toolchain satisfies the HipKittenscdna3header requirements - otherwise, if the traced kernel matches a HipKittens tensorop GEMM or attention family, Baybridge auto-prefers
hipkittens_ref - otherwise, if the traced kernel matches the validated
flydsl_execsubset and the active FlyDSL environment is ready, Baybridge auto-prefersflydsl_execonly when:- the inputs match the validated path
- and the kernel falls into either:
- the currently validated real-upstream subset
- or the broader experimental subset with
BAYBRIDGE_EXPERIMENTAL_REAL_FLYDSL_EXEC=1
- otherwise it falls back to the normal default textual backend
- if
BAYBRIDGE_EXEC_ARCHis set, Baybridge uses that architecture as the default compile target
Example:
artifact = cute.compile(indexed_add_kernel, a, b, c, backend="gpu_text")
print(artifact.lowered_module.text)HipKittens reference example:
artifact = cute.compile(gemm_kernel, a, b, c, backend="hipkittens_ref")
print(artifact.lowered_module.dialect) # hipkittens_cpp
print(artifact.lowered_module.text)HipKittens executable example:
artifact = cute.compile(
bf16_gemm_kernel,
a,
b,
c,
backend="hipkittens_exec",
target=cute.AMDTarget(arch="gfx950"),
)
artifact(a, b, c)These tests are the best executable documentation today:
tests/test_examples_runtime.py- hello world
- naive elementwise add
tests/test_examples_tracing.py- direct indexed kernel tracing
- traced launch wrapper behavior
tests/test_backend_gpu.pygpu_textlowering surface
tests/test_backend_hipkittens_ref.py- HipKittens reference backend family matching
tests/test_backend_hipkittens_exec.py- HipKittens executable BF16 GEMM path
tests/test_validation.py- explicit unsupported cases and diagnostics
Not implemented yet:
zipped_divide- composition-based tiled tensor views
- slice/view objects with
.load()/.store()semantics - executable ROCm lowering
- PTX translation
Important limitations in the current tracer:
- dynamic Python
if,while, and other control flow on traced scalars are rejected - runtime fallback is only used for
@jitwrappers - direct kernel compilation expects the kernel body itself to be traceable
src/baybridge/frontend.py- user-facing DSL surface
src/baybridge/runtime.py- reference runtime tensors and launch execution
src/baybridge/tracing.py- portable IR builder and traced values
src/baybridge/compiler.py- compile/cache flow and runtime fallback policy
src/baybridge/backends/mlir_text.py- portable textual lowering
src/baybridge/backends/gpu_text.py- GPU/ROCDL-flavored textual lowering
src/baybridge/backends/hipkittens_ref.py- HipKittens-oriented reference lowering for matched kernel families
src/baybridge/backends/hipkittens_exec.py- narrow executable HipKittens lowering for supported BF16 GEMM tiles
baybridge is past the scaffolding stage and is now useful for:
- trying the frontend
- validating portability decisions
- exercising a reference runtime
- inspecting the portable IR and textual AMD-oriented lowerings
It is not yet a drop-in ROCm execution backend for full upstream DSL programs.