In this hands-on exercise, we will guide you through the process of specifying a new accelerator ISA using TAIDL (Tensor Accelerator ISA Definition Language)'s Python-based API. The API directly mirrors the formal ISA constructs introduced in the ACT paper: memory hierarchy (data models), instruction sets with operational semantics, and other essential components of the ISA. You will learn how to describe a domain-specific tensor accelerator from scratch, using the QKV accelerator as our running example.
The Query-Key-Value (QKV) attention mechanism is the fundamental building block of transformer models, which have revolutionized natural language processing, computer vision, and generative AI. The attention operation can be expressed as:
Attention(Q, K, V) = softmax(Q × K^T) × V
where Q (query), K (key), and V (value) are typically matrices of shape [seq_len, hidden_dim]. This operation is compute-intensive, especially for long sequences, making it a prime target for hardware acceleration.
The QKV accelerator is a specialized tensor accelerator designed specifically for efficient attention computation. Its key architectural features include:
- Dual On-chip Scratchpads: Two scratchpad memories (
d1andd2) optimized for different access patternsd1: Primary buffer for input/output data (128 rows × 64 columns)d2: Intermediate computation buffer (64 rows × 64 columns)
- BF16 Datatype: Uses Brain Floating Point (BF16) for efficient mixed-precision computation
- Specialized Instructions: Hardware support for matrix multiplication, softmax, and efficient data movement
- Row-major and Column-major Support: Flexible memory access patterns to minimize data layout transformations
This accelerator exemplifies the coarse-grained, tensor-oriented ISA design philosophy commonly observed in commercial tensor accelerators like Google's TPU and AWS Trainium -- instructions operate on entire tensor blocks rather than individual elements.
Before diving into the QKV accelerator specification, let's understand how memory hierarchies evolve from classical scalar processors to modern tensor accelerators.
1. Scalar Register Files (x86-64 GPRs)
# Classical CPU: 16 general-purpose 64-bit registers
# Each register holds a single scalar value
acc.add_data_model("gpr", [16], [], "i64")- Granularity: Individual 64-bit integers
- Example:
rax,rbx, ...,r15
2. Vector Register Files (AVX-512 ZMM)
# SIMD extension: 32 vector registers, each 512-bit wide
# Each register holds 16 float32 or 8 float64 values
acc.add_data_model("zmm", [32], [16], "f32")- Granularity: Vectors of 16 elements
- Operations: Element-wise SIMD operations
3. Tile Register Files (Intel AMX)
# Matrix extension: 8 tile registers, each 16×64 bytes
# Each tile holds a 2D matrix for tensor operations
acc.add_data_model("tiles", [8], [16, 64], "s8")- Granularity: 2D tiles (16 rows × 64 columns)
- Operations: Matrix multiplication, tile loads/stores
4. Scratchpad Memories (Gemmini, TPU-style)
# On-chip SRAM: Large capacity, software-managed
# Organized as a 2D array of elements
acc.add_data_model("spad", [1024 * 16], [16], "s8") # Main scratchpad
acc.add_data_model("acc", [64 * 16], [16], "s32") # Accumulator buffer- Granularity: Configurable rows of 16 elements
- Management: Explicit software-controlled data movement
5. QKV Accelerator Memory Hierarchy
# Specialized for attention: Two scratchpad levels
# d1: Primary I/O buffer, d2: Computation buffer
qkv.add_data_model("d1", [128], [64], "bf16")
qkv.add_data_model("d2", [64], [64], "bf16")- Granularity: Rows of 64 BF16 elements
- Design: Optimized for matrix tiles and attention patterns
As we move from scalar to tensor accelerators, the fundamental trend is:
- Higher dimensionality: 0D (scalar) → 1D (vector) → 2D+ (tile/scratchpad)
- Coarser granularity: Single values → Fixed-width SIMD lanes → Fixed-shape Tiles → Variable-size Tensor blocks
This motivates the data model abstraction in TAIDL -- a unified formalism to describe memory hierarchies across different accelerator designs.
Let's begin by preparing the boilerplate ISA specification file.
# From the tutorial root directory
cd ~/tutorials-splash25
# Copy boilerplate files to code/
./copy.sh exercise1The copy.sh script will copy the boilerplate QKV.py file to the code/ directory and clean any previous outputs.
Your QKV.py file should begin with:
"""QKV Accelerator ISA Definition"""
from taidlv2 import Accelerator
qkv = Accelerator("QKV")This imports the TAIDL framework and creates an accelerator object named "QKV". The Accelerator class provides methods to define data models, instructions, and generate programming infrastructure.
You can now edit this file in your preferred editor on your host machine.
The data model specifies the on-chip memory hierarchy: the number of buffers, their capacity, dimensionality, and element types. This directly corresponds to the formal construct
# Create an accelerator named "QKV"
qkv = Accelerator("QKV")The QKV accelerator has two scratchpad buffers:
# d1: Primary scratchpad for input/output tensors
# - 128 rows (first dimension)
# - 64 columns (second dimension)
# - BF16 element type
qkv.add_data_model("d1", [128], [64], "bf16")
# d2: Secondary scratchpad for intermediate computations
# - 64 rows
# - 64 columns
# - BF16 element type
qkv.add_data_model("d2", [64], [64], "bf16")Interpretation:
"d1": Buffer name (used in instruction operands)[128]: Capacity in the first dimension (addressable rows)[64]: Shape of each addressable unit (64 BF16 elements per row)"bf16": Element datatype (Brain Floating Point 16-bit)
Key Design Decisions:
- Why BF16? Offers a good balance between precision and hardware efficiency for neural network computations
- Why 64 columns? Matches common attention head dimensions and enables efficient matrix tiling
- Why two buffers?
d1for I/O (larger),d2for computation (sized for 64×64 matrix intermediates)
In addition to these on-chip buffers, there's an implicit off-chip memory (d0) representing HBM/DRAM, which is always available as the source/sink for load/store operations.
Attributes (Constraints): Unlike classical ISAs where operands are just register numbers, tensor accelerator instructions have both:
- Addressing attributes: Buffer addresses (analogous to register indices)
- Computational attributes: Configuration parameters (e.g.,
n= number of rows to process)
Read/Write Locations: Specify which data models are accessed, at what addresses, and how many addressable units:
instr.set_inputs([[buffer_name, [address_attr], [size_of_input]]])
instr.set_outputs([[buffer_name, [address_attr], [size_of_output]]])Operational Semantics: A functional description in XLA-HLO syntax, using SSA (Static Single Assignment) form and tensor operations.
XLA-HLO (Accelerated Linear Algebra - High Level Operations) is a compiler IR for expressing tensor computations. Key concepts:
SSA Form:
%variable_name = operation(operands)
ROOT %output = final_operation
Tensor Types:
type[dimensions] # e.g., bf16[64,64], u8[128], s32[16,16,4]
Common Operations:
- Element-wise:
add,multiply,exponential,divide,convert,bitcast_convert - Structural:
reshape,transpose,broadcast,copy - Reductions:
reduce_add,reduce_max,reduce_min - Contractions:
dot(matrix multiplication),convolution
Template Variables:
Use `@c.attr_name` to reference computational attributes in HLO:
%In1 = bf16[`@c.n`,64] parameter(0); # Parameterized by attribute n
For complete HLO documentation, see: OpenXLA HLO Operational Semantics and StableHLO Specification
Loads data from off-chip memory (d0) into the primary scratchpad (d1) in row-major layout.
instr = qkv.add_instruction("load_rm", ["n"], ["addr_in", "addr_out"])- Name:
load_rm - Computational attributes:
["n"]- number of rows to load - Addressing attributes:
["addr_in", "addr_out"]- source (d0) and destination (d1) addresses
instr.set_inputs([["d0", ["@a.addr_in"], ["@c.n * 128"]]])- Read from
d0(off-chip memory) - Starting at address
@a.addr_in - Total bytes:
@c.n * 128(each row is 64 BF16 = 128 bytes)
instr.set_outputs([["d1", ["@a.addr_out"], ["@c.n"]]])- Write to
d1(scratchpad) - Starting at row
@a.addr_out - Number of rows:
@c.n
instr.add_semantics("""
ENTRY load_rm {
%In1 = u8[`@c.n * 128`] parameter(0); # Raw bytes from memory
%a = u8[`@c.n`,64,2] reshape(%In1); # Reshape to rows of 64 × 2-byte elements
ROOT %Out0 = bf16[`@c.n`,64] bitcast_convert(%a); # Interpret as BF16 matrix
}
""")Semantics Explanation:
- Input: Flat byte array from d0
- Reshape: Organize into rows of 64 elements (2 bytes each for BF16)
- Bitcast: Reinterpret raw bytes as BF16 values without computation
Loads data and simultaneously transposes it to column-major layout.
instr = qkv.add_instruction("load_cm", ["n"], ["addr_in", "addr_out"])
instr.set_inputs([["d0", ["@a.addr_in"], ["@c.n * 128"]]])
instr.set_outputs([["d1", ["@a.addr_out"], ["@c.n"]]])
instr.add_semantics("""
ENTRY load_cm {
%In1 = u8[`@c.n * 128`] parameter(0);
%a = u8[`@c.n`,64,2] reshape(%In1);
%b = bf16[`@c.n`,64] bitcast_convert(%a);
ROOT %Out0 = bf16[64,`@c.n`] transpose(%b), dimensions={1,0}; # Transpose on load
}
""")Why Column-Major? In attention computation, we need K^T. Loading K directly in transposed form avoids a separate transpose operation.
Mirror the load operations, moving data from scratchpad back to off-chip memory.
# Row-major store
instr = qkv.add_instruction("store_rm", ["n"], ["addr_in", "addr_out"])
instr.set_inputs([["d1", ["@a.addr_in"], ["@c.n"]]])
instr.set_outputs([["d0", ["@a.addr_out"], ["@c.n * 128"]]])
instr.add_semantics("""
ENTRY store_rm {
%In1 = bf16[`@c.n`,64] parameter(0);
%a = u8[`@c.n`,64,2] bitcast_convert(%In1);
ROOT %Out0 = u8[`@c.n*128`] reshape(%a);
}
""")
# Column-major store (with transpose)
instr = qkv.add_instruction("store_cm", ["n"], ["addr_in", "addr_out"])
instr.set_inputs([["d1", ["@a.addr_in"], ["@c.n"]]])
instr.set_outputs([["d0", ["@a.addr_out"], ["@c.n * 128"]]])
instr.add_semantics("""
ENTRY store_cm {
%In1 = bf16[`@c.n`,64] parameter(0);
%a = bf16[64,`@c.n`] transpose(%In1), dimensions={1,0};
%b = u8[64,`@c.n`,2] bitcast_convert(%a);
ROOT %Out0 = u8[`@c.n*128`] reshape(%b);
}
""")Copies data from d2 to d1 (or conceptually between any buffers).
instr = qkv.add_instruction("mov", ["n"], ["addr_in", "addr_out"])
instr.set_inputs([["d2", ["@a.addr_in"], ["@c.n"]]])
instr.set_outputs([["d1", ["@a.addr_out"], ["@c.n"]]])
instr.add_semantics("""
ENTRY mov {
%In1 = bf16[`@c.n`,64] parameter(0);
ROOT %Out0 = bf16[`@c.n`,64] copy(%In1);
}
""")Use Case: Moving intermediate results between computation stages.
The core computational instruction implementing General Matrix Multiply.
instr = qkv.add_instruction("gemm", [], ["addr_1", "addr_2", "addr_out"])- No computational attributes (operates on fixed 64×64 tiles)
- Three address attributes: two inputs, one output
instr.set_inputs([["d1", ["@a.addr_1"], ["64"]],
["d1", ["@a.addr_2"], ["64"]]])
instr.set_outputs([["d2", ["@a.addr_out"], ["64"]]])instr.add_semantics("""
ENTRY gemm {
%In1 = bf16[64,64] parameter(0);
%In2 = bf16[64,64] parameter(1);
ROOT %Out0 = bf16[64,64] dot(%In1, %In2),
lhs_contracting_dims={1},
rhs_contracting_dims={0};
}
""")Semantics Explanation:
dot: Matrix multiplication operatorlhs_contracting_dims={1}: Contract along dimension 1 of left operand (columns)rhs_contracting_dims={0}: Contract along dimension 0 of right operand (rows)- Implements:
Out0[i,j] = Σ_k In1[i,k] * In2[k,j]
Applies the softmax function along each row: softmax(x)_i = exp(x_i) / Σ_j exp(x_j)
instr = qkv.add_instruction("softmax", ["n"], ["addr"])
instr.set_inputs([["d2", ["@a.addr"], ["@c.n"]]])
instr.set_outputs([["d2", ["@a.addr"], ["@c.n"]]]) # In-place operation
instr.add_semantics("""
ENTRY softmax {
%In1 = bf16[`@c.n`,64] parameter(0);
%a = bf16[`@c.n`,64] exponential(%In1); # e^x for each element
%reduced = bf16[`@c.n`] reduce_add(%a), dimensions={1}; # Sum along rows
%b = bf16[`@c.n`,64] broadcast(%reduced), dimensions={0}; # Broadcast sums
ROOT %Out0 = bf16[`@c.n`,64] divide(%a, %b); # Normalize
}
""")Semantics Explanation:
- Compute
exp(x)element-wise - Sum along dimension 1 (across columns within each row)
- Broadcast the row sums back to matrix shape
- Divide element-wise to normalize
Numerical Note: Production implementations use softmax(x - max(x)) for stability, but this simplified version illustrates the concept.
The final step in the ISA specification is to add the generation command. This will compile your ISA definition into usable programming infrastructure.
Add the following line at the end of your QKV.py file:
# Generate kernel programming APIs and functional simulator (TAIDL-TO)
qkv.generate_oracle()What gets generated?
- Kernel Programming API: Python functions for each instruction (
api.load_rm(),api.gemm(), etc.) - Test Oracle (Functional Simulator): A fast, auto-generated simulator for correctness testing (TAIDL-TO)
- Kernel Framework: Decorators and utilities for writing and testing accelerator kernels
Here's the full QKV.py file with all components:
"""QKV Accelerator ISA Definition"""
from taidlv2 import Accelerator
qkv = Accelerator("QKV")
# Define Data Models
qkv.add_data_model("d1", [128], [64], "bf16")
qkv.add_data_model("d2", [64], [64], "bf16")
# Load instructions
instr = qkv.add_instruction("load_rm", ["n"], ["addr_in", "addr_out"])
instr.set_inputs([["d0", ["@a.addr_in"], ["@c.n * 128"]]])
instr.set_outputs([["d1", ["@a.addr_out"], ["@c.n"]]])
instr.add_semantics("""
ENTRY load_rm {
%In1 = u8[`@c.n * 128`] parameter(0);
%a = u8[`@c.n`,64,2] reshape(%In1);
ROOT %Out0 = bf16[`@c.n`,64] bitcast_convert(%a);
}
""")
instr = qkv.add_instruction("load_cm", ["n"], ["addr_in", "addr_out"])
instr.set_inputs([["d0", ["@a.addr_in"], ["@c.n * 128"]]])
instr.set_outputs([["d1", ["@a.addr_out"], ["@c.n"]]])
instr.add_semantics("""
ENTRY load_cm {
%In1 = u8[`@c.n * 128`] parameter(0);
%a = u8[`@c.n`,64,2] reshape(%In1);
%b = bf16[`@c.n`,64] bitcast_convert(%a);
ROOT %Out0 = bf16[64,`@c.n`] transpose(%b), dimensions={1,0};
}
""")
# Store instructions
instr = qkv.add_instruction("store_rm", ["n"], ["addr_in", "addr_out"])
instr.set_inputs([["d1", ["@a.addr_in"], ["@c.n"]]])
instr.set_outputs([["d0", ["@a.addr_out"], ["@c.n * 128"]]])
instr.add_semantics("""
ENTRY store_rm {
%In1 = bf16[`@c.n`,64] parameter(0);
%a = u8[`@c.n`,64,2] bitcast_convert(%In1);
ROOT %Out0 = u8[`@c.n*128`] reshape(%a);
}
""")
instr = qkv.add_instruction("store_cm", ["n"], ["addr_in", "addr_out"])
instr.set_inputs([["d1", ["@a.addr_in"], ["@c.n"]]])
instr.set_outputs([["d0", ["@a.addr_out"], ["@c.n * 128"]]])
instr.add_semantics("""
ENTRY store_cm {
%In1 = bf16[`@c.n`,64] parameter(0);
%a = bf16[64,`@c.n`] transpose(%In1), dimensions={1,0};
%b = u8[64,`@c.n`,2] bitcast_convert(%a);
ROOT %Out0 = u8[`@c.n*128`] reshape(%b);
}
""")
# Move instruction
instr = qkv.add_instruction("mov", ["n"], ["addr_in", "addr_out"])
instr.set_inputs([["d2", ["@a.addr_in"], ["@c.n"]]])
instr.set_outputs([["d1", ["@a.addr_out"], ["@c.n"]]])
instr.add_semantics("""
ENTRY mov {
%In1 = bf16[`@c.n`,64] parameter(0);
ROOT %Out0 = bf16[`@c.n`,64] copy(%In1);
}
""")
# Compute instructions
instr = qkv.add_instruction("gemm", [], ["addr_1", "addr_2", "addr_out"])
instr.set_inputs([["d1", ["@a.addr_1"], ["64"]], ["d1", ["@a.addr_2"], ["64"]]])
instr.set_outputs([["d2", ["@a.addr_out"], ["64"]]])
instr.add_semantics("""
ENTRY gemm {
%In1 = bf16[64,64] parameter(0);
%In2 = bf16[64,64] parameter(1);
ROOT %Out0 = bf16[64,64] dot(%In1, %In2), lhs_contracting_dims={1}, rhs_contracting_dims={0};
}
""")
instr = qkv.add_instruction("softmax", ["n"], ["addr"])
instr.set_inputs([["d2", ["@a.addr"], ["@c.n"]]])
instr.set_outputs([["d2", ["@a.addr"], ["@c.n"]]])
instr.add_semantics("""
ENTRY softmax {
%In1 = bf16[`@c.n`,64] parameter(0);
%a = bf16[`@c.n`,64] exponential(%In1);
%reduced = bf16[`@c.n`] reduce_add(%a), dimensions={1};
%b = bf16[`@c.n`,64] broadcast(%reduced), dimensions={0};
ROOT %Out0 = bf16[`@c.n`,64] divide(%a, %b);
}
""")
# Generate programming APIs and test oracle (functional simulator)
qkv.generate_oracle()Now that you've completed the ISA specification, it's time to generate the test oracle. This step requires running inside the Docker container.
# From your host machine in the tutorials-splash25 directory, launch Docker
./docker.sh
# Inside the Docker container, you'll be at /workspace (which maps to code/)
# Execute the ISA specification
python QKV.pyExpected Output:
Copied generic oracle structure to /workspace/targets/QKV/oracle
Generated api.py
Oracle API generation complete for QKV
Building oracle for QKV
Oracle build complete for QKV
The oracle API is located at /workspace/targets/QKV/oracle/
What gets generated?
The generator creates a kernel programming interface along with its functional simulator in targets/QKV/oracle/:
-
api.py: Python functions for each instruction with HLO semantics compilationdef load_rm(n, addr_in, addr_out): """Load n rows from HBM to scratchpad d1 (row-major) Compiles HLO: u8[n*128] -> reshape -> bitcast -> bf16[n,64] """ ... def gemm(addr_1, addr_2, addr_out): """Matrix multiply: d1[addr_1] × d1[addr_2] -> d2[addr_out] Compiles HLO: dot(bf16[64,64], bf16[64,64]) -> bf16[64,64] """ ...
-
decorator.py:@kerneldecorator framework- Defines kernel metadata (HBM layout, inputs/outputs, constants)
- Compiles instruction sequences into HLO modules
- Provides simulation and debugging utilities
The auto-generated TAIDL-TO (TAIDL Test Oracle) is a functional simulator that executes your ISA semantics. Key features:
- Fast Execution: Uses JAX/XLA compilation for hardware-accelerated simulation (orders of magnitude faster than existing functional simulators)
- Correctness Testing: Helps validate kernel implementations against golden reference
- Debugging Support: Utilities for inspecting on-chip data, tracing instruction execution
- Automatic Generation: Derived directly from your HLO semantics -- zero manual implementation
- Scalability: Simulates large-scale models efficiently (simulated with I-BERT for Gemmini within a few seconds compared to almost an hour with Gemmini's Spike functional simulator)
The simulator compiles and executes the XLA-HLO semantics you wrote for each instruction, providing a golden reference for correctness validation.
Generated directory structure:
code/
├── QKV.py # Your ISA specification
├── targets/QKV/ # Target-specific generated tools
│ └── oracle/ # Test oracle (functional simulator)
│ ├── __init__.py # Package initialization
│ ├── api.py # Instruction API functions
│ ├── decorator.py # Kernel compilation framework
│ ├── utils.py # Utility functions
│ └── build/ # JAX/XLA compiled artifacts
In this exercise, you've learned:
-
Formal ISA Specification: How to describe accelerator ISAs using data models, instructions with constraints, operand specifications, and operational semantics
-
Memory Hierarchy Design: The evolution from scalar registers to tensor scratchpads, and how to formalize different memory architectures
-
Instruction Semantics: Expressing coarse-grained tensor operations using XLA-HLO, including:
- Data movement with layout transformations
- Matrix computations (GEMM)
- Complex operations (softmax)
The QKV accelerator specification demonstrates key design principles of TAIDL:
-
Declarative Approach: Describe what the hardware does, not how to compile for it
- Instruction semantics specify computational intent, not compilation strategy
- Enables automatic generation of both test oracles and compiler backends
-
Compositionality: Complex operations built from primitive tensor operators
- Softmax decomposes into exponential, reduction, broadcast, and division
- Matrix operations expressed as high-level dot products, not scalar loops
-
Semantic Precision: Bit-precise operational semantics using XLA-HLO
- Exact type conversions (e.g., u8 to bf16 via reshape + bitcast)
- Layout transformations (row-major ↔ column-major) precisely specified
Now that you've specified the QKV accelerator ISA and generated the programming APIs and test oracle, you're ready to:
- Write accelerator kernels using the generated APIs
- Test kernels using the auto-generated functional simulator (TAIDL-TO)
- Debug on-chip data with instrumentation utilities
- Implement end-to-end attention using QKV accelerator instructions
Proceed to Exercise 2: Writing Accelerator Kernels where you'll write kernels ranging from simple data movement to complete QKV attention computation!
- XLA-HLO Operation Semantics: https://openxla.org/xla/operation_semantics
- TAIDL Paper: Section 3-5 for ISA formalization and test oracle generation
- ACT Paper: See §3 (Problem Formulation) for theoretical foundations
- Gemmini Architecture: https://github.com/ucb-bar/gemmini
