This guide walks through setting up a new CUDA kernel for optimization with CUCo.
Three commands to go from seed kernel to running evolution:
# 1. One-time cluster setup (auto-detects CUDA, NCCL, MPI, GPU)
cuco_init --setup
# 2. Scaffold a new workload from your seed kernel
cuco_init /path/to/my_kernel.cu
# 3. Run evolution
cuco_run my_kernel --generations 50cuco_init creates a ready-to-run workload directory with all required files, auto-detects your cluster environment, validates the seed file, and prints the exact run command. See cuco_init --help for all options.
- Creates
workloads/<name>/with your seed file - Generates
evaluate.pypre-configured with your cluster's CUDA/NCCL/MPI paths (from~/.cuco/site.yaml) - Copies
run_evo.py,run_transform.py,nccl_api_docs.pywith correct defaults - Generates
.gitignore - Validates that your seed prints
Time: X.XXXX msandVerification: PASS - Prints a review checklist and the exact run command
If you prefer not to use cuco_init, you can copy and customize directly:
cp -r workloads/ds_v3_moe workloads/my_workload
cp /path/to/my_kernel.cu workloads/my_workload/
cd workloads/my_workload
python run_evo.py --init_program my_kernel.cu --results_dir results_my_workload --num_generations 18A CUCo workload directory contains:
workloads/my_workload/
├── my_kernel.cu # Seed CUDA kernel (host-driven or device-initiated)
├── evaluate.py # Build, run, correctness check, fitness scoring
├── run_evo.py # Evolution launcher with prompt customization
├── run_transform.py # Fast-path transformation launcher (optional)
├── nccl_api_docs.py # API documentation / reference material for LLM context
└── .gitignore # Excludes build artifacts, results, __pycache__
When using cuco_init, all of these files are generated automatically. When copying from ds_v3_moe, they come with the directory.
The seed kernel is the starting point for evolution. It can be either:
- Host-driven (standard NCCL collectives called from CPU) — CUCo's fast-path agent will convert it to device-initiated code before evolution begins.
- Device-initiated (already using GIN or LSA) — Evolution starts directly.
- The kernel must be a single
.cufile that compiles withnvcc. - It must run via
mpirun -np Nacross the target GPU configuration. - The kernel must print a timing line that
evaluate.pycan parse. The default regex expects:If your kernel prints timing in a different format, either add aTime: X.XXXX msprintf("Time: %.4f ms\n", elapsed);line, or updateTIME_PATTERNinevaluate.py(see Step 3). - The kernel must print
"Verification: PASS"when results are correct.
Requirements 3 and 4 are not hard constraints — the evaluation logic is entirely defined in evaluate.py, so you can implement any scoring scheme (timing, accuracy, throughput, memory usage, or a combination). The kernel just needs to produce output that your evaluate.py knows how to parse.
If your kernel is already device-initiated or you want to skip the fast-path, add EVOLVE-BLOCK markers manually:
// EVOLVE-BLOCK-START
__global__ void myCommKernel(...) {
// Communication logic — CUCo can modify this
}
__global__ void myComputeKernel(...) {
// Compute logic — CUCo can modify this
}
// EVOLVE-BLOCK-END
// FROZEN: MPI/NCCL initialization, main(), verification
int main(int argc, char** argv) {
// ... MPI_Init, ncclCommInitRank, etc.
// EVOLVE-BLOCK-START
// Pipeline: stream creation, memory allocation, kernel launches, timing
cudaEventRecord(start, stream);
myCommKernel<<<...>>>();
myComputeKernel<<<...>>>();
cudaEventRecord(stop, stream);
// EVOLVE-BLOCK-END
// FROZEN: verification, cleanup
printf("Verification: PASS\n");
printf("Time: %.4f ms\n", elapsed);
}Rules followed by Fast-Path Agent for EVOLVE-BLOCK placement:
| Region | Mutable? | Why |
|---|---|---|
Kernel definitions (__global__, __device__) |
Yes | Core optimization target |
| Pipeline logic (streams, launches, events) | Yes | Architecture exploration |
| MPI/NCCL initialization | No | Breaking this breaks everything |
main() signature and MPI setup |
No | Fixed infrastructure |
| Verification and output formatting | No | Evaluation depends on exact output format |
| Warmup section | No | Removing it inflates timing by 10–50 ms |
The purpose of this file is to supply the LLM with all the reference material it needs — API documentation, usage examples, header snippets, or any other context — so it can generate correct code for APIs or libraries that may be poorly represented in its training data. The file name is arbitrary (e.g., nccl_api_docs.py, cuda_graph_docs.py, my_lib_reference.py); what matters is that your run_evo.py imports the variables and injects them into the prompt.
The included workloads/ds_v3_moe/nccl_api_docs.py is one such file, providing NCCL device-initiated API documentation. It exports these variables:
| Variable | Content |
|---|---|
NCCL_DEVICE_API_REFERENCE |
Overview of device-initiated communication (GIN, LSA, teams, thread groups) |
NCCL_GIN_API_DOC |
GIN-specific API: ncclGin, put, flush, signals, barriers |
NCCL_LSA_API_DOC |
LSA-specific API: barriers, ncclGetLsaPointer, peer access |
NCCL_GIN_PURE_EXAMPLE |
Complete working GIN AlltoAll example |
NCCL_THREAD_GROUPS_DOC |
ncclCoopThread, ncclCoopWarp, ncclCoopCta |
NCCL_TEAMS_DOC |
ncclTeamWorld, ncclTeamLsa, ncclTeamRail |
NCCL_HOST_TO_DEVICE_COOKBOOK |
GIN kernel signature and host launch pattern |
NCCL_HEADER_* |
Raw C++ header snippets (gin.h, core.h, coop.h, etc.) |
For a new workload that also uses NCCL device APIs, you can typically reuse this file unchanged (it's included when you copy the directory). For workloads that use entirely different libraries or custom APIs, create your own docs file exporting string variables with the relevant documentation, then import and inject them into the prompt in run_evo.py.
evaluate.py is the bridge between CUCo and your hardware. The framework calls it automatically for every candidate program:
python evaluate.py --program_path gen_5/main.cu --results_dir gen_5/resultsIt handles the full pipeline: build the .cu file, run the binary, check correctness, parse timing, and write metrics.json + correct.json.
The source and binary filenames are derived automatically from --program_path — you do not need to change them per workload. However, you may need to review and adapt the following sections for your environment.
When using cuco_init, the generated evaluate.py imports all cluster-specific paths from cuco.site_config, which reads from ~/.cuco/site.yaml (auto-detected during cuco_init --setup). You do not need to edit any paths manually.
To change cluster settings (e.g., when moving to a different machine), either:
- Re-run
cuco_init --setup, or - Edit
~/.cuco/site.yamldirectly
If using the manual setup (copied from ds_v3_moe), the original evaluate.py has hardcoded paths that you'll need to update.
Key build flags:
-rdc=true— Required for device-side NCCL (relocatable device code)- Static NCCL linking (
libnccl_static.a) is required for device-initiated APIs (GIN, LSA)
MPI_NP controls how many ranks mpirun launches. It must match the NUM_RANKS in your seed kernel:
MPI_NP = 2When using cuco_init, the generated evaluate.py uses build_mpirun_cmd() from cuco.site_config, which constructs the mpirun command dynamically based on ~/.cuco/site.yaml:
- Single-node (default): Simple
mpirun -np Nwith no hostfile or IB flags. - Multi-node: Includes
--hostfile, NCCL network flags, and MCA TCP settings.
To switch between single-node and multi-node, run cuco_init --setup and answer the multi-node prompt, or edit ~/.cuco/site.yaml directly:
multi_node: true
hostfile: /path/to/hostfile
network:
nccl_socket_ifname: enp75s0f1np1
nccl_ib_hca: mlx5_1
nccl_ib_gid_index: 3
btl_tcp_if_include: enp75s0f1np1
oob_tcp_if_include: enp75s0f1np1All workloads on the same cluster share this config — change it once, and all workloads pick up the new settings.
evaluate.py parses program output using these regexes:
TIME_PATTERN = re.compile(r"^Time:\s*([\d.]+)\s*ms", re.MULTILINE)
VERIFICATION_PASS_STR = "Verification: PASS"Your kernel must print output that matches these patterns. The simplest approach is to add a matching printf to your kernel:
printf("Time: %.4f ms\n", elapsed_ms);
printf("Verification: PASS\n");If your kernel prints timing in a different format, update the regex instead. For example, if your kernel prints "Elapsed: 5.23 ms":
TIME_PATTERN = re.compile(r"^Elapsed:\s*([\d.]+)\s*ms", re.MULTILINE)For multi-rank workloads with unequal token counts, the evaluator also supports token-weighted averaging via:
RANK_HEADER_PATTERN = re.compile(
r"RESULTS \(Rank (\d+).*?receives (\d+) tokens\).*?\nTime:\s*([\d.]+)\s*ms",
re.DOTALL,
)The default fitness function maps lower time to higher score:
def score_from_time_ms(time_ms: float) -> float:
return 10000.0 / (1.0 + time_ms)This works for most latency-focused workloads. For multi-objective optimization (e.g., balancing latency and memory), modify this function and add sub-metrics to the public dict in metrics.json so the LLM can see the breakdown:
def compute_score(time_ms, memory_mb, accuracy):
latency_score = 10000.0 / (1.0 + time_ms)
memory_score = 1000.0 / (1.0 + memory_mb)
return 0.7 * latency_score + 0.2 * memory_score + 0.1 * accuracyevaluate.py writes exactly two files:
metrics.json:
{
"combined_score": 83.85,
"public": {
"time_ms": 118.26,
"throughput_gb_s": 42.5,
"build_duration_sec": 6.72
},
"private": {
"peak_memory_mb": 1024,
"nvcc_warnings": 2
},
"text_feedback": "Optional LLM-generated suggestions"
}correct.json:
{
"correct": true
}Or on failure:
{
"correct": false,
"error": "Build failed: undefined reference to ncclGin"
}See Evaluation for the full reference.
If your seed kernel uses host-side NCCL collectives, write a transformation launcher:
from cuco.transform import CUDAAnalyzer, HostToDeviceTransformer
from cuco.transform.transformer import TransformConfig
config = TransformConfig(
rewrite_model="bedrock/us.anthropic.claude-sonnet-4-6",
nvcc_path="/usr/local/cuda-13.1/bin/nvcc",
nccl_include="/usr/local/nccl_2.28.9-1+cuda13.0_x86_64/include",
nccl_static_lib="/usr/local/nccl_2.28.9-1+cuda13.0_x86_64/lib/libnccl_static.a",
num_mpi_ranks=2,
hostfile="build/hostfile",
api_type="gin",
two_stage=True,
reference_code=open("reference_gin_example.cu").read(),
nccl_api_docs=NCCL_API_FULL,
)
transformer = HostToDeviceTransformer(config)
result = transformer.transform("my_kernel.cu", work_dir="_transform_work")When copying from ds_v3_moe, the run_transform.py is already included. Pass your kernel via CLI:
python run_transform.py --source my_kernel.cuSee Fast-Path Agent for details on the two-stage transformation.
This is the main evolution launcher. When copying from ds_v3_moe, pass workload-specific values via CLI:
python run_evo.py \
--init_program my_kernel.cu \
--results_dir results_my_workload \
--num_generations 18 \
--api ginKey decisions if you need to customize beyond CLI arguments:
Build a task system message that tells the LLM what to optimize:
_COMMON_CONSTRAINTS = """
## Evolve Block Structure
The file contains EVOLVE-BLOCK regions. You have FULL AUTONOMY over code inside these blocks.
## Hard Constraints
- Only modify code between EVOLVE-BLOCK-START and EVOLVE-BLOCK-END markers.
- Correctness: the program must print "Verification: PASS".
- Do NOT use external libraries (cuBLAS, cuDNN, etc.).
"""
_STRATEGIES = """
## Choosing the Right Fusion Level
### Kernel-Level Fusion
Fuse compute and communication into a single persistent kernel when...
### Stream-Level Overlap
Use separate CUDA streams when...
### Split Communication Kernels
Break into separate PUT and WAIT launches when...
"""Include API-specific knowledge (GIN rules, LSA rules) and hardware context.
_PHASE_CONFIGS = {
"explore": {
"patch_type_probs": [0.15, 0.70, 0.15], # diff, full, cross
"temperatures": [0.2, 0.5, 0.8],
},
"exploit": {
"patch_type_probs": [0.25, 0.60, 0.15],
"temperatures": [0.0, 0.2, 0.5],
},
}from cuco.core import EvolutionRunner, EvolutionConfig
from cuco.database import DatabaseConfig
from cuco.launch import LocalJobConfig
evo_config = EvolutionConfig(
task_sys_msg=task_msg,
patch_types=["diff", "full", "cross"],
patch_type_probs=[0.15, 0.70, 0.15],
num_generations=18,
max_parallel_jobs=1,
language="cuda",
llm_models=["bedrock/us.anthropic.claude-opus-4-6-v1"],
llm_kwargs=dict(temperatures=[0.2, 0.5, 0.8], max_tokens=32768),
meta_rec_interval=8,
meta_max_recommendations=5,
init_program_path="my_kernel.cu",
results_dir="results_my_workload",
embedding_model="bedrock-amazon.titan-embed-text-v1",
use_text_feedback=True,
)
db_config = DatabaseConfig(
db_path="evolution_db.sqlite",
num_islands=2,
archive_size=60,
migration_interval=8,
parent_selection_strategy="weighted",
)
job_config = LocalJobConfig(eval_program_path="evaluate.py")
runner = EvolutionRunner(
evo_config=evo_config,
job_config=job_config,
db_config=db_config,
)
runner.run()See Configuration Reference for all parameters.
CUCo requires LLM API credentials for evolution, feedback, and transformation. Create a .env file in the CUCo root directory:
# For AWS Bedrock (default)
AWS_ACCESS_KEY_ID=your_key
AWS_SECRET_ACCESS_KEY=your_secret
AWS_REGION_NAME=us-east-1The .env file is loaded automatically by evaluate.py, run_evo.py, and run_transform.py.
Before running a full evolution, verify each component:
cd workloads/my_workload
# Build
nvcc -o my_kernel my_kernel.cu -I... -rdc=true -arch=sm_80 ...
# Run (single-node)
CUDA_VISIBLE_DEVICES=0,1 mpirun -np 2 ./my_kernel
# Run (multi-node)
mpirun --hostfile build/hostfile -np 2 ./my_kernel
# Should print: "Verification: PASS" and "Time: X.XXXX ms"mkdir -p test_results
python evaluate.py --program_path my_kernel.cu --results_dir test_results
cat test_results/metrics.json # Should show combined_score > 0
cat test_results/correct.json # Should show "correct": truepython run_evo.py --init_program my_kernel.cu --results_dir results_test --num_generations 3Check results_test/gen_0/ for the first generation's output.
Before running a full evolution, verify:
-
cuco_init --setuphas been run (or~/.cuco/site.yamlexists with correct paths) - Seed kernel compiles with your nvcc command
- Seed kernel runs correctly via mpirun on your target GPUs
- Seed kernel prints
Time: X.XXXX ms(or you've updatedTIME_PATTERN) - Seed kernel prints
Verification: PASS -
evaluate.pycorrectly parses timing and writesmetrics.json/correct.json - EVOLVE-BLOCK markers are placed around mutable regions only
- Frozen regions (init, main, verification) are outside EVOLVE-BLOCK
-
.envfile has valid LLM API credentials -
nccl_api_docs.pyis present and importable
-
Timing format mismatch: The default
evaluate.pyregex expects^Time:\s*([\d.]+)\s*ms. If your kernel prints timing differently (e.g.,"Elapsed: 5.23 ms"or"Total Pipeline Time: 3.14 ms"), the evaluator will silently fail to parse it and report score 0. Either add a matchingprintfto your kernel or updateTIME_PATTERN. -
Missing
-rdc=true: Device-side NCCL requires relocatable device code. Without it, linking fails silently or produces wrong results. -
Wrong NCCL version: Device-initiated APIs (GIN, LSA) require NCCL >= 2.28.9. Older versions will compile but crash at runtime.
-
Timing includes warmup: Always include a warmup section before the timed region. The first GIN/NCCL call triggers lazy RDMA initialization (10-50 ms).
-
EVOLVE-BLOCK too broad: If the entire file is mutable, the LLM may break initialization or verification code. Keep EVOLVE-BLOCKs focused on kernels and pipeline logic.
-
EVOLVE-BLOCK too narrow: If only a single kernel is mutable, the LLM cannot explore architectural changes (e.g., switching from sequential to pipelined execution).
-
Network configuration: Inter-node GIN requires correct
NCCL_SOCKET_IFNAME,NCCL_IB_HCA, andNCCL_IB_GID_INDEX. Wrong values cause silent hangs. -
Static NCCL linking: Device APIs require static linking (
libnccl_static.a), not dynamic (-lnccl). -
Single-node vs multi-node mismatch: If running on a single node, make sure
~/.cuco/site.yamlhasmulti_node: false. When using the manual setup (withoutcuco_init), remove the--hostfile,--map-by node, and InfiniBand flags from_run_binaryinevaluate.py.