This guide walks you through installing CUCo and running the included DeepSeek-V3 MoE example end-to-end.
- GPUs: NVIDIA A100 (or later) with NVLink (intra-node) or RoCE/InfiniBand (inter-node)
- Multi-GPU: At least 2 GPUs, potentially across nodes depending on the workload
| Dependency | Version | Notes |
|---|---|---|
| Python | >= 3.10 | 3.10, 3.11, or 3.12 |
| CUDA | >= 13.1 | nvcc compiler |
| NCCL | >= 2.28.9 | Must include device-side API headers (gin.h, etc.) |
| MPI | OpenMPI | For mpirun-based multi-GPU execution |
| Git | Any | For cloning the repository |
CUCo requires access to at least one LLM provider. The default configuration uses Anthropic Claude via AWS Bedrock. See LLM Backends for all supported providers.
git clone https://github.com/UT-InfraAI/cuco.git
cd cuco
python3 -m venv .venv
source .venv/bin/activate
pip install -e .Or with uv (faster):
uv venv
source .venv/bin/activate
uv syncCreate a .env file in the repository root:
# AWS Bedrock (default provider)
AWS_DEFAULT_REGION=us-east-1
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION_NAME=us-east-1CUCo loads this file automatically via python-dotenv. See LLM Backends for other providers (OpenAI, Gemini, DeepSeek, etc.).
The example evaluate.py expects these paths (edit if your installation differs):
NVCC = "/usr/local/cuda-13.1/bin/nvcc"
NCCL_INCLUDE = "/usr/local/nccl_2.28.9-1+cuda13.0_x86_64/include"
NCCL_STATIC_LIB = "/usr/local/nccl_2.28.9-1+cuda13.0_x86_64/lib/libnccl_static.a"For inter-node experiments, create a hostfile at workloads/ds_v3_moe/build/hostfile:
node1 slots=1
node2 slots=1
The included example is a DeepSeek-V3 MoE (Mixture of Experts) dispatch-compute-combine workload running across 2 GPUs over inter-node RoCE links.
If starting from a host-driven NCCL program, the fast-path agent converts it to device-initiated GIN:
cd workloads/ds_v3_moe
python run_transform.pyThis runs the three-step pipeline (analyze, host-to-device, evolve-block annotation) and writes the transformed kernel to _transform_host_output/. The included ds_v3_moe.cu seed already has EVOLVE-BLOCK markers, so this step is optional for the provided example.
Options:
| Flag | Default | Description |
|---|---|---|
--source | ds_v3_moe.cu | Input CUDA source file |
--no-agent | off | Use structured LLM loop instead of Claude Code agent |
--model | sonnet | Claude model for agent mode |
--max_iterations | 5 | Max iterations in no-agent mode |
Run the slow-path agent to optimize the kernel:
cd workloads/ds_v3_moe
python run_evo.py --num_generations=18 --api=ginThis runs a two-phase evolution:
- Phase 1 (explore): First 40% of generations. High-temperature full rewrites to discover diverse architectures.
- Phase 2 (exploit): Remaining 60%. Low-temperature diff patches to refine the best designs.
Options:
| Flag | Default | Description |
|---|---|---|
--num_generations | 60 | Total generation budget |
--results_dir | results_ds_v3_moe | Output directory for results |
--api | gin | Communication API: gin or lsa |
--explore_fraction | 0.4 | Fraction of budget for explore phase |
--init_program | ds_v3_moe.cu | Seed program path |
--gin_ref | None | Reference GIN example for prompts |
--lsa_ref | None | Reference LSA example for prompts |
During evolution, watch the console for per-generation output:
Generation 3 — Score: 83.86 (best: 83.86)
Time: 118.26 ms | Correct: True | Patch: full | Model: claude-opus-4-6
Results are saved to results_ds_v3_moe/ as they complete.
Launch the interactive web UI:
cuco_visualize --db workloads/ds_v3_moe/results_ds_v3_moe/evolution_db.sqlite --openThis opens a browser with:
- Tree view: Lineage tree showing parent-child relationships and scores
- Programs table: Sortable table of all candidates with metrics
- Embeddings: Similarity heatmap and clustering
- Meta scratchpad: Cross-generation optimization recommendations
See Visualization for details.
After evolution, results_ds_v3_moe/ contains:
results_ds_v3_moe/
├── evolution_db.sqlite # SQLite database of all candidates
├── experiment_config.yaml # Configuration snapshot
├── meta_memory.json # Meta-learning state
├── best/ # Symlink to best generation
│ ├── ds_v3_moe.cu # Best evolved kernel
│ └── results/
│ └── metrics.json # Best score and timing
├── gen_0/ # Generation 0 (seed)
│ ├── ds_v3_moe.cu # Evolved program
│ ├── original.cu # Parent code
│ ├── main.cu # Copy used for evaluation
│ ├── edit.diff # Mutation diff
│ ├── rewrite.txt # LLM output
│ └── results/
│ ├── metrics.json # Score, timing, feedback
│ ├── correct.json # Correctness result
│ ├── build.log # Compiler output
│ └── run.log # Runtime output
├── gen_1/
│ └── ...
└── meta_8.txt # Meta-summary at generation 8
{
"combined_score": 83.85,
"public": {
"time_ms": 118.26,
"rank0_time_ms": 141.83,
"rank0_tokens": 6144,
"rank1_time_ms": 47.54,
"rank1_tokens": 2048,
"all_run_times_ms": [118.58, 118.26]
},
"text_feedback": "LLM suggestions: ..."
}- combined_score:
10000 / (1 + time_ms)— higher is better - time_ms: Token-weighted average time across ranks
- text_feedback: LLM-generated optimization suggestions
- Adding a New Workload — Adapt CUCo for your own kernels
- Configuration Reference — Full parameter documentation
- Fast-Path Agent — Deep dive into the transformation pipeline
- Slow-Path Agent — Deep dive into the evolutionary search