Getting Started

This guide walks you through installing CUCo and running the included DeepSeek-V3 MoE example end-to-end.

Prerequisites

Hardware

GPUs: NVIDIA A100 (or later) with NVLink (intra-node) or RoCE/InfiniBand (inter-node)
Multi-GPU: At least 2 GPUs, potentially across nodes depending on the workload

Software

Dependency	Version	Notes
Python	>= 3.10	3.10, 3.11, or 3.12
CUDA	>= 13.1	nvcc compiler
NCCL	>= 2.28.9	Must include device-side API headers (gin.h, etc.)
MPI	OpenMPI	For mpirun-based multi-GPU execution
Git	Any	For cloning the repository

LLM API Access

CUCo requires access to at least one LLM provider. The default configuration uses Anthropic Claude via AWS Bedrock. See LLM Backends for all supported providers.

Installation

Clone and install

git clone https://github.com/UT-InfraAI/cuco.git
cd cuco

python3 -m venv .venv
source .venv/bin/activate
pip install -e .

Or with uv (faster):

uv venv
source .venv/bin/activate
uv sync

Configure LLM credentials

Create a .env file in the repository root:

# AWS Bedrock (default provider)
AWS_DEFAULT_REGION=us-east-1
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION_NAME=us-east-1

CUCo loads this file automatically via python-dotenv. See LLM Backends for other providers (OpenAI, Gemini, DeepSeek, etc.).

Verify CUDA/NCCL paths

The example evaluate.py expects these paths (edit if your installation differs):

NVCC = "/usr/local/cuda-13.1/bin/nvcc"
NCCL_INCLUDE = "/usr/local/nccl_2.28.9-1+cuda13.0_x86_64/include"
NCCL_STATIC_LIB = "/usr/local/nccl_2.28.9-1+cuda13.0_x86_64/lib/libnccl_static.a"

Configure multi-node (if applicable)

For inter-node experiments, create a hostfile at workloads/ds_v3_moe/build/hostfile:

node1 slots=1
node2 slots=1

Running the Example

The included example is a DeepSeek-V3 MoE (Mixture of Experts) dispatch-compute-combine workload running across 2 GPUs over inter-node RoCE links.

Step 1: Fast-Path Transformation (optional)

If starting from a host-driven NCCL program, the fast-path agent converts it to device-initiated GIN:

cd workloads/ds_v3_moe
python run_transform.py

This runs the three-step pipeline (analyze, host-to-device, evolve-block annotation) and writes the transformed kernel to _transform_host_output/. The included ds_v3_moe.cu seed already has EVOLVE-BLOCK markers, so this step is optional for the provided example.

Options:

Flag	Default	Description
`--source`	`ds_v3_moe.cu`	Input CUDA source file
`--no-agent`	off	Use structured LLM loop instead of Claude Code agent
`--model`	`sonnet`	Claude model for agent mode
`--max_iterations`	`5`	Max iterations in no-agent mode

Step 2: Evolutionary Search

Run the slow-path agent to optimize the kernel:

cd workloads/ds_v3_moe
python run_evo.py --num_generations=18 --api=gin

This runs a two-phase evolution:

Phase 1 (explore): First 40% of generations. High-temperature full rewrites to discover diverse architectures.
Phase 2 (exploit): Remaining 60%. Low-temperature diff patches to refine the best designs.

Options:

Flag	Default	Description
`--num_generations`	`60`	Total generation budget
`--results_dir`	`results_ds_v3_moe`	Output directory for results
`--api`	`gin`	Communication API: `gin` or `lsa`
`--explore_fraction`	`0.4`	Fraction of budget for explore phase
`--init_program`	`ds_v3_moe.cu`	Seed program path
`--gin_ref`	`None`	Reference GIN example for prompts
`--lsa_ref`	`None`	Reference LSA example for prompts

Step 3: Monitor Progress

During evolution, watch the console for per-generation output:

Generation 3 — Score: 83.86 (best: 83.86)
  Time: 118.26 ms | Correct: True | Patch: full | Model: claude-opus-4-6

Results are saved to results_ds_v3_moe/ as they complete.

Step 4: Visualize Results

Launch the interactive web UI:

cuco_visualize --db workloads/ds_v3_moe/results_ds_v3_moe/evolution_db.sqlite --open

This opens a browser with:

Tree view: Lineage tree showing parent-child relationships and scores
Programs table: Sortable table of all candidates with metrics
Embeddings: Similarity heatmap and clustering
Meta scratchpad: Cross-generation optimization recommendations

See Visualization for details.

Understanding Results

Directory structure

After evolution, results_ds_v3_moe/ contains:

results_ds_v3_moe/
├── evolution_db.sqlite         # SQLite database of all candidates
├── experiment_config.yaml      # Configuration snapshot
├── meta_memory.json            # Meta-learning state
├── best/                       # Symlink to best generation
│   ├── ds_v3_moe.cu            # Best evolved kernel
│   └── results/
│       └── metrics.json        # Best score and timing
├── gen_0/                      # Generation 0 (seed)
│   ├── ds_v3_moe.cu            # Evolved program
│   ├── original.cu             # Parent code
│   ├── main.cu                 # Copy used for evaluation
│   ├── edit.diff               # Mutation diff
│   ├── rewrite.txt             # LLM output
│   └── results/
│       ├── metrics.json        # Score, timing, feedback
│       ├── correct.json        # Correctness result
│       ├── build.log           # Compiler output
│       └── run.log             # Runtime output
├── gen_1/
│   └── ...
└── meta_8.txt                  # Meta-summary at generation 8

Interpreting metrics.json

{
  "combined_score": 83.85,
  "public": {
    "time_ms": 118.26,
    "rank0_time_ms": 141.83,
    "rank0_tokens": 6144,
    "rank1_time_ms": 47.54,
    "rank1_tokens": 2048,
    "all_run_times_ms": [118.58, 118.26]
  },
  "text_feedback": "LLM suggestions: ..."
}

combined_score: 10000 / (1 + time_ms) — higher is better
time_ms: Token-weighted average time across ranks
text_feedback: LLM-generated optimization suggestions

Next Steps

Adding a New Workload — Adapt CUCo for your own kernels
Configuration Reference — Full parameter documentation
Fast-Path Agent — Deep dive into the transformation pipeline
Slow-Path Agent — Deep dive into the evolutionary search

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Started

Prerequisites

Hardware

Software

LLM API Access

Installation

Clone and install

Configure LLM credentials

Verify CUDA/NCCL paths

Configure multi-node (if applicable)

Running the Example

Step 1: Fast-Path Transformation (optional)

Step 2: Evolutionary Search

Step 3: Monitor Progress

Step 4: Visualize Results

Understanding Results

Directory structure

Interpreting metrics.json

Next Steps

FilesExpand file tree

getting-started.md

Latest commit

History

getting-started.md

File metadata and controls

Getting Started

Prerequisites

Hardware

Software

LLM API Access

Installation

Clone and install

Configure LLM credentials

Verify CUDA/NCCL paths

Configure multi-node (if applicable)

Running the Example

Step 1: Fast-Path Transformation (optional)

Step 2: Evolutionary Search

Step 3: Monitor Progress

Step 4: Visualize Results

Understanding Results

Directory structure

Interpreting metrics.json

Next Steps