[WS5] RL-Kernel to vime integration and benchmark plan

### Goal

Integrate RL-Kernel into vime as an optional GRPO/logprob acceleration layer, then produce reproducible benchmarks comparing vime baseline against vime + RL-Kernel.

RL-Kernel should be positioned as accelerating vime's GRPO training path, especially selected logprob, ratio/KL, and GRPO loss computation. For MoE workloads, this issue only validates compatibility with vime R3 and measures end-to-end pipeline speed. It does not claim that RL-Kernel replaces R3 or accelerates MoE expert/router kernels.

### Scope

1. Map vime training-side hook points for:
   - selected logprob
   - reference logprob / KL
   - GRPO loss
   - metric reporting and profiling

2. Add an optional RL-Kernel integration path:
   - environment flag such as `VIME_RL_KERNEL=1`, or
   - config / CLI flag such as `--enable-rl-kernel`
   - fallback to vime's original path when disabled or unsupported

3. Prioritize RL-Kernel operators already available on main:
   - `fused logp`
   - `ratio_kl`
   - `grpo_loss`

4. Add module-level profiling for:
   - rollout time
   - logprob time
   - ratio/KL time
   - GRPO loss time
   - update time
   - weight sync time
   - full step time

5. Run benchmark tiers:
   - operator-level microbenchmark
   - vime module-level benchmark
   - vime end-to-end benchmark

6. For MoE workloads:
   - compare `vime + R3` vs `vime + R3 + RL-Kernel`
   - verify RL-Kernel does not regress `train_rollout_logprob_abs_diff`
   - verify `raw_reward` does not regress
   - report step-time improvement if present

### Non-goals

- Do not integrate MoE expert/router/GEMM kernels in this issue.
- Do not claim RL-Kernel replaces vime R3.
- Do not claim RL-Kernel itself solves MoE train-inference consistency.
- Do not modify vime's R3 routing replay logic.
- Do not rewrite vLLM or Megatron internals.

### Benchmark Matrix

Recommended runs:

1. Qwen3-4B / A100 / GRPO / gsm8k
   - purpose: smoke, correctness, reward/logprob stability

2. Qwen3-30B-A3B / GB200 or H200 / GRPO / dapo-math-17k
   - purpose: main end-to-end speed benchmark

3. Qwen3-30B-A3B MoE / vime R3 / dapo-math-17k
   - purpose: R3-compatible MoE validation

4. GLM-4.5-Air or another non-Qwen model
   - purpose: generalization check

### Metrics

Required metrics:

- mean step time
- p50 / p90 / p99 step time
- tokens/s
- peak VRAM
- logprob time
- ratio/KL time
- GRPO loss time
- raw_reward
- train_rollout_logprob_abs_diff
- KL / loss
- hardware and software version metadata
- vime commit and RL-Kernel commit

### Deliverables

- vime hook-point mapping
- integration design note
- optional RL-Kernel integration switch
- one minimal end-to-end PoC
- operator/module/end-to-end benchmark logs
- plots for:
  - mean step time
  - step time over training steps
  - module time breakdown
  - peak VRAM
  - raw_reward
  - train_rollout_logprob_abs_diff
- reproducibility notes with commands, commits, hardware, and hyperparameters

### Exit Criteria

- vime can run with RL-Kernel disabled and enabled.
- `fused logp`, `ratio_kl`, or `grpo_loss` is invoked through the vime path.
- Baseline and RL-Kernel runs use identical model, dataset, hardware, seed, and hyperparameters.
- End-to-end benchmark results are reproducible from saved logs.
- MoE benchmark, if included, is reported as `vime + R3` vs `vime + R3 + RL-Kernel`.
- Public wording avoids MoE-kernel and R3-replacement claims.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WS5] RL-Kernel to vime integration and benchmark plan #158

Goal

Scope

Non-goals

Benchmark Matrix

Metrics

Deliverables

Exit Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[WS5] RL-Kernel to vime integration and benchmark plan #158

Description

Goal

Scope

Non-goals

Benchmark Matrix

Metrics

Deliverables

Exit Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions