Goal
Integrate RL-Kernel into vime as an optional GRPO/logprob acceleration layer, then produce reproducible benchmarks comparing vime baseline against vime + RL-Kernel.
RL-Kernel should be positioned as accelerating vime's GRPO training path, especially selected logprob, ratio/KL, and GRPO loss computation. For MoE workloads, this issue only validates compatibility with vime R3 and measures end-to-end pipeline speed. It does not claim that RL-Kernel replaces R3 or accelerates MoE expert/router kernels.
Scope
-
Map vime training-side hook points for:
- selected logprob
- reference logprob / KL
- GRPO loss
- metric reporting and profiling
-
Add an optional RL-Kernel integration path:
- environment flag such as
VIME_RL_KERNEL=1, or
- config / CLI flag such as
--enable-rl-kernel
- fallback to vime's original path when disabled or unsupported
-
Prioritize RL-Kernel operators already available on main:
fused logp
ratio_kl
grpo_loss
-
Add module-level profiling for:
- rollout time
- logprob time
- ratio/KL time
- GRPO loss time
- update time
- weight sync time
- full step time
-
Run benchmark tiers:
- operator-level microbenchmark
- vime module-level benchmark
- vime end-to-end benchmark
-
For MoE workloads:
- compare
vime + R3 vs vime + R3 + RL-Kernel
- verify RL-Kernel does not regress
train_rollout_logprob_abs_diff
- verify
raw_reward does not regress
- report step-time improvement if present
Non-goals
- Do not integrate MoE expert/router/GEMM kernels in this issue.
- Do not claim RL-Kernel replaces vime R3.
- Do not claim RL-Kernel itself solves MoE train-inference consistency.
- Do not modify vime's R3 routing replay logic.
- Do not rewrite vLLM or Megatron internals.
Benchmark Matrix
Recommended runs:
-
Qwen3-4B / A100 / GRPO / gsm8k
- purpose: smoke, correctness, reward/logprob stability
-
Qwen3-30B-A3B / GB200 or H200 / GRPO / dapo-math-17k
- purpose: main end-to-end speed benchmark
-
Qwen3-30B-A3B MoE / vime R3 / dapo-math-17k
- purpose: R3-compatible MoE validation
-
GLM-4.5-Air or another non-Qwen model
- purpose: generalization check
Metrics
Required metrics:
- mean step time
- p50 / p90 / p99 step time
- tokens/s
- peak VRAM
- logprob time
- ratio/KL time
- GRPO loss time
- raw_reward
- train_rollout_logprob_abs_diff
- KL / loss
- hardware and software version metadata
- vime commit and RL-Kernel commit
Deliverables
- vime hook-point mapping
- integration design note
- optional RL-Kernel integration switch
- one minimal end-to-end PoC
- operator/module/end-to-end benchmark logs
- plots for:
- mean step time
- step time over training steps
- module time breakdown
- peak VRAM
- raw_reward
- train_rollout_logprob_abs_diff
- reproducibility notes with commands, commits, hardware, and hyperparameters
Exit Criteria
- vime can run with RL-Kernel disabled and enabled.
fused logp, ratio_kl, or grpo_loss is invoked through the vime path.
- Baseline and RL-Kernel runs use identical model, dataset, hardware, seed, and hyperparameters.
- End-to-end benchmark results are reproducible from saved logs.
- MoE benchmark, if included, is reported as
vime + R3 vs vime + R3 + RL-Kernel.
- Public wording avoids MoE-kernel and R3-replacement claims.
Goal
Integrate RL-Kernel into vime as an optional GRPO/logprob acceleration layer, then produce reproducible benchmarks comparing vime baseline against vime + RL-Kernel.
RL-Kernel should be positioned as accelerating vime's GRPO training path, especially selected logprob, ratio/KL, and GRPO loss computation. For MoE workloads, this issue only validates compatibility with vime R3 and measures end-to-end pipeline speed. It does not claim that RL-Kernel replaces R3 or accelerates MoE expert/router kernels.
Scope
Map vime training-side hook points for:
Add an optional RL-Kernel integration path:
VIME_RL_KERNEL=1, or--enable-rl-kernelPrioritize RL-Kernel operators already available on main:
fused logpratio_klgrpo_lossAdd module-level profiling for:
Run benchmark tiers:
For MoE workloads:
vime + R3vsvime + R3 + RL-Kerneltrain_rollout_logprob_abs_diffraw_rewarddoes not regressNon-goals
Benchmark Matrix
Recommended runs:
Qwen3-4B / A100 / GRPO / gsm8k
Qwen3-30B-A3B / GB200 or H200 / GRPO / dapo-math-17k
Qwen3-30B-A3B MoE / vime R3 / dapo-math-17k
GLM-4.5-Air or another non-Qwen model
Metrics
Required metrics:
Deliverables
Exit Criteria
fused logp,ratio_kl, orgrpo_lossis invoked through the vime path.vime + R3vsvime + R3 + RL-Kernel.