Skip to content

[WS5] RL-Kernel to vime integration and benchmark plan #158

Description

@inaniloquentee

Goal

Integrate RL-Kernel into vime as an optional GRPO/logprob acceleration layer, then produce reproducible benchmarks comparing vime baseline against vime + RL-Kernel.

RL-Kernel should be positioned as accelerating vime's GRPO training path, especially selected logprob, ratio/KL, and GRPO loss computation. For MoE workloads, this issue only validates compatibility with vime R3 and measures end-to-end pipeline speed. It does not claim that RL-Kernel replaces R3 or accelerates MoE expert/router kernels.

Scope

  1. Map vime training-side hook points for:

    • selected logprob
    • reference logprob / KL
    • GRPO loss
    • metric reporting and profiling
  2. Add an optional RL-Kernel integration path:

    • environment flag such as VIME_RL_KERNEL=1, or
    • config / CLI flag such as --enable-rl-kernel
    • fallback to vime's original path when disabled or unsupported
  3. Prioritize RL-Kernel operators already available on main:

    • fused logp
    • ratio_kl
    • grpo_loss
  4. Add module-level profiling for:

    • rollout time
    • logprob time
    • ratio/KL time
    • GRPO loss time
    • update time
    • weight sync time
    • full step time
  5. Run benchmark tiers:

    • operator-level microbenchmark
    • vime module-level benchmark
    • vime end-to-end benchmark
  6. For MoE workloads:

    • compare vime + R3 vs vime + R3 + RL-Kernel
    • verify RL-Kernel does not regress train_rollout_logprob_abs_diff
    • verify raw_reward does not regress
    • report step-time improvement if present

Non-goals

  • Do not integrate MoE expert/router/GEMM kernels in this issue.
  • Do not claim RL-Kernel replaces vime R3.
  • Do not claim RL-Kernel itself solves MoE train-inference consistency.
  • Do not modify vime's R3 routing replay logic.
  • Do not rewrite vLLM or Megatron internals.

Benchmark Matrix

Recommended runs:

  1. Qwen3-4B / A100 / GRPO / gsm8k

    • purpose: smoke, correctness, reward/logprob stability
  2. Qwen3-30B-A3B / GB200 or H200 / GRPO / dapo-math-17k

    • purpose: main end-to-end speed benchmark
  3. Qwen3-30B-A3B MoE / vime R3 / dapo-math-17k

    • purpose: R3-compatible MoE validation
  4. GLM-4.5-Air or another non-Qwen model

    • purpose: generalization check

Metrics

Required metrics:

  • mean step time
  • p50 / p90 / p99 step time
  • tokens/s
  • peak VRAM
  • logprob time
  • ratio/KL time
  • GRPO loss time
  • raw_reward
  • train_rollout_logprob_abs_diff
  • KL / loss
  • hardware and software version metadata
  • vime commit and RL-Kernel commit

Deliverables

  • vime hook-point mapping
  • integration design note
  • optional RL-Kernel integration switch
  • one minimal end-to-end PoC
  • operator/module/end-to-end benchmark logs
  • plots for:
    • mean step time
    • step time over training steps
    • module time breakdown
    • peak VRAM
    • raw_reward
    • train_rollout_logprob_abs_diff
  • reproducibility notes with commands, commits, hardware, and hyperparameters

Exit Criteria

  • vime can run with RL-Kernel disabled and enabled.
  • fused logp, ratio_kl, or grpo_loss is invoked through the vime path.
  • Baseline and RL-Kernel runs use identical model, dataset, hardware, seed, and hyperparameters.
  • End-to-end benchmark results are reproducible from saved logs.
  • MoE benchmark, if included, is reported as vime + R3 vs vime + R3 + RL-Kernel.
  • Public wording avoids MoE-kernel and R3-replacement claims.

Metadata

Metadata

Labels

platform: cudaSpecific optimizations or bugs in NVIDIA graphics cards (such as FlashInfer, TMA optimizations)priority: highSevere congestion issues require the highest priority for resolution.type: performancePerformance optimization tasks aimed at increasing throughput and reducing latency etc.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions