Checklist
Background
In on-policy distillation, the teacher is used only for log-prob scoring (teacher_logp) and does not need optimizer, gradients, or training states.
Using a full TrainEngine for teacher is memory-heavy and increases resource pressure.
We want a teacher path based on RolloutEngine / InferenceEngine to reduce GPU memory usage.
Current distillation workflows historically instantiate teacher as a train-style engine, which can allocate unnecessary training components (optimizer states, train-time buffers, etc.) for an inference-only teacher use case.
Potential Solution
- teacher is configured as an inference rollout engine (vLLM/SGLang).
- RLTrainer calls teacher.compute_logp(...) on rollout batches.
- Teacher model path/config is independent from actor rollout model path.
- Teacher lifecycle uses rollout/controller semantics (init/offload/onload/destroy) without train-engine overhead.
Benefits
- Lower peak GPU memory for distillation runs.
- Better stability on limited-memory hardware.
- Better separation of concerns (teacher scoring vs student training).
Additional Information
Minimal config example
teacher:
path: <teacher-model-path>
rollout:
backend: "vllm:d1p1t1" # or sglang:d...
offload: true
rl_loss_weight: 1.0
distill_loss_weight: 0.005
Checklist
areal/api/. If not, please raise a refactor issue first.Background
In on-policy distillation, the teacher is used only for log-prob scoring (teacher_logp) and does not need optimizer, gradients, or training states.
Using a full TrainEngine for teacher is memory-heavy and increases resource pressure.
We want a teacher path based on RolloutEngine / InferenceEngine to reduce GPU memory usage.
Current distillation workflows historically instantiate teacher as a train-style engine, which can allocate unnecessary training components (optimizer states, train-time buffers, etc.) for an inference-only teacher use case.
Potential Solution
Benefits
Additional Information
Minimal config example