Official repository of RLP: Reinforcement as a Pretraining Objective.
A verifier‑free, information‑gain objective that teaches models to “think before predicting” during pre‑training.
Ali Hatamizadeh1, Syeda Nahida Akter1, Shrimai Prabhumoye1, Jan Kautz, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Yejin Choi.
Teach models to think during pretraining, not just after.

We introduce RLP (Reinforcement Learning Pre‑training): treat chain‑of‑thought (CoT) as an action taken before next‑token prediction, and reward it by the information gain it provides on the observed next token. This yields a verifier‑free, dense reward that can be applied to ordinary pre‑training text. On Qwen3‑1.7B‑Base, RLP improves the overall math+science average by ≈ +19% over the base model and ≈ +17% over compute‑matched continuous pre‑training; after identical post‑training the gains compound. On a 12B hybrid Mamba‑Transformer (NeMo‑12B), the overall average rises from 42.81 → 61.32 (+18.51 points), with large science reasoning gains.
-
Setup:
- We compare RLP against both the base model (BASE) and a compute matched Continuous Pretraining (CPT) baseline.
- All models use the same SFT + RLVR post training pipeline for a fair comparison.
-
Pretraining Gains:
- RLP outperforms BASE by +19% and CPT by +17% on average across math and science benchmarks.
- These improvements come without extra compute, showing the gains are from methodology rather than raw FLOPs.
-
Post Training Synergy:
-
After identical SFT + RLVR, RLP compounds its advantage, achieving:
- +8% relative over BASE+Post
- +7% relative over CPT+Post
-
This shows that RLP builds durable reasoning foundations that are strengthened, not erased, by downstream alignment.
-
-
Takeaway:
- Unlike next token prediction or continuous pretraining, RLP instills reasoning during pretraining itself.
- These early advantages persist through post training, giving models stronger and more robust reasoning capabilities.
-
Setup:
- We compare an intermediate checkpoint of Nemotron-Nano-12B-v2-Base trained on 19.8T tokens with RLP applied for only 250M tokens.
- The BASE model, in contrast, is trained fully on 20T tokens.
-
Pretraining Gains:
- RLP substantially outperforms BASE across all domains despite using ~200B fewer tokens.
- On average, RLP is +35% better than BASE, highlighting both efficiency and scalability.
-
Domain Specific Improvements:
- Math performance improves moderately.
- The largest gains are in science reasoning, where Science Avg improves by +23 absolute points.
-
Takeaway:
- The benefits of RLP not only persist but amplify at larger model scales.
- RLP generalizes effectively across architectures, yielding robust reasoning improvements even in hybrid models like Nemotron.
If you find RLP to be useful for your work, please consider citing our paper:
@article{hatamizadeh2025rlp,
title={RLP: Reinforcement as a Pretraining Objective},
author={Hatamizadeh, Ali and Akter, Syeda Nahida and Prabhumoye, Shrimai and Kautz, Jan and Patwary, Mostofa and Shoeybi, Mohammad and Catanzaro, Bryan and Choi, Yejin},
journal={arXiv preprint arXiv:2510.01265},
year={2025}
}
Copyright © 2025, NVIDIA Corporation. All rights reserved.