RLP: Reinforcement as a Pretraining Objective

Official repository of RLP: Reinforcement as a Pretraining Objective.

A verifier‑free, information‑gain objective that teaches models to “think before predicting” during pre‑training.

Ali Hatamizadeh 1, Syeda Nahida Akter 1, Shrimai Prabhumoye 1, Jan Kautz, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Yejin Choi.

Teach models to think during pretraining, not just after.

We introduce RLP (Reinforcement Learning Pre‑training): treat chain‑of‑thought (CoT) as an action taken before next‑token prediction, and reward it by the information gain it provides on the observed next token. This yields a verifier‑free, dense reward that can be applied to ordinary pre‑training text. On Qwen3‑1.7B‑Base, RLP improves the overall math+science average by ≈ +19% over the base model and ≈ +17% over compute‑matched continuous pre‑training; after identical post‑training the gains compound. On a 12B hybrid Mamba‑Transformer (NeMo‑12B), the overall average rises from 42.81 → 61.32 (+18.51 points), with large science reasoning gains.

Next token prediction comparison

Key results

🔹 Qwen3 1.7B Base

Setup:
- We compare RLP against both the base model (BASE) and a compute matched Continuous Pretraining (CPT) baseline.
- All models use the same SFT + RLVR post training pipeline for a fair comparison.
Pretraining Gains:
- RLP outperforms BASE by +19% and CPT by +17% on average across math and science benchmarks.
- These improvements come without extra compute, showing the gains are from methodology rather than raw FLOPs.
Post Training Synergy:
- After identical SFT + RLVR, RLP compounds its advantage, achieving:
  - +8% relative over BASE+Post
  - +7% relative over CPT+Post
- This shows that RLP builds durable reasoning foundations that are strengthened, not erased, by downstream alignment.
Takeaway:
- Unlike next token prediction or continuous pretraining, RLP instills reasoning during pretraining itself.
- These early advantages persist through post training, giving models stronger and more robust reasoning capabilities.

🔹 Nemotron Nano 12B v2 Base

Setup:
- We compare an intermediate checkpoint of Nemotron-Nano-12B-v2-Base trained on 19.8T tokens with RLP applied for only 250M tokens.
- The BASE model, in contrast, is trained fully on 20T tokens.
Pretraining Gains:
- RLP substantially outperforms BASE across all domains despite using ~200B fewer tokens.
- On average, RLP is +35% better than BASE, highlighting both efficiency and scalability.
Domain Specific Improvements:
- Math performance improves moderately.
- The largest gains are in science reasoning, where Science Avg improves by +23 absolute points.
Takeaway:
- The benefits of RLP not only persist but amplify at larger model scales.
- RLP generalizes effectively across architectures, yielding robust reasoning improvements even in hybrid models like Nemotron.

Citation

If you find RLP to be useful for your work, please consider citing our paper:

@article{hatamizadeh2025rlp,
  title={RLP: Reinforcement as a Pretraining Objective},
  author={Hatamizadeh, Ali and Akter, Syeda Nahida and Prabhumoye, Shrimai and Kautz, Jan and Patwary, Mostofa and Shoeybi, Mohammad and Catanzaro, Bryan and Choi, Yejin},
  journal={arXiv preprint arXiv:2510.01265},
  year={2025}
}

Star History

Licenses

Equal Contribution ↩ ↩² ↩³

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
pdf		pdf
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RLP: Reinforcement as a Pretraining Objective

Next token prediction comparison

Key results

🔹 Qwen3 1.7B Base

🔹 Nemotron Nano 12B v2 Base

Citation

Star History

Licenses

About

Uh oh!

Releases

Packages

License

NVlabs/RLP

Folders and files

Latest commit

History

Repository files navigation

RLP: Reinforcement as a Pretraining Objective

Next token prediction comparison

Key results

🔹 Qwen3 1.7B Base

🔹 Nemotron Nano 12B v2 Base

Citation

Star History

Licenses

Footnotes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages