Skip to content

NVlabs/RLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

RLP: Reinforcement as a Pretraining Objective

Star on GitHub

Official repository of RLP: Reinforcement as a Pretraining Objective.

A verifier‑free, information‑gain objective that teaches models to “think before predicting” during pre‑training.

Paper

Ali Hatamizadeh1, Syeda Nahida Akter1, Shrimai Prabhumoye1, Jan Kautz, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Yejin Choi.

Teach models to think during pretraining, not just after.

framework

We introduce RLP (Reinforcement Learning Pre‑training): treat chain‑of‑thought (CoT) as an action taken before next‑token prediction, and reward it by the information gain it provides on the observed next token. This yields a verifier‑free, dense reward that can be applied to ordinary pre‑training text. On Qwen3‑1.7B‑Base, RLP improves the overall math+science average by ≈ +19% over the base model and ≈ +17% over compute‑matched continuous pre‑training; after identical post‑training the gains compound. On a 12B hybrid Mamba‑Transformer (NeMo‑12B), the overall average rises from 42.81 → 61.32 (+18.51 points), with large science reasoning gains.


Next token prediction comparison

Key results

🔹 Qwen3 1.7B Base

  • Setup:

    • We compare RLP against both the base model (BASE) and a compute matched Continuous Pretraining (CPT) baseline.
    • All models use the same SFT + RLVR post training pipeline for a fair comparison.
  • Pretraining Gains:

    • RLP outperforms BASE by +19% and CPT by +17% on average across math and science benchmarks.
    • These improvements come without extra compute, showing the gains are from methodology rather than raw FLOPs.
  • Post Training Synergy:

    • After identical SFT + RLVR, RLP compounds its advantage, achieving:

      • +8% relative over BASE+Post
      • +7% relative over CPT+Post
    • This shows that RLP builds durable reasoning foundations that are strengthened, not erased, by downstream alignment.

  • Takeaway:

    • Unlike next token prediction or continuous pretraining, RLP instills reasoning during pretraining itself.
    • These early advantages persist through post training, giving models stronger and more robust reasoning capabilities.

🔹 Nemotron Nano 12B v2 Base

  • Setup:

    • We compare an intermediate checkpoint of Nemotron-Nano-12B-v2-Base trained on 19.8T tokens with RLP applied for only 250M tokens.
    • The BASE model, in contrast, is trained fully on 20T tokens.
  • Pretraining Gains:

    • RLP substantially outperforms BASE across all domains despite using ~200B fewer tokens.
    • On average, RLP is +35% better than BASE, highlighting both efficiency and scalability.
  • Domain Specific Improvements:

    • Math performance improves moderately.
    • The largest gains are in science reasoning, where Science Avg improves by +23 absolute points.
  • Takeaway:

    • The benefits of RLP not only persist but amplify at larger model scales.
    • RLP generalizes effectively across architectures, yielding robust reasoning improvements even in hybrid models like Nemotron.

Citation

If you find RLP to be useful for your work, please consider citing our paper:

@article{hatamizadeh2025rlp,
  title={RLP: Reinforcement as a Pretraining Objective},
  author={Hatamizadeh, Ali and Akter, Syeda Nahida and Prabhumoye, Shrimai and Kautz, Jan and Patwary, Mostofa and Shoeybi, Mohammad and Catanzaro, Bryan and Choi, Yejin},
  journal={arXiv preprint arXiv:2510.01265},
  year={2025}
}

Star History

Stargazers repo roster for @NVlabs/RLP

Star History Chart

Licenses

Copyright © 2025, NVIDIA Corporation. All rights reserved.

Footnotes

  1. Equal Contribution 2 3

Releases

No releases published

Packages

No packages published