CRUX is a parameter-efficient (27M params) Looped text generation Transformer using Discrete Diffusion. It utilizes nested weight sharing where an outer loop of shared transformer layers contains an inner loop of gated recursive processing.
Two versions of CRUX were built: The CRUX-Heavy only uses the inner loop and disables outer weight sharing (36M params), while CRUX-Light uses the nested loop structure (27M params). CRUX-Light attains the same performance as CRUX-Heavy, but with ~23% fewer weights.
- Looped Sandwich Architecture: Features distinct "Entry" and "Exit" layers surrounding a core "Shared Middle" block that is executed 8 times per forward pass.
- Nested Gated Recursion: Each shared middle block itself contains an internal 4-step recursive loop, providing further computational depth (32 effective refinement steps) with minimal parameter overhead.
- Discrete Diffusion Training: Predicts clean tokens from masked/noisy states using an SNR-based schedule, providing a fast, non-autoregressive generation alternative.
- Optimization: Integrates Muon with Cautious Weight Decay for stable training.
| Feature | 36M "Heavy" Model (model_heavy.py) |
27M "Light" Model (model.py) |
|---|---|---|
| Global Structure | 10 unique, sequential Transformer layers | Three distinct stages: Entry, Shared Middle, and Exit |
| Weight Sharing | None in Transformer layers, only the internal Recursive Processing block is sharing with itself by recursion | Shared Middle layer is reused 8 times per forward pass |
| Recursion | Traversed once (4 inner recursive steps) at the midpoint | Called while nested inside every one of the middle loops (8 middle loops |
flowchart TB
subgraph Core["Looped Core (Shared Weights)"]
direction TB
L_In["x_in"] --> SharedLayer["Shared Transformer Layer"]
subgraph Inner["Nested Recursion (4x)"]
direction TB
R_In["Gated Transform"] --> R_Loop["Recurrent Refinement"]
end
SharedLayer --> R_In
R_Loop --> L_Out["x_next"]
L_Out -- "Repeat 8x" --> L_In
end
Input["Tokens + Time Embed"] --> Entry["Unique Entry Layer"]
Entry --> L_In
L_Out --> Exit["Unique Exit Layer"]
Exit --> Head["Diffusion Head & Tied LM Head"]
git clone https://github.com/TheronAI/crux.git
cd crux
pip install -r requirements.txtCRUX can be trained with "micro-datasets" to study scaling efficiency:
python src/train.py \
--dataset Marcus2112/minipile_density-proportioned_pico \
--num-steps 62500 \
--batch-size 16 \
--recursive-depth 4Alternatively run python src/train.py, which will apply the config.yaml.
Generate text with real-time denoising visualization:
python generate.py "My name is" --num-tokens 5- Xiao, G., Tian, Y., Chen, B., Han, S., & Lewis, M. (2023). Efficient Streaming Language Models with Attention Sinks (arXiv:2309.17453). arXiv.
- Takehi, R., Clavié, B., Lee, S., & Shakir, A. (2025). Fantastic (small) Retrievers and How to Train Them: mxbai-edge-colbert-v0 Tech Report (arXiv:2510.14880). arXiv.
- MK2112. (2025). MiniPile Density Dataset Series, derived using the MiniCorpus Framework. GitHub.
- Zhang, A. L., Kraska, T., & Khattab, O. (2025). Recursive Language Models (arXiv:2512.24601). arXiv.
- von Rütte, D., Fluri, J., Pooladzandi, O., Schölkopf, B., Hofmann, T., & Orvieto, A. (2025). Scaling Behavior of Discrete Diffusion Language Models (arXiv:2512.10858). arXiv.
- Arriola, M., Gokaslan, A., Chiu, J. T., Yang, Z., Qi, Z., Han, J., ... & Kuleshov, V. (2025). Block diffusion: Interpolating between autoregressive and diffusion language models (arXiv:2503.09573). arXiv.
- Jordan, K. (2024). Muon: An optimizer for hidden layers in neural networks. Keller Jordan Blog.
@software{theron2026crux,
author = {TheronAI},
title = {CRUX: A Nested Recursive Language Model with Discrete Diffusion},
year = {2026},
url = {https://github.com/TheronAI/crux},
version = {0.0.2},
license = {MIT}
}