CRUX: Nested Recursive Discrete Diffusion

CRUX is a parameter-efficient (27M params) Looped text generation Transformer using Discrete Diffusion. It utilizes nested weight sharing where an outer loop of shared transformer layers contains an inner loop of gated recursive processing.

Two versions of CRUX were built: The CRUX-Heavy only uses the inner loop and disables outer weight sharing (36M params), while CRUX-Light uses the nested loop structure (27M params). CRUX-Light attains the same performance as CRUX-Heavy, but with ~23% fewer weights.

Key Innovations (Light)

Looped Sandwich Architecture: Features distinct "Entry" and "Exit" layers surrounding a core "Shared Middle" block that is executed 8 times per forward pass.
Nested Gated Recursion: Each shared middle block itself contains an internal 4-step recursive loop, providing further computational depth (32 effective refinement steps) with minimal parameter overhead.
Discrete Diffusion Training: Predicts clean tokens from masked/noisy states using an SNR-based schedule, providing a fast, non-autoregressive generation alternative.
Optimization: Integrates Muon with Cautious Weight Decay for stable training.

Feature	36M "Heavy" Model (model_heavy.py)	27M "Light" Model (model.py)
Global Structure	10 unique, sequential Transformer layers	Three distinct stages: Entry, Shared Middle, and Exit
Weight Sharing	None in Transformer layers, only the internal Recursive Processing block is sharing with itself by recursion	Shared Middle layer is reused 8 times per forward pass
Recursion	Traversed once (4 inner recursive steps) at the midpoint	Called while nested inside every one of the middle loops (8 middle loops $\times$ 4 inner recursive steps = 32 recursion traversals)

flowchart TB
    subgraph Core["Looped Core (Shared Weights)"]
        direction TB
        L_In["x_in"] --> SharedLayer["Shared Transformer Layer"]
        
        subgraph Inner["Nested Recursion (4x)"]
            direction TB
            R_In["Gated Transform"] --> R_Loop["Recurrent Refinement"]
        end

        SharedLayer --> R_In
        R_Loop --> L_Out["x_next"]
        L_Out -- "Repeat 8x" --> L_In
    end

    Input["Tokens + Time Embed"] --> Entry["Unique Entry Layer"]
    Entry --> L_In
    L_Out --> Exit["Unique Exit Layer"]
    Exit --> Head["Diffusion Head & Tied LM Head"]

Setup

git clone https://github.com/TheronAI/crux.git
cd crux
pip install -r requirements.txt

Training

CRUX can be trained with "micro-datasets" to study scaling efficiency:

python src/train.py \
    --dataset Marcus2112/minipile_density-proportioned_pico \
    --num-steps 62500 \
    --batch-size 16 \
    --recursive-depth 4

Alternatively run python src/train.py, which will apply the config.yaml.

Inference

Generate text with real-time denoising visualization:

python generate.py "My name is" --num-tokens 5

Context & References

Xiao, G., Tian, Y., Chen, B., Han, S., & Lewis, M. (2023). Efficient Streaming Language Models with Attention Sinks (arXiv:2309.17453). arXiv.
Takehi, R., Clavié, B., Lee, S., & Shakir, A. (2025). Fantastic (small) Retrievers and How to Train Them: mxbai-edge-colbert-v0 Tech Report (arXiv:2510.14880). arXiv.
MK2112. (2025). MiniPile Density Dataset Series, derived using the MiniCorpus Framework. GitHub.
Zhang, A. L., Kraska, T., & Khattab, O. (2025). Recursive Language Models (arXiv:2512.24601). arXiv.
von Rütte, D., Fluri, J., Pooladzandi, O., Schölkopf, B., Hofmann, T., & Orvieto, A. (2025). Scaling Behavior of Discrete Diffusion Language Models (arXiv:2512.10858). arXiv.
Arriola, M., Gokaslan, A., Chiu, J. T., Yang, Z., Qi, Z., Han, J., ... & Kuleshov, V. (2025). Block diffusion: Interpolating between autoregressive and diffusion language models (arXiv:2503.09573). arXiv.
Jordan, K. (2024). Muon: An optimizer for hidden layers in neural networks. Keller Jordan Blog.

Citation

@software{theron2026crux,
  author = {TheronAI},
  title = {CRUX: A Nested Recursive Language Model with Discrete Diffusion},
  year = {2026},
  url = {https://github.com/TheronAI/crux},
  version = {0.0.2},
  license = {MIT}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
results		results
src		src
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
bench.py		bench.py
config.yaml		config.yaml
config_ablation.yaml		config_ablation.yaml
config_heavy.yaml		config_heavy.yaml
generate.py		generate.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CRUX: Nested Recursive Discrete Diffusion

Key Innovations (Light)

Setup

Training

Inference

Context & References

Citation

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CRUX: Nested Recursive Discrete Diffusion

Key Innovations (Light)

Setup

Training

Inference

Context & References

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages