Complete step-by-step pipeline. Entire run was done on a single RTX 3060 12GB in bfloat16. No QLoRA/bitsandbytes — manual layer freezing only.
# Tested on:
# GPU: NVIDIA RTX 3060 12GB
# VRAM: 12 GB (11.6 GB usable)
# OS: Ubuntu 22.04 / Linux
# Python: 3.10+
# CUDA: 12.x
pip install -r requirements.txtBase model:
state-spaces/mamba-2.8b-hf(~5.5 GB download from HuggingFace)
Pulls from UltraChat (chat), GSM8K (math), HumanEval (code) and reformats to:
[LOGIC] question ==== answer / [CHAT] ... / [CODE] ...
python pipeline/phase1_build_dataset.py
# Output: data/latent_dataset.jsonl (~10,164 rows)Verify:
python eval/test_phase1.pyTrains only x_proj, dt_proj, embed_tokens layers (frozen core). BF16.
Loss 17.3 → 10.5 over 500 steps.
python pipeline/phase2_sft_trainer.py
# Checkpoint: checkpoints/mamba-2.8b-sft/
# ~45 min on RTX 3060Verify:
python eval/eval_phase2_sft.pyExtracts hidden states from Phase 2 checkpoint and trains a 3-layer MLP probe on fractional ramp targets conditioned on loop depth.
python pipeline/phase3_train_halting_head.py
# Output: checkpoints/mamba-2.8b-sft/halting_head.pt
# Target: MAE < 0.06 (achieved: 0.052)Verify:
python eval/eval_phase3_halting_head.py
# Expect: ~88% halt-step accuracy on 2000-sample probe72-row dataset with <TOOL: BASH> and <RESULT> tags.
Loss 13.7 → 0.9 in 200 steps.
python pipeline/phase4_build_tool_dataset.py
# Output: data/tool_dataset.jsonl
python pipeline/phase4_tool_sft_trainer.py
# Checkpoint: checkpoints/mamba-2.8b-tool/Verify:
python eval/eval_phase4_tool_use.py
# Expect: 3/3 tool emission tests passMerges Phase 4 checkpoint + HaltingHead probe into a single deployable directory.
python pipeline/phase5_merge_and_export.py
# Output: checkpoints/mamba-2.8b-latent/
# ├── config.json / tokenizer.json / model.safetensors
# ├── halting_head.pt
# └── engine_manifest.jsonNo training required — purely inference-time. Test it:
python session_memory.py --new-session demo
# /save /load demo /history /clearpython agent_loop.py "How much disk space is available?"
python agent_loop.py "What Python version is installed?"Run the 4-proof scientific test harness:
python eval/the_crucible.py
# Proof 1: ACT loop proportionality
# Proof 2: O(1) VRAM flatline
# Proof 3: Lobotomy ablation (W=8 vs W=4)
# Proof 4: Tool execution
# Expected ACT results:
# ARC-Challenge: ~5.9 loops avg
# HellaSwag: ~2.0 loops avg
# Ratio: ~3x ← emergent scaling# Generative eval — all 4 tasks with reasoning loops active
python eval/generative_benchmark.py
# Format-free content eval — answer text matching, no A/B/C/D
python eval/content_benchmark.py
# ARC-Challenge only, fastest (200 samples, ~7 min)
python eval/eval_latent_arc.py
# Log-likelihood lobotomy baseline (for comparison)
lm_eval --model hf \
--model_args pretrained=checkpoints/mamba-2.8b-latent,trust_remote_code=True,dtype=bfloat16 \
--tasks arc_challenge,hellaswag,piqa,winogrande \
--num_fewshot 0 \
--output_path benchmark_results/If generative scores show "Verbose Genius" failure (correct reasoning, wrong letter format):
python pipeline/mc_format_patch.py
# Trains embeddings only, 300 steps, LR=5e-4
# Output: checkpoints/mamba-2.8b-latent-mc/| Checkpoint | Size | Note |
|---|---|---|
mamba-2.8b-sft/ |
~5.55 GB | After Phase 2 |
mamba-2.8b-tool/ |
~5.55 GB | After Phase 4 |
mamba-2.8b-latent/ |
~5.55 GB | Final engine |
halting_head.pt |
~5 MB | Co-located with engine |
sessions/demo.pt |
~32 KB | Per-session state |
Intermediate checkpoints can be deleted after merging. Only
mamba-2.8b-latent/is needed for inference.