Do dense transformers, without routers, develop sparse, modular structure that becomes more specialized as model size grows? We:
- Measure activation sparsity (AS), feature Specialization Index (SI), and graph modularity (Q) across a consistent scaling suite.
- Explain features via Sparse Autoencoders (SAEs) to reveal monosemantic circuits.
- Exploit the structure using dynamic-k MLP execution for real FLOPs savings at fixed quality.
TL;DR — Specialization scales with size; you can cash it out for speed.
python -m venv .venv && source .venv/bin/activate
pip install -U pip
pip install -e .
# list MLP-ish layers in a checkpoint
python - <<'PY'
from sdlms.activations import list_layers
print(list_layers("EleutherAI/pythia-410m-deduped"))
PY# 1) Capture activations on small probe tasks
python scripts/run_capture.py --model EleutherAI/pythia-410m-deduped --task-id ioi_minimal --layers model.layers.10.mlp
# 2) Train SAEs (separate tool) and export features
# 3) Compute metrics (AS, SI, Q)
python scripts/run_metrics.py
# 4) Dynamic-k eval (throughput vs perplexity)
python scripts/run_dynamick_eval.py --k 0.35# install dependencies through uv (preferred)
uv sync --all-groups
# measure activation sparsity with a prompt (writes CSV to artifacts/sparsity)
uv run sparsity --model EleutherAI/pythia-70m-deduped --probe-manifest data/probe_tasks.jsonl --task-id toy_arithmetic
# launch a notebook to inspect results
uvx jupyter lab- Deterministic seeds where possible
- Configs + exact prompts for probe tasks
- All figures generated from
notebooks/
MIT — see LICENSE.
