Skip to content

Do dense LMs develop MoE-like specialization as they scale? Measure it, visualize it, and turn it into speed.

License

Notifications You must be signed in to change notification settings

wasim/scaling-specialization-dense-lms

Repository files navigation

Scaling Specialization in Dense LMs

CI License: MIT Python

Do dense transformers, without routers, develop sparse, modular structure that becomes more specialized as model size grows? We:

  1. Measure activation sparsity (AS), feature Specialization Index (SI), and graph modularity (Q) across a consistent scaling suite.
  2. Explain features via Sparse Autoencoders (SAEs) to reveal monosemantic circuits.
  3. Exploit the structure using dynamic-k MLP execution for real FLOPs savings at fixed quality.

TL;DR — Specialization scales with size; you can cash it out for speed.

High-level SLMS flow

Quickstart

python -m venv .venv && source .venv/bin/activate
pip install -U pip
pip install -e .

# list MLP-ish layers in a checkpoint
python - <<'PY'
from sdlms.activations import list_layers
print(list_layers("EleutherAI/pythia-410m-deduped"))
PY

Minimal workflow

# 1) Capture activations on small probe tasks
python scripts/run_capture.py --model EleutherAI/pythia-410m-deduped --task-id ioi_minimal --layers model.layers.10.mlp

# 2) Train SAEs (separate tool) and export features

# 3) Compute metrics (AS, SI, Q)
python scripts/run_metrics.py

# 4) Dynamic-k eval (throughput vs perplexity)
python scripts/run_dynamick_eval.py --k 0.35

CLI quickstart

# install dependencies through uv (preferred)
uv sync --all-groups

# measure activation sparsity with a prompt (writes CSV to artifacts/sparsity)
uv run sparsity --model EleutherAI/pythia-70m-deduped --probe-manifest data/probe_tasks.jsonl --task-id toy_arithmetic

# launch a notebook to inspect results
uvx jupyter lab

Reproducibility

  • Deterministic seeds where possible
  • Configs + exact prompts for probe tasks
  • All figures generated from notebooks/

License

MIT — see LICENSE.

About

Do dense LMs develop MoE-like specialization as they scale? Measure it, visualize it, and turn it into speed.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published