Transformer bridge layer norm folding #1071

bryce13950 · 2025-09-27T17:37:15Z

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes # (issue)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Screenshots

Please attach before and after screenshots of the change if applicable.

Checklist:

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have not rewritten tests relating to key interfaces which would affect backward compatibility

* created individual processing functions * extracted state dict and inserted back into instance after processing * created weight processing shared class * added test coverage for new functions * updated hooked transformer to use new shared functions * created test * moved over weight processing * replaced keys * used the correct function * created test for making sure path translation works correctly * fixed weight processing * added additional tests * formatted tests a bit * cleaned up * fixed unit test * fixed indentation * fixed doc string * fixed unit test * fixed type * fixed some tests * fixed test * fixed setup of tests

* created individual processing functions * extracted state dict and inserted back into instance after processing * created weight processing shared class * added test coverage for new functions * updated hooked transformer to use new shared functions * created test * moved over weight processing * replaced keys * used the correct function * created test for making sure path translation works correctly * fixed weight processing * added additional tests * formatted tests a bit * cleaned up * fixed unit test * fixed indentation * fixed doc string * fixed unit test * fixed type * fixed some tests * fixed test * fixed setup of tests * cleaned up test * started working through individual matches * added test coverage * tested function a bit * integrated weight conversion into weight proccessing * simplified functions * identified individual problem lines * identified divergences more clearly * brought back error lines

…for already initialized components (#1066)

* imporoved accuracy a bit * got models to match * removed forward pass stuff * cleaned up weight processing a bit * removed working attention * restored files

* imporoved accuracy a bit * got models to match * removed forward pass stuff * cleaned up weight processing a bit * removed working attention * restored files * created loop to verify weight conversion * finished compatibility layer * finished testing hugging face weights * setup correct init * added some tests * removed seperate component * fixed some integration tests

* imporoved accuracy a bit * got models to match * removed forward pass stuff * cleaned up weight processing a bit * removed working attention * restored files * created loop to verify weight conversion * finished compatibility layer * finished testing hugging face weights * setup correct init * added some tests * removed seperate component * fixed some integration tests * fixed typing issue * fixed typing and format issues * fixed ci issues * ran format * fixed mypy issues * removed extra file * removed old scripts * tested format * fixed some tests * ran format * fixed tests * fixed acceptance tests * fixed some more tests * synced functionality completely * reduced old references * removed remaining references * moved forward functions * removed forward * tested various forwards * worked on getting original forwards back into place * added more coverage * cleaned up model * git status * Fix automatic weight extraction to use reference HookedTransformer This restores the working weight extraction mechanism that creates a reference HookedTransformer internally and extracts exact processed weights for perfect compatibility with ablation studies. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * moved embed stuff from bridge * moved MLP stuff * claned up a bit * cleaned up a bit * removed extra block * created pos embed bridge * fixed unembed --------- Co-authored-by: Claude <[email protected]>

* moved final layer norm * moved layer norm forward * cleaned up more things * updated attention weight loading * fixed function names

* fixed some ci issues * fixed type issues * ran format * fixed test * fixed type issues * fixed type issue * fixed type issue * fixed test * fixed test * fixed issues * ran format * fixed typing * fixed tests * fixed tests * simplified test * sped up tests * added check for kv cache * ran format * skipped some tests * marked a couple tests to skip * ran some more optimizations * ran poetry lock * regenerated lock * fixed commands * set random seed * updated parallelism prop * updated command * reverted some changes * updated notebook settings * updated verbosity * removed extra test * cleaned up tests some more * marked test as skipped * fixed more tests * sped up CI * reverted CI changes * reverted actions changes * improved cache * sped up some tests * optimzed more tests * sped up some more tests * made more speed improvements * fixed error * fixed typing

* cleaned up some debug points * fixed attention hooks * enabled hooks in test

* split out some tasks into their own jobs * removed bad file * updated name

* fixed batch dimension * removed log point * fixed potential error * sped up load * ran format * improved hf cache handling * fixed bridge * fixed cache again * added more checks * removed parallel execution

* fixed cache hooks * fixed test and typing * fixed test

* fixed bias displaying * fixed ablation issue * fixed type issue

* setup new hooks properly * fixed type checks

* fixed alias hook props * ran format

* made all hooks show properly * ran format * fixed type checks

* updated loading in main demo to use transformers bridge * updated model name * updated imports * updated some cells * reran demo * updated some cells * reran some cells * reran demo * ran demo again * finished generating new cells

* Update README.md (#957) Update link to Streamlit tutorial and guide. Co-authored-by: Bryce Meyer <[email protected]> * improve model properties table in docs (#769) * add static to gitignore * making a meaningless change to see if tests pass at all * making a meaningless change to see if tests pass at all * add interactive table static html only adding things one at a time to see what causes things to break * run poetry update with no changes to deps * revert lockfile change * add tiktoken >=0.7.0 to group docs * add dep muutils >=0.6.15 to group docs * add improved interactive table generation we still generate a plain markdown table code is from the old PR: https://github.com/mivanit/TransformerLens/blob/add-better-model-properties-table/docs/make_docs.py which is in turn a modified version of https://github.com/mivanit/transformerlens-model-table * fix format -- missing trailing newline * fix type hints for compatibility * fix torch device meta in make docs script, also improved hot reload * TEMPORARY: allow_except when getting models to deal with mixtral HF_TOKEN issue * added simple test for get_model_info * context manager for controlling device, tests were breaking due to default device meta * formatted with wrong version of black, oops * fix path to generated model_properties_table * fix md table header, add title in yaml frontmatter * add line to frontmatter yaml, re-run tests bc huggingface down? * do not allow exceptions when getting models * re-run poetry lock * attempt fix lockfile * re-run poetry lock --------- Co-authored-by: Bryce Meyer <[email protected]> * switch pyproject from toml to uv, generate lockfile also update tiktoken dep for 3.13 compatibility * update makefile to use uv * update actions * hack to get version to work * wip * make dep * update contributing.md to reflect switch from poetry to uv * add type hints to supported_models * fix paths in make_docs.py * docs group not in default, update install instructions for docs * POETRY_PYPI_TOKEN_PYPI -> PYPI_TOKEN_PYPI * make format * fix default groups, re-add docs * add some deps needed in notebooks * removed use of torchtyping in othello_GPT.ipynb and deps - torchtyping causes various issues if it's imported - presumably jaxtyping should be used instead?? - othello GPT notebook doesn't actually use the imported TT - shouldn't a linter/formatter catch this sort of unused import? * fix: add pythonpath "." to pytest config for test imports Configure pytest to include project root in Python path, enabling `from tests.foo import bar` style imports, which were broken by switching to uv * attempt jupyter issue fix * issue ref explaining ipython version restriction * updated ci commands after recent work * fixed more setup items * added tabulate dependency * updated make docs command * updated dependencies * fixed docs --------- Co-authored-by: jmole <[email protected]> Co-authored-by: Bryce Meyer <[email protected]>

* setup tests for hooks * ran format * merged legacy hooks tests * ran format * enabled compatibility mode * added remaining hooks * fixed type issue * added main demo cached output * removed debug items * reran notebook * marked cell for skipping * reran notebook * regenerated demo * regenerated notebook

* updated loading in arena content demo to use transformer bridge * updated install reference * removed extra params * ran some cells * updated arena notebook --------- Co-authored-by: Bryce Meyer <[email protected]>

* regeneerated with new hooks * ran first cell

* added test coverage for ensuring compatibility * ran format * fixed unit tests * resolved type issue * added init files * added init file * fixed tokaize function * fixed attention mask issues * reverted invalid change to test

…er (#1103)" (#1108) This reverts commit 931b45f.

* finalized bench mark logic * ran format

* improved various models * improved llama * fixed benchmark utils now * fixed test * fixed opt adapter * fixed phi-3 * fixed format * fixed type issues * added line break * fixed phi-3

* improved benchmarking tools * added individual benchmarks * ran format * ran format * revised gpt2 again

* resolved experts mapping issues * ran format * Fix GPT-OSS JointGateUpMLPBridge hook alias resolution and add tests This commit addresses hook alias resolution issues for GPT-OSS MoE models and adds comprehensive unit tests. Changes: 1. Fixed JointGateUpMLPBridge hook_aliases to use gate.hook_out instead of in.hook_out/input.hook_out, which don't exist in this bridge type 2. Added 7 comprehensive unit tests in test_gpt_oss_moe.py that verify: - Model loads without downloading weights (using meta device) - Bridge creation works correctly - MLP uses JointGateUpMLPBridge (not regular MLPBridge) - Compatibility mode hooks are accessible - Experts structure is correct (batched tensors, not iterable modules) - Hook aliases resolve correctly - No incorrect BlockBridge wrapper around experts Root cause: - JointGateUpMLPBridge inherits from MLPBridge which has hook_aliases expecting in.hook_out or input.hook_out submodules - JointGateUpMLPBridge creates gate and up submodules instead, causing AttributeError when resolving aliases - Solution: Override hook_aliases at class level to use gate.hook_out Testing: All 7 tests pass, verifying GPT-OSS loads correctly and hooks work in compatibility mode without downloading the full 20B parameter model. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * added model name to avilable models * added eps config * updated to rotary bridge * passed through to parent * fixed tuple passing * changed oss bridge * fixed gpt oss activations * ran format * removed colab compat from checks * decoupling weight processing completely from hooked transformer * fixed weight processing issues * fixed test * fixed tests * fixed forward tests * updated test * fixed last test * fixed format * ran format * added whitespace * finished weight processing generalization * fixed type check * fixed test * fixed format * made checks continue after first failure * fixed test g --------- Co-authored-by: Claude <[email protected]>

* fixed tensor storing * fixed type issue

* disabled patched forwards * fixed mlp linear for gpt2 for simpler implementation * rearranged some forward pass items * fixed type issue and updated demo run sequence * reran arena notebook * reran main demo * regenerated demo * fixed mid hook issue * regenerated demo * fixed test * regenerated demo

* disabled patched forwards * fixed mlp linear for gpt2 for simpler implementation * rearranged some forward pass items * got gemma 3 a lot closer * updated gemma 3 * got gemma 3 to pass perfectly * finished gemma 3 match * fixed type issue and updated demo run sequence * reran arena notebook * reran main demo * regenerated demo * fixed mid hook issue * regenerated demo * fixed test * regenerated demo * fixed type checks * fixed check * fixed format * fixed test * fixed prescision * updated test * updated window

* setup real aliases * fixed format * set submodules as props * fixed submodule setting * cut out the last bit of extras * fixed type check * removed startup alias functiong * fixed test

* disabled patched forwards * fixed mlp linear for gpt2 for simpler implementation * rearranged some forward pass items * got gemma 3 a lot closer * updated gemma 3 * got gemma 3 to pass perfectly * finished gemma 3 match * fixed type issue and updated demo run sequence * reran arena notebook * reran main demo * regenerated demo * fixed mid hook issue * regenerated demo * matched os * fixed test * regenerated demo * fixed type checks * fixed check * fixed format * fixed test * fixed prescision * updated test * updated window * ran format * fixed type check * fixed test * updated weight processing to extratct properly * started moving around weight processing * fixed v bias folding * updated weigth loading * debugged joint qkv a bit further * fixed gemam 3 init * fixed hook_z * fixed mypy * ran format * removed extra markdown * updated key checking to use unified structure * matched shapes in hooks * got gpt2 to pass completely again * ran format * fixed some shapes * fixed typing * updated shapes * fixed tests * fixed test * fixed kv cache issue * made weigth processign more generic * fixed tst issue * removed oss * updated test * fixed test * got tests to run again * cleaned up component * fixed test * fixed tests and regenerated arena notebook * regenerated demo

* trimmed memory a bit * fixed type checks

…1123) * created benchmark suite for unsupported models in hooked transformer * fixed function registartion * Add hook_structure.py module for cross-model validation This module was missing from the repository, causing CI failures. It provides structure-only validation of hooks that can work across different model architectures. Key functions: - validate_hook_shape_compatibility() - Cross-model shape validation - benchmark_forward_hooks_structure() - Forward hook structure tests - benchmark_backward_hooks_structure() - Backward hook structure tests - benchmark_activation_cache_structure() - Cache structure tests These enable validation of models without HookedTransformer references by comparing their hook structure against GPT-2. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * fixed memory issues * ran format * reran main demo --------- Co-authored-by: Claude <[email protected]>

* fixed remaining gemma 3 benchmarks * fixed type check

* got a bunch of models closer * fixed trust remote issue * fixed phi after merge * fixed bos token * fixed arena issue * fixed typing * fixed gqa * fixed issue * ran format * fixed benchmark util and demo * fixed some hooks * added bos support * fixed gemma again * fixed llama * skipped t5 in ci * fixed qwen loading * removed phi-2 from ci * fixed type check * fixed type issues * fixed some more attention things * fixed no processing check * fixed type and ran format * reverted generalized components changes * restored tests * reverted more changes * reverted transformers chagnes * skipped test * fixed skip if check

* setup brenchmark suite, and trimmed out extra tests * ran format

* fixed remaining gemma 3 benchmarks * fixed type check * fixed adapter * cleaned up attention component * cleaned up processed weight setting * ran format * fixed type issues

* made cross model comparisons more explicit * finished updating benchmarks for cross model * finished testing new benchmarks * fixed typing * synced attention params

* made cross model comparisons more explicit * finished updating benchmarks for cross model * finished testing new benchmarks * fixed typing * synced attention params * wrapped up initial oss work * skipped irrelevant tests for moe * ran format

* removed legacy private functions * cleaned up adapter * expanded benchmarks * got gemma 2 closer * added processing step to architecture adapter * setup rotary to sub component properly * cleaned up block * added recursive test of models * got gemma 2 closer * removed extras * ran format * restored files * removed some dead code * fixed test * fixed typing * fixed test * fixed more tests * removed more dead code * added xet to cache * fixed potential recursion issue * reverted bad change * bumped cache * added verbose output * Add --tb=short to coverage-report-test for better error visibility * added more verbose output * configuired output to display errors right away * skipped problematic tests * removed extra flags * skipped hooked encoder decoder * skipped failing test

* finished matchign 2 & 3 * cleaned up components * cleaned up weight processing * fixed embed key mapping * cleaned up key processing * removed extra functions * cleaned up a bit more * cleaned up bridge a bit * generalized weight processing a bit more * allowed attention to be passed through * restored gemma 3 * fixed gpt 2 again * ran format * cleaned up a couple thigns * refactored weight processing to not use split weights * fixed granular tests * added more granular tests * removed invalid import * cleaned up a couple things * removed extra function call * moved weight integration * cleaned up a bit more * cleaned up some more * added verbose mode * ran format * fixed doc string and mypy issues * fixed tets * fixed more tests * fixed issue * added function calls again * updated format * restored last ln step * moved around stop at layer * rested some tests

* updated joint component * cleaned up a bit * added working bridge * removed some detection * fixed state dict filtering * removed extra function * cleaned up a bit * updated process to use main keys * Revert "updated process to use main keys" This reverts commit cacf71a. * cleaned up some things * removed extra junk * tested more models * fixed gemma 3 ln issue * refactored conversions * integrated conversion throughout * configured gpt2 some more * cleaned up conversion and reversions * removed extra param * untied embed from state dict directly * broke tying * validated ship * cleaned up components * cleaned up a bit * resovled some gpt2 items * fixed unembed bias * fixed final check * ran format * fixed gemma 3 * cleaned up extra stuff * cleaned up linear bridge a bit * fixed keys * fixed unit tests * fixed tests * removed extra function * moved function * regenerated demo * ran format * regenerated demo

* removed merge file * cleaned up imports * cleaned up some adapters * fixed up some adapters * added ability to use parent paths when mapping components * cleaned up gpt * cleaned up more architecture adapters * ran format * fixed type issues

* Removed all attributes of which directly mapped keys. These attributes are now handled by the component mapping Bridge classes * Formatting update * Removed additional missed key

* Removed all attributes of which directly mapped keys. These attributes are now handled by the component mapping Bridge classes * Remove source keys where they have been made redundant by the bridges * Formatting update * Remove source keys where they have been made redundant by the bridges * created qwen 3 adapter --------- Co-authored-by: jlarson <[email protected]>

* Removed all attributes of which directly mapped keys. These attributes are now handled by the component mapping Bridge classes * Remove source keys where they have been made redundant by the bridges * Formatting update * Remove source keys where they have been made redundant by the bridges --------- Co-authored-by: Bryce Meyer <[email protected]>

* cleaned up a lot of things * removed extra function * fixed typing * fixed index bug * removed extra stuff * fixed main demo * removed bad chunk * removed attention check * fixed cache * fixed type check * fixed demo issue * fixed test * fixed typing * updated type * restored patched function * fixed gemma 3 * fixed test * fixed typing * continued working through gemma compat * fixed more issues * got closer * ran format * fixed extra config * grouped results by phase * cleaned up adapter * fixed weight processing issue * reevised hooks * fixed phase 2 * set flags correctly * revised benchmarks for granularity * improved gemma compatibility * claned up memory * revised architecture adapters * cleaned up memory * improved some models * used correct component * verified more architectures * fixed more models * ran format * fixed typing * fixed typing * fixed test * fixed t5 * fixed format * fixed test

bryce13950 and others added 8 commits September 21, 2025 17:57

Add missing configuration parameters

6a00384

Add missing configuration parameters

7e691b3

Properly set up normalization_type and layer_norm_folding attributes …

8b434d7

…for already initialized components (#1066)

Process accuracy (#1067)

c0255e8

* imporoved accuracy a bit * got models to match * removed forward pass stuff * cleaned up weight processing a bit * removed working attention * restored files

bryce13950 changed the title ~~Dev 3.x folding~~ Transformer bridge layer norm folding Sep 27, 2025

bryce13950 and others added 21 commits September 29, 2025 22:33

Revision extra forwards (#1073)

ddcb4f5

* moved final layer norm * moved layer norm forward * cleaned up more things * updated attention weight loading * fixed function names

Merge branch 'dev-3.x' into dev-3.x-folding

fece8c9

Attention hooks full coverage for folding (#1078)

caf4c3d

* cleaned up some debug points * fixed attention hooks * enabled hooks in test

Ci job splitting (#1079)

ed23558

* split out some tasks into their own jobs * removed bad file * updated name

fixed batch dimension (#1082)

ebea4f4

* fixed batch dimension * removed log point * fixed potential error * sped up load * ran format * improved hf cache handling * fixed bridge * fixed cache again * added more checks * removed parallel execution

fixed cache hooks (#1083)

445f747

* fixed cache hooks * fixed test and typing * fixed test

fixed bias displaying (#1084)

f60e8d3

* fixed bias displaying * fixed ablation issue * fixed type issue

fixed return type none (#1085)

251b5ab

Create pass through for hooks in compatibility mode (#1086)

f053201

* setup new hooks properly * fixed type checks

fixed alias hook props (#1087)

817be64

* fixed alias hook props * ran format

made all hooks show properly (#1088)

b6477a0

* made all hooks show properly * ran format * fixed type checks

addded full kv cache (#1089)

92585df

fixed first two tests

674f3a4

regeneerated with new hooks (#1091)

8b259a2

* regeneerated with new hooks * ran first cell

added test coverage for ensuring compatibility (#989)

d6934cb

* added test coverage for ensuring compatibility * ran format * fixed unit tests * resolved type issue * added init files * added init file * fixed tokaize function * fixed attention mask issues * reverted invalid change to test

bryce13950 and others added 29 commits November 6, 2025 04:02

removed invalid comparison (#1107)

7abff67

Revert "decoupling weight processing completely from hooked transform…

905fe6d

…er (#1103)" (#1108) This reverts commit 931b45f.

finalized bench mark logic (#1109)

eb63a26

* finalized bench mark logic * ran format

Fix opt (#1106)

ee9dbf1

* improved various models * improved llama * fixed benchmark utils now * fixed test * fixed opt adapter * fixed phi-3 * fixed format * fixed type issues * added line break * fixed phi-3

Benchmarking and compatibility only (#1112)

07377ad

* improved benchmarking tools * added individual benchmarks * ran format * ran format * revised gpt2 again

optimized benchmarks a bit (#1115)

86ea8a4

fixed tensor storing (#1116)

9ad8198

* fixed tensor storing * fixed type issue

added skip condition (#1117)

3839661

setup real aliases (#1121)

9bd0cbe

* setup real aliases * fixed format * set submodules as props * fixed submodule setting * cut out the last bit of extras * fixed type check * removed startup alias functiong * fixed test

trimmed memory a bit (#1122)

fadc50f

* trimmed memory a bit * fixed type checks

fixed remaining gemma 3 benchmarks (#1124)

5bcf34e

* fixed remaining gemma 3 benchmarks * fixed type check

setup brenchmark suite, and trimmed out extra tests (#1125)

4643ffb

* setup brenchmark suite, and trimmed out extra tests * ran format

Attention cleanup (#1126)

4715592

* fixed remaining gemma 3 benchmarks * fixed type check * fixed adapter * cleaned up attention component * cleaned up processed weight setting * ran format * fixed type issues

Benchmarking cross comparison revision (#1127)

eb60dea

* made cross model comparisons more explicit * finished updating benchmarks for cross model * finished testing new benchmarks * fixed typing * synced attention params

Oss match (#1128)

80bc515

* made cross model comparisons more explicit * finished updating benchmarks for cross model * finished testing new benchmarks * fixed typing * synced attention params * wrapped up initial oss work * skipped irrelevant tests for moe * ran format

Final cleanup (#1135)

36f1c0a

* removed merge file * cleaned up imports * cleaned up some adapters * fixed up some adapters * added ability to use parent paths when mapping components * cleaned up gpt * cleaned up more architecture adapters * ran format * fixed type issues

Supported Architectures – code artifact cleanup (#1136)

41e6d4a

* Removed all attributes of which directly mapped keys. These attributes are now handled by the component mapping Bridge classes * Formatting update * Removed additional missed key

bryce13950 merged commit e4d9784 into dev-3.x Dec 7, 2025
25 of 30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Transformer bridge layer norm folding #1071

Transformer bridge layer norm folding #1071

Uh oh!

bryce13950 commented Sep 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Transformer bridge layer norm folding #1071

Transformer bridge layer norm folding #1071

Uh oh!

Conversation

bryce13950 commented Sep 27, 2025

Description

Type of change

Screenshots

Checklist:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants