Skip to content

Fix kohya FLUX CLIP text-encoder LoRA loading under transformers>=5 (#13984)#14029

Open
Jefsky wants to merge 1 commit into
huggingface:mainfrom
Jefsky:fix/kohya-flux-clip-te-lora-transformers5
Open

Fix kohya FLUX CLIP text-encoder LoRA loading under transformers>=5 (#13984)#14029
Jefsky wants to merge 1 commit into
huggingface:mainfrom
Jefsky:fix/kohya-flux-clip-te-lora-transformers5

Conversation

@Jefsky

@Jefsky Jefsky commented Jun 21, 2026

Copy link
Copy Markdown

Fixes #13984

Root cause

transformers>=5 flattened CLIPTextModel: the text_model. wrapper module was removed, so text_encoder.named_modules() now yields names like encoder.layers.0.self_attn.k_proj instead of text_model.encoder.layers.0.self_attn.k_proj.

The kohya→diffusers conversion still emits text-encoder keys prefixed with text_model. (e.g. text_encoder.text_model.encoder.layers.0.self_attn.k_proj.lora_B.weight). In _load_lora_into_text_encoder, the rank dict is built by matching named_modules() against the converted state-dict keys — under transformers>=5 nothing matches, rank stays empty, and get_peft_kwargs does list(rank_dict.values())[0]IndexError.

The PEFT-side fix (#3212) doesn't help here because the crash happens before any PEFT state-dict injection (confirmed by the issue reporter).

Fix

In _load_lora_into_text_encoder, after the convert_state_dict_to_peft call, strip the stale text_model. prefix from the converted state-dict keys when the text encoder doesn't have the text_model submodule (i.e. the transformers>=5 layout). The check uses hasattr(text_encoder, "text_model"), not a version number, so it's forward-compatible.

Test

Added FluxLoRATests::test_kohya_clip_text_encoder_flattened_compat — a fast CPU regression test that:

  1. Builds the dummy FLUX pipeline
  2. Removes the text_model attribute from text_encoder (simulating transformers>=5)
  3. Passes a synthetic kohya-style state dict with the stale text_model. prefix
  4. Asserts the adapter is correctly injected (no IndexError)

Verification

  • Before fix: the synthetic state dict with text_model. prefix causes IndexError: list index out of range in get_peft_kwargs
  • After fix: the adapter loads correctly on both flattened (transformers>=5) and traditional (transformers<5, hasattr(text_model)=True) layouts

…uggingface#13984)

Under transformers>=5, CLIPTextModel was flattened: the text_model. wrapper
module was removed, so named_modules() returns unprefixed names like
'encoder.layers.0.self_attn.k_proj' instead of
'text_model.encoder.layers.0.self_attn.k_proj'.

Kohya-sourced LoRA state dict keys still carry the stale 'text_model.' prefix
after conversion, causing _load_lora_into_text_encoder to build an empty rank
dict (nothing matches) and crash with IndexError in get_peft_kwargs.

Fix: after the PEFT state dict conversion, strip 'text_model.' from state dict
keys when the encoder doesn't have the text_model submodule (the
transformers>=5 layout), so they align with named_modules() output.

Added a regression test test_kohya_clip_text_encoder_flattened_compat that
simulates the flattened CLIPTextModel layout and passes synthetic kohya-style
keys.

@BenjaminBossan BenjaminBossan left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I agree that it should fix the specific issue that was mentioned and that should be safe to apply. My concern is that it is very specialized to this case and not a general solution. The PR covers this exact case:

https://github.com/huggingface/transformers/blob/bfd3604d83e84d7ff8bbc18bc09c21e8282d31f9/src/transformers/conversion_mapping.py#L608

But there could be other entries in the conversion mapping, now or added in the future, which are not covered by this patch. So I wonder if we should instead call Transformers get_model_conversion_mapping on the model (if it's a Transformers model) and apply the conversions from there.

I'll leave it up to the Diffusers maintainers to decide how to deal with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FLUX LoRA with CLIP text-encoder weights fails (empty rank -> IndexError) under transformers>=5

2 participants