Ideogram4 lora training by apolinario · Pull Request #13861 · huggingface/diffusers

apolinario · 2026-06-03T16:01:00Z

DreamBooth LoRA training script + Ideogram4 LoRA loader mixin.

LoRA targets the conditional transformer only (asymmetric CFG: the unconditional branch is the CFG prior).
Timestep sampling uses Ideogram 4's resolution-aware logit-normal schedule via the standard --weighting_scheme / --logit_mean / --logit_std args (defaults set to the model's schedule).

Stacked on #13859.

HuggingFaceDocBuilderDev · 2026-06-03T16:10:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

bghira · 2026-06-04T01:47:42Z

Mean loss: 1.3919, min: 0.852, max: 2.25.

intentionally super high loss? it's like training the audio branch with LTX2

bghira · 2026-06-04T11:57:09Z

    def fuse_qkv_projections(self):
        # The attention already uses a single fused `qkv` projection, so there is nothing to fuse.
        raise NotImplementedError(
            "Ideogram4Transformer2DModel already uses a fused QKV projection (`attention.qkv`), "
            "so `fuse_qkv_projections()` is not applicable."
        )

    def unfuse_qkv_projections(self):
        raise NotImplementedError(
            "Ideogram4Transformer2DModel uses a fused QKV projection that cannot be split, "
            "so `unfuse_qkv_projections()` is not applicable."
        )

these were removed, despite qkv now being split. can you re-add?

bghira · 2026-06-04T18:08:51Z

@joangava did you test the script? it's not working.

bghira · 2026-06-04T18:32:55Z

finally identified the issues;

the fp8 weights are having the scale discarded by this script's loader, it doesn't actually load the quantised weights properly, this causes the NaN loss and black images
the hf accelerate library seems to have a bug. disabling autocast is actually the better move for ideogram (that's how simpletuner works); unwrap_model isn't removing the forward wrapper that Accelerate adds during model prepare, this causes collapsed outputs on step 1

linoytsaban · 2026-06-05T06:28:29Z

re: fp8 compatibility- not yet available. diffusers compatible weights are coming soon, also related to this PR: Incorporate safetensors support to TorchAO #13719
which accelerate version are you using @bghira?

bghira · 2026-06-05T12:04:44Z

well their Fp8Linear sucks anyway, it's not using scaled mm and it's upcasting to bf16 on every forward pass
1.13.0

linoytsaban · 2026-06-15T12:57:07Z

On documenting fp8 training: diffusers doesn't currently support training directly from the fp8 checkpoint (see my earlier note: #13861 (comment)). A working path atm can be to dequantize the SDNQ fp8 checkpoint to bf16 (via the sdnq package) and train on that, documented as a small dequantize-at-load wrapper. wdyt about documenting this alternative in the README vs leaving the fp8 checkpoint out of the examples for now until native fp8 loading is merged?

bghira · 2026-06-15T13:02:32Z

but, guys, the split qkv question is still up in the air. it's in an inconsistent state right now, and "pin to a revision if you want to use the fused projections" isn't a great workaround. i'm not sure why the original weights had to be modified compulsively by the Diffusers team. multiple developers have asked about why this keeps being done, with no response given. it made LoRAs incompatible between Diffusers' version of ideogram4 weights/model code and literally everybody else's. why? if "it's just what we do", why wasn't flux.2 modified the same way? it's not even done reliably / consistently across model families.

dxqb · 2026-06-15T13:14:19Z

but, guys, the split qkv question is still up in the air. it's in an inconsistent state right now, and "pin to a revision if you want to use the fused projections" isn't a great workaround. i'm not sure why the original weights had to be modified compulsively by the Diffusers team. multiple developers have asked about why this keeps being done, with no response given. it made LoRAs incompatible between Diffusers' version of ideogram4 weights/model code and literally everybody else's. why? if "it's just what we do", why wasn't flux.2 modified the same way? it's not even done reliably / consistently across model families.

adding to that, please see here https://github.com/Comfy-Org/ComfyUI/blob/7d4194d984abbfcd49ec93a615b95327c031ac69/comfy/utils.py#L652 for an example in an inference tool to work around the qkv split. it has to be implemented again and again for each model:
splitting and fusing is trivial in full model weights, but LoRAs (and especially other peft types) cannot be converted from split to fused mathematically (within the same rank). the inference tool has to support both formats.

renaming layer keys is inconvenient but can be handled. However I agree that diffusers should not change the tensor shape from the officially released model by its creator

bghira · 2026-06-15T13:16:26Z

as for fp8 quantised loading w/ scaled mm it's rather simple, already implemented elsewhere, and reduces the compute required by almost half for mem-bw constrained cards like L40S and 4090. i don't think bf16 or nf4 are qualified replacements, nf4 is upcasting to bf16 matmuls. bf16 itself consumes more vram. even on a 5090 with fp8 scaled matmul for training, it's not a huge speed gain (more mem bw there) but we do see about 10gb drop in vram vs bf16 which is the difference between a bad batch size and a good one.

Disty0 · 2026-06-15T15:03:42Z

SDNQ FP8 model (or the 4bit model too) can be trained on directly with a simple and fast in-place conversion to SDNQ Training format:

from sdnq.training import convert_sdnq_model_to_training

quantized_model = convert_sdnq_model_to_training(
    quantized_model,
    quantized_matmul_dtype="float8_e4m3fn", # overrides the quantized matmul dtype to be different than weights_dtype format.
    use_grad_ckpt=True, # disable this if you are not using gradient checkpointing
    use_quantized_matmul=False, # use quantized matmul on the forward pass and the backward pass (False means no quantized matmul at all)
    use_stochastic_rounding=True, # This is only used when you update the quantized model weights
    dequantize_fp32=True, # keeps the quant scales in FP32 and compute the de-quant steps in FP32. Highly recommended to enable this option
)

One downside with this is that the SDNQ Training model cannot be saved with Safetensors because of the custom SDNQTensor used with training. This shouldn't matter for LoRa training, it is only relevant for full finetuning.

Add examples/dreambooth/train_dreambooth_lora_ideogram4.py + README + requirements: - DreamBooth LoRA training for Ideogram 4 (flow-matching, dual transformer, Qwen3-VL TE) - nf4 QLoRA and SDNQ fp8 bases: --do_fp8_training trains the fp8 checkpoint in place (scaled matmul), or omit it to dequantize the SDNQ base to bf16 - --disable_training_autocast (Ideogram4's forward is corrupted by autocast) - structured JSON caption support + --upsample_prompt - model card with a validation-image gallery Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

dg845 · 2026-06-20T00:08:46Z

@bot /style

github-actions · 2026-06-20T00:10:26Z

Style bot fixed some files and pushed the changes.

dg845 · 2026-06-20T07:36:08Z

Hi @bghira and @dxqb, thanks for your feedback. For the question of diffusers's split Q,K,V projections vs an original checkpoint with a fused QKV projection, would the following hypothetical diffusers PEFT workflow address (or at least mitigate) your concerns?

Fuse QKV projections to match the original checkpoint
Checkpoint weight name remapping
1. After this, the diffusers checkpoint and original PEFT checkpoint should have the same semantics
Inject original checkpoint-compatible adapter with peft

If so, do you think (1) and/or (2) should be exposed as diffusers utilities, or would it be fine if the workflow was contained within a method similar to load_lora_weights?

If not, what do you think a good diffusers PEFT workflow would look like? Are there any other problems with split Q,K,V vs fused QKV besides PEFT support that you'd like to see addressed?

bghira · 2026-06-20T08:34:20Z

i'd asked for improvements to fusion of qkv projections before and told they're just experimental and eg. not meant for training. in SimpleTuner, there's a lot of changes made to this so that they became useful for training; the original split projections are entirely dropped from the model instead of left lying around like in the original helper logic. i only found out they're left around because when i did first implement fused qkv training pipeline, the original matrices were still receiving LoRA target layers and being updated through those. it was quite confusing. then i found out the attn processors are splitting the fused layer and running attn calc and then re-fusing.

basically, that whole process is fragile and seemingly pointless enough in diffusers that it either should never be modified from how the original weights were distributed (eg. Flux2) or the whole ecosystem needs to change to match how Diffusers wants to split something.

for ideogram4 it means you shouldn't have changed things, and the Diffusers weights should be reverted on the hub back to their original state. i don't see why we want to do 6 calculations instead of just 2 (for sequential cond/uncond passes), or better yet, batching cond + uncond for models that are capable of it (eg. qwen-image) because there's about 10% of speed left off the table with them split just for that model on an H100 SXM5, even when the varlen FA3 kernel has to be used.

for qwen image we see 2400 compiled region launches over 60 blocks at 20 steps (2 passes per step) and that's a lot of calls. for qwen image we see this reduced down to just 20 calls with batched cond+uncond forward calls through a block stack function that stays compiled and RoPE left out of the mix (there's an awful cosine kernel that gets pulled into the graph and slows things down). and then there's the use of complex tensors that conflict with torch inductor which get left around and ... i figured you guys were testing for torch compile compat. but that's a separate thing.

dg845 · 2026-06-20T09:11:36Z

I have opened an issue for supporting original checkpoint-compatible PEFT adapters at #14002, we can discuss that issue there.

dg845 · 2026-06-20T09:30:58Z

Similarly, I opened an issue about split Q,K,V vs fused QKV projections at #14003. We can continue the discussion of that issue there.

github-actions Bot added lora size/L PR with diff > 200 LOC pipelines examples loaders labels Jun 3, 2026

Base automatically changed from add-ideogram-4 to main June 3, 2026 22:03

linoytsaban force-pushed the ideogram4-lora-training branch from 6f8d6e9 to 0128816 Compare June 4, 2026 09:53

joangava approved these changes Jun 4, 2026

View reviewed changes

linoytsaban force-pushed the ideogram4-lora-training branch from e7c1205 to b94cd4c Compare June 8, 2026 10:02

linoytsaban requested a review from sayakpaul June 9, 2026 10:00

github-actions Bot added the documentation Improvements or additions to documentation label Jun 9, 2026

This was referenced Jun 11, 2026

Support loading non-diffusers Ideogram4 LoRA checkpoints #13919

Merged

Add Ideogram4LoraLoaderMixin (LoRA loading for Ideogram4) #13921

Merged

linoytsaban force-pushed the ideogram4-lora-training branch from 575d8e3 to fd6a858 Compare June 11, 2026 12:57

linoytsaban changed the base branch from main to ideogram4-lora-loader June 11, 2026 12:59

github-actions Bot removed documentation Improvements or additions to documentation lora labels Jun 16, 2026

linoytsaban requested review from dg845 and removed request for Copilot June 17, 2026 13:27

Copilot started reviewing on behalf of linoytsaban June 17, 2026 13:28 View session

linoytsaban force-pushed the ideogram4-lora-training branch from a425b47 to 0c38f7e Compare June 17, 2026 15:10

github-actions Bot added documentation Improvements or additions to documentation lora models tests pipelines loaders labels Jun 17, 2026

linoytsaban changed the base branch from ideogram4-lora-loader to main June 17, 2026 15:11

Merge branch 'main' into ideogram4-lora-training

e9c1fe1

github-actions Bot removed documentation Improvements or additions to documentation lora models tests pipelines loaders labels Jun 19, 2026

Merge branch 'main' into ideogram4-lora-training

b2d52ca

Apply style fixes

c2b68ad

dg845 mentioned this pull request Jun 20, 2026

Support Original Checkpoint-Compatible PEFT Adapters #14002

Open

dg845 mentioned this pull request Jun 20, 2026

Split Q,K,V vs Fused QKV Projections in diffusers Discussion #14003

Open

Merge branch 'main' into ideogram4-lora-training

1b26554

Conversation

apolinario commented Jun 3, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Jun 3, 2026

Uh oh!

bghira commented Jun 4, 2026

Uh oh!

bghira commented Jun 4, 2026

Uh oh!

bghira commented Jun 4, 2026

Uh oh!

bghira commented Jun 4, 2026

Uh oh!

linoytsaban commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bghira commented Jun 5, 2026

Uh oh!

linoytsaban commented Jun 15, 2026

Uh oh!

bghira commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dxqb commented Jun 15, 2026

Uh oh!

bghira commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Disty0 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dg845 commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dg845 commented Jun 20, 2026

Uh oh!

bghira commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dg845 commented Jun 20, 2026

Uh oh!

dg845 commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

linoytsaban commented Jun 5, 2026 •

edited

Loading

bghira commented Jun 15, 2026 •

edited

Loading

bghira commented Jun 15, 2026 •

edited

Loading

Disty0 commented Jun 15, 2026 •

edited

Loading

github-actions Bot commented Jun 20, 2026 •

edited

Loading

bghira commented Jun 20, 2026 •

edited

Loading