Skip to content

Ideogram4 lora training#13861

Open
apolinario wants to merge 5 commits into
mainfrom
ideogram4-lora-training
Open

Ideogram4 lora training#13861
apolinario wants to merge 5 commits into
mainfrom
ideogram4-lora-training

Conversation

@apolinario

Copy link
Copy Markdown
Collaborator

DreamBooth LoRA training script + Ideogram4 LoRA loader mixin.

  • LoRA targets the conditional transformer only (asymmetric CFG: the unconditional branch is the CFG prior).
  • Timestep sampling uses Ideogram 4's resolution-aware logit-normal schedule via the standard --weighting_scheme / --logit_mean / --logit_std args (defaults set to the model's schedule).

Stacked on #13859.

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Base automatically changed from add-ideogram-4 to main June 3, 2026 22:03
@bghira

bghira commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Mean loss: 1.3919, min: 0.852, max: 2.25.

intentionally super high loss? it's like training the audio branch with LTX2

@linoytsaban linoytsaban force-pushed the ideogram4-lora-training branch from 6f8d6e9 to 0128816 Compare June 4, 2026 09:53
@bghira

bghira commented Jun 4, 2026

Copy link
Copy Markdown
Contributor
    def fuse_qkv_projections(self):
        # The attention already uses a single fused `qkv` projection, so there is nothing to fuse.
        raise NotImplementedError(
            "Ideogram4Transformer2DModel already uses a fused QKV projection (`attention.qkv`), "
            "so `fuse_qkv_projections()` is not applicable."
        )

    def unfuse_qkv_projections(self):
        raise NotImplementedError(
            "Ideogram4Transformer2DModel uses a fused QKV projection that cannot be split, "
            "so `unfuse_qkv_projections()` is not applicable."
        )

these were removed, despite qkv now being split. can you re-add?

@bghira

bghira commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

@joangava did you test the script? it's not working.

@bghira

bghira commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

finally identified the issues;

  • the fp8 weights are having the scale discarded by this script's loader, it doesn't actually load the quantised weights properly, this causes the NaN loss and black images
  • the hf accelerate library seems to have a bug. disabling autocast is actually the better move for ideogram (that's how simpletuner works); unwrap_model isn't removing the forward wrapper that Accelerate adds during model prepare, this causes collapsed outputs on step 1

@linoytsaban

linoytsaban commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

@bghira

bghira commented Jun 5, 2026

Copy link
Copy Markdown
Contributor
  • well their Fp8Linear sucks anyway, it's not using scaled mm and it's upcasting to bf16 on every forward pass
  • 1.13.0

@linoytsaban linoytsaban force-pushed the ideogram4-lora-training branch from e7c1205 to b94cd4c Compare June 8, 2026 10:02
@linoytsaban linoytsaban requested a review from sayakpaul June 9, 2026 10:00
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Jun 9, 2026
@linoytsaban linoytsaban force-pushed the ideogram4-lora-training branch from 575d8e3 to fd6a858 Compare June 11, 2026 12:57
@linoytsaban linoytsaban changed the base branch from main to ideogram4-lora-loader June 11, 2026 12:59
@linoytsaban

Copy link
Copy Markdown
Collaborator

On documenting fp8 training: diffusers doesn't currently support training directly from the fp8 checkpoint (see my earlier note: #13861 (comment)). A working path atm can be to dequantize the SDNQ fp8 checkpoint to bf16 (via the sdnq package) and train on that, documented as a small dequantize-at-load wrapper. wdyt about documenting this alternative in the README vs leaving the fp8 checkpoint out of the examples for now until native fp8 loading is merged?

@bghira

bghira commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

but, guys, the split qkv question is still up in the air. it's in an inconsistent state right now, and "pin to a revision if you want to use the fused projections" isn't a great workaround. i'm not sure why the original weights had to be modified compulsively by the Diffusers team. multiple developers have asked about why this keeps being done, with no response given. it made LoRAs incompatible between Diffusers' version of ideogram4 weights/model code and literally everybody else's. why? if "it's just what we do", why wasn't flux.2 modified the same way? it's not even done reliably / consistently across model families.

@dxqb

dxqb commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

but, guys, the split qkv question is still up in the air. it's in an inconsistent state right now, and "pin to a revision if you want to use the fused projections" isn't a great workaround. i'm not sure why the original weights had to be modified compulsively by the Diffusers team. multiple developers have asked about why this keeps being done, with no response given. it made LoRAs incompatible between Diffusers' version of ideogram4 weights/model code and literally everybody else's. why? if "it's just what we do", why wasn't flux.2 modified the same way? it's not even done reliably / consistently across model families.

adding to that, please see here https://github.com/Comfy-Org/ComfyUI/blob/7d4194d984abbfcd49ec93a615b95327c031ac69/comfy/utils.py#L652 for an example in an inference tool to work around the qkv split. it has to be implemented again and again for each model:
splitting and fusing is trivial in full model weights, but LoRAs (and especially other peft types) cannot be converted from split to fused mathematically (within the same rank). the inference tool has to support both formats.

renaming layer keys is inconvenient but can be handled. However I agree that diffusers should not change the tensor shape from the officially released model by its creator

@bghira

bghira commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

as for fp8 quantised loading w/ scaled mm it's rather simple, already implemented elsewhere, and reduces the compute required by almost half for mem-bw constrained cards like L40S and 4090. i don't think bf16 or nf4 are qualified replacements, nf4 is upcasting to bf16 matmuls. bf16 itself consumes more vram. even on a 5090 with fp8 scaled matmul for training, it's not a huge speed gain (more mem bw there) but we do see about 10gb drop in vram vs bf16 which is the difference between a bad batch size and a good one.

@Disty0

Disty0 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

SDNQ FP8 model (or the 4bit model too) can be trained on directly with a simple and fast in-place conversion to SDNQ Training format:

from sdnq.training import convert_sdnq_model_to_training

quantized_model = convert_sdnq_model_to_training(
    quantized_model,
    quantized_matmul_dtype="float8_e4m3fn", # overrides the quantized matmul dtype to be different than weights_dtype format.
    use_grad_ckpt=True, # disable this if you are not using gradient checkpointing
    use_quantized_matmul=False, # use quantized matmul on the forward pass and the backward pass (False means no quantized matmul at all)
    use_stochastic_rounding=True, # This is only used when you update the quantized model weights
    dequantize_fp32=True, # keeps the quant scales in FP32 and compute the de-quant steps in FP32. Highly recommended to enable this option
)

One downside with this is that the SDNQ Training model cannot be saved with Safetensors because of the custom SDNQTensor used with training. This shouldn't matter for LoRa training, it is only relevant for full finetuning.

@github-actions github-actions Bot removed documentation Improvements or additions to documentation lora labels Jun 16, 2026
@linoytsaban linoytsaban requested review from dg845 and removed request for Copilot June 17, 2026 13:27
Add examples/dreambooth/train_dreambooth_lora_ideogram4.py + README + requirements:
- DreamBooth LoRA training for Ideogram 4 (flow-matching, dual transformer, Qwen3-VL TE)
- nf4 QLoRA and SDNQ fp8 bases: --do_fp8_training trains the fp8 checkpoint in place
  (scaled matmul), or omit it to dequantize the SDNQ base to bf16
- --disable_training_autocast (Ideogram4's forward is corrupted by autocast)
- structured JSON caption support + --upsample_prompt
- model card with a validation-image gallery

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@linoytsaban linoytsaban force-pushed the ideogram4-lora-training branch from a425b47 to 0c38f7e Compare June 17, 2026 15:10
@github-actions github-actions Bot added documentation Improvements or additions to documentation lora models tests pipelines loaders labels Jun 17, 2026
@linoytsaban linoytsaban changed the base branch from ideogram4-lora-loader to main June 17, 2026 15:11
@github-actions github-actions Bot removed documentation Improvements or additions to documentation lora models tests pipelines loaders labels Jun 19, 2026
@dg845

dg845 commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator

@bot /style

@github-actions

github-actions Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Style bot fixed some files and pushed the changes.

@dg845

dg845 commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator

Hi @bghira and @dxqb, thanks for your feedback. For the question of diffusers's split Q,K,V projections vs an original checkpoint with a fused QKV projection, would the following hypothetical diffusers PEFT workflow address (or at least mitigate) your concerns?

  1. Fuse QKV projections to match the original checkpoint
  2. Checkpoint weight name remapping
    1. After this, the diffusers checkpoint and original PEFT checkpoint should have the same semantics
  3. Inject original checkpoint-compatible adapter with peft

If so, do you think (1) and/or (2) should be exposed as diffusers utilities, or would it be fine if the workflow was contained within a method similar to load_lora_weights?

If not, what do you think a good diffusers PEFT workflow would look like? Are there any other problems with split Q,K,V vs fused QKV besides PEFT support that you'd like to see addressed?

@bghira

bghira commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

i'd asked for improvements to fusion of qkv projections before and told they're just experimental and eg. not meant for training. in SimpleTuner, there's a lot of changes made to this so that they became useful for training; the original split projections are entirely dropped from the model instead of left lying around like in the original helper logic. i only found out they're left around because when i did first implement fused qkv training pipeline, the original matrices were still receiving LoRA target layers and being updated through those. it was quite confusing. then i found out the attn processors are splitting the fused layer and running attn calc and then re-fusing.

basically, that whole process is fragile and seemingly pointless enough in diffusers that it either should never be modified from how the original weights were distributed (eg. Flux2) or the whole ecosystem needs to change to match how Diffusers wants to split something.

for ideogram4 it means you shouldn't have changed things, and the Diffusers weights should be reverted on the hub back to their original state. i don't see why we want to do 6 calculations instead of just 2 (for sequential cond/uncond passes), or better yet, batching cond + uncond for models that are capable of it (eg. qwen-image) because there's about 10% of speed left off the table with them split just for that model on an H100 SXM5, even when the varlen FA3 kernel has to be used.

for qwen image we see 2400 compiled region launches over 60 blocks at 20 steps (2 passes per step) and that's a lot of calls. for qwen image we see this reduced down to just 20 calls with batched cond+uncond forward calls through a block stack function that stays compiled and RoPE left out of the mix (there's an awful cosine kernel that gets pulled into the graph and slows things down). and then there's the use of complex tensors that conflict with torch inductor which get left around and ... i figured you guys were testing for torch compile compat. but that's a separate thing.

@dg845

dg845 commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator

I have opened an issue for supporting original checkpoint-compatible PEFT adapters at #14002, we can discuss that issue there.

@dg845

dg845 commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator

Similarly, I opened an issue about split Q,K,V vs fused QKV projections at #14003. We can continue the discussion of that issue there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples size/L PR with diff > 200 LOC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants