Ideogram4 lora training#13861
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Mean loss: 1.3919, min: 0.852, max: 2.25. intentionally super high loss? it's like training the audio branch with LTX2 |
6f8d6e9 to
0128816
Compare
def fuse_qkv_projections(self):
# The attention already uses a single fused `qkv` projection, so there is nothing to fuse.
raise NotImplementedError(
"Ideogram4Transformer2DModel already uses a fused QKV projection (`attention.qkv`), "
"so `fuse_qkv_projections()` is not applicable."
)
def unfuse_qkv_projections(self):
raise NotImplementedError(
"Ideogram4Transformer2DModel uses a fused QKV projection that cannot be split, "
"so `unfuse_qkv_projections()` is not applicable."
)these were removed, despite qkv now being split. can you re-add? |
|
@joangava did you test the script? it's not working. |
|
finally identified the issues;
|
|
|
e7c1205 to
b94cd4c
Compare
575d8e3 to
fd6a858
Compare
|
On documenting fp8 training: diffusers doesn't currently support training directly from the fp8 checkpoint (see my earlier note: #13861 (comment)). A working path atm can be to dequantize the SDNQ fp8 checkpoint to bf16 (via the |
|
but, guys, the split qkv question is still up in the air. it's in an inconsistent state right now, and "pin to a revision if you want to use the fused projections" isn't a great workaround. i'm not sure why the original weights had to be modified compulsively by the Diffusers team. multiple developers have asked about why this keeps being done, with no response given. it made LoRAs incompatible between Diffusers' version of ideogram4 weights/model code and literally everybody else's. why? if "it's just what we do", why wasn't flux.2 modified the same way? it's not even done reliably / consistently across model families. |
adding to that, please see here https://github.com/Comfy-Org/ComfyUI/blob/7d4194d984abbfcd49ec93a615b95327c031ac69/comfy/utils.py#L652 for an example in an inference tool to work around the qkv split. it has to be implemented again and again for each model: renaming layer keys is inconvenient but can be handled. However I agree that diffusers should not change the tensor shape from the officially released model by its creator |
|
as for fp8 quantised loading w/ scaled mm it's rather simple, already implemented elsewhere, and reduces the compute required by almost half for mem-bw constrained cards like L40S and 4090. i don't think bf16 or nf4 are qualified replacements, nf4 is upcasting to bf16 matmuls. bf16 itself consumes more vram. even on a 5090 with fp8 scaled matmul for training, it's not a huge speed gain (more mem bw there) but we do see about 10gb drop in vram vs bf16 which is the difference between a bad batch size and a good one. |
|
SDNQ FP8 model (or the 4bit model too) can be trained on directly with a simple and fast in-place conversion to SDNQ Training format: from sdnq.training import convert_sdnq_model_to_training
quantized_model = convert_sdnq_model_to_training(
quantized_model,
quantized_matmul_dtype="float8_e4m3fn", # overrides the quantized matmul dtype to be different than weights_dtype format.
use_grad_ckpt=True, # disable this if you are not using gradient checkpointing
use_quantized_matmul=False, # use quantized matmul on the forward pass and the backward pass (False means no quantized matmul at all)
use_stochastic_rounding=True, # This is only used when you update the quantized model weights
dequantize_fp32=True, # keeps the quant scales in FP32 and compute the de-quant steps in FP32. Highly recommended to enable this option
)One downside with this is that the SDNQ Training model cannot be saved with Safetensors because of the custom SDNQTensor used with training. This shouldn't matter for LoRa training, it is only relevant for full finetuning. |
Add examples/dreambooth/train_dreambooth_lora_ideogram4.py + README + requirements: - DreamBooth LoRA training for Ideogram 4 (flow-matching, dual transformer, Qwen3-VL TE) - nf4 QLoRA and SDNQ fp8 bases: --do_fp8_training trains the fp8 checkpoint in place (scaled matmul), or omit it to dequantize the SDNQ base to bf16 - --disable_training_autocast (Ideogram4's forward is corrupted by autocast) - structured JSON caption support + --upsample_prompt - model card with a validation-image gallery Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
a425b47 to
0c38f7e
Compare
|
@bot /style |
|
Style bot fixed some files and pushed the changes. |
|
Hi @bghira and @dxqb, thanks for your feedback. For the question of
If so, do you think (1) and/or (2) should be exposed as If not, what do you think a good |
|
i'd asked for improvements to fusion of qkv projections before and told they're just experimental and eg. not meant for training. in SimpleTuner, there's a lot of changes made to this so that they became useful for training; the original split projections are entirely dropped from the model instead of left lying around like in the original helper logic. i only found out they're left around because when i did first implement fused qkv training pipeline, the original matrices were still receiving LoRA target layers and being updated through those. it was quite confusing. then i found out the attn processors are splitting the fused layer and running attn calc and then re-fusing. basically, that whole process is fragile and seemingly pointless enough in diffusers that it either should never be modified from how the original weights were distributed (eg. Flux2) or the whole ecosystem needs to change to match how Diffusers wants to split something. for ideogram4 it means you shouldn't have changed things, and the Diffusers weights should be reverted on the hub back to their original state. i don't see why we want to do 6 calculations instead of just 2 (for sequential cond/uncond passes), or better yet, batching cond + uncond for models that are capable of it (eg. qwen-image) because there's about 10% of speed left off the table with them split just for that model on an H100 SXM5, even when the varlen FA3 kernel has to be used. for qwen image we see 2400 compiled region launches over 60 blocks at 20 steps (2 passes per step) and that's a lot of calls. for qwen image we see this reduced down to just 20 calls with batched cond+uncond forward calls through a block stack function that stays compiled and RoPE left out of the mix (there's an awful cosine kernel that gets pulled into the graph and slows things down). and then there's the use of complex tensors that conflict with torch inductor which get left around and ... i figured you guys were testing for torch compile compat. but that's a separate thing. |
|
I have opened an issue for supporting original checkpoint-compatible PEFT adapters at #14002, we can discuss that issue there. |
|
Similarly, I opened an issue about split Q,K,V vs fused QKV projections at #14003. We can continue the discussion of that issue there. |
DreamBooth LoRA training script + Ideogram4 LoRA loader mixin.
--weighting_scheme/--logit_mean/--logit_stdargs (defaults set to the model's schedule).Stacked on #13859.