Add Boogu-Image generation, editing, and turbo pipelines by Boogu-Team · Pull Request #14040 · huggingface/diffusers

Boogu-Team · 2026-06-22T15:55:13Z

What does this PR do?

Adds the Boogu-Image family of pipelines to diffusers:

BooguImagePipeline — text-to-image generation and instruction-based image editing.
BooguImageTurboPipeline — few-step DMD distilled generation.
fp8 quantized inference examples for both, targeting the published fp8 checkpoints.

The integration is purely additive: it introduces new files only and does not modify any existing upstream module.

Published checkpoints (Hugging Face Hub, Boogu/…): Boogu-Image-0.1-Base, Boogu-Image-0.1-Edit, Boogu-Image-0.1-Turbo, and their -fp8 variants.

Pipelines & model

Component	File
Generation / edit pipeline	`src/diffusers/pipelines/boogu/pipeline_boogu.py`
Turbo (DMD few-step) pipeline	`src/diffusers/pipelines/boogu/pipeline_boogu_turbo.py`
Transformer backbone	`src/diffusers/models/transformers/transformer_boogu.py`
Attention processors	`src/diffusers/models/attention_processor_boogu.py`
RoPE	`src/diffusers/models/transformers/rope_boogu.py`
Image processor	`src/diffusers/pipelines/boogu/image_processor.py`

Docs: docs/source/en/api/pipelines/boogu.md. Runnable examples: examples/boogu/ (base / edit / turbo + fp8, with README.md).

Convention compliance

Implemented against the repo's .ai rules:

Attention routed through dispatch_attention_fn (no direct F.scaled_dot_product_attention in the forward path); attention masks always materialized as [B, 1, 1, L] bool to stay bit-exact with the trained checkpoints under the native bf16 backend.
BooguImageTurboPipeline is a standalone DiffusionPipeline (not a subclass of BooguImagePipeline), with shared methods carried via # Copied from and kept in sync by make fix-copies.
No dead code paths, no silent except Exception fallbacks, no unused "API-consistency" parameters — training/ablation/prompt-tuning code from the research repo was removed; only the inference path is integrated.

Verification

All 6 examples/boogu/ scripts run end-to-end and produce correct images (base / edit / turbo + fp8).
Every refactor verified bit-exact against the pre-refactor reference: CPU and GPU single-forward and end-to-end maxdiff = 0; checkpoints load strict (no missing/unexpected keys).
CI gates pass locally: ruff check, ruff format --check, make fix-copies (no diff), check_dummies.
Test suite under tests/pipelines/boogu/ and tests/models/transformers/test_models_transformer_boogu.py.

Notes for reviewers

BOG (Boosted Orthogonal Guidance) is kept inline as a documented public kwarg (use_boosted_orthogonal_guidance, default False) rather than moved into src/diffusers/guiders/, since the guiders framework is oriented at modular pipelines. Happy to relocate it if preferred.
_keep_in_fp32_modules is intentionally not set: forcing time_caption_embed to fp32 changes bf16 inference numerics, so we left it to the model author / reviewer's call.
The fp8 examples include a small DeepGEMM-disable shim with version branches that are load-bearing for transformers 5.10.x (the env var alone does not disable DeepGEMM there).

Required Hub-side checkpoint changes (no code change in this PR)

Important for reviewers: the published Boogu/Boogu-Image-0.1-* checkpoints on the Hub are still packaged for the old custom-remote-code loading path and will not load correctly against this PR's native diffusers classes until the Hub repos are updated. We will push these Hub-repo edits to land together with the PR. Diff below is Hub main → required state.

1. model_index.json — point component classes at diffusers, not the bundled remote-code modules.
On Hub main it currently is:

"scheduler":   ["scheduling_flow_match_euler_discrete_time_shifting", "FlowMatchEulerDiscreteScheduler"],
"transformer": ["transformer_boogu", "BooguImageTransformer2DModel"],

Both must become library refs so from_pretrained resolves the in-tree classes added by this PR:

"scheduler":   ["diffusers", "FlowMatchEulerDiscreteScheduler"],
"transformer": ["diffusers", "BooguImageTransformer2DModel"],

2. Delete the two remote-code shim modules (they are thin re-export stubs that import boogu … from the private research package and raise ModuleNotFoundError: boogu for any external user):

transformer/transformer_boogu.py
scheduler/scheduling_flow_match_euler_discrete_time_shifting.py

3. scheduler/scheduler_config.json — drop the legacy custom-scheduler keys. Hub main has:

{ "_class_name": "FlowMatchEulerDiscreteScheduler", "_diffusers_version": "0.33.1",
  "do_shift": true, "dynamic_time_shift": false, "time_shift_version": "v1",
  "seq_len": 4096, "num_train_timesteps": 1000 }

Remove do_shift, dynamic_time_shift, time_shift_version (these drove the old custom scheduler's time-shift; the official scheduler ignores them). Keep seq_len — the pipeline's time-shift adapter reads scheduler.config.seq_len to compute the shift mu, so it is load-bearing → final config keeps _class_name / _diffusers_version / seq_len / num_train_timesteps.

4. transformer/config.json — remove prompt_tuning_configs (the prompt-tuning subsystem was dropped from this integration; the key is otherwise ignored with a warning).

The same four edits apply to all six repos (Base, Edit, Turbo and their -fp8 variants). The mllm/, processor/, vae/ and weight files are unchanged.

Fixes # (issue)

Before submitting

Did you use an AI agent (Claude Code, Codex, Cursor, etc.) to help with this PR? If so:
- Did you read the Coding with AI agents guide?
- Did you self-review the diff against .ai/review-rules.md?
Did you read the contributor guideline?
Did you read our philosophy doc? (important for complex PRs)
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?
Are you the author (or part of the team) of the model/pipeline (only applicable for model/pipeline related PRs)?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Integrate the Boogu-Image model into diffusers: - Models: BooguImageTransformer2DModel, PromptEmbedding, Boogu attention processors, Lumina2 blocks, and rotary embeddings. - Pipelines: BooguImagePipeline (text-to-image and instruction editing) and BooguImageTurboPipeline (DMD few-step text-to-image). - Scheduler: flow-match Euler scheduler with training-aligned time shifting. - Internal utils: TaylorSeer cache, TeaCache params, DPM cache helpers, and optional Triton fused RMSNorm. - Loading: resolve published checkpoints' custom module names to the integrated classes via module aliases, so from_pretrained needs no trust_remote_code. - Docs and runnable examples under docs/ and examples/boogu/. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Drop the Boogu-only TaylorSeer caching feature, which was only half-removed in the working tree (left dangling `enable_taylorseer` references that raised NameError, and collaterally deleted the TeaCache `__init__` setup so the transformer raised AttributeError on `enable_teacache`). - transformer_boogu.py: remove the remaining TaylorSeer branches; restore the TeaCache init block (enable_teacache, enable_teacache_for_all_layers, teacache_rel_l1_thresh, teacache_params, rescale_func) and the numpy / TeaCacheParams imports it needs. - pipeline_boogu.py: drop the cache_init import, the enable_taylorseer plumbing and per-condition cache_dic/current branches, collapsing each `if enable_taylorseer / elif enable_teacache` into a plain `if enable_teacache`. - Delete cache_functions/ and taylorseer_utils/ (Boogu-added, TaylorSeer-only, now unreferenced). The upstream hooks-based TaylorSeerCacheConfig is untouched. - Remove BOOGU_INTEGRATION.md (ephemeral integration notes); add an environment install link to examples/boogu/README.md. The pipeline uses the official FlowMatchEulerDiscreteScheduler via the thin BooguFlowMatchEulerDiscreteScheduler subclass (reuses the parent step). Tests: test_models_transformer_boogu (15 passed) and test_boogu (20 passed) green; check_copies and check_dummies pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… adapter Replace the BooguFlowMatchEulerDiscreteScheduler subclass with the official FlowMatchEulerDiscreteScheduler plus a standalone set_flow_match_timesteps adapter that applies Boogu's training-aligned static v1 time shift and 0->1 sigma schedule, reusing the parent's exponential shift formula. - Add pipelines/boogu/flow_match_boogu.py with set_flow_match_timesteps - Route the flow-match branch of retrieve_timesteps through the adapter (annotated "# Adapted from" to reflect the intentional divergence) - Update pipeline/test type hints and imports to the official scheduler - Drop the scheduler subclass and its registrations (schedulers/__init__, top-level __init__, dummy_pt_objects) Numerically bit-identical to the old subclass (max diff ~6e-08). The boogu test suite shows no regression vs the pre-change tree (same 11 pre-existing MLLM device-placement failures, 19 passed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…writer Reduce the boogu pipeline package from 7 files to 4 by removing dead and misplaced code, keeping the default T2I/TI2I inference path unchanged. - Inline set_flow_match_timesteps into pipeline_boogu.py (single caller) and delete flow_match_boogu.py, per the "inline single-caller helpers" rule. - Replace the image_processor.preprocess override (which duplicated the parent VaeImageProcessor wholesale) with a thin override that only derives the Boogu max_pixels/max_side_length target size, then delegates to the parent. Verified bit-identical output across sizes/constraints (max diff 0.0). - Remove BooguImageLoraLoaderMixin / lora_pipeline.py: LoRA is unused on the inference path, and the mixin belongs in loaders/ by diffusers convention. - Remove the instruction-rewriter feature entirely (static_skills.py, instruct_reasoner_static_skills.py, and ~1100 lines of rewriter methods, state, and public kwargs). It was gated by use_rewrite_text_instruction (default False) and unused by every example/test; the skills files were its only consumers. Net: -2255 / +74 lines. End-to-end TI2I inference reproduces the standalone reference (mean pixel diff 8.8, unchanged from before), and the boogu test suite shows the same pre-existing baseline (11 failed / 19 passed / 4 skipped, the 11 being unrelated MLLM device-placement failures). check_copies and check_dummies pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Both edit examples ran with no negative prompt. At text_guidance_scale=4.0 the model guides away from the negative instruction, so omitting it left the output oversaturated and under-stylized (style transfer barely applied). Add the standard negative prompt used by the reference inference so the colored-pencil style conversion comes through. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tions) Self-review against .ai/{AGENTS,models,pipelines,review-rules}.md surfaced a batch of mechanical issues fixed here (no behavior change on the default path; boogu test suite unchanged at 16/53/7, identical failure set — the remaining failures are a pre-existing MLLM cpu/cuda device-placement issue). pipeline_boogu.py: - Remove dead helpers: _project, _sigmoid_kernel, _softmax_kernel, the non-newton-schulz bog_norm branches, MomentumRollingSum._append_and_save (+ now-unused pathlib import). - Drop unused __call__ params verbose and callback_on_step_end_tensor_inputs, a bare `latents.shape[0]` expression, and several commented-out code blocks. - Replace all print() with module logger; drop emoji/blank-line prints. pipeline_boogu_turbo.py: - Add module logger; replace the inference print() with logger.info. transformer_boogu.py: - Default attention to the SDPA processor instead of selecting it from an os.getenv("device") read at __init__ (non-standard, and forced flash in fp32); drop the now-unused Flash2Varlen imports and the single-stream block alias. - Replace np.poly1d TeaCache rescale with inline Horner eval; drop numpy import. - Fix _no_split_modules / _repeated_blocks (remove the alias string that never matched __class__.__name__ and the invalid "nn.Embedding" entry). - Give PromptEmbedding flat @register_to_config kwargs so from_pretrained round-trips; remove its non-standard from_config override. - Remove dead self.layers, enable_teacache_for_all_layers, a commented-out param, a discarded dict lookup, and a stale section comment. attention_processor_boogu.py: - Remove no-op `layer = layer.to(device)` loops (rebind a local, never move the module) plus the bare shape expressions and commented debug lines above them. image_processor.py: - Guard get_new_height_width against None max_pixels / max_side_length (previously TypeError / UnboundLocalError when called with defaults); output is bit-identical when both constraints are set. Sync the class docstring to the actual __init__ signature. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

No released Boogu checkpoint ships a PromptEmbedding / prompt-tuning subfolder, so the prompt-tuning path is never exercised by a published model. Per .ai/AGENTS.md ("only keep the inference path you are actually integrating"), remove it entirely: - Delete PromptEmbedding (transformer_boogu.py), BooguImagePromptTuningPipeline (pipeline_boogu.py), and BooguImagePromptTuningRotaryPosEmbed (rope_boogu.py). - Drop the model's unused prompt_tuning_configs config arg, the pipeline's prompt_embedding attribute + set_prompt_embedding(), and the use_prompt_tuning_embedding branch of _get_instruction_feature_embeds (the normal VLM-encoding path is unchanged). The now-orphaned has_offload_strategy / _module_execution_device helpers go with it. - Remove the PromptEmbedding registrations (lazy import structure, top-level export, dummy object). Removing BooguImagePromptTuningPipeline also drops 2 of the 4 except-Exception fallback blocks (the other 2, in BooguImagePipeline, are handled separately). Verified: cached checkpoint transformer loads with no missing/unexpected keys (prompt_tuning_configs in config.json is now harmlessly ignored); import + ruff clean; no orphaned references remain. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

_get_instruction_feature_embeds wrapped the single-layer MLLM call in try output_hidden_states=False / except -> output_hidden_states=True and hidden_states[-1]. Both paths return the same tensor (.last_hidden_state == .hidden_states[-1]), so the except branch only masked real errors behind a UserWarning. Per .ai/AGENTS.md ("raise a concise error for unsupported cases rather than adding complex fallback logic"), call the single path unconditionally and let genuine failures surface. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Per .ai/models.md, attention processors must use dispatch_attention_fn rather than calling F.scaled_dot_product_attention / flash_attn_varlen_func directly. Rewrite the two live processors (single-stream BooguImageAttnProcessor and double-stream BooguImageDoubleStreamSelfAttnProcessor) to feed (B, L, H, D) tensors to dispatch_attention_fn with _attention_backend / _parallel_config, and delete the two dead *Flash2Varlen classes and their _upad_input helpers (no longer instantiated; varlen unpadding is handled inside the dispatcher). File shrinks 1128 -> 383 lines. State_dict keys are unchanged: the double-stream QKV/out projections stay on the processor module (...processor.img_to_q / instruct_to_q / img_out / instruct_out), so published checkpoints load strictly with no remapping. The attention mask is always materialized as a [B, 1, 1, L] bool mask (never dropped to None when no token is padded): the native backend rounds bf16 differently on its masked vs no-mask paths, and matching the trained behavior keeps output bit-identical to the pre-refactor pipeline. Verified bit-exact (maxdiff 0.0): CPU tiny-model forward, GPU bf16 single forward, and GPU end-to-end base / edit / turbo. Checkpoint loads strict; pytest suite unchanged at 16/53/7. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Per .ai/pipelines.md gotcha huggingface#4, a pipeline variant must be its own class with a duplicated __call__ rather than subclassing another pipeline in core src/ (the flux / sdxl / wan / qwenimage convention). BooguImageTurboPipeline previously subclassed BooguImagePipeline and overrode processing() with a DMD branch. Reparent it to DiffusionPipeline and give it its own pure-T2I DMD __call__: the setup (device management, encode_instruction, prepare_image, prepare_latents, RoPE) mirrors the parent's T2I path, then runs the DMD predict/renoise loop and decode directly — byte-for-byte the same computation the old processing() DMD branch performed. The DMD path takes no scheduler, reference images, or classifier-free guidance, so the negative / empty / BOG / cfg kwargs are dropped from the turbo signature. Shared utilities (encode_instruction, prepare_latents, prepare_image, predict, device management, the guidance-scale properties, …) are carried as `# Copied from diffusers.pipelines.boogu.pipeline_boogu.BooguImagePipeline.<method>` so make fix-copies keeps them in sync. Verified: end-to-end turbo output is bit-identical to the pre-change subclass (maxdiff 0.0); base / edit unaffected (also 0.0); check_copies consistent; ruff clean; pytest suite unchanged at 16/53/7. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Self-review round 2 against .ai rules, after the four structural refactors. No numerical change: CPU and GPU end-to-end (base/edit/turbo) A/B stay bit-identical (maxdiff 0.0); pytest suite unchanged at 16/53/7. Dead code removed: - MASK_VISION_TOKENS_FEATURE / VISION_TOKEN_IDs and their truncation branch (no public API ever sets them) plus the now-unused input_ids local. - base_sequence_length parameter and its proportional-attention branch from both attention processors (never passed by the transformer); drops the math import. - BooguImageRotaryPosEmbed reduced to the only thing used — the static get_freqs_cis — dropping its dead __init__/_get_freqs_cis/forward (the transformer uses BooguImageDoubleStreamRotaryPosEmbed; the pipeline only calls the static method). - Commented-out guidance formula and the `+ +` unary-plus typos in the triple guidance combination; stale docstrings (a "LoRA loading" mention with no LoRA, a reference to an internal training dataset class, a "may not be actually used" development note). Correctness / convention: - assert -> raise ValueError in the transformer / rope / attention forward paths (asserts are stripped under python -O). - _validate_device_format now relies on the validator's own raise instead of returning an ignored bool. - MomentumRollingSum states are only constructed when boosted orthogonal guidance is enabled. - encode_instruction return annotation corrected (it returns six values). - BooguImageTransformerTesterConfig inherits BaseModelTesterConfig (gives it model_split_percents etc., matching the other transformer tests). - examples: edit / edit_fp8 raise a clear error if base.png is missing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Collapse statements that fit on one line after the previous cleanup, so `make style` / `ruff format --check` is clean for the PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The instruct_reasoner_static_skills.py prompt-template module was removed during cleanup; its per-file ruff ignore in pyproject.toml pointed at a file that no longer exists. Remove the dead entry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

yiyixuxu

thanks a lot for the PR and the very thoughtful self-review :)
i left some comments/questions

The triton fused-RMSNorm / flash-attn SwiGLU paths were gated behind an `os.getenv("device")` guard that defaulted to "cpu", so the published inference path always fell back to torch.nn.RMSNorm and a torch SwiGLU. Remove the unused ops/triton kernels (1261 lines) and ops/simple_layer_norm, drop the dead env-guard in block_lumina2, and the now-unused is_triton_available helper. Numerically identical to the default path; addresses reviewer feedback (single-file convention prep + perf-path removal). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

diffusers follows a one-model-one-file convention. Merge the Boogu model's helper modules into transformer_boogu.py: - rope_boogu.py -> RoPE section - block_lumina2.py -> norm / feed-forward / embedding section - attention_processor_boogu.py -> attention-processor section Update the two pipelines and the transformer test to import BooguImageRotaryPosEmbed from transformer_boogu. Pure code relocation: the class bodies are unchanged, so checkpoints load identically and base/edit/turbo remain bit-exact (verified end-to-end on GPU). Addresses reviewer single-file convention feedback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A pipeline subclass should only carry pipeline-specific steps; device placement, offloading, and component registration belong to DiffusionPipeline. Remove the custom devices_manager / set_mllm / set_transformer / set_processor / set_scheduler / _validate_device_format / _check_device_strategy_validity methods, the enable_*_offload_flag / user_set_pipe_device state, and the now-unused validator_utils helper. __call__ resolves the device via the base class's _execution_device and drops its redundant `device=` kwarg; the mllm lm_head stripping stays in __init__. This also makes the inherited to()/enable_*_offload tests pass (previously 16-17 device/offload failures, now 0). Addresses reviewer feedback on pipeline-subclass responsibilities. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Boogu-Team · 2026-06-23T07:12:55Z

thanks a lot for the PR and the very thoughtful self-review :)
i left some comments/questions

Thank you for the quick and thoughtful review, @yiyixuxu — much appreciated! 🙏

I've addressed all the comments and pushed the changes (commits 9e672c2, d202a23, 5cef903):

single-file convention — merged the model's helper modules (rope /blocks / attention processors) into transformer_boogu.py.
triton fused RMSNorm — removed; the published path always fell back to torch.nn.RMSNorm, so it's numerically identical.
pipeline-subclass responsibilities — dropped all the custom device / offload / component-setter infrastructure from both pipelines; they now rely on DiffusionPipeline (.to() / _execution_device / enable_*_offload). This also made the previously-failing device/offload tests pass.

I've replied inline on each thread with the specifics and resolved them.
Happy to iterate further — thanks again!

Boogu-Team and others added 13 commits June 18, 2026 15:51

Boogu: apply ruff format

8952e6d

Collapse statements that fit on one line after the previous cleanup, so `make style` / `ruff format --check` is clean for the PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions Bot added size/L PR with diff > 200 LOC documentation Improvements or additions to documentation models tests utils pipelines examples schedulers and removed size/L PR with diff > 200 LOC labels Jun 22, 2026

yiyixuxu reviewed Jun 23, 2026

View reviewed changes

yiyixuxu mentioned this pull request Jun 23, 2026

[.ai] document single-file model layout and "don't reimplement Diffus… #14048

Merged

Boogu-Team and others added 3 commits June 23, 2026 03:12

github-actions Bot added the size/L PR with diff > 200 LOC label Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Boogu-Image generation, editing, and turbo pipelines#14040

Add Boogu-Image generation, editing, and turbo pipelines#14040
Boogu-Team wants to merge 16 commits into
huggingface:mainfrom
Boogu-Team:feat/integrate-boogu

Boogu-Team commented Jun 22, 2026 •

edited

Loading

Uh oh!

yiyixuxu left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Boogu-Team commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Boogu-Team commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Pipelines & model

Convention compliance

Verification

Notes for reviewers

Required Hub-side checkpoint changes (no code change in this PR)

Before submitting

Who can review?

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Boogu-Team commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Boogu-Team commented Jun 22, 2026 •

edited

Loading