Skip to content

Add Boogu-Image generation, editing, and turbo pipelines#14040

Open
Boogu-Team wants to merge 16 commits into
huggingface:mainfrom
Boogu-Team:feat/integrate-boogu
Open

Add Boogu-Image generation, editing, and turbo pipelines#14040
Boogu-Team wants to merge 16 commits into
huggingface:mainfrom
Boogu-Team:feat/integrate-boogu

Conversation

@Boogu-Team

@Boogu-Team Boogu-Team commented Jun 22, 2026

Copy link
Copy Markdown

What does this PR do?

Adds the Boogu-Image family of pipelines to diffusers:

  • BooguImagePipeline — text-to-image generation and instruction-based image editing.
  • BooguImageTurboPipeline — few-step DMD distilled generation.
  • fp8 quantized inference examples for both, targeting the published fp8 checkpoints.

The integration is purely additive: it introduces new files only and does not modify any existing upstream module.

Published checkpoints (Hugging Face Hub, Boogu/…): Boogu-Image-0.1-Base, Boogu-Image-0.1-Edit, Boogu-Image-0.1-Turbo, and their -fp8 variants.

Pipelines & model

Component File
Generation / edit pipeline src/diffusers/pipelines/boogu/pipeline_boogu.py
Turbo (DMD few-step) pipeline src/diffusers/pipelines/boogu/pipeline_boogu_turbo.py
Transformer backbone src/diffusers/models/transformers/transformer_boogu.py
Attention processors src/diffusers/models/attention_processor_boogu.py
RoPE src/diffusers/models/transformers/rope_boogu.py
Image processor src/diffusers/pipelines/boogu/image_processor.py

Docs: docs/source/en/api/pipelines/boogu.md. Runnable examples: examples/boogu/ (base / edit / turbo + fp8, with README.md).

Convention compliance

Implemented against the repo's .ai rules:

  • Attention routed through dispatch_attention_fn (no direct F.scaled_dot_product_attention in the forward path); attention masks always materialized as [B, 1, 1, L] bool to stay bit-exact with the trained checkpoints under the native bf16 backend.
  • BooguImageTurboPipeline is a standalone DiffusionPipeline (not a subclass of BooguImagePipeline), with shared methods carried via # Copied from and kept in sync by make fix-copies.
  • No dead code paths, no silent except Exception fallbacks, no unused "API-consistency" parameters — training/ablation/prompt-tuning code from the research repo was removed; only the inference path is integrated.

Verification

  • All 6 examples/boogu/ scripts run end-to-end and produce correct images (base / edit / turbo + fp8).
  • Every refactor verified bit-exact against the pre-refactor reference: CPU and GPU single-forward and end-to-end maxdiff = 0; checkpoints load strict (no missing/unexpected keys).
  • CI gates pass locally: ruff check, ruff format --check, make fix-copies (no diff), check_dummies.
  • Test suite under tests/pipelines/boogu/ and tests/models/transformers/test_models_transformer_boogu.py.

Notes for reviewers

  • BOG (Boosted Orthogonal Guidance) is kept inline as a documented public kwarg (use_boosted_orthogonal_guidance, default False) rather than moved into src/diffusers/guiders/, since the guiders framework is oriented at modular pipelines. Happy to relocate it if preferred.
  • _keep_in_fp32_modules is intentionally not set: forcing time_caption_embed to fp32 changes bf16 inference numerics, so we left it to the model author / reviewer's call.
  • The fp8 examples include a small DeepGEMM-disable shim with version branches that are load-bearing for transformers 5.10.x (the env var alone does not disable DeepGEMM there).

Required Hub-side checkpoint changes (no code change in this PR)

Important for reviewers: the published Boogu/Boogu-Image-0.1-* checkpoints on the Hub are still packaged for the old custom-remote-code loading path and will not load correctly against this PR's native diffusers classes until the Hub repos are updated. We will push these Hub-repo edits to land together with the PR. Diff below is Hub main → required state.

1. model_index.json — point component classes at diffusers, not the bundled remote-code modules.
On Hub main it currently is:

"scheduler":   ["scheduling_flow_match_euler_discrete_time_shifting", "FlowMatchEulerDiscreteScheduler"],
"transformer": ["transformer_boogu", "BooguImageTransformer2DModel"],

Both must become library refs so from_pretrained resolves the in-tree classes added by this PR:

"scheduler":   ["diffusers", "FlowMatchEulerDiscreteScheduler"],
"transformer": ["diffusers", "BooguImageTransformer2DModel"],

2. Delete the two remote-code shim modules (they are thin re-export stubs that import boogu … from the private research package and raise ModuleNotFoundError: boogu for any external user):

  • transformer/transformer_boogu.py
  • scheduler/scheduling_flow_match_euler_discrete_time_shifting.py

3. scheduler/scheduler_config.json — drop the legacy custom-scheduler keys. Hub main has:

{ "_class_name": "FlowMatchEulerDiscreteScheduler", "_diffusers_version": "0.33.1",
  "do_shift": true, "dynamic_time_shift": false, "time_shift_version": "v1",
  "seq_len": 4096, "num_train_timesteps": 1000 }

Remove do_shift, dynamic_time_shift, time_shift_version (these drove the old custom scheduler's time-shift; the official scheduler ignores them). Keep seq_len — the pipeline's time-shift adapter reads scheduler.config.seq_len to compute the shift mu, so it is load-bearing → final config keeps _class_name / _diffusers_version / seq_len / num_train_timesteps.

4. transformer/config.json — remove prompt_tuning_configs (the prompt-tuning subsystem was dropped from this integration; the key is otherwise ignored with a warning).

The same four edits apply to all six repos (Base, Edit, Turbo and their -fp8 variants). The mllm/, processor/, vae/ and weight files are unchanged.


Fixes # (issue)

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Boogu-Team and others added 13 commits June 18, 2026 15:51
Integrate the Boogu-Image model into diffusers:

- Models: BooguImageTransformer2DModel, PromptEmbedding, Boogu attention
  processors, Lumina2 blocks, and rotary embeddings.
- Pipelines: BooguImagePipeline (text-to-image and instruction editing) and
  BooguImageTurboPipeline (DMD few-step text-to-image).
- Scheduler: flow-match Euler scheduler with training-aligned time shifting.
- Internal utils: TaylorSeer cache, TeaCache params, DPM cache helpers, and
  optional Triton fused RMSNorm.
- Loading: resolve published checkpoints' custom module names to the integrated
  classes via module aliases, so from_pretrained needs no trust_remote_code.
- Docs and runnable examples under docs/ and examples/boogu/.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the Boogu-only TaylorSeer caching feature, which was only half-removed
in the working tree (left dangling `enable_taylorseer` references that raised
NameError, and collaterally deleted the TeaCache `__init__` setup so the
transformer raised AttributeError on `enable_teacache`).

- transformer_boogu.py: remove the remaining TaylorSeer branches; restore the
  TeaCache init block (enable_teacache, enable_teacache_for_all_layers,
  teacache_rel_l1_thresh, teacache_params, rescale_func) and the numpy /
  TeaCacheParams imports it needs.
- pipeline_boogu.py: drop the cache_init import, the enable_taylorseer plumbing
  and per-condition cache_dic/current branches, collapsing each
  `if enable_taylorseer / elif enable_teacache` into a plain `if enable_teacache`.
- Delete cache_functions/ and taylorseer_utils/ (Boogu-added, TaylorSeer-only,
  now unreferenced). The upstream hooks-based TaylorSeerCacheConfig is untouched.
- Remove BOOGU_INTEGRATION.md (ephemeral integration notes); add an environment
  install link to examples/boogu/README.md.

The pipeline uses the official FlowMatchEulerDiscreteScheduler via the thin
BooguFlowMatchEulerDiscreteScheduler subclass (reuses the parent step).

Tests: test_models_transformer_boogu (15 passed) and test_boogu
(20 passed) green; check_copies and check_dummies pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… adapter

Replace the BooguFlowMatchEulerDiscreteScheduler subclass with the official
FlowMatchEulerDiscreteScheduler plus a standalone set_flow_match_timesteps
adapter that applies Boogu's training-aligned static v1 time shift and 0->1
sigma schedule, reusing the parent's exponential shift formula.

- Add pipelines/boogu/flow_match_boogu.py with set_flow_match_timesteps
- Route the flow-match branch of retrieve_timesteps through the adapter
  (annotated "# Adapted from" to reflect the intentional divergence)
- Update pipeline/test type hints and imports to the official scheduler
- Drop the scheduler subclass and its registrations
  (schedulers/__init__, top-level __init__, dummy_pt_objects)

Numerically bit-identical to the old subclass (max diff ~6e-08). The boogu
test suite shows no regression vs the pre-change tree (same 11 pre-existing
MLLM device-placement failures, 19 passed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…writer

Reduce the boogu pipeline package from 7 files to 4 by removing dead and
misplaced code, keeping the default T2I/TI2I inference path unchanged.

- Inline set_flow_match_timesteps into pipeline_boogu.py (single caller) and
  delete flow_match_boogu.py, per the "inline single-caller helpers" rule.
- Replace the image_processor.preprocess override (which duplicated the parent
  VaeImageProcessor wholesale) with a thin override that only derives the
  Boogu max_pixels/max_side_length target size, then delegates to the parent.
  Verified bit-identical output across sizes/constraints (max diff 0.0).
- Remove BooguImageLoraLoaderMixin / lora_pipeline.py: LoRA is unused on the
  inference path, and the mixin belongs in loaders/ by diffusers convention.
- Remove the instruction-rewriter feature entirely (static_skills.py,
  instruct_reasoner_static_skills.py, and ~1100 lines of rewriter methods,
  state, and public kwargs). It was gated by use_rewrite_text_instruction
  (default False) and unused by every example/test; the skills files were its
  only consumers.

Net: -2255 / +74 lines. End-to-end TI2I inference reproduces the standalone
reference (mean pixel diff 8.8, unchanged from before), and the boogu test
suite shows the same pre-existing baseline (11 failed / 19 passed / 4 skipped,
the 11 being unrelated MLLM device-placement failures). check_copies and
check_dummies pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Both edit examples ran with no negative prompt. At text_guidance_scale=4.0
the model guides away from the negative instruction, so omitting it left
the output oversaturated and under-stylized (style transfer barely applied).
Add the standard negative prompt used by the reference inference so the
colored-pencil style conversion comes through.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tions)

Self-review against .ai/{AGENTS,models,pipelines,review-rules}.md surfaced a
batch of mechanical issues fixed here (no behavior change on the default path;
boogu test suite unchanged at 16/53/7, identical failure set — the remaining
failures are a pre-existing MLLM cpu/cuda device-placement issue).

pipeline_boogu.py:
- Remove dead helpers: _project, _sigmoid_kernel, _softmax_kernel, the
  non-newton-schulz bog_norm branches, MomentumRollingSum._append_and_save
  (+ now-unused pathlib import).
- Drop unused __call__ params verbose and callback_on_step_end_tensor_inputs,
  a bare `latents.shape[0]` expression, and several commented-out code blocks.
- Replace all print() with module logger; drop emoji/blank-line prints.

pipeline_boogu_turbo.py:
- Add module logger; replace the inference print() with logger.info.

transformer_boogu.py:
- Default attention to the SDPA processor instead of selecting it from an
  os.getenv("device") read at __init__ (non-standard, and forced flash in fp32);
  drop the now-unused Flash2Varlen imports and the single-stream block alias.
- Replace np.poly1d TeaCache rescale with inline Horner eval; drop numpy import.
- Fix _no_split_modules / _repeated_blocks (remove the alias string that never
  matched __class__.__name__ and the invalid "nn.Embedding" entry).
- Give PromptEmbedding flat @register_to_config kwargs so from_pretrained
  round-trips; remove its non-standard from_config override.
- Remove dead self.layers, enable_teacache_for_all_layers, a commented-out
  param, a discarded dict lookup, and a stale section comment.

attention_processor_boogu.py:
- Remove no-op `layer = layer.to(device)` loops (rebind a local, never move the
  module) plus the bare shape expressions and commented debug lines above them.

image_processor.py:
- Guard get_new_height_width against None max_pixels / max_side_length
  (previously TypeError / UnboundLocalError when called with defaults); output
  is bit-identical when both constraints are set. Sync the class docstring to
  the actual __init__ signature.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
No released Boogu checkpoint ships a PromptEmbedding / prompt-tuning subfolder,
so the prompt-tuning path is never exercised by a published model. Per
.ai/AGENTS.md ("only keep the inference path you are actually integrating"),
remove it entirely:

- Delete PromptEmbedding (transformer_boogu.py), BooguImagePromptTuningPipeline
  (pipeline_boogu.py), and BooguImagePromptTuningRotaryPosEmbed (rope_boogu.py).
- Drop the model's unused prompt_tuning_configs config arg, the pipeline's
  prompt_embedding attribute + set_prompt_embedding(), and the
  use_prompt_tuning_embedding branch of _get_instruction_feature_embeds (the
  normal VLM-encoding path is unchanged). The now-orphaned has_offload_strategy
  / _module_execution_device helpers go with it.
- Remove the PromptEmbedding registrations (lazy import structure, top-level
  export, dummy object).

Removing BooguImagePromptTuningPipeline also drops 2 of the 4 except-Exception
fallback blocks (the other 2, in BooguImagePipeline, are handled separately).

Verified: cached checkpoint transformer loads with no missing/unexpected keys
(prompt_tuning_configs in config.json is now harmlessly ignored); import +
ruff clean; no orphaned references remain.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
_get_instruction_feature_embeds wrapped the single-layer MLLM call in
try output_hidden_states=False / except -> output_hidden_states=True and
hidden_states[-1]. Both paths return the same tensor (.last_hidden_state ==
.hidden_states[-1]), so the except branch only masked real errors behind a
UserWarning. Per .ai/AGENTS.md ("raise a concise error for unsupported cases
rather than adding complex fallback logic"), call the single path
unconditionally and let genuine failures surface.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per .ai/models.md, attention processors must use dispatch_attention_fn rather
than calling F.scaled_dot_product_attention / flash_attn_varlen_func directly.
Rewrite the two live processors (single-stream BooguImageAttnProcessor and
double-stream BooguImageDoubleStreamSelfAttnProcessor) to feed (B, L, H, D)
tensors to dispatch_attention_fn with _attention_backend / _parallel_config,
and delete the two dead *Flash2Varlen classes and their _upad_input helpers
(no longer instantiated; varlen unpadding is handled inside the dispatcher).
File shrinks 1128 -> 383 lines.

State_dict keys are unchanged: the double-stream QKV/out projections stay on
the processor module (...processor.img_to_q / instruct_to_q / img_out /
instruct_out), so published checkpoints load strictly with no remapping.

The attention mask is always materialized as a [B, 1, 1, L] bool mask (never
dropped to None when no token is padded): the native backend rounds bf16
differently on its masked vs no-mask paths, and matching the trained behavior
keeps output bit-identical to the pre-refactor pipeline.

Verified bit-exact (maxdiff 0.0): CPU tiny-model forward, GPU bf16 single
forward, and GPU end-to-end base / edit / turbo. Checkpoint loads strict;
pytest suite unchanged at 16/53/7.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per .ai/pipelines.md gotcha huggingface#4, a pipeline variant must be its own class with a
duplicated __call__ rather than subclassing another pipeline in core src/ (the
flux / sdxl / wan / qwenimage convention). BooguImageTurboPipeline previously
subclassed BooguImagePipeline and overrode processing() with a DMD branch.

Reparent it to DiffusionPipeline and give it its own pure-T2I DMD __call__: the
setup (device management, encode_instruction, prepare_image, prepare_latents,
RoPE) mirrors the parent's T2I path, then runs the DMD predict/renoise loop and
decode directly — byte-for-byte the same computation the old processing() DMD
branch performed. The DMD path takes no scheduler, reference images, or
classifier-free guidance, so the negative / empty / BOG / cfg kwargs are dropped
from the turbo signature.

Shared utilities (encode_instruction, prepare_latents, prepare_image, predict,
device management, the guidance-scale properties, …) are carried as `# Copied
from diffusers.pipelines.boogu.pipeline_boogu.BooguImagePipeline.<method>` so
make fix-copies keeps them in sync.

Verified: end-to-end turbo output is bit-identical to the pre-change subclass
(maxdiff 0.0); base / edit unaffected (also 0.0); check_copies consistent;
ruff clean; pytest suite unchanged at 16/53/7.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Self-review round 2 against .ai rules, after the four structural refactors.
No numerical change: CPU and GPU end-to-end (base/edit/turbo) A/B stay
bit-identical (maxdiff 0.0); pytest suite unchanged at 16/53/7.

Dead code removed:
- MASK_VISION_TOKENS_FEATURE / VISION_TOKEN_IDs and their truncation branch
  (no public API ever sets them) plus the now-unused input_ids local.
- base_sequence_length parameter and its proportional-attention branch from
  both attention processors (never passed by the transformer); drops the math
  import.
- BooguImageRotaryPosEmbed reduced to the only thing used — the static
  get_freqs_cis — dropping its dead __init__/_get_freqs_cis/forward (the
  transformer uses BooguImageDoubleStreamRotaryPosEmbed; the pipeline only calls
  the static method).
- Commented-out guidance formula and the `+ +` unary-plus typos in the triple
  guidance combination; stale docstrings (a "LoRA loading" mention with no LoRA,
  a reference to an internal training dataset class, a "may not be actually
  used" development note).

Correctness / convention:
- assert -> raise ValueError in the transformer / rope / attention forward paths
  (asserts are stripped under python -O).
- _validate_device_format now relies on the validator's own raise instead of
  returning an ignored bool.
- MomentumRollingSum states are only constructed when boosted orthogonal
  guidance is enabled.
- encode_instruction return annotation corrected (it returns six values).
- BooguImageTransformerTesterConfig inherits BaseModelTesterConfig (gives it
  model_split_percents etc., matching the other transformer tests).
- examples: edit / edit_fp8 raise a clear error if base.png is missing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collapse statements that fit on one line after the previous cleanup, so
`make style` / `ruff format --check` is clean for the PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The instruct_reasoner_static_skills.py prompt-template module was removed
during cleanup; its per-file ruff ignore in pyproject.toml pointed at a
file that no longer exists. Remove the dead entry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size/L PR with diff > 200 LOC documentation Improvements or additions to documentation models tests utils pipelines examples schedulers and removed size/L PR with diff > 200 LOC labels Jun 22, 2026

@yiyixuxu yiyixuxu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot for the PR and the very thoughtful self-review :)
i left some comments/questions

Comment thread src/diffusers/ops/triton/layer_norm.py Outdated
Comment thread src/diffusers/models/transformers/transformer_boogu.py Outdated
Comment thread src/diffusers/pipelines/boogu/pipeline_boogu_turbo.py Outdated
Comment thread src/diffusers/pipelines/boogu/pipeline_boogu_turbo.py Outdated
Comment thread src/diffusers/pipelines/boogu/pipeline_boogu_turbo.py Outdated
Comment thread src/diffusers/pipelines/boogu/pipeline_boogu_turbo.py Outdated
Boogu-Team and others added 3 commits June 23, 2026 03:12
The triton fused-RMSNorm / flash-attn SwiGLU paths were gated behind an
`os.getenv("device")` guard that defaulted to "cpu", so the published
inference path always fell back to torch.nn.RMSNorm and a torch SwiGLU.
Remove the unused ops/triton kernels (1261 lines) and ops/simple_layer_norm,
drop the dead env-guard in block_lumina2, and the now-unused
is_triton_available helper. Numerically identical to the default path;
addresses reviewer feedback (single-file convention prep + perf-path removal).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
diffusers follows a one-model-one-file convention. Merge the Boogu model's
helper modules into transformer_boogu.py:
  - rope_boogu.py            -> RoPE section
  - block_lumina2.py         -> norm / feed-forward / embedding section
  - attention_processor_boogu.py -> attention-processor section
Update the two pipelines and the transformer test to import
BooguImageRotaryPosEmbed from transformer_boogu. Pure code relocation: the
class bodies are unchanged, so checkpoints load identically and base/edit/turbo
remain bit-exact (verified end-to-end on GPU). Addresses reviewer single-file
convention feedback.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A pipeline subclass should only carry pipeline-specific steps; device
placement, offloading, and component registration belong to DiffusionPipeline.
Remove the custom devices_manager / set_mllm / set_transformer / set_processor
/ set_scheduler / _validate_device_format / _check_device_strategy_validity
methods, the enable_*_offload_flag / user_set_pipe_device state, and the
now-unused validator_utils helper. __call__ resolves the device via the base
class's _execution_device and drops its redundant `device=` kwarg; the mllm
lm_head stripping stays in __init__. This also makes the inherited
to()/enable_*_offload tests pass (previously 16-17 device/offload failures,
now 0). Addresses reviewer feedback on pipeline-subclass responsibilities.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the size/L PR with diff > 200 LOC label Jun 23, 2026
@Boogu-Team

Copy link
Copy Markdown
Author

thanks a lot for the PR and the very thoughtful self-review :)
i left some comments/questions

Thank you for the quick and thoughtful review, @yiyixuxu — much appreciated! 🙏

I've addressed all the comments and pushed the changes (commits 9e672c2, d202a23, 5cef903):

  • single-file convention — merged the model's helper modules (rope /blocks / attention processors) into transformer_boogu.py.
  • triton fused RMSNorm — removed; the published path always fell back to torch.nn.RMSNorm, so it's numerically identical.
  • pipeline-subclass responsibilities — dropped all the custom device / offload / component-setter infrastructure from both pipelines; they now rely on DiffusionPipeline (.to() / _execution_device / enable_*_offload). This also made the previously-failing device/offload tests pass.

I've replied inline on each thread with the specifics and resolved them.
Happy to iterate further — thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation examples models pipelines schedulers size/L PR with diff > 200 LOC tests utils

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants