Add Boogu-Image generation, editing, and turbo pipelines#14040
Open
Boogu-Team wants to merge 16 commits into
Open
Add Boogu-Image generation, editing, and turbo pipelines#14040Boogu-Team wants to merge 16 commits into
Boogu-Team wants to merge 16 commits into
Conversation
Integrate the Boogu-Image model into diffusers: - Models: BooguImageTransformer2DModel, PromptEmbedding, Boogu attention processors, Lumina2 blocks, and rotary embeddings. - Pipelines: BooguImagePipeline (text-to-image and instruction editing) and BooguImageTurboPipeline (DMD few-step text-to-image). - Scheduler: flow-match Euler scheduler with training-aligned time shifting. - Internal utils: TaylorSeer cache, TeaCache params, DPM cache helpers, and optional Triton fused RMSNorm. - Loading: resolve published checkpoints' custom module names to the integrated classes via module aliases, so from_pretrained needs no trust_remote_code. - Docs and runnable examples under docs/ and examples/boogu/. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the Boogu-only TaylorSeer caching feature, which was only half-removed in the working tree (left dangling `enable_taylorseer` references that raised NameError, and collaterally deleted the TeaCache `__init__` setup so the transformer raised AttributeError on `enable_teacache`). - transformer_boogu.py: remove the remaining TaylorSeer branches; restore the TeaCache init block (enable_teacache, enable_teacache_for_all_layers, teacache_rel_l1_thresh, teacache_params, rescale_func) and the numpy / TeaCacheParams imports it needs. - pipeline_boogu.py: drop the cache_init import, the enable_taylorseer plumbing and per-condition cache_dic/current branches, collapsing each `if enable_taylorseer / elif enable_teacache` into a plain `if enable_teacache`. - Delete cache_functions/ and taylorseer_utils/ (Boogu-added, TaylorSeer-only, now unreferenced). The upstream hooks-based TaylorSeerCacheConfig is untouched. - Remove BOOGU_INTEGRATION.md (ephemeral integration notes); add an environment install link to examples/boogu/README.md. The pipeline uses the official FlowMatchEulerDiscreteScheduler via the thin BooguFlowMatchEulerDiscreteScheduler subclass (reuses the parent step). Tests: test_models_transformer_boogu (15 passed) and test_boogu (20 passed) green; check_copies and check_dummies pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… adapter Replace the BooguFlowMatchEulerDiscreteScheduler subclass with the official FlowMatchEulerDiscreteScheduler plus a standalone set_flow_match_timesteps adapter that applies Boogu's training-aligned static v1 time shift and 0->1 sigma schedule, reusing the parent's exponential shift formula. - Add pipelines/boogu/flow_match_boogu.py with set_flow_match_timesteps - Route the flow-match branch of retrieve_timesteps through the adapter (annotated "# Adapted from" to reflect the intentional divergence) - Update pipeline/test type hints and imports to the official scheduler - Drop the scheduler subclass and its registrations (schedulers/__init__, top-level __init__, dummy_pt_objects) Numerically bit-identical to the old subclass (max diff ~6e-08). The boogu test suite shows no regression vs the pre-change tree (same 11 pre-existing MLLM device-placement failures, 19 passed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…writer Reduce the boogu pipeline package from 7 files to 4 by removing dead and misplaced code, keeping the default T2I/TI2I inference path unchanged. - Inline set_flow_match_timesteps into pipeline_boogu.py (single caller) and delete flow_match_boogu.py, per the "inline single-caller helpers" rule. - Replace the image_processor.preprocess override (which duplicated the parent VaeImageProcessor wholesale) with a thin override that only derives the Boogu max_pixels/max_side_length target size, then delegates to the parent. Verified bit-identical output across sizes/constraints (max diff 0.0). - Remove BooguImageLoraLoaderMixin / lora_pipeline.py: LoRA is unused on the inference path, and the mixin belongs in loaders/ by diffusers convention. - Remove the instruction-rewriter feature entirely (static_skills.py, instruct_reasoner_static_skills.py, and ~1100 lines of rewriter methods, state, and public kwargs). It was gated by use_rewrite_text_instruction (default False) and unused by every example/test; the skills files were its only consumers. Net: -2255 / +74 lines. End-to-end TI2I inference reproduces the standalone reference (mean pixel diff 8.8, unchanged from before), and the boogu test suite shows the same pre-existing baseline (11 failed / 19 passed / 4 skipped, the 11 being unrelated MLLM device-placement failures). check_copies and check_dummies pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Both edit examples ran with no negative prompt. At text_guidance_scale=4.0 the model guides away from the negative instruction, so omitting it left the output oversaturated and under-stylized (style transfer barely applied). Add the standard negative prompt used by the reference inference so the colored-pencil style conversion comes through. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tions)
Self-review against .ai/{AGENTS,models,pipelines,review-rules}.md surfaced a
batch of mechanical issues fixed here (no behavior change on the default path;
boogu test suite unchanged at 16/53/7, identical failure set — the remaining
failures are a pre-existing MLLM cpu/cuda device-placement issue).
pipeline_boogu.py:
- Remove dead helpers: _project, _sigmoid_kernel, _softmax_kernel, the
non-newton-schulz bog_norm branches, MomentumRollingSum._append_and_save
(+ now-unused pathlib import).
- Drop unused __call__ params verbose and callback_on_step_end_tensor_inputs,
a bare `latents.shape[0]` expression, and several commented-out code blocks.
- Replace all print() with module logger; drop emoji/blank-line prints.
pipeline_boogu_turbo.py:
- Add module logger; replace the inference print() with logger.info.
transformer_boogu.py:
- Default attention to the SDPA processor instead of selecting it from an
os.getenv("device") read at __init__ (non-standard, and forced flash in fp32);
drop the now-unused Flash2Varlen imports and the single-stream block alias.
- Replace np.poly1d TeaCache rescale with inline Horner eval; drop numpy import.
- Fix _no_split_modules / _repeated_blocks (remove the alias string that never
matched __class__.__name__ and the invalid "nn.Embedding" entry).
- Give PromptEmbedding flat @register_to_config kwargs so from_pretrained
round-trips; remove its non-standard from_config override.
- Remove dead self.layers, enable_teacache_for_all_layers, a commented-out
param, a discarded dict lookup, and a stale section comment.
attention_processor_boogu.py:
- Remove no-op `layer = layer.to(device)` loops (rebind a local, never move the
module) plus the bare shape expressions and commented debug lines above them.
image_processor.py:
- Guard get_new_height_width against None max_pixels / max_side_length
(previously TypeError / UnboundLocalError when called with defaults); output
is bit-identical when both constraints are set. Sync the class docstring to
the actual __init__ signature.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
No released Boogu checkpoint ships a PromptEmbedding / prompt-tuning subfolder,
so the prompt-tuning path is never exercised by a published model. Per
.ai/AGENTS.md ("only keep the inference path you are actually integrating"),
remove it entirely:
- Delete PromptEmbedding (transformer_boogu.py), BooguImagePromptTuningPipeline
(pipeline_boogu.py), and BooguImagePromptTuningRotaryPosEmbed (rope_boogu.py).
- Drop the model's unused prompt_tuning_configs config arg, the pipeline's
prompt_embedding attribute + set_prompt_embedding(), and the
use_prompt_tuning_embedding branch of _get_instruction_feature_embeds (the
normal VLM-encoding path is unchanged). The now-orphaned has_offload_strategy
/ _module_execution_device helpers go with it.
- Remove the PromptEmbedding registrations (lazy import structure, top-level
export, dummy object).
Removing BooguImagePromptTuningPipeline also drops 2 of the 4 except-Exception
fallback blocks (the other 2, in BooguImagePipeline, are handled separately).
Verified: cached checkpoint transformer loads with no missing/unexpected keys
(prompt_tuning_configs in config.json is now harmlessly ignored); import +
ruff clean; no orphaned references remain.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
_get_instruction_feature_embeds wrapped the single-layer MLLM call in
try output_hidden_states=False / except -> output_hidden_states=True and
hidden_states[-1]. Both paths return the same tensor (.last_hidden_state ==
.hidden_states[-1]), so the except branch only masked real errors behind a
UserWarning. Per .ai/AGENTS.md ("raise a concise error for unsupported cases
rather than adding complex fallback logic"), call the single path
unconditionally and let genuine failures surface.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per .ai/models.md, attention processors must use dispatch_attention_fn rather than calling F.scaled_dot_product_attention / flash_attn_varlen_func directly. Rewrite the two live processors (single-stream BooguImageAttnProcessor and double-stream BooguImageDoubleStreamSelfAttnProcessor) to feed (B, L, H, D) tensors to dispatch_attention_fn with _attention_backend / _parallel_config, and delete the two dead *Flash2Varlen classes and their _upad_input helpers (no longer instantiated; varlen unpadding is handled inside the dispatcher). File shrinks 1128 -> 383 lines. State_dict keys are unchanged: the double-stream QKV/out projections stay on the processor module (...processor.img_to_q / instruct_to_q / img_out / instruct_out), so published checkpoints load strictly with no remapping. The attention mask is always materialized as a [B, 1, 1, L] bool mask (never dropped to None when no token is padded): the native backend rounds bf16 differently on its masked vs no-mask paths, and matching the trained behavior keeps output bit-identical to the pre-refactor pipeline. Verified bit-exact (maxdiff 0.0): CPU tiny-model forward, GPU bf16 single forward, and GPU end-to-end base / edit / turbo. Checkpoint loads strict; pytest suite unchanged at 16/53/7. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per .ai/pipelines.md gotcha huggingface#4, a pipeline variant must be its own class with a duplicated __call__ rather than subclassing another pipeline in core src/ (the flux / sdxl / wan / qwenimage convention). BooguImageTurboPipeline previously subclassed BooguImagePipeline and overrode processing() with a DMD branch. Reparent it to DiffusionPipeline and give it its own pure-T2I DMD __call__: the setup (device management, encode_instruction, prepare_image, prepare_latents, RoPE) mirrors the parent's T2I path, then runs the DMD predict/renoise loop and decode directly — byte-for-byte the same computation the old processing() DMD branch performed. The DMD path takes no scheduler, reference images, or classifier-free guidance, so the negative / empty / BOG / cfg kwargs are dropped from the turbo signature. Shared utilities (encode_instruction, prepare_latents, prepare_image, predict, device management, the guidance-scale properties, …) are carried as `# Copied from diffusers.pipelines.boogu.pipeline_boogu.BooguImagePipeline.<method>` so make fix-copies keeps them in sync. Verified: end-to-end turbo output is bit-identical to the pre-change subclass (maxdiff 0.0); base / edit unaffected (also 0.0); check_copies consistent; ruff clean; pytest suite unchanged at 16/53/7. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Self-review round 2 against .ai rules, after the four structural refactors. No numerical change: CPU and GPU end-to-end (base/edit/turbo) A/B stay bit-identical (maxdiff 0.0); pytest suite unchanged at 16/53/7. Dead code removed: - MASK_VISION_TOKENS_FEATURE / VISION_TOKEN_IDs and their truncation branch (no public API ever sets them) plus the now-unused input_ids local. - base_sequence_length parameter and its proportional-attention branch from both attention processors (never passed by the transformer); drops the math import. - BooguImageRotaryPosEmbed reduced to the only thing used — the static get_freqs_cis — dropping its dead __init__/_get_freqs_cis/forward (the transformer uses BooguImageDoubleStreamRotaryPosEmbed; the pipeline only calls the static method). - Commented-out guidance formula and the `+ +` unary-plus typos in the triple guidance combination; stale docstrings (a "LoRA loading" mention with no LoRA, a reference to an internal training dataset class, a "may not be actually used" development note). Correctness / convention: - assert -> raise ValueError in the transformer / rope / attention forward paths (asserts are stripped under python -O). - _validate_device_format now relies on the validator's own raise instead of returning an ignored bool. - MomentumRollingSum states are only constructed when boosted orthogonal guidance is enabled. - encode_instruction return annotation corrected (it returns six values). - BooguImageTransformerTesterConfig inherits BaseModelTesterConfig (gives it model_split_percents etc., matching the other transformer tests). - examples: edit / edit_fp8 raise a clear error if base.png is missing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collapse statements that fit on one line after the previous cleanup, so `make style` / `ruff format --check` is clean for the PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The instruct_reasoner_static_skills.py prompt-template module was removed during cleanup; its per-file ruff ignore in pyproject.toml pointed at a file that no longer exists. Remove the dead entry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
yiyixuxu
reviewed
Jun 23, 2026
yiyixuxu
left a comment
Collaborator
There was a problem hiding this comment.
thanks a lot for the PR and the very thoughtful self-review :)
i left some comments/questions
The triton fused-RMSNorm / flash-attn SwiGLU paths were gated behind an
`os.getenv("device")` guard that defaulted to "cpu", so the published
inference path always fell back to torch.nn.RMSNorm and a torch SwiGLU.
Remove the unused ops/triton kernels (1261 lines) and ops/simple_layer_norm,
drop the dead env-guard in block_lumina2, and the now-unused
is_triton_available helper. Numerically identical to the default path;
addresses reviewer feedback (single-file convention prep + perf-path removal).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
diffusers follows a one-model-one-file convention. Merge the Boogu model's helper modules into transformer_boogu.py: - rope_boogu.py -> RoPE section - block_lumina2.py -> norm / feed-forward / embedding section - attention_processor_boogu.py -> attention-processor section Update the two pipelines and the transformer test to import BooguImageRotaryPosEmbed from transformer_boogu. Pure code relocation: the class bodies are unchanged, so checkpoints load identically and base/edit/turbo remain bit-exact (verified end-to-end on GPU). Addresses reviewer single-file convention feedback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A pipeline subclass should only carry pipeline-specific steps; device placement, offloading, and component registration belong to DiffusionPipeline. Remove the custom devices_manager / set_mllm / set_transformer / set_processor / set_scheduler / _validate_device_format / _check_device_strategy_validity methods, the enable_*_offload_flag / user_set_pipe_device state, and the now-unused validator_utils helper. __call__ resolves the device via the base class's _execution_device and drops its redundant `device=` kwarg; the mllm lm_head stripping stays in __init__. This also makes the inherited to()/enable_*_offload tests pass (previously 16-17 device/offload failures, now 0). Addresses reviewer feedback on pipeline-subclass responsibilities. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Author
Thank you for the quick and thoughtful review, @yiyixuxu — much appreciated! 🙏 I've addressed all the comments and pushed the changes (commits 9e672c2, d202a23, 5cef903):
I've replied inline on each thread with the specifics and resolved them. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adds the Boogu-Image family of pipelines to
diffusers:BooguImagePipeline— text-to-image generation and instruction-based image editing.BooguImageTurboPipeline— few-step DMD distilled generation.The integration is purely additive: it introduces new files only and does not modify any existing upstream module.
Published checkpoints (Hugging Face Hub,
Boogu/…):Boogu-Image-0.1-Base,Boogu-Image-0.1-Edit,Boogu-Image-0.1-Turbo, and their-fp8variants.Pipelines & model
src/diffusers/pipelines/boogu/pipeline_boogu.pysrc/diffusers/pipelines/boogu/pipeline_boogu_turbo.pysrc/diffusers/models/transformers/transformer_boogu.pysrc/diffusers/models/attention_processor_boogu.pysrc/diffusers/models/transformers/rope_boogu.pysrc/diffusers/pipelines/boogu/image_processor.pyDocs:
docs/source/en/api/pipelines/boogu.md. Runnable examples:examples/boogu/(base / edit / turbo + fp8, withREADME.md).Convention compliance
Implemented against the repo's
.airules:dispatch_attention_fn(no directF.scaled_dot_product_attentionin the forward path); attention masks always materialized as[B, 1, 1, L]bool to stay bit-exact with the trained checkpoints under the native bf16 backend.BooguImageTurboPipelineis a standaloneDiffusionPipeline(not a subclass ofBooguImagePipeline), with shared methods carried via# Copied fromand kept in sync bymake fix-copies.except Exceptionfallbacks, no unused "API-consistency" parameters — training/ablation/prompt-tuning code from the research repo was removed; only the inference path is integrated.Verification
examples/boogu/scripts run end-to-end and produce correct images (base / edit / turbo + fp8).maxdiff = 0; checkpoints load strict (no missing/unexpected keys).ruff check,ruff format --check,make fix-copies(no diff),check_dummies.tests/pipelines/boogu/andtests/models/transformers/test_models_transformer_boogu.py.Notes for reviewers
use_boosted_orthogonal_guidance, defaultFalse) rather than moved intosrc/diffusers/guiders/, since the guiders framework is oriented at modular pipelines. Happy to relocate it if preferred._keep_in_fp32_modulesis intentionally not set: forcingtime_caption_embedto fp32 changes bf16 inference numerics, so we left it to the model author / reviewer's call.transformers5.10.x (the env var alone does not disable DeepGEMM there).Required Hub-side checkpoint changes (no code change in this PR)
1.
model_index.json— point component classes atdiffusers, not the bundled remote-code modules.On Hub
mainit currently is:Both must become library refs so
from_pretrainedresolves the in-tree classes added by this PR:2. Delete the two remote-code shim modules (they are thin re-export stubs that
import boogu …from the private research package and raiseModuleNotFoundError: boogufor any external user):transformer/transformer_boogu.pyscheduler/scheduling_flow_match_euler_discrete_time_shifting.py3.
scheduler/scheduler_config.json— drop the legacy custom-scheduler keys. Hubmainhas:{ "_class_name": "FlowMatchEulerDiscreteScheduler", "_diffusers_version": "0.33.1", "do_shift": true, "dynamic_time_shift": false, "time_shift_version": "v1", "seq_len": 4096, "num_train_timesteps": 1000 }Remove
do_shift,dynamic_time_shift,time_shift_version(these drove the old custom scheduler's time-shift; the official scheduler ignores them). Keepseq_len— the pipeline's time-shift adapter readsscheduler.config.seq_lento compute the shiftmu, so it is load-bearing → final config keeps_class_name/_diffusers_version/seq_len/num_train_timesteps.4.
transformer/config.json— removeprompt_tuning_configs(the prompt-tuning subsystem was dropped from this integration; the key is otherwise ignored with a warning).The same four edits apply to all six repos (
Base,Edit,Turboand their-fp8variants). Themllm/,processor/,vae/and weight files are unchanged.Fixes # (issue)
Before submitting
.ai/review-rules.md?documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.