Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
159 changes: 158 additions & 1 deletion examples/community/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ PIXART-α Controlnet pipeline | Implementation of the controlnet model for pixar
| FaithDiff Stable Diffusion XL Pipeline | Implementation of [(CVPR 2025) FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolutionUnleashing Diffusion Priors for Faithful Image Super-resolution](https://huggingface.co/papers/2411.18824) - FaithDiff is a faithful image super-resolution method that leverages latent diffusion models by actively adapting the diffusion prior and jointly fine-tuning its components (encoder and diffusion model) with an alignment module to ensure high fidelity and structural consistency. | [FaithDiff Stable Diffusion XL Pipeline](#faithdiff-stable-diffusion-xl-pipeline) | [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/jychen9811/FaithDiff) | [Junyang Chen, Jinshan Pan, Jiangxin Dong, IMAG Lab, (Adapted by Eliseu Silva)](https://github.com/JyChen9811/FaithDiff) |
| Stable Diffusion 3 InstructPix2Pix Pipeline | Implementation of Stable Diffusion 3 InstructPix2Pix Pipeline | [Stable Diffusion 3 InstructPix2Pix Pipeline](#stable-diffusion-3-instructpix2pix-pipeline) | [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/BleachNick/SD3_UltraEdit_freeform) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/CaptainZZZ/sd3-instructpix2pix) | [Jiayu Zhang](https://github.com/xduzhangjiayu) and [Haozhe Zhao](https://github.com/HaozheZhao)|
| Flux Kontext multiple images | A modified version of the `FluxKontextPipeline` that supports calling Flux Kontext with multiple reference images.| [Flux Kontext multiple input Pipeline](#flux-kontext-multiple-images) | - | [Net-Mist](https://github.com/Net-Mist) |

| Lumina-DiMOO Pipeline | Implementation of Lumina-DiMOO, an omni-foundational model for unified multimodal generation and understanding. | [Lumina-DiMOO Pipeline](#lumina-dimoo) | [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/Alpha-VLLM/Lumina-DiMOO) | [Yi Xin](https://synbol.github.io/) and [Qi Qin](https://github.com/ChinChyi)|

To load a custom pipeline you just need to pass the `custom_pipeline` argument to `DiffusionPipeline`, as one of the files in `diffusers/examples/community`. Feel free to send a PR with your own pipelines, we will merge them quickly.

Expand Down Expand Up @@ -5527,3 +5527,160 @@ images = pipe(
).images
images[0].save("pizzeria.png")
```


# Lumina-DiMOO
[Project](https://synbol.github.io/Lumina-DiMOO/) / [GitHub](https://github.com/Alpha-VLLM/Lumina-DiMOO/) / [Model](https://huggingface.co/Alpha-VLLM/Lumina-DiMOO)

Lumina-DiMOO is a discrete-diffusion omni-modal foundation model unifying generation and understanding. This implementation integrates a Lumina-DiMOO switch for T2I, I2I editing, and MMU.

#### Key features

- **Unified Discrete Diffusion Architecture**: Employs a fully discrete diffusion framework to process inputs and outputs across diverse modalities.
- **Versatile Multimodal Capabilities**: Supports a wide range of multimodal tasks, including text-to-image generation (arbitrary and high-resolution), image-to-image generation (e.g., image editing, subject-driven generation, inpainting), and advanced image understanding.
- **Higher Sampling Efficiency**: Outperforms previous autoregressive (AR) or hybrid AR-diffusion models with significantly faster sampling. A custom caching mechanism further boosts sampling speed by up to 2×.


### Example Usage

The Lumina-DiMOO pipeline provides three core functions — T2I, I2I, and MMU.
For detailed implementation examples and creative applications, please visit the [GitHub](https://github.com/Alpha-VLLM/Lumina-DiMOO)


#### Text-to-Image
**prompt** | **image**
:-------------------------:|:-------------------------:
| "A striking photograph of a glass of orange juice on a wooden kitchen table, capturing a playful moment. The orange juice splashes out of the glass and forms the word \"Smile\" in a whimsical, swirling script just above the glass. The background is softly blurred, revealing a cozy, homely kitchen with warm lighting and a sense of comfort." | <img src="https://github-production-user-asset-6210df.s3.amazonaws.com/73575386/500095460-5490df4b-5ed9-4db7-bbca-3768a32ac840.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20251011%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20251011T015852Z&X-Amz-Expires=300&X-Amz-Signature=c9dc21e9414ed9069d35c72e21c4226aaea86142ab18dc9571aefb9bbdce9642&X-Amz-SignedHeaders=host">

```python
import torch

from diffusers import VQModel, DiffusionPipeline
from transformers import AutoTokenizer

vqvae = VQModel.from_pretrained("Alpha-VLLM/Lumina-DiMOO", subfolder="vqvae").to(device='cuda', dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("Alpha-VLLM/Lumina-DiMOO", trust_remote_code=True)

pipe = DiffusionPipeline.from_pretrained(
"Alpha-VLLM/Lumina-DiMOO",
vqvae=vqvae,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
custom_pipeline="lumina_dimoo",
)
pipe.to("cuda")

prompt = '''A striking photograph of a glass of orange juice on a wooden kitchen table, capturing a playful moment. The orange juice splashes out of the glass and forms the word \"Smile\" in a whimsical, swirling script just above the glass. The background is softly blurred, revealing a cozy, homely kitchen with warm lighting and a sense of comfort.'''

img = pipe(
prompt=prompt,
task="text_to_image",
height=768,
width=1536,
num_inference_steps=64,
cfg_scale=4.0,
use_cache=True,
cache_ratio=0.9,
warmup_ratio=0.3,
refresh_interval=5
).images[0]

img.save("t2i_test_output.png")
```

#### Image-to-Image
**prompt** | **image_before** | **image_after**
:-------------------------:|:-------------------------:|:-------------------------:
| "A functional wooden printer stand.Nestled next to a brick wall in a bustling city street, it stands firm as pedestrians hustle by, illuminated by the warm glow of vintage street lamps." | <img src="https://github-production-user-asset-6210df.s3.amazonaws.com/73575386/500095462-7b451c2f-ec15-4cb9-8a6a-4eab9d2f5760.jpg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20251011%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20251011T015605Z&X-Amz-Expires=300&X-Amz-Signature=a0e4be32af521b555747b794e660eb0ab116a7ced1cded55448f324fca9fa901&X-Amz-SignedHeaders=host"> | <img src="https://github-production-user-asset-6210df.s3.amazonaws.com/73575386/500095459-1d6631ef-8286-4402-a74b-85fb1ea684c3.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20251011%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20251011T015729Z&X-Amz-Expires=300&X-Amz-Signature=4fb8f2e5eb9645b5e8efb1022e01c329f4a54e793a9e3c4e36fdd357e076964a&X-Amz-SignedHeaders=host"> |

```python
import torch

from diffusers import VQModel, DiffusionPipeline
from transformers import AutoTokenizer
from diffusers.utils import load_image

vqvae = VQModel.from_pretrained("Alpha-VLLM/Lumina-DiMOO", subfolder="vqvae").to(device='cuda', dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("Alpha-VLLM/Lumina-DiMOO", trust_remote_code=True)

pipe = DiffusionPipeline.from_pretrained(
"Alpha-VLLM/Lumina-DiMOO",
vqvae=vqvae,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
custom_pipeline="lumina_dimoo",
)
pipe.to("cuda")

input_image = load_image(
"https://raw.githubusercontent.com/Alpha-VLLM/Lumina-DiMOO/main/examples/example_2.jpg"
).convert("RGB")

prompt = "A functional wooden printer stand.Nestled next to a brick wall in a bustling city street, it stands firm as pedestrians hustle by, illuminated by the warm glow of vintage street lamps."

img = pipe(
prompt=prompt,
image=input_image,
edit_type="depth_control",
num_inference_steps=64,
temperature=1.0,
cfg_scale=2.5,
cfg_img=4.0,
task="image_to_image"
).images[0]

img.save("i2i_test_output.png")

```


#### Multimodal Understanding
**question** | **image** | **answer**
:-------------------------:|:-------------------------:|:-------------------------:
| "Please describe the image." | <img src="https://github-production-user-asset-6210df.s3.amazonaws.com/73575386/500095461-9ae63fc0-b992-4652-9af5-b87be647048f.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20251011%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20251011T015701Z&X-Amz-Expires=300&X-Amz-Signature=acd3b53d43c81324b2135fde46a0ad25d04dfb71f987456a0b02d1b9af2f2e2e&X-Amz-SignedHeaders=host"> | "The image shows a vibrant orange sports car parked in a showroom. The car has a sleek, aerodynamic design with a prominent front grille and side vents. The body is adorned with black and orange racing stripes, creating a striking contrast against the orange paint. The car is equipped with black alloy wheels and a low-profile body style. The background features a white wall with a large emblem that reads "BREITZEN" and includes a silhouette of a horse and text. The floor is tiled with dark tiles, and the showroom is well-lit, highlighting the car. The overall setting suggests a high-end, possibly luxury, automotive environment."|


```python
import torch

from diffusers import VQModel, DiffusionPipeline
from transformers import AutoTokenizer
from diffusers.utils import load_image

vqvae = VQModel.from_pretrained("Alpha-VLLM/Lumina-DiMOO", subfolder="vqvae").to(device='cuda', dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("Alpha-VLLM/Lumina-DiMOO", trust_remote_code=True)

pipe = DiffusionPipeline.from_pretrained(
"Alpha-VLLM/Lumina-DiMOO",
vqvae=vqvae,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
custom_pipeline="lumina_dimoo",
)
pipe.to("cuda")

question = "Please describe the image."

input_image = load_image(
"https://raw.githubusercontent.com/Alpha-VLLM/Lumina-DiMOO/main/examples/example_8.png"
).convert("RGB")

out = pipe(
prompt=question,
image=input_image,
task="multimodal_understanding",
num_inference_steps=128,
gen_length=128,
block_length=32,
temperature=0.0,
cfg_scale=0.0,
)

text = getattr(out, "text", out)
with open("mmu_answer.txt", "w", encoding="utf-8") as f:
f.write(text.strip() + "\n")
```




Loading