Skip to content

feat: add JoyImage edit plus#14032

Open
tangyanf wants to merge 1 commit into
huggingface:mainfrom
tangyanf:add-joyimage-edit-plus
Open

feat: add JoyImage edit plus#14032
tangyanf wants to merge 1 commit into
huggingface:mainfrom
tangyanf:add-joyimage-edit-plus

Conversation

@tangyanf

Copy link
Copy Markdown

Description

We are the JoyAI Team, and this is the Diffusers implementation for the JoyAI-Image-Edit-Plus model.

GitHub Repository: [https://github.com/jd-opensource/JoyAI-Image]
Hugging Face Model: [https://huggingface.co/jdopensource/JoyAI-Image-Edit-Plus-Diffusers]
Original opensource weights: [https://huggingface.co/jdopensource/JoyAI-Image-Edit-Plus]

Model Overview

JoyAI-Image-Edit-Plus extends JoyAI-Image-Edit with multi-image editing capabilities. While JoyAI-Image-Edit operates on a single reference image, Edit-Plus accepts multiple reference
images as input and performs instruction-guided editing across them — enabling tasks such as subject composition, style transfer from multiple sources, and multi-view consistent editing.

It combines an 8B Multimodal Large Language Model (MLLM) with a 16B Multimodal Diffusion Transformer (MMDiT), supporting variable-resolution reference images that are independently
encoded and jointly denoised.

Key Features

  • Multi-Image Input: Accepts multiple reference images with different resolutions, enabling complex editing scenarios that require information from multiple visual sources.
  • Subject Composition: Combine elements from separate images into a coherent output guided by text instructions (e.g., "Let the person lovingly play with the dog" given separate person
    and dog images).
  • Cross-Image Style Transfer: Apply style or attributes from one reference image to subjects in another.
  • Variable Resolution Support: Each reference image is independently resized and encoded at its optimal resolution, preserving fine-grained details regardless of input size.
  • Instruction-Guided Generation: Natural language prompts control how multiple reference images are composed and edited in the final output.

@github-actions github-actions Bot added models pipelines size/L PR with diff > 200 LOC labels Jun 22, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Hi @tangyanf, thanks for the PR! It does not appear to link an issue it fixes. If this PR addresses an existing issue, please add a closing keyword (e.g. Fixes #1234) to the PR description so the issue is linked. See the contribution guide for more details. If this PR intentionally does not fix a tracked issue, a maintainer can add the no-issue-needed label to silence this reminder.

@yiyixuxu yiyixuxu added the no-issue-needed for PRs that do not require link to an issue label Jun 22, 2026

@sergereview sergereview Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤗 Serge says:

This PR adds the JoyImage Edit Plus model and pipeline. There are several blocking issues that need to be addressed before merging.

Blocking — Debug artifacts left in production code

Multiple torch.save() calls, a print() statement, and a commented-out exit(0) are left in pipeline_joyimage_edit_plus.py. These will write files to the user's working directory and print to stdout during every inference call.

Blocking — einops dependency

Per .ai/models.md: "No new mandatory dependency without discussion (e.g. einops). Optional deps guarded with is_X_available() and a dummy in utils/dummy_*.py." The pipeline directly imports from einops import rearrange — this is the only non-comment usage of einops in src/diffusers/. The rearrange calls should be rewritten with native PyTorch (reshape, permute, unflatten).

Blocking — sglang integration code in model forward

The transformer's forward method contains sglang-specific code: list-unwrapping for "SglangXvideo CFG branches" (lines 272-276) and a try: from sglang... fallback (lines 279-287). Per .ai/AGENTS.md: "No defensive code, unused code paths, or legacy stubs — do not add fallback paths, safety checks, or configuration options 'just in case'." This code doesn't belong in the diffusers model — the pipeline always passes the required arguments.

Blocking — Missing dummy objects

JoyImageEditPlusTransformer3DModel, JoyImageEditPlusPipeline, and JoyImageEditPlusPipelineOutput are not registered in dummy_pt_objects.py / dummy_torch_and_transformers_objects.py. This will cause ImportError when torch/transformers are not installed.

Blocking — Missing tests

No test files were added for the new model or pipeline.

Blocking — Hardcoded device_type="cuda" in torch.autocast

torch.autocast(device_type="cuda", ...) is hardcoded in two places in the pipeline. This will fail on MPS, XPU, and other non-CUDA devices.

Non-blocking — Inlined scheduler sigma math

Per .ai/pipelines.md gotcha #3, the pipeline manually computes shifted sigmas and temporarily overrides self.scheduler.shift — this is exactly what FlowMatchEulerDiscreteScheduler does with its shift config. The scheduler should own this logic.

Non-blocking — Unused imports and parameters

  • import inspect in transformer_joyimage_edit_plus.py is unused.
  • enable_denormalization parameter is declared in prepare_latents and __call__ but never read.
  • retrieve_timesteps is duplicated from the existing pipeline without a # Copied from annotation.

serge v0.1.0 · model: claude-opus-4-6 · 29 LLM turns · 50 tool calls · 190.2s · 1602502 in / 7369 out tokens

max_sequence_length=max_sequence_length,
)

torch.save(prompt_embeds, "prompt_embeds.pt")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug artifact. torch.save(prompt_embeds, "prompt_embeds.pt") will write a file to the user's working directory on every inference call. Remove this and the other torch.save calls (lines 550, 582, 583).

torch.save(prompt_embeds, "prompt_embeds.pt")
# Encode negative prompt for CFG
if self.do_classifier_free_guidance:
print(f"negative_prompt: {negative_prompt}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug artifact. print(f"negative_prompt: {negative_prompt}") — remove this debug print statement.

)
torch.save(padded_latents, "padded_latents.pt")
torch.save(target_mask, "target_mask.pt")
# exit(0)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug artifact. Remove the commented-out # exit(0).


import numpy as np
import torch
from einops import rearrange

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forbidden dependency. Per .ai/models.md: "No new mandatory dependency without discussion (e.g. einops)." This is the only real einops import in src/diffusers/. Rewrite the two rearrange calls (lines 339, 662-665) with native PyTorch.

For example, line 339:

# einops: rearrange(item, "c (t pt) (h ph) (w pw) -> (t h w) c pt ph pw", pt=pt, ph=ph, pw=pw)
patches = item.unflatten(1, (t//pt, pt)).unflatten(3, (h//ph, ph)).unflatten(5, (w//pw, pw))
patches = patches.permute(1, 3, 5, 0, 2, 4, 6).reshape(-1, c, pt, ph, pw)

batch_size, max_num_patches, channels, pt, ph, pw = hidden_states.shape
device = hidden_states.device

# Unwrap list inputs (SglangXvideo passes these as lists from CFG branches)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defensive / framework-specific code. Per .ai/AGENTS.md: "No defensive code, unused code paths, or legacy stubs." These list-unwrapping guards for "SglangXvideo" don't belong in the diffusers model. The pipeline always passes tensors. Remove lines 272-276.

Suggested change
# Unwrap list inputs (SglangXvideo passes these as lists from CFG branches)

ref_tensor = torch.from_numpy(np.array(ref_img_pil.convert("RGB"))).to(device=device, dtype=dtype)
ref_tensor = (ref_tensor / 127.5 - 1.0).permute(2, 0, 1).unsqueeze(1).unsqueeze(0)

with torch.autocast(device_type="cuda", dtype=torch.float32):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded CUDA device type. torch.autocast(device_type="cuda", ...) will fail on non-CUDA devices (MPS, XPU, etc.). Use the device from the tensor:

Suggested change
with torch.autocast(device_type="cuda", dtype=torch.float32):
with torch.autocast(device_type=device.type, dtype=torch.float32):

Same issue on line 670.

device: torch.device,
generator: Optional[Union[torch.Generator, List[torch.Generator]]],
reference_images: Optional[List[List[Image.Image]]] = None,
enable_denormalization: bool = True,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused parameter. enable_denormalization is declared but never read inside prepare_latents. Either use it or remove it from both prepare_latents and __call__.


# Prepare timesteps — compute sigmas with single shift to match original scheduler
if timesteps is None and sigmas is None:
shift = getattr(self.scheduler.config, "shift", 1.0)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inlined scheduler math. Per .ai/pipelines.md gotcha #3, this manually computes shifted sigmas and temporarily overrides self.scheduler.shift — this is exactly what FlowMatchEulerDiscreteScheduler does with its shift config. Consider letting the scheduler own this:

self.scheduler.set_timesteps(num_inference_steps, device=device)
timesteps = self.scheduler.timesteps

Output class for JoyImage Edit Plus multi-image editing pipelines.
"""

images: Union[List[PIL.Image.Image], np.ndarray] No newline at end of file

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing newline at end of file.

Suggested change
images: Union[List[PIL.Image.Image], np.ndarray]
images: Union[List[PIL.Image.Image], np.ndarray]

self._pad_sequence(negative_prompt_embeds_mask, max_seq_len),
self._pad_sequence(prompt_embeds_mask, max_seq_len),
])
torch.save(prompt_embeds, 'prompt_embeds_2.pt')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug artifact. Remove torch.save(prompt_embeds, 'prompt_embeds_2.pt').

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

models no-issue-needed for PRs that do not require link to an issue pipelines size/L PR with diff > 200 LOC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants