Fix group offloading for quanto-quantized models and the use_stream path for quantized tensor subclasses by Sunt-ing · Pull Request #14038 · huggingface/diffusers

Sunt-ing · 2026-06-22T14:21:21Z

What does this PR do?

Group offloading moves a group's parameters between CPU and the accelerator by reassigning param.data:

param.data = source_tensor.to(device)

This is correct for plain tensors but wrong for tensor subclasses (quantized weights), whose real payload lives in internal sub-tensors (quanto WeightQBytesTensor: _data/_scale; torchao AffineQuantizedTensor: qdata/scale/...). Reassigning .data only swaps the outer wrapper and leaves the inner tensors on the source device, so the next matmul fails with mat2 is on cpu, different from cuda:0.

#13276 fixed this for torchao by swapping the whole subclass via torch.utils.swap_tensors and restoring inner attributes one by one. Two gaps remained:

quanto was never handled (Quanto + Group Offload causes device mismatch error (weights on cpu, mat1 on gpu) #12610). Any quanto-quantized model with enable_group_offload hits the wrapper-only .data = path and crashes with a device mismatch on the first forward, for both leaf_level and block_level.
the streamed path was still broken for both subclasses (Group offloading with use_stream=True breaks torchao quantized models (device mismatch) in qwen image #13281). When use_stream=True, _to_cpu / _pinned_memory_tensors call pin_memory() / is_pinned(), which neither subclass supports: quanto silently loses the subclass identity, and torchao raises NotImplementedError: ... aten.is_pinned. So torchao + use_stream=True crashes even though its non-stream path was already fixed.

Changes (`src/diffusers/hooks/group_offloading.py`)

Add _is_quanto_tensor plus quanto helpers, and handle quanto next to the existing torchao branch in _transfer_tensor_to_device (onload), _offload_to_memory (restore / offload), and the record_stream path. Inner tensor names come from the standard subclass protocol __tensor_flatten__(); quanto onload uses torch.utils.swap_tensors instead of .data =.
In _to_cpu and _pinned_memory_tensors, skip pin_memory() / is_pinned() for quanto and torchao subclasses.
Plain tensors and the torchao non-stream path are untouched (zero behavior change).

Tests

Added test_group_offloading to the quanto and torchao quantization suites. Each loads a quantized tiny Flux transformer, offloads it across leaf_level / block_level and non-stream / use_stream, and asserts the output matches the non-offloaded quantized baseline.

tests/quantization/quanto/test_quanto.py (int8 and float8): both fail on main with the device mismatch, pass here.
tests/quantization/torchao/test_torchao.py::TorchAoTest::test_group_offloading: the use_stream=True cases fail on main with the aten.is_pinned error, pass here.

Reproduction and before/after

Environment: NVIDIA RTX 4090, torch==2.8.0+cu128, diffusers @ 2d0110f, optimum-quanto==0.2.7, torchao==0.17.0.

Minimal standalone repro for #12610 (quanto):

import torch
from diffusers import UNet2DConditionModel
from diffusers.hooks import apply_group_offloading
from optimum.quanto import quantize, freeze, qint8

m = UNet2DConditionModel.from_pretrained(
    "hf-internal-testing/tiny-stable-diffusion-pipe", subfolder="unet"
).to(torch.float32).eval()
quantize(m, weights=qint8); freeze(m)
apply_group_offloading(
    m, onload_device=torch.device("cuda"), offload_device=torch.device("cpu"),
    offload_type="leaf_level",
)
x = torch.randn(2, m.config.in_channels, m.config.sample_size, m.config.sample_size, device="cuda")
t = torch.tensor([10, 10], device="cuda")
e = torch.randn(2, 4, m.config.cross_attention_dim, device="cuda")
with torch.no_grad():
    m(x, t, e)  # main: RuntimeError: mat2 is on cpu, different from cuda:0

Running the new tests (RUN_NIGHTLY=1 RUN_SLOW=1):

# on main (fix reverted, tests kept)
quanto  FluxTransformerInt8WeightsTest::test_group_offloading    FAILED  (mat2 is on cpu, different from cuda:0)
quanto  FluxTransformerFloat8WeightsTest::test_group_offloading  FAILED  (mat2 is on cpu, different from cuda:0)
torchao TorchAoTest::test_group_offloading                       FAILED  (NotImplementedError: ... aten.is_pinned)

# with this PR
quanto  FluxTransformerInt8WeightsTest::test_group_offloading    PASSED
quanto  FluxTransformerFloat8WeightsTest::test_group_offloading  PASSED
torchao TorchAoTest::test_group_offloading                       PASSED

Across leaf_level / block_level × non-stream / use_stream / record_stream, the offloaded output is bit-identical (max abs diff = 0.0) to the fully-on-accelerator quantized baseline. A non-quantized group-offload equivalence sweep stays at 0.0 (plain-tensor path unchanged).

Relationship to other work

The streamed-pin half of Group offloading with use_stream=True breaks torchao quantized models (device mismatch) in qwen image #13281 was also pursued upstream in torchao (# Feature Request: Support Async-Stream Transfer for [AffineQuantizedTensor] (Fix diffusers group_offload device mismatch) pytorch/ao#4158, support pinning for mx and nvfp4 tensors pytorch/ao#4192), but that only added pinning for mx / nvfp4 tensors. Int8WeightOnlyConfig AffineQuantizedTensor still raises aten.is_pinned on torchao==0.17.0, so the streamed path is still broken for the common int8 case. Skipping pinning on the diffusers side fixes it regardless of the torchao version, and is also required for quanto, whose subclass tensors do not implement torch pinning at all.
Add TorchAO disk group offload support #13875 and torchao: safetensors save/load + disk group offload (closes #13713) #13721 (open) refactor the same torchao offload helpers (_to_cpu, _pinned_memory_tensors, _swap_torchao_tensor) to add disk offload. They are orthogonal in intent (disk vs the memory device-mismatch / stream-pin crash here) but touch the same region, so this PR will need a rebase around whichever lands first.

Who can review?

cc @sayakpaul

Before submitting

Did you use an AI agent (Claude Code, Codex, Cursor, etc.) to help with this PR? If so:
- Did you read the Coding with AI agents guide?
- Did you self-review the diff against .ai/review-rules.md?
Did you read the contributor guideline?
Did you write any new necessary tests?

…ath for quantized tensor subclasses

Fix group offloading for quanto-quantized models and the use_stream p…

8ab88ee

…ath for quantized tensor subclasses

github-actions Bot added fixes-issue size/M PR with diff < 200 LOC tests hooks and removed size/M PR with diff < 200 LOC labels Jun 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix group offloading for quanto-quantized models and the use_stream path for quantized tensor subclasses#14038

Fix group offloading for quanto-quantized models and the use_stream path for quantized tensor subclasses#14038
Sunt-ing wants to merge 1 commit into
huggingface:mainfrom
Sunt-ing:0

Sunt-ing commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Sunt-ing commented Jun 22, 2026

What does this PR do?

Changes (src/diffusers/hooks/group_offloading.py)

Tests

Relationship to other work

Who can review?

Before submitting

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Changes (`src/diffusers/hooks/group_offloading.py`)