Summary
Add configurable cache support for phyai's DiT / diffusion transformer inference path. The goal is to reuse selected intermediate results across denoising steps, reduce repeated computation, and improve end-to-end generation latency.
DiT inference repeatedly runs transformer blocks, attention, and MLP layers across adjacent timesteps. Many hidden states or residual updates are locally redundant, especially across nearby denoising steps. Existing acceleration methods usually exploit this through block-level cache, residual cache, first-block cache, timestep-level skipping, or Taylor-style approximation, trading a small and controllable quality change for lower latency.
This issue should focus on inference-time, training-free cache support. Cache should be disabled by default and must not change existing generation behavior unless explicitly enabled. Once enabled, users should be able to control the cache strategy, target range, active steps, thresholds, and quality/speed trade-off through explicit configuration.
Motivation
- DiT-based models are expensive to run, especially for image and video generation where denoising requires many steps and deep transformer stacks.
- Users may accept a small quality shift in exchange for lower latency or higher throughput.
- Ecosystem projects such as diffusers, Cache-DiT, and SGLang already expose related acceleration mechanisms that can inform phyai's implementation and API design.
- Adding DiT cache support gives phyai a unified entry point for future inference optimizations across DiT-based pipelines such as FLUX, Wan, HunyuanVideo, Qwen-Image, and similar models.
Goals
- Provide a unified DiT cache configuration entry point, for example:
enable_dit_cache(...)
disable_dit_cache()
- or a
dit_cache field in pipeline / model config.
- Support at least one basic cache strategy for the MVP:
- first-block / residual cache;
- or block-level cache;
- or an adapter around existing Cache-DiT / diffusers cache APIs.
- Keep cache disabled by default. Enabling cache should not change the original pipeline's prompt arguments, seed handling, scheduler behavior, dtype, or device placement semantics.
- Support a per-run cache lifecycle to prevent cache pollution across prompts, batches, shapes, or devices.
- Expose useful debug and metric information:
- skipped block / step count;
- latency;
- peak memory;
- quality regression comparison.
References
Summary
Add configurable cache support for phyai's DiT / diffusion transformer inference path. The goal is to reuse selected intermediate results across denoising steps, reduce repeated computation, and improve end-to-end generation latency.
DiT inference repeatedly runs transformer blocks, attention, and MLP layers across adjacent timesteps. Many hidden states or residual updates are locally redundant, especially across nearby denoising steps. Existing acceleration methods usually exploit this through block-level cache, residual cache, first-block cache, timestep-level skipping, or Taylor-style approximation, trading a small and controllable quality change for lower latency.
This issue should focus on inference-time, training-free cache support. Cache should be disabled by default and must not change existing generation behavior unless explicitly enabled. Once enabled, users should be able to control the cache strategy, target range, active steps, thresholds, and quality/speed trade-off through explicit configuration.
Motivation
Goals
enable_dit_cache(...)disable_dit_cache()dit_cachefield in pipeline / model config.References