Skip to content

[phyai] DiT cache support #15

@chenghuaWang

Description

@chenghuaWang

Summary

Add configurable cache support for phyai's DiT / diffusion transformer inference path. The goal is to reuse selected intermediate results across denoising steps, reduce repeated computation, and improve end-to-end generation latency.

DiT inference repeatedly runs transformer blocks, attention, and MLP layers across adjacent timesteps. Many hidden states or residual updates are locally redundant, especially across nearby denoising steps. Existing acceleration methods usually exploit this through block-level cache, residual cache, first-block cache, timestep-level skipping, or Taylor-style approximation, trading a small and controllable quality change for lower latency.

This issue should focus on inference-time, training-free cache support. Cache should be disabled by default and must not change existing generation behavior unless explicitly enabled. Once enabled, users should be able to control the cache strategy, target range, active steps, thresholds, and quality/speed trade-off through explicit configuration.

Motivation

  • DiT-based models are expensive to run, especially for image and video generation where denoising requires many steps and deep transformer stacks.
  • Users may accept a small quality shift in exchange for lower latency or higher throughput.
  • Ecosystem projects such as diffusers, Cache-DiT, and SGLang already expose related acceleration mechanisms that can inform phyai's implementation and API design.
  • Adding DiT cache support gives phyai a unified entry point for future inference optimizations across DiT-based pipelines such as FLUX, Wan, HunyuanVideo, Qwen-Image, and similar models.

Goals

  • Provide a unified DiT cache configuration entry point, for example:
    • enable_dit_cache(...)
    • disable_dit_cache()
    • or a dit_cache field in pipeline / model config.
  • Support at least one basic cache strategy for the MVP:
    • first-block / residual cache;
    • or block-level cache;
    • or an adapter around existing Cache-DiT / diffusers cache APIs.
  • Keep cache disabled by default. Enabling cache should not change the original pipeline's prompt arguments, seed handling, scheduler behavior, dtype, or device placement semantics.
  • Support a per-run cache lifecycle to prevent cache pollution across prompts, batches, shapes, or devices.
  • Expose useful debug and metric information:
    • skipped block / step count;
    • latency;
    • peak memory;
    • quality regression comparison.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is needed
    No fields configured for Feature.

    Projects

    Status
    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions