[ICLR 2026 ORAL] Let Features Decide Their Own Solvers: Hybrid Feature Caching for Diffusion Transformers

1Shanghai Jiao Tong University 2South China University of Technology 3Tsinghua University
Corresponding author

Introduction

Diffusion Transformers (DiTs) offer state-of-the-art fidelity in image and video synthesis, but their iterative sampling process remains a major bottleneck due to the high cost of transformer forward passes at each timestep. To mitigate this, feature caching has emerged as a training-free acceleration technique that reuses or forecasts hidden representations. However, existing methods often apply a uniform caching strategy across all feature dimensions, ignoring their heterogeneous dynamic behaviors. Therefore, we adopt a new perspective by modeling hidden feature evolution as a mixture of ODEs across dimensions, and introduce HyCa, a Hybrid ODE solver inspired caching framework that applies dimension-wise caching strategies. HyCa achieves near-lossless acceleration across diverse domains and models, including 5.56× speedup on FLUX and HunyuanVideo, 6.24× speedup on Qwen-Image and Qwen-Image-Edit without retraining.

HyCa Poster
Figure 1: Images generated on Qwen-Image with HyCa at 6.24× acceleration.
Feature Trajectory Clusters
Figure 2: Feature trajectory clusters and stability of assignments. (a-b) Cluster 1 shows oscillatory trajectories while Cluster 2 shows smooth ones. (c-d) ARI distributions on Hunyuan Video and Qwen-Image exceed 0.8 in most cases, confirming stable and consistent cluster assignments across prompts and timesteps. An ARI above 0.8 indicates strong agreement and high clustering reliability.
HyCa Framework
Figure 4: HyCa Framework. (a) Offline Preprocessing: feature dimensions are first analyzed and clustered with temporal indicators (e.g., differences, curvature). For each cluster, candidate solvers generate predicted features, then compared against real computed features; the solver with minimum error is then assigned to that cluster. (b) Inference: once assigned, each cluster consistently reuses its solver, enabling efficient prediction by skipping redundant computations while maintaining accuracy.
Clustering Results
Figure 6: Clustering Results. Top row: Clustering results from FLUX.1dev; Bottom row: Clustering results from Hunyuan Video. The clustering assignments remain highly consistent across various prompts, resolutions and timesteps, suggesting stable and robust geometric structure in the feature space.

Samples

Explore our generated samples across different domains below.

Image Cases

Flux Flux Schnell

Image Editing Cases

Image Editing

Video Cases

Video Cases

Contact Us

Feel free to contact us for any questions, cooperation, and communication.


BibTeX

@misc{zheng2025letfeaturesdecidesolvers,
      title={Let Features Decide Their Own Solvers: Hybrid Feature Caching for Diffusion Transformers}, 
      author={Shikang Zheng and Guantao Chen and Qinming Zhou and Yuqi Lin and Lixuan He and Chang Zou and Peiliang Cai and Jiacheng Liu and Linfeng Zhang},
      year={2025},
      eprint={2510.04188},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.04188}, 
}