How does the proposed CAT method compare to fixed tokenization approaches in terms of end-to-end inference tim
Description
Transformer-based video diffusion models rely on 3D attention over spatial and temporal tokens, which incurs quadratic time and memory complexity and makes end-to-end training for ultra-high-resolution videos prohibitively expensive. To overcome this bottleneck, we propose a pure image adaptation framework that upgrades a video Diffusion Transformer pre-trained at its native scale to synthesize higher-resolution videos. Unfortunately, naively fine-tuning with high-resolution images alone often introduces noticeable noise due to the image-video modality gap. To address this, we decouple the lea
Research goal: How does the proposed CAT method compare to fixed tokenization approaches in terms of end-to-end inference time and token count scaling across different domain shifts in image complexity?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.5/10.
Notes
Files
paper.pdf
Files
(83.8 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:00c3f55f9c524c9abd2fdbaffdf69268
|
83.8 kB | Preview Download |