计算机科学
人工智能
计算机视觉
视频跟踪
钥匙(锁)
弹丸
推论
视频压缩图片类型
发电机(电路理论)
视频处理
计算机安全
量子力学
物理
功率(物理)
有机化学
化学
作者
Jay Zhangjie Wu,Yixiao Ge,Xintao Wang,Weixian Lei,Yuchao Gu,Wynne Hsu,Ying Shi,Xiaohu Qie,Mike Zheng Shou
出处
期刊:Cornell University - arXiv
日期:2022-12-22
标识
DOI:10.48550/arxiv.2212.11565
摘要
To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new T2V generation setting$\unicode{x2014}$One-Shot Video Tuning, where only one text-video pair is presented. Our model is built on state-of-the-art T2I diffusion models pre-trained on massive image data. We make two key observations: 1) T2I models can generate still images that represent verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we introduce Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy. At inference, we employ DDIM inversion to provide structure guidance for sampling. Extensive qualitative and numerical experiments demonstrate the remarkable ability of our method across various applications.
科研通智能强力驱动
Strongly Powered by AbleSci AI