计算机科学
计算机视觉
视频跟踪
人工智能
运动补偿
缩放
运动(物理)
对象(语法)
灵活性(工程)
摄像机自动校准
可控性
运动(音乐)
光学(聚焦)
摄像机切除
光学
哲学
工程类
物理
镜头(地质)
美学
统计
石油工程
数学
应用数学
作者
Shiyuan Yang,Liang Hou,Haibin Huang,Chongyang Ma,Pengfei Wan,Di Zhang,Xiaodong Chen,Jing Liao
标识
DOI:10.1145/3641519.3657481
摘要
Recent text-to-video diffusion models have achieved impressive progress. In practice, users often desire the ability to control object motion and camera movement independently for customized video creation. However, current methods lack the focus on separately controlling object motion and camera movement in a decoupled manner, which limits the controllability and flexibility of text-to-video models. In this paper, we introduce Direct-a-Video, a system that allows users to independently specify motions for multiple objects as well as camera's pan and zoom movements, as if directing a video. We propose a simple yet effective strategy for the decoupled control of object motion and camera movement. Object motion is controlled through spatial cross-attention modulation using the model's inherent priors, requiring no additional optimization. For camera movement, we introduce new temporal cross-attention layers to interpret quantitative camera movement parameters. We further employ an augmentation-based approach to train these layers in a self-supervised manner on a small-scale dataset, eliminating the need for explicit motion annotation. Both components operate independently, allowing individual or combined control, and can generalize to open-domain scenarios. Extensive experiments demonstrate the superiority and effectiveness of our method. Project page and code are available at https://direct-a-video.github.io/.
科研通智能强力驱动
Strongly Powered by AbleSci AI