已入深夜,您辛苦了!由于当前在线用户较少,发布求助请尽量完整的填写文献信息,科研通机器人24小时在线,伴您度过漫漫科研夜!祝你早点完成任务,早点休息,好梦!

TA2V: Text-Audio Guided Video Generation

计算机科学 多媒体 音频信号处理 语音识别 音频信号 语音编码
作者
Minglu Zhao,Wenmin Wang,Tongbao Chen,Rui Zhang,Ruochen Li
出处
期刊:IEEE Transactions on Multimedia [Institute of Electrical and Electronics Engineers]
卷期号:26: 7250-7264
标识
DOI:10.1109/tmm.2024.3362149
摘要

Recent conditional and unconditional video generation tasks have been accomplished mainly based on generative adversarial network (GAN), diffusion, and autoregressive models. However, in some circumstances, using only one modality cannot provide enough semantic information. Therefore, in this paper, we propose text-audio to video (TA2V) generation, a new task for generating realistic videos from two different guided modalities, text and audio, which has not been explored much thus far. Compared to image generation, video generation is a harder task because of the complexity of processing higher-dimensional data and scarcer suitable datasets, especially for multimodal video generation. To overcome these limitations, (i) we propose the Text&Audio-guided-Video-Maker (TAgVM) model, which consists of two modules: a text-guided video generator and a text&audio-guided video modifier. (ii) This model uses a 3D VQ-GAN to compress high-dimension video data to a low-dimension discrete sequence, followed by an autoregressive model to guide text-conditional generation in the latent space. Then, we apply a text&audio-guided diffusion model to the generated video scenes, providing additional semantic details corresponding to the audio and text. (iii) We introduce a newly produced music performance video dataset, the University of Rochester Multimodal Music Performance with Video-Audio-Text (URMP-VAT), and a landscape dataset, Landscape with Video-Audio-Text (Landscape-VAT), both of which include three modalities (text, audio, and video) that are aligned with each other. The results demonstrate that our model can create videos with satisfactory quality and semantic information. The source code and datasets are available at https://github.com/Minglu58/TA2V.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
肉丸完成签到 ,获得积分10
1秒前
李健应助程风破浪采纳,获得10
1秒前
lzw完成签到,获得积分20
5秒前
10秒前
zbx完成签到,获得积分20
11秒前
敏感的飞松完成签到 ,获得积分10
12秒前
13秒前
研友_VZG7GZ应助Ade阿德采纳,获得10
15秒前
白金之星完成签到 ,获得积分10
15秒前
cyn0762完成签到,获得积分10
15秒前
17秒前
Kyrie发布了新的文献求助10
19秒前
19秒前
20秒前
嗯嗯嗯哦哦哦完成签到 ,获得积分10
20秒前
JamesYang发布了新的文献求助10
23秒前
24秒前
归尘发布了新的文献求助10
27秒前
小木林完成签到 ,获得积分10
27秒前
hmf1995完成签到 ,获得积分10
28秒前
冷静剑成发布了新的文献求助10
28秒前
Jerrder关注了科研通微信公众号
28秒前
30秒前
32秒前
Ade阿德发布了新的文献求助10
33秒前
Orange应助JamesYang采纳,获得10
35秒前
36秒前
GXGXGX发布了新的文献求助30
38秒前
开朗烧鹅完成签到,获得积分10
41秒前
Jerrder发布了新的文献求助10
42秒前
研友_VZG7GZ应助愉快谷芹采纳,获得10
46秒前
49秒前
53秒前
53秒前
55秒前
司徒文青应助灵犀采纳,获得200
56秒前
JamesYang发布了新的文献求助10
57秒前
58秒前
58秒前
11heys发布了新的文献求助10
58秒前
高分求助中
Genetics: From Genes to Genomes 3000
Production Logging: Theoretical and Interpretive Elements 2500
Continuum thermodynamics and material modelling 2000
Healthcare Finance: Modern Financial Analysis for Accelerating Biomedical Innovation 2000
Applications of Emerging Nanomaterials and Nanotechnology 1111
Les Mantodea de Guyane Insecta, Polyneoptera 1000
Diabetes: miniguías Asklepios 800
热门求助领域 (近24小时)
化学 医学 材料科学 生物 工程类 有机化学 生物化学 纳米技术 内科学 物理 化学工程 计算机科学 复合材料 基因 遗传学 物理化学 催化作用 细胞生物学 免疫学 电极
热门帖子
关注 科研通微信公众号,转发送积分 3471334
求助须知:如何正确求助?哪些是违规求助? 3064450
关于积分的说明 9088046
捐赠科研通 2755051
什么是DOI,文献DOI怎么找? 1511733
邀请新用户注册赠送积分活动 698575
科研通“疑难数据库(出版商)”最低求助积分说明 698430