计算机科学
模式
标杆管理
情态动词
模态(人机交互)
数据科学
钥匙(锁)
领域(数学)
资源(消歧)
连贯性(哲学赌博策略)
人工智能
多媒体
量子力学
物理
社会学
业务
营销
化学
高分子化学
纯数学
计算机安全
社会科学
数学
计算机网络
作者
Fatemeh Nazarieh,Zhenhua Feng,Muhammad Awais,Wenwu Wang,Josef Kittler
出处
期刊:IEEE Transactions on Circuits and Systems for Video Technology
[Institute of Electrical and Electronics Engineers]
日期:2024-01-09
卷期号:34 (8): 6814-6832
标识
DOI:10.1109/tcsvt.2024.3351601
摘要
Cross-modal content generation has become very popular in recent years. To generate high-quality and realistic content, a variety of methods have been proposed. Among these approaches, visual content generation has attracted significant attention from academia and industry due to its vast potential in various applications. This survey provides an overview of recent advances in visual content generation conditioned on other modalities, such as text, audio, speech, and music, with a focus on their key contributions to the community. In addition, we summarize the existing publicly available datasets that can be used for training and benchmarking cross-modal visual content generation models. We provide an in-depth exploration of the datasets used for audio-to-visual content generation, filling a gap in the existing literature. Various evaluation metrics are also introduced along with the datasets. Furthermore, we discuss the challenges and limitations encountered in the area, such as modality alignment and semantic coherence. Last, we outline possible future directions for synthesizing visual content from other modalities including the exploration of new modalities, and the development of multi-task multi-modal networks. This survey serves as a resource for researchers interested in quickly gaining insights into this burgeoning field.
科研通智能强力驱动
Strongly Powered by AbleSci AI