计算机科学
水准点(测量)
电子游戏
多媒体
编码(集合论)
帧(网络)
人工智能
情报检索
自然语言处理
人机交互
大地测量学
电信
集合(抽象数据类型)
程序设计语言
地理
作者
Thomas Hayes,Songyang Zhang,Xi Yin,Guangchang Pang,Sasha Sheng,Harry Yang,Songwei Ge,Qiyuan Hu,Devi Parikh
标识
DOI:10.1007/978-3-031-20074-8_25
摘要
Multimodal video-audio-text understanding and generation can benefit from datasets that are narrow but rich. The narrowness allows bite-sized challenges that the research community can make progress on. The richness ensures we are making progress along the core challenges. To this end, we present a large-scale video-audio-text dataset MUGEN, collected using the open-sourced platform game CoinRun. We made substantial modifications to make the game richer by introducing audio and enabling new interactions. We trained RL agents with different objectives to navigate the game and interact with 13 objects and characters. This allows us to automatically extract a large collection of diverse videos and associated audio. We sample 375K video clips (3.2 s each) and collect text descriptions from human annotators. Each video has additional annotations that are extracted automatically from the game engine, such as accurate semantic maps for each frame and templated textual descriptions. Altogether, MUGEN can help progress research in many tasks in multimodal understanding and generation. We benchmark representative approaches on tasks involving video-audio-text retrieval and generation. Our dataset and code are released at: https://mugen-org.github.io/ .
科研通智能强力驱动
Strongly Powered by AbleSci AI