计算机科学
模式
模态(人机交互)
光学(聚焦)
水准点(测量)
事件(粒子物理)
情报检索
桥接(联网)
代表(政治)
人工智能
自然语言处理
信息抽取
多媒体
计算机网络
社会科学
大地测量学
社会学
政治
法学
政治学
光学
地理
物理
量子力学
作者
Jian Liu,Yufeng Chen,Jinan Xu
标识
DOI:10.1145/3503161.3548132
摘要
Extracting events from news have seen many benefits in downstream applications. Today's event extraction (EE) systems, however, usually focus on a single modality --- either for text or image, and such methods suffer from incomplete information because a news document is typically presented in a multimedia format. In this paper, we propose a new method for multimedia EE by bridging the textual and visual modalities with a unified contrastive learning framework. Our central idea is to create a shared space for texts and images in order to improve their similar representation. This is accomplished by training on text-image pairs in general, and we demonstrate that it is possible to use this framework to boost learning for one modality by investigating the complementary of the other modality. On the benchmark dataset, our approach establishes a new state-of-the-art performance and shows a 3 percent improvement in F1. Furthermore, we demonstrate that it can achieve cutting-edge performance for visual EE even in a zero-shot scenario with no annotated data in the visual modality.
科研通智能强力驱动
Strongly Powered by AbleSci AI