作者
Zhenguo Yang,Jiale Xiang,Jiuxiang You,Qing Li,Wenyin Liu
摘要
Visual question answering (VQA) is a challenging task that reasons over questions on images with knowledge. A prerequisite for VQA is the availability of annotated datasets, while the available datasets have several limitations. 1) The diversity of questions and answers are limited to a few question categories and certain concepts (e.g., objects, relations, actions.) with somewhat mechanical answers. 2) The availability of background knowledge or context information has been disregarded with just images, questions and answers being provided. 3) The timeliness of knowledge has not been examined, though some works may introduce factual or commonsense knowledge bases, e.g., ConceptNet, DBPedia. In this paper, we provide an Event-oriented Visual Question Answering (E-VQA) dataset including free-form questions and answers for real-world event concepts, which provides context information of events as domain knowledge in addition to images. E-VQA consists of 2,690 social media images, 9,088 questions, 5,479 answers, and 1,157 news media articles for references being annotated to 182 real-world events, covering a wide range of topics, such as armed conflicts and attacks, disasters and accidents, law and crime. For comparisons, we investigate 10 state-of-the-art VQA methods as benchmarks.