答疑
计算机科学
判别式
人工智能
集合(抽象数据类型)
安全性令牌
嵌入
自然语言处理
图像(数学)
试验装置
词(群论)
代表(政治)
词汇
模式识别(心理学)
情报检索
数学
语言学
政治
哲学
计算机安全
程序设计语言
法学
政治学
几何学
作者
Himanshu Sharma,Swati Srivastava
标识
DOI:10.1080/13682199.2022.2153489
摘要
Scene Text Visual Question Answering (VQA) needs to understand both the visual contents and the texts in an image to predict an answer for the image-related question. Existing Scene Text VQA models predict an answer by choosing a word from a fixed vocabulary or the extracted text tokens. In this paper, we have strengthened the representation power of the text tokens by using FastText embedding, appearance, bounding box and PHOC features for text tokens. Our model employs two-way co-attention by using self-attention and guided attention mechanisms to obtain the discriminative image features. We compute the text token position and combine this information with the predicted answer embedding for final answer generation. We have achieved an accuracy of 51.27% and 52.09% on the TextVQA validation set and test set. For the ST-VQA dataset, our model predicted an ANLS score of 0.698 on validation set and 0.686 on test set.
科研通智能强力驱动
Strongly Powered by AbleSci AI