计算机科学
任务(项目管理)
背景(考古学)
编码器
人工智能
自然语言处理
语音识别
古生物学
管理
经济
生物
操作系统
作者
Yonghong Tan,Yingying Chen,Jinqiao Wang
标识
DOI:10.1109/ieir59294.2023.10391251
摘要
Traditional scene text recognition (STR) is usually regarded as a visual unimodal recognition task, which has made some progress using the encoder-decoder framework. Introducing the language model (LM) that taps into semantic contextual relationships has significantly promoted the task from the language modality. However, in existing works, LM seriously relies on the output of the decoder in the vision model (VM), and the vision decoder itself lacks semantic and global context awareness. In this paper, we explore the capability of the vision decoder, which is generally ignored in previous works. We propose a Visual-Semantic Refinement Network (VSRN) to provide context and semantic guidance to the decoder, fully supporting the recognition capability. With the semantic refine module, the recognition results in the LM, in return, can be introduced to the VM. It provides semantic information while further facilitating the union of these two modalities. In the visual refinement module, we propose an adaptive mask strategy and explore visual features’ global contextual relationships to assist the VM further. The two complementary clues jointly promote the VM and iteratively improve the recognition performance. Experimental results on several scene text recognition benchmarks show that our proposed method is effective and achieves state-of-the-art performance.
科研通智能强力驱动
Strongly Powered by AbleSci AI