计算机科学
语言模型
特征(语言学)
人工智能
块(置换群论)
代表(政治)
编码(集合论)
噪音(视频)
自然语言处理
模式识别(心理学)
语音识别
图像(数学)
程序设计语言
哲学
法学
集合(抽象数据类型)
几何学
政治
语言学
数学
政治学
作者
Shancheng Fang,Hongtao Xie,Yuxin Wang,Zhendong Mao,Yongdong Zhang
标识
DOI:10.1109/cvpr46437.2021.00702
摘要
Linguistic knowledge is of great benefit to scene text recognition. However, how to effectively model linguistic rules in end-to-end deep networks remains a research challenge. In this paper, we argue that the limited capacity of language models comes from: 1) implicitly language modeling; 2) unidirectional feature representation; and 3) language model with noise input. Correspondingly, we propose an autonomous, bidirectional and iterative ABINet for scene text recognition. Firstly, the autonomous suggests to block gradient flow between vision and language models to enforce explicitly language modeling. Secondly, a novel bidirectional cloze network (BCN) as the language model is proposed based on bidirectional feature representation. Thirdly, we propose an execution manner of iterative correction for language model which can effectively alleviate the impact of noise input. Additionally, based on the ensemble of iterative predictions, we propose a self-training method which can learn from unlabeled images effectively. Extensive experiments indicate that ABINet has superiority on low-quality images and achieves state-of-the-art results on several mainstream benchmarks. Besides, the ABINet trained with ensemble self-training shows promising improvement in realizing human-level recognition. Code is available at https://github.com/FangShancheng/ABINet.
科研通智能强力驱动
Strongly Powered by AbleSci AI