计算机科学
变压器
编码器
卷积神经网络
人工智能
模式识别(心理学)
端到端原则
语音识别
电压
电气工程
工程类
操作系统
作者
Zhang Zhang,Yibo Zhang
标识
DOI:10.1007/978-3-031-21648-0_13
摘要
The attention-based encoder-decoder (AED) models are increasingly used in handwritten mathematical expression recognition (HMER) tasks. Given the recent success of Transformer in computer vision and a variety of attempts to combine Transformer with convolutional neural network (CNN), in this paper, we study 3 ways of leveraging Transformer and CNN designs to improve AED-based HMER models: 1) Tandem way, which feeds CNN-extracted features to a Transformer encoder to capture global dependencies; 2) Parallel way, which adds a Transformer encoder branch taking raw image patches as input and concatenates its output with CNN's as final feature; 3) Mixing way, which replaces convolution layers of CNN's last stage with multi-head self-attention (MHSA). We compared these 3 methods on the CROHME benchmark. On CROHME 2016 and 2019, Tandem way attained the ExpRate of 54.85% and 58.56%, respectively; Parallel way attained the ExpRate of 55.63% and 57.39%; and Mixing way achieved the ExpRate of 53.93% and 55.64%. This result indicates that Parallel and Tandem ways perform better than Mixing way, and have little difference between each other.
科研通智能强力驱动
Strongly Powered by AbleSci AI