注释
计算机科学
人工智能
自然语言处理
符号(正式)
卷积神经网络
深度学习
过程(计算)
词(群论)
情报检索
语音识别
语言学
操作系统
哲学
程序设计语言
作者
Hiqmat Nisa,Vic Ciesielski,James A. Thom,Ruwan Tennakoon
出处
期刊:Australasian Document Computing Symposium
日期:2021-12-09
卷期号:: 1-7
被引量:2
标识
DOI:10.1145/3503516.3503532
摘要
Annotating handwritten documents for training deep learning models is a major issue in handwritten text recognition. It requires manual effort to annotate each word in a document to specify the ground truth. Often documents contain struck-out text which needs to be ignored by the recognition process. In preparing training data, struck-out text needs to be represented in a way that can help deep learning models to learn to deal appropriately with the strike-outs. The question is how to do this. In this paper, we have investigated two approaches for struck-out text annotation: (1) provide no annotation, thus reducing the annotation burden, and (2) mark the struck-out text with a special symbol, we have used the symbol #. We have trained two models on a synthetically generated dataset using a convolutional neural network and LSTM. We obtained 8.8% and 9.0% character error rates for models one and two respectively. There was no statistically significant difference in the performance of the two models. This indicates that a model trained with minimal annotations can perform as well as a model trained with extra annotations for struck-out text.
科研通智能强力驱动
Strongly Powered by AbleSci AI