计算机科学
辍学(神经网络)
同义词(分类学)
词(群论)
人工智能
情绪分析
自然语言处理
机器学习
光学(聚焦)
任务(项目管理)
数据质量
属
生物
经济
光学
管理
运营管理
物理
植物
哲学
语言学
公制(单位)
作者
Guoqing Chao,Jingyao Liu,Mingyu Wang,Dianhui Chu
标识
DOI:10.1016/j.knosys.2023.111038
摘要
Data augmentation is a commonly-used technique to avoid over-fitting in deep learning. However, the mechanism behind effective data augmentation methods is unclear. To address this issue, we explore and identify two critical factors: semantic preservation and diversity to assess the quality of data augmentation in natural language processing. Our study focus on text sentiment classification and examines these two factors on two commonly-used data augmentation methods: synonym replacement and random deletion. Based on the discovery, we propose two new augmentation methods: TF-IDF word dropout and adaptive synonym replacement. Experimental results demonstrate that these two new data augmentation methods are effective. Moreover, with further experiments, we summarize three strategies for improving data augmentation methods in sentiment classification task. These strategies are employing online augmentation, introducing word importance into word sampling process, and filtering augmented data based on the current model state. We hope that our study will inspire some new perspectives on the underlying principles of data augmentation’s effectiveness and contribute to a systematic study of data augmentation methods in future.
科研通智能强力驱动
Strongly Powered by AbleSci AI