计算机科学
修剪
水准点(测量)
蒸馏
可微函数
加权
机器学习
人工智能
语言模型
比例(比率)
数学
数学分析
地理
化学
有机化学
农学
放射科
物理
生物
医学
量子力学
大地测量学
作者
Zhou Zhang,Yang Lu,Tengfei Wang,Xing Wei,Zhen Wei
出处
期刊:Neural Networks
[Elsevier BV]
日期:2024-02-09
卷期号:173: 106164-106164
被引量:3
标识
DOI:10.1016/j.neunet.2024.106164
摘要
Large-scale pre-trained models, such as BERT, have demonstrated outstanding performance in Natural Language Processing (NLP). Nevertheless, the high number of parameters in these models has increased the demand for hardware storage and computational resources while posing a challenge for their practical deployment. In this article, we propose a combined method of model pruning and knowledge distillation to compress and accelerate large-scale pre-trained language models. Specifically, we introduce a dynamic structure pruning method based on differentiable search and recursive knowledge distillation to automatically prune the BERT model, named DDK. We define the search space for network pruning as all feed-forward layer channels and self-attention heads at each layer of the network, and utilize differentiable methods to determine their optimal number. Additionally, we design a recursive knowledge distillation method that employs adaptive weighting to extract the most important features from multiple intermediate layers of the teacher model and fuse them to supervise the student network learning. Our experimental results on the GLUE benchmark dataset and ablation analysis demonstrate that our proposed method outperforms other advanced methods in terms of average performance.
科研通智能强力驱动
Strongly Powered by AbleSci AI