计算机科学
抽象语法树
安全性令牌
编码(集合论)
编码
源代码
变压器
程序设计语言
人工智能
语法
操作系统
工程类
基因
电气工程
电压
集合(抽象数据类型)
化学
生物化学
作者
Aiping Zhang,Liming Fang,Chunpeng Ge,Piji Li,Zhe Liu
标识
DOI:10.1016/j.jss.2022.111557
摘要
Deep learning techniques have achieved promising results in code clone detection in the past decade. Unfortunately, current deep learning-based methods rarely explicitly consider the modeling of long codes. Worse, the code length is increasing due to the increasing requirement of complex functions. Thus, modeling the relationship between code tokens to catch their long-range dependencies is crucial to comprehensively capture the information of the code fragment. In this work, we resort to the Transformer to capture long-range dependencies within a code, which however requires huge computational cost for long code fragments. To make it possible to apply Transformer efficiently, we propose a code token learner to largely reduce the number of feature tokens in an automatic way. Besides, considering the tree structure of the abstract syntax tree, we present a tree-based position embedding to encode the position of each token in the input. Apart from the Transformer that captures the dependency within a code, we further leverage a cross-code attention module to capture the similarities between two code fragments. Our method significantly reduces the computational cost of using Transformer by 97% while achieves superior performance with state-of-the-art methods. Our code is available at https://github.com/ArcticHare105/Code-Token-Learner.
科研通智能强力驱动
Strongly Powered by AbleSci AI