代码气味
计算机科学
源代码
KPI驱动的代码分析
编码(社会科学)
变压器
编码(集合论)
开源
代码生成
代码评审
静态程序分析
计算机安全
程序设计语言
软件
工程类
软件开发
软件质量
统计
数学
集合(抽象数据类型)
电压
钥匙(锁)
电气工程
作者
Mohammed Latif Siddiq,Shafayat Hossain Majumder,Maisha R. Mim,Sourov Jajodia,Joanna C. S. Santos
标识
DOI:10.1109/scam55253.2022.00014
摘要
Prior works have developed transformer-based language learning models to automatically generate source code for a task without compilation errors. The datasets used to train these techniques include samples from open source projects which may not be free of security flaws, code smells, and violations of standard coding practices. Therefore, we investigate to what extent code smells are present in the datasets of coding generation techniques and verify whether they leak into the output of these techniques. To conduct this study, we used Pylint and Bandit to detect code smells and security smells in three widely used training sets (CodeXGlue, APPS, and Code Clippy). We observed that Pylint caught 264 code smell types, whereas Bandit located 44 security smell types in these three datasets used for training code generation techniques. By analyzing the output from ten different configurations of the open-source fine-tuned transformer-based GPT-Neo 125M parameters model, we observed that this model leaked the smells and non-standard practices to the generated source code. When analyzing GitHub Copilot's suggestions, a closed source code generation tool, we observed that it contained 18 types of code smells, including substandard coding patterns and 2 security smell types.
科研通智能强力驱动
Strongly Powered by AbleSci AI