摘要
Randomization is a technique that can be used with programming assessments to discourage academic misconduct by making it unlikely for two colluding students to get the exact same questions. Previous research about randomization has shown it to be an effective tool for addressing academic misconduct, but this work often focuses on randomization broadly, with few considering specific techniques. In contrast, we consider different randomization techniques and the contexts that they are best suited to. In addition, we investigate the effectiveness of randomization techniques against emerging AI technologies. This is done by exploring randomization in the context of an online quiz system that evaluates student responses to pro-gramming challenges, specifically the CodeRunner system for the Moodle learning management system. We provide a classification of techniques, and discuss the benefits of each. This classification starts with simpler techniques, such as shuffling question order, shuffling multi-choice question options, and question pooling. We then move on to more advanced techniques, including simple substitution, altering expected output, switching logic, and steganography. We also investigate two approaches to generating randomized questions, considering the benefits and drawbacks of each. These approaches are generating the questions beforehand (pre-generation) and generating the questions when the quiz is started (on-the-fly generation). We then identify four categories of assessment based on assessment that is formative/summative, and proctored/non-proctored, then identify which randomization techniques are suited for each category. Finally, we test randomized questions against OpenAI's Codex, to see if these techniques could prevent this new opportunity for academic dishonesty. We found that there are some types of questions that Codex currently performs poorly on, such as program reasoning, and creating complex classes, but overall randomization was not effective in defeating it, with Codex scoring 79.7% on questions that were created after it was trained, and 85.3 % on questions that could have been available to it when it was trained.