ORCID

0009-0002-1252-2263

Keywords

Code Authorship, Machine Learning, LLMs, Security, Software Engineering

Abstract

The field of code authorship attribution focuses on identifying the author(s) of source code written in a specific programming language. Various methods, including manual crafting, automated crafting, and deep learning-generated features, have been developed for this task. These techniques leverage the unique stylistic patterns inherent in code, such as structure, comments, variable names, and function names, to attribute authorship accurately. This field has practical applications in software forensics, cybersecurity, and code plagiarism detection, enabling investigations into issues like piracy, intellectual property violations, and malware attribution. In educational settings, it helps detect plagiarism in programming assignments. However, traditional methods face challenges from code transformation techniques that alter stylistic patterns, making authorship attribution difficult. Additionally, the rise of AI programming tools like ChatGPT introduces complexities, as these tools generate code in various styles and may evade detection. Furthermore, AI-generated code often exhibits lower security standards, raising concerns about copyright infringement, cheating, and vulnerabilities. To address these challenges, our research investigates whether existing attribution techniques can identify AI-generated code. Initial findings suggest they cannot, underscoring the need for novel approaches. By leveraging a feature-based method using pretrained models, we accurately classify ChatGPT and non-ChatGPT code, creating a jointly trained model for reliable attribution. We also explore ChatGPT's ability to generate diverse code styles, akin to code transformation, and evaluate the resilience of attribution techniques against evasion attempts. Additionally, we propose SCAE, a machine learning-based Seq2Seq code transformation technique that mitigates the limitations of Monte Carlo Tree Search (MCTS). SCAE achieves efficient processing while maintaining transformation quality, offering a robust solution for code authorship obfuscation. Our work advances the field by addressing the unique challenges posed by AI-generated and transformed code, ensuring accurate and secure authorship attribution.

Completion Date

2025

Semester

Spring

Committee Chair

Mohaisen, David

Degree

Doctor of Philosophy (Ph.D.)

College

College of Engineering and Computer Science

Department

Computer Science

Identifier

DP0029278

Document Type

Dissertation/Thesis

Campus Location

Orlando (Main) Campus

Share

COinS