Abstract
Successful software authorship identification has both software forensics applications and privacy implications. However, the process requires an efficient extraction of quality authorship attributes. The extraction of such attributes is very challenging due to several factors such as the variety of software formats, number of available samples, and possible obfuscation or adversarial manipulation. We focus on software authorship identification from three central perspectives: large-scale single-authored software, real-world multi-authored software, and the robustness assessment of code authorship identification methods against adversarial attacks. First, we propose DL-CAIS, a deep Learning-based approach for software authorship attribution, that facilitates large-scale, format-independent, language-oblivious, and obfuscation-resilient software authorship identification. DL-CAIS incorporates learning deep authorship attribution using a recurrent neural network and identifying programmers using ensemble random forest. We demonstrate the effectiveness of DL-CAIS under different experimental settings and scenarios for identifying programmers of both source code and software binaries. Second, we propose Multi-X, a fine-grained multi-author identification system of programmers in single code files. Multi-X incorporates code segmentation, code representation, authorship verification, code integration, and authorship identification. We evaluate Multi-X with several Github projects (Caffe, Facebook's Folly, TensorFlow, etc.) and show remarkable accuracy. We examine the performance of Multi-X against multiple dimensions and design choices, and demonstrate its effectiveness. Finally, we propose Author-SHIELD to examine the robustness of six state-of-the-art code authorship attribution approaches against adversarial examples. We define three adversarial attacks on attribution techniques---confidence reduction, a programmer imitation, and evasion attacks---and realize them in targeted and non-targeted adversarial code perturbation. Our experiments demonstrate the vulnerability of current authorship attribution methods against adversarial attacks.
Notes
If this is your thesis or dissertation, and want to learn how to access it or for more information about readership statistics, contact us at STARS@ucf.edu
Graduation Date
2020
Semester
Summer
Advisor
Mohaisen, David
Degree
Doctor of Philosophy (Ph.D.)
College
College of Engineering and Computer Science
Department
Computer Science
Degree Program
Computer Science
Format
application/pdf
Identifier
CFE0008568; DP0024244
URL
https://purls.library.ucf.edu/go/DP0024244
Language
English
Release Date
February 2022
Length of Campus-only Access
1 year
Access Status
Doctoral Dissertation (Open Access)
STARS Citation
Abuhamad, Mohammed, "Towards Large-Scale and Robust Code Authorship Identification with Deep Feature Learning" (2020). Electronic Theses and Dissertations, 2020-2023. 597.
https://stars.library.ucf.edu/etd2020/597