Abstract

Successful software authorship identification has both software forensics applications and privacy implications. However, the process requires an efficient extraction of quality authorship attributes. The extraction of such attributes is very challenging due to several factors such as the variety of software formats, number of available samples, and possible obfuscation or adversarial manipulation. We focus on software authorship identification from three central perspectives: large-scale single-authored software, real-world multi-authored software, and the robustness assessment of code authorship identification methods against adversarial attacks. First, we propose DL-CAIS, a deep Learning-based approach for software authorship attribution, that facilitates large-scale, format-independent, language-oblivious, and obfuscation-resilient software authorship identification. DL-CAIS incorporates learning deep authorship attribution using a recurrent neural network and identifying programmers using ensemble random forest. We demonstrate the effectiveness of DL-CAIS under different experimental settings and scenarios for identifying programmers of both source code and software binaries. Second, we propose Multi-X, a fine-grained multi-author identification system of programmers in single code files. Multi-X incorporates code segmentation, code representation, authorship verification, code integration, and authorship identification. We evaluate Multi-X with several Github projects (Caffe, Facebook's Folly, TensorFlow, etc.) and show remarkable accuracy. We examine the performance of Multi-X against multiple dimensions and design choices, and demonstrate its effectiveness. Finally, we propose Author-SHIELD to examine the robustness of six state-of-the-art code authorship attribution approaches against adversarial examples. We define three adversarial attacks on attribution techniques---confidence reduction, a programmer imitation, and evasion attacks---and realize them in targeted and non-targeted adversarial code perturbation. Our experiments demonstrate the vulnerability of current authorship attribution methods against adversarial attacks.

Notes

If this is your thesis or dissertation, and want to learn how to access it or for more information about readership statistics, contact us at STARS@ucf.edu

Graduation Date

2020

Semester

Summer

Advisor

Mohaisen, David

Degree

Doctor of Philosophy (Ph.D.)

College

College of Engineering and Computer Science

Department

Computer Science

Degree Program

Computer Science

Format

application/pdf

Identifier

CFE0008568; DP0024244

URL

https://purls.library.ucf.edu/go/DP0024244

Language

English

Release Date

February 2022

Length of Campus-only Access

1 year

Access Status

Doctoral Dissertation (Open Access)

STARS Citation

Abuhamad, Mohammed, "Towards Large-Scale and Robust Code Authorship Identification with Deep Feature Learning" (2020). Electronic Theses and Dissertations, 2020-2023. 597.
https://stars.library.ucf.edu/etd2020/597

Download

Included in

Software Engineering Commons

COinS

Electronic Theses and Dissertations, 2020-2023

Towards Large-Scale and Robust Code Authorship Identification with Deep Feature Learning

Abstract

Notes

Graduation Date

Semester

Advisor

Degree

College

Department

Degree Program

Format

Identifier

URL

Language

Release Date

Length of Campus-only Access

Access Status

STARS Citation

Included in

Browse Advisors

Explore

Connect

Electronic Theses and Dissertations, 2020-2023

Towards Large-Scale and Robust Code Authorship Identification with Deep Feature Learning

Author

Abstract

Notes

Graduation Date

Semester

Advisor

Degree

College

Department

Degree Program

Format

Identifier

URL

Language

Release Date

Length of Campus-only Access

Access Status

STARS Citation

Included in

Share

Browse Advisors

Explore

Connect