Electronic Theses and Dissertations

Data Representation in Machine Learning Methods with its Application to Compilation Optimization and Epitope Prediction

Yevgeniy Sher, University of Central Florida

Abstract

In this dissertation we explore the application of machine learning algorithms to compilation phase order optimization, and epitope prediction. The common thread running through these two disparate domains is the type of data being dealt with. In both problem domains we are dealing with categorical data, with its representation playing a significant role in the performance of classification algorithms. We first present a neuroevolutionary approach which orders optimization phases to generate compiled programs with performance superior to those compiled using LLVM's -O3 optimization level. Performance improvements calculated as the speed of the compiled program's execution ranged from 27% for the ccbench program, to 40.8% for bzip2. This dissertation then explores the problem of data representation of 3D biological data, such as amino acids. A new approach for distributed representation of 3D biological data through the process of embedding is proposed and explored. Analogously to word embedding, we developed a system that uses atomic and residue coordinates to generate distributed representation for residues, which we call 3D Residue BioVectors. Preliminary results are presented which demonstrate that even the low dimensional 3D Residue BioVectors can be used to predict conformational epitopes and protein-protein interactions, with promising proficiency. The generation of such 3D BioVectors, and the proposed methodology, opens the door for substantial future improvements, and application domains. The dissertation then explores the problem domain of linear B-Cell epitope prediction. This problem domain deals with predicting epitopes based strictly on the protein sequence. We present the DRREP system, which demonstrates how an ensemble of shallow neural networks can be combined with string kernels and analytical learning algorithm to produce state of the art epitope prediction results. DRREP was tested on the SARS subsequence, the HIV, Pellequer, AntiJen datasets, and the standard SEQ194 test dataset. AUC improvements achieved over the state of the art ranged from 3% to 8%. Finally, we present the SEEP epitope classifier, which is a multi-resolution SMV ensemble based classifier which uses conjoint triad feature representation, and produces state of the art classification results. SEEP leverages the domain specific knowledge based protein sequence encoding developed within the protein-protein interaction research domain. Using an ensemble of multi-resolution SVMs, and a sliding window based pre and post processing pipeline, SEEP achieves an AUC of 91.2 on the standard SEQ194 test dataset, a 24% improvement over the state of the art.

Notes

If this is your thesis or dissertation, and want to learn how to access it or for more information about readership statistics, contact us at STARS@ucf.edu

Graduation Date

2017

Semester

Summer

Advisor

Zhang, Shaojie

Degree

Doctor of Philosophy (Ph.D.)

College

College of Engineering and Computer Science

Department

Computer Science

Degree Program

Computer Science

Format

application/pdf

Identifier

CFE0006793

URL

http://purl.fcla.edu/fcla/etd/CFE0006793

Language

English

Release Date

August 2020

Length of Campus-only Access

3 years

Access Status

Doctoral Dissertation (Open Access)

STARS Citation

Sher, Yevgeniy, "Data Representation in Machine Learning Methods with its Application to Compilation Optimization and Epitope Prediction" (2017). Electronic Theses and Dissertations. 5608.
https://stars.library.ucf.edu/etd/5608

Download

Included in

Computer Sciences Commons

COinS

Electronic Theses and Dissertations

Data Representation in Machine Learning Methods with its Application to Compilation Optimization and Epitope Prediction

Abstract

Notes

Graduation Date

Semester

Advisor

Degree

College

Department

Degree Program

Format

Identifier

URL

Language

Release Date

Length of Campus-only Access

Access Status

STARS Citation

Included in

Browse Advisors

Explore

Connect

Electronic Theses and Dissertations

Data Representation in Machine Learning Methods with its Application to Compilation Optimization and Epitope Prediction

Author

Abstract

Notes

Graduation Date

Semester

Advisor

Degree

College

Department

Degree Program

Format

Identifier

URL

Language

Release Date

Length of Campus-only Access

Access Status

STARS Citation

Included in

Share

Browse Advisors

Explore

Connect