In this dissertation we explore the application of machine learning algorithms to compilation phase order optimization, and epitope prediction. The common thread running through these two disparate domains is the type of data being dealt with. In both problem domains we are dealing with categorical data, with its representation playing a significant role in the performance of classification algorithms. We first present a neuroevolutionary approach which orders optimization phases to generate compiled programs with performance superior to those compiled using LLVM's -O3 optimization level. Performance improvements calculated as the speed of the compiled program's execution ranged from 27% for the ccbench program, to 40.8% for bzip2. This dissertation then explores the problem of data representation of 3D biological data, such as amino acids. A new approach for distributed representation of 3D biological data through the process of embedding is proposed and explored. Analogously to word embedding, we developed a system that uses atomic and residue coordinates to generate distributed representation for residues, which we call 3D Residue BioVectors. Preliminary results are presented which demonstrate that even the low dimensional 3D Residue BioVectors can be used to predict conformational epitopes and protein-protein interactions, with promising proficiency. The generation of such 3D BioVectors, and the proposed methodology, opens the door for substantial future improvements, and application domains. The dissertation then explores the problem domain of linear B-Cell epitope prediction. This problem domain deals with predicting epitopes based strictly on the protein sequence. We present the DRREP system, which demonstrates how an ensemble of shallow neural networks can be combined with string kernels and analytical learning algorithm to produce state of the art epitope prediction results. DRREP was tested on the SARS subsequence, the HIV, Pellequer, AntiJen datasets, and the standard SEQ194 test dataset. AUC improvements achieved over the state of the art ranged from 3% to 8%. Finally, we present the SEEP epitope classifier, which is a multi-resolution SMV ensemble based classifier which uses conjoint triad feature representation, and produces state of the art classification results. SEEP leverages the domain specific knowledge based protein sequence encoding developed within the protein-protein interaction research domain. Using an ensemble of multi-resolution SVMs, and a sliding window based pre and post processing pipeline, SEEP achieves an AUC of 91.2 on the standard SEQ194 test dataset, a 24% improvement over the state of the art.
Doctor of Philosophy (Ph.D.)
College of Engineering and Computer Science
Length of Campus-only Access
Doctoral Dissertation (Campus-only Access)
Sher, Yevgeniy, "Data Representation in Machine Learning Methods with its Application to Compilation Optimization and Epitope Prediction" (2017). Electronic Theses and Dissertations. 5608.