The ever-increasing availability of large-scale and high-dimensional data offers unprecedented opportunities for data-driven studies across widely differing domains ranging from marketing and web mining, to bioinformatics and space exploration. Yet, it also poses formidable challenges in face of storing, organizing and analyzing such data. Often though, high-dimensional data tend to lie close to some low-dimensional structures. Aside from substantial computational and storage conservations, discovering such structures also contributes to improved learning and inference performance. A majority of existing work assume linear data models which fail to effectively capture the complex patterns of real-world problems. Non-linear models on the other hand typically hinge upon local neighborhood regions, unable to fully understand the global geometry. Furthermore, the problem of high-dimensional contamination is notably harder for non-linear manifolds due to their more complex patterns. This doctoral dissertation studies the manifold learning scheme, where we explore the low-dimensional non-linear structures hidden in high-dimensional data in the presence of various types of data contamination such as outliers, gross corruptions, missing entries, noise, etc. We tackle various problems in this vein including nonlinear outlier detection, manifold clustering, and representative selection from manifolds, addressing the important shortcomings of the existing methods. We first develop a remarkably simple, yet powerful outlier identification method for manifold data. A novel measure is introduced to distinguish inliers from outliers and theoretical guarantees are provided for effective segregation. This framework is further adapted for manifold clustering. Our method captures a global understanding of the underlying structure, and outperforms the state-of-the-art on numerous and structured outliers, noise components, and intersecting manifolds. Next, we study the problem of representative selection from large-scale high-dimensional data. Existing techniques share important limitations which motivate us to develop two different approaches to this problem. In both settings, we consider the non-linear structures of the underlying data and multiple key criteria for the representative subset including descriptiveness and conciseness. However, the two models differ in the data contamination types, with the first one tackling the presence of outlying points, and the second one considering the gross corruptions. The different nature of these contamination types gives rise to two fundamentally different problem formulations. Aside from convex optimization programs involved in each problem enjoying unique solutions, we develop two novel scalable implementations which bring about substantial speedups. The first algorithm is built upon an adaptive refinement scheme of a randomized sketch, while the second algorithm exploits explicit feature mapping and special structures of the involved matrices. Theoretical analyses provide geometrical characterization of the chosen representatives, enabling insightful interpretations of the proposed methods. Furthermore, we establish probabilistic guarantees for the employed approximations, laying the groundwork for high quality of the employed approximations. Both sampling schemes outperform the state-of-the-art, experimented on challenging generative Deep Learning frameworks, as well as a variety of down-stream problems including classification, clustering, graph community detection, and outlier detection.
If this is your thesis or dissertation, and want to learn how to access it or for more information about readership statistics, contact us at STARS@ucf.edu.
Doctor of Philosophy (Ph.D.)
College of Engineering and Computer Science
Electrical and Computer Engineering
Length of Campus-only Access
Doctoral Dissertation (Open Access)
Sedghi, Mahlagha, "In Pursuit of Low-dimensional Manifold Structures in High-dimensional Data" (2020). Electronic Theses and Dissertations, 2020-. 814.