Deep learning has achieved tremendous success on various computer vision tasks. However, deep learning methods and models are usually computationally expensive, making it hard to train and deploy, especially on resource-constrained devices. In this dissertation, we explore how to improve the efficiency and effectiveness of deep learning methods from various perspectives. We first propose a new learning method to learn computationally adaptive representations. Traditional neural networks are static. However, our method trains adaptive neural networks that can adjust their computational cost during runtime, avoiding the need to train and deploy multiple networks for dynamic resource budgets. Next, we extend our method to learn adaptive spatiotemporal representations to solve various video understanding tasks such as video recognition and action detection. Then, inspired by the proposed adaptive learning method, we propose a new regularization method to learn better representations for the full network. Our method regularizes the full network by ensuring that its predictions align with those of its sub-networks when fed with differently transformed input data. This approach facilitates the learning of more generalized and robust representations by the full network. Besides learning methods, designing good network architecture is also critical to learn good representations. Neural architecture search (NAS) has shown great potential in designing novel network structures, but its high computational cost is a significant limitation. To address this issue, we present a new short-training based NAS method that achieves superior performance compared to previous methods, while requiring significantly less search cost. Finally, with the recent advancements in large-scale image foundation models, we present an efficient finetuning method to adapt pre-trained image foundation models for video understanding. Our method significantly reduces training costs compared to traditional full fine-tuning, while delivering competitive performance across multiple video benchmarks. It is both simple and versatile, making it easy to leverage stronger image foundation models in the future.
If this is your thesis or dissertation, and want to learn how to access it or for more information about readership statistics, contact us at STARS@ucf.edu
Doctor of Philosophy (Ph.D.)
College of Engineering and Computer Science
Length of Campus-only Access
Doctoral Dissertation (Open Access)
Yang, Taojiannan, "Towards Efficient and Effective Representation Learning for Image and Video Understanding" (2023). Electronic Theses and Dissertations, 2020-. 1738.