ORCID

https://orcid.org/0009-0008-4555-7677

Keywords

Representation Learning, Metric Learning, Efficient Transfer Learning

Abstract

Representation Learning and Transfer Learning are fundamental challenges in deep learning, particularly for downstream computer vision tasks. While existing research predominantly focuses on loss functions and network architectures, this dissertation explores how modeling distance metrics and data distributions can enhance representation and transfer learning efficiency in computer vision applications, specifically focusing on image captioning and retrieval tasks.

We make contributions primarily in three subareas. First, we develop an image captioning system that leverages geometric relationships among objects using a novel object graph structure, enhancing global visual representations for more accurate caption generation. We further incorporate adversarial learning to model data distributions, enabling the generation of diverse yet coherent captions by matching feature discrepancies before and after the adversarial discriminator. Second, we advance image retrieval by aligning cross-modal distributions between visual and textual domains. Our approach adapts representations across vision and language modalities while minimizing distributional gaps among different classes. We further enhance retrieval performance through proxy-based domain adaptation within deep metric learning, aligning data distributions to proxy feature points. Finally, we improve transfer learning efficiency by introducing learnable visual prompts accompanying input images during the fine-tuning of pre-trained models. This approach facilitates effective knowledge transfer for downstream image retrieval tasks. We demonstrate that learning deep metrics between visual prompts and aligning class-based information significantly enhances transfer learning effectiveness.

Our comprehensive experiments demonstrate that explicitly modeling data metrics and distributions substantially improves performance across various computer vision tasks. The proposed methods advance the state-of-the-art in both representation learning and transfer learning, particularly in image captioning and retrieval applications.

Completion Date

2025

Semester

Spring

Committee Chair

Hua, Kien

Degree

Doctor of Philosophy (Ph.D.)

College

College of Engineering and Computer Science

Department

Computer Science

Identifier

DP0029374

Document Type

Dissertation/Thesis

Campus Location

Orlando (Main) Campus

Share

COinS