Keywords

Perception, Video Segmentation, Detection, Tracking, LiDAR, Open-Vocabulary

Abstract

Perception is the cornerstone of autonomy, enabling agents to interpret complex environments through segmentation, detection, and tracking. These perception capabilities are fundamental to intelligent systems, whether guiding self-driving vehicles along urban roads, aerial robots mapping diverse terrains, or embodied AI agents interacting naturally with people and objects. Despite sustained research efforts, current perception models remain constrained by fragmented architectures, heavy annotation dependence, and poor generalization to unseen viewpoints, which hinder scalability and reliability. This dissertation advances scene understanding through innovations in segmentation, detection, and tracking, with contributions spanning four key directions that emphasize scalability through end-to-end formulations in 2D and 3D, label efficiency through self-supervision, and open-vocabulary generalization across modalities. The first contribution introduces a unified bottom-up approach for video instance segmentation that formulates object association as spatio-temporal tag propagation, where each pixel is assigned a learned embedding, namely a tag, serving as a soft instance identifier across frames. This design achieves temporally coherent segmentation without region proposals or heuristic tracking, using a tagging loss and tag-based attention to dynamically group pixels into objects over time. To eliminate reliance on dense annotations, the second contribution, CT-VOS, a self-supervised framework for video object segmentation, learns object-centric representations from unlabeled videos by reconstructing occluded regions and enforcing temporal tag consistency, capturing motion and boundary cues without ground-truth. Extending perception to 3D, the third contribution, 3DMODT, presents a transformer-based framework for joint detection and tracking in LiDAR point clouds that learns spatial localization and temporal association through attention-guided refinement to maintain object identities under occlusion and motion variations. The final contribution establishes a contrastive alignment framework that connects aerial, ground-view, and textual representations within a shared embedding space, enabling open-vocabulary detection and cross-view generalization. Together, these contributions establish unified, label-efficient, and generalizable perception frameworks capable of operating reliably in dynamic real-world settings.

Completion Date

2025

Semester

Fall

Committee Chair

Shah, Mubarak

Degree

Doctor of Philosophy (Ph.D.)

College

College of Engineering and Computer Science

Department

Computer Science

Format

PDF

Identifier

DP0029774

Document Type

Thesis

Campus Location

Orlando (Main) Campus

Share

COinS