Keywords
Perception, Video Segmentation, Detection, Tracking, LiDAR, Open-Vocabulary
Abstract
Perception is the cornerstone of autonomy, enabling agents to interpret complex environments through segmentation, detection, and tracking. These perception capabilities are fundamental to intelligent systems, whether guiding self-driving vehicles along urban roads, aerial robots mapping diverse terrains, or embodied AI agents interacting naturally with people and objects. Despite sustained research efforts, current perception models remain constrained by fragmented architectures, heavy annotation dependence, and poor generalization to unseen viewpoints, which hinder scalability and reliability. This dissertation advances scene understanding through innovations in segmentation, detection, and tracking, with contributions spanning four key directions that emphasize scalability through end-to-end formulations in 2D and 3D, label efficiency through self-supervision, and open-vocabulary generalization across modalities. The first contribution introduces a unified bottom-up approach for video instance segmentation that formulates object association as spatio-temporal tag propagation, where each pixel is assigned a learned embedding, namely a tag, serving as a soft instance identifier across frames. This design achieves temporally coherent segmentation without region proposals or heuristic tracking, using a tagging loss and tag-based attention to dynamically group pixels into objects over time. To eliminate reliance on dense annotations, the second contribution, CT-VOS, a self-supervised framework for video object segmentation, learns object-centric representations from unlabeled videos by reconstructing occluded regions and enforcing temporal tag consistency, capturing motion and boundary cues without ground-truth. Extending perception to 3D, the third contribution, 3DMODT, presents a transformer-based framework for joint detection and tracking in LiDAR point clouds that learns spatial localization and temporal association through attention-guided refinement to maintain object identities under occlusion and motion variations. The final contribution establishes a contrastive alignment framework that connects aerial, ground-view, and textual representations within a shared embedding space, enabling open-vocabulary detection and cross-view generalization. Together, these contributions establish unified, label-efficient, and generalizable perception frameworks capable of operating reliably in dynamic real-world settings.
Completion Date
2025
Semester
Fall
Committee Chair
Shah, Mubarak
Degree
Doctor of Philosophy (Ph.D.)
College
College of Engineering and Computer Science
Department
Computer Science
Format
Identifier
DP0029774
Document Type
Thesis
Campus Location
Orlando (Main) Campus
STARS Citation
Kini, Jyoti, "Exploring Segmentation, Detection and Tracking in Videos" (2025). Graduate Thesis and Dissertation post-2024. 466.
https://stars.library.ucf.edu/etd2024/466