Temporal Video Segmentation, Semantic Linking, Spatiotemporal Video Attention


In this fast paced digital age, a vast amount of videos are produced every day, such as movies, TV programs, personal home videos, surveillance video, etc. This places a high demand for effective video data analysis and management techniques. In this dissertation, we have developed new techniques for segmentation, linking and understanding of video scenes. Firstly, we have developed a video scene segmentation framework that segments the video content into story units. Then, a linking method is designed to find the semantic correlation between video scenes/stories. Finally, to better understand the video content, we have developed a spatiotemporal attention detection model for videos. Our general framework for temporal scene segmentation, which is applicable to several video domains, is formulated in a statistical fashion and uses the Markov chain Monte Carlo (MCMC) technique to determine the boundaries between video scenes. In this approach, a set of arbitrary scene boundaries are initialized at random locations and are further automatically updated using two types of updates: diffusion and jumps. The posterior probability of the target distribution of the number of scenes and their corresponding boundary locations are computed based on the model priors and the data likelihood. Model parameter updates are controlled by the MCMC hypothesis ratio test, and samples are collected to generate the final scene boundaries. The major contribution of the proposed framework is two-fold: (1) it is able to find weak boundaries as well as strong boundaries, i.e., it does not rely on the fixed threshold; (2) it can be applied to different video domains. We have tested the proposed method on two video domains: home videos and feature films. On both of these domains we have obtained very accurate results, achieving on the average of 86% precision and 92% recall for home video segmentation, and 83% precision and 83% recall for feature films. The video scene segmentation process divides videos into meaningful units. These segments (or stories) can be further organized into clusters based on their content similarities. In the second part of this dissertation, we have developed a novel concept tracking method, which links news stories that focus on the same topic across multiple sources. The semantic linkage between the news stories is reflected in the combination of both their visual content and speech content. Visually, each news story is represented by a set of key frames, which may or may not contain human faces. The facial key frames are linked based on the analysis of the extended facial regions, and the non-facial key frames are correlated using the global matching. The textual similarity of the stories is expressed in terms of the normalized textual similarity between the keywords in the speech content of the stories. The developed framework has also been applied to the task of story ranking, which computes the interestingness of the stories. The proposed semantic linking framework and the story ranking method have both been tested on a set of 60 hours of open-benchmark video data (CNN and ABC news) from the TRECVID 2003 evaluation forum organized by NIST. Above 90% system precision has been achieved for the story linking task. The combination of both visual and speech cues has boosted the un-normalized recall by 15%. We have developed PEGASUS, a content based video retrieval system with fast speech and visual feature indexing and search. The system is available on the web: Given a video sequence, one important task is to understand what is present or what is happening in its content. To achieve this goal, target objects or activities need to be detected, localized and recognized in either the spatial and/or temporal domain. In the last portion of this dissertation, we present a visual attention detection method, which automatically generates the spatiotemporal saliency maps of input video sequences. The saliency map is later used in the detections of interesting objects and activities in videos by significantly narrowing the search range. Our spatiotemporal visual attention model generates the saliency maps based on both the spatial and temporal signals in the video sequences. In the temporal attention model, motion contrast is computed based on the planar motions (homography) between images, which are estimated by applying RANSAC on point correspondences in the scene. To compensate for the non-uniformity of the spatial distribution of interest-points, spanning areas of motion segments are incorporated in the motion contrast computation. In the spatial attention model, we have developed a fast method for computing pixel-level saliency maps using color histograms of images. Finally, a dynamic fusion technique is applied to combine both the temporal and spatial saliency maps, where temporal attention is dominant over the spatial model when large motion contrast exists, and vice versa. The proposed spatiotemporal attention framework has been extensively applied on multiple video sequences to highlight interesting objects and motions present in the sequences. We have achieved 82% user satisfactory rate on the point-level attention detection and over 92% user satisfactory rate on the object-level attention detection.


If this is your thesis or dissertation, and want to learn how to access it or for more information about readership statistics, contact us at

Graduation Date





Shah, Mubarak


Doctor of Philosophy (Ph.D.)


College of Engineering and Computer Science


Electrical Engineering and Computer Science

Degree Program

Computer Science








Length of Campus-only Access


Access Status

Masters Thesis (Open Access)