Cross-view images, referring to the images taken from aerial and street views, contain drastically differing representations of the same scene of a given location. Due to the differences in the camera viewpoints of ground and aerial images the same semantic concepts in the two viewpoints look very different. Therefore the problem of relating them is very challenging. Thus, it becomes crucial to explore the cross-view relations and learn appropriate representations such that images from these two domains can be associated. In this dissertation we explore the relationship between ground and aerial views by synthesis and matching. First, we explore supervised approaches for cross view image synthesis problem to generate realistic images from the target (eg. ground) view, given an image from a source (eg. aerial) view. We solve this problem by utilizing Generative Adversarial Networks (GANs) to synthesize the target images and an auxiliary output, the target view segmentation maps, from source view images. We do so by enforcing the networks to correctly align and orient the different semantics in the scene by jointly penalizing the networks on the quality of target view images and the se- mantic segmentation maps. Next, we explore the geometrical cues between the aerial and ground images and attempt to preserve the pixels from aerial images to synthesize the ground images. We use homography to transform the aerial images to the street-view and preserve the pixels from the overlapping field of view, followed by inpainting the remaining regions in the ground image. Geometrically transformed images as input ease the network's burden in synthesizing the cross- view images. Following the cross-view image synthesis problem, we solve the cross-view image matching problem. We propose a novel framework that uses the synthesized images for bridging the domain gap between the images from the two (aerial and ground) viewpoints and helps to learn better features for the cross-view image matching. Finally, the last part of the dissertation addresses the problem of matching the frames of a video with geo-tagged reference images for purpose of geo-localization. We develop a novel method that learns coherent features for individual frames in the query video by attending to all the frames of the video. We conduct extensive evaluations to validate that the proposed approach performs better compared to methods that learn image features independently.


If this is your thesis or dissertation, and want to learn how to access it or for more information about readership statistics, contact us at STARS@ucf.edu.

Graduation Date





Shah, Mubarak


Doctor of Philosophy (Ph.D.)


College of Engineering and Computer Science


Computer Science

Degree Program

Computer Science









Release Date

August 2021

Length of Campus-only Access


Access Status

Doctoral Dissertation (Open Access)