## Keywords

compression, pattern matching, JPEG-LS, LZW, information retrieval, compressed pattern matching

## Abstract

The amount of information that we are dealing with today is being generated at an ever-increasing rate. On one hand, data compression is needed to efficiently store, organize the data and transport the data over the limited-bandwidth network. On the other hand, efficient information retrieval is needed to speedily find the relevant information from this huge mass of data using available resources. The compressed pattern matching problem can be stated as: given the compressed format of a text or an image and a pattern string or a pattern image, report the occurrence(s) of the pattern in the text or image with minimal (or no) decompression. The main advantages of compressed pattern matching versus the naïve decompress-then-search approach are: First, reduced storage cost. Since there is no need to decompress the data or there is only minimal decompression required, the disk space and the memory cost is reduced. Second, less search time. Since the size of the compressed data is smaller than that of the original data, a searching performed on the compressed data will result in a shorter search time. The challenge of efficient compressed pattern matching can be met from two inseparable aspects: First, to utilize effectively the full potential of compression for the information retrieval systems, there is a need to develop search-aware compression algorithms. Second, for data that is compressed using a particular compression technique, regardless whether the compression is search-aware or not, we need to develop efficient searching techniques. This means that techniques must be developed to search the compressed data with no or minimal decompression and with not too much extra cost. Compressed pattern matching algorithms can be categorized as either for text compression or for image compression. Although compressed pattern matching for text compression has been studied for a few years and many publications are available in the literature, there is still room to improve the efficiency in terms of both compression and searching. None of the search engines available today make explicit use of compressed pattern matching. Compressed pattern matching for image compression, on the other hand, has been relatively unexplored. However, it is getting more attention because lossless compression has become more important for the ever-increasing large amount of medical images, satellite images and aerospace photos, which requires the data to be losslessly stored. Developing efficient information retrieval techniques from the losslessly compressed data is therefore a fundamental research challenge. In this dissertation, we have studied compressed pattern matching problem for both text and images. We present a series of novel compressed pattern matching algorithms, which are divided into two major parts. The first major work is done for the popular LZW compression algorithm. The second major work is done for the current lossless image compression standard JPEG-LS. Specifically, our contributions from the first major work are: 1. We have developed an "almost-optimal" compressed pattern matching algorithm that reports all pattern occurrences. An earlier "almost-optimal" algorithm reported in the literature is only capable of detecting the first occurrence of the pattern and the practical performance of the algorithm is not clear. We have implemented our algorithm and provide extensive experimental results measuring the speed of our algorithm. We also developed a faster implementation for so-called "simple patterns". The simple patterns are patterns that no unique symbol appears more than once. The algorithm takes advantage of this property and runs in optimal time. 2. We have developed a novel compressed pattern matching algorithm for multiple patterns using the Aho-Corasick algorithm. The algorithm takes O(mt+n+r) time with O(mt) extra space, where n is the size of the compressed file, m is the total size of all patterns, t is the size of the LZW trie and r is the number of occurrences of the patterns. The algorithm is particularly efficient when being applied on archival search if the archives are compressed with a common LZW trie. All the above algorithms have been implemented and extensive experiments have been conducted to test the performance of our algorithms and to compare with the best existing algorithms. The experimental results show that our compressed pattern matching algorithm for multiple patterns is competitive among the best algorithms and is practically the fastest among all approaches when the number of patterns is not very large. Therefore, our algorithm is preferable for general string matching applications. LZW is one of the most efficient and popular compression algorithms used extensively and both of our algorithms require no modification on the compression algorithm. Our work, therefore, has great economical and market potential Our contributions from the second major work are: 1 We have developed a new global context variation of the JPEG-LS compression algorithm and the corresponding compressed pattern matching algorithm. Comparing to the original JPEG-LS, the global context variation is search-aware and has faster encoding and decoding speeds. The searching algorithm based on the global-context variation requires partial decompression of the compressed image. The experimental results show that it improves the search speed by about 30% comparing to the decompress-then-search approach. Based on our best knowledge, this is the first two-dimensional compressed pattern matching work for the JPEG-LS standard. 2 We have developed a two-pass variation of the JPEG-LS algorithm and the corresponding compressed pattern matching algorithm. The two-pass variation achieves search-awareness through a common compression technique called semi-static dictionary. Comparing to the original algorithm, the compression of the new algorithm is equally well but the encoding takes slightly longer. The searching algorithm based on the two-pass variation requires no decompression at all and therefore works in the fully compressed domain. It runs in time O(nc+mc+nm+m^2) with extra space O(n+m+mc), where n is the number of columns of the image, m is the number of rows and columns of the pattern, nc is the compressed image size and mc is the compressed pattern size. The algorithm is the first known two-dimensional algorithm that works in the fully compressed domain.

## Notes

If this is your thesis or dissertation, and want to learn how to access it or for more information about readership statistics, contact us at STARS@ucf.edu

## Graduation Date

2005

## Semester

Spring

## Advisor

Mukherjee, Amar

## Degree

Doctor of Philosophy (Ph.D.)

## College

College of Engineering and Computer Science

## Department

Computer Science

## Degree Program

Computer Science

## Format

application/pdf

## Identifier

CFE0000471

## URL

http://purl.fcla.edu/fcla/etd/CFE0000471

## Language

English

## Release Date

May 2005

## Length of Campus-only Access

None

## Access Status

Doctoral Dissertation (Open Access)

## STARS Citation

Tao, Tao, "Compressed Pattern Matching For Text And Images" (2005). *Electronic Theses and Dissertations*. 402.

https://stars.library.ucf.edu/etd/402