Title
A Dictionary-Based Multi-Corpora Text Compression System
Keywords
Computer science; Data compression; Data structures; Decoding; Dictionaries; Encoding; Engines; Frequency; Sun
Abstract
Summary form only given. StarZip, a multi-copora text compression system, was introduced together with its transform engine StarNT. One of the key features of the StarZip compression system is to develop domain specific dictionaries and provide tools to develop such dictionaries. StarNT was utilized because it achieves a superior compression ratio than almost all the other recent efforts based on BWT and PPM. StarNT is a dictionary-based fast lossless text transform. The main idea is to record each English word with a representation of no more than three symbols. This transform maintains most of the original context information at the word level and provides an "artificial" strong context. It ultimately reduces the size of the transformed text that, in turn, is provided to a backend compressor. This data structure provides a very fast transform encoding with a low storage overhead. StarNT also treats the transformed codewords as an offset of words in the transform dictionary. The time complexity for searching a word in the dictionary is achieved in the transform decoder. Experimental results have shown that the average compression time has improved by orders magnitude compared to previous dictionary-based transform LIPT. The complexity and compression performance of bzip2, in conjunction with this transform, is better than both gzip and PPMD. Results from five copora have shown that StarZip achieved an average improvement in compression performance (in terms of BPC) of 13% over bzip2-9, 19% over gzip-9, and 10% over PPMD.
Publication Date
1-1-2003
Publication Title
Data Compression Conference Proceedings
Volume
2003-January
Number of Pages
448-
Document Type
Article; Proceedings Paper
Personal Identifier
scopus
DOI Link
https://doi.org/10.1109/DCC.2003.1194067
Copyright Status
Unknown
Socpus ID
84942245883 (Scopus)
Source API URL
https://api.elsevier.com/content/abstract/scopus_id/84942245883
STARS Citation
Sun, Welfeng; Zhang, Nan; and Mukherjee, Amar, "A Dictionary-Based Multi-Corpora Text Compression System" (2003). Scopus Export 2000s. 1958.
https://stars.library.ucf.edu/scopus2000/1958