A Dictionary-Based Multi-Corpora Text Compression System


Computer science; Data compression; Data structures; Decoding; Dictionaries; Encoding; Engines; Frequency; Sun


Summary form only given. StarZip, a multi-copora text compression system, was introduced together with its transform engine StarNT. One of the key features of the StarZip compression system is to develop domain specific dictionaries and provide tools to develop such dictionaries. StarNT was utilized because it achieves a superior compression ratio than almost all the other recent efforts based on BWT and PPM. StarNT is a dictionary-based fast lossless text transform. The main idea is to record each English word with a representation of no more than three symbols. This transform maintains most of the original context information at the word level and provides an "artificial" strong context. It ultimately reduces the size of the transformed text that, in turn, is provided to a backend compressor. This data structure provides a very fast transform encoding with a low storage overhead. StarNT also treats the transformed codewords as an offset of words in the transform dictionary. The time complexity for searching a word in the dictionary is achieved in the transform decoder. Experimental results have shown that the average compression time has improved by orders magnitude compared to previous dictionary-based transform LIPT. The complexity and compression performance of bzip2, in conjunction with this transform, is better than both gzip and PPMD. Results from five copora have shown that StarZip achieved an average improvement in compression performance (in terms of BPC) of 13% over bzip2-9, 19% over gzip-9, and 10% over PPMD.

Publication Date


Publication Title

Data Compression Conference Proceedings



Number of Pages


Document Type

Article; Proceedings Paper

Personal Identifier


DOI Link


Socpus ID

84942245883 (Scopus)

Source API URL


This document is currently not available here.