Title

A Dictionary-Based Multi-Corpora Text Compression System

Keywords

Computer science; Data compression; Data structures; Decoding; Dictionaries; Encoding; Engines; Frequency; Sun

Abstract

Summary form only given. StarZip, a multi-copora text compression system, was introduced together with its transform engine StarNT. One of the key features of the StarZip compression system is to develop domain specific dictionaries and provide tools to develop such dictionaries. StarNT was utilized because it achieves a superior compression ratio than almost all the other recent efforts based on BWT and PPM. StarNT is a dictionary-based fast lossless text transform. The main idea is to record each English word with a representation of no more than three symbols. This transform maintains most of the original context information at the word level and provides an "artificial" strong context. It ultimately reduces the size of the transformed text that, in turn, is provided to a backend compressor. This data structure provides a very fast transform encoding with a low storage overhead. StarNT also treats the transformed codewords as an offset of words in the transform dictionary. The time complexity for searching a word in the dictionary is achieved in the transform decoder. Experimental results have shown that the average compression time has improved by orders magnitude compared to previous dictionary-based transform LIPT. The complexity and compression performance of bzip2, in conjunction with this transform, is better than both gzip and PPMD. Results from five copora have shown that StarZip achieved an average improvement in compression performance (in terms of BPC) of 13% over bzip2-9, 19% over gzip-9, and 10% over PPMD.

Publication Date

1-1-2003

Publication Title

Data Compression Conference Proceedings

Volume

2003-January

Number of Pages

448-

Document Type

Article; Proceedings Paper

Personal Identifier

scopus

DOI Link

https://doi.org/10.1109/DCC.2003.1194067

Socpus ID

84942245883 (Scopus)

Source API URL

https://api.elsevier.com/content/abstract/scopus_id/84942245883

This document is currently not available here.

Share

COinS