Share Email Print

Proceedings Paper

Natural language insensitive short textual string compression
Author(s): Cornel Constantinescu; Jennifer Q. Trelewicz; Ronald B. Arps
Format Member Price Non-Member Price
PDF $14.40 $18.00

Paper Abstract

There are applications (such as Internet search engines) where short textual strings, for example abstracts or pieces of Web pages, need to be compressed independently of each other. The usual adaptive compression algorithms perform poorly on these short strings due to the lack of necessary data to learn. In this manuscript, we introduce a compression algorithm targeting short text strings; e.g., containing a few hundred symbols. We also target natural language insensitivity, to facilitate its robust compression and fast implementation. The algorithm is based on the following findings. Applying the move-to-front transform (MTFT) after the Burrows-Wheeler transform (BWT) brings the short textual strings to a "normalized form" where the distribution of the resulting "ranks" has a shape similar over the set of natural language strings. This facilitates the use of a static coding method with few variations, which we call shortBWT, where no on-line learning is needed, to encode the ranks. Finally, for short strings, shortBWT runs very fast because the strings fit into the cache of most current computers. The introduction for this paper will review the mathematical bases of BWT and MTF, it will also review our recently published metric for rapidly pre-characterizing the compressibility of such short textual strings when using these transforms.

Paper Details

Date Published: 28 January 2004
PDF: 10 pages
Proc. SPIE 5208, Mathematics of Data/Image Coding, Compression, and Encryption VI, with Applications, (28 January 2004); doi: 10.1117/12.507758
Show Author Affiliations
Cornel Constantinescu, IBM Almaden Research Ctr. (United States)
Jennifer Q. Trelewicz, IBM Almaden Research Ctr. (United States)
Ronald B. Arps, IBM Almaden Research Ctr. (United States)

Published in SPIE Proceedings Vol. 5208:
Mathematics of Data/Image Coding, Compression, and Encryption VI, with Applications
Mark S. Schmalz, Editor(s)

© SPIE. Terms of Use
Back to Top