Need faster text searches on compressed data? This paper introduces a fast compression technique for natural language texts, in which decompression of arbitrary portions of the text can be done very efficiently, direct search for words and phrases is enabled and approximate search can also be done efficiently without any decoding. The experiments show that running our algorithms on a compressed text is twice as fast as running the best existing software on the uncompressed version of the same text. The main content shows that the compression scheme uses a semistatic word-based model and a Huffman code where the coding alphabet is byte-oriented rather than bit-oriented. We compress typical English texts to about 30% of their original size, against 40% and 35% for *Compress* and *Gzip*, respectively. When searching for complex or approximate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We present three algorithms to search the compressed text. This can be used to keep the text compressed all the time, decompressing only for displaying purposes.
Published in ACM Transactions on Information Systems, this paper fits within the journal's focus on information retrieval and data management techniques. The research on fast and flexible word searching in compressed text enhances efficiency in information processing, aligning with the journal's core interests.