2 1 Introduction
1.1 Preceding History
A big milestone being reached in data compression has been the Lempel-Ziv algorithm which is a form of
dictionary compression due to its consistence of backward references to data slices which have already been
seen. The area of statistical data compression was rst stationary and Huffman optimal prex codes can be
mentioned for entropy coding before arithmetic coding was known. Basically, few bits should be used for
frequent symbols but more for infrequent symbols. Entropy coding developed further with adaptive coding
from the mid-80s. One major innovation called PPM (Prediction by Partial Matching) usually works on
order-N byte contexts and predictions  whereas DMC (Dynamic Markov Compression) predicts bits .
Adaptive Huffman coding was developed as well. A decade later CTW (Context Tree Weighting) also utilizes
bit prediction but mixes the learned distributions of the order-N byte contexts .
The predecessors to the series of PAQ compressors, called P5, P6, P12, are based on a 2-layer neural network
with online training that predicts one bit for context inputs up to order-5. At that time they were on par
with variants of PPM concerning speed and compression ratio .
In PAQ1 the neural network is not used anymore as the probability and condence of various models are
combined through weighted averaging. There is a non-stationary n-gram model up to order-8, a match model
for occurrences longer than 8 bytes in the last 4 MB, a cyclic model for patterns in rows and a whole word
model up to order-2. As statistical data the models mostly hold the bit counts of the contexts as one byte
in a hash table. At its release it could produce best results for a concatenation of the Calgary corpus (a data
compression benchmark) .
The models and their semi-stationary update were improved and PAQ2 introduced SSE (Secondary Symbol
Estimation) which improves a given probability by mapping it to a bit history. PAQ4 changed to adaptive
linear weighting and PAQ5 added another mixer. Many other versions have model improvements and added
special models or transforms. PAQAR, a fork of PAQ6, reworked the mixing architecture and could turn
models off. PAsQDa uses a transformation to map words to dictionary codes. PAQ6 variants were one after
the other leading in the Calgary challenge .
PAQ7 is a rewrite using neural network logistic mixing. Specically it includes a JPEG model that predicts
the Huffman codes based on the DCT coefcients. PAQ8 variants come with more le models, dictionary
or x86 call address preprocessors or a DMC model. The PAQ8HP series won the Hutter Prize for Loss-
less Compression of Human Knowledge
about compressing a part of an English Wikipedia dump. Many
achievements were mainly made through special models for various le formats. The simpler LPAQ pro-
vides faster but less compression and later versions aim on text compression. PAQ9A has LZ (Lempel-Ziv)
precompression and cascaded ISSE (Indirect Secondary Symbol Estimation). The detailed history including
source code is tracked on the PAQ history website
and in M. Mahoney’s book on data compression .
2 The PAQ Data Compression Programs http://mattmahoney.net/dc/paq.html