Design of a Python-subset Compiler in Rust targeting ZPAQL, Kai Lüke

Freie Universität Berlin

Department of Mathematics and Computer Science

Institute of Computer Science

Bachelor Thesis

Design of a Python-subset Compiler in Rust targeting ZPAQL

Kai Lüke

kailueke@riseup.net

Supervisors: Prof. Dr. Günter Rote

Dipl.-Inform. Till Zoppke

Berlin, August 23, 2016

Abstract

The compressed data container format ZPAQ embeds decompression algorithms as ZPAQL bytecode in the

archive. This work contributes a Python-subset compiler written in Rust for the assembly language ZPAQL,

discusses design decisions and improvements. On the way it explains ZPAQ and some theoretical and prac-

tical properties of context mixing compression by the example of compressing digits of . As use cases for

the compiler it shows a lossless compression algorithm for PNM image data, a LZ77 variant ported to Python

from ZPAQL to measure compiler overhead and as most complex case an implementation of the Brotli algo-

rithm. It aims to make the development of algorithms for ZPAQ more accessible and leverage the discussion

whether the current specification limits the suitability of ZPAQ as an universal standard for compressed

archives.

Contents

1 Introduction 1

1.1 Preceding History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 The ZPAQ Standard Format for Compressed Data 5

2.1 ZPAQ Specication and Working Principle . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Context Mixing: Components and Context Data . . . . . . . . . . . . . . . . . . . . . . 8

2.3 ZPAQL: Virtual Machine and Instruction Set . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Examples: A simple Context Model and LZ1 with a Context Model . . . . . . . . . . . . . 12

3 Notes on Data Compression 16

4 Compilers, Rust and Python 21

4.1 Classical Compiler Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2 The Rust Programming Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.3 The Python Programming Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Considerations and Challenges with ZPAQL as Target Language 23

6 Selection of the Python-subset as Source Language 25

6.1 Decision on the Feature Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6.2 Grammar and Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

7 Design of the Compiler in Rust 29

7.1 Exposed API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

7.2 Parser and Tokenizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

7.3 Grammar of the Intermediate Representation . . . . . . . . . . . . . . . . . . . . . . . . 31

7.4 IR Generation and the Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7.5 Optimizations in the Intermediate Representation . . . . . . . . . . . . . . . . . . . . . . 33

7.6 ZPAQL Generation and Register Assignment . . . . . . . . . . . . . . . . . . . . . . . . . 33

7.7 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

vi Contents

8 Exemplary Configurations and Programs 35

8.1 Compression of PNM Image Data using a Context Model . . . . . . . . . . . . . . . . . . 35

8.2 Bringing the Brotli Algorithm to ZPAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

9 Evaluation 38

9.1 Compiler Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

9.2 Performance of the Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

9.3 Analysis of the generated Code and Comparison with handwritten Code of LZ1 . . . . . . . 39

9.4 Suitability of ZPAQ as universal Standard . . . . . . . . . . . . . . . . . . . . . . . . . . 40

10 Conclusion 41

Bibliography 42

A Tutorial 44

B Visualizations 47

B.1 Arithmetic Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

B.2 Context Mixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

B.3 ZPAQ Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

B.4 ZPAQL Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

B.5 Compiler Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

1 Introduction

It takes time until new and incompatible data compression algorithms become distributed in software. Also

different input data should often be handled with different compression techniques, utilizing best knowledge

about the data.

The ZPAQ standard format for compressed data is a container format which also holds the needed decompres-

sion algorithms. They can be specied through a context mixing model of several predictors with a bytecode

which computes the context data for them (used for arithmetic coding), a bytecode for postprocessing (used

for transformations or stand-alone algorithms) or any combination of both.

Arithmetic coding spreads symbols over a number range partitioned according to the probability distribution.

That way a likely to be encoded symbol can get a bigger part. The whole message is encoded at once from the

beginning. And every time a symbol was chosen the – possibly modied – probability distribution is applied

again to segment its part of the number range. Then for the next symbol this is the number range partition

to choose from. When the last symbol has been processed, the number range is very narrow compared to

the beginning and every number of it now represents the whole message. So the shortest in terms of binary

representation can be selected and a decoder can do the same steps again by choosing the symbol which has

this number in its range and applying then the partitioning to this range again according to the probability

distribution. In practice, one has either to use a special end-of-message symbol or specify the message length

before to dene an end to this process.

The history which led to the development of ZPAQ shall shortly be explained in this chapter, followed by the

motivation of writing a compiler for ZPAQL and the research question of this work. In the following chapter

the whole picture of ZPAQ compression is visualized and the building blocks, i.e. context mixing and the

bytecode, are explained in detail. General theory of data compression and its limits are shortly noted on

afterwards. The main part is preceded by a short introduction to compiler construction, the implementation

language Rust and the source language Python. One chapter outlines the conditions and difculties of

ZPAQL as a compiler target. The following chapters cover the chosen Python-subset (6), the developed API

and compiler internals (7), example programs (8) and nally the evaluation (9).

References to compressors and project websites or more detailed documentation and short remarks are

located in footnotes. All websites have been accessed until August 19th 2016. Academic publications are

listed in the bibliography.

2 1 Introduction

1.1 Preceding History

A big milestone being reached in data compression has been the Lempel-Ziv algorithm which is a form of

dictionary compression due to its consistence of backward references to data slices which have already been

seen. The area of statistical data compression was rst stationary and Huffman optimal prex codes can be

mentioned for entropy coding before arithmetic coding was known. Basically, few bits should be used for

frequent symbols but more for infrequent symbols. Entropy coding developed further with adaptive coding

from the mid-80s. One major innovation called PPM (Prediction by Partial Matching) usually works on

order-N byte contexts and predictions [1] whereas DMC (Dynamic Markov Compression) predicts bits [2].

Adaptive Huffman coding was developed as well. A decade later CTW (Context Tree Weighting) also utilizes

bit prediction but mixes the learned distributions of the order-N byte contexts [3].

The predecessors to the series of PAQ compressors, called P5, P6, P12, are based on a 2-layer neural network

with online training that predicts one bit for context inputs up to order-5. At that time they were on par

with variants of PPM concerning speed and compression ratio [4].

In PAQ1 the neural network is not used anymore as the probability and condence of various models are

combined through weighted averaging. There is a non-stationary n-gram model up to order-8, a match model

for occurrences longer than 8 bytes in the last 4 MB, a cyclic model for patterns in rows and a whole word

model up to order-2. As statistical data the models mostly hold the bit counts of the contexts as one byte

in a hash table. At its release it could produce best results for a concatenation of the Calgary corpus (a data

compression benchmark) [5].

The models and their semi-stationary update were improved and PAQ2 introduced SSE (Secondary Symbol

Estimation) which improves a given probability by mapping it to a bit history. PAQ4 changed to adaptive

linear weighting and PAQ5 added another mixer. Many other versions have model improvements and added

special models or transforms. PAQAR, a fork of PAQ6, reworked the mixing architecture and could turn

models off. PAsQDa uses a transformation to map words to dictionary codes. PAQ6 variants were one after

the other leading in the Calgary challenge [6].

PAQ7 is a rewrite using neural network logistic mixing. Specically it includes a JPEG model that predicts

the Huffman codes based on the DCT coefcients. PAQ8 variants come with more le models, dictionary

or x86 call address preprocessors or a DMC model. The PAQ8HP series won the Hutter Prize for Loss-

less Compression of Human Knowledge

about compressing a part of an English Wikipedia dump. Many

achievements were mainly made through special models for various le formats. The simpler LPAQ pro-

vides faster but less compression and later versions aim on text compression. PAQ9A has LZ (Lempel-Ziv)

precompression and cascaded ISSE (Indirect Secondary Symbol Estimation). The detailed history including

source code is tracked on the PAQ history website

and in M. Mahoney’s book on data compression [7].

1 http://prize.hutter1.net/

2 The PAQ Data Compression Programs http://mattmahoney.net/dc/paq.html

1.2 Motivation 3

Parts of PAQ variants made it into the recent context mixing compressor cmix that leads the Large Text

Compression Benchmark

and the Silesia Open Source Compression Benchmark

but with around 30 GB

RAM usage

»ZPAQ is intended to replace PAQ and its variants (PAQ8, PAQ9A, LPAQ, LPQ1, etc) with similar or better

compression in a portable, standard format. Current versions of PAQ break archive compatibility with

each compression improvement. ZPAQ is intended to x that.«

The development of the ZPAQ archiver started from early 2009 and dened the rst version of The ZPAQ

Open Standard Format for Highly Compressed Data [8] after some months. The main idea is to move

the layout of the context mixing tree as well as the algorithm for context computation and the one for

postprocessing into the archive before each block of compressed data. There are nine components

choose from for the tree, mostly context models coming from PAQ. In addition the algorithms are provided

as bytecode for a minimal virtual machine. Hence the algorithm implementation is mostly independent of the

decompressor implementation and compatibility is preserved when improvements are made. Also depending

on the input data an appropriate compression method can be chosen. The main program using libzpaq is

an incremental journaling backup utility which supports deduplication and encryption and provides various

compression levels using LZ77, BWT (Burrows-Wheeler Transform) and context models [9]. But there are

also reference decoders unzpaq and tiny_unzpaq, a simple pipeline application zpipe and a development

tool zpaqd

The fastqz compressor [10] for Sanger FASTQ format DNA strings and quality scores also uses the ZPAQ

format and was submitted to the Pistoia Alliance Sequence Squeeze Competition

1.2 Motivation

The development of ZPAQ continued specially for the use case of the incremental archiver program. But

the appearance of new algorithms for ZPAQ reached a rather low level, as well as the number of authors

despite the fact that it offers a good environment for research about context mixing methods and crafting

of special solutions for use cases with a known type of data because one can build upon existing parts. The

leading reason could be that the assembly language ZPAQL with its few registers is not very accessible and

programs are hard to grasp or get easily unmanageable when complex tasks like parsing a header should be

accomplished. Therefore, the rst question is whether another language can support ZPAQ’s popularity.

3 http://mattmahoney.net/dc/text.html

4 http://mattmahoney.net/dc/silesia.html

5 http://www.byronknoll.com/cmix.html

6 http://mattmahoney.net/dc/zpaq.html

7 See chapter 2.2 or http://mattmahoney.net/dc/dce.html#Section_437

8 http://mattmahoney.net/dc/zpaqutil.html

9 http://www.pistoiaalliance.org/projects/sequence-squeeze/

10 The majority of available congurations is listed on http://mattmahoney.net/dc/zpaqutil.html

4 1 Introduction

If that is the case, then a compiler for a well-known programming language could help to overcome the

obstacle of learning ZPAQL for implementing new algorithms.

The second question is whether the design decisions of the ZPAQ specication allow for arbitrary compres-

sion algorithms to be used with ZPAQ, even if they bring their own encoder or a dictionary or whether ZPAQ

is rather meant to be a platform for algorithms that use the predened prediction components like (I)CM

and (I)SSE

together with the built-in arithmetic coder.

1.3 Research Question

As an approach towards these questions mentioned above this work wants to contribute a compiler that

emits ZPAQL for a subset of the popular Python programming language. The subset should be oriented to

the data types of the ZPAQL virtual machine. The architecture and operation of the Python code should

still maintain similarity to the ZPAQL execution workow to avoid abstractions. That means only integers,

no objects, oating-point numbers or strings, but the ability to address the two memory sections of the VM

as arrays. Running the Python code should be possible as usual which helps to keep debugging out of the

ZPAQL VM.

All this would ease the development of new algorithms and discover limitations of the current ZPAQ stan-

dard. An example for a new algorithm should be provided by developing a model for uncompressed PNM

image data which should be compared to other lossless image compressors.

As an example for a complex all-purpose algorithm the recent Brotli compression algorithm[11] will be

brought to ZPAQ. It was proposed for HTTP2 compression, includes a dictionary and combines LZ77, Huff-

man coding and order-2 context modeling. This is a challenge because it has to be solely implemented in

the postprocessing step and also needs memory management.

The quality of the compiler will be evaluated by comparing to an existing hand-written implementation of a

LZ compressor. Design decisions of the compiler and difculties faced will be discussed.

11 (Indirect) context model and (indirect) secondary symbol estimation, four of the nine ZPAQ components.

2 The ZPAQ Standard Format for Compressed Data

This chapter gives an introduction to the ZPAQ archive specication in the current version 2.06. A small use

case is explained as well as the needed commands. Then the context mixing components and the language

ZPAQL are presented. Finally, two real-world examples are explained.

2.1 ZPAQ Specification and Working Principle

The Level 1 specication of this container format required the data always to be encoded with the arithmetic

coder and at least one predictor. A backwards incompatible change leading to Level 2 was to allow raw data

to be stored and thus no prediction component to be used, as needed for incompressible data or standalone

postprocessor algorithms e.g. a fast LZ variant [8]. It also denes a standard encryption for the whole archive

le and introduces an append-only journaling mode for incremental backups and deduplication on top of the

streaming format which is still the basis of the archive. These features are out of scope for this work and

therefore the focus is on the compression architecture and the virtual machine.

It is only specied what a valid archive is and how decompression takes place (nevertheless, probability

and thus context computation have to be symmetric for compression and decompression with an arithmetic

coder). In case of ambiguity in the specication there is the reference implementation unzpaq

, a tiny

version

of it and the more mature library libzpaq that is used by the zpaq archiver

, the zpaqd development

tool

or as plug-in for le archivers and incorporates a x86/64 JIT compiler for ZPAQL.

»The ZPAQ open standard species a compressed representation for one or more byte (8 bit value) se-

quences. A ZPAQ stream consists of a sequence of blocks that can be decompressed independently. A

block consists of a sequence of segments that must be decompressed sequentially from the beginning of

the block. Each segment might represent an array of bytes in memory, a le, or a contiguous portion of a

le.« [8]

A block header has no information on the length of the block because like for segments an end marker is

used. Memory requirements for the two bytecodes hcomp and pcomp are dened in the header. It is noted

1 http://mattmahoney.net/dc/unzpaq206.cpp

2 http://mattmahoney.net/dc/tiny_unzpaq.cpp

3 http://mattmahoney.net/dc/zpaq715.zip

4 http://mattmahoney.net/dc/zpaqd715.zip

6 2 The ZPAQ Standard Format for Compressed Data

whether arithmetic coding is used and what predicting components make up the context mixing tree. Each

component has arguments also determining memory use. If needed the bytecode hcomp is embedded to

compute context data for the components of the context mixing tree for each byte. All components give

bit predictions for the partially decoded byte (these are passed upwards the tree) and are trained afterwards

with the correct bit which was decoded based on the root (i.e. the last) node probability for each bit.

The optionally arithmetic coded data which comes from all segment content (not the segment lename or

comment) in the block can start with an embedded pcomp bytecode or declare that no pcomp bytecode

is present. Therefore, the hcomp section can already be used for context computation to compress the

pcomp bytecode (0 for empty or 1 followed by the length and the bytecode). The pcomp code is used for

postprocessing, may it be a simple transform or the decompression of LZ codes. It gets each decoded byte

as input and outputs a number of bytes not necessarily equal to the input.

That means the four combinations for a block are in total no compression, only context mixing with arith-

metic coding, only postprocessing the stored data or context mixing with subsequent postprocessing (from

the decompressor perspective). The chosen selection applies for all (le) segments in the block.

The following charts illustrate the named parts and their relation to each other for a sample compression use

case. The transform for x86 machine code enhances compressibility by converting relative to static addresses

after each CALL and JMP instruction (0xE8 and 0xE9). It is applied on the two input les, a x86 executable

binary and shared library. Therefore a ZPAQL pcomp program needs to be supplied in the archive block

to revert that transform. Encoding takes place based on the probability distribution of 1 and 0 for each bit

of the current byte as they are provided as prediction by the root node of the simple context mixing tree.

The hcomp program is loaded into the ZPAQL VM and computes contexts for the two components. The

ISSE maps the context to a bit history which is used as context for a learning mixer that should improve the

probability provided by the rst component, a CM (Context Model) which should learn good predictions for

the given context. The whole model and the hcomp bytecode are also embedded into the archive block.

The two les are stored as two segments in the block (like a ”solid” archive). Because the preprocessor might

be any external program or also included in the compressing archiver and is of no use for decompression it

is therefore not mentioned in the archive anymore.

2.1 ZPAQ Specication and Working Principle 7

Figure 2.1: Possible compression scheme

Decompression takes place in reverse manner and hcomp is loaded into the ZPAQL VM to compute the

context data for the components of the model. They supply the predictions to the arithmetic coder and are

corrected afterwards. For the reverse transform of each segment pcomp is read from the decoded stream and

loaded in another VM. Then the two segments follow in the decoding step and go through the postprocessing

transform before they are written out as les.

Figure 2.2: Accompanying decompression

The tool zpaqd only supports the streaming format and can be used to construct this example setup by writing

a conguration le for it and then adding two les as segments into a block. But beside the algorithms that are

already dened for compression in libzpaq for the levels 1 to 5 (LZ77, BWT and context mixing) it also offers

the ability to specify a customized model (E8E9, LZ77 transformations or also word models are supported)

given as argument, so that the above conguration can also be brought to life with something like zpaq

a [archive] [files] -method s8.4c22.0.255.255i3 (denotation documented in libzpaq.h and the

8 2 The ZPAQ Standard Format for Compressed Data

zpaq man page

). Here the rst 8 accounts for 



  MB blocks, so that both segments should t into

one block (yet the zpaq application uses the API in a way that creates an additional block), then an order-2

CM and an order-3 ISSE are chained. The resulting conguration including the two ZPAQL programs stored

in the archive can be listed with zpaqd l [archive].

For a more general view c.f. the compression workow of the zpaq end user archiver as described in the

article mentioned [9]. It selects one of its predened algorithms based on their performance for the data and

uses deduplication through the journaling format.

2.2 Context Mixing: Components and Context Data

The way the mixing takes place has evolved from the neural network approach in P5 over weighted averaging

from PAQ1 and adaptive linear weighting to logistic mixing in PAQ7 (like in a neural network and instead of

back propagation the weights are updated). In ZPAQ the probabilities are also handled in the logistic domain

  





 as output of the predicting components and can be averaged or rened by other

components before the output of the last component is transformed   



 







be used in the arithmetic coder to code one bit [7].

ZPAQ denes nine components. Up to 255 of them can be used in a model but only linear trees can be

constructed as mixing network i.e. when they are listed, they can only refer to predictions of preceding

components. The last one listed is the top node [8]. They get their context data input through computation

in hcomp. To supply the component with a higher order context, a rotating history buffer needs to be

maintained by hcomp. Now the components are listed without their arguments and internals which can be

found in the specication [8].

CONST is the most simple and gives a xed prediction. It does not receive context data.

CM is a direct context model that holds counts of zeros/ones and a prediction for each of the eight bits

expected in the specied context. The counts are used to update the prediction to reduce the error. The

given context data is often a hash (and collisions can hurt compression). Also the partially decoded byte is

expanded to nine bits and XORed with the context after coding each bit, so that the next bit’s prediction is

accessed in the table.

ICM is an indirect context model consisting of a hash table to map the given context and the partially

decoded byte to a bit history which is then mapped to a prediction like in a direct context model. Histories

are counts of 1 and 0 and represented by states (explained in [7], chapter 4.1.3. Indirect Models) and

adjusted after the prediction to a better tting state. The prediction is also updated.

5 http://mattmahoney.net/dc/zpaqdoc.html

2.3 ZPAQL: Virtual Machine and Instruction Set 9

MATCH is a model that takes the following byte of a found string match as prediction until a mismatch

happens. Therefore it keeps its own history buffer and the prediction is also varied depending on the length

of the match. The match is found based on the (higher order) context hash input. This works because the

search does not take place in the history buffer, but a table maps each context hash input to the last element

in the history buffer.

AVG outputs a non-adaptive weighted average of the predictions of two other components. It does not

receive context data.

MIX maps a context and the masked partially decoded byte to weights for producing an averaged prediction

of some other components. Afterwards the weights are updated to reduce the prediction error.

MIX2 is a simpler MIX and can only take two predictions as input.

SSE stands for secondary symbol estimation (Dmitry Shkarin, also known as Adaptive Probability Map in

PAQ7 or LPAQ) because it receives a context as input and takes the prediction of another component which

is then quantized. For the given context this quantized prediction and the next closest quantization value

are mapped to predictions which result in an interpolation of both. Initially this map is an identity, but

the update step corrects the prediction of the closer-to-original quantized value the same way as in the CM

update phase.

»A typical place for SSE is to adjust the output of a mixer using a low order (0 or 1) context. SSE components

may be chained in series with contexts typically in increasing order. Or they may be in parallel with

independent contexts, and the results mixed or averaged together.« [7]

ISSE is an indirect secondary symbol estimator that, as an SSE, renes the prediction of another component

based on a context which is mapped to a bit history like in an ICM. That bit history is set as context for an

adaptive MIX to select the weights to combine the original prediction with a xed prediction.

»Generally, the best compression is obtained when each ISSE context contains the lower order context of

its input.« [7]

2.3 ZPAQL: Virtual Machine and Instruction Set

At the beginning of each block the two bytecodes hcomp and pcomp, if needed, are loaded to a virtual

machine for each. It consists of the program counter , 32-bit registers , ,  and , an 1-bit condition

ag , 256 32-bit registers 







and the arrays  (8-bit elements) and  (32-bit elements). Initially, all

registers and arrays hold . The size of  and  is dened in the block header along with the information

which context-mixing components are used.

For each encoded or decoded byte hcomp gets run to set the contexts for the components. The context

data for a components    has to be stored in . As rst input hcomp sees whether a postprocessor

10 2 The ZPAQ Standard Format for Compressed Data

is present and if yes, then its length and bytecode. Afterwards the data of the segments is coming, without

any separation. Except for the  which is set to  and the  register which is used for the input all state

is preserved between the calls.

The postprocessor pcomp is run for each decoded byte of the block to revert the preprocessing and puts out

the data via an instruction. After each segment it is invoked with 



  as input to mark the end.

There is an assembly language for the bytecode which is also used in the table describing the corresponding

opcodes in the specication [8]. In this assembly language whitespace can only occur between opcode bytes

in order to visualize that a=b is a 1-byte opcode while a= 123 is a 2-byte opcode. Comments are written in

brackets.

Operations on the 32-bit registers and elements of  are  



and interpreted as positive numbers in

comparisons. Indexes access into  and  is   or   and denoted as *B for , *C for

 and *D for . Because  holds bytes operations on *B and *C are   and swapping via

B*<>A or C*<>A alters only the lower byte of  [8].

Instructions Semantics and Constraints

error cause execution to fail

X++ increment X by 1 (X is one of A, B, C, D, *B, *C, *D)

X-- decrement X by 1

X! ip all bits

X=0 set X to 0

X<>A swap (X is not A)

X=X set X to X

X= N set X to     

A+=X add X on A

A-=X subtract X from A

A*=X multiply A by X

A/=X divide A by X (set    if   )

A%=X      (set    if   )

A&=X binary AND with X

A&~=X binary AND with ipped bits of X

A|=X binary OR with X

A^=X binary XOR with X

A<<=X bitwise left shift of A by    bits

A>>=X bitwise right shift of A by    bits

2.3 ZPAQL: Virtual Machine and Instruction Set 11

A+= N, A-= N,

A*= N, A/= N,

A%= N, A&= N,

A&~= N, A|= N,

A^= N, A<<= N,

A>>= N

same as previous instructions but with     

A==X    if    otherwise   

A<X    if    otherwise   

A>X    if    otherwise   

A== N, A< N,

A> N

same as above but with     

A=R N, B=R N,

C=R N, D=R N

set A, B, C or D to 



R=A N set 



to A

HALT ends current execution of the program

OUT output    in pcomp, ignored in hcomp

HASH         

HASHD         

JMP I add      to PC relative to the following instruction, so    has

no effect and    is an endless loop (in the specication a positive N is used, so





       

JT N jump if   

JF N jump if   

LJ L 



  with     



(in the specication written as LJ N M because

it is a 3-byte instruction with 



     )

Beside the opcodes libzpaq also supports using helper macros like if(not)? … (else)? … endif, the

long jump versions if(not)?l … (elsel)? … endifl and do … (while|until|forever) which will

be converted to conditional jump opcodes. An if-block is executed if   , the while-jump is executed if

  , an until-jump is executed if   . So that means the test for the condition has to be written before

the macro: a> 255 if … endif. The statements can also be interweaved, e.g. write a do-while-loop or a

continue-jump as do … if … forever endif.

In a ZPAQ cong le the two sections for hcomp and pcomp are written behind the context mixing model

conguration. The pcomp section is optional. Comments can appear everywhere within brackets.

12 2 The ZPAQ Standard Format for Compressed Data

Syntax: (where    and 



, 



, 



and 



dene the size of  and  for each section)

comp HH HM PH PM N

(I COMPONENT (ARG)+ )*

hcomp

(ZPAQL_INSTR)*

halt

(pcomp (PREPROCESSOR_COMMAND)? ;

(ZPAQL_INSTR)*

halt

end

2.4 Examples: A simple Context Model and LZ1 with a Context Model

The following example conguration is based on fast.cfg from the utility site

and can be used for text

compression and adaptively combines (independently of contexts, just based on the success of the last pre-

diction) the prediction of a direct order-1 context model with the prediction of a order-4 ISSE which renes

the prediction of a order-2 ICM. The arguments for the components are documented in the specication [8].

1 comp 2 2 0 0 4 (hh hm ph pm n)

(where H gets the size of 2^hh in hcomp or 2^ph in comp ,

M 2^hm or 2^pm and n is the number of

context -mixing components)

0 cm 19 4 (will get an order 1 context)

6 1 icm 16 (order 2, chained to isse)

2 isse 19 1 (order 4, has reference to ICM component 1)

3 mix2 0 0 2 24 0 (moderate adapting mixer between CM and ISSE

based on which predicts better, no contexts even for bits)

(ICM and ISSE part adapted from fast.cfg)

11 hcomp

r=a 2 (R2 = A, input byte in R2)

d=0

a<<= 9 *d=a (H[D] = A) (set context to actual byte)

(leaving first 9 bits free for the partially decoded byte)

16 a=r 2 (A = R2)

*b=a (M[B] = A) (save input byte in rotating buffer)

(full M is used with pointer b)

a=0 hash (shortcut for A = (A + M[B] + 512) * 773)

b-- hash

21 d= 1 *d=a (order 2 hash for H[1])

b-- hash b-- hash

d= 2 *d=a (order 4 hash for H[2])

6 http://mattmahoney.net/dc/zpaqutil.html, cong les with a ”post” instead of ”pcomp” are in the old format of the

Level 1 specication

2.4 Examples: A simple Context Model and LZ1 with a Context Model 13

(H[3] stays 0 as fixed context for MIX2)

halt (execution stops here for this input byte)

end

Listing 2.1: Example mfast.cfg without a pcomp section

To demonstrate the compression phases and parts involved in detail the LZ1 conguration from the utility

site is chosen, but also the BWT.1 examples are worth to look at.

The LZ1 conguration relies on a preprocessor lzpre.cpp which turns the input data into a compressed

LZ77-variant representation of codes for match copies and literal strings. This is further compressed through

arithmetic coding with probabilities provided by an ICM (indirect context model).

The contexts are always hashed as    from two values  and  as follows. For the rst byte

of an offset number of a match the length of the match and the current state (2…4, i.e. 1…3 bytes to follow

as offset number) are used as context. For the remaining bytes of an offset number (or a new code if no

bytes are remaining) the previous context and the current state (previous state - 1, i.e. 0…2 bytes to follow)

are used as context. For the rst literal of a literal string the number of literals and the state 5 are used as

context. For the following literals the current literal and the state 5 are used as context. For a new code

after a literal string instead of a hash of the rst value just 0 and the current state (1) are used as context.

The bytecode of pcomp is not specially handled.

To revert the LZ1 compression pcomp parses the literal and match codes and maintains a 16 MB  



byte buffer in .

(lz1.cfg

Licensed under GPL v3, http://www.gnu.org/copyleft/gpl.html)

comp 0 0 0 24 1

0 icm 12 (sometimes ”0 cm 20 48” will compress better)

8 hcomp

(c=state: 0=init, 1=expect LZ77 literal or match code,

2..4=expect n-1 offset bytes,

5..68=expect n-4 literals)

b=a (save input)

13 a=c a== 1 if (expect code ccxxxxxx as input)

(cc is number of offset bytes following)

(00xxxxxx means x+1 literal bytes follow)

a=b a>>= 6 a&= 3 a> 0 if

a++ c=a (high 2 bits is code length)

18 *d=0 a=b a>>= 3 hashd

else

a=b a&= 63 a+= 5 c=a (literal length)

*d=0 a=b hashd

endif

23 else

a== 5 if (end of literal)

c= 1 *d=0

14 2 The ZPAQ Standard Format for Compressed Data

else

a== 0 if (init)

28 c= 124 *d=0 (5+length of postprocessor)

else (literal or offset)

c--

(model literals in order 1 context , offset order 0)

a> 5 if *d=0 a=b hashd endif

33 endif

endif

(model parse state as context)

38 a=c a> 5 if a= 5 endif hashd

halt

pcomp ./lzpre c ; (code below is equivalent to ”lzpre d”)

a> 255 if (end of segment)

b=0 d=0 (reset , is last command before halt)

43 else

(LZ77 decoder: b=i, c=c d=state r1=len r2=off

state = d = 0 = expect literal or match code

1 = decoding a literal with len bytes left

2 = expecting last offset byte of a match

48 3,4 = expecting 2,3 match offset bytes

i = b = position in 16M output buffer

c = c = input byte

len = r1 = length of match or literal

off = r2 = offset of match back from i

53 Input format:

00llllll: literal of length lllllll=1..64 to follow

01lllooo oooooooo: length lll=5..12, offset o=1..2048

10llllll oooooooo oooooooo: l=1..64 offset =1..65536

11llllll oooooooo oooooooo oooooooo: 1..64, 1..2^24)

58 c=a a=d a== 0 if

a=c a>>= 6 a++ d=a

a== 1 if (state?)

a+=c r=a 1 a=0 r=a 2 (literal len=c+1 off=0)

else

63 a== 2 if a=c a&= 7 r=a 2 (short match: off=c&7)

a=c a>>= 3 a-= 3 r=a 1 (len=(c>>3)-3)

else (3 or 4 byte match)

a=c a&= 63 a++ r=a 1 a=0 r=a 2 (off=0, len=(c&63) -1)

endif

68 endif

else

a== 1 if (writing literal)

a=c *b=a b++ out

a=r 1 a-- a== 0 if d=0 endif r=a 1 (if (--len==0) state=0)

73 else

a> 2 if (reading offset)

a=r 2 a<<= 8 a|=c r=a 2 d-- (off=off<<8|c, --state)

else (state==2, write match)

a=r 2 a<<= 8 a|=c c=a a=b a-=c a-- c=a (c=i-off -1)

78 d=r 1 (d=len)

do (copy and output d=len bytes)

a=*c *b=a out c++ b++

d-- a=d a> 0 while

(d=state=0. off, len don’t matter)

2.4 Examples: A simple Context Model and LZ1 with a Context Model 15

83 endif

endif

halt

88 end

Listing 2.2: lz1.cfg

For comparison a Python port for usage with the zpaqlpy compiler can be found in test/lz1.py. It differs

in processing the pcomp bytecode with the current opcode byte as context for the next.

3 Notes on Data Compression

Not all data can be compressed. Because if we build a – necessarily bijective

– compressing scheme

 



















for strings       to strings       , starting from an identity, every time we remap an

input 



, whereas 



  



, to a shorter compressed representation 



, whereas 







  



and





  



, so that now 







  



and 







  



, we have to map 



to 



so that 







  



and indeed make this swap in order to maintain the bijection, ending up with 







   



 because





  



. So while 



compresses 



it expands 



and this holds for each iteration when we change our

compression scheme.

Luckily, most data which is relevant to us and interesting for compression has patterns and other data where

we do not understand the patterns appears to be random and is out of scope for compression algorithms. If

we do not have any knowledge about the data except that its symbols are equally distributed with probability











for each symbol , best we can do is to use an optimal code reaching the Shannon entropy as coding

length per symbol:

  















For

   

would be 8 bit as usual and we can simply store the data instead of encoding it again.

In general, given that the distribution is known, we can choose e.g. a non-adaptive arithmetic encoder to

almost reach the limit of  bits per symbol. But adaptive arithmetic coding with PPM and others could

even give a better average because the distribution is adjusted by exploiting patterns. Therefore, to craft a

well performing algorithm for the expected input data knowledge about the patterns is needed in order to

give good predictions and go beyond the Shannon entropy as average size.

To dene the lower limit that can be reached the concept of algorithmic information or Kolmogorov complex-

ity of a string is developed. Basically it is the length of the shortest program in a xed language that produces

this string. The language choice only inuences a constant variation because an interpreter could be written.

When comparing different compressors in a benchmark it is common to include the decompressor size in

the measurement because it could also hold data or generate it via computation.

1 We want to allow for every string to be compressed, resulting in the same number – because they have to differ – of com-

pressed strings which also should be decompressible. Another common approach to prove that there is no universal lossless

compression is a counting argument which uses the pigeonhole principle.

Using this principle with ZPAQ and its pcomp section the rst million digits of  if form of the ~1 MB text

le pi.txt from the Canterbury Miscellaneous Corpus

can be compressed to a 114 bytes ZPAQ archive

which consists of no data stored and a postprocessing step which computes  to the given precision and

outputs it as text

. This extreme case of the earlier mentioned knowledge about the data can serve as a

bridge between Kolmogorov complexity and adaptive arithmetic coding. For faster execution of the ZPAQ

model we only take the rst ten thousand digits of . Normally the limit most compressors would stay

above (because the digits are equally distributed) is





  









    byte

instead of 10 KB.

With a ZPAQ context model we can, instead of generating the digits in pcomp phase, also use the next

expected digit as context so that the predictor will quickly learn that e.g. character 3 comes in context

3. But prediction can not be 100 % for one symbol as other symbols could occur and there has to be a

probability greater zero assigned to them. Also whether the end of the message is reached is encoded as a

special symbol. So the range of the arithmetic encoder gets still narrowed when a perfectly predicted digit

is encoded, but on such a small level that still only 121 bytes are needed for the ZPAQ archive consisting

of the CM model conguration, hcomp bytecode and the arithmetically coded ten thousand digits of .

That shows that we can go much beyond the entropy limit down to Kolmogorov complexity by using

context modeling and adaptive arithmetic coding. And still the context model is usable for all input data in

opposition to computing  in pcomp. The overhead of the fact that 100 % can not be used when predicting

seems to be linear for the message size and can be observed when compressing 50 or 200 MB zeros with a

CM, resulting in around 0.0000022 byte per input byte.

(pi10k.cfg 2016 Kai Lüke and Matt Mahoney

2 instead of generating the digits in pcomp phase, use the next expected digit as context

To compress: zpaqd cinst pi10k.cfg pi10k.zpaq pi10000.txt)

comp 0 14 0 0 1 (2^14 > 10000)

0 cm 13 0

7 hcomp

ifnot (only first run)

(Compute pi to 10000 digits in M using the formula:

pi=4; for (d=r1*20/3;d>0;--d) pi=pi*d/(2*d+1)+2;

where r1 is the number of base 100 digits.

12 The precision is 1 bit per iteration so 20/3

is slightly more than the log2 (100) we need.)

a= 100 a*=a r=a 1 (r1 = digits base 100)

a*= 7 d=a (d = n iterations)

*b= 4 (M=4)

17 do

(multiply M *= d, carry in c)

b=r 1 c=0

22 b--

2 http://corpus.canterbury.ac.nz/descriptions/#misc

3 http://mattmahoney.net/dc/pi.cfg

18 3 Notes on Data Compression

a=*b a*=d a+=c c=a a%= 10 *b=a

a=c a/= 10 c=a

a=b a> 0 while

(divide M /= (2d+1), remainder in c)

a=d a+=d a++ d=a

a=c a*= 10 a+=*b c=a a/=d *b=a

32 a=c a%=d c=a

a=r 1 b++ a>b while

a=d a>>= 1 d=a

(add 2)

37 b=0 a= 2 a+=*b *b=a

d-- a=d a== 0 until

halt

endif

a=*b a<<= 9 *d=a b++ (set context for expected digit taken from M)

42 halt

end

Listing 3.1: pi10k.cfg

Even if this conguration is usable for other input than  it does not give good compression. It can be

merged with the general text model from Listing 2.1 and change between using the CM for order-1 contexts

and for the expected digits contexts every time the start of  is detected until a mismatch is found. This

way all occurrences of  are coded with only a few bits.

(mixed_pi2.cfg

use the next expected digit as context for CM or a general text model of fast.cfg

3 a MIX2 will select between them

To compress: zpaqd c mixed_pi2.cfg text_pi.zpaq text_with_appearance_of_pi.txt

)

comp 2 14 0 0 4 (2^14 > 10000)

0 cm 18 0 (order 1 or digits of pi)

8 1 icm 16 (order 2, chained to isse)

2 isse 19 1 (order 4)

3 mix2 0 0 2 24 0 (moderate adapting mixer between CM and ISSE based on which predicts

better)

hcomp

r=a 2

13 a=r 0

a== 0 if (only first run)

(Compute pi to 10000 digits using the formula:

pi=4; for (d=r1*20/3;d>0;--d) pi=pi*d/(2*d+1)+2;

where r1 is the number of base 100 digits.

18 The precision is 1 bit per iteration so 20/3

is slightly more than the log2 (100) we need.)

a= 100 a*=a r=a 1 (r1 = digits base 100)

a*= 7 d=a (d = n iterations)

*b= 4 (M=4)

23 do

(multiply M *= d, carry in c)

b=r 1 c=0

b--

28 a=*b a*=d a+=c c=a a%= 10 *b=a

a=c a/= 10 c=a

a=b a> 0 while

(divide M /= (2d+1), remainder in c)

33 a=d a+=d a++ d=a

a=c a*= 10 a+=*b c=a a/=d *b=a

a=c a%=d c=a

a=r 1 b++ a>b while

38 a=d a>>= 1 d=a

(add 2)

b=0 a= 2 a+=*b *b=a

d-- a=d a== 0 until

43 c= 2 (point to 4 of 3.14)

a= 1

r=a 0

a<<= 14 a-- (last element of ring buffer)

b=a

48 a-= 4 (first element of ring bufer, pointer in r3)

r=a 3

halt (input 0 came from pcomp, also to restart c=2 is enough)

endif

(CM part)

53 d=0

a=r 2 a-= 48

c--

a==*c

c++

58 if (pi: set context for expected digit)

a=*c c++ a<<= 1 a++ a<<= 9 *d=a (distinguish between pi number context and character

context by 1 bit for sure)

else (other:)

a=r 2 a<<= 10 *d=a c= 2 (set context to actual byte)

endif

63 (a in r2, lower border of ring buffer in r3)

(ICM and ISSE part adapted from fast.cfg)

a=r 2

*b=a a=0 (save in rotating buffer M)

hash b--

68 d=a (save hash) a=r 3 a>b if b++ b++ b++ b++ endif a=d

hash d= 1 *d=a

b--

d=a (save hash) a=r 3 a>b if b++ b++ b++ b++ endif a=d

hash b--

73 d=a (save hash) a=r 3 a>b if b++ b++ b++ b++ endif a=d

hash d= 2 *d=a

halt

end

Listing 3.2: mixedpi2.cfg

20 3 Notes on Data Compression

For real-life use cases it is often not possible to give perfect predictions. Good contexts can help to bring

order into the statistics about previous data. Beside the manual approach heuristic context models can be

generated for data by calculating the data’s autocorrelation function [12] or as done in PAQ by recognizing

two-dimensional strides for table or graphical data.

Even compressed JPEG photos can be further compressed by 10% - 30% by predicting the Huffman-coded

DCT coefcients when using the decoded values as contexts (done in PAQ7-8, Stuft, PackJPG, WinZIP [7]).

The ZPAQ conguration jpg_test2.cfg uses a preprocessor to expand Huffman codes to DCT coefcients

and later uses them as contexts

. The PackJPG approach continues to be developed by Dropbox under the

name lepton

and supports progressive beside baseline JPEGs.

Overall modeling and prediction are an AI problem because e.g. for a given sentence start a likely following

word has to be provided or how a picture with a missing area is going to continue. Remarkable results

have been accomplished by using PAQ8 as machine learning tool for e.g. building a game AI with it serving

as a classier, for interactive input text prediction, text classication, shape recognition and lossy image

compression [13].

4 http://mattmahoney.net/dc/zpaqutil.html

5 https://github.com/dropbox/lepton

4 Compilers, Rust and Python

Compilers and their parts covering the subproblems are mentioned here. Then the programming language

Rust which was used for the compiler development is shortly presented as well as Python as the source

language.

4.1 Classical Compiler Architecture

The task to translate a program from a source language into an equivalent program in the target language is

usually split into independent phases where each phase passes its result to the next phase (see [14], Figure

1.6). In most cases the target language is very low level, like machine instructions of a CPU. One variation

of a compiler structure including an intermediate representation language will be explained.

The tokenizer or lexer cuts the input stream into tokens which represent words/units of the source language

which are normally the allowed terminal symbols of the grammar. The parser reads the tokens in sequence

and ts them into nonterminals of the source language grammar by following the production rules until

all tokens are consumed and the start production is completed. On the way it generates an AST (Abstract

Syntax Tree) as semantic representation of the program in the source language. This completes the analysis

done in the frontend.

Synthesis takes place in the backend. The AST is traversed to generate IR (intermediate representation)

instructions. This step usually keeps track of variable names and their assignments in a symbol table. An

optimization pass can be applied to the IR. The code generator produces instructions in the target language

and they can be passed through optimization steps as well.

Checking for errors has to be done in every analysis step to nd lexical errors in the tokenizer, syntactical

errors while parsing, type errors and others during semantic analysis.

If the target language is not the end product but only an assembly language than an assembler produces

object les which will be combined by a linker to a nal executable program.

There are also preprocessors for the source language, multi-pass compilers and JIT (just-in-time) compilers

which are used in interpreters and translate only an often-used section of the source program during runtime

for a performance gain.

22 4 Compilers, Rust and Python

4.2 The Rust Programming Language

»Rust is a systems programming language focused on three goals: safety, speed, and concurrency. It

maintains these goals without having a garbage collector, making it a useful language for a number of use

cases other languages aren’t good at: embedding in other languages, programs with specic space and

time requirements, and writing low-level code, like device drivers and operating systems. It improves on

current languages targeting this space by having a number of compile-time safety checks that produce no

runtime overhead, while eliminating all data races. Rust also aims to achieve ‘zero-cost abstractions’ even

though some of these abstractions feel like those of a high-level language. Even then, Rust still allows

precise control like a low-level language would.«

The rst stable 1.0 release of the Mozilla-sponsored language was in May 2015, the current version is 1.11

from August 2016. Pattern matching, denition of data structures and traits together with strong types bring

Rust close to functional languages like Haskell. A program can in most cases only be compiled if it is free

from crashes and undened behavior unless explicitly a panic is allowed when an alternative result case of

a function is not handled. So sometimes there are many iterations until a program compiles, but then as a

gain there are assurances for the execution like memory safety instead of buffer overows. But on the other

side one has to be aware of the ow and reference of data through variables and has to comply to the rules

imposed by the move semantics of the language, because the ownership model is the core for the safety

guarantees.

Part of the language is the denition of test functions for software testing and inline documentation which can

include test cases. As tooling the build system and package manager Cargo comes along with the compiler

rustc which is invoked by Cargo. The inline documentation can be extracted to HTML.

Because of the noted advantages it seems that Rust ts well for compiler construction and offers a pleasing

workow once accustomed with its type system because of the more rare runtime errors leading to less

debugging.

4.3 The Python Programming Language

Python has been around for two decades now and is still representative for an easy to learn imperative

language with a simple syntax. It features dynamic types and a pragmatic approach to object orientation

with functional elements and closures. Freeing memory is done by reference counting and the language

is tightly coupled with the way it is interpreted. In order to fulll a protocol/trait it relies on functions an

object needs to implement.

1 Book Introduction ”The Rust Programming Language”: https://doc.rust-lang.org/book/index.html

5 Considerations and Challenges with ZPAQL as

Target Language

The ZPAQL bytecode that runs on the virtual machine is executed for every input byte, from the beginning

i.e. the rst instruction. Then  is set to the input byte while all other state is maintained between the

executions. Also if running as hcomp section then the rst elements of  are not general purpose memory

but serve as input for the context mixing components. Execution in a ZPAQL VM is sequential with no

parallelism. Beside the dened size of  and  there is no heap memory allocation or stack available. If

dynamic memory allocation is wanted it has to be implemented on top of  or .

It is Turing-complete but the bytecode size is restricted to around 64 KB (in hcomp few bytes less as the

context model needs to be dened before). Also there is no keyword for a data section dened in the

ZPAQL assembly language and it would be of no use as the bytecode content itself is not addressable and

also unmodiable. But program data could be prepended before the stored data and retrieved by reading

in hcomp or pcomp and discarded while pcomp generates the output. If data should just be accessed by

hcomp it could also be stored as a skipped part of the pcomp bytecode as this is seen by hcomp before the

decoded data. This could also be used if an interpreter is written in ZPAQL.

The jump instructions only support a xed and not a variable destination. That means instead of jumping to

a return address which the caller passes to the function the return can only be achieved through an identier

for a jump table with predened jump points.

The instruction set is rather simple and aimed at 32-bit integers. To keep expensive abstractions away, the

source language should e.g. not include oating point numbers or 64-bit integers. Most calculations have

to be done using register  as accumulator because the others lack advanced instructions.

There is no instruction for reading the next byte or saying that context computation has nished - the

program needs to halt and is then executed again from the beginning. Therefore, additional runtime code is

needed to continue execution where it stopped if e.g. convenient input reading should be allowed without

disrupting the program ow or the nished computation of context data should be signaled.

It seems to be a viable plan to keep the organizational structure of a source program close to the resulting

ZPAQL program but with added convenience through a runtime API which is anyway needed for input and

output. Instead of introducing special functions to set the context data as rst elements in ,  and also

24 5 Considerations and Challenges with ZPAQL as Target Language

 can just be fully exposed as data structure in the source language and also be used as arrays for other

means with low abstraction costs. This keeps the structure of handwritten ZPAQL programs close to those

in the source language. But in order to keep variables in a stack,  with its 256 elements is not enough, so

expanding  seems to be a good solution. To model the repetitive execution of hcomp and pcomp they

could be dened as functions in the source program (think main function) and it would also be possible to

pass the input byte as argument which also keeps the similarity to a handwritten ZPAQL source.

As the runtime code abstractions for providing the mentioned read-API are not too high, the similarity to

original ZPAQL les is more a cosmetic design decision. And if a context-set-API which halts and continues

execution through runtime code is present then hcomp and pcomp functions could be replaced by main

functions which are entered only once and thus hide the fact that execution starts from the beginning for

every input byte. Still dynamic memory management on top of  and  seems to be costly and thus

departing to far from ZPAQL and adding more complicated data structures could hurt performance too

much.

It would be helping if the source program is standalone executable without being compiled to ZPAQL to

ease debugging by staying outside the ZPAQL VM as long as possible.

Before ZPAQL instructions are generated by the compiler it would be helpful if most complicated operations

are solved on the IR level already, like the saving, restoring and other memory management.

6 Selection of the Python-subset as Source

Language

Considerations are mentioned again before the chosen subset is presented together with the API functions

and organizational source code constraints. Then the grammar is listed with a short note on the behavior of

elements.

6.1 Decision on the Feature Set

Based on the preceding thoughts the input should be a single Python le containing both the code for hcomp

and pcomp as functions which could also call other functions but no imports are allowed. The compiler

should not deal with strings, lists, arbitrary big numbers, classes, closures and (function) objects but only

support a reduced set with variables having 32-bit integer values. The context model should be dened

like in a ZPAQL conguration and the predened arrays  and  should be exposed as well as the size

exponents , , ,  and  as number of components used.

API functions out(a) for the pcomp out instruction and read_b() to read the next input byte by halting

and returning to the current state of execution should be provided. As well as helpers if a dynamic memory

management for additional arrays is implemented in Python on top of  and . This API needs to be

included as corresponding ZPAQL runtime in the output of the compiler. The Python runtime should also

allow execution as standalone program with a similar interface to the zpaqd development tool which can

run pcomp and hcomp on a given input.

All the code segments need to be separated, namely the context model denition and other common deni-

tions/functions, pcomp and hcomp code with their variables and functions and then the runtime API and

nally the code for standalone execution. Thus a template approach was chosen which uses comment lines

as separation marks. Then the compiler extracts only the sections it needs to compile.

26 6 Selection of the Python-subset as Source Language

source.py Template Sections Editable?

Denition of the ZPAQ conguration header data (memory size, context mixing com-

ponents) and optionally functions and variables used by both hcomp and pcomp

yes

API functions for input and output, initialization of memory no

function hcomp and associated global variables and functions yes

function pcomp and associated global variables and functions yes

code for standalone execution of the Python le analog to running a ZPAQL congu-

ration with zpaqd r [cfg] p|h

Of course it could be organized in a different way that is more appealing based on the alternatives mentioned

in the previous chapter e.g. with two more idiomatic Python les for hcomp and pcomp without special

entry functions and the runtime in a different le which either imports the two les or vice versa.

6.2 Grammar and Semantics

The Python grammar as specied in the language reference

has been simplied where possible. It still

allows dictionaries and strings, however not for program code but just for the denition of the context

mixing components. Tuples, unpacking, lists and list comprehensions, for-loops, with-blocks, support for

async-coroutines, import, try/raise/except, decorators, lambda expressions or named arguments have been

removed.

Not all what is parsed is allowed as source code, e.g. nonlocal, dicts, strings or the @-operator for matrix

multiplication but therefore a better error message can be provided than just a parser error. The names of

the productions have been kept even if they are simplied. It would be best if the grammar is expanded to

parse full Python again and let the code generator decide what to support.

Grammar with NUMBER, NAME, ”symbols”, NEWLINE, INDENT, DEDENT or STRING as terminals

Prog (NEWLINE* stmt)* ENDMARKER?

funcdef ”def” NAME Parameters ”:” suite

Parameters ”(” Typedargslist? ”)”

Typedargslist Tfpdef (”=” test)? (”,” Tfpdef (”=” test)?)* (”,” (”**” Tfpdef)?)?

Tfpdef NAME (”:” test)?

stmt simple_stmt | compound_stmt

simple_stmt small_stmt (”;” small_stmt)* ”;”? NEWLINE

small_stmt expr_stmt, pass_stmt, ow_stmt, global_stmt, nonlocal_stmt

expr_stmt (store_assign augassign test) | ((store_assign ”=”)? test)

1 https://docs.python.org/3/reference/grammar.html

6.2 Grammar and Semantics 27

store_assign NAME (”[” test ”]”)?

augassign ”+=” | ”-=” | ”*=” | ”@=” | ”//=” | ”/=” | ”%=”

”&=” | ”|=” | ”^=” | ”<<=” | ”>>=” | ”**=”

pass_stmt ”pass”

ow_stmt break_stmt | continue_stmt | return_stmt

break_stmt ”break”

continue_stmt ”continue”

return_stmt ”return” test

global_stmt ”global” NAME (”,” NAME)*

nonlocal_stmt ”nonlocal” NAME (”,” NAME)*

compound_stmt if_stmt | while_stmt | funcdef

if_stmt ”if” test ”:” suite (”elif” test ”:” suite)* (”else” ”:” suite)?

while_stmt ”while” test ”:” suite (”else” ”:” suite)?

suite simple_stmt, NEWLINE INDENT stmt+ DEDENT

test or_test

test_nocond or_test

or_test and_test (”or” and_test)*

and_test not_test (”and” not_test)*

not_test comparison | (”not” not_test)

comparison expr (comp_op expr)*

comp_op ”<” | ”>” | ”==” | ”>=” | ”<=” | ”!=” | ”in” | ”not” ”in” | ”is” | ”is” ”not”

expr xor_expr (”|” xor_expr)*

xor_expr and_expr (”^” and_expr)*

and_expr shift_expr (”&” shift_expr)*

shift_expr arith_expr | (arith_expr (shift_op arith_expr)+)

shift_op ”<<” | ”>>”

arith_expr term | (term (t_op term)+)

t_op ”+” | ”-”

term factor (f_op factor)*

f_op ”*” | ”@” | ”/” | ”%” | ”//”

factor (”+” factor) | (”-” factor) | (”~” factor) | power

power atom_expr (”**” factor)?

atom_expr (NAME ”(” arglist? ”)”) | (NAME ”[” test ”]”) | atom

atom (”(” test ”)”) | (”” dictorsetmaker? ””) | NUMBER | STRING+ | ”...”

”None” | ”True” | ”False” | NAME

dictorsetmaker dictorsetmaker_t (”,” dictorsetmaker_t)* ”,”?

dictorsetmaker_t test ”:” test

arglist test (”,” test)* ”,”?

28 6 Selection of the Python-subset as Source Language

The semantics of the language elements as described in the reference

stay mostly the same, even if the

usable feature set is still reduced as stated before. In particular, one has to be aware of integer overows

which are absent in Python but are present in ZPAQL and thus all computations are in 



i.e.

 



. Except for the bit shift operations with a shift of more than 32 bits. In this case the Python-subset

will do a shift by    bits. To achieve the semantics of (v << X) % 2**32 or (v >> X) % 2**32

with    the resulting value should directly be set to 0 instead. Also / 0 and % 0 does not fail in ZPAQL

but results in a 0 value.

For a wrong Python input the current compiler specication might behave in a different way and even accept

it without a failure. Therefore it is required that the input is a valid Python program which runs without

exceptions.

This requirement is also important because the current compiler does not check array boundaries, so

index%len(hH) or index&((1<<hh)-1) should be used e.g. for a ring buffer because after the original

size of  there comes the stack. If run as plain Python le, an exception is thrown anyway then because it

checks array boundaries.

2 https://docs.python.org/3/reference/index.html

7 Design of the Compiler in Rust

The source code for the zpaqlpy compiler is located at https://github.com/pothos/zpaqlpy and in-

cludes the test data and other les mentioned. This chapter is about the developed API the source le uses

and about the process of developing the compiler and its internals.

7.1 Exposed API

The 32- or 8-bit memory areas  and  are available as arrays hH, pH, hM, pM depending on being a

hcomp or pcomp section with size 















dened in the header as available constants hh, hm,

ph, pm. There is support for len(hH), len(pH), len(hM), len(pM) instead of calculating 



. But

in general instead of using len() for dynamically allocated arrays as well, special functions like len_hH()

are used to visibly expose their types and do runtime checks already in Python. NONE is a shortcut for

    .

Other functions Description

c = read_b() Read one input byte, might leave VM execution to get the next

input byte before return

push_b(c) Put read byte c back, overwrites if already present (no buffer)

c = peek_b() Read but do not consume next byte, might leave VM execution

to get the next input byte before return

out(c) In pcomp: write c to output stream

error() Execution fails with ”Bad ZPAQL opcode”

aref = alloc_pH(asize), … Allocate an array of size asize on pH/pM/hH/hM

aref = array_pH(intaddr), … Cast an integer address back to a reference

len_pH(aref), … Get the length of an array in pH/pM/hH/hM

free_pH(aref), … Free the memory in pH/pM/hH/hM again by destructing the array

If backend implementations addr_alloc_pH(size), addr_free_pH(addr), … are dened then dy-

namic memory management is available though the API functions alloc_pM and free_pM. The cast

30 7 Design of the Compiler in Rust

array_pH(numbervar) can be used to save a type check in ZPAQL at runtime. Also in plain Python the

cast from an address is needed after an array reference was itself stored into  and thus became an address

number and is then retrieved as a number again instead of a reference. In general, there are no boxed types

but by context a variable is used as address.

The last addressable starting point for any list is      because the compiler uses

the 32nd bit to distinguish between pointers to  and .

The provided implementations of addr_alloc_pM, addr_free_pM, … can be found in the template (run

./zpaqlpy --emit-template or see src/template.rs). The returned pointer is expected to point at the

rst element of the array. One entry before the rst element is used to store whether this memory section

is free or not. Before that the length of the array is stored, i.e.    for arrays in H and the

four bytes      of the 32-bit length for arrays in .

Beside these constraints the implementations are free how to nd a free region. The example uses getter

and setter functions for the 32-bit length value as four bytes in . For allocation it skips over the blocks

from the beginning until a sufciently sized block is found. If this block is bigger then the rest of it is kept

free and might be merged with the next block if it is also free. That also happens when a block is freed

again, then it is even merged with the previous block if that is free.

7.2 Parser and Tokenizer

Tokens are known Python keywords and symbols like operators or brackets, names or strings and num-

bers as converted values. Initially the Python module tokenize was used through piping source code

in python3 -m tokenize -e and processing its output to build tokens as a Rust data structure

(see src/tok.rs). Later on the code commonly residing in /usr/lib/python3.5/tokenize.py and

/usr/lib/python3.5/token.py was ported to Rust (see src/rtok.rs) in order to be independent from

external calls. It is assumed that the input is UTF-8 without a BOM (Byte Order Mark).

The parser is built with a parser generator out of the grammar specication. It constructs the AST with each

production. The parser library lalrpop

was chosen in LALR(1) recursive ascent mode for that purpose.

The elements of the produced AST are based on the abstract grammar of the Python module ast

but

simplied (see src/ast.rs) and give a structured representation of the source program. In fact it is not a tree

but a list of statements. They can be one of FunctionDef which holds the function body as list of statements,

Return, Assign and AugAssign which hold the expressions that are concerned, While and If which hold

their body and else-part as list of statements and the test as expression, Global, Pass, Break, Continue and

Expr which encapsulates an expression. These expressions can be one of BoolOpE to evaluate AND and OR

1 https://github.com/nikomatsakis/lalrpop

2 https://docs.python.org/3.5/library/ast.html#abstract-grammar

7.3 Grammar of the Intermediate Representation 31

over expressions, BinOp for evaluation of an arithmetic expression over two expressions, UnaryOpE for an

unary operation, Compare for comparisons over expressions, Call for a function call including arguments

as expressions, Num for an integer, NameConstant for True and False, Name for a variable or Subscript for

index access to an array variable. The precedence of and and or is resolved during parsing into a binary tree

of BoolOpE elements. Contrary to that one has to be aware that with Compare the semantics of a == b

== c and (a == b) == c differ, the middle operand is split up and passed to the next comparison. Their

results are evaluated and merged with and until the result can not be True anymore, so this was better left

to the IR generator.

7.3 Grammar of the Intermediate Representation

The IR was chosen to be close to ZPAQL but easier to write. While ZPAQL has the registers

  and arrays  and  the IR gives only control about  and  and no other

registers. There are no new data structures added. So the other registers can be used for address compu-

tation and temporary calculations when the IR is converted to ZPAQL. That means the temporary variables

t0…t255 of the IR are a direct mapping to registers 



. Because the input byte is in  at the beginning of

hcomp/pcomp execution the IR relies on the guarantee that 



  before the rst instruction.

stmt var ”=” var (op var)?

var ”=” uop var

”if” var ”goto” label

”ifN” var ”goto” label

”ifEq” var var ”goto” label (to be used for optimizations)

”ifNeq” var var ”goto” label

”goto” label

”:” label ”:”

”halt”

”error”

”out” var

var t | ”H[” t ”]” | ”H[t0+” x ”]” | ”H[t252+” x ”]” | ”H[” x ”]” | ”M[” t ”]” | ”M[” x ”]” | x

op ”+” | ”-” | ”*” | ”/” | ”//” | ”%” | ”**” | ”<<” | ”>>” | ”|” | ”^”

”&” | ”or” | ”and” | ”==” | ”!=” | ”<” | ”<=” | ”>” | ”>=”

uop ”!” | ”~” | ”-”

t ”t0” | … | ”t255”

x ”0” | … | ”4294967295”

label [a-z_0-9~A-Z]+

comment ”#…\n”

32 7 Design of the Compiler in Rust

The var on the left side of an assignment can not be a number. The operators or and and differ from the

binary versions | and & as they represent the semantics of Python or and and i.e. they evaluate to the original

value and not simply to the boolean choices of 1 and 0 for True and False while the binary operators use

bitwise AND and OR. Operator !v tests against v==0 while ~v inverts the bits.

The choices made might not be the best and a totally different IR is possible which could introduce variables

or stack operations. For efciency it might be interesting to go into the direction of static single assignment

with linear scan register allocation or an algorithm with graph-coloring register allocation. Currently local

Python variables are on the stack and temporary variables are in  because together they could be more

than 256 and anyway local variables need to be stored on stack before a call. But they could be joined in a

common pool and the current limit of maximum 256 temporary variables in  could be widened by using

elements of  as needed. Maybe even LLVM as a very popular IR could be used with its many optimization

passes available.

7.4 IR Generation and the Stack

Due to the reduced features in the subset the compilation of the Python source can be done in a similar way

a C source would be compiled. The local variables of a function are held on the stack which is produced by

expanding  beyond its dened size for the conguration. Global variables are held in the beginning of the

stack. The temporary variables t0…t251 are used for intermediate or address computations whereas t0 is

the base pointer for the stack and t252 is a copy of the global base pointer. t255 holds the last input byte,

t254 the reading state and t253 the read byte for the API function read_b() which stops the execution

and returns to the caller with the newly acquired byte when the bytecode is run again.

The original size   



of  is extended to hold the stack of size . To calculate a valid resulting size of





    the formula   



    



  



 





 is used.

There are no IR instructions for calling a function or stack operations on , but there are helping meta

instructions. These will be converted to simple IR instructions and have been dened for handling blocks,

saving and loading variables on the stack in , calls and returns to comply to a calling convention, predened

IR code sections for the runtime API and the jumping table to continue execution after a return. The initial

IR code is responsible to either call the Python functions hcomp(c)/pcomp(c) with the new input byte or

continue execution if it was interrupted through a read_b() API call. Also it sets the base pointer for the

rst run and denes the API function read_b.

The function traverse() in src/gen_ir.rs generates IR instructions for the given (part of an) AST. It is

recursively used and consults the symbol table for free temporary variables, position of local variables and the

mapping of global variables. Also it uses the recursive function evaluate() which takes only an expression

part of an AST and returns IR instructions and the IR variable that holds the value after these instructions

7.5 Optimizations in the Intermediate Representation 33

have been executed. Returning the instructions makes testing easier. If pcomp(c)/hcomp(c) only contain

a pass then the whole pcomp/hcomp section is omitted.

The temporary variables have to be saved on the stack before a call. Also the current base pointer and then

the return ID for the jump table need to be saved there as part of the calling convention. The new base

pointer in t0 points at the return ID. Arguments passed come afterwards and the called function will address

them via   . On return the previous base pointer will be restored to t0 and the return ID is copied

in t2 for the jumper table while the return value is in t1 before the jump to the code for the jump table is

done in order to return after the call instruction.

It should be possible to treat local variables like temporary variables to reduce code size and stack usage by

a compiler ag. Also the calling convention could be changed to avoid the stack and globals could be stored

in tx.

For now the compiler passes IR code around as vectors (lists), but for a more idiomatic Rust style iterators

could be used.

7.5 Optimizations in the Intermediate Representation

Unused functions occupy space in the bytecode and hence are removed by searching for calls and if none is

found it can be removed. This is repeated until no function was removed anymore. Assignments which are

unused or could be merged into one assignment are not yet recognized.

During IR generation it is not known whether a temporary variable really needs to be saved on the

stack before a call or whether it is not in use afterwards anyway. Therefore the helper macro in-

structions MarkTempVarStart and MarkTempVarEnd are inserted around each function scope. Sav-

ing and restoring temporary variables is done through the macros StoreTempVars{identifiers} and

LoadTempVars{identifiers}. The optimization pass goes through the IR instructions in reverse order to

reason about the lifetimes and removes the identiers from the load and store macro instructions which are

not live i.e. referred to after the load.

Compiler books have much more to offer and there are many classical optimizations which could be looked

on for further improvements. Caching could be used and constant expressions evaluated during IR genera-

tion. This is just done for a row of assignments on allocated arrays in .

7.6 ZPAQL Generation and Register Assignment

The IR was chosen to be easily convertible to ZPAQL by using the registers   and  to calculate

and move values between  and . Thus for the beginning a very simple but inefcient solution was

34 7 Design of the Compiler in Rust

taken. The function emit_zpaql() in src/gen_zpaql.rs takes IR code and yields ZPAQL code by calling

assign_var_to_a(var) for the operands (if needed saving one value from  in ), and applying the

operator on them. The result is moved from  to the target by calling assign_a_to_var(var). In order to

load a number in  calc_number(value) is used and for values greater 255 multiple bit shifts are needed.

All of this helper functions emit ZPAQL code which is combined as result of each function in a similar way

to the IR generation.

Then a more advanced solution was chosen with pattern matching to avoid the ubiquitous transfer over 

and detect an augmented assignment like +=1 and simply increase the value at its location. Therefore the

helper function gen_loc_for_var() was introduced to get the location handle and the other helpers were

widened to operate with the notion of a location meaning registers or memory positions through pointers

instead of only . Tracking recent values of variables in the registers or memory locations for reuse was also

helpful in order to produce less code output. Yet it is a more careful issue now to make changes because

the cache entries concerning the location and the variable need - at least - to be invalidated for a correct

output.

After the code generation a special optimization takes place to simplify byte assignments on arrays in .

They have been produced in the IR generation if a row of successive assignments like in an initialization was

detected. Thus the pointer variable is always just increased and even not saved on  until all assignments

are nished.

Resolving labels to positions for the jump destinations is done as last step before the ZPAQL assembly code

is written out. The opcode size for each instruction is known and thus the label’s positions can be hold in

a hash table. The second pass through the code can then replace the virtual GoTo placeholder instruction

with the real long jump.

7.7 Debugging

It is important to run the plain Python version because it contains assertions and test it before trying to

compile it to ZPAQL. For the Python runtime a compare option was introduced to test the correctness of the

pcomp code which should restore the preprocessor input. Each output byte is compared to the expected

output and a debugger shell spawned if a mismatch occurs.

Because compiler optimizations are likely to introduce bugs there is a small test suite in the Makele. It

compares the output of the Python code for pcomp and hcomp to the output of running the compiled

ZPAQL code. Mostly zpaqd r CFG p is used but also a ZPAQ VM implementation in Rust included in the

compiler (option --run-hcomp) because zpaqd r CFG h does not print   .

8 Exemplary Configurations and Programs

As a showcase of the benets of having a compiler for a high-level language like Python this chapter presents

and evaluates two new ZPAQ congurations. The rst is a new image compression model which can parse

a header and combine various context predictors. And the second is an implementation of the Brotli decom-

pression scheme in the pcomp phase.

8.1 Compression of PNM Image Data using a Context Model

PNM is a lossless image format which in variant P6 with a maximum color value of 256 saves 8-bit RGB

color channels as bytes after a short header with height and width

. To predict the value of a color channel

the value of the neighboring pixels can be used as context as far as they have been seen in the stream. L is

referring to the left pixel, LU to the upper left, U to the pixel above and UR to the upper right.

For grayscale images in a custom format such a context model has already been developed

for ZPAQ and

performed better than PNG compression. Even if it does not match the ZPAQ conguration bmp_j4c.cfg,

it is therefore much faster because the context mixing network is simpler.

The PNM model test/pnm.py parses the header in Python to dene the history buffer for one row and

takes a similar approach as the grayscale model but with separated context models for each channel. They

are mixed together also with an averaged value context model and indirect models. This mixed prediction

is again combined by a mixer with the prediction of a chain of ISSE (with a hash of various contexts and the

last byte as context) rening the prediction of an ICM (with an averaged value context).

Color transforms like the subtract green transform in WebP can improve compression by decorrelating colors

because one color component is often also a good predictor for the others

and that link should be reduced to

improve prediction for the same color. The PNM model can use it as pre- and postprocessor, mapping 

to      and back. The preprocessor is in the separate Python le test/subtract_green.

1 http://netpbm.sourceforge.net/doc/ppm.html

2 http://www.modejong.com/blog/post15_zpaql_grayscale/index.html

3 https://developers.google.com/speed/webp/docs/webp_lossless_bitstream_specification#subtract_

green_transform

36 8 Exemplary Congurations and Programs

To measure the compression ratio four image les were converted to PNM in test/ and compressed (512

pixel high peppers.pnm, monarch.pnm, kodim23.pnm

and 1020 pixel high rafale.pnm

). FLIF is a new

lossless format

(a) kodim23 (b) peppers (c) monarch (d) rafale

Figure 8.1: Images of the benchmark

Method Total le size in byte

ZPAQ archive with bmp_j4c.cfg (in BMP format, uses special color transform) 1785405

ZPAQ archive with pnm.cfg 1908258

FLIF 1914297

WebP 2256458

PNG 2938153

For now it does not use e.g. delta coding or other predictors beside the average from neighbor values. Also

it could provide another predicted color value which is estimated by the recent development. Beside that

the color transform should depend on the image or a way needs to be found to move its advantage out from

the preprocessor into the predictors. Like the mentioned grayscale model it uses an additional context for

the mixer depending on image noise.

By using Python the model can easily be improved and writing it in ZPAQL would have been a bigger an

bug-prone effort. The bytecode overhead as of writing is around 3 KB.

For the large 53 MB photo canon24.pnm

it would reach the 6th place with 21,326,964 bytes (89 seconds

for (de)compression) in the benchmark for lossless picture compression published in a paper about BWT

image compression [15]. For kodim23.pnm it would reach the 4th place with an archive size of 369,519

bytes. An overall benchmark for a huge number of les like in

has not been done up to now.

4 Commonly used for image compression benchmarks, available under http://r0k.us/graphics/kodak/

5 http://www.maximumcompression.com/data/bmp.php

6 http://flif.info/

7 https://web.archive.org/web/20140702040431/http://www.squeezechart.com/canon24.pnm

8 http://www.squeezechart.com/bitmap.html

9 http://imagecompression.info/gralic/LPCB.html

8.2 Bringing the Brotli Algorithm to ZPAQ 37

8.2 Bringing the Brotli Algorithm to ZPAQ

The algorithm has to be implemented as postprocessor in pcomp and uses the Brotli compressor

as pre-

processor. Because the dictionary itself is almost twice the maximum bytecode size allowed it has to be

prepended before the rst segment of the compressed Brotli data for being read into memory. Following

Brotli streams as additional segments in the same block can use a byte different from the rst dictionary byte

as prex instead of prepending the whole dictionary again.

Instead of porting the C library or writing a new implementation based on the Brotli specication [11] it

was easier to port a decompressor written in Rust

to the Python-subset (see test/brotli.py). It works

as standalone Python program test/brotli.py and can be compiled to ZPAQL with a bytecode size of

64870 bytes (ZPAQ cong le test/brotli.cfg).

This means the size of an empty archive is already 184 KB if no context mixing is used for the dictionary

and bytecode. A simple order-1 ICM model brings it down to 90 KB. The slightly more advanced model of

mfast.cfg and its hcomp code reach 64 KB in total.

The overhead is less signicant for bigger archives and trying to reduce it by applying context mixing to the

whole block might not always improve the result because the Brotli data could be expanded in coding size.

For the four sample PNM les it saves 133 KB which is almost the size of the dictionary and bytecode (from

3615 KB with no arithmetic coding down to 3482 KB in total). For the enwik8 benchmark

(100 MB

Wikipedia dump) however this would not even make sense because mfast.cfg alone performs better by

reaching 24720 KB instead of 30303 KB when used together with Brotli data and the dictionary/bytecode

overhead. The same Brotli data without arithmetic coding through context model can be stored in around

30416 KB, depending on the Brotli compressor options it might go down to around 26000 KB.

Decompression is much slower than with the C or Rust Brotli implementations. The ZPAQL implementation

needs 118 seconds for enwik8 instead of 0.5 or 2.1 seconds for the reference C implementation and the one

of Mark Adler

. The memory management is not very efcient, specially dealing with byte arrays on  is

expensive and the compiler could sometimes produce better code by avoiding the intermediate variables in

. Beside that ZPAQL is JIT-compiled by libzpaq in a rather simple way and so an advanced JIT backend on

top of LLVM could help.

But it shows that ZPAQ can be used with a very different algorithm and it still maintains backwards com-

patibility. A ZPAQ decompressor could also detect if the embedded pcomp bytecode is in fact a Brotli

implementation and skip its execution by using one of the faster C or Rust implementations. That way new

algorithms become instantly available while new versions of decompressors can have a speed advantage by

replacing the bytecode execution with the equivalent library call.

10 https://github.com/google/brotli

11 https://github.com/ende76/brotli-rs

12 Used for the Hutter Prize compression challenge, comparison here: http://mattmahoney.net/dc/text.html

13 https://github.com/madler/brotli

9 Evaluation

What follows in this chapter is a review on the outcome so far of working on a compiler as well as an

attempt to provide one possible answer to the question whether — or how well — ZPAQ suits to be a

general standard for data compression.

9.1 Compiler Construction

Rust provided a good environment to prevent multiple bugs by detecting them at compile time. Yet it takes

up to four minutes for an iterative build which is also due to the big parser generated by lalrpop. Recently

there is development in a table-driven mode which could provide faster compile times by producing less

parser code. Also the upcoming MIR intermediate language will allow the Rust compiler to cache compiled

code segments if they have not been altered. This way test-driven development would be more easy because

of fast iterations.

Even if Rust provides inbuilt test support is has not been used as the compiled code output was mainly tested

with zpaqd. This could be improved, specially with the small ZPAQL VM implementation that is already

included and used for testing the correctness of computations through the Makele. The code lacks inline

documentation and the structure is sometimes not idiomatic to Rust. Error handling currently uses panics

instead of propagation of success or failure via result types.

9.2 Performance of the Compiler

The choice of the Python-subset and IR seemed to be supporting a fast development of the compiler even if

a lot of optimization passes are still missing. There might be compiler bugs which have not been discovered

up to now as the test coverage is not too extensive.

While the original ZPAQL le for test/lz1.orig.cfg needs 251 bytes for an empty archive the port

test/lz1.py compiled to bytecode needs 1255 bytes for an empty archive which is an overhead of around

1000 bytes compared to handwritten LZ1.

9.3 Analysis of the generated Code and Comparison with handwritten Code of LZ1 39

Runtime Benchmarks of the LZ1 Conguration in ZPAQL and Python for the four PNMs

ZPAQ cong Compression time (sec) Archive size (bytes) Decompression time (sec)

zpaqlpy lz1.cfg 1.6 3350016 1.85

handwritten lz1.orig.cfg 1.3 3349062 1.53

9.3 Analysis of the generated Code and Comparison with handwritten

Code of LZ1

The original test/lz1.orig.cfg is very dense and tries to hold values only in the registers if possible. Only

in the pcomp code two  variables are used along with  as history buffer. Sometimes variables are moved

between the registers because certain operations are only available on . Without comments it would be

difcult to guess what it does and make changes.

On the other hand the compiled test/lz1.cfg from the Python version is rather large, specially when all

comments are embedded which is the default. It does not harm the bytecode size but is useful for debugging

and partly allows extraction of the original source as well as the IR code by invoking:

egrep ’\( *[0-9]+: ’ test/lz1.cfg # Python code

egrep ’\( +’ test/lz1.cfg | egrep -v ’\( *[0-9]+: ’ # IR code

The compiled le starts with the runtime code which decides whether it is the rst run. For the rst run

it sets the base pointer and starts the global denitions before calling the pcomp/hcomp function with the

input. When initialization did take place already then the code decides whether a read was interrupted to

get a new input byte or if just the pcomp/hcomp function should be called again with the new input. As

stated in the previous chapters all Python variables are in the stack and thus it takes more opcodes to do

calculations on them. The code does not use custom arrays but instead used  as history buffer as the

original code. Everytime when the function returns the jump table is used to jump behind the call and then

the execution halts.

It is an advantage that changes can be made more quickly in Python and the code is more comprehensible.

But the compiled code uses a stack while it is not really necessary and instead of using the jump table for

the return it could just use the halt instruction. So for simple programs like that it would be desirable if

the compiler offers a mode without a stack by holding variables in .

40 9 Evaluation

9.4 Suitability of ZPAQ as universal Standard

ZPAQ serves well for research in context mixing algorithms and as development platform for custom algo-

rithms. But whether it will replace general compression implementations in various libraries of deate and

newer variants is open to question. Because a huge memory allocation or an endless loop can be unsafe for

decompressing untrusted archives, at least congurable limits need to be implemented into libzpaq. It might

also not be perfect for universal use as the mixing type (logarithmic) is quite xed. In addition to that the

denition of an own encoder can not use the inbuilt context mixing. A component which takes something

like    as stretched prediction value would be interesting (in a valid range to leave range space

for the other bit). Also it is not possible to dene new components which maybe could be moved in an

additional section with its identier mentioned. While pcomp gets notied about segments hcomp gets not

which could also be of use.

As a compiler target the ZPAQL instruction set lacks a variable jump instruction like jmpa which would take

the value of  as destination or getpc which sets  to the current program counter. It would be useful to

set 32-bit values directly in one instruction like a=L 3210987654. Memory management on top of  is

possible but a new data structure which allows using the OS heap would be more efcient. These proposals

could be considered for the next level of the ZPAQ specication.

10 Conclusion

It was shown that it is possible to have a compiler for ZPAQL which is also helpful for development of new

algorithms. To write the compiler large parts of the Python grammar have been ported to the format of the

LALR(1) parser library lalrpop. The core functionality of the Python module tokenize has been ported to

Rust. As a goal for a source input which should be supported a Rust implementation for Brotli decompression

has been ported to Python. A compilation scheme from a Python-subset to ZPAQL including an IR has been

planned and implemented. The compiler is platform independent as long as a Rust compiler is available. It

features an implementation of the ZPAQ VM to print the calculated context data for each input byte which

can then be compared with the calculated values of the Python source to expose ZPAQL specic semantics

or even compiler bugs.

As a result it can be said that more different approaches should be tried to reach an acceptable performance

for a large and complex code base. The ZPAQ specication does at its current point not offer various features

which might be considered for future improvements, from additional instructions up to ways of being even

more exible on how prediction takes place. But most important would be a variable jump instruction.

Compilation can be congured with command line arguments to some extend. The pcomp or hcomp part

can be disabled so no changes to the input le are needed to vary between using only context mixing, only

a preprocessor or both. Documentation can be printed via arguments as well. Four example source les are

provided: A small context mixing compression for run length encoded data (see appendix tutorial), the LZ1

port and the PNM model which already showed good results with less efforts. An extreme case is the Brotli

algorithm which needed many compiler optimizations to t under the 64 KB bytecode limit and utilizes

dynamic memory allocation.

While the compiler was developed two bugs in ZPAQ tools were found and are already resolved through

two new releases. One was a simple crash in zpaqd and the more serious one a wrong instruction in the

x86 JIT code of libzpaq which caused a miscomputation in the Brotli decompressor.

For most results there have been measurements which can be repeated because all needed tools are published

as free/libre software.

While PAQ and its internals were covered and modied in publications, ZPAQ was often only used as just

another compressor instead of a platform for compression algorithms. This work exposed the crucial part of

ZPAQ which is the embedding of the two bytecodes for hcomp and pcomp in the archive.

Bibliography

[1] J. Cleary and I. Witten. Data compression using adaptive coding and partial string matching. IEEE

Transactions on Communications, 32(4):396–402, Apr 1984. ISSN 0090-6778. doi: 10.1109/

TCOM.1984.1096090. http://dx.doi.org/10.1109/TCOM.1984.1096090.

[2] G. V. Cormack and R. N. S. Horspool. Data compression using dynamic markov modelling. Comput.

J., 30(6):541–550, December 1987. ISSN 0010-4620. doi: 10.1093/comjnl/30.6.541. http:

//dx.doi.org/10.1093/comjnl/30.6.541.

[3] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens. The context-tree weighting method: basic prop-

erties. IEEE Transactions on Information Theory, 41(3):653–664, May 1995. ISSN 0018-9448. doi:

10.1109/18.382012. http://dx.doi.org/10.1109/18.382012.

[4] Matthew V. Mahoney. Fast text compression with neural networks. In James N. Etheredge

and Bill Z. Manaris, editors, Proceedings of the Thirteenth International Florida Articial Intelli-

gence Research Society Conference, May 22-24, 2000, Orlando, Florida, USA, pages 230–234.

AAAI Press, 2000. ISBN 1-57735-113-4. http://www.aaai.org/Library/FLAIRS/2000/

flairs00-044.php.

[5] Matthew V. Mahoney. The PAQ1 data compression program. 2002. https://cs.fit.edu/

~mmahoney/compression/paq1.pdf.

[6] Matthew V. Mahoney. Adaptive weighing of context models for lossless data compression, Florida

Tech. Technical Report CS-2005-16. 2005. https://cs.fit.edu/~mmahoney/compression/

cs200516.pdf.

[7] Matthew V. Mahoney. Data compression explained, 2010. http://mattmahoney.net/dc/dce.

html.

[8] Matthew V. Mahoney. The ZPAQ open standard format for highly compressed data - Level 2. 2016.

http://mattmahoney.net/dc/zpaq206.pdf.

[9] Matthew V. Mahoney. The ZPAQ compression algorithm. 2015. http://mattmahoney.net/dc/

zpaq_compression.pdf.

Bibliography 43

[10] James K. Boneld and Matthew V. Mahoney. Compression of FASTQ and SAM format sequencing

data. PLoS ONE, 8(3):1–10, 03 2013. doi: 10.1371/journal.pone.0059190. http://dx.doi.org/

10.1371%2Fjournal.pone.0059190.

[11] Jyrki Alakuijala and Zoltan Szabadka. Brotli Compressed Data Format. RFC 7932, July 2016. https:

//rfc-editor.org/rfc/rfc7932.txt.

[12] John Scoville. Fast autocorrelated context models for data compression. CoRR, abs/1305.5486,

2013. http://arxiv.org/abs/1305.5486.

[13] Byron Knoll and Nando de Freitas. A machine learning perspective on predictive coding with PAQ.

CoRR, abs/1108.3298, 2011. http://arxiv.org/abs/1108.3298.

[14] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques,

and Tools (2nd Edition). Addison Wesley, 2006. ISBN 978-0321486813.

[15] Aftab Khan and Ashfaq Khan. Lossless colour image compression using RCT for bi-level BWCA.

Signal, Image and Video Processing, 10(3):601–607, 2016. ISSN 1863-1711. doi: 10.1007/

s11760-015-0783-3. http://dx.doi.org/10.1007/s11760-015-0783-3.

A Tutorial

A context mixing model with a preprocessor for run length encoding is written. Three components are

used to form the network. Create a new template which will then be modied at the beginning and at the

pcomp/hcomp sections:

$ ./zpaqlpy --emit-template > rle_model.py

$ chmod +x rle_model.py

First the size of the arrays  and  for each section, hcomp and pcomp needs to be specied:

hh = 2 # i.e. size is 2**2=4, because hH[0], …, hH[2] are the inputs for the

components

Listing A.1: rle_model.py

One component should give predictions based on the byte value and the second component based on the run

length, both give predictions for the next count and the next value. Then the context-mixing components

are combined to a network:

n = len({

0: ”cm 19 22”, # context table size 2*19 with

# partly decoded byte as 9 bit hash xored with the context ,

4 # count limit 22

1: ”cm 19 22”,

2: ”mix2 1 0 1 30 0”,

# will mix 0 and 1 together , context table size 2**1 with and-0 masking of the

# partly decoded byte which is added to the context, learning rate 30

9 })

Listing A.2: rle_model.py

Each component  gets its context input from the entry in after each run of the hcomp function, which is

called for each input byte of the preprocessed data, which either is to be stored through arithmetic coding in

compression phase or is retrieved through decoding in decompression phase with following postprocessing

done by calls of the pcomp function.

The context-mixing network is written to the archive in byte representation as well as the bytecode for

hcomp and pcomp (if they are used). The preprocessor command is needed when the compiled le is used

with zpaqd if a pcomp section is present. As the preprocessor might be any external programme or also

included in the compressing archiver and is of no use for decompression it is therefore not mentioned in the

archive anymore. This way we specify a preprocessor:

1 pcomp_invocation = ”./simple_rle”

Listing A.3: rle_model.py

$ chmod +x simple_rle # create the preprocessor as executable file and fill it as follows

#!/usr/bin/env python3

import sys

input = sys.argv[1]

4 output = sys.argv[2]

with open(input , mode=’rb’) as fi:

with open(output , mode=’wb’) as fo:

last = None

count = 0

9 data = []

for a in fi.read():

if a != last or count == 255: # count only up to 255 to use one byte

if last != None: # write out the pair

data.append(last)

14 data.append(count)

last = a # start counting

count = 1

else:

count += 1 # continue counting

19 if last != None:

data.append(last)

data.append(count)

fo.write(bytes(data))

Listing A.4: simple_rle

Then we need code in the pcomp section to undo this transform:

case_loading = False

last = NONE

def pcomp(c):

global case_loading, last

if c == NONE: # start of new segment , so restart our code

case_loading = False

8 last = NONE

return

if not case_loading: # c is byte to load

case_loading = True

last = c

13 else: # write out content of last c times

case_loading = False

while c > 0:

46 A Tutorial

c-= 1

out(last)

Listing A.5: rle_model.py

So now it should produce the same le as the input le:

$ ./simple_rle INPUTFILE input.rle

$ ./rle_model.py pcomp input.rle input.norle

$ cmp INPUTFILE input.norle

We can already try it, even if hcomp does not compute the context data yet (so compression is not good):

$ ./zpaqlpy rle_model.py

$ ./zpaqd c rle_model.cfg archive.zpaq FILE FILE FILE

Now we can add hcomp code to improve compression by adaptive prediction:

at_counter = False # if false , then c is byte, otherwise c is a counter

last_value = 0

3 last_counter = 0

def hcomp(c): # pcomp bytecode is passed first (or 0 if there is none)

global at_counter , last_value, last_counter

if at_counter:

8 last_counter = c

else:

last_value = c

# first part of the context for the first CM is the byte replicated and

# the second is whether we are at a counter (then we predict for a byte) or vice versa

13 hH[0] = (last_value << 1) + at_counter # at_counter will occupy 1 bit, therefore shift

hH[0] <<= 9 # again shift to side because of the xor with the partially decoded byte

# second CM same but uses the counter for prediction

hH[1] = (last_counter << 1) + at_counter

hH[1] <<= 9

18 hH[2] = at_counter + 0 # context for mixer: is at counter (1) or not (0)

at_counter = not at_counter

Listing A.6: rle_model.py

We need to compile again before we run the nal ZPAQ conguration le:

$ ./zpaqlpy rle_model.py

$ ./zpaqd c rle_model.cfg archive.zpaq FILE FILE FILE

zpaqd needs to have simple_rle in the same folder because we specied

pcomp_invocation = ”./simple_rle”.

B Visualizations

B.1 Arithmetic Coding

Figure B.1: Encoding and decoding steps without End-of-Message symbol

B.2 Context Mixing

Figure B.2: A possible text model

B.3 ZPAQ Format

Figure B.3: Structure of a block header

Figure B.4: Structure of segments

B.4 ZPAQL Virtual Machine

Figure B.5: Architecture with its building blocks

B.5 Compiler Pipeline

Figure B.6: The parts from source reading to writing a cong le

Freie Universität Berlin Department of Mathematics and Computer Science Institute of Computer Science Bachelor Thesis Design of a Python-subset Compiler in Rust targeting ZPAQL Kai Lüke kailueke@riseup.net Supervisors: Prof. Dr. Günter Rote Dipl.-Inform. Till Zoppke Berlin, August 23, 2016 Abstract The compressed data container format ZPAQ embeds decompression algorithms as ZPAQL bytecode in the archive. This work contributes a Python-subset compiler written in Rust for the assembly language ZPAQL, discusses design decisions and improvements. On the way it explains ZPAQ and some theoretical and practical properties of context mixing compression by the example of compressing digits of 𝜋. As use cases for the compiler it shows a lossless compression algorithm for PNM image data, a LZ77 variant ported to Python from ZPAQL to measure compiler overhead and as most complex case an implementation of the Brotli algorithm. It aims to make the development of algorithms for ZPAQ more accessible and leverage the discussion whether the current specification limits the suitability of ZPAQ as an universal standard for compressed archives. Contents 1 2 3 Introduction 1 1.1 Preceding History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 The ZPAQ Standard Format for Compressed Data 5 2.1 ZPAQ Specification and Working Principle . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Context Mixing: Components and Context Data . . . . . . . . . . . . . . . . . . . . . . 8 2.3 ZPAQL: Virtual Machine and Instruction Set . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Examples: A simple Context Model and LZ1 with a Context Model . . . . . . . . . . . . . 12 Notes on Data Compression 16 4 Compilers, Rust and Python 21 5 4.1 Classical Compiler Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2 The Rust Programming Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.3 The Python Programming Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Considerations and Challenges with ZPAQL as Target Language 6 Selection of the Python-subset as Source Language 7 23 25 6.1 Decision on the Feature Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 6.2 Grammar and Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Design of the Compiler in Rust 29 7.1 Exposed API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 7.2 Parser and Tokenizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 7.3 Grammar of the Intermediate Representation . . . . . . . . . . . . . . . . . . . . . . . . 31 7.4 IR Generation and the Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 7.5 Optimizations in the Intermediate Representation . . . . . . . . . . . . . . . . . . . . . . 33 7.6 ZPAQL Generation and Register Assignment . . . . . . . . . . . . . . . . . . . . . . . . . 33 7.7 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 vi Contents 8 Exemplary Configurations and Programs 35 8.1 Compression of PNM Image Data using a Context Model . . . . . . . . . . . . . . . . . . 35 8.2 Bringing the Brotli Algorithm to ZPAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 9 Evaluation 38 9.1 Compiler Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 9.2 Performance of the Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 9.3 Analysis of the generated Code and Comparison with handwritten Code of LZ1 . . . . . . . 39 9.4 Suitability of ZPAQ as universal Standard . . . . . . . . . . . . . . . . . . . . . . . . . . 40 10 Conclusion 41 Bibliography 42 A Tutorial 44 B Visualizations 47 B.1 Arithmetic Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 B.2 Context Mixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 B.3 ZPAQ Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 B.4 ZPAQL Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 B.5 Compiler Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 1 Introduction It takes time until new and incompatible data compression algorithms become distributed in software. Also different input data should often be handled with different compression techniques, utilizing best knowledge about the data. The ZPAQ standard format for compressed data is a container format which also holds the needed decompression algorithms. They can be specified through a context mixing model of several predictors with a bytecode which computes the context data for them (used for arithmetic coding), a bytecode for postprocessing (used for transformations or stand-alone algorithms) or any combination of both. Arithmetic coding spreads symbols over a number range partitioned according to the probability distribution. That way a likely to be encoded symbol can get a bigger part. The whole message is encoded at once from the beginning. And every time a symbol was chosen the – possibly modified – probability distribution is applied again to segment its part of the number range. Then for the next symbol this is the number range partition to choose from. When the last symbol has been processed, the number range is very narrow compared to the beginning and every number of it now represents the whole message. So the shortest in terms of binary representation can be selected and a decoder can do the same steps again by choosing the symbol which has this number in its range and applying then the partitioning to this range again according to the probability distribution. In practice, one has either to use a special end-of-message symbol or specify the message length before to define an end to this process. The history which led to the development of ZPAQ shall shortly be explained in this chapter, followed by the motivation of writing a compiler for ZPAQL and the research question of this work. In the following chapter the whole picture of ZPAQ compression is visualized and the building blocks, i.e. context mixing and the bytecode, are explained in detail. General theory of data compression and its limits are shortly noted on afterwards. The main part is preceded by a short introduction to compiler construction, the implementation language Rust and the source language Python. One chapter outlines the conditions and difficulties of ZPAQL as a compiler target. The following chapters cover the chosen Python-subset (6), the developed API and compiler internals (7), example programs (8) and finally the evaluation (9). References to compressors and project websites or more detailed documentation and short remarks are located in footnotes. All websites have been accessed until August 19th 2016. Academic publications are listed in the bibliography. 2 1 Introduction 1.1 Preceding History A big milestone being reached in data compression has been the Lempel-Ziv algorithm which is a form of dictionary compression due to its consistence of backward references to data slices which have already been seen. The area of statistical data compression was first stationary and Huffman optimal prefix codes can be mentioned for entropy coding before arithmetic coding was known. Basically, few bits should be used for frequent symbols but more for infrequent symbols. Entropy coding developed further with adaptive coding from the mid-80s. One major innovation called PPM (Prediction by Partial Matching) usually works on order-N byte contexts and predictions [1] whereas DMC (Dynamic Markov Compression) predicts bits [2]. Adaptive Huffman coding was developed as well. A decade later CTW (Context Tree Weighting) also utilizes bit prediction but mixes the learned distributions of the order-N byte contexts [3]. The predecessors to the series of PAQ compressors, called P5, P6, P12, are based on a 2-layer neural network with online training that predicts one bit for context inputs up to order-5. At that time they were on par with variants of PPM concerning speed and compression ratio [4]. In PAQ1 the neural network is not used anymore as the probability and confidence of various models are combined through weighted averaging. There is a non-stationary n-gram model up to order-8, a match model for occurrences longer than 8 bytes in the last 4 MB, a cyclic model for patterns in rows and a whole word model up to order-2. As statistical data the models mostly hold the bit counts of the contexts as one byte in a hash table. At its release it could produce best results for a concatenation of the Calgary corpus (a data compression benchmark) [5]. The models and their semi-stationary update were improved and PAQ2 introduced SSE (Secondary Symbol Estimation) which improves a given probability by mapping it to a bit history. PAQ4 changed to adaptive linear weighting and PAQ5 added another mixer. Many other versions have model improvements and added special models or transforms. PAQAR, a fork of PAQ6, reworked the mixing architecture and could turn models off. PAsQDa uses a transformation to map words to dictionary codes. PAQ6 variants were one after the other leading in the Calgary challenge [6]. PAQ7 is a rewrite using neural network logistic mixing. Specifically it includes a JPEG model that predicts the Huffman codes based on the DCT coefficients. PAQ8 variants come with more file models, dictionary or x86 call address preprocessors or a DMC model. The PAQ8HP series won the Hutter Prize for Lossless Compression of Human Knowledge 1 about compressing a part of an English Wikipedia dump. Many achievements were mainly made through special models for various file formats. The simpler LPAQ provides faster but less compression and later versions aim on text compression. PAQ9A has LZ (Lempel-Ziv) precompression and cascaded ISSE (Indirect Secondary Symbol Estimation). The detailed history including source code is tracked on the PAQ history website2 and in M. Mahoney’s book on data compression [7]. 1 http://prize.hutter1.net/ 2 The PAQ Data Compression Programs http://mattmahoney.net/dc/paq.html 1.2 Motivation 3 Parts of PAQ variants made it into the recent context mixing compressor cmix that leads the Large Text Compression Benchmark 3 and the Silesia Open Source Compression Benchmark 4 but with around 30 GB RAM usage5. »ZPAQ is intended to replace PAQ and its variants (PAQ8, PAQ9A, LPAQ, LPQ1, etc) with similar or better compression in a portable, standard format. Current versions of PAQ break archive compatibility with each compression improvement. ZPAQ is intended to fix that.« 6 The development of the ZPAQ archiver started from early 2009 and defined the first version of The ZPAQ Open Standard Format for Highly Compressed Data [8] after some months. The main idea is to move the layout of the context mixing tree as well as the algorithm for context computation and the one for postprocessing into the archive before each block of compressed data. There are nine components7 to choose from for the tree, mostly context models coming from PAQ. In addition the algorithms are provided as bytecode for a minimal virtual machine. Hence the algorithm implementation is mostly independent of the decompressor implementation and compatibility is preserved when improvements are made. Also depending on the input data an appropriate compression method can be chosen. The main program using libzpaq is an incremental journaling backup utility which supports deduplication and encryption and provides various compression levels using LZ77, BWT (Burrows-Wheeler Transform) and context models [9]. But there are also reference decoders unzpaq and tiny_unzpaq, a simple pipeline application zpipe and a development tool zpaqd8. The fastqz compressor [10] for Sanger FASTQ format DNA strings and quality scores also uses the ZPAQ format and was submitted to the Pistoia Alliance Sequence Squeeze Competition9. 1.2 Motivation The development of ZPAQ continued specially for the use case of the incremental archiver program. But the appearance of new algorithms for ZPAQ reached a rather low level, as well as the number of authors10 despite the fact that it offers a good environment for research about context mixing methods and crafting of special solutions for use cases with a known type of data because one can build upon existing parts. The leading reason could be that the assembly language ZPAQL with its few registers is not very accessible and programs are hard to grasp or get easily unmanageable when complex tasks like parsing a header should be accomplished. Therefore, the first question is whether another language can support ZPAQ’s popularity. 3 http://mattmahoney.net/dc/text.html 4 http://mattmahoney.net/dc/silesia.html 5 http://www.byronknoll.com/cmix.html 6 http://mattmahoney.net/dc/zpaq.html 7 See chapter 2.2 or http://mattmahoney.net/dc/dce.html#Section_437 8 http://mattmahoney.net/dc/zpaqutil.html 9 http://www.pistoiaalliance.org/projects/sequence-squeeze/ 10 The majority of available configurations is listed on http://mattmahoney.net/dc/zpaqutil.html 4 1 Introduction If that is the case, then a compiler for a well-known programming language could help to overcome the obstacle of learning ZPAQL for implementing new algorithms. The second question is whether the design decisions of the ZPAQ specification allow for arbitrary compression algorithms to be used with ZPAQ, even if they bring their own encoder or a dictionary or whether ZPAQ is rather meant to be a platform for algorithms that use the predefined prediction components like (I)CM and (I)SSE11 together with the built-in arithmetic coder. 1.3 Research Question As an approach towards these questions mentioned above this work wants to contribute a compiler that emits ZPAQL for a subset of the popular Python programming language. The subset should be oriented to the data types of the ZPAQL virtual machine. The architecture and operation of the Python code should still maintain similarity to the ZPAQL execution workflow to avoid abstractions. That means only integers, no objects, floating-point numbers or strings, but the ability to address the two memory sections of the VM as arrays. Running the Python code should be possible as usual which helps to keep debugging out of the ZPAQL VM. All this would ease the development of new algorithms and discover limitations of the current ZPAQ standard. An example for a new algorithm should be provided by developing a model for uncompressed PNM image data which should be compared to other lossless image compressors. As an example for a complex all-purpose algorithm the recent Brotli compression algorithm[11] will be brought to ZPAQ. It was proposed for HTTP2 compression, includes a dictionary and combines LZ77, Huffman coding and order-2 context modeling. This is a challenge because it has to be solely implemented in the postprocessing step and also needs memory management. The quality of the compiler will be evaluated by comparing to an existing hand-written implementation of a LZ compressor. Design decisions of the compiler and difficulties faced will be discussed. 11 (Indirect) context model and (indirect) secondary symbol estimation, four of the nine ZPAQ components. 2 The ZPAQ Standard Format for Compressed Data This chapter gives an introduction to the ZPAQ archive specification in the current version 2.06. A small use case is explained as well as the needed commands. Then the context mixing components and the language ZPAQL are presented. Finally, two real-world examples are explained. 2.1 ZPAQ Specification and Working Principle The Level 1 specification of this container format required the data always to be encoded with the arithmetic coder and at least one predictor. A backwards incompatible change leading to Level 2 was to allow raw data to be stored and thus no prediction component to be used, as needed for incompressible data or standalone postprocessor algorithms e.g. a fast LZ variant [8]. It also defines a standard encryption for the whole archive file and introduces an append-only journaling mode for incremental backups and deduplication on top of the streaming format which is still the basis of the archive. These features are out of scope for this work and therefore the focus is on the compression architecture and the virtual machine. It is only specified what a valid archive is and how decompression takes place (nevertheless, probability and thus context computation have to be symmetric for compression and decompression with an arithmetic coder). In case of ambiguity in the specification there is the reference implementation unzpaq1, a tiny version2 of it and the more mature library libzpaq that is used by the zpaq archiver3, the zpaqd development tool4 or as plug-in for file archivers and incorporates a x86/64 JIT compiler for ZPAQL. »The ZPAQ open standard specifies a compressed representation for one or more byte (8 bit value) sequences. A ZPAQ stream consists of a sequence of blocks that can be decompressed independently. A block consists of a sequence of segments that must be decompressed sequentially from the beginning of the block. Each segment might represent an array of bytes in memory, a file, or a contiguous portion of a file.« [8] A block header has no information on the length of the block because like for segments an end marker is used. Memory requirements for the two bytecodes hcomp and pcomp are defined in the header. It is noted 1 http://mattmahoney.net/dc/unzpaq206.cpp 2 http://mattmahoney.net/dc/tiny_unzpaq.cpp 3 http://mattmahoney.net/dc/zpaq715.zip 4 http://mattmahoney.net/dc/zpaqd715.zip 6 2 The ZPAQ Standard Format for Compressed Data whether arithmetic coding is used and what predicting components make up the context mixing tree. Each component has arguments also determining memory use. If needed the bytecode hcomp is embedded to compute context data for the components of the context mixing tree for each byte. All components give bit predictions for the partially decoded byte (these are passed upwards the tree) and are trained afterwards with the correct bit which was decoded based on the root (i.e. the last) node probability for each bit. The optionally arithmetic coded data which comes from all segment content (not the segment filename or comment) in the block can start with an embedded pcomp bytecode or declare that no pcomp bytecode is present. Therefore, the hcomp section can already be used for context computation to compress the pcomp bytecode (0 for empty or 1 followed by the length and the bytecode). The pcomp code is used for postprocessing, may it be a simple transform or the decompression of LZ codes. It gets each decoded byte as input and outputs a number of bytes not necessarily equal to the input. That means the four combinations for a block are in total no compression, only context mixing with arithmetic coding, only postprocessing the stored data or context mixing with subsequent postprocessing (from the decompressor perspective). The chosen selection applies for all (file) segments in the block. The following charts illustrate the named parts and their relation to each other for a sample compression use case. The transform for x86 machine code enhances compressibility by converting relative to static addresses after each CALL and JMP instruction (0xE8 and 0xE9). It is applied on the two input files, a x86 executable binary and shared library. Therefore a ZPAQL pcomp program needs to be supplied in the archive block to revert that transform. Encoding takes place based on the probability distribution of 1 and 0 for each bit of the current byte as they are provided as prediction by the root node of the simple context mixing tree. The hcomp program is loaded into the ZPAQL VM and computes contexts for the two components. The ISSE maps the context to a bit history which is used as context for a learning mixer that should improve the probability provided by the first component, a CM (Context Model) which should learn good predictions for the given context. The whole model and the hcomp bytecode are also embedded into the archive block. The two files are stored as two segments in the block (like a ”solid” archive). Because the preprocessor might be any external program or also included in the compressing archiver and is of no use for decompression it is therefore not mentioned in the archive anymore. 2.1 ZPAQ Specification and Working Principle 7 Figure 2.1: Possible compression scheme Decompression takes place in reverse manner and hcomp is loaded into the ZPAQL VM to compute the context data for the components of the model. They supply the predictions to the arithmetic coder and are corrected afterwards. For the reverse transform of each segment pcomp is read from the decoded stream and loaded in another VM. Then the two segments follow in the decoding step and go through the postprocessing transform before they are written out as files. Figure 2.2: Accompanying decompression The tool zpaqd only supports the streaming format and can be used to construct this example setup by writing a configuration file for it and then adding two files as segments into a block. But beside the algorithms that are already defined for compression in libzpaq for the levels 1 to 5 (LZ77, BWT and context mixing) it also offers the ability to specify a customized model (E8E9, LZ77 transformations or also word models are supported) given as argument, so that the above configuration can also be brought to life with something like zpaq a [archive] [files] -method s8.4c22.0.255.255i3 (denotation documented in libzpaq.h and the 8 2 The ZPAQ Standard Format for Compressed Data zpaq man page5). Here the first 8 accounts for 28 = 256 MB blocks, so that both segments should fit into one block (yet the zpaq application uses the API in a way that creates an additional block), then an order-2 CM and an order-3 ISSE are chained. The resulting configuration including the two ZPAQL programs stored in the archive can be listed with zpaqd l [archive]. For a more general view c.f. the compression workflow of the zpaq end user archiver as described in the article mentioned [9]. It selects one of its predefined algorithms based on their performance for the data and uses deduplication through the journaling format. 2.2 Context Mixing: Components and Context Data The way the mixing takes place has evolved from the neural network approach in P5 over weighted averaging from PAQ1 and adaptive linear weighting to logistic mixing in PAQ7 (like in a neural network and instead of back propagation the weights are updated). In ZPAQ the probabilities are also handled in the logistic domain 𝑝 𝑠𝑡𝑟𝑒𝑡𝑐ℎ(𝑝) = 𝑙𝑛( 1−𝑝 ) as output of the predicting components and can be averaged or refined by other components before the output of the last component is transformed 𝑠𝑞𝑢𝑎𝑠ℎ(𝑥) = 𝑠𝑡𝑟𝑒𝑡𝑐ℎ−1 (𝑥) = 1 1+𝑒−𝑥 to be used in the arithmetic coder to code one bit [7]. ZPAQ defines nine components. Up to 255 of them can be used in a model but only linear trees can be constructed as mixing network i.e. when they are listed, they can only refer to predictions of preceding components. The last one listed is the top node [8]. They get their context data input through computation in hcomp. To supply the component with a higher order context, a rotating history buffer needs to be maintained by hcomp. Now the components are listed without their arguments and internals which can be found in the specification [8]. CONST is the most simple and gives a fixed prediction. It does not receive context data. CM is a direct context model that holds counts of zeros/ones and a prediction for each of the eight bits expected in the specified context. The counts are used to update the prediction to reduce the error. The given context data is often a hash (and collisions can hurt compression). Also the partially decoded byte is expanded to nine bits and XORed with the context after coding each bit, so that the next bit’s prediction is accessed in the table. ICM is an indirect context model consisting of a hash table to map the given context and the partially decoded byte to a bit history which is then mapped to a prediction like in a direct context model. Histories are counts of 1 and 0 and represented by states (explained in [7], chapter 4.1.3. Indirect Models) and adjusted after the prediction to a better fitting state. The prediction is also updated. 5 http://mattmahoney.net/dc/zpaqdoc.html 2.3 ZPAQL: Virtual Machine and Instruction Set 9 MATCH is a model that takes the following byte of a found string match as prediction until a mismatch happens. Therefore it keeps its own history buffer and the prediction is also varied depending on the length of the match. The match is found based on the (higher order) context hash input. This works because the search does not take place in the history buffer, but a table maps each context hash input to the last element in the history buffer. AVG outputs a non-adaptive weighted average of the predictions of two other components. It does not receive context data. MIX maps a context and the masked partially decoded byte to weights for producing an averaged prediction of some other components. Afterwards the weights are updated to reduce the prediction error. MIX2 is a simpler MIX and can only take two predictions as input. SSE stands for secondary symbol estimation (Dmitry Shkarin, also known as Adaptive Probability Map in PAQ7 or LPAQ) because it receives a context as input and takes the prediction of another component which is then quantized. For the given context this quantized prediction and the next closest quantization value are mapped to predictions which result in an interpolation of both. Initially this map is an identity, but the update step corrects the prediction of the closer-to-original quantized value the same way as in the CM update phase. »A typical place for SSE is to adjust the output of a mixer using a low order (0 or 1) context. SSE components may be chained in series with contexts typically in increasing order. Or they may be in parallel with independent contexts, and the results mixed or averaged together.« [7] ISSE is an indirect secondary symbol estimator that, as an SSE, refines the prediction of another component based on a context which is mapped to a bit history like in an ICM. That bit history is set as context for an adaptive MIX to select the weights to combine the original prediction with a fixed prediction. »Generally, the best compression is obtained when each ISSE context contains the lower order context of its input.« [7] 2.3 ZPAQL: Virtual Machine and Instruction Set At the beginning of each block the two bytecodes hcomp and pcomp, if needed, are loaded to a virtual machine for each. It consists of the program counter 𝑃𝐶, 32-bit registers 𝐴, 𝐵, 𝐶 and 𝐷, an 1-bit condition flag 𝐹, 256 32-bit registers 𝑅0 …𝑅255 and the arrays 𝑀 (8-bit elements) and 𝐻 (32-bit elements). Initially, all registers and arrays hold 0. The size of 𝐻 and 𝑀 is defined in the block header along with the information which context-mixing components are used. For each encoded or decoded byte hcomp gets run to set the contexts for the components. The context data for a components 𝑖 < 256 has to be stored in 𝐻[𝑖]. As first input hcomp sees whether a postprocessor 10 2 The ZPAQ Standard Format for Compressed Data is present and if yes, then its length and bytecode. Afterwards the data of the segments is coming, without any separation. Except for the 𝑃𝐶 which is set to 0 and the 𝐴 register which is used for the input all state is preserved between the calls. The postprocessor pcomp is run for each decoded byte of the block to revert the preprocessing and puts out the data via an instruction. After each segment it is invoked with 232 − 1 as input to mark the end. There is an assembly language for the bytecode which is also used in the table describing the corresponding opcodes in the specification [8]. In this assembly language whitespace can only occur between opcode bytes in order to visualize that a=b is a 1-byte opcode while a= 123 is a 2-byte opcode. Comments are written in brackets. Operations on the 32-bit registers and elements of 𝐻 are 𝑚𝑜𝑑 232 and interpreted as positive numbers in comparisons. Indexes access into 𝑀 and 𝐻 is 𝑚𝑜𝑑 𝑚𝑠𝑖𝑧𝑒 or 𝑚𝑜𝑑 ℎ𝑠𝑖𝑧𝑒 and denoted as *B for 𝑀[𝐵], *C for 𝑀[𝐶] and *D for 𝐻[𝐷]. Because 𝑀 holds bytes operations on *B and *C are 𝑚𝑜𝑑 256 and swapping via B*<>A or C*<>A alters only the lower byte of 𝐴 [8]. Instructions Semantics and Constraints error cause execution to fail X++ increment X by 1 (X is one of A, B, C, D, *B, *C, *D) X-- decrement X by 1 X! flip all bits X=0 set X to 0 X<>A swap (X is not A) X=X set X to X X= N set X to 0 <= 𝑁 <= 255 A+=X add X on A A-=X subtract X from A A*=X multiply A by X A/=X divide A by X (set 𝐴 = 0 if 𝑋 = 0) A%=X 𝐴 = 𝐴 𝑚𝑜𝑑 𝑋 (set 𝐴 = 0 if 𝑋 = 0) A&=X binary AND with X A&~=X binary AND with flipped bits of X A|=X binary OR with X A^=X binary XOR with X A<<=X bitwise left shift of A by 𝑋 𝑚𝑜𝑑 32 bits A>>=X bitwise right shift of A by 𝑋 𝑚𝑜𝑑 32 bits 2.3 ZPAQL: Virtual Machine and Instruction Set A+= N, A-= N, A*= N, A/= N, A%= N, A&= N, 11 same as previous instructions but with 0 <= 𝑁 <= 255 A&~= N, A|= N, A^= N, A<<= N, A>>= N A==X 𝐹 = 1 if 𝐴 = 𝑋 otherwise 𝐹 = 0 A<X 𝐹 = 1 if 𝐴 < 𝑋 otherwise 𝐹 = 0 A>X 𝐹 = 1 if 𝐴 > 𝑋 otherwise 𝐹 = 0 N, A< N, same as above but with 0 <= 𝑁 <= 255 N, B=R N, set A, B, C or D to 𝑅𝑁 A== A> N A=R C=R N, D=R N R=A N set 𝑅𝑁 to A HALT ends current execution of the program OUT output 𝐴 𝑚𝑜𝑑 256 in pcomp, ignored in hcomp HASH 𝐴 = (𝐴 + 𝑀[𝐵] + 512) ⋅ 773 HASHD 𝐻[𝐷] = (𝐻[𝐷] + 𝐴 + 512) ⋅ 773 JMP I add −128 <= 𝐼 <= 127 to PC relative to the following instruction, so 𝐼 = 0 has no effect and 𝐼 = −1 is an endless loop (in the specification a positive N is used, so 𝑃𝐶𝑓 𝑜𝑙𝑙𝑜𝑤𝑖𝑛𝑔 += ((𝑁 + 128) 𝑚𝑜𝑑 256) − 128) JT N jump if 𝐹 = 1 JF N jump if 𝐹 = 0 LJ L 𝑃𝐶𝑓 𝑜𝑙𝑙𝑜𝑤𝑖𝑛𝑔 = 𝐿 with 0 <= 𝐿 < 216 (in the specification written as LJ N M because it is a 3-byte instruction with 𝑃𝐶𝑓 𝑜𝑙𝑙𝑜𝑤𝑖𝑛𝑔 = 𝑁 + 256 ⋅ 𝑀) Beside the opcodes libzpaq also supports using helper macros like if(not)? … (else)? … endif, the long jump versions if(not)?l … (elsel)? … endifl and do … (while|until|forever) which will be converted to conditional jump opcodes. An if-block is executed if 𝐹 = 1, the while-jump is executed if 𝐹 = 1, an until-jump is executed if 𝐹 = 0. So that means the test for the condition has to be written before the macro: a> 255 if … endif. The statements can also be interweaved, e.g. write a do-while-loop or a continue-jump as do … if … forever endif. In a ZPAQ config file the two sections for hcomp and pcomp are written behind the context mixing model configuration. The pcomp section is optional. Comments can appear everywhere within brackets. 12 2 The ZPAQ Standard Format for Compressed Data Syntax: (where 𝐼 < 𝑁 and 2𝐻𝐻 , 2𝐻𝑀 , 2𝑃𝐻 and 2𝑃𝑀 define the size of 𝐻 and 𝑀 for each section) comp HH HM PH PM N (I COMPONENT (ARG)+ )* hcomp (ZPAQL_INSTR)* halt (pcomp (PREPROCESSOR_COMMAND)? ; (ZPAQL_INSTR)* halt )? end 2.4 Examples: A simple Context Model and LZ1 with a Context Model The following example configuration is based on fast.cfg from the utility site6 and can be used for text compression and adaptively combines (independently of contexts, just based on the success of the last prediction) the prediction of a direct order-1 context model with the prediction of a order-4 ISSE which refines the prediction of a order-2 ICM. The arguments for the components are documented in the specification [8]. 1 comp 2 2 0 0 4 (hh hm ph pm n) (where H gets the size of 2^hh in hcomp or 2^ph in comp , M 2^hm or 2^pm and n is the number of context -mixing components) 6 0 cm 19 4 (will get an order 1 context) 1 icm 16 (order 2, chained to isse) 2 isse 19 1 (order 4, has reference to ICM component 1) 3 mix2 0 0 2 24 0 (moderate adapting mixer between CM and ISSE based on which predicts better , no contexts even for bits) (ICM and ISSE part adapted from fast.cfg) 11 hcomp r=a 2 (R2 = A, input byte in R2) d=0 a<<= 9 *d=a (H[D] = A) (set context to actual byte) (leaving first 9 bits free for the partially decoded byte) a=r 2 (A = R2) 16 *b=a (M[B] = A) (save input byte in rotating buffer) (full M is used with pointer b) a=0 hash (shortcut for A = (A + M[B] + 512) * 773) b-- hash d= 1 *d=a (order 2 hash for H[1]) 21 b-- hash b-- hash d= 2 *d=a (order 4 hash for H[2]) 26 6 http://mattmahoney.net/dc/zpaqutil.html, config files with a ”post” instead of ”pcomp” are in the old format of the Level 1 specification 2.4 Examples: A simple Context Model and LZ1 with a Context Model 13 (H[3] stays 0 as fixed context for MIX2) halt (execution stops here for this input byte) end Listing 2.1: Example mfast.cfg without a pcomp section To demonstrate the compression phases and parts involved in detail the LZ1 configuration from the utility site is chosen, but also the BWT.1 examples are worth to look at. The LZ1 configuration relies on a preprocessor lzpre.cpp which turns the input data into a compressed LZ77-variant representation of codes for match copies and literal strings. This is further compressed through arithmetic coding with probabilities provided by an ICM (indirect context model). The contexts are always hashed as ℎ𝑎𝑠ℎ(ℎ𝑎𝑠ℎ(𝑎) + 𝑏) from two values 𝑎 and 𝑏 as follows. For the first byte of an offset number of a match the length of the match and the current state (2…4, i.e. 1…3 bytes to follow as offset number) are used as context. For the remaining bytes of an offset number (or a new code if no bytes are remaining) the previous context and the current state (previous state - 1, i.e. 0…2 bytes to follow) are used as context. For the first literal of a literal string the number of literals and the state 5 are used as context. For the following literals the current literal and the state 5 are used as context. For a new code after a literal string instead of a hash of the first value just 0 and the current state (1) are used as context. The bytecode of pcomp is not specially handled. To revert the LZ1 compression pcomp parses the literal and match codes and maintains a 16 MB = 224 byte buffer in 𝑀. (lz1.cfg 3 (C) 2011 Dell Inc. Written by Matt Mahoney Licensed under GPL v3, http://www.gnu.org/copyleft/gpl.html) comp 0 0 0 24 1 0 icm 12 (sometimes ”0 cm 20 48” will compress better) 8 hcomp (c=state: 0=init , 1=expect LZ77 literal or match code , 2..4= expect n-1 offset bytes , 5..68= expect n-4 literals) b=a (save input) 13 a=c a== 1 if (expect code ccxxxxxx as input) (cc is number of offset bytes following) (00 xxxxxx means x+1 literal bytes follow) a=b a>>= 6 a&= 3 a> 0 if a++ c=a (high 2 bits is code length) *d=0 a=b a>>= 3 hashd 18 else a=b a&= 63 a+= 5 c=a (literal length) *d=0 a=b hashd endif 23 else a== 5 if (end of literal) c= 1 *d=0 14 2 The ZPAQ Standard Format for Compressed Data else a== 0 if (init) c= 124 *d=0 (5+ length of postprocessor) 28 else (literal or offset) c-(model literals in order 1 context , offset order 0) a> 5 if *d=0 a=b hashd endif endif 33 endif endif (model parse state as context) 38 a=c a> 5 if a= 5 endif hashd halt pcomp ./lzpre c ; (code below is equivalent to ”lzpre d”) a> 255 if (end of segment) b=0 d=0 43 (reset , is last command before halt) else (LZ77 decoder: b=i, c=c d=state r1=len r2=off state = d = 0 = expect literal or match code 1 = decoding a literal with len bytes left 2 = expecting last offset byte of a match 3,4 = expecting 2,3 match offset bytes 48 i = b = position in 16M output buffer c = c = input byte len = r1 = length of match or literal off = r2 = offset of match back from i 53 Input format: 00llllll: literal of length lllllll =1..64 to follow 01lllooo oooooooo: length lll=5..12, offset o=1..2048 10llllll oooooooo oooooooo: l=1..64 offset =1..65536 11llllll oooooooo oooooooo oooooooo: 1..64, 1..2^24) 58 c=a a=d a== 0 if a=c a>>= 6 a++ d=a a== 1 if (state?) a+=c r=a 1 a=0 r=a 2 (literal len=c+1 off=0) else a== 2 if a=c a&= 7 r=a 2 (short match: off=c&7) 63 a=c a>>= 3 a-= 3 r=a 1 (len=(c>>3) -3) else (3 or 4 byte match) a=c a&= 63 a++ r=a 1 a=0 r=a 2 (off=0, len=(c&63) -1) endif 68 endif else a== 1 if (writing literal) a=c *b=a b++ out a=r 1 a-- a== 0 if d=0 endif r=a 1 (if (--len==0) state =0) 73 else a> 2 if (reading offset) a=r 2 a<<= 8 a|=c r=a 2 d-- (off=off <<8|c, --state) else (state==2, write match) a=r 2 a<<= 8 a|=c c=a a=b a-=c a-- c=a (c=i-off -1) 78 d=r 1 (d=len) do (copy and output d=len bytes) a=*c *b=a out c++ b++ d-- a=d a> 0 while (d=state =0. off , len don’t matter) 2.4 Examples: A simple Context Model and LZ1 with a Context Model 15 endif 83 endif endif endif halt 88 end Listing 2.2: lz1.cfg For comparison a Python port for usage with the zpaqlpy compiler can be found in test/lz1.py. It differs in processing the pcomp bytecode with the current opcode byte as context for the next. 3 Notes on Data Compression Not all data can be compressed. Because if we build a – necessarily bijective1 – compressing scheme 𝑓 ∶ ⋃ 𝛴𝑖 → 0<𝑖≤𝑛∈ℕ ⋃ 𝛴𝑖 0<𝑖≤𝑛∈ℕ for strings 𝑚 ∈ 𝛴∗, |𝑚| ≤ 𝑛 to strings 𝑧 ∈ 𝛴, |𝑧| ≤ 𝑛 , starting from an identity, every time we remap an input 𝑚1 , whereas 𝑓 (𝑚1 ) = 𝑧1 , to a shorter compressed representation 𝑧2 , whereas 𝑓 −1 (𝑧2 ) = 𝑚2 and |𝑧2 | < |𝑧1 |, so that now 𝑓 ′ (𝑚1 ) = 𝑧2 and |𝑓 ′ (𝑚1 )| < |𝑓 (𝑚1 )|, we have to map 𝑚2 to 𝑧1 so that 𝑓 ′ (𝑚2 ) = 𝑧1 and indeed make this swap in order to maintain the bijection, ending up with |𝑓 ′ (𝑚2 )| > |𝑓 (𝑚2 )| because |𝑧1 | > |𝑧2 |. So while 𝑓 ′ compresses 𝑚1 it expands 𝑚2 and this holds for each iteration when we change our compression scheme. Luckily, most data which is relevant to us and interesting for compression has patterns and other data where we do not understand the patterns appears to be random and is out of scope for compression algorithms. If we do not have any knowledge about the data except that its symbols are equally distributed with probability 𝑝𝑖 = 1 𝛴 for each symbol 𝑖, best we can do is to use an optimal code reaching the Shannon entropy as coding length per symbol: 𝐻 = − ∑ 𝑝𝑖 𝑙𝑜𝑔2 𝑝𝑖 𝑖 For |𝛴| = 256 𝐻 would be 8 bit as usual and we can simply store the data instead of encoding it again. In general, given that the distribution is known, we can choose e.g. a non-adaptive arithmetic encoder to almost reach the limit of 𝐻 bits per symbol. But adaptive arithmetic coding with PPM and others could even give a better average because the distribution is adjusted by exploiting patterns. Therefore, to craft a well performing algorithm for the expected input data knowledge about the patterns is needed in order to give good predictions and go beyond the Shannon entropy as average size. To define the lower limit that can be reached the concept of algorithmic information or Kolmogorov complexity of a string is developed. Basically it is the length of the shortest program in a fixed language that produces this string. The language choice only influences a constant variation because an interpreter could be written. When comparing different compressors in a benchmark it is common to include the decompressor size in the measurement because it could also hold data or generate it via computation. 1 We want to allow for every string to be compressed, resulting in the same number – because they have to differ – of compressed strings which also should be decompressible. Another common approach to prove that there is no universal lossless compression is a counting argument which uses the pigeonhole principle. 17 Using this principle with ZPAQ and its pcomp section the first million digits of 𝜋 if form of the ~1 MB text file pi.txt from the Canterbury Miscellaneous Corpus 2 can be compressed to a 114 bytes ZPAQ archive which consists of no data stored and a postprocessing step which computes 𝜋 to the given precision and outputs it as text3. This extreme case of the earlier mentioned knowledge about the data can serve as a bridge between Kolmogorov complexity and adaptive arithmetic coding. For faster execution of the ZPAQ model we only take the first ten thousand digits of 𝜋 . Normally the limit most compressors would stay above (because the digits are equally distributed) is 𝐻 8 ⋅ 10000 = −10⋅(0.1⋅𝑙𝑜𝑔2 (0.1)) 8 ⋅ 10000 = 4153 byte instead of 10 KB. With a ZPAQ context model we can, instead of generating the digits in pcomp phase, also use the next expected digit as context so that the predictor will quickly learn that e.g. character 3 comes in context 3. But prediction can not be 100 % for one symbol as other symbols could occur and there has to be a probability greater zero assigned to them. Also whether the end of the message is reached is encoded as a special symbol. So the range of the arithmetic encoder gets still narrowed when a perfectly predicted digit is encoded, but on such a small level that still only 121 bytes are needed for the ZPAQ archive consisting of the CM model configuration, hcomp bytecode and the arithmetically coded ten thousand digits of 𝜋 . That shows that we can go much beyond the entropy limit down to Kolmogorov complexity by using context modeling and adaptive arithmetic coding. And still the context model is usable for all input data in opposition to computing 𝜋 in pcomp. The overhead of the fact that 100 % can not be used when predicting seems to be linear for the message size and can be observed when compressing 50 or 200 MB zeros with a CM, resulting in around 0.0000022 byte per input byte. (pi10k.cfg 2016 Kai Lüke and Matt Mahoney 2 instead of generating the digits in pcomp phase , use the next expected digit as context To compress: zpaqd cinst pi10k.cfg pi10k.zpaq pi10000.txt) comp 0 14 0 0 1 (2^14 > 10000) 0 cm 13 0 7 hcomp ifnot (only first run) (Compute pi to 10000 digits in M using the formula: pi=4; for (d=r1*20/3;d>0;--d) pi=pi*d/(2*d+1)+2; where r1 is the number of base 100 digits. The precision is 1 bit per iteration so 20/3 12 is slightly more than the log2 (100) we need.) a= 100 a*=a r=a 1 (r1 = digits base 100) a*= 7 d=a (d = n iterations) *b= 4 (M=4) do 17 (multiply M *= d, carry in c) b=r 1 c=0 do b-- 22 2 http://corpus.canterbury.ac.nz/descriptions/#misc 3 http://mattmahoney.net/dc/pi.cfg 18 3 Notes on Data Compression a=*b a*=d a+=c c=a a%= 10 *b=a a=c a/= 10 c=a a=b a> 0 while 27 (divide M /= (2d+1), remainder in c) a=d a+=d a++ d=a do a=c a*= 10 a+=*b c=a a/=d *b=a a=c a%=d c=a 32 a=r 1 b++ a>b while a=d a>>= 1 d=a (add 2) b=0 a= 2 a+=*b *b=a 37 d-- a=d a== 0 until halt endif a=*b a<<= 9 *d=a b++ (set context for expected digit taken from M) halt 42 end Listing 3.1: pi10k.cfg Even if this configuration is usable for other input than 𝜋 it does not give good compression. It can be merged with the general text model from Listing 2.1 and change between using the CM for order-1 contexts and for the expected digits contexts every time the start of 𝜋 is detected until a mismatch is found. This way all occurrences of 𝜋 are coded with only a few bits. (mixed_pi2.cfg use the next expected digit as context for CM or a general text model of fast.cfg 3 a MIX2 will select between them To compress: zpaqd c mixed_pi2.cfg text_pi.zpaq text_with_appearance_of_pi.txt ) comp 2 14 0 0 4 8 (2^14 > 10000) 0 cm 18 0 (order 1 or digits of pi) 1 icm 16 (order 2, chained to isse) 2 isse 19 1 (order 4) 3 mix2 0 0 2 24 0 (moderate adapting mixer between CM and ISSE based on which predicts better) hcomp r=a 2 13 a=r 0 a== 0 if (only first run) (Compute pi to 10000 digits using the formula: pi=4; for (d=r1*20/3;d>0;--d) pi=pi*d/(2*d+1)+2; where r1 is the number of base 100 digits. 18 The precision is 1 bit per iteration so 20/3 is slightly more than the log2 (100) we need.) a= 100 a*=a r=a 1 (r1 = digits base 100) a*= 7 d=a (d = n iterations) *b= 4 (M=4) 23 do (multiply M *= d, carry in c) 19 b=r 1 c=0 do b-a=*b a*=d a+=c c=a a%= 10 *b=a 28 a=c a/= 10 c=a a=b a> 0 while (divide M /= (2d+1), remainder in c) a=d a+=d a++ d=a 33 do a=c a*= 10 a+=*b c=a a/=d *b=a a=c a%=d c=a a=r 1 b++ a>b while a=d a>>= 1 d=a 38 (add 2) b=0 a= 2 a+=*b *b=a d-- a=d a== 0 until c= 2 (point to 4 of 3.14) 43 a= 1 r=a 0 a<<= 14 a-- (last element of ring buffer) b=a a-= 4 (first element of ring bufer , pointer in r3) 48 r=a 3 halt (input 0 came from pcomp , also to restart c=2 is enough) endif (CM part) 53 d=0 a=r 2 a-= 48 c-a==*c c++ 58 if (pi: set context for expected digit) a=*c c++ a<<= 1 a++ a<<= 9 *d=a (distinguish between pi number context and character context by 1 bit for sure) else (other:) a=r 2 a<<= 10 *d=a c= 2 (set context to actual byte) endif 63 (a in r2, lower border of ring buffer in r3) (ICM and ISSE part adapted from fast.cfg) a=r 2 *b=a a=0 (save in rotating buffer M) hash b-d=a (save hash) a=r 3 a>b if b++ b++ b++ b++ endif a=d 68 hash d= 1 *d=a b-d=a (save hash) a=r 3 a>b if b++ b++ b++ b++ endif a=d hash b-d=a (save hash) a=r 3 a>b if b++ b++ b++ b++ endif a=d 73 hash d= 2 *d=a halt end Listing 3.2: mixedpi2.cfg 20 3 Notes on Data Compression For real-life use cases it is often not possible to give perfect predictions. Good contexts can help to bring order into the statistics about previous data. Beside the manual approach heuristic context models can be generated for data by calculating the data’s autocorrelation function [12] or as done in PAQ by recognizing two-dimensional strides for table or graphical data. Even compressed JPEG photos can be further compressed by 10% - 30% by predicting the Huffman-coded DCT coefficients when using the decoded values as contexts (done in PAQ7-8, Stuffit, PackJPG, WinZIP [7]). The ZPAQ configuration jpg_test2.cfg uses a preprocessor to expand Huffman codes to DCT coefficients and later uses them as contexts4. The PackJPG approach continues to be developed by Dropbox under the name lepton5 and supports progressive beside baseline JPEGs. Overall modeling and prediction are an AI problem because e.g. for a given sentence start a likely following word has to be provided or how a picture with a missing area is going to continue. Remarkable results have been accomplished by using PAQ8 as machine learning tool for e.g. building a game AI with it serving as a classifier, for interactive input text prediction, text classification, shape recognition and lossy image compression [13]. 4 http://mattmahoney.net/dc/zpaqutil.html 5 https://github.com/dropbox/lepton 4 Compilers, Rust and Python Compilers and their parts covering the subproblems are mentioned here. Then the programming language Rust which was used for the compiler development is shortly presented as well as Python as the source language. 4.1 Classical Compiler Architecture The task to translate a program from a source language into an equivalent program in the target language is usually split into independent phases where each phase passes its result to the next phase (see [14], Figure 1.6). In most cases the target language is very low level, like machine instructions of a CPU. One variation of a compiler structure including an intermediate representation language will be explained. The tokenizer or lexer cuts the input stream into tokens which represent words/units of the source language which are normally the allowed terminal symbols of the grammar. The parser reads the tokens in sequence and fits them into nonterminals of the source language grammar by following the production rules until all tokens are consumed and the start production is completed. On the way it generates an AST (Abstract Syntax Tree) as semantic representation of the program in the source language. This completes the analysis done in the frontend. Synthesis takes place in the backend. The AST is traversed to generate IR (intermediate representation) instructions. This step usually keeps track of variable names and their assignments in a symbol table. An optimization pass can be applied to the IR. The code generator produces instructions in the target language and they can be passed through optimization steps as well. Checking for errors has to be done in every analysis step to find lexical errors in the tokenizer, syntactical errors while parsing, type errors and others during semantic analysis. If the target language is not the end product but only an assembly language than an assembler produces object files which will be combined by a linker to a final executable program. There are also preprocessors for the source language, multi-pass compilers and JIT (just-in-time) compilers which are used in interpreters and translate only an often-used section of the source program during runtime for a performance gain. 22 4 Compilers, Rust and Python 4.2 The Rust Programming Language »Rust is a systems programming language focused on three goals: safety, speed, and concurrency. It maintains these goals without having a garbage collector, making it a useful language for a number of use cases other languages aren’t good at: embedding in other languages, programs with specific space and time requirements, and writing low-level code, like device drivers and operating systems. It improves on current languages targeting this space by having a number of compile-time safety checks that produce no runtime overhead, while eliminating all data races. Rust also aims to achieve ‘zero-cost abstractions’ even though some of these abstractions feel like those of a high-level language. Even then, Rust still allows precise control like a low-level language would.« 1 The first stable 1.0 release of the Mozilla-sponsored language was in May 2015, the current version is 1.11 from August 2016. Pattern matching, definition of data structures and traits together with strong types bring Rust close to functional languages like Haskell. A program can in most cases only be compiled if it is free from crashes and undefined behavior unless explicitly a panic is allowed when an alternative result case of a function is not handled. So sometimes there are many iterations until a program compiles, but then as a gain there are assurances for the execution like memory safety instead of buffer overflows. But on the other side one has to be aware of the flow and reference of data through variables and has to comply to the rules imposed by the move semantics of the language, because the ownership model is the core for the safety guarantees. Part of the language is the definition of test functions for software testing and inline documentation which can include test cases. As tooling the build system and package manager Cargo comes along with the compiler rustc which is invoked by Cargo. The inline documentation can be extracted to HTML. Because of the noted advantages it seems that Rust fits well for compiler construction and offers a pleasing workflow once accustomed with its type system because of the more rare runtime errors leading to less debugging. 4.3 The Python Programming Language Python has been around for two decades now and is still representative for an easy to learn imperative language with a simple syntax. It features dynamic types and a pragmatic approach to object orientation with functional elements and closures. Freeing memory is done by reference counting and the language is tightly coupled with the way it is interpreted. In order to fulfill a protocol/trait it relies on functions an object needs to implement. 1 Book Introduction ”The Rust Programming Language”: https://doc.rust-lang.org/book/index.html 5 Considerations and Challenges with ZPAQL as Target Language The ZPAQL bytecode that runs on the virtual machine is executed for every input byte, from the beginning i.e. the first instruction. Then 𝐴 is set to the input byte while all other state is maintained between the executions. Also if running as hcomp section then the first elements of 𝐻 are not general purpose memory but serve as input for the context mixing components. Execution in a ZPAQL VM is sequential with no parallelism. Beside the defined size of 𝐻 and 𝑀 there is no heap memory allocation or stack available. If dynamic memory allocation is wanted it has to be implemented on top of 𝐻 or 𝑀. It is Turing-complete but the bytecode size is restricted to around 64 KB (in hcomp few bytes less as the context model needs to be defined before). Also there is no keyword for a data section defined in the ZPAQL assembly language and it would be of no use as the bytecode content itself is not addressable and also unmodifiable. But program data could be prepended before the stored data and retrieved by reading in hcomp or pcomp and discarded while pcomp generates the output. If data should just be accessed by hcomp it could also be stored as a skipped part of the pcomp bytecode as this is seen by hcomp before the decoded data. This could also be used if an interpreter is written in ZPAQL. The jump instructions only support a fixed and not a variable destination. That means instead of jumping to a return address which the caller passes to the function the return can only be achieved through an identifier for a jump table with predefined jump points. The instruction set is rather simple and aimed at 32-bit integers. To keep expensive abstractions away, the source language should e.g. not include floating point numbers or 64-bit integers. Most calculations have to be done using register 𝐴 as accumulator because the others lack advanced instructions. There is no instruction for reading the next byte or saying that context computation has finished - the program needs to halt and is then executed again from the beginning. Therefore, additional runtime code is needed to continue execution where it stopped if e.g. convenient input reading should be allowed without disrupting the program flow or the finished computation of context data should be signaled. It seems to be a viable plan to keep the organizational structure of a source program close to the resulting ZPAQL program but with added convenience through a runtime API which is anyway needed for input and output. Instead of introducing special functions to set the context data as first elements in 𝐻 , 𝐻 and also 24 5 Considerations and Challenges with ZPAQL as Target Language 𝑀 can just be fully exposed as data structure in the source language and also be used as arrays for other means with low abstraction costs. This keeps the structure of handwritten ZPAQL programs close to those in the source language. But in order to keep variables in a stack, 𝑅 with its 256 elements is not enough, so expanding 𝐻 seems to be a good solution. To model the repetitive execution of hcomp and pcomp they could be defined as functions in the source program (think main function) and it would also be possible to pass the input byte as argument which also keeps the similarity to a handwritten ZPAQL source. As the runtime code abstractions for providing the mentioned read-API are not too high, the similarity to original ZPAQL files is more a cosmetic design decision. And if a context-set-API which halts and continues execution through runtime code is present then hcomp and pcomp functions could be replaced by main functions which are entered only once and thus hide the fact that execution starts from the beginning for every input byte. Still dynamic memory management on top of 𝐻 and 𝑀 seems to be costly and thus departing to far from ZPAQL and adding more complicated data structures could hurt performance too much. It would be helping if the source program is standalone executable without being compiled to ZPAQL to ease debugging by staying outside the ZPAQL VM as long as possible. Before ZPAQL instructions are generated by the compiler it would be helpful if most complicated operations are solved on the IR level already, like the saving, restoring and other memory management. 6 Selection of the Python-subset as Source Language Considerations are mentioned again before the chosen subset is presented together with the API functions and organizational source code constraints. Then the grammar is listed with a short note on the behavior of elements. 6.1 Decision on the Feature Set Based on the preceding thoughts the input should be a single Python file containing both the code for hcomp and pcomp as functions which could also call other functions but no imports are allowed. The compiler should not deal with strings, lists, arbitrary big numbers, classes, closures and (function) objects but only support a reduced set with variables having 32-bit integer values. The context model should be defined like in a ZPAQL configuration and the predefined arrays 𝐻 and 𝑀 should be exposed as well as the size exponents ℎℎ, ℎ𝑚, 𝑝ℎ, 𝑝𝑚 and 𝑛 as number of components used. API functions out(a) for the pcomp out instruction and read_b() to read the next input byte by halting and returning to the current state of execution should be provided. As well as helpers if a dynamic memory management for additional arrays is implemented in Python on top of 𝐻 and 𝑀. This API needs to be included as corresponding ZPAQL runtime in the output of the compiler. The Python runtime should also allow execution as standalone program with a similar interface to the zpaqd development tool which can run pcomp and hcomp on a given input. All the code segments need to be separated, namely the context model definition and other common definitions/functions, pcomp and hcomp code with their variables and functions and then the runtime API and finally the code for standalone execution. Thus a template approach was chosen which uses comment lines as separation marks. Then the compiler extracts only the sections it needs to compile. 26 6 Selection of the Python-subset as Source Language source.py Template Sections Editable? Definition of the ZPAQ configuration header data (memory size, context mixing com- yes ponents) and optionally functions and variables used by both hcomp and pcomp API functions for input and output, initialization of memory no function hcomp and associated global variables and functions yes function pcomp and associated global variables and functions yes code for standalone execution of the Python file analog to running a ZPAQL configu- no ration with zpaqd r [cfg] p|h Of course it could be organized in a different way that is more appealing based on the alternatives mentioned in the previous chapter e.g. with two more idiomatic Python files for hcomp and pcomp without special entry functions and the runtime in a different file which either imports the two files or vice versa. 6.2 Grammar and Semantics The Python grammar as specified in the language reference1 has been simplified where possible. It still allows dictionaries and strings, however not for program code but just for the definition of the context mixing components. Tuples, unpacking, lists and list comprehensions, for-loops, with-blocks, support for async-coroutines, import, try/raise/except, decorators, lambda expressions or named arguments have been removed. Not all what is parsed is allowed as source code, e.g. nonlocal, dicts, strings or the @-operator for matrix multiplication but therefore a better error message can be provided than just a parser error. The names of the productions have been kept even if they are simplified. It would be best if the grammar is expanded to parse full Python again and let the code generator decide what to support. Grammar with NUMBER, NAME, ”symbols”, NEWLINE, INDENT, DEDENT or STRING as terminals 1 Prog (NEWLINE* stmt)* ENDMARKER? funcdef ”def” NAME Parameters ”:” suite Parameters ”(” Typedargslist? ”)” Typedargslist Tfpdef (”=” test)? (”,” Tfpdef (”=” test)?)* (”,” (”**” Tfpdef)?)? Tfpdef NAME (”:” test)? stmt simple_stmt | compound_stmt simple_stmt small_stmt (”;” small_stmt)* ”;”? NEWLINE small_stmt expr_stmt, pass_stmt, flow_stmt, global_stmt, nonlocal_stmt expr_stmt (store_assign augassign test) | ((store_assign ”=”)? test) https://docs.python.org/3/reference/grammar.html 6.2 Grammar and Semantics store_assign NAME (”[” test ”]”)? augassign ”+=” | ”-=” | ”*=” | ”@=” | ”//=” | ”/=” | ”%=” ”&=” | ”|=” | ”^=” | ”<<=” | ”>>=” | ”**=” pass_stmt ”pass” flow_stmt break_stmt | continue_stmt | return_stmt break_stmt ”break” continue_stmt ”continue” return_stmt ”return” test global_stmt ”global” NAME (”,” NAME)* nonlocal_stmt ”nonlocal” NAME (”,” NAME)* compound_stmt if_stmt | while_stmt | funcdef if_stmt ”if” test ”:” suite (”elif” test ”:” suite)* (”else” ”:” suite)? while_stmt ”while” test ”:” suite (”else” ”:” suite)? suite simple_stmt, NEWLINE INDENT stmt+ DEDENT test or_test test_nocond or_test or_test and_test (”or” and_test)* and_test not_test (”and” not_test)* not_test comparison | (”not” not_test) comparison expr (comp_op expr)* comp_op ”<” | ”>” | ”==” | ”>=” | ”<=” | ”!=” | ”in” | ”not” ”in” | ”is” | ”is” ”not” expr xor_expr (”|” xor_expr)* xor_expr and_expr (”^” and_expr)* and_expr shift_expr (”&” shift_expr)* shift_expr arith_expr | (arith_expr (shift_op arith_expr)+) shift_op ”<<” | ”>>” arith_expr term | (term (t_op term)+) t_op ”+” | ”-” term factor (f_op factor)* f_op ”*” | ”@” | ”/” | ”%” | ”//” factor (”+” factor) | (”-” factor) | (”~” factor) | power power atom_expr (”**” factor)? atom_expr (NAME ”(” arglist? ”)”) | (NAME ”[” test ”]”) | atom atom (”(” test ”)”) | (”” dictorsetmaker? ””) | NUMBER | STRING+ | ”...” ”None” | ”True” | ”False” | NAME dictorsetmaker dictorsetmaker_t (”,” dictorsetmaker_t)* ”,”? dictorsetmaker_t test ”:” test arglist test (”,” test)* ”,”? 27 28 6 Selection of the Python-subset as Source Language The semantics of the language elements as described in the reference2 stay mostly the same, even if the usable feature set is still reduced as stated before. In particular, one has to be aware of integer overflows which are absent in Python but are present in ZPAQL and thus all computations are in ℤ4294967296 i.e. 𝑚𝑜𝑑 232 . Except for the bit shift operations with a shift of more than 32 bits. In this case the Python-subset will do a shift by 𝑋 𝑚𝑜𝑑 32 bits. To achieve the semantics of (v << X) % 2**32 or (v >> X) % 2**32 with 𝑋 > 31 the resulting value should directly be set to 0 instead. Also / 0 and % 0 does not fail in ZPAQL but results in a 0 value. For a wrong Python input the current compiler specification might behave in a different way and even accept it without a failure. Therefore it is required that the input is a valid Python program which runs without exceptions. This requirement is also important because the current compiler does not check array boundaries, so index%len(hH) or index&((1<<hh)-1) should be used e.g. for a ring buffer because after the original size of 𝐻 there comes the stack. If run as plain Python file, an exception is thrown anyway then because it checks array boundaries. 2 https://docs.python.org/3/reference/index.html 7 Design of the Compiler in Rust The source code for the zpaqlpy compiler is located at https://github.com/pothos/zpaqlpy and includes the test data and other files mentioned. This chapter is about the developed API the source file uses and about the process of developing the compiler and its internals. 7.1 Exposed API The 32- or 8-bit memory areas 𝐻 and 𝑀 are available as arrays hH, pH, hM, pM depending on being a hcomp or pcomp section with size 2ℎℎ , 2ℎ𝑚 , 2𝑝ℎ , 2𝑝𝑚 defined in the header as available constants hh, hm, ph, pm. There is support for len(hH), len(pH), len(hM), len(pM) instead of calculating 2ℎℎ . But in general instead of using len() for dynamically allocated arrays as well, special functions like len_hH() are used to visibly expose their types and do runtime checks already in Python. NONE is a shortcut for 0 − 1 = 4294967295. Other functions Description c = read_b() Read one input byte, might leave VM execution to get the next input byte before return push_b(c) Put read byte c back, overwrites if already present (no buffer) c = peek_b() Read but do not consume next byte, might leave VM execution to get the next input byte before return out(c) In pcomp: write c to output stream error() Execution fails with ”Bad ZPAQL opcode” aref = alloc_pH(asize), … Allocate an array of size asize on pH/pM/hH/hM aref = array_pH(intaddr), … Cast an integer address back to a reference len_pH(aref), … Get the length of an array in pH/pM/hH/hM free_pH(aref), … Free the memory in pH/pM/hH/hM again by destructing the array If backend implementations addr_alloc_pH(size), addr_free_pH(addr), … are defined then dynamic memory management is available though the API functions alloc_pM and free_pM. The cast 30 7 Design of the Compiler in Rust array_pH(numbervar) can be used to save a type check in ZPAQL at runtime. Also in plain Python the cast from an address is needed after an array reference was itself stored into 𝐻 and thus became an address number and is then retrieved as a number again instead of a reference. In general, there are no boxed types but by context a variable is used as address. The last addressable starting point for any list is 2147483647 == (1<<31) − 1 because the compiler uses the 32nd bit to distinguish between pointers to 𝑀 and 𝐻 . The provided implementations of addr_alloc_pM, addr_free_pM, … can be found in the template (run ./zpaqlpy --emit-template or see src/template.rs). The returned pointer is expected to point at the first element of the array. One entry before the first element is used to store whether this memory section is free or not. Before that the length of the array is stored, i.e. 𝐻[𝑎𝑟𝑟𝑎𝑦𝑝𝑜𝑖𝑛𝑡𝑒𝑟 − 2] for arrays in H and the four bytes 𝑀[𝑎𝑟𝑟𝑎𝑦𝑝𝑜𝑖𝑛𝑡𝑒𝑟 − 5]…𝑀[𝑎𝑟𝑟𝑎𝑦𝑝𝑜𝑖𝑛𝑡𝑒𝑟 − 2] of the 32-bit length for arrays in 𝑀. Beside these constraints the implementations are free how to find a free region. The example uses getter and setter functions for the 32-bit length value as four bytes in 𝑀. For allocation it skips over the blocks from the beginning until a sufficiently sized block is found. If this block is bigger then the rest of it is kept free and might be merged with the next block if it is also free. That also happens when a block is freed again, then it is even merged with the previous block if that is free. 7.2 Parser and Tokenizer Tokens are known Python keywords and symbols like operators or brackets, names or strings and numbers as converted values. Initially the Python module tokenize was used through piping source code in python3 -m tokenize -e and processing its output to build tokens as a Rust data structure (see src/tok.rs). Later on the code commonly residing in /usr/lib/python3.5/tokenize.py and /usr/lib/python3.5/token.py was ported to Rust (see src/rtok.rs) in order to be independent from external calls. It is assumed that the input is UTF-8 without a BOM (Byte Order Mark). The parser is built with a parser generator out of the grammar specification. It constructs the AST with each production. The parser library lalrpop1 was chosen in LALR(1) recursive ascent mode for that purpose. The elements of the produced AST are based on the abstract grammar of the Python module ast 2 but simplified (see src/ast.rs) and give a structured representation of the source program. In fact it is not a tree but a list of statements. They can be one of FunctionDef which holds the function body as list of statements, Return, Assign and AugAssign which hold the expressions that are concerned, While and If which hold their body and else-part as list of statements and the test as expression, Global, Pass, Break, Continue and Expr which encapsulates an expression. These expressions can be one of BoolOpE to evaluate AND and OR 1 https://github.com/nikomatsakis/lalrpop 2 https://docs.python.org/3.5/library/ast.html#abstract-grammar 7.3 Grammar of the Intermediate Representation 31 over expressions, BinOp for evaluation of an arithmetic expression over two expressions, UnaryOpE for an unary operation, Compare for comparisons over expressions, Call for a function call including arguments as expressions, Num for an integer, NameConstant for True and False, Name for a variable or Subscript for index access to an array variable. The precedence of and and or is resolved during parsing into a binary tree of BoolOpE elements. Contrary to that one has to be aware that with Compare the semantics of a == b == c and (a == b) == c differ, the middle operand is split up and passed to the next comparison. Their results are evaluated and merged with and until the result can not be True anymore, so this was better left to the IR generator. 7.3 Grammar of the Intermediate Representation The IR was chosen to be close to ZPAQL but easier to write. While ZPAQL has the registers 𝐴, 𝐵, 𝐶, 𝐷, 𝐹, 𝑅0…255 and arrays 𝐻 and 𝑀 the IR gives only control about 𝑅, 𝐻 and 𝑀 and no other registers. There are no new data structures added. So the other registers can be used for address computation and temporary calculations when the IR is converted to ZPAQL. That means the temporary variables t0…t255 of the IR are a direct mapping to registers 𝑅𝑖 . Because the input byte is in 𝐴 at the beginning of hcomp/pcomp execution the IR relies on the guarantee that 𝑅255 = 𝐴 before the first instruction. stmt var ”=” var (op var)? var ”=” uop var ”if” var ”goto” label ”ifN” var ”goto” label ”ifEq” var var ”goto” label (to be used for optimizations) ”ifNeq” var var ”goto” label ”goto” label ”:” label ”:” ”halt” ”error” ”out” var var t | ”H[” t ”]” | ”H[t0+” x ”]” | ”H[t252+” x ”]” | ”H[” x ”]” | ”M[” t ”]” | ”M[” x ”]” | x op ”+” | ”-” | ”*” | ”/” | ”//” | ”%” | ”**” | ”<<” | ”>>” | ”|” | ”^” ”&” | ”or” | ”and” | ”==” | ”!=” | ”<” | ”<=” | ”>” | ”>=” uop ”!” | ”~” | ”-” t ”t0” | … | ”t255” x ”0” | … | ”4294967295” label [a-z_0-9~A-Z]+ comment ”#…\n” 32 7 Design of the Compiler in Rust The var on the left side of an assignment can not be a number. The operators or and and differ from the binary versions | and & as they represent the semantics of Python or and and i.e. they evaluate to the original value and not simply to the boolean choices of 1 and 0 for True and False while the binary operators use bitwise AND and OR. Operator !v tests against v==0 while ~v inverts the bits. The choices made might not be the best and a totally different IR is possible which could introduce variables or stack operations. For efficiency it might be interesting to go into the direction of static single assignment with linear scan register allocation or an algorithm with graph-coloring register allocation. Currently local Python variables are on the stack and temporary variables are in 𝑅 because together they could be more than 256 and anyway local variables need to be stored on stack before a call. But they could be joined in a common pool and the current limit of maximum 256 temporary variables in 𝑅 could be widened by using elements of 𝐻 as needed. Maybe even LLVM as a very popular IR could be used with its many optimization passes available. 7.4 IR Generation and the Stack Due to the reduced features in the subset the compilation of the Python source can be done in a similar way a C source would be compiled. The local variables of a function are held on the stack which is produced by expanding 𝐻 beyond its defined size for the configuration. Global variables are held in the beginning of the stack. The temporary variables t0…t251 are used for intermediate or address computations whereas t0 is the base pointer for the stack and t252 is a copy of the global base pointer. t255 holds the last input byte, t254 the reading state and t253 the read byte for the API function read_b() which stops the execution and returns to the caller with the newly acquired byte when the bytecode is run again. The original size 𝑥 = 2ℎℎ of 𝐻 is extended to hold the stack of size 𝑦. To calculate a valid resulting size of 𝑦 2⌈𝑧⌉ ≥ 𝑥 + 𝑦 the formula 𝑧 = 𝑙𝑜𝑔2 (𝑥 + 𝑦) = 𝑙𝑜𝑔2 (𝑥) + 𝑙𝑜𝑔2 (1 + 𝑥 ) is used. There are no IR instructions for calling a function or stack operations on 𝐻 , but there are helping meta instructions. These will be converted to simple IR instructions and have been defined for handling blocks, saving and loading variables on the stack in 𝐻 , calls and returns to comply to a calling convention, predefined IR code sections for the runtime API and the jumping table to continue execution after a return. The initial IR code is responsible to either call the Python functions hcomp(c)/pcomp(c) with the new input byte or continue execution if it was interrupted through a read_b() API call. Also it sets the base pointer for the first run and defines the API function read_b. The function traverse() in src/gen_ir.rs generates IR instructions for the given (part of an) AST. It is recursively used and consults the symbol table for free temporary variables, position of local variables and the mapping of global variables. Also it uses the recursive function evaluate() which takes only an expression part of an AST and returns IR instructions and the IR variable that holds the value after these instructions 7.5 Optimizations in the Intermediate Representation 33 have been executed. Returning the instructions makes testing easier. If pcomp(c)/hcomp(c) only contain a pass then the whole pcomp/hcomp section is omitted. The temporary variables have to be saved on the stack before a call. Also the current base pointer and then the return ID for the jump table need to be saved there as part of the calling convention. The new base pointer in t0 points at the return ID. Arguments passed come afterwards and the called function will address them via 𝐻[𝑡0 + 𝑥]. On return the previous base pointer will be restored to t0 and the return ID is copied in t2 for the jumper table while the return value is in t1 before the jump to the code for the jump table is done in order to return after the call instruction. It should be possible to treat local variables like temporary variables to reduce code size and stack usage by a compiler flag. Also the calling convention could be changed to avoid the stack and globals could be stored in tx. For now the compiler passes IR code around as vectors (lists), but for a more idiomatic Rust style iterators could be used. 7.5 Optimizations in the Intermediate Representation Unused functions occupy space in the bytecode and hence are removed by searching for calls and if none is found it can be removed. This is repeated until no function was removed anymore. Assignments which are unused or could be merged into one assignment are not yet recognized. During IR generation it is not known whether a temporary variable really needs to be saved on the stack before a call or whether it is not in use afterwards anyway. Therefore the helper macro in- structions MarkTempVarStart and MarkTempVarEnd are inserted around each function scope. Saving and restoring temporary variables is done through the macros StoreTempVars{identifiers} and LoadTempVars{identifiers}. The optimization pass goes through the IR instructions in reverse order to reason about the lifetimes and removes the identifiers from the load and store macro instructions which are not live i.e. referred to after the load. Compiler books have much more to offer and there are many classical optimizations which could be looked on for further improvements. Caching could be used and constant expressions evaluated during IR generation. This is just done for a row of assignments on allocated arrays in 𝑀. 7.6 ZPAQL Generation and Register Assignment The IR was chosen to be easily convertible to ZPAQL by using the registers 𝐴, 𝐵, 𝐶 and 𝐷 to calculate and move values between 𝐻 and 𝑅. Thus for the beginning a very simple but inefficient solution was 34 7 Design of the Compiler in Rust taken. The function emit_zpaql() in src/gen_zpaql.rs takes IR code and yields ZPAQL code by calling assign_var_to_a(var) for the operands (if needed saving one value from 𝐴 in 𝐶), and applying the operator on them. The result is moved from 𝐴 to the target by calling assign_a_to_var(var). In order to load a number in 𝐴 calc_number(value) is used and for values greater 255 multiple bit shifts are needed. All of this helper functions emit ZPAQL code which is combined as result of each function in a similar way to the IR generation. Then a more advanced solution was chosen with pattern matching to avoid the ubiquitous transfer over 𝐴 and detect an augmented assignment like +=1 and simply increase the value at its location. Therefore the helper function gen_loc_for_var() was introduced to get the location handle and the other helpers were widened to operate with the notion of a location meaning registers or memory positions through pointers instead of only 𝐴. Tracking recent values of variables in the registers or memory locations for reuse was also helpful in order to produce less code output. Yet it is a more careful issue now to make changes because the cache entries concerning the location and the variable need - at least - to be invalidated for a correct output. After the code generation a special optimization takes place to simplify byte assignments on arrays in 𝑀. They have been produced in the IR generation if a row of successive assignments like in an initialization was detected. Thus the pointer variable is always just increased and even not saved on 𝑅 until all assignments are finished. Resolving labels to positions for the jump destinations is done as last step before the ZPAQL assembly code is written out. The opcode size for each instruction is known and thus the label’s positions can be hold in a hash table. The second pass through the code can then replace the virtual GoTo placeholder instruction with the real long jump. 7.7 Debugging It is important to run the plain Python version because it contains assertions and test it before trying to compile it to ZPAQL. For the Python runtime a compare option was introduced to test the correctness of the pcomp code which should restore the preprocessor input. Each output byte is compared to the expected output and a debugger shell spawned if a mismatch occurs. Because compiler optimizations are likely to introduce bugs there is a small test suite in the Makefile. It compares the output of the Python code for pcomp and hcomp to the output of running the compiled ZPAQL code. Mostly zpaqd r CFG p is used but also a ZPAQ VM implementation in Rust included in the compiler (option --run-hcomp) because zpaqd r CFG h does not print 𝐻[0], …, 𝐻[𝑛 − 1]. 8 Exemplary Configurations and Programs As a showcase of the benefits of having a compiler for a high-level language like Python this chapter presents and evaluates two new ZPAQ configurations. The first is a new image compression model which can parse a header and combine various context predictors. And the second is an implementation of the Brotli decompression scheme in the pcomp phase. 8.1 Compression of PNM Image Data using a Context Model PNM is a lossless image format which in variant P6 with a maximum color value of 256 saves 8-bit RGB color channels as bytes after a short header with height and width1. To predict the value of a color channel the value of the neighboring pixels can be used as context as far as they have been seen in the stream. L is referring to the left pixel, LU to the upper left, U to the pixel above and UR to the upper right. For grayscale images in a custom format such a context model has already been developed2 for ZPAQ and performed better than PNG compression. Even if it does not match the ZPAQ configuration bmp_j4c.cfg, it is therefore much faster because the context mixing network is simpler. The PNM model test/pnm.py parses the header in Python to define the history buffer for one row and takes a similar approach as the grayscale model but with separated context models for each channel. They are mixed together also with an averaged value context model and indirect models. This mixed prediction is again combined by a mixer with the prediction of a chain of ISSE (with a hash of various contexts and the last byte as context) refining the prediction of an ICM (with an averaged value context). Color transforms like the subtract green transform in WebP can improve compression by decorrelating colors because one color component is often also a good predictor for the others3 and that link should be reduced to improve prediction for the same color. The PNM model can use it as pre- and postprocessor, mapping (𝑟, 𝑔, 𝑏) to (𝑟 − 𝑔, 𝑔, 𝑏 − 𝑔) and back. The preprocessor is in the separate Python file test/subtract_green. 1 http://netpbm.sourceforge.net/doc/ppm.html 2 http://www.modejong.com/blog/post15_zpaql_grayscale/index.html 3 https://developers.google.com/speed/webp/docs/webp_lossless_bitstream_specification#subtract_ green_transform 36 8 Exemplary Configurations and Programs To measure the compression ratio four image files were converted to PNM in test/ and compressed (512 pixel high peppers.pnm, monarch.pnm, kodim23.pnm4 and 1020 pixel high rafale.pnm5). FLIF is a new lossless format6. (a) kodim23 (b) peppers (c) monarch (d) rafale Figure 8.1: Images of the benchmark Method Total file size in byte ZPAQ archive with bmp_j4c.cfg (in BMP format, uses special color transform) 1785405 ZPAQ archive with pnm.cfg 1908258 FLIF 1914297 WebP 2256458 PNG 2938153 For now it does not use e.g. delta coding or other predictors beside the average from neighbor values. Also it could provide another predicted color value which is estimated by the recent development. Beside that the color transform should depend on the image or a way needs to be found to move its advantage out from the preprocessor into the predictors. Like the mentioned grayscale model it uses an additional context for the mixer depending on image noise. By using Python the model can easily be improved and writing it in ZPAQL would have been a bigger an bug-prone effort. The bytecode overhead as of writing is around 3 KB. For the large 53 MB photo canon24.pnm7 it would reach the 6th place with 21,326,964 bytes (89 seconds for (de)compression) in the benchmark for lossless picture compression published in a paper about BWT image compression [15]. For kodim23.pnm it would reach the 4th place with an archive size of 369,519 bytes. An overall benchmark for a huge number of files like in 8 or 9 has not been done up to now. 4 Commonly used for image compression benchmarks, available under http://r0k.us/graphics/kodak/ 5 http://www.maximumcompression.com/data/bmp.php 6 http://flif.info/ 7 https://web.archive.org/web/20140702040431/http://www.squeezechart.com/canon24.pnm 8 http://www.squeezechart.com/bitmap.html 9 http://imagecompression.info/gralic/LPCB.html 8.2 Bringing the Brotli Algorithm to ZPAQ 37 8.2 Bringing the Brotli Algorithm to ZPAQ The algorithm has to be implemented as postprocessor in pcomp and uses the Brotli compressor10 as preprocessor. Because the dictionary itself is almost twice the maximum bytecode size allowed it has to be prepended before the first segment of the compressed Brotli data for being read into memory. Following Brotli streams as additional segments in the same block can use a byte different from the first dictionary byte as prefix instead of prepending the whole dictionary again. Instead of porting the C library or writing a new implementation based on the Brotli specification [11] it was easier to port a decompressor written in Rust11 to the Python-subset (see test/brotli.py). It works as standalone Python program test/brotli.py and can be compiled to ZPAQL with a bytecode size of 64870 bytes (ZPAQ config file test/brotli.cfg). This means the size of an empty archive is already 184 KB if no context mixing is used for the dictionary and bytecode. A simple order-1 ICM model brings it down to 90 KB. The slightly more advanced model of mfast.cfg and its hcomp code reach 64 KB in total. The overhead is less significant for bigger archives and trying to reduce it by applying context mixing to the whole block might not always improve the result because the Brotli data could be expanded in coding size. For the four sample PNM files it saves 133 KB which is almost the size of the dictionary and bytecode (from 3615 KB with no arithmetic coding down to 3482 KB in total). For the enwik8 benchmark12 (100 MB Wikipedia dump) however this would not even make sense because mfast.cfg alone performs better by reaching 24720 KB instead of 30303 KB when used together with Brotli data and the dictionary/bytecode overhead. The same Brotli data without arithmetic coding through context model can be stored in around 30416 KB, depending on the Brotli compressor options it might go down to around 26000 KB. Decompression is much slower than with the C or Rust Brotli implementations. The ZPAQL implementation needs 118 seconds for enwik8 instead of 0.5 or 2.1 seconds for the reference C implementation and the one of Mark Adler13. The memory management is not very efficient, specially dealing with byte arrays on 𝑀 is expensive and the compiler could sometimes produce better code by avoiding the intermediate variables in 𝑅. Beside that ZPAQL is JIT-compiled by libzpaq in a rather simple way and so an advanced JIT backend on top of LLVM could help. But it shows that ZPAQ can be used with a very different algorithm and it still maintains backwards compatibility. A ZPAQ decompressor could also detect if the embedded pcomp bytecode is in fact a Brotli implementation and skip its execution by using one of the faster C or Rust implementations. That way new algorithms become instantly available while new versions of decompressors can have a speed advantage by replacing the bytecode execution with the equivalent library call. 10 https://github.com/google/brotli 11 https://github.com/ende76/brotli-rs 12 Used for the Hutter Prize compression challenge, comparison here: http://mattmahoney.net/dc/text.html 13 https://github.com/madler/brotli 9 Evaluation What follows in this chapter is a review on the outcome so far of working on a compiler as well as an attempt to provide one possible answer to the question whether — or how well — ZPAQ suits to be a general standard for data compression. 9.1 Compiler Construction Rust provided a good environment to prevent multiple bugs by detecting them at compile time. Yet it takes up to four minutes for an iterative build which is also due to the big parser generated by lalrpop. Recently there is development in a table-driven mode which could provide faster compile times by producing less parser code. Also the upcoming MIR intermediate language will allow the Rust compiler to cache compiled code segments if they have not been altered. This way test-driven development would be more easy because of fast iterations. Even if Rust provides inbuilt test support is has not been used as the compiled code output was mainly tested with zpaqd. This could be improved, specially with the small ZPAQL VM implementation that is already included and used for testing the correctness of computations through the Makefile. The code lacks inline documentation and the structure is sometimes not idiomatic to Rust. Error handling currently uses panics instead of propagation of success or failure via result types. 9.2 Performance of the Compiler The choice of the Python-subset and IR seemed to be supporting a fast development of the compiler even if a lot of optimization passes are still missing. There might be compiler bugs which have not been discovered up to now as the test coverage is not too extensive. While the original ZPAQL file for test/lz1.orig.cfg needs 251 bytes for an empty archive the port test/lz1.py compiled to bytecode needs 1255 bytes for an empty archive which is an overhead of around 1000 bytes compared to handwritten LZ1. 9.3 Analysis of the generated Code and Comparison with handwritten Code of LZ1 39 Runtime Benchmarks of the LZ1 Configuration in ZPAQL and Python for the four PNMs ZPAQ config Compression time (sec) Archive size (bytes) Decompression time (sec) zpaqlpy lz1.cfg 1.6 3350016 1.85 handwritten lz1.orig.cfg 1.3 3349062 1.53 9.3 Analysis of the generated Code and Comparison with handwritten Code of LZ1 The original test/lz1.orig.cfg is very dense and tries to hold values only in the registers if possible. Only in the pcomp code two 𝑅 variables are used along with 𝑀 as history buffer. Sometimes variables are moved between the registers because certain operations are only available on 𝐴. Without comments it would be difficult to guess what it does and make changes. On the other hand the compiled test/lz1.cfg from the Python version is rather large, specially when all comments are embedded which is the default. It does not harm the bytecode size but is useful for debugging and partly allows extraction of the original source as well as the IR code by invoking: egrep ’\( *[0-9]+: ’ test/lz1.cfg # Python code egrep ’\( +’ test/lz1.cfg | egrep -v ’\( *[0-9]+: ’ # IR code The compiled file starts with the runtime code which decides whether it is the first run. For the first run it sets the base pointer and starts the global definitions before calling the pcomp/hcomp function with the input. When initialization did take place already then the code decides whether a read was interrupted to get a new input byte or if just the pcomp/hcomp function should be called again with the new input. As stated in the previous chapters all Python variables are in the stack and thus it takes more opcodes to do calculations on them. The code does not use custom arrays but instead used 𝑀 as history buffer as the original code. Everytime when the function returns the jump table is used to jump behind the call and then the execution halts. It is an advantage that changes can be made more quickly in Python and the code is more comprehensible. But the compiled code uses a stack while it is not really necessary and instead of using the jump table for the return it could just use the halt instruction. So for simple programs like that it would be desirable if the compiler offers a mode without a stack by holding variables in 𝑅. 40 9 Evaluation 9.4 Suitability of ZPAQ as universal Standard ZPAQ serves well for research in context mixing algorithms and as development platform for custom algorithms. But whether it will replace general compression implementations in various libraries of deflate and newer variants is open to question. Because a huge memory allocation or an endless loop can be unsafe for decompressing untrusted archives, at least configurable limits need to be implemented into libzpaq. It might also not be perfect for universal use as the mixing type (logarithmic) is quite fixed. In addition to that the definition of an own encoder can not use the inbuilt context mixing. A component which takes something like 𝐻[𝑖] − 2048 as stretched prediction value would be interesting (in a valid range to leave range space for the other bit). Also it is not possible to define new components which maybe could be moved in an additional section with its identifier mentioned. While pcomp gets notified about segments hcomp gets not which could also be of use. As a compiler target the ZPAQL instruction set lacks a variable jump instruction like jmpa which would take the value of 𝐴 as destination or getpc which sets 𝐴 to the current program counter. It would be useful to set 32-bit values directly in one instruction like a=L 3210987654. Memory management on top of 𝐻/𝑀 is possible but a new data structure which allows using the OS heap would be more efficient. These proposals could be considered for the next level of the ZPAQ specification. 10 Conclusion It was shown that it is possible to have a compiler for ZPAQL which is also helpful for development of new algorithms. To write the compiler large parts of the Python grammar have been ported to the format of the LALR(1) parser library lalrpop. The core functionality of the Python module tokenize has been ported to Rust. As a goal for a source input which should be supported a Rust implementation for Brotli decompression has been ported to Python. A compilation scheme from a Python-subset to ZPAQL including an IR has been planned and implemented. The compiler is platform independent as long as a Rust compiler is available. It features an implementation of the ZPAQ VM to print the calculated context data for each input byte which can then be compared with the calculated values of the Python source to expose ZPAQL specific semantics or even compiler bugs. As a result it can be said that more different approaches should be tried to reach an acceptable performance for a large and complex code base. The ZPAQ specification does at its current point not offer various features which might be considered for future improvements, from additional instructions up to ways of being even more flexible on how prediction takes place. But most important would be a variable jump instruction. Compilation can be configured with command line arguments to some extend. The pcomp or hcomp part can be disabled so no changes to the input file are needed to vary between using only context mixing, only a preprocessor or both. Documentation can be printed via arguments as well. Four example source files are provided: A small context mixing compression for run length encoded data (see appendix tutorial), the LZ1 port and the PNM model which already showed good results with less efforts. An extreme case is the Brotli algorithm which needed many compiler optimizations to fit under the 64 KB bytecode limit and utilizes dynamic memory allocation. While the compiler was developed two bugs in ZPAQ tools were found and are already resolved through two new releases. One was a simple crash in zpaqd and the more serious one a wrong instruction in the x86 JIT code of libzpaq which caused a miscomputation in the Brotli decompressor. For most results there have been measurements which can be repeated because all needed tools are published as free/libre software. While PAQ and its internals were covered and modified in publications, ZPAQ was often only used as just another compressor instead of a platform for compression algorithms. This work exposed the crucial part of ZPAQ which is the embedding of the two bytecodes for hcomp and pcomp in the archive. Bibliography [1] J. Cleary and I. Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, 32(4):396–402, Apr 1984. ISSN 0090-6778. doi: 10.1109/ TCOM.1984.1096090. http://dx.doi.org/10.1109/TCOM.1984.1096090. [2] G. V. Cormack and R. N. S. Horspool. Data compression using dynamic markov modelling. Comput. J., 30(6):541–550, December 1987. ISSN 0010-4620. doi: 10.1093/comjnl/30.6.541. http: //dx.doi.org/10.1093/comjnl/30.6.541. [3] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens. The context-tree weighting method: basic properties. IEEE Transactions on Information Theory, 41(3):653–664, May 1995. ISSN 0018-9448. doi: 10.1109/18.382012. http://dx.doi.org/10.1109/18.382012. [4] Matthew V. Mahoney. Fast text compression with neural networks. In James N. Etheredge and Bill Z. Manaris, editors, Proceedings of the Thirteenth International Florida Artificial Intelligence Research Society Conference, May 22-24, 2000, Orlando, Florida, USA, pages 230–234. AAAI Press, 2000. ISBN 1-57735-113-4. http://www.aaai.org/Library/FLAIRS/2000/ flairs00-044.php. [5] Matthew V. Mahoney. The PAQ1 data compression program. 2002. https://cs.fit.edu/ ~mmahoney/compression/paq1.pdf. [6] Matthew V. Mahoney. Adaptive weighing of context models for lossless data compression, Florida Tech. Technical Report CS-2005-16. 2005. https://cs.fit.edu/~mmahoney/compression/ cs200516.pdf. [7] Matthew V. Mahoney. Data compression explained, 2010. http://mattmahoney.net/dc/dce. html. [8] Matthew V. Mahoney. The ZPAQ open standard format for highly compressed data - Level 2. 2016. http://mattmahoney.net/dc/zpaq206.pdf. [9] Matthew V. Mahoney. The ZPAQ compression algorithm. 2015. http://mattmahoney.net/dc/ zpaq_compression.pdf. Bibliography 43 [10] James K. Bonfield and Matthew V. Mahoney. Compression of FASTQ and SAM format sequencing data. PLoS ONE, 8(3):1–10, 03 2013. doi: 10.1371/journal.pone.0059190. http://dx.doi.org/ 10.1371%2Fjournal.pone.0059190. [11] Jyrki Alakuijala and Zoltan Szabadka. Brotli Compressed Data Format. RFC 7932, July 2016. https: //rfc-editor.org/rfc/rfc7932.txt. [12] John Scoville. Fast autocorrelated context models for data compression. CoRR, abs/1305.5486, 2013. http://arxiv.org/abs/1305.5486. [13] Byron Knoll and Nando de Freitas. A machine learning perspective on predictive coding with PAQ. CoRR, abs/1108.3298, 2011. http://arxiv.org/abs/1108.3298. [14] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools (2nd Edition). Addison Wesley, 2006. ISBN 978-0321486813. [15] Aftab Khan and Ashfaq Khan. Lossless colour image compression using RCT for bi-level BWCA. Signal, Image and Video Processing, 10(3):601–607, 2016. ISSN 1863-1711. doi: 10.1007/ s11760-015-0783-3. http://dx.doi.org/10.1007/s11760-015-0783-3. A Tutorial A context mixing model with a preprocessor for run length encoding is written. Three components are used to form the network. Create a new template which will then be modified at the beginning and at the pcomp/hcomp sections: $ ./zpaqlpy --emit-template > rle_model.py $ chmod +x rle_model.py First the size of the arrays 𝐻 and 𝑀 for each section, hcomp and pcomp needs to be specified: hh = 2 # i.e. size is 2**2=4, because hH[0], …, hH[2] are the inputs for the components Listing A.1: rle_model.py One component should give predictions based on the byte value and the second component based on the run length, both give predictions for the next count and the next value. Then the context-mixing components are combined to a network: n = len({ 0: ”cm 19 22”, # context table size 2*19 with # partly decoded byte as 9 bit hash xored with the context , # count limit 22 4 1: ”cm 19 22”, 2: ”mix2 1 0 1 30 0”, # will mix 0 and 1 together , context table size 2**1 with and -0 masking of the # partly decoded byte which is added to the context , learning rate 30 9 }) Listing A.2: rle_model.py Each component 𝑖 gets its context input from the entry in 𝐻[𝑖] after each run of the hcomp function, which is called for each input byte of the preprocessed data, which either is to be stored through arithmetic coding in compression phase or is retrieved through decoding in decompression phase with following postprocessing done by calls of the pcomp function. The context-mixing network is written to the archive in byte representation as well as the bytecode for hcomp and pcomp (if they are used). The preprocessor command is needed when the compiled file is used 45 with zpaqd if a pcomp section is present. As the preprocessor might be any external programme or also included in the compressing archiver and is of no use for decompression it is therefore not mentioned in the archive anymore. This way we specify a preprocessor: 1 pcomp_invocation = ”./ simple_rle” Listing A.3: rle_model.py $ chmod +x simple_rle # create the preprocessor as executable file and fill it as follows #!/usr/bin/env python3 import sys input = sys.argv[1] 4 output = sys.argv[2] with open(input , mode=’rb’) as fi: with open(output , mode=’wb’) as fo: last = None count = 0 9 data = [] for a in fi.read(): if a != last or count == 255: if last != None: # count only up to 255 to use one byte # write out the pair data.append(last) data.append(count) 14 last = a # start counting count = 1 else: count += 1 19 # continue counting if last != None: data.append(last) data.append(count) fo.write(bytes(data)) Listing A.4: simple_rle Then we need code in the pcomp section to undo this transform: case_loading = False last = NONE 3 def pcomp(c): global case_loading , last if c == NONE: # start of new segment , so restart our code case_loading = False 8 last = NONE return if not case_loading: # c is byte to load case_loading = True last = c 13 else: # write out content of last c times case_loading = False while c > 0: 46 A Tutorial c-= 1 out(last) Listing A.5: rle_model.py So now it should produce the same file as the input file: $ ./simple_rle INPUTFILE input.rle $ ./rle_model.py pcomp input.rle input.norle $ cmp INPUTFILE input.norle We can already try it, even if hcomp does not compute the context data yet (so compression is not good): $ ./zpaqlpy rle_model.py $ ./zpaqd c rle_model.cfg archive.zpaq FILE FILE FILE Now we can add hcomp code to improve compression by adaptive prediction: at_counter = False # if false , then c is byte , otherwise c is a counter last_value = 0 3 last_counter = 0 def hcomp(c): # pcomp bytecode is passed first (or 0 if there is none) global at_counter , last_value , last_counter if at_counter: 8 last_counter = c else: last_value = c # first part of the context for the first CM is the byte replicated and # the second is whether we are at a counter (then we predict for a byte) or vice versa 13 hH[0] = (last_value << 1) + at_counter hH[0] <<= 9 # at_counter will occupy 1 bit , therefore shift # again shift to side because of the xor with the partially decoded byte # second CM same but uses the counter for prediction hH[1] = (last_counter << 1) + at_counter hH[1] <<= 9 18 hH[2] = at_counter + 0 # context for mixer: is at counter (1) or not (0) at_counter = not at_counter Listing A.6: rle_model.py We need to compile again before we run the final ZPAQ configuration file: $ ./zpaqlpy rle_model.py $ ./zpaqd c rle_model.cfg archive.zpaq FILE FILE FILE zpaqd needs to have simple_rle in the same folder because we specified pcomp_invocation = ”./simple_rle”. B Visualizations B.1 Arithmetic Coding Figure B.1: Encoding and decoding steps without End-of-Message symbol B.2 Context Mixing Figure B.2: A possible text model B.3 ZPAQ Format Figure B.3: Structure of a block header Figure B.4: Structure of segments B.4 ZPAQL Virtual Machine Figure B.5: Architecture with its building blocks B.5 Compiler Pipeline Figure B.6: The parts from source reading to writing a config file