Freie Universität Berlin
Department of Mathematics and Computer Science
Institute of Computer Science
Bachelor Thesis
Design of a Python-subset Compiler in Rust targeting ZPAQL
Kai Lüke
kailueke@riseup.net
Supervisors: Prof. Dr. Günter Rote
Dipl.-Inform. Till Zoppke
Berlin, August 23, 2016
Abstract
The compressed data container format ZPAQ embeds decompression algorithms as ZPAQL bytecode in the
archive. This work contributes a Python-subset compiler written in Rust for the assembly language ZPAQL,
discusses design decisions and improvements. On the way it explains ZPAQ and some theoretical and prac-
tical properties of context mixing compression by the example of compressing digits of . As use cases for
the compiler it shows a lossless compression algorithm for PNM image data, a LZ77 variant ported to Python
from ZPAQL to measure compiler overhead and as most complex case an implementation of the Brotli algo-
rithm. It aims to make the development of algorithms for ZPAQ more accessible and leverage the discussion
whether the current specification limits the suitability of ZPAQ as an universal standard for compressed
archives.
Contents
1 Introduction 1
1.1 Preceding History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 The ZPAQ Standard Format for Compressed Data 5
2.1 ZPAQ Specication and Working Principle . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Context Mixing: Components and Context Data . . . . . . . . . . . . . . . . . . . . . . 8
2.3 ZPAQL: Virtual Machine and Instruction Set . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Examples: A simple Context Model and LZ1 with a Context Model . . . . . . . . . . . . . 12
3 Notes on Data Compression 16
4 Compilers, Rust and Python 21
4.1 Classical Compiler Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 The Rust Programming Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 The Python Programming Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5 Considerations and Challenges with ZPAQL as Target Language 23
6 Selection of the Python-subset as Source Language 25
6.1 Decision on the Feature Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.2 Grammar and Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7 Design of the Compiler in Rust 29
7.1 Exposed API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.2 Parser and Tokenizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7.3 Grammar of the Intermediate Representation . . . . . . . . . . . . . . . . . . . . . . . . 31
7.4 IR Generation and the Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.5 Optimizations in the Intermediate Representation . . . . . . . . . . . . . . . . . . . . . . 33
7.6 ZPAQL Generation and Register Assignment . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.7 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
vi Contents
8 Exemplary Configurations and Programs 35
8.1 Compression of PNM Image Data using a Context Model . . . . . . . . . . . . . . . . . . 35
8.2 Bringing the Brotli Algorithm to ZPAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9 Evaluation 38
9.1 Compiler Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.2 Performance of the Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.3 Analysis of the generated Code and Comparison with handwritten Code of LZ1 . . . . . . . 39
9.4 Suitability of ZPAQ as universal Standard . . . . . . . . . . . . . . . . . . . . . . . . . . 40
10 Conclusion 41
Bibliography 42
A Tutorial 44
B Visualizations 47
B.1 Arithmetic Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
B.2 Context Mixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
B.3 ZPAQ Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
B.4 ZPAQL Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
B.5 Compiler Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
1 Introduction
It takes time until new and incompatible data compression algorithms become distributed in software. Also
different input data should often be handled with different compression techniques, utilizing best knowledge
about the data.
The ZPAQ standard format for compressed data is a container format which also holds the needed decompres-
sion algorithms. They can be specied through a context mixing model of several predictors with a bytecode
which computes the context data for them (used for arithmetic coding), a bytecode for postprocessing (used
for transformations or stand-alone algorithms) or any combination of both.
Arithmetic coding spreads symbols over a number range partitioned according to the probability distribution.
That way a likely to be encoded symbol can get a bigger part. The whole message is encoded at once from the
beginning. And every time a symbol was chosen the possibly modied probability distribution is applied
again to segment its part of the number range. Then for the next symbol this is the number range partition
to choose from. When the last symbol has been processed, the number range is very narrow compared to
the beginning and every number of it now represents the whole message. So the shortest in terms of binary
representation can be selected and a decoder can do the same steps again by choosing the symbol which has
this number in its range and applying then the partitioning to this range again according to the probability
distribution. In practice, one has either to use a special end-of-message symbol or specify the message length
before to dene an end to this process.
The history which led to the development of ZPAQ shall shortly be explained in this chapter, followed by the
motivation of writing a compiler for ZPAQL and the research question of this work. In the following chapter
the whole picture of ZPAQ compression is visualized and the building blocks, i.e. context mixing and the
bytecode, are explained in detail. General theory of data compression and its limits are shortly noted on
afterwards. The main part is preceded by a short introduction to compiler construction, the implementation
language Rust and the source language Python. One chapter outlines the conditions and difculties of
ZPAQL as a compiler target. The following chapters cover the chosen Python-subset (6), the developed API
and compiler internals (7), example programs (8) and nally the evaluation (9).
References to compressors and project websites or more detailed documentation and short remarks are
located in footnotes. All websites have been accessed until August 19th 2016. Academic publications are
listed in the bibliography.
2 1 Introduction
1.1 Preceding History
A big milestone being reached in data compression has been the Lempel-Ziv algorithm which is a form of
dictionary compression due to its consistence of backward references to data slices which have already been
seen. The area of statistical data compression was rst stationary and Huffman optimal prex codes can be
mentioned for entropy coding before arithmetic coding was known. Basically, few bits should be used for
frequent symbols but more for infrequent symbols. Entropy coding developed further with adaptive coding
from the mid-80s. One major innovation called PPM (Prediction by Partial Matching) usually works on
order-N byte contexts and predictions [1] whereas DMC (Dynamic Markov Compression) predicts bits [2].
Adaptive Huffman coding was developed as well. A decade later CTW (Context Tree Weighting) also utilizes
bit prediction but mixes the learned distributions of the order-N byte contexts [3].
The predecessors to the series of PAQ compressors, called P5, P6, P12, are based on a 2-layer neural network
with online training that predicts one bit for context inputs up to order-5. At that time they were on par
with variants of PPM concerning speed and compression ratio [4].
In PAQ1 the neural network is not used anymore as the probability and condence of various models are
combined through weighted averaging. There is a non-stationary n-gram model up to order-8, a match model
for occurrences longer than 8 bytes in the last 4 MB, a cyclic model for patterns in rows and a whole word
model up to order-2. As statistical data the models mostly hold the bit counts of the contexts as one byte
in a hash table. At its release it could produce best results for a concatenation of the Calgary corpus (a data
compression benchmark) [5].
The models and their semi-stationary update were improved and PAQ2 introduced SSE (Secondary Symbol
Estimation) which improves a given probability by mapping it to a bit history. PAQ4 changed to adaptive
linear weighting and PAQ5 added another mixer. Many other versions have model improvements and added
special models or transforms. PAQAR, a fork of PAQ6, reworked the mixing architecture and could turn
models off. PAsQDa uses a transformation to map words to dictionary codes. PAQ6 variants were one after
the other leading in the Calgary challenge [6].
PAQ7 is a rewrite using neural network logistic mixing. Specically it includes a JPEG model that predicts
the Huffman codes based on the DCT coefcients. PAQ8 variants come with more le models, dictionary
or x86 call address preprocessors or a DMC model. The PAQ8HP series won the Hutter Prize for Loss-
less Compression of Human Knowledge
1
about compressing a part of an English Wikipedia dump. Many
achievements were mainly made through special models for various le formats. The simpler LPAQ pro-
vides faster but less compression and later versions aim on text compression. PAQ9A has LZ (Lempel-Ziv)
precompression and cascaded ISSE (Indirect Secondary Symbol Estimation). The detailed history including
source code is tracked on the PAQ history website
2
and in M. Mahoney’s book on data compression [7].
1 http://prize.hutter1.net/
2 The PAQ Data Compression Programs http://mattmahoney.net/dc/paq.html
1.2 Motivation 3
Parts of PAQ variants made it into the recent context mixing compressor cmix that leads the Large Text
Compression Benchmark
3
and the Silesia Open Source Compression Benchmark
4
but with around 30 GB
RAM usage
5
.
»ZPAQ is intended to replace PAQ and its variants (PAQ8, PAQ9A, LPAQ, LPQ1, etc) with similar or better
compression in a portable, standard format. Current versions of PAQ break archive compatibility with
each compression improvement. ZPAQ is intended to x that.«
6
The development of the ZPAQ archiver started from early 2009 and dened the rst version of The ZPAQ
Open Standard Format for Highly Compressed Data [8] after some months. The main idea is to move
the layout of the context mixing tree as well as the algorithm for context computation and the one for
postprocessing into the archive before each block of compressed data. There are nine components
7
to
choose from for the tree, mostly context models coming from PAQ. In addition the algorithms are provided
as bytecode for a minimal virtual machine. Hence the algorithm implementation is mostly independent of the
decompressor implementation and compatibility is preserved when improvements are made. Also depending
on the input data an appropriate compression method can be chosen. The main program using libzpaq is
an incremental journaling backup utility which supports deduplication and encryption and provides various
compression levels using LZ77, BWT (Burrows-Wheeler Transform) and context models [9]. But there are
also reference decoders unzpaq and tiny_unzpaq, a simple pipeline application zpipe and a development
tool zpaqd
8
.
The fastqz compressor [10] for Sanger FASTQ format DNA strings and quality scores also uses the ZPAQ
format and was submitted to the Pistoia Alliance Sequence Squeeze Competition
9
.
1.2 Motivation
The development of ZPAQ continued specially for the use case of the incremental archiver program. But
the appearance of new algorithms for ZPAQ reached a rather low level, as well as the number of authors
10
despite the fact that it offers a good environment for research about context mixing methods and crafting
of special solutions for use cases with a known type of data because one can build upon existing parts. The
leading reason could be that the assembly language ZPAQL with its few registers is not very accessible and
programs are hard to grasp or get easily unmanageable when complex tasks like parsing a header should be
accomplished. Therefore, the rst question is whether another language can support ZPAQ’s popularity.
3 http://mattmahoney.net/dc/text.html
4 http://mattmahoney.net/dc/silesia.html
5 http://www.byronknoll.com/cmix.html
6 http://mattmahoney.net/dc/zpaq.html
7 See chapter 2.2 or http://mattmahoney.net/dc/dce.html#Section_437
8 http://mattmahoney.net/dc/zpaqutil.html
9 http://www.pistoiaalliance.org/projects/sequence-squeeze/
10 The majority of available congurations is listed on http://mattmahoney.net/dc/zpaqutil.html
4 1 Introduction
If that is the case, then a compiler for a well-known programming language could help to overcome the
obstacle of learning ZPAQL for implementing new algorithms.
The second question is whether the design decisions of the ZPAQ specication allow for arbitrary compres-
sion algorithms to be used with ZPAQ, even if they bring their own encoder or a dictionary or whether ZPAQ
is rather meant to be a platform for algorithms that use the predened prediction components like (I)CM
and (I)SSE
11
together with the built-in arithmetic coder.
1.3 Research Question
As an approach towards these questions mentioned above this work wants to contribute a compiler that
emits ZPAQL for a subset of the popular Python programming language. The subset should be oriented to
the data types of the ZPAQL virtual machine. The architecture and operation of the Python code should
still maintain similarity to the ZPAQL execution workow to avoid abstractions. That means only integers,
no objects, oating-point numbers or strings, but the ability to address the two memory sections of the VM
as arrays. Running the Python code should be possible as usual which helps to keep debugging out of the
ZPAQL VM.
All this would ease the development of new algorithms and discover limitations of the current ZPAQ stan-
dard. An example for a new algorithm should be provided by developing a model for uncompressed PNM
image data which should be compared to other lossless image compressors.
As an example for a complex all-purpose algorithm the recent Brotli compression algorithm[11] will be
brought to ZPAQ. It was proposed for HTTP2 compression, includes a dictionary and combines LZ77, Huff-
man coding and order-2 context modeling. This is a challenge because it has to be solely implemented in
the postprocessing step and also needs memory management.
The quality of the compiler will be evaluated by comparing to an existing hand-written implementation of a
LZ compressor. Design decisions of the compiler and difculties faced will be discussed.
11 (Indirect) context model and (indirect) secondary symbol estimation, four of the nine ZPAQ components.
2 The ZPAQ Standard Format for Compressed Data
This chapter gives an introduction to the ZPAQ archive specication in the current version 2.06. A small use
case is explained as well as the needed commands. Then the context mixing components and the language
ZPAQL are presented. Finally, two real-world examples are explained.
2.1 ZPAQ Specification and Working Principle
The Level 1 specication of this container format required the data always to be encoded with the arithmetic
coder and at least one predictor. A backwards incompatible change leading to Level 2 was to allow raw data
to be stored and thus no prediction component to be used, as needed for incompressible data or standalone
postprocessor algorithms e.g. a fast LZ variant [8]. It also denes a standard encryption for the whole archive
le and introduces an append-only journaling mode for incremental backups and deduplication on top of the
streaming format which is still the basis of the archive. These features are out of scope for this work and
therefore the focus is on the compression architecture and the virtual machine.
It is only specied what a valid archive is and how decompression takes place (nevertheless, probability
and thus context computation have to be symmetric for compression and decompression with an arithmetic
coder). In case of ambiguity in the specication there is the reference implementation unzpaq
1
, a tiny
version
2
of it and the more mature library libzpaq that is used by the zpaq archiver
3
, the zpaqd development
tool
4
or as plug-in for le archivers and incorporates a x86/64 JIT compiler for ZPAQL.
»The ZPAQ open standard species a compressed representation for one or more byte (8 bit value) se-
quences. A ZPAQ stream consists of a sequence of blocks that can be decompressed independently. A
block consists of a sequence of segments that must be decompressed sequentially from the beginning of
the block. Each segment might represent an array of bytes in memory, a le, or a contiguous portion of a
le.« [8]
A block header has no information on the length of the block because like for segments an end marker is
used. Memory requirements for the two bytecodes hcomp and pcomp are dened in the header. It is noted
1 http://mattmahoney.net/dc/unzpaq206.cpp
2 http://mattmahoney.net/dc/tiny_unzpaq.cpp
3 http://mattmahoney.net/dc/zpaq715.zip
4 http://mattmahoney.net/dc/zpaqd715.zip
6 2 The ZPAQ Standard Format for Compressed Data
whether arithmetic coding is used and what predicting components make up the context mixing tree. Each
component has arguments also determining memory use. If needed the bytecode hcomp is embedded to
compute context data for the components of the context mixing tree for each byte. All components give
bit predictions for the partially decoded byte (these are passed upwards the tree) and are trained afterwards
with the correct bit which was decoded based on the root (i.e. the last) node probability for each bit.
The optionally arithmetic coded data which comes from all segment content (not the segment lename or
comment) in the block can start with an embedded pcomp bytecode or declare that no pcomp bytecode
is present. Therefore, the hcomp section can already be used for context computation to compress the
pcomp bytecode (0 for empty or 1 followed by the length and the bytecode). The pcomp code is used for
postprocessing, may it be a simple transform or the decompression of LZ codes. It gets each decoded byte
as input and outputs a number of bytes not necessarily equal to the input.
That means the four combinations for a block are in total no compression, only context mixing with arith-
metic coding, only postprocessing the stored data or context mixing with subsequent postprocessing (from
the decompressor perspective). The chosen selection applies for all (le) segments in the block.
The following charts illustrate the named parts and their relation to each other for a sample compression use
case. The transform for x86 machine code enhances compressibility by converting relative to static addresses
after each CALL and JMP instruction (0xE8 and 0xE9). It is applied on the two input les, a x86 executable
binary and shared library. Therefore a ZPAQL pcomp program needs to be supplied in the archive block
to revert that transform. Encoding takes place based on the probability distribution of 1 and 0 for each bit
of the current byte as they are provided as prediction by the root node of the simple context mixing tree.
The hcomp program is loaded into the ZPAQL VM and computes contexts for the two components. The
ISSE maps the context to a bit history which is used as context for a learning mixer that should improve the
probability provided by the rst component, a CM (Context Model) which should learn good predictions for
the given context. The whole model and the hcomp bytecode are also embedded into the archive block.
The two les are stored as two segments in the block (like a ”solid” archive). Because the preprocessor might
be any external program or also included in the compressing archiver and is of no use for decompression it
is therefore not mentioned in the archive anymore.
2.1 ZPAQ Specication and Working Principle 7
Figure 2.1: Possible compression scheme
Decompression takes place in reverse manner and hcomp is loaded into the ZPAQL VM to compute the
context data for the components of the model. They supply the predictions to the arithmetic coder and are
corrected afterwards. For the reverse transform of each segment pcomp is read from the decoded stream and
loaded in another VM. Then the two segments follow in the decoding step and go through the postprocessing
transform before they are written out as les.
Figure 2.2: Accompanying decompression
The tool zpaqd only supports the streaming format and can be used to construct this example setup by writing
a conguration le for it and then adding two les as segments into a block. But beside the algorithms that are
already dened for compression in libzpaq for the levels 1 to 5 (LZ77, BWT and context mixing) it also offers
the ability to specify a customized model (E8E9, LZ77 transformations or also word models are supported)
given as argument, so that the above conguration can also be brought to life with something like zpaq
a [archive] [files] -method s8.4c22.0.255.255i3 (denotation documented in libzpaq.h and the
8 2 The ZPAQ Standard Format for Compressed Data
zpaq man page
5
). Here the rst 8 accounts for
 MB blocks, so that both segments should t into
one block (yet the zpaq application uses the API in a way that creates an additional block), then an order-2
CM and an order-3 ISSE are chained. The resulting conguration including the two ZPAQL programs stored
in the archive can be listed with zpaqd l [archive].
For a more general view c.f. the compression workow of the zpaq end user archiver as described in the
article mentioned [9]. It selects one of its predened algorithms based on their performance for the data and
uses deduplication through the journaling format.
2.2 Context Mixing: Components and Context Data
The way the mixing takes place has evolved from the neural network approach in P5 over weighted averaging
from PAQ1 and adaptive linear weighting to logistic mixing in PAQ7 (like in a neural network and instead of
back propagation the weights are updated). In ZPAQ the probabilities are also handled in the logistic domain
 

as output of the predicting components and can be averaged or rened by other
components before the output of the last component is transformed  




to
be used in the arithmetic coder to code one bit [7].
ZPAQ denes nine components. Up to 255 of them can be used in a model but only linear trees can be
constructed as mixing network i.e. when they are listed, they can only refer to predictions of preceding
components. The last one listed is the top node [8]. They get their context data input through computation
in hcomp. To supply the component with a higher order context, a rotating history buffer needs to be
maintained by hcomp. Now the components are listed without their arguments and internals which can be
found in the specication [8].
CONST is the most simple and gives a xed prediction. It does not receive context data.
CM is a direct context model that holds counts of zeros/ones and a prediction for each of the eight bits
expected in the specied context. The counts are used to update the prediction to reduce the error. The
given context data is often a hash (and collisions can hurt compression). Also the partially decoded byte is
expanded to nine bits and XORed with the context after coding each bit, so that the next bit’s prediction is
accessed in the table.
ICM is an indirect context model consisting of a hash table to map the given context and the partially
decoded byte to a bit history which is then mapped to a prediction like in a direct context model. Histories
are counts of 1 and 0 and represented by states (explained in [7], chapter 4.1.3. Indirect Models) and
adjusted after the prediction to a better tting state. The prediction is also updated.
5 http://mattmahoney.net/dc/zpaqdoc.html
2.3 ZPAQL: Virtual Machine and Instruction Set 9
MATCH is a model that takes the following byte of a found string match as prediction until a mismatch
happens. Therefore it keeps its own history buffer and the prediction is also varied depending on the length
of the match. The match is found based on the (higher order) context hash input. This works because the
search does not take place in the history buffer, but a table maps each context hash input to the last element
in the history buffer.
AVG outputs a non-adaptive weighted average of the predictions of two other components. It does not
receive context data.
MIX maps a context and the masked partially decoded byte to weights for producing an averaged prediction
of some other components. Afterwards the weights are updated to reduce the prediction error.
MIX2 is a simpler MIX and can only take two predictions as input.
SSE stands for secondary symbol estimation (Dmitry Shkarin, also known as Adaptive Probability Map in
PAQ7 or LPAQ) because it receives a context as input and takes the prediction of another component which
is then quantized. For the given context this quantized prediction and the next closest quantization value
are mapped to predictions which result in an interpolation of both. Initially this map is an identity, but
the update step corrects the prediction of the closer-to-original quantized value the same way as in the CM
update phase.
»A typical place for SSE is to adjust the output of a mixer using a low order (0 or 1) context. SSE components
may be chained in series with contexts typically in increasing order. Or they may be in parallel with
independent contexts, and the results mixed or averaged together [7]
ISSE is an indirect secondary symbol estimator that, as an SSE, renes the prediction of another component
based on a context which is mapped to a bit history like in an ICM. That bit history is set as context for an
adaptive MIX to select the weights to combine the original prediction with a xed prediction.
»Generally, the best compression is obtained when each ISSE context contains the lower order context of
its input.« [7]
2.3 ZPAQL: Virtual Machine and Instruction Set
At the beginning of each block the two bytecodes hcomp and pcomp, if needed, are loaded to a virtual
machine for each. It consists of the program counter , 32-bit registers , , and , an 1-bit condition
ag , 256 32-bit registers


and the arrays (8-bit elements) and (32-bit elements). Initially, all
registers and arrays hold . The size of and is dened in the block header along with the information
which context-mixing components are used.
For each encoded or decoded byte hcomp gets run to set the contexts for the components. The context
data for a components    has to be stored in . As rst input hcomp sees whether a postprocessor
10 2 The ZPAQ Standard Format for Compressed Data
is present and if yes, then its length and bytecode. Afterwards the data of the segments is coming, without
any separation. Except for the  which is set to and the register which is used for the input all state
is preserved between the calls.
The postprocessor pcomp is run for each decoded byte of the block to revert the preprocessing and puts out
the data via an instruction. After each segment it is invoked with

as input to mark the end.
There is an assembly language for the bytecode which is also used in the table describing the corresponding
opcodes in the specication [8]. In this assembly language whitespace can only occur between opcode bytes
in order to visualize that a=b is a 1-byte opcode while a= 123 is a 2-byte opcode. Comments are written in
brackets.
Operations on the 32-bit registers and elements of are 

and interpreted as positive numbers in
comparisons. Indexes access into and is   or   and denoted as *B for , *C for
and *D for . Because holds bytes operations on *B and *C are   and swapping via
B*<>A or C*<>A alters only the lower byte of [8].
Instructions Semantics and Constraints
error cause execution to fail
X++ increment X by 1 (X is one of A, B, C, D, *B, *C, *D)
X-- decrement X by 1
X! ip all bits
X=0 set X to 0
X<>A swap (X is not A)
X=X set X to X
X= N set X to     
A+=X add X on A
A-=X subtract X from A
A*=X multiply A by X
A/=X divide A by X (set    if   )
A%=X     (set    if   )
A&=X binary AND with X
A&~=X binary AND with ipped bits of X
A|=X binary OR with X
A^=X binary XOR with X
A<<=X bitwise left shift of A by   bits
A>>=X bitwise right shift of A by   bits
2.3 ZPAQL: Virtual Machine and Instruction Set 11
A+= N, A-= N,
A*= N, A/= N,
A%= N, A&= N,
A&~= N, A|= N,
A^= N, A<<= N,
A>>= N
same as previous instructions but with     
A==X    if    otherwise   
A<X    if    otherwise   
A>X    if    otherwise   
A== N, A< N,
A> N
same as above but with     
A=R N, B=R N,
C=R N, D=R N
set A, B, C or D to
R=A N set
to A
HALT ends current execution of the program
OUT output   in pcomp, ignored in hcomp
HASH      
HASHD     
JMP I add     to PC relative to the following instruction, so has
no effect and  is an endless loop (in the specication a positive N is used, so


      
JT N jump if   
JF N jump if   
LJ L 

  with    

(in the specication written as LJ N M because
it is a 3-byte instruction with 

    )
Beside the opcodes libzpaq also supports using helper macros like if(not)? (else)? endif, the
long jump versions if(not)?l … (elsel)? … endifl and do … (while|until|forever) which will
be converted to conditional jump opcodes. An if-block is executed if , the while-jump is executed if
  , an until-jump is executed if  . So that means the test for the condition has to be written before
the macro: a> 255 if … endif. The statements can also be interweaved, e.g. write a do-while-loop or a
continue-jump as do … if … forever endif.
In a ZPAQ cong le the two sections for hcomp and pcomp are written behind the context mixing model
conguration. The pcomp section is optional. Comments can appear everywhere within brackets.
12 2 The ZPAQ Standard Format for Compressed Data
Syntax: (where   and
,
,

and

dene the size of and for each section)
comp HH HM PH PM N
(I COMPONENT (ARG)+ )*
hcomp
(ZPAQL_INSTR)*
halt
(pcomp (PREPROCESSOR_COMMAND)? ;
(ZPAQL_INSTR)*
halt
)?
end
2.4 Examples: A simple Context Model and LZ1 with a Context Model
The following example conguration is based on fast.cfg from the utility site
6
and can be used for text
compression and adaptively combines (independently of contexts, just based on the success of the last pre-
diction) the prediction of a direct order-1 context model with the prediction of a order-4 ISSE which renes
the prediction of a order-2 ICM. The arguments for the components are documented in the specication [8].
1 comp 2 2 0 0 4 (hh hm ph pm n)
(where H gets the size of 2^hh in hcomp or 2^ph in comp ,
M 2^hm or 2^pm and n is the number of
context -mixing components)
0 cm 19 4 (will get an order 1 context)
6 1 icm 16 (order 2, chained to isse)
2 isse 19 1 (order 4, has reference to ICM component 1)
3 mix2 0 0 2 24 0 (moderate adapting mixer between CM and ISSE
based on which predicts better, no contexts even for bits)
(ICM and ISSE part adapted from fast.cfg)
11 hcomp
r=a 2 (R2 = A, input byte in R2)
d=0
a<<= 9 *d=a (H[D] = A) (set context to actual byte)
(leaving first 9 bits free for the partially decoded byte)
16 a=r 2 (A = R2)
*b=a (M[B] = A) (save input byte in rotating buffer)
(full M is used with pointer b)
a=0 hash (shortcut for A = (A + M[B] + 512) * 773)
b-- hash
21 d= 1 *d=a (order 2 hash for H[1])
b-- hash b-- hash
d= 2 *d=a (order 4 hash for H[2])
26
6 http://mattmahoney.net/dc/zpaqutil.html, cong les with a ”post” instead of ”pcomp” are in the old format of the
Level 1 specication
2.4 Examples: A simple Context Model and LZ1 with a Context Model 13
(H[3] stays 0 as fixed context for MIX2)
halt (execution stops here for this input byte)
end
Listing 2.1: Example mfast.cfg without a pcomp section
To demonstrate the compression phases and parts involved in detail the LZ1 conguration from the utility
site is chosen, but also the BWT.1 examples are worth to look at.
The LZ1 conguration relies on a preprocessor lzpre.cpp which turns the input data into a compressed
LZ77-variant representation of codes for match copies and literal strings. This is further compressed through
arithmetic coding with probabilities provided by an ICM (indirect context model).
The contexts are always hashed as   from two values and as follows. For the rst byte
of an offset number of a match the length of the match and the current state (2…4, i.e. 1…3 bytes to follow
as offset number) are used as context. For the remaining bytes of an offset number (or a new code if no
bytes are remaining) the previous context and the current state (previous state - 1, i.e. 0…2 bytes to follow)
are used as context. For the rst literal of a literal string the number of literals and the state 5 are used as
context. For the following literals the current literal and the state 5 are used as context. For a new code
after a literal string instead of a hash of the rst value just 0 and the current state (1) are used as context.
The bytecode of pcomp is not specially handled.
To revert the LZ1 compression pcomp parses the literal and match codes and maintains a 16 MB

byte buffer in .
(lz1.cfg
3 (C) 2011 Dell Inc. Written by Matt Mahoney
Licensed under GPL v3, http://www.gnu.org/copyleft/gpl.html)
comp 0 0 0 24 1
0 icm 12 (sometimes 0 cm 20 48 will compress better)
8 hcomp
(c=state: 0=init, 1=expect LZ77 literal or match code,
2..4=expect n-1 offset bytes,
5..68=expect n-4 literals)
b=a (save input)
13 a=c a== 1 if (expect code ccxxxxxx as input)
(cc is number of offset bytes following)
(00xxxxxx means x+1 literal bytes follow)
a=b a>>= 6 a&= 3 a> 0 if
a++ c=a (high 2 bits is code length)
18 *d=0 a=b a>>= 3 hashd
else
a=b a&= 63 a+= 5 c=a (literal length)
*d=0 a=b hashd
endif
23 else
a== 5 if (end of literal)
c= 1 *d=0
14 2 The ZPAQ Standard Format for Compressed Data
else
a== 0 if (init)
28 c= 124 *d=0 (5+length of postprocessor)
else (literal or offset)
c--
(model literals in order 1 context , offset order 0)
a> 5 if *d=0 a=b hashd endif
33 endif
endif
endif
(model parse state as context)
38 a=c a> 5 if a= 5 endif hashd
halt
pcomp ./lzpre c ; (code below is equivalent to lzpre d)
a> 255 if (end of segment)
b=0 d=0 (reset , is last command before halt)
43 else
(LZ77 decoder: b=i, c=c d=state r1=len r2=off
state = d = 0 = expect literal or match code
1 = decoding a literal with len bytes left
2 = expecting last offset byte of a match
48 3,4 = expecting 2,3 match offset bytes
i = b = position in 16M output buffer
c = c = input byte
len = r1 = length of match or literal
off = r2 = offset of match back from i
53 Input format:
00llllll: literal of length lllllll=1..64 to follow
01lllooo oooooooo: length lll=5..12, offset o=1..2048
10llllll oooooooo oooooooo: l=1..64 offset =1..65536
11llllll oooooooo oooooooo oooooooo: 1..64, 1..2^24)
58 c=a a=d a== 0 if
a=c a>>= 6 a++ d=a
a== 1 if (state?)
a+=c r=a 1 a=0 r=a 2 (literal len=c+1 off=0)
else
63 a== 2 if a=c a&= 7 r=a 2 (short match: off=c&7)
a=c a>>= 3 a-= 3 r=a 1 (len=(c>>3)-3)
else (3 or 4 byte match)
a=c a&= 63 a++ r=a 1 a=0 r=a 2 (off=0, len=(c&63) -1)
endif
68 endif
else
a== 1 if (writing literal)
a=c *b=a b++ out
a=r 1 a-- a== 0 if d=0 endif r=a 1 (if (--len==0) state=0)
73 else
a> 2 if (reading offset)
a=r 2 a<<= 8 a|=c r=a 2 d-- (off=off<<8|c, --state)
else (state==2, write match)
a=r 2 a<<= 8 a|=c c=a a=b a-=c a-- c=a (c=i-off -1)
78 d=r 1 (d=len)
do (copy and output d=len bytes)
a=*c *b=a out c++ b++
d-- a=d a> 0 while
(d=state=0. off, len dont matter)
2.4 Examples: A simple Context Model and LZ1 with a Context Model 15
83 endif
endif
endif
endif
halt
88 end
Listing 2.2: lz1.cfg
For comparison a Python port for usage with the zpaqlpy compiler can be found in test/lz1.py. It differs
in processing the pcomp bytecode with the current opcode byte as context for the next.
3 Notes on Data Compression
Not all data can be compressed. Because if we build a necessarily bijective
1
compressing scheme


for strings   to strings    , starting from an identity, every time we remap an
input
, whereas 
, to a shorter compressed representation
, whereas


and

  
, so that now

  
and 

  
, we have to map
to
so that

  
and indeed make this swap in order to maintain the bijection, ending up with 

  
 because

  
. So while
compresses
it expands
and this holds for each iteration when we change our
compression scheme.
Luckily, most data which is relevant to us and interesting for compression has patterns and other data where
we do not understand the patterns appears to be random and is out of scope for compression algorithms. If
we do not have any knowledge about the data except that its symbols are equally distributed with probability
for each symbol , best we can do is to use an optimal code reaching the Shannon entropy as coding
length per symbol:
 

For
 
would be 8 bit as usual and we can simply store the data instead of encoding it again.
In general, given that the distribution is known, we can choose e.g. a non-adaptive arithmetic encoder to
almost reach the limit of bits per symbol. But adaptive arithmetic coding with PPM and others could
even give a better average because the distribution is adjusted by exploiting patterns. Therefore, to craft a
well performing algorithm for the expected input data knowledge about the patterns is needed in order to
give good predictions and go beyond the Shannon entropy as average size.
To dene the lower limit that can be reached the concept of algorithmic information or Kolmogorov complex-
ity of a string is developed. Basically it is the length of the shortest program in a xed language that produces
this string. The language choice only inuences a constant variation because an interpreter could be written.
When comparing different compressors in a benchmark it is common to include the decompressor size in
the measurement because it could also hold data or generate it via computation.
1 We want to allow for every string to be compressed, resulting in the same number because they have to differ of com-
pressed strings which also should be decompressible. Another common approach to prove that there is no universal lossless
compression is a counting argument which uses the pigeonhole principle.
17
Using this principle with ZPAQ and its pcomp section the rst million digits of if form of the ~1 MB text
le pi.txt from the Canterbury Miscellaneous Corpus
2
can be compressed to a 114 bytes ZPAQ archive
which consists of no data stored and a postprocessing step which computes to the given precision and
outputs it as text
3
. This extreme case of the earlier mentioned knowledge about the data can serve as a
bridge between Kolmogorov complexity and adaptive arithmetic coding. For faster execution of the ZPAQ
model we only take the rst ten thousand digits of . Normally the limit most compressors would stay
above (because the digits are equally distributed) is
 


   byte
instead of 10 KB.
With a ZPAQ context model we can, instead of generating the digits in pcomp phase, also use the next
expected digit as context so that the predictor will quickly learn that e.g. character 3 comes in context
3. But prediction can not be 100 % for one symbol as other symbols could occur and there has to be a
probability greater zero assigned to them. Also whether the end of the message is reached is encoded as a
special symbol. So the range of the arithmetic encoder gets still narrowed when a perfectly predicted digit
is encoded, but on such a small level that still only 121 bytes are needed for the ZPAQ archive consisting
of the CM model conguration, hcomp bytecode and the arithmetically coded ten thousand digits of .
That shows that we can go much beyond the entropy limit down to Kolmogorov complexity by using
context modeling and adaptive arithmetic coding. And still the context model is usable for all input data in
opposition to computing in pcomp. The overhead of the fact that 100 % can not be used when predicting
seems to be linear for the message size and can be observed when compressing 50 or 200 MB zeros with a
CM, resulting in around 0.0000022 byte per input byte.
(pi10k.cfg 2016 Kai ke and Matt Mahoney
2 instead of generating the digits in pcomp phase, use the next expected digit as context
To compress: zpaqd cinst pi10k.cfg pi10k.zpaq pi10000.txt)
comp 0 14 0 0 1 (2^14 > 10000)
0 cm 13 0
7 hcomp
ifnot (only first run)
(Compute pi to 10000 digits in M using the formula:
pi=4; for (d=r1*20/3;d>0;--d) pi=pi*d/(2*d+1)+2;
where r1 is the number of base 100 digits.
12 The precision is 1 bit per iteration so 20/3
is slightly more than the log2 (100) we need.)
a= 100 a*=a r=a 1 (r1 = digits base 100)
a*= 7 d=a (d = n iterations)
*b= 4 (M=4)
17 do
(multiply M *= d, carry in c)
b=r 1 c=0
do
22 b--
2 http://corpus.canterbury.ac.nz/descriptions/#misc
3 http://mattmahoney.net/dc/pi.cfg
18 3 Notes on Data Compression
a=*b a*=d a+=c c=a a%= 10 *b=a
a=c a/= 10 c=a
a=b a> 0 while
27
(divide M /= (2d+1), remainder in c)
a=d a+=d a++ d=a
do
a=c a*= 10 a+=*b c=a a/=d *b=a
32 a=c a%=d c=a
a=r 1 b++ a>b while
a=d a>>= 1 d=a
(add 2)
37 b=0 a= 2 a+=*b *b=a
d-- a=d a== 0 until
halt
endif
a=*b a<<= 9 *d=a b++ (set context for expected digit taken from M)
42 halt
end
Listing 3.1: pi10k.cfg
Even if this conguration is usable for other input than it does not give good compression. It can be
merged with the general text model from Listing 2.1 and change between using the CM for order-1 contexts
and for the expected digits contexts every time the start of is detected until a mismatch is found. This
way all occurrences of are coded with only a few bits.
(mixed_pi2.cfg
use the next expected digit as context for CM or a general text model of fast.cfg
3 a MIX2 will select between them
To compress: zpaqd c mixed_pi2.cfg text_pi.zpaq text_with_appearance_of_pi.txt
)
comp 2 14 0 0 4 (2^14 > 10000)
0 cm 18 0 (order 1 or digits of pi)
8 1 icm 16 (order 2, chained to isse)
2 isse 19 1 (order 4)
3 mix2 0 0 2 24 0 (moderate adapting mixer between CM and ISSE based on which predicts
better)
hcomp
r=a 2
13 a=r 0
a== 0 if (only first run)
(Compute pi to 10000 digits using the formula:
pi=4; for (d=r1*20/3;d>0;--d) pi=pi*d/(2*d+1)+2;
where r1 is the number of base 100 digits.
18 The precision is 1 bit per iteration so 20/3
is slightly more than the log2 (100) we need.)
a= 100 a*=a r=a 1 (r1 = digits base 100)
a*= 7 d=a (d = n iterations)
*b= 4 (M=4)
23 do
(multiply M *= d, carry in c)
19
b=r 1 c=0
do
b--
28 a=*b a*=d a+=c c=a a%= 10 *b=a
a=c a/= 10 c=a
a=b a> 0 while
(divide M /= (2d+1), remainder in c)
33 a=d a+=d a++ d=a
do
a=c a*= 10 a+=*b c=a a/=d *b=a
a=c a%=d c=a
a=r 1 b++ a>b while
38 a=d a>>= 1 d=a
(add 2)
b=0 a= 2 a+=*b *b=a
d-- a=d a== 0 until
43 c= 2 (point to 4 of 3.14)
a= 1
r=a 0
a<<= 14 a-- (last element of ring buffer)
b=a
48 a-= 4 (first element of ring bufer, pointer in r3)
r=a 3
halt (input 0 came from pcomp, also to restart c=2 is enough)
endif
(CM part)
53 d=0
a=r 2 a-= 48
c--
a==*c
c++
58 if (pi: set context for expected digit)
a=*c c++ a<<= 1 a++ a<<= 9 *d=a (distinguish between pi number context and character
context by 1 bit for sure)
else (other:)
a=r 2 a<<= 10 *d=a c= 2 (set context to actual byte)
endif
63 (a in r2, lower border of ring buffer in r3)
(ICM and ISSE part adapted from fast.cfg)
a=r 2
*b=a a=0 (save in rotating buffer M)
hash b--
68 d=a (save hash) a=r 3 a>b if b++ b++ b++ b++ endif a=d
hash d= 1 *d=a
b--
d=a (save hash) a=r 3 a>b if b++ b++ b++ b++ endif a=d
hash b--
73 d=a (save hash) a=r 3 a>b if b++ b++ b++ b++ endif a=d
hash d= 2 *d=a
halt
end
Listing 3.2: mixedpi2.cfg
20 3 Notes on Data Compression
For real-life use cases it is often not possible to give perfect predictions. Good contexts can help to bring
order into the statistics about previous data. Beside the manual approach heuristic context models can be
generated for data by calculating the data’s autocorrelation function [12] or as done in PAQ by recognizing
two-dimensional strides for table or graphical data.
Even compressed JPEG photos can be further compressed by 10% - 30% by predicting the Huffman-coded
DCT coefcients when using the decoded values as contexts (done in PAQ7-8, Stuft, PackJPG, WinZIP [7]).
The ZPAQ conguration jpg_test2.cfg uses a preprocessor to expand Huffman codes to DCT coefcients
and later uses them as contexts
4
. The PackJPG approach continues to be developed by Dropbox under the
name lepton
5
and supports progressive beside baseline JPEGs.
Overall modeling and prediction are an AI problem because e.g. for a given sentence start a likely following
word has to be provided or how a picture with a missing area is going to continue. Remarkable results
have been accomplished by using PAQ8 as machine learning tool for e.g. building a game AI with it serving
as a classier, for interactive input text prediction, text classication, shape recognition and lossy image
compression [13].
4 http://mattmahoney.net/dc/zpaqutil.html
5 https://github.com/dropbox/lepton
4 Compilers, Rust and Python
Compilers and their parts covering the subproblems are mentioned here. Then the programming language
Rust which was used for the compiler development is shortly presented as well as Python as the source
language.
4.1 Classical Compiler Architecture
The task to translate a program from a source language into an equivalent program in the target language is
usually split into independent phases where each phase passes its result to the next phase (see [14], Figure
1.6). In most cases the target language is very low level, like machine instructions of a CPU. One variation
of a compiler structure including an intermediate representation language will be explained.
The tokenizer or lexer cuts the input stream into tokens which represent words/units of the source language
which are normally the allowed terminal symbols of the grammar. The parser reads the tokens in sequence
and ts them into nonterminals of the source language grammar by following the production rules until
all tokens are consumed and the start production is completed. On the way it generates an AST (Abstract
Syntax Tree) as semantic representation of the program in the source language. This completes the analysis
done in the frontend.
Synthesis takes place in the backend. The AST is traversed to generate IR (intermediate representation)
instructions. This step usually keeps track of variable names and their assignments in a symbol table. An
optimization pass can be applied to the IR. The code generator produces instructions in the target language
and they can be passed through optimization steps as well.
Checking for errors has to be done in every analysis step to nd lexical errors in the tokenizer, syntactical
errors while parsing, type errors and others during semantic analysis.
If the target language is not the end product but only an assembly language than an assembler produces
object les which will be combined by a linker to a nal executable program.
There are also preprocessors for the source language, multi-pass compilers and JIT (just-in-time) compilers
which are used in interpreters and translate only an often-used section of the source program during runtime
for a performance gain.
22 4 Compilers, Rust and Python
4.2 The Rust Programming Language
»Rust is a systems programming language focused on three goals: safety, speed, and concurrency. It
maintains these goals without having a garbage collector, making it a useful language for a number of use
cases other languages aren’t good at: embedding in other languages, programs with specic space and
time requirements, and writing low-level code, like device drivers and operating systems. It improves on
current languages targeting this space by having a number of compile-time safety checks that produce no
runtime overhead, while eliminating all data races. Rust also aims to achieve ‘zero-cost abstractions’ even
though some of these abstractions feel like those of a high-level language. Even then, Rust still allows
precise control like a low-level language would.«
1
The rst stable 1.0 release of the Mozilla-sponsored language was in May 2015, the current version is 1.11
from August 2016. Pattern matching, denition of data structures and traits together with strong types bring
Rust close to functional languages like Haskell. A program can in most cases only be compiled if it is free
from crashes and undened behavior unless explicitly a panic is allowed when an alternative result case of
a function is not handled. So sometimes there are many iterations until a program compiles, but then as a
gain there are assurances for the execution like memory safety instead of buffer overows. But on the other
side one has to be aware of the ow and reference of data through variables and has to comply to the rules
imposed by the move semantics of the language, because the ownership model is the core for the safety
guarantees.
Part of the language is the denition of test functions for software testing and inline documentation which can
include test cases. As tooling the build system and package manager Cargo comes along with the compiler
rustc which is invoked by Cargo. The inline documentation can be extracted to HTML.
Because of the noted advantages it seems that Rust ts well for compiler construction and offers a pleasing
workow once accustomed with its type system because of the more rare runtime errors leading to less
debugging.
4.3 The Python Programming Language
Python has been around for two decades now and is still representative for an easy to learn imperative
language with a simple syntax. It features dynamic types and a pragmatic approach to object orientation
with functional elements and closures. Freeing memory is done by reference counting and the language
is tightly coupled with the way it is interpreted. In order to fulll a protocol/trait it relies on functions an
object needs to implement.
1 Book Introduction ”The Rust Programming Language”: https://doc.rust-lang.org/book/index.html
5 Considerations and Challenges with ZPAQL as
Target Language
The ZPAQL bytecode that runs on the virtual machine is executed for every input byte, from the beginning
i.e. the rst instruction. Then is set to the input byte while all other state is maintained between the
executions. Also if running as hcomp section then the rst elements of are not general purpose memory
but serve as input for the context mixing components. Execution in a ZPAQL VM is sequential with no
parallelism. Beside the dened size of and there is no heap memory allocation or stack available. If
dynamic memory allocation is wanted it has to be implemented on top of or .
It is Turing-complete but the bytecode size is restricted to around 64 KB (in hcomp few bytes less as the
context model needs to be dened before). Also there is no keyword for a data section dened in the
ZPAQL assembly language and it would be of no use as the bytecode content itself is not addressable and
also unmodiable. But program data could be prepended before the stored data and retrieved by reading
in hcomp or pcomp and discarded while pcomp generates the output. If data should just be accessed by
hcomp it could also be stored as a skipped part of the pcomp bytecode as this is seen by hcomp before the
decoded data. This could also be used if an interpreter is written in ZPAQL.
The jump instructions only support a xed and not a variable destination. That means instead of jumping to
a return address which the caller passes to the function the return can only be achieved through an identier
for a jump table with predened jump points.
The instruction set is rather simple and aimed at 32-bit integers. To keep expensive abstractions away, the
source language should e.g. not include oating point numbers or 64-bit integers. Most calculations have
to be done using register as accumulator because the others lack advanced instructions.
There is no instruction for reading the next byte or saying that context computation has nished - the
program needs to halt and is then executed again from the beginning. Therefore, additional runtime code is
needed to continue execution where it stopped if e.g. convenient input reading should be allowed without
disrupting the program ow or the nished computation of context data should be signaled.
It seems to be a viable plan to keep the organizational structure of a source program close to the resulting
ZPAQL program but with added convenience through a runtime API which is anyway needed for input and
output. Instead of introducing special functions to set the context data as rst elements in , and also
24 5 Considerations and Challenges with ZPAQL as Target Language
can just be fully exposed as data structure in the source language and also be used as arrays for other
means with low abstraction costs. This keeps the structure of handwritten ZPAQL programs close to those
in the source language. But in order to keep variables in a stack, with its 256 elements is not enough, so
expanding seems to be a good solution. To model the repetitive execution of hcomp and pcomp they
could be dened as functions in the source program (think main function) and it would also be possible to
pass the input byte as argument which also keeps the similarity to a handwritten ZPAQL source.
As the runtime code abstractions for providing the mentioned read-API are not too high, the similarity to
original ZPAQL les is more a cosmetic design decision. And if a context-set-API which halts and continues
execution through runtime code is present then hcomp and pcomp functions could be replaced by main
functions which are entered only once and thus hide the fact that execution starts from the beginning for
every input byte. Still dynamic memory management on top of and seems to be costly and thus
departing to far from ZPAQL and adding more complicated data structures could hurt performance too
much.
It would be helping if the source program is standalone executable without being compiled to ZPAQL to
ease debugging by staying outside the ZPAQL VM as long as possible.
Before ZPAQL instructions are generated by the compiler it would be helpful if most complicated operations
are solved on the IR level already, like the saving, restoring and other memory management.
6 Selection of the Python-subset as Source
Language
Considerations are mentioned again before the chosen subset is presented together with the API functions
and organizational source code constraints. Then the grammar is listed with a short note on the behavior of
elements.
6.1 Decision on the Feature Set
Based on the preceding thoughts the input should be a single Python le containing both the code for hcomp
and pcomp as functions which could also call other functions but no imports are allowed. The compiler
should not deal with strings, lists, arbitrary big numbers, classes, closures and (function) objects but only
support a reduced set with variables having 32-bit integer values. The context model should be dened
like in a ZPAQL conguration and the predened arrays and should be exposed as well as the size
exponents , , ,  and as number of components used.
API functions out(a) for the pcomp out instruction and read_b() to read the next input byte by halting
and returning to the current state of execution should be provided. As well as helpers if a dynamic memory
management for additional arrays is implemented in Python on top of and . This API needs to be
included as corresponding ZPAQL runtime in the output of the compiler. The Python runtime should also
allow execution as standalone program with a similar interface to the zpaqd development tool which can
run pcomp and hcomp on a given input.
All the code segments need to be separated, namely the context model denition and other common deni-
tions/functions, pcomp and hcomp code with their variables and functions and then the runtime API and
nally the code for standalone execution. Thus a template approach was chosen which uses comment lines
as separation marks. Then the compiler extracts only the sections it needs to compile.
26 6 Selection of the Python-subset as Source Language
source.py Template Sections Editable?
Denition of the ZPAQ conguration header data (memory size, context mixing com-
ponents) and optionally functions and variables used by both hcomp and pcomp
yes
API functions for input and output, initialization of memory no
function hcomp and associated global variables and functions yes
function pcomp and associated global variables and functions yes
code for standalone execution of the Python le analog to running a ZPAQL congu-
ration with zpaqd r [cfg] p|h
no
Of course it could be organized in a different way that is more appealing based on the alternatives mentioned
in the previous chapter e.g. with two more idiomatic Python les for hcomp and pcomp without special
entry functions and the runtime in a different le which either imports the two les or vice versa.
6.2 Grammar and Semantics
The Python grammar as specied in the language reference
1
has been simplied where possible. It still
allows dictionaries and strings, however not for program code but just for the denition of the context
mixing components. Tuples, unpacking, lists and list comprehensions, for-loops, with-blocks, support for
async-coroutines, import, try/raise/except, decorators, lambda expressions or named arguments have been
removed.
Not all what is parsed is allowed as source code, e.g. nonlocal, dicts, strings or the @-operator for matrix
multiplication but therefore a better error message can be provided than just a parser error. The names of
the productions have been kept even if they are simplied. It would be best if the grammar is expanded to
parse full Python again and let the code generator decide what to support.
Grammar with NUMBER, NAME, ”symbols”, NEWLINE, INDENT, DEDENT or STRING as terminals
Prog (NEWLINE* stmt)* ENDMARKER?
funcdef ”def” NAME Parameters ”:” suite
Parameters ”(” Typedargslist? ”)”
Typedargslist Tfpdef (”=” test)? (”,” Tfpdef (”=” test)?)* (”,” (”**” Tfpdef)?)?
Tfpdef NAME (”:” test)?
stmt simple_stmt | compound_stmt
simple_stmt small_stmt (”;” small_stmt)* ”;”? NEWLINE
small_stmt expr_stmt, pass_stmt, ow_stmt, global_stmt, nonlocal_stmt
expr_stmt (store_assign augassign test) | ((store_assign ”=”)? test)
1 https://docs.python.org/3/reference/grammar.html
6.2 Grammar and Semantics 27
store_assign NAME (”[” test ”]”)?
augassign ”+=” | ”-=” | ”*=” | ”@=” | ”//=” | ”/=” | ”%=”
”&=” | ”|=” | ”^=” | ”<<=” | ”>>=” | ”**=”
pass_stmt ”pass”
ow_stmt break_stmt | continue_stmt | return_stmt
break_stmt ”break”
continue_stmt ”continue”
return_stmt ”return” test
global_stmt ”global” NAME (”,” NAME)*
nonlocal_stmt ”nonlocal” NAME (”,” NAME)*
compound_stmt if_stmt | while_stmt | funcdef
if_stmt ”if” test ”:” suite (”elif” test ”:” suite)* (”else” ”:” suite)?
while_stmt ”while” test ”:” suite (”else” ”:” suite)?
suite simple_stmt, NEWLINE INDENT stmt+ DEDENT
test or_test
test_nocond or_test
or_test and_test (”or” and_test)*
and_test not_test (”and” not_test)*
not_test comparison | (”not” not_test)
comparison expr (comp_op expr)*
comp_op ”<” | ”>” | ”==” | ”>=” | ”<=” | ”!=” | ”in” | ”not” ”in” | ”is” | ”is” ”not”
expr xor_expr (”|” xor_expr)*
xor_expr and_expr (”^” and_expr)*
and_expr shift_expr (”&” shift_expr)*
shift_expr arith_expr | (arith_expr (shift_op arith_expr)+)
shift_op ”<<” | ”>>”
arith_expr term | (term (t_op term)+)
t_op ”+” | ”-”
term factor (f_op factor)*
f_op ”*” | ”@” | ”/” | ”%” | ”//”
factor (”+” factor) | (”-” factor) | (”~” factor) | power
power atom_expr (”**” factor)?
atom_expr (NAME ”(” arglist? ”)”) | (NAME ”[” test ”]”) | atom
atom (”(” test ”)”) | (”” dictorsetmaker? ””) | NUMBER | STRING+ | ”...”
”None” | ”True” | ”False” | NAME
dictorsetmaker dictorsetmaker_t (”,” dictorsetmaker_t)* ”,”?
dictorsetmaker_t test ”:” test
arglist test (”,” test)* ”,”?
28 6 Selection of the Python-subset as Source Language
The semantics of the language elements as described in the reference
2
stay mostly the same, even if the
usable feature set is still reduced as stated before. In particular, one has to be aware of integer overows
which are absent in Python but are present in ZPAQL and thus all computations are in

i.e.


. Except for the bit shift operations with a shift of more than 32 bits. In this case the Python-subset
will do a shift by   bits. To achieve the semantics of (v << X) % 2**32 or (v >> X) % 2**32
with   the resulting value should directly be set to 0 instead. Also / 0 and % 0 does not fail in ZPAQL
but results in a 0 value.
For a wrong Python input the current compiler specication might behave in a different way and even accept
it without a failure. Therefore it is required that the input is a valid Python program which runs without
exceptions.
This requirement is also important because the current compiler does not check array boundaries, so
index%len(hH) or index&((1<<hh)-1) should be used e.g. for a ring buffer because after the original
size of there comes the stack. If run as plain Python le, an exception is thrown anyway then because it
checks array boundaries.
2 https://docs.python.org/3/reference/index.html
7 Design of the Compiler in Rust
The source code for the zpaqlpy compiler is located at https://github.com/pothos/zpaqlpy and in-
cludes the test data and other les mentioned. This chapter is about the developed API the source le uses
and about the process of developing the compiler and its internals.
7.1 Exposed API
The 32- or 8-bit memory areas and are available as arrays hH, pH, hM, pM depending on being a
hcomp or pcomp section with size




dened in the header as available constants hh, hm,
ph, pm. There is support for len(hH), len(pH), len(hM), len(pM) instead of calculating

. But
in general instead of using len() for dynamically allocated arrays as well, special functions like len_hH()
are used to visibly expose their types and do runtime checks already in Python. NONE is a shortcut for
  .
Other functions Description
c = read_b() Read <