- data compression means encoding into less bits
- lossless means without loosing any information
- trade-off between communication throughput, computation, and memory
- can compress if some symbols are more likely than others
- better symbol prediction => lower cross-entropy => higher compression

## Dictionary Coder

- Also called Substitution coder
- Replaces long frequently occurring strings with a shorter references in a dictionary.
- Applied in Gzip and in natural language processing tokenization (e.g. BPE) compression

## Morse Code Uses Huffman Coding Compression

- compresses 26 characters of english alphabet, not compressing white space
- character mapped to sequence of dots and dashes, space mapped to space
- more frequent characters mapped to fewer dots and dashes
- this is Huffman coding with a static tree
- encoding and decoding requires minimal compute and memory

## GZip’s Deflate Data Compression

- GZip uses Deflate compression format (ZIP also)
- a sliding widow of 32k bytes is used to detect duplicate strings
- duplicate strings are referenced back with length and distance symbols (similar to dictionary coding)
- this along with byte literals defines custom alphabet of symbols

- Huffman coding maps frequent symbols to shorter bit sequences
- bit-mapping trees stored in the output and sometimes refreshed

- a sliding widow of 32k bytes is used to detect duplicate strings

## Arithmetic Coding vs Huffman Coding

- arithmetic coding has higher compression ratio than Huffman, but slower
- maps symbol of probability \( q \) to length \( -log_2 q \) in contrast to Huffman
- defined by split of \( (0, 1) \) into subintervals of the probability size, sorted by the size.
- encodings are numbers within the subintervals in binary format
- transmit enough digits so all fractions that fall within interval (prefix code)

## Entropy and Cross-Entropy in Compression

- Let true next symbol probability given previous symbols: \( p(x \mid x_i, x_{i-1}, …) \)
- estimated next symbol probability given previous symbols: \( q(x \mid x_i ,… ) \)
- arithmetic coding encodes to length \( - \log_2 q(x) \)

- then average compressed message bit-length is cross-entropy: \( - \sum_x p(x) \log_2 q(x) \)
- and optimal minimum bits for next symbol is entropy: \( - \sum_x p(x) \log_2 p(x) \)
- optimal due to “Shanon’s 1948 source coding theorem”

- Cross-Entropy minimization equals likelihood maximization: \( \frac{1}{N} \log ( \prod_k q_k^{N_k p_k} ) \)

## Bits-per-byte (bpb) and Bits-per-Character (bpc)

- compression ratio is defined as \( \mathrm{cmpRatio} = \mathrm{unCompressedBytes} / \mathrm{compressedBytes} \)
- Bits-per-byte is defined as \( \mathrm{compressedBits} / \mathrm{unCompressedBytes} \)
- Bits-per-byte (bpb) metric is inverse compression ratio divided by 8: \( 1 bpb = 1 / (8 \mathrm{cmpRatio}) \).
- Bits-per-character (bpc) metric for ASCII Extended characters equals bits-per-byte (bpb).
- Cross-entropy loss (log2) for a character-level language model averaged over a dataset equals bpc.
- Perplexity is cross-entropy (log2) to the second power \( PP = 2^{\mathrm{crossEnropy}} \)
- Gzip compresses enwik8 2.92 bpb, Morse code approximately 10.8 bpc
- SRU++ model achieves 1.02 bpc - approximately compression ratio of 8

## Compression by Predicting Next Symbol

- Huffman coding predicts next symbol cheaply with symbol frequency
- can trade more memory and computation with complex probability modeling with ML
- ML model can be trained on already compressed data stream deterministically
- common benchmarks are enwik8, and enwik9 datasets with bits-per-byte (bpb)
- bpb not comparable to language modeling results: single-pass, extra overhead, compressing entire dataset

## NNCP: Lossless Data Compression with Neural Networks

- model is not stored in the output. It is deterministically derived based on decompressed output
- model is regularly retrained during compression
- creates a tokenization dictionary (16k symbols like Cmix) during the first pass
- tokenization is dictionary coding
- NNCP 1 uses multi-layer LSTM to predict next symbol
- NNCP v2 Transformer beats Cmix on enwik9

## NNCP v2 Results vs Cmix

- results of NNCP v2 on enwik9 below. LSTM (large2) is …
- 10,000 times slower than GZip for 2.8x less bits-per-byte
- faster, simpler, better than Cmix (complex w/ LSTM) on enwik9
- worse than Cmix on enwik8

## TRACE Model 1-layer Transformer

- TRACE is 1-layer transformer compression
- predicts the next byte instead of dictionary symbol (token)
- vocabulary size is 256 too little compared to the hidden dimension
- so 4 consecutive embeddings concatenated before input

- retrained less often (adaptive) than NNCP, but starts randomly initialized

## TRACE Model Results Faster Data Compression Than NNCP

- 3x speedup with competitive compression with NNCP, but still 1000x slower than GZip
- worse on text but on-par on other due to not using tokenization dictionary
- result on enwik9 below for Cmix, NNCP, Dzip required GPU

## Deep Neural Network Lossless Compression Applications

- other compression algos: Cmix (HutterPrice 2021 SoTA), Dzip
- as of 2022-05 seems unpractical - slow, small compression improvement
- could be practical as a side effect of other computation

- note that lossy compression of images and video seem more likely applied
- more overview of the field in this paper

## Abstraction is Lossy Compression

- abstraction is general rules and concepts derived from the usage and classification
- transformation from concrete to abstract is lossy compression - taxonomy
- from properties of the abstract, can derive properties for all concrete
- not have to repeat this for each concrete - compression

## Autoencoders and Variational Autoencoders are Lossy Compressors

- if finite precision encoding into less latent dimensions
- compressed representations may be used for interpolation
- interpretable features may be disentangled