# Neural Data Compression

Lossless bit reduction with machine learning by minimizing cross-entropy. Examples: NNCP and TRACE models.
• data compression means encoding into less bits
• lossless means without loosing any information
• trade-off between communication throughput, computation, and memory
• can compress if some symbols are more likely than others
• better symbol prediction => lower cross-entropy => higher compression Watch Neural Data Compression on Youtube

## Morse Code Uses Huffman Coding Compression

• compresses 26 characters of english alphabet, not compressing white space
• character mapped to sequence of dots and dashes, space mapped to space
• more frequent characters mapped to fewer dots and dashes
• this is Huffman coding with a static tree
• encoding and decoding requires minimal compute and memory ## GZip’s Deflate Data Compression

• GZip uses Deflate compression format (ZIP also)
1. a sliding widow of 32k bytes is used to detect duplicate strings
• duplicate strings are referenced back with length and distance symbols (similar to dictionary coding)
• this along with byte literals defines custom alphabet of symbols
2. Huffman coding maps frequent symbols to shorter bit sequences
• bit-mapping trees stored in the output and sometimes refreshed ## Arithmetic Coding vs Huffman Coding

• arithmetic coding has higher compression ratio than Huffman, but slower
• maps symbol of probability $$q$$ to length $$-log_2 q$$ in contrast to Huffman
• defined by split of $$(0, 1)$$ into subintervals of the probability size, sorted by the size.
• encodings are numbers within the subintervals in binary format
• transmit enough digits so all fractions that fall within interval (prefix code) ## Entropy and Cross-Entropy in Compression

• Let true next symbol probability given previous symbols: $$p(x \mid x_i, x_{i-1}, …)$$
• estimated next symbol probability given previous symbols: $$q(x \mid x_i ,… )$$
• arithmetic coding encodes to length $$- \log_2 q(x)$$
• then average compressed message bit-length is cross-entropy: $$- \sum_x p(x) \log_2 q(x)$$
• Cross-Entropy minimization equals likelihood maximization: $$\frac{1}{N} \log ( \prod_k q_k^{N_k p_k} )$$

## Bits-per-byte (bpb) and Bits-per-Character (bpc)

• compression ratio is defined as $$\mathrm{cmpRatio} = \mathrm{unCompressedBytes} / \mathrm{compressedBytes}$$
• Bits-per-byte is defined as $$\mathrm{compressedBits} / \mathrm{unCompressedBytes}$$
• Bits-per-byte (bpb) metric is inverse compression ratio divided by 8: $$1 bpb = 1 / (8 \mathrm{cmpRatio})$$.
• Bits-per-character (bpc) metric for ASCII Extended characters equals bits-per-byte (bpb).
• Cross-entropy loss (log2) for a character-level language model averaged over a dataset equals bpc.
• Perplexity is cross-entropy (log2) to the second power $$PP = 2^{\mathrm{crossEnropy}}$$
• Gzip compresses enwik8 2.92 bpb, Morse code approximately 10.8 bpc
• SRU++ model achieves 1.02 bpc - approximately compression ratio of 8

## Compression by Predicting Next Symbol

• Huffman coding predicts next symbol cheaply with symbol frequency
• can trade more memory and computation with complex probability modeling with ML
• ML model can be trained on already compressed data stream deterministically
• common benchmarks are enwik8, and enwik9 datasets with bits-per-byte (bpb)
• bpb not comparable to language modeling results: single-pass, extra overhead, compressing entire dataset ## NNCP: Lossless Data Compression with Neural Networks

• model is not stored in the output. It is deterministically derived based on decompressed output
• model is regularly retrained during compression
• creates a tokenization dictionary (16k symbols like Cmix) during the first pass
• tokenization is dictionary coding
• NNCP 1 uses multi-layer LSTM to predict next symbol
• NNCP v2 Transformer beats Cmix on enwik9

## NNCP v2 Results vs Cmix

• results of NNCP v2 on enwik9 below. LSTM (large2) is …
• 10,000 times slower than GZip for 2.8x less bits-per-byte
• faster, simpler, better than Cmix (complex w/ LSTM) on enwik9
• worse than Cmix on enwik8 ## TRACE Model 1-layer Transformer

• TRACE is 1-layer transformer compression
• predicts the next byte instead of dictionary symbol (token)
• vocabulary size is 256 too little compared to the hidden dimension
• so 4 consecutive embeddings concatenated before input
• retrained less often (adaptive) than NNCP, but starts randomly initialized ## TRACE Model Results Faster Data Compression Than NNCP

• 3x speedup with competitive compression with NNCP, but still 1000x slower than GZip
• worse on text but on-par on other due to not using tokenization dictionary
• result on enwik9 below for Cmix, NNCP, Dzip required GPU ## Deep Neural Network Lossless Compression Applications

• other compression algos: Cmix (HutterPrice 2021 SoTA), Dzip
• as of 2022-05 seems unpractical - slow, small compression improvement
• could be practical as a side effect of other computation
• note that lossy compression of images and video seem more likely applied
• more overview of the field in this paper

## Abstraction is Lossy Compression

• abstraction is general rules and concepts derived from the usage and classification
• transformation from concrete to abstract is lossy compression - taxonomy
• from properties of the abstract, can derive properties for all concrete
• not have to repeat this for each concrete - compression

## Autoencoders and Variational Autoencoders are Lossy Compressors

• if finite precision encoding into less latent dimensions
• compressed representations may be used for interpolation
• interpretable features may be disentangled

Created on 14 May 2022. Updated on: 21 Jul 2022. 