Bits-per-Byte measures how many bits a compression program needs to know to guess the next symbol on average. For example, if the compression program is perfect, then the next symbol is obvious to it, and it needs 0 bits, so \( bpb = 0 \). Or, if it the worst possible, it needs to be given the exact next symbol from the vocabulary, so it needs as \( bpb = log_2(\mathrm{vocabularySize}) \).
How BPB relates to compression ratio or cross-entropy?
Bits-per-Byte (BPB) and Bits-per-Character (BPC) are metrics related to compression ratio and cross-entropy, used in compression and language modeling, with BPC equaling BPB for ASCII Extended characters, and cross-entropy loss using log2 in character-level models equating to BPC.
- compression ratio is defined as \( \mathrm{cmpRatio} = \mathrm{unCompressedBytes} / \mathrm{compressedBytes} \)
- Bits-per-byte is defined as \( \mathrm{compressedBits} / \mathrm{unCompressedBytes} \)
- Bits-per-byte (bpb) metric is inverse compression ratio divided by 8: \( 1 bpb = 1 / (8 \mathrm{cmpRatio}) \).
- Bits-per-character (bpc) metric for ASCII Extended characters equals bits-per-byte (bpb).
- Cross-entropy loss (log2) for a character-level language model averaged over a dataset equals bpc.
- Perplexity is cross-entropy (log2) to the second power \( PP = 2^{\mathrm{crossEnropy}} \)
- Gzip compresses enwik8 2.92 bpb, Morse code approximately 10.8 bpc
- SRU++ model achieves 1.02 bpc - approximately compression ratio of 8
BPC corresponds to BPB for extended ASCII characters, and when using log2 in character-level models, the cross-entropy loss is equivalent to BPB.
Neural Data Compression
Data compression relies on ability to predict next symbol. Read more on neural data compression and its applications in machine learning here.