Bits-Per-Byte and Bits-Per-Character

BPB and BPC are metrics used in compression and language modelling related to compression ratio.
  • compression ratio is defined as \( \mathrm{cmpRatio} = \mathrm{unCompressedBytes} / \mathrm{compressedBytes} \)
  • Bits-per-byte is defined as \( \mathrm{compressedBits} / \mathrm{unCompressedBytes} \)
  • Bits-per-byte (bpb) metric is inverse compression ratio divided by 8: \( 1 bpb = 1 / (8 \mathrm{cmpRatio}) \).
  • Bits-per-character (bpc) metric for ASCII Extended characters equals bits-per-byte (bpb).
  • Cross-entropy loss using log2 for a character-level language model averaged over a dataset equals bpc.
  • Gzip compresses enwik8 2.92 bpb, Morse code approximately 10.8 bpc
  • SRU++ model achieves 1.02 bpc - approximately compression ratio of 8

Deflate algorithm illustration with LZ77 and Huffman coding

Neural Data Compression

Data compression relies on ability to predict next symbol. Read more on neural data compression and its applications in machine learning here.

Created on 20 May 2022.
Thank you

Ask or Report A Mistake

Let's connect

Privacy Policy How many days left in this quarter? Twitter Bullet Points to Copy & Paste About Vaclav Kosar