Vaclav Kosar's face photo
Vaclav Kosar
Software, Machine Learning, & Business

Bits-Per-Byte and Bits-Per-Character

BPB and BPC are metrics used in compression and language modelling related to compression ratio.
  • compression ratio is defined as \( \mathrm{cmpRatio} = \mathrm{unCompressedBytes} / \mathrm{compressedBytes} \)
  • Bits-per-byte is defined as \( \mathrm{compressedBits} / \mathrm{unCompressedBytes} \)
  • Bits-per-byte (bpb) is inverse compression ratio divided by 8: \( 1 bpb = 1 / (8 \mathrm{cmpRatio}) \).
  • Bits-per-character (bpc) for ASCII Extended characters equals bits-per-byte (bpb).
  • Cross-entropy loss using log2 for a character-level language model averaged over a dataset equals bpc.
  • Gzip compresses enwik8 2.92 bpb, Morse code approximately 10.8 bpc
  • SRU++ model achieves 1.02 bpc - approximately compression ratio of 8

Deflate algorithm illustration with LZ77 and Huffman coding

Neural Data Compression

Data compression relies on ability to predict next symbol. Read more on neural data compression and its applications in machine learning here.

Created on 20 May 2022.

Let's connect

Privacy Policy How many days left in this quarter? Twitter Bullet Points to Copy & Paste