Bits-Per-Byte and Bits-Per-Character

BPB and BPC are metrics used in compression and language modelling related to compression ratio.
  • compression ratio is defined as \( \mathrm{cmpRatio} = \mathrm{unCompressedBytes} / \mathrm{compressedBytes} \)
  • Bits-per-byte is defined as \( \mathrm{compressedBits} / \mathrm{unCompressedBytes} \)
  • Bits-per-byte (bpb) metric is inverse compression ratio divided by 8: \( 1 bpb = 1 / (8 \mathrm{cmpRatio}) \).
  • Bits-per-character (bpc) metric for ASCII Extended characters equals bits-per-byte (bpb).
  • Cross-entropy loss using log2 for a character-level language model averaged over a dataset equals bpc.
  • Gzip compresses enwik8 2.92 bpb, Morse code approximately 10.8 bpc
  • SRU++ model achieves 1.02 bpc - approximately compression ratio of 8

Deflate algorithm illustration with LZ77 and Huffman coding

Neural Data Compression

Data compression relies on ability to predict next symbol. Read more on neural data compression and its applications in machine learning here.

Created on 20 May 2022.
Thank you

Let's connect

About Vaclav Kosar How many days left in this quarter? Twitter Bullet Points to Copy & Paste Privacy Policy
Copyright © Vaclav Kosar. All rights reserved. Not investment, financial, medical, or any other advice. No guarantee of information accuracy.