Wav2vec is fascinating in that it combines several neural network architectures and methods: CNN, transformer, quantization, and GAN training. I bet you’ll enjoy this guide through Wav2vec papers solving the problem of speech to text.
There are many languages
- want to convert audio to text
- 7000 languages spoken today
- 195 sovereign states
- ~150 language groups
- lack labelled data
- humans learn without labels
Wav2vec 2.0
- “A Framework for Self-Supervised Learning of Speech Representations”
- Facebook AI
- On Arxiv 22 Oct 2020
- pretrain on ~800h unlabeled data
- fine-tune ~100h labeled data
- SoTa in low-resource setting Libri-light
- (all SoTa info is as of the paper discussed)
- by a lot on WER clean test 100h labeled: others ~4 vs theirs ~2.5
- WER = word-level, word-count normalized edit distance
- SoTa on high-resource noisy data (3.3 vs 3.4)
- close to SoTa on clean data
- uses quantization as inductive bias for phonemes

Phoneme
- a unit of sound in spoken languages
- for example in IPA: /sɪn/ (sin) and /sɪŋ/ (sing)
- English ~40 phonemes
Quantization
- replaces with vector from a finite set
- the set of vectors is “codebook”
- forward pass selects single quantization vector
- backward pass uses Gumbal softmax over the codebook
- product quantization:
- concatenation of several quantizations
- then linear transformation
Wav2vec Quantization works
- codewords = product of 2 codebooks of 320 gives 100k
- codewords dimension of 256 (128 for both sub-codebooks)

Wav2vec 2.0 Architecture

Wav2vec 2.0 Implementation
- 7-layer convolution to raw audio
- mask spans of the latents
- contextualize via 12-block transformer
- transformed token predicts quantized input
- contrastive learning on quantized targets
- ablations showed quantization helps
- unsupervised, and then fine-tuned on supervised
- fine-tuning
- add output layer to predict characters
- uses CTC loss
- original source,
- HuggingFace (pretraining not possible as of 2021-06)
Connectionist Temporal Classification (CTC) Loss
- between a unsegmented time series and a target sequence
- CTCLoss sums probability of all possible alignments of input to target
- differentiable with respect to each input node
- pytorch docs
- Original CTC paper (Graves 2016)
- network returns probabilities of phonemes and blanks for each position
- remove all blanks and repeated labels from the possible sequences
- for example \( B(a − ab−) = B(−aa − −abb) = aab \)
- this maps many paths to one output sequence \( \pi \in B^{-1}(l) \)
- probability of label \( l \) is sum of matching the sequences \( \pi \in B \)
- \( p(l | x) = \sum_{\pi \in B^{-1}(l)} p(\pi | x) \)
- efficiently calculated with dynamic programming (Forward–backward algorithm)
Wav2vec 2.0 vs previous version
- previous version vq-wav2vec
- jointly learn quantizations instead of separately
- contrastive loss:
- from transformer output to the codebook
- uses similarity
- distractors are other masked time steps
- \( - \log \frac{exp(sim(c_t, q_t) / \kappa }{ \sum_{q \in Q_t } \exp (sim(c_t, q) / \kappa) } \)
- diversity loss:
- encourage even use of the codebook
- entropy of average softmax for the batch over the codebook
- reduced word error rate (WER) ~33% compared to vq-wav2vec
Wav2vec-U
- “Unsupervised Speech Recognition”
- On Arxiv on 24 May 2021
- trains without any labeled data
- inspired by other adversarial approaches
- SoTa in unsupervised setting
- not competitive with current supervised models
- perhaps with models from 2018
Wav2vec-U Architecture

- segment representations mean pooled clusters
- Generator is single layer CNN
- ~90k params
- kernel size 4
- 512 dimension
- Generative adversarial (GAN) training involves only the CNN
- discriminator is also an CNN
Wav2vec-U Training
- amazing! no-labels needed
- discriminator
- fed phonemized natural text and generator output
- tries to recognize which input is which
- generator wins over-time
- easier to generate correct transcription
- compared to hallucinating incorrect transcription
Discussions
Still not sure how the transformer model really works?
The transformer architecture stormed the ML world including computer vision thanks to its generality and GPU parallizability on shorter sequences. Finally understand it over here, and if you still don’t get it, ask me a question!