Wav2vec: Semi-supervised and Unsupervised Speech Recognition

Word2vec for audio quantizes phonemes, transforms, GAN trains on text and audio from Facebook AI.
Wav2vec: Semi-supervised and Unsupervised Speech Recognition
JS disabled! Watch Wav2vec: Semi-supervised and Unsupervised Speech Recognition on Youtube
Watch video "Wav2vec: Semi-supervised and Unsupervised Speech Recognition"

Wav2vec is fascinating in that it combines several neural network architectures and methods: CNN, transformer, quantization, and GAN training. I bet you’ll enjoy this guide through Wav2vec papers solving the problem of speech to text.

There are many languages

  • want to convert audio to text
  • 7000 languages spoken today
    • 195 sovereign states
    • ~150 language groups
  • lack labelled data
  • humans learn without labels

Wav2vec 2.0

  • Paper A Framework for Self-Supervised Learning of Speech Representations from Facebook AI 2020
  • pretrain on ~800h unlabeled data and fine-tune ~100h labeled data
  • SOTA in low-resource setting Libri-light
    • (all SOTA info is as of the paper discussed)
    • by a lot on WER clean test 100h labeled: others ~4 vs theirs ~2.5
    • WER = word-level, word-count normalized edit distance
  • SOTA on high-resource noisy data (3.3 vs 3.4)
    • close to SOTA on clean data
  • uses quantization as inductive bias for phonemes
Wav2vec 2.0 results on 100h-labels Libri-Light
Wav2vec 2.0 results on 100h-labels Libri-Light (source).


  • a unit of sound in spoken languages approximately 100ms long
  • for example in IPA: /sɪn/ (sin) and /sɪŋ/ (sing)
  • English ~40 phonemes


  • related to tokenization in that it outputs finite number of items from a dictionary
  • is used in Wav2vec and DALL-E 1 and VQ-VAE
  • replaces the input vector with the closest vector from a finite dictionary of vectors called codebook
  • during training, backward pass uses Gumbal softmax over the codebook to propagate gradient
  • product quantization: concatenation of several quantizations then linear transformation

Wav2vec Quantization works

  • codewords = product of 2 codebooks of 320 gives 100k
  • codewords dimension of 256 (128 for both sub-codebooks)
  • there is a high co-occurence of certain codebook items and phoneme sounds
Co-occurrence between phonemes on y-axis and quantizations on x-axis
Co-occurrence between phonemes on y-axis and quantizations on x-axis (source). Discrete representation is coded in presence of one phoneme most of the time.

Wav2vec 2.0 Architecture

Wav2vec-U architecture: GAN CNN phonemes segment representations
Wav2vec-U architecture (source)

Wav2vec 2.0 Training

Connectionist Temporal Classification (CTC) Loss

  • between a unsegmented time series and a target sequence
  • CTCLoss sums probability of all possible alignments of input to target
  • differentiable with respect to each input node
  • Original CTC paper (Graves 2016), pytorch docs
    • network returns probabilities of phonemes and blanks for each position
    • remove all blanks and repeated labels from the possible sequences
    • for example \( B(a − ab−) = B(−aa − −abb) = aab \)
    • this maps many paths to one output sequence \( \pi \in B^{-1}(l) \)
    • probability of label \( l \) is sum of matching the sequences \( \pi \in B \)
    • \( p(l | x) = \sum_{\pi \in B^{-1}(l)} p(\pi | x) \)
    • efficiently calculated with dynamic programming (Forward–backward algorithm)

Wav2vec 2.0 vs vq-wav2vec

  • jointly learn quantizations instead of separately in constrat to vq-wav2vec
  • contrastive loss for quantizations:
    • transformer output compared to the embeddings in the codebook
    • contractive distractors are other masked time steps
    • \( - \log \frac{exp(sim(c_t, q_t) / \kappa }{ \sum_{q \in Q_t } \exp (sim(c_t, q) / \kappa) } \)
  • diversity loss for codebook:
    • encourage even use of the whole codebook
    • loss is entropy of average softmax for the batch over the codebook
  • reduced word error rate (WER) ~33% compared to vq-wav2vec


  • “Unsupervised Speech Recognition”
  • On Arxiv on 24 May 2021
  • trains without any labeled data
  • inspired by other adversarial approaches
  • SOTA in unsupervised setting
  • not competitive with current supervised models
    • perhaps with models from 2018

Wav2vec-U Architecture

  • segment representations k-means clustering and mean-pool clusters into single phoneme unit embedding
  • Generator is single layer CNN: ~90k params, kernel size 4, 512 dimensions
  • Generative adversarial (GAN) training involves only the CNN
  • discriminator is also an CNN
Wav2vec-U architecture: GAN CNN phonemes segment representations
Wav2vec-U architecture (source)

Wav2vec-U Training

  • amazing! no-labels needed
  • discriminator
    • fed phonemized natural text and generator output
    • tries to recognize which input is which
    • generator wins over-time
    • easier to generate correct transcription
    • compared to hallucinating incorrect transcription


Still not sure how the transformer model really works?

The transformer architecture stormed the ML world including computer vision thanks to its generality and GPU parallizability on shorter sequences. Finally understand it over here, and if you still don’t get it, ask me a question!

Created on 21 Jun 2021. Updated on: 17 Sep 2022.
Thank you

About Vaclav Kosar How many days left in this quarter? Twitter Bullet Points to Copy & Paste Averaging Stopwatch Privacy Policy
Copyright © Vaclav Kosar. All rights reserved. Not investment, financial, medical, or any other advice. No guarantee of information accuracy.