Vaclav Kosar's face photo
Vaclav Kosar
Software, Machine Learning, & Business

Wav2vec: Semi and Unsupervised Speech Recognition

Audio Word2vec Guide - Quantizes phonemes, transforms, GAN trains on text and audio.

Wav2vec is fascinating in that it combines several neural network architectures and methods: CNN, transformer, quantization, and GAN training. I bet you’ll enjoy this guide through Wav2vec papers solving the problem of speech to text.

There are many languages

  • want to convert audio to text
  • 7000 languages spoken today
    • 195 sovereign states
    • ~150 language groups
  • lack labelled data
  • humans learn without labels

Wav2vec 2.0

  • “A Framework for Self-Supervised Learning of Speech Representations”
  • Facebook AI
  • On Arxiv 22 Oct 2020
  • pretrain on ~800h unlabeled data
  • fine-tune ~100h labeled data
  • SoTa in low-resource setting Libri-light
    • (all SoTa info is as of the paper discussed)
    • by a lot on WER clean test 100h labeled: others ~4 vs theirs ~2.5
    • WER = word-level, word-count normalized edit distance
  • SoTa on high-resource noisy data (3.3 vs 3.4)
    • close to SoTa on clean data
  • uses quantization as inductive bias for phonemes
Wav2vec 2.0 results on 100h-labels Libri-Light
Wav2vec 2.0 results on 100h-labels Libri-Light (source).


  • a unit of sound in spoken languages
  • for example in IPA: /sɪn/ (sin) and /sɪŋ/ (sing)
  • English ~40 phonemes


  • replaces with vector from a finite set
  • the set of vectors is “codebook”
  • forward pass selects single quantization vector
  • backward pass uses Gumbal softmax over the codebook
  • product quantization:
    • concatenation of several quantizations
    • then linear transformation

Wav2vec Quantization works

  • codewords = product of 2 codebooks of 320 gives 100k
  • codewords dimension of 256 (128 for both sub-codebooks)
Co-occurrence between phonemes on y-axis and quantizations on x-axis
Co-occurrence between phonemes on y-axis and quantizations on x-axis (source). Discrete representation is coded in presence of one phoneme most of the time.

Wav2vec 2.0 Architecture

Wav2vec-U architecture: GAN CNN phonemes segment representations
Wav2vec-U architecture (source)

Wav2vec 2.0 Implementation

Connectionist Temporal Classification (CTC) Loss

  • between a unsegmented time series and a target sequence
  • CTCLoss sums probability of all possible alignments of input to target
  • differentiable with respect to each input node
  • pytorch docs
  • Original CTC paper (Graves 2016)
    • network returns probabilities of phonemes and blanks for each position
    • remove all blanks and repeated labels from the possible sequences
    • for example \( B(a − ab−) = B(−aa − −abb) = aab \)
    • this maps many paths to one output sequence \( \pi \in B^{-1}(l) \)
    • probability of label \( l \) is sum of matching the sequences \( \pi \in B \)
    • \( p(l | x) = \sum_{\pi \in B^{-1}(l)} p(\pi | x) \)
    • efficiently calculated with dynamic programming (Forward–backward algorithm)

Wav2vec 2.0 vs previous version

  • previous version vq-wav2vec
  • jointly learn quantizations instead of separately
  • contrastive loss:
    • from transformer output to the codebook
    • uses similarity
    • distractors are other masked time steps
    • \( - \log \frac{exp(sim(c_t, q_t) / \kappa }{ \sum_{q \in Q_t } \exp (sim(c_t, q) / \kappa) } \)
  • diversity loss:
    • encourage even use of the codebook
    • entropy of average softmax for the batch over the codebook
  • reduced word error rate (WER) ~33% compared to vq-wav2vec


  • “Unsupervised Speech Recognition”
  • On Arxiv on 24 May 2021
  • trains without any labeled data
  • inspired by other adversarial approaches
  • SoTa in unsupervised setting
  • not competitive with current supervised models
    • perhaps with models from 2018

Wav2vec-U Architecture

Wav2vec-U architecture: GAN CNN phonemes segment representations
Wav2vec-U architecture (source)
  • segment representations mean pooled clusters
  • Generator is single layer CNN
    • ~90k params
    • kernel size 4
    • 512 dimension
    • Generative adversarial (GAN) training involves only the CNN
  • discriminator is also an CNN

Wav2vec-U Training

  • amazing! no-labels needed
  • discriminator
    • fed phonemized natural text and generator output
    • tries to recognize which input is which
    • generator wins over-time
    • easier to generate correct transcription
    • compared to hallucinating incorrect transcription


Still not sure how the transformer model really works?

The transformer architecture stormed the ML world including computer vision thanks to its generality and GPU parallizability on shorter sequences. Finally understand it over here, and if you still don’t get it, ask me a question!

Created on 21 Jun 2021.

Let's connect

Privacy Policy How many days left in this quarter? Twitter Bullet Points to Copy & Paste