Vaclav Kosar's face photo
Vaclav Kosar
Software And Machine Learning Blog

Wav2vec: Semi and Unsupervised Speech Recognition

Audio Word2vec - Quantize phonemes, transform, GAN the text.
Wav2vec: Semi and Unsupervised Speech Recognition

There are many languages

  • want to convert audio to text
  • 7000 languages spoken today
    • 195 sovereign states
    • ~150 language groups
  • lack labelled data
  • humans learn without labels

Wav2vec 2.0

  • “A Framework for Self-Supervised Learning of Speech Representations”
  • Facebook AI
  • On Arxiv 22 Oct 2020
  • pretrain on ~800h unlabeled data
  • fine-tune ~100h labeled data
  • SoTa in low-resource setting Libri-light
    • (all SoTa info is as of the paper discussed)
    • by a lot on WER clean test 100h labeled: others ~4 vs theirs ~2.5
    • WER = word-level, word-count normalized edit distance
  • SoTa on high-resource noisy data (3.3 vs 3.4)
    • close to SoTa on clean data
  • uses quantization as inductive bias for phonemes
Wav2vec 2.0 results on 100h-labels Libri-Light
Wav2vec 2.0 results on 100h-labels Libri-Light (source).

Phoneme

  • a unit of sound in spoken languages
  • for example in IPA: /sɪn/ (sin) and /sɪŋ/ (sing)
  • English ~40 phonemes

Quantization

  • replaces with vector from a finite set
  • the set of vectors is “codebook”
  • forward pass selects single quantization vector
  • backward pass uses Gumbal softmax over the codebook
  • product quantization:
    • concatenation of several quantizations
    • then linear transformation

Wav2vec Quantization works

  • codewords = product of 2 codebooks of 320 gives 100k
  • codewords dimension of 256 (128 for both sub-codebooks)
Co-occurrence between phonemes on y-axis and quantizations on x-axis
Co-occurrence between phonemes on y-axis and quantizations on x-axis (source). Discrete representation is coded in presence of one phoneme most of the time.

Wav2vec 2.0 Architecture

Wav2vec-U architecture: GAN CNN phonemes segment representations
Wav2vec-U architecture (source)

Wav2vec 2.0 Implementation

Connectionist Temporal Classification (CTC) Loss

  • between a unsegmented time series and a target sequence
  • CTCLoss sums probability of all possible alignments of input to target
  • differentiable with respect to each input node
  • pytorch docs
  • Original CTC paper (Graves 2016)
    • network returns probabilities of phonemes and blanks for each position
    • remove all blanks and repeated labels from the possible sequences
    • for example \( B(a − ab−) = B(−aa − −abb) = aab \)
    • this maps many paths to one output sequence \( \pi \in B^{-1}(l) \)
    • probability of label \( l \) is sum of matching the sequences \( \pi \in B \)
    • \( p(l | x) = \sum_{\pi \in B^{-1}(l)} p(\pi | x) \)
    • efficiently calculated with dynamic programming (Forward–backward algorithm)

Wav2vec 2.0 vs previous version

  • previous version vq-wav2vec
  • jointly learn quantizations instead of separately
  • contrastive loss:
    • from transformer output to the codebook
    • uses similarity
    • distractors are other masked time steps
    • \( - \log \frac{exp(sim(c_t, q_t) / \kappa }{ \sum_{q \in Q_t } \exp (sim(c_t, q) / \kappa) } \)
  • diversity loss:
    • encourage even use of the codebook
    • entropy of average softmax for the batch over the codebook
  • reduced word error rate (WER) ~33% compared to vq-wav2vec

Wav2vec-U

  • “Unsupervised Speech Recognition”
  • On Arxiv on 24 May 2021
  • trains without any labeled data
  • inspired by other adversarial approaches
  • SoTa in unsupervised setting
  • not competitive with current supervised models
    • perhaps with models from 2018

Wav2vec-U Architecture

Wav2vec-U architecture: GAN CNN phonemes segment representations
Wav2vec-U architecture (source)
  • segment representations mean pooled clusters
  • Generator is single layer CNN
    • ~90k params
    • kernel size 4
    • 512 dimension
    • Generative adversarial (GAN) training involves only the CNN
  • discriminator is also an CNN

Wav2vec-U Training

  • amazing! no-labels needed
  • discriminator
    • fed phonemized natural text and generator output
    • tries to recognize which input is which
    • generator wins over-time
    • easier to generate correct transcription
    • compared to hallucinating incorrect transcription

Discussions

Do you know how many days until the current quarter ends?

21 Jun 2021