Wav2vec: Semi-supervised and Unsupervised Speech Recognition

Wav2vec is fascinating in that it combines several neural network architectures and methods: CNN, transformer, quantization, and GAN training. I bet you’ll enjoy this guide through Wav2vec papers solving the problem of speech to text.

There are many languages

want to convert audio to text
7000 languages spoken today
- 195 sovereign states
- ~150 language groups
lack labelled data
humans learn without labels

Wav2vec 2.0

Paper A Framework for Self-Supervised Learning of Speech Representations from Facebook AI 2020
pretrain on ~800h unlabeled data and fine-tune ~100h labeled data
SOTA in low-resource setting Libri-light
- (all SOTA info is as of the paper discussed)
- by a lot on WER clean test 100h labeled: others ~4 vs theirs ~2.5
- WER = word-level, word-count normalized edit distance
SOTA on high-resource noisy data (3.3 vs 3.4)
- close to SOTA on clean data
uses quantization as inductive bias for phonemes

Phoneme

a unit of sound in spoken languages approximately 100ms long
for example in IPA: /sɪn/ (sin) and /sɪŋ/ (sing)
English ~40 phonemes

Quantization

related to tokenization in that it outputs finite number of items from a dictionary
is used in Wav2vec and DALL-E 1 and VQ-VAE
replaces the input vector with the closest vector from a finite dictionary of vectors called codebook
during training, backward pass uses Gumbal softmax over the codebook to propagate gradient
product quantization: concatenation of several quantizations then linear transformation

Wav2vec Quantization works

codewords = product of 2 codebooks of 320 gives 100k
codewords dimension of 256 (128 for both sub-codebooks)
there is a high co-occurence of certain codebook items and phoneme sounds

Co-occurrence between phonemes on y-axis and quantizations on x-axis (source). Discrete representation is coded in presence of one phoneme most of the time.

Wav2vec 2.0 Architecture

pre-trained unsupervised, and then fine-tuned on supervised speech transcription task
raw audio tokenized via splitting into 25ms pieces fed into 7-layer convolution network
output quantized into a fixed sized codebook
contextualize embeddings via 12-block transformer
original source here, HuggingFace (pretraining not possible as of 2021-06)

Wav2vec-U architecture: GAN CNN phonemes segment representations — Wav2vec-U architecture (source)

Wav2vec 2.0 Training

unsupervised pre-training:
- mask spans of the latent embeddings
- predict masked quantization via contrastive learning on quantized targets
- ablations showed quantization helps
fine-tuning
- add output layer to predict characters
- uses CTC loss

Connectionist Temporal Classification (CTC) Loss

between a unsegmented time series and a target sequence
CTCLoss sums probability of all possible alignments of input to target
differentiable with respect to each input node
Original CTC paper (Graves 2016), pytorch docs
- network returns probabilities of phonemes and blanks for each position
- remove all blanks and repeated labels from the possible sequences
- for example \( B(a − ab−) = B(−aa − −abb) = aab \)
- this maps many paths to one output sequence \( \pi \in B^{-1}(l) \)
- probability of label \( l \) is sum of matching the sequences \( \pi \in B \)
- \( p(l | x) = \sum_{\pi \in B^{-1}(l)} p(\pi | x) \)
- efficiently calculated with dynamic programming (Forward–backward algorithm)

Wav2vec 2.0 vs vq-wav2vec

jointly learn quantizations instead of separately in constrat to vq-wav2vec
contrastive loss for quantizations:
- transformer output compared to the embeddings in the codebook
- contractive distractors are other masked time steps
- \( - \log \frac{exp(sim(c_t, q_t) / \kappa }{ \sum_{q \in Q_t } \exp (sim(c_t, q) / \kappa) } \)
diversity loss for codebook:
- encourage even use of the whole codebook
- loss is entropy of average softmax for the batch over the codebook
reduced word error rate (WER) ~33% compared to vq-wav2vec

Wav2vec-U

“Unsupervised Speech Recognition”
On Arxiv on 24 May 2021
trains without any labeled data
inspired by other adversarial approaches
SOTA in unsupervised setting
not competitive with current supervised models
- perhaps with models from 2018

Wav2vec-U Architecture

segment representations k-means clustering and mean-pool clusters into single phoneme unit embedding
Generator is single layer CNN: ~90k params, kernel size 4, 512 dimensions
Generative adversarial (GAN) training involves only the CNN
discriminator is also an CNN

Wav2vec-U Training

amazing! no-labels needed
discriminator
- fed phonemized natural text and generator output
- tries to recognize which input is which
- generator wins over-time
- easier to generate correct transcription
- compared to hallucinating incorrect transcription

Discussions

Hackernews

Still not sure how the transformer model really works?

The transformer architecture stormed the ML world including computer vision thanks to its generality and GPU parallizability on shorter sequences. Finally understand it over here, and if you still don’t get it, ask me a question!

Vaclav Kosar

Wav2vec: Semi-supervised and Unsupervised Speech Recognition

There are many languages

Wav2vec 2.0

Phoneme

Quantization

Wav2vec Quantization works

Wav2vec 2.0 Architecture

Wav2vec 2.0 Training

Connectionist Temporal Classification (CTC) Loss

Wav2vec 2.0 vs vq-wav2vec

Wav2vec-U

Wav2vec-U Architecture

Wav2vec-U Training

Discussions

Still not sure how the transformer model really works?

Vaclav Kosar

You'll love also...

Vaclav Kosar

Vaclav Kosar