Wav2vec is fascinating in that it combines several neural network architectures and methods: CNN, transformer, quantization, and GAN training. I bet you’ll enjoy this guide through Wav2vec papers solving the problem of speech to text.
There are many languages
- want to convert audio to text
- 7000 languages spoken today
- 195 sovereign states
- ~150 language groups
- lack labelled data
- humans learn without labels
Wav2vec 2.0
- Paper A Framework for Self-Supervised Learning of Speech Representations from Facebook AI 2020
- pretrain on ~800h unlabeled data and fine-tune ~100h labeled data
- SOTA in low-resource setting Libri-light
- (all SOTA info is as of the paper discussed)
- by a lot on WER clean test 100h labeled: others ~4 vs theirs ~2.5
- WER = word-level, word-count normalized edit distance
- SOTA on high-resource noisy data (3.3 vs 3.4)
- close to SOTA on clean data
- uses quantization as inductive bias for phonemes
Phoneme
- a unit of sound in spoken languages approximately 100ms long
- for example in IPA: /sɪn/ (sin) and /sɪŋ/ (sing)
- English ~40 phonemes
Quantization
- related to tokenization in that it outputs finite number of items from a dictionary
- is used in Wav2vec and DALL-E 1 and VQ-VAE
- replaces the input vector with the closest vector from a finite dictionary of vectors called codebook
- during training, backward pass uses Gumbal softmax over the codebook to propagate gradient
- product quantization: concatenation of several quantizations then linear transformation
Wav2vec Quantization works
- codewords = product of 2 codebooks of 320 gives 100k
- codewords dimension of 256 (128 for both sub-codebooks)
- there is a high co-occurence of certain codebook items and phoneme sounds
Wav2vec 2.0 Architecture
- pre-trained unsupervised, and then fine-tuned on supervised speech transcription task
- raw audio tokenized via splitting into 25ms pieces fed into 7-layer convolution network
- output quantized into a fixed sized codebook
- contextualize embeddings via 12-block transformer
- original source here, HuggingFace (pretraining not possible as of 2021-06)
Wav2vec 2.0 Training
- unsupervised pre-training:
- mask spans of the latent embeddings
- predict masked quantization via contrastive learning on quantized targets
- ablations showed quantization helps
- fine-tuning
- add output layer to predict characters
- uses CTC loss
Connectionist Temporal Classification (CTC) Loss
- between a unsegmented time series and a target sequence
- CTCLoss sums probability of all possible alignments of input to target
- differentiable with respect to each input node
- Original CTC paper (Graves 2016), pytorch docs
- network returns probabilities of phonemes and blanks for each position
- remove all blanks and repeated labels from the possible sequences
- for example \( B(a − ab−) = B(−aa − −abb) = aab \)
- this maps many paths to one output sequence \( \pi \in B^{-1}(l) \)
- probability of label \( l \) is sum of matching the sequences \( \pi \in B \)
- \( p(l | x) = \sum_{\pi \in B^{-1}(l)} p(\pi | x) \)
- efficiently calculated with dynamic programming (Forward–backward algorithm)
Wav2vec 2.0 vs vq-wav2vec
- jointly learn quantizations instead of separately in constrat to vq-wav2vec
- contrastive loss for quantizations:
- transformer output compared to the embeddings in the codebook
- contractive distractors are other masked time steps
- \( - \log \frac{exp(sim(c_t, q_t) / \kappa }{ \sum_{q \in Q_t } \exp (sim(c_t, q) / \kappa) } \)
- diversity loss for codebook:
- encourage even use of the whole codebook
- loss is entropy of average softmax for the batch over the codebook
- reduced word error rate (WER) ~33% compared to vq-wav2vec
Wav2vec-U
- “Unsupervised Speech Recognition”
- On Arxiv on 24 May 2021
- trains without any labeled data
- inspired by other adversarial approaches
- SOTA in unsupervised setting
- not competitive with current supervised models
- perhaps with models from 2018
Wav2vec-U Architecture
- segment representations k-means clustering and mean-pool clusters into single phoneme unit embedding
- Generator is single layer CNN: ~90k params, kernel size 4, 512 dimensions
- Generative adversarial (GAN) training involves only the CNN
- discriminator is also an CNN
Wav2vec-U Training
- amazing! no-labels needed
- discriminator
- fed phonemized natural text and generator output
- tries to recognize which input is which
- generator wins over-time
- easier to generate correct transcription
- compared to hallucinating incorrect transcription
Discussions
Still not sure how the transformer model really works?
The transformer architecture stormed the ML world including computer vision thanks to its generality and GPU parallizability on shorter sequences. Finally understand it over here, and if you still don’t get it, ask me a question!