# There are many languages

- want to convert audio to text
- 7000 languages spoken today
- 195 sovereign states
- ~150 language groups

- lack labelled data
- humans learn without labels

# Wav2vec 2.0

- “A Framework for Self-Supervised Learning of Speech Representations”
- Facebook AI
- On Arxiv 22 Oct 2020
- pretrain on ~800h unlabeled data
- fine-tune ~100h labeled data
- SoTa in low-resource setting Libri-light
- (all SoTa info is as of the paper discussed)
- by a lot on WER clean test 100h labeled: others ~4 vs theirs ~2.5
- WER = word-level, word-count normalized edit distance

- SoTa on high-resource noisy data (3.3 vs 3.4)
- close to SoTa on clean data

- uses quantization as inductive bias for phonemes

# Phoneme

- a unit of sound in spoken languages
- for example in IPA: /sɪn/ (sin) and /sɪŋ/ (sing)
- English ~40 phonemes

# Quantization

- replaces with vector from a finite set
- the set of vectors is “codebook”
- forward pass selects single quantization vector
- backward pass uses Gumbal softmax over the codebook
- product quantization:
- concatenation of several quantizations
- then linear transformation

# Wav2vec Quantization works

- codewords = product of 2 codebooks of 320 gives 100k
- codewords dimension of 256 (128 for both sub-codebooks)

## Wav2vec 2.0 Architecture

## Wav2vec 2.0 Implementation

- 7-layer convolution to raw audio
- mask spans of the latents
- contextualize via 12-block transformer
- transformed token predicts quantized input
- contrastive learning on quantized targets
- ablations showed quantization helps
- unsupervised, and then fine-tuned on supervised
- fine-tuning
- add output layer to predict characters
- uses CTC loss

- original source,
- HuggingFace (pretraining not possible as of 2021-06)

## Connectionist Temporal Classification (CTC) Loss

- between a unsegmented time series and a target sequence
- CTCLoss sums probability of all possible alignments of input to target
- differentiable with respect to each input node
- pytorch docs
- Original CTC paper (Graves 2016)
- network returns probabilities of phonemes and blanks for each position
**remove all blanks**and**repeated labels**from the possible sequences- for example \( B(a − ab−) = B(−aa − −abb) = aab \)
- this maps many paths to one output sequence \( \pi \in B^{-1}(l) \)
- probability of label \( l \) is sum of matching the sequences \( \pi \in B \)
- \( p(l | x) = \sum_{\pi \in B^{-1}(l)} p(\pi | x) \)
- efficiently calculated with dynamic programming (Forward–backward algorithm)

## Wav2vec 2.0 vs previous version

- previous version vq-wav2vec
- jointly learn quantizations instead of separately
- contrastive loss:
- from transformer output to the codebook
- uses similarity
- distractors are other masked time steps
- \( - \log \frac{exp(sim(c_t, q_t) / \kappa }{ \sum_{q \in Q_t } \exp (sim(c_t, q) / \kappa) } \)

- diversity loss:
- encourage even use of the codebook
- entropy of average softmax for the batch over the codebook

- reduced word error rate (WER) ~33% compared to vq-wav2vec

# Wav2vec-U

- “Unsupervised Speech Recognition”
- On Arxiv on 24 May 2021
- trains without any labeled data
- inspired by other adversarial approaches
- SoTa in unsupervised setting
- not competitive with current supervised models
- perhaps with models from 2018

## Wav2vec-U Architecture

- segment representations mean pooled clusters
- Generator is single layer CNN
- ~90k params
- kernel size 4
- 512 dimension
- Generative adversarial (GAN) training involves only the CNN

- discriminator is also an CNN

## Wav2vec-U Training

- amazing! no-labels needed
- discriminator
- fed phonemized natural text and generator output
- tries to recognize which input is which
- generator wins over-time
- easier to generate correct transcription
- compared to hallucinating incorrect transcription