I had a look at norms of FastText embeddings and written paper-like formatted post. Full code is here.
Word embeddings, trained on large unlabeled corpora are useful for many natural language processing tasks. FastText (Bojanowski et al., 2016) in contrast to Word2vec model accounts for sub-word information by also embedding sub-word n-grams. FastText word representation is the word embedding vector plus sum of n-grams contained in it. Word2vec vector norms have been shown (Schakel & Wilson, 2015) to be correlated to word significance. This blog post visualize vector norms of FastText embedding and evaluates use of FastText word vector norm multiplied with number of word n-grams for detecting non-english OOV words.
FastText embeds words by adding word’s n-grams to the word embedding and then normalizes by total token count i.e. fastText(word) = (vword + Σg ∈ ngrams(word)vg) / (1 + |ngrams(word)|). However if the word is not present in the dictionary (OOV) only n-grams are used i.e. (Σg ∈ ngrams(word)vg) / |ngrams(word)|. For purpose of studying OOV words this asymmetry between vocabulary and out of vocabulary words is removed by only utilizing word’s n-grams regardless if the word is OOV or not.
In order to study contrast between common english words e.g. “apple” and noise-words (usually some parsing artifacts or unusual tokens with very specific meaning) e.g. “wales-2708” or “G705” MIT 10K Common words dataset is used.
Standard Vector Norm
Standard vector norm as defined in Gensim implementation is used in this section. Common words are located mostly on the right in the term-frequency spectrum and clustered in three different areas in the norm spectrum. On both axis common words are clustered approximatelly in 4 areas. In would be interesting to investigate what those clusters correspond to.
From below samples it is not clear what clusters correspond to:
- bottom left cluster: now, three, month, News, Big, picked, votes, signature, Challenge, Short, trick, Lots, 68, priorities, upgrades
- bottom right cluster: our, home, game, won, control, law, common, Street, speed, Tuesday, direct, helped, passed, condition, Date, signed
- middle right cluster: via, companies, necessary, straight, menu, kinds, Championship, relief, periods, Prize, minimal, Rated, 83, wears
- top right cluster: position, wonderful, shooting, switch, â, Atlantic, ladies, vegetables, tourist, HERE, prescription, upgraded, Evil
No N-Gram Norm
As mentioned above each FastText vocab word has its vector representation regardless its size. Norms of those vectors are plotted in this section. The shape of the distribution seems to match closely the shape of the same plot for Word2Vec (Schakel & Wilson, 2015). The vector norm as measure of word significance seems to hold even for FastText in terms of this norm as can be seen from labeled samples in the scatter plot.
NG_Norm (N-Grams Times Count Norm)
As mentioned above FastText uses average of word vectors used. However for detection of noise-words number of ngrams seems to useful. For that purpose NG_Norm is defined ng_norm(word)= || Σg ∈ ngrams(word)vg ||. Using this norm common words are clustered in narrower band on ng_norm axis.
Explicitly aggregated distribution on ng_norm axis is plotted in histogram below.
Probability distribution of given FastText vocabulary word being common word is plotted below. The distribution is well approximated by t-distribution.
Ability to detect noisy-words is evaluated on simple task of splitting two concatenated words back apart below. For example let’s split back concatenation ‘inflationlithium’:
Above approach yielded around 48% accuracy on 3000 random two-word samples from MIT 10k common words. A more efficient method in this specific case would be to search vocabulary instead of calculating vector norms. More appropriate comparison however would be for more general task involving OOV words e.g. using Edit Distance performed also on OOV words and words with typos.
FastText vector norms and their term-frequency were visualized and investigated in this post.
Standard Norm Term-Frequency plot revealed potentially interesting clustering of common vectors in three to four main areas.
No-N-Gram Norm has very similar Norm-TF distribution as Word2Vec shown in (Schakel & Wilson, 2015). The word significance correlation does seem to hold even for FastText embeddings in terms of No-N-Gram Norm.
NG_Norm shows that n-gram count could be potentially useful feature and that simple averaging over n-gram vectors may not be optimal. Perhaps some approach akin to (Zhelezniak et al., 2019) could be used.
- Piotr Bojanowski, Edouard Grave, Armand Joulin,and Tomas Mikolov. 2016. Enriching word vec-tors with subword information. arXiv preprint arXiv:1607.04606.
- Adriaan M. J. Schakel and Benjamin J Wilson. Measuring Word Significance using DistributedRepresentations of Words. aug 2015. http://arxiv.org/abs/1508.02297.
- Vitalii Zhelezniak, Aleksandar Savkov, April Shen,Francesco Moramarco, Jack Flann, and Nils Y.Hammerla. 2019. Don’t settle for average, go for the max: Fuzzy sets and max-pooled word vectors. In International Conference on Learning Representations.