Vaclav Kosar's face photo

Vaclav Kosar

Vaclav Kosar is a senior software developer. This blog is focused on interesting technology, tips and news.

FastText Vector Norms And OOV Words

30 Jun 2019

I had a look at norms of FastText embeddings and written paper-like formatted post. Full code is here.

Abstract

Word embeddings, trained on large unlabeled corpora are useful for many natural language processing tasks. FastText (Bojanowski et al., 2016) in contrast to Word2vec model accounts for sub-word information by also embedding sub-word n-grams. FastText word representation is the word embedding vector plus sum of n-grams contained in it. Word2vec vector norms have been shown (Schakel & Wilson, 2015) to be correlated to word significance. This blog post visualize vector norms of FastText embedding and evaluates use of FastText word vector norm multiplied with number of word n-grams for detecting non-english OOV words.

Introduction

FastText embeds words by adding word’s n-grams to the word embedding and then normalizes by total token count i.e. fastText(word) = (vword + Σg ∈ ngrams(word)vg) / (1 + |ngrams(word)|). However if the word is not present in the dictionary (OOV) only n-grams are used i.e. g ∈ ngrams(word)vg) / |ngrams(word)|. For purpose of studying OOV words this asymmetry between vocabulary and out of vocabulary words is removed by only utilizing word’s n-grams regardless if the word is OOV or not.

In order to study contrast between common english words e.g. “apple” and noise-words (usually some parsing artifacts or unusual tokens with very specific meaning) e.g. “wales-2708” or “G705” MIT 10K Common words dataset is used.

Entire code for this post in available in this repository in file “main.py”. FastText model used is 5-gram English 2M “cc.en.300.bin”.

Standard Vector Norm

Standard vector norm as defined in Gensim implementation is used in this section. Common words are located mostly on the right in the term-frequency spectrum and clustered in three different areas in the norm spectrum. On both axis common words are clustered approximatelly in 4 areas. In would be interesting to investigate what those clusters correspond to.

standard_norm-tf

From below samples it is not clear what clusters correspond to:

No N-Gram Norm

As mentioned above each FastText vocab word has its vector representation regardless its size. Norms of those vectors are plotted in this section. The shape of the distribution seems to match closely the shape of the same plot for Word2Vec (Schakel & Wilson, 2015). The vector norm as measure of word significance seems to hold even for FastText in terms of this norm as can be seen from labeled samples in the scatter plot. no_ngram_norm-tf

NG_Norm (N-Grams Times Count Norm)

As mentioned above FastText uses average of word vectors used. However for detection of noise-words number of ngrams seems to useful. For that purpose NG_Norm is defined ng_norm(word)= || Σg ∈ ngrams(word)vg ||. Using this norm common words are clustered in narrower band on ng_norm axis.

ng_norm-tf

Explicitly aggregated distribution on ng_norm axis is plotted in histogram below. ng_norm-hist

Probability distribution of given FastText vocabulary word being common word is plotted below. The distribution is well approximated by t-distribution.

ng_norm-common-density

Ability to detect noisy-words is evaluated on simple task of splitting two concatenated words back apart below. For example let’s split back concatenation ‘inflationlithium’:

word1 word2 norm1 norm2 prob1 prob2 prob
i nflationlithium 0 4.20137 0.000000 0.000397 0.000000e+00
in flationlithium 0 4.40944 0.000000 0.000519 0.000000e+00
inf lationlithium 1.88772 3.86235 0.010414 0.000741 7.721472e-06
infl ationlithium 2.29234 4.04391 0.053977 0.000428 2.308942e-05
infla tionlithium 2.24394 4.74456 0.052467 0.000000 0.000000e+00
inflat ionlithium 2.55929 3.45802 0.048715 0.002442 1.189513e-04
inflati onlithium 3.10228 3.55187 0.007973 0.001767 1.408828e-05
inflatio nlithium 3.34667 3.26616 0.003907 0.003159 1.234263e-05
inflation lithium 2.87083 2.73886 0.017853 0.035389 6.318213e-04
inflationl ithium 3.36933 2.35156 0.002887 0.053333 1.539945e-04
inflationli thium 3.73344 2.21766 0.001283 0.052467 6.730259e-05
inflationlit hium 4.16165 1.66477 0.000096 0.004324 4.139165e-07
inflationlith ium 4.40217 1.59184 0.000519 0.002212 1.147982e-06
inflationlithi um 4.71089 0 0.000000 0.000000 0.000000e+00
inflationlithiu m 4.91263 0 0.000213 0.000000 0.000000e+00

Above approach yielded around 48% accuracy on 3000 random two-word samples from MIT 10k common words. A more efficient method in this specific case would be to search vocabulary instead of calculating vector norms. More appropriate comparison however would be for more general task involving OOV words e.g. using Edit Distance performed also on OOV words and words with typos.

Conclusion

FastText vector norms and their term-frequency were visualized and investigated in this post.

Standard Norm Term-Frequency plot revealed potentially interesting clustering of common vectors in three to four main areas.

No-N-Gram Norm has very similar Norm-TF distribution as Word2Vec shown in (Schakel & Wilson, 2015). The word significance correlation does seem to hold even for FastText embeddings in terms of No-N-Gram Norm.

NG_Norm shows that n-gram count could be potentially useful feature and that simple averaging over n-gram vectors may not be optimal. Perhaps some approach akin to (Zhelezniak et al., 2019) could be used.

References


Subscribe: Twitter , Facebook , RSS ,
Share on: Twitter , Facebook , Google+ , LinkedIn , Reddit .





Report any trackers to [email protected].