Word Alignment for Sentence Similarity

Semantic similarity increases with similar semantic units of similar semantic contexts in the monolingual word alignment.

In 2015, LSTM wasn’t widely used and Transformer haven’t existed yet. How people even did their sentence similarity back then? This post is about application of word alignment to semantic similarity. It is about a super simple word aligner based solely on dependency parsing and a word database that achieved 1st place in 2014, 5th in 2015, in SemEval STS In 2017, the aligner itself dropped to 10th place, but it still lived on as a subsystem in the winning system in an autoencoder system and other spots are taken by LSTM models.

Word Alignment

In word alignment we have two similar sentences and look for a correspondence mapping between the words that correspond to the same meaning within the context. To evaluate this task we need to have labelled corpus. Word alignment task is related to word movers distance (read more), in that both first map between the words, but alignment has to be zero-or-one while in case of WMD we can distribute the word weights in a fuzzy way.

word alignment example

Word Alignment vs Semantic Similarity

In 2015, top positions in sentence similarity task were occupied by corpus-based word-alignment models that used simple algorithms together with word databases or word embeddings e.g. word2vec. How word alignment relates to semantic similarity? Semantic similarity increases with similar semantic units of similar semantic contexts in the word alignment.

Now, to say how similar word-aligned sentences are, we need to calculate the similarity score. The score for similarity of sentence A to sentence B is a fraction of aligned words divided by number of words in sentence A. This measure is made symmetric by taking harmonic mean of both directions. Stop word alignment is not used for sentence similarity task.

But, how to align the words?

The Sultan 2014 Aligner Algorithm

alignment pipeline diagram

In each step below we increasingly align more words:

  1. align identical word sequences (high precision)
  2. align named entities before other content words to enable alignment of entity mentions of different lengths
  3. align similar words with similar dependency-tree context (higher precision then the next step)
  4. align similar word with similar with 3 to the left and 3 to the right
  5. align stop words depending on existing content word alignments

Identical Word Sequences

Aligning identical words in sequences of length n containing at least one content word. This simple heuristic demonstrates a high precision (≈ 97%) on the MSR alignment dev set for n ≥ 2

Named Entities

The algorithm uses GNU licenced Stanford Named Entity Recognizer (Finkel et al., 2005) to align all first character acronyms in the texts.

Content Words

  • word similarity: via Paraphrase Database (PPDB)
  • exact word or lemma match, returns similarity score of 1
  • if found as match in the PPDB, returns a similarity score ppdbSim=0.9
    • a tuned parameter 0 <= ppdbSim <= 1

Dependency-based Alignment Process

Dependency-based Alignment Process

Dependency context alignment is limited by accuracy of the dependency parser. Without the dependency alignment the model performed almost the same (see ablations). To align the dependency context, the dependency types were aligned custom lists to only find similar syntactic patterns.

Operation:

  • for each potentially alignable pair, the dependency-based context is extracted, and context similarity is calculated as the sum of the word similarities of the context word pairs
  • alignment score a weighted sum of word similarity and contextual similarity
  • then aligns pairs with non-zero evidence in decreasing order of this score (greedy)

Alignment Based on Similarities in The Textual Neighborhood

  • extract the context, which is a set of neigh-boring content word pairs (3 left, 3 right)
  • The contextual similarity is the sum of the similarities of these pairs
  • the alignment score is a weighted sum of word similarity and contextual similarity
  • The alignment score is then used to make one-to-one word alignment decisions

Datasets

MSR Brockett 2007 Corpus example

semeval 2014 sts task 10 dataset examples

Results

  • state-of-the-art 2014 on word alignment

Monolingual Word Alignment for Sentence Similarity results

Winner of SemEval 2014 STS SemEval 2014 STS (sentence similarity):

SemEval 2014 STS (sentence similarity) results

Winner of SemEval 2015 STS ([email protected]) with pearson mean result of 0.8015. While Word Mover’s Embeddings paper gets 64.2.

Unfortunately for the other papers spearman correlation is used. So they are not directly comparable. SentenceBert achieved 0.8099. Top score as of writing is 0.8863 from Trans-Encoder-RoBERTa-large-cross.

Created on 02 Apr 2022.
Thank you

Ask or Report A Mistake


Let's connect








Privacy Policy How many days left in this quarter? Twitter Bullet Points to Copy & Paste About Vaclav Kosar