In 2015, LSTM wasn’t widely used and Transformer haven’t existed yet. How people even did their sentence similarity back then? This post is about application of word alignment to semantic similarity. It is about a super simple word aligner based solely on dependency parsing and a word database that achieved 1st place in 2014, 5th in 2015, in SemEval STS In 2017, the aligner itself dropped to 10th place, but it still lived on as a subsystem in the winning system in an autoencoder system and other spots are taken by LSTM models.
- The monolingual aligner paper: Back to Basics for Monolingual Alignment (2014)
- The aligner-based sentence similarity paper: [email protected]
- The aligner source: Sultan et al. 2014 aligner source code
- State-of-the-art 2014 on sentence word alignment task
- Winner (DSL-CU) of SemEval 2014 STS (sentence similarity), results only
- Fifth place in SemEval 2015 STS - 2020 overview of the sentence similarity evolution
In word alignment we have two similar sentences and look for a correspondence mapping between the words that correspond to the same meaning within the context. To evaluate this task we need to have labelled corpus. Word alignment task is related to word movers distance (read more), in that both first map between the words, but alignment has to be zero-or-one while in case of WMD we can distribute the word weights in a fuzzy way.
Word Alignment vs Semantic Similarity
In 2015, top positions in sentence similarity task were occupied by corpus-based word-alignment models that used simple algorithms together with word databases or word embeddings e.g. word2vec. How word alignment relates to semantic similarity? Semantic similarity increases with similar semantic units of similar semantic contexts in the word alignment.
Now, to say how similar word-aligned sentences are, we need to calculate the similarity score. The score for similarity of sentence A to sentence B is a fraction of aligned words divided by number of words in sentence A. This measure is made symmetric by taking harmonic mean of both directions. Stop word alignment is not used for sentence similarity task.
But, how to align the words?
The Sultan 2014 Aligner Algorithm
In each step below we increasingly align more words:
- align identical word sequences (high precision)
- align named entities before other content words to enable alignment of entity mentions of different lengths
- align similar words with similar dependency-tree context (higher precision then the next step)
- align similar word with similar with 3 to the left and 3 to the right
- align stop words depending on existing content word alignments
Identical Word Sequences
Aligning identical words in sequences of length
n containing at least one content word.
This simple heuristic demonstrates a high precision (≈ 97%) on the MSR alignment dev set for
n ≥ 2
The algorithm uses GNU licenced Stanford Named Entity Recognizer (Finkel et al., 2005) to align all first character acronyms in the texts.
- word similarity: via Paraphrase Database (PPDB)
- exact word or lemma match, returns similarity score of
- if found as match in the PPDB, returns a similarity score
- a tuned parameter
0 <= ppdbSim <= 1
- a tuned parameter
Dependency-based Alignment Process
Dependency context alignment is limited by accuracy of the dependency parser. Without the dependency alignment the model performed almost the same (see ablations). To align the dependency context, the dependency types were aligned custom lists to only find similar syntactic patterns.
- for each potentially alignable pair, the dependency-based context is extracted, and context similarity is calculated as the sum of the word similarities of the context word pairs
- alignment score a weighted sum of word similarity and contextual similarity
- then aligns pairs with non-zero evidence in decreasing order of this score (greedy)
Alignment Based on Similarities in The Textual Neighborhood
- extract the context, which is a set of neigh-boring content word pairs (3 left, 3 right)
- The contextual similarity is the sum of the similarities of these pairs
- the alignment score is a weighted sum of word similarity and contextual similarity
- The alignment score is then used to make one-to-one word alignment decisions
- SemEval 2014 STS SemEval 2014 STS (sentence similarity):
- state-of-the-art 2014 on word alignment
Winner of SemEval 2014 STS SemEval 2014 STS (sentence similarity):
Unfortunately for the other papers spearman correlation is used. So they are not directly comparable. SentenceBert achieved 0.8099. Top score as of writing is 0.8863 from Trans-Encoder-RoBERTa-large-cross.