Vaclav Kosar's face photo
Vaclav Kosar
Software And Machine Learning Blog

Submodularity in Ranking, Summarization, and Self-attention

Diminishing returns with a budget constraint in problems of coverage and results diversification.
Submodularity in Ranking, Summarization, and Self-attention

Submodular function models diminishing returns

A submodular function is a set function whose value, has the property that the difference in the incremental value of the function that a single element makes when added to an input set decreases as the size of the input set increases.

\( S_1 \subset S_2 \), \(i \notin S_2 \), \( F(\lbrace i \rbrace \cup S_1) - F(S_1) \geq F(\lbrace i \rbrace \cup S_2) - F(S_2) \)


We encounter this property during shopping in form of quantity discount. The discount of addition of on more item to the purchase cart, decreases with the size of the cart. This priciple is called diminishing returns.

For example the price per product price as a function of set of purchased products \( S \) could be dependent on the product count \( | S | \). From following, we then observe diminishing discount of adding an extra product into the cart.

\( \mathrm{price}(S) = \exp(-|S|) \)
\( S_1 \subset S_2 \implies | S_1 | \leq | S_2 | \)
\( \mathrm{price}(S_1 + \lbrace x \rbrace) - \mathrm{price}(S_1) \geq \) \( \mathrm{price}(S_2 + \lbrace x \rbrace) - \mathrm{price}(S_2) \),
because \( \exp(- | S_1 | )(e^{-1} - 1) \geq \exp(- | S_2 | )(e^{-1} - 1) \)


Submodular functions appear also in maximization problems where they model notions of diversity, information and coverage. For example one would like a document summary to cover all important and relevant concepts in the document, but avoid topic repetition at the same time as repeating a fact has a diminishing return.

Submodular functions can be seen as discrete convexity e.g.

  • polynomial time algo for finding minimum-set.
  • for non-negative submodular function maximum-set under a budget constraint is approximable with polynomial time greedy algorithm
  • Submodular functions can be combined, and the resulting function is still submodular.

Maximum-Marginal-Relevance re-ranks for diversity of results

Maximum-Marginal-Relevance for re-ranking (1998) trades off redundancy of query results for small decrease relevance of results for already ranked documents or phrases. It make the trade off using the following greedy algorithm.

  • \( V \) = a set of documents
  • \( S \) = a subset of documents in V already selected
  • \( \lambda \) = a constant in range of \( ( 0, 1 ) \)
\( \mathrm{nextDoc}(S) = \) \( \mathrm{argmax}_{i \in V \setminus S } ( \lambda \mathrm{sim}_1 (i, Q) - (1 - \lambda) \mathrm{max}_{j \in S} \mathrm{sim}_2 (i, j)) \),
  • The first part of equation growths with document similarity with the query.
  • The second part decreases with the document similarity with already selected documents.

Summarization as budgeted submodularity maximization

Multi-document Summarization via Budgeted Maximization of Submodular Functions formulates summarization as the problem of maximizing a submodular function under a budget constraint, where \( V \) is set of all sentences, \( S \) are selected sentences, \( c_i \) is word count, \( f \) is submodular function, \( B \) is the word budget, then:

\( S_{max} = \max_{S \subset V} f(S) : \sum_{i \in S} c_i \leq B \)

The paper proposes MMR alternative that is submodular if the similarity is non-negative:

\( f = \sum_{i \in V \setminus S} \sum_{j \in S} \mathrm{sim}(i,j) - \lambda \sum_{i,j \in S: i \neq j} \mathrm{sim}(i,j) \)

  • The first part grows as selected sentences \( S \) become more similar to all sentences \( V \).
  • The second part decreases as selected sentences become more similar between each other.

For which they present greedy algorithm with near-optimality guarantee. The algo repeatedely finds documents maximizing:

\( \mathrm{nextDoc}(S) = \mathrm{argmax_i} \frac{f(S \cup \lbrace i \rbrace) - f(S)}{c_i^r} \)

until the budget is reached. The algo Details are presented in the paper.

How is this related exactly to the MMR? If the costs for each item is the same e.g. \( c_i = 1 \), then we get:

\( f(S \cup \lbrace k \rbrace ) - f(S) = \) \( \sum_{i \in V \setminus \lbrace k \rbrace} \mathrm{sim} (i, k) - 2 (1 + \lambda) \sum_{i \in S} \mathrm{sim}(i, k) \).


Proof:
\( f_1 := \sum_{i \in V \setminus S, j \in S } \mathrm{sim} (i, j) \),
\( f_1(S \cup \lbrace k \rbrace ) - f_1(S) = - \sum_{j \in S} \mathrm{sim}(k, j) + \sum_{i \in V \setminus (S \cup \lbrace k \rbrace )} \mathrm{sim}(i, k) = \) \( \sum_{j \in V \setminus \lbrace k \rbrace} \mathrm{sim}(k, j) - 2 \sum_{j \in S} \mathrm{sim}(k, j) \)
\( f_2 := \sum_{i,j \in S : i \neq j } \mathrm{sim} (i, j) \),
\( f_2(S \cup \lbrace k \rbrace ) - f_2(S) = 2 \sum_{j \in S} sim(k, j) \),
\( f = f_1 - \lambda f_2 \)


The equation is very much like the Maximum-Marginal-Relevance equation, except \( \lambda \) the diversity term has always non-zero weight. The cost \( c_k \), can then be added back to rescale according to the sentence length.

  • The first part grows with the sentence \(k \) similarity to all sentences in \( V \) i.e. with sentence centrality in the whole document.
  • The second part decreases with sentence similarity to the already selected sentences.

This setup with idf-modified-cosine similarity matrix achieved SoTA outperforming PageRank based centrality summarization algorithms like LexRank. LexRank applies PageRank to the idf-modified-cosine similarity matrix to get the centrality scores for the sentences. The LexRank paper mentions use MMR as a re-ranker, making the comparison with the submodularity solution fair.

Wiki term disambiguation for wiki using topic hierarchy coverage

Summarization of Multi-Document Topic Hierarchies using Submodular Mixtures (2015) suggested generating Wikipedia disambiguation pages using custom crafted submodularity function maximizing disambiguation topics specificity, clarity, relevance, and more.

Wiki disambiguation problem and submodularity
Wiki disambiguation problem and submodularity (source)

Wiki articleas are organized into a topic hierarchy. The article looks for the best small subset of topics for an informative disambiguation page. The submodular function is constructed as optimal linear combination of multiple hand-constructed submodular functions.

  • \( V \) all topics organized into a hierarchy
  • Collection of D documents associated with one or more topics
  • \( S \) selected topics
  • \( B \) topic count budget
  • \( f_i \) custom monotone submodular functions for selection of the best topics. For example:
    • Weighted set cover function \( f(S) = \sum_d \omega_d \) with weights based on relative importance.
    • topic specificity - distance from root
    • topic clarity: fraction of descendats that cover one or more docs
    • topic relevance: min number of hops needed
  • \( \omega_i \geq 0 \) the best linear combination of the custom functions

We solve:

\( S_{best} = \mathrm{argmax}_{S \subset V : |S| \leq B} \sum_i \omega_i f_i(S) \)


Diminishing self-attention improves summarization coverage

The paper Resurrecting Submodularity for Neural Text Generation optimizes summarization and translation coverage with diminishing attentions with submodular functions.

End-to-end trained neural encoder-decoder summarizers suffer from repetition and insufficient information coverage. Previous papers added coverage requirements into the loss. This paper instead modifies the attention weights using the principle of diminishing returns using a submodular function.

Given decoder step (position) \( t \), encoder sequence position \( i \), attention matrix \( a_{i, t} \), concave non-decreasing function \( F \) with \( F(0) = 0 \) e.g. , then diminishing attention is defined as:

\( DimAtt_{i,t} = F(\sum_{\tau = 0}^{t} a_{i, \tau}) - F(\sum_{\tau = 0}^{t - 1} a_{i, \tau}) \)

Diminishing attention via submodularity
Diminishing attention via submodularity (source)

This relatively increases attention to the parts of input text that were not yet used for decoding. With a slightly more sofisticated approach they call “Dynamic Dimishing Attention” they then achive SoTA with underlying summarization models Pointer-Generator and BART.

Results of diminishing attention
Results of diminishing attention (source)

Meet Other ML Enthusiasts One-on-One Online

Video-call each week an interesting person in the machine learning field and break out of your remote isolation. Network One-on-One Around Your Online Village with RandomMeets.

Join Machine Learning @ RandomMeets

07 Feb 2021