Earth mover’s distance

Earth mover’s distance

A semantic measure for document similarity in semantic search

·

10 min read

This is the part 2 of my initial Semantic search post. This one is about the Earth mover’s distance, which can be applied in semantic search.

Motivation

The way semantic search engines retrieve results from their ontology is by computing the closest / most similar documents to the query. In order to get the top most similar documents, the algorithm computes the distance between the query and many documents. When the similarities are obtained, the documents are ranked accordingly and the ones at the top are retrieved.

For example, imagine that the ontology has 5 documents:

  • Query: ‘how do I start a career in data science?’
  • Document 1: ‘The science of cooking: how to cook a splendid meal’
  • Document 2: ‘Data science breakthroughs in 2019
  • Document 3: ‘A marine biologist’s guide to getting a data science job’
  • Document 4: ‘Data about chimpanzees is the basis of behavioral science’
  • Document 5: ‘Data science is the top career in the 21st century’

First we compute the distance between the query and the 5 documents, and the one(s) with the highest similarity are returned to the user.

What is this distance between documents? Humans can quickly know which document is the most appropriate for this query, because language is inherent to us and we understand the meaning of each word and and also the sentence as a whole, we can even predict what each document is about. Simple grammatical mistakes or a slightly different word order doesn’t affect our thinking process.

In a matter of seconds, we reach the conclusion that document #3 is the closest one: even if the user may not have a marine biology background, the marine biologist’s experience might prove useful.

But how does a computer calculate the distance between two sentences? A sentence isn’t just a sequence of words: each word has its place, there’s an invisible underlying grammatical structure that we intuitively understand because we’ve been exposed to language since birth, but computers have historically had trouble gaining this ‘intuition’.

A simple document similarity measure would be counting how many words there are in common between the two sentences, regardless of the position:

  • d(query, doc1) = 3 (science, how, a)
  • d(query, doc2) = 2 (data, science)
  • d(query, doc3) = 3 (a, data, science)
  • d(query, doc4) = 2 (data, science)
  • d(query, doc5) = 4 (data, science, career, in)

According to this metric, document #5 is the most similar one to the query. But we quickly realize that some words are irrelevant, ‘a’. ‘how’, ‘in’ and such words don’t add meaning to the sentence. These words are called stopwords and are removed in most text processing algorithms.

Slight variations in words aren’t included in this metric either. Verbs in present vs. past tense, singular vs. plural words, adverbs vs. adjectives, etc. are regarded as different words when in reality, they refer to almost the same term. Stemming, lemmatizing and tokenization are methods for obtaining the root form of the word.

Lastly, document #3 should get points for including the word ‘job’, which in this context, is a synonym of ‘career’. ‘career’ and ‘job’ should be processed as if they were equivalent., the same as the word ‘career’ in document #5. Including synonym coverage in the algorithm can greatly improve performance.

And if we dig deep, one may argue that ‘how do I do X’ is similar in meaning to ‘someone’s guide to doing X’, although implementing that is another story.

As a summary, the motivation of using semantic document similarity measures is to improve existing measures by using semantic features such as synonyms, hyponyms, etc.

There are many document similarity measures, lexical and semantic. You can read about many of them, more centered in theory or more practical.

This post is about the Earth mover’s distance (EMD), one of the semantic approaches, where the meaning of the words and the intent of the user are considered.

Earth mover’s distance (EMD)

In the case of EMD, the distance between documents is based on semantic distances between words, where words are stored in an electronical lexical database called WordNet. Once the semantic distances between words are obtained, the EMD calculates the document similarity with a many-to-many matching between words. That is, three words in one document can be equivalent in meaning to just one word in the other document. [1]

1. Definition

Let’s abstract from sentences, words and documents and let’s imagine a document is a distribution of weighted points. The distributions are tuples of positions and weights.

The EMD between two distributions is proportional to the minimum amount of work required to convert one distribution into the other.

1 unit of work is the amount of work necessary to move one unit of weight by one unit of distance.

2. Intuition

Intuitively, the weight flows from one distribution to the other until they are identical, similar to filing holes with piles of dirt. One distribution acts as a group of holes, while the points of the other distribution are dirt. [2]

Example 1

From an initial point where the two distributions are displayed in the space, one distribution is assigned the role of dirt and the other holes, each point of dirt is assigned to a point of hole (not necessarily one point of dirt for each point of hole), and iteratively the dirt ‘moves’ towards the hole, until it fills it.

The cost/work of moving the dirt depends on the weight/amount of dirt and the distance it needs to cover.

The ground distance metric between points can be Euclidean, Manhattan… but don’t confuse this distance with the distance between words. This ground distance is merely the distance between points in the dimensional space.

Example 2

In example #2, the distance that the red distribution has to cover is greater than in example #1, so it will take more work to ‘move the dirt’.

The goal of the EMD algorithm is to optimize how to distribute the weights so that all of the dirt covers all of the holes while moving the weights through the minimum distance possible.

Note: in this scenario, the two distributions are balanced, that is, the amount of dirt equals the amount of hole. There’s no dirt left once all the holes are covered, or there’s no extra holes once all the dirt has been placed. If you want to see imbalanced distributions, consider reading [2].

There are many ways to distribute the weights, the leftmost dirt point in the red distribution could be matched to the lowest hole point in the blue distribution, but this wouldn’t be very efficient. One step in calculating the EMD is obtaining the most feasible flow between the two distributions.

3. Notation

Notation and formulas can be found in the slides of my presentation [0]. I’ve tried to keep this post more theoretical and intuitive, without getting too deep in the mathematics.

4. Flow

Given two distributions, x and y, a flow between them is any matrix F that defines a weight distribution among x and y. Each point in the matrix F, f_ij, represents the amount of weight at x_i matched to y_j.

Note: emphasis on the any, a flow can be any weight distribution, efficient or not. Finding an efficient/feasible flow is an optimization problem in which the EMD is based.

Finding a feasible flow: the flow F is feasible if and only if it fulfills the following constraints for each f_ij: [2]

  1. f_ij must be positive, the matched weight cannot be negative.
  2. The amount of weight moved to hole y_j cannot exceed the amount of weight present in x_i. At most, all of the dirt in the dirt point can be moved.
  3. The amount of weight moved from the dirt point x_i cannot exceed the amount of ‘hole’ present in y_j. There can be no dirt overflow in the hole.
  4. All the weight moved throughout all the points must be the total amount of weight in the distribution. All the dirt must cover all the holes, exactly. No dirt floating around and no holes left uncovered.

The flows that fulfill these requirements are considered feasible.

5. Work

The work done by a feasible flow in matching x and y is the amount of matched weight, multiplied by the distance, for each point in x and y.

6. EMD

The Earth mover’s distance is the minimum amount of work to match x and y, normalized by the total weight of the lighter distribution, but in this case both distributions have the same total weight so there is no lighter distribution. The work is simply divided by the total weight of one distribution.

The optimization problem is finding the flow that minimizes the work. The weight, all of it, has to be moved. But which weight to which point, and in what quantity, depending on the distance between the points, so that the weights don’t have to move too far, that’s the optimization problem.

7. Some examples [2]

Example 3

A non-optimal vs. optimal flow in regards of the work. In the left scenario, the distance covered by the 0.26 weight (d=316.3) causes the work to be so high. This example is quite intuitive, the optimal flow is relatively easy to obtain.

Example 4

A less intuitive example, in this case the total amount of weight in the distributions is different. So either there will be dirt overflow, or the holes won’t be fully covered.

8. Summary

The Earth mover’s distance is the distance it takes to move/transform one distribution into the other. The two characteristics of these distributions are that the points are in a space, 2D in the examples, and each point has a certain weight. By calculating the most efficient way to distribute these weights, we reach a number denoting the EMD.

9. EMD as a semantic measure for document similarity

How is this related to a semantic measure for document similarity that this post is supposed to be about? As mentioned in the beginning, the distance between documents is based on semantic distances between words, where words are stored in an electronical lexical database called WordNet. Let’s try to map this WordNet representation into a distribution of points and weights: [3][4]

  • WordNet is a large graph or semantic network, where each node of the network represents a real world concept (house, teacher, art).
  • Each node consists of:
    • A synset, a set of synonyms that represent the same concept, and
    • a gloss, a short definition or description of the real world concept.
  • Each link between synsets describes a semantic relationship between the real world concept represented by the synsets (hypernym, hyponym, meronym, holonym, etc).

A sentence X can be described as a distribution of points and weights, where:

  • the points are the WordNet representations of the words (in vectors of length N, we can visualize that as an N-dimensional space)
  • the weights are the TF-IDF values of the words (a metric showing how important a word is in respect to a document). [5]

The EMD is calculated by finding the flow that minimizes the work to match x and y, normalized to 1, and the document similarity is obtained as follows:

Conclusion

The Earth mover’s distance is a type of distance where the position and weight of the points in an N-dimensional space is critical.

By turning words into word vectors and weights into TF-IDF values, the EMD can be used as a semantic measure for document similarity.

EMD can be used to calculate the semantic similarity between an query and a document, and thus play a role in the ontology matching stage of semantic search.

Word mover’s distance (WMD)

WMD is a similar, more recent variation of EMD but using word2vec instead of WordNet. It was released in 2015 [6], and there are several implementation tutorials available, very useful to learn the implementation side of the measure.

I researched this topic quite intensely for my presentation and I hope I summarized it in a clear and concise way. If anything is not explained well enough, I’d be happy to give it another try.

I would really appreciate any feedback, incomplete or erroneous information I might have included. Thank you for reading!

References

[0]: Semantic search, slides from a presentation I gave in my Master’s degree at LMU, 2018

[1]: Towards Named-Entity-based similarity measures: challenges and opportunities, De Nies et al, 2014

[2]: Finding color and shape patterns in images, Cohen, Stanford University, 1999

[3]: Incorporating Dictionary and Corpus Information into a Context Vector Measure of Semantic Relatedness, Pathwardan, University of Minnesota, 2003

[4]: Automatic word sense discrimination, Schütze, 1998

[5]: The Earth Mover’s Distance as a Semantic Measure for Document Similarity, Wan and Peng, Peking University, 2005

[6]: From Word Embeddings To Document Distances, Kusner et al, 2015.

Did you find this article valuable?

Support Ane Berasategi: data science and cloud by becoming a sponsor. Any amount is appreciated!