Similarity Metrics for Short Texts

· 07.17.2015 · etc

Computing similarity metrics for short texts can be very difficult. It's the main challenge in developing Geiger. The problem is that text similarity metrics typically rely on exact overlap of terms (called "surface matching" because you match surface forms¹ of words), and short texts are sparse in their terms. There is more opportunity for overlap in longer documents by virtue of the fact that there are simply more words.

Say you have two news articles about employment. If these are longer documents, they may mention words like "work" or "jobs" or "employment". Similarity metrics reliant on common terms will work fine here.

Now say you have two comments about employment, both of which are fairly short, say ~400 characters each.

Here are two examples:

C2: This attack on another base of employment by democrats is why what began in the last elections will be finished by 2016. I can't wait to participate.

C1: Nobody thinks the fossil fuel industry opposition to clean air has anything to do with jobs. Establish new companies in these areas, and retrain miners to do those jobs. Or put them to work fixing our roads and bridges, as in FDR's day. Americans should be outraged that this is even an issue for the courts. What a waste! It's the air, stupid.

One comment mentions "employment" explicitly; the other only talks about "jobs" and "work". They are both talking about the same thing, but there is no exact overlap of these key terms, so common-term similarity metrics will fail to recognize that. This is referred to as a problem of synonyms.

The converse of the synonym problem is that of polysemy - where one term can mean different things, in different contexts. The typical example is "bank", which an mean a financial institution, or the side of a river, or as a verb. Maybe a comment says "good job" and really isn't saying thing at all about employment, but common-term similarity metrics won't recognize that.

There's quite a bit of literature on this problem (see below for a short list) - the popularity of Twitter as a dataset has spurned a lot of interest here. Most of the approaches turn to some external source of knowledge - variously referred to as "world knowledge", "background knowledge", "auxiliary data", "additional semantics", "external semantics", and perhaps by other names as well.

This external knowledge is usually another corpus of longer texts related to the short texts. Often this is Wikipedia or some subset of Wikipedia pages, but it could be something more domain-specific as well. You can use this knowledge to relate terms by co-occurrence, e.g. maybe you see that "job", "work", and "employment" occur together often in the Wikipedia page for "Employment", so you know that the terms have some relation.

Alternatively, with Wikipedia (or other corpora with some explicit structure) you could look at the pagelink or redirect graph for this information. Terms that redirect to one another could be considered synonymous, path length on the pagelink graph could be interpreted as a similarity degree between two terms. Wikipedia's disambiguation pages can help with the polysemy problem, leaning on term co-occurrence in the disambiguated pages (e.g. this comment contains "bank" and "finance"; only one of the Wikipedia pages for "Bank" also has both those terms).

I am still in the process of trying these methods, but they have some intuitive appeal. When we make sense of short documents – or any text for that matter – we always rely on background knowledge orders of magnitude greater than the current text we are looking at it. It's sensible to try and emulate that when working with machine text processing as well.

Side note: You can also approximately resolve synonyms using word embeddings (vector representations of single terms), such as those derived from a Word2Vec model. I say "approximately" because what word embeddings represent is how similar terms are not in terms of semantics, but by how "swappable" they are. That is, two word embeddings are similar if one can take the place of the other.

For example, consider the sentences "Climate change will be devastating" and "Global warming will be devastating". The terms aren't technically synonymous, but are often used as such, so we'll say that they practically are. A well-trained Word2Vec model will pick up that they often appear in similar contexts, so they will be considered similar.

But also consider "You did a good job" and "You did a bad job". Here the term "good" and "bad" appear in similar contexts, and so the Word2Vec model would call them similar. But we would not call them synonymous.

Referenced papers:

Yih, W. and Meek, C. Improving Similarity Measures for Short Segments of Text. 2007.
Hu, X., Sun N., Zhang C., Chua, T. Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge. 2009.
Petersen, H., Poon, J. Enhancing Short Text Clustering with Small External Repositories. 2011.
Hu, X., Zhang X., Lu C., Park, E. Zhou X. Exploiting Wikipedia as External Knowledge for Document Clustering. 2009.
Jin, O., Liu, N., Zhao, K., Yu, Y., Yang Q. Transferring Topical Knowledge from Auxiliary Long Texts for Short Text Clustering. 2011.
Seifzadeh S., Farahat A., Kamel M. Short-Text Clustering using Statistical Semantics. 2015.

A surface form of a word is its form in the text. For example, "run", "ran", "running" are all surface forms of "run". ↩