Friday, June 21, 2013

Semantic distance between words in a large text

I've had an algorithm rattling around in my head for years that can be used to find words of similar meaning in large texts.  I've implemented it from scratch several times, but I doubt I will ever have a chance to use it in real life work.

The basic idea is this:
Part of the semantic meaning of a word is shared with the words that are syntactically close to it in a text.
For example, in the sentence "I eat cake", the words "eat" and "cake" are close to each other, and they share a common component of meaning.  A cake is something that you eat, and eating is something you do with cake.

Add the sentences "I eat bread", "I eat bananas", "I bake bread", "I bake cake" to the mix.  When you compare their contexts, you find that "bread", "cake" and "bananas" share common contexts (they are all eaten), but "bread" and "cake" have more in common with each other than they do with "bananas" because they are baked.  Furthermore, "eat" and "bake" have something in common: they are both actions that are applied to food.

In a large text you will get many accidental pairs, but if they are truly accidental then they won't be statistically distinctive.

Here's how I usually implement this algorithm:

1.  Build a table of all words in the text with their frequencies.

2.  Build a table of all pairs in the text with their frequencies.

3.  For any two words W1 and W2, find all pairs of pairs ((W1, x), (W2, x)) and all pairs of pairs ((x, W1), (x, W2)).

4.  For each pair of pairs, calculate the sum S1 of the absolute difference between the relative frequencies of the left and right members of the pair.  The relative frequency of a pair (W1, x) is the frequency of the pair divided by the frequency of W1.

5.  For each pair of pairs, calculate the sum S2 of the relative frequencies of the left and right members of each pair.

6.  Calculate the semantic distance between words W1 and W2 as: 2 - S2 + S1.  The distance is a number between 0 and 2, where 0 is complete similarity and 2 is complete difference.

If the words are the same, then S1 will be 0 and S2 will be 2, so the distance will be 2 - 2 + 0 = 0.  If the words are completely different, then S1 will be 0, but S2 will also be zero, so 2 - 0 + 0 = 2.

The more data you have to throw at this calculation, the better, to reduce the impact of noise on the result.