It turns out there is an established algorithm for measuring semantic distance between words in a text that is somewhat similar to the algorithm I have developed. It is called Second-Order Co-occurrence Pointwise Mutual Information, and there is a paper about it on the University of Ottowa website.
However, my algorithm is a little bit simpler than SOC-PMI, and it treats context words differently based on whether they occur before or after the subject word. I like mine better.
Using my algorithm (which I may someday put online, if I ever have time to tidy up the code), I have generated a series of images. These images show the semantic relationship between words in the Voynich manuscript and several other texts that are approximately the size of the Voynich manuscript. In these images, each point along the x and y axes represents a word in the lexicon of the text, and the brightness at any point (x, y) represents the semantic similarity between the words x and y. There is a bright line at x = y because words are completely similar to themselves, and clusters of brightness represent clusters of words with common meanings.
First, here is a control text created from 2000 words arranged in random order. The text is meaningless, so the image is basically blank except the line x = y.
The following image is generated from a text in Vulgar Latin. Because it is an inflected language, and my algorithm doesn't recognized inflected forms as belonging to the same word, there are many small islands of similarity.
The next image is taken from an English text. Due to the more analytic nature of English, the islands of similarity are larger.
The next image is taken from a text in Wampanoag. I expected this text to be more like the Latin text, but I think the source text is actually smaller in terms of the number of words (though similar in terms of the number of graphemes).
Now, for the cool one. This is the Voynich Manuscript.
I wonder if this shows a text written in two languages. The light gray square in the upper left corner, together with the two gray bands in the same rows and columns, represents a subset of words that exist in their own context, a context not shared with the rest of the text. One possible explanation would be that the "common" language of the text contains within it segments of text in another "minority" language.
About 25% of the lexicon of the text belongs to the minority language, and the remaining 75% belongs to the common language. The minority language doesn't contain any significant islands of brightness, so it may be noise of some kind--either something like cryptographic nulls, or perhaps something like litanies of names. If I have time, I'll try to split the manuscript into two separate texts, one in each language, and see what the analysis looks like at that point.
Good night.
No comments:
Post a Comment