ere yamji bi ilan hūntahan sain nure be omire jakade...
There is a problem with the Voynich manuscript, and the problem is on the level of the apparent words, not the graphemes.
I mentioned in a post a while back that I have an algorithm that roughly measures the contextual similarity between words in a large text. Over the years I have gathered samples of the text of the book of Genesis because it is approximately the same size as the Voynich manuscript, and it has been translated into many languages.
Since I have recently been working on another (unrelated) cipher, I have developed a set of tools for measuring solutions, and I thought I could use them in combination with my semantic tools to look at the VM.
What I am finding right now is interesting. I'm pulling the top 14 words from sample texts and creating a 14x14 grid of similarities between them. The most frequent words tend to be more functional and less content-bearing, so I am really looking for groups like prepositions, pronouns, and so on.
In highly inflected languages, the top 14 words are not especially similar to each other. The reason for this is that prepositions in inflected languages tend to go with different cases, and inflected nouns are treated as entirely different words by my algorithm. So, for example, in Latin all of the top 14 words have a similarity score of lower than 0.2. In Wampanoag, the majority are under 0.1. In less inflected languages, like Middle English, Early Modern English, Italian and Czech, scores more commonly fall in the range 0.2-0.3, with a few in the 0.3-0.4 range.
The scores from the Voynich manuscript, however, are disconcertingly high. All of the top 14 words are very similar to each other, falling in the 0.3-0.5 range. This suggests that the most common words in the Voynich manuscript are all used in roughly similar contexts. They average just a little under the numbers that I get for a completely random text.
Something is going on here that makes the words look as though they are in random order.