Saturday, November 20, 2021

Word Transposition in the Voynich Manuscript

This is a follow-on to my last post, but I don't want to bother recapping the argument from last time, so I'll just start over fresh and go a different direction.

Summary
While a text in an unknown language may look random, there are two "forces" that govern the appearance of words in the text. One of those "forces" is absent in the Voynich Manuscript, and I think that indicates that a word transposition step has taken place.

Argument
The following graph shows the relative frequency that a word will recur in a text after having once occurred.


The graph above shows the likelihood that a given word will appear a second time after it has appeared a first time. For example, in a Latin medieval prose text (red line), if a word appears once then there is an extremely low chance (0.03%) that the next word will be the same word. That rises to an almost 1% chance that it will appear six words later, then slowly drops off to a 0.76% chance that it will appear 30 words later, and a roughly 0.6% chance that it will appear 100 words later.

A similar phenomenon can be seen in the early modern French novel Pantagruel (blue line).

This curve could be described as the product of the interaction of two "forces":
  1. Strong repulsive force: Instances of the same word have a very low likelihood of being found in very close vicinity to each other. Perhaps languages are naturally structured in a way that avoids close repetition.
  2. Weak attractive force: Instances of the same word have a higher likelihood of being found in the same broad area of the text as each other. Intuitively it seems like this should not apply to high-frequency words with low semantic content (articles, prepositions, etc.), since there is no reason for these to be grouped together in the same area of a text. Instead, this ought to apply primarily to lower-frequency words with high semantic content, since these words will be tied to the topic of discourse, and will therefore be clustered in areas of the text where the topic relates to their semantic domain. (I should have proved this out, but I didn't.)

Interestingly, Latin syllables (orange line) respond to the same strong repulsive force as words, but not the weak attractive one. This makes sense if the weak attractive force relates to semantic content, because syllables themselves have no semantic content and are therefore not tied to the topic of the text. Instead, with Latin syllables we see a strong tendency for syllables not to repeat in close vicinity to each other, but then the curve just rises to a plateau.

So what do we see in the VM?


Words in the Voynich Manuscript demonstrate the effects of the weak attractive force more or less like Latin words.  This suggests they have a semantic component, and there is some kind of topicalization going on. However, the VM shows no evidence of the strong repulsive force. What could cause that?

The strong repulsive force works over a very short distance, generally less than five words. If words were shuffled around so they were separated from their neighbors by a distance of five words or more, then this would conceal the effect of the strong repulsive force.

In other words, perhaps there is a transposition step in the VM cipher, operating on words. This could solve a lot of problems.

For example, such a transposition could also explain why the text does not exhibit line-breaking features that make it clear whether it runs right-to-left or left-to-right.

It might also explain why the last lines of some paragraphs (especially in the Currier A sections) have gaps in them. Perhaps these gaps are slots that were simply not filled in by the transposition algorithm.


Indeed, if we suppose that the transposition works on the level of the paragraph, then that could explain why so many paragraphs begin with a word containing an ornate gallows letter. If the transposition algorithm resets for each paragraph, then the reader would need a visual cue to indicate where to start over again.

I can even imagine algorithms that could produce the phenomenon of head words, body words and tail words, though this is a bit more of a stretch since that would mean there is some connection between what a word is and where the transposition algorithm puts it.

Lastly, this could explain why the VM has no punctuation. In manuscripts of this era punctuation was common (though not universal). Since punctuation marks stand between words in connected linear text, if the words are shuffled through some kind of transposition algorithm then it might no longer be clear where to put the punctuation marks.

So...how would one prove or disprove the existence of word transposition in the VM?

2 comments:

  1. I'm not sure this is as straight forward as you think. daiin daiin and qokedy qokedy (etc) might be distorting your distance==1 counts. There's a strong case that they may each be concealing other things in the plaintext.

    ReplyDelete
    Replies
    1. This is interesting. Of course, word transposition would create a certain number of high-frequency pairs naturally, so to tease out whether pairs like 'daiin daiin' and 'qokedy qokedy' could be the product of word transposition or some other process, I decided to look at their actual frequency compared to the frequency that random order would predict.

      The results were not what I expected.

      In Currier A, the pair 'daiin daiin' occurs *less* frequently than random chance would predict, and the pair 'chol chol' occurs *more* frequently.

      In Currier B, the pair 'daiin daiin' occurs *less* frequently than random chance would predict, and the qok- pairs (like 'qokedy qokedy') occur *more* frequently.

      This is apparent even if I account for the differences in head, body and tail word frequency.

      So you're right, something else is affecting the frequency of identical pairs. Word transposition doesn't really explain this.

      Delete