Thursday, October 28, 2021

Is there unusual context dependency in the VM?

Back in 2015 Torsten Timm wrote a paper titled How the Voynich Manuscript was Created, in which he argued that the VM was created by a process of copying and altering glyph groups that had already been written. One of the pieces of evidence Timm used in supporting this argument was the measurable fact that words found on one line of the manuscript had a higher likelihood of being found on the three lines immediately above it than being found further away in the text.

In this post I'll dig into a specific problem with Timm's argument, and I will argue that what Timm observed is a natural language feature, but that following his approach reveals another interesting feature of the VM text.

Let's start with Timm's argument. The following graph is taken from his paper, and it shows the likelihood that a word on one line of the VM will be found on a line before it:


Here you see that a word found on any given line has almost a 7% chance of being found elsewhere in the same line (position 0), a roughly 6.5% chance of being found one line higher (position 1), 6% chance of being found two lines higher, and so forth. The further back you go, the lower the likelihood of finding your word repeated, with the curve flattening out at around 4%.

Does the curve in this graph represent a natural feature or an unnatural one? On the face of it, it seems a phenomenon like this ought to be perfectly natural. If we have an herbal, for example, we would expect a word like "verbena" or "pigroot" to be localized to the section of the herbal that discusses it. But would that be sufficient to move the line on the graph?

Timm argued that this is an unnatural feature, and supported this argument by carrying out the same exercise with a Latin text (the Aeneid) and an English text (Dryden's translation of the Aeneid). Here is what he found:


Here you can see that the Voynich manuscript has a curve that is dramatically different from both the Latin and English texts that Timm chose for comparison.

Note that the English line dips down at the far left, a phenomenon which Timm attributed to Dryden's rhyme scheme. More on that below.

Here's the problem: The Aeneid is not normal in terms of its repetitiveness. If you conduct this same exercise with all of the medieval prose and poetry in the LatinISE corpus, you find the following:


As you can see from this graph:
  • The Aeneid is less repetitive than Medieval Latin poetry. This is probably in part due to Vergil's style (maybe he preferred to avoid repetition) and probably also in part due to the fact that Medieval Latin made more use of low-content high-frequency words than Classical Latin.
  • Latin poetry is less repetitive than Latin prose, but this is partly due to a difference in line length. The curve in the graph above resulted from breaking prose texts down into lines of up to 40 characters in length. If I had used an 80-character limit instead, the curve would have peaked at 7.8%. If I had used a 23-character limit, the prose curve would have come close to matching the poetry curve.
  • In both Medieval Latin poetry and prose there is a lower tendency for a word to be found in its own line (position 0) than in the next line (position 1). This same phenomenon appeared in Timm's graph for the "English" line, and he explained it as a product of Dryden's rhyme scheme. However, since there is no rhyme scheme in play in Latin prose, there must be another explanation.
In my opinion, the interesting thing Timm's analysis reveals is actually this: In the VM, a word is more likely to be found again on its own line than to be found on the lines above it. I have an idea of what this could mean, but I don't want to make this post unnecessarily long, so I will dig into it in a separate post.

No comments:

Post a Comment