Thursday, October 28, 2021

Is there unusual context dependency in the VM?

Back in 2015 Torsten Timm wrote a paper titled How the Voynich Manuscript was Created, in which he argued that the VM was created by a process of copying and altering glyph groups that had already been written. One of the pieces of evidence Timm used in supporting this argument was the measurable fact that words found on one line of the manuscript had a higher likelihood of being found on the three lines immediately above it than being found further away in the text.

In this post I'll dig into a specific problem with Timm's argument, and I will argue that what Timm observed is a natural language feature, but that following his approach reveals another interesting feature of the VM text.

Let's start with Timm's argument. The following graph is taken from his paper, and it shows the likelihood that a word on one line of the VM will be found on a line before it:


Here you see that a word found on any given line has almost a 7% chance of being found elsewhere in the same line (position 0), a roughly 6.5% chance of being found one line higher (position 1), 6% chance of being found two lines higher, and so forth. The further back you go, the lower the likelihood of finding your word repeated, with the curve flattening out at around 4%.

Does the curve in this graph represent a natural feature or an unnatural one? On the face of it, it seems a phenomenon like this ought to be perfectly natural. If we have an herbal, for example, we would expect a word like "verbena" or "pigroot" to be localized to the section of the herbal that discusses it. But would that be sufficient to move the line on the graph?

Timm argued that this is an unnatural feature, and supported this argument by carrying out the same exercise with a Latin text (the Aeneid) and an English text (Dryden's translation of the Aeneid). Here is what he found:


Here you can see that the Voynich manuscript has a curve that is dramatically different from both the Latin and English texts that Timm chose for comparison.

Note that the English line dips down at the far left, a phenomenon which Timm attributed to Dryden's rhyme scheme. More on that below.

Here's the problem: The Aeneid is not normal in terms of its repetitiveness. If you conduct this same exercise with all of the medieval prose and poetry in the LatinISE corpus, you find the following:


As you can see from this graph:
  • The Aeneid is less repetitive than Medieval Latin poetry. This is probably in part due to Vergil's style (maybe he preferred to avoid repetition) and probably also in part due to the fact that Medieval Latin made more use of low-content high-frequency words than Classical Latin.
  • Latin poetry is less repetitive than Latin prose, but this is partly due to a difference in line length. The curve in the graph above resulted from breaking prose texts down into lines of up to 40 characters in length. If I had used an 80-character limit instead, the curve would have peaked at 7.8%. If I had used a 23-character limit, the prose curve would have come close to matching the poetry curve.
  • In both Medieval Latin poetry and prose there is a lower tendency for a word to be found in its own line (position 0) than in the next line (position 1). This same phenomenon appeared in Timm's graph for the "English" line, and he explained it as a product of Dryden's rhyme scheme. However, since there is no rhyme scheme in play in Latin prose, there must be another explanation.
In my opinion, the interesting thing Timm's analysis reveals is actually this: In the VM, a word is more likely to be found again on its own line than to be found on the lines above it. I have an idea of what this could mean, but I don't want to make this post unnecessarily long, so I will dig into it in a separate post.

Tuesday, October 26, 2021

Are there nulls in the VM?

Renaissance cryptographers used nulls to break up repeated sequences of characters and to alter the frequency statistics of a text. Did the author of the VM do anything similar? If so, how could we detect it?

Null characters increase the entropy of a text. The more randomly a null character is employed, the greater the entropy it adds to the text. For example, the Latin text Historia rerum in partibus transmarinis gestarum written by William of Tyre conveys an average of 2.711 bits per character when measured using third-order entropy. If we randomly insert a null character "@" into the text at an average interval of every five characters, the entropy increases to 3.044 bits per character.

In cases like this, it seems like we ought to be able to ferret out the null character by identifying the character which, when removed, causes a significant drop in the entropy of the text.

The procedure we'll use is:

1. Calculate the average bits per character of a text which we think may be a cipher. This value will be the Bnull.

2. Make a copy of the cipher text and remove from the copy a character C which we think might be a null.

3. Calculate average bits per character of the text with C removed. Let this be BC.

5. Calculate the "nullitude" of the character C using NC = BC / Bnull.

First, we should know what the results look like with a text that does not contain nulls. Here are the nullitude values for characters in the Historia when no null character is inserted:

Here we do not see a significant drop below the value of 1, indicating that no single character, when removed from the text, causes a noticeable decrease in entropy. This is what we expected to see.

Now, looking at a copy of the Historia into which a null "@" has been inserted randomly about every five characters, we see a different result:


Here the null character stands out clearly, causing a significant drop in the entropy of the text once it is removed. Again, this is what we expect.

We can apply the same test to the Voynich Manuscript. Here is what we find with Currier A:


The result here is interesting because, at the left side of the chart, we see that removal of the character "e" causes a drop in entropy. It isn't a huge drop, but it does stand out from the rest of the characters. This suggests the possibility that Currier A words like cheol and cheor might be synonyms of the more common words chol and chor.

Though "e" looks slightly nullish in Currier A, we do not see the same phenomenon in Currier B:


In Currier B the distribution is more like the Latin plaintext above, with no particular character looking more like a null than the others.

Friday, October 1, 2021

Head words, body words and tail words

Lines of Voynichese text can be divided into a head (the first word), a tail (the last word) and a body (all the words in the middle).

Words can be classified according to where they tend to fall in the line:

  • Head words tend to be the first word in a line.
    • The first word of a paragraph seems to be its own special kind of headword
  • Tail words tend to be the last word in a line
    • Words ending in -m and -g tend to be tail words
    • Words ending in -aly tend to be tail words
  • Body words tend to fall in the middle of a line
  • Free words can appear anywhere

These categories aren't strict, so head words can sometimes be found in the body of a line, but are almost never found at the end. Similarly, tail words can be found in the body, but almost never at the head. There is a lot more work to be done looking at the relationship between the structure of a word and its classification.

This explains why I failed to find a clear direction of text when I was looking at line breaks. I was looking for common pairs that were broken across lines, but since the head and the tail of the line are drawn from statistically different sets of words than the body, those pairs turn out to be very rare in the rest of the text. (Thanks to Nick Pelling for the observation that initial and final letters of Voynichese probably screwed up my test!)

Interestingly, page F81R (where the text is laid out like a poem) follows these head-and-tail tendencies. That is, the words at the heads of the lines tend to be head words elsewhere, and the words at the ends of lines tend to be tail words elsewhere. This suggests that the ragged line lengths on this page are intentional, and adds weight back to the hypothesis that this page contains a poem.