Thursday, August 26, 2021

A Voynich-like Code

I've just read an article titled The Linguistics of the Voynich Manuscript by Claire L. Bowern and Luke Lindemann, which summarizes previous scholarship on the manuscript and concludes that "the character-level metrics show Voynichese to be unusual, while the word- and line-level metrics show it to be regular natural language."

Reading the article reminded me to finish this post, which I started several weeks ago. Here, I'll outline a cipher using ideas from my previous posts, which I believe a late medieval or early Renaissance scholar might plausibly have created, which I think would produce some of the features of the Voynich manuscript.

I'll walk through the cipher steps with an English phrase and a Latin phrase: Can you read these words? Potesne legere haec verba?

Step 1: Remove the vowels from all of the words. This is what causes the words of the cipher text to carry less information than they would in the source language. In this example I'm treating the letter v as a vowel in Latin, but w as a consonant in English, because these are the historical conventions in these languages.

English; cn y rd ths wrds?

Latin: ptsn lgr hc rb?

It is an open question for me whether it is feasible to reverse this step in Latin. In English I know it is reasonable if you are familiar with the general content of the text, because a similar approach was used to create mnemonics describing Masonic ceremonies:


Step 2: Encipher each word using a substitution cipher that replaces each letter with a syllable, with special syllables reserved for the last letters in each word, to create the appearance of an inflected language. This is what creates the low second-order character entropy.

In this case, I have created the key using the first and last syllables from polysyllabic words at the beginning of Virgil's Aeneid. I haven't bothered to create a complete key, it only covers the letters needed for this example.


A partial key

Using this key, the two example sentences become:

English: viam tum prono vepria liprocaa?

Latin: favelaam otrogus prirum proma?

One of the neat things about this cipher approach is that one could hypothetically train oneself to speak the cipher. 

Step 3: Write the cipher in a secret alphabet. This changes very little about the cipher, and might be considered more of a cultural requirement of the era.

To be clear, I don't think the Voynich cipher worked in exactly this way. For example, the frequency of daiin in the Currier A pages is nearly exactly the frequency of t (representing et, ut, te, tu, etc.) in a long devoweled Latin Text, but it isn't clear how daiin could be used to encode initial, medial or final t in other longer words. If the underlying language of the VM is Latin, and it is encoded using a system like this, then it is likely that there is some additional complexity in step 2. For example, there might be a set of words (like daiin, chol, chor) that encode single letters, then another set of prefixes and suffixes to encode letters in longer words.

Friday, August 6, 2021

Old Cryptography and Entropy

In my last two posts, I first suggested that Voynich Manuscript 81R might contain a poem in Latin dactylic hexameter, but then I argued that the lines only convey about half of the information necessary to encode such a poem. In this post I'll try to reconcile those two arguments by showing that a late medieval/early Renaissance cipher system could have produced this effect.

The pages of the VM have been carbon-dated to between 1404 and 1438. If the text is not a hoax, and it was written within a century or so of the production of the vellum, then what cryptographic techniques might the author plausibly have known, and how would they impact the total bits per line of an enciphered poem?

According to David Kahn's The Code-Breakers, the following methods might have been available to someone in Europe during that period. For most of these, I have created simulations using the Aeneid as a plain text, and measured the effect on bits per line using the formula for Pbc from my last post.

  • Writing backwards (0.2% increase)
  • Substituting dots for vowels (28.5% decrease)
  • Foreign alphabets (little or no change, depending on how well the foreign alphabet maps to the plaintext alphabet)
  • Simple substitution (no change)
  • Writing in consonants only (45.6% - 49% decrease, depending on whether v and j are treated as vowels)
  • Figurate expressions (impractical to test, but likely to increase bits per line)
  • Exotic alphabets (no change, same as simple substitution)
  • Shorthand (impractical to test, but likely to decrease bits per line)
  • Abbreviations (impractical to test, but certain to decrease bits per line)
  • Word substitutions (did not test, but likely to cause moderate increase or decrease to bits per line)
  • Homophones for vowels (increase bits per line, but the exact difference depends on the number of homophones per vowel. With two homophones for each vowel, there was a 19.5% increase)
  • Nulls (increase bits per line, but the exact difference depends on the number of distinct nulls used and the number of nulls inserted per line)
  • Homophones for consonants (increase bits per line, but the exact difference depends on the number of homophones per consonant)
  • Nomenclators (impact depends on the type of nomenclator. I tested with a large nomenclator and got a 44.5% decrease in bits per line)
If 81R contains a poem in Latin dactylic hexameter, then it appears the encoding system caused something like a 47.9% decrease in the number of bits per line. Only two of the encoding methods above have a similar effect:
  • Writing in consonants only
  • Using a large nomenclator
The first of these options is intriguing, because removing the vowels from a Latin text causes a significant number of lexical collisions, especially if v and j are treated as vowels. If this is one of the steps in the Voynich cipher process, then the appearance of repeated sequences like daiin daiin daiin in the VM could result from sequences like  ita ut tu, ut vitae tuae, etc.

That, of course, cannot be the only story here. If the VM is written in a cipher that removes all of the vowels, then it must also be written in a cipher that encodes single Latin consonants as strings of multiple Voynich letters in order to account for the length of Voynichese words. This must also be done in a way that increases the lengths of words without significantly increasing the number of bits per line.

I think this is quite possible to do. In my next post I'll try to demonstrate this with a proof-of-concept cipher that creates a cipher text like the VM from a Latin plaintext.