Currier B has the high-frequency words qokeedy, qokain, qokedy, qokeey and qokain which raises the textual frequency of the prefix qo- relative to its lexical frequency.
Sunday, September 12, 2021
Lexical vs. Textual Frequency
Currier B has the high-frequency words qokeedy, qokain, qokedy, qokeey and qokain which raises the textual frequency of the prefix qo- relative to its lexical frequency.
Tuesday, September 7, 2021
Latin Contractions
My efforts to get a copy of The Curse of the Voynich are themselves apparently cursed. The first time I tried to order this book I was at a vacation rental for a month, and discovered that the postal service would not deliver it because the rental had no mailbox. The second time my order was canceled because the book was out of stock. I am hopeful that my third effort will meet with a better outcome.
While I wait for it to arrive, however, I've been looking at one of Nick Pelling's ideas. Did the Voynich cipher employ contraction and abbreviation as part of its process? If so, it seems like this could explain the relatively low amount of information conveyed by Voynichese words. It would be a lossy compression process similar to the removal of vowels, but perhaps more culturally appropriate to the 15th century.
I looked at the 1901 German translation of Adriano Cappelli's Lexicon abbreviaturarum, and it seems that conventions for contraction and abbreviation evolved over time such that by the 14th or 15th centuries scribes were using a number of methods in conjunction, including the use of a small set of symbols borrowed from Tironian notes. In order to understand these processes better, I took thirty random entries from the lexicon and looked at what the scribes chose to keep from the full written word and what they felt they were able to do away with. In general, I found that words could be divided into three parts:
Prefix: The prefix is made of consecutive letters from the start of the word, including at minimum the first letter. In my samples, the prefix is one character long about 53% of the time, two characters about 23% of the time, three characters about 6% of the time.
Infix: The infix is made of of letters that are generally not consecutive, chosen from the middle of the word. Presumably these are letters that differentiate between one contracted word and another. There is roughly 12% chance that a given letter from the middle of the word will appear in the infix.
Suffix: The suffix is made of consecutive letters from the end of the word, except the -m of accusative endings, which is sometimes dropped. The last letter was included in the suffix about 63% of the time, the second-to-last about 30% of the time, the third-to-last about 6% of the time.
What is interesting about this, to me, is that the first letter of each word is always retained. That means, if the Voynich cipher employs abbreviations and/or contractions, and the subsequent steps are only forms of substitution (and not, for example, transposition), then it might be possible to crack the first letters of Voynichese words.
It would be hard to know if you had gotten it right, though!
Thursday, August 26, 2021
A Voynich-like Code
I've just read an article titled The Linguistics of the Voynich Manuscript by Claire L. Bowern and Luke Lindemann, which summarizes previous scholarship on the manuscript and concludes that "the character-level metrics show Voynichese to be unusual, while the word- and line-level metrics show it to be regular natural language."
Reading the article reminded me to finish this post, which I started several weeks ago. Here, I'll outline a cipher using ideas from my previous posts, which I believe a late medieval or early Renaissance scholar might plausibly have created, which I think would produce some of the features of the Voynich manuscript.
I'll walk through the cipher steps with an English phrase and a Latin phrase: Can you read these words? Potesne legere haec verba?
Step 1: Remove the vowels from all of the words. This is what causes the words of the cipher text to carry less information than they would in the source language. In this example I'm treating the letter v as a vowel in Latin, but w as a consonant in English, because these are the historical conventions in these languages.
English; cn y rd ths wrds?
Latin: ptsn lgr hc rb?
It is an open question for me whether it is feasible to reverse this step in Latin. In English I know it is reasonable if you are familiar with the general content of the text, because a similar approach was used to create mnemonics describing Masonic ceremonies:
Step 2: Encipher each word using a substitution cipher that replaces each letter with a syllable, with special syllables reserved for the last letters in each word, to create the appearance of an inflected language. This is what creates the low second-order character entropy.
In this case, I have created the key using the first and last syllables from polysyllabic words at the beginning of Virgil's Aeneid. I haven't bothered to create a complete key, it only covers the letters needed for this example.
Using this key, the two example sentences become:English: viam tum prono vepria liprocaa?
Latin: favelaam otrogus prirum proma?
One of the neat things about this cipher approach is that one could hypothetically train oneself to speak the cipher.
Step 3: Write the cipher in a secret alphabet. This changes very little about the cipher, and might be considered more of a cultural requirement of the era.
To be clear, I don't think the Voynich cipher worked in exactly this way. For example, the frequency of daiin in the Currier A pages is nearly exactly the frequency of t (representing et, ut, te, tu, etc.) in a long devoweled Latin Text, but it isn't clear how daiin could be used to encode initial, medial or final t in other longer words. If the underlying language of the VM is Latin, and it is encoded using a system like this, then it is likely that there is some additional complexity in step 2. For example, there might be a set of words (like daiin, chol, chor) that encode single letters, then another set of prefixes and suffixes to encode letters in longer words.
Friday, August 6, 2021
Old Cryptography and Entropy
In my last two posts, I first suggested that Voynich Manuscript 81R might contain a poem in Latin dactylic hexameter, but then I argued that the lines only convey about half of the information necessary to encode such a poem. In this post I'll try to reconcile those two arguments by showing that a late medieval/early Renaissance cipher system could have produced this effect.
The pages of the VM have been carbon-dated to between 1404 and 1438. If the text is not a hoax, and it was written within a century or so of the production of the vellum, then what cryptographic techniques might the author plausibly have known, and how would they impact the total bits per line of an enciphered poem?
According to David Kahn's The Code-Breakers, the following methods might have been available to someone in Europe during that period. For most of these, I have created simulations using the Aeneid as a plain text, and measured the effect on bits per line using the formula for Pbc from my last post.
- Writing backwards (0.2% increase)
- Substituting dots for vowels (28.5% decrease)
- Foreign alphabets (little or no change, depending on how well the foreign alphabet maps to the plaintext alphabet)
- Simple substitution (no change)
- Writing in consonants only (45.6% - 49% decrease, depending on whether v and j are treated as vowels)
- Figurate expressions (impractical to test, but likely to increase bits per line)
- Exotic alphabets (no change, same as simple substitution)
- Shorthand (impractical to test, but likely to decrease bits per line)
- Abbreviations (impractical to test, but certain to decrease bits per line)
- Word substitutions (did not test, but likely to cause moderate increase or decrease to bits per line)
- Homophones for vowels (increase bits per line, but the exact difference depends on the number of homophones per vowel. With two homophones for each vowel, there was a 19.5% increase)
- Nulls (increase bits per line, but the exact difference depends on the number of distinct nulls used and the number of nulls inserted per line)
- Homophones for consonants (increase bits per line, but the exact difference depends on the number of homophones per consonant)
- Nomenclators (impact depends on the type of nomenclator. I tested with a large nomenclator and got a 44.5% decrease in bits per line)
- Writing in consonants only
- Using a large nomenclator
Monday, July 26, 2021
Entropy in Voynichese
It has often been observed that Voynich characters have relatively low entropy (c.f. this discussion on René Zandbergen's site). This is a serious problem for the proposal I made in my last post, where I suggested that page 81R of the Voynich Manuscript might contain a poem in Latin dactylic hexameter.
Suppose you calculate the bits of information conveyed by a character c of a text T using a formula like the following:
Sc = (ln(fT) - ln(fc)) / ln(2)
where
Sc is the number of bits conveyed by the single character c
fT is the number of characters in the text
fc is the number of times the character c appears in the text
Using this formula we find that the lines on 81R carry, on average, 121.4 bits of information. In contrast, lines of the Aeneid carry an average of 156.1 bits of information. This is a real problem, which becomes even more severe if you look at the incremental information conveyed by the second character in a pair. That is, for a character c appearing immediately after a character b:
Pbc = (ln(fb) - ln(fbc)) / ln(2)
where
Pbc is the number of bits conveyed by character c when it appears in the pair bc
fb is the number of times the character b appears in the text
fbc is the number of times the sequence bc appears in the text, which may also be expressed as the number of times that the character c appears immediately after b.
This second approach to measuring information tells us, for example, that the character "u" in a Latin text conveys no additional information when it follows "q". Since the total frequency of "qu" is the same as the total frequency of "u", the numerator is zero, and total bits likewise is zero.
Saturday, July 24, 2021
Linguistic Information from Voynich 81R
Page 81R of the Voynich Manuscript has a block of text with an interesting property that is different from other text in the VM. While most lines of text in the VM continue to a page margin or the boundary of an image, the text on 81R is ragged on the right side. In other texts, both printed and manuscript, this type of raggedness can be a property of poetry, wherein the breaks between lines are guided by metrical considerations rather than the need to use space on the page efficiently. Nick Pelling has a post that digs into this page, and he notes that the poem-like layout of this page was observed by Gabriel Landini on the Voynich mailing list in 1996.
So, if 81R contains a poem, then what kind of information could we derive from it?
Generally speaking, a line of poetry is broken down into feet, and feet have some relationship to syllables, though the exact nature of that relationship varies. In Latin and Greek dactylic hexameter, for example, a foot is made of two poetically long syllables (a spondee) or else a long syllable and two short ones (a dactyl), and there are six feet per line. In iambic pentameter a foot is made of one unstressed and one stressed syllable (an iamb) and there are five feet per line. Other styles of poetry use other definitions of feet and other numbers of feet per line.
Whatever definition there is to a foot and a line, however, there is going to be some natural relationship between the length of the line and the number of syllables in it. For a given language and a given metrical form, that will lead to a certain average number of words per line, with a certain standard deviation.
These values are different for different languages and metrical forms. In the graph below, I have taken multiple 31-line samples from five epic poems and graphed the average number of words per line (x axis) against the standard deviation in number of words per line (y axis).
In the graph above, you can see that each of these epic poems has a different average number of words per line. Chaucer is by far the highest, while the Serbian epic poem Strahinja Banović is at the extreme other end.
If the number of words in a line of Voynich text is equal to the number of words in a line of the underlying plain text, and the text on 81R is a poem, then where does it fall on this graph?
There is not universal agreement on where all of the wordbreaks are on 81R, so we have a range of answers, but it is a relatively narrow range, and the answer is relatively clear. Among the five sample epic poems, the most similar in this respect is the Aeneid. The three red bubbles in the graph below demonstrate the range of values for page 81R, and the blue bubbles are the sample values from the Aeneid.
The next most similar poem is the Anglo-Norman Voyage de Brendan, the right edge of which touches the left edge of the Voynich range:
This suggests the possibility that 81R is written in Latin dactylic hexameter, or else possibly something like the Anglo-Norman octosyllables of Voyage de Brendan.
The argument for Latin dactylic hexameter is strengthened over something like Old French by the fact that there are 31 lines on 81R. Old French poetry (like Middle English) was built on rhyming couplets, and to have an odd number of lines would mean having a line dangling at the end with no rhyme.
Of course, the VM is not a simple substitution cipher, and it's always possible that a Voynichese word does not correspond to a plaintext word, but this is a direction I will hopefully expand on more in my next post.
Wednesday, July 21, 2021
A domain-specific language for representing morphology
Whenever I learn a new language, I instinctively want to model the morphology in code. It's inefficient to write grammars in generic programming languages, though, and that's where I always get stuck.
This month I developed a domain-specific language for representing morphology. The interpreter is written in Javascript, but could easily be rewritten in almost any other language.
A project in this language starts out with a declaration of the types of graphemes used in the language. (It works on the level of graphemes instead of phonemes, but phonemic systems are a subset of graphemic systems, so there is nothing lost by doing it this way.)
Here is an example, which defines vowels (V) and consonants (C) in a system with five vowels, phonemic vowel length, and certain digraphs (such as 'hw', 'qu', 'hl').
classes: {
V: '[aeiouáéíóú]',
C: '[ghn]w|[hnrt]y|qu|h[rl]|[bdfghklmnpqrstvwy]'
}
The values on the left are identifiers (V, C) and the values on the right are regular expressions.
These identifiers can then be used in transformations like the following, which will append -n to a word ending in a vowel, or -en to a word ending in a consonant.
append_n: [
'* V -> * V n',
'* C -> * C en'
]