Sunday, September 19, 2021

I was Wrong about F81R
How Line Breaks and Word Breaks Behave in Currier A and B

This is going to be a long and boring post, so here's the summary:
  • Line breaks in the VM do not act like line breaks in a natural text, in that they do not provide evidence of whether the text runs left-to-right or right-to-left.
  • Word breaks in Currier A act like word breaks in a natural text, but in Currier B they do not.
  • Since my analysis of F81R as a poem was based on the assumption that line breaks and word breaks were natural, yet they turn out not to be natural at all, there remains nothing to support the idea that this page contains a poem.
Here are the tests I conducted whose results led me to that conclusion.

1. Direction of Text

Question: Text in the VM is laid out on the page in a way that suggests left-to-right text, but does the content of the text support that? How do we know the layout of the text isn't intentionally misleading?

Test: In a traditional European text, line breaks are governed by the width of the text column, and have an arbitrary relationship to the underlying text. Therefore we should expect that high frequency pairs of words W1 and W2 will occasionally be broken across lines, so W1 will appear on one end of one line and W2 will appear on the other end of the next line. If W1 appears at the right end of one line and W2 appears at the left end of the next line, then the text behaves like a left-to-right text. If they appear on the left and right ends, respectively, then it behaves like a right-to-left text.

Demonstration: I applied the test to De natura rerum ad Sisebutum regem liber, by Isidorus Hispalensis Episcopus, which is roughly the size of the Currier A section of the VM. The sample text contained 366 distinct pairs of words that were repeated at least twice, for a total of 966 instances of repeated pairs. In 68 cases a pair was found broken across lines in a way that indicated left-to-right text, in 13 cases it was found broken in a way that indicated right-to-left text.

Conclusion: With more than five times as many left-to-right breaks, the evidence pointed strongly to a left-to-right text, as was expected.

Currier A: I found 491 distinct pairs repeated at least twice, for a total 1389 instances of repeated pairs. In 32 cases a pair was broken across lines in a way that indicated left-to-right text, in 47 cases it was found broken in a way that indicated right-to-left text.

Conclusion: The number of left-to-right breaks is not significantly different from the number of right-to-left breaks. This is not obviously a natural text running in either direction.

Currier B: I found 1701 distinct pairs repeated at least twice, for a total 5313 instances of repeated pairs. In 69 cases a pair was broken across lines in a way that indicated left-to-right text, in 94 cases it was found broken in a way that indicated right-to-left text.

Conclusion: The number of left-to-right breaks is not significantly different from the number of right-to-left breaks. This is not obviously a natural text running in either direction.

2. Word Breaks

Question: Text in the VM appears to be broken into words by spaces, but do these spaces really act like word breaks within the text?

Test: Word breaks should divide the text into a relatively productive lexicon. A productive lexicon is one that can produce the text in question with a relatively small number of words used at relatively high frequencies. We should find that true word breaks divide the text into a productive lexicon better than any other character in the text.

Treat each character in the text as a potential word-break character and measure the frequency of the most frequent word in the resulting lexicon. Use that frequency as a proxy for the productivity of the lexicon. If the word break character results in the most productive lexicon, then it acts like a true word break.

Demonstration: I applied the test to De natura rerum ad Sisebutum regem liber. The word break character resulted in a score of 320, while the next best character (s) resulted in a score of 184.

Conclusion: The lexicon created by the word break character is nearly twice as productive as the next best candidate. The word break character acts like a true word break, as expected.

Currier A: The word break character resulted in a score of 512, while the next best character (o) resulted in a score of 266.

Conclusion: The lexicon created by the word break character in Currier A is nearly twice as productive as the next best candidate. The word break character acts like a true word break.

Currier B: The word break character resulted in a score of 499, but the character producing the most productive lexicon was actually 'e', which yielded a score of 514. The character 'a' was third in rank, with a score of 482.

Conclusion: The lexicon created by the word break character in Currier B is not significantly more productive than the lexicon created by other high-frequency characters. In Currier B, the word break character does not act like a word break.

Thursday, September 16, 2021

Currier B aiin and Latin in

In the Latin ISE corpus, the word 'in' is the second most frequent word, and it would be surprising if this word was not among the top ten words of any Latin text of significant length. The word 'in' also has the property that it is rarely followed by another high-frequency word. The reason for this is that 'in' is a preposition, and is therefore usually followed by a noun with high semantic content, and those words are generally lower in frequency than function words.

Despite the fact that it is rarely followed by another high-frequency word, 'in' is commonly preceded by another high-frequency word, particularly 'et', 'est' or 'ut'. This can be seen in the frequencies by which the top ten most frequent words appear together:


In Currier B the word 'aiin' has similar properties. It is the fourth most frequent word, and has the property that it is rarely followed by another high-frequency word, but is commonly preceded by one. This can be seen in the frequencies by which the top ten most frequent words appear together:

The word 'in' appears not only in Latin, but also in Tuscan and Spanish, though with somewhat lower frequency. In Dante's Divina Commedia, for example, it is the 11th most common word. I assume the drop in frequency between Latin and Tuscan was due to the loss of case markers on nouns, which would have required a corresponding increase in the number of prepositions (since otherwise distinctions such as in urbe / in urbem were lost).

The situation with Latin and Tuscan could be compared to the situation with Currier B and Currier A. The word 'aiin' also appears in Currier A, though with a lower frequency, being the 17th most common word.

Wednesday, September 15, 2021

qokeedy qokeedy

In this post I'll look at similarities between high-frequency qok- words in Currier B and high-frequency qu- words in Latin.

1. Textual Frequency

The prefix qu- is the most frequent two-letter prefix in the Latin ISE corpus, and the prefix qok- is the most frequent three-letter prefix in Currier B.

2. Zipf Rank

The most frequent qok- words in Currier B occupy similar Zipf ranks to the most frequent qu- words in Latin (though the Currier B words have a tendency to have lower Zipf ranks).


3. Reduplication

Some of the qu- words in Latin may be reduplicated, as may some of the qok- words in Currier B:









Sunday, September 12, 2021

Lexical vs. Textual Frequency

There are two ways you can look at the frequency of a letter or sequence of letters: One is the frequency of the letter in a text (textual frequency, the way we usually look at it); the other is the frequency of the letter in the lexicon (lexical frequency).

Since there is no reason for the frequency of a word to have any connection to the letters in it, we would expect to find that the relationship between textual frequency and lexical frequency is roughly linear. And, generally speaking, this is the case. However, there are usually a small number of outliers.

For example, looking at initial pairs of letters in Dante's Divina Commedia, we find two initial pairs that stand out:


The prefix ch- appears in the high-frequency word che (and its contracted form ch'), which drives up its textual frequency relative to its lexical frequency. The prefix ri- is a derivational prefix used to create a relatively large number of words of low frequency, driving up its lexical frequency relative to its textual frequency.

Latin has a different (but etymologically related) outlier:

The prefix qu- in latin appears in such relatively high-frequency words as qui, quod, quae, quam, quid, quo, quem, quoque, which drives up its textual frequency relative to its lexical frequency.

Currier A and B have different patterns from each other:
Currier A has the high-frequency words daiin and dain, which raises the textual frequency of the prefix da- relative to its lexical frequency.

Currier B has the high-frequency words qokeedy, qokain, qokedy, qokeey and qokain which raises the textual frequency of the prefix qo- relative to its lexical frequency.

In the case of the VM there may be multiple reasons for these outliers. There are probably lexical anomalies in the underlying language, but then the cipher itself could introduce its own odd behaviors through the use of homophones, multi-letter symbols, and so forth.

Tuesday, September 7, 2021

Latin Contractions

My efforts to get a copy of The Curse of the Voynich are themselves apparently cursed. The first time I tried to order this book I was at a vacation rental for a month, and discovered that the postal service would not deliver it because the rental had no mailbox. The second time my order was canceled because the book was out of stock. I am hopeful that my third effort will meet with a better outcome.

While I wait for it to arrive, however, I've been looking at one of Nick Pelling's ideas. Did the Voynich cipher employ contraction and abbreviation as part of its process? If so, it seems like this could explain the relatively low amount of information conveyed by Voynichese words. It would be a lossy compression process similar to the removal of vowels, but perhaps more culturally appropriate to the 15th century.

I looked at the 1901 German translation of Adriano Cappelli's Lexicon abbreviaturarum, and it seems that conventions for contraction and abbreviation evolved over time such that by the 14th or 15th centuries scribes were using a number of methods in conjunction, including the use of a small set of symbols borrowed from Tironian notes. In order to understand these processes better, I took thirty random entries from the lexicon and looked at what the scribes chose to keep from the full written word and what they felt they were able to do away with. In general, I found that words could be divided into three parts:

Prefix: The prefix is made of consecutive letters from the start of the word, including at minimum the first letter. In my samples, the prefix is one character long about 53% of the time, two characters about 23% of the time, three characters about 6% of the time.

Infix: The infix is made of of letters that are generally not consecutive, chosen from the middle of the word. Presumably these are letters that differentiate between one contracted word and another. There is roughly 12% chance that a given letter from the middle of the word will appear in the infix.

Suffix: The suffix is made of consecutive letters from the end of the word, except the -m of accusative endings, which is sometimes dropped. The last letter was included in the suffix about 63% of the time, the second-to-last about 30% of the time, the third-to-last about 6% of the time.

What is interesting about this, to me, is that the first letter of each word is always retained. That means, if the Voynich cipher employs abbreviations and/or contractions, and the subsequent steps are only forms of substitution (and not, for example, transposition), then it might be possible to crack the first letters of Voynichese words.

It would be hard to know if you had gotten it right, though!