Monday, July 26, 2021

Entropy in Voynichese

It has often been observed that Voynich characters have relatively low entropy (c.f. this discussion on René Zandbergen's site). This is a serious problem for the proposal I made in my last post, where I suggested that page 81R of the Voynich Manuscript might contain a poem in Latin dactylic hexameter.

Suppose you calculate the bits of information conveyed by a character c of a text T using a formula like the following:

Sc = (ln(fT) - ln(fc)) / ln(2)

where

Sc is the number of bits conveyed by the single character c

fT is the number of characters in the text

fc is the number of times the character c appears in the text

Using this formula we find that the lines on 81R carry, on average, 121.4 bits of information. In contrast, lines of the Aeneid carry an average of 156.1 bits of information. This is a real problem, which becomes even more severe if you look at the incremental information conveyed by the second character in a pair. That is, for a character c appearing immediately after a character b:

Pbc = (ln(fb) - ln(fbc)) / ln(2)

where

Pbc is the number of bits conveyed by character c when it appears in the pair bc

fb is the number of times the character b appears in the text

fbc is the number of times the sequence bc appears in the text, which may also be expressed as the number of times that the character c appears immediately after b.

This second approach to measuring information tells us, for example, that the character "u" in a Latin text conveys no additional information when it follows "q". Since the total frequency of "qu" is the same as the total frequency of "u", the numerator is zero, and total bits likewise is zero.

When you apply this measure to the lines on 81R and the Aeneid, the average amount of information conveyed the lines of 81R drops to 66.9 bits, while the information conveyed in the average line of the Aeneid drops only to 128.5 bits.

This is a serious challenge to the idea that the plaintext on 81R is a Latin poem in dactylic hexameter, because it suggests that these lines simply don't contain enough information to encode such a poem. In my next post I will look at historically and culturally plausible enciphering schemes that could produce this effect.

Saturday, July 24, 2021

Linguistic Information from Voynich 81R

Page 81R of the Voynich Manuscript has a block of text with an interesting property that is different from other text in the VM. While most lines of text in the VM continue to a page margin or the boundary of an image, the text on 81R is ragged on the right side. In other texts, both printed and manuscript, this type of raggedness can be a property of poetry, wherein the breaks between lines are guided by metrical considerations rather than the need to use space on the page efficiently. Nick Pelling has a post that digs into this page, and he notes that the poem-like layout of this page was observed by Gabriel Landini on the Voynich mailing list in 1996.

So, if 81R contains a poem, then what kind of information could we derive from it? 

Generally speaking, a line of poetry is broken down into feet, and feet have some relationship to syllables, though the exact nature of that relationship varies. In Latin and Greek dactylic hexameter, for example, a foot is made of two poetically long syllables (a spondee) or else a long syllable and two short ones (a dactyl), and there are six feet per line. In iambic pentameter a foot is made of one unstressed and one stressed syllable (an iamb) and there are five feet per line. Other styles of poetry use other definitions of feet and other numbers of feet per line.

Whatever definition there is to a foot and a line, however, there is going to be some natural relationship between the length of the line and the number of syllables in it. For a given language and a given metrical form, that will lead to a certain average number of words per line, with a certain standard deviation.

These values are different for different languages and metrical forms. In the graph below, I have taken multiple 31-line samples from five epic poems and graphed the average number of words per line (x axis) against the standard deviation in number of words per line (y axis).


In the graph above, you can see that each of these epic poems has a different average number of words per line. Chaucer is by far the highest, while the Serbian epic poem Strahinja Banović is at the extreme other end.

If the number of words in a line of Voynich text is equal to the number of words in a line of the underlying plain text, and the text on 81R is a poem, then where does it fall on this graph?

There is not universal agreement on where all of the wordbreaks are on 81R, so we have a range of answers, but it is a relatively narrow range, and the answer is relatively clear. Among the five sample epic poems, the most similar in this respect is the Aeneid. The three red bubbles in the graph below demonstrate the range of values for page 81R, and the blue bubbles are the sample values from the Aeneid.

The next most similar poem is the Anglo-Norman Voyage de Brendan, the right edge of which touches the left edge of the Voynich range:

This suggests the possibility that 81R is written in Latin dactylic hexameter, or else possibly something like the Anglo-Norman octosyllables of Voyage de Brendan.

The argument for Latin dactylic hexameter is strengthened over something like Old French by the fact that there are 31 lines on 81R. Old French poetry (like Middle English) was built on rhyming couplets, and to have an odd number of lines would mean having a line dangling at the end with no rhyme.

Of course, the VM is not a simple substitution cipher, and it's always possible that a Voynichese word does not correspond to a plaintext word, but this is a direction I will hopefully expand on more in my next post.

Wednesday, July 21, 2021

A domain-specific language for representing morphology

Whenever I learn a new language, I instinctively want to model the morphology in code. It's inefficient to write grammars in generic programming languages, though, and that's where I always get stuck.

This month I developed a domain-specific language for representing morphology. The interpreter is written in Javascript, but could easily be rewritten in almost any other language.

A project in this language starts out with a declaration of the types of graphemes used in the language. (It works on the level of graphemes instead of phonemes, but phonemic systems are a subset of graphemic systems, so there is nothing lost by doing it this way.)

Here is an example, which defines vowels (V) and consonants (C) in a system with five vowels, phonemic vowel length, and certain digraphs (such as 'hw', 'qu', 'hl').

  classes: {

    V: '[aeiouáéíóú]',

    C: '[ghn]w|[hnrt]y|qu|h[rl]|[bdfghklmnpqrstvwy]'

  }

The values on the left are identifiers (V, C) and the values on the right are regular expressions.

These identifiers can then be used in transformations like the following, which will append -n to a word ending in a vowel, or -en to a word ending in a consonant.

    append_n: [

      '* V -> * V n',

      '* C -> * C en'

    ]

This transformation is composed of two rules, which are turned into regular expressions like the following:

/(^.*)([aeiouáéíóú]$)/
/(^.*)([ghn]w|[hnrt]y|qu|h[rl]|[bdfghklmnpqrstvwy]$)/

Each of the rules also has a map describing the way that the parts of the input are transformed into an output, like this:

[1, 2, "n"]
[1, 2, "en"]

A candidate word, such as 'arat', is tested against each regular expression from top to bottom. In this case, it will be matched against the second expression, and the match results will look like this:

["arat", "ara", "t"]

The transformation will then assemble the answer using the mapping [1, 2, "en"]. The numbers in the mapping refer to elements in the zero-indexed match results, so the result will be "ara" + "t" + "en" = "araten".

In addition to preparing regular expressions and mappings for applying transformations, the system also prepares reversing versions. In this case, we have the following reverse expressions and mappings:

/(^.*)([aeiouáéíóú])(n$)/
/(^.*)([ghn]w|[hnrt]y|qu|h[rl]|[bdfghklmnpqrstvwy])(en$)/

[1, 2]
[1, 2]

In reverse application, instead of using only the first rule that matches, the system applies any rule that matches and returns an array of answers. So, if we reverse "araten" then we will match against both rules, and get the answers ["arat", "arate"].

The value of reverse application is that we can take inflected words from a text and reverse the inflections to arrive at a set of possible stems.

There is much more to it, of course, because morphology is complex.