Friday, October 1, 2021

Head words, body words and tail words

Lines of Voynichese text can be divided into a head (the first word), a tail (the last word) and a body (all the words in the middle).

Words can be classified according to where they tend to fall in the line:

  • Head words tend to be the first word in a line.
    • The first word of a paragraph seems to be its own special kind of headword
  • Tail words tend to be the last word in a line
    • Words ending in -m and -g tend to be tail words
    • Words ending in -aly tend to be tail words
  • Body words tend to fall in the middle of a line
  • Free words can appear anywhere

These categories aren't strict, so head words can sometimes be found in the body of a line, but are almost never found at the end. Similarly, tail words can be found in the body, but almost never at the head. There is a lot more work to be done looking at the relationship between the structure of a word and its classification.

This explains why I failed to find a clear direction of text when I was looking at line breaks. I was looking for common pairs that were broken across lines, but since the head and the tail of the line are drawn from statistically different sets of words than the body, those pairs turn out to be very rare in the rest of the text. (Thanks to Nick Pelling for the observation that initial and final letters of Voynichese probably screwed up my test!)

Interestingly, page F81R (where the text is laid out like a poem) follows these head-and-tail tendencies. That is, the words at the heads of the lines tend to be head words elsewhere, and the words at the ends of lines tend to be tail words elsewhere. This suggests that the ragged line lengths on this page are intentional, and adds weight back to the hypothesis that this page contains a poem.

3 comments:

  1. Separating out the three groups of words does rather beg the question of whether there is any way of testing the hypothesis that line-final -m is encoding a Gutenberg-like '//' word-continuation (i.e. where we now use a hyphen).

    One thing I've wondered is whether the reason line-initial words might have an extra glyph at the start is to conceal the follow-on letter from the (preceding) line-final word.

    So there may well be a direct connection between head words and tail words. Just a thought! :-)

    ReplyDelete
  2. I have been wondering about the absence of punctuation in the cipher text, and whether punctuation might be encoded in the text.

    I suppose it's possible that -m could act like a word-continuation character, but there are some oddities about its appearance that would need to be explained. For example, the last line on f31v ends in -m, but the line does not go all the way to the end of the page, and there is room enough below for more one or two more lines of text. If the scribe broke a word at the end of that page, it wasn't because there was a lack of space.

    There is also the fact that words ending in -m do not appear exclusively at the end of the line. They have a strong tendency to do so, but they may also appear anywhere else. If you look at line 2 of f33v, there are four words ending in -m, and while the line is broken up as it crosses an image, the words ending in -m do not fall immediately before the breaks.

    I will see what happens statistically when we strip the first character off the head of a line.

    ReplyDelete
  3. It's possible that the continuation character reuses a rare plaintext character in the cipher alphabet (e.g. 'x'). Or instances of -m within a line might be copying errors (e.g. for -r). Or there might have been line-internal hyphens inside the plaintext that have been faithfully reproduced in the ciphertext.

    I could probably think of another five explanations if pushed, so I'm not too concerned just yet. :-)

    ReplyDelete