Monday, January 28, 2019

How much information is in the Rohonc Codex?

I'm getting back to the Rohonc codex after a few years away, and I'm wondering how much information is stored in the text. That is, if the text were efficiently translated into another language, how long of a text would it be.

Since I don't know where word breaks fall, I can only base the calculation on symbols in the text. Using Shannon's information entropy formula, I can calculate the entropy of some sample texts as follows:

King James Bible (letters):4.1
King James Bible (words):8.9
Latin Vulgate book of Genesis (letters):4.1
Latin Vulgate book of Genesis (words):10.3
Voynich Manuscript (letters):4.0
Voynich Manuscript (words):10.5
Rohonc Codex (glyphs):5.9


These values make intuitive sense to me. A word in Latin conveys more information than a word in English because it carries more inflectional morphemes. (The Voynich numbers are just thrown in for fun, since we don't know what they convey).

From these values, it looks like a glyph from the Rohonc codex conveys more information than a character in English or Latin. That seems like the right answer given the number of symbols in the system. But how do we use that information to calculate the size of the text?

Letters convey phonological information, but words convey meaning, so the size of a text in letters means something different from the size in words, and we have to do some conversion to get from one to the other.

If we assume the Rohonc script is primarily phonological and roughly as efficient as English and Latin, then the size of the text in phonological terms would be about:

    5.9 bits/glyph x 60,142 glyphs = 354,837.8 bits

This would be equivalent to a Latin text of length:

    354,837.8 bits ÷ 4.1 bits/letter = 86,545.8 letters

If the underlying language were Latin, then given the number of letters per Latin word (5.2), we could calculate the expected number of Latin words. That would be:

    86,545.8 letters ÷ 5.2 letters/word = 16,643.4 words

Given this, if the language were Latin, then each word would consume about 3.6 Rohonc glyphs. That is 60,142 glyphs ÷ 16,643.4 words = 3.6 glyphs/word. If the language were English, each word would consume 2.8 glyphs/word. Probably it is safe to assume Rohonc words are 3-4 glyphs long, on average.