Sunday, September 12, 2021

Lexical vs. Textual Frequency

There are two ways you can look at the frequency of a letter or sequence of letters: One is the frequency of the letter in a text (textual frequency, the way we usually look at it); the other is the frequency of the letter in the lexicon (lexical frequency).

Since there is no reason for the frequency of a word to have any connection to the letters in it, we would expect to find that the relationship between textual frequency and lexical frequency is roughly linear. And, generally speaking, this is the case. However, there are usually a small number of outliers.

For example, looking at initial pairs of letters in Dante's Divina Commedia, we find two initial pairs that stand out:


The prefix ch- appears in the high-frequency word che (and its contracted form ch'), which drives up its textual frequency relative to its lexical frequency. The prefix ri- is a derivational prefix used to create a relatively large number of words of low frequency, driving up its lexical frequency relative to its textual frequency.

Latin has a different (but etymologically related) outlier:

The prefix qu- in latin appears in such relatively high-frequency words as qui, quod, quae, quam, quid, quo, quem, quoque, which drives up its textual frequency relative to its lexical frequency.

Currier A and B have different patterns from each other:
Currier A has the high-frequency words daiin and dain, which raises the textual frequency of the prefix da- relative to its lexical frequency.

Currier B has the high-frequency words qokeedy, qokain, qokedy, qokeey and qokain which raises the textual frequency of the prefix qo- relative to its lexical frequency.

In the case of the VM there may be multiple reasons for these outliers. There are probably lexical anomalies in the underlying language, but then the cipher itself could introduce its own odd behaviors through the use of homophones, multi-letter symbols, and so forth.

No comments:

Post a Comment