Tuesday, April 18, 2023

Toyfl, finished at last

When I started this blog ten (!) years ago, I was creating a programming language called Toyl (Toy Language).

In the intervening years I have thrown away and rebuilt my language several times, and I now consider it essentially finished. It is now called Toyfl (Toy Functional Language).

Syntactically it is somewhat inspired by Javascript, with lambda expressions like the following:

(a, b, c) => a * b + c

And anonymous objects like this:

{a = 10; b = 32; c = 11; f = (x) => a}

And a dot operator for accessing members:

x = {a = 10; b = 32};

y = x.a

And lots of other familiar things.

There are only a few features of the language that are worth remarking on, but since the purpose of this blog was originally connected to developing the language, I'll remark on them here:

1) Extreme Laziness. If you have an expression like if(a == b, a + 37, b + 22), Toyfl will only evaluate the condition a == b. It will then return one of the two expressions in the then or else case without actually evaluating them. Nothing gets evaluated until it is really, really needed.

That's connected to a second weird feature of Toyfl:

2) Currying by Refactoring. Since Toyfl doesn't evaluate an expression until it is really, really needed, an expression returned from within a function needs to be context-independent. That is, it needs to be able to remember what relevant parameters were passed into the function when it was invoked. I handle that by creating a refactored copy of the body of the function. So, if you have a function like this:

f = (x, y) => x * y

and you invoke it like this:

g = f(a, b + c)

then what you get for g is actually a structure representing the expression a * (b + c).

Originally it rubbed me the wrong way to do this, because somewhere deep inside I have this feeling that code should remain static, but the effort required to carry out refactoring is actually pretty slight, and in a functional context it is very reliable.

However, this behavior means that recursive functions actually build large in-memory expressions. Memory is cheap these days, so I don't worry too much about that for ordinary use-cases, but I thought it would be interesting to solve the problem anyway. So I created:

3) Special syntax for tail recursion. I noticed that the simplest functional implementations of certain solutions required significantly more calculation than the simplest imperative implementation. I wanted a way to bring the efficiency of imperative solutions to my functional language, so I created the following syntax to bridge this gap.

fibonacci = (n) => (

    for {

        a = 1;

        b = 1;

        i = 1

    }

    let {

        nextA = b;

        b = a + b;

        a = nextA;

        i = i + 1

    }

    until i == n

).a

This looks like a loop in an imperative language, but effectively it turns into something like this:

fibonacci = (n) => innerFibonacci(1, 1, n)

    where innerFibonacci = (a, b, i) => if (i ==n, a, innerFibonacci(b, a + b, i + 1))

Normally Toyfl would build a huge structure out of this, but the special syntax tells Toyfl to fully evaluate each of the parameters a, b, i before the recursive invocation, so we can take advantage of tail recursion.

So that's Toyfl. It's been fun.

Saturday, November 27, 2021

A mistake in the VM?

One of the notable things about the VM is that there are no obvious mistakes. The scribe(s) do not appear to have erased, scraped out or overlined any text.

On f105r, however, there are four words written in an odd location, and I will argue that these were accidentally omitted from the end of the first line of the text, but the mistake was caught before the scribe was done and they were written in above the line. These words make up line 10 in the Landini-Stolfi transliteration, and fall between the second paragraph and the third one.


On René Zandbergen's site the description for this page has the following note: "There is a break between the third and fourth paragraph, and it appears as if the end of the third paragraphs was written above it." I think the "break" referred to here is the fact that the ink of the fourth paragraph is slightly fainter than the third, and the letters are neater and smaller, suggesting that paragraphs 1-3 were written in one sitting, and paragraphs 4 and onward were written later.

Observations

We can observe the following things about the physical appearance of these words:
  • They are set lower than the last line of the second paragraph, though there was ample space to place them in line with it, suggesting that they are not intended to be part of paragraph 2.
  • The gallows letters of the first line of the third paragraph are interposed between the oddball words, suggesting that the oddball words were written after the first line of paragraph 3 was completed, and written around the gallows letters.
  • The color and shape of the letters in these words is similar to those in paragraphs 1-3, so not obviously written later or in a different hand.
The words have the following statistical properties:
  • sairy elsewhere only appears as the last word of the first line of a paragraph
  • ore does not appear elsewhere
  • daiindy appears once in Currier A as part of a label and twice in Currier B (including this instance), not in either case as the end of a line
  • ytam appears both in Currier A and B, usually at the end of the line, but not always
Conclusions

The color and style of the letters, together with their placement, suggest that they belong to paragraph 3, and they were written in at roughly the same time that the other lines of paragraph 3 were written. The statistical properties of the words suggest that they belong to the end of a line, but which line?
  • Line 11: This line ends in dyaiin, which is a word not found elsewhere
  • Line 12: This line ends in ry, which elsewhere is only a line-final word
  • Line 13: This line ends with ot, which elsewhere does not appear at the end of a line. There is blank space at the end of the paragraph, sufficient to write about two words.
Given that line 12 ends in a word which is elsewhere only line-final, and line 13 leaves enough space for at least two of the oddball words to have been written there, the best explanation is that these words belong at the end of line 11, the line immediately below them.

I have seen omissions like this in manuscripts in the past, and the cause is often that the eye skips from one word to a later similar word. In this case, perhaps the scribe's eye skipped from sairy to yaiir, which starts line 12. That opens up two possibilities:
  • If the text was enciphered first on a wax tablet (or something similar) and then copied to the vellum, and the eye-skip occurred during the copying process, then line-breaks on the wax tablet were not the same as the line breaks on the vellum.
  • If the eye-skip occurred during the encipherment process, then the plaintext for sairy could be similar to (or even identical to) the plaintext for yaiir.

Saturday, November 20, 2021

Word Transposition in the Voynich Manuscript

This is a follow-on to my last post, but I don't want to bother recapping the argument from last time, so I'll just start over fresh and go a different direction.

Summary
While a text in an unknown language may look random, there are two "forces" that govern the appearance of words in the text. One of those "forces" is absent in the Voynich Manuscript, and I think that indicates that a word transposition step has taken place.

Argument
The following graph shows the relative frequency that a word will recur in a text after having once occurred.


The graph above shows the likelihood that a given word will appear a second time after it has appeared a first time. For example, in a Latin medieval prose text (red line), if a word appears once then there is an extremely low chance (0.03%) that the next word will be the same word. That rises to an almost 1% chance that it will appear six words later, then slowly drops off to a 0.76% chance that it will appear 30 words later, and a roughly 0.6% chance that it will appear 100 words later.

A similar phenomenon can be seen in the early modern French novel Pantagruel (blue line).

This curve could be described as the product of the interaction of two "forces":
  1. Strong repulsive force: Instances of the same word have a very low likelihood of being found in very close vicinity to each other. Perhaps languages are naturally structured in a way that avoids close repetition.
  2. Weak attractive force: Instances of the same word have a higher likelihood of being found in the same broad area of the text as each other. Intuitively it seems like this should not apply to high-frequency words with low semantic content (articles, prepositions, etc.), since there is no reason for these to be grouped together in the same area of a text. Instead, this ought to apply primarily to lower-frequency words with high semantic content, since these words will be tied to the topic of discourse, and will therefore be clustered in areas of the text where the topic relates to their semantic domain. (I should have proved this out, but I didn't.)

Interestingly, Latin syllables (orange line) respond to the same strong repulsive force as words, but not the weak attractive one. This makes sense if the weak attractive force relates to semantic content, because syllables themselves have no semantic content and are therefore not tied to the topic of the text. Instead, with Latin syllables we see a strong tendency for syllables not to repeat in close vicinity to each other, but then the curve just rises to a plateau.

So what do we see in the VM?


Words in the Voynich Manuscript demonstrate the effects of the weak attractive force more or less like Latin words.  This suggests they have a semantic component, and there is some kind of topicalization going on. However, the VM shows no evidence of the strong repulsive force. What could cause that?

The strong repulsive force works over a very short distance, generally less than five words. If words were shuffled around so they were separated from their neighbors by a distance of five words or more, then this would conceal the effect of the strong repulsive force.

In other words, perhaps there is a transposition step in the VM cipher, operating on words. This could solve a lot of problems.

For example, such a transposition could also explain why the text does not exhibit line-breaking features that make it clear whether it runs right-to-left or left-to-right.

It might also explain why the last lines of some paragraphs (especially in the Currier A sections) have gaps in them. Perhaps these gaps are slots that were simply not filled in by the transposition algorithm.


Indeed, if we suppose that the transposition works on the level of the paragraph, then that could explain why so many paragraphs begin with a word containing an ornate gallows letter. If the transposition algorithm resets for each paragraph, then the reader would need a visual cue to indicate where to start over again.

I can even imagine algorithms that could produce the phenomenon of head words, body words and tail words, though this is a bit more of a stretch since that would mean there is some connection between what a word is and where the transposition algorithm puts it.

Lastly, this could explain why the VM has no punctuation. In manuscripts of this era punctuation was common (though not universal). Since punctuation marks stand between words in connected linear text, if the words are shuffled through some kind of transposition algorithm then it might no longer be clear where to put the punctuation marks.

So...how would one prove or disprove the existence of word transposition in the VM?

Thursday, October 28, 2021

Is there unusual context dependency in the VM?

Back in 2015 Torsten Timm wrote a paper titled How the Voynich Manuscript was Created, in which he argued that the VM was created by a process of copying and altering glyph groups that had already been written. One of the pieces of evidence Timm used in supporting this argument was the measurable fact that words found on one line of the manuscript had a higher likelihood of being found on the three lines immediately above it than being found further away in the text.

In this post I'll dig into a specific problem with Timm's argument, and I will argue that what Timm observed is a natural language feature, but that following his approach reveals another interesting feature of the VM text.

Let's start with Timm's argument. The following graph is taken from his paper, and it shows the likelihood that a word on one line of the VM will be found on a line before it:


Here you see that a word found on any given line has almost a 7% chance of being found elsewhere in the same line (position 0), a roughly 6.5% chance of being found one line higher (position 1), 6% chance of being found two lines higher, and so forth. The further back you go, the lower the likelihood of finding your word repeated, with the curve flattening out at around 4%.

Does the curve in this graph represent a natural feature or an unnatural one? On the face of it, it seems a phenomenon like this ought to be perfectly natural. If we have an herbal, for example, we would expect a word like "verbena" or "pigroot" to be localized to the section of the herbal that discusses it. But would that be sufficient to move the line on the graph?

Timm argued that this is an unnatural feature, and supported this argument by carrying out the same exercise with a Latin text (the Aeneid) and an English text (Dryden's translation of the Aeneid). Here is what he found:


Here you can see that the Voynich manuscript has a curve that is dramatically different from both the Latin and English texts that Timm chose for comparison.

Note that the English line dips down at the far left, a phenomenon which Timm attributed to Dryden's rhyme scheme. More on that below.

Here's the problem: The Aeneid is not normal in terms of its repetitiveness. If you conduct this same exercise with all of the medieval prose and poetry in the LatinISE corpus, you find the following:


As you can see from this graph:
  • The Aeneid is less repetitive than Medieval Latin poetry. This is probably in part due to Vergil's style (maybe he preferred to avoid repetition) and probably also in part due to the fact that Medieval Latin made more use of low-content high-frequency words than Classical Latin.
  • Latin poetry is less repetitive than Latin prose, but this is partly due to a difference in line length. The curve in the graph above resulted from breaking prose texts down into lines of up to 40 characters in length. If I had used an 80-character limit instead, the curve would have peaked at 7.8%. If I had used a 23-character limit, the prose curve would have come close to matching the poetry curve.
  • In both Medieval Latin poetry and prose there is a lower tendency for a word to be found in its own line (position 0) than in the next line (position 1). This same phenomenon appeared in Timm's graph for the "English" line, and he explained it as a product of Dryden's rhyme scheme. However, since there is no rhyme scheme in play in Latin prose, there must be another explanation.
In my opinion, the interesting thing Timm's analysis reveals is actually this: In the VM, a word is more likely to be found again on its own line than to be found on the lines above it. I have an idea of what this could mean, but I don't want to make this post unnecessarily long, so I will dig into it in a separate post.

Tuesday, October 26, 2021

Are there nulls in the VM?

Renaissance cryptographers used nulls to break up repeated sequences of characters and to alter the frequency statistics of a text. Did the author of the VM do anything similar? If so, how could we detect it?

Null characters increase the entropy of a text. The more randomly a null character is employed, the greater the entropy it adds to the text. For example, the Latin text Historia rerum in partibus transmarinis gestarum written by William of Tyre conveys an average of 2.711 bits per character when measured using third-order entropy. If we randomly insert a null character "@" into the text at an average interval of every five characters, the entropy increases to 3.044 bits per character.

In cases like this, it seems like we ought to be able to ferret out the null character by identifying the character which, when removed, causes a significant drop in the entropy of the text.

The procedure we'll use is:

1. Calculate the average bits per character of a text which we think may be a cipher. This value will be the Bnull.

2. Make a copy of the cipher text and remove from the copy a character C which we think might be a null.

3. Calculate average bits per character of the text with C removed. Let this be BC.

5. Calculate the "nullitude" of the character C using NC = BC / Bnull.

First, we should know what the results look like with a text that does not contain nulls. Here are the nullitude values for characters in the Historia when no null character is inserted:

Here we do not see a significant drop below the value of 1, indicating that no single character, when removed from the text, causes a noticeable decrease in entropy. This is what we expected to see.

Now, looking at a copy of the Historia into which a null "@" has been inserted randomly about every five characters, we see a different result:


Here the null character stands out clearly, causing a significant drop in the entropy of the text once it is removed. Again, this is what we expect.

We can apply the same test to the Voynich Manuscript. Here is what we find with Currier A:


The result here is interesting because, at the left side of the chart, we see that removal of the character "e" causes a drop in entropy. It isn't a huge drop, but it does stand out from the rest of the characters. This suggests the possibility that Currier A words like cheol and cheor might be synonyms of the more common words chol and chor.

Though "e" looks slightly nullish in Currier A, we do not see the same phenomenon in Currier B:


In Currier B the distribution is more like the Latin plaintext above, with no particular character looking more like a null than the others.

Friday, October 1, 2021

Head words, body words and tail words

Lines of Voynichese text can be divided into a head (the first word), a tail (the last word) and a body (all the words in the middle).

Words can be classified according to where they tend to fall in the line:

  • Head words tend to be the first word in a line.
    • The first word of a paragraph seems to be its own special kind of headword
  • Tail words tend to be the last word in a line
    • Words ending in -m and -g tend to be tail words
    • Words ending in -aly tend to be tail words
  • Body words tend to fall in the middle of a line
  • Free words can appear anywhere

These categories aren't strict, so head words can sometimes be found in the body of a line, but are almost never found at the end. Similarly, tail words can be found in the body, but almost never at the head. There is a lot more work to be done looking at the relationship between the structure of a word and its classification.

This explains why I failed to find a clear direction of text when I was looking at line breaks. I was looking for common pairs that were broken across lines, but since the head and the tail of the line are drawn from statistically different sets of words than the body, those pairs turn out to be very rare in the rest of the text. (Thanks to Nick Pelling for the observation that initial and final letters of Voynichese probably screwed up my test!)

Interestingly, page F81R (where the text is laid out like a poem) follows these head-and-tail tendencies. That is, the words at the heads of the lines tend to be head words elsewhere, and the words at the ends of lines tend to be tail words elsewhere. This suggests that the ragged line lengths on this page are intentional, and adds weight back to the hypothesis that this page contains a poem.

Sunday, September 19, 2021

I was Wrong about F81R
How Line Breaks and Word Breaks Behave in Currier A and B

This is going to be a long and boring post, so here's the summary:
  • Line breaks in the VM do not act like line breaks in a natural text, in that they do not provide evidence of whether the text runs left-to-right or right-to-left.
  • Word breaks in Currier A act like word breaks in a natural text, but in Currier B they do not.
  • Since my analysis of F81R as a poem was based on the assumption that line breaks and word breaks were natural, yet they turn out not to be natural at all, there remains nothing to support the idea that this page contains a poem.
Here are the tests I conducted whose results led me to that conclusion.

1. Direction of Text

Question: Text in the VM is laid out on the page in a way that suggests left-to-right text, but does the content of the text support that? How do we know the layout of the text isn't intentionally misleading?

Test: In a traditional European text, line breaks are governed by the width of the text column, and have an arbitrary relationship to the underlying text. Therefore we should expect that high frequency pairs of words W1 and W2 will occasionally be broken across lines, so W1 will appear on one end of one line and W2 will appear on the other end of the next line. If W1 appears at the right end of one line and W2 appears at the left end of the next line, then the text behaves like a left-to-right text. If they appear on the left and right ends, respectively, then it behaves like a right-to-left text.

Demonstration: I applied the test to De natura rerum ad Sisebutum regem liber, by Isidorus Hispalensis Episcopus, which is roughly the size of the Currier A section of the VM. The sample text contained 366 distinct pairs of words that were repeated at least twice, for a total of 966 instances of repeated pairs. In 68 cases a pair was found broken across lines in a way that indicated left-to-right text, in 13 cases it was found broken in a way that indicated right-to-left text.

Conclusion: With more than five times as many left-to-right breaks, the evidence pointed strongly to a left-to-right text, as was expected.

Currier A: I found 491 distinct pairs repeated at least twice, for a total 1389 instances of repeated pairs. In 32 cases a pair was broken across lines in a way that indicated left-to-right text, in 47 cases it was found broken in a way that indicated right-to-left text.

Conclusion: The number of left-to-right breaks is not significantly different from the number of right-to-left breaks. This is not obviously a natural text running in either direction.

Currier B: I found 1701 distinct pairs repeated at least twice, for a total 5313 instances of repeated pairs. In 69 cases a pair was broken across lines in a way that indicated left-to-right text, in 94 cases it was found broken in a way that indicated right-to-left text.

Conclusion: The number of left-to-right breaks is not significantly different from the number of right-to-left breaks. This is not obviously a natural text running in either direction.

2. Word Breaks

Question: Text in the VM appears to be broken into words by spaces, but do these spaces really act like word breaks within the text?

Test: Word breaks should divide the text into a relatively productive lexicon. A productive lexicon is one that can produce the text in question with a relatively small number of words used at relatively high frequencies. We should find that true word breaks divide the text into a productive lexicon better than any other character in the text.

Treat each character in the text as a potential word-break character and measure the frequency of the most frequent word in the resulting lexicon. Use that frequency as a proxy for the productivity of the lexicon. If the word break character results in the most productive lexicon, then it acts like a true word break.

Demonstration: I applied the test to De natura rerum ad Sisebutum regem liber. The word break character resulted in a score of 320, while the next best character (s) resulted in a score of 184.

Conclusion: The lexicon created by the word break character is nearly twice as productive as the next best candidate. The word break character acts like a true word break, as expected.

Currier A: The word break character resulted in a score of 512, while the next best character (o) resulted in a score of 266.

Conclusion: The lexicon created by the word break character in Currier A is nearly twice as productive as the next best candidate. The word break character acts like a true word break.

Currier B: The word break character resulted in a score of 499, but the character producing the most productive lexicon was actually 'e', which yielded a score of 514. The character 'a' was third in rank, with a score of 482.

Conclusion: The lexicon created by the word break character in Currier B is not significantly more productive than the lexicon created by other high-frequency characters. In Currier B, the word break character does not act like a word break.