Wednesday, June 26, 2013

The Roasted Cake Song

There is a Classical Chinese text called the Roasted Cake Song (燒餅歌), supposedly composed by the Ming official Liu Ji (劉基) who was posthumously named Liu Bowen (劉伯溫).  The song is cryptic, and is traditionally held to be prophetic, but is often considered a recent hoax.

I can't comment on whether it is prophetic or not, but I am pretty sure it is not a recent hoax.  Here is a page from the Old Manchu text Jiu Manzhou Dang, detailing a letter sent by the Manchu chieftain Nurhaci to the Khalkha of the Five Encampments in 1620.  I translate the highlighted section below.


The highlighted section says (in Old Manchu):
te geli liobe uwen-i gisunde latunahabi.  tere liobe uwen serengge dūleke julgei [n]iyalma kai. ganio joriha gese. liobe uwen-i gisun ehe sain ojibe emgeli jorime waciha kai.
Which I translate as:
Now again [events] adhere to the words of Liu Bowen.  That Liu Bowen was a person of the ancient past.  It is like he indicated supernatural things.  Regardless of whether his words were of good or evil, what he indicated came to pass.
So, at the very least, we can say that in 1620 there was some text attributed to Liu Bowen that was considered to be strange and prophetic.  It is possible that the Roasted Cake Song that survives to the present day is a hoax, but unless there is something in the content of the surviving text that is clearly anachronistic, we can probably assume that it dates at least to 1620.

Monday, June 24, 2013

The Rohonc Codex

The Rohonc Codex is an undeciphered manuscript that was kept in the city of Rohonc in Western Hungary (now Reichnitz, Austria), until 1838.  The paper on which the codex was written bears a watermark that identifies it as having originated in Venice in the 1530s.  Generally it is supposed to be in Hungarian, Romanian, or else a hoax.

First, I think there is no question at all that the content of the codex is biblical.  Consider the following image:


This depicts the scene described in Matthew 21, where Jesus rides an ass into Jerusalem, and the people spread their garments in the way, and cut the branches from trees and spread them in the way.

But obviously this is not written in a (known) traditional liturgical language of a Christian tradition, nor is it written in a (known) vulgar language of a Christian population.  I think it is very possible that it was intended as a secret text, either because the Christian sect was considered heretical, or else because the practice of the Christian religion was circumscribed.

One of the interesting things about the imagery of the codex is how the Romans are depicted.  Here, you can see them crucifying Christ:


The Romans appear to have tall, pointed caps with something like a tassel on it.  I believe this text was composed within the domain of the Ottoman empire, and the author was depicting the Romans as Janissaries.  You can see what I mean in this picture of a Janissary:


The text uses a double-dash as a hyphen at the ends of a line to indicate breaks in words.  Interestingly, these hyphens normally occur on the left side of each page, demonstrating that the direction of the text was from right to left, perhaps influenced by Arabic or Hebrew writing.  This usage of the hyphen came into being with moveable type, and so we should consider primarily areas of the Ottoman empire that were in the European sphere.

So, let's say this text originated within the Ottoman empire, somewhere not too far from the general trajectory between Venice and Reichnitz.  At the time the paper was manufactured, the empire included most of the Balkan peninsula (including Albania).  Hungary, Transylvania, Wallachia and Moldova were quickly added.  Theoretically, the language could have been Hungarian, Romanian, Serbian, or even Macedonian or Albanian.

It's getting late, and I'm tired, so I'll just throw one more thing out there.

Check out this page of the text (which reads right-to-left).  It has what appears to be a numbered list, wherein the numbers are prefixed with a curly-V looking shape, followed by a repeated formula:


The symbols for the smaller numbers are just slashes.  The first element on this page is number 4, so there are four slahes, then five slashes, then the number six is represented by a semicircle with an angle in it, then seven is that plus a slash, eight is that plus two slashes, nine is a bent t, ten is a cross +, eleven is a slash with a dot, like a lower-case i.

Suppose the curly-V represents an ordinal prefix, so these entries read something like "Fourth, it sayeth in XYZ...", "Fifth, it sayeth in XYZ...", and so forth.  If that's the case, we should focus on languages where the ordinals are indicated by a prefix.

In Romanian and Albanian, the ordinals are prefixed by the definite article.  Romanian is closer to Reichnitz, but my money is tentatively on Albanian, because several scripts were invented for Albanian in the 18th century, including one (the Todhri script) that used many of the symbols used here for numbers.

Saturday, June 22, 2013

The Insulating Void

As far as we know, our universe radiates heat into an infinite and empty void.  For this reason, we believe that it may eventually grow cold and die the death of entropy.

But think about what happens if a photon radiates from a particle in our universe and departs into the void.  First, a probability radiates out from the originating particle, representing the chance that the photon will strike another particle within the radius of ct, where t is the time since radiation.

Suppose the probability wave grows so large that it entirely escapes the universe, reducing to zero the chance that the photon will ever strike a particle and energize it.

If that were to happen, the end result would be functionally equivalent to a simple loss of energy in the originating particle--something that is not supposed to happen, because mass/energy is supposed to be conserved.

Not only that, but there would be a problem with the time symmetry of the photon, because if time ran backwards what you would see is photons radiating in from an empty void, looking like an increase in the mass/energy of the universe.

So what if the universe is so ordered that this cannot happen, and part of what it means that all of these particles exist in the same universe is that if one particle loses energy through radiation, that energy must be picked up by another particle in the same universe.

In that case, the void would act like an insulator.

Friday, June 21, 2013

Semantic distance between words in a large text

I've had an algorithm rattling around in my head for years that can be used to find words of similar meaning in large texts.  I've implemented it from scratch several times, but I doubt I will ever have a chance to use it in real life work.

The basic idea is this:
Part of the semantic meaning of a word is shared with the words that are syntactically close to it in a text.
For example, in the sentence "I eat cake", the words "eat" and "cake" are close to each other, and they share a common component of meaning.  A cake is something that you eat, and eating is something you do with cake.

Add the sentences "I eat bread", "I eat bananas", "I bake bread", "I bake cake" to the mix.  When you compare their contexts, you find that "bread", "cake" and "bananas" share common contexts (they are all eaten), but "bread" and "cake" have more in common with each other than they do with "bananas" because they are baked.  Furthermore, "eat" and "bake" have something in common: they are both actions that are applied to food.

In a large text you will get many accidental pairs, but if they are truly accidental then they won't be statistically distinctive.

Here's how I usually implement this algorithm:

1.  Build a table of all words in the text with their frequencies.

2.  Build a table of all pairs in the text with their frequencies.

3.  For any two words W1 and W2, find all pairs of pairs ((W1, x), (W2, x)) and all pairs of pairs ((x, W1), (x, W2)).

4.  For each pair of pairs, calculate the sum S1 of the absolute difference between the relative frequencies of the left and right members of the pair.  The relative frequency of a pair (W1, x) is the frequency of the pair divided by the frequency of W1.

5.  For each pair of pairs, calculate the sum S2 of the relative frequencies of the left and right members of each pair.

6.  Calculate the semantic distance between words W1 and W2 as: 2 - S2 + S1.  The distance is a number between 0 and 2, where 0 is complete similarity and 2 is complete difference.

If the words are the same, then S1 will be 0 and S2 will be 2, so the distance will be 2 - 2 + 0 = 0.  If the words are completely different, then S1 will be 0, but S2 will also be zero, so 2 - 0 + 0 = 2.

The more data you have to throw at this calculation, the better, to reduce the impact of noise on the result.

A measure of the distance of phonemic change

It is common in the software world to use a metric called Leveshtein Distance to measure the distance between two strings.  The metric reflects the smallest number of insertions, deletions and substitutions that would be needed to change one string into another.  I am wondering if a variation on this theme could be used to measure the magnitude of a phonemic change.

For example, say you have a proto-language L0 whose inventory includes /p/ and /f/, where a sound change occurs that merges the two by means of p > f in child language L1.  Then you have a second sound change in L1 that turns f > h in grandchild language L11.  Meanwhile, in parallel, p > b has occurred in child language L2, and b > v / V_V in grandchild language L21.

For example, suppose you have the following samples from L11 and L21:

L11   huku, apple < L0 *puku
L11   toho, sand < L0 *tofo
L11   aka, turtle < L0 *aka

L12   buku, apple < L0 *puku
L12   tovo, sand < L0 *tofo
L12   ama, turtle < L0 *ama, crocodile

What I'm looking for is something that will say that "apple" and "sand" are close, but "turtle" is not.  In terms of Levenshtein distance, they are all equally distant from each other (one substitution), but you would have to jump through more hoops to propose a proto-form that would account for aka and ama as cognates.

If you had a network of plausible sound changes, you could chart the shortest distance from one sound to the other, like this:

h < f < p > b = 4
h < f > v = 3
m < mb < mp < mk > nk > k = 6

If you had a large enough vocabulary sample, you could score each proposed path of change in terms of how often it could explain the observed evidence, then eliminate those that were excessively implausible or inconsistent with other changes.

What kind of magmas (magmae?) are cryptographic systems?

Suppose you've got a cryptographic system in which keys, plain texts and encrypted messages all exist in the same set (e.g. the set of arrays of bits).

In a situation like that, you could say that the encryption operation E, which takes a plain text block p and a key block k, forms a magma together with the set of messages and keys.  What kind of magma is it?

First, there is ideally not going to be a left-identity element I such that E(I, k) = k, because under certain circumstances you could trick an automated system into revealing the key by feeding it the identity element.  You probably don't want a right-identity element either, because you wouldn't want to accidentally use it for the key and leave your plain text unencrypted.

Ideally, you would want inverse elements to exist, because you would want the encrypted message to be dependent on every bit of the plain text and every bit of the key, and you would want the encryption function to be invertible.  However, if the message space is infinite (i.e. we're talking about all possible messages and keys of all possible lengths) then there is no guarantee that inverse elements would exist.

If I have that right, then this type of magma is a quasigroup if the inverse elements exist.

Thursday, June 20, 2013

Hacking a Trojan Horse

I get all kinds of Trojan Horses by email, and sometimes I wonder what they do.  This morning I had a few extra minutes, so I decided to start taking a look at one.

First, I installed binutils and configured it with the --enable-targets=all option, because these Trojan Horses are inevitably in PE/COFF format, and I don't do this kind of thing on Windows.

Then I disassembled my most recent Trojan Horse.  The executable section boils down to a mere 1781 lines of x86 assembly, so not really very large at all, with a bunch of small routines.  I haven't had a chance to look too closely at it yet, but it looks like it has some obfuscated chunk of something (executable) as an embedded resource.  I'll have to see if I can sort that out.

Wednesday, June 19, 2013

Magma

Magma is a big deal in functional programming.

I know this, because I have attended two study groups on functional programming, and each time someone has pulled up the Wikipedia page on Magma.

The day after the last such meeting, I spent the morning wondering whether I had made a huge mistake trying to learn a programming paradigm that required a Ph.D. in an obscure branch of mathematics in order to get started.  But then it dawned on me (I think) what it's all about:
Groups are the higher order types that higher order functions operate on.
So, a magma is a set of things (M), together with a function (.) that takes two of those things as a parameter and returns a third one.  We could start with the basic magma and build up to groups, but I would rather go backwards from abelian groups.

Abelian Groups

  • The function is commutative,  so a.b = b.a
  • There is an identity element, so a.I = a
  • The set contains inverse elements, so for any a, there is a b such that a.b = I
  • The function is associative, so a.(b.c) = (a.b).c
An example of an abelian group would be the set of integers and the addition operation.  So,

  • The function is commutative,  so a + b = b + a
  • There is an identity element 0, so a + 0 = a
  • The set contains inverse elements, so for any a, there is a b (in this case, -a) such that a + b = 0
  • The function is associative, so a + (b + c) = (a + b) + c
These are handy for higher order functions like fold.  If you wanted to fold the addition operation across a tree of integers, you just need to start with the identity (0) and add in all of the integers in any order.

But the higher order fold function cannot be used to apply any old binary operation to any old tree of values from the set that the operation applies to.  If there is no identity element in the set, then the fold function needs an additional parameter to start with.  If the function is not associative, then you will need more memory to fold over a tree, because you have to fully process each branch.

And on it goes.

Tuesday, June 4, 2013

It's a virus! It's a Trojan Horse! It's...the EPL IDE!

I tried to download the EPL 5.1.1 IDE, and it set off alarms with Chrome and Norton.  Nothing specific, just "Insight Network Threat.  There are many indications that this file is untrustworthy and therefore not safe".

A friend of mine was talking about "food fear" in China.  You can't trust the system to produce safe food, but then what choice do you have?  I think the same thing applies to software.  Who will trust software made in China?

Luckily, I don't need to play with EPL.

Sunday, June 2, 2013

易语言, a Chinese-lexified programming language

I just found out about Yìyǔyán (易语言, a.k.a. EPL), a Chinese-lexified programming language, which claims to be "The best development environment in China, and the best Chinese programming language".

I've often been curious what it would be like to work with a programming language that was not lexified from English.  I'm tempted to try it out, but first I'd like to take a look at "Teach yourself EPL in Ten Days".

Day 1 looks like an introduction to the IDE.  Here is a simple snippet of code pulled from Day 2:

变量1 = “我爱”
变量2 = “易语言”
变量3 = 变量1 + 变量2
编辑框1.内容 = 到文本 (变量3)

The syntax is generally familiar.  We have assignment (A = B), quoted strings, string concatenation with the plus sign (A + B), dot-notation for accessing object members (A.B) and parentheses to enclose function call parameters (A(B)). We can translate this snippet as follows:

variable1 = "I love" 
variable2 = "EPL" 
variable3 = variable1 + variable2 
textbox1.content = tostring(variable3)

There are several things that are incredibly interesting about this.  First, the act of translating the code from EPL to...whatever Englishy thing I translated it into...isn't a mechanical act.  There is real interpretation going on here, just as there is in translating natural language.

Translating 变量 as "variable" is no big deal, and translating the content of the strings ("I love" + "EPL") is also no big deal.  But rendering 编辑框 as "textbox" is a matter of interpretation.  Is the EPL 编辑框 the same thing as a textbox in the languages I am familiar with?  The "Textbox" I know and love in C# has a Text attribute, but the 编辑框 has a 内容 attribute.

There is something else here, too, that is really interesting.  In line 3, variable3 appears to be a concatenation of variable1 and variable2.  Normally you would expect that the result of the concatenation would be a string (文本), but then why would it need to be cast into a string (到文本) before being assigned to the content of the textbox?

One possible reason is that, when you concatenate strings in EPL, you create lists.  We'll have to keep an eye out.

This is cool!