Sektu: Programming

Showing posts with label Programming. Show all posts

Tuesday, April 18, 2023

Toyfl, finished at last

When I started this blog ten (!) years ago, I was creating a programming language called Toyl (Toy Language).

In the intervening years I have thrown away and rebuilt my language several times, and I now consider it essentially finished. It is now called Toyfl (Toy Functional Language).

Syntactically it is somewhat inspired by Javascript, with lambda expressions like the following:

(a, b, c) => a * b + c

And anonymous objects like this:

{a = 10; b = 32; c = 11; f = (x) => a}

And a dot operator for accessing members:

x = {a = 10; b = 32};

y = x.a

And lots of other familiar things.

There are only a few features of the language that are worth remarking on, but since the purpose of this blog was originally connected to developing the language, I'll remark on them here:

1) Extreme Laziness. If you have an expression like if(a == b, a + 37, b + 22), Toyfl will only evaluate the condition a == b. It will then return one of the two expressions in the then or else case without actually evaluating them. Nothing gets evaluated until it is really, really needed.

That's connected to a second weird feature of Toyfl:

2) Currying by Refactoring. Since Toyfl doesn't evaluate an expression until it is really, really needed, an expression returned from within a function needs to be context-independent. That is, it needs to be able to remember what relevant parameters were passed into the function when it was invoked. I handle that by creating a refactored copy of the body of the function. So, if you have a function like this:

f = (x, y) => x * y

and you invoke it like this:

g = f(a, b + c)

then what you get for g is actually a structure representing the expression a * (b + c).

Originally it rubbed me the wrong way to do this, because somewhere deep inside I have this feeling that code should remain static, but the effort required to carry out refactoring is actually pretty slight, and in a functional context it is very reliable.

However, this behavior means that recursive functions actually build large in-memory expressions. Memory is cheap these days, so I don't worry too much about that for ordinary use-cases, but I thought it would be interesting to solve the problem anyway. So I created:

3) Special syntax for tail recursion. I noticed that the simplest functional implementations of certain solutions required significantly more calculation than the simplest imperative implementation. I wanted a way to bring the efficiency of imperative solutions to my functional language, so I created the following syntax to bridge this gap.

fibonacci = (n) => (

for {

a = 1;

b = 1;

i = 1

}

let {

nextA = b;

b = a + b;

a = nextA;

i = i + 1

}

until i == n

).a

This looks like a loop in an imperative language, but effectively it turns into something like this:

fibonacci = (n) => innerFibonacci(1, 1, n)

where innerFibonacci = (a, b, i) => if (i ==n, a, innerFibonacci(b, a + b, i + 1))

Normally Toyfl would build a huge structure out of this, but the special syntax tells Toyfl to fully evaluate each of the parameters a, b, i before the recursive invocation, so we can take advantage of tail recursion.

So that's Toyfl. It's been fun.

Monday, March 23, 2015

A functional database

I mentioned in an earlier post that I was again working on a functional programming language. Probably the reason it is going so well is that the language itself is not my main goal. What I really want to do is build a functional database.

On the face of it, this seems like a ridiculous idea, because in order to be purely functional you would need to pass the whole database in as a parameter to any query, and return the updated database whenever you make a change to it.

But put aside the ridiculousness of it for a moment and imagine the following:

1. Your database is represented as a linked list of data changing transactions, so changing the database is just a matter of consing the new transaction onto the transaction chain.

2. Every transaction in the database carries a timestamp and a checksum, where the checksum is built from the data in the transaction and the checksum of the last transaction.

3. The data storage is trusted and the checksum has a vanishingly low rate of collision, so the checksum at time T can act as a reliable proxy for the whole database at time T.

What good would this thing be? I think it could be very powerful for certain applications. Consider the following:

All changes to the database would be audited. In an important sense, the database itself is the audit trail.
It would be trivial to add a date/time parameter to any query and get back the result of the query as it would have been at any time in the past.
The checksum could be used to verify whether the results of a query did indeed come from a given database at a given time.
Brute-force backdoor changes to the transaction chain would show up in the checksums.
You could allow "alternate universes" to be forked off of the main database at any time, to experiment with different scenarios. These would exist in parallel time with the main database, but would be distinguished by their own checksums.

This database would be useful in environments where data needs to be reliable and auditable.

Wednesday, March 18, 2015

Nearly full circle: A functional programming language

My mind is like an attic, stuffed with everything I can't bear to get rid of, and occasionally I bring something out and work on it (only to put it back later).

The earliest posts that I wrote on this blog related to my plan to create a functional programming language. Over the last few weeks, I have actually done that. It's nothing pretty, but it really does work, with lambda expressions and everything.

I worked backwards, starting with how the language would work on the inside, without worrying about what it would look like on the outside. The final language isn't finished yet, but I created a kind of intermediate testing language to use for checking the architecture.

The testing language is not supposed to be pretty. My main consideration here was to create a language that could be parsed by a very simple parser (with no lookahead), so that meant leaving out a lot of syntactic sugar and salt.

Here is my first program in the testing language.

using System
using FBaseUnitTests
let plus
    method Program _test006plus [ Int32 , Int32 ]
let ifzero
    lazyMethod Program _test006ifzero [ Int32 , Int32 , Int32 ]
let func
    lambda [ Int32 ] [ x ]
        apply variable ifzero [
            variable x ,
            value Int32 1 ,
            apply variable plus [
                variable x ,
                apply variable func [
                    apply variable plus [
                        variable x ,
                        value Int32 -1
                        ]
                ]
            ]
        ]
let r
    apply variable func [ value Int32 16 ]

The most important thing here is that functions are first-class citizens. The variables plus and ifzero hold native methods (one of which has some lazy parameters), while the variable func holds a lambda expression.

Tuesday, July 22, 2014

Zipf's Law in the Rohonc Codex

I've added a column showing frequencies to the catalog of glyphs that accompanies my in-process transcription of the Rohonc codex.

Of course, the first thing one wants to do with glyph frequencies is to see if Rohoncian obeys Zipf's Law. At first blush, it would seem not, because we have the following distribution for the top ten glyphs:

Glyph	Frequency	*Frequency Rank**
C	4952	4952
I	4902	9804
D	3816	11448
CO	2983	11932
N	2588	12940
O	2572	15432
H	1657	11599
IX	1538	12304
CX	1402	12618
CX1Q	899	8990

However, this distribution supports something I have suspected already: CO, C and CX are probably the same glyph. I separated them in my transcription because I decided it would be easier to merge glyphs later. But, I suspected that they might be the same because they are apparently interchangeable in the Holy Noun.

If CO, C and CX are merged, then the distribution appears as follows:

Glyph	Frequency	*Frequency Rank**
C, CO, CX	9337	9337
I	4902	9804
D	3816	11448
N	2588	10352
O	2572	12860
H	1657	9942
IX	1538	10766
CX1Q	899	8990

It's still not perfect, but it is much closer to a normal Zipf distribution.

Monday, July 21, 2014

Rohonc Transcription online (as-is)

My transcription is still only 90% complete. I estimate about 20 more hours of work would be required to finish the remaining 10%, but it is hard for me to find that kind of time.

So rather than let the perfect be the enemy of the good, I have put the transcription online as-is, with the hope that I can continue to refine and update it as I go along.

I have thrown together a website: http://quint.us/Roho. This site is ugly as sin, because I wrote it with an emphasis of function over form. There may well be bugs, but hopefully they are rare. The site provides four basic pieces of functionality:

Download: Download my current revision of the transcription.

Search: Search the transcription.

Browse: Browse the page images and transcription.

Glyphs: View the catalog of glyphs.

As I refine the transcription system and the transcription, I will also work on the site to make it less ugly and more functional. Every time I update the transcription, I will increment the minor revision number. Every time I update the transcription system, I will increment the major revision number.

My system of transcription represents glyphs as strings of capital letters and numbers. The primary goal of the transcription system was to uniquely identify apparently unique glyphs. The secondary goal was to make transcriptions for similar glyphs similar in form. (For example, glyphs whose transcriptions begin with C have shapes that begin with the same semicircular stroke).

When I am less tired, I'll write up a better description of the transcription system.

Happy hunting. I welcome any kind of feedback on the site. Please leave a comment or email me at rst140720@quint.us if you have a suggestion.

Thursday, April 10, 2014

Dividing work between man and machine

This project to write a text recognizer for the Rohonc codex has been really rewarding. The quality of the text is so poor that the solutions have to be really clever, and that is what makes it so fun. I've learned a massive amount about image manipulation and text recognition, and I have a whole slew of projects I want to undertake when I'm done with this one.

I've probably spent four hours training my glyph recognition algorithm, and I think it probably identifies glyphs correctly about 80-90% of the time. Right now, I can process a single line of text in under a second, but it takes me about 15-30 seconds to manually verify the transcription and fix any errors that crop up. That seems pretty fast, but when you multiply it out by 4285 lines, it comes to about 25 hours of manual work. I need to pare that down, because it'll take me forever to scrape together 25 hours of free time.

A lot of this project has involved dividing labor between me and the machine, making the most of what the machine can do without my intervention, and making the best use of my feedback on good and bad matches. The code has been very fluid but very stable, basically organized around building a powerful set of core functionality, but using the simplest and most ergonomic user interface for each task.

Friday, March 21, 2014

Starting the final phase of Rohonc transcription (I think)

I think I've finally settled on a workable process for machine-assisted transcription of the Rohonc Codex.

I tried several approaches before landing on the current one. One approach was to analyze each page in a top-down way, first identifying the areas that contained text, then splitting those into lines, and splitting the lines into glyphs. The other approach was bottom-up: First, identify glyphs, then identify lines.

No single automated process was able to correctly split the pages up 100% of the time, so I have adopted a top-down automated approach with manual overrides. I have now broken all of the pages down to the line level, and I have some code that picks out glyphs from a line with great accuracy.

Now comes the fun part: writing (and training) the glyph-recognition algorithm.

I've decided to use the "mark" as my fundamental unit of text. A mark is a single, contiguous, dark shape on the page within the bounds of a text line. Many Rohonc glyphs consist of a single mark, but many consist of core mark with one or more satellite marks. Most satellite marks are single dots above or to the left of the core mark, but some marks are dashes, and some are haloes that surround a core mark.

My glyph-recognition algorithm will start out by finding the best match between the glyphs on a new line and any that have been previously identified. This will be followed by a manual intervention step where I can correct any incorrect automated matches, or reject a mark as being non-text. When the line has been completely treated, constellations of marks will be matched to known glyphs.

I have wrestled with several different approaches to matching marks. One approach is to simply overlay one mark upon another and determine the total number of dark points that are the different, and calculate the ratio between that number and the full number of points.

Another approach that I am toying with is to use a two-dimensional version of Levenshtein distance. One way to do that would be to treat each row and column of the mark bitmaps as a string, calculate the individual Levenshtein distances, and sum them up to come up with a total distance.

But some calibration would be needed to make an apples-to-apples comparison between different match scores.

Sunday, March 9, 2014

Interesting features of Rohonc script and character recognition

I've been working on code that can scan the images of the Rohonc Codex and help me transcribe it. Hopefully I will be able to complete the transcription relatively quickly with the assistance of some code that can recognize and categorize graphemes (and remember what wacky name I decided to give each character).

In the process, I have unearthed a wealth of interesting detail and challenges.

Regarding the grapheme recognition process, the challenges are many. The script is hand-written, the lines are irregular, and the scanned pages are not necessarily orthogonal to the images. My approach is to identify individual marks, place them in a network together with other similar marks, and differentiate them based on the local density of the area of the network in which they appear. Then, I think I can recognize constellations of marks as graphemes, and start training the program to do the transcription.

It is clear to me at this point that this is going to be more of a computer-assisted transcription project than a pure computer transcription, but even so the work should go much more quickly with the aid of a machine whose eyes never tire.

One of the challenges I have had to overcome is distinguishing between stray dots on the page and the dots that are intended to be part of a grapheme. Unless I am mistaken, it appears that the dots that accompany a grapheme always appear above or to the left of the main shape of the grapheme. I suspect this is related to the right-to-left direction of the text.

In categorizing the graphemes, I am running into a problem I have wrestled with for years, ever since I first started thinking about ways to automate the recognition of patterns. I call it the "cloud-within-the-network" problem, and I need to find out what the proper answer to it is.

The "cloud-within-the-network" problem works like this: Suppose you have some dense networks, and you loosely connect them to each other in a larger network. How do you computationally recognize the existence of the dense networks within the larger loose network?

It seems like it should be relatively simple, but every solution I think up seems to have a problem with it. In the case of this transcription project, I have a workaround, but some day I would like to find the right solution.

Monday, March 3, 2014

Rohonc Transcriber (stage 1)

In a recent post, I said I would write a program to transcribe the Rohonc Codex.

Tonight I did the first part. I wrote some code to identify lines of text and graphemes. The image below shows a page of text, with the first-pass graphemes marked by green rectangles.

This is just a first pass. Some of these rectangles enclose multiple graphemes, and they will need to be split apart.

Next, I think I'll build a database of all of the grapheme images, then compare them to each other to identify image families.

Sunday, March 2, 2014

Transcription of the Rohonc Codex

It's time for a free and open transcription of the Rohonc Codex.

Unfortunately, I don't have the time to do it by hand, and we do not yet have a universal crowd-sourcing platform where I can set it up. So I'll write a program to scan the images and do an initial pass at automating the transcription.

When it is done, I will release it for free.

Edit: I did get a long way on the transcription, but life intervened so it is not yet complete. The latest revision of the transcription is here: http://quint.us/Roho/.

Tuesday, February 18, 2014

A preamble to a computer worm

I recently downloaded the source-code for 30 Chinese computer worms and Trojan horses. The code makes for interesting reading, but the comments are all in the GB2312 character set, so I have to convert to UTF-8 in order to read them.

When these things first appeared in the wild, they had a deliberate anonymity. Their original developers had given them names like Golden Pig and Chinese Vampire, and adorned their code with comments to describe and explain their effects. But before releasing them, the developers stripped them of all of their identifying and explanatory information, and sent them out into the world nameless and unexplained.

Those who discovered and analyzed them gave them new names. They disassembled their code, but they couldn't recreate the comments and non-semantic details that the original developers created.

It is interesting to look at the original source code for some of these things, for the subtle details you would not see in disassembled code. In this post, I will just give the preamble that appears at the top of one source file that is part of something the author called the Chinese Vampire, written in 2008. Reading this feels kind of like reading the mummy's curse.

Chinese Vampire Source Code
Author: God of the Black Net
After you buy the source code, please do not casually distribute it. Please treasure the product of the author's labor.
If you get lost in the code, the coding style and comments are not generally to blame. Those that I have already changed are very good, quite clear and easy to understand.

It does not use any C++, just simple C code, but edit it using VC++6.0. Once you edit it, you can use it. It has already passed hundreds of tests, so it is quite perfect, and there is no need to edit it very much.
If you can't get rid of it, contact the author and ask for a special killer.

This comment reveals a couple of interesting details about the Chinese Hacker world, at least as it was in 2008 (six years ago, now). First, the Chinese Vampire was for sale, a stock tool that could be purchased and customized. Second, there was an expectation that the author should be remunerated for his hard work.

A neat footnote on function difficulty

In my last post, I argued that the difficulty of the most difficult functions in F increases in proportion to N² (where F is the set of binary functions that operate on values in [0..N-1]).

I was trying to think of a way to test this conclusion for a specific case, when I remembered an earlier post where I argued that these functions can all be represented as polynomials, as long as N is prime. This gives us a perfect test-case, because polynomials can be described as programs in the context of the previous post, where the instruction set consists of addition and multiplication.

So let's take the case where N = 23. The most complex polynomial representing a function in F will have 23² terms. In reality, the most efficient way to evaluate the function would be to preprocess x²..x²² and y²..y²². This wasn't something I accounted for in my model, but let's run with it anyway. Preprocessing would require 22 operations for x and 22 for y.

There would be 23² = 529 terms in the polynomial, and for each term we would need to perform two multiplications, so 1058 multiplications. We would need 528 additions to add the terms together.

In total, that would mean 22 + 22 + 1058 + 528 = 1630 operations would be required for the most complex function in F.

How does that compare to our prediction? For N = 23 and A = 2, the lower limit for the difficulty of functions in F would be (23² ln 23 - ln (23+2)) / (ln 4 + ln 2 + ln (23+2)) = 312.45.

So, in this test case, it seems the conclusion is right, since 1630 > 312.45. The general case for any prime N can be easily shown, where the difficulty for this particular case is d' = 2(N-1) + 3N² - 1.

The other thing that is interesting about this is that, for this particular instruction set, we can say something about the makeup of the function space in terms of difficulty. There are N^N² functions in the set, of which (N-1)^N² fall into the "most difficult" category. In other words, difficult functions vastly outnumber easy ones, with the ratio of "most difficult" functions to all others being 1 - ((N - 1)/N)^N².

Monday, February 17, 2014

Difficulty of functions

This weekend I have been thinking about abstract ways to measure how difficult it is to evaluate functions given a hypothetical computer with a specific instruction set.

In particular, I want to see if I can find a lower limit of difficulty for the most difficult functions of a particular type. I thought I had it nailed last night, but as soon as I went to bed I realized I was wrong.

Imagine a computer with the following properties:

1. The computer's only data type is an integer in the set [0..N-1], called an int.

2. The computer's programs represent functions that take two ints and returns an int. The set of all such functions is F.

3. The computer has an instruction set composed of A <= N+2 functions drawn from F. The computer can easily evaluate every function in its instruction set.

4. The computer's programs are finite function trees, so each program has a root function drawn from the instruction set, which takes two parameters. The two parameters of the root function may be function trees, constant ints, or the special symbols x and y which evaluate to the program's two input parameters. There is no recursion.

5. The difficulty of a program is the average number of operations required to evaluate the program for any possible input values x and y.

For simplicity, we initially assume that there is no lazy evaluation.

Given a computer like this, can we put a lower limit on the difficulty of the most difficult function in F?

If there is no lazy evaluation, then the difficulty of a function is proportional to the number of instructions. Given that there are N^N² functions in F, if you could write maximally efficient programs to represent each of those functions, how big would the largest such efficient program need to be?

There are p₀ = N+2 distinct programs that could be written with zero instructions. These would be the programs that returned a constant value, or else x or y with no modification.

The number of distinct programs that can be written with one instruction is p₁ = A*(p₀* p₀) = A(N+2)². That is, there are A possible instructions, which must take a program with zero instructions for each parameter.

The number of distinct programs that can be written with two instructions is p₂ = A*(p₀* p₁ + p₁* p₀) = 2A²(N+2)³. That is, the first parameter may be a zero-instruction tree, in which case the second parameter must be a 1-instruction tree, or vice-versa.

The number of distinct programs that can be written with three instructions is p₃ = A*(p₀* p₂ + p₁* p₁ + p₂* p₀) = 5A³(N+2)⁴. The progression we are seeing here is that p_n = s_nAⁿ(N+2)ⁿ⁺¹, where s_n is the number of different trees that can be formed with n binary functions.

There will be a lot of overlap in the program space, meaning there will be multiple programs that can evaluate to the same function. This means we can't say that n instructions can always represent p_n functions, but we can say that they will represent no more than p_nfunctions. Thus, for p_n = |F| = N^N², we can be certain that the most complex program representing a function in F can be no smaller than n instructions.

So the lower limit d on the difficulty of the most difficult function in F is calculated as follows:

p_d > N^N².
s_dA^d(N+2)^d+1> N^N².

For large values of d, I think 3^d < s_d < 4^d. This needs to be proven, though. If it is true, we could fudge a bit and say

4^dA^d(N+2)^d+1> N^N².
d ln 4 + d ln A + d ln (N+2) + ln (N+2) > N² ln N.
d ln 4 + d ln A + d ln (N+2) > N² ln N - ln (N+2).
d > (N² ln N - ln (N+2)) / (ln 4 + ln A + ln (N+2)).

Of course, there is a lot more to gnaw on here, like operation laziness and pre-evaluation, which might be real game-changers. But as we have it now, the difficulty of the most difficult functions in F increases in proportion to N².

Tuesday, February 11, 2014

A brief history of Chinese hacking: Part III

(The following draws extensively from an online text titled "The Record of X on the Rise of the Chinese Hacker", supplemented from other sources.)

In the last two posts, I have mentioned two galvanizing events for the Red Hacker movement: Violence against ethnic Chinese in Indonesia; and the NATO bombing of the Chinese embassy in Belgrade.

Two months after the bombing of the embassy in Belgrade, the government of Taiwan announced a 'Two States' policy, which undermined the long-held idea that China and Taiwan were a single country suffering a temporary disunion. Seasoned by the 1998 action against Indonesia and the May 1999 action against the United States, the Red Hacker apparatus was ready to turn and defend the honor of the motherland on the battlefield of Taiwan's networks.

They attacked the website of the Executive Yuan of Taiwan, as well as many other websites, deploying newly developed tools like Glacier (冰河, a trojan horse) for the first time, and NetSpy (a tool for uploading and downloading files from a server, apparently).

In 2000, the number of internet cafes mushroomed, and the hacker spectrum broadened. The old Black Hackers were still around, but the ready availability of technology led to a large number of careless, headstrong and unskilled teenagers pursuing the black hacker path. These "script kiddies" were nicknamed the Little Blacks (小黑黑) by an influential female hacker of the time named Wollf.

Alongside the Black and Red hackers, there also arose Blue Hackers (篮客, lán kè), who were relatively unconcerned with cheap tricks and politics, and intensely passionate about computer security.

In 2001, after the South China Sea collision incident, a small American hacker group called PoizonBOx defaced at least a hundred Chinese websites, and reportedly 80,000 Chinese hackers returned fire beginning on May 4. Most of these were unskilled script kiddies, so the damage done did not reflect their large numbers, and some considered the action to be a farce. As far as I can tell, 100-600 websites were vandalized, and the White House website suffered a DOS attack that blocked access from May 4 to May 8.

In the years between 2000 and 2002, Chinese hackers created and released the Code Red, Code Blue and nimda computer worms. But many also undertook a serious discussion of the ethical dimensions of hacking, and of hacking culture. They began to discover and publish their own findings on network and software vulnerabilities, which have been picked up by international security research organizations.

Sunday, February 9, 2014

A brief history of Chinese hacking: Part II

(The following draws extensively from an online text titled "The Record of X on the Rise of the Chinese Hacker", supplemented from other sources.)

I ended the last post with the emergence of the Chinese hacktivist alliance in response to violence against ethnic Chinese in Indonesia in 1998. This era also saw the emergence of the Green Corps and the Chinese Green League. (I'm not sure what the significance of the color "green" is in these names, but I wonder if it doesn't relate to the color of CRT screens).

Webpages discussing the technical details of hacking began to proliferate, and Chinese hackers eagerly undertook to study the relevant technologies. The most famous hacker of this period may have been Xiǎo Róng (小榕), creator of tools like Stream of Light (流光, a vulnerability scanner), Tracing Snow (溯雪, a password cracker) and Chaos Knife (乱刀).

1999 saw a dramatic increase in the number of internet users in China, and it also saw the NATO bombing of the Chinese embassy in Belgrade, which many Chinese saw as a deliberate act of retribution on the part of the United States for China's criticism of NATO action in Yugoslavia.

The second day after the bombing of the Chinese embassy in Belgrade, the first Red Hacker website was born, initially called the Chinese Hacker's Rallying Point for the Motherland (中国红客之祖国团结阵线), and later renamed the Chinese Hacker's United Front for the Motherland (中国红客之祖国统一战线).

This site drew intense interest from Chinese citizens around the world, and the Red Hackers carried out widespread attacks on American websites and email servers.

Hacking tools created in this period included NetSpy (inspired by Cult of the Dead Cow's Back Orifice), Glacier (冰河, a trojan horse), Black Hole (黑洞), Network Thief (网络神偷), Gray Dove (灰鸽子), XSan and YAI.

Glacier, Black Hole and Network Thief are still considered by many to be essential tools for the Chinese hacker. "Official" development of Glacier has ceased, but users have forked off many versions of their own.

A brief history of Chinese hacking: Part I

(The following draws extensively from an online text titled "The Record of X on the Rise of the Chinese Hacker", supplemented from other sources.)

China's earliest online community arose in the mid-1990s, with a small number of people using PCs and dial-ups to interact with each other on bulletin-board systems. Between 1994 and 1996, BBS servers proliferated in major Chinese cities, and interest in copying software and breaking license controls on software also grew, creating the first generation of Chinese hackers.

Internet access came to China in 1996, and the BBS culture moved from dial-ups and isolated servers to the internet. It is interesting to me that the BBS format is incredibly prevalent on Chinese websites today, while they have been basically replaced by social networks in America. It was during this period that a man named Gao Chunhui created the first personal website in China, and it is said that his personal site at that time was dedicated to the topic of breaking software registration controls.

This era also saw a brief period of phreaking (电话飞客), but advances in telecom technology rapidly put an end to that.

In 1998, a Taiwanese student named Chen Ing-Hau released the Chernobyl virus, which caused billions in economic damage in mainland China. Because the author was a Taiwanese student, some Chinese users perceived the damage done by the Chernobyl virus as a politically motivated attack.

Also in 1998, amid the deepening Asian Financial Crisis, there was widespread violence against ethnic Chinese in Indonesia. Chinese internet users formed teams that flooded Indonesian government email accounts, and they tried to bring down Indonesian websites with ping-based DOS attacks. In order to coordinate these attacks, a group was formed called the Chinese Hacker Emergency Meeting Center (中国黑客紧急会议中心). This might be considered the first Chinese hacktivist alliance.

So, from the very beginning, Chinese hacking has been closely tied to nationalist sentiments.

Friday, February 7, 2014

The Chinese Hacker's Code

In 1984, Steven Levy suggested that there was a commonly understood but unwritten [American/European] "hacker code of ethics", that encompassed the values of sharing, openness, decentralization, free access to computers, and world improvement.

On many Chinese hacker sites I have found a written code of conduct, which is attributed to an influential Taiwanese hacker named CoolFire, who has his roots in the computer culture of the late 1990s. I will present that code of conduct below, but first I want to write out something about the connotations of the word "hacker" in Chinese.

The most common word for "hacker" in Chinese is 黑客, hēikè, derived phonetically from the English word "hacker". These two characters literally mean "black guest", which I think is a great way to describe a hacker's presence on your system. Unlike the English word "hacker", however, the Chinese hēikè seems to have a less negative, perhaps more ambiguous connotation.

The less common word is 骇客, hàikè, also derived phonetically from English "hacker", but with a literal meaning of "terrifying guest". This seems to be a more negative term, maybe more like cyber-criminal.

There is a strong association between hacking (hēikè) and patriotism in China, dating back to the earliest organizations and activities of hackers in the 1990s. This has given rise to another term, 红客, hóngkè, meaning "red guest". This is sometimes translated as "honker", but I'll render it as Red Hacker for now. (Not only does "honker" also mean someone from Hong Kong, but it sounds pejorative to me.)

Without further ado, here is a composite of the Chinese Hacker Code, drawn from several similar versions.

1. Do not sabotage any system. It will only bring you trouble.

2. Do not modify any system files. If you must do so to access a system, please restore them to their original state after you are done.

3. Do not casually hack a website and then tell friends whom you do not trust.

4. Do not talk about what you have hacked in a BBS or forum.

5. Do not use your real name when you post an article.

6. Do not leave your computer while you are actively engaged in invasion.

7. Do not invade or attack telecom/government organization servers.

8. Do not talk about what you have hacked over the phone.

9. Keep your notes in a safe place.

10. Read everything related to system security or vulnerabilities (learn English quickly!)

11. Do not delete or alter accounts on the systems you invade.

12. Do not modify system files, unless it is necessary to conceal your intrusion. In any case, maintain the security of the system, do not invade and disable the original security.

13. Do not share the accounts you have cracked with your friends.

14. Do not invade or destroy government organization servers.

15. If you can't program you can't be a good hacker.

16. Hackers are not "pirates".

Thursday, February 6, 2014

Motherlode

Last month, I idled away some hours trying to get some idea of the Chinese hacker culture. I assumed that, like the hackers I knew when I was younger, that Chinese hackers would have some kind of specialized vocabulary, like a Chinese version of 1337. I figured if I could find a few terms in that specialized vocabulary, I might be able to do some narrow internet searches that would give me a general outline of the Chinese hacker world.

It didn't really work. I did get a really interesting look into how the Chinese government is handling cybersecurity, but I never found anything that really looked like a hacker site.

Today, by chance, I hit the motherlode. I found the type of site I was looking for, and the wealth of information available is a bit overwhelming. My experience with Chinese government sites has been that, after I access them a few times, they may drop off the net, especially if they are very interesting. I fear I won't have enough time to learn what I want to learn before this site goes away too.

If my luck holds, I will soon have much new and interesting information to blog about.

Monday, January 6, 2014

New Chinese information security terminology

The subject of information security is always interesting to me because it involves emergent behavior in complex systems and requires experimental research. In fact, I recently downloaded and have been playing with some vulnerability analysis tools. (I'm only working on my own network, no intention to engage in malicious behavior, etc).

I'm also interested in Chinese software and technological innovations. This afternoon I decided to put these two things together and see what I could find about information security in Chinese. This brought me to a Chinese website describing how hackers operate, including screenshots of some exploitation tools that appear to be Chinese innovations.

One of these is called The Struts2 Ultimate Loophole Exploitation Utility. It takes advantage of weaknesses in the Apache Struts2 framework to execute code on the server. The title of the window in the screenshot includes not only the utility name, but also the names of two of the developers and a phone number.

The names of the developers were unique enough that I was able to find their Weibo accounts, as well as their accounts on a Chinese social site for those interested in IT security. The site lists an ungodly number of software vulnerabilities--19 added today alone. It seems this "white hat" site rewards users for reporting vulnerabilities, which are then passed on to manufacturers. Clever!

Reading through these, I've started to update my "Chinese Programming Terminology" page with new vocabulary related to information security. I've also found a Chinese translation of the manual for the Metasploit pen-testing utility--another goldmine for this type of stuff.

Monday, December 2, 2013

Quickly: How to turn a number into a function

In these recent posts, I have been talking about hunting for functions with certain properties. I decided to try an exhaustive search of a particular function space to find the functions I was looking for, by identifying a 1:1 relationship between numbers and functions, then iterating through all of the numbers and testing the corresponding function.

In brief, I treated the number as an efficient encoding of the truth table for the function. By "efficient", I mean I used an encoding system in which every number in the range represented a valid and distinct truth table, and all possible functions were represented in the range. Since I'm talking about functions that operate on discrete, finite sets of values, the number of truth tables is finite.

First, I used a system I came up with a while ago to represent permutations as numbers. I originally devised this as a way to encode every possible shuffled state of a deck of cards, but it really works for any permutation. Encode the permutation f_a as follows:

R_p(f_a) = N! (f_a:0(0)/N! + f_a:1(1)/(N-1)! + f_a:2(2)/(N-2)! ... f_a:3(N-1)/1!)

Where f_a:n returns the zero-based index of f_a(n) within the ordered set of values returned by f_a(x) where x >= n.

So, for example, the permutation f_a(x) = x + 2 (modulo 4) can be represented as the number 10. Here's how the encoding works:

f_a:0returns the zero-based index of f_a(0) in the set [0, 1, 2, 3]. That is, it returns 2, since f_a(0) = 2, and 2 is found at position 2 in the set.

f_a:1returns the zero-based index of f_a(1) in the set [0, 1, 3]. That is, it returns 2, since f_a(1) = 3, and 3 is found at position 2 in the set.

f_a:2returns the zero-based index of f_a(2) in the set [0, 1]. That is, it returns 0, since f_a(2) = 0, and 0 is found at position 0 in the set.

f_a:3returns the zero-based index of f_a(3) in the set [1]. That is, it returns 0, since f_a(3) = 1, and 1 is found at position 0 in the set.

So:

R_p(λx → x + 3) = 4! (2/4! + 2/3! + 0/2! + 0/1!) = 10

In my last post, I treated the function f(a, b) as selecting permutation f_a from an indexed set, and applying it to value b. The N permutations required for distinct values of a can all be encoded together in a single number, as follows:

R_f(f) = R_p(f₀)(N!)⁰ + R_p(f₁)(N!)¹ + ... + R_p(f_N-1)(N!)^N-1

So the function f(x, y) = x + y (modulo 4) can be encoded as follows, given that R_p(f₀) = 0, R_p(f₁) = 1, R_p(f₂) = 10, R_p(f₃) = 3.

R_f(λx, y → x + y) = 0(4!)⁰ + 1(4!)¹ + 10(4!)² + 3(4!)³ = 0 + 24 + 5760 + 41472 = 47256

For my function-space exploration, I simply reversed this process for each number to generate the truth table, and I used the truth table to test for the properties I was looking for. However, for the fun of it, here is how you would apply the function to a set of values. Given the function above, you would apply it to the values 2 and 3 as follows:

R_p(f₂) = floor(R_f(f) / (4!)²) % 4! = floor(47256 / 576) % 4 = 10

f_2:0(0) = floor(4! R_p(f₂) / 4!) % 4 = 2, therefore f₂(0) = 2 (i.e. zero-indexed value 2 from [0, 1, 2, 3])

f_2:1(1) = floor(3! R_p(f₂) / 4!) % 3 = 2, therefore f₂(1) = 3 (i.e. zero-indexed value 2 from [0, 1, 3])

f_2:2(2) = floor(2! R_p(f₂) / 4!) % 2 = 0, therefore f₂(2) = 0 (i.e. zero-indexed value 0 from [0, 1])

f_2:3(3) = floor(1! R_p(f₂) / 4!) % 1 = 0, therefore f₂(3) = 1 (i.e. zero-indexed value 0 from [1])

So 2 + 3 (modulo 4) = 1.

The looooooong way.