I've been working on code that can scan the images of the Rohonc Codex and help me transcribe it. Hopefully I will be able to complete the transcription relatively quickly with the assistance of some code that can recognize and categorize graphemes (and remember what wacky name I decided to give each character).
In the process, I have unearthed a wealth of interesting detail and challenges.
Regarding the grapheme recognition process, the challenges are many. The script is hand-written, the lines are irregular, and the scanned pages are not necessarily orthogonal to the images. My approach is to identify individual marks, place them in a network together with other similar marks, and differentiate them based on the local density of the area of the network in which they appear. Then, I think I can recognize constellations of marks as graphemes, and start training the program to do the transcription.
It is clear to me at this point that this is going to be more of a computer-assisted transcription project than a pure computer transcription, but even so the work should go much more quickly with the aid of a machine whose eyes never tire.
One of the challenges I have had to overcome is distinguishing between stray dots on the page and the dots that are intended to be part of a grapheme. Unless I am mistaken, it appears that the dots that accompany a grapheme always appear above or to the left of the main shape of the grapheme. I suspect this is related to the right-to-left direction of the text.
In categorizing the graphemes, I am running into a problem I have wrestled with for years, ever since I first started thinking about ways to automate the recognition of patterns. I call it the "cloud-within-the-network" problem, and I need to find out what the proper answer to it is.
The "cloud-within-the-network" problem works like this: Suppose you have some dense networks, and you loosely connect them to each other in a larger network. How do you computationally recognize the existence of the dense networks within the larger loose network?
It seems like it should be relatively simple, but every solution I think up seems to have a problem with it. In the case of this transcription project, I have a workaround, but some day I would like to find the right solution.