Friday, March 21, 2014

Starting the final phase of Rohonc transcription (I think)

I think I've finally settled on a workable process for machine-assisted transcription of the Rohonc Codex.

I tried several approaches before landing on the current one. One approach was to analyze each page in a top-down way, first identifying the areas that contained text, then splitting those into lines, and splitting the lines into glyphs. The other approach was bottom-up: First, identify glyphs, then identify lines.

No single automated process was able to correctly split the pages up 100% of the time, so I have adopted a top-down automated approach with manual overrides. I have now broken all of the pages down to the line level, and I have some code that picks out glyphs from a line with great accuracy.

Now comes the fun part: writing (and training) the glyph-recognition algorithm.

I've decided to use the "mark" as my fundamental unit of text. A mark is a single, contiguous, dark shape on the page within the bounds of a text line. Many Rohonc glyphs consist of a single mark, but many consist of core mark with one or more satellite marks. Most satellite marks are single dots above or to the left of the core mark, but some marks are dashes, and some are haloes that surround a core mark.

My glyph-recognition algorithm will start out by finding the best match between the glyphs on a new line and any that have been previously identified. This will be followed by a manual intervention step where I can correct any incorrect automated matches, or reject a mark as being non-text. When the line has been completely treated, constellations of marks will be matched to known glyphs.

I have wrestled with several different approaches to matching marks. One approach is to simply overlay one mark upon another and determine the total number of dark points that are the different, and calculate the ratio between that number and the full number of points.

Another approach that I am toying with is to use a two-dimensional version of Levenshtein distance. One way to do that would be to treat each row and column of the mark bitmaps as a string, calculate the individual Levenshtein distances, and sum them up to come up with a total distance.

But some calibration would be needed to make an apples-to-apples comparison between different match scores.

No comments:

Post a Comment