So far I have come up with two ways to identify word breaks in the RC. One is based on the relative frequency of initial and final glyphs as inferred from the presence or absence of hyphens at the ends of lines, and the other is based on the space between glyphs on the page.
The image below shows the identified word breaks on page 50, where the blue lines indicate word breaks based on space between glyphs and the red lines indicate word breaks based on the formula above. The placement of some of the blue lines is odd, and I think that probably goes back to problems with the how the glyph recognition algorithm counts the spaces between glyphs.
It's a good sign that the two algorithms often identify the same word breaks, giving weight to the idea that spaces do indicate word breaks in the text. Many of these word divisions seem intuitively right, while others are surprising and worth investigating.
No comments:
Post a Comment