There are two ends of the spectrum in solutions. On one end, you have a "fat" solution involving a massive database containing every possible inflected form of every word in the language. On the other end, you have a "lean" solution with a smaller database of stems and inflectional information, and some complex code to use the two in combination.
The fat solution has the advantage of using mature indexing technology to efficiently search the massive database for the subject term. Supposing you have a binary tree index, the cost of executing a query with the fat solution might be:
Tfat = Q + N log2 (si)
where Q is the fixed cost of opening a query to the database, N is the incremental cost of checking one node in the binary tree, and log2 (si) is the depth of the binary tree, based on the number of stems (s) and inflections (i).
However, for agglutinative languages this is a nightmare. Consider the standard Uyghur expression used when meeting someone new:
tonušqanliqimizdin xošalmen
The stem of the first word is ton, and the rest is all inflection (uš - qan - liq - imiz - din). A database containing all inflected forms of Uyghur verbs would be huge.
The leanest solution searches the stems and inflections separately. Since the inflections are a fixed set, it makes sense to factor them out first, then do a simple search of the stems. The speed of the lean solution is the time required to query for all of the inflectional data (Q + Ni), the time required to parse all possible inflections (P), and the time required to query roots (Q + N log2 s):
Tlean = (Q + Ni) + P + (Q + N log2 s)
which can be written as
Tlean = (Q + Ni) + P + Tfat - N log2 i
Q + Ni + P < N log2 i
That simply isn't doable, because Q, P and N are positive, and i > log2 i. In order to get the lean solution to work, we've got to pre-fetch the inflectional data. In fact, we could write a routine to use the inflectional data to build a parser, in which case we're down to this:
Tlean = P + Tfat - N log2 i
P < N log2 i
The fastest parser would probably be a finite state transducer. It seems to me you would end up with the FST traversing about the same number of states as the nodes traversed by the index search, but the FST has the advantage of running in memory, while the index search would have to deal with some of the necessary evils of relational database management (like query preparation, I/O buffering, and so forth) that are hiding in the value N.
Going back to the challenge requirement that this has to be implemented in a conventional programming language, the best marriage between conventional programming and an FST is probably found in regular expression processing, so that would probably be the tool to use for the parser.
No comments:
Post a Comment