Friday, November 8, 2013

FST via RDBMS

I was stuck in an airport a couple days ago with nothing to do, so I started a little project to pass the time. The challenge is this: Using a conventional database and a conventional programming language, what is the fastest program I can write that can take an inflected word in some given language and identify the root and the inflection(s)?

There are two ends of the spectrum in solutions. On one end, you have a "fat" solution involving a massive database containing every possible inflected form of every word in the language. On the other end, you have a "lean" solution with a smaller database of stems and inflectional information, and some complex code to use the two in combination.

The fat solution has the advantage of using mature indexing technology to efficiently search the massive database for the subject term.  Supposing you have a binary tree index, the cost of executing a query with the fat solution might be:

Tfat = Q + N log(si)

where Q is the fixed cost of opening a query to the database, N is the incremental cost of checking one node in the binary tree, and log(si) is the depth of the binary tree, based on the number of stems (s) and inflections (i).

However, for agglutinative languages this is a nightmare. Consider the standard Uyghur expression used when meeting someone new:

tonušqanliqimizdin xošalmen

The stem of the first word is ton, and the rest is all inflection (uš - qan - liq - imiz - din). A database containing all inflected forms of Uyghur verbs would be huge.

The leanest solution searches the stems and inflections separately. Since the inflections are a fixed set, it makes sense to factor them out first, then do a simple search of the stems.  The speed of the lean solution is the time required to query for all of the inflectional data (Ni), the time required to parse all possible inflections (P), and the time required to query roots (N logs):

Tlean = (Q + Ni) + P + (N logs)

which can be written as

Tlean = (Ni) + P + Tfat - N logi

In order for the lean solution to beat the fat solution, this has to be true:

Ni + P < N logi

That simply isn't doable, because Q, P and N are positive, and i > logi.  In order to get the lean solution to work, we've got to pre-fetch the inflectional data. In fact, we could write a routine to use the inflectional data to build a parser, in which case we're down to this:

Tlean = P + Tfat N logi

And now all we need is to make this work:

P < N logi

The fastest parser would probably be a finite state transducer. It seems to me you would end up with the FST traversing about the same number of states as the nodes traversed by the index search, but the FST has the advantage of running in memory, while the index search would have to deal with some of the necessary evils of relational database management (like query preparation, I/O buffering, and so forth) that are hiding in the value N.

Going back to the challenge requirement that this has to be implemented in a conventional programming language, the best marriage between conventional programming and an FST is probably found in regular expression processing, so that would probably be the tool to use for the parser.

No comments:

Post a Comment