Wednesday, July 21, 2021

A domain-specific language for representing morphology

Whenever I learn a new language, I instinctively want to model the morphology in code. It's inefficient to write grammars in generic programming languages, though, and that's where I always get stuck.

This month I developed a domain-specific language for representing morphology. The interpreter is written in Javascript, but could easily be rewritten in almost any other language.

A project in this language starts out with a declaration of the types of graphemes used in the language. (It works on the level of graphemes instead of phonemes, but phonemic systems are a subset of graphemic systems, so there is nothing lost by doing it this way.)

Here is an example, which defines vowels (V) and consonants (C) in a system with five vowels, phonemic vowel length, and certain digraphs (such as 'hw', 'qu', 'hl').

  classes: {

    V: '[aeiouáéíóú]',

    C: '[ghn]w|[hnrt]y|qu|h[rl]|[bdfghklmnpqrstvwy]'

  }

The values on the left are identifiers (V, C) and the values on the right are regular expressions.

These identifiers can then be used in transformations like the following, which will append -n to a word ending in a vowel, or -en to a word ending in a consonant.

    append_n: [

      '* V -> * V n',

      '* C -> * C en'

    ]

This transformation is composed of two rules, which are turned into regular expressions like the following:

/(^.*)([aeiouáéíóú]$)/
/(^.*)([ghn]w|[hnrt]y|qu|h[rl]|[bdfghklmnpqrstvwy]$)/

Each of the rules also has a map describing the way that the parts of the input are transformed into an output, like this:

[1, 2, "n"]
[1, 2, "en"]

A candidate word, such as 'arat', is tested against each regular expression from top to bottom. In this case, it will be matched against the second expression, and the match results will look like this:

["arat", "ara", "t"]

The transformation will then assemble the answer using the mapping [1, 2, "en"]. The numbers in the mapping refer to elements in the zero-indexed match results, so the result will be "ara" + "t" + "en" = "araten".

In addition to preparing regular expressions and mappings for applying transformations, the system also prepares reversing versions. In this case, we have the following reverse expressions and mappings:

/(^.*)([aeiouáéíóú])(n$)/
/(^.*)([ghn]w|[hnrt]y|qu|h[rl]|[bdfghklmnpqrstvwy])(en$)/

[1, 2]
[1, 2]

In reverse application, instead of using only the first rule that matches, the system applies any rule that matches and returns an array of answers. So, if we reverse "araten" then we will match against both rules, and get the answers ["arat", "arate"].

The value of reverse application is that we can take inflected words from a text and reverse the inflections to arrive at a set of possible stems.

There is much more to it, of course, because morphology is complex.

No comments:

Post a Comment