Workshop on the Computational Modeling and Lexical Acquisition
July 2005, Split, CroatiaSteLemMin – a Generic Minimal Stem Algorithm for Word Conflation and Lemmatization
Miroslav Martinovic, L. Rofrano
TOPIC AREA: Information Retrieval, Question Answering, NLP Tools and Resources
ABSTRACT
This paper introduces SteLemMin (Stemming and Lemmatization Minimizer), an algorithm for transforming any sequential word conflation or lemmatization algorithm into an algorithm whose final product is guaranteed to be a minimal stem. The algorithm is presented in its generic form applicable to any lemmatization or stemming method of a sequential nature. Later in the paper, we demonstrate how SteLemMin can be employed to transform Porter’s stemming method.
SteLemMin method is based on an equivalence relation which partitions the sets of conflated word forms into mutually exclusive and exhaustive equivalence classes. The shortest length word of the union of the element sets for each such equivalence class is proved to be the sought minimal stem. In addition to expressing the commonality among the related words, the minimal stem does it in a most succinct manner. It clearly contributes to reducing the number of terms in an IR system that would use it, but also does it in a space-wise most efficient manner. Word conflation by automatic means is a process particularly beneficial in the field of information retrieval. Terms with a common root or stem are considered likely to have similar meanings (i.e. COLLECT, COLLECTED, COLLECTING, COLLECTION and COLLECTIONS). Many word conflation strategies reported in the literature (Lovins, 1968; Mason, 2000; Porter, 1980; Paice, 1996; Xu, 1998) include a large number of word conflation schemes depending on: (i) whether a dictionary is being used, (ii) whether a pre/in/suf-fix list is being used, (iii) the purpose of the conflation process. The main merits of a successful algorithm are contemplated as small size, efficient performance, and reasonable simplicity. Characterized informally, the best qualification for the word conflation of two words W1 and W2 producing a single stem S is if one can say that there appears to be no difference between the two statements “a document is about W1” and “a document is about W2”. To summarize, word conflation algorithms under the consideration here are simple, computationally inexpensive and successful techniques that bring together the words conveying the same or similar meaning and treat them as the same content contributors. It has been observed that with some algorithms, related words are not conflated into a same common stem ( i.e. RELATEDNESS conflated to RELATED, while RELATED transformed into RELAT by the Porter’s algorithm. Also, DEEPENINGS conflated to DEEPEN, while DEEP stayed DEEP). An obvious reason for these deficiencies is that the conflation procedural steps get applied sequentially and an appropriate step can miss its suffix if it’s not at the end of the word at the time when the module dealing with that suffix gets to execute. As a consequence we have that : (i) The number of stems on the system stays greater (because RELATED and RELAT likes are considered different and unrelated terms), (ii) The system fails to record the smallest (and space-wise least expensive) related terms (i.e. RELAT rather than RELATED likes), (iii) The semantic congruity between some words gets unnoticed.
Our simple and computationally inexpensive transformation converts any sequential conflation algorithm into its minimal stem variant. For each of the sequential steps of an algorithm in question we define a corresponding relation which is defined on the sets of words.
The transformation converting a conflation algorithm of described properties into one that generates the minimal stem is rather obvious. Once the sequential phases of the original algorithm have been identified, they are applied not once but until all words in a corresponding Algtuple are identical. When an Alg tuple consists of only identical words, no transformation of the given algorithm can further conflate the word and the word assumes its minimal stem form. This process is guaranteed to converge to a condition in which the Algtuple will contain only one copies of one single word because of the assumption that no sequential phase increases the length of the word.
In addition and as a useful computational linguistics tool, we put forward an implementation of SteLemMin fully capable of optimizing and interpreting an arbitrary sequential conflation or lemmatization algorithm. The word form transformation rules are expected to be introduced through a simple text file, external to this implementation. The representation of the rules is expected to use the traditional regular expressions terminology. As such, the algorithm demonstrates NO dependency on a particular natural language and can be tailored to very different morphological systems. We illustrate this feature by providing examples from English and German. For instance, a Porter stemming rule 52 (`52 m>1 .*[ST]ION -1 3 "" 60 53`) defines stripping of the `SION` or `TION` suffix in a word form (i.e. CONSTITUTION or INTRUSION). First label (52) is the ordinal number of this rule. Next, `m>1` means that the word form must have a measure (length) greater than one (see Porter, 1980 for details) in order for the transformation to be applicable. `.*[ST]ION` is a regular expression defining that the word form must end with `TION` or `SION`. `-1` specifies that the transformation is taking place at the end of the word, while the next number `3` specifies that three characters need to be removed from there and replaced by `””`, an empty string. At the end, 60 and 53 are labels of rules which are applied next if this rule has or has not been successfully applied, respectively. Similarly, the rule `87 m>=0 FROZEN$ 3 4 "EEZE" 88 88` of an English lemmatizer specifies that the root infinitive form `FREEZE` is obtained from `FROZEN` by removing `4` characters starting from the position `3` in the word and replacing them by `EEZE`.
A transformation of `GEWUSST` into `WISSEN` in a German lemmatizer, has to be described by two rules. '55 m>1 .*WUSST$ -1 4 “ISSEN” 84 182` rule transforms `GEWUSST` into `GEWISSEN` and `84 m>1 GE.* 1 2 “” 88 190` rule transforms `GEWISSEN` into the root infinitive form `WISSEN`. Clearly, this approach with transformation rules being fed as input into the algorithm emerges as ‘multi-lingual’. As such, it could be tailored to account for very different morphological features (compound words (i.e. German, Dutch), strong inflection (i.e. Finish) or even different accents depending on the form (i.e. Spanish)).