To Appear
Miroslav Martinovic
TOPIC AREA: Word Conflation, Information Retrieval, Stem Dictionaries, NLP Tools and Resources
ABSTRACT
This paper introduces an algorithm for transforming any sequential word conflation algorithm into a conflation algorithm whose final product is guaranteed to generate a minimal stem. We first present the algorithm in its generic form and then demonstrate how to implement it for the well-known Porter’s stemming method. The proposed transformation method is based on an equivalence relation which partitions the sets of conflated word forms into mutually exclusive and exhaustive equivalence classes. The shortest length word of the union of element sets for each such equivalence class presents us with a sought minimal stem. In addition to expressing the commonality among the related words, the minimal stem does it in a most succinct manner. It clearly contributes to reducing the number of terms in an IR system that would use it, but also does it in a space-wise most efficient manner.