NLDB (Natural Language for Databases) Conference, June 2002, Stockholm, Sweden
Miroslav Martinovic, G. Sampath
TOPIC AREA: Information Retrieval, Text Summarization, NLP Tools and Resources
ABSTRACT
We present a multilevel model of discussions in USENET newsgroups that includes the use of statistical and linguistic methods to obtain lexical, semantic and discourse characteristics of the text. In contrast with document mining, where the amorphous unstructured nature of text makes it difficult to extract and summarize information in useful ways, several constraints make information extraction and summarization in newsgroup discussions more amenable to analysis at different levels. The model we present here makes use of several characteristics of newsgroup discussions such as posting structure, times of posting, time spans, and length and depth of a thread. It uses this information to extract higher-level information on subject matter, interest level, topicality, and discussion trends. Techniques for summarizing and detection of discussion characteristics are mentioned.