QASTIIR - Question Answering System Through Intelligent Information Retrieval


 


 
Project Summary

 
    QASTIIR project addresses the multi-faceted problems of question answering and personalization. Its pivotal idea is an on-line, dynamic and on-demand integration of NLP methodologies and techniques with statistical ones into a highly modular, flexible, “plug-in, pull-out” system. The focus of this work is on creating a robust and aggregated representation of a search query to optimize the success of properly targeting responses. An initial query to the system is preprocessed both semantically, related to its content, as well as on a meta data level, based on a search context profile of the user submitting the query. The information obtained is then used to decide which NLP components of the system seem adequate to “plug-in” and employ in processing and which are to be “pulled-out” and not used. A distributed collaborative architecture provides the overarching decision-making process for selecting appropriate components. The on-line, dynamic nature of the process is novel and significantly different from what has been implemented in contemporary systems. It represents an attempt to address the adequacy and efficiency problems of like systems. Often, such systems use NLP tools that significantly impede performance without proportionally compensating by increasing the quality of the returned documents and answers. On the other extreme, there are cases when too few NLP resources are utilized to process a sophisticated input, consequently producing rather poor quality returns.
    A secondary point of intellectual merit is methodological one. A highly modular system is required for this endeavor, one based on a distributed, as well as a hierarchical architecture which facilitates a collaboration among the constituent processes. Because of the emphasis on modularity and system integration, QASTIIR blends naturally in with collaborative WHAT and HOPEWELL projects and together they can easily be viewed as one integrated multifaceted project, as well as three interwoven but distinct research agenda.  In that light, QASTIIR work focuses on design, selection and application of NLP techniques while WHAT work focuses on the development of a user profile and search context that drives that selection.  Hopewell work focuses on the nature of collaborative distributed architectures that support non-linear application of techniques.
    The proposed research activity is expected to have a far reaching theoretical impact for (1) new theory development in natural language processing systems, (2) highly private user profile development, (3) application of NLP technology to practical domains. The significance of this impact is based on that fact that it addresses transfer of knowledge from an academic research (QASTIIR) to applied domains (WHAT and Hopewell) and reciprocally provides insight into relevance of basic research questions based on immediate need of the more practical domains. Furthermore, methodological protocols developed within QASTIIR, WHAT and HOPEWELL projects can corroborate research results developed in the others. These techniques vary from a highly formalized, NIST-based TREC evaluation (QASTIIR), to established user protocol methodologies (WHAT) to ethnographic techniques (HOPEWELL).
    Secondary impact is on the constituency of this work: undergraduate students and community outreach. The College of New Jersey (TCNJ) is primarily an undergraduate institution where research moves forward solely through undergraduate efforts. The highly disciplined modular system design allows undergraduate researchers to focus on their assigned task without mastering the entire research agenda. This approach can serve as a model for other institutions. Furthermore, the planned work involves a collaboration with a consortium of 14 school districts, providing direct support for their immediate needs while generalized solotions are developed.  Students hand code solutions to real NLP problems posed through the outreach network, and gain insight into theory that can automate those solutions. This model of a software development cycle can foster a greater appreciation for computer science research among K-12 teachers,  while providing substantive community outreach.



Project Description
 

    Question Answering System Through Intelligent Information Retrieval (QASTIIR) is a research undertaking at The College of New Jersey conducted by Dr. Miroslav Martinovic and a group of talented undergraduate Computer Science students ([BloodgoodMartinovic], [LocastoHulmeMartinovic], [Gibson], [GibsonGrapesMartinovic], [MartinovicSampathWagnerBrienin], [TrnkaMartinovic]). The project started in September 2000 with a goal of developing a highly modular question answering (QA) system, suitable for testing and evaluation of different integrative approaches using an on-demand, online incorporation of NLP modules into a statistical information retrieval system. QASTIIR is also pursued in a close collaboration with WHAT and HOPEWELL projects directed by Dr. Ursula Wolz of The College of New Jersey and Dr. Lilian N. (Boots) Cassel of Villanova University. These three multi-year projects are jointly funded (under the name of "Purpose-Driven Natural Language Processing") by the National Science Foundation under an EIA grant.
    A problem with purely statistical QA systems is that they can have trouble distinguishing between similar but significantly distinct queries that share the key terms (i.e. “List congressmen voting for the 1999 Energy Cost Bill” vs. “List congressmen voting for all but the 1999 Energy Cost Bill”). Moreover, the queries asking for exactly opposite information can be found to share the key terms which often yields the retrieval of the same documents from the collection (i.e. “List congressmen voting for the 1999 Energy Cost Bill” vs. “List congressmen not voting for the 1999 Energy Cost Bill”). While purely statistical approaches have problems in resolving sophisticated distinction in meanings, purely linguistic (NLP) systems become unworkable when dealing with large text corpora. Recognizing the inadequacy of exclusive approaches, most of the current research concentrates on investigating hybrid systems in order to combine linguistic and statistical techniques. The statistical and linguistic components of our hybrid system QASTIIR closely cooperate, while acting only on the subtasks for which they are found to be most suitable for the given user request. The statistical component is an IR system (SIRS), based on the well known SMART system ([Salton]) and responsible for handling the large text corpora and fast retrieval of documents, their paragraphs and sentences containing the query terms in high frequencies. QASTIIR’s NLP component consists of modules that can perform semantic parsing of the query and the relevant paragraphs and sentences that got retrieved by the statistical component (the answer candidates). The final outcome of the processing is a semantic representation of the query together with semantic representations of all of the answer candidates retrieved earlier by SIRS. Subsequently, QASTIIR applies a metric for measuring the proximity in meaning between the query’s semantics and that of each answer candidate, establishing a refined ranking.
    QASTIIR’s architecture is a highly modular and a dynamic one, and as such, it facilitates testing alternative cooperative approaches between the statistical and linguistic components. If a query to be processed is found to be of a deficient grammar and its syntax cannot be successfully parsed, it can be treated as a simple “bag of words” query. Consequently, its semantics will be based on the meanings of individual phrases in it and will further determine that meaning representations for answer candidates be of the same kind. Accordingly, the NLP tools involved in processing will not include (deep) parsing and the context of a phrase will be defined as a “bag of words” consisting of n words preceding and following the phrase. Furthermore, the similarity metric for meanings will include little more than a vector space analysis. On the other hand, if a query is of a proper grammar and its syntax can be parsed successfully, a full semantic parsing will be employed and different (tree-like) semantic representations will be produced for both, the query and answer candidates. Similarity metric will now include a structural analysis of the semantics and syntax, in addition to the vector analysis of the previous case. The whole process of deciding what NLP submodules to employ and where and what to omit is done at execution time, dynamically, after the user query has arrived. In addition, WHAT tool for user profiling can be incorporated into query analysis when evaluating what probable level of NLP tools related sophistication would seem appropriate for the current user.
    The QASTIIR system is built to facilitates inclusions of new heuristics related to “plug-in, pull-out” techniques and it presents a novel approach that welcomes experimentation with system components. Additionally, we gear significant efforts towards developing a truly semantic parser that moves beyond what is traditionally called that name but actually refers to a syntactic parser with a lexicon based analysis involving individual phrase semantics’. As a side product, an according similarity metric system for meaning representations (briefly mentioned in the previous examples) is also being developed.
    QASTIIR uses established techniques for assessment through the Text REtrieval Conference (TREC) and will be registered for the TREC 2005. TREC is an initiative co-sponsored by the National Institute of Standards and Technology (NIST) and the Defense Advanced Research Projects Agency (DARPA) whose purpose is to support research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies. It is overseen by a program committee consisting of representatives from government, industry, and academia. For each TREC, NIST provides a test set of documents and questions. Participants run their own retrieval systems on the data, and return to NIST a list of the retrieved top-ranked documents. NIST pools the individual results, judges the retrieved documents for correctness, and evaluates the results. The TREC cycle ends with a workshop that is a forum for participants to share their experiences. In addition to this formal evaluation, integrating QASTIIR modules with WHAT tools and applying them to accomplish HOPEWELL task goals will result in a valuable real world feedback from another, significantly different domain.
    Up until now, the work on QASTIIR project has produced a number of publications, invited talks and conference presentations ([Martinovic], [SampathMartinovic], [MartinovicSampathWagnerBriening]) and it has been registered with TREC. It has also earned an equipment grant support from the NSF (Grant EIA 0130798), as well as an additional support by TCNJ. It is fair to say that QASTIIR has moved forward to an important degree through the efforts of undergraduate independent study researchers. Four student poster presentations were made at The Annual Student Research Symposium at the St. Joseph University in Philadelphia, in April, 2001 ([BloodgoodMartinovic], [GibsonGrapesMartinovic], [LocastoHulmeMartinovic], [TrnkaMartinovic]) and a student researcher has been accepted to NASA's Summer Information Technology Intelligence Task Program at the “Information System Technology Intelligence Center" at Goddard Space Flight Center. Two students working on QASTIIR were also awarded 2001-2002 CREW Grant for their work on this projects. In addition, two of our current students working on QASTIIR have co-authored with the principal investigator on a paper just accepted for publishing at the NLDB2003 conference.



Progress and Results


 

    Our efforts resulted in a number of published papers :  “A Multilevel Text Processing Model of Newsgroup Dynamics : Implementation and Results“, to be presented at 8th International Conference on Applications of Natural Language to Information Systems (NLDB 2003 conference in Cottbus, Germany), “A Multilevel Text Processing Model of Newsgroup Dynamics“, presented at 7th International Conference on Applications of Natural Language to Information Systems (NLDB 2002 conference in Stockholm, Sweden) and  “Integrating Statistical and Linguistic Approaches in Building Intelligent Question Answering Systems”, presented at SSGRR Winter 2002 International Conference on Advances in Infrastructure for e-Business, e-Education, e-Science, and e-Medicine on the Internet (in L’Aquilla, Italy). Also, “Transforming A Word Conflation Algorithm into A Minimal Stem Algorithm” paper with its accompanying Morphological Normalizer software module is in preparation for publishing. Our QA system will be registered for evaluation at the TREC 2005 conference and our present efforts are concentrated towards incorporating a semantic analysis component into it and testing it before the TREC 2005 deadlines. A further and deeper experimentation with the semantic component remains our long rather than short term goal.
    In addition, a great ammount of student research work has been done under the guidance of Dr. Martinovic all geared towards the investigation on and implementation of different components and aspects of QASTIIR. Work on the morphological component of QASTIIR earned Emily Gibson and Christina Grape a one year CREW (Collaborative Research Experience for Women in Undergraduate Computer Science and Engineering) 2001-2002 Grant. Emily Gibson also got accepted to NASA's Summer Program for an Information Technology Intelligence Task at the “Information System Technology Intelligence Center" at Goddard Space Flight Center. Michael Bloodgood, another member of our group was doing his research on representation systems for semantics and state of the art semantic parsers. His work got him another local award, the Phi Kappa Phi Student-Faculty Research Scholarship. The work presentations of Emily Gibson, Christina Grape, Michael Hulme, Michael Locasto, Keith Trnka, Michael Bloodgood were all accepted for the 2002 Saint Joseph University Student Research Symposium and published in the Proceeedings.


Publications

[CasselWolz1]
    Cassel L., Wolz, U.,
    "Client-side Personalization",
    Proceedings of the Joint DELOS-NSF Workshop on Personalization and Recommender Systems  in Digital Libraries,
    June 18-21, 2001, Dublin Ireland.

[CasselWolz2]
    Cassel L., Wolz, U.,
    "Individual user support in context-aware web searching",
    presented at the Workshop on User Modeling for context-aware applications, in conjunction with UM2001,
    July 13-17, 2001, Sonthofen, Germany.

[Martinovic]
    Martinovic M.
    “Integrating Statistical and Linguistic Approaches in Building Intelligent Question Answering Systems”,
    Proceedings of the SSGRR 2002 International Conference on Advances in Infrastructure for e-Business,
        e-Education, e-Science, and e-Medicine on the Internet,
    L’Aquilla, Italy, January 2002.

[MartinovicSampathWagnerBriening]
    Martinovic, M, Sampath, G., R. Wagner, S. Briening.
    “A Multilevel Text Processing Model of Newsgroup Dynamics : An Implementation”
    In Proceedings of NLDB 2003 Conference, Cottbus, Germany, June 2003.

[SampathMartinovic]
    Sampath, G., Martinovic, M.
    “A Multilevel Text Processing Model of Newsgroup Dynamics”,
    Proceedings of NLDB 2002 Conference, Stockholm
    (Lecture Notes in Computer Science, Volume 2553, Springer-Verlag, 2002).


Student Work References

[BloodgoodMartinovic]
    Bloodgood, M.E., Martinovic, M.
    “Semantic Structures and Natural Language Semantic Parsers: A Case Study”,
    In Proceedings of 13th Annual Student Research Symposium: 7, St. Joseph University, Philadelphia, April 2002.

 [GibsonGrapeMartinovic]
    Gibson, E., Grape, C., Martinovic, M.
    “Design and Development of a Word Conflation Module and its Evaluation by Integration into an Information Retrieval System”
    In Proceedings of 13th Annual Student Research Symposium:33, St. Joseph University, Philadelphia, April 2002.

[LocastoHulmeMartinovic]
    Locasto, M.E., Hulme, M.J., Martinovic, M.
   
“QASTIIR: Designing and Developing a Dynamic IR Engine”
    In Proceedings of 13th Annual Student Research Symposium: 60, St. Joseph University, Philadelphia, April 2002.

[MartinovicSampathWagnerBriening]
    Martinovic, M, Sampath, G., R. Wagner, S. Briening.
    “A Multilevel Text Processing Model of Newsgroup Dynamics : An Implementation”
    In Proceedings of NLDB 2003 Conference, Cottbus, Germany, June 2003.

 [TrnkaMartinovic]
    Trnka, K., Martinovic, M.
    “Statistical Document Retrieval in QASTIIR”
    In Proceedings of 13th Annual Student Research Symposium: 116, St. Joseph University, Philadelphia, April 2002.


 

Work Samples and Documentation
 

       SIRS (Statistical Information Retrieval System)

       QASTIIR: Dynamic IR Engine
       QASTIIR How To Document
       QASTIIR Releases
 



Barriers and Opportunities
 

    Our efforts were partially hampered by (1) a lack of funds for financing summer research activities, (2) a inadequate release time for research activities during regular semesters, as well as (3) a slow pace at which the acquired equipment was made available for use. The first two points, we have addressed by applying for an additional grant and the third one by promoting improvement of local logistics.

    More broadly, a lack of shared and practical tools for semantic analysis of text proved to be more than a match at the present stage of our research.


 


Further Access

The College of New Jersey Home Page
TCNJ Information Management Home Page
TCNJ Computer Science Department
Villanova University

 


 

Send e-mail to mmmartin@TCNJ.EDU