Semantic Structures and Natural Language Parsers: A Case Study

Michael Bloodgood

Dr. Miroslav Martinovic

Department of Computer Science, The College of New Jersey

Abstract—QASTIIR is a part of an NSF-funded, collaborative project between The College of New Jersey and Villanova University. The ultimate goal of this project is the design, development, experimentation, and evaluation of a hybrid Question Answering (QA) system. An important role within this project is the semantic analysis of both, queries and potential answers. It is within this context that we are conducting this case study. One goal of this study is the identification and evaluation of state-of-the-art semantic parsers that are candidates for being implemented as a component of QASTIIR. The meaning of natural language must be converted to a formal structure in the linguistic component of QASTIIR. A second goal of this research is to acquire an understanding of the mathematical and logical techniques and structures that are used to represent meaning. The theoretical basis for using formal structures to represent meaning is considered. Examples of structures that can be used to represent meaning include First Order Logic, Instant Tense Logic, Period Structures, and Event Structures. The Online Lexical Database WordNet is described and its potential for use in QASTIIR is identified.

I. INTRODUCTION

QASTIIR (Question Answering System Through Intelligent Information Retrieval) is a part of an NSF-funded, collaborative project between The College of New Jersey and Villanova University. The ultimate goal of this project is the design, development, experimentation, and evaluation of a hybrid Question Answering (QA) system. See Appendix A for a diagram of the architecture of QASTIIR. The architecture is divided into three phases. They are Statistical Document Selection, Statistical Answer Selection, and Semantics Based Answer Selection.

It is essential in the Semantics Based Answer Selection Phase that both the query and the potential answers be converted from natural language to a structured, formal representation of meaning. This conversion would enable a metrics component to be implemented that could measure the proximity in meaning between sentences. The answer candidate that is determined to have the highest proximity in meaning to the posed query would be selected as the answer.

II. FORMAL STRUCTURES FOR SEMANTICS

There are several different formal structures that can be used to represent semantics. Examples include First Order Predicate Logic, Instant Tense Logic, Period Structures, and Event Structures. It is important to recognize that we can formalize meaning in any of a multitude of ways. The tricky part is to formalize the representation of meaning in a way that captures all of our intuitions correctly.

First Order Predicate Logic is a popular formal logic. A great advantage of using First Order Predicate Logic is that it satisfies the Completeness Theorem. This theorem states that a first order theory delta derives a first order sentence p if and only if delta entails p. An alternative formulation of this theorem is that a first order theory delta is consistent if and only if delta has a model. It is common to split the completeness theorem into two parts: soundness and completeness. The soundness part tells us that the logic and proof system we have chosen are correct with respect to the given semantics. In other words, the derivations that can be made with the logic system are all semantically valid. The completeness part tells us that the logic system is complete with respect to the semantics given. In other words, every argument that is semantically valid can be derived with the logic system.

A disadvantage of First Order Logic is its lack of expressiveness. We call a class of models M first order definable if and only if M = MOD(p) for some first order sentence p. (Note: MOD(x) is the class of all models on which x is true.) We call a class of models M generalized first order definable if and only if M = MOD(delta) for some first order theory delta. The two notions are the same if delta is finite because then we can just replace delta by the conjunction of the sentences in it.

Some facts concerning the lack of expressiveness of First Order Logic are that: finiteness is not first order definable, finiteness is not generalized first order definable, infinity is not first order definable (however it is generalized first order definable), and the property most is not first order definable.

Second order logic is much more expressive than First Order Logic. For example, finiteness, infinity, and most are all second order definable. A notable disadvantage of second order logic is that it is incomplete. This means that there are sentences in the logic that are true but cannot be proved. The classic example is the sentence: “I am not provable in this proof theory.” This sentence must be true, because if it were provable, then the theory would be inconsistent. And the sentence is not capable of being derived because then the sentence would be false.

We can see that higher order logic is attractive because we can express more with it. However, for computational concerns, first order logic is preferred because it allows us to deal with all semantic phenomena at a level of logical form.

III. NATURAL LANGUAGE PARSERS

A crucial piece of software for performing semantic analysis on natural language is the natural language parser. Currently the field of broad-coverage natural language parsing is in transition. Rigorous, objective, and verifiable evaluation procedures have not yet become established practice, although a beginning has been made. For more information on the Evaluation of Broad-Coverage Natural-Language Parsers, see: http://cslu.cse.ogi.edu/HLTsurvey/ch13node6.html.

Until recently, objective evaluation essentially was not practiced at all. Therefore, even the author of a parsing system had no real idea of how accurate, and hence how useful, his system was. In 1991, the PARSEVAL system for syntactically evaluating broad-coverage English-language parsers was introduced and the next year seven creators of such parsers applied PARSEVAL to their systems, all using the same test data.

The PARSEVAL metrics compare a proposed parse P with the corresponding correct treebank parse T as follows:

Precision = #correct constituents in P / # constituents in P

Recall = # correct constituents in P / # constituents in T

Good semantic parsers that are free for academic research are not widely available. However, there are several excellent syntactic parsers that are free for academic researchers to use. The following are the three natural language parsers that we are most seriously considering for implementation as a component of QASTIIR.

1. Proximity Technology, Inc. / CL Research Parser

The source code for this parser is available for both PC and Unix systems. The parser is around 8000 lines of C code and it has been used in Text REtrieval Conference (TREC)8-QA and TREC9-QA. TREC, co-sponsored by the National Institute of Standards and Technology (NIST) and the Defense Advanced Research Projects Agency (DARPA), was started in 1992 as part of the TIPSTER Text program. TREC’s purpose is to support research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies. For more information about TREC, see http://trec.nist.gov/overview.html.

The parser that CL Research has made available was provided by Proximity Technology, Inc. Ken Litkowski currently maintains the parser. The parser is a grammar checker that uses a context-sensitive, augmented transition network grammar of 350 rules, each consisting of a start state, a condition to be satisfied (either a non-terminal or a lexical category), and an end state. Satisfying a condition may result in an annotation (such as number and case) being added to the growing parse tree. Nodes (and possibly further annotations, such as potential attachment points for prepositional phrases) are added to the parse tree when reaching some end states. The parser is accompanied by an extensible dictionary containing the parts of speech (and frequently other information) associated with each lexical entry. The dictionary information allows for the recognition of phrases (as single entities) and uses 36 different verb government patterns to create dynamic parsing goals and to recognize particles and idioms associated with the verbs (the context-sensitive portion of the parser).

The parser output consists of bracketed parse trees, with leaf nodes describing the part of speech and lexical entry for each sentence word. Annotations, such as number and tense information, may be included at any node. The parser does not always produce a correct parse, but is very robust since the parse tree is constructed bottom-up from the leaf nodes, making it possible to examine the local context of a word even when the parse is incorrect. In TREC-9, parsing exceptions occurred for only 69 sentences out of 422562 (0.0002, down from 0.008), with another 131 "sentences" (usually tabular data) not submitted to the parser. Usable output was available despite the fact that there was at least one word unknown to the parsing dictionary in 33,467 sentences (7.9 percent).

trec9.html

2. Eugene Charniak Parsers

Eugene Charniak’s parsers are available to academic researchers for free by ftp. Charniak 1997 is a probabilistic bottom-up parser. He trained it on the Penn TreeBank Wall Street Journal (WSJ) corpus and it outperformed the first Magerman and Collins parsers.

The PARSEVAL performance statistics for this parser are as follows:

<=40 words

LR=87.5, LP = 87.4

<=100 words

LR=86.7, LP=86.6

Charniak 2000 is a better parser than Charniak 1997. It is an improvement to his previous parser. Major changes are that it is maximum entropy-inspired and it uses Markov grammars instead of the Penn TreeBank grammar. The third-order Markov grammar performs the best. This parser also outperforms Collins’ 1999 parser.

The PARSEVAL performance statistics for this parser are as follows:

<=40 words

LR=90.1, LP = 90.1

<=100 words

LR=89.6, LP=89.5

Source: http://www.cs.brown.edu/people/ec/

3. Michael Collins Parsers

Michael Collins’ parsers are available for free to academic researchers. His parser from 1999 achieved the following PARSEVAL scores.

PARSEVAL performance statistics for the Coll99 parser:

<=40 words

LR=88.5, LP = 88.7

<=100 words

LR=88.1, LP=88.3

This parser takes Part-Of-Speech (POS)-tagged input and outputs a bracketed parse tree. In the output, word tag pairs are separated by ‘/’. Punctuation marks have their POS tag preceded by “PUNC”. Non-terminals are in the following format:

‘non-term-label~headword~total#_of_children~constituent_#’ where constituent_# is the number of the child from which the headword is taken. This uses 1-based indexing with punctuation marks not being counted as children.

Charniak’s parser has limited documentation and support so it is unlikely that we will use that parser. We are currently in communication with Ken Litkowski and Michael Collins concerning the use of their natural language parsers.

IV. Incorporation of an Online Lexical Database: WordNet

The parsers mentioned above do an excellent syntactic analysis of the text under consideration. However, they do not perform extensive semantic analysis. In order to perform Word Sense Disambiguation (WSD) and other semantic analyses, it will be helpful to utilize an Online Lexical Database.

WordNet is an Online Lexical Database that was developed by the Cognitive Science Laboratory at Princeton University under the direction of Professor George A. Miller. The noun hierarchy used by WordNet is fairly simple for a computer scientist to understand because it is basically just an inheritance hierarchy like we have in Object-Oriented programming. A hyponym is a subordinate and a hypernym is a superordinate. For example, a tree is a hyponym of plant and plant is a hypernym of tree. Word forms are grouped into sets of synonyms called synsets. A lexical matrix has word forms as column headers and word meanings as row headers. Mappings between word forms and word meanings are many to many. Synonymy occurs when a word meaning has more than one word form. Polysemy occurs when a word form has more than one word meaning. Meronymy is the part-whole relation. For example, ‘beak’ is a meronym of ‘bird’ because a beak is a part of a bird. ‘Wing’ is also a meronym of ‘bird’ and ‘bird’ is a holonym of ‘beak’ and also of ‘wing’ because a bird has a beak as a part and also a wing as a part.

Adjectives are organized in a different manner than nouns are. Adjectives are organized into clusters that are related to each other by antonymy. Adjectives can be descriptive, reference-modifying, color, or relational. Descriptive adjectives are those that ascribe a value of an attribute to a noun. For example, in the sentence ‘The package is heavy.’, heavy is a descriptive adjective. It attributes a value to the attribute weight of the package. Adjectives like former, present, alleged, and likely are examples of reference-modifying adjectives. Color adjectives (i.e. blue, red, etc.) must be organized differently because the pattern of direct and indirect antonymy that is observed for other descriptive adjectives does not hold for color adjectives. Relational adjectives are those like fraternal, dental, musical, atomic, criminal, civil, chemical, etc. Examples of their use include: civil engineer, electrical engineer, etc. Since relational adjectives do not have antonyms, they cannot be incorporated into the clusters that characterize descriptive adjectives. Also, because their syntactic and semantic properties are a mixture of those of adjectives and those of nouns used as noun modifiers, rather than attempting to integrate them into either structure, WordNet maintains a separate file of relational adjectives with pointers to the corresponding nouns.

Lexical inheritance underlies the semantic relations between nouns and bipolar oppositions organize adjectives in WordNet. Similarly, the different relations that organize the verbs can be cast in terms of one overarching principle, lexical entailment. Just as hyponymy serves to organize the nouns, troponymy (from the Greek tropos, which means manner or fashion) serves to organize the verbs. The troponymy relation between two verbs can be expressed by the formula: To V1 is to V2 in some particular manner.

The WordNet system was developed on a network of Sun-4 workstations. The software programs and tools are written using the C programming language, Unix utilities, and shell scripts. To date, WordNet has been ported to the following computer systems: Sun-3; DECstation; NeXT; IBM PC and PC clones; and Macintosh.

For each syntactic category, two files represent the WordNet database – index.pos and data.pos, where ‘pos’ is one of either noun, verb, adj, or adv. The database is in an ASCII format that is both human- and machine-readable. The format of the database also makes it easily accessible to those who wish to use it in their own applications.

A typical line of data in an index file contains the following information: the sense count from the on-line dictionary; a list of the relational pointer types used in all synsets containing the word; and a list of indices which are byte offsets into the corresponding data file, one for each occurrence of the word form in a synset.

A typical line of data in a data file contains the following information: the byte offset of the synset; an integer, corresponding to the location in the list of file names of the file from which the synset originated; a list of word forms, relational pointers, and verb frames; and an optional textual gloss.

To assist with the creation of programs that use WordNet to automatically process natural language texts, the WordNet software suite includes functions that give WordNet some intelligence about English morphology. Morphy, the WordNet morphological processing functions, handle a wide range of morphological transformations.

There are many different interfaces to WordNet, including Python, Lisp, Perl, and Java interfaces. QASTIIR is being implemented in Java so it is most feasible for us to use a Java interface to WordNet. A lot of the Java interfaces are open source and are documented on http://sourceforge.net. QASTIIR could use one of these Java interfaces to extract semantic information (POS, definitions, number of senses, etc.) about words from WordNet. JWordNet is a pure Java standalone object-oriented interface to WordNet that can be used to write portable Java applications that use a local copy of the WordNet files. See an example of using JWordNet to retrieve the senses of the noun ‘dog’ in Appendix B.

V. Conclusions and Future Work

Conversion of natural language to a formal representation of meaning is a critical factor in determining the successful performance of the linguistic components of QASTIIR. An understanding of formal structures that can be used for representing semantics will be extremely valuable in developing linguistic components for QA systems. First Order Logic is preferred over higher order logic for computational tasks because everything can be dealt with at the level of logical form.

A semantic parser would be an ideal software component to use to convert natural language to a structured language. However, objectively verified semantic parsers are not widely available for free. There are many excellent, objectively verified syntactic parsers available for use by academic researchers, however. The two that we are most interested in at this point are the CL Research parser and Collins’ 1999 parser. The syntactic parsers perform very limited semantic analysis.

To perform more sophisticated semantic analysis, it will be necessary to incorporate an Online Lexical Database such as WordNet. There are many Java interfaces to WordNet that are open source and can be used with QASTIIR. JWordNet is one of the most promising Java interfaces that we have examined so far.

Our next steps are to get at least one of the syntactic parsers compiled and running. Also, we will use one of the Java interfaces to WordNet to help perform more sophisticated semantic analysis such as Word Sense Disambiguation (WSD). Next we will design, develop, and implement a metrics component, which will measure the proximity in meaning between sentences.

APPENDIX A: The Three-Phase Architecture of QASTIIR

Phase One: Statistical Document Selection

Phase Two: Statistical Answer Selection

Phase Three: Semantics Based Answer Selection

APPENDIX B: Retrieving the Senses of the Noun ‘dog’ Using JWordNet

The following is an example of the code one would use to print the senses of the noun ‘dog’ to the console using the JWordNet interface to WordNet.

Example:

import edu.brandeis.cs.steele.wn.*;

/* This prints the senses of the noun 'dog' to the console. */

public class Main {

static void main(String[] args) {

// Open the database from its default location (as specified

// by the WNHOME and WNSEARCHDIR properties).

// To specify a pathname for the database directory, use

// new FileBackedDictionary(searchDir);

// To use a remote server via RMI, use

// new FileBackedDictionary(RemoteFileManager.lookup(hostname));

DictionaryDatabase dictionary = new FileBackedDictionary();

IndexWord word = dictionary.lookupIndexWord(POS.NOUN, "dog");

Synset[] senses = word.getSenses();

int taggedCount = word.getTaggedSenseCount();

System.out.print("The " + word.getPOS().getLabel() + " " + word.getLemma() + " has " + senses.length + " sense" + (senses.length == 1 ? "" : "s") + " ");

System.out.print("(");

if (taggedCount == 0) {

System.out.print("no senses from tagged texts");

} else {

System.out.print("first " + taggedCount + " from tagged texts");

}

System.out.print(")\n\n");

for (int i = 0; i < senses.length; ++i) {

Synset sense = senses[i];

System.out.println("" + (i + 1) + ". " + sense.getLongDescription());

}

ACKNOWLEDGEMENT

The Phi Kappa Phi Honor Society Chapter of The College of New Jersey generously supported this research through their Spring/2002 Student-Faculty Research Scholarship.

REFERENCES

Beckwith, Richard, Miller, George A., and Tengi, Randee. “Design and Implementation of the WordNet Lexical Database and Searching Software”. 1993.

Black, Ezra. Evaluation of Broad-Coverage Natural-Language Parsers website: http://cslu.cse.ogi.edu/HLTsurvey/ch13node6.html.

Charniak, Eugene. “A Maximum-Entropy-Inspired Parser”. Proceedings of NAACL-2000.

Charniak, Eugene. “Statistical Parsing with a Context-free Grammar and Word Statistics”. Proceedings of the Fourteenth National Conference on Artificial Intelligence. AAAI Press/MIT Press, Menlo Park, CA, 1997, 598-603.

CL Research website: http://www.clres.com

Eugene Charniak’s website: http://www.cs.brown.edu/people/ec

Fellbaum, Christiane. “English Verbs as a Semantic Net”. 1993.

Fellbaum, Christiane, Gross, Derek, and Miller, Katherine. “Adjectives in WordNet”. 1993.

JWordNet website: http://jwn.sourceforge.net

Landman, Fred. Structures for Semantics. Kluwer Academic Publishers: Boston, 1991.

Litkowski, Kenneth C. “Syntactic Clues and Lexical Resources in Question-Answering”.

Martinovic, Miroslav. “Integrating Statistical and Linguistic Approaches in Building Intelligent Question Answering Systems”. SSGRR2002w Proceedings

Michael Collins’ web site: http://www.cis.upenn.edu/~mcollins

Miller, George A., Beckwith, Richard, Fellbaum, Christiane, Gross, Derek, and Miller, Katherine. “Introduction to WordNet: An On-line Lexical Database”. 1993.

Miller, George A. “Nouns in WordNet: A Lexical Inheritance System”. 1993.

TREC website: http://trec.nist.gov

WordNet website: http://www.cogsci.princeton.edu/~wn