An Introduction to
Computational Linguistics





 
 
 
 
Miroslav Martinovic

Computer Science Department
The College of New Jersey
mmmartin@tcnj.edu


 
 
 




Computational Linguistics (CL) is an interdisciplinary field centered around the use of computers to process or produce human language. It is a discipline between linguistics, computer science and mathematics. In it, mathematics and linguistics contribute an understanding of special properties of language data, as well as theories and descriptions of language structure and use, while computer science provides theories and techniques for designing and implementing computer systems. CL is largely an applied field.


What
        is the ultimate goal of the CL field (HOLY GRAIL of the CL) :
                Building a computer system that could understand and produce human language
                as well as humans can. Its applications would be endless.


Why
        to use computers and not humans to process or produce language (MAJOR REASONS) :
                Humans are often unavailable, too slow, too expensive or too busy doing tasks they do better
                than machines.


When and where
        to use computers to process or produce language (MAJOR APPLICATION AREAS)  :
                Translating from one language to another (Machine Translation)
                Finding releveant documents in large collections of text (Information Retrieval)
                Answering questions about a subject area (Expert Systems with Natural Language Interfaces)
                Helping humans in learning languages spoken by other humans (Computer-Assisted Language Learning).


How
        to get computers to process or produce language (MAJOR APPROACHES) :
        Simulation
                        Build a psychologically valid working model of human thinking that includes language
                        understanding and use. Then, use computer to implement the model in the most
                        efficient manner possible. This proves to be a vast and very far from solved problem,
                        currently broken into many subproblems addressed in Cognitive Science, Artificial
                        Intelligence, Biology, Psychology and Linguistics.
        Emulation
                        Doesn't need human brain understanding or any resemblance of how it works. Attempts
                        to do as well as a human on tasks involving language. Still an enormous task that needs to
                                handle reasoning,
                                understand how the language works, and
                                provide an encyclopedic knowledge.


The enormousity of the problem at hand suggests to either :
            break it down into
    subproblems with simpler computational models
                        (i.e. into phonetics, phonology, lexicon, morphology, syntax, semantics, pragmatics, etc.)
                        and focus on computational models for them, or
            build the
    systems that learn.
                        Instead of trying to emulate the linguistic capabilities of an adult (with an adult's
                        encyclopedic knowledge), model the knowledge of an infant and work on
                        designing systems that learn, or
            focus on
    specific tasks in restricted domains
    Instead of a wide variety of situations and wide variety of tasks, concentrate on
                        more concrete problems (i.e. system summarizing bank telexes or system summarizing
                        radiology reports), or
            look at
    shallow approaches
     where useful tasks with language data are acceptable results (i.e. automatic correction
                        of OCR errors or identifying the topics of documents for information retrieval and
                        document classification tasks).


Computational linguists are involved in all of the above areas of research. They're expected to have an
understanding of the language and language data properties, as well as an understanding of design,
implementation and computational techniques and issues.



 

Appeals of the CL as a field :
        - Human language is a most exciting and demanding puzzle,
        - Computer science is a most growing and expanding field,
        - Language processing using computers is an important part of CS applications,
        - Prospects look good for the future.
 

    CL References and Professional Associations :
 

see also:






 
Natural Language Processing (NLP)

  Goal : Take some text (speech) input and extract meaning from it or given the meaning, express it in a textual (speech) form.

Aspects of a language addressed : Lexics, Syntax, Semantics, Discourse, Pragmatics, etc...
 



 

A starting point for most of the nowadays NLP systems is to build an electronic Lexicon. Usually, the

    minimal requirement is that :
        each word in the lexicon has a part of speech (POS) tag associated with it ( i.e. the / DT for
        the determiner the with the tag DT, or point / NN for the singular noun point with the tag NN.)).
        Often, what is also included are

    additional requirements to :
        add morphological information (i.e. cacti / NNS / 1 us indicating that the word cacti is a plural
        noun (tag NNS) with the morphological information 1 us indicating that its nominative singular form
        (cactus) is obtained by deleting 1 of the last characters from the inflected form (cacti) and adding
        characters us) and
        add probability (statistical) information (i.e.  heat / NN / _  / 0.83 indicating that heat could be
        a singular noun (NN) (like in "This heat is unbearable."). Its probability of appearing in a text is 0.83.
        Or as in heat / VB / _  / 0.05, heat is a verb in its base form (VB) (like in "Heat oil in a large pot!")
        with its probability of  0.05. The probability information is something that normally appears only in
        electronic lexicons and is derived from a very large text corpora used for training.

In the next phase, once the lexicon is built and tested, programs called POS Taggers are designed
and developed in order to be used in determining parts of speech in a text to be analyzed. They most often
are built to include :

    (i) lexical tagging
        where words are initially tagged with their most likely POS regardless of context (merely a lexicon
        look up and assignment of the tag with the highest probability),
    (ii) unknown word tagging
       where words not found in the dictionary are assigned a POS based on their ending or other simple
        characteristics,
    (iii) contextual rules
        where the set of simple rules are applied to the previously obtained POS sequence. They modify
        the POS of each word based on the POS of the surrounding words. In the following example :
 
 
 I  took a left turn  .
pro verb det noun verb
adjective noun
verb
adverb

        the POS tags for each word are listed (top to bottom) in the order from their highest to lowest
        probabilities (as found in the dictionary). The correct choices are listed in the bold case. The
        example shows that lexical tagging made two errors (on left and turn words) because it dealt
        only with the individual words and not the context. This was corrected in the phase (iii) using the
        contextual rule that changes verb tag into noun  tag if one of the previous two tags is a det tag
        (turn), and another contextual rule that changes noun tag into adjective  tag if the next tag is a
        noun tag (left). The contextual rules are often used for the POS disambiguation like in the example
        above.

In a real life, all three modules of a POS TAGGER are learned from a training corpus of correctly tagged
sentences (i.e. 500K words).


In the syntactic analysis phase of an NLP system, a program called Parser is built. This program,
written in some computer language, will use a formal grammatical descritpion of a natural language to
analyze its input. Tools and theories needed in this phase are :
        (i)            a theory of grammar and a grammar formalism
        (ii)           a grammar of the language to be analyzed
        (iii)          a lexicon, containing the words of the language
        (iv)          a parser which interprets the grammar and uses it to assign a syntactic description to the input.

Grammar formalisms (generally descended from Transformational Grammar) to choose from :
        (GB)         Government-Binding Theory
        (MT)        Minimalist Theory
        (GPSG)    Generalized Phrase Structure Grammar
        (HPSG)    Head-Driven Phrase Structure Grammar
        (LFG)       Lexical Functional Grammar
        (DCG)      Definite Clause Grammar (unification-based CFGs with non-terminals allowed to have arguments)
        etc., etc...

There is a vast literature on parsing theory and the design and implementation of parsers for computer programming languages. This is what we need to do in NLP as well, but the syntax of natural language input is much more difficult to describe than the syntax of (say) Perl statements, and designing efficient and 'robust' parsers for natural language input is not an easy task. However, a variety of parsers have been developed, and one can  identify some general types:
        (TD)        top-down,
        (BU)        bottom-up,
        (CP)        chart parsers,
to name a few (for a useful introduction, see James Allen, "Natural Language Understanding"). In terms of
the programming languages in which parsers for natural language are implemented, the languages of choice
in the past have been LISP and Prolog. Of course it would be possible to write a parser in any other main
stream programming language (C++ or Java), but it is easier to develop a parser in a higher-level language
like LISP or Prolog, languages which are commonly used in Artificial Intelligence research. We'll show
here some segments from Prolog, a logic-programming language.

To make the discussion more concrete, let's consider a very simple TG-style grammar for a fragment of
English. Here are some phrase-structure rules: rules that describe the structure of phrases of English.

A grammar for a fragment of English

1. S -> NP VP
2. NP -> DET N
3. NP -> N
4. VP -> V
5. VP -> V NP
6. DET -> the
7. N -> dog
8. V -> eats

These rules can be read: 'A sentence is composed of a noun phrase followed by a verb phrase'; 'A noun
phrase can be a determiner and a noun', 'A determiner can be the word the', and so on. Rules (6-8) are
known as lexical rules, since they expand lexical categories. Since we are assuming that there is a
separate lexicon in which words are stored with their part of speech, we will replace these with rules
that refer to the lexicon.

Same grammar for a fragment of English

1. S -> NP VP
2. NP -> DET N
3. NP -> N
4. VP -> V
5. VP -> V NP
6. DET -> (any lexical item that is in the lexicon as a determiner)
7. N -> (any lexical item that is a noun)
8. V -> (any lexical item that is a verb)

In linguistics we often think of such rules as generating grammatical sentences of a language. So to
generate a sentence of English, we apply the rules one at a time until we have a nothing left but
sequence of words, and this process also gives us a parse tree. For example, we begin with a S,
and expand that as NP and VP; we apply the first NP rule, to get a determiner and a noun; we
apply the determiner rule, to get the, and so on. We can represent this process of applying the
rules as a parse tree:
 

                                                            S
 

                                    NP                                            VP
 
 

                   DET                        N                                V
 
 

           the                        dog                            eats
 
 

There is a linear representation of a parse tree called a 'labelled bracketing': this is similar to the
representation used in the Penn Treebank article. The Penn Treebank project built a parsed
corpus: a corpus of text in which words are tagged for part of speech, and phrases are
bracketed and labelled for syntactic category. A traditional labelled bracketing for The dog
eats, corresponding to the parse tree above, would be:

[    s    [    np    [    det the    ]    [    n dog    ]    ]    [    vp    [    v eats    ]    ]    ]     or
 

[
    s
            [
                np
                        [    det the    ]    [    n dog    ]
                                                                          ]
                                                                                [    vp
                                                                                             [    v eats    ]
                                                                                                                    ]
                                                                                                                        ]

Or using SGML and the Penn Treebank tagset, the representation will look something like this:

<s> <np> <DT>the</DT> <NN>dog</NN> </np><vp> <VBZ> eats </VBZ> </vp> </s>

We usually think of a grammar as generating the sentences of a language, but obviously a grammar is also used for analyzing sentences of a language. So we can write a parser that will use our grammar to process the sentence The dog eats (not a sentence that is likely to occur in the Wall Street Journal corpus, but it will do for illustration).

In designing a parser, we must decide on a systematic strategy for applying the grammar rules to the input. One such strategy is very similar to the method we used for generating the sentence: begin with the rule for the type of constitutent we expect to end up with (S). Take the left-most constituent (NP), and pick the first rule that expands that constituent. Keep choosing the first applicable rule and expanding the left-most constitutent until a lexical node is reached; at that point, check to see if the next word in the input is of the right type. If it is, process the next unexpanded constituent. The general strategy is to apply the rules in order, from top to bottom, moving from left to right.

Parsing 'the dog eats' using a top-down left-to-right strategy

           Input: the dog eats.

  1. apply rule S -> NP VP.

  2. Choose the left-most constituent that has not yet been processed (NP).
  3. apply the first NP rule, NP -> DET N. Choose the left-most constituent (DET).
  4. apply the first DET rule. This is a lexical node.

  5. Is the next word on the input (the) a determiner? Yes.
    Remove this word from the input.
     

    Input: dog eats

    Choose the next unexpanded constituent from the last phrase (N).

  6. apply the first N rule. This is a lexical node.

  7. Is the next word on the input (dog) a noun? Yes.
    Remove this word from the input.

    Input: eats

    Since we have completely expanded all constituents in the NP rule, get the next constituent from the higher phrase (VP).

  8. apply the first VP rule, VP -> V.

  9. Choose the left-most constituent that has not yet been processed (V).
  10. apply the first V rule. This is a lexical node.

  11. Is the next word on the input (eats) a verb? Yes.
    Remove this word from the input.

    Input: [nothing left]

    There are no more constituents to expand in the VP rule.
    There are no more constituents to expand in the S rule.
    There are no words left to process in the input.
    Done!

We have described a a top-down left-to-right parsing strategy, and we have walked through the parse of the sentence The dog eats.

Jumping a one step further and using the conveniencies (built in interpreter for Definite Clause Grammars
(DCGs)) of Prolog, we can write a DCG that will besides accepting or rejecting a sentence also build its
derivation tree (parse tree).

First we rewrite the same grammar as a DC grammar :

s --> np, vp.
np --> det, n.
np --> n.
vp --> v.
vp --> v, np.
det --> [Word], {lex(Word,'DT')}.
n --> [Word], {lex(Word,'NN')}.
v --> [Word], {lex(Word,'VBZ')}.

As it's obvious, each DC grammar rule ends in a period, and constituents within a rule are separated by commas. The rewrite symbol, '->', becomes '-->' in a DCG. Syntactic categories in a DCG must be in lowercase letters because they will become Prolog predicates, and in Prolog initial caps are reserved for variable names. With these exceptions, the DCG rules are almost identical to the phrase structure rules for non-lexical nodes.

In the lexical nodes (det, n, v), the square brackets in a DCG represent an instruction to take a word off the input. Curly braces are used to add Prolog code to a DCG: in this case, the code inside the curly braces tests whether the current word is in the lexicon with the expected part of speech. To see how these rules are translated, you can ask for a listing of the predicates at the Prolog prompt, e.g.:

?- listing(np).

Here's how the Prolog interpreter translates the first NP rule and the DET rule:

np(A,B) :-
    det(A,C),
    n(C,B).

det(A,B) :-
    'C'(A,C,B),
    lex(C,'DT').

As one can see, each of the syntactic categories in the DCG translates into a predicate with two arguments. The first argument is the list of words to be parsed, and the second is what's left over after parsing. If we were parsing the sentence The dog eats, the input to the NP rule would be [the,dog,eats], and what is left over would be [eats] :

?- np([the,dog,eats],X).
        X = [eats]

If we are parsing a complete sentence, we want nothing left over. So we would call the S rule with input [the,dog,eats], and we would specify that the output should be the empty list :

?- s([the,dog,eats],[]).
        Yes

To summarize: Prolog provides a top-down left-to-right parser, in the form of an inference engine. It also provides a facility that automatically converts grammar rules into Prolog statements. To parse a sentence, we first load the grammar rules and lexicon into Prolog. We then call the appropriate grammar rule with a list of words as input. Prolog returns Yes or No.

However, in computational linguistics, a question is usually not 'Is this sequence of words a sentence? (yes or no)', but rather 'What is the syntactic analysis of this sentence?' or to rephrase it "the goal is to Produce a Parse Tree".

The desired outcome would look somehow along these lines:

?- s(Tree,[the,dog,eats],[]).
        Tree = '<s> <np> <DT>the</DT> <NN>dog</NN>
</np> <vp> <VBZ> eats </VBZ> </vp> </s>'

The simplest way to accomplish this is to build up the parse tree as we go along, and pass it around as a list. Each rule can contribute a piece of the parse tree. As you might imagine, to get a parse tree back from our grammar, we need to add another argument to each rule. For example:

Before, we had :
s --> np, vp.

After adding parse tree arguments, we got:
s(S) --> np(NP),vp(VP),
{S = ['<s>',NP,VP,'</s>']}.

If we modify the np rules and the vp rules to return their parse trees, then at the end we need only put these two trees together to get the syntactic representation for the whole sentence, as above. Since the basic Prolog data structure is a list, what we will end up with is a list of lists, like this :

?- s(Tree,[the,dog,eats],[]).
        Tree =
[<s>,[<np>,[<DT>,the,</DT>][<NN>,dog,</NN>],</np>],[<vp>,[<VBZ>,eats,</VBZ>],</vp>],</s>]

With a little work, we could write some Prolog code to display this as a tree diagram. Since it is a list of lists, and since lists are the basic data structure in Prolog, we can use one of many Prolog utilities to display a list, e.g. pretty_print (in Gazdar and Mellish,  "Natural Language Processing in Prolog", along with many other useful utilities).


The semantic analysis phase of a NLP system has as its objective to obtain the meaning
(semantics) or in other words to determine what a sentence means. Here, the researchers
find it desireable (or even necessary) to devise a formal language with a simple semantics into which
natural language can conveniently be mapped. The formal language should
    (i)        be unambiguous,
    (ii)       have simple rules of interpretation and inference, and in particular
    (iii)      have a logical structure determined by the form of the sentence.
The predicate logic, predicate logic with restricted quantification, and semantic nets are (just to name a few) some of the systems often used to represent the meaning. However, notions of modality, tense and belief, as well as presupposition and fuzziness are not captured by the formalism of the predicate calculus and its equivalents. Sentences like "I believe the price of a Big Mac is $10", "John is young" fall into the uncaptured categories. Adding modal (possible(F), necessary(F)), epistemic (believe(x,F)) and temporal (true_at_some_time_in_the_future(F)...) operators covers some of the mentioned insufficiencies.

So far, we seemed to have suggested that two levels of representation - syntactic and semantic - are processed at different phases in a NLP system. First, a syntactic structure is created (with or without the syntactic and sometimes also semantical (selectional) constraints). Then, it is translated into a semantic representation. However, the largest "school" of computational linguists (started by R. Schank) adopts the approach that "since the ultimate objective is the generation of a semantic representation, it should be created directly using the syntactic constraints only to guide the process". Thus, very often the grammar describing the language handles both the syntax and semantics in its rules. The segment of a DCG for English given below describes a group of wh_ questions (here used to illustrate the relationship between the sentence Who wrote this, and its corresponding semantic representation wrote(who, this)). Each of the logic grammar symbols has an argument (shown here italicized and bold) representing its semantics (WhQues_Sem argument in whques  logic grammar symbol, WhSubj_Sem in whsubj, WhObj_Sem in whobj, WhPred_Sem in whpred).

.........
(1)        whques ( WhQues_Sem, WhQues_Sen, WhQues_SenRest )        -->
                                whsubj ( Number, WhSubj_Sem, WhQues_Sen, Rest1 ),
                                whpred ( Number, Tense, [ WhSubj_Sem, WhObj_Sem ], WhQues_Sem, Rest1, Rest2 ),
                                whobj ( WhObj_Sem, Rest2, WhQues_SenRest  ).
.........
(2)        whsubj ( X, who, [ who|WhSubjRest ], WhSubjRest ).
(3)        whsubj ( X, what, [ what|WhSubjRest ], WhSubjRest ).
.........
(4)        whpred ( sing, perf, [ Subj, Obj ], wrote ( who, this ), [ wrote | WhPredRest ], WhPredRest ).
.........
(5)        whobj ( this, [ this | WhObjRest ], WhObjRest ).
.........

Not incidentally, grammars like this (incorporating both syntactic and semantic information) are frequently used for both parsing (given a sentence, derive its semantics) and generation (given a semantics, derive the corresponding sentence). They're said to be reversible and are subject of an interested and ongoing research in CL.


However, the complexities of a natural language are far beyond those of individual sentences in it and include the ones resolvable only through an entire discourse analysis. As a simple (!) example of the problems we face, consider the following brief description of a naval encounter :

        Just before the dawn, the Valiant sighted the Zwiebel and fired two torpedoes.
        It sank swiftly, leaving few survivors.

The most evident linguistic problem faced is finding an antecedent for "it". There are for candidates in the first sentence : "dawn", "Valiant", "Zwiebel", and "torpedoes". Semantic classifications should enable us to exclude "dawn" ( * "dawn sinks" ), and number agreement will exclude "torpedoes". However, our semantic and syntactic tools will not enable us to choose between the "Valiant" and "Zwiebel" (both presumably ships of some sort). In order to decide that it was the Zwiebel  which sank, we must recognize the causal relationship between the firing of the torpedoes and sinking of the ship: that the Valiant fired at the Zwiebel (since one normally doesn't fire at oneself), and hence that it was probably the Zwiebel which sank. Such facts and connections are not bothered to be included into narratives since they can readily be inferred by the reader.

Discovering the implicit information is exactly the role of the discourse analysis component in NLP systems. Some of the approaches used for this include
        grouping facts by topic,
        designing situational frames,
        designing situational scripts, and
        designing situational plans.

Also, any NLP system that handles a discourse that includes a dialog situation must also be able to reason about beliefs, intentions, desires and other propositional attitudes of the agents in the dialog, as well as agents referred to in the discourse. This, in short is what is referred to as the pragmatics of the natural language utterances. Most currently implemented systems use only the simplest dialog structure, with unilateral initiative in which one party completely controls the conversation, giving commands or asking questions. The other party simply responds to the last command or question. Such systems are realtively easy to implement, since they need not analyze  the preceeding dialog at length but they are also rather constraining for the user. Mixed initiative systems with relaxed constraints are also being implemented. In them, a great majority follows the idea of maintaining multiple goals in a conversation (users' and system (clarification) goals).



 
 
Evaluation Systems

for

Grammar Processing Algorithms 
(Parsers and Generators)




Logical grammars (and logic programming in a broader sense) have been widely accepted as a paradigm for parser and generator design and prototyping. Thanks to the related theoretical background, it became possible to develop formal systems in order to compare various approaches of the competing grammar processing algorithms (parsers and generators). The need for their formal evaluation  has been felt very strongly in Computational Linguistics because much of the work done has been predominantly empirical and it became difficult to measure the actual progress. A formal apparatus has been developed for ranking the grammar processing algorithms with respect to the following criteria :
 
 
 
completeness soundness efficiency reversibility

Completenes of a parser (or a generator)  is measured based on how much of the given language is parseable (or generable) by the given algorithm.

Soundness (overproducing) of a parser (or a generator) is judged based on the quantity of correct meanings  (or sentences) is obtained from incorrect sentences (or meanings) by the given algorithm.

Efficiency (optimality) of a parser (or a generator) is measured based on the number of production rules traversed during the derivation process.

Reversibility of a parser (or a generator) is judged based on the quantity of corresponding sentencial and semantic structures that could be derived one from another by the given algorithm.

 
 

The evaluation system based on the general principle of the traversal of derivation trees showed  most promise because of its independence of a particular grammar, grammar formalism or execution strategy.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
GPA Evaluation System Development
Formal Language Apparatus
Grammar - related (DCG)
Derivation Trees - related
Derivation Techniques - related
Comparison Apparatus
Criteria
completeness
soundness
efficiency
reversibility
Tools
Guides
Proper
Universal
Set 
theory
 
Graph theory STAS (Same To A Subtree) relation for tree traversals

The following excerpts from the authors research on the subject illustrate the flavor of the efforts taken in this direction.

Going back to the who wrote this  DCG example, let's try to illustrate one way to formally look into the derivations and derivation tree traversals. Below is what is referred to as an oracle derivation tree , a tree in which all critical variable assignments are somehow (oracle) known.
 
 
 
 
 
 
 
 
 
 

What follows are various traversals of the tree. In them, each node contains the predicate symbol, the set of variables uninstantiated at the moment of the visit of the given node, the set of variables uninstantiated at the moment after the given node was just departed, and the ordinal number indicating how many nodes in the tree were visited before the given one. The first two correspond to two versions of Essential Arguments driven traversals, and the third one is the Semantic-Head Driven Generation (SHDG) traversal. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

The essential arguments algorithm (EAA) rearranges grammar productions at compile time in such a way that the evaluation (even one like a simple top-down left-to-right evaluation) will follow an optimal (or nearly optimal) path. By optimal path we mean a path in which no unsuccessful expansions will ever occur, or in terms of traversing of an analysis tree, no backtracking ever happens. In its preprocessing phase, EAA computes the so called minimal sets of essential arguments (abbreviated as msea's) for each of grammar symbols. If variables from a minimal set of essential arguments of a grammar symbol are all instantiated, a finite deterministic expansion of the grammar symbol is possible. Since we assume a finite deterministic expansion, no wrong moves (and backtracking) is allowed. An expansion of that kind can end with a failure or a success, but it is an expansion with no guessing. Thus, when used for generation, EAA will assume that the semantics of the topmost grammar symbol is instantiated and will try to rearrange rhs grammar symbols in rules where the topmost symbol is on the left-hand side in such a manner that whenever a symbol is next for expansion, at least one of its msea's is instantiated. If such ordering(s) of grammar symbols exists, it is (they are) adopted and EAA actually creates a (or several) grammar(s) equivalent to the original one and executable in a top-down, left-to-right fashion. If such reordering of grammar symbols is not possible, EAA will apply so called inter-clausal reordering (abbreviated as ICR). It will lift up a grammar rule having a rhs symbol of the currently processed rule on its left-hand side, by replacing the symbol in the current rule with the right-hand side symbols of its rule.

Semantic-Head-Driven Generation Algorithm (SHDGA), uses information about semantic heads in grammar rules to obtain the best possible traversal of the generation tree, using a mixed top-down/bottom-up strategy. In its preprocessing phase SHDGA compiles grammar rules in one of two ways, depending on whether the rule in question is a "chain rule" or not. A chain rule is one in which the semantics of the left-hand side (abbreviated here as lhs) is identical to the semantics of some right-hand side (rhs) constituent, the rule's "semantic head". Here, the algorithm needs an explicit indication of what the semantics is, for a given grammar symbol. In two major papers on SHDGA,  this indication is given by introducing "/" symbol, that separates syntactic from semantic part of a grammar symbol (SYMSYN/SYMSEM). Thus, what follows "/" symbol is considered semantics of a given grammar symbol. In addition, the "chained" relation, the transitive closure of the syntax part of the semantic head relation is computed. That is, if HEADSYN/HEADSEM is the semantic head of LHSSYN/LHSSEM, then HEADSYN and LHSSYN are in the "chained" relation. Therefore, preprocessing phase of SHDGA consists of introducing a notation that explicitly indicates what semantics is for a given symbol, aligning the grammar rules into "chain rules", or "non-chain rules" group and computing the relation "chained". We can talk of this preprocessing phase as of compilation phase of the algorithm. In the evaluation phase of the algorithm, SHDGA is given a grammar symbol (goal) with its semantics fully instantiated and is expected to discover corresponding surface string(s), if such exists. SHDGA selects a "non-chain" rule, whose left-hand side symbol (pivot) has semantic component unifiable with the semantics of the given goal and pivot can be connected to the goal by a sequence of "chain rules". That rule is expanded and the same procedure is applied to the grammar symbols on the right-hand side of it, if there are right-hand side symbols. Since the rule is "non-chain", no right-hand side symbol can have the same semantics as pivot. After the expansion of the "non-chain" rule is completed, its left-hand side symbol will get connected with the goal, by applying a sequence of "chained rules". Thus, next, a "chain rule" is selected whose left-hand side symbol has the same semantics as pivot's semantics and it is possible to connect it to the goal by a sequence of "chain rules". The same procedure is applied to the pivot's siblings on the right-hand side of this rule. The application of "chain rules" is continued in the same manner until at one moment, the left-hand side symbol of such a "chain rule" is actually the goal symbol.

For more charts and illustrations on these two algorithms, see : Appendix A : SHDGA and EAA. 

The two variable sets shown on the previous tree traversals are the basis of what's known as universal guide, a concept introduced by the author that could be used for monitoring derivations, as well as the performance of parsers and generators. They could be added to a logic grammar as a piece of redundancy, and conveniently exploited for tighter control of the computational process. 

The notion of universal guides is a generalization of well known proper guides (introduced by M. Dymetman and P. Isabelle). Unlike proper guides, universal guides allow consumption of guides at any level, not only lexical, they're introduced independently of the algorithm and the underlaying grammar and the related the main results (concerning finite derivations) are stated not only with respect to one (the depth-first-search, left-to-right (DFLR)) algorithm, but with respect to any grammar processing algorithm. Also universal guides treat parsing and generation in a symmetrical way, as it was done by proper guides, but unlike with the proper guides, universal guides need not to be instantiated and created differently for parsing and generation. 

For more on universal guides see : Appendix B : Universal guides

The relation STAS (Same To A Subtree) is a mathematical relation defined on the set of all possible tree traversals, useful in estimating the quality of a grammar processing algorithm based on the criteria of completeness, soundness, efficiency and reversibility. Intuitively, two tree traversals T1 and T2 are STAS ( STAS(T1,T2) ) if at any given moment during the traversal, the choice of the next subtree to be visited is the same for T1 and T2. STAS does not imply that the order in which the nodes are visited will necessarily be the same. In other words, two traversals can be STAS and their orderings of nodes quite different, as it is shown on the next couple of examples. STAS is an equivalence relation and it partitions the set of all possible traversals into its mutually exclusive equivalence classes. Those classes as we'll illustrate often describe powers of various grammar processing algorithms.
 
 
 
 
 
 
 
 
 
 
 

  The figure from the Appendix C : STAS Tree Traversals represents an example of two tree traversals which are STAS. The nodes are visited by the first algorithm in the order indicated by the arabic numbers (A, E, J, K, F, B, G, H, L, C, I, M, D). This traversal, as it will become clear later in the study, can be produced by semantic-head-first algorithm. The second algorithm, marked here by roman numbers, is depth-first search algorithm, performed in a top-down, left-right fashion, producing the following ordering: A, B, E, F, J, K, F, C, G, H, L, D, I, M. As it can be verified easily, the choice of the subtrees for the traversals at any node is same for both algorithms. For instance, at the node A, both algorithms first traverse the leftmost subtree, rooted at the node B, then the middle one (rooted at C) and at the end the rightmost (rooted at D). However, it should be observed that the order in which the nodes in the subtrees are visited by the two algorithms is not the same. Same type of verification can be done for the remaining nodes in the tree. Therefore, these two traversals are STAS, although the order in which the nodes are visited is different.

The examples from the Appendix C : STAS Tree Traversal Relation illustrate the way STAS could be used to compare SHDG and EA algorithms.
 


 




 
 
Machine Translation (MT)

 Goal :                 Devise a system capable of translating from one human language to another (!!!).

Problems :        Lack of empirical study of translation.
                            Linguistics (in a narrow sense) not as useful as expected.
                            No such thing as a universal MT system.
                            Developers' side :        Huge investment vs. Small market
                            Users' side :                High expectations vs. Moderate or low performance
                            Researchers' side :      Big dreams vs. Serious deadlocks

Directions :      More empirical studies of translation
                            Diversification of systems
                            Customization techniques

Aproaches :
                      BI-LINGUAL CORPORA approaches
                              Example-based MT (EBMT)
                                        This is a translation by retrieval in which previous translations
                                        are used as templates for translation of new text. A conservative
                                        text to use is the Bible which has been so widely translated.
                              Statistical-based MT (SBMT)
                                  This approach capitalizes on the idea that translation examples
                                        implicitely conatin rules of translation. Statistical techniques
                                        are used to reveal rules.

                      LINGUISTICS BASED approaches (LBMT)
                              REDUCTIONIST approach
            Monolingual theory of L1    <===========>        Monolingual theory of L2

                                                        systematic
                                                        correspondence
                                                        theory

            Language L1                                                             Language L2
            (infinite set of expressions)                                         (infinite set of expressions)
 

                                    INTER-LINGUAL approach
                                                        [Pivot]
                                                        Inter-lingual
                                                        expressions
 

            Analysis component                                                    Generation component

            L1 input sentences                                                        L2 translations

                      HYBRID approaches
                                    Attempt to integrate the best of the above in order to resolve most
                                    difficult issues in MT by :
                                        adding a model of the discourse,
                                        rules of  pragmatic interpretation and generation,
                                        a representation of the information structure of the input,
                                        etc.... (lots of good thesis topics here).


MT References :

To get a sense of various problems involved in a MT system, try the Globalink translator at http://www.globalink.com/xlate.html.

Try a real machine translation system: Altavista + Systran

Universities and Organizations working on MT

On-Line Books, Journals, Lists, etc.

Products and other software




 
 
Computer-Assisted Language Learning (CALL)

 Much of what is called 'CALL' does not involve computational linguistics, but rather linguistic computing (generally, doing things with language data on a computer). In any case, CALL is one of the most challenging application areas considered to date. Unlike general NLP, which requires a knowledge of linguistic theory and of computer science, CALL requires an understanding of:

CALL that is uninformed by Applied Linguistics is just another computer program, and there are many such CALL programs. Why should we think that they should lead to language acquisition? It is more likely that they will either have no effect or even be detrimental. Similarly, it is insufficient for CALL design to be informed by Applied Linguistics: a badly designed user interface, poor system design, bug-laden implementation, and so on can also be noxious.

The way the topic of computer-assisted language learning (CALL) is usually approached first is by looking at 'book-assisted language learning'.

To start with, let's consider the aspects of book technology, as it has developed over the last 500 or so years, that we take for granted, e.g. :

        (a) the content and structure of the printed page (can contain both text and graphics;
             may contain glosses and notes in at various positions on the page),
        (b) ways that learners can interact with books (e.g. they can be owned; they are
            highly portable; the student can write in the margins, highlight areas of the text,
            turn down pages, and insert Post-it notes, tape flags, paper-clips within the page).

Also, let's identify :
        (a) a number of variables such as the learner's profiency level and goals, whether the
            language is being learned in a foreign-language or second-language setting
        (b) aspects of the language that are often taught with the assistance of books
        (c) areas that books are and are not useful for (e.g. + reading; - phonology).

Now, turning to computer-assisted language learning, we can observe that to make good use of the computer as a tool in language learning, it is important to know about :
        (a) language acquisition theory
        (b) learning theory
        (c) applications and design of computer programs.
There is really  a great deal in CALL that represents the transfer of book technology and traditional methods of language teaching to the computer: e.g. the printed dictionary becomes the electronic dictionary; the traditional presentation of grammar and grammar exercises becomes screens of grammar lessons and grammar exercises. In fact, drill and memorization play a minor role in learning a language (though they may be useful for learning about the language).

A good deal of current CALL is simple PC or Mac programs and very little current CALL applies computational linguistics to the problems of language learning. Here are some possible reasons for this situation: programmers who become interested in CALL may not be informed by language acquisition theory and advances in language pedagogy; those who are may not be in a position to develop computer programs; both groups may be unfamiliar with relevant theory and techniques in computational linguistics. The informed CALL designer (the CALL designer of the future) is ideally knowledgeable about both Applied Linguistics and Computational Linguistics.

Sound and Text

In looking at some areas of language learning that are and are not addressed by book technology, we identified two obvious areas where the computer can play a useful role: acquisition of the sound system, and reading. Books are not good for learning the phonology of a language, but computers (since they can record and play sound) could be. Surprisingly, this is a little-developed area of CALL, although there are 'multi-media' programs (e.g. on CD-ROM and videodisk) that feature recorded speech.

A very simple combination of sound and text on the computer is the 'audio-dictionary' that has a recording associated with each word; these are becoming available in commercial editions. Since digitized sound takes up a great deal of space, large collections of sound recordings are more practical on CD-ROM than on diskette. However, we mustn't forget the potential of the Internet: disk drives associated with mini-computers and workstations can store vast amounts of data, which can be made accessible to large groups of users over the Web. Text (such as words in a dictionary) can be linked to sound recordings with hypertext links. To get a feeling about the currents, we could check out several examples of this in various applications on the Web: Sounds of the World's Animals (which has recordings of animal sounds, for comparison with how these sounds are expressed in human languages); Web French lessons (which have some pronunciation information); and Hwæt! Old English in Context, which has some recordings linked to words, and also an 'audio-visual glossary'.

In terms of reading, this is an obvious area that books are good for: it is in fact their primary use. But reading can be enhanced on the computer, most obviously through the use of hypertext (or more generally, hypermedia). This appears to be a very fruitful area for further development, especially with the advent of the Web and the tools that are available for developing hypermedia on the Web. These Web tools are in fact authoring software and very suitable for academics to develop new language-learning materials with: unlike commercial authoring software, which may cost hundreds or thousands of dollars, many of these tools are low-cost or free. There is also extensive free documentation on developing Web 'pages' (as well as dozens of books on the subject). It is well known how easy it is to learn basic HyperText Markup Language (HTML), how to incorporate text and graphics and link your documents to related ones.

Programming vs. authoring software

The discussion so far has been in terms of things that can be done with sound and text on the computer which involve no programming at all. This is an important point: while we tend to think of CALL in terms of computer programs, it always makes sense to develop programs only when necessary. Programs must be designed, developed, tested, documented, and maintained, ideally by computer professionals; programming is expensive, time-consuming, and error-prone.

Much of the programming that we see associated with traditional CALL can be divided into two areas: (a) handling the user interface, and (b) processing user input and other data. But now consider, by contrast, Web software: the user interface is handled by general-purpose software such as Netscape and Mosaic; the Web CALL designer, instead of writing programs to collect user input and display information on the screen, concentrates on design. Web browsers have been designed, developed, tested and documented by professionals, and run on a variety of platforms (PCs, Macs, Suns, etc.). Much the same is true of commercial authoring software (though this is often limited to specific hardware and operating systems, unlike Web browsers). It would be foolish, in general, to write programs to do what professionally-developed software already does (and probably better).

More CALL with less programming !?

Applications of computational linguistics in CALL can generally be considered under the heading of 'intelligent CALL', and as we have seen in previous segments, it is quite difficult to provide intelligent behavior in computer programs. However, adding life and intelligence to one's HTML-based CALL software by enbedding JAVA applets (ideally reasoning-equipped) is where this field could turn to in the future.

The fact however remains that the humans are far superior in language processing and reasoning. Consequently, the ways in which the intelligence of the learner, rather than artificial intelligence can be exploited in designing language learning situations that involve the use of computers remains the preferred approach in CALL technology nowadays. To consider how these learning situations might be enhanced with some limited artificial intelligence one should visit sites and characters listed below in the reference section (Elmo, the MOO softbot, etc...). Have fun !
 



 

CALL References and Glossary :

MUDs, MOOs, MUSHs

MUDs = Multi-user Domains
MOOs = MUD, Object Oriented
MUSHs = Multi-user Shared Hallucination
 

MOOs are internet accessible, text mediated virtual environments well suited for distance learning. MOOs of interest include ...





Educational MOOs (and a MUSH):

Cafe MOOlano, Berkeley's new multi-language educational MOO, used and developed by a variety of humanities courses from UC Berkeley and beyond.
telnet moolano.berkeley.edu 8888.
Visit our Cafe MOOlano web site--simply sumptuous graphics!
TAPPED IN, The Teacher Professional Development Institute, or TAPPED IN, is a shared teacher professional development (TPD) workplace patterned after a real-world conference center. Teachers with diverse interests, backgrounds, and skills can share experiences, engage in mentoring and collaborative work, or simply meet their colleagues.
telnet tappedin.sri.com 7777
TAPPED IN web site
MooWP, an educational MOO from the folks at the University of Wisconsin, Parkside.
telnet cs.uwp.edu 7777
Diversity University
telnet moo.du.org 8888
MediaMOO
telnet purple-crayon.media.mit.edu 8888
Tecfa MOO, a virtual space for Educational Technology, Education, Research and Life at TECFA, School of Psychology and Education, University of Geneva, Switzerland
telnet tecfamoo.unige.ch 7777
CollegeTown
telnet galaxy.bvu.edu 7777
CollegeTown web page.
Virtual Online University: "VOU MOO"
telnet brazos.iac.net 8888
Bio MOO
telnet bioinformatics.weizmann.ac.il 8888
PMC-MOO(PostModern Culture)
telnet hero.village.virginia.edu 7777
MOOtiny (WWW + MOO environment)
telnet spsyc.nott.ac.uk 8888
WriteMUX
telnet writemux.visi.com 6250
Brown Hypertext Hotel
telnet 128.148.37.8 8888

Foreign Language and EFL/ESL MOOs

MundoHispano
telnet europa.syr.edu 8888
Visit the MundoHispano home page, a veritable wealth of information about MundoHispano and MOO related topics.
ArdaMOO, a Spanish language social MOO.
telnet lince.las.es 7777
VCR MUSH: Virtual Classrooms MUSH
This MUSH consists of German, French, Spanish and English sections.
telnet risky1.ssg.comp.uvic.ca 6250
Le MOO Francais, French langauge MOO
telnet moo.syr.edu 7777
"LittleItaly", Italian language MOO
telnet little.usr.dsi.unimi.it 4444
Visit the LittleItaly home page
MorgenGrauen LPmud, German language Mud
telnet mud.uni-muenster.de 23
Check out the MorgenGrauen WWW Site

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

UNItopia, German language Mud
telnet infosgi.rus.uni-stuttgart.de:3333
MOOsaico, Portuguese language MOO
telnet moo.di.uminho.pt 7777
SvenskMud, Swedish language Mud
telnet svmud.lysator.liu.se 2043
schMOOze University (an EFL/ESL MOO)
telnet schmooze.hunter.cuny.edu 8888
The schMOOze University home page
Links to Foreign Language WWW sites.

Social MOOs

Paradise MOO
telnet kamber.cobra.net 8888
AquaMOO
telnet aqua.king.net 2345
RiverMOO
telnet river.honors.indiana.edu 8888
Strange Brew
telnet 206.161.77.144 8888
BushMOO, a trip to the Australian outback ...
telnet bushnet.qld.edu.au 7777
BushMOO web site
BayMOO
telnet mud.crl.com 8888
LambdaMOO, the mother of all MOOs!
telnet lambda.parc.xerox.com 8888
ChibaMOO
telnet chiba.picosof.com 7777
SenseMedia Snow A MOO universe based on Neal Stephenson's novel Snow Crash.
telnet sapporo.sensemedia.net 9030
UNM-Moo
telnet forsythia.acomp.usf.edu 7777
UNM-Moo web page
idMOO. idMOO is an experimental MOO, encouraging wild creativity and unusual programming projects of all kinds.
telnet id.sva.edu 7777

Using MOOs: technical and hands-on information

MOO Teacher's Tip Sheet: This document (brought to you by the Daedalus folks) is valuable for both new and seasoned instructors using (or planning to use) MOOs.

What's a MOO?

MUD Clients! Visit Kathleen MacMahon's comprehensive MOO clients home page, with links and information for both Mac and PC users. I highly recommend a visit to this page, and generally, for MOO-ing through a client.

Basic MOO commands

More advanced MOO commands

MOO Quick Start and Summary of Commands
 


Papers, Web pages and documents concerning MOOs

The Composition in Cyberspace home page, put up by Leslie Harris, focuses on the use of MOOs (primarily Diversity University MOO) and Internet discussion lists in English Composition teaching, and includes great course and MOO research materials as well as links to other MOO sites.
For a plethora MOO information, visit Jerome McDonough's formidable MOO/MU* Document Library. Included are links to manuals, tutorials, FAQs, MOO research papers, links to other MOO paper collections, and links to general MOO/MU* resource pages.
'MUDs in Education: New Environments, New Pedagogies', a must read by Tari Lin Fanderclai.
For lots of general MOO information, check out Jeff Galen's MOOcentral home page.
The Journal of Computer-mediated Communication (JCMC) includes articles on a variety of CmC topics, including MU*s.
Visit the Multimedia MOOs page for information on the graphical future of MOOs and rich links to general MOO information.

Some MOO history ...

The first developer of a MOO server was Stephen White. Pavel Curtis of Xerox built upon White's basic design and code and supplemented it with added features, which culminated in the first LambdaMOO core. MOOs offer an alternative to MUDs -- which are fixed virtual environments controlled by an oligarchy of programmers. The LambdaMOO source code is publicly available from Xerox parc (other MOO source codes have also been developed).

MOOs offer every participant the opportunity to construct spaces and objects and to write code that in some way augments or increases the functionality with these virtual spaces. In this sense, MOOs are constructed social spaces in a dynamic process of continual evolution.

For users, MOOs can be described as constellations of spaces, or "rooms", within which multiple individuals can congregate and interact. Movement is possible room to room, by typing in cardinal directions, or via "teleporting", which allows immediate transport to rooms not adjacent to one's present location.

How to Log-on to a MOO:

at the % prompt (the unix shell prompt), type a MOO address, for example:

telnet moo.du.org 8888 [for Diversity University]

Log in as a guest, [type: co guest > (for 'connect guess']

To "speak" in a MOO:

Type " > and what you say will be heard. Example: "hi. This appears on everyone's screen as Kaloo says, 'hi.' >

Here are some other MOO commands that might will you as you explore:

Type : > and this will show an action: an example would be for me to Type :bows gracefully to David > and others see Kaloo bows gracefully to David >

To page a character, enter:

page example: page David Help--I'm stuck in the Coat Closet!

Kaloo pages David with "Help--I'm stuck in the Coat Closet!

Looking at Characters, Rooms, and Objects:

To see the self-description of a character in the room with you,

Type: look 
Example: look Kaloo

A description of Kaloo will appear on your screen.

You can also "look" at yourself, by typing look me >.

To review the description of a room you're in, enter: look >

If in the room you're in you see an object you want to examine, enter this:
@examine 
Example: @examine here (to see what's in the room you're in)




 
Information Retrieval (IR)

Goal :                                                 Find the information requested by the user.

Difficulty :                                        Imprecision of  the goal.

Important Considerations :
                                                            Who is the user ?
                                                                    Naive user vs. trained analyst
                                                            What is the data ?
                                                                    (i)    Growing archive of text
                                                                    (ii)    Volatile corpus outside the control of the IR system
                                                                            -- Internet
                                                                    (iii)    Message traffic
                                                                            -- routing


To begin with experiencing the issues involved in IR, let's select one of the Web search engines, and search for information about (a) Old English (b) Julia the software robot (c) information retrieval. Analyze the results for
    precision (P - proportion of the relevant information among what's retrieved), and
    recall (R proportion of all relevant information retrieved).

Here, one has to bear in mind that Web search databases are always slightly out of date and never complete.


Types of IR Systems

    Boolean keyword search
Usually indexed by humans
    Boolean full text search
Search is not robust
Requires much work from the user
    Relevance ranked full text search
    Conceptual searching

       A user searching for the word "car" wants to retrieve documents that mention automobiles.
        Query Expansion

IR Technologies

    Reverse Index
For each word, list all documents that contain it.
Frequently omit common "stopwords".
    Document surrogates
Compact reverse index.
Relies on statistics.
    Vector space models
Represent any string as a vector

Issues


What does (computational) linguistics have to offer to Information Retrieval?

Information retrieval involves human language. As language scientists, linguists should have much to contribute to this field. Obvious areas of contribution include the linguistic analysis of user queries and the analysis of text, e.g. through 'deep' natural language processing. Analysis of meaning is ultimately what seems to be required, although whether deep NLP is necessary to achieve this remains to be seen. But let's begin at a higher level.

Perspectives on the IR speech situation

When we look at IR in this way, it seems that Interactional Sociolinguistics and Linguistic Anthropology are very relevant to improving the quality and success of IR interactions. But the field of Pragmatics (broadly, the interpretation of utterances in context) also has much to contribute. Pragmatics includes the study of reference (pronominal reference, reference of full NPs, etc.), which is clearly relevant. To the extent that documents are retrieved based on the relative frequency with which some search expression is satisfied, then the reliable interpretation of referring expressions should increase the accuracy of the search. Thus, if we are searching for documents about People's Savings Bank of Canada, we would want to detect subsequent references to this entity as People's Savings Bank, the bank, it, etc. Pragmatics (and Computational Pragmatics) provides a theoretical basis for reference tracking.

Pragmatics is also concerned with higher-level principles of utterance interpretation, and the work of the British philosopher H. P. Grice has been highly influential in this regard (Gricean pragmatics). In this lecture, the focus is on an application of Grice's Cooperative Principle and Maxims of Conversation to IR, specifically with respect to precision and recall. Let's elaborate on the mentioned in more details.
 

Information Retrieval: Precision and Recall

Two main parameters of retrieval effectiveness have been used over the years, defined as the proportion of relevant materials retrieved, or recall (R), and the proportion of retrieved materials that are relevant, or precision (P). (Salton 1989:248)
Pragmatics: Grice (1975)
Cooperative Principle (Grice 1975:45)
Make your conversational contribution such as is required, at the stage at which it occurs, by the accepted purpose or direction of the talk exchange in which you are engaged.
Maxims of Conversation (Grice 1975:45-46)
Quantity
1. Make your contribution as informative as is required (for the current purposes of the exchange).
2. Do not make your contribution more informative than is required.
Quality
1. Do not say what you believe to be false.
2. Do not say that for which you lack adequate evidence.
Relation
1. Be relevant.
Manner
1. Avoid obscurity of expression.
2. Avoid ambiguity.
3. Be brief (avoid unnecessary prolixity).
4. Be orderly.
(Grice, H. P. 1975. Logic and conversation. In Peter Cole and Jerry Morgan (eds.), Syntax and Semantics: Volume 3, Speech Acts. New York: Academic Press)
Grice's maxims govern human interaction involving language, in the sense that we assume by default that they are being adhered to. On the basis of this assumption we regularly draw certain non-logical (i.e. plausible) inferences from what is said or written, which Grice calls implicatures. These implicatures contribute to conveyed meaning, beyond the literal meaning of what was said. One of Grice's illustrations involves a letter of recommendation which patently violates the first submaxim of Quantity (Make your contribution as informative as is required):
A is writing a testimonial about a pupil who is a candidate for a philosophy job, and his letter reads as follows: 'Dear Sir, Mr. X's command of English is excellent, and his attendance at tutorials has been regular. Yours, etc.' (Gloss: A cannot be opting out, since if he wishes to be uncooperative, why write at all? He cannot be unable, through ignorance, to say more, since the man is his pupil; moreover, he knows that more information than this is wanted. He must, therefore, be wishing to impart information that he is reluctant to write down. This supposition is tenable only on the assumption that he thinks Mr. X is no good at philosophy. This, then, is what he is implicating.) (Grice 1975:52)
How does this relate to IR? Although humans are known to adapt to computer systems, any kind of communication involving human language seems to evoke principles governing human-human interaction. When considered in this light, we can see that users will expect an IR system, by default, to be 'cooperative' (in the Gricean sense) and subject to the Maxims of Conversation. In particular, the user may expect IR system responses to be
        (a) informative (Quantity),
        (b) true (Quality),
        (c) relevant (Relation), and
        (d) clear, unambiguous, brief, and orderly (Manner).
These observations provide a basis for understanding and predicting some common problems with IR interactions.

For example, by the maxims of Quantity and Relation, the user will expect that what has been retrieved is relevant, and that the display of what has been retrieved is as informative as is required. This means that by default, novice users will expect perfect precision (Relation: everything retrieved is relevant) and Recall (Quantity and Relation: everything relevant has been retrieved). Now, it is reasonably easy to see precision errors, and users respond to these by reformulating (often narrowing) their queries. An assumption of perfect precision is often contradicted by the evidence. But the assumption of perfect recall is not contradicted by the evidence. Users cannot see what has been missed; even experienced users may believe that their search results are retrieving everything that is relevant (or at least, a good percentage of what is relevant).

The pitfalls for the user are illustrated by the initially mentioned assignment, which involved using Web search engines to find information about Old English, information retrieval, and Julia (the softbot). The assignment asked to report precision and recall for a selected search engine. We cautioned that Web search engines search databases of Web pages that are always out of date, since new Web pages come into being every second. It follows from this fact alone that recall cannot be measured accurately with respect to the Web.

Furthermore, a little thought should show that Web search engines look for the words in the query, not the underlying concepts. Thus, Web queries for Old English retrieved pages which included the word old and the word English: e.g. Old English sheepdogs, Old English taverns, and even an article on recognition of English vowels by Old World monkeys. The poor precision of these searches is what everybody using the Internet notices. But what about recall?

There are in fact many ways of expressing the concepts that underly the expressions 'Old English', 'Information Retrieval', and even 'Julia' (e.g. 'that softbot developed by Michael Mauldin, whose name I can't remember except it begins with a J'). For Old English, I gave a hint that was intended to spark reflection: an alternative term is Anglo-Saxon. Even seasoned searchers can easily fail to consider the full range of possibilities for the linguistic expression of concepts, resulting in decreased recall.

The heart of the problem is that texts and queries are not just assemblages of words: humans express meanings using language, and languages provide a multitude of ways of expressing meanings. We are accustomed to interacting with other humans, who attend to the sense of what we are saying. But current search technology is largely focused on words. In a Web search, it is not possible to ask the question underlying a search for information about Old English, e.g. 'I am interested in information about Old English, the language spoken in England between roughly 450 and 1100.' Without an analysis of meaning (not to speak of user goals), there is no reliable way for search engines to retrieve all and only what is relevant to the user. 


Miscellania

Good reference:

            Salton, Gerard. 1989. Automatic Text Processing: The Transformation,
                                    Analysis, and Retrieval of Information by Computer.
                                    Reading, MA: Addison-Wesley.
Public domain software
Originally WAIS (Wide Area Information Server), from Thinking Machines Corporation
Now FreeWAIS, maintained by Department of Labor