![]() |
An Introduction to
Computational Linguistics |
|
Computer Science Department
|
Computational Linguistics (CL) is an interdisciplinary field centered around the use of computers to process or produce human language. It is a discipline between linguistics, computer science and mathematics. In it, mathematics and linguistics contribute an understanding of special properties of language data, as well as theories and descriptions of language structure and use, while computer science provides theories and techniques for designing and implementing computer systems. CL is largely an applied field.
What
is
the ultimate goal of the CL field (HOLY GRAIL of the
CL) :
Building a computer system that could understand and produce human language
as well as humans can. Its applications would be endless.
Why
to
use computers and not humans to process or produce language (MAJOR
REASONS) :
Humans are often unavailable, too slow, too expensive or too busy doing
tasks they do better
than machines.
When and where
to
use computers to process or produce language (MAJOR
APPLICATION AREAS) :
Translating from one language to another (Machine Translation)
Finding releveant documents in large collections of text (Information
Retrieval)
Answering questions about a subject area (Expert Systems with Natural
Language Interfaces)
Helping humans in learning languages spoken by other humans (Computer-Assisted
Language Learning).
How
to
get computers to process or produce language (MAJOR
APPROACHES) :
Simulation
Build a psychologically valid working model of human thinking that includes
language
understanding and use. Then, use computer to implement the model in the
most
efficient manner possible. This proves to be a vast and very far from solved
problem,
currently broken into many subproblems addressed in Cognitive Science,
Artificial
Intelligence, Biology, Psychology and Linguistics.
Emulation
Doesn't need human brain understanding or any resemblance of how it works.
Attempts
to do as well as a human on tasks involving language. Still an enormous
task that needs to
handle reasoning,
understand how the language works, and
provide an encyclopedic knowledge.
The enormousity of the problem
at hand suggests to either :
break it down into
subproblems
with simpler computational models
(i.e. into phonetics, phonology, lexicon, morphology, syntax, semantics,
pragmatics, etc.)
and focus on computational models for them, or
build the
systems
that learn.
Instead of trying to emulate the linguistic capabilities of an adult (with
an adult's
encyclopedic knowledge), model the knowledge of an infant and work on
designing systems that learn, or
focus on
specific
tasks in restricted domains
Instead
of a wide variety of situations and wide variety of tasks, concentrate
on
more concrete problems (i.e. system summarizing bank telexes or system
summarizing
radiology reports), or
look at
shallow
approaches
where useful tasks with language data are acceptable results
(i.e. automatic correction
of OCR errors or identifying the topics of documents for information retrieval
and
document classification tasks).
Computational linguists are involved in all of
the above areas of research. They're expected to have an
understanding of the language and language data
properties, as well as an understanding of design,
implementation and computational techniques and
issues.
Appeals of the CL as a field :
-
Human language is a most exciting and demanding puzzle,
-
Computer science is a most growing and expanding field,
-
Language processing using computers is an important part of CS applications,
-
Prospects look good for the future.
CL References
and Professional Associations :
![]() |
Natural Language Processing (NLP) |
Goal : Take some text (speech) input and extract meaning from it or given the meaning, express it in a textual (speech) form.
Aspects of a language
addressed : Lexics, Syntax, Semantics, Discourse,
Pragmatics,
etc...
A starting point for most of the nowadays NLP systems is to build an electronic Lexicon. Usually, the
minimal requirement is that
:
each
word in the lexicon has a part of speech (POS) tag associated with
it ( i.e. the / DT for
the
determiner the with the tag DT, or point
/ NN for the singular noun point with the tag NN.)).
Often,
what is also included are
additional requirements to
:
add
morphological
information (i.e. cacti / NNS / 1 us indicating
that the word cacti is a plural
noun
(tag NNS) with the morphological information 1 us indicating
that its nominative singular form
(cactus)
is obtained by deleting 1 of the last characters from the
inflected form (cacti) and adding
characters
us)
and
add
probability
(statistical) information (i.e. heat / NN / _ / 0.83
indicating
that heat
could be
a
singular noun (NN) (like in "This heat is unbearable.").
Its probability of appearing in a text is 0.83.
Or
as in heat / VB / _ / 0.05, heat
is a verb in its base form (VB) (like in "Heat
oil in a large pot!")
with
its probability of 0.05. The probability information
is something that normally appears only in
electronic
lexicons and is derived from a very large text corpora used for training.
In the next phase, once the lexicon is built and
tested, programs called POS Taggers
are designed
and developed in order to be used in determining
parts of speech in a text to be analyzed. They most often
are built to include :
(i) lexical tagging
where
words are initially tagged with their most likely POS regardless of context
(merely a lexicon
look
up and assignment of the tag with the highest probability),
(ii) unknown word tagging
where
words not found in the dictionary are assigned a POS based on their ending
or other simple
characteristics,
(iii) contextual rules
where
the set of simple rules are applied to the previously obtained POS sequence.
They modify
the
POS of each word based on the POS of the surrounding words. In the following
example :
| I | took | a | left | turn | . |
| pro | verb | det | noun | verb | |
| adjective | noun | ||||
| verb | |||||
| adverb |
the
POS tags for each word are listed (top to bottom) in the order from their
highest to lowest
probabilities
(as found in the dictionary). The correct choices are listed in the bold
case. The
example
shows that lexical tagging made two errors (on left
and turn words) because it dealt
only
with the individual words and not the context. This was corrected in the
phase (iii) using the
contextual
rule that changes verb tag into noun tag if one of
the previous two tags is a det tag
(turn),
and another contextual rule that changes noun tag into adjective
tag if the next tag is a
noun tag (left). The contextual rules are often used
for the POS disambiguation like in the example
above.
In a real life, all three modules of a POS TAGGER
are learned from a training corpus of correctly tagged
sentences (i.e. 500K words).
In the syntactic analysis phase of an NLP system,
a program called Parser is built. This
program,
written in some computer language, will use a
formal grammatical descritpion of a natural language to
analyze its input. Tools and theories needed
in this phase are :
(i)
a theory of grammar and a grammar formalism
(ii)
a grammar of the language to be analyzed
(iii)
a lexicon, containing the words of the language
(iv)
a parser which interprets the grammar and uses it to assign
a syntactic description to the input.
Grammar formalisms (generally descended from Transformational
Grammar) to choose from :
(GB)
Government-Binding Theory
(MT)
Minimalist Theory
(GPSG)
Generalized Phrase Structure Grammar
(HPSG)
Head-Driven Phrase Structure Grammar
(LFG)
Lexical Functional Grammar
(DCG)
Definite Clause Grammar (unification-based CFGs with non-terminals allowed
to have arguments)
etc.,
etc...
There is a vast literature on parsing theory and
the design and implementation of parsers for computer programming languages.
This is what we need to do in NLP as well, but the syntax of natural language
input is much more difficult to describe than the syntax of (say) Perl
statements, and designing efficient and 'robust' parsers for natural language
input is not an easy task. However, a variety of parsers have been developed,
and one can identify some general types:
(TD)
top-down,
(BU)
bottom-up,
(CP)
chart parsers,
to name a few (for a useful introduction, see
James Allen, "Natural Language Understanding"). In terms of
the programming languages in which parsers for
natural language are implemented, the languages of choice
in the past have been LISP and Prolog. Of course
it would be possible to write a parser in any other main
stream programming language (C++ or Java), but
it is easier to develop a parser in a higher-level language
like LISP or Prolog, languages which are commonly
used in Artificial Intelligence research. We'll show
here some segments from Prolog, a logic-programming
language.
To make the discussion more concrete, let's consider
a very simple TG-style grammar for a fragment of
English. Here are some phrase-structure rules:
rules that describe the structure of phrases of English.
A grammar for a fragment of English
1. S -> NP VP
2. NP -> DET N
3. NP -> N
4. VP -> V
5. VP -> V NP
6. DET -> the
7. N -> dog
8. V -> eats
These rules can be read: 'A sentence is composed
of a noun phrase followed by a verb phrase'; 'A noun
phrase can be a determiner and a noun', 'A determiner
can be the word the', and so on. Rules (6-8) are
known as lexical rules, since they expand lexical
categories. Since we are assuming that there is a
separate lexicon in which words are stored with
their part of speech, we will replace these with rules
that refer to the lexicon.
Same grammar for a fragment of English
1. S -> NP VP
2. NP -> DET N
3. NP -> N
4. VP -> V
5. VP -> V NP
6. DET -> (any lexical item that is in the lexicon
as a determiner)
7. N -> (any lexical item that is a noun)
8. V -> (any lexical item that is a verb)
In linguistics we often think of such rules as
generating grammatical sentences of a language. So to
generate a sentence of English, we apply the
rules one at a time until we have a nothing left but
sequence of words, and this process also gives
us a parse tree. For example, we begin with a S,
and expand that as NP and VP; we apply the first
NP rule, to get a determiner and a noun; we
apply the determiner rule, to get the,
and so on. We can represent this process of applying the
rules as a parse tree:
S
NP
VP
DET
N
V
the
dog
eats
There is a linear representation of a parse tree
called a 'labelled bracketing': this is similar to the
representation used in the Penn Treebank article.
The Penn Treebank project built a parsed
corpus: a corpus of text in which words are tagged
for part of speech, and phrases are
bracketed and labelled for syntactic category.
A traditional labelled bracketing for The dog
eats, corresponding to the parse
tree above, would be:
[ s [
np [ det the ]
[ n dog ] ]
[ vp [ v eats
] ] ] or
[
s
[
np
[ det the ] [
n dog ]
]
[ vp
[ v eats ]
]
]
Or using SGML and the Penn Treebank tagset, the representation will look something like this:
<s> <np> <DT>the</DT> <NN>dog</NN> </np><vp> <VBZ> eats </VBZ> </vp> </s>
We usually think of a grammar as generating the sentences of a language, but obviously a grammar is also used for analyzing sentences of a language. So we can write a parser that will use our grammar to process the sentence The dog eats (not a sentence that is likely to occur in the Wall Street Journal corpus, but it will do for illustration).
In designing a parser, we must decide on a systematic strategy for applying the grammar rules to the input. One such strategy is very similar to the method we used for generating the sentence: begin with the rule for the type of constitutent we expect to end up with (S). Take the left-most constituent (NP), and pick the first rule that expands that constituent. Keep choosing the first applicable rule and expanding the left-most constitutent until a lexical node is reached; at that point, check to see if the next word in the input is of the right type. If it is, process the next unexpanded constituent. The general strategy is to apply the rules in order, from top to bottom, moving from left to right.
Parsing 'the dog eats' using a top-down left-to-right strategy
Input: the dog eats.
Input: dog eats
Choose the next unexpanded constituent from the last phrase (N).
Input: eats
Since we have completely expanded all constituents in the NP rule, get the next constituent from the higher phrase (VP).
Input: [nothing left]
There are no more constituents to expand in the
VP rule.
There are no more constituents to expand in the
S rule.
There are no words left to process in the input.
Done!
Jumping a one step further and using the conveniencies
(built in interpreter for Definite Clause Grammars
(DCGs)) of Prolog, we can write a DCG that will
besides accepting or rejecting a sentence also build its
derivation tree (parse tree).
First we rewrite the same grammar as a DC grammar :
s --> np, vp.
np --> det, n.
np --> n.
vp --> v.
vp --> v, np.
det --> [Word], {lex(Word,'DT')}.
n --> [Word], {lex(Word,'NN')}.
v --> [Word], {lex(Word,'VBZ')}.
As it's obvious, each DC grammar rule ends in a period, and constituents within a rule are separated by commas. The rewrite symbol, '->', becomes '-->' in a DCG. Syntactic categories in a DCG must be in lowercase letters because they will become Prolog predicates, and in Prolog initial caps are reserved for variable names. With these exceptions, the DCG rules are almost identical to the phrase structure rules for non-lexical nodes.
In the lexical nodes (det, n, v), the square brackets in a DCG represent an instruction to take a word off the input. Curly braces are used to add Prolog code to a DCG: in this case, the code inside the curly braces tests whether the current word is in the lexicon with the expected part of speech. To see how these rules are translated, you can ask for a listing of the predicates at the Prolog prompt, e.g.:
?- listing(np).
Here's how the Prolog interpreter translates the first NP rule and the DET rule:
np(A,B) :-
det(A,C),
n(C,B).
det(A,B) :-
'C'(A,C,B),
lex(C,'DT').
As one can see, each of the syntactic categories in the DCG translates into a predicate with two arguments. The first argument is the list of words to be parsed, and the second is what's left over after parsing. If we were parsing the sentence The dog eats, the input to the NP rule would be [the,dog,eats], and what is left over would be [eats] :
?- np([the,dog,eats],X).
X
= [eats]
If we are parsing a complete sentence, we want nothing left over. So we would call the S rule with input [the,dog,eats], and we would specify that the output should be the empty list :
?- s([the,dog,eats],[]).
Yes
To summarize: Prolog provides a top-down left-to-right parser, in the form of an inference engine. It also provides a facility that automatically converts grammar rules into Prolog statements. To parse a sentence, we first load the grammar rules and lexicon into Prolog. We then call the appropriate grammar rule with a list of words as input. Prolog returns Yes or No.
However, in computational linguistics, a question is usually not 'Is this sequence of words a sentence? (yes or no)', but rather 'What is the syntactic analysis of this sentence?' or to rephrase it "the goal is to Produce a Parse Tree".
The desired outcome would look somehow along these lines:
?- s(Tree,[the,dog,eats],[]).
Tree
= '<s> <np> <DT>the</DT> <NN>dog</NN>
</np> <vp> <VBZ> eats </VBZ> </vp>
</s>'
The simplest way to accomplish this is to build up the parse tree as we go along, and pass it around as a list. Each rule can contribute a piece of the parse tree. As you might imagine, to get a parse tree back from our grammar, we need to add another argument to each rule. For example:
Before, we had :
s --> np, vp.
After adding parse tree arguments, we got:
s(S) --> np(NP),vp(VP),
{S = ['<s>',NP,VP,'</s>']}.
If we modify the np rules and the vp rules to return their parse trees, then at the end we need only put these two trees together to get the syntactic representation for the whole sentence, as above. Since the basic Prolog data structure is a list, what we will end up with is a list of lists, like this :
?- s(Tree,[the,dog,eats],[]).
Tree
=
[<s>,[<np>,[<DT>,the,</DT>][<NN>,dog,</NN>],</np>],[<vp>,[<VBZ>,eats,</VBZ>],</vp>],</s>]
With a little work, we could write some Prolog code to display this as a tree diagram. Since it is a list of lists, and since lists are the basic data structure in Prolog, we can use one of many Prolog utilities to display a list, e.g. pretty_print (in Gazdar and Mellish, "Natural Language Processing in Prolog", along with many other useful utilities).
The semantic analysis phase of a NLP system has
as its objective to obtain the meaning
(semantics) or
in other words to determine what a sentence means. Here, the researchers
find it desireable (or even necessary) to devise
a formal language with a simple semantics into which
natural language can conveniently be mapped.
The formal language should
(i)
be unambiguous,
(ii)
have simple rules of interpretation and inference, and in particular
(iii)
have a logical structure determined by the form of the sentence.
The predicate logic, predicate logic with restricted
quantification, and semantic nets are (just to name a few) some of the
systems often used to represent the meaning. However, notions of modality,
tense and belief, as well as presupposition and fuzziness are not captured
by the formalism of the predicate calculus and its equivalents. Sentences
like "I believe the price of a Big Mac is $10", "John is young" fall into
the uncaptured categories. Adding modal (possible(F), necessary(F)),
epistemic (believe(x,F)) and temporal (true_at_some_time_in_the_future(F)...)
operators covers some of the mentioned insufficiencies.
So far, we seemed to have suggested that two levels of representation - syntactic and semantic - are processed at different phases in a NLP system. First, a syntactic structure is created (with or without the syntactic and sometimes also semantical (selectional) constraints). Then, it is translated into a semantic representation. However, the largest "school" of computational linguists (started by R. Schank) adopts the approach that "since the ultimate objective is the generation of a semantic representation, it should be created directly using the syntactic constraints only to guide the process". Thus, very often the grammar describing the language handles both the syntax and semantics in its rules. The segment of a DCG for English given below describes a group of wh_ questions (here used to illustrate the relationship between the sentence Who wrote this, and its corresponding semantic representation wrote(who, this)). Each of the logic grammar symbols has an argument (shown here italicized and bold) representing its semantics (WhQues_Sem argument in whques logic grammar symbol, WhSubj_Sem in whsubj, WhObj_Sem in whobj, WhPred_Sem in whpred).
.........
(1)
whques ( WhQues_Sem, WhQues_Sen, WhQues_SenRest )
-->
whsubj ( Number, WhSubj_Sem, WhQues_Sen, Rest1 ),
whpred ( Number, Tense, [ WhSubj_Sem, WhObj_Sem ], WhQues_Sem,
Rest1, Rest2 ),
whobj ( WhObj_Sem, Rest2, WhQues_SenRest ).
.........
(2)
whsubj ( X, who, [ who|WhSubjRest ], WhSubjRest ).
(3)
whsubj ( X, what, [ what|WhSubjRest ], WhSubjRest ).
.........
(4)
whpred ( sing, perf, [ Subj, Obj ], wrote ( who, this ),
[ wrote | WhPredRest ], WhPredRest ).
.........
(5)
whobj ( this, [ this | WhObjRest ], WhObjRest ).
.........
Not incidentally, grammars like this (incorporating both syntactic and semantic information) are frequently used for both parsing (given a sentence, derive its semantics) and generation (given a semantics, derive the corresponding sentence). They're said to be reversible and are subject of an interested and ongoing research in CL.
However, the complexities of a natural language are far beyond those of individual sentences in it and include the ones resolvable only through an entire discourse analysis. As a simple (!) example of the problems we face, consider the following brief description of a naval encounter :
Just
before the dawn, the Valiant sighted the Zwiebel and fired
two torpedoes.
It
sank swiftly, leaving few survivors.
The most evident linguistic problem faced is finding an antecedent for "it". There are for candidates in the first sentence : "dawn", "Valiant", "Zwiebel", and "torpedoes". Semantic classifications should enable us to exclude "dawn" ( * "dawn sinks" ), and number agreement will exclude "torpedoes". However, our semantic and syntactic tools will not enable us to choose between the "Valiant" and "Zwiebel" (both presumably ships of some sort). In order to decide that it was the Zwiebel which sank, we must recognize the causal relationship between the firing of the torpedoes and sinking of the ship: that the Valiant fired at the Zwiebel (since one normally doesn't fire at oneself), and hence that it was probably the Zwiebel which sank. Such facts and connections are not bothered to be included into narratives since they can readily be inferred by the reader.
Discovering the implicit information is exactly
the role of the discourse analysis component in NLP systems. Some of the
approaches used for this include
grouping facts by topic,
designing situational frames,
designing situational scripts, and
designing situational plans.
Also, any NLP system that handles a discourse that includes a dialog situation must also be able to reason about beliefs, intentions, desires and other propositional attitudes of the agents in the dialog, as well as agents referred to in the discourse. This, in short is what is referred to as the pragmatics of the natural language utterances. Most currently implemented systems use only the simplest dialog structure, with unilateral initiative in which one party completely controls the conversation, giving commands or asking questions. The other party simply responds to the last command or question. Such systems are realtively easy to implement, since they need not analyze the preceeding dialog at length but they are also rather constraining for the user. Mixed initiative systems with relaxed constraints are also being implemented. In them, a great majority follows the idea of maintaining multiple goals in a conversation (users' and system (clarification) goals).
|
for Grammar
Processing Algorithms
Logical grammars (and logic
programming in a broader sense) have been widely accepted as a paradigm
for parser and generator design and prototyping. Thanks to the related
theoretical background, it became possible to develop formal systems in
order to compare various approaches of the competing grammar processing
algorithms (parsers and generators). The need for their formal evaluation
has been felt very strongly in Computational Linguistics because much of
the work done has been predominantly empirical and it became difficult
to measure the actual progress. A formal apparatus has been developed for
ranking the grammar processing algorithms with respect to the following
criteria :
The evaluation system based
on the general principle of the traversal of derivation trees showed
most promise because of its independence of a particular grammar, grammar
formalism or execution strategy.
The following excerpts from the authors research on the subject illustrate the flavor of the efforts taken in this direction. Going back to the who
wrote this DCG example, let's try to illustrate one way to formally
look into the derivations and derivation tree traversals. Below is what
is referred to as an oracle derivation tree , a tree in which all
critical variable assignments are somehow (oracle) known.
What follows are various traversals
of the tree. In them, each node contains the predicate symbol, the set
of variables uninstantiated at the moment of the visit of the given node,
the set of variables uninstantiated at the moment after the given node
was just departed, and the ordinal number indicating how many nodes in
the tree were visited before the given one. The first two correspond to
two versions of Essential Arguments driven traversals, and the third one
is the Semantic-Head Driven Generation (SHDG) traversal.
The essential arguments algorithm (EAA) rearranges grammar productions at compile time in such a way that the evaluation (even one like a simple top-down left-to-right evaluation) will follow an optimal (or nearly optimal) path. By optimal path we mean a path in which no unsuccessful expansions will ever occur, or in terms of traversing of an analysis tree, no backtracking ever happens. In its preprocessing phase, EAA computes the so called minimal sets of essential arguments (abbreviated as msea's) for each of grammar symbols. If variables from a minimal set of essential arguments of a grammar symbol are all instantiated, a finite deterministic expansion of the grammar symbol is possible. Since we assume a finite deterministic expansion, no wrong moves (and backtracking) is allowed. An expansion of that kind can end with a failure or a success, but it is an expansion with no guessing. Thus, when used for generation, EAA will assume that the semantics of the topmost grammar symbol is instantiated and will try to rearrange rhs grammar symbols in rules where the topmost symbol is on the left-hand side in such a manner that whenever a symbol is next for expansion, at least one of its msea's is instantiated. If such ordering(s) of grammar symbols exists, it is (they are) adopted and EAA actually creates a (or several) grammar(s) equivalent to the original one and executable in a top-down, left-to-right fashion. If such reordering of grammar symbols is not possible, EAA will apply so called inter-clausal reordering (abbreviated as ICR). It will lift up a grammar rule having a rhs symbol of the currently processed rule on its left-hand side, by replacing the symbol in the current rule with the right-hand side symbols of its rule. Semantic-Head-Driven Generation Algorithm (SHDGA), uses information about semantic heads in grammar rules to obtain the best possible traversal of the generation tree, using a mixed top-down/bottom-up strategy. In its preprocessing phase SHDGA compiles grammar rules in one of two ways, depending on whether the rule in question is a "chain rule" or not. A chain rule is one in which the semantics of the left-hand side (abbreviated here as lhs) is identical to the semantics of some right-hand side (rhs) constituent, the rule's "semantic head". Here, the algorithm needs an explicit indication of what the semantics is, for a given grammar symbol. In two major papers on SHDGA, this indication is given by introducing "/" symbol, that separates syntactic from semantic part of a grammar symbol (SYMSYN/SYMSEM). Thus, what follows "/" symbol is considered semantics of a given grammar symbol. In addition, the "chained" relation, the transitive closure of the syntax part of the semantic head relation is computed. That is, if HEADSYN/HEADSEM is the semantic head of LHSSYN/LHSSEM, then HEADSYN and LHSSYN are in the "chained" relation. Therefore, preprocessing phase of SHDGA consists of introducing a notation that explicitly indicates what semantics is for a given symbol, aligning the grammar rules into "chain rules", or "non-chain rules" group and computing the relation "chained". We can talk of this preprocessing phase as of compilation phase of the algorithm. In the evaluation phase of the algorithm, SHDGA is given a grammar symbol (goal) with its semantics fully instantiated and is expected to discover corresponding surface string(s), if such exists. SHDGA selects a "non-chain" rule, whose left-hand side symbol (pivot) has semantic component unifiable with the semantics of the given goal and pivot can be connected to the goal by a sequence of "chain rules". That rule is expanded and the same procedure is applied to the grammar symbols on the right-hand side of it, if there are right-hand side symbols. Since the rule is "non-chain", no right-hand side symbol can have the same semantics as pivot. After the expansion of the "non-chain" rule is completed, its left-hand side symbol will get connected with the goal, by applying a sequence of "chained rules". Thus, next, a "chain rule" is selected whose left-hand side symbol has the same semantics as pivot's semantics and it is possible to connect it to the goal by a sequence of "chain rules". The same procedure is applied to the pivot's siblings on the right-hand side of this rule. The application of "chain rules" is continued in the same manner until at one moment, the left-hand side symbol of such a "chain rule" is actually the goal symbol. For more charts and illustrations on these two algorithms, see : Appendix A : SHDGA and EAA. The two variable sets shown on the previous tree traversals are the basis of what's known as universal guide, a concept introduced by the author that could be used for monitoring derivations, as well as the performance of parsers and generators. They could be added to a logic grammar as a piece of redundancy, and conveniently exploited for tighter control of the computational process. The notion of universal guides is a generalization of well known proper guides (introduced by M. Dymetman and P. Isabelle). Unlike proper guides, universal guides allow consumption of guides at any level, not only lexical, they're introduced independently of the algorithm and the underlaying grammar and the related the main results (concerning finite derivations) are stated not only with respect to one (the depth-first-search, left-to-right (DFLR)) algorithm, but with respect to any grammar processing algorithm. Also universal guides treat parsing and generation in a symmetrical way, as it was done by proper guides, but unlike with the proper guides, universal guides need not to be instantiated and created differently for parsing and generation. For more on universal guides see : Appendix B : Universal guides The relation STAS (Same To
A Subtree) is a mathematical relation defined on the set
of all possible tree traversals, useful in estimating the quality of a
grammar processing algorithm based on the criteria of completeness, soundness,
efficiency and reversibility. Intuitively, two tree traversals T1 and T2
are STAS ( STAS(T1,T2) ) if at any given moment during the traversal, the
choice of the next subtree to be visited is the same for T1 and T2. STAS
does not imply that the order in which the nodes are visited will necessarily
be the same. In other words, two traversals can be STAS and their orderings
of nodes quite different, as it is shown on the next couple of examples.
STAS is an equivalence relation and it partitions the set of all possible
traversals into its mutually exclusive equivalence classes. Those classes
as we'll illustrate often describe powers of various grammar processing
algorithms.
The figure from the Appendix C : STAS Tree Traversals represents an example of two tree traversals which are STAS. The nodes are visited by the first algorithm in the order indicated by the arabic numbers (A, E, J, K, F, B, G, H, L, C, I, M, D). This traversal, as it will become clear later in the study, can be produced by semantic-head-first algorithm. The second algorithm, marked here by roman numbers, is depth-first search algorithm, performed in a top-down, left-right fashion, producing the following ordering: A, B, E, F, J, K, F, C, G, H, L, D, I, M. As it can be verified easily, the choice of the subtrees for the traversals at any node is same for both algorithms. For instance, at the node A, both algorithms first traverse the leftmost subtree, rooted at the node B, then the middle one (rooted at C) and at the end the rightmost (rooted at D). However, it should be observed that the order in which the nodes in the subtrees are visited by the two algorithms is not the same. Same type of verification can be done for the remaining nodes in the tree. Therefore, these two traversals are STAS, although the order in which the nodes are visited is different. The examples from the Appendix C : STAS Tree Traversal
Relation illustrate the way STAS could be used to compare SHDG and EA algorithms.
|
![]() |
Machine Translation (MT) |
Goal : Devise a system capable of translating from one human language to another (!!!).
Problems :
Lack of empirical study of translation.
Linguistics (in a narrow sense) not as useful as expected.
No such thing as a universal MT system.
Developers' side : Huge investment
vs. Small market
Users' side :
High expectations vs. Moderate or low performance
Researchers' side : Big dreams vs. Serious
deadlocks
Directions :
More empirical studies of translation
Diversification of systems
Customization techniques
Aproaches :
BI-LINGUAL CORPORA approaches
Example-based MT (EBMT)
This is a translation by retrieval in which previous translations
are used as templates for translation of new text. A conservative
text to use is the Bible which has been so widely translated.
Statistical-based MT (SBMT)
This approach capitalizes on the idea that translation examples
implicitely conatin rules of translation. Statistical techniques
are used to reveal rules.
LINGUISTICS BASED approaches (LBMT)
REDUCTIONIST approach
Monolingual theory of L1 <===========>
Monolingual theory of L2
systematic
correspondence
theory
Language L1
Language L2
(infinite set of expressions)
(infinite set of expressions)
INTER-LINGUAL approach
[Pivot]
Inter-lingual
expressions
Analysis component Generation component
L1 input sentences L2 translations
HYBRID approaches
Attempt to integrate the best of the above in order to resolve most
difficult issues in MT by :
adding a model of the discourse,
rules of pragmatic interpretation and generation,
a representation of the information structure of the input,
etc.... (lots of good thesis topics here).
MT References :
To get a sense of various problems involved in a MT system, try the Globalink translator at http://www.globalink.com/xlate.html.
Try a real machine translation system: Altavista + Systran
![]() |
Computer-Assisted Language Learning (CALL) |
Much of what is called 'CALL' does not involve computational linguistics, but rather linguistic computing (generally, doing things with language data on a computer). In any case, CALL is one of the most challenging application areas considered to date. Unlike general NLP, which requires a knowledge of linguistic theory and of computer science, CALL requires an understanding of:
The way the topic of computer-assisted language learning (CALL) is usually approached first is by looking at 'book-assisted language learning'.
To start with, let's consider the aspects of book technology, as it has developed over the last 500 or so years, that we take for granted, e.g. :
(a)
the content and structure of the printed page (can contain both text and
graphics;
may contain glosses and notes in at various positions on the page),
(b)
ways that learners can interact with books (e.g. they can be owned; they
are
highly portable; the student can write in the margins, highlight areas
of the text,
turn down pages, and insert Post-it notes, tape flags, paper-clips within
the page).
Also, let's identify :
(a)
a number of variables such as the learner's profiency level and goals,
whether the
language is being learned in a foreign-language or second-language setting
(b)
aspects of the language that are often taught with the assistance of books
(c)
areas that books are and are not useful for (e.g. + reading; - phonology).
Now, turning to computer-assisted language learning,
we can observe that to make good use of the computer as a tool in language
learning, it is important to know about :
(a)
language acquisition theory
(b)
learning theory
(c)
applications and design of computer programs.
There is really a great deal in CALL that
represents the transfer of book technology and traditional methods of language
teaching to the computer: e.g. the printed dictionary becomes the electronic
dictionary; the traditional presentation of grammar and grammar exercises
becomes screens of grammar lessons and grammar exercises. In fact, drill
and memorization play a minor role in learning a language (though they
may be useful for learning about the language).
A good deal of current CALL is simple PC or Mac programs and very little current CALL applies computational linguistics to the problems of language learning. Here are some possible reasons for this situation: programmers who become interested in CALL may not be informed by language acquisition theory and advances in language pedagogy; those who are may not be in a position to develop computer programs; both groups may be unfamiliar with relevant theory and techniques in computational linguistics. The informed CALL designer (the CALL designer of the future) is ideally knowledgeable about both Applied Linguistics and Computational Linguistics.
A very simple combination of sound and text on the computer is the 'audio-dictionary' that has a recording associated with each word; these are becoming available in commercial editions. Since digitized sound takes up a great deal of space, large collections of sound recordings are more practical on CD-ROM than on diskette. However, we mustn't forget the potential of the Internet: disk drives associated with mini-computers and workstations can store vast amounts of data, which can be made accessible to large groups of users over the Web. Text (such as words in a dictionary) can be linked to sound recordings with hypertext links. To get a feeling about the currents, we could check out several examples of this in various applications on the Web: Sounds of the World's Animals (which has recordings of animal sounds, for comparison with how these sounds are expressed in human languages); Web French lessons (which have some pronunciation information); and Hwæt! Old English in Context, which has some recordings linked to words, and also an 'audio-visual glossary'.
In terms of reading, this is an obvious area that books are good for: it is in fact their primary use. But reading can be enhanced on the computer, most obviously through the use of hypertext (or more generally, hypermedia). This appears to be a very fruitful area for further development, especially with the advent of the Web and the tools that are available for developing hypermedia on the Web. These Web tools are in fact authoring software and very suitable for academics to develop new language-learning materials with: unlike commercial authoring software, which may cost hundreds or thousands of dollars, many of these tools are low-cost or free. There is also extensive free documentation on developing Web 'pages' (as well as dozens of books on the subject). It is well known how easy it is to learn basic HyperText Markup Language (HTML), how to incorporate text and graphics and link your documents to related ones.
Much of the programming that we see associated with traditional CALL can be divided into two areas: (a) handling the user interface, and (b) processing user input and other data. But now consider, by contrast, Web software: the user interface is handled by general-purpose software such as Netscape and Mosaic; the Web CALL designer, instead of writing programs to collect user input and display information on the screen, concentrates on design. Web browsers have been designed, developed, tested and documented by professionals, and run on a variety of platforms (PCs, Macs, Suns, etc.). Much the same is true of commercial authoring software (though this is often limited to specific hardware and operating systems, unlike Web browsers). It would be foolish, in general, to write programs to do what professionally-developed software already does (and probably better).
The fact however remains that the humans are far
superior in language processing and reasoning. Consequently, the ways in
which the intelligence of the learner, rather than artificial intelligence
can be exploited in designing language learning situations that involve
the use of computers remains the preferred approach in CALL technology
nowadays. To consider how these learning situations might be enhanced with
some limited artificial intelligence one should visit sites and characters
listed below in the reference section (Elmo, the MOO softbot, etc...).
Have fun !
CALL References and Glossary :
MOOs are internet accessible, text mediated virtual environments well suited for distance learning. MOOs of interest include ...
MUD
Clients! Visit Kathleen MacMahon's comprehensive MOO clients home page,
with links and information for both Mac and PC users. I highly recommend
a visit to this page, and generally, for MOO-ing through a client.
MOO
Quick Start and Summary of Commands
MOOs offer every participant the opportunity to construct spaces and objects and to write code that in some way augments or increases the functionality with these virtual spaces. In this sense, MOOs are constructed social spaces in a dynamic process of continual evolution.
For users, MOOs can be described as constellations of spaces, or "rooms", within which multiple individuals can congregate and interact. Movement is possible room to room, by typing in cardinal directions, or via "teleporting", which allows immediate transport to rooms not adjacent to one's present location.
telnet moo.du.org 8888 [for Diversity University]
Log in as a guest, [type: co guest > (for 'connect guess']
Kaloo pages David with "Help--I'm stuck in the Coat Closet!
Type: look A description of Kaloo will appear on your screen.
You can also "look" at yourself, by typing look
me >.
To review the description of a room you're in,
enter: look >
If in the room you're in you see an object you
want to examine, enter this:
Goal :
Find the information requested by the user.
Difficulty :
Imprecision of the goal.
Important Considerations :
To begin with experiencing the issues involved
in IR, let's select one of the Web search engines, and search for information
about (a) Old English (b) Julia the software robot (c) information retrieval.
Analyze the results for
Here, one has to bear in mind that Web search
databases are always slightly out of date and never complete.
Issues
What does (computational) linguistics have
to offer to Information Retrieval?
Information retrieval involves human language.
As language scientists, linguists should have much to contribute to this
field. Obvious areas of contribution include the linguistic analysis of
user queries and the analysis of text, e.g. through 'deep' natural language
processing. Analysis of meaning is ultimately what seems to be required,
although whether deep NLP is necessary to achieve this remains to be seen.
But let's begin at a higher level.
Perspectives on the IR speech situation
Pragmatics is also concerned with higher-level
principles of utterance interpretation, and the work of the British philosopher
H. P. Grice has been highly influential in this regard (Gricean pragmatics).
In this lecture, the focus is on an application of Grice's Cooperative
Principle and Maxims of Conversation to IR, specifically with respect to
precision
and recall. Let's elaborate on the mentioned in more details.
Information Retrieval: Precision and Recall
For example, by the maxims of Quantity and Relation,
the user will expect that what has been retrieved is relevant, and that
the display of what has been retrieved is as informative as is required.
This means that by default, novice users will expect perfect precision
(Relation: everything retrieved is relevant) and Recall (Quantity and Relation:
everything relevant has been retrieved). Now, it is reasonably easy to
see precision errors, and users respond to these by reformulating (often
narrowing) their queries.
An assumption of perfect precision is often
contradicted by the evidence. But the assumption of perfect recall is not
contradicted by the evidence. Users cannot see what has been missed;
even experienced users may believe that their search results are retrieving
everything that is relevant (or at least, a good percentage of what is
relevant).
The pitfalls for the user are illustrated by the
initially mentioned assignment, which involved using Web search engines
to find information about Old English, information retrieval,
and Julia (the softbot). The assignment asked to report precision
and recall for a selected search engine. We cautioned that Web search engines
search databases of Web pages that are always out of date, since new Web
pages come into being every second. It follows from this fact alone that
recall cannot be measured accurately with respect to the Web.
Furthermore, a little thought should show that
Web search engines look for the words in the query, not the underlying
concepts. Thus, Web queries for Old English retrieved pages which
included the word old and the word English: e.g. Old English
sheepdogs, Old English taverns, and even an article on recognition
of English vowels by Old World monkeys. The poor precision
of these searches is what everybody using the Internet notices. But what
about recall?
There are in fact many ways of expressing the
concepts that underly the expressions 'Old English', 'Information Retrieval',
and even 'Julia' (e.g. 'that softbot developed by Michael Mauldin, whose
name I can't remember except it begins with a J'). For Old English,
I gave a hint that was intended to spark reflection: an alternative term
is Anglo-Saxon. Even seasoned searchers can easily fail to consider
the full range of possibilities for the linguistic expression of concepts,
resulting in decreased recall.
The heart of the problem is that texts and queries
are not just assemblages of words: humans express meanings using
language, and languages provide a multitude of ways of expressing meanings.
We are accustomed to interacting with other humans, who attend to the sense
of what we are saying. But current search technology is largely focused
on words. In a Web search, it is not possible to ask the question
underlying a search for information about Old English, e.g. 'I am interested
in information about Old English, the language spoken in England between
roughly 450 and 1100.' Without an analysis of meaning (not to speak of
user goals), there is no reliable way for search engines to retrieve all
and only what is relevant to the user.
Example: look Kaloo
@examine
Example: @examine here (to see what's in the
room you're in)

Information Retrieval
(IR)
Who is the user ?
Naive user vs. trained analyst
What is the data ?
(i) Growing archive of text
(ii) Volatile corpus outside the control of the IR system
-- Internet
(iii) Message traffic
-- routing
precision (P
- proportion of the relevant information among what's retrieved), and
recall (R
proportion
of all relevant information retrieved).
Types of IR Systems
A
user searching for the word "car" wants to retrieve documents that mention
automobiles.
Query
Expansion
IR Technologies
When we look at IR in this way, it seems that Interactional
Sociolinguistics and Linguistic Anthropology are very relevant to improving
the quality and success of IR interactions. But the field of Pragmatics
(broadly, the interpretation of utterances in context) also has much to
contribute. Pragmatics includes the study of reference (pronominal reference,
reference of full NPs, etc.), which is clearly relevant. To the extent
that documents are retrieved based on the relative frequency with which
some search expression is satisfied, then the reliable interpretation of
referring expressions should increase the accuracy of the search. Thus,
if we are searching for documents about People's Savings Bank of Canada,
we would want to detect subsequent references to this entity as People's
Savings Bank, the bank, it, etc. Pragmatics (and Computational
Pragmatics) provides a theoretical basis for reference tracking.
Pragmatics: Grice (1975)
Grice's maxims govern human interaction involving
language, in the sense that we assume by default that they are being adhered
to. On the basis of this assumption we regularly draw certain non-logical
(i.e. plausible) inferences from what is said or written, which Grice calls
implicatures.
These implicatures contribute to conveyed meaning, beyond the literal meaning
of what was said. One of Grice's illustrations involves a letter of recommendation
which patently violates the first submaxim of Quantity (Make your contribution
as informative as is required):
(Grice, H. P. 1975. Logic and conversation. In Peter
Cole and Jerry Morgan (eds.), Syntax and Semantics: Volume 3, Speech
Acts. New York: Academic Press)
How does this relate to IR? Although humans are known
to adapt to computer systems, any kind of communication involving human
language seems to evoke principles governing human-human interaction. When
considered in this light, we can see that users will expect an IR system,
by default, to be 'cooperative' (in the Gricean sense) and subject to the
Maxims of Conversation. In particular, the user may expect IR system responses
to be
(a)
informative (Quantity),
(b)
true (Quality),
(c)
relevant (Relation), and
(d)
clear, unambiguous, brief, and orderly (Manner).
These observations provide a basis for understanding
and predicting some common problems with IR interactions.
Miscellania
Salton, Gerard. 1989. Automatic Text Processing: The Transformation,
Analysis, and Retrieval of Information by Computer.
Reading, MA: Addison-Wesley.