-*- Outline -*- * c11b ** egrep *** Design for an egrep program egrep input: a regular expression E and a set of strings egrep output: print each string that is in L(E) Steps- 1 build a NFA accepting the language of the regular expression 2 build a DFA accepting the same language as the NFA 3 run each of the strings through the DFA Some versions of grep-like programs skip step 2 and run the strings directly through the NFA, but that requires a sometimes very expensive graph search. If there are a lot of strings (big file ), it's better to build the DFA, because repeatedly running a DFA is much more efficient than repeatedly running a NFA. We need a module for each of the steps. Will only discuss Step 2 here. *** Building DFA Pseudocode for our algorithm given in C11a.Kleene's Theorem.NFA to DFA: Set qqt, qq0 // sets of NFA states/int Set DFAstateSet // set of DFA states . . . . . . . . // each element is a set of ints . queue <- emptyQueue() . qq0 <- lambda({NFA.startState}) . queue.add(qq0) . DFAstateSet.add(qq0) . repeat . . . . qq <- queue.remove() . . . . . . for each alphabet symbol s . . . . . . qqt <- TD(qq, s) // TD() is DFA transition fn . . . . . . DFAedgeSet.addEdge( qq, s, qqt) . . . . . . if qqt not -in- DFAstateSet . . . . . . . . . queue.add(qqt) . . . . . . . . . DFAstateSet.add(qqt) . until (queue.isEmpty()) *** NFA representation states; 0, 1, 2, ..., nq symbols: 0, 1, 2, .., na class NFA{ int nq, na int startState Set finalStates Set T[][] //non-lambda transitions T[stateix, symbolix] Set E[] // lambda transitions E[stateix] } For example T[2,1] is the set of states reached from state 2 along one edge labeled with symbol 1. If T[2, 1]= {1, 5}, then there are edges labeled "1" from state 2->state 1 and from state 2->state 5. *** DFA Transition function //kernel(qq,s)= UNION{ T(q, s) | q -in- qq } //T(q,s) is given by the NFA transition matrix T[][] Set kernel(Set qq, int s){ //qq is set of NFA states, ie ints . Set answer = new Set() // answer = {} . Iterator isource = qq.iterator() . while (isource.hasNext()) . . int q = isource.next() // Pick one q -in- qq . . // answer = answer UNION T[q,s] . . // Union by adding in elts of set T[q,s] one by one: . . Iterator itarget = T[q,s].iterator() . . while (itarget.hasNext()) . . answer.add(itarget.next()) . return answer } Set TD(Set qq, int s){ //qq is set of NFA states . return lambda(kernel(qq, s)) } *** Set Data Structures for Building DFA **** Set Interface Interface Set Set() // return empty set boolean contains(Object o) // Is o an element of set? boolean add(Object o) if o is already an element, return false if o is new, add o the set , return true int size() // size of the Set Iterator iterator() Example //test whether Set S is a subset of Set T Iterator is = S.iterator(); while ( is.hasNext()) if (!T.contains(is.next()) ) return false; return true; Note: Set equality test is almost the same **** Set Implementations boolean contains(Object o) // Is o an element of set? boolean add(Object o) int size() // size of the Set Iterator iterator() If the set elements are integers (NFA states), there are several good implementations Boolean (or "bit") array Linked list Array of elements Sorted array of elements TreeSets for more, see http://java.sun.com/j2se/1.4.2/docs/api/java/util/Collection.html ***** Boolean array Suboptimal: iterator() requires looking at all array positions, true AND false. ***** Linked lists "contains" has linear cost pointers use memory ***** Array of elements "contains" has linear cost ***** Sorted array of elements "contains" has logarithmic cost "add" has linear cost ***** TreeSet set is represented as a binary search tree with parent links contains() and add() are O(log N) iterator is efficient if there are parent links bonus--iterator presents elements in order **** Choice of Set Implementation for DFA->NFA Conversion Algorithm The algorithm, again: Set qqt, qq0 // sets of NFA states/int Set DFAstateSet // set of DFA states . . . . . . . . // each element is a set of ints . queue <- emptyQueue() . qq0 <- lambda({NFA.startState}) . queue.add(qq0) . DFAstateSet.add(qq0) . repeat . . . . qq <- queue.remove() . . . . . . for each alphabet symbol s . . . . . . qqt <- lambda( kernel(qq, s)) . . . . . . DFAedgeSet.addEdge( qq, s, qqt) . . . . . . if qqt not -in- DFAstateSet . . . . . . . . . queue.add(qqt) . . . . . . . . . DFAstateSet.add(qqt) . until (queue.isEmpty()) Two kinds of sets occur in the algorithm: Kind 1. A set like qq3 = { 2,5,8,9 } , which is one DFA state. The elements of such a set are single NFA states (represented as integers). The number of NFA states is moderate if the length of the regular expression is moderate. So the NFA states can be represented as integers over a moderate range. Each of these "Set"'s can be represented as a bit or boolean array of moderate dimension. Kind 2. The DFAstateSet containing all known DFA states, eg DFAstateSet = { {2,5,8,9}, {2,6,7,1}, {4,5}, .. } The elements of such a set are themselves sets. Boolean array is out. Why? We will use Boolean array for Sets of integers, but need a different representation for Sets of Sets of integers. Some candidates- linked list of bit arrays, linked list of bit arrays, said arrays in sorted (is this possible?) order array of bit arrays array of bit arrays in sorted order a TreeSet of bit arrays Note: The good news is that our implementation for DFAStateSet doesn't have to support all of the operations in the Set interface. Examining the algorithm, the only operations needed for Sets of Sets are "contains" and "add". . . . . . . . DFAstateSet.add(qqt) done once for each DFA state . . . . . . if qqt not -in- DFAstateSet done once for each DFA edge Which is best? For each candidate, work out the total time and space cost, assuming the DFA has N states and k*N edges. time cost = |QQ| * ( cost of add) + |QQ| * |A|* (cost of "contains") space cost = |QQ| * (space per DFA state) + |QQ| *|A| * (space per DFA edge) **** Orderings The better choices for the more complex Set implementation require that the elements, the basic integer Sets, be maintained in some kind of "order". What does this mean, and can we do it if the basic Sets are Boolean arrays? -------------------------------------------------------------- PARTIAL ORDER A binary relation is a partial order if it is antisymmetric, ie for all x, y, if (x R y) and (y R x) then x = y and transitive, ie for all x,y, and z, if (x R y) and (y R z) then (x R z) -------------------------------------------------------------- Note: "antisymmetric" is not the same as "not symmetric" Example 1. On N, the equality relation is antisymmetric, ie if (x=y) and (y=x) then of course x = y. But equality is also symmetric, ie if x=y then y = x. Example 2. On N, the relation <= is antisymmetric and also not symmetric. Example 3. On N, the relation < is antisymmetric and also not symmetric. Antisymmetry holds because the condition "(x R y) and (y R x)" is impossible, hence no counterexample can exist. -------------------------------------------------------------- TOTAL ORDER A binary relation is a total order if it is a partial order, and for all x and y at least one of the following is true: x=y x R y y R x -------------------------------------------------------------- Example. On N the relations = , <=, and < all are total orders. Note. Under a total order, if x -notequal- y, exactly one of (x R y) and (y R x) is true. Q. Is there a total order for Power({0, 1, .., n } )? This the same as aking for an ordering of all possible sets of NFA states, i.e. ordering the DFA states. *** Lambda closure /Breadth first traversal For NFA node q, the lambda moves leaving q are edges to the node set E[q] // Inductive definition of lambda(S) // Base: If s -in- S, then s -in- lambda(S) // Induction: If q -in- lambda(S), then for all t -in- E[q], // t -in- lambda(S) The Java Collections also include Lists, which can handle ops needed for a Queue. Queue initially holds S (Base). Repeatedly, remove item from Queue, place the item in closure and also apply Induction to it, to get more items for Queue... Set lambda(Set s){ . Set answer = new Set() . Queue qu = new Queue().addAll(s); . while (!qu.empty()){ . . int q = qu.qremove() . . answer.add(q) . . // new inductively generated elts added to Queue: . . Iterator i = E[q].iterator() . . while (i.hasNext()){ . . . t = i.next() // q -> t . . . if (!answer.contains(t)) . . . . qu.qadd(t) . . } . } . return answer } More efficient/complicated: precompute and save closures for {0}, {1},... and union as needed for lambda( s). ** Minimization *** Introduction It's possible to "compress" any DFA to a smallest possible DFA accepting the same language. Before showing how to construct the "minimum state" DFA, we have to do some general work with equivalence relations. *** Relations -------------------------------------------------- Cartesian product of two sets A x B = { (a, b)| a -in- A and b -in- B } Binary relation A binary relation on set A is a subset of A x A -------------------------------------------------- Example A = { 1, 2, 3} A x A = { (1,1), (1,2), (1,3), (2,1), (2,2), (2,3), (3,1), (3,2), (3,3) } R1 = { (1,2), (2,3) } R2 = {} R3 = A x A -------------------------------------------------- Notations: (1, 2) -in- R1 same as R1(1, 2) same as 1 R1 2 ------------------------------------------------- *** Equivalence relations defined -------------------------------------------------------------------- EQUIVALENCE RELATION A binary relation =R= is an "equivalence relation" if, for all x, y, and z, all of the following hold: 1. ( x =R= x ) 2. If (x =R= y), then (y =R= x) 3. If (x =R= y) and (y =R= z), then (x =R= z) --------------------------------------------------------------------- Property 1 is "reflexivity". For relation "<" on N, it is false, because, for example 1 < 1 is false. Property 2 is "symmetry". For relation "<" on N, it is false, because 1 < 2 is true, but 2 < 1 isn't. For relation != on N (not equal), it is true: whwnever x != y, it's is also true that y != x. Property 3 is "transitivity". For relation "<" on N, it is true, For relation != on N (not equal), it is false: for example with x = 1, y = 2, z = 1 (x != y) and (y != z), but (x != z) is not true Relation "<" on N is not an equivalence relation, because properties 1 and 2 fail. Relation "!=" on N is not an equivalence relation, because properties 1 and 3 fail. *** Equivalence relation and partition ------------------------------------------------------- Equivalence class Let =R= be an equivalence relation on set A. For each member a -in- A, the "equivalence class of a" is the set of elements of A related to a by =R=: [a] = { x -in- A | x =R= a } ------------------------------------------------------- Example A = { 0, 1, 2 , -1, -2 } ( x =R= y ) if x*x = y * y Can prove that =R= has the necessary 3 properties: it is an equivalence relation. (will skip proof) Equivalence class of 1 x =R= 1 ? x x*x = 1*1 ? ----------------- 0 False 1 True 2 False -1 True -2 False Equivalence class of 1 = [1] = { 1, -1 } [1] = { 1, -1 } [-1] = ? [2] = ... [-2] = [0] = Partition Different equivalence classes never overlap. The set A is "partitioned" A = { 1, -1 } -union- { 2, -2 } -union- { 0 } or A = -union- {{ 1, -1 }, { 2, -2 } , { 0 }} *** 0-equivalence Now we are beginning the minimum state DFA construction. Given a DFA, the relation 0-equivalence is defined as (q =0= r) if states q and r are both final states, or q and r are both non-final Example Q = {0,1,2,3,4,5,6}, qo = 0, F = { 2, 5} [0] = { 0,1,3, 4 ,6} [2] = { 2, 5} A0 = { 0,1,3, 4 ,6} B0= { 2, 5} *** 1-equivalence ------------------------------------------------------ 1-equivalence (q =1= r) if (q =0= r) and (for all s -in- A)(T(q,s) =0= T(r,s)) ------------------------------------------------------ Example: s t a t e 0 1 2 3 4 5 6 a 1 2 3 2 5 6 5 b 2 3 4 1 4 2 6 Annotate each column header with its 0-equivalence class, and replace each table entry with its 0-equivalence class: (A0) (A0) (B0) (A0) (A0) (B0) (A0) _0 _1 _2 _3 _4 _5 _6 a A0 B0 A0 B0 B0 A0 B0 b B0 A0 A0 A0 A0 B0 A0 (3 =1= 6) because columns 3 and 6 above have equal 0-equivalence headers and elements, A0 B0 A0 1-equivalence partition: A1 = [0] = { 0} B1 = [1] = { 1,3,4,6 } C1 = [2] = { 2 } D1 = [5] = {5} To form partition: ECS <- {} //set of equiv classes for each q -in- Q foundClass <- false for each E -in- ECS let e -in- E if (q =1= e) {E.add(q) foundClass <- true break;} if (not foundClass) ECS.add({ q }) *** k-equivalence ------------------------------------------------------ k-equivalence (q =k= r) if (q =k-1= r) and (for all s -in- A)(T(q,s) =k-1= T(r,s)) ------------------------------------------------------ k-equivalence of states q and r means, if you take any string of length k or less, and start a path consuming the string, the result is the same, whether you start a state q or at state r. 2-equivalence 0 1 2 3 4 5 6 a 1 2 3 2 5 6 5 b 2 3 4 1 4 2 6 Annotate each column header with its 1-equivalence class, and replace each table entry with its 1-equivalence class: (A1) (B1) (C1) (B1) (B1) (D1) (B1) _0 _1 _2 _3 _4 _5 _6 a B1 C1 B1 C1 D1 B1 D1 b C1 B1 B1 B1 B1 C1 B1 2-equivalence partition: A2 = [0] = { 0} B2 = [1] = {1, 3 } C2 = [2] = { 2 } D2 = [4] = {4, 6} E2 = [5] = {5} This is a "refinement" of the 1-equivalence partition. A2 = A1, C2 = C1, E2=D1, but B2 and D2 are new, resulting from a partition of the old B1. 3- equivalence 0 1 2 3 4 5 6 a 1 2 3 2 5 6 5 b 2 3 4 1 4 2 6 Annotate each column header with its 2-equivalence class, and replace each table entry with its 2-equivalence class: (A2) (B2) (C2) (B2) (D2) (E2) (D2) _0 _1 _2 _3 _4 _5 _6 a B2 C2 B2 C2 A2 D2 A2 b C2 B2 D2 B2 D2 C2 D2 A3 = [0] = { 0 } B3 = [1] = {1, 3 } C3 = [2] = {2} D3 = [4] = {4, 6} E3 = [5] = { 5 } Note that this =3= partition is the same as the partition associated with =2=. *** infinite-equivalence Once we achieve a partition that equals the prior partition, continuing the process to higher k will only produce that same partition. In this example, =3= and higher are the same as =2=. For example, from B2, 1 =k= 3 , so for a string s of ANY length, it doesn't matter whether we start at state 1 or state 3. In the minimum state DFA, states 1 and 3 will merged into one state. ------------------------------------------------------ infinite-equivalence (q =infinite= r) if (for all k -in- N)( q =k= r) ------------------------------------------------------ This example has "index" 2. Every DFA has some index value for which all higher =k= relations are the same as =index=. In general, =index= is the same as =infinity= *** Minimum state DFA Minimized DFA Transitions Unminimized DFA: 0 1 2 3 4 5 6 a 1 2 3 2 5 6 5 b 2 3 4 1 4 2 6 =infinity= partition: A2 = { 0 }, B2 = {1, 3}, C2 = {2}, D2 = {4, 6}, E2 = {5} Replace each state with its infinite-equivalence class: A2 B2 C2 B2 D2 E2 D2 a B2 C2 B2 C2 E2 D2 E2 b C2 B2 D2 B2 D2 C2 D2 remove duplicate columns A2 B2 C2 D2 E2 ------------------------------------------- a B2 C2 B2 E2 D2 b C2 B2 D2 D2 C2 Minimized DFA states = { A2, B2, C2, D2, E2 } Minimized DFA Start state = [ Unminimized DFA start state ] = [ 0 ] = A4 Minimized DFA Final states = { [ q ] | q is a final state of unminimized DFA } = { [ 2 ], [5] } = { C2, E2} *OR* QM = { [q] | q -in- Q} // the set of =infinity= equivalence classes = { A2, B2, C2, D2 } start state qm0 = [q0] = [0] = A2 final states FM = { [q]| q -in- F} = {[2], [5]} = { C2, E2 } transitions TM([x], s) = [ T(x, s) ] TM(B2, a) pick any original state in B2, say, 1 TM(B2, a) = [ T(1, a) ] = [ 2] = C2 TM(B2, b) = [T(1, b)] = [ 3 ] = { 4,6} = B2 ** Beyond regular languages *** Introduction L1 = { (ab)^n | n -in- N } L2 = { a^mb^n | m,n -in- N } L3 = { a^nb^n | n -in- N } L4 = { a^nb^nc^n | n -in- N } L5 = { w | w is a valid (no syntax error) C program } L6 = { w | w is a valid (no syntax error) C program that terminates for all inputs } L7 = { w | w is a valid (no grammatical error) sentence of English } *** Not all languages are regular L = { a^nb^n | n -in- N } = { "", ab, aabb, aaabbb, ... } There is no regular expression for this language Proof by contradiction: Assume there is a regular expression for L By Kleene's Theorem, there is a DFA M that accepts L Let nq be the number of states -in- M ... Contradiction. *** Pumping lemma A tool for proving that languages are not regular **** Statement of Pumping Lemma If L is an infinite regular language over A, then THERE IS AN m > 0 such that: for any string s in L having length at least m, there are strings x, y, and z over A with 1. s = xyz 2. y is not "" 3. xy is of length at most m 4. xy^kz in L for all k Proof of pumping lemma is omitted. Note: Pumping lemma's "m" depends on L only. Once m is fixed, if you specify any string "s" -in- L (of length at least m), the "x", "y", and "z" are then determined. **** Template for using Pumping Lemma To prove that language L is NOT regular: Proof by contradiction. Assume L is regular. Let m be as provided by the pumping lemma. (#1 you supply a string in L of length at least m, that will lead to contradiction *!* ) ( #2 You show that the fact that the xy^kz are in L forces a contradiction) **** Example Example. Use the pumping lemma to prove that L = {a^nb^n | n -in- N } is not regular. Proof by contradiction: Assume L is regular. Let m be as provided by the pumping lemma. Let s = a^mb^m (it's in L, and its length is 2*m) Let x, y, and z be as provided by the pumping lemma a a a ... a b b b ... b = s 1 2 m m+1..... 2m -------- ----------------------- xy z Since s = xyz and xy is of length at most m, we know that x and y contain only a's. The strings {xz, xyz, xyyz , ..} are in L and y isn't "". Contradiction is that, among these, only "xyz" has the number of a's equal to the number of b's. So, for example, xz has fewer a's than b's, and so is NOT in L = {a^nb^n}. *** Grammars **** Introduction We've seen that some important languages cannot be handled by regular expressions. CONTEXT FREE GRAMMARS are a more powerful way to define languages. They can handle all regular languages and some, but not all, non regular languages. Computer languages like Java, etc. can "almost" be specified by grammars. For natural languages, you need much more powerful models than context free grammars. **** Example: Grammar for {a^nb^n} Start with an inductive definition of L ={a^nb^n | n -in- N} BASIS "" -in- L INDUCTION If s -in- L , then asb -in- L Grammar: S -> "" S -> aSb **** Grammars Example: S -> AB A -> "" A -> aA B -> bB B -> ba Generates L = ? In this example, the grammar contains "nonterminal symbols" { S, A, B } "start symbol" S "terminal symbols" {a, b} "productions" {S -> AB, A -> "", A -> aA, B -> bB, B -> ba } It's a shorthand for an inductive definition of three kinds of strings: those generated starting with the A nonterminal, the B nonterminal, and the S nonterminal. The most important strings (the one language defined) are the ones based on the start symbol S. In this example, the production A -> "" says that empty string is in the language derived from A. Similarly, the production B -> ba says "ba" is in the language derived from B. Productions whose right sides contain no nonterminals correspond to the basis clauses of an inductive definition. The other productions contain some nonterminal symbols on their right sides, and correspond to induction clauses in an inductive definition. For example, production A -> aA says, for all strings s, if s is derivable from A then as is derivable from A **** Derivations Grammar G {S -> AB, A -> "", A -> aA, B -> bB, B -> ba } ...1........2........3........4........5..... Derivation of "aaba": .. S -> AB -> aAB -> aaAB -> aaB -> aaba .. 1....3......3.......2.......5........ Derivation of string s * begin with start symbol * at each step, rewrite one nonterminal according to some production * last step produces s Language of G L(G) = { s -in- (terminals)* | there is a derivation of s}