Building Finite-State Machines

Post on 22-Jan-2016

55 views 0 download

description

Building Finite-State Machines. Finite-State Toolkits. In these slides, we’ll use Xerox’s regexp notation Their tool is XFST – free version is called FOMA Usage: Enter a regular expression; it builds FSA or FST Now type in input string FSA: It tells you whether it’s accepted - PowerPoint PPT Presentation

Transcript of Building Finite-State Machines

600.465 - Intro to NLP - J. Eisner 1

Building Finite-State Machines

600.465 - Intro to NLP - J. Eisner 2

Finite-State Toolkits

In these slides, we’ll use Xerox’s regexp notation Their tool is XFST – free version is called FOMA Usage:

Enter a regular expression; it builds FSA or FST Now type in input string

FSA: It tells you whether it’s accepted FST: It tells you all the output strings (if any) Can also invert FST to let you map outputs to inputs

Could hook it up to other NLP tools that need finite-state processing of their input or output

There are other tools for weighted FSMs (Thrax, OpenFST)

600.465 - Intro to NLP - J. Eisner 3

Common Regular Expression Operators (in XFST notation)

concatenation EF* + iteration E*, E+| union E | F& intersection E & F~ \ - complementation, minus ~E, \x, F-E.x. crossproduct E .x. F.o. composition E .o. F.u upper (input) language E.u “domain”

.l lower (output) language E.l “range”

600.465 - Intro to NLP - J. Eisner 4

Common Regular Expression Operators (in XFST notation)

concatenation EF

EF = {ef: e E, f F}

ef denotes the concatenation of 2 strings.EF denotes the concatenation of 2 languages.

To pick a string in EF, pick e E and f F and concatenate them.

To find out whether w EF, look for at least one way to split w into two “halves,” w = ef, such that e E and f F.

A language is a set of strings. It is a regular language if there exists an FSA that accepts all

the strings in the language, and no other strings.If E and F denote regular languages, than so does EF.

(We will have to prove this by finding the FSA for EF!)

600.465 - Intro to NLP - J. Eisner 5

Common Regular Expression Operators (in XFST notation)

concatenation EF* + iteration E*, E+

E* = {e1e2 … en: n0, e1 E, … en E}

To pick a string in E*, pick any number of strings in E and concatenate them.

To find out whether w E*, look for at least one way to split w into 0 or more sections, e1e2 … en, all of which are in E.

E+ = {e1e2 … en: n>0, e1 E, … en E} =EE*

600.465 - Intro to NLP - J. Eisner 6

Common Regular Expression Operators (in XFST notation)

concatenation EF* + iteration E*, E+| union E | F

E | F = {w: w E or w F} = E F

To pick a string in E | F, pick a string from either E or F. To find out whether w E | F, check whether w E or w

F.

600.465 - Intro to NLP - J. Eisner 7

Common Regular Expression Operators (in XFST notation)

concatenation EF* + iteration E*, E+| union E | F& intersection E & F

E & F = {w: w E and w F} = E F

To pick a string in E & F, pick a string from E that is also in F.

To find out whether w E & F, check whether w E and w F.

600.465 - Intro to NLP - J. Eisner 8

Common Regular Expression Operators (in XFST notation)

concatenation EF* + iteration E*, E+| union E | F& intersection E & F~ \ - complementation, minus ~E, \x, F-

E

~E = {e: e E} = * - EE – F = {e: e E and e F} = E & ~F\E = - E (any single character not in E)

is set of all letters; so * is set of all strings

600.465 - Intro to NLP - J. Eisner 9

Regular ExpressionsA language is a set of strings. It is a regular language if there exists an FSA that

accepts all the strings in the language, and no other strings.

If E and F denote regular languages, than so do EF, etc.

Regular expression: EF*|(F & G)+Syntax:

E F

*

F G

concat

&

+

| Semantics: Denotes a regular language. As usual, can build semantics compositionally bottom-up.E, F, G must be regular languages. As a base case, e denotes {e} (a language containing a single string), so ef*|(f&g)+ is regular.

600.465 - Intro to NLP - J. Eisner 10

Regular Expressionsfor Regular RelationsA language is a set of strings. It is a regular language if there exists an FSA that

accepts all the strings in the language, and no other strings.

If E and F denote regular languages, than so do EF, etc.

A relation is a set of pairs – here, pairs of strings.It is a regular relation if here exists an FST that accepts

all the pairs in the language, and no other pairs.If E and F denote regular relations, then so do EF, etc.

EF = {(ef,e’f’): (e,e’) E, (f,f’) F}Can you guess the definitions for E*, E+, E | F, E & F

when E and F are regular relations?Surprise: E & F isn’t necessarily regular in the case of relations; so not

supported.

600.465 - Intro to NLP - J. Eisner 11

Common Regular Expression Operators (in XFST notation)

concatenation EF* + iteration E*, E+| union E | F& intersection E & F~ \ - complementation, minus ~E, \x, F-

E.x. crossproduct E .x. F

E .x. F = {(e,f): e E, f F} Combines two regular languages into a regular relation.

600.465 - Intro to NLP - J. Eisner 12

Common Regular Expression Operators (in XFST notation)

concatenation EF* + iteration E*, E+| union E | F& intersection E & F~ \ - complementation, minus ~E, \x, F-E.x. crossproduct E .x. F.o. composition E .o. F

E .o. F = {(e,f): m. (e,m) E, (m,f) F} Composes two regular relations into a regular relation. As we’ve seen, this generalizes ordinary function

composition.

600.465 - Intro to NLP - J. Eisner 13

Common Regular Expression Operators (in XFST notation)

concatenation EF* + iteration E*, E+| union E | F& intersection E & F~ \ - complementation, minus ~E, \x,

F-E.x. crossproduct E .x. F.o. composition E .o. F.u upper (input) language E.u “domain”

E.u = {e: m. (e,m) E}

600.465 - Intro to NLP - J. Eisner 14

Common Regular Expression Operators (in XFST notation)

concatenation EF* + iteration E*, E+| union E | F& intersection E & F~ \ - complementation, minus ~E, \x, F-E.x. crossproduct E .x. F.o. composition E .o. F.u upper (input) language E.u “domain”

.l lower (output) language E.l “range”

Function from strings to ...

a:x/.5

c:z/.7

:y/.5.3

Acceptors (FSAs) Transducers (FSTs)

a:x

c:z

:y

a

c

Unweighted

Weighted a/.5

c/.7

/.5.3

{false, true} strings

numbers (string, num) pairs

600.465 - Intro to NLP - J. Eisner 16

How to implement?

concatenation EF* + iteration E*, E+| union E | F~ \ - complementation, minus ~E, \x, E-F& intersection E & F.x. crossproduct E .x. F.o. composition E .o. F.u upper (input) language E.u “domain”

.l lower (output) language E.l “range”

slide courtesy of L. Karttunen (modified)

600.465 - Intro to NLP - J. Eisner 17

Concatenation

==

example courtesy of M. Mohri

r

r

600.465 - Intro to NLP - J. Eisner 18

||

Union

==

example courtesy of M. Mohri

r

600.465 - Intro to NLP - J. Eisner 19

Closure (this example has outputs too)

==

**

example courtesy of M. Mohri

The loop creates (red machine)+ . Then we add a state to get do | (red machine)+ .Why do it this way? Why not just make state 0 final?

600.465 - Intro to NLP - J. Eisner 20

Upper language (domain)

.u.u

==

similarly construct lower language .lalso called input & output languages

example courtesy of M. Mohri

600.465 - Intro to NLP - J. Eisner 21

Reversal

.r.r

==

example courtesy of M. Mohri

600.465 - Intro to NLP - J. Eisner 22

Inversion

.i.i

==

example courtesy of M. Mohri

600.465 - Intro to NLP - J. Eisner 23

Complementation

Given a machine M, represent all strings not accepted by M

Just change final states to non-final and vice-versa

Works only if machine has been determinized and completed first (why?)

600.465 - Intro to NLP - J. Eisner 24

Intersectionexample adapted from M. Mohri

fat/0.5

10 2/0.8pig/0.3 eats/0

sleeps/0.6

fat/0.210 2/0.5

eats/0.6

sleeps/1.3

pig/0.4

&&

0,0fat/0.7

0,1 1,1pig/0.7

2,0/0.8

2,2/1.3

eats/0.6

sleeps/1.9

==

600.465 - Intro to NLP - J. Eisner 25

Intersectionfat/0.5

10 2/0.8pig/0.3 eats/0

sleeps/0.6

0,0fat/0.7

0,1 1,1pig/0.7

2,0/0.8

2,2/1.3

eats/0.6

sleeps/1.9

==

fat/0.210 2/0.5

eats/0.6

sleeps/1.3

pig/0.4

&&

Paths 0012 and 0110 both accept fat pig eats So must the new machine: along path 0,0 0,1 1,1 2,0

600.465 - Intro to NLP - J. Eisner 26

fat/0.5

fat/0.2

Intersection

10 2/0.5

10 2/0.8pig/0.3 eats/0

sleeps/0.6

eats/0.6

sleeps/1.3

pig/0.4

0,0fat/0.7

0,1==

&&

Paths 00 and 01 both accept fatSo must the new machine: along path 0,0 0,1

600.465 - Intro to NLP - J. Eisner 27

pig/0.3

pig/0.4

Intersection

fat/0.5

10 2/0.8eats/0

sleeps/0.6

fat/0.210 2/0.5

eats/0.6

sleeps/1.3

0,0fat/0.7

0,1pig/0.7

1,1==

&&

Paths 00 and 11 both accept pigSo must the new machine: along path 0,1 1,1

600.465 - Intro to NLP - J. Eisner 28

sleeps/0.6

sleeps/1.3

Intersection

fat/0.5

10 2/0.8pig/0.3 eats/0

fat/0.210

eats/0.6

pig/0.4

0,0fat/0.7

0,1 1,1pig/0.7

sleeps/1.9 2,2/1.3

2/0.5

==

&&

Paths 12 and 12 both accept fatSo must the new machine: along path 1,1 2,2

600.465 - Intro to NLP - J. Eisner 29

eats/0.6

eats/0

sleeps/0.6

sleeps/1.3

Intersection

fat/0.5

10 2/0.8pig/0.3

fat/0.210

pig/0.4

0,0fat/0.7

0,1 1,1pig/0.7

sleeps/1.9

2/0.5

2,2/1.3

eats/0.6 2,0/0.8

==

&&

600.465 - Intro to NLP - J. Eisner 30

What Composition Means

ab?d abcd

abed

abjd

3

2

6

4

2

8

...

f

g

600.465 - Intro to NLP - J. Eisner 31

What Composition Means

ab?d

...

Relation composition: f g

3+4

2+2

6+8

600.465 - Intro to NLP - J. Eisner 32

Relation = set of pairs

ab?d abcd

abed

abjd

3

2

6

4

2

8

...

f

g

does not contain any pair of the form abjd …

ab?d abcdab?d abedab?d abjd …

abcd abed abed …

600.465 - Intro to NLP - J. Eisner 33

Relation = set of pairs

ab?d abcdab?d abedab?d abjd …

abcd abed abed …

ab?d

4

2

...

8

ab?d ab?d ab?d …

f g = {xz: y (xy f and yz g)}where x, y, z are strings

f g

600.465 - Intro to NLP - J. Eisner 34

Wilbur:pink/0.7

Intersection vs. Composition

pig/0.310

pig/0.4

1 0,1pig/0.7

1,1==&&

Intersection

Composition

Wilbur:pig/0.310

pig:pink/0.4

1 0,1 1,1==.o..o.

600.465 - Intro to NLP - J. Eisner 35

Wilbur:gray/0.7

Intersection vs. Composition

pig/0.310

elephant/0.4

1 0,1pig/0.7

1,1==&&

Intersection mismatch

Composition mismatch

Wilbur:pig/0.310

elephant:gray/0.4

1 0,1 1,1==.o..o.

Composition example courtesy of M. Mohri

.o..o. ==

Composition

.o..o. ==

aa:b .o. b::b .o. b:bb = = aa::bb

Composition

.o..o. ==

aa:b .o. b::b .o. b:aa = = aa::aa

Composition

.o..o. ==

aa:b .o. b::b .o. b:aa = = aa::aa

Composition

.o..o. ==

bb:b .o. b::b .o. b:a a = = bb::aa

Composition

.o..o. ==

aa:b .o. b::b .o. b:aa = = aa::aa

Composition

.o..o. ==

aa:a .o. a::a .o. a:bb = = aa::bb

Composition

.o..o. ==

bb:b .o. a::b .o. a:bb = nothing = nothing(since intermediate symbol doesn’t (since intermediate symbol doesn’t match)match)

Composition

.o..o. ==

bb:b .o. b::b .o. b:aa = = bb::aa

Composition

.o..o. ==

aa:b .o. a::b .o. a:bb = = aa::bb

Composition in Dyna

600.465 - Intro to NLP - J. Eisner 46

start = &pair( start1, start2 ).

final(&pair(Q1,Q2)) :- final1(Q1), final2(Q2).

edge(U, L, &pair(Q1,Q2), &pair(R1,R2)) min= edge1(U, Mid, Q1, R1) + edge2(Mid, L, Q2, R2).

600.465 - Intro to NLP - J. Eisner 47

Relation = set of pairs

ab?d abcdab?d abedab?d abjd …

abcd abed abed …

ab?d

4

2

...

8

ab?d ab?d ab?d …

f g = {xz: y (xy f and yz g)}where x, y, z are strings

f g

600.465 - Intro to NLP - J. Eisner 48

3 Uses of Set Composition:

Feed string into Greek transducer: {abedabed} .o. Greek = {abed,abed} {abed} .o. Greek = {abed,abed} [{abed} .o. Greek].l = {}

Feed several strings in parallel: {abcd, abed} .o. Greek

= {abcd,abed,abed}

[{abcd,abed} .o. Greek].l = {,,} Filter result via No = {,,…}

{abcd,abed} .o. Greek .o. No = {abcd,abed}

600.465 - Intro to NLP - J. Eisner 49

What are the “basic” transducers?

The operations on the previous slides combine transducers into bigger ones

But where do we start?

a: for a :x for x

Q: Do we also need a:x? How about ?

a:

:x

600.465 - Intro to NLP - J. Eisner 50

Some Xerox Extensions

$ containment=> restriction-> @-> replacement

Make it easier to describe complex languages and relations without extending the formal power of finite-state systems.

slide courtesy of L. Karttunen (modified)

600.465 - Intro to NLP - J. Eisner 51

Containment

$[ab*c]$[ab*c]

““Must contain a substringMust contain a substringthat matches that matches ab*cab*c.”.”

Accepts Accepts xxxacyyxxxacyyRejects Rejects bcbabcba

?* [ab*c] ?* ?* [ab*c] ?*

Equivalent expression Equivalent expression

bb

cc

aa

a,b,c,?a,b,c,?

a,b,c,?a,b,c,?

Warning: ?? in regexps means “any character at all.”

But ?? in machines means “any character not explicitly

mentioned anywhere in the machine.”

600.465 - Intro to NLP - J. Eisner 52

Restriction

??cc

bb

bb

cc?? aa

cc

a => b _ ca => b _ c

““AnyAny aa must be preceded bymust be preceded by bband followed byand followed by cc.”.”

Accepts Accepts bacbbacdebacbbacdeRejects Rejects bacbacaa

~[~[~[?* b] a ?*~[?* b] a ?*]] & & ~[~[?* a ~[c ?*]?* a ~[c ?*]]] Equivalent expression Equivalent expression

slide courtesy of L. Karttunen (modified)

contains aa not preceded by bb

contains aa not followed by cc

600.465 - Intro to NLP - J. Eisner 53

Replacement

a:ba:b

bb

aa

??

??

b:ab:a

aa

a:ba:b

a b -> b a

““Replace ‘ab’ by ‘ba’.”Replace ‘ab’ by ‘ba’.”

Transduces Transduces abcdbaba to to bacdbbaa

[[~$[a b] ~$[a b] [[a b] .x. [b a]][[a b] .x. [b a]]]*]* ~$[a b]~$[a b] Equivalent expression Equivalent expression

slide courtesy of L. Karttunen (modified)

600.465 - Intro to NLP - J. Eisner 54

Replacement is Nondeterministic

a b -> b a | x

““Replace ‘ab’ by ‘ba’ or ‘x’, nondeterministically.”Replace ‘ab’ by ‘ba’ or ‘x’, nondeterministically.”

Transduces Transduces abcdbaba to { to {bacdbbaa,, bacdbxa,, xcdbbaa,, xcdbxa}}

600.465 - Intro to NLP - J. Eisner 55

Replacement is Nondeterministic

[ a b -> b a | x ] .o. [ x => _ c ]

““Replace ‘ab’ by ‘ba’ or ‘x’, nondeterministically.”Replace ‘ab’ by ‘ba’ or ‘x’, nondeterministically.”

Transduces Transduces abcdbaba to { to {bacdbbaa,, bacdbxa,, xcdbbaa,, xcdbxa}}

600.465 - Intro to NLP - J. Eisner 56

Replacement is Nondeterministic

a b | b | b a | a b a -> x

applied to “aba”

Four overlapping substrings match; we haven’t told it which one to replace so it chooses nondeterministically

a b a a b a a b a a b a

a x a a x x a x

slide courtesy of L. Karttunen (modified)

600.465 - Intro to NLP - J. Eisner 57

More Replace Operators

Optional replacement: a b (->) b a

Directed replacement guarantees a unique result by

constraining the factorization of the input string by Direction of the match (rightward or

leftward) Length (longest or shortest)

slide courtesy of L. Karttunen

600.465 - Intro to NLP - J. Eisner 58

@-> Left-to-right, Longest-match Replacement

a b | b | b a | a b a @-> x

applied to “aba”

a b a a b a a b a a b a

a x a a x x a x

slide courtesy of L. Karttunen

@-> left-to-right, longest match @> left-to-right, shortest match ->@ right-to-left, longest match >@ right-to-left, shortest match

600.465 - Intro to NLP - J. Eisner 59

Using “…” for marking

0:[0:[

[[

0:]0:]

??

aa

ee

iioo

uu]]

a|e|i|o|u -> [ ... ]

p o t a t op o t a t op[o]t[a]t[o]p[o]t[a]t[o]

Note: actually have to write as Note: actually have to write as -> %[ ... %] or or -> “[” ... “]”

since [] are parens in the regexp languagesince [] are parens in the regexp language

slide courtesy of L. Karttunen (modified)

600.465 - Intro to NLP - J. Eisner 60

Using “…” for marking

0:[0:[

[[

0:]0:]

??

aa

ee

iioo

uu]]

a|e|i|o|u -> [ ... ]

p o t a t op o t a t op[o]t[a]t[o]p[o]t[a]t[o]

Which way does the FST transduce potatoe?

slide courtesy of L. Karttunen (modified)

How would you change it to get the other answer?

p o t a t o ep o t a t o ep[o]t[a]t[o][e]p[o]t[a]t[o][e]

p o t a t o ep o t a t o ep[o]t[a]t[o e]p[o]t[a]t[o e]

vs.

600.465 - Intro to NLP - J. Eisner 61

Example: Finnish Syllabification

define C [ b | c | d | f ...define C [ b | c | d | f ...define V [ a | e | i | o | u ];define V [ a | e | i | o | u ];

s t r u k t u r a l i s m s t r u k t u r a l i s m iis t r u k - t u - r a - l i s - m s t r u k - t u - r a - l i s - m ii

[C* V+ C*] @-> ... "-" || _ [C V][C* V+ C*] @-> ... "-" || _ [C V]

““Insert a hyphen after the longest instance of theInsert a hyphen after the longest instance of the

C* V+ C*C* V+ C* pattern in front of a pattern in front of a C VC V pattern.” pattern.”

slide courtesy of L. Karttunen

why?why?

600.465 - Intro to NLP - J. Eisner 62

Conditional Replacement

The relation that replaces A by B between L and R leaving everything else unchanged.

A -> BA -> B

Replacement

L _ RL _ R

Context

Sources of complexity:

Replacements and contexts may overlap

Alternative ways of interpreting “between left and right.”

slide courtesy of L. Karttunen

600.465 - Intro to NLP - J. Eisner 63

Hand-Coded Example: Parsing Dates

Today is [Tuesday, July 25, 2000].

Today is Tuesday, [July 25, 2000].

Today is [Tuesday, July 25], 2000.

Today is Tuesday, [July 25], 2000.

Today is [Tuesday], July 25, 2000.

Best result

Bad results

Need left-to-right, longest-match constraints.

slide courtesy of L. Karttunen

600.465 - Intro to NLP - J. Eisner 64

Source code: Language of Dates

Day = Monday | Tuesday | ... | Sunday

Month = January | February | ... | December

Date = 1 | 2 | 3 | ... | 3 1

Year = %0To9 (%0To9 (%0To9 (%0To9))) - %0?* from 1 to 9999

AllDates = Day | (Day “, ”) Month “ ” Date (“, ” Year))

slide courtesy of L. Karttunen

600.465 - Intro to NLP - J. Eisner 65

actually represents 7 arcs, each labeled by a string

Object code: All Dates from 1/1/1 to 12/31/9999

, ,

FebJan

Mar

MayJunJul

Apr

Aug

OctNovDec

Sep

3

,

,

123456789

0123456789

0

123456789

0123456789

123456789

0

10

21

TueMon

Wed

FriSatSun

Thu 456789

MayJan Feb Mar Apr Jun

Jul Aug Oct Nov DecSep

13 states, 96 arcs13 states, 96 arcs29 760 007 date expressions29 760 007 date expressions

slide courtesy of L. Karttunen

600.465 - Intro to NLP - J. Eisner 66

Parser for Dates

AllDates @-> “[DT ” ... “]”Compiles into an

unambiguous transducer (23 states,

332 arcs).

Today is Today is [DT Tuesday, July 25, 2000][DT Tuesday, July 25, 2000] because yesterday because yesterday

was was [DT Monday][DT Monday] and it was and it was [DT July 24][DT July 24] so tomorrow must so tomorrow must

be be [DT Wednesday, July 26][DT Wednesday, July 26] and not and not [DT July 27][DT July 27] as it says as it says

on the program.on the program.

Xerox left-to-right replacement operator

slide courtesy of L. Karttunen (modified)

600.465 - Intro to NLP - J. Eisner 67

Problem of Reference

Valid dates

Tuesday, July 25, 2000

Tuesday, February 29, 2000

Monday, September 16, 1996Invalid dates

Wednesday, April 31, 1996

Thursday, February 29, 1900

Tuesday, July 26, 2000

slide courtesy of L. Karttunen

600.465 - Intro to NLP - J. Eisner 68

Refinement by IntersectionAllDatesAllDates

ValidValidDatesDates

WeekdayDateWeekdayDate

MaxDaysMaxDaysIn MonthIn Month

“ 31” => Jan|Mar|May|… _“ 30” => Jan|Mar|Apr|… _

LeapYearsLeapYears

Feb 29, => _ …

slide courtesy of L. Karttunen (modified)

Xerox contextual restriction operator

Q: Why do these rulesstart with spaces?(And is it enough?)

Q: Why does this ruleend with a comma?Q: Can we write the whole rule?

600.465 - Intro to NLP - J. Eisner 69

Defining Valid Dates

AllDates

&

MaxDaysInMonth

&

LeapYears

&

WeekdayDates

= ValidDates

AllDates: 13 states, 96 arcs29 760 007 date expressions

ValidDates: 805 states, 6472 arcs7 307 053 date expressions

slide courtesy of L. Karttunen

600.465 - Intro to NLP - J. Eisner 70

Parser for Valid and Invalid Dates

[AllDates - ValidDates] @-> “[ID ” ... “]”,

ValidDates @-> “[VD ” ... “]”

Today is [VD Tuesday, July 25, 2000],

not [ID Tuesday, July 26, 2000].

valid date

invalid date

2688 states,20439 arcs

slide courtesy of L. Karttunen

Comma creates a single FST that does left-to-right

longest match against either pattern