Finite-State Methods in Natural Language Processing
description
Transcript of Finite-State Methods in Natural Language Processing
Finite-State Methods in Natural Finite-State Methods in Natural Language ProcessingLanguage Processing
Lauri KarttunenLSA 2005 Summer InstituteJuly 18, 2005
Course OutlineCourse Outline
July 18:Intro to computational morphologyXFST
ReadingsLauri Karttunen, “Finite-State Constraints”, The Last
Phonological Rule. J. Goldsmith (ed.), pages 173-194, University of Chicago Press, 1993.
Karttunen and Beesley, “25 Years of Finite-State Morphology”
Chapter 1: “Gentle Introduction” (B&K)
July 20:Regular expressionsMore on XFST
ReadingsChapter 2: “Systematic Introduction”Chapter 3: “The XFST interface”
July 25Concatenative morphotacticsConstraining non-local dependencies
ReadingsChapter 4. “The LEXC Language”Chapter 5. “Flag Diacritics”
July 27Non-concatenative morphotactics
Reduplication, interdigitation
ReadingsChapter 8. “Non-Concatenative Morphotactics”
August 1Realizational morphology
ReadingsGregory T. Stump. Inflectional Morphology. A Theory
of Paradigm Structure. Cambridge U. Press. 2001. (An excerpt)
Lauri Karttunen, “Computing with Realizational Morphology”, Lecture Notes in Computer Science, Volume 2588, Alexander Gelbukh (ed.), 205-216, Springer Verlag. 2003.
August 3Optimality theory
ReadingsPaul Kiparsky “Finnish Noun Inflection” Generative Approaches to
Finnic and Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI Publications, 2003.
Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273-329.
Getting credit for LSA 207Getting credit for LSA 207
There will be three assignments, given on each Wednesday. The first two are to be turned in by the following Monday, the last one by the following Friday.
You will get credit for the course if you solve at least two of the three assignments. The solutions will involve programming in the xfst scripting language. The problems will be easy to solve if you have attended the class.
If you have any problems in doing the assignments, Michael Wagner and I will be happy to help you.
TextbookTextbook
Copies will arrive in theLinguistics Departmenttomorrow afternoon.
You can purchase a copy theretomorrow as soon as the bookshave arrived.
Starting Wednesday, books canBe purchased from our TA,Michael Wagner.
The price is $35.
With the book comes asoftware CD for Solaris,Linux, MacOSX and Windowsoperating systems.
LSA 207 Web siteLSA 207 Web site
http://lsa.dlp.mit.edu/Class/207 You can use this username and password to
access materials:Username: LSA207Password: seunsehi207Your are free to copy, modify and use the slides
for whatever purpose provided that you give appropriate credit to the original source.
The readings for Wednesday’s class (“Finite-State Constraints”, “25 Years of Finite-State Morphology” and “Gentle Introduction” (Chapter 1 of B&K book) are posted on the web site).
SoftwareSoftware
The software on the Book CD dates back to the Spring of 2003. For an update, point your browser tohttp://www.stanford.edu/~laurik/.lsa207/
Please read the README file and the License Agreement before downloading the software.
The updated software supports UTF-8 encoded Unicode input/output. The Book version supports only Latin-1 (ISO-8859-1).
The XFST application will be available locally on some computers (ask Michael).
Check out the web site for the Book:http://www.fsmbook.com/
Finite-State Methods in NLPFinite-State Methods in NLP
Domains of ApplicationTokenizationSentence breakingSpelling correctionMorphology (analysis/generation)Phonological disambiguation (Speech Recognition)Morphological disambiguation (“Tagging”)Pattern matching (“Named Entity Recognition”)Shallow Parsing
Types of Finite-State SystemsClassical (non-weighted) automataWeighted (associated with weights in a semi-ring)
Binary relations (simple transducers)N-ary relations (multi-tape transducers)
Computational morphologyComputational morphology
Analysis
leaves
leaf N Pl leave N Pl leave V Sg3
Generation
hang V Past
hanged hung
Two challengesTwo challenges
MorphotacticsWords are composed of smaller elements that
must be combined in a certain order:piti-less-ness is Englishpiti-ness-less is not English
Phonological alternationsThe shape of an element may vary depending
on the contextpity is realized as piti in pitilessnessdie becomes dy in dying
Morphology is regular (=rational)Morphology is regular (=rational)
The relation between the surface forms of a language and the corresponding lexical forms can be described as a regular relation.
A regular relation consists of ordered pairs of strings.leaf+N+Pl : leaves hang+V+Past : hung
Any finite collection of such pairs is a regular relation.
Regular relations are closed under operations such as concatenation, iteration, union, and composition.
Complex regular relations can be derived from simple relations.
Morphology is finite-stateMorphology is finite-state
A regular relation can be defined using the metalanguage of regular expressions.
[{talk} | {walk} | {work}]
[%+Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}];
A regular expression can be compiled into a finite-state transducer that implements the relation computationally.
CompilationCompilation
[{talk} | {walk} | {work}]
[%+Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}];
Regular expression
k
t
a
a
wo
l
r
+Progr:i :g
+3rdSg:s
+Past:e :d
:n
+Base:
Finite-state transducer
finalstate
initialstate
work+3rdSg --> works
k:k
t:t
a:a
a:a
w:wo:o
l:l
r:r
+Progr:i :g
+3rdSg:s
+Past:e :d
:n
+Base:
GenerationGeneration
talked --> talk+Past
k:k
t:t
a:a
a:a
w:wo:o
l:l
r:r
+Progr:i :g
+3rdSg:s
+Past:e :d
:n
+Base:
AnalysisAnalysis
XFST Demo 1XFST Demo 1
xfst[0]: regex
[{talk} | {walk} | {work}]
[% +Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}];
% xfstxfst[0]:
start xfst
compile a regular expression
apply the resultxfst[1]: apply up walkedwalk+Past
xfst[1]: apply down talk+SgGen3talks
Lexical transducerLexical transducer
veut
vouloir +IndP +SG + P3
Finite-state transducer
inflected form
citation form inflection codes
v o u l o i r +IndP +SG +P3
v e u t
Bidirectional: generation or analysisCompact and fastComprehensive systems have been
built for over 40 languages:English, German, Dutch, French,
Italian, Spanish, Portuguese, Finnish, Russian, Turkish, Japanese, Korean, Basque, Greek, Arabic, Hebrew, Bulgarian, …
How lexical transducers are madeHow lexical transducers are made
LexiconFST
RuleFSTs
Compiler
f a t +Adj
r
+Comp
f a t t e
Lexical Transducer(a single FST)composition
LexiconRegular Expression
RulesRegular Expressions
Morphotactics
Alternations
Sequential ModelSequential Model
...
Surface form
Intermediate form
Lexical form
fst 1
fst 2
fst n
Ordered sequenceof rewrite rules
(Chomsky & Halle ‘68)can be modeledby a cascade of
finite-state transducersJohnson ‘72
Kaplan & Kay ‘81
Discovery and RediscoveryDiscovery and Rediscovery
C. Douglas Johnson (1972) showed that– phonological rewrite rules are interpreted in a way
that makes them less powerful than they appear– rewrite rules can be modeled by finite transducers– for any two finite transducers applied in a sequence
there exists an equivalent single transducer (Schützenberger 1961).
Johnson’s result was ignored and forgotten, rediscovered by Ronald M. Kaplan and Martin Kay at Xerox around 1980.
Application constraintApplication constraint
Phonological rewrite rules are not as powerful as they appear because of the constraint that a rule does not apply to its own output. (Johnson 1972, Kaplan&Kay 1980).
Sequential applicationSequential application
N -> m / _ p
p -> m / m _
k a N p a n
k a m p a n
k a m m a n
Sequential application in detailSequential application in detail
N:m
N
?? 0
2
1
pN:m
m
pN
m
p:m
?? 0 1
mp
m
k a N p a n
k a m p a n
k a m m a n
0 0 0 2 0 0 0
0 0 0 1 0 0 0
CompositionComposition
N:m
N
?? 0
3
1
N:m
m
p
N
?
m2
p:m
p:m
N m
N:mk a N p a n
k a m m a n
0 0 0 3 0 0 0
Parallel ModelParallel Model
Set of parallelof two-level rules (constraints)
compiled into finite-state automatainterpreted as transducers
Koskenniemi ‘83
fst 1 fst 2 fst n...
Surface form
Lexical form
Sequential vs. parallel rulesSequential vs. parallel rules
compose intersect
FST
rule 1 rule 2 rule n...
Surface form
Lexical form
Koskenniemi 1983
Intermediate form
...
Surface form
Lexical form
rule 1
rule n
rule 1
Chomsky&Halle 1968
Rewrite rulesRewrite rules
Epenthesis
Harmony
Lowering
? u: t y ? A s
? u: t I y ? A s
? u: t u y ? a s
? o: t u y ? a s
Yawelmani Vowel Harmony Kisseberth 1969
Two-level constraintsTwo-level constraints
? u: t 0 y ? A s
? o: t u y ? a s
Underlying representation controls all three alternations.
Epenthesis: Insert u or i (underspecification)Harmony: Rounding next to a round V of the same height.Lowering: Long u always realized as long o.
Rewrite Rules vs. ConstraintsRewrite Rules vs. Constraints
• Two different ways of decomposing the complex relation between lexical and surface forms into a set of simpler relations that can be more easily understood and manipulated.
• One approach may be more convenient than the other for particular applications.
The Big PictureThe Big Picture
Languageor
Relation
Regular Expression
Finite-State Network
describes
encodes
compiles into
a a
{a}
XFST Demo 2XFST Demo 2
xfst[1]: apply upapply up> dogdogapply up> pantherapply up>apply up> END;
xfst[0]: define Cat {cat} | {tiger} | {lion};defined Cat: 640 bytes. 11 states, 12 arcs, 3 paths. ...xfst[0]:
xfst[0]: set verbose off
xfst[0]: define Dog {dog} | {spaniel} | {poodle};
xfst[0]: regex Cat | Dog ;
xfst[1]: define Animalxfst[0]:
xfst[0]: regex Cat & Dog;
xfst[1]: print netSigma: a c d e g i l n o p r s tSize: 13, Label Map: DefaultNet: Flags: deterministic, pruned, minimized, epsilon_free, ...s0: (no arcs)xfst[1]:
xfst[1]: popxfst[0]:
xfst[0]: regex Animal - Dog;xfst[1]: push Catxfst[2]: test equivalent1, (0=NO,1=YES)xfst[2]: clearxfst[0]:
Compiling networks from wordsCompiling networks from words
rlc ae
v ee
t hf
a
Networkxfst[0]: read textclearclevereareverfatfather^D432 bytes. 10 states, 12 arcs, 6 paths.
read text < file
read regex {clear}|{clever}|{ear}|{ever}|{fat}|{father} ;
Regular Expression CalculusRegular Expression Calculus
SymbolsSimple symbols vs. symbol pairsSpecial symbols: ANY, EPSILON
Common regular expression operatorsconcatenation, union, intersection,
negation, composition
Xerox operatorscontains, restriction, replacement
Symbols and LabelsSymbols and Labels
Single and multicharacter symbolsa, b, c, … , +Adj, +SG, ^Fin
Special symbols0 EPSILON? ANY
Symbols vs. symbol pairsIn general, no distinction is made between
a the language {“a”}a:a the identity relation {<“a”,
“a”>}
a
Common RE OperatorsCommon RE Operators
concatenation* + iteration| union& intersection*~ \ - complementation*, minus*.x. : crossproduct.o. composition
* = not applicable to regular relations because the result may not be encodable by a finite-state network.
IterationIteration
A* zero or more contatenations of A
A+ one or more concatenations of A
?* the universal language/the universal identity relation
?
a:A
b:B
c:C
d:D
[a:A | b:B | c:C | d:D | … ]*
NegationNegation
\A any single symbol that is not in A\? the null language
~A any string that is not in A
a
\a Sigma: a, ?
~a
a
a
?
?a
a?
?
CrossproductCrossproduct
A .x. B The relation that maps every string in A to every string in B, and vice versa
A:B Same as [A .x. B].
b:y c:0a:x
a b c .x. x y [a b c] : [x y] {abc}:{xy}
CompositionComposition
A .o. B The relation C such that if A maps x to y and B maps y to z, C maps x to z.
b:B c:Ca:A
b ca
a:A
b:B
c:C
d:D {abc} .o. [a:A | b:B | c:C | d:D]*
Xerox RE OperatorsXerox RE Operators
$ containment=> restriction-> @-> replacement
Make it easier to describe complex languages and relations without extending the formal power of finite-state systems.
ContainmentContainment
aa?? ?? aa$a$a
[?* a ?*][?* a ?*]
RestrictionRestriction
??cc
bb
bb
cc?? aa
cc
a => b _ ca => b _ c
““AnyAny aa must be preceded bymust be preceded by bband followed byand followed by cc.”.”
~[~[?* b] a ?*] & ~[?* a ~[c ?*]] ~[~[?* b] a ?*] & ~[?* a ~[c ?*]]
Equivalent expression Equivalent expression