Finite State Machinery - I Fundamentals Recognisers and Transducers.
-
Upload
tamsin-taylor -
Category
Documents
-
view
221 -
download
1
Transcript of Finite State Machinery - I Fundamentals Recognisers and Transducers.
Finite State Machinery - I
• Fundamentals
• Recognisers and Transducers
4
Reference Outline• Websites
– Xerox: www.xrce.xerox.com/research/mltt/fst/– Groningen: grid.let.rug.nl/~vannoord/FSA/fsa.html– AT & T: www.research.att.com/sw/tools/fsm
• Books/Collections– Karttunen & Oflazer (2000)– Jurafsky & Martin (2000)– Hopcraft and Ullman (1979)– Roche and Schabes (1977)
• Classic Articles– Kaplan and Kay (1994)– Koskenniemi (1983)– Johnson (1972)
• Tools– Van Noord et al.– Mohri et al.– Daciuk.– Karttunen & Beesley
5
Acknowledgements to
• Lauri Karttunen, Ken Beesley and colleagues at Xerox.
• Most materials in this tutorial are from their website.
• Forthcoming book: Finite State Morphology – Xerox Tools and Techniques.
FS Motivation
• Chomsky hierarchy of language classes based on classes of descriptive notation, and also on asociated classes of machine.
• Chomsky (1957) dismissed FS grammars, and associated machinery, as fundamentally inadequate for the description of NL.
Embedding
• Basic problem is not that sentences can grow to arbitrary length, it is that the description of a syntactic constitutent may embed any other constituents including the sentence itelf.The dog bit the cat.
The dog that the man saw bit the cat.
The dog that the man that the horse kicked saw bit the cat.
etc
On the other hand …...• Plenty of language just ain't like that.• Words
– Orthographic spelling.– Phonological spelling.– Morphology.
• Fixed expression types (e.g dqtes).• Gross constitutent structures (e.g. the big,
bad, blue wolf).
Recent Application Areas for FS Technology Include
• POS Tagging
• Spell Checking
• Information Extraction
• Speech Recognition
• Text to Speech
• Spoken Dialogue
• Parsing
21
Recognition of Italian Words• The coke machine recognises words in the
coke machine language.
• The following machine recognises two words in Italian.
• Recognition mechanism is language independent.
C A S A
I N Q U E
22
The Process of Analysis
• Start in the initial state and at the first symbol of the word.
• If there is an arc labelled with that symbol, the machine transitions to the next state, and the symbol is consumed.
• The process continues with successive symbols until .....
23
The Process of AnalysisOne or more of these conditions holds:• A. A final state is reached• B. All symbols are consumed• C. There are no transitions out of a state for
the current symbol. – If both A and B, analysis succeeds and the
word is recognised.– Otherwise recognition fails.
24
Success and Failure
I N Q U E
C A S A
EE N T
L
LE; CASA; CINQUANTA; LENTEMENTE
27
Transducers
• Recognisers either accept or reject a word.
• Although this is useful, networks can actually return more substantial information.
• This is achieved by providing networks with the ability to write as well as to read.
28
Basic Transducer• Each transition of a transducer is labelled with a
pair of symbols rather than with a single symbol.• Analysis proceeds as before, except that input
symbols are matched against the lower-side symbols on transitions.
• If analysis succeeds, return the string of upper-side symbols on the path to the final state
Confusing Terminology
• Lower side = surface side.
• Upper side = "deep" side.
• Analysis proceeds from lower to upper.
• Synthesis (generation) proceeds from upper to lower.
29
Lexical Transducers
• In common parlance, a transducer is a device which converts one form of energy into another, e.g. a microphone converts from sound to electrical signals.
• Next we look at lexical transducers which convert one string of symbols into another.
30
Lexical Transducer Example
C A S A
C A S E
• Input: CASE• Output: CASA
lexical string
surface string
31
Morphological Analysis
R
ATNOC
OC N T
+V E+SG +1P
O
• Input: CONTO• Output: CONTARE +V +1P +SG
32
Remarks
stands for "epsilon". During analysis, epsilon transitions are taken freely without consuming any input.
• Note also single symbols with multi-character print names (e.g. +SG).
• The order of these symbols, and the choice of infinitive as baseform, is determined by linguists.
33
Exercise
• The word "conto" in Italian is also a masculine noun meaning (a) story and (b) bank account
• Draw the corresponding 2-level networks.
• How can the different meanings be incorporated into the same network
31
Conto +N +SG
+N
OTNOC
OC N T
O
+SG
• Input: CONTO• Output: CONTO +N+SG
A
34
Synthesis
• Transducers are reversible. This means that they can be used to perform the inverse transduction from an transducers.
• The process of synthesis is the inverse of analysis
35
The Process of Synthesis
• Start at the start state and at the beginning of the input string.
• Match the input symbols against the upper-side symbols of the arcs, consuming symbols until a final state is reached.
• If successful, return the string of lower-side symbols (else nothing).
36
Morphological Synthesis
R
ATNOC
OC N T
+V E+SG +1P
O
•Input: CONTARE +V +1P +SG•Output: CONTO•N.B. symbols are ignored on output
37
Analysis and Synthesis
• Upper Side Language (Lexical Strings).
• Lower Side Language (Surface Strings).
• Transducer maps between the two.
• However large the lexical transducer may become, analysis and synthesis are performed by the same language-independent matching techniques.