Finite State Machinery - I Fundamentals Recognisers and Transducers.

Post on 03-Jan-2016

221 views 1 download

Tags:

Transcript of Finite State Machinery - I Fundamentals Recognisers and Transducers.

Finite State Machinery - I

• Fundamentals

• Recognisers and Transducers

4

Reference Outline• Websites

– Xerox: www.xrce.xerox.com/research/mltt/fst/– Groningen: grid.let.rug.nl/~vannoord/FSA/fsa.html– AT & T: www.research.att.com/sw/tools/fsm

• Books/Collections– Karttunen & Oflazer (2000)– Jurafsky & Martin (2000)– Hopcraft and Ullman (1979)– Roche and Schabes (1977)

• Classic Articles– Kaplan and Kay (1994)– Koskenniemi (1983)– Johnson (1972)

• Tools– Van Noord et al.– Mohri et al.– Daciuk.– Karttunen & Beesley

5

Acknowledgements to

• Lauri Karttunen, Ken Beesley and colleagues at Xerox.

• Most materials in this tutorial are from their website.

• Forthcoming book: Finite State Morphology – Xerox Tools and Techniques.

FS Motivation

• Chomsky hierarchy of language classes based on classes of descriptive notation, and also on asociated classes of machine.

• Chomsky (1957) dismissed FS grammars, and associated machinery, as fundamentally inadequate for the description of NL.

Embedding

• Basic problem is not that sentences can grow to arbitrary length, it is that the description of a syntactic constitutent may embed any other constituents including the sentence itelf.The dog bit the cat.

The dog that the man saw bit the cat.

The dog that the man that the horse kicked saw bit the cat.

etc

On the other hand …...• Plenty of language just ain't like that.• Words

– Orthographic spelling.– Phonological spelling.– Morphology.

• Fixed expression types (e.g dqtes).• Gross constitutent structures (e.g. the big,

bad, blue wolf).

Recent Application Areas for FS Technology Include

• POS Tagging

• Spell Checking

• Information Extraction

• Speech Recognition

• Text to Speech

• Spoken Dialogue

• Parsing

21

Recognition of Italian Words• The coke machine recognises words in the

coke machine language.

• The following machine recognises two words in Italian.

• Recognition mechanism is language independent.

C A S A

I N Q U E

22

The Process of Analysis

• Start in the initial state and at the first symbol of the word.

• If there is an arc labelled with that symbol, the machine transitions to the next state, and the symbol is consumed.

• The process continues with successive symbols until .....

23

The Process of AnalysisOne or more of these conditions holds:• A. A final state is reached• B. All symbols are consumed• C. There are no transitions out of a state for

the current symbol. – If both A and B, analysis succeeds and the

word is recognised.– Otherwise recognition fails.

24

Success and Failure

I N Q U E

C A S A

EE N T

L

LE; CASA; CINQUANTA; LENTEMENTE

27

Transducers

• Recognisers either accept or reject a word.

• Although this is useful, networks can actually return more substantial information.

• This is achieved by providing networks with the ability to write as well as to read.

28

Basic Transducer• Each transition of a transducer is labelled with a

pair of symbols rather than with a single symbol.• Analysis proceeds as before, except that input

symbols are matched against the lower-side symbols on transitions.

• If analysis succeeds, return the string of upper-side symbols on the path to the final state

Confusing Terminology

• Lower side = surface side.

• Upper side = "deep" side.

• Analysis proceeds from lower to upper.

• Synthesis (generation) proceeds from upper to lower.

29

Lexical Transducers

• In common parlance, a transducer is a device which converts one form of energy into another, e.g. a microphone converts from sound to electrical signals.

• Next we look at lexical transducers which convert one string of symbols into another.

30

Lexical Transducer Example

C A S A

C A S E

• Input: CASE• Output: CASA

lexical string

surface string

31

Morphological Analysis

R

ATNOC

OC N T

+V E+SG +1P

O

• Input: CONTO• Output: CONTARE +V +1P +SG

32

Remarks

stands for "epsilon". During analysis, epsilon transitions are taken freely without consuming any input.

• Note also single symbols with multi-character print names (e.g. +SG).

• The order of these symbols, and the choice of infinitive as baseform, is determined by linguists.

33

Exercise

• The word "conto" in Italian is also a masculine noun meaning (a) story and (b) bank account

• Draw the corresponding 2-level networks.

• How can the different meanings be incorporated into the same network

31

Conto +N +SG

+N

OTNOC

OC N T

O

+SG

• Input: CONTO• Output: CONTO +N+SG

A

34

Synthesis

• Transducers are reversible. This means that they can be used to perform the inverse transduction from an transducers.

• The process of synthesis is the inverse of analysis

35

The Process of Synthesis

• Start at the start state and at the beginning of the input string.

• Match the input symbols against the upper-side symbols of the arcs, consuming symbols until a final state is reached.

• If successful, return the string of lower-side symbols (else nothing).

36

Morphological Synthesis

R

ATNOC

OC N T

+V E+SG +1P

O

•Input: CONTARE +V +1P +SG•Output: CONTO•N.B. symbols are ignored on output

37

Analysis and Synthesis

• Upper Side Language (Lexical Strings).

• Lower Side Language (Surface Strings).

• Transducer maps between the two.

• However large the lexical transducer may become, analysis and synthesis are performed by the same language-independent matching techniques.