Natural Language Processing >> Morphology

59
Natural Language Processing >> Morphology << Prof. Dr. Bettina Harriehausen-Mühlbauer Univ. of Applied Science, Darmstadt, Germany www.fbi.h-da.de/~harriehausen [email protected] [email protected] winter / fall 2010/2011 41.4268

description

Natural Language Processing >> Morphology

Transcript of Natural Language Processing >> Morphology

Page 1: Natural Language Processing >>  Morphology

Natural Language Processing>> Morphology <<

Prof. Dr. Bettina Harriehausen-MühlbauerUniv. of Applied Science, Darmstadt, Germanywww.fbi.h-da.de/~harriehausen

[email protected] [email protected]

winter / fall 2010/201141.4268

Page 2: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 2

1 morphemes

2 compounds / concatenation

3 idiomatic phrases

4 multiple word entries (MWE)

5 spell aid

6 regular expressions

7 Finite State Automata (FSA)

content

Page 3: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 3

1 morphemes

2 compounds / concatenation

3 idiomatic phrases

4 multiple word entries (MWE)

5 spell aid

6 regular expressions

7 Finite State Automata (FSA)

content

Page 4: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 4

Morphemes

morpheme = smallest possible item in a language that carries meaning

• lexeme (man, house, dog,...)• inflectional affixes (dog-s, want-ed,...)• other affixes (pre-/in-/suff-): unwanted, atypical, antipathetic,...

esp. in technical language (-itis = „infection“, gastro = stomach...gastroenteritis)

definition

Page 5: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 5

morphemes

Page 6: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 6

morphemes

free morphemes : stand-alone, carry lexical and morphological meaning (e.g. house= sing, neuter, nominative ; case/number/gender)

bound morphemes : legal wordform only in combination with another morpheme, stand-alone, carry lexical and morphological meaning (e.g. un-happy, gastroenteritis)

Page 7: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 7

morphemes

inflectional morphemes : create words and carry morphological meaning (e.g. dogs, laughed, going

derivational morphemes : create wordforms and carry morphological meaning ( happily, intellectually, instruction, instructor, insulator, the pounding, limpness, blindness...)

Question: which string (~morpheme) do we include in our dictionary ?

Page 8: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 8

1 morphemes

2 compounds / concatenation

3 idiomatic phrases

4 multiple word entries (MWE)

5 spell aid

6 regular expressions

7 Finite State Automata (FSA)

content

Page 9: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 9

compounds / concatenation

in addition to single morphemes, we need to consider „multiple morpheme strings / multi word expressions“ (fixed phrases):

incr

easi

ng

the

form

al co

mple

xit

y

=

incr

easi

ng

the

idio

mati

c ri

gid

ity

• independent of the context: dog, cat, ...

• compounding: combine lexical meanings: carseat, houseboat,...

• compounding: not a combination of the lexical meanings: nosebag, nosedive, paperback, ladybug,...

• depending on the context: bite the dust, lose face, kick the bucket,...

Page 10: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 10

Samples for long compounds in German

• die Armbrust• die Mehrzweckhalle• das Mehrzweckkirschentkerngerät• die Gemeindegrundsteuerveranlagung• die Nummernschildbedruckungsmaschine• der Mehrkornroggenvollkornbrotmehlzulieferer• der Schifffahrtskapitänsmützenmaterialhersteller• die Verkehrsinfrastrukturfinanzierungsgesellschaft• die Feuerwehrrettungshubschraubernotlandeplatzaufseherin• der Oberpostdirektionsbriefmarkenstempelautomatenmechaniker• das Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz• die Donaudampfschifffahrtselektrizitätenhauptbetriebswerkbauunterbeamtengesellschaft

Page 11: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 11

compounds / concatenation

decompounding:

principles / rules:

FANO rule: „the analysis is unambiguous, when a morpheme is not the beginning of another morpheme“

(= principle of longest match)

e.g. but / butter

Segmentation has to be done recursively in order to find all possibilities:

horseshoe: horses – hoe (?) vs. horse-shoe

Staubecken: Stau – Becken vs. Staub - Ecken

Page 12: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 12

concatenation

Problems: not all morphemes can be concatenated

Page 13: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 13

1 morphemes

2 compounds / concatenation

3 idiomatic phrases

4 multiple word entries (MWE)

5 spell aid

6 regular expressions

7 Finite State Automata (FSA)

content

Page 14: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 14

idiomatic phrases (http://www.geo.de/GEOlino/mensch/redewendungen/englisch)

• Out of the blue• To be on Cloud Nine• A leopard cannot change its spots• Head over heels• Fair Play• As cool as a cucumber• The early bird catches the worm• An apple a day keeps the doctor away• As fit as a fiddle• Beat about the bush• The Big Apple• The apple of my eye• Wet behind the ears• A bird in the hand is worth two in the bush• It's raining cats and dogs• A friend in need is a friend indeed• It's all greek to me

Page 15: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 15

idiomatic phrases (http://www.geo.de/GEOlino/mensch/redewendungen/deutsch)

• Wie bei Hempels unterm Sofa • Schmetterlinge im Bauch• Jemanden übers Ohr hauen• Ein Bäuerchen machen • Mit jemandem durch dick und dünn gehen• Seine Pappenheimer kennen• Jemandem die Würmer aus der Nase ziehen• Die Arschkarte ziehen• Mit jemandem Pferde stehlen können• Sich aus dem Staub machen• Hummeln im Hintern haben• Im siebten Himmel sein• Viele Wege führen nach Rom• Mit einem lachenden und einem weinenden Auge• Nah am Wasser gebaut haben• Da ist der Bär los• Nachtigall, ick hör dir trapsen• Mein lieber Scholli!

Page 16: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 16

idiomatic phrases (http://www.geo.de/GEOlino/mensch/redewendungen/deutsch)

• Jemandem einen Denkzettel verpassen • Sich auf den Schlips getreten fühlen• Alles für die Katz• Wo drückt denn der Schuh?• Gegen den Strich gehen• Den Faden verlieren• Etwas ausbaden müssen• Einen Stein im Brett haben• Bahnhof verstehen• Der springende Punkt• Der Sündenbock sein• Einen Ohrwurm haben• Das ist doch zum Mäusemelken!• Schmiere stehen• Den Teufel an die Wand malen• Auf dem Holzweg sein• Eselsbrücke• In der Kreide stehen

Page 17: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 17

idiomatic phrases (http://www.geo.de/GEOlino/mensch/redewendungen/deutsch)

• Die Ohren steif halten• Auf Vordermann bringen• Um die Ecke bringen• Hals- und Beinbruch• Auf dem Kerbholz haben• Eine Schlappe einstecken • Frosch im Hals• Es zieht wie Hechtsuppe• Jemandem einen Bärendienst erweisen• Damoklesschwert• Tomaten auf den Augen haben• Jemandem raucht der Kopf• Für 'n Appel und 'n Ei• Etwas an die große Glocke hängen• Das ist Jacke wie Hose• Etwas aus dem Ärmel schütteln• Ein X für ein U vormachen• Jemandem nicht das Wasser reichen können

Page 18: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 18

idiomatic phrases (http://www.geo.de/GEOlino/mensch/redewendungen/deutsch)

• Alles im grünen Bereich• Die Hand ins Feuer legen• Auf Draht sein• Sein blaues Wunder erleben• Der hat es faustdick hinter den Ohren• Mein Name ist Hase, ich weiß von nichts• Aus dem Stegreif• Der Groschen ist gefallen• Einen Vogel haben• Den Kürzeren ziehen• Bis in die Puppen• Etwas hinter die Ohren schreiben• Ins Fettnäpfchen treten• Beleidigte Leberwurst• Jemanden auf dem Kieker haben• Ich verstehe immer nur Bahnhof! • Die Katze im Sack kaufen• Das kann kein Schwein lesen!

Page 19: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 19

idiomatic phrases (http://www.geo.de/GEOlino/mensch/redewendungen/deutsch)

• Bekannt wie ein bunter Hund• Den Kopf in den Sand stecken• Mit dem ist nicht gut Kirschen essen• Aller guten Dinge sind drei• Lampenfieber• Das kommt mir spanisch vor• Schwein haben• Das hast du dir selbst eingebrockt• Seinen Senf dazugeben• Jemandem ist eine Laus über die Leber gelaufen• Kalte Füße bekommen• Im Stich lassen• Schwedische Gardinen• Alles in Butter• Geld auf den Kopf hauen• Das Handtuch werfen• Sich mit fremden Federn schmücken

Page 20: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 20

1 morphemes

2 compounds / concatenation

3 idiomatic phrases

4 multiple word entries (MWE)

5 spell aid

6 regular expressions

7 Finite State Automata (FSA)

content

Page 21: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 21

multiple word entries (MWE)

in addition to single morphemes, we need to consider „multiple morpheme strings“ (fixed phrases):

• electronic dictionaries

• all NLP applications

• machine translation

!• independent of the context: dog, cat, ...

• compounding (a): combine lexical meanings: carseat, houseboat,...

• compounding (b): not a combination of the lexical meanings: nosebag, nosedive, paperback, ladybug, soap opera...

• depending on the context: bite the dust, lose face, kick the bucket,...

Page 22: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 22

multiple word entries (MWE)

Problems: the relationships among the components change

the „Schnitzel“ problem• sirloin steak (made from certain parts of..)

• soy steak (made out of material...)

• „Wiener Schnitzel“ (according to a certain receipe)

• pepper steak (served with...)

• ...

Even though the single lexical meanings remain untouched in the compound, the relationshiprelationship between the compounds varies tremendously !

Page 23: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 23

multiple word entries (MWE)

the 3 main relationships (default ?) between parts of a compound word: (the role of global knowledge in decompounding)

compoundmeaning relationshipdoorknob knob of the door is-a / is-part-of/

carseat seat of the car genitive

glasdoor door made of glas made from / material

nutbread ‡ bread of the nut

waterglas glas filled with water used for

oiltruck truck that carries oil

‡ truck made of oil

1

2

3

Page 24: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 24

decompounding:the orange bowl problem

Can you please bring me the orange bowl ?

bowl filled with oranges

bowl having the shape of an orange bowl with an

orange pattern

bowl of orange colour

bowl that was formerly / usually filled with oranges

?

?

?

?

?

multiple word entries (MWE)

Page 25: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 25

1 morphemes

2 compounds / concatenation

3 idiomatic phrases

4 multiple word entries (MWE)

5 spell aid

6 regular expressions

7 Finite State Automata (FSA)

content

Page 26: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 26

spell aid

in NLP, decompounding algorithms are essential for spell-checking / spell aid :spell-checking / spell aid :

How do we define lexical error in NLP terms ?

An error is a string that cannot be found in / matched with a dictionary entry.

It is not necessarily an incorrect word (esp. neologisms).

Page 27: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 27

spell aid

spell checking algorithmsspell checking algorithms are based on the following types of mistakes (statistics !):

• phonetic similarities (ph – f : telephone – telefone)

• deletion of multiple entries ( mouuse - mouse)

• wrong order (from – form ; mouse – muose)

• substitution of neighbouring letters on the keyboard (miuse – mouse)

• include missing letters (vowels in between consonants...) (telephne)

• typos occur towards the end of a word (assumption:first letter is correct)

• segmentation / decomposition into substrings (horeshoe – horseshoe)

Page 28: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 28

spell aid

• phonetic similarities (ph – f : telephone – telefone)

• deletion of multiple entries ( mouuse - mouse)

• wrong order (from – form ; mouse – muose)

• substitution of neighbouring letters on the keyboard (miuse – mouse)

• include missing letters (vowels in between consonants...) (telephne)

• typos occur towards the end of a word (assumption:first letter is correct)

• segmentation / decomposition into substrings (horeshoe – horseshoe)

Page 29: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 29

spell aid

• include missing letters (vowels in between consonants...) (telephne)

certain rules apply: e.g. in German: never concatenate „l“, „n“ or „r“ with „tz“ and „ck“:

_ltz_ *Holtz_lck__ntz__nck__rlz__rck_

Page 30: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 30

spell aid

• include missing letters

www.dositey.com/language/spelling/Mislet3.htm

Page 31: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 31

spell aid

How does spell checking work (w.r.t. grammar How does spell checking work (w.r.t. grammar checking) ?checking) ?

Various degrees of „intelligence“:

System A : no match found in the dictionary -> mark entry as incorrect

System B: no match found in the dictionary. Initiate a rudimentary parse (left-right-search). Try to identify the wordclass, i.e. limit possibilities and continue a sentential analysis. e.g. the ...man (statistics: DET + ADJ + NOUN)

System C: no match found in the dictionary. Initiate a segmentation of the word to identify the wordclass, e.g. look for typical endings (-ly = adverb / capital letters = proper noun, ...). This way new wordcreations can be identified (e.g. any word ending in -ness = noun)

Page 32: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 32

1 morphemes

2 compounds / concatenation

3 idiomatic phrases

4 multiple word entries (MWE)

5 spell aid

6 regular expressions

7 Finite State Automata (FSA)

content

Page 33: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 33

regular expressions (Jurafsky, section 2.1)

• In order to figure out whether something is an incorrect word, the machine has to match the string (= a sequence of symbols; any sequence of alphanumeric characters (letters, numbers, spaces, tabs, punctuation) to an entry in the dictionary

• other matches: e.g. information retrieval in www-search engines (google, altavista,…)

• the standard notation for characterizing text sequences=regular expressions

• regular expressions are written in (regular expression) languages: e.g. Perl, grep (Global Regular Expression Print)

• formally, regular expressions are algebraic notations for characterizing a set of strings

• regular expression search requires a pattern that we want to search for (and a corpus of text to search through) (text mining !)

Page 34: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 34

Example: Search for the pattern “linguistics”.• You also want to find documents with “Linguistics” and “LINGUISTICS”.

(remember: the computer does EXACTLY do what you tell him to…)• The regular expression /linguistics/ matches any string in any document

containing exactly the substring “linguistics”• Regular expressions are case sensitive• samples (Jurafsky, p. 23)

regular expression example pattern matched/woodchucks/ “interesting links to woodchucks and lemurs”/a/ “Mary Ann stopped by Mona’s”/Claire says,/ Dagmar, my gift please,” Claire says,”/song/ “all our pretty songs”/!/ “You’ve left the burglar behind again!” said Nori

regular expressions (Jurafsky, section 2.1)

Page 35: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 35

linguistics - Linguistics - LINGUSTICS

to search for alternative characters “l” and/or “L” we use square brackets: [l L]

Regular expression match sample pattern

/[l L] inguistics/ Linguistics or linguistics “computational linguistics is

fun”

/[1 2 3 4 5 6 7 8 9 0]/ any digit this is Linguistics 5981

regular expressions (Jurafsky, section 2.1)

Page 36: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 36

to search for a character in a range we use the dash: [-]

Regular expression match sample pattern

/[A-Z]/ any uppercase letter this is Linguistics 5981

/[0-9]/ any single digit this is Linguistics 5981

/[1 2 3 4 5 6 7 8 9 0]/any single digit this is Linguistics 5981

regular expressions (Jurafsky, section 2.1)

Page 37: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 37

to search for negation, i.e. a character that I do NOT want to find we use the caret: [^]

Regular expression match sample pattern

/[^A-Z]/ not an uppercase letter this is Linguistics 5981

/[^L l]/ neither L nor l this is Linguistics 5981

/[^\.]/ not a period this is Linguistics 5981

\* an asterisk “L*I*N*G*U*I*S*T*I*C*S”\. a period “Dr.Doolittle”\? a question mark “Is this Linguistics 5981 ?”\n a newline\t a tab

Special characters:

regular expressions (Jurafsky, section 2.1)

Page 38: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 38

to search for optional characters we use the question mark: [?]

Regular expression match sample pattern

/colou?r/ colour or color beautiful colour

to search for any number of a certain character we use the Kleene star: [*]

Regular expression match

/a*/ any string of zero or more “a”s

/aa*/ at least one a but also any number of “a”s

regular expressions (Jurafsky, section 2.1)

Page 39: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 39

Any combination is possible

Regular expression match

/[ab]*/ zero or more “a”s or “b”s

/[0-9] [0-9]*/ any integer (= a string of digits)

To look for at least one character of a type we use the Kleene “+”:

Regular expression match

/[0-9]+/ a sequence of digits

regular expressions (Jurafsky, section 2.1)

Page 40: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 40

The “.” is a very special character -> so-called wildcard

Regular expression match sample pattern

/b.ll/ any character ball between b and ll bell

bullbill

Will the search find “Bill” ?

regular expressions (Jurafsky, section 2.1)

Page 41: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 41

Anchors (start of line: “^”, end of line:”$”)

Regular expression match sample pattern

/^Linguistics/ “Linguistics” at the Linguistics is fun.beginning of a line

/linguistics\.$/ “linguistics” at the We like linguistics.end of a line

Anchors (word boundary: “\b”, non-boundary:”\B”)

Regular expression match sample pattern

/\bthe\b/ “the” alone This is the place.

/\Bthe\B/ “the” included This is my mother.

regular expressions (Jurafsky, section 2.1)

Page 42: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 42

More on alternative characters: the pipe symbol: “|” (disjunction)

Regular expression match sample pattern

/colou?r/ colour or color beautiful colour

/progra(m|mme)/ program or programme linguistics program

regular expressions (Jurafsky, section 2.1)

Page 43: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 43

What does the following expression match ?

/student [0-9] + */

Will it match “student 1 student 2 student 3” ?

regular expressions (Jurafsky, section 2.1)

Page 44: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 44

Perl expressions are also used for string substitution: (used in ELIZA)

s/man/men/ man -> men

Perl expressions are also used for string repetition via memory:

(the number operator)

s/(linguistics)/wonderful \1/ linguistics-> wonderful linguisticsELIZA

s/.* YOU ARE (depressed|sad) .*/ I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad) .*/ WHY DO YOU THINK YOU

ARE \1 ?/

regular expressions (Jurafsky, section 2.1)

Page 45: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 45

1 morphemes

2 compounds / concatenation

3 idiomatic phrases

4 multiple word entries (MWE)

5 spell aid

6 regular expressions

7 Finite State Automata (FSA)

content

Page 46: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 46

The regular expression is more than just a convenient metalanguage for text searching.

• First, a regular expression is one way of describing a finite-state automaton (FSA).Finite-state automata are the theoretical foundation of a good deal of the computational work we will describe and look at in this lecture. Any regular expression can be implemented as a finite-state automaton*. Symmetrically, any finite-state automaton can be described with a regular expression.

• Second, a regular expression is one way of characterizing a particular kind of formal language called a regular language. Both regular expressions and finite-state automata can be used to describe regular languages. The relation among these three theoretical constructions is sketched out in the following figure:* Except regular expressions that use the memory feature – more on that

later

Finite State Automata (FSA)

Page 47: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 47

regular expressions

Finite regular

Automata languages

The relationship between finite state automata, regular expressions, and regular languages*

* as suggested by Martin Kay in:

Kay, M. (1987). Nonconcatenative finite-state morphology. In Proceedings of the Third Conference of the European Chapter of the ACL (EACL-87), Copenhagen, Denmark,pp. 2-10.ACL.).

Finite State Automata (FSA)

Page 48: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 48

Examples:Examples:

• Introduction to finite-state automata for regular expressions

• Mapping from regular expressions to automata

examples

Finite State Automata (FSA)

Page 49: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 49

Using a FSA to recognize sheeptalk

After a while, with the parrot‘s help, the Doctor got to learn the language of the animals so well that he could talk to them himself and understand everything they said.

Hugh Lofting, The Story of Doctor Doolittle

Finite State Automata (FSA)

Page 50: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 50

Using a FSA to recognize sheeptalk

Sheep language can be defined as any string from the following (infinite) set:

baa!baaa!baaaa!baaaaa!baaaaaa!....

Finite State Automata (FSA)

Page 51: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 51

baa!baaa!baaaa!baaaaa!baaaaaa!....

The regular expression for this kind of sheeptalk is

/baa+!/

All regular expressions can be represented as finite-state automata (FSA):

Finite State Automata (FSA)

Page 52: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 52

a finite-state automaton (FSA) for the regular expression /baa+!/

q

0 q

q

q

q

1 2 3 4

b a a

a

!

start state final state/accepting state

Finite State Automata (FSA)

Page 53: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 53

... ... ... a b a ! b ... ... ... ... ... ... ... ...

a tape with cells

Example of non-finite state = rejection of the input

q0

Finite State Automata (FSA)

Page 54: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 54

Input

State b a !

0(null) 1 00

1 0 2 0

2 0 3 0

3 0 3 4

4: 0 0 0

The state-transition table for the previous FSA

Finite State Automata (FSA)

Page 55: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 55

function D-RECOGNIZE(tape,machine) returns accept or reject

index <- Beginning of tape

current-state <- Initial state of machine

loop

if End of input has been reached then

if current-state is an accept state then

return accept

elsereturn reject

elseif transition-table[current-state,tape[index]] is empty then

return reject

else

current-state <- transition-table[current-state,tape[index]] index <- index +1

end

An algorithm for deterministic recognition of FSAs

Finite State Automata (FSA)

Page 56: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 56

... ... ... b a a a ! ... ... ... ... ... ... ... ...

Tracing the execution of FSA on some sheeptalk

q0

q q q q q1 2 3 4 5

Finite State Automata (FSA)

Page 57: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 57

Regular expressions can be represented as FSAs:

fail state

q

0 q

q

q

q

1 2 3 4

b a a

a

!

fq

a

! b b bb

!! !

ac?

Finite State Automata (FSA)

Page 58: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 58

q

0 q

q

q

q

1 2 3

b a a

a

!

4

A non-deterministic finite-state automaton for talking sheep

Finite State Automata (FSA)

Page 59: Natural Language Processing >>  Morphology

WS 2010/2011 NLP - Harriehausen 59

40q

q 1

b

2q

q

q

!a a

3

E

A non-finite-state automaton (NFSA) for the sheep

language – having an E-transition

Finite State Automata (FSA)