Morphology from a computational point of view March 2001.

65
Morphology from a computational point of view March 2001

Transcript of Morphology from a computational point of view March 2001.

Page 1: Morphology from a computational point of view March 2001.

Morphology from a computational point of view

March 2001

Page 2: Morphology from a computational point of view March 2001.

Today

Minimal Edit Distance, and Viterbi more generally;

Letter to Sound What is morphology? Finite-state automata Finite-state phonological rules

Page 3: Morphology from a computational point of view March 2001.

1. What is morphology?

Study of the internal structure of words: morph-ology word-s jump-ingWhy?1. For some purposes, we need to know

what the internal pieces are.2. Knowledge of the words of a language

can’t be summarized in a finite list: we need to know the principles of word-formation

Page 4: Morphology from a computational point of view March 2001.

Some resources

Richard Sproat: Morphology and Computation (MIT Press, 1992)

Excellent overview of computational morphology and phonoloy by Harald Trost at

http://www.ai.univie.ac.at/~harald/handbook.html

Page 5: Morphology from a computational point of view March 2001.

2. What applications need knowledge of words?

Any high-level linguistic analysis: syntactic parser

machine translation

speech recognition, text-to-speech (TTS)

information retrieval (IR)

dictionary, spell-checker

Page 6: Morphology from a computational point of view March 2001.

3. A list is not enoughAn empirical fact:

AP newswire: mid-Feb – Dec 30 1988

Nearly 300,000 words.

“New” words that appeared on Dec 31 1988:

compounds: prenatal-care, publicly-funded, channel-switching, owner-president, logic-loving, part-Vulcan, signal-emitting, landsite, government-aligned, armhole, signal-emitting...

Page 7: Morphology from a computational point of view March 2001.

...new words...

dumbbells, groveled, fuzzier, oxidized

ex-presidency, puppetry, boulderlike, over-emphasized, hydrosulfite, outclassing, non-passengers, racialist, counterprograms, antiprejudice, re-unification, traumatological, refinancings, instrumenting, ex-critters, mega-lizard

Page 8: Morphology from a computational point of view March 2001.

ex-presidency: prefix ex- boulder-like: suffix –like over-emphasized: prefix over- antiprejudice: prefix anti

This is often called the OOV problem (“out of vocabulary”).

Page 9: Morphology from a computational point of view March 2001.

If we work out the principles of word-formation, we will simultaneously:

1. compress the size of our internalized list of words;

2. become able to deal with new words on the fly.

Page 10: Morphology from a computational point of view March 2001.

4. Overview of applications that need knowledge of words Speech generation: text to speech

(TTS) Speech recognition

Page 11: Morphology from a computational point of view March 2001.

Text to speech

Problem: take text, in standard spelling, and produce a sequence of phonemes which can be synthesized by the “backend”.

Severe problems: Proper names (persons, places), OOV words

boathouse B OW1 T H AU2 S

Page 12: Morphology from a computational point of view March 2001.

Speech recognition

Take a sound file (e.g., *.wav) and produce a list of words in standard orthography.

Bill Clinton is a recent ex-president.

If someone says it, we need to figure out what the word was.

Page 13: Morphology from a computational point of view March 2001.

Do we know what a word is?

This is actually not an easy question! – especially if we turn to Asian languages, without a tradition of putting in “white space” between “words”, as we do in the West.

German writes more compounds without white space than English does.

Page 14: Morphology from a computational point of view March 2001.

Basic principles of morphology

For some purposes, we need to think about phonemes, while for others it’s more convenient to talk about letters.

For our purposes, I’ll talk about letters whenever we don’t need to specifically focus on phonemes.

Page 15: Morphology from a computational point of view March 2001.

Morpheme

It is convenient to be able to talk about the pieces into which words may be broken, and linguists call these pieces morphemes: the smallest parts of a language that can be regularly assigned a meaning.

Page 16: Morphology from a computational point of view March 2001.

Morphemes

Uncontroversial morphemes:

door, dog, jump, -ing, -s, to

More controversial morphemes

sing/sang: s-ng + i/a

cut/cut: cut + PAST

Page 17: Morphology from a computational point of view March 2001.

Classic distinctions in morphology:

Analytic (isolating) languages:– no morphology of derivational or

inflectional sort.

Synthetic (inflecting) languages:– Agglutinative: 1 function per morpheme– Fusional: > 1 function per morpheme

Page 18: Morphology from a computational point of view March 2001.

Agglutinative:Finnish Nominal Declension

talo 'the-house' kaup-pa 'the-shop'

talo-ni 'my house' kaup-pa-ni 'my shop'

talo-ssa 'in the-house' kaup-a-ssa 'in the-shop'

talo-ssa-ni 'in my house’ kaup-a-ssa-ni 'in my shop'

talo-i-ssa 'in the-houses’ kaup-o-i-ssa 'in the-shops'

talo-i-ssa-ni 'in my houses’ kaup-o-i-ssa-ni 'in my shops'

Courtesy of Bucknell Univ. web page

Page 19: Morphology from a computational point of view March 2001.

Fusional: LatinLatin Declension of hortus 'garden'

Singular Plural

Nominative (Subject) hort-us hort-i

Genitive (of) hort-i hort-rum

Dative (for/to) hort-o hort-is

Accusative (Direct Obj) hort-um hort-us

Vocative (Call) hort-e hort-i

Ablative (from/with) hort-o hort-is

Page 20: Morphology from a computational point of view March 2001.

Morphemes vs. morphs

Some analysts distinguish between “morphemes” and “morphs”.

Morphemes are motivated by an analysis, and include “plural” and “past”

Morphs are strings of letters or phones that “realize” or “manifest” a morpheme.

Page 21: Morphology from a computational point of view March 2001.

Free and bound morphemes

Free morphemes can form (free-standing) words; bound morphemes are only found in combination with other morphemes.

Examples?

Page 22: Morphology from a computational point of view March 2001.

Functions of morphology

Derivational morphology: creates one lexeme from another

compute > computer > computerize > computerization

Inflectional morphology: creates the form of a lexeme that’s right for a sentence:

the nominative singular form of a noun; or the past 3rd person singular form of a verb.

Page 23: Morphology from a computational point of view March 2001.

Word: an identifiable string of letters (or phonemes) sing

Word-form: a word with a specific set of syntactic and morphological features. The sing in I sing is 1st person sg, and is a distinct word-from from the sing in you sing.

Lexeme: a complete set of inflectionally related word-forms, including sing, sings, and sang

Lemma: a complete set of morphologically related lexemes: sing, sings, song, sang.

Page 24: Morphology from a computational point of view March 2001.

A lexeme’s stem

In many languages (unlike English), constellations of word-forms forming a lexeme demand the recognition of a basic stem which does not stand freely as a word:

Italian ragazzo, ragazzi (boy, girl)ragazzi, ragazze (boys, girls)

ragazz-

Page 25: Morphology from a computational point of view March 2001.

Compounds

Compounds are composed of 2 (or more) words or stems

Compounds: hot dog, White House, bookstore, cherry-covered

Page 26: Morphology from a computational point of view March 2001.

Languages vary in the amount of morphology they have and use

English has a lot of derivational morphology and relatively little inflectional morphology

English verb’s inflectional forms:

bare stem, -s, -ed, -ing

Page 27: Morphology from a computational point of view March 2001.

European languages

Not uncommon for a verb to have 30 to 50+ forms:

marking tense, person and number of the subject

Page 28: Morphology from a computational point of view March 2001.

Derivation Derivational morphology usually consists of

adding a prefix or suffix to a base (= stem).

The base has a lexical category (it is a noun, verb, adjective), and the suffix typically assigns a different category to the whole word.

sad ness

Adj

Noun -ness: suffix that takes an adjective, & makes a noun.

Page 29: Morphology from a computational point of view March 2001.

un interest ing

Adj

Adj

VerbVb

Adj

Adj

Page 30: Morphology from a computational point of view March 2001.

Distinct from contractions…

English (and some other languages) permit the collapsing together of common words. In some extremely rare cases, only the collapsed form exists (English possessive ’s).

He will arrive tonight > he’ll arrive…

The [King of England]’s children

Page 31: Morphology from a computational point of view March 2001.

Some basics of English morphologyInflectional morphologyNouns: -NULL, -s, -’sVerbs: -NULL, s, -ed, -ing (so-called weak verbs)Strong verbs: 3 major groupsa. Internal verb change (sing/sang,

drive/drove/driven, dive/dove)b. –t suffix, typically with vowel-shortening

dream/dreamt, sleep/sleptc. –aught replacement: catch, teach,

seek,

Page 32: Morphology from a computational point of view March 2001.

Derivational morphology in complex

This morphology creates new words, by adding prefixes or suffixes.

It is helpful to divide them into two groups, depending on whether they leave the pronunciation of the base unchanged or not.

There are, as always, some fuzzy cases.

Page 33: Morphology from a computational point of view March 2001.

Level 1

ize, ization, al, ity, al, ic, al, ity, ion, y (nominaliz-ing), al, ate, ous, ive, ation

Can attach to non-word stems (fratern-al, paternal; parent-al)

Typically change stress and vowel quality of stem

Level 2

Never precede Level 1 suffixes

Never change stress pattern or vowel quality

Almost always attach to words that already exist

hood, ness, ly, s, ing, ish, ful, ly, ize, less, y (adj.)

Page 34: Morphology from a computational point of view March 2001.

Combinations of Class 1,2

Class 1 + Class 1: histor-ic-al, illumina-at-tion, indetermin-at-y;

Class 1 + Class 2: frantern-al-ly, transform-ate-ion-less;

Class 2 + Class 2: weight-less-ness ?? Class 2 + Class 1: *weight-less-ity,

fatal-ism-al

Page 35: Morphology from a computational point of view March 2001.

Signature

Set of suffixes (or prefixes) that occurs in a corpus with a set of stems.

Page 36: Morphology from a computational point of view March 2001.

NULL.ed.ing.s

look interest add claim mark extend demand remain want succeed record offer represent cover return end explain follow help belong attempt talk fear happen assault account point award appeal train contract result request staff view fail kick visit confront attack comment sponsor

Page 37: Morphology from a computational point of view March 2001.

NULL.s paper retain improvement missile song truth doctor

indictment window conductor dick misunderstanding struggle stake tank belief cafeteria material mind operator bassi lot movement chain notion marriage dancer scholarship reservoir sweet right battalion hold mr shot cardinal athletic revenue duel confrontation solo talent guest shoe russian commitment average monk election street roger rifle worker area plane pinch-hitter dozen browning conclusion teacher narcotic appearance alternative dealer producer mile stock shrine sometime bag successor career mistake ankle weapon model front spotlight rhode pace debate payment requirement fairway consultation chip dollar employer thank mustang rocket-bomb hat string precinct robert employee action detective pressure measure spirit forbid hitter breast yankee partner floor member

Page 38: Morphology from a computational point of view March 2001.

NULL.d.s increase tie hole associate reserve price fire receive

challenge rate purchase propose feature celebrate decide suite single change sculpture combine privilege pledge issue frame indicate believe damage include use aide graduate surprise intervene practice trouble serve oppose promise charge note schedule continue raise decline cause operate emphasize relieve hope share judge birdie produce exchange

Page 39: Morphology from a computational point of view March 2001.

NULL.ed.er.ing.s report turn walk park pick flowNULL.d.ment enforce announce engage arrange replace

improve encourageNULL.n.s rose low take law drive rise undertakeNULL.al intern profession logic fat tradition extern margin

jurisdiction historic education promotion constitution addition sensation roy ration origin classic convention

NULL.man sand news police states gross sun fresh sports boss sales 3- patrol bonds

ed.er.ing slugg manag crush publish robbNULL.ity.s major senior moral hospitalNULL.ry hung mason ave summit scene surge rival

forestNULL.a.s indian kind american

Page 40: Morphology from a computational point of view March 2001.

Finite state morphology

Page 41: Morphology from a computational point of view March 2001.

FSA: finite-state automataConsists of

1. a set of states

2. a starting state

3. a set of final or accepting states

4. a finite set of symbols

5. a set of transitions: each is defined by a from-state, a to-state, and a symbol

Page 42: Morphology from a computational point of view March 2001.

It’s natural to think of this as describing an annotated directed graph.

q0 q1 q2 q3

b a a

q3

!

aa a !

Page 43: Morphology from a computational point of view March 2001.

An FSA can be thought of as judging (accepting) a string, or as generating one.

How does it judge? Find a start to finish path that matches the string.

How does it generate? Walk through any start-to-finish path.

Page 44: Morphology from a computational point of view March 2001.

Deterministic and Non-deterministic FSAsJust a little difference: Deterministic case: For every state,

there is a maximum of one transition associated with any given symbol. You can say that there’s a function from {states}X{symbols} {states}

Nondeterministic case: There is no such restriction; hence, given a state and a symbol, it is not necessarily certain which transition is to be taken.

Page 45: Morphology from a computational point of view March 2001.

q0 q1 q2 q3

b a a

q3

!

a

deterministic…

Page 46: Morphology from a computational point of view March 2001.

q0 q1 q2 q3

b a a

q3

!

a

non-deterministicThe best things in lifeare non-deterministic.

Page 47: Morphology from a computational point of view March 2001.

Figure 3.4 p. 68

q0 q1 q2 q3

un- adj-root -er –est -ly

)()(

ly

est

er

rootsadjun

Alternate notation

Page 48: Morphology from a computational point of view March 2001.

Yet a third way: rows in an array(to-column can consist of pointers)

From To Output

0 1 un

0 1 NULL

1 2 adj-root-list

2 3 er;est;ly

Stop states: 2,3

Page 49: Morphology from a computational point of view March 2001.

Figure 3.5 p. 69

q0q1 q2

q5

un-adj-root-1

-er –est -ly

q4 -er –est

q3adj-root-2

)()( 1

ly

est

er

rootsadjun )(2

est

errootsadj

adj-root-1

Page 50: Morphology from a computational point of view March 2001.

Yet a third way: rows in an array(to-column can consist of pointers)

From To Output

0 1 un

0 3 NULL

1 2 adj-root-list-1

2 5 er;est;ly

3 2 adj-root-list-1

3 4 adj-root-list-2

4 5 er;est

Stop states: 2,4,5

Page 51: Morphology from a computational point of view March 2001.

0 1 noun 11 2 ize2 3 ation2 4 er2 5 able0 5 -al “adj”0 1 adj-al5 6 ity;ness0 7 verbs17 8 ive “adj”8 6 ness8 9 ly0 10 verb210 8 ative0 11 noun211 8 ful

Figure 3.6, p. 70

Page 52: Morphology from a computational point of view March 2001.

ation ize

noun

adj

verb

adverb

er

nouns

ativeive

able

ly

ly

ity, ness

ful

verbs

adjectives

adverbs

We can simplify greatly (generating a bit more….)

Page 53: Morphology from a computational point of view March 2001.

Finite-State Transducers (FST)The symbols of the FST are complex:

they’re really pairs of symbols, one for each of two “tapes” or levels.

Recognizer: decides if a given pair of representations fits together “OK”

Generator: generates pairs of representations that fit together

Translator: takes a representation on one level and produces the appropriate representation on the other level

Page 54: Morphology from a computational point of view March 2001.

Finite state transducers

can be inverted, or composed and you get another FST.

Page 55: Morphology from a computational point of view March 2001.

Complex symbols Usually of the form a:b, which means a

can appear on the upper tape when b appears on the lower tape.

So a:b means that’s a permissible pairing up of symbols.

“a” along means a:a, etc. epsilon means null character. Remember, “other” means “any feasible

pair that is not in this transducer” (p. 78)

Page 56: Morphology from a computational point of view March 2001.

Using FSTs for orthographic rules

#__/ s

z

s

x

e

#

q0 q1 q2 q3 q4

q5:̂#

other

otherZ! = Z, s, x

Z! Z!

Z!

S

#, other

:e

#, other z,x

^:

^:

s

Page 57: Morphology from a computational point of view March 2001.

Using FSTs for orthographic rules

fox^s#…we get to q1 with ‘x’

#

q0 q1 q2 q3 q4

q5:̂#

other

otherZ! = Z, s, x

Z! Z!

Z!

S

#, other

:e

#, other z,x

^:

^:

s

Page 58: Morphology from a computational point of view March 2001.

Using FSTs for orthographic rules

#

q0 q1 q2 q3 q4

q5:̂#

other

otherZ! = Z, s, x

Z! Z!

Z!

S

#, other

:e

#, other z,x

^:

^:

s

fox^s#…we get to q2 with ‘^’

Page 59: Morphology from a computational point of view March 2001.

Using FSTs for orthographic rules

#

q0 q1 q2 q3 q4

q5:̂#

other

otherZ! = Z, s, x

Z! Z!

Z!

S

#, other

:e

#, other z,x

^:

^:

s

fox^s#…we can get to q3 with ‘NULL’

Page 60: Morphology from a computational point of view March 2001.

Using FSTs for orthographic rules

#

q0 q1 q2 q3 q4

q5:̂#

other

otherZ! = Z, s, x

Z! Z!

Z!

S

#, other

:e

#, other z,x

^:

^:

s

fox^s#…we also get to q5 with ‘s’but we don’t want to!

Page 61: Morphology from a computational point of view March 2001.

#

q0 q1 q2 q3 q4

q5:̂#

other

otherZ! = Z, s, x

Z! Z!

Z!

S

#, other

:e

#, other z,x

^:

^:

s

fox^s#…we also get to q5 with ‘s’but we don’t want to!

So why is this transition there??friend^ship, ?fox^s^s (= foxes’s)

Page 62: Morphology from a computational point of view March 2001.

#

q0 q1 q2 q3 q4

q5:̂#

other

otherZ! = Z, s, x

Z! Z!

Z!

S

#, other

:e

#, other z,x

^:

^:

s

fox^s#…q4 with s

Page 63: Morphology from a computational point of view March 2001.

#

q0 q1 q2 q3 q4

q5:̂#

other

otherZ! = Z, s, x

Z! Z!

Z!

S

#, other

:e

#, other z,x

^:

^:

s

fox^s#…q0 with # (accepting state)

Page 64: Morphology from a computational point of view March 2001.

#

q0 q1 q2 q3 q4

q5:̂#

other

otherZ! = Z, s, x

Z! Z!

Z!

S

#, other

:e

#, other z,x

^:

^:

s

arizona: we leave q0 but return

Other transitions…

Page 65: Morphology from a computational point of view March 2001.

#

q0 q1 q2 q3 q4

q5:̂#

other

otherZ! = Z, s, x

Z! Z!

Z!

S

#, other

:e

#, other z,x

^:

^:

s

m i s s ^ s

Other transitions…