Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn...

40
Lecture 3, 7/27/2005 Natural Language Processing 1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005

Transcript of Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn...

Page 1: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 1

CS60057Speech &Natural Language

Processing

Autumn 2005

Lecture 4

28 July 2005

Page 2: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 2

Morphology

Study of the rules that govern the combination of morphemes.

Inflection: same word, different syntactic information Run/runs/running, book/books

Derivation: new word, different meaning Often different part of speech, but not always Possible/possibly/impossible, happy/happiness

Compounding: new word, each part is a word Blackboard, bookshelf lAlakamala, banabAsa

Page 3: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 3

Morphology Level: The Mapping

Formally: A+ 2(L,C1,C2,...,Cn)

A is the alphabet of phonemes (A+ denotes any non-empty sequence of phonemes)

L is the set of possible lemmas, uniquely identified Ci are morphological categories, such as:

grammatical number, gender, case person, tense, negation, degree of comparison, voice, aspect, ... tone, politeness, ... part of speech (not quite morphological category, but...)

A, L and Ci are obviously language-dependent

Page 4: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 4

Bengali/Hindi Inflectional Morphology

Certain languages encode more syntax in morphology than in syntax

Some of inflectional suffixes that nouns can have: singular/plural : Gender possessive markers : case markers :

Different Karakas

Inflectional suffixes that verbs can have: Hindi: Tense, aspect, modality, person, gender, number Bengali: Tense, aspect, modality, person

Order among inflectional suffixes (morphotactics ) Chhelederke baigulokei

Page 5: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 5

Bengali/ Hindi Derivational Morphology

Derivational morphology is very rich.

Page 6: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 6

English Inflectional Morphology

Nouns have simple inflectional morphology. plural -- cat / cats possessive -- John / John’s

Verbs have slightly more complex inflectional, but still relatively simple inflectional morphology. past form -- walk / walked past participle form -- walk / walked gerund -- walk / walking singular third person -- walk / walks

Verbs can be categorized as: main verbs modal verbs -- can, will, should primary verbs -- be, have, do

Regular and irregular verbs: walk / walked -- go / went

Page 7: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 7

Regulars and Irregulars Some words misbehave (refuse to follow the rules)

Mouse/mice, goose/geese, ox/oxen Go/went, fly/flew

The terms regular and irregular will be used to refer to words that follow the rules and those that don’t.

Page 8: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 8

Regular and Irregular Verbs

Regulars… Walk, walks, walking, walked, walked

Irregulars Eat, eats, eating, ate, eaten Catch, catches, catching, caught, caught Cut, cuts, cutting, cut, cut

Page 9: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 9

Derivational Morphology

Quasi-systematicity Irregular meaning change Changes of word class

Some English derivational affixes -ation : transport / transportation -er : kill / killer -ness : fuzzy / fuzziness -al : computation / computational -able : break / breakable -less : help / helpless un : do / undo re : try / retry

renationalizationability

Page 10: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 10

Derivational Examples

Verb/Adj to Noun

-ation computerize computerization

-ee appoint appointee

-er kill killer

-ness fuzzy fuzziness

Page 11: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 11

Derivational Examples

Noun/Verb to Adj

-al Computation Computational

-able Embrace Embraceable

-less Clue Clueless

Page 12: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 12

Compute

Many paths are possible… Start with compute

Computer -> computerize -> computerization Computation -> computational Computer -> computerize -> computerizable Compute -> computee

Page 13: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 13

Parts of A Morphological Processor

For a morphological processor, we need at least followings: Lexicon : The list of stems and affixes together with basic

information about them such as their main categories (noun, verb, adjective, …) and their sub-categories (regular noun, irregular noun, …).

Morphotactics : The model of morpheme ordering that explains which classes of morphemes can follow other classes of morphemes inside a word.

Orthographic Rules (Spelling Rules) : These spelling rules are used to model changes that occur in a word (normally when two morphemes combine).

Page 14: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 14

Lexicon

A lexicon is a repository for words (stems). They are grouped according to their main categories.

noun, verb, adjective, adverb, … They may be also divided into sub-categories.

regular-nouns, irregular-singular nouns, irregular-plural nouns, … The simplest way to create a morphological parser, put all possible

words (together with its inflections) into a lexicon. We do not this because their numbers are huge (theoretically for

Turkish, it is infinite)

Page 15: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 15

Morphotactics

Which morphemes can follow which morphemes.

Lexicon:

regular-noun irregular-pl-noun irreg-sg-noun pluralfox geese goose -s

cat sheep sheep

dog mice mouse

Simple English Nominal Inflection (Morphotactic Rules)

0

1

2

reg-nounplural (-s)

irreg-sg-noun

irreg-pl-noun

Page 16: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 16

Combine Lexicon and Morphotacticsf

ox

sc a t

d o g

s

h e ep

g

o

e e

o s

e

m

o u s

i c

e

This only says yes or no. Does not give lexical representation.It accepts a wrong word (foxs).

Page 17: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 17

FSAs and the Lexicon

This will actual require a kind of FSA : the Finite State Transducer (FST)

We will give a quick overview

First we’ll capture the morphotactics The rules governing the ordering of affixes in a language.

Then we’ll add in the actual words

Page 18: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 18

Simple Rules

Page 19: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 19

Adding the Words

Page 20: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 20

Derivational Rules

Page 21: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 21

Parsing/Generation vs. Recognition

Recognition is usually not quite what we need. Usually if we find some string in the language we need to find the

structure in it (parsing) Or we have some structure and we want to produce a surface

form (production/generation)

Example From “cats” to “cat +N +PL”

Page 22: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 22

Why care about morphology?

`Stemming’ in information retrieval Might want to search for “aardvark” and find pages with both

“aardvark” and “aardvarks”

Morphology in machine translation Need to know that the Spanish words quiero and quieres are

both related to querer ‘want’

Morphology in spell checking Need to know that misclam and antiundoggingly are not words

despite being made up of word parts

Page 23: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 23

Can’t just list all words

Turkish word Uygarlastiramadiklarimizdanmissinizcasin `(behaving) as if you are among those whom we could not civilize’

Uygar `civilized’ + las `become’ + tir `cause’ + ama `not able’ + dik `past’ + lar ‘plural’+ imiz ‘p1pl’ + dan ‘abl’ + mis ‘past’ + siniz ‘2pl’ + casina ‘as if’

Page 24: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 24

Finite State Transducers

The simple story Add another tape Add extra symbols to the transitions

On one tape we read “cats”, on the other we write “cat +N +PL”

Page 25: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 25

Transitions

c:c means read a c on one tape and write a c on the other +N:ε means read a +N symbol on one tape and write nothing on the other +PL:s means read +PL and write an s

c:c a:a t:t +N:ε +PL:s

Page 26: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 26

Lexical to Intermediate Level

Page 27: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 27

FST Properties

FSTs are closed under: union, inversion, and composition.

union : The union of two regular relations is also a regular relation. inversion : The inversion of a FST simply switches the input and

output labels. This means that the same FST can be used for both directions of a

morphological processor.

composition : If T1 is a FST from I1 to O1 and T2 is a FST from O1

to O2, then composition of T1 and T2 (T1oT2) maps from I1 to O2.

We use these properties of FSTs in the creation of the FST for a morphological processor.

Page 28: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 28

A FST for Simple English Nominals

reg-noun

irreg-sg-noun

irreg-pl-noun

+N: є

+N: є

+N: є

+S:#+PL:^s#

+SG:#

+PL:#

Page 29: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 29

FST for stems

A FST for stems which maps roots to their root-class

reg-noun irreg-pl-noun irreg-sg-noun fox g o:e o:e se goose

cat sheep sheep

dog m o:i u:є s:c e mouse

fox stands for f:f o:o x:x When these two transducers are composed, we have a FST which maps

lexical forms to intermediate forms of words for simple English noun inflections.

Next thing that we should handle is to design the FSTs for orthographic rules, and combine all these transducers.

Page 30: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 30

Multi-Level Multi-Tape Machines

A frequently use FST idiom, called cascade, is to have the output of one FST read in as the input to a subsequent machine.

So, to handle spelling we use three tapes: lexical, intermediate and surface

We need one transducer to work between the lexical and intermediate levels, and a second (a bunch of FSTs) to work between intermediate and surface levels to patch up the spelling.

+PL+Ngod

sgod

s #^god

lexical

intermediate

surface

Page 31: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 31

Lexical to Intermediate FST

Page 32: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 32

Orthographic Rules

We need FSTs to map intermediate level to surface level. For each spelling rule we will have a FST, and these FSTs run parallel.

Some of English Spelling Rules: consonant doubling -- 1-letter consonant doubled before ing/ed --

beg/begging E deletion - Silent e dropped before ing and ed -- make/making E insertion -- e added after s, z, x, ch, sh before s -- watch/watches Y replacement -- y changes to ie before s, and to i before ed -- try/tries K insertion -- verbs ending with vowel+c we add k -- panic/panicked

We represent these rules using two-level morphology rules: a => b / c__d rewrite a as b when it occurs between c and d.

Page 33: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 33

FST for E-Insertion Rule

E-insertion rule: є => e / {x,s,z}^ __ s#

Page 34: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 34

Generating or Parsing with FST Lexicon and Rules

Page 35: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 35

Accepting Foxes

Page 36: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 36

Intersection

We can intersect all rule FSTs to create a single FST. Intersection algorithm just takes the Cartesian product of

states. For each state qi of the first machine and qj of the second

machine, we create a new state qij

For input symbol a, if the first machine would transition to state

qn and the second machine would transition to qm the new

machine would transition to qnm.

Page 37: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 37

Composition

Cascade can turn out to be somewhat pain. it is hard to manage all tapes it fails to take advantage of restricting power of the machines

So, it is better to compile the cascade into a single large machine.

Create a new state (x,y) for every pair of states x є Q1 and y є Q2. The transition function of composition will be defined as follows:

δ((x,y),i:o) = (v,z) if

there exists c such that δ1(x,i:c) = v

and δ2(y,c:o) = z

Page 38: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 38

Intersect Rule FSTs

lexical tape

LEXICON-FST

intermediate tape

FST1 … FSTn

surface tape

=> FSTR = FST1 ^ … ^ FSTn

Page 39: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 39

Compose Lexicon and Rule FSTs

lexical tape

LEXICON-FST

intermediate tape

surface tape

FSTR = FST1 ^ … ^ FSTn

=> LEXICON-FST o FSTR

lexical tape

surface level

Page 40: Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Lecture 3, 7/27/2005 Natural Language Processing 40

Porter Stemming

Some applications (some informational retrieval applications) do not the whole morphological processor.

They only need the stem of the word. A stemming algorithm (Port Stemming algorithm) is a

lexicon-free FST. It is just a cascaded rewrite rules. Stemming algorithms are efficient but they may introduce

errors because they do not use a lexicon.