Frank Reichartz Gerhard Paaß Knowledge Discovery Fraunhofer IAIS St. Augustin, Germany

27
Frank Reichartz Gerhard Paaß Knowledge Discovery Fraunhofer IAIS St. Augustin, Germany Estimating Supersenses with Conditional Random Fields

description

Estimating Supersenses with Conditional Random Fields. Frank Reichartz Gerhard Paaß Knowledge Discovery Fraunhofer IAIS St. Augustin, Germany. Agenda. Agenda. Introduction Models for Supersenses Conditional Random Fields Lumped Observations Summary. è. Use Case Contentus. - PowerPoint PPT Presentation

Transcript of Frank Reichartz Gerhard Paaß Knowledge Discovery Fraunhofer IAIS St. Augustin, Germany

Page 1: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

Frank ReichartzGerhard Paaß

Knowledge Discovery Fraunhofer IAISSt. Augustin, Germany

Estimating Supersenses with Conditional Random Fields

Page 2: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

2© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Agenda

1.Introduction

2. Models for Supersenses

3. Conditional Random Fields

4. Lumped Observations

5. Summary

Agenda

Page 3: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

3© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Use Case Contentus

Digitize a Multimedia Collection of German National Library

• Music of former GDR

• Digitize

• Quality control

• Meta data collection

• Semantic indexing

• Semantic search engine Target

• Provide content: text, score sheets, video, images, speech

• Generate meta data: composers, premiere, director, artists, …

• Extract entities: dates, places, relations, composers, pieces of music, …

• Assign meanings to words and phrases: use ontology

Page 4: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

4© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Wordnet as Ontology

• WordNet is a fine-grained word sense hierarchy

• The same word may have different senses: bank = financial institutebank = river boundary

• Defines senses (synsets) for

• Verbs

• Common & proper nouns

• Adjectives

• Adverbs

Target: assign each word to a synset

• Easy semantic indexing & retrieval

Page 5: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

5© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Fine Grained Word Senses • Example: senses of noun „blow“

Very subtle differences between senses

Page 6: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

6© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Hierarchy of Hypernyms

• Supersense level

• Fewer distinctions

• Retains main differences

Target: assign verbs / nouns to a supersense

Page 7: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

7© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

List of Supersenses

15

26

Page 8: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

8© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Supersenses discriminate between many synsets

Noun blow:

• 7 synsets

• 5 supersenses

Verb blow:

• 22 synsets

• 9 supersenses

Sufficient for coarse disambiguation

Page 9: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

9© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Agenda

1.Introduction

2. Models for Supersenses

3. Conditional Random Fields

4. Lumped Observations

5. Summary

Agenda

Page 10: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

10© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Training Data: SemCor Dataset

A A DT

compromise compromise NN 1190419noun.communication

will will MD

leave leave VB 2610151 verb.stative

both both DT

sides side NN 8294366 noun.group

without without IN

the the DT

glow glow NN 13864852 noun.state

of of IN

triumph triumph NN 7425691 noun.feeling

, , PUNC

but but CC

it it PRP

will will MD

save save VB 2526596 verb.social

Berlin location NNP 26074 noun.location

. . PUNC

Synset Supersense

Output

Input

Page 11: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

11© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Prior Work: Classifier

• Bag-of-words is not sufficient: code relative positions

• Use classifiers

• MaxEnt

• SVM

• Naive Bayes

• kNN

• Proc. SemEval 2007Coarse-Grained English All-Words Task

Page 12: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

12© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Prior Work: Sequence Modelling

Chiaramita & Altun 06: Use perceptron-trained HMM

x)(y, sequencea for counts feature ),,(),(

),(,);,(

sequencea in feature / wordth-i and label th-i is ,

1 11

d

i

y

jjji

ii

xyyyx

yxwwyxF

xy

• Maximize predictive performance on training set

• Ignore ambiguity: use only most frequent sense

Deschacht & Moens 06: Use Conditional Random Field

• Exploit hierarchy to model many classes

• Apply to fine grained word sense: good results

Page 13: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

13© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Agenda

1.Introduction

2. Models for Supersenses

3. Conditional Random Fields

4. Lumped Observations

5. Summary

Agenda

Page 14: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

14© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Definition of Conditional Random Fields

observed words / features X1,…,Xn

states Y1,…,Yn

each state Yt may be influenced by many of the X1,…,Xn features

Definition: Let G=(V,E) be a graph such that Y=(Yt)tV, so that Y is indexed

by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yt obey the Markov property with respect to the

graph:

),|(),,...,,...,|( )(,111 XYYpXYYYYYptYneighwwtnttt

Hammersley Cliffort Theorem: probability can be written as a product of potential functions

Variable sets are cliques in the neighborhood set

Page 15: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

15© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Simplification: Sequential Chain

observed words / features X1,…,Xn

states Y1,…,Yn

N

t

N

k

N

ktCkkttCkkn YgYYf

ZXYYp

1 1 1,1,1

2 1

),(),,(exp),,(

1)|,,( XX

X

features may involve two consecutive states and all observed words

examples

feature has value 1 if Yt-1="other" and Yt=location and Xt has pos tag "proper name" and (Xt-2,Xt-1)="arrived in". Otherwise the value is 0

estimated POS tags, noun phrase tags, weekday, amounts, etc.

prefixes, suffixes; matching regular expressions for capitalization, etc.

information from lexica, lists of proper names[Lafferty, McCallum, Pereira 01]

Page 16: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

16© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Derivative of Likelihood

estimate the optimal parameter vector for the training set (Xd,Yd), d=1,…,D

D

d

D

ddsdddk

sfpf

L

1 1

),(),|(),(Y

XYXYXY

observed

feature value

expected

feature value

how can we calculate the expected feature values?

need for every document d and state Yd,t the probability p(Ydt=i|Xd)

need for every d and states Yd,t , Yd,t+1 the probability p(Yd,t=i,Yd,t+1=j|Xd)

use forward - backward algorithm as for the hidden Markov model

Page 17: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

17© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Optimization Procedure: Gradient Ascent

Regularization term: use small weight values, if possible smaller generalization errorBayesian prior distribution: Gaussian, Laplace, etc.

use gradient-based optimizer: e.g. conjugate gradient, BFGS approximate quadratic optimization

use stochastic gradient

v

D

d

D

ddsdddk

s

sfpfL

1 1

),(),|(),(Y

XYXYXY

regularizationexpected feature

value

observed

feature value

[Sutton, McCallum 06]

Page 18: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

18© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Features

• Lemmas of verbs of previous position

• Part-of-Speech of lags -2|-1|0|1|2

• Coarse POS lags -2|-1|0|1|2

• Three letter prefixes lag -1,0,1

• Three letter suffixes lag -1,0,1

• INITCAP -1|0|1

• ALLDIGITS -1|0|1

• ALLCAPS -1|0|1

• MIXEDCAPS -1|0|1

• CONTAINSDASH -1|0|1

• Class-ID of unsupervised LDA topic model with 50 classes

SemCor training set: ~ 20k sentences, 5-fold cross validation

Page 19: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

19© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Results for nouns

Different F1-valuesevent: 67.0%Tops: 98.2%

Micro-Average: 83.5%

Macro-Average: 77.9%

Different frequencies of examplesmotive: 133artifact: 8894

Noun

Page 20: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

20© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Comparison to Prior Result

Micro-Average:

Ciaramita & Altun 06: 77.18 % ()

CRF 83.5% ()

~ 28% reduction of error

Supersense

Ciaramita-Altun 06 CRF

Rec. Prec. F Rec. Prec. F

n.person 92.0 87.9 89.9 96.0 96.9 96.4

n.group 75.4 79.6 77.4 84.4 85.1 84.7

n.location 77.2 75.4 76.3 85.1 80.8 82.9

n.time 88.4 84.3 86.3 92.6 90.5 91.5

Page 21: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

21© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Agenda

1.Introduction

2. Models for Supersenses

3. Conditional Random Fields

4. Lumped Observations

5. Summary

Page 22: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

22© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Need for More Data

WordNet covers more than 100000 synsets

Few examples per supersense: higher training error

Many examples required to train each synsetSemCor: ~20k sentences

Manual labelling is costly

Exploit restrictions in WordNet

Each word has only a subset of possible supersenses

• Blow: n.act, n.event, n.phenomenon, n.artifact, n.act

Unlabeled data: assign possible supersenses to each word

Specialized CRF required

Page 23: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

23© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Conditional Random Field with Lumped Supersenses observed words / features X1,…,Xn

states Y1,…,Yn, Yt…

Observations: Subsets of supersenses Yt …

An observation (X1,…,Xn; Y1,…,Yn) contains a large number of sequences

N

t

N

k

N

ktCkkttCkkn YgYYf

ZXYYp

1 1 1,1,1

2 1),(),,(exp

),,(1

)|,,( XXX

Adapt likelihood computation

),,( 1 nYY

Page 24: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

24© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Training Data: SemCor DatasetA A DT

compromise compromise NN n.communication, n.act

will will MD

leave leave VB v.stative, v.motion, v.cogn., v.change, v.social, v.possession

both both DT

sides side NN n.group, n. location, n.body, n.artifact, n.cogn., n. food, n. communication, n.object, n.event

without without IN

the the DT

glow glow NN n.state, n.attribute, n.phenomenon, n.feeling

of of IN

triumph triumph NN n.feeling, n.event

, , PUNC

but but CC

it it PRP

will will MD

save save VB v.social, v.possession, v.change,

Berlin location NNP n.location, n.person, n.artifact

. . PUNC

Possible Supersenses

Page 25: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

25© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Results for Lumped Supersenses: SemCor

Simulate lumped supersenses

Determine possible supersenses for SemCor

Use different fractions of annotated / possible supersenses

Method Fraction annotated - possible

Precision

Recall F1

Ciaramita & Altun 06

3 - 0 76.6 77.7 77.1

CRF 3 - 0 83.4 83.6 83.5

CRF 2 - 0 83.1 83.2 83.1

CRF 2 - 1 83.7 83.9 83.8

CRF 0 – 3 (nothing observed)

82.6 82.7 82.7

Work in progress

Supersenses estimated without annotations: only 0.8% reduction of F-value

Page 26: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

26© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Agenda

1.Theseus Overview

2. Use Case Contentus

3. Core Technology Cluster

4. Supersense Tagging

5. Summary & Conclusions

Page 27: Frank Reichartz Gerhard Paaß Knowledge Discovery  Fraunhofer IAIS St. Augustin, Germany

27© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Summary

• Sequence models are able to extract supersenses

• New features like topic models help

• We may use non-annotated texts by exploiting restrictions in the ontology

• Chance to improve classifiers considerably

• May enhance higher order IE and information retrieval

Todo

• Apply to lower levels of hierarchy

• Detect new senses / supersenses of words in WordNet