Frank Reichartz Gerhard Paaß Knowledge Discovery Fraunhofer IAIS St. Augustin, Germany

Frank ReichartzGerhard Paaß

Knowledge Discovery Fraunhofer IAISSt. Augustin, Germany

Estimating Supersenses with Conditional Random Fields

2© Fraunhofer IAIS. Gerhard Paaß19 Sept 08 HLIE Workshop at ECML/PKDD’08

Agenda

1.Introduction

2. Models for Supersenses

3. Conditional Random Fields

4. Lumped Observations

5. Summary

Agenda


Use Case Contentus

Digitize a Multimedia Collection of German National Library

• Music of former GDR

• Digitize

• Quality control

• Meta data collection

• Semantic indexing

• Semantic search engine Target

• Provide content: text, score sheets, video, images, speech

• Generate meta data: composers, premiere, director, artists, …

• Extract entities: dates, places, relations, composers, pieces of music, …

• Assign meanings to words and phrases: use ontology


Wordnet as Ontology

• WordNet is a fine-grained word sense hierarchy

• The same word may have different senses: bank = financial institutebank = river boundary

• Defines senses (synsets) for

• Verbs

• Common & proper nouns

• Adjectives

• Adverbs

Target: assign each word to a synset

• Easy semantic indexing & retrieval


Fine Grained Word Senses • Example: senses of noun „blow“

Very subtle differences between senses


Hierarchy of Hypernyms

• Supersense level

• Fewer distinctions

• Retains main differences

Target: assign verbs / nouns to a supersense


List of Supersenses

15

26


Supersenses discriminate between many synsets

Noun blow:

• 7 synsets

• 5 supersenses

Verb blow:

• 22 synsets

• 9 supersenses

Sufficient for coarse disambiguation


Agenda

1.Introduction




5. Summary

Agenda


Training Data: SemCor Dataset

A A DT

compromise compromise NN 1190419noun.communication

will will MD

leave leave VB 2610151 verb.stative

both both DT

sides side NN 8294366 noun.group

without without IN

the the DT

glow glow NN 13864852 noun.state

of of IN

triumph triumph NN 7425691 noun.feeling

, , PUNC

but but CC

it it PRP

will will MD

save save VB 2526596 verb.social

Berlin location NNP 26074 noun.location

. . PUNC

Synset Supersense

Output

Input


Prior Work: Classifier

• Bag-of-words is not sufficient: code relative positions

• Use classifiers

• MaxEnt

• SVM

• Naive Bayes

• kNN

• Proc. SemEval 2007Coarse-Grained English All-Words Task


Prior Work: Sequence Modelling

Chiaramita & Altun 06: Use perceptron-trained HMM

x)(y, sequencea for counts feature ),,(),(

),(,);,(

sequencea in feature / wordth-i and label th-i is ,

1 11

d

i

y

jjji

ii

xyyyx

yxwwyxF

xy

• Maximize predictive performance on training set

• Ignore ambiguity: use only most frequent sense

Deschacht & Moens 06: Use Conditional Random Field

• Exploit hierarchy to model many classes

• Apply to fine grained word sense: good results


Agenda

1.Introduction




5. Summary

Agenda


Definition of Conditional Random Fields

observed words / features X1,…,Xn

states Y1,…,Yn

each state Yt may be influenced by many of the X1,…,Xn features

Definition: Let G=(V,E) be a graph such that Y=(Yt)tV, so that Y is indexed

by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yt obey the Markov property with respect to the

graph:

),|(),,...,,...,|( )(,111 XYYpXYYYYYptYneighwwtnttt

Hammersley Cliffort Theorem: probability can be written as a product of potential functions

Variable sets are cliques in the neighborhood set


Simplification: Sequential Chain

observed words / features X1,…,Xn

states Y1,…,Yn

N

t

N

k

N

ktCkkttCkkn YgYYf

ZXYYp

1 1 1,1,1

2 1

),(),,(exp),,(

1)|,,( XX

X

features may involve two consecutive states and all observed words

examples

feature has value 1 if Yt-1="other" and Yt=location and Xt has pos tag "proper name" and (Xt-2,Xt-1)="arrived in". Otherwise the value is 0

estimated POS tags, noun phrase tags, weekday, amounts, etc.

prefixes, suffixes; matching regular expressions for capitalization, etc.

information from lexica, lists of proper names[Lafferty, McCallum, Pereira 01]


Derivative of Likelihood

estimate the optimal parameter vector for the training set (Xd,Yd), d=1,…,D

D

d

D

ddsdddk

sfpf

L

1 1

),(),|(),(Y

XYXYXY

observed

feature value

expected

feature value

how can we calculate the expected feature values?

need for every document d and state Yd,t the probability p(Ydt=i|Xd)

need for every d and states Yd,t , Yd,t+1 the probability p(Yd,t=i,Yd,t+1=j|Xd)

use forward - backward algorithm as for the hidden Markov model


Optimization Procedure: Gradient Ascent

Regularization term: use small weight values, if possible smaller generalization errorBayesian prior distribution: Gaussian, Laplace, etc.

use gradient-based optimizer: e.g. conjugate gradient, BFGS approximate quadratic optimization

use stochastic gradient

v

D

d

D

ddsdddk

s

sfpfL

1 1

),(),|(),(Y

XYXYXY

regularizationexpected feature

value

observed

feature value

[Sutton, McCallum 06]


Features

• Lemmas of verbs of previous position

• Part-of-Speech of lags -2|-1|0|1|2

• Coarse POS lags -2|-1|0|1|2

• Three letter prefixes lag -1,0,1

• Three letter suffixes lag -1,0,1

• INITCAP -1|0|1

• ALLDIGITS -1|0|1

• ALLCAPS -1|0|1

• MIXEDCAPS -1|0|1

• CONTAINSDASH -1|0|1

• Class-ID of unsupervised LDA topic model with 50 classes

SemCor training set: ~ 20k sentences, 5-fold cross validation


Results for nouns

Different F1-valuesevent: 67.0%Tops: 98.2%

Micro-Average: 83.5%

Macro-Average: 77.9%

Different frequencies of examplesmotive: 133artifact: 8894

Noun


Comparison to Prior Result

Micro-Average:

Ciaramita & Altun 06: 77.18 % ()

CRF 83.5% ()

~ 28% reduction of error

Supersense

Ciaramita-Altun 06 CRF

Rec. Prec. F Rec. Prec. F

n.person 92.0 87.9 89.9 96.0 96.9 96.4

n.group 75.4 79.6 77.4 84.4 85.1 84.7

n.location 77.2 75.4 76.3 85.1 80.8 82.9

n.time 88.4 84.3 86.3 92.6 90.5 91.5


Agenda

1.Introduction




5. Summary


Need for More Data

WordNet covers more than 100000 synsets

Few examples per supersense: higher training error

Many examples required to train each synsetSemCor: ~20k sentences

Manual labelling is costly

Exploit restrictions in WordNet

Each word has only a subset of possible supersenses

• Blow: n.act, n.event, n.phenomenon, n.artifact, n.act

Unlabeled data: assign possible supersenses to each word

Specialized CRF required


Conditional Random Field with Lumped Supersenses observed words / features X1,…,Xn

states Y1,…,Yn, Yt…

Observations: Subsets of supersenses Yt …

An observation (X1,…,Xn; Y1,…,Yn) contains a large number of sequences

N

t

N

k

N

ktCkkttCkkn YgYYf

ZXYYp

1 1 1,1,1

2 1),(),,(exp

),,(1

)|,,( XXX

Adapt likelihood computation

),,( 1 nYY


Training Data: SemCor DatasetA A DT

compromise compromise NN n.communication, n.act

will will MD

leave leave VB v.stative, v.motion, v.cogn., v.change, v.social, v.possession

both both DT

sides side NN n.group, n. location, n.body, n.artifact, n.cogn., n. food, n. communication, n.object, n.event

without without IN

the the DT

glow glow NN n.state, n.attribute, n.phenomenon, n.feeling

of of IN

triumph triumph NN n.feeling, n.event

, , PUNC

but but CC

it it PRP

will will MD

save save VB v.social, v.possession, v.change,

Berlin location NNP n.location, n.person, n.artifact

. . PUNC

Possible Supersenses


Results for Lumped Supersenses: SemCor

Simulate lumped supersenses

Determine possible supersenses for SemCor

Use different fractions of annotated / possible supersenses

Method Fraction annotated - possible

Precision

Recall F1

Ciaramita & Altun 06

3 - 0 76.6 77.7 77.1

CRF 3 - 0 83.4 83.6 83.5

CRF 2 - 0 83.1 83.2 83.1

CRF 2 - 1 83.7 83.9 83.8

CRF 0 – 3 (nothing observed)

82.6 82.7 82.7

Work in progress

Supersenses estimated without annotations: only 0.8% reduction of F-value


Agenda

1.Theseus Overview

2. Use Case Contentus

3. Core Technology Cluster

4. Supersense Tagging

5. Summary & Conclusions


Summary

• Sequence models are able to extract supersenses

• New features like topic models help

• We may use non-annotated texts by exploiting restrictions in the ontology

• Chance to improve classifiers considerably

• May enhance higher order IE and information retrieval

Todo

• Apply to lower levels of hierarchy

• Detect new senses / supersenses of words in WordNet

Frank Reichartz Gerhard Paaß Knowledge Discovery Fraunhofer IAIS St. Augustin, Germany

Documents

Transcript of Frank Reichartz Gerhard Paaß Knowledge Discovery Fraunhofer IAIS St. Augustin, Germany