Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel -...
-
Upload
augustus-dean -
Category
Documents
-
view
217 -
download
3
Transcript of Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel -...
Using bilingual LSA for FN annotation of French text from generic resources
Guillaume Pitel - LORIA/LED
FR.FrameNet Project
Funded by France-Berkeley Fund
Guillaume Pitel - LORIA - Nancy 2
Outline
The (small) FR.FrameNet project The projection problem Realizations
French Frames database Annotated reference sub-corpus English semantic clusters from FEs Projection into French
Other potential applications
Guillaume Pitel - LORIA - Nancy 3
The (small) FR.FrameNet project A Berkeley-Nancy collab. Funded by France-
Berkeley Fund - ICSI, ATILF, LORIA French participants : Susanne Alt, Benoît Crabbé,
Christiane Jadelot, Guillaume Pitel, Laurent Romary
Setting the foundations for a cheap bootstrapping of a French FrameNet Reusing existing French Lexical Semantic resources Reusing any available resources Focus on automatic methods
Guillaume Pitel - LORIA - Nancy 4
The projection problem
Use a semantic lexicon in language A to annotate a corpus in language B Resulting data is expected to be of much lower
quality than a handcrafted lexicon It is a bootstrapping process : requires manual
correction Important question : does it really speed up the
final production ?
Guillaume Pitel - LORIA - Nancy 5
Pado & Lapata approach Using a Source language/Target language
parallel corpus The Source-side of the corpus must be FN-
annotated, The roles are projected in the Target corpus
Train a statistical semantic role parser for Target language
Automatic annotation of any corpus in Target language
Guillaume Pitel - LORIA - Nancy 6
Pado & Lapata approach
Problems translation is not frame-conserving in many
cases (20-30%) parallel corpora are a rare resource Berkeley’s FrameNet is not built on the English
side of a // corpus :( But very useful with a resource like
Europarl
Guillaume Pitel - LORIA - Nancy 7
The main bottleneck
Existence of parallel AND annotated corpora : rare and expensive to build
But… Annotated corpora are available Parallel, aligned corpora are available
Guillaume Pitel - LORIA - Nancy 8
The Semantic Space based approach (using LSA) Pure semantic annotation
no grammatical function no POS
Use a bilingual LSA space to make the projection Preparation :
Find the lexical units in the Target language that fits for each frame
Use an available resource Compute them automatically
Compute the semantic clusters of each frame element
Guillaume Pitel - LORIA - Nancy 9
The Semantic Space based approach (using LSA) Usage : Automatic preannotation (or selection)
For each sentence in Target corpus Find potential frames from LUs Compare each word (or head of constituent) of the sentence
with to computed semantic clusters of the (core) roles of the candidate frames (or the corresponding roles in parents if training data missing)
Candidate Frames and FEs are rated by the semantic distance
What we can expect Can’t deal with anaphora, Can’t deal with FEs not semantically narrow
Guillaume Pitel - LORIA - Nancy 10
Subprojects
Convert frames to French Using the ISC Semantic Atlas (built from 2
synonym dictonaries + a minimal FR//EN corpus)
Annotation of reference subcorpus 1000 sentences from Europarl
Projection using LSA
Guillaume Pitel - LORIA - Nancy 11
Convert Frames to French
Guillaume Pitel - LORIA - Nancy 12
English LUs to French LUs
For each Frame in Berkeley FrameNet For each LU, find potential translations in French.
Using Semantic ATLAS (Ploux & Ji, 2003) - other languages ?
Compute the French “profile” of the Frame Manually check that a lemma can actually evoke the
frame (pure subjective judgment) Frame-by-frame procedure Must be validated later by corpus evidence
Guillaume Pitel - LORIA - Nancy 13
Lexical units in “Filling” Frame adorn.v, anoint.v, asphalt.v, brush.v, butter.v, coat.v, cover.v,
cram.v, crowd.v, dab.v, daub.v, douse.v, drape.v, drizzle.v, dust.v, embellish.v, fill.v, flood.v, gild.v, glaze.v, hang.v, heap.v, inject.v, jam.v, load.v, pack.v, paint.v, panel.v, pave.v, pile.v, plant.v, plaster.v, pump.v, scatter.v, seed.v, shower.v, smear.v, sow.v, spatter.v, splash.v, splatter.v, spray.v, spread.v, sprinkle.v, squirt.v, strew.v, stuff.v, suffuse.v, surface.v, tile.v, varnish.v, wallpaper.v, wrap.v
Guillaume Pitel - LORIA - Nancy 14
Translations 1/4 Adorn : Chamarrer, embellir, enjoliver, orner, parer, revêtir Anoint : Oindre Asphalt : Asphalter, bitumer Brush: Badigeonner, brosser, effleurer Butter : Beurrer Coat : Empâter, enduire, enrober, revêtir Cover : badigeonner, barbouiller, couvrir, franchir, gainer, garnir, habiller, monter, parcourir, quadriller, recouvrir, revêtir,
saillir, se couvrir, subvenir, tapisser Cram : bachoter, bâfrer, bûcher, chauffer, engraisser, lester, potasser Crowd : foule (should be also peupler) Dab : bassiner, tamponner, toucher Daub : badigeonner, barbouiller, peinturlurer Douse : ??? Drape : Draper Drizzle : brouillasser, bruiner, crachiner, pleuvasser, pleuviner Dust : enlever la poussière, essuyer, poussière, saupoudrer, épousseter Embellish : broder, embellir, enjoliver, orner Fill : appliquer un enduit, boucher, bourrer, calfeutrer, combler, devenir plein, emplir, enfler, fourrer, garnir, gonfler,
gorger, imprégner, lester, mastiquer, meubler, obturer, occuper, peupler, plomber, pourvoir, pourvoir à, pénétrer, remplir, s'enfler, se gonfler, se peupler, se remplir
Guillaume Pitel - LORIA - Nancy 15
Manual selection 1/4 Adorn : Chamarrer, embellir, enjoliver, orner, parer, revêtir Anoint : Oindre Asphalt : Asphalter, bitumer Brush: Badigeonner, brosser, effleurer Butter : Beurrer Coat : Empâter, enduire, enrober, revêtir Cover : badigeonner, barbouiller, couvrir, franchir, gainer, garnir, habiller, monter, parcourir, quadriller, recouvrir, revêtir,
saillir, se couvrir, subvenir, tapisser Cram : bachoter, bâfrer, bûcher, chauffer, engraisser, lester, potasser Crowd : foule (should be also peupler) Dab : bassiner, tamponner, toucher Daub : badigeonner, barbouiller, peinturlurer Douse : ??? Drape : Draper Drizzle : brouillasser, bruiner, crachiner, pleuvasser, pleuviner Dust : enlever la poussière, essuyer, poussière, saupoudrer, épousseter Embellish : broder, embellir, enjoliver, orner Fill : appliquer un enduit, boucher, bourrer, calfeutrer, combler, devenir plein, emplir, enfler, fourrer, garnir, gonfler,
gorger, imprégner, lester, mastiquer, meubler, obturer, occuper, peupler, plomber, pourvoir, pourvoir à, pénétrer, remplir, s'enfler, se gonfler, se peupler, se remplir
Guillaume Pitel - LORIA - Nancy 16
Frame building : Conclusion Quite inexpensive compared to an approach
of introspection from scratch or corpus-based (Filling is a big frame with a lot of LUs, it took me ~ 30min to select good instances - with manual color setting)
Probably far from perfect coverage, low precision
Need several annotators to duplicate the work
Guillaume Pitel - LORIA - Nancy 17
Our approach to cross-language semantic annotation
The goal : A lemma can be related to several Frames We want to disambiguate between the
possible choices, And also try to attribute roles (at least core
roles) once we have made the choice All of this in French, while we have the training
data in English
Guillaume Pitel - LORIA - Nancy 18
Bilingual LSA approach
Guillaume Pitel - LORIA - Nancy 19
Latent Semantic Analysis Improvement of cooccurrence matrices Reduce the number of dimensions Example :
A occurs in documents (or contexts) 1,2,3 B in 2,3,4,5 C in 4,5,6 A and C never occur in the same document
LSA would allow to reduce documents 1-6 into one dimension
Guillaume Pitel - LORIA - Nancy 20
Evaluating the semantic position of Frame Elements in LSA
Computing an English LSA space Tools : Treetagger + Infomap-nlp Corpus : BNC+English part of Europarl +
translation of Balzac POS+lemma : “NNyear” Keep only Verbs, Adjectives, Nouns,
Adverbs Other combinations (no POS, all POS, raw
form) don’t perform as well
Guillaume Pitel - LORIA - Nancy 21
Example FE’s annotations for Filling.Theme
with water. with a fungicide such as green or yellow sulphur. with a soft brush and malathion dust. with a little cayenne pepper. …
Terms used for the FE’s representation NNwater;NNfungicide;JJsuch;JJgreen;JJyello
w;NNsulphur;JJsoft;NNbrush;NNmalathion;NNdust;JJlittle;NNcayenne;NNpepper
Guillaume Pitel - LORIA - Nancy 22
Evaluating FE’s semantic coherence
Compute the semantic center of the FE = center of each FE term’s position
Find the N nearest neighbors of this center If the center is in a semantically coherent
region, the average similarity between neighbors and center is high.
Guillaume Pitel - LORIA - Nancy 23
FEs de FillingFrame.FE Average Std Min Max Nb annot
Filling.Agent 0.604941 0.0413504 0.563591 0.717469 279
Filling.Cause
Filling.Degree 0.595513 0.0431123 0.552401 0.697830 4
Filling.Depictive 0.683302 0.0502735 0.633029 0.804053 1
Filling.Goal 0.6483 0.0510976 0.597202 0.793063 543
Filling.Instrument 0.646028 0.0715617 0.574466 0.844308 4
Filling.Manner 0.647012 0.0795992 0.567413 0.896142 25
Filling.Means 0.67356 0.0502949 0.623265 0.820630 1
Filling.Path 0.708096 0.069683 0.638413 0.925448 2
Filling.Place 0.562765 0.0364663 0.526299 0.683526 2
Filling.Purpose 0.631099 0.0585047 0.572594 0.761788 5
Filling.Result 0.734567 0.0585102 0.676057 0.825459 37
Filling.Source 0.611222 0.0447367 0.566485 1.000000 1
Filling.Subregion 0.782659 0.0756196 0.707039 0.944916 2
Filling.Theme 0.747146 0.0485786 0.698567 0.890307 450
Filling.Time 0.474269 0.0474972 0.426772 0.628049 16
Guillaume Pitel - LORIA - Nancy 24
Neighbors of Filling.Theme powder 0.890307 spray 0.836283 dry 0.821666 crushed 0.820905 charcoal 0.813571 plastic 0.806768 copper 0.804459 paste 0.802643 foam 0.802201 brush 0.799847 …
Computed from : with fake diamonds. with pictures of cute white bunnies. with jewels and fine gowns. with one of these pegs. with pictures , flowers , and messages of peace. with wreaths of flowers and garlands of feathers. with the finest furniture from a firm in London 's New Bond Street. with a crown. with beautifully hooked melodies and harmonies. with chrism , the sacred ointment ,. with gel. with such a leaden armour of expectations. with the poison. with these substances. with vaseline. with his pungent urine. with holy oil. in bulb fibre. in whipped cream and honey. with a foot of topsoil. with her hand. …
Guillaume Pitel - LORIA - Nancy 25
Neighbors of Filling.Agent oliver 0.717469 jack 0.696716 joe 0.691628 marie 0.686812 harry 0.684113 charlie 0.681887 billy 0.680378 tom 0.678887 jane 0.676179 rose 0.669748 …
Computed from :Your man. I. They. The priests. He. the wife of Cnut 's henchman Tofi the Proud. The Reclusiarch. she. What father. The Indians. Over 200 species of birds. He. He. Father Peter. Viktor. by ecclesiastics. We. One girl. She. she. he. the white gravel. the reluctant soldier. I. Eva. he. Two people. he. the good beachcombers. Sylvester. he. He. Two girls. you. Cecil Beaton. you. Larsen. you. He. you. you. He. he. she. Mina and K. She. you. she. the programme that turns the cameras on teenagers and let's them do the talking and the interviews. Baldwin. by Molly Fletcher. She. I. They. she. Endill. They. He. the BBC and official propaganda…
Guillaume Pitel - LORIA - Nancy 26
FEs’ clusters
Grouping terms of the FE by minimal distance (arbitrarily set) i.e. 0.8 = 74°
Keeping clusters with more than 5% of terms
http://guillaume.work.free.fr/Frames.en.3
Guillaume Pitel - LORIA - Nancy 27
Clusters of Filling frame Agent : 2 cluster(s) Degree : 4 cluster(s) Depictive : 6 cluster(s) Goal : 2 cluster(s) Instrument : 6 cluster(s) Manner : 2 cluster(s) Means : 2 cluster(s) Path : 1 cluster(s) Place : 5 cluster(s) Purpose : 1 cluster(s) Result : 2 cluster(s) Source : 1 cluster(s) Subregion : 1 cluster(s) Theme : 2 cluster(s) Time : 0 cluster(s)
Guillaume Pitel - LORIA - Nancy 28
Clusters Filling.Agent rachel 0.867663 sara 0.863332 ellen 0.856612 lily 0.855513 sally 0.853933 alice 0.849205 emily 0.847480 dad 0.845598 jenny 0.844003 kate 0.839664 maggie 0.836391
tom 0.924026john 0.908828hugh 0.898049michael 0.897622scott 0.892861sir 0.891623david 0.889539frank 0.889324murray 0.879660anthony 0.879149geoffrey 0.876748
Guillaume Pitel - LORIA - Nancy 29
Clusters Filling.Goal
tin 0.924426 pot 0.908988 jar 0.908169 cake 0.893367 bottle 0.888083 bag 0.871596 jug 0.866099 bowl 0.860658 basket 0.858857 plastic 0.852992 dish 0.846176 peel 0.834313
wall 0.911646wooden 0.864492entrance 0.851708front 0.846124floor 0.834214porch 0.834039staircase 0.827131roof 0.823297rear 0.815847corner 0.815765rear 0.813187front 0.813136
Guillaume Pitel - LORIA - Nancy 30
Clusters Filling.Theme powder 0.913015 salt 0.907773 dry 0.900202 aromatic 0.886529 vegetable 0.870903 spray 0.867004 bean 0.860508 herb 0.858321 meat 0.852165 apple 0.848998 vinegar 0.848045 pea 0.845492
shiny 0.915945red 0.908281pink 0.905748tint 0.900729grey 0.899490yellow 0.882565blue 0.882097white 0.877434ribbon 0.876266brown 0.875334pale 0.875016silk 0.865824
Guillaume Pitel - LORIA - Nancy 31
Projection
Compute French clusters from English clusters
Corpus collection Europarl (French-English) // French-English Balzac from Project
Gutenberg French//English : 50M lemmas Shakespeare, Hansard Corpus to be included
Guillaume Pitel - LORIA - Nancy 32
Training data
Lemmas interleaved on a sentence alignment basis
Training with a larger window Only parallel corpus, experiments that
introduce bits of pure monolingual corpus show a quality loss
Guillaume Pitel - LORIA - Nancy 33
Similarity between translations in the Biling. Sem. Space
Results : eat / manger : 0,98 (32°) fleuve / river : 0,94 (55°) green / vert : 0,83 (92°) bleu / blue : 0,87 (81°) eat / fleuve : 0,77 (107°) drink / écran : 0,82 (96°)
Guillaume Pitel - LORIA - Nancy 34
Neighborhood in Bilingual Semantic Space
Eat/Manger
eat:0.976250manger:0.976250consommer:0.823532 (consume)boire:0.818577 (drink)feed:0.784077fumeur:0.777815 (smoker)consume:0.775385fumer:0.757367 (to smoke)cream:0.744859
Guillaume Pitel - LORIA - Nancy 35
Neighborhood in Bilingual Semantic Space
Fleuve/River
river:0.938150fleuve:0.938150coastline:0.810345rivière:0.807991alp:0.801064sea:0.774821lake:0.771523coast:0.761910littoral:0.756541(seashore)bassin:0.755235 (basin)
Guillaume Pitel - LORIA - Nancy 36
Neighborhood in Bilingual Semantic Space
Vert/Green
vert:0.825634green:0.825634green:0.748835biotechnology:0.745683mandelkern:0.675176hatch:0.664682taslima:0.633138cote:0.628252converter:0.624423orée:0.616550 (forest border)hydrogen:0.611002
Guillaume Pitel - LORIA - Nancy 37
Projection: Conclusion Projecting whole clusters gives variable results Results in the projection are very disappointing
Unusable in this state Seems that it may simply come from alignment
mistakes Can we improve the projected clusters with a
bilingual dictionary ? Relating clusters to Synsets ? Not necessarily a good
idea : Champagne and Caviar are not related in WN More generally “simple” translation may cause
undesired broadening of the cluster
Guillaume Pitel - LORIA - Nancy 38
Potential application Statistical processing is interesting because it can
capture “usage-based” regularities Clusters built with LSA can be interesting information
sources for the lexicographer They can also more simply be used to automatically
find new semantic types/selectional preferences emerging from the annotation of a new domain (metaphors occuring frequently for instance)
In a multilingual, collaborative annotation task, could be useful in order to transfer work between languages without requiring annotation of a parallel corpus.