Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah...
-
Upload
godwin-todd -
Category
Documents
-
view
219 -
download
1
Transcript of Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah...
![Page 1: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/1.jpg)
Katrin Erk
Distributional models
![Page 2: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/2.jpg)
Representing meaning through collections of words
Doc 1: Abdullah boycotting challenger commission dangerous election interest Karzai necessary runoff Sunday
Doc 2: animatronics are children’s igloo intimidating Jonze kingdom smashing Wild
Doc 3: applications documents engines information iterated library metadata precision query statistical web
Doc 4: cucumbers crops discoveries enjoyable fill garden leafless plotting Vermont you
![Page 3: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/3.jpg)
Representing meaning through collections of words
Doc 1: Abdullah boycotting challenger commission dangerous election interest Karzai necessary runoff Sunday
Doc 2: animatronics are children’s igloo intimidating Jonze kingdom smashing Wild
Doc 3: applications documents engines information iterated library metadata precision query statistical web
Washington Post Oct 24, 2009 on elections in Afghanistan
Wikipedia (version Oct 24, 2009) on the movie “Where the Wild Things Are”
Wikipedia (version Oct 24, 2009) on Information Retrieval
Doc 4: cucumbers crops discoveries enjoyable fill garden leafless plotting Vermont you
garden.org: Planning a Vegetable Garden
![Page 4: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/4.jpg)
Representing meaning through a collection of wordsWhat parts of the meaning of a document
can you capture through an unordered collection of words?
How can you make use of such collections?
![Page 5: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/5.jpg)
Representing meaning through a collection of wordsWhat parts of the meaning of a document
can you capture through an unordered collection of words?General topic information: What is the
document about?More specifically: things mentioned in the
document How can you make use of such collections?
Documents on similar topics contain similar words
Use in Information Retrieval (search)
![Page 6: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/6.jpg)
Representing collections of words through tables of counts
Doc 2: animatronics are children’s igloo intimidating Jonze kingdom smashing Wild
film wild max
that things as [edit]
him
jonze
released
24 18 16 16 12 12 11 9 9 6
![Page 7: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/7.jpg)
Representing collections of words through tables of counts
We can now compare documents by comparing tables of counts.
What can you tell about the second document below?
film wild max
that things as [edit]
him
jonze
released
24 18 16 16 12 12 11 9 9 6
film wild max
that things as [edit]
him
jonze
released
17 0 0 9 0 36 8 7 0 3
![Page 8: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/8.jpg)
The “second document”: a more extensive list of words
the 167and 58of 58to 56a 49in 37as 36is 33victor 30
* 27with 26by 23her 18film 17for 16emily 15was 15corpse 14
bride 13victoria 13his 13on 13from 11
What movie is this?
![Page 9: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/9.jpg)
From tables to vectors
Interpret table as a vector:Each entry is a dimension:
“film” is a dimension. Document’s coordinate: 24“wild” is a dimensions. Document’s coordinate: 18…
Then this document is a point in 10-dimensional space
film wild max
that things as [edit]
him
jonze
released
24 18 16 16 12 12 11 9 9 6
![Page 10: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/10.jpg)
Documents as points in vector spaceViewing “Wild Things” and “Corpse Bride”
as vectors/points in vector space: Similarity between them as proximity in space
Corpse Bride
Where the Wild Things Are
“Distributional model”, “vector space model”, “semantic space model” used interchangeably here
![Page 11: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/11.jpg)
What have we gained?Representation of document in vector
space can be computed completely automatically: Just counts words
Similarity in vector space is a good predictor for similarity in topicDocuments that contain similar words tend to
be about similar things
![Page 12: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/12.jpg)
What do we mean by “similarity” of vectors?Euclidean distance (a dissimilarity measure!):
Corpse Bride
Where the Wild Things Are
![Page 13: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/13.jpg)
What do we mean by “similarity” of vectors?Cosine similarity:
Corpse Bride
Where the Wild Things Are
![Page 14: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/14.jpg)
What have we gained?We can compute the similarity of
documents through their Euclidean distanceor through their cosine
We can also represent a query as a vector:Just count the words in the query
Now we can search for documents similar to the query
![Page 15: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/15.jpg)
From documents to wordsSame holds for words as for documents:
Context words are a good indicator of meaningSimilar words tend to occur in similar
contextsWhat is a context? How do we count here?
Take all the occurrences of our target word in a large text
Take a context window, e.g. 10 words either side
Count all that occurs there
![Page 16: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/16.jpg)
Representing the meaning of a word through a collection of context words
Emerging from the earth is Emily, the "Corpse Bride," a beautiful undead girl in a moldy bridal gown who declares Victor her husband.
a the corpse emerging
from
2 2 1 1 1
is undead beautiful moldy bride
1 1 1 1 1
in earth girl
1 1 1
Counts for target “Emily”, 10 words context either side.
![Page 17: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/17.jpg)
Representing the meaning of a word through a collection of context words
Go through all occurrences of “Emily” in a large corpusCount words in 10-word window for each
occurrence, sum up
a the corpse emerging
from
2 2 1 1 1
is undead beautiful moldy bride
1 1 1 1 1
in earth girl
1 1 1
![Page 18: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/18.jpg)
Some co-occurrences: “letter” in “Pride and Prejudice”
jane : 12 when : 14 by : 15 which : 16 him : 16 with : 16 elizabeth : 17 but : 17 he : 17 be : 18 s : 20 on : 20
was : 34 it : 35 his : 36 she : 41 her : 50 a : 52 and : 56 of : 72 to : 75 the : 102
• not : 21• for : 21• mr : 22• this : 23• as : 23• you : 25• from : 28• i : 28• had : 32• that : 33• in : 34
This is not a large text!Large = something like 100 million words at least
![Page 19: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/19.jpg)
From tables to vectors
Interpret table as a vector:Each entry is a dimension:
“admirer” is a dimension. Coordinate of “letter”: 1. Coordinate of “surprise”: 0
“all” is a dimensions. Coordinate of “letter”: 8. Coordinate of “surprise: 7
…
Then each word is a point in n-dimensional space
Counts for “letter” and “surprise” from Pride and Prejudice
![Page 20: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/20.jpg)
What have we gained?Representation of word in vector space can
be computed completely automatically: Just counts co-occurring words in all context
Similarity in vector space is a good predictor for meaning similarityWords that occur in similar contexts tend to
be similar in meaningSynonyms are close together in vector spaceAntonyms too
![Page 21: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/21.jpg)
Parameters of vector space modelsW. Lowe (2001): “Towards a theory of semantic
space”A semantic space defined as a tuple
(A, B, S, M)B: base elements. A: mapping from raw co-occurrence counts to
something else, to correct for frequency effectsS: similarity measure. M: transformation of the whole space to
different dimensions
![Page 22: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/22.jpg)
B: base elementsWe have seen: context words as base
elementsTerm x document matrix:
Represent document as vector of weighted terms
Represent term as vector of weighted documents
![Page 23: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/23.jpg)
B: base elementsDimensions:
not words in a context window, but dependency paths starting from the target word (Pado & Lapata 07)
![Page 24: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/24.jpg)
A: transforming raw countsProblem with vectors of raw counts:
Distortion through frequency of target word
Weigh counts: The count on dimension “and” will not be as
informative as that on the dimension “angry”For example, using Pointwise Mutual
Information between target a and context word b
![Page 25: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/25.jpg)
M: transforming the whole spaceDimensionality reduction:
Principal Component Analysis (PCA)Singular Value Decomposition (SVD)
Latent Semantic Analysis, LSA(also called Latent Semantic Indexing, LSI):Do SVD on term x document representation to induce “latent” dimensions that correspond to topics that a document can be about
Landauer & Dumais 1997
![Page 26: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/26.jpg)
Using similarity in vector spacesSearch/information retrieval: Given query
and document collection,Use term x document representation:
Each document is a vector of weighted termsAlso represent query as vector of weighted
termsRetrieve the documents that are most similar
to the query
![Page 27: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/27.jpg)
Using similarity in vector spacesTo find synonyms:
Synonyms tend to have more similar vectors than non-synonyms:Synonyms occur in the same contexts
But the same holds for antonyms:In vector spaces, “good” and “evil” are the same (more or less)
So: vector spaces can be used to build a thesaurus automatically
![Page 28: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/28.jpg)
Using similarity in vector spacesIn cognitive science, to predict
human judgments on how similar pairs of words are (on a scale of 1-10)
“priming”
![Page 29: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/29.jpg)
An automatically extracted thesaurusDekang Lin 1998:
For each word, automatically extract similar words
vector space representation based on syntactic context of target (dependency parses)
similarity measure: based on mutual information (“Lin’s measure”)
Large thesaurus, used often in NLP applications
![Page 30: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/30.jpg)
Vectors for word sensesUp to now: one vector per wordVector for “bank” conflates
financial contextsfishing contexts
How to get to vectors for word senses?
![Page 31: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/31.jpg)
Automatically inducing word sensesSchütze 1998: one vector per sentence,
or per occurrence (token)of “letter”She wrote an angry letter to her niece.He sprayed the word in big letters.The newspaper gets 100 letters from readers every
day.Make token vector by adding up the vectors of
all other (content) words in the sentence:
Cluster token vectorsClusters = induced word senses
![Page 32: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/32.jpg)
A vector for an individual occurrence of a wordAvoid having to define word senses
Sometimes hard to divide uses into senses:words like “leave”, or “paint”
Erk/Pado 2008: Modify vector of “bank” using its syntactic context:
bankbank
breakbreak obj
bankbank
fishfish on
![Page 33: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/33.jpg)
Summary: vector space models
Representing meaning through countsRepresent document through content wordsRepresent word meaning through context
words / parse tree snippets / documentsContext items as dimensions,
target as vector/point in semantic space
Proximity in semantic space ~ similarity between words
![Page 34: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.](https://reader030.fdocuments.in/reader030/viewer/2022032703/56649d155503460f949eae40/html5/thumbnails/34.jpg)
Summary: vector space models
Uses: SearchInducing ontologiesModeling human judgments of word similarityRepresent word senses
Cluster sentence vectorsCompute vectors for individual occurrences