Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach both are...
-
date post
21-Dec-2015 -
Category
Documents
-
view
224 -
download
1
Transcript of Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach both are...
Concepts & Categorization
Measurement of Similarity
• Geometric approach
• Featural approach
both are vector representations
Vector-representation for words
• Words represented as vectors of feature values• Similar words have similar vectors
98112…8129radio
12458…2462pet
22357…2361dog
12348…3461cat
98112…8129radio
12458…2462pet
22357…2361dog
12348…3461cat
How to get vector representations
• Multidimensional scaling on similarity ratings
• Tversky’s (1977) contrast model
• Latent Semantic Analysis(Landauer & Dumais, 1997)
• Topics Model(e.g., Griffiths & Steyvers, 2004)
Multidimensional Scaling (MDS) Approach
• Suppose we have N stimuli
• Measure the (dis)similarity between every pair of stimuli (N x (N-1) / 2 pairs).
• Represent each stimulus as a point in a multidimensional space.
• Similarity is measured by geometric distance, e.g., Minkowski distance metric:
rn
k
r
jkikij xxd1
1
Multidimensional Scaling
• Represent observed similarities by a multidimensional space – close neighbors should have high similarity
• Multidimensional Scaling: iterative procedure to place points in a (low) dimensional space to model observed similarities
Data: Matrix of (dis)similarity
MDS procedure: move points in space to best model observed similarity relations
Example: 2D solution for bold faces
2D solution for fruit words
Critical Assumptions of Geometric Approach
• Psychological distance should obey three axioms
– Minimality
– Symmetry
– Triangle inequality
0),(),(),( bbdaadbad
),(),( abdbad
),(),(),( cadcbdbad
For conceptual relations, violations of distance axioms often found
• Similarities can often be asymmetric
“North-Korea” is more similar to “China” than vice versa
“Pomegranate” is more similar to “Apple” than vice versa
• Violations of triangle inequality:
“Lemon”
“Orange” “Apricot”
Triangle Inequality and similarity constraints on words with multiple meanings
AB
BC
Euclidian distance: AC AB + BC
FIELD MAGNETIC
SOCCER
AC
Nearest neighbor problem (Tversky & Hutchinson (1986)
• In similarity data, “Fruit” is nearest neighbor in 18 out of 20 items
• In 2D solution, “Fruit” can be nearest neighbor of at most 5 items
• High-dimensional solutions might solve this but these are less appealing
Feature Contrast Model (Tversky, 1977)
• Represent stimuli with sets of discrete features
• Similarity is an – increasing function of common features– decreasing function of distinct features
)()()(),( IJcfJIbfJIafJISim Common features Features unique to I Features unique to J
a,b, and c are weighting parameters
Contrast model predicts asymmetries
Weighting parameter b > c
pomegranate is more similar to apple than vice versa becausepomegranate has fewer distinctive features
Contrast model predicts violations of triangle inequality
Weighting parameter a > b > c (common feature should be weighted more)
Additive Tree solution
Latent Semantic Analysis (LSA) Landauer & Dumais (1997)
Assumptions
1) words similar in meaning occur in similar verbal contexts(e.g., magazine articles, book chapters, newspaper articles)
2) we can count number of times words occur in documents and construct a large word x document matrix
3) this co-occurrence matrix contains a wealth of latent semantic information that can be extracted by statisticaltechniques
4) words can be represented as points in a multidimensionalspace
FIELD
GRASS
CORNBASEBALL
MAJOR FOOTBALL
Latent Semantic Analysis (Landauer & Dumais, ’97)
MEADOW
(high dimensional space)
1 2 … DFIELD 12 5 2
MEADOW 4BASEBALL 10
…MAJOR 5
DOCUMENTS
TE
RM
S
Information in matrix is compressed; relationships between words through other words are used.
Problem: LSA has to obey triangle inequality
AB
BC
Euclidian distance: AC AB + BC
FIELD MAGNETIC
SOCCER
AC
The Topics Model (Griffith & Steyvers, 2002 & 2003)
• A probabilistic version of LSA: no spatial constraints.
• Each document (i.e. context) is a mixture of topics. Each topic is a distribution over words
• Each word chosen from a single topic:
T
jiiii jzPjzwPwP
1
|
word probability in topic j
probability of topic jin document
P( w | z )HEART 0.3 LOVE 0.2SOUL 0.2TEARS 0.1MYSTERY 0.1JOY 0.1
P( z = 1 )
P( w | z )SCIENTIFIC 0.4 KNOWLEDGE 0.2WORK 0.1RESEARCH 0.1MATHEMATICS 0.1MYSTERY 0.1
P( z = 2 ) TOPIC MIXTURE
A toy example
MIXTURE COMPONENTS
wi
Words can occur in multiple topics
P( w | z )HEART 0.3 LOVE 0.2SOUL 0.2TEARS 0.1MYSTERY 0.1JOY 0.1
P( z = 1 ) = 1
P( w | z )SCIENTIFIC 0.4 KNOWLEDGE 0.2WORK 0.1RESEARCH 0.1MATHEMATICS 0.1MYSTERY 0.1
P( z = 2 ) = 0 TOPIC MIXTURE
All probability to topic 1…
MIXTURE COMPONENTS
wi
Document: HEART, LOVE, JOY, SOUL, HEART, ….
P( w | z )HEART 0.3 LOVE 0.2SOUL 0.2TEARS 0.1MYSTERY 0.1JOY 0.1
P( z = 1 ) = 0
P( w | z )SCIENTIFIC 0.4 KNOWLEDGE 0.2WORK 0.1RESEARCH 0.1MATHEMATICS 0.1MYSTERY 0.1
P( z = 2 ) = 1 TOPIC MIXTURE
All probability to topic 2 …
MIXTURE COMPONENTS
wi
Document: SCIENTIFIC, KNOWLEDGE, SCIENTIFIC, RESEARCH, ….
P( w | z )HEART 0.3 LOVE 0.2SOUL 0.2TEARS 0.1MYSTERY 0.1JOY 0.1
P( z = 1 ) = 0.5
P( w | z )SCIENTIFIC 0.4 KNOWLEDGE 0.2WORK 0.1RESEARCH 0.1MATHEMATICS 0.1MYSTERY 0.1
P( z = 2 ) = 0.5 TOPIC MIXTURE
Mixing topic 1 and 2
MIXTURE COMPONENTS
wi
Document: LOVE, SCIENTIFIC, HEART, SOUL, KNOWLEDGE, RESEARCH, ….
Application to corpus data
• TASA corpus: text from first grade to college– representative sample of text
• 26,000+ word types (stop words removed)
• 37,000+ documents
• 6,000,000+ word tokens
THEORYSCIENTISTS
EXPERIMENTOBSERVATIONS
SCIENTIFICEXPERIMENTSHYPOTHESIS
EXPLAINSCIENTISTOBSERVED
EXPLANATIONBASED
OBSERVATIONIDEA
EVIDENCETHEORIESBELIEVED
DISCOVEREDOBSERVE
FACTS
SPACEEARTHMOON
PLANETROCKET
MARSORBIT
ASTRONAUTSFIRST
SPACECRAFTJUPITER
SATELLITESATELLITES
ATMOSPHERESPACESHIPSURFACE
SCIENTISTSASTRONAUT
SATURNMILES
ARTPAINT
ARTISTPAINTINGPAINTEDARTISTSMUSEUM
WORKPAINTINGS
STYLEPICTURES
WORKSOWN
SCULPTUREPAINTER
ARTSBEAUTIFUL
DESIGNSPORTRAITPAINTERS
STUDENTSTEACHERSTUDENT
TEACHERSTEACHING
CLASSCLASSROOM
SCHOOLLEARNING
PUPILSCONTENT
INSTRUCTIONTAUGHTGROUPGRADE
SHOULDGRADESCLASSES
PUPILGIVEN
BRAINNERVESENSE
SENSESARE
NERVOUSNERVES
BODYSMELLTASTETOUCH
MESSAGESIMPULSES
CORDORGANSSPINALFIBERS
SENSORYPAIN
IS
CURRENTELECTRICITY
ELECTRICCIRCUIT
ISELECTRICAL
VOLTAGEFLOW
BATTERYWIRE
WIRESSWITCH
CONNECTEDELECTRONSRESISTANCE
POWERCONDUCTORS
CIRCUITSTUBE
NEGATIVE
A selection from 500 topics
FIELDMAGNETICMAGNET
WIRENEEDLE
CURRENTCOIL
POLESIRON
COMPASSLINESCORE
ELECTRICDIRECTION
FORCEMAGNETS
BEMAGNETISM
POLEINDUCED
SCIENCESTUDY
SCIENTISTSSCIENTIFIC
KNOWLEDGEWORK
RESEARCHCHEMISTRY
TECHNOLOGYMANY
MATHEMATICSBIOLOGY
FIELDPHYSICS
LABORATORYSTUDIESWORLD
SCIENTISTSTUDYINGSCIENCES
BALLGAMETEAM
FOOTBALLBASEBALLPLAYERS
PLAYFIELD
PLAYERBASKETBALL
COACHPLAYEDPLAYING
HITTENNISTEAMSGAMESSPORTS
BATTERRY
JOBWORKJOBS
CAREEREXPERIENCE
EMPLOYMENTOPPORTUNITIES
WORKINGTRAINING
SKILLSCAREERS
POSITIONSFIND
POSITIONFIELD
OCCUPATIONSREQUIRE
OPPORTUNITYEARNABLE
Polysemy: words with multiple meanings represented in different topics
No Problem of Triangle Inequality
SOCCER
MAGNETICFIELD
TOPIC 1 TOPIC 2
Topic structure easily explains violations of triangle inequality
How to get vector representations
• Multidimensional scaling on similarity ratings
• Tversky’s (1977) contrast model
• Latent Semantic Analysis(Landauer & Dumais, 1997)
• Topics Model(e.g., Griffiths & Steyvers, 2004)