Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach both are...

31
Concepts & Categorization
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    224
  • download

    1

Transcript of Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach both are...

Page 1: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

Concepts & Categorization

Page 2: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

Measurement of Similarity

• Geometric approach

• Featural approach

both are vector representations

Page 3: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

Vector-representation for words

• Words represented as vectors of feature values• Similar words have similar vectors

98112…8129radio

12458…2462pet

22357…2361dog

12348…3461cat

98112…8129radio

12458…2462pet

22357…2361dog

12348…3461cat

Page 4: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

How to get vector representations

• Multidimensional scaling on similarity ratings

• Tversky’s (1977) contrast model

• Latent Semantic Analysis(Landauer & Dumais, 1997)

• Topics Model(e.g., Griffiths & Steyvers, 2004)

Page 5: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

Multidimensional Scaling (MDS) Approach

• Suppose we have N stimuli

• Measure the (dis)similarity between every pair of stimuli (N x (N-1) / 2 pairs).

• Represent each stimulus as a point in a multidimensional space.

• Similarity is measured by geometric distance, e.g., Minkowski distance metric:

rn

k

r

jkikij xxd1

1

Page 6: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

Multidimensional Scaling

• Represent observed similarities by a multidimensional space – close neighbors should have high similarity

• Multidimensional Scaling: iterative procedure to place points in a (low) dimensional space to model observed similarities

Page 7: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

Data: Matrix of (dis)similarity

Page 8: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

MDS procedure: move points in space to best model observed similarity relations

Page 9: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

Example: 2D solution for bold faces

Page 10: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

2D solution for fruit words

Page 11: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

Critical Assumptions of Geometric Approach

• Psychological distance should obey three axioms

– Minimality

– Symmetry

– Triangle inequality

0),(),(),( bbdaadbad

),(),( abdbad

),(),(),( cadcbdbad

Page 12: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

For conceptual relations, violations of distance axioms often found

• Similarities can often be asymmetric

“North-Korea” is more similar to “China” than vice versa

“Pomegranate” is more similar to “Apple” than vice versa

• Violations of triangle inequality:

“Lemon”

“Orange” “Apricot”

Page 13: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

Triangle Inequality and similarity constraints on words with multiple meanings

AB

BC

Euclidian distance: AC AB + BC

FIELD MAGNETIC

SOCCER

AC

Page 14: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

Nearest neighbor problem (Tversky & Hutchinson (1986)

• In similarity data, “Fruit” is nearest neighbor in 18 out of 20 items

• In 2D solution, “Fruit” can be nearest neighbor of at most 5 items

• High-dimensional solutions might solve this but these are less appealing

Page 15: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

Feature Contrast Model (Tversky, 1977)

• Represent stimuli with sets of discrete features

• Similarity is an – increasing function of common features– decreasing function of distinct features

)()()(),( IJcfJIbfJIafJISim Common features Features unique to I Features unique to J

a,b, and c are weighting parameters

Page 16: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

Contrast model predicts asymmetries

Weighting parameter b > c

pomegranate is more similar to apple than vice versa becausepomegranate has fewer distinctive features

Page 17: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

Contrast model predicts violations of triangle inequality

Weighting parameter a > b > c (common feature should be weighted more)

Page 18: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

Additive Tree solution

Page 19: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

Latent Semantic Analysis (LSA) Landauer & Dumais (1997)

Assumptions

1) words similar in meaning occur in similar verbal contexts(e.g., magazine articles, book chapters, newspaper articles)

2) we can count number of times words occur in documents and construct a large word x document matrix

3) this co-occurrence matrix contains a wealth of latent semantic information that can be extracted by statisticaltechniques

4) words can be represented as points in a multidimensionalspace

Page 20: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

FIELD

GRASS

CORNBASEBALL

MAJOR FOOTBALL

Latent Semantic Analysis (Landauer & Dumais, ’97)

MEADOW

(high dimensional space)

1 2 … DFIELD 12 5 2

MEADOW 4BASEBALL 10

…MAJOR 5

DOCUMENTS

TE

RM

S

Information in matrix is compressed; relationships between words through other words are used.

Page 21: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

Problem: LSA has to obey triangle inequality

AB

BC

Euclidian distance: AC AB + BC

FIELD MAGNETIC

SOCCER

AC

Page 22: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

The Topics Model (Griffith & Steyvers, 2002 & 2003)

• A probabilistic version of LSA: no spatial constraints.

• Each document (i.e. context) is a mixture of topics. Each topic is a distribution over words

• Each word chosen from a single topic:

T

jiiii jzPjzwPwP

1

|

word probability in topic j

probability of topic jin document

Page 23: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

P( w | z )HEART 0.3 LOVE 0.2SOUL 0.2TEARS 0.1MYSTERY 0.1JOY 0.1

P( z = 1 )

P( w | z )SCIENTIFIC 0.4 KNOWLEDGE 0.2WORK 0.1RESEARCH 0.1MATHEMATICS 0.1MYSTERY 0.1

P( z = 2 ) TOPIC MIXTURE

A toy example

MIXTURE COMPONENTS

wi

Words can occur in multiple topics

Page 24: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

P( w | z )HEART 0.3 LOVE 0.2SOUL 0.2TEARS 0.1MYSTERY 0.1JOY 0.1

P( z = 1 ) = 1

P( w | z )SCIENTIFIC 0.4 KNOWLEDGE 0.2WORK 0.1RESEARCH 0.1MATHEMATICS 0.1MYSTERY 0.1

P( z = 2 ) = 0 TOPIC MIXTURE

All probability to topic 1…

MIXTURE COMPONENTS

wi

Document: HEART, LOVE, JOY, SOUL, HEART, ….

Page 25: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

P( w | z )HEART 0.3 LOVE 0.2SOUL 0.2TEARS 0.1MYSTERY 0.1JOY 0.1

P( z = 1 ) = 0

P( w | z )SCIENTIFIC 0.4 KNOWLEDGE 0.2WORK 0.1RESEARCH 0.1MATHEMATICS 0.1MYSTERY 0.1

P( z = 2 ) = 1 TOPIC MIXTURE

All probability to topic 2 …

MIXTURE COMPONENTS

wi

Document: SCIENTIFIC, KNOWLEDGE, SCIENTIFIC, RESEARCH, ….

Page 26: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

P( w | z )HEART 0.3 LOVE 0.2SOUL 0.2TEARS 0.1MYSTERY 0.1JOY 0.1

P( z = 1 ) = 0.5

P( w | z )SCIENTIFIC 0.4 KNOWLEDGE 0.2WORK 0.1RESEARCH 0.1MATHEMATICS 0.1MYSTERY 0.1

P( z = 2 ) = 0.5 TOPIC MIXTURE

Mixing topic 1 and 2

MIXTURE COMPONENTS

wi

Document: LOVE, SCIENTIFIC, HEART, SOUL, KNOWLEDGE, RESEARCH, ….

Page 27: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

Application to corpus data

• TASA corpus: text from first grade to college– representative sample of text

• 26,000+ word types (stop words removed)

• 37,000+ documents

• 6,000,000+ word tokens

Page 28: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

THEORYSCIENTISTS

EXPERIMENTOBSERVATIONS

SCIENTIFICEXPERIMENTSHYPOTHESIS

EXPLAINSCIENTISTOBSERVED

EXPLANATIONBASED

OBSERVATIONIDEA

EVIDENCETHEORIESBELIEVED

DISCOVEREDOBSERVE

FACTS

SPACEEARTHMOON

PLANETROCKET

MARSORBIT

ASTRONAUTSFIRST

SPACECRAFTJUPITER

SATELLITESATELLITES

ATMOSPHERESPACESHIPSURFACE

SCIENTISTSASTRONAUT

SATURNMILES

ARTPAINT

ARTISTPAINTINGPAINTEDARTISTSMUSEUM

WORKPAINTINGS

STYLEPICTURES

WORKSOWN

SCULPTUREPAINTER

ARTSBEAUTIFUL

DESIGNSPORTRAITPAINTERS

STUDENTSTEACHERSTUDENT

TEACHERSTEACHING

CLASSCLASSROOM

SCHOOLLEARNING

PUPILSCONTENT

INSTRUCTIONTAUGHTGROUPGRADE

SHOULDGRADESCLASSES

PUPILGIVEN

BRAINNERVESENSE

SENSESARE

NERVOUSNERVES

BODYSMELLTASTETOUCH

MESSAGESIMPULSES

CORDORGANSSPINALFIBERS

SENSORYPAIN

IS

CURRENTELECTRICITY

ELECTRICCIRCUIT

ISELECTRICAL

VOLTAGEFLOW

BATTERYWIRE

WIRESSWITCH

CONNECTEDELECTRONSRESISTANCE

POWERCONDUCTORS

CIRCUITSTUBE

NEGATIVE

A selection from 500 topics

Page 29: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

FIELDMAGNETICMAGNET

WIRENEEDLE

CURRENTCOIL

POLESIRON

COMPASSLINESCORE

ELECTRICDIRECTION

FORCEMAGNETS

BEMAGNETISM

POLEINDUCED

SCIENCESTUDY

SCIENTISTSSCIENTIFIC

KNOWLEDGEWORK

RESEARCHCHEMISTRY

TECHNOLOGYMANY

MATHEMATICSBIOLOGY

FIELDPHYSICS

LABORATORYSTUDIESWORLD

SCIENTISTSTUDYINGSCIENCES

BALLGAMETEAM

FOOTBALLBASEBALLPLAYERS

PLAYFIELD

PLAYERBASKETBALL

COACHPLAYEDPLAYING

HITTENNISTEAMSGAMESSPORTS

BATTERRY

JOBWORKJOBS

CAREEREXPERIENCE

EMPLOYMENTOPPORTUNITIES

WORKINGTRAINING

SKILLSCAREERS

POSITIONSFIND

POSITIONFIELD

OCCUPATIONSREQUIRE

OPPORTUNITYEARNABLE

Polysemy: words with multiple meanings represented in different topics

Page 30: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

No Problem of Triangle Inequality

SOCCER

MAGNETICFIELD

TOPIC 1 TOPIC 2

Topic structure easily explains violations of triangle inequality

Page 31: Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

How to get vector representations

• Multidimensional scaling on similarity ratings

• Tversky’s (1977) contrast model

• Latent Semantic Analysis(Landauer & Dumais, 1997)

• Topics Model(e.g., Griffiths & Steyvers, 2004)