Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive...

72
Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine oint work with: om Griffiths, UC Berkeley adhraic Smyth, UC Irvine ave Newman, UC Irvine haitanya Chemudugunta, UC Irvine
  • date post

    20-Jan-2016
  • Category

    Documents

  • view

    219
  • download

    0

Transcript of Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive...

Page 1: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Topic Models for Semantic Memory and Information Extraction

Mark SteyversDepartment of Cognitive Sciences

University of California, Irvine

Joint work with:Tom Griffiths, UC BerkeleyPadhraic Smyth, UC IrvineDave Newman, UC IrvineChaitanya Chemudugunta, UC Irvine

Page 2: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Human MemoryInformation

Retrieval

Finding relevant memories

Extracting meaning

Finding relevant documents

Extracting content

Page 3: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Semantic Representations

• What are suitable representations?

• How can it be learned from experience?

Page 4: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

MONEY

• Structured representations • Encoding of propositions• Hand coded examples• Not inferred from data

CASH

BILL

LOAN

MONEY

BANK

RIVER STREAM

Two approaches to semantic representation

Semantic networkse.g. Collins & Quillian 1969

Semantic Spacese.g. Landauer & Dumais, 1997

LOANCASH

RIVER

BANK

BILL

• Relatively unstructured representations• Can be learned from data

Page 5: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Overview

I Probabilistic Topic Models

II Associative Memory

III Episodic Memory

IV Applications

V Conclusion

Page 6: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Probabilistic Topic Models

• Extract topics from large text collections

unsupervised

Bayesian statistical techniques

• Topics provide quick summary of content / gist

Page 7: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

What are topics?

• A topic represents a probability distribution over words

– Related words get high probability in same topic

• Example topics extracted from psychology grant applications:

brainfmri

imagingfunctional

mrisubjects

magneticresonance

neuroimagingstructural

schizophreniapatientsdeficits

schizophrenicpsychosissubjects

psychoticdysfunction

abnormalitiesclinical

memoryworking

memoriestasks

retrievalencodingcognitive

processingrecognition

performance

diseasead

alzheimerdiabetes

cardiovascularinsulin

vascularblood

clinicalindividuals

Probability distribution over words. Most likely words listed at the top

Page 8: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Model Input

• Matrix of counts: number of times words occur in documents

• Note:

– word order is lost: “bag of words” approach

– Some function words are deleted: “the”, “a”, “in”

documents

wor

ds

1

16

0

FOOD

6190ITALIAN

2012PASTA

3034PIZZA

Doc3 … Doc2Doc1

Page 9: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Model Assumptions

• Each topic is a probability distribution over words

• Each document is modeled as a mixture of topics

Page 10: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Generative Model for Documents

TOPICS MIXTURE

TOPIC TOPIC

WORD WORD

...

...

1. for each document, choosea mixture of topics

2. sample a topic [1..T] from the mixture

3. sample a word from the topic

(“LDA” by Blei, Ng, & Jordan, 2002; “pLSI” by Hoffman, 1999; Griffith & Steyvers, 2002 & 2004)(“LDA” by Blei, Ng, & Jordan, 2002; “pLSI” by Hoffman, 1999; Griffith & Steyvers, 2002 & 2004)

Page 11: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Document = mixture of topics

brainfmri

imagingfunctional

mrisubjects

magneticresonance

neuroimagingstructural

schizophreniapatientsdeficits

schizophrenicpsychosissubjects

psychoticdysfunction

abnormalitiesclinical

memoryworking

memoriestasks

retrievalencodingcognitive

processingrecognition

performance

diseasead

alzheimerdiabetes

cardiovascularinsulin

vascularblood

clinicalindividuals

Document

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

80%20%

Document

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

100%

Page 12: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

The Generative Model

• Conditional probability of word in document:

1

| | |T

j

P w d P w z j P z j d

word probability in topic j

probability of topic jin document

Page 13: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Inverting the generative model

• Generative model gives procedure to obtain corpus from

topics and mixing proportions

• Inverting the model involves extracting topics and mixing

proportions per document from corpus

Page 14: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Inverting the generative model

• Estimate topic assignments

– Each occurrence of a word is assigned to one topic [ 1..T ]

• Large state space

– With T=300 topics and 6,000,000 words, the size of the

discrete state space is (300)6,000,000

• Need efficient sampling techniques

– Markov Chain Monte Carlo (MCMC) with Gibbs sampling

Page 15: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

INPUT: word-document counts (word order is irrelevant)

OUTPUT:topic assignments to each word P( zi )likely words in each topic P( w | z )likely topics in each document (“gist”) P( z | d )

Summary

Page 16: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

THEORYSCIENTISTS

EXPERIMENTOBSERVATIONS

SCIENTIFICEXPERIMENTSHYPOTHESIS

EXPLAINSCIENTISTOBSERVED

EXPLANATIONBASED

OBSERVATIONIDEA

EVIDENCETHEORIESBELIEVED

DISCOVEREDOBSERVE

FACTS

SPACEEARTHMOON

PLANETROCKET

MARSORBIT

ASTRONAUTSFIRST

SPACECRAFTJUPITER

SATELLITESATELLITES

ATMOSPHERESPACESHIPSURFACE

SCIENTISTSASTRONAUT

SATURNMILES

ARTPAINT

ARTISTPAINTINGPAINTEDARTISTSMUSEUM

WORKPAINTINGS

STYLEPICTURES

WORKSOWN

SCULPTUREPAINTER

ARTSBEAUTIFUL

DESIGNSPORTRAITPAINTERS

STUDENTSTEACHERSTUDENT

TEACHERSTEACHING

CLASSCLASSROOM

SCHOOLLEARNING

PUPILSCONTENT

INSTRUCTIONTAUGHTGROUPGRADE

SHOULDGRADESCLASSES

PUPILGIVEN

BRAINNERVESENSE

SENSESARE

NERVOUSNERVES

BODYSMELLTASTETOUCH

MESSAGESIMPULSES

CORDORGANSSPINALFIBERS

SENSORYPAIN

IS

CURRENTELECTRICITY

ELECTRICCIRCUIT

ISELECTRICAL

VOLTAGEFLOW

BATTERYWIRE

WIRESSWITCH

CONNECTEDELECTRONSRESISTANCE

POWERCONDUCTORS

CIRCUITSTUBE

NEGATIVE

A selection from 500 topics [P(w|z = j)]

Page 17: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

FIELDMAGNETICMAGNET

WIRENEEDLE

CURRENTCOIL

POLESIRON

COMPASSLINESCORE

ELECTRICDIRECTION

FORCEMAGNETS

BEMAGNETISM

POLEINDUCED

SCIENCESTUDY

SCIENTISTSSCIENTIFIC

KNOWLEDGEWORK

RESEARCHCHEMISTRY

TECHNOLOGYMANY

MATHEMATICSBIOLOGY

FIELDPHYSICS

LABORATORYSTUDIESWORLD

SCIENTISTSTUDYINGSCIENCES

BALLGAMETEAM

FOOTBALLBASEBALLPLAYERS

PLAYFIELD

PLAYERBASKETBALL

COACHPLAYEDPLAYING

HITTENNISTEAMSGAMESSPORTS

BATTERRY

JOBWORKJOBS

CAREEREXPERIENCE

EMPLOYMENTOPPORTUNITIES

WORKINGTRAINING

SKILLSCAREERS

POSITIONSFIND

POSITIONFIELD

OCCUPATIONSREQUIRE

OPPORTUNITYEARNABLE

Words can have high probability in multiple topics

Page 18: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Disambiguation

• Give example for field

Page 19: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Topics versus LSA

• Latent Semantic Analysis (LSI/LSA)– Projects words into a K-dimensional hidden space– Less interpretable– Not generalizable– Not as accurate

MONEY

LOANCASH

RIVER

BANK

BILL

(high dimensional space)

Page 20: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Modeling Word Association

Page 21: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Word Association(norms from Nelson et al. 1998)

CUE: PLANET

Page 22: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

people

EARTH STARS SPACE

SUN MARS

UNIVERSE SATURN GALAXY

associate number

12345678

Word Association(norms from Nelson et al. 1998)

CUE: PLANET

(vocabulary = 5000+ words)

Page 23: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Topic model for Word Association

• Association as a problem of prediction:

– Given that a single word is observed, predict what other

words might occur in that context

• Under a single topic assumption:

2 1 2 1| | |z

P w w P w z P z w

Response Cue

z

wzPzwPwwP )|()|()|( 1212

Page 24: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

people

EARTH STARS SPACE

SUN MARS

UNIVERSE SATURN GALAXY

model

STARS STAR SUN

EARTH SPACE

SKY PLANET

UNIVERSE

associate number

12345678

Word Association(norms from Nelson et al. 1998)

CUE: PLANET

First associate “EARTH” is in the set of 8 associates (from the model)

Page 25: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

P( set contains first associate )

100

101

102

103

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1P

(se

t co

nta

ins

first

ass

oci

ate

)

Set size

LSATOPICS

LSA

TOPICS

Page 26: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Why would LSA perform worse?

• Cosine similarity measure imposes unnecessary

constraints on representations, e.g.

– Symmetry

– Triangle inequality

Page 27: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Violation of triangle inequality

Can find associations that violate this:

A

B C

AC AB + BC

SOCCER FIELD MAGNETIC

Page 28: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

No Triangle Inequality with Topics

SOCCER

MAGNETICFIELD

TOPIC 1 TOPIC 2

Topic structure easily explains violations of triangle inequality

Page 29: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Modeling Episodic Memory

Page 30: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Semantic Isolation Effect

Study this list:PEAS, CARROTS, BEANS, SPINACH, LETTUCE, HAMMER, TOMATOES, CORN, CABBAGE, SQUASH

HAMMER,PEAS,

CARROTS,...

Page 31: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Semantic Isolation Effect

• Verbal explanations:

– Attention, surprise, distinctiveness

• Our approach: memory system is trading off two encoding

resources

– Storing specific words (e.g. “hammer”)

– Storing general theme of list (e.g. “vegetables”)

Page 32: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Computational Problem

• How to tradeoff specificity and generality?

– Remembering detail and gist

Special word topic model

Page 33: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Special word topic (SW) model

• Each word can be generated via one route

– Topics

– Special words distribution (unique to a document)

• Conditional prob. of a word under a document:

| ( 0 | ) ( | )

( 1 | ) ( | )

topics

special

P w d P x d P w d

P x d P w d

Page 34: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Study wordsPEAS

CARROTS BEANS

SPINACH LETTUCE HAMMER

TOMATOES CORN

CABBAGE SQUASH

Retrieval Probability

0.00 0.01 0.02

HAMMER

BEANS

CORN

PEAS

SPINACH

CABBAGE

LETTUCE

CARROTS

SQUASH

TOMATOES

Special Word Probability

0.00 0.05 0.10 0.15

HAMMER

PEAS

SPINACH

CABBAGE

CARROTS

LETTUCE

SQUASH

BEANS

TOMATOES

CORN

Switch Probability

0.0 0.5 1.0

Verbatim

Topic

Topic Probability

0.0 0.1 0.2

VEGETABLES

FURNITURE

TOOLS

ENCODING RETRIEVAL

Special

Page 35: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Hunt & Lamb (2001 exp. 1)

OUTLIER LIST

PEAS

CARROTS

BEANS

SPINACH

LETTUCE

HAMMER

TOMATOES

CORN

CABBAGE

SQUASH

CONTROL LIST

SAW

SCREW

CHISEL

DRILL

SANDPAPER

HAMMER

NAILS

BENCH

RULER

ANVIL

outlier list pure listP

rob.

of

Rec

all

0.0

0.2

0.4

0.6

0.8

1.0TargetBackground

DATA

Page 36: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Model Predictions

outlier list pure list

Pro

b. o

f R

ecal

l

0.0

0.2

0.4

0.6

0.8

1.0TargetBackground

DATA PREDICTED

outlier list pure list

Nee

d P

roba

bilit

y0.00

0.01

0.02

0.03

0.04

0.05TargetBackground

Page 37: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

False memory effects

MADFEARHATE

SMOOTHNAVYHEAT

SALADTUNE

COURTSCANDYPALACEPLUSHTOOTHBLIND

WINTER

Number of Associates

3 6 9

MADFEARHATERAGE

TEMPERFURY

SALADTUNE

COURTSCANDYPALACEPLUSHTOOTHBLIND

WINTER

MADFEARHATERAGE

TEMPERFURY

WRATHHAPPYFIGHT

CANDYPALACEPLUSHTOOTHBLIND

WINTER

(lure = ANGER)

Number of Associates Studied

3 6 9 12 15

Pro

b. o

f R

ecal

l

0.0

0.2

0.4

0.6

0.8

1.0Studied itemsNonstudied (lure)

DATA

Robinson & Roediger (1997)

Number of Associates Studied

3 6 9 12 15

Pro

b. o

f R

etri

eval

0.00

0.01

0.02

0.03Studied associatesNonstudied (lure)

PREDICTED

Page 38: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Relation to Dual Process Models

• Gist/Verbatim distinction (e.g. Reyna, Brainerd, 1995)

– Maps onto topics and special words

• Our approach specifies both encoding and retrieval

representations and processes

– Routes are not independent

– Model explains performance for actual word lists

Page 39: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Applications I

Page 40: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Topics provide quick summary of content

• What is in this corpus?

• What is in this document?

• What are the topical trends over time?

• Who writes on this topic?

Page 41: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

330,000 articles

2000-2002

Analyzing the New York Times

Page 42: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Three investigations began Thursday into the securities and exchange_commission's choice of william_webster to head a new board overseeing the accounting profession. house and senate_democrats called for the resignations of both judge_webster and harvey_pitt, the commission's chairman. The white_house expressed support for judge_webster as well as for harvey_pitt, who was harshly criticized Thursday for failing to inform other commissioners before they approved the choice of judge_webster that he had led the audit committee of a company facing fraud accusations. “The president still has confidence in harvey_pitt,” said dan_bartlett, bush's communications director …

Extracted Named Entities

• Used standard algorithms to extract named entities:

- People- Places- Organizations

Page 43: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Standard Topic Model with Entities

team 0.028 tour 0.039 holiday 0.071 award 0.026play 0.015 rider 0.029 gift 0.050 film 0.020game 0.013 riding 0.017 toy 0.023 actor 0.020season 0.012 bike 0.016 season 0.019 nomination 0.019final 0.011 team 0.016 doll 0.014 movie 0.015games 0.011 stage 0.014 tree 0.011 actress 0.011point 0.011 race 0.013 present 0.008 won 0.011series 0.011 won 0.012 giving 0.008 director 0.010player 0.010 bicycle 0.010 special 0.007 nominated 0.010coach 0.009 road 0.009 shopping 0.007 supporting 0.010playoff 0.009 hour 0.009 family 0.007 winner 0.008championship 0.007 scooter 0.008 celebration 0.007 picture 0.008playing 0.006 mountain 0.008 card 0.007 performance 0.007win 0.006 place 0.008 tradition 0.006 nominees 0.007LAKERS 0.062 LANCE-ARMSTRONG 0.021 CHRISTMAS 0.058 OSCAR 0.035SHAQUILLE-O-NEAL0.028 FRANCE 0.011 THANKSGIVING 0.018 ACADEMY 0.020KOBE-BRYANT 0.028 JAN-ULLRICH 0.003 SANTA-CLAUS 0.009 HOLLYWOOD 0.009PHIL-JACKSON 0.019 LANCE 0.003 BARBIE 0.004 DENZEL-WASHINGTON 0.006NBA 0.013 U-S-POSTAL-SERVICE 0.002 HANUKKAH 0.003 JULIA-ROBERT 0.005SACRAMENTO 0.007 MARCO-PANTANI 0.002 MATTEL 0.003 RUSSELL-CROWE 0.005RICK-FOX 0.007 PARIS 0.002 GRINCH 0.003 TOM-HANK 0.005PORTLAND 0.006 ALPS 0.002 HALLMARK 0.002 STEVEN-SODERBERGH 0.004ROBERT-HORRY 0.006 PYRENEES 0.001 EASTER 0.002 ERIN-BROCKOVICH 0.003DEREK-FISHER 0.006 SPAIN 0.001 HASBRO 0.002 KEVIN-SPACEY 0.003

Basketball Holidays OscarsTour de France

Page 44: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

computer 0.069 play 0.030technology 0.026 show 0.029system 0.015 stage 0.022digital 0.014 theater 0.022chip 0.013 director 0.017software 0.013 production 0.017machine 0.011 performance 0.016devices 0.010 dance 0.014machines 0.010 audience 0.014video 0.009 festival 0.013Companies 1.000 Theater 0.960

Music 0.040

IBM 0.074 BROADWAY 0.119 BACH 0.035APPLE 0.061 NEW_YORK 0.044 BEETHOVEN 0.026INTEL 0.059 SHAKESPEARE 0.029 LOUIS_ARMSTRONG 0.019MICROSOFT 0.053 THEATER 0.022 MOZART 0.019COMPAQ 0.041 LONDON 0.019 CARNEGIE_HALL 0.017SONY 0.029 GUINNESS 0.018 LATIN 0.017DELL 0.019 TONY 0.016HP 0.018 LINCOLN_CTR 0.015

ArtsComputers

MusicCompanies Theatre

Page 45: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan030

50

100

Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan030

5

10

15

Topic Trends

Tour-de-France

Anthrax

Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan030

10

20

30Quarterly Earnings

Proportion of words assigned to topic for that

time slice

Page 46: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Example of Extracted Entity-Topic Network

Muslim_Militance

Mid_East_Conflict

Palestinian_Territories

Pakistan_Indian_War

FBI_Investigation

Detainees

Mid_East_Peace

US_Military

Religion

Terrorist_Attacks

Afghanistan_War

AL_QAEDA

HAMID_KARZAIMOHAMMED

MOHAMMED_ATTA

NORTHERN_ALLIANCE

BIN_LADEN

TALIBAN

ZAWAHIRI

YASSER_ARAFAT

EHUD_BARAK

ARIEL_SHARON

HAMAS

AL_HAZMI

KING_HUSSEIN

Page 47: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Prediction of Missing Entities in Text

Shares of XXXX slid 8 percent, or $1.10, to $12.65 Tuesday, as major

credit agencies said the conglomerate would still be challenged in repaying

its debts, despite raising $4.6 billion Monday in taking its finance group

public. Analysts at XXXX Investors service in XXXX said they were

keeping XXXX and its subsidiaries under review for a possible debt

downgrade, saying the company “will continue to face a significant debt

burden,'' with large slices of debt coming due, over the next 18 months.

XXXX said …

Test article with entities

removed

Page 48: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Prediction of Missing Entities in Text

Shares of XXXX slid 8 percent, or $1.10, to $12.65 Tuesday, as major

credit agencies said the conglomerate would still be challenged in repaying

its debts, despite raising $4.6 billion Monday in taking its finance group

public. Analysts at XXXX Investors service in XXXX said they were

keeping XXXX and its subsidiaries under review for a possible debt

downgrade, saying the company “will continue to face a significant debt

burden,'' with large slices of debt coming due, over the next 18 months.

XXXX said …

fitch goldman-sachs lehman-brother moody morgan-stanley new-york-

stock-exchange standard-and-poor tyco tyco-international wall-street

worldco

Test article with entities

removed

Actual missing entities

Page 49: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Prediction of Missing Entities in Text

Shares of XXXX slid 8 percent, or $1.10, to $12.65 Tuesday, as major

credit agencies said the conglomerate would still be challenged in repaying

its debts, despite raising $4.6 billion Monday in taking its finance group

public. Analysts at XXXX Investors service in XXXX said they were

keeping XXXX and its subsidiaries under review for a possible debt

downgrade, saying the company “will continue to face a significant debt

burden,'' with large slices of debt coming due, over the next 18 months.

XXXX said …

fitch goldman-sachs lehman-brother moody morgan-stanley new-york-

stock-exchange standard-and-poor tyco tyco-international wall-street

worldco

wall-street new-york nasdaq securities-exchange-commission sec merrill-

lynch new-york-stock-exchange goldman-sachs standard-and-poor

Test article with entities

removed

Actual missing entities

Predicted entities given observed words (matches

in blue)

Page 50: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Applications II

Page 51: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Faculty Browser

• System collects PDF files from websites

– UCI

– UCSD

• Applies topic model on text extracted from PDFs

• Display faculty research with topics

• Demo:

http://yarra.calit2.uci.edu/calit2/

Page 52: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.
Page 53: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

one topic

most prolific researchers for this topic

Page 54: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

topics this researcher works on

other researchers with similar

topical interests

one researcher

Page 55: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Conclusions

Page 56: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Human MemoryInformation

Retrieval

Finding relevant memories

Finding relevant documents

Page 57: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Software

Public-domain MATLAB toolbox for topic modeling on the Web:

http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm

Page 58: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Hidden Markov Topic Model

Page 59: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Hidden Markov Topics Model

• Syntactic dependencies short range dependencies

• Semantic dependencies long-range

z z z z

w w w w

s s s s

Semantic state: generate words from topic model

Syntactic states: generate words from HMM

(Griffiths, Steyvers, Blei, & Tenenbaum, 2004)

Page 60: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

z = 1 0.4

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 2 0.6

THE 0.6 A 0.3MANY 0.1

OF 0.6 FOR 0.3BETWEEN 0.1

0.9

0.1

0.2

0.8

0.7

0.3

Transition between semantic state and syntactic states

Page 61: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

THE ………………………………

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

z = 1 0.4

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 2 0.6

x = 1

THE 0.6 A 0.3MANY 0.1

x = 3

OF 0.6 FOR 0.3BETWEEN 0.1

x = 2

0.9

0.1

0.2

0.8

0.7

0.3

Combining topics and syntax

Page 62: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

THE LOVE……………………

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

z = 1 0.4

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 2 0.6

x = 1

THE 0.6 A 0.3MANY 0.1

x = 3

OF 0.6 FOR 0.3BETWEEN 0.1

x = 2

0.9

0.1

0.2

0.8

0.7

0.3

Combining topics and syntax

Page 63: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

THE LOVE OF………………

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

z = 1 0.4

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 2 0.6

x = 1

THE 0.6 A 0.3MANY 0.1

x = 3

OF 0.6 FOR 0.3BETWEEN 0.1

x = 2

0.9

0.1

0.2

0.8

0.7

0.3

Combining topics and syntax

Page 64: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

THE LOVE OF RESEARCH ……

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

z = 1 0.4

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 2 0.6

x = 1

THE 0.6 A 0.3MANY 0.1

x = 3

OF 0.6 FOR 0.3BETWEEN 0.1

x = 2

0.9

0.1

0.2

0.8

0.7

0.3

Combining topics and syntax

Page 65: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

FOODFOODSBODY

NUTRIENTSDIETFAT

SUGARENERGY

MILKEATINGFRUITS

VEGETABLESWEIGHT

FATSNEEDS

CARBOHYDRATESVITAMINSCALORIESPROTEIN

MINERALS

MAPNORTHEARTHSOUTHPOLEMAPS

EQUATORWESTLINESEAST

AUSTRALIAGLOBEPOLES

HEMISPHERELATITUDE

PLACESLAND

WORLDCOMPASS

CONTINENTS

DOCTORPATIENTHEALTH

HOSPITALMEDICAL

CAREPATIENTS

NURSEDOCTORSMEDICINENURSING

TREATMENTNURSES

PHYSICIANHOSPITALS

DRSICK

ASSISTANTEMERGENCY

PRACTICE

BOOKBOOKS

READINGINFORMATION

LIBRARYREPORT

PAGETITLE

SUBJECTPAGESGUIDE

WORDSMATERIALARTICLE

ARTICLESWORDFACTS

AUTHORREFERENCE

NOTE

GOLDIRON

SILVERCOPPERMETAL

METALSSTEELCLAYLEADADAM

OREALUMINUM

MINERALMINE

STONEMINERALS

POTMININGMINERS

TIN

BEHAVIORSELF

INDIVIDUALPERSONALITY

RESPONSESOCIAL

EMOTIONALLEARNINGFEELINGS

PSYCHOLOGISTSINDIVIDUALS

PSYCHOLOGICALEXPERIENCES

ENVIRONMENTHUMAN

RESPONSESBEHAVIORSATTITUDES

PSYCHOLOGYPERSON

CELLSCELL

ORGANISMSALGAE

BACTERIAMICROSCOPEMEMBRANEORGANISM

FOODLIVINGFUNGIMOLD

MATERIALSNUCLEUSCELLED

STRUCTURESMATERIAL

STRUCTUREGREENMOLDS

Semantic topics

PLANTSPLANT

LEAVESSEEDSSOIL

ROOTSFLOWERS

WATERFOOD

GREENSEED

STEMSFLOWER

STEMLEAF

ANIMALSROOT

POLLENGROWING

GROW

Page 66: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

GOODSMALL

NEWIMPORTANT

GREATLITTLELARGE

*BIG

LONGHIGH

DIFFERENTSPECIAL

OLDSTRONGYOUNG

COMMONWHITESINGLE

CERTAIN

THEHIS

THEIRYOURHERITSMYOURTHIS

THESEA

ANTHATNEW

THOSEEACH

MRANYMRSALL

MORESUCHLESS

MUCHKNOWN

JUSTBETTERRATHER

GREATERHIGHERLARGERLONGERFASTER

EXACTLYSMALLER

SOMETHINGBIGGERFEWERLOWER

ALMOST

ONAT

INTOFROMWITH

THROUGHOVER

AROUNDAGAINSTACROSS

UPONTOWARDUNDERALONGNEAR

BEHINDOFF

ABOVEDOWN

BEFORE

SAIDASKED

THOUGHTTOLDSAYS

MEANSCALLEDCRIED

SHOWSANSWERED

TELLSREPLIED

SHOUTEDEXPLAINEDLAUGHED

MEANTWROTE

SHOWEDBELIEVED

WHISPERED

ONESOMEMANYTWOEACHALL

MOSTANY

THREETHIS

EVERYSEVERAL

FOURFIVEBOTHTENSIX

MUCHTWENTY

EIGHT

HEYOU

THEYI

SHEWEIT

PEOPLEEVERYONE

OTHERSSCIENTISTSSOMEONE

WHONOBODY

ONESOMETHING

ANYONEEVERYBODY

SOMETHEN

Syntactic classes

BEMAKE

GETHAVE

GOTAKE

DOFINDUSESEE

HELPKEEPGIVELOOKCOMEWORKMOVELIVEEAT

BECOME

Page 67: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

MODELALGORITHM

SYSTEMCASE

PROBLEMNETWORKMETHOD

APPROACHPAPER

PROCESS

ISWASHAS

BECOMESDENOTES

BEINGREMAINS

REPRESENTSEXISTSSEEMS

SEESHOWNOTE

CONSIDERASSUMEPRESENT

NEEDPROPOSEDESCRIBESUGGEST

USEDTRAINED

OBTAINEDDESCRIBED

GIVENFOUND

PRESENTEDDEFINED

GENERATEDSHOWN

INWITHFORON

FROMAT

USINGINTOOVER

WITHIN

HOWEVERALSOTHENTHUS

THEREFOREFIRSTHERENOW

HENCEFINALLY

#*IXTN-CFP

EXPERTSEXPERTGATING

HMEARCHITECTURE

MIXTURELEARNINGMIXTURESFUNCTION

GATE

DATAGAUSSIANMIXTURE

LIKELIHOODPOSTERIOR

PRIORDISTRIBUTION

EMBAYESIAN

PARAMETERS

STATEPOLICYVALUE

FUNCTIONACTION

REINFORCEMENTLEARNINGCLASSESOPTIMAL

*

MEMBRANESYNAPTIC

CELL*

CURRENTDENDRITICPOTENTIAL

NEURONCONDUCTANCE

CHANNELS

IMAGEIMAGESOBJECT

OBJECTSFEATURE

RECOGNITIONVIEWS

#PIXEL

VISUAL

KERNELSUPPORTVECTOR

SVMKERNELS

#SPACE

FUNCTIONMACHINES

SET

NETWORKNEURAL

NETWORKSOUPUTINPUT

TRAININGINPUTS

WEIGHTS#

OUTPUTS

NIPS Semantics

NIPS Syntax

Page 68: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Random sentence generation

LANGUAGE:[S] RESEARCHERS GIVE THE SPEECH[S] THE SOUND FEEL NO LISTENERS[S] WHICH WAS TO BE MEANING[S] HER VOCABULARIES STOPPED WORDS[S] HE EXPRESSLY WANTED THAT BETTER VOWEL

Page 69: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Nested Chinese Restaurant Process

Page 70: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Topic Hierarchies

• In regular topic model, no relations

between topicstopic 1

topic 2 topic 3

topic 4 topic 5 topic 6 topic 7

• Nested Chinese Restaurant Process

– Blei, Griffiths, Jordan, Tenenbaum (2004)

– Learn hierarchical structure, as well as

topics within structure

Page 71: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Example: Psych Review Abstracts

RESPONSESTIMULUS

REINFORCEMENTRECOGNITION

STIMULIRECALLCHOICE

CONDITIONING

SPEECHREADINGWORDS

MOVEMENTMOTORVISUALWORD

SEMANTIC

ACTIONSOCIALSELF

EXPERIENCEEMOTION

GOALSEMOTIONALTHINKING

GROUPIQ

INTELLIGENCESOCIAL

RATIONALINDIVIDUAL

GROUPSMEMBERS

SEXEMOTIONS

GENDEREMOTIONSTRESSWOMENHEALTH

HANDEDNESS

REASONINGATTITUDE

CONSISTENCYSITUATIONALINFERENCEJUDGMENT

PROBABILITIESSTATISTICAL

IMAGECOLOR

MONOCULARLIGHTNESS

GIBSONSUBMOVEMENTORIENTATIONHOLOGRAPHIC

CONDITIONINSTRESS

EMOTIONALBEHAVIORAL

FEARSTIMULATIONTOLERANCERESPONSES

AMODEL

MEMORYFOR

MODELSTASK

INFORMATIONRESULTSACCOUNT

SELFSOCIAL

PSYCHOLOGYRESEARCH

RISKSTRATEGIES

INTERPERSONALPERSONALITY

SAMPLING

MOTIONVISUAL

SURFACEBINOCULAR

RIVALRYCONTOUR

DIRECTIONCONTOURSSURFACES

DRUGFOODBRAIN

AROUSALACTIVATIONAFFECTIVEHUNGER

EXTINCTIONPAIN

THEOF

ANDTOINAIS

Page 72: Topic Models for Semantic Memory and Information Extraction Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work.

Generative Process

RESPONSESTIMULUS

REINFORCEMENTRECOGNITION

STIMULIRECALLCHOICE

CONDITIONING

SPEECHREADINGWORDS

MOVEMENTMOTORVISUALWORD

SEMANTIC

ACTIONSOCIALSELF

EXPERIENCEEMOTION

GOALSEMOTIONALTHINKING

GROUPIQ

INTELLIGENCESOCIAL

RATIONALINDIVIDUAL

GROUPSMEMBERS

SEXEMOTIONS

GENDEREMOTIONSTRESSWOMENHEALTH

HANDEDNESS

REASONINGATTITUDE

CONSISTENCYSITUATIONALINFERENCEJUDGMENT

PROBABILITIESSTATISTICAL

IMAGECOLOR

MONOCULARLIGHTNESS

GIBSONSUBMOVEMENTORIENTATIONHOLOGRAPHIC

CONDITIONINSTRESS

EMOTIONALBEHAVIORAL

FEARSTIMULATIONTOLERANCERESPONSES

AMODEL

MEMORYFOR

MODELSTASK

INFORMATIONRESULTSACCOUNT

SELFSOCIAL

PSYCHOLOGYRESEARCH

RISKSTRATEGIES

INTERPERSONALPERSONALITY

SAMPLING

MOTIONVISUAL

SURFACEBINOCULAR

RIVALRYCONTOUR

DIRECTIONCONTOURSSURFACES

DRUGFOODBRAIN

AROUSALACTIVATIONAFFECTIVEHUNGER

EXTINCTIONPAIN

THEOF

ANDTOINAIS