Tutorial: mining latent entity structures

182
Mining Latent Entity Structures Jiawei Han and Chi Wang Computer Science, University of Illinois at Urbana-Champaign June 25, 2014 1

description

SIGMOD 2014

Transcript of Tutorial: mining latent entity structures

Page 1: Tutorial:   mining latent entity structures

Mining Latent Entity Structures Jiawei Han and Chi Wang

Computer Science, University of I l l inois at Urbana -Champaign

June 25, 2014

1

Page 2: Tutorial:   mining latent entity structures

Outline

1. Introduction to mining latent entity structures

2. Mining latent topic hierarchies

3. Mining latent entity relations

4. Mining latent entity concepts

5. Trends and research problems

2

Page 3: Tutorial:   mining latent entity structures

Motivation of Mining Latent Entity Structures The prevalence of unstructured data

Structures are useful for knowledge discovery

3

Too expensive to be structured by human: Automated & scalable

Up to 85% of all

information is

unstructured

-- estimated by

industry analysts

Vast majority of the CEOs

expressed frustration over

their organization’s inability to

glean insights from available

data

-- IBM study with1500+ CEOs

Page 4: Tutorial:   mining latent entity structures

Information Overload: A Critical Problem in Big Data Era

By 2020, information will double every 73 days

-- G. Starkweather (Microsoft), 1992

Unstructured or loosely structured data are prevalent

1700 1750 1800 1850 1900 1950 2000 2050

Information growth

4

Page 5: Tutorial:   mining latent entity structures

Example: Research Publications Every year, hundreds of thousands papers are published

◦ Unstructured data: paper text

◦ Loosely structured entities: authors, venues

papers

venue

author

5

Page 6: Tutorial:   mining latent entity structures

Example: News Articles Every day, >90,000 news articles are produced

◦ Unstructured data: news content

◦ Loosely structured entities: persons, locations, organizations, …

news

person

location organization

6

Page 7: Tutorial:   mining latent entity structures

Example: Social Media Every second, >150K tweets are sent out

◦ Unstructured data: tweet content

◦ Loosely structured entities: twitters, hashtags, URLs, …

Darth Vader

The White House

#maythefourthbewithyou

tweets

twitter

hashtag

URL

7

Page 8: Tutorial:   mining latent entity structures

A Semi-Structured Framework (or Model) for Unstructured and Loosely-Structured Data

news papers tweets

venue

author person

location organization

twitter

URL

hashtag

text

entity

8

Page 9: Tutorial:   mining latent entity structures

Useful Latent Structures: Topics, Concepts, Relations Top 10 active politicians regarding healthcare issues?

Influential high-tech companies in Silicon Valley?

Top 10 researchers in data mining and their specializations?

Academic family on support vector machine?

Concepts (is-a)

Topics (hierarchical)

Relations text

entity

9

Page 10: Tutorial:   mining latent entity structures

Mining Latent Structures: Topics, Concepts, Relations

10

Concepts

Topics

Relations

text

entity

Unstructured text & linked entities -> structured hierarchies

Page 11: Tutorial:   mining latent entity structures

What Power Can We Gain if More Structures Can Be Discovered? Structured database queries

Information network analysis, …

11

Page 12: Tutorial:   mining latent entity structures

Structures Facilitate Multi-Dimensional Analysis: An EventCube Experiment

12

Page 13: Tutorial:   mining latent entity structures

Distribution along Multiple Dimensions Query ‘health care bill’ in news data

13

Page 14: Tutorial:   mining latent entity structures

Entity Analysis and Profiling Topic distribution for “Stanford University” 14

Page 15: Tutorial:   mining latent entity structures

15 AMETHYST [DANILEVSKY ET AL. 13]

Page 16: Tutorial:   mining latent entity structures

Structures Facilitate Heterogeneous Information Network Analysis

Real-world data: Multiple object types and/or multiple link types

Venue Paper Author

DBLP Bibliographic Network The IMDB Movie Network

Actor

Movie

Director

Movie

Studio

The Facebook Network

16

Page 17: Tutorial:   mining latent entity structures

What Can Be Mined in Structured Information Networks

Example: DBLP: A Computer Science bibliographic database

Knowledge hidden in DBLP Network Mining Functions

Who are the leading researchers on Web search? Ranking

Who are the peer researchers of Jure Leskovec? Similarity Search

Whom will Christos Faloutsos collaborate with? Relationship Prediction

Which types of relationships are most influential for an author to decide her topics? Relation Strength Learning

How was the field of Data Mining emerged or evolving? Network Evolution

Which authors are rather different from his/her peers in IR? Outlier/anomaly detection

17

Page 18: Tutorial:   mining latent entity structures

Similarity Search: Find Similar Objects in Networks Guided by Meta-Paths Who are very similar to Christos Faloutsos?

Meta-Path: Meta-level description of a path between two objects

Christos’s students or close collaborators Similar reputation at similar venues

Meta-Path: Author-Paper-Author (APA) Meta-Path: Author-Paper-Venue-Paper-Author (APVPA)

Schema of the DBLP Network Different meta-paths lead to very

different results!

18

Page 19: Tutorial:   mining latent entity structures

Similarity Search: PathSim Measure Helps Find Peer Objects in Long Tails

Anhai Doan

◦ CS, Wisconsin

◦ Database area

◦ PhD: 2002

Meta-Path: Author-Paper-Venue-Paper-Author (APVPA)

• Jignesh Patel

• CS, Wisconsin

• Database area

• PhD: 1998

• Amol Deshpande

• CS, Maryland

• Database area

• PhD: 2004

• Jun Yang

• CS, Duke

• Database area

• PhD: 2001

PathSim [Sun et al. 11]

19

Page 20: Tutorial:   mining latent entity structures

PathPredict: Meta-Path Based Relationship Prediction Meta path-guided prediction of links and relationships

Insight: Meta path relationships among similar typed links share similar semantics and are comparable and inferable

Bibliographic network: Co-author prediction (A—P—A)

paper topic

venue

author

publish publish-1

mention-1

mention write

write-1

contain/contain-1 cite/cite-1

vs.

20

Page 21: Tutorial:   mining latent entity structures

Meta-Path Based Co-authorship Prediction Co-authorship prediction: Whether two authors start to collaborate

Co-authorship encoded in meta-path: Author-Paper-Author

Topological features encoded in meta-paths

Meta-Path Semantic Meaning

21

The prediction power of each meta-path

Derived by logistic regression

Page 22: Tutorial:   mining latent entity structures

Heterogeneous Network Helps Personalized Recommendation Users and items with limited feedback are connected by a variety of paths Different users may require different models: Relationship heterogeneity makes personalized recommendation models easier to define

Avatar Titanic Aliens Revolutionary Road

James Cameron

Kate Winslet

Leonardo Dicaprio

Zoe Saldana

Adventure Romance

Collaborative filtering methods suffer from the data sparsity issue

# of users or items

A small set of users & items have a large number of ratings

Most users and items have a small number of ratings

# o

f ra

tin

gs

Personalized recommendation with heterogeous networks [Yu et al. 14a]

22

Page 23: Tutorial:   mining latent entity structures

Personalized Recommendation in Heterogeneous Networks Datasets: Methods to compare:

◦ Popularity: Recommend the most popular items to users

◦ Co-click: Conditional probabilities between items

◦ NMF: Non-negative matrix factorization on user feedback

◦ Hybrid-SVM: Use Rank-SVM to utilize both user feedback and information network

Winner: HeteRec personalized recommendation (HeteRec-p)

23

Page 24: Tutorial:   mining latent entity structures

Outline

1. Introduction to mining latent entity structures

2. Mining latent topic hierarchies

3. Mining latent entity relations

4. Mining latent entity concepts

5. Trends and research problems

24

Page 25: Tutorial:   mining latent entity structures

Mining Latent Topical Hierarchy

25

Topics

text

Unstructured text & linked entities -> structured hierarchies

entity

Page 26: Tutorial:   mining latent entity structures

Topic Hierarchy: Summarize the Data with Multiple Granularity

Top 10 researchers in data mining? ◦ And their specializations?

Important research areas in SIGIR conference?

Computer Science

Information technology &

system

Database Information

retrieval

… …

Theory of computation

… …

papers

venue

author

26

Page 27: Tutorial:   mining latent entity structures

Methodologies of Topic Mining

27

C. An integrated framework: CATHY

i) Recursive topic discovery ii) Phrase mining iii) Phrase and entity ranking

B. Extension of topic modeling

i) Flat -> hierarchical ii) Unigrams -> ngrams iii) Text -> text + entity

A. Traditional bag-of-words topic modeling

Page 28: Tutorial:   mining latent entity structures

Methodologies of Topic Mining

28

C. An integrated framework: CATHY

i) Recursive topic discovery ii) Phrase mining iii) Phrase and entity ranking

B. Extension of topic modeling

i) Flat -> hierarchical ii) Unigrams -> ngrams iii) Text -> text + entity

A. Traditional bag-of-words topic modeling

Page 29: Tutorial:   mining latent entity structures

A. Bag-of-Words Topic Modeling Widely studied technique for text analysis

◦ Summarize themes/aspects

◦ Facilitate navigation/browsing

◦ Retrieve documents

◦ Segment documents

◦ Many other text mining tasks

Represent each document as a bag of words: all the words within a document are exchangeable

Probabilistic approach

29

Page 30: Tutorial:   mining latent entity structures

Topic: Multinomial Distribution over Words A document is modeled as a sample of mixed topics

How can we discover these topic word distributions from a corpus?

30

[ Criticism of government response to the

hurricane primarily consisted of criticism of its

response to the approach of the storm and its

aftermath, specifically in the delayed response ] to

the [ flooding of New Orleans. … 80% of the 1.3

million residents of the greater New Orleans

metropolitan area evacuated ] …[ Over seventy

countries pledged monetary donations or other

assistance]. …

Topic 1

Topic 3

Topic 2

government 0.3

response 0.2

...

donate 0.1

relief 0.05

help 0.02

...

city 0.2

new 0.1

orleans 0.05

...

EXAMPLE FROM CHENGXIANG ZHAI'S LECTURE NOTES

Page 31: Tutorial:   mining latent entity structures

Routine of Generative Models Model design: assume the documents are generated by a certain process

Model Inference: Fit the model with observed documents to recover the unknown parameters Θ

31

Generative process with unknown parameters Θ

Criticism of government

response to the hurricane …

corpus

Two representative models: pLSA and LDA

Page 32: Tutorial:   mining latent entity structures

Probabilistic Latent Semantic Analysis (PLSA) [Hofmann 99] 𝑘 topics: 𝑘 multinomial distributions over words

𝐷 documents: 𝐷 multinomial distributions over topics

32

Topic 𝝓𝟏 Topic 𝝓𝒌

… government 0.3

response 0.2

...

donate 0.1

relief 0.05

...

.4 .3 .3 Doc 𝜃1

.2 .5 .3 Doc 𝜃𝐷

Generative process: we will generate each token in each document 𝑑 according to 𝜙, 𝜃

Page 33: Tutorial:   mining latent entity structures

PLSA – Model Design 𝑘 topics: 𝑘 multinomial distributions over words

𝐷 documents: 𝐷 multinomial distributions over topics

33

Topic 𝝓𝟏 Topic 𝝓𝒌

… government 0.3

response 0.2

...

donate 0.1

relief 0.05

...

.4 .3 .3 Doc 𝜃1

.2 .5 .3 Doc 𝜃𝐷

To generate a token in document 𝑑: 1. Sample a topic label 𝑧 according to 𝜃𝑑 (e.g. z=1) 2. Sample a word w according to 𝜙𝑧 (e.g. w=government)

.4 .3 .3

Topic 𝝓𝒛

Page 34: Tutorial:   mining latent entity structures

PLSA – Model Inference

What parameters are most likely to generate the observed corpus?

34

To generate a token in document 𝑑: 1. Sample a topic label 𝑧 according to 𝜃𝑑 (e.g. z=1) 2. Sample a word w according to 𝜙𝑧 (e.g. w=government)

.4 .3 .3

Topic 𝝓𝒛

Criticism of government

response to the hurricane …

corpus Topic 𝝓𝟏 Topic 𝝓𝒌

.? .? .? Doc 𝜃1

.? .? .? Doc 𝜃𝐷

government ?

response ?

...

donate ?

relief ?

...

Page 35: Tutorial:   mining latent entity structures

PLSA – Model Inference using Expectation-Maximization (EM)

35

E-step: Fix 𝜙, 𝜃, estimate topic labels 𝑧 for every token in every document M-step: Use estimated topic labels 𝑧 to estimate 𝜙, 𝜃 Guaranteed to converge to a stationary point, but not guaranteed optimal

Criticism of government

response to the hurricane …

corpus

Exact max likelihood is hard => approximate optimization with EM

Topic 𝝓𝟏 Topic 𝝓𝒌

.? .? .? Doc 𝜃1

.? .? .? Doc 𝜃𝐷

government ?

response ?

...

donate ?

relief ?

...

Page 36: Tutorial:   mining latent entity structures

How the EM Algorithm Works

36

.4 .3 .3

Topic 𝝓𝟏 Topic 𝝓𝒌

Doc 𝜃1

Doc 𝜃𝐷

.2 .5 .3

government 0.3

response 0.2

...

donate 0.1

relief 0.05

...

response

criticism

government

hurricane

government

d1

dD Sum fractional

counts

response

M-step

E-step

Bayes rule

k

j wjjd

wjjd

k

jjzwpdjzp

jzwpdjzpwdjzp

1' ,'',

,,

1')'|()|'(

)|()|(),|(

Page 37: Tutorial:   mining latent entity structures

Analysis of pLSA PROS

Simple, only one hyperparameter k

Easy to incorporate prior in the EM algorithm

CONS

High model complexity -> prone to overfitting

The EM solution is neither optimal nor unique

37

Page 38: Tutorial:   mining latent entity structures

Latent Dirichlet Allocation (LDA) [Blei et al. 02] Impose Dirichlet prior to the model parameters -> Bayesian version of pLSA

38

Generative process: First generate 𝜙, 𝜃 with Dirichlet prior, then generate each token in each document 𝑑 according to 𝜙, 𝜃

Topic 𝝓𝟏 Topic 𝝓𝒌

… government 0.3

response 0.2

...

donate 0.1

relief 0.05

...

.4 .3 .3 Doc 𝜃1

.2 .5 .3 Doc 𝜃𝐷

𝛽

𝛼

Same as pLSA

To mitigate overfitting

Page 39: Tutorial:   mining latent entity structures

LDA – Model Inference MAXIMUM LIKELIHOOD

Aim to find parameters that maximize the likelihood

Exact inference is intractable

Approximate inference ◦ Variational EM [Blei et al. 03]

◦ Markov chain Monte Carlo (MCMC) – collapsed Gibbs sampler [Griffiths & Steyvers 04]

METHOD OF MOMENTS

Aim to find parameters that fit the moments (expectation of patterns)

Exact inference is tractable ◦ Tensor orthogonal decomposition

[Anandkumar et al. 12]

◦ Scalable tensor orthogonal decomposition [Wang et al. 14a]

39

Page 40: Tutorial:   mining latent entity structures

MCMC – Collapsed Gibbs Sampler [Griffiths & Steyvers 04]

40

response

criticism

government

hurricane

government

d1

dD

response

Iter 1 Iter 2 Iter 1000 …

… Topic 𝝓𝟏 Topic 𝝓𝒌

government 0.3

response 0.2

...

donate 0.1

relief 0.05

...

kn

n

VN

NjzP

i

i

i

d

d

j

j

j

w

ii

)(

)(

)(

)(

),|( zwSample each zi conditioned on z-i

Estimated 𝜙𝑗,𝑤𝑖 Estimated 𝜃𝑑𝑖,𝑗

Page 41: Tutorial:   mining latent entity structures

Method of Moments [Anandkumar et al. 12, Wang et al. 14a]

What parameters are most likely to generate the observed corpus? Criticism of

government response to the

hurricane …

corpus Topic 𝝓𝟏 Topic 𝝓𝒌

… government ?

response ?

...

donate ?

relief ?

...

criticism government response: 0.001

government response hurricane: 0.005

criticism response hurricane: 0.004

:

criticism: 0.03

response: 0.01

government: 0.04

:

criticism response: 0.001

criticism government: 0.002

government response: 0.003

:

Moments: expectation of patterns

What parameters fit the empirical moments?

length 1 length 2 (pair) length 3 (triple)

41

Page 42: Tutorial:   mining latent entity structures

Guaranteed Topic Recovery Theorem. The patterns up to length 3 are sufficient for topic recovery

𝑀2 = 𝜆𝑗𝝓𝒋 ⊗𝝓𝒋

𝑘

𝑗=1

, 𝑀3 = 𝜆𝑗𝝓𝒋 ⊗𝝓𝒋 ⊗𝝓𝒋

𝑘

𝑗=1

42

criticism government response: 0.001

government response hurricane: 0.005

criticism response hurricane: 0.004

:

criticism: 0.03

response: 0.01

government: 0.04

:

criticism response: 0.001

criticism government: 0.002

government response: 0.003

:

length 1 length 2 (pair) length 3 (triple)

V

V: vocabulary size; k: topic number V

V

V V

Page 43: Tutorial:   mining latent entity structures

Tensor Orthogonal Decomposition for LDA

43

A: 0.03 AB: 0.001 ABC: 0.001

B: 0.01 BC: 0.002 ABD: 0.005

C: 0.04 AC: 0.003 BCD: 0.004

: : :

Normalized pattern counts

𝑀2

𝑀3

V

V: vocabulary size k: topic number

V

V

V

V

k k

k

𝑇

Input corpus

Topic 𝝓𝟏

Topic 𝝓𝒌

government 0.3

response 0.2

...

donate 0.1

relief 0.05

...

[ANANDKUMAR ET AL. 12]

Page 44: Tutorial:   mining latent entity structures

Tensor Orthogonal Decomposition for LDA – Not Scalable

44

A: 0.03 AB: 0.001 ABC: 0.001

B: 0.01 BC: 0.002 ABD: 0.005

C: 0.04 AC: 0.003 BCD: 0.004

: : :

Normalized pattern counts

𝑀2

𝑀3

V

V

V

V

V

k k

k

𝑇

Input corpus

Topic 𝝓𝟏

Topic 𝝓𝒌

government 0.3

response 0.2

...

donate 0.1

relief 0.05

...

Prohibitive to compute

Time: 𝑶 𝑽𝟑𝒌 + 𝑳𝒍𝟐

Space: 𝑶 𝑽𝟑

V: vocabulary size; k: topic number L: # tokens; l: average doc length

Page 45: Tutorial:   mining latent entity structures

Scalable Tensor Orthogonal Decomposition

45

A: 0.03 AB: 0.001 ABC: 0.001

B: 0.01 BC: 0.002 ABD: 0.005

C: 0.04 AC: 0.003 BCD: 0.004

: : :

Normalized pattern counts

𝑀2

𝑀3

V

V

V

V

V

k k

k

𝑇

Input corpus

Topic 𝝓𝟏

Topic 𝝓𝒌

government 0.3

response 0.2

...

donate 0.1

relief 0.05

...

Sparse & low rank

Decomposable

1st scan

2nd scan

Time: 𝑶 𝑳𝒌𝟐 + 𝒌𝒎

Space: 𝑶 𝒎 # nonzero 𝒎≪ 𝑽𝟐

[WANG ET AL. 14A]

Page 46: Tutorial:   mining latent entity structures

STOD is 20-3000 times faster

And the recovery error is low when the sample is large enough

Variance is almost 0

46

STOD – Scalable tensor orthogonal decomposition TOD – Tensor orthogonal decomposition

L=19M L=39M

Runtime

Recovery error

Page 47: Tutorial:   mining latent entity structures

Summary of LDA Model Inference MAXIMUM LIKELIHOOD

Approximate inference ◦ slow, scan data thousands of times

◦ large variance, no theoretic guarantee

Numerous follow-up work ◦ further approximation [Porteous et al.

08, Yao et al. 09, Hoffman et al. 12] etc.

◦ parallelization [Newman et al. 09] etc.

◦ online learning [Hoffman et al. 13] etc.

METHOD OF MOMENTS

STOD [Wang et al. 14a] ◦ fast, scan data twice

◦ robust recovery with theoretic guarantee

New and promising!

47

Page 48: Tutorial:   mining latent entity structures

Methodologies of Topic Mining

48

C. An integrated framework: CATHY

i) Recursive topic discovery ii) Phrase mining iii) Phrase and entity ranking

B. Extension of topic modeling

i) Flat -> hierarchical ii) Unigrams -> ngrams iii) Text -> text + entity

A. Traditional bag-of-words topic modeling

Page 49: Tutorial:   mining latent entity structures

B. i) Flat Topics -> Hierarchical Topics In PLSA and LDA, a topic is selected from a flat pool of topics

In hierarchical topic models, a topic is selected from a hierarchy

49

Topic 𝝓𝟏 Topic 𝝓𝒌

… government 0.3

response 0.2

...

donate 0.1

relief 0.05

...

To generate a token in document 𝑑: 1. Sample a topic label 𝑧 according to 𝜃𝑑 2. Sample a word w according to 𝜙𝑧

.4 .3 .3

Topic 𝝓𝒛

o

o/1 o/2

o/2/1 o/1/1 o/1/2 o/2/2

Information technology & system

DB IR

CS

Page 50: Tutorial:   mining latent entity structures

Hierarchical Topic Models Topics form a tree structure

◦ nested Chinese Restaurant Process [Griffiths et al. 04]

◦ recursive Chinese Restaurant Process [Kim et al. 12a]

◦ LDA with Topic Tree [Wang et al. 14b]

Topics form a DAG structure ◦ Pachinko Allocation [Li & McCallum 06]

◦ hierarchical Pachinko Allocation [Mimno et al. 07]

◦ nested Chinese Restaurant Franchise [Ahmed et al. 13]

50

o

o/1 o/2

o/2/1 o/1/1 o/1/2 o/2/2

o

o/1 o/2

o/2/1 o/1/1 o/1/2 o/2/2

DAG: DIRECTED ACYCLIC GRAPH

Page 51: Tutorial:   mining latent entity structures

Hierarchical Topic Model Inference MAXIMUM LIKELIHOOD

Exact inference is intractable

Approximate inference: variational inference or MCMC

Non recursive – all the topics are inferred at once

METHOD OF MOMENTS

Scalable Tensor Recursive Orthogonal Decomposition [Wang et al. 14b] ◦ fast and robust recovery with theoretic

guarantee

Recursive method - only for LDA with Topic Tree model

51

Most popular

Page 52: Tutorial:   mining latent entity structures

Recursive Inference for LDA with Topic Tree A large tree subsumes a smaller tree with shared model parameters

52

Inference order

[WANG ET AL. 14B]

Flexible to decide when to terminate

Easy to revise the tree structure

Page 53: Tutorial:   mining latent entity structures

Scalable Tensor Recursive Orthogonal Decomposition

53

A: 0.03 AB: 0.001 ABC: 0.001

B: 0.01 BC: 0.002 ABD: 0.005

C: 0.04 AC: 0.003 BCD: 0.004

: : :

Normalized pattern counts for t

k k

k

𝑇 (𝑡)

Input corpus

Topic 𝝓𝒕/𝟏

Topic 𝝓𝒕/𝒌

government 0.3

response 0.2

...

donate 0.1

relief 0.05

...

[WANG ET AL. 14B]

+ Topic t

Page 54: Tutorial:   mining latent entity structures

B. ii) Unigrams -> N-Grams Motivation: unigrams can be difficult to interpret

54

learning

reinforcement

support

machine

vector

selection

feature

random

:

versus

learning

support vector machines

reinforcement learning

feature selection

conditional random fields

classification

decision trees

:

The topic that represents the area of Machine Learning

Page 55: Tutorial:   mining latent entity structures

Various Strategies Strategy 1: generate bag-of-words -> generate sequence of tokens

◦ Bigram topical model [Wallach 06], topical n-gram model [Wang et al. 07], phrase discovering topic model [Lindsey et al. 12]

Strategy 2: post bag-of-words model inference, visualize topics with n-grams ◦ Label topic [Mei et al. 07], TurboTopic [Blei & Lafferty 09], KERT [Danilevsky et al. 14]

Strategy 3: prior bag-of-words model inference, mine phrases and impose to the bag-of-words model ◦ Frequent pattern-enriched topic model [Kim et al. 12b], ToPMine [El-kishky et al. 14]

55

Page 56: Tutorial:   mining latent entity structures

Strategy 1 – Topical N-Gram Model & Phrase Discovering Topic Model

56

To generate a token in document 𝑑: 1. Sample a binary variable 𝑥 according to the previous token & topic label 2. Sample a topic label 𝑧 according to 𝜃𝑑 3. If 𝑥 = 0 (new phrase), sample a word w according to 𝜙𝑧; otherwise,

sample a word w according to 𝑧 and the previous token

[white

house]

[reports

[white]

[black d1 dD

color]

0

z x

0

1

0

0

1

z x

High model complexity - overfitting High inference cost - slow MCMC inference

[WANG ET AL. 07, LINDSEY ET AL. 12]

Page 57: Tutorial:   mining latent entity structures

Strategy 2 – Topical Keyphrase Extraction & Ranking (KERT)

57

learning

support vector machines

reinforcement learning

feature selection

conditional random fields

classification

decision trees

:

Topical keyphrase

extraction & ranking

knowledge discovery using least squares support vector machine classifiers

support vectors for reinforcement learning

a hybrid approach to feature selection

pseudo conditional random fields

automatic web page classification in a dynamic and hierarchical way

inverse time dependency in convex regularized learning

postprocessing decision trees to extract actionable knowledge

variance minimization least squares support vector machines

Unigram topic assignment: Topic 1 & Topic 2

[DANILEVSKY ET AL. 14]

Page 58: Tutorial:   mining latent entity structures

Framework of KERT 1. Run bag-of-words model inference, and assign topic label to each token

2. Extract candidate keyphrases within each topic

3. Rank the keyphrases in each topic

◦ Popularity: ‘information retrieval’ vs. ‘cross-language information retrieval’

◦ Discriminativeness: only frequent in documents about topic t

◦ Concordance: ‘active learning’ vs.‘learning classification’

◦ Completeness: ‘vector machine’ vs. ‘support vector machine’

58

Frequent pattern mining

Comparability property: directly compare phrases of mixed lengths

Page 59: Tutorial:   mining latent entity structures

Comparison of phrase ranking methods

The topic that represents the area of Machine Learning 59

kpRel [Zhao et al. 11]

KERT (-popularity)

KERT (-discriminativeness)

KERT (-concordance)

KERT [Danilevsky et al. 14]

learning effective support vector machines learning learning

classification text feature selection classification support vector machines

selection probabilistic reinforcement learning selection reinforcement learning

models identification conditional random fields feature feature selection

algorithm mapping constraint satisfaction decision conditional random fields

features task decision trees bayesian classification

decision planning dimensionality reduction trees decision trees

: : : : :

Page 60: Tutorial:   mining latent entity structures

Strategy 3 – Phrase Mining + Topic Model (ToPMine)

60

Phrase mining and

document segmentation

knowledge discovery using least squares support vector machine classifiers…

Knowledge discovery and support vector machine should have coherent topic labels

[EL-KISHKY ET AL. 14]

Strategy 2: the tokens in the same phrase may be assigned to different topics

Solution: switch the order of phrase mining and topic model inference

[knowledge discovery] using [least squares]

[support vector machine] [classifiers] …

[knowledge discovery] using [least squares]

[support vector machine] [classifiers] …

Topic model inference

with phrase constraints

More challenging than in strategy 2!

Page 61: Tutorial:   mining latent entity structures

Phrase Mining: Frequent Pattern Mining + Statistical Analysis

61

Significance score [Church et al. 91] 𝛼(𝐴, 𝐵)

=|𝐴𝐵| − |𝐴||𝐵|/𝑛

𝐴𝐵

Good Phrases

Page 62: Tutorial:   mining latent entity structures

Phrase Mining: Frequent Pattern Mining + Statistical Analysis

62

Raw freq

[support vector machine]: 90 80

[vector machine]: 95 0

[support vector]: 100 20

True freq

[Markov blanket] [feature selection] for [support

vector machines]

[knowledge discovery] using [least squares]

[support vector machine] [classifiers]

…[support vector] for [machine learning]…

Significance score [Church et al. 91] 𝛼(𝐴, 𝐵)

=|𝐴𝐵| − |𝐴||𝐵|/𝑛

𝐴𝐵

Page 63: Tutorial:   mining latent entity structures

Example Topical Phrases

63

ToPMine [El-kishky et al. 14] – Strategy 3 (67 seconds)

information retrieval feature selection

social networks machine learning

web search semi supervised

search engine large scale

information extraction support vector machines

question answering active learning

web pages face recognition

: :

Topic 1 Topic 2

social networks information retrieval

web search text classification

time series machine learning

search engine support vector machines

management system information extraction

real time neural networks

decision trees text categorization

: :

Topic 1 Topic 2

PDLDA [Lindsey et al. 12] – Strategy 1 (3.72 hours)

Page 64: Tutorial:   mining latent entity structures

ToPMine: Experiments on Associate Press News (1989) 64

Page 65: Tutorial:   mining latent entity structures

ToPMine: Experiments on Yelp Reviews 65

Page 66: Tutorial:   mining latent entity structures

66

Runtime

Strategy 3 1 2 1 2

fastest slow slow fast

Comparison of three strategies

strategy 3 > strategy 2 > strategy 1

Coherence of topics

Page 67: Tutorial:   mining latent entity structures

Summary of Topical N-Gram Mining Strategy 1: generate bag-of-words -> generate sequence of tokens

◦ integrated complex model; phrase quality and topic inference rely on each other

◦ slow and overfitting

Strategy 2: post bag-of-words model inference, visualize topics with n-grams ◦ phrase quality relies on topic labels for unigrams

◦ can be fast

Strategy 3: prior bag-of-words model inference, mine phrases and impose to the bag-of-words model ◦ topic inference relies on correct partition of documents, but not sensitive

◦ can be fast

67

Page 68: Tutorial:   mining latent entity structures

B. iii) Text Only -> Text + Entity

What should be the output?

How to use linked entity information?

68

text

Criticism of government

response to the hurricane …

Text-only corpus

entity

Topic 𝝓𝟏 Topic 𝝓𝒌

… government 0.3

response 0.2

...

donate 0.1

relief 0.05

...

.4 .3 .3 Doc 𝜃1

.2 .5 .3 Doc 𝜃𝐷

Page 69: Tutorial:   mining latent entity structures

Three Modeling Strategies RESEMBLE ENTITIES TO DOCUMENTS

An entity has a multinomial distribution over topics

RESEMBLE ENTITIES TO WORDS

A topic has a multinomial distribution over each type of entities

69

.3 .4 .3

.2 .5 .3 SIGMOD

Surajit Chaudhuri

… Topic 1

KDD 0.3

ICDM 0.2

...

Over venues

Jiawei Han 0.1

Christos Faloustos 0.05

...

Over authors

RESEMBLE ENTITIES TO TOPICS

An entity has a multinomial distribution over words

SIGMOD database 0.3

system 0.2

...

Page 70: Tutorial:   mining latent entity structures

Resemble Entities to Documents Regularization - Linked documents or entities have similar topic distributions

◦ iTopicModel [Sun et al. 09a]

◦ TMBP-Regu [Deng et al. 11]

Use entities as additional sources of topic choices for each token ◦ Contextual focused topic model [Chen et al. 12] etc.

Aggregate documents linked to a common entity as a pseudo document ◦ Co-regularization of inferred topics under multiple views [Tang et al. 13]

70

Page 71: Tutorial:   mining latent entity structures

Resemble Entities to Documents Regularization - Linked documents or entities have similar topic distributions

71

iTopicModel [Sun et al. 09a] TMBP-Regu [Deng et al. 11]

Doc 𝜃2

Doc 𝜃3

Doc 𝜃1

𝜃2 should be similar to 𝜃1, 𝜃3 𝜃1d should be similar to 𝜃5

𝑢, 𝜃2𝑢, 𝜃2

𝑣

Page 72: Tutorial:   mining latent entity structures

Resemble Entities to Documents Use entities as additional sources of topic choice for each token

◦ Contextual focused topic model [Chen et al. 12]

72

To generate a token in document 𝑑: 1. Sample a variable 𝑥 for the context type 2. Sample a topic label 𝑧 according to 𝜃 of the context type decided by 𝑥 3. Sample a word w according to 𝜙𝑧

.4 .3 .3 𝑥 = 1, sample 𝑧 from document’s topic distribution

𝑥 = 2, sample 𝑧 from author’s topic distribution .3 .4 .3

.2 .5 .3 𝑥 = 3, sample 𝑧 from venue’s topic distribution

On Random Sampling over Joins

Surajit Chaudhuri SIGMOD

Page 73: Tutorial:   mining latent entity structures

Resemble Entities to Documents Aggregate documents linked to a common entity as a pseudo document

◦ Co-regularization of inferred topics under multiple views [Tang et al. 13]

73

Document view

A single paper Author view

All Surajit Chaudhuri’s

papers

Venue view

All SIGMOD papers

Topic 𝝓𝟏 Topic 𝝓𝒌 …

Page 74: Tutorial:   mining latent entity structures

Three Modeling Strategies RESEMBLE ENTITIES TO DOCUMENTS

An entity has a multinomial distribution over topics

RESEMBLE ENTITIES TO WORDS

A topic has a multinomial distribution over each type of entities

74

.3 .4 .3

.2 .5 .3 SIGMOD

Surajit Chaudhuri

… Topic 1

KDD 0.3

ICDM 0.2

...

Over venues

Jiawei Han 0.1

Christos Faloustos 0.05

...

Over authors

RESEMBLE ENTITIES TO TOPICS

An entity has a multinomial distribution over words

SIGMOD database 0.3

system 0.2

...

Page 75: Tutorial:   mining latent entity structures

Resemble Entities to Topics Entity-Topic Model (ETM) [Kim et al. 12c]

75

Topic 𝝓𝟏

… data 0.3

mining 0.2

...

SIGMOD

database 0.3

system 0.2

...

Surajit Chaudhuri

database 0.1

query 0.1

...

text venue author

To generate a token in document 𝑑: 1. Sample an entity 𝑒 2. Sample a topic label 𝑧 according to 𝜃𝑑 3. Sample a word w according to 𝜙𝑧,𝑒

𝜙𝑧,𝑒~𝐷𝑖𝑟(𝑤1𝜙𝑧 +𝑤2𝜙𝑒)

Paper text Surajit Chaudhuri SIGMOD

Page 76: Tutorial:   mining latent entity structures

Example topics learned by ETM

On a news dataset about Japan tsunami 2011

76

𝜙𝑧,𝑒 𝜙𝑧,𝑒 𝜙𝑧,𝑒 𝜙𝑧

𝜙𝑧,𝑒 𝜙𝑧,𝑒 𝜙𝑧,𝑒 𝜙e

Page 77: Tutorial:   mining latent entity structures

Three Modeling Strategies RESEMBLE ENTITIES TO DOCUMENTS

An entity has a multinomial distribution over topics

RESEMBLE ENTITIES TO WORDS

A topic has a multinomial distribution over each type of entities

77

.3 .4 .3

.2 .5 .3 SIGMOD

Surajit Chaudhuri

… Topic 1

KDD 0.3

ICDM 0.2

...

Over venues

Jiawei Han 0.1

Christos Faloustos 0.05

...

Over authors

RESEMBLE ENTITIES TO TOPICS

An entity has a multinomial distribution over words

SIGMOD database 0.3

system 0.2

...

Page 78: Tutorial:   mining latent entity structures

Resemble Entities to Words Entities as additional elements to be generated for each doc

◦ Conditionally independent LDA [Cohn & Hofmann 01]

◦ CorrLDA1 [Blei & Jordan 03]

◦ SwitchLDA & CorrLDA2 [Newman et al. 06]

◦ NetClus [Sun et al. 09b]

78

To generate a token/entity in document 𝑑: 1. Sample a topic label 𝑧 according to 𝜃𝑑 2. Sample a token w / entity e according to 𝜙𝑧 or 𝜙𝑧

𝑒

Topic 1

KDD 0.3

ICDM 0.2

...

venues

Jiawei Han 0.1

Christos Faloustos 0.05

...

authors

data 0.2

mining 0.1

...

words

Page 79: Tutorial:   mining latent entity structures

Comparison of Three Modeling Strategies for Text + Entity RESEMBLE ENTITIES TO DOCUMENTS

Entities regularize textual topic discovery

RESEMBLE ENTITIES TO WORDS

Entities enrich and regularize the textual representation of topics

79

.3 .4 .3

.2 .5 .3 SIGMOD

Surajit Chaudhuri

… Topic 1

KDD 0.3

ICDM 0.2

...

Over venues

Jiawei Han 0.1

Christos Faloustos 0.05

...

Over authors

RESEMBLE ENTITIES TO TOPICS

Each entity has its own profile SIGMOD database 0.3

system 0.2

... # params = k*E*V

# params = k*(E+V)

Page 80: Tutorial:   mining latent entity structures

Methodologies of Topic Mining

80

C. An integrated framework: CATHY

i) Recursive topic discovery ii) Phrase mining iii) Phrase and entity ranking

B. Extension of topic modeling

i) Flat -> hierarchical ii) Unigrams -> ngrams iii) Text -> text + entity

A. Traditional bag-of-words topic modeling

Page 81: Tutorial:   mining latent entity structures

Break

81

Page 82: Tutorial:   mining latent entity structures

C. An Integrated Framework

How to choose & integrate?

82

Recursive Non recursive Hierarchy

Sequence of tokens generative model

• Strategy 1

Post inference, visualize topics with n-grams

• Strategy 2

Prior inference, mine phrases and impose to the bag-of-words model

• Strategy 3

P

h

r

a

s

e

E

n

t

i

t

y

Resemble entities to documents

• Modeling strategy 1

Resemble entities to topics

• Modeling strategy 2

Resemble entities to words

• Modeling strategy 3

Page 83: Tutorial:   mining latent entity structures

C. An Integrated Framework

Compatible & effective

83

Recursive Non recursive Hierarchy

P

h

r

a

s

e

E

n

t

i

t

y

Resemble entities to documents

• Modeling strategy 1

Resemble entities to topics

• Modeling strategy 2

Resemble entities to words

• Modeling strategy 3

Sequence of tokens generative model

• Strategy 1

Post inference, visualize topics with n-grams

• Strategy 2

Prior model inference, mine phrases and impose to the bag-of-words model

• Strategy 3

Page 84: Tutorial:   mining latent entity structures

C. An Integrated Framework: Construct A Topical HierarchY (CATHY)

Hierarchy + phrase + entity

84

i) Hierarchical topic discovery with entities

ii) Phrase mining

iii) Rank phrases & entities per topic

Output hierarchy with phrases & entities

Input collection

text

o

o/1

o/1/1 o/1/2

o/2

o/2/1

entity

Page 85: Tutorial:   mining latent entity structures

Mining Framework – CATHY Construct A Topical HierarchY

85

i) Hierarchical topic discovery with entities

ii) Phrase mining

iii) Rank phrases & entities per topic

Output hierarchy with phrases & entities

Input collection

text

o

o/1

o/1/1 o/1/2

o/2

o/2/1

entity

Page 86: Tutorial:   mining latent entity structures

Hierarchical Topic Discovery with Multi-Typed Entities [Wang et al. 13a,b] Resemble entities to words – every topic has a multinomial distribution over each type of entities

86

Topic 1

KDD 0.3

ICDM 0.2

...

Jiawei Han 0.1

Christos Faloustos 0.05

...

data 0.2

mining 0.1

...

Topic k

𝜙11 𝜙1

2 𝜙13

𝜙𝑘1 𝜙𝑘

2 𝜙𝑘3

SIGMOD 0.3

VLDB 0.3

...

Surajit Chaudhuri 0.1

Jeff Naughton 0.05

...

database 0.2

system 0.1

...

venues authors words

Page 87: Tutorial:   mining latent entity structures

Hierarchical Topic Discovery: Link Patterns

87

Computing machinery and intelligence

intelligence

computing machinery

A.M. Turing

A.M. Turing

Page 88: Tutorial:   mining latent entity structures

Hierarchical Topic Discovery: Link-Weighted Heterogeneous Network

88

word author venue

text

A.M. Turing intelligence

system database

SIGMOD

venue

author

Page 89: Tutorial:   mining latent entity structures

Hierarchical Topic Discovery: Generative Model for Link Patterns A single link has a latent topic path z

89

o

o/1 o/2

o/2/1 o/1/1 o/1/2 o/2/2

Information technology & system

DB IR

To generate a link between type 𝑡1 and type 𝑡2: 1. Sample a topic label 𝑧 according to 𝜃

Suppose 𝑡1 = 𝑡2 = word

Page 90: Tutorial:   mining latent entity structures

Hierarchical Topic Discovery: Generative Model for Link Patterns

90

database

To generate a link between type 𝑡1 and type 𝑡2: 1. Sample a topic label 𝑧 according to 𝜃

2. Sample the first end node 𝑢 according to 𝜙𝑧𝑡1

Suppose 𝑡1 = 𝑡2 = word

Topic o/1/2

database 0.2

system 0.1

...

Page 91: Tutorial:   mining latent entity structures

Hierarchical Topic Discovery: Generative Model for Link Patterns

91

database system

To generate a link between type 𝑡1 and type 𝑡2: 1. Sample a topic label 𝑧 according to 𝜃

2. Sample the first end node 𝑢 according to 𝜙𝑧𝑡1

3. Sample the second end node 𝑣 according to 𝜙𝑧𝑡2

Suppose 𝑡1 = 𝑡2 = word

Topic o/1/2

database 0.2

system 0.1

...

Page 92: Tutorial:   mining latent entity structures

Hierarchical Topic Discovery: Generative Model for Link Patterns

92

Equivalently, we can generate # links between u and v:

𝑒𝑢,𝑣 = 𝑒𝑢,𝑣1 + ⋯+ 𝑒𝑢,𝑣

𝑘 , 𝑒𝑢,𝑣𝑧 ~ 𝑃𝑜𝑖𝑠𝑠𝑜𝑛 (𝜃𝑧 𝜙𝑧,𝑢

𝑡1 𝜙𝑧,𝑣𝑡2 )

Suppose 𝑡1 = 𝑡2 = word

database system

0 1 2 3 4 5

0 1 2 3 4 5

𝑒𝑑𝑎𝑡𝑎𝑏𝑎𝑠𝑒,𝑠𝑦𝑠𝑡𝑒𝑚𝑜/1/2 𝐷𝐵

~

𝑒𝑑𝑎𝑡𝑎𝑏𝑎𝑠𝑒,𝑠𝑦𝑠𝑡𝑒𝑚𝑜/1/1(𝐼𝑅)

~

database system 5

4

1

Page 93: Tutorial:   mining latent entity structures

Recursive Model Inference Using Expectation-Maximization (EM)

93

system database

system

database

system database

Topic o/1 Topic o/2 Topic o

100 95 5

+

Topic o/1

KDD 0.3

ICDM 0.2

...

Jiawei Han 0.1

Christos Faloustos 0.05

...

data 0.2

mining 0.1

...

𝜙𝑜/11 𝜙𝑜/1

2 𝜙𝑜/13

Topic o/k

... ... ...

𝜙𝑜/𝑘1 𝜙𝑜/𝑘

2 𝜙𝑜/𝑘3

Bayes rule Sum &

normalize counts

M-ste

p

E-ste

p

Page 94: Tutorial:   mining latent entity structures

Top-Down Recursion

94

system database

system

database

system database

Topic o/1 Topic o/2 Topic o

100 95 5

+

system

database

Topic o/1

95 system

database

system database

65 30

+

Topic o/1/1 Topic o/1/2

Page 95: Tutorial:   mining latent entity structures

Extension: Learn Link Type Importance Different link types may have different importance in topic discovery

Introduce a link type weight 𝜶𝒙,𝒚

◦ Original link weight 𝒆𝒊,𝒋𝒙,𝒚,𝒛

→ 𝜶𝒙,𝒚𝒆𝒊,𝒋𝒙,𝒚,𝒛

◦ 𝛼 > 1 – more important

◦ 0 < 𝛼 < 1 – less important

we can assume w.l.o.g 𝛼𝑥,𝑦𝑛𝑥,𝑦

𝑥,𝑦 = 1

The EM solution is invariant to a constant scaleup of all the link weights

rescale

95

Page 96: Tutorial:   mining latent entity structures

Optimal Weight

Average link weight KL-divergence of prediction from observation

96

Page 97: Tutorial:   mining latent entity structures

Learned Link Importance & Topic Coherence

97

Learned importance of different link types Level Word-word Word-author Author-author Word-venue Author-venue

1 .2451 .3360 .4707 5.7113 4.5160

2 .2548 .7175 .6226 2.9433 2.9852

-1

0

1

2

NetClus CATHY (equal importance) CATHY (learn importance)

Average pointwise mutual information (PMI) of each topic

Word-word Word-author Author-author Word-venue Author-venue Overall

Page 98: Tutorial:   mining latent entity structures

Phrase Mining

Frequent pattern mining; no NLP parsing

Statistical analysis for filtering bad phrases

98

i) Hierarchical topic discovery with entities

ii) Phrase mining

iii) Rank phrases & entities per topic

Output hierarchy with phrases & entities

Input collection

text o

o/1

o/1/1 o/1/2

o/2

o/2/1

Page 99: Tutorial:   mining latent entity structures

Examples of Mined Phrases News Computer science

99

information retrieval feature selection

social networks machine learning

web search semi supervised

search engine large scale

information extraction support vector machines

question answering active learning

web pages face recognition

: :

: :

energy department president bush

environmental protection agency white house

nuclear weapons bush administration

acid rain house and senate

nuclear power plant members of congress

hazardous waste defense secretary

savannah river capital gains tax

: :

: :

Page 100: Tutorial:   mining latent entity structures

Phrase & Entity Ranking

Ranking criteria: popular, discriminative, concordant

100

1. Hierarchical topic discovery w/ entities

2. Phrase mining

3. Rank phrases & entities per topic

Output hierarchy w/ phrases & entities

Input collection

text o

o/1

o/1/1 o/1/2

o/2

o/2/1

entity

Page 101: Tutorial:   mining latent entity structures

Phrase & Entity Ranking – Estimate Topical Frequency

E.g.

𝑝 𝑧 = 𝐷𝐵 𝑞𝑢𝑒𝑟𝑦 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔 =𝑝 𝑧=𝐷𝐵 𝑝 𝑞𝑢𝑒𝑟𝑦 𝑧=𝐷𝐵 𝑝 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔 𝑧=𝐷𝐵

𝑝 𝑧=𝑡 𝑝 𝑞𝑢𝑒𝑟𝑦 𝑧=𝑡 𝑝 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔 𝑧=𝑡𝑡

=𝜃𝐷𝐵𝜙𝐷𝐵,𝑞𝑢𝑒𝑟𝑦𝜙𝐷𝐵,𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔

𝜃𝑡𝜙𝑡,𝑞𝑢𝑒𝑟𝑦𝜙𝑡,𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔𝑡

101

Pattern Total ML DB DM IR

support vector machines 85 85 0 0 0 query processing 252 0 212 27 12 Hui Xiong 72 0 0 66 6 SIGIR 2242 444 378 303 1117

Estimated by Bayes rule Frequent pattern mining

Page 102: Tutorial:   mining latent entity structures

Phrase & Entity Ranking – Ranking Function ‘Popular’ indicator of phrase or entity 𝐴 in topic 𝑡: 𝑝 𝐴 𝑡

‘Discriminative’ indicator of phrase or entity 𝐴 in topic 𝑡: log𝑝 𝐴 𝑡

𝑝 𝐴 𝑇

‘Concordance’ indicator of phrase 𝐴: 𝛼(𝐴) =|𝐴|−𝐸( 𝐴 )

𝑠𝑡𝑑( 𝐴 )

𝑟𝑡 𝐴 = 𝑝 𝐴 𝑡 log𝑝 𝐴 𝑡

𝑝 𝐴 𝑇+ 𝜔𝛼(𝐴)

Pointwise KL-divergence

102

𝑇: topic for comparison

Significance score used for phrase mining

Page 103: Tutorial:   mining latent entity structures

Example topics: database & information retrieval

103

database system

query processing

concurrency control…

Divesh Srivastava

Surajit Chaudhuri

Jeffrey F. Naughton…

ICDE

SIGMOD

VLDB…

text categorization

text classification

document clustering

multi-document summarization…

relevance feedback

query expansion

collaborative filtering

information filtering…

……

……

information retrieval

retrieval

question answering…

W. Bruce Croft

James Allan

Maarten de Rijke…

SIGIR

ECIR

CIKM…

Evaluation: Intrusion detection [Chang et al. 09]

Page 104: Tutorial:   mining latent entity structures

Evaluation method - Intrusion detection

104

Phrase Intrusion

Question 1/80 Topic Intrusion

Parent topic database systems data management query processing management system data system

Child topic 1 web search search engine semantic web search results web pages

Child topic 2 data management data integration data sources data warehousing data applications

Child topic 3 query processing query optimization query databases relational databases query data

Child topic 4 database system database design expert system management system design system

Question 1/130 data mining association rules logic programs data streams

Question 2/130 natural language query optimization data management database systems

parent

child1 child2 intruder

Phrase 1

Phrase 2

Phrase 3

Phrase 4

Phrase 1

Phrase 2

Phrase 3

Phrase 4

normal intruder

which child topic does not belong to the given parent topic?

which phrase does not belong?

Page 105: Tutorial:   mining latent entity structures

Ranked phrases + entities > unigrams

105

0%

20%

40%

60%

80%

100%

CS Topic Intrusion NEWS Topic Intrusion

% of the hierarchy interpreted by people

1. hPAM 2. NetClus 3. CATHY (unigram)

3 + phrase 3 + entity 3 + phrase + entity

65% 66%

Page 106: Tutorial:   mining latent entity structures

Applications: entity, community profiling…

106

Christos Faloutsos in data mining

data mining / data streams / time series /

association rules / mining patterns

time series

nearest neighbor

association rules

mining patterns

data streams

high dimensional data

111.6

papers

21.0 35.6 33.3

data mining / data streams / nearest

neighbor / time series / mining patterns

selectivity estimation

sensor networks

nearest neighbor

time warping

large graphs

large datasets

67.8

papers

16.7 16.4 20.0

Eamonn J. Keogh

Jessica Lin

Michail Vlachos

Michael J. Passani

Matthias Renz

Divesh Srivasta

Surajit Chaudhuri

Nick Koudas

Jeffrey F. Naughton

Yannis Papakonstantinou

Jiawei Han

Ke Wang

Xifeng Yan

Bing Liu

Mohammed J. Zaki

Charu C. Aggarwal

Graham Cormode

S. Muthukrishnan

Philip S. Yu

Xiaolei Li

Philip S. Yu in data mining

Page 107: Tutorial:   mining latent entity structures

Important research areas in SIGIR conference ?

107

260.0 583.0

support vector machines

collaborative filtering

text categorization

text classification

conditional random fields

information systems

artificial intelligence

distributed information

retrieval

query evaluation

event detection

large collections

similarity search

duplicate detection

large scale

information retrieval

question answering

web search

natural language

document retrieval

SIGIR (2,432 papers)

443.8 377.7 302.7 1,117.4

information retrieval

question answering

relevance feedback

document retrieval

ad hoc

web search

search engine

search results

world wide web

web search results

word sense disambiguation

named entity

named entity recognition

domain knowledge

dependency parsing

127.3 108.9

matrix factorization

hidden markov models

maximum entropy

link analysis

non-negative matrix

factorization

text categorization

text classification

document clustering

multi-document summarization

naïve bayes

160.3

DM IR DB ML

Page 108: Tutorial:   mining latent entity structures

Outline

1. Introduction to mining latent entity structures

2. Mining latent topic hierarchies

3. Mining latent entity relations

4. Mining latent entity concepts

5. Trends and research problems

108

Page 109: Tutorial:   mining latent entity structures

Discovery of Hidden Entity Relations

109

Relations

text

entity

Topics

Unstructured text & linked entities -> structured hierarchies

Page 110: Tutorial:   mining latent entity structures

Example Hidden Relations Academic family from research publications

Social relationship from online social network

110

Alumni Colleague

Club friend

Jeff Ullman

Surajit Chaudhuri (1991)

Jeffrey Naughton (1987)

Joseph M. Hellerstein (1995)

Page 111: Tutorial:   mining latent entity structures

Entity Relation Mining A. Representative mining paradigms

B. Patterns, rules and constraints for reasoning

C. Methodologies for dependency modeling

111

Page 112: Tutorial:   mining latent entity structures

Mining Paradigms A. Representative mining paradigms

◦ Similarity search of relationships

◦ Classify or cluster entity relationships

◦ Slot filling

B. Patterns, rules and constraints for reasoning

C. Methodologies for dependency modeling

112

Page 113: Tutorial:   mining latent entity structures

Similarity Search of Relationships Input: relation instance

Output: relation instances with similar semantics

113

(Jeff Ullman, Surajit Chaudhuri)

(Jeffrey Naughton, Joseph M. Hellerstein)

(Jiawei Han, Chi Wang)

Is advisor of

(Apple, iPad)

(Microsoft, Surface)

(Amazon, Kindle)

Produce tablet

Page 114: Tutorial:   mining latent entity structures

Classify or Cluster Entity Relationships Input: relation instances with unknown relationship

Output: predicted relationship or clustered relationship

114

(Jeff Ullman, Surajit Chaudhuri)

Is advisor of

(Jeff Ullman, Hector Garcia)

Is colleague of

Alumni Colleague

Club friend

Page 115: Tutorial:   mining latent entity structures

Slot Filling Input: relation instance with a missing element (slot)

Output: fill the slot

is advisor of (?, Surajit Chaudhuri) Jeff Ullman

produce tablet (Apple, ?) iPad

115

Model Brand

S80 ?

A10 ?

T1460 ?

Model Brand

S80 Nikon

A10 Canon

T1460 Benq

Page 116: Tutorial:   mining latent entity structures

Patterns, Rules & Constraints for Reasoning

A. Representative mining paradigms ◦ Similarity search of relationships

◦ Classify or cluster entity relationships

◦ Slot filling

B. Patterns, rules and constraints for reasoning ◦ Text patterns

◦ Linkage patterns

◦ Dependency rules and constraints

C. Methodologies for dependency modeling

116

Page 117: Tutorial:   mining latent entity structures

Text Patterns Syntactic patterns

◦ [Bunescu & Mooney 05b]

Dependency parse tree patterns ◦ [Zelenko et al. 03]

◦ [Culotta & Sorensen 04]

◦ [Bunescu & Mooney 05a]

Topical patterns ◦ [McCallum et al. 05] etc.

117

The headquarters of Google are situated in Mountain View

Jane says John heads XYZ Inc.

Emails between McCallum & Padhraic Smyth

Page 118: Tutorial:   mining latent entity structures

Linkage Patterns - Topology Edge sign prediction [Leskovec et al. 10]

◦ Epinions: Trust/Distrust

◦ Wikipedia: Support/Oppose

◦ Slashdot: Friend/Foe

Triad counts: counts of signed triads edge u->v takes part in

118

Page 119: Tutorial:   mining latent entity structures

Linkage Patterns - Traffic Manager-subordinate relationship from email exchange [Diehl et al. 07]

◦ # msgs

◦ quartiles for # recipients

119

Possible communication events for relationship (𝑛𝑎, 𝑛𝑏)

Page 120: Tutorial:   mining latent entity structures

Linkage Patterns – Traffic Dynamics Advisor-advisee relationship from research publications [Wang et al. 10]

120

Smith

2000

2000

2001

2002

2003

1999

Ada

Ying

2004

A (papers by the

candidate advisor)

B 𝑘𝑢𝑙𝑐 𝐴, 𝐵 =

𝐴 ∩ 𝐵

2

1

𝐴+

1

𝐵

𝐼𝑅 𝐴, 𝐵 =𝐴 − |𝐵|

|𝐴 ∪ 𝐵|

Kulczinski

Imbalance ratio

0

0.5

1

2000 2001 2002 2003 2004

Coauthorship trend over time

Kulczinski Imbalance Ratio End time Start time

Page 121: Tutorial:   mining latent entity structures

Dependency Rules & Constraints (Advisor-Advisee Relationship)

E.g., role transition - one cannot be advisor before graduation

121

2000

2000

2001

1999

Ada Bob

Ying

Ada

BobYing

Graduate in 2001Start in 2000

Graduate in 1998

Ada

Bob Ying

Graduate in 2001 Start in 2000

Graduate in 1998

Page 122: Tutorial:   mining latent entity structures

Dependency Rules & Constraints (Social Relationship) ATTRIBUTE-RELATIONSHIP

Friends of the same relationship type share the same value for only certain attribute

CONNECTION-RELATIONSHIP

The friends having different relationships are loosely connected

122

Page 123: Tutorial:   mining latent entity structures

Methodologies for Dependency Modeling Factor graph

◦ [Wang et al. 10, 11, 12]

◦ [Tang et al. 11]

Optimization framework ◦ [McAuley & Leskovec 12]

◦ [Li, Wang & Chang 14]

Graph-based ranking ◦ [Yakout et al. 12]

123

Page 124: Tutorial:   mining latent entity structures

Factor Graph [Wang et al. 10, 11, 12]

124

Factor graph – model the joint likelihood by multiple factors

Each factor is a potential function defined on a few variables ◦ 𝑦𝑥 - 𝑎𝑥’s advisor. E.g., 𝑦2 ∈ 0,1 , 𝑦4 ∈ 0,3

◦ 𝑔𝑥 𝑦𝑥 - traffic dynamic; 𝑓𝑥 𝑦𝑥 , 𝐶𝑥 - role transition

Candidate children’s 𝑦

𝑃 =1

𝑍 𝑔𝑥(𝑦𝑥)

𝑎𝑢𝑡ℎ𝑜𝑟 𝑎𝑥

𝑓𝑥 𝑦𝑥, 𝐶𝑥

average of Kulczinski measure and IR measure

1 if all the constraints are satisfied; 0 otherwise

a1

a0

a2 a3

a4a5

f 0(y0,y1, y2,y3,y4,y5)

y1

y0

y2 y3

y4y5

f 1(y1, y2,y3)

g2(y2)

g5(y5)g4(y4)

f 3(y3, y4,y5)

g3(y3)

g1(y1)

g0(y0)

Page 125: Tutorial:   mining latent entity structures

Factor Graph - Inference Traditional inference algorithm message passing [Koller & Friedman 09] ◦ Exact inference: sum-product + junction tree

◦ Approximate inference: loopy belief propagation

More optimized message passing algorithm [Wang et al. 10] ◦ Specialized schedule according to the DAG structure

◦ Eliminate function nodes and internal messages

◦ Reduce the message from a vector to a scalar

125

y1

y2 y3f 1

Standard message passing algorithm message is a marginal function (a vector)

a1

a2 a3

Our message passing algorithm message is a scalar

Not scalable. Fails to finish for thousands of variables

Flooding schedule; causing repetitive information flow

Page 126: Tutorial:   mining latent entity structures

Our message passing algorithm

1

Phase 1

a5

a2 a3

a4

a0

a1

1

1

1

11

2

3

2

Phase 2

a5

a2 a3

a4

a0

a1

1

3

1

13

2

1

senty0

10

recv ?senty0

10

recv 1

senty2

u2,0

0

recv ?u2,1

1

?

senty5

u5,0

0

recv ?

senty1

u1,0

0

recv ?

senty4

u4,0

0

recv ?u4,3

3

?

u5,3

3

?

senty3

u3,0

0

recv ?u3,1

1

?

senty2

u2,0

0

recv v2,0

u2,1

1

v2,1

senty5

u5,0

0

recv v5,0

u5,3

3

v5,3

senty1

u1,0

0

recv v1,0

senty3

u3,0

0

recv v3,0

u3,1

1

v3,1

senty4

u4,0

0

recv v4,0

u4,3

3

v4,3

𝑟𝑖𝑗 = max 𝑃(𝒚| 𝑦𝑖 = 𝑗) = exp (𝑠𝑒𝑛𝑡𝑖𝑗 + 𝑟𝑒𝑐𝑣𝑖𝑗)

After the message passing, we collect the message to find the most likely advisor

how likely 𝑎𝑗 is 𝑎𝑖’s

advisor

126

Page 127: Tutorial:   mining latent entity structures

Experiment Results DBLP data: 654,628 authors, 1,076,946 publications, publishing time provided

Labeled data: Math Genealogy; AI Genealogy; Faculty Homepage

Datasets RULE SVM Our method

TEST1 69.9% 73.4% 84.4%

TEST2 69.8% 74.6% 84.3%

TEST3 80.6% 86.7% 91.3%

Heuristic

rule

Support vector machine

(no role transition)

Accuracy of relationship prediction

127

Page 128: Tutorial:   mining latent entity structures

The proposed algorithm is efficient and effective

Advisee Top Ranked Advisor Time Note

David M. Blei

1. Michael I. Jordan 01-03 PhD advisor, 2004 grad

2. John D. Lafferty 05-06 Postdoc, 2006

Hong Cheng

1. Qiang Yang 02-03 MS advisor, 2003

2. Jiawei Han 04-08 PhD advisor, 2008

Sergey Brin

1. Rajeev Motawani 97-98 “Unofficial advisor”

128

Page 129: Tutorial:   mining latent entity structures

Other examples of hierarchical relationship Reply relationship in online discussion [Wang et al. 11]

129

Page 130: Tutorial:   mining latent entity structures

Other examples of hierarchical relationship

Family tree

[Wang et al. 12]

130

Page 131: Tutorial:   mining latent entity structures

Solution in the generic setting

Step 1: create a candidate DAG according to certain expectation about a partial order among the objects

Step 2: model the dependency of relations with a factor graph

Step 3: learn the importance of different expectations by Maximum A Posterior (MAP) inference

Step 4: maximize the joint likelihood and predict true relationship

A large number of factors may exist

Their importance is unknown

a1

a0

a2 a3

a4a5

f 0(y0,y1, y2,y3,y4,y5)

y1

y0

y2 y3

y4y5

f 1(y1, y2,y3)

g2(y2)

g5(y5)g4(y4)

f 3(y3, y4,y5)

g3(y3)

g1(y1)

g0(y0)

a1

a0

a2 a3

a4a5

L2 regularization L-BFGS

131

Page 132: Tutorial:   mining latent entity structures

Categorization of potential functions See other types of potentials in [Wang et al. 12]

Type Cognitive description Potential definition

Homophile Parent and child are similar

Polarity Parent is superior to child

pairwise potentials

singleton potentials

e.g, Kulczinsky

Imbalance Ratio

Type Cognitive description Potential definition

Constraints Restrict certain patterns 𝑓 𝑦𝑖 , 𝑦𝑗 = −|𝐶𝑃 𝑣𝑖 , 𝑣𝑗 , 𝑣𝑦𝑖 , 𝑣𝑦𝑗 |

e.g., role transition

132

Page 133: Tutorial:   mining latent entity structures

Optimization Framework [Li et al. 14] Relationship in online social networks

133

Employer: Yahoo! College: Stanford

Employer: ? College:?

Employer: Yahoo! College: Berkeley

Employer: Twitter College: Berkeley

Employer: ? College:?

Employer: Twitter College: UIUC

Employer: ? College:?

Employer: JP Morgan College: UIUC

Employer: ? College:?

Employer: Google College: UIUC

Some users provide attributes in their online profiles Some users’ attributes are missing

Page 134: Tutorial:   mining latent entity structures

Optimization Framework to Model the Dual Dependency Design cost function to capture the dependences

Find the unknown variables that minimize the cost function

134

Unobserved Friends’ circles

Observed User Connections

Partially Observed User Attributes

Friends of the same relationship type share the same value for only certain attribute

The friends having different relationships are loosely connected

Page 135: Tutorial:   mining latent entity structures

Co-Profiling of Relationship & Attributes

𝜆1 𝑤𝑖 𝑓𝑖 − 𝑓𝑗2

𝑒𝑖𝑗∈𝐸,𝑣𝑖,𝑣𝑗∈𝐶𝑖

𝐾

𝑡=1

+ 𝜆2 𝑤𝑖𝑓𝑖 − 1 2

𝑣𝑖∈𝐿∩𝐶𝑖

𝐾

𝑡=1

+ 𝜆3 1

𝐾

𝑒𝑖𝑗∈𝐸′,𝑥𝑖≠𝑥𝑗

135

f3 =<1, 0, 0, 0, 1, 0, 0.1>

Circle1: friends likely to share employee Circle 2: friends likely to share college

Employer: Yahoo! College: Stanford

Employer: ? College:?

Employer: Yahoo! College: Berkeley

Employer: ? College:?

Employer: Twitter College: UIUC

Employer: Google College: UIUC

Employer: Yahoo

College: UIUC

Attribute Vector f1 =<1, 0, 0, 1, 0, 0, 0.1>

Circle Assignment x1=1

x3=1

Association Vector w1 =<1, 1, 1, 0, 0, 0, 0> w2 =<0, 0, 0, 1, 1, 1, 0>

f4 =<0, 1, 0, 0, 0, 1, 0 0.1>

x4=2

Attribute-Relationship Dependence Connection-Relationship Dependence

Yahoo Twitter Google Berkeley Stanford UIUC …

Page 136: Tutorial:   mining latent entity structures

Inference Algorithm

𝜆1 𝑤𝑖 𝑓𝑖 − 𝑓𝑗2

𝑒𝑖𝑗∈𝐸,𝑣𝑖,𝑣𝑗∈𝐶𝑖

𝐾

𝑡=1

+ 𝜆2 𝑤𝑖𝑓𝑖 − 1 2

𝑣𝑖∈𝐿∩𝐶𝑖

𝐾

𝑡=1

+ 𝜆3 1

𝐾

𝑒𝑖𝑗∈𝐸′,𝑥𝑖≠𝑥𝑗

1. Update User Attribute Vectors 𝑓𝑖 ◦ Only propagate values from friends in the same circles ◦ Only propagate the attribute value associated with the circle

2. Update User Circle Assignments 𝑥𝑖 , 𝐶𝑖 ◦ Consider both user’s attributes and connections

3. Update Circle Association Vectors 𝑤𝑖 ◦ Make association vector sparse

136

Page 137: Tutorial:   mining latent entity structures

Coprofiling of relationship + attributes outperforms baselines for both

137

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7Accuracy for College

0.48

0.5

0.52

0.54

0.56

0.58

0.6

0.62

APw APi APc CP

Accuracy for Employer

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

RPa RPn RPan CP

Profiling college classmates

0

0.1

0.2

0.3

0.4

0.5

0.6

RPa RPn RPan CP

Profiling colleagues

Users Connections

19K 110K

LinkedIn data

8K connection are labeled

Page 138: Tutorial:   mining latent entity structures

Methodologies for Dependency Modeling Factor graph

◦ [Wang et al. 10, 11, 12]

◦ [Tang et al. 11]

Optimization framework ◦ [McAuley & Leskovec 12]

◦ [Li, Wang & Chang 14]

Graph-based ranking ◦ [Yakout et al. 12]

◦ Suitable for discrete variables

◦ Probabilistic model with general inference algorithms

◦ Both discrete and real variables

◦ Special optimization algorithm needed

◦ Similar to PageRank

◦ Suitable when the problem can be modeled as ranking on graphs

138

Page 139: Tutorial:   mining latent entity structures

Outline

1. Introduction to mining latent entity structures

2. Mining latent topic hierarchies

3. Mining latent entity relations

4. Mining latent entity concepts

5. Trends and research problems

139

Page 140: Tutorial:   mining latent entity structures

Mining Entity Concepts for Typing

140

Concepts Relations

text

entity

Topics

Unstructured text & linked entities -> structured hierarchies

Page 141: Tutorial:   mining latent entity structures

Concepts as Entity Types Top 10 active politicians regarding healthcare issues?

Influential high-tech companies in Silicon Valley?

141

Concept Entity Mention

politician Obama says more than 6M signed up for

health care…

high-tech company

Apple leads in list of Silicon Valley's most-valuable brands…

Page 142: Tutorial:   mining latent entity structures

Source of Concepts, Entities & Mentions

142

Concept Entity Mention

politician Obama says more than 6M signed up for

health care…

high-tech company

Apple leads in list of Silicon Valley's most-valuable brands…

in taxonomy / ad-hoc in text / in web tables

Page 143: Tutorial:   mining latent entity structures

Mining Entity Concepts for Typing Large scale taxonomies

◦ Built from human-curated knowledgebases

◦ Constructed from free text

Type entities in text

Type entities in web tables and lists

143

Page 144: Tutorial:   mining latent entity structures

Large Scale Taxonomies Name Source # concepts # entities Hierarchy

Dbpedia (v3.9) Wikipedia infoboxes 529 3M Tree

YAGO2s Wiki, WordNet, GeoNames 350K 10M Tree

Freebase Miscellaneous 23K 23M Flat

Probase (MS.KB) Web text 2M 5M DAG

144

YAGO2s Freebase

Page 145: Tutorial:   mining latent entity structures

Type Entities in Text Relying on knowledgebases – entity linking

◦ Context similarity: [Bunescu & Pascal 06] etc.

◦ Topical coherence: [Cucerzan 07] etc.

◦ Context similarity + entity popularity + topical coherence: Wikifier [Ratinov et al. 11]

◦ Jointly linking multiple mentions: AIDA [Hoffart et al. 11] etc.

◦ …

145

Page 146: Tutorial:   mining latent entity structures

Limitation of Entity Linking Low recall of knowledgebases

Sparse concept descriptors

Can we type entities without relying on knowledgebases?

Yes! Exploit the redundancy in the corpus ◦ Not relying on knowledgebases: targeted disambiguation of ad-hoc, homogeneous

entities [Wang et al. 12]

◦ Partially relying on knowledgebases: mining additional evidence in the corpus for disambiguation [Li et al. 13]

146

82 of 900 shoe brands exist in Wiki

Michael Jordan won the best paper award

Page 147: Tutorial:   mining latent entity structures

Targeted Disambiguation [Wang et al. 12]

147

Entity Id

Entity Name

e1 Microsoft

e2 Apple

e3 HP

Microsoft and Apple are the developers of three of the most popular operating systems

Apple trees take four to five years to produce their first fruit…

Microsoft’s new operating system, Windows 8, is a PC operating system for the tablet age …

CEO Meg Whitman said that HP is focusing on Windows 8 for its tablet strategy

Audi is offering a racing version of its hottest TT model: a 380 HP, front-wheel …

Target entities

d1

d2

d3

d4

d5

Page 148: Tutorial:   mining latent entity structures

Targeted Disambiguation

148

Entity Id

Entity Name

e1 Microsoft

e2 Apple

e3 HP

Microsoft and Apple are the developers of three of the most popular operating systems

Apple trees take four to five years to produce their first fruit…

Microsoft’s new operating system, Windows 8, is a PC operating system for the tablet age …

CEO Meg Whitman said that HP is focusing on Windows 8 for its tablet strategy

Audi is offering a racing version of its hottest TT model: a 380 HP, front-wheel …

d1

d2

d4

d5

d3

Target entities

Page 149: Tutorial:   mining latent entity structures

Insight – Context Similarity

149

Microsoft and Apple are the developers of three of the most popular operating systems

Apple trees take four to five years to produce their first fruit…

Microsoft’s new operating system, Windows 8, is a PC operating system for the tablet age …

CEO Meg Whitman said that HP is focusing on Windows 8 for its tablet strategy

Audi is offering a racing version of its hottest TT model: a 380 HP, front-wheel …

Similar

Page 150: Tutorial:   mining latent entity structures

Insight – Context Similarity

150

Microsoft and Apple are the developers of three of the most popular operating systems

Apple trees take four to five years to produce their first fruit…

Microsoft’s new operating system, Windows 8, is a PC operating system for the tablet age …

CEO Meg Whitman said that HP is focusing on Windows 8 for its tablet strategy

Audi is offering a racing version of its hottest TT model: a 380 HP, front-wheel …

Dissimilar

Page 151: Tutorial:   mining latent entity structures

Insight – Context Similarity

151

Microsoft and Apple are the developers of three of the most popular operating systems

Apple trees take four to five years to produce their first fruit…

Microsoft’s new operating system, Windows 8, is a PC operating system for the tablet age …

CEO Meg Whitman said that HP is focusing on Windows 8 for its tablet strategy

Audi is offering a racing version of its hottest TT model: a 380 HP, front-wheel …

Dissimilar

Page 152: Tutorial:   mining latent entity structures

Insight – Leverage Homogeneity

Hypothesis: the context between two true mentions is more similar than between two false mentions across two distinct entities, as well as between a true mention and a false mention.

Caveat: the context of false mentions can be similar among themselves within an entity

152

Sun

IT Corp.SundaySurnamenewspaper

Apple

IT Corp.

fruit

HP

IT Corp.

horsepower

others

Page 153: Tutorial:   mining latent entity structures

Insight – Comention

153

Microsoft and Apple are the developers of three of the most popular operating systems

Apple trees take four to five years to produce their first fruit…

Microsoft’s new operating system, Windows 8, is a PC operating system for the tablet age …

CEO Meg Whitman said that HP is focusing on Windows 8 for its tablet strategy

Audi is offering a racing version of its hottest TT model: a 380 HP, front-wheel …

High confidence

Page 154: Tutorial:   mining latent entity structures

Insight – Leverage Homogeneity

154

Microsoft and Apple are the developers of three of the most popular operating systems

Apple trees take four to five years to produce their first fruit…

Microsoft’s new operating system, Windows 8, is a PC operating system for the tablet age …

CEO Meg Whitman said that HP is focusing on Windows 8 for its tablet strategy

Audi is offering a racing version of its hottest TT model: a 380 HP, front-wheel …

True

True

Page 155: Tutorial:   mining latent entity structures

Insight – Leverage Homogeneity

155

Microsoft and Apple are the developers of three of the most popular operating systems

Apple trees take four to five years to produce their first fruit…

Microsoft’s new operating system, Windows 8, is a PC operating system for the tablet age …

CEO Meg Whitman said that HP is focusing on Windows 8 for its tablet strategy

Audi is offering a racing version of its hottest TT model: a 380 HP, front-wheel …

True

True

True

Page 156: Tutorial:   mining latent entity structures

Insight – Leverage Homogeneity

156

Microsoft and Apple are the developers of three of the most popular operating systems

Apple trees take four to five years to produce their first fruit…

Microsoft’s new operating system, Windows 8, is a PC operating system for the tablet age …

CEO Meg Whitman said that HP is focusing on Windows 8 for its tablet strategy

Audi is offering a racing version of its hottest TT model: a 380 HP, front-wheel …

True

True

True

False

False

Page 157: Tutorial:   mining latent entity structures

MentionRank

157

Entity Id

Entity Name

e1 Microsoft

e2 Apple

𝑟23

𝑟21

𝑟11

𝑟22

d1

d2

d3

Page 158: Tutorial:   mining latent entity structures

MentionRank

158

• 𝑟𝑖𝑗 ∈ [0,1] – confidence for

(ei,dj) to be a true mention.

• 𝜋𝑖𝑗 ∈ [0,1] – the prior

estimate of (ei,dj) to be a true mention (co-mention)

• 𝜇𝑖𝑗,𝑖′𝑗′ ∈ 0,1 – the weight

between (ei,dj) and (ei’,dj’) (context similarity)

Page 159: Tutorial:   mining latent entity structures

Experiment results

159

(a) PL

Programming Languages Science Fiction Books

Java, Python, Ruby, Perl, R… A Pleasure to Burn, Golden Reflections, Pathfinder…

Page 160: Tutorial:   mining latent entity structures

Type Entities in Web Tables & Lists [Limaye et al. 10]

160

Page 161: Tutorial:   mining latent entity structures

Type Entities in Web Tables & Lists Collective typing - Annotate the entire column of a table or a list

161

Paper Taxonomy Method

[Limaye et al. 10] YAGO Conditional random field for joint annotation (supervised)

[Venetis et al. 11] Google’s IsA DB

1) Max likelihood (hit entities between 10% and 50%) 𝑝(𝑒|𝑐)𝑒 𝑝 𝑒 𝑐 ∝ 𝑝(𝑐|𝑒)/𝑝 𝑐 is smoothed by frequency of (e,c) in DB;

2) Majority (>50%) voting 1

𝑅𝑎𝑛𝑘 𝑐|𝑒𝑒

[Song et al. 11] MS.KB Naïve Bayes 𝑝 𝑐 𝑝 𝑒 𝑐𝑒 𝑝(𝑒|𝑐) is smoothed by frequency of (e,c) in MS.KB

[Pimplikar & Sarawagi 12]

Ad-hoc column name

A dual problem: find columns that best fit query concepts Conditional random fields (supervised)

Page 162: Tutorial:   mining latent entity structures

Outline

1. Introduction to mining latent entity structures

2. Mining latent topic hierarchies

3. Mining latent entity relations

4. Mining latent entity concepts

5. Trends and research problems

162

Page 163: Tutorial:   mining latent entity structures

Open Problems on Mining Latent Entity Structures

What is the best way to organize information and interact with users?

163

Page 164: Tutorial:   mining latent entity structures

Understand the Data

System, architecture and database

Information quality and security

164

Coverage & Volatility

Utility

How do we design such a multi-layer organization system?

How do we control information quality and resolve conflicts?

Page 165: Tutorial:   mining latent entity structures

Understand the People

NLP, ML, AI

HCI, Crowdsourcing, Web search, domain experts

165

Understand & answer natural language questions

Explore latent structures with user guidance

Page 166: Tutorial:   mining latent entity structures

Mining Relations & Concepts from Multiple Sources Knowledgebase

Taxonomy

Web tables

Web pages

Domain text

Social media

Social networks

166

Freebase

Satori

Annotate Enrich

Enrich

Guide

Page 167: Tutorial:   mining latent entity structures

Integration of NLP & Data Mining Approaches

NLP - analyzing single sentences Data mining - analyzing big data

167

Page 168: Tutorial:   mining latent entity structures

Application to the slot filling task [Yu et al. 14b]

168

0 2 4 6 8 10 12 14 16 18 20

0

5

10

15

20

25

30

35

F-m

esa

ure

(%

)

System

Before

After

Enhance Individual SF Systems

0 10000 20000 30000 40000

0

2000

4000

6000

8000

10000

12000

14000

13

2

4

5

#tr

uth

s

6 Oracle

5 MTM

4 SVM

3 Linguistic Indicator

2 Voting

1 Baseline

#total responses

6

Truth finding efficiency

Page 169: Tutorial:   mining latent entity structures

Mining Latent Entity Structures from Massive Unstructured & Linked Data

169

Concepts

Topics

Relations

text

entity

Frequent pattern mining + probabilistic analysis

Page 170: Tutorial:   mining latent entity structures

Biography

170

Chi Wang

(http://cs.uiuc.edu/homes/chiwang1),

Ph.D. candidate in Computer Science, UIUC.

His Ph.D. research has been focused on

mining hidden entity structures from

massive data and information networks,

with over 30 research publications in refereed journals

and conferences including KDD, WWW, SIGIR, ICDM,

SDM, CIKM, ECMLPKDD, IJCAI, and SIGMOD. He is the

winner of the prestige Microsoft Research Graduate

Research Fellowship, KDDCUP'13 runner-up award, and

best paper award candidates in CIKM'13 and ICDM'13.

Jiawei Han (http://www.cs.uiuc.edu/hanj),

Abel Bliss Professor, Computer Science,

UIUC. He has been researching into data

mining, database systems, and information

networks, with over 600 conference and

journal publications. He is Fellow of ACM

and Fellow of IEEE, the first author of the textbook “Data

Mining: Concepts and Techniques" 3rd ed., (Morgan

Kaufmann, 2011), and the Director of Information Network

Academic Research Center, supported by the U.S. Army

Research Lab since 2009. He has delivered over 20 tutorials

in major conferences including SIGMOD, VLDB, KDD,

WWW, WSDM, ICDE, ICDM, SDM, ASONAM, and CIKM.

Page 171: Tutorial:   mining latent entity structures

References 1. [Wang et al. 14a] C. Wang, X. Liu, Y. Song, J. Han. Scalable Moment-based Inference for Latent

Dirichlet Allocation, ECMLPKDD’14.

2. [Li et al. 14] R. Li, C. Wang, K. Chang. User Profiling in Ego Network: An Attribute and Relationship Type Co-profiling Approach, WWW’14.

3. [Danilevsky et al. 14] M. Danilevsky, C. Wang, N. Desai, X. Ren, J. Guo, J. Han. Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents“, SDM’14.

4. [Wang et al. 13b] C. Wang, M. Danilevsky, J. Liu, N. Desai, H. Ji, J. Han. Constructing Topical Hierarchies in Heterogeneous Information Networks, ICDM’13.

5. [Wang et al. 13a] C. Wang, M. Danilevsky, N. Desai, Y. Zhang, P. Nguyen, T. Taula, and J. Han. A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy, KDD’13.

6. [Li et al. 13] Y. Li, C. Wang, F. Han, J. Han, D. Roth, and X. Yan. Mining Evidences for Named Entity Disambiguation, KDD’13.

171

Page 172: Tutorial:   mining latent entity structures

References 7. [Wang et al. 12a] C. Wang, K. Chakrabarti, T. Cheng, S. Chaudhuri. Targeted Disambiguation

of Ad-hoc, Homogeneous Sets of Named Entities, WWW’12.

8. [Wang et al. 12b] C. Wang, J. Han, Q. Li, X. Li, W. Lin and H. Ji. Learning Hierarchical Relationships among Partially Ordered Objects with Heterogeneous Attributes and Links, SDM’12.

9. [Wang et al. 11] H. Wang, C. Wang, C. Zhai and J. Han. Learning Online Discussion Structures by Conditional Random Fields, SIGIR’11.

10. [Wang et al. 10] C. Wang, J. Han, Y. Jia, J. Tang, D. Zhang, Y. Yu and J. Guo. Mining Advisor-advisee Relationship from Research Publication Networks, KDD’10.

11. [Danilevsky et al. 13] M. Danilevsky, C. Wang, F. Tao, S. Nguyen, G. Chen, N. Desai, J. Han. AMETHYST: A System for Mining and Exploring Topical Hierarchies in Information Networks, KDD’13.

172

Page 173: Tutorial:   mining latent entity structures

References 12. [Sun et al. 11] Y. Sun, J. Han, X. Yan, P. S. Yu, T. Wu. Pathsim: Meta path-based top-k

similarity search in heterogeneous information networks, VLDB’11.

13. [Hofmann 99] T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis, UAI’99.

14. [Blei et al. 03] D. M. Blei, A. Y. Ng, M. I. Jordan. Latent Dirichlet allocation, the Journal of machine Learning research, 2003.

15. [Griffiths & Steyvers 04] T. L. Griffiths, M. Steyvers. Finding scientific topics, Proc. of the National Academy of Sciences of USA, 2004.

16. [Anandkumar et al. 12] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, M. Telgarsky. Tensor decompositions for learning latent variable models, arXiv:1210.7559, 2012.

17. [Porteous et al. 08] I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, M. Welling. Fast collapsed gibbs sampling for latent dirichlet allocation, KDD’08.

173

Page 174: Tutorial:   mining latent entity structures

References 18. [Hoffman et al. 12] M. Hoffman, D. M. Blei, D. M. Mimno. Sparse stochastic inference for

latent dirichlet allocation, ICML’12.

19. [Yao et al. 09] L. Yao, D. Mimno, A. McCallum. Efficient methods for topic model inference on streaming document collections, KDD’09.

20. [Newman et al. 09] D. Newman, A. Asuncion, P. Smyth, M. Welling. Distributed algorithms for topic models, Journal of Machine Learning Research, 2009.

21. [Hoffman et al. 13] M. Hoffman, D. Blei, C. Wang, J. Paisley. Stochastic variational inference, Journal of Machine Learning Research, 2013.

22. [Griffiths et al. 04] T. Griffiths, M. Jordan, J. Tenenbaum, and D. M. Blei. Hierarchical topic models and the nested chinese restaurant process, NIPS’04.

23. [Kim et al. 12a] J. H. Kim, D. Kim, S. Kim, and A. Oh. Modeling topic hierarchies with the recursive chinese restaurant process, CIKM’12.

174

Page 175: Tutorial:   mining latent entity structures

References 24. [Wang et al. 14b] C. Wang, X. Liu, Y. Song, J. Han. Scalable and Robust Construction of

Topical Hierarchies, arXiv: 1403.3460, 2014.

25. [Li & McCallum 06] W. Li, A. McCallum. Pachinko allocation: Dag-structured mixture models of topic correlations, ICML’06.

26. [Mimno et al. 07] D. Mimno, W. Li, A. McCallum. Mixtures of hierarchical topics with pachinko allocation, ICML’07.

27. [Ahmed et al. 13] A. Ahmed, L. Hong, A. Smola. Nested chinese restaurant franchise process: Applications to user tracking and document modeling, ICML’13.

28. [Wallach 06] H. M. Wallach. Topic modeling: beyond bag-of-words, ICML’06.

29. [Wang et al. 07] X. Wang, A. McCallum, X. Wei. Topical n-grams: Phrase and topic discovery, with an application to information retrieval, ICDM’07.

175

Page 176: Tutorial:   mining latent entity structures

References 30. [Lindsey et al. 12] R. V. Lindsey, W. P. Headden, III, M. J. Stipicevic. A phrase-discovering

topic model using hierarchical pitman-yor processes, EMNLP-CoNLL’12.

31. [Mei et al. 07] Q. Mei, X. Shen, C. Zhai. Automatic labeling of multinomial topic models, KDD’07.

32. [Blei & Lafferty 09] D. M. Blei, J. D. Lafferty. Visualizing Topics with Multi-Word Expressions, arXiv:0907.1013, 2009.

33. [Danilevsky et al. 14] M. Danilevsky, C. Wang, N. Desai, J. Guo, J. Han. Automatic construction and ranking of topical keyphrases on collections of short documents, SDM’14.

34. [Kim et al. 12b] H. D. Kim, D. H. Park, Y. Lu, C. Zhai. Enriching Text Representation with Frequent Pattern Mining for Probabilistic Topic Modeling, ASIST’12.

35. [El-kishky et al. 14] A. El-Kishky, Y. Song, C. Wang, C.R. Voss, J. Han. Scalable Topical Phrase Mining from Large Text Corpora, arXiv: 1406.6312, 2014.

176

Page 177: Tutorial:   mining latent entity structures

References 36. [Zhao et al. 11] W. X. Zhao, J. Jiang, J. He, Y. Song, P. Achananuparp, E.-P. Lim, X. Li. Topical

keyphrase extraction from twitter, HLT’11.

37. [Church et al. 91] K. Church, W. Gale, P. Hanks, D. Kindle. Chap 6, Using statistics in lexical analysis, 1991.

38. [Sun et al. 09a] Y. Sun, J. Han, J. Gao, Y. Yu. itopicmodel: Information network-integrated topic modeling, ICDM’09.

39. [Deng et al. 11] H. Deng, J. Han, B. Zhao, Y. Yu, C. X. Lin. Probabilistic topic models with biased propagation on heterogeneous information networks, KDD’11.

40. [Chen et al. 12] X. Chen, M. Zhou, L. Carin. The contextual focused topic model, KDD’12.

41. [Tang et al. 13] J. Tang, M. Zhang, Q. Mei. One theme in all views: modeling consensus topics in multiple contexts, KDD’13.

177

Page 178: Tutorial:   mining latent entity structures

References 42. [Kim et al. 12c] H. Kim, Y. Sun, J. Hockenmaier, J. Han. Etm: Entity topic models for mining

documents associated with entities, ICDM’12.

43. [Cohn & Hofmann 01] D. Cohn, T. Hofmann. The missing link-a probabilistic model of document content and hypertext connectivity, NIPS’01.

44. [Blei & Jordan 03] D. Blei, M. I. Jordan. Modeling annotated data, SIGIR’03.

45. [Newman et al. 06] D. Newman, C. Chemudugunta, P. Smyth, M. Steyvers. Statistical Entity-Topic Models, KDD’06.

46. [Sun et al. 09b] Y. Sun, Y. Yu, J. Han. Ranking-based clustering of heterogeneous information networks with star network schema, KDD’09.

47. [Chang et al. 09] J. Chang, J. Boyd-Graber, C. Wang, S. Gerrish, D.M. Blei. Reading tea leaves: How humans interpret topic models, NIPS’09.

178

Page 179: Tutorial:   mining latent entity structures

References 48. [Bunescu & Mooney 05a] R. C. Bunescu, R. J. Mooney. A shortest path dependency kernel for

relation extraction, HLT’05.

49. [Bunescu & Mooney 05b] R. C. Bunescu, R. J. Mooney. Subsequence kernels for relation extraction, NIPS’05.

50. [Zelenko et al. 03] D. Zelenko, C. Aone, A. Richardella. Kernel methods for relation extraction, Journal of Machine Learning Research, 2003.

51. [Culotta & Sorensen 04] A. Culotta, J. Sorensen. Dependency tree kernels for relation extraction, ACL’04.

52. [McCallum et al. 05] A. McCallum, A. Corrada-Emmanuel, X. Wang. Topic and role discovery in social networks, IJCAI’05.

53. [Leskovec et al. 10] J. Leskovec, D. Huttenlocher, J. Kleinberg. Predicting positive and negative links in online social networks, WWW’10.

179

Page 180: Tutorial:   mining latent entity structures

References 54. [Diehl et al. 07] C. Diehl, G. Namata, L. Getoor. Relationship identification for social network

discovery, AAAI’07.

55. [Tang et al. 11] W. Tang, H. Zhuang, J. Tang. Learning to infer social ties in large networks, ECMLPKDD’11.

56. [McAuley & Leskovec 12] J. McAuley, J. Leskovec. Learning to discover social circles in ego networks, NIPS’12.

57. [Yakout et al. 12] M. Yakout, K. Ganjam, K. Chakrabarti, S. Chaudhuri. InfoGather: Entity Augmentation and Attribute Discovery By Holistic Matching with Web Tables, SIGMOD’12.

58. [Koller & Friedman 09] D. Koller, N. Friedman. Probabilistic Graphical Models: Principles and Techniques, 2009.

59. [Bunescu & Pascal 06] R. Bunescu, M. Pasca. Using encyclopedic knowledge for named entity disambiguation, EACL’06.

180

Page 181: Tutorial:   mining latent entity structures

References 60. [Cucerzan 07] S. Cucerzan. Large-scale named entity disambiguation based on wikipedia data,

EMNLP-CoNLL’07.

61. [Ratinov et al. 11] L. Ratinov, D. Roth, D. Downey, M. Anderson. Local and global algorithms for disambiguation to wikipedia, ACL’11.

62. [Hoffart et al. 11] J. Hoffart, M. Yosef, I. Bordino, H. F•urstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, G. Weikum. Robust disambiguation of named entities in text, EMNLP’11.

63. [Limaye et al. 10] G. Limaye, S. Sarawagi, S. Chakrabarti. Annotating and searching web tables using entities, types and relationships, VLDB’10.

64. [Venetis et al. 11] P. Venetis, A. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao, C. Wu. Recovering semantics of tables on the web, VLDB’11.

65. [Song et al. 11] Y. Song, H. Wang, Z. Wang, H. Li, W. Chen. Short Text Conceptualization using a Probabilistic Knowledgebase, IJCAI’11.

181

Page 182: Tutorial:   mining latent entity structures

References 66. [Pimplikar & Sarawagi 12] R. Pimplikar, S. Sarawagi. Answering table queries on the web

using column keywords, VLDB’12.

67. [Yu et al. 14a] X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandelwal, B. Norick, J. Han. Personalized Entity Recommendation: A Heterogeneous Information Network Approach, WSDM’14.

68. [Yu et al. 14b] D. Yu, H. Huang, T. Cassidy, H. Ji, C. Wang, S. Zhi, J. Han, C. Voss. The Wisdom of Minority: Unsupervised Slot Filling Validation based on Multi-dimensional Truth-Finding with Multi-layer Linguistic Indicators, COLING’14.

182