Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011...

33
Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter Organisciak Katrina Fenlon Graduate School of Library & Information Science University of Illinois, Urbana- Champaign Supported by IMLS LG-06- 07-0020. 1

Transcript of Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011...

Page 1: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

1

Building Topic Models in a Federated Digital Library Through

Selective Document Exclusion

ASIST 2011New Orleans, LAOctober 10, 2011

Miles Efron Peter Organisciak Katrina FenlonGraduate School of Library & Information Science

University of Illinois, Urbana-Champaign

Supported by IMLS LG-06-07-0020.

Page 2: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

2

The Setting: IMLS DCC

collection(s) collection(s) collection(s)Data providers(IMLS NLG & LSTA)

metadata

DCC

services

metadata metadata

OAI-PMH

Service provider:DCC

Page 3: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

3

High-Level Research Interest

• Improve “access” to data harvested for federated digital libraries by enhancing:– Representation of documents– Representation of document aggregations– Capitalizing on the relationship between

aggregations and documents.

• PS: By “document” I mean a single metadata (usually DC) record.

Page 4: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

4

Motivation for our Work

• Most empirical approaches to this type of problem rely on some kind of analysis of term counts.

• Unreliable for our data:– Vocabulary mismatch– Poor probability estimates

Page 5: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

5

The Setting: IMLS DCC

Page 6: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

6

The Problem: Supporting End-User Experience

• Full-text search• Browse by “subject”• Desired:– Improved browsing– Support high-level aggregation understanding and

resource discovery• Approach: Empirically induced “topics” using

established methods--e.g. latent Dirichlet allocation (LDA).

Page 7: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

7

Page 8: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

8

Page 9: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

9

Page 10: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

10

Page 11: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

11

Research Question

• Can we improve induced models by mitigating the influence of noisy data, common in federated digital library settings?

• Hypothesis: Harvested records are not all useful for training a model of corpus-level topics.

• Approach: Identify and remove “weakly topical” documents during model training.

Page 12: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

12

Latent Dirichlet Allocation

• Given a corpus of documents, C, and an empirically chosen integer k

• Assume that a generative process involving k latent topics generated word occurrences in C.

• End result: for a given word w and a given document D:– Pr(w|Ti)

– Pr(D|Ti)

– Pr(Ti)

For each topic T1 … Tk

Page 13: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

13

Latent Dirichlet Allocation

• Given a corpus of documents, C, and an empirically chosen integer k

• Assume that a generative process involving k latent topics generated word occurrences in C.

• End result: for a given word w and a given document D:– Pr(w|Ti)

– Pr(D|Ti)

– Pr(Ti)

For each topic T1 … Tk

1. Choose doc length N ~ Poisson(mu).2. Choose probability vector Theta ~ Dir(alpha).3. For each word wi in 1:N:

a) Choose topic zi ~ Multinomial(Theta).b) Choose word wn from P(wn | wn, Beta).

Page 14: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

14

Latent Dirichlet Allocation

• Given a corpus of documents, C and an empirically chosen integer k.

• Assume that a generative process involving k latent topics generated word occurrences in C.

• End result: for a given word w and a given document D:– Pr(w|Ti)

– Pr(D|Ti)

– Pr(Ti)

For each topic T1 … Tk

Calculate estimates via iterative methods: MCMC / Gibbs

Sampling.

Page 15: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

15

Full Corpus

Page 16: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

16

Full Corpus

Proposed algorithm

Page 17: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

17

Reduced Corpus

Pr(w | T) Pr(D | T) Pr(T)

Train the Model

Page 18: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

18

Full Corpus

Inference

Pr(w | T) Pr(D | T) Pr(T)

Pr(w | T) Pr(D | T) Pr(T)

Page 19: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

19

Sample Topics Induced from “Raw” Data

Page 20: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

20

Documents’ Topical Strength• Hypothesis: Harvested records are not all

useful for training a model of corpus-level. topics.

Page 21: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

21

Documents’ Topical Strength• Hypothesis: Harvested records are not all

useful for training a model of corpus-level.• Proposal: Improve induced topic model by

removing “weakly topical” documents during training.

• After training, use the inferential apparatus of LDA to assign topics to these “stop documents.”

Page 22: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

22

Identifying “Stop Documents”• Time at which documents enter a repository is

often informative (e.g. bulk uploads).

log Pr(di | MC)where MC is the collection language modeland di is the words comprising the ith document

Page 23: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

23

Identifying “Stop Documents”• Our paper outlines an algorithm for

accomplishing this.• Intuition:– Given a document di decide if it is part of a “run”

of near-identical records.– Remove all records that occur within a run.– The required amount of homogeneity to identify a

run is guided by a parameter tol which is the cumulative normal: e.g. 95%, 99% confidence.

Page 24: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

24

Page 25: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

25

Page 26: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

26

Sample Topics Induced from Groomed Data

Page 27: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

27

Experimental Assessment

• Question: Are topics built from “sampled” corpora more coherent than topics induced from raw corpora?

• Intrusion detection:– Find the 10 most probable words for topic Ti

– Replace one of these 10 with a word chosen from the corpus with uniform probability.

– Ask human assessors to identify the “intruder” word.

Page 28: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

28

Experimental Assessment

• For each topic Ti have 20 assessors try to find an intruder (20 different intruders). Repeat for both “sampled” and “raw” models.– i.e. 20 * 2* 100 = 4,000 assessments

• Asi is the percent of workers who correctly found the intruder in the ith topic of the sampled model and Ari is analogous for the raw model

• H0: Asi > Ari yields p<0.001

Page 29: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

29

Experimental Assessment

• For each topic Ti have 20 workers subjectively assess the topic’s “coherence,” reporting on a 4-point Likert scale.

Page 30: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

30

Current & Future Work

• Testing breadth of coverage• Assessing the value of induced topics

• Topic information for document priors in the language modeling IR framework [next slide]

• Massive document expansion for improved language model estimation [under review]

Page 31: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

31

Weak Topicality and Document Priors

Page 32: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

32

Weak Topicality and Document Priors

Page 33: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

33

Thank You

ASIST 2011New Orleans, LAOctober 10, 2011

Miles Efron Peter Organisciak Katrina FenlonGraduate School of Library & Information Science

University of Illinois, Urbana-Champaign