Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

33
Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter Organisciak Katrina Fenlon Graduate School of Library & Information Science University of Illinois, Urbana- Champaign Supported by IMLS LG-06- 07-0020. 1

description

Building Topic Models in a Federated Digital Library Through Selective Document Exclusion. Miles Efron. Peter Organisciak. Katrina Fenlon. Graduate School of Library & Information Science University of Illinois, Urbana-Champaign. ASIST 2011 New Orleans, LA October 10, 2011. - PowerPoint PPT Presentation

Transcript of Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

Page 1: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

1

Building Topic Models in a Federated Digital Library Through

Selective Document Exclusion

ASIST 2011New Orleans, LAOctober 10, 2011

Miles Efron Peter Organisciak Katrina FenlonGraduate School of Library & Information Science

University of Illinois, Urbana-Champaign

Supported by IMLS LG-06-07-0020.

Page 2: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

2

The Setting: IMLS DCC

collection(s) collection(s) collection(s)Data providers(IMLS NLG & LSTA)

metadata

DCC

services

metadata metadata

OAI-PMH

Service provider:DCC

Page 3: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

3

High-Level Research Interest

• Improve “access” to data harvested for federated digital libraries by enhancing:– Representation of documents– Representation of document aggregations– Capitalizing on the relationship between

aggregations and documents.

• PS: By “document” I mean a single metadata (usually DC) record.

Page 4: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

4

Motivation for our Work

• Most empirical approaches to this type of problem rely on some kind of analysis of term counts.

• Unreliable for our data:– Vocabulary mismatch– Poor probability estimates

Page 5: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

5

The Setting: IMLS DCC

Page 6: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

6

The Problem: Supporting End-User Experience

• Full-text search• Browse by “subject”• Desired:– Improved browsing– Support high-level aggregation understanding and

resource discovery• Approach: Empirically induced “topics” using

established methods--e.g. latent Dirichlet allocation (LDA).

Page 7: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

7

Page 8: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

8

Page 9: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

9

Page 10: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

10

Page 11: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

11

Research Question

• Can we improve induced models by mitigating the influence of noisy data, common in federated digital library settings?

• Hypothesis: Harvested records are not all useful for training a model of corpus-level topics.

• Approach: Identify and remove “weakly topical” documents during model training.

Page 12: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

12

Latent Dirichlet Allocation

• Given a corpus of documents, C, and an empirically chosen integer k

• Assume that a generative process involving k latent topics generated word occurrences in C.

• End result: for a given word w and a given document D:– Pr(w|Ti)– Pr(D|Ti)

– Pr(Ti)

For each topic T1 … Tk

Page 13: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

13

Latent Dirichlet Allocation

• Given a corpus of documents, C, and an empirically chosen integer k

• Assume that a generative process involving k latent topics generated word occurrences in C.

• End result: for a given word w and a given document D:– Pr(w|Ti)– Pr(D|Ti)

– Pr(Ti)

For each topic T1 … Tk

1. Choose doc length N ~ Poisson(mu).2. Choose probability vector Theta ~ Dir(alpha).3. For each word wi in 1:N:

a) Choose topic zi ~ Multinomial(Theta).b) Choose word wn from P(wn | wn, Beta).

Page 14: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

14

Latent Dirichlet Allocation

• Given a corpus of documents, C and an empirically chosen integer k.

• Assume that a generative process involving k latent topics generated word occurrences in C.

• End result: for a given word w and a given document D:– Pr(w|Ti)– Pr(D|Ti)

– Pr(Ti)

For each topic T1 … Tk

Calculate estimates via iterative methods: MCMC / Gibbs

Sampling.

Page 15: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

15

Full Corpus

Page 16: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

16

Full Corpus

Proposed algorithm

Page 17: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

17

Reduced Corpus

Pr(w | T) Pr(D | T) Pr(T)

Train the Model

Page 18: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

18

Full Corpus

Inference

Pr(w | T) Pr(D | T) Pr(T)

Pr(w | T) Pr(D | T) Pr(T)

Page 19: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

19

Sample Topics Induced from “Raw” Data

Page 20: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

20

Documents’ Topical Strength• Hypothesis: Harvested records are not all

useful for training a model of corpus-level. topics.

Page 21: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

21

Documents’ Topical Strength• Hypothesis: Harvested records are not all

useful for training a model of corpus-level.• Proposal: Improve induced topic model by

removing “weakly topical” documents during training.

• After training, use the inferential apparatus of LDA to assign topics to these “stop documents.”

Page 22: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

22

Identifying “Stop Documents”• Time at which documents enter a repository is

often informative (e.g. bulk uploads).

log Pr(di | MC)where MC is the collection language modeland di is the words comprising the ith document

Page 23: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

23

Identifying “Stop Documents”• Our paper outlines an algorithm for

accomplishing this.• Intuition:– Given a document di decide if it is part of a “run”

of near-identical records.– Remove all records that occur within a run.– The required amount of homogeneity to identify a

run is guided by a parameter tol which is the cumulative normal: e.g. 95%, 99% confidence.

Page 24: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

24

Page 25: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

25

Page 26: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

26

Sample Topics Induced from Groomed Data

Page 27: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

27

Experimental Assessment

• Question: Are topics built from “sampled” corpora more coherent than topics induced from raw corpora?

• Intrusion detection:– Find the 10 most probable words for topic Ti

– Replace one of these 10 with a word chosen from the corpus with uniform probability.

– Ask human assessors to identify the “intruder” word.

Page 28: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

28

Experimental Assessment

• For each topic Ti have 20 assessors try to find an intruder (20 different intruders). Repeat for both “sampled” and “raw” models.– i.e. 20 * 2* 100 = 4,000 assessments

• Asi is the percent of workers who correctly found the intruder in the ith topic of the sampled model and Ari is analogous for the raw model

• H0: Asi > Ari yields p<0.001

Page 29: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

29

Experimental Assessment

• For each topic Ti have 20 workers subjectively assess the topic’s “coherence,” reporting on a 4-point Likert scale.

Page 30: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

30

Current & Future Work

• Testing breadth of coverage• Assessing the value of induced topics

• Topic information for document priors in the language modeling IR framework [next slide]

• Massive document expansion for improved language model estimation [under review]

Page 31: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

31

Weak Topicality and Document Priors

Page 32: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

32

Weak Topicality and Document Priors

Page 33: Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

33

Thank You

ASIST 2011New Orleans, LAOctober 10, 2011

Miles Efron Peter Organisciak Katrina FenlonGraduate School of Library & Information Science

University of Illinois, Urbana-Champaign