Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

1

Building Topic Models in a Federated Digital Library Through

Selective Document Exclusion

ASIST 2011New Orleans, LAOctober 10, 2011

Miles Efron Peter Organisciak Katrina FenlonGraduate School of Library & Information Science

University of Illinois, Urbana-Champaign

Supported by IMLS LG-06-07-0020.

2

The Setting: IMLS DCC

collection(s) collection(s) collection(s)Data providers(IMLS NLG & LSTA)

metadata

…

DCC

services

metadata metadata

OAI-PMH

Service provider:DCC

3

High-Level Research Interest

• Improve “access” to data harvested for federated digital libraries by enhancing:– Representation of documents– Representation of document aggregations– Capitalizing on the relationship between

aggregations and documents.

• PS: By “document” I mean a single metadata (usually DC) record.

4

Motivation for our Work

• Most empirical approaches to this type of problem rely on some kind of analysis of term counts.

• Unreliable for our data:– Vocabulary mismatch– Poor probability estimates

5

The Setting: IMLS DCC

6

The Problem: Supporting End-User Experience

• Full-text search• Browse by “subject”• Desired:– Improved browsing– Support high-level aggregation understanding and

resource discovery• Approach: Empirically induced “topics” using

established methods--e.g. latent Dirichlet allocation (LDA).

11

Research Question

• Can we improve induced models by mitigating the influence of noisy data, common in federated digital library settings?

• Hypothesis: Harvested records are not all useful for training a model of corpus-level topics.

• Approach: Identify and remove “weakly topical” documents during model training.

12

Latent Dirichlet Allocation

• Given a corpus of documents, C, and an empirically chosen integer k

• Assume that a generative process involving k latent topics generated word occurrences in C.

• End result: for a given word w and a given document D:– Pr(w|Ti)– Pr(D|Ti)

– Pr(Ti)

For each topic T1 … Tk

13


• Given a corpus of documents, C, and an empirically chosen integer k



– Pr(Ti)


1. Choose doc length N ~ Poisson(mu).2. Choose probability vector Theta ~ Dir(alpha).3. For each word wi in 1:N:

a) Choose topic zi ~ Multinomial(Theta).b) Choose word wn from P(wn | wn, Beta).

14


• Given a corpus of documents, C and an empirically chosen integer k.



– Pr(Ti)


Calculate estimates via iterative methods: MCMC / Gibbs

Sampling.

15

Full Corpus

16

Full Corpus

Proposed algorithm

17

Reduced Corpus

Pr(w | T) Pr(D | T) Pr(T)

Train the Model

18

Full Corpus

Inference



19

Sample Topics Induced from “Raw” Data

20

Documents’ Topical Strength• Hypothesis: Harvested records are not all

useful for training a model of corpus-level. topics.

21

Documents’ Topical Strength• Hypothesis: Harvested records are not all

useful for training a model of corpus-level.• Proposal: Improve induced topic model by

removing “weakly topical” documents during training.

• After training, use the inferential apparatus of LDA to assign topics to these “stop documents.”

22

Identifying “Stop Documents”• Time at which documents enter a repository is

often informative (e.g. bulk uploads).

log Pr(di | MC)where MC is the collection language modeland di is the words comprising the ith document

23

Identifying “Stop Documents”• Our paper outlines an algorithm for

accomplishing this.• Intuition:– Given a document di decide if it is part of a “run”

of near-identical records.– Remove all records that occur within a run.– The required amount of homogeneity to identify a

run is guided by a parameter tol which is the cumulative normal: e.g. 95%, 99% confidence.

26

Sample Topics Induced from Groomed Data

27

Experimental Assessment

• Question: Are topics built from “sampled” corpora more coherent than topics induced from raw corpora?

• Intrusion detection:– Find the 10 most probable words for topic Ti

– Replace one of these 10 with a word chosen from the corpus with uniform probability.

– Ask human assessors to identify the “intruder” word.

28


• For each topic Ti have 20 assessors try to find an intruder (20 different intruders). Repeat for both “sampled” and “raw” models.– i.e. 20 * 2* 100 = 4,000 assessments

• Asi is the percent of workers who correctly found the intruder in the ith topic of the sampled model and Ari is analogous for the raw model

• H0: Asi > Ari yields p<0.001

29


• For each topic Ti have 20 workers subjectively assess the topic’s “coherence,” reporting on a 4-point Likert scale.

30

Current & Future Work

• Testing breadth of coverage• Assessing the value of induced topics

• Topic information for document priors in the language modeling IR framework [next slide]

• Massive document expansion for improved language model estimation [under review]

31

Weak Topicality and Document Priors

32

Weak Topicality and Document Priors

33

Thank You

ASIST 2011New Orleans, LAOctober 10, 2011

Miles Efron Peter Organisciak Katrina FenlonGraduate School of Library & Information Science

University of Illinois, Urbana-Champaign

Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

Documents

Transcript of Building Topic Models in a Federated Digital Library Through Selective Document Exclusion