Post on 05-Jan-2016
1
Building Topic Models in a Federated Digital Library Through
Selective Document Exclusion
ASIST 2011New Orleans, LAOctober 10, 2011
Miles Efron Peter Organisciak Katrina FenlonGraduate School of Library & Information Science
University of Illinois, Urbana-Champaign
Supported by IMLS LG-06-07-0020.
2
The Setting: IMLS DCC
collection(s) collection(s) collection(s)Data providers(IMLS NLG & LSTA)
metadata
…
DCC
services
metadata metadata
OAI-PMH
Service provider:DCC
3
High-Level Research Interest
• Improve “access” to data harvested for federated digital libraries by enhancing:– Representation of documents– Representation of document aggregations– Capitalizing on the relationship between
aggregations and documents.
• PS: By “document” I mean a single metadata (usually DC) record.
4
Motivation for our Work
• Most empirical approaches to this type of problem rely on some kind of analysis of term counts.
• Unreliable for our data:– Vocabulary mismatch– Poor probability estimates
5
The Setting: IMLS DCC
6
The Problem: Supporting End-User Experience
• Full-text search• Browse by “subject”• Desired:– Improved browsing– Support high-level aggregation understanding and
resource discovery• Approach: Empirically induced “topics” using
established methods--e.g. latent Dirichlet allocation (LDA).
7
8
9
10
11
Research Question
• Can we improve induced models by mitigating the influence of noisy data, common in federated digital library settings?
• Hypothesis: Harvested records are not all useful for training a model of corpus-level topics.
• Approach: Identify and remove “weakly topical” documents during model training.
12
Latent Dirichlet Allocation
• Given a corpus of documents, C, and an empirically chosen integer k
• Assume that a generative process involving k latent topics generated word occurrences in C.
• End result: for a given word w and a given document D:– Pr(w|Ti)
– Pr(D|Ti)
– Pr(Ti)
For each topic T1 … Tk
13
Latent Dirichlet Allocation
• Given a corpus of documents, C, and an empirically chosen integer k
• Assume that a generative process involving k latent topics generated word occurrences in C.
• End result: for a given word w and a given document D:– Pr(w|Ti)
– Pr(D|Ti)
– Pr(Ti)
For each topic T1 … Tk
1. Choose doc length N ~ Poisson(mu).2. Choose probability vector Theta ~ Dir(alpha).3. For each word wi in 1:N:
a) Choose topic zi ~ Multinomial(Theta).b) Choose word wn from P(wn | wn, Beta).
14
Latent Dirichlet Allocation
• Given a corpus of documents, C and an empirically chosen integer k.
• Assume that a generative process involving k latent topics generated word occurrences in C.
• End result: for a given word w and a given document D:– Pr(w|Ti)
– Pr(D|Ti)
– Pr(Ti)
For each topic T1 … Tk
Calculate estimates via iterative methods: MCMC / Gibbs
Sampling.
15
Full Corpus
16
Full Corpus
Proposed algorithm
17
Reduced Corpus
Pr(w | T) Pr(D | T) Pr(T)
Train the Model
18
Full Corpus
Inference
Pr(w | T) Pr(D | T) Pr(T)
Pr(w | T) Pr(D | T) Pr(T)
19
Sample Topics Induced from “Raw” Data
20
Documents’ Topical Strength• Hypothesis: Harvested records are not all
useful for training a model of corpus-level. topics.
21
Documents’ Topical Strength• Hypothesis: Harvested records are not all
useful for training a model of corpus-level.• Proposal: Improve induced topic model by
removing “weakly topical” documents during training.
• After training, use the inferential apparatus of LDA to assign topics to these “stop documents.”
22
Identifying “Stop Documents”• Time at which documents enter a repository is
often informative (e.g. bulk uploads).
log Pr(di | MC)where MC is the collection language modeland di is the words comprising the ith document
23
Identifying “Stop Documents”• Our paper outlines an algorithm for
accomplishing this.• Intuition:– Given a document di decide if it is part of a “run”
of near-identical records.– Remove all records that occur within a run.– The required amount of homogeneity to identify a
run is guided by a parameter tol which is the cumulative normal: e.g. 95%, 99% confidence.
24
25
26
Sample Topics Induced from Groomed Data
27
Experimental Assessment
• Question: Are topics built from “sampled” corpora more coherent than topics induced from raw corpora?
• Intrusion detection:– Find the 10 most probable words for topic Ti
– Replace one of these 10 with a word chosen from the corpus with uniform probability.
– Ask human assessors to identify the “intruder” word.
28
Experimental Assessment
• For each topic Ti have 20 assessors try to find an intruder (20 different intruders). Repeat for both “sampled” and “raw” models.– i.e. 20 * 2* 100 = 4,000 assessments
• Asi is the percent of workers who correctly found the intruder in the ith topic of the sampled model and Ari is analogous for the raw model
• H0: Asi > Ari yields p<0.001
29
Experimental Assessment
• For each topic Ti have 20 workers subjectively assess the topic’s “coherence,” reporting on a 4-point Likert scale.
30
Current & Future Work
• Testing breadth of coverage• Assessing the value of induced topics
• Topic information for document priors in the language modeling IR framework [next slide]
• Massive document expansion for improved language model estimation [under review]
31
Weak Topicality and Document Priors
32
Weak Topicality and Document Priors
33
Thank You
ASIST 2011New Orleans, LAOctober 10, 2011
Miles Efron Peter Organisciak Katrina FenlonGraduate School of Library & Information Science
University of Illinois, Urbana-Champaign