Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011...

Building Topic Models in a Federated Digital Library Through

Selective Document Exclusion

ASIST 2011New Orleans, LAOctober 10, 2011

Miles Efron Peter Organisciak Katrina FenlonGraduate School of Library & Information Science

University of Illinois, Urbana-Champaign

Supported by IMLS LG-06-07-0020.

The Setting: IMLS DCC

collection(s) collection(s) collection(s)Data providers(IMLS NLG & LSTA)

metadata

services

metadata metadata

OAI-PMH

Service provider:DCC

High-Level Research Interest

• Improve “access” to data harvested for federated digital libraries by enhancing:– Representation of documents– Representation of document aggregations– Capitalizing on the relationship between

aggregations and documents.

• PS: By “document” I mean a single metadata (usually DC) record.

Motivation for our Work

• Most empirical approaches to this type of problem rely on some kind of analysis of term counts.

• Unreliable for our data:– Vocabulary mismatch– Poor probability estimates

The Setting: IMLS DCC

The Problem: Supporting End-User Experience

• Full-text search• Browse by “subject”• Desired:– Improved browsing– Support high-level aggregation understanding and

resource discovery• Approach: Empirically induced “topics” using

established methods--e.g. latent Dirichlet allocation (LDA).

Research Question

• Can we improve induced models by mitigating the influence of noisy data, common in federated digital library settings?

• Hypothesis: Harvested records are not all useful for training a model of corpus-level topics.

• Approach: Identify and remove “weakly topical” documents during model training.

Latent Dirichlet Allocation

• Given a corpus of documents, C, and an empirically chosen integer k

• Assume that a generative process involving k latent topics generated word occurrences in C.

• End result: for a given word w and a given document D:– Pr(w|Ti)

– Pr(D|Ti)

– Pr(Ti)

For each topic T1 … Tk

• Given a corpus of documents, C, and an empirically chosen integer k

– Pr(D|Ti)

– Pr(Ti)

1. Choose doc length N ~ Poisson(mu).2. Choose probability vector Theta ~ Dir(alpha).3. For each word wi in 1:N:

a) Choose topic zi ~ Multinomial(Theta).b) Choose word wn from P(wn | wn, Beta).

• Given a corpus of documents, C and an empirically chosen integer k.

– Pr(D|Ti)

– Pr(Ti)

Calculate estimates via iterative methods: MCMC / Gibbs

Sampling.

Full Corpus

Proposed algorithm

Reduced Corpus

Pr(w | T) Pr(D | T) Pr(T)

Train the Model

Full Corpus

Inference

Pr(w | T) Pr(D | T) Pr(T)

Sample Topics Induced from “Raw” Data

Documents’ Topical Strength• Hypothesis: Harvested records are not all

useful for training a model of corpus-level. topics.

Documents’ Topical Strength• Hypothesis: Harvested records are not all

useful for training a model of corpus-level.• Proposal: Improve induced topic model by

removing “weakly topical” documents during training.

• After training, use the inferential apparatus of LDA to assign topics to these “stop documents.”

Identifying “Stop Documents”• Time at which documents enter a repository is

often informative (e.g. bulk uploads).

log Pr(di | MC)where MC is the collection language modeland di is the words comprising the ith document

Identifying “Stop Documents”• Our paper outlines an algorithm for

accomplishing this.• Intuition:– Given a document di decide if it is part of a “run”

of near-identical records.– Remove all records that occur within a run.– The required amount of homogeneity to identify a

run is guided by a parameter tol which is the cumulative normal: e.g. 95%, 99% confidence.

Sample Topics Induced from Groomed Data

Experimental Assessment

• Question: Are topics built from “sampled” corpora more coherent than topics induced from raw corpora?

• Intrusion detection:– Find the 10 most probable words for topic Ti

– Replace one of these 10 with a word chosen from the corpus with uniform probability.

– Ask human assessors to identify the “intruder” word.

• For each topic Ti have 20 assessors try to find an intruder (20 different intruders). Repeat for both “sampled” and “raw” models.– i.e. 20 * 2* 100 = 4,000 assessments

• Asi is the percent of workers who correctly found the intruder in the ith topic of the sampled model and Ari is analogous for the raw model

• H0: Asi > Ari yields p<0.001

• For each topic Ti have 20 workers subjectively assess the topic’s “coherence,” reporting on a 4-point Likert scale.

Current & Future Work

• Testing breadth of coverage• Assessing the value of induced topics

• Topic information for document priors in the language modeling IR framework [next slide]

• Massive document expansion for improved language model estimation [under review]

Weak Topicality and Document Priors

Thank You

ASIST 2011New Orleans, LAOctober 10, 2011

Miles Efron Peter Organisciak Katrina FenlonGraduate School of Library & Information Science

University of Illinois, Urbana-Champaign

Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011...

Documents

Transcript of Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011...

ASIST - Data workshop 2007

Asist guide:Reach Speaking Up Project

Asist mit 2012

June2015 asist rhode_island_2004

Federated Identity Managementeudat.eu/sites/default/files/Federated Identity... · 2019-04-01 · Federated Identity Management • “Federated Identity Management (FIM) is about

Asist 2010 alvarenga neto and choo 2010

Asist. O.S. Kvasnitska Neurotoxicosis. Intoxication by lead.

Manual Asist Contable

Wikis For Conference Communications ASIST 2006

ASIST Bulletin 12, September 2001 - International Labour … · 2015. 8. 31. · ASIST Bulletin No. 12 September 2001. ASIST Bulletin No. 12 2 editorial Advisory Support, Information

Zac Efron for Action Press Germany by Jana Cruder

ASIST 2013 Panel: Altmetrics at Mendeley

DU ASIST tech trends webinar

A Hora de Jogo Diagnostica de Efron in Ocampo

Efron. Ruben

Asist 2013 panelpresentation_social media_final

Administração de Sistemas (ASIST)

Orar med, asist, moase. (1)

Asist. O.S. Kvasnitska

Bradley Efron, R.J. Tibshirani an Introduction to Bootstrap