Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

Building Topic Models in a Federated Digital Library Through

Selective Document Exclusion

ASIST 2011New Orleans, LAOctober 10, 2011

Miles Efron Peter Organisciak Katrina FenlonGraduate School of Library & Information Science

University of Illinois, Urbana-Champaign

Supported by IMLS LG-06-07-0020.

The Setting: IMLS DCC

collection(s) collection(s) collection(s)Data providers(IMLS NLG & LSTA)

metadata

services

metadata metadata

OAI-PMH

Service provider:DCC

High-Level Research Interest

• Improve “access” to data harvested for federated digital libraries by enhancing:– Representation of documents– Representation of document aggregations– Capitalizing on the relationship between

aggregations and documents.

• PS: By “document” I mean a single metadata (usually DC) record.

Motivation for our Work

• Most empirical approaches to this type of problem rely on some kind of analysis of term counts.

• Unreliable for our data:– Vocabulary mismatch– Poor probability estimates

The Setting: IMLS DCC

The Problem: Supporting End-User Experience

• Full-text search• Browse by “subject”• Desired:– Improved browsing– Support high-level aggregation understanding and

resource discovery• Approach: Empirically induced “topics” using

established methods--e.g. latent Dirichlet allocation (LDA).

Research Question

• Can we improve induced models by mitigating the influence of noisy data, common in federated digital library settings?

• Hypothesis: Harvested records are not all useful for training a model of corpus-level topics.

• Approach: Identify and remove “weakly topical” documents during model training.

Latent Dirichlet Allocation

• Given a corpus of documents, C, and an empirically chosen integer k

• Assume that a generative process involving k latent topics generated word occurrences in C.

• End result: for a given word w and a given document D:– Pr(w|Ti)– Pr(D|Ti)

– Pr(Ti)

For each topic T1 … Tk

• Given a corpus of documents, C, and an empirically chosen integer k

– Pr(Ti)

1. Choose doc length N ~ Poisson(mu).2. Choose probability vector Theta ~ Dir(alpha).3. For each word wi in 1:N:

a) Choose topic zi ~ Multinomial(Theta).b) Choose word wn from P(wn | wn, Beta).

• Given a corpus of documents, C and an empirically chosen integer k.

– Pr(Ti)

Calculate estimates via iterative methods: MCMC / Gibbs

Sampling.

Full Corpus

Proposed algorithm

Reduced Corpus

Pr(w | T) Pr(D | T) Pr(T)

Train the Model

Full Corpus

Inference

Pr(w | T) Pr(D | T) Pr(T)

Sample Topics Induced from “Raw” Data

Documents’ Topical Strength• Hypothesis: Harvested records are not all

useful for training a model of corpus-level. topics.

Documents’ Topical Strength• Hypothesis: Harvested records are not all

useful for training a model of corpus-level.• Proposal: Improve induced topic model by

removing “weakly topical” documents during training.

• After training, use the inferential apparatus of LDA to assign topics to these “stop documents.”

Identifying “Stop Documents”• Time at which documents enter a repository is

often informative (e.g. bulk uploads).

log Pr(di | MC)where MC is the collection language modeland di is the words comprising the ith document

Identifying “Stop Documents”• Our paper outlines an algorithm for

accomplishing this.• Intuition:– Given a document di decide if it is part of a “run”

of near-identical records.– Remove all records that occur within a run.– The required amount of homogeneity to identify a

run is guided by a parameter tol which is the cumulative normal: e.g. 95%, 99% confidence.

Sample Topics Induced from Groomed Data

Experimental Assessment

• Question: Are topics built from “sampled” corpora more coherent than topics induced from raw corpora?

• Intrusion detection:– Find the 10 most probable words for topic Ti

– Replace one of these 10 with a word chosen from the corpus with uniform probability.

– Ask human assessors to identify the “intruder” word.

• For each topic Ti have 20 assessors try to find an intruder (20 different intruders). Repeat for both “sampled” and “raw” models.– i.e. 20 * 2* 100 = 4,000 assessments

• Asi is the percent of workers who correctly found the intruder in the ith topic of the sampled model and Ari is analogous for the raw model

• H0: Asi > Ari yields p<0.001

• For each topic Ti have 20 workers subjectively assess the topic’s “coherence,” reporting on a 4-point Likert scale.

Current & Future Work

• Testing breadth of coverage• Assessing the value of induced topics

• Topic information for document priors in the language modeling IR framework [next slide]

• Massive document expansion for improved language model estimation [under review]

Weak Topicality and Document Priors

Thank You

ASIST 2011New Orleans, LAOctober 10, 2011

Miles Efron Peter Organisciak Katrina FenlonGraduate School of Library & Information Science

University of Illinois, Urbana-Champaign

Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

Documents

Transcript of Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

Design & Exclusion Principles History of Exclusion for ...

exclusion social, exclusion en salud-teoria.pdf

Practical Federated Identity

Federated Networked Cloud

FEDERATED MACHINE LEARNING - Applied Mathematics · Federated Averaging and FedSGD Federated Averaging (FedAvg) Shares updated parameters Federated SGD (FedSGD) Shares local gradients

From federated to aggregated search - University of …mounia/Papers/SIGIR2010Tutorial.pdf · From federated to aggregated search ... Recap – Introduction federated ... scale (documents,

THE FEDERATED FORECAST€¦ · 10/11/2018 · THE FEDERATED CHURCH—PAGE 1 THE FEDERATED FORECAST The Federated Church 612 W. State Street, Sycamore, IL 60178 Phone # 815-895-2706

Presentation federated search

FEDERATED ILLOGICALITIES

Federated Searching

Federated Identity Management452

Federated Media- profile

COORDINATING IN FEDERATED INFORMATION … in federated information technology governance structures: ... case study, critical realism . coordinating in federated information technology

Exclusion Policy - Roselands & Stafford Federationstafford.roselands-stafford.org/.../2017/05/Exclusion-Policy-RSF.pdf · Exclusion Policy Federated Index ... A Local Authority representative

CIS14: Why Federated Access Needs a Federated Identity

THE FEDERATED STATES THE FEDERATED STATES OF … · THE FEDERATED STATES OF MICRONESIA NATIONAL ICT AND TELECOMMUNICATIONS POLICY ~ ii ~ MESSAGE FROM THE PRESIDENT OF THE FEDERATED

Federated States of Micronesia · Federated States of Micronesia Federated States of Micronesia a Federated States of Micronesia Kosrae Joint State Action Plan for Disaster Risk Management

Bayesian Nonparametric Federated Learning of Neural Networksproceedings.mlr.press/v97/yurochkin19a/yurochkin19a.pdf · 2.3. Federated and Distributed Learning Federated learning has

Size Exclusion Chromatography Size Exclusion Chromatography

Federated Identity Management