Meow Hagedorn

38
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 1 meow::0 6 David Newman Bill Landis, ex officio Kat Hagedorn Clustering, Classification, and Metadata Enhancement Techniques July 24, 2006

description

 

Transcript of Meow Hagedorn

Page 1: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

1

meow::06

David Newman

Bill Landis, ex officio

Kat Hagedorn

Clustering, Classification, and Metadata

Enhancement TechniquesJuly 24, 2006

Page 2: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

2

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

I. Preprocessing and Topic Modeling

II. The “Browser”

III. Lessons Learned and Next Steps

Page 3: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

3

Goals

• Evaluate topical/subject-based metadata enhancement• Experiment on testbed of multiple OAI repositories• Discuss lessons learned and refine testing• Propose products and services

Page 4: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

4

What We Did

Cluster

Preprocessing & Topic Modeling >

vocab-ulary

preprocesstopic

model(cluster/learn)

topicsOAI

records

Page 5: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

5

What We Didvocab-ulary

preprocesstopic

model(cluster/learn)

topicsCluster

OAIrecords

vocab-ulary

preprocesstopic

model(classify)

1. topics in records2. records in topicsoai

rec

Classify

Preprocessing & Topic Modeling >

OAIrecords

Page 6: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

6

What We Did

Cluster

Classify

Preprocessing & Topic Modeling >

clustering is learning the

topics

classification is using the

learned topics

vocab-ulary

preprocesstopic

model(cluster/learn)

topicsOAI

records

vocab-ulary

preprocesstopic

model(classify)

1. topics in records2. records in topicsoai

recOAI

records

Page 7: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

7

Repository Selection

• Mix of cultural heritage repositories?– UMich, Library of Congress, CDL, State Lib of Victoria (Aust), …– Average of 15 words per record (excl. stopwords)– Topics often specific to collection (e.g., State Lib of Victoria)– Experience with CDL’s American West project

• Mix of scientific/research repositories?– CiteSeer, arXiv, PubMed, …– <description> is a reasonably reliable 200-word abstract– Average of 75 words per record– Topics more likely to span repositories

• For purposes of evaluation, used (mostly) English-language repositories

Preprocessing & Topic Modeling >

Page 8: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

8

Selected Repositories*

Short Name

Description Records Records used for clustering (learning)

arxiv arXiv.org Eprint Archive 368,000 1 in 3

caltech Caltech Electronic Theses and Dissertations 3,000 -

cern CERN Document Server 45,000 1 in 2

citeseer CiteSeer Scientific Literature Digital Library 717,000 1 in 3

doaj Directory of Open Access Journals Articles 29,000 1 in 2

iop Institute of Physics 212,000 1 in 3

loc Library of Congress Digitized Historical Collections 239,000 -

nsdl The National Science Digital Library 33,000 1 in 2

osti Office of Science and Technology Information 131,000 1 in 3

pangaea Publishing Network for Geoscientific and Environmental Data

370,000 -

pubmed PubMed Central 625,000 1 in 3

repec Research Papers in Economics 141,000 1 in 3

*Repositories harvested by UMich/OAIster, June 7, 2006.

Preprocessing & Topic Modeling >

Page 9: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

9

Usage of Dublin Core Fields

• Decided to use words from <title>, <description>, <subject> for clustering

• Idiosyncrasies– CiteSeer: repeats <author> and <title> in <subject>– CiteSeer: puts citations to other IDs in <description>– arXiv: puts e.g., “Comment: 12 pages PostScript” in <description>– RePEc: no <subject>, repeats ID in <description>– etc.

• Approach: Process all repositories identically, no special treatment

Preprocessing & Topic Modeling >

Page 10: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

10

Preprocessing Example

<ID=oai:CiteSeerPSU:44072>

<title>Reinforcement Learning: A Survey

<description>This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." …

<subject>Leslie Pack Kaelbling, Michael Littman, Andrew Moore. Reinforcement Learning: A Survey

vocab-ulary

preprocess

<ID=oai:CiteSeerPSU:44072>

reinforcement learning survey

survey field reinforcement learning computer science perspective written accessible researcher familiar machine learning historical basis field broad selection current summarized reinforcement learning faced agent learn behavior trial error interaction dynamic environment resemblance psychology differ considerably detail word reinforcement …

leslie pack kaelbling littman andrew moore reinforcement learning survey

Preprocessing & Topic Modeling >

Page 11: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

11

Stopwords and Stemming• Standard: and, the, …• Research related: research, paper, data, system,

method, result, …• Repository specific: cern, citeseer, repec, Smith, …• All tokens starting with a digit: 1996, 401k, …• Produced stopword list of 500 words• Applied very simple stemming (cars car)• Note: replacing collocations improves interpretability of

topics, but not quality (los angeles los_angeles)• Don’t need to find and exclude all stopwords because

topic model will help find these (e.g. des, les, une, …) -- suppress after the fact

Preprocessing & Topic Modeling >

Page 12: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

12

Building Vocabulary

• Preprocessed (sampled) repositories, excluded stopwords• Only kept words that occurred in more than 10 records• Result: a final vocabulary with ~ 90,000 words• Most frequent words: cell, high, energy, protein, function,

algorithm, field, theory, physics, …• Resulting discussion point: When do we need to re-create

the vocabulary? (When classifying, new documents will be filtered through existing vocabulary)

Preprocessing & Topic Modeling >

Page 13: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

13

Preprocessing & Topic Modeling >

• Average of 75 words per record

• Bimodal because used records with abstracts and records without abstracts

• Topic model isn’t adversely affected by very short records

Page 14: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

14

Computation

• Clustering (Learning)D = 750,000 recordsW = 90,000 word vocabularyL = 75 words per recordT = 500 topicsiter = 500 iterationsmemory = 3DL + T(D+W) = 3 GBytetime = D L T Iter = 3 days (3 GHz Xeon)

• ClassificationD = 3,000,000 records totaliter = 40 iterationsmax memory = 2 GBytemax time = 5 hours (but repositories can run in parallel)

Decision point: How many topics?Decision point: How many iterations?

Preprocessing & Topic Modeling >

Page 15: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

15

Broad Topical Categories

• 500 topics too many to look at• Need to organize topics under broad topical

categories– Cluster the clusters (automatic)– Use pre-defined categories

• Classify group of keywords (manual + automatic)• Create hierarchy by hand (manual)

Preprocessing & Topic Modeling >

Page 16: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

16

Broad Topical Categories

broad topicalcategories

Preprocessing & Topic Modeling >

vocab-ulary

preprocesstopic

model(cluster/learn)

topicsOAI

records

topicmodel

(cluster/learn)

Cluster

Cluster the clusters

Page 17: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

17

Broad Topical Categories

Cluster

broad topicalcategories

Cluster the clusters

Classify group of keywords

vocab-ulary

preprocesstopic

model(classify)

topics organized under broad topical categories

group ofkeywords

Preprocessing & Topic Modeling >

vocab-ulary

preprocesstopic

model(cluster/learn)

topicsOAI

records

topicmodel

(cluster/learn)

Page 18: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

18

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

I. Preprocessing and Topic Modeling

II. The “Browser”

III. Lessons Learned and Next Steps

Page 19: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

19

The “Browser”

*Based on 750,000 sampled records from 9 repositories, 500 topics

The Browser >

• PHP/MySQL browser of 3 million OAI records*• Preserving transparency for this audience• Browser not meant for end users• No search, no information architecture, etc.• http://yarra.calit2.uci.edu/meow/

Page 20: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

20

The “Browser”: http://yarra.calit2.uci.edu/meow/The Browser >

Page 21: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

21

Selected Topics: Useful• [ t201 ]   learning machine training learn algorithm task examples

reinforcement inductive learned learner supervised unsupervised

• [ t482 ]   labor worker employment wage market labour job unemployment wages earning panel find evidence individual participation

• [ t381 ]   algebraic geometry mathematic conjecture varieties projective variety theory cohomology moduli curves prove genus rational give math

• [ t097 ]   dark matter universe astrophysic cosmological cosmic background density inflation spectrum power scale cmb halo cosmology gravitational

• [ t027 ]   hiv virus human immunodeficiency type envelope infection viral cd4 infected gag replication reverse aid tat gp120

• [ t365 ]   waste radioactive wastes tank nuclear facilities management hanford disposal fuel storage material processing facility site level

> show all 500 sub-topics (to see all 500 topics)

The Browser >

Page 22: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

22

Selected Topics: Less Useful• [ t255 ]   journal author chapter vol notes editor publication issue special

bibliography reader references appendix literature submitted topic• [ t328 ]   paul mark thank andrew scott stephen alan steven miller george

martin obituaries thesis daniel prof ian• [ t384 ]   supported part grant author foundation partially contract science

national nsf support advanced ccr provided center agency• [ t112 ]   look people difficult thing need want fact reason help understand

think say alway try easy bad• [ t496 ]   increase increased increases decrease increasing decreased

decreases observed change decreasing significant caused decline • [ t012 ]   des les dan une est par sur pour qui nous sont aux ces analyse

pay cette

But junk topics alleviate the need to exhaustively find stopwords; many useless words cluster as topics which can be suppressed and very useful

to filter out French records

The Browser >

Page 23: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

23

Broad Topical Categories (BTCs)

• By clustering the clusters– worked well– mathematics, global energy resources, …– can choose desired number of broad topical

categories (e.g., 25) and thresholding

• By classifying groups of keywords– worked well too

• Then review and manually edit – include or exclude any subtopic

The Browser >

Page 24: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

24

BTCs: Clustering the clusters

The Browser >

Page 25: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

25

BTCs: Classifying group of keywords>>> Aerospace Engineering stars (15) space (18) aeronautics (20) astronautics (20) rocket (12) shuttle (12) exploration (15) lander (3) planets (7) black holes (7) quasars (7) pulsars (7) observatories (10) air traffic (10) aircraft (15)

aerospace (20)

airplanes (10)

airports (10) heliports (10) helicopters (10) aviation (18) FAA (7) airlines (12) flight (18) comets (10) meteorites (12) spacecraft (15) air force (7) pilots (7) jets (7) air travel (15) flying (18)

domain expert specifies list of

relevant keywords and (importance)

The Browser >

Page 26: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

26

BTCs: Classifying group of keywords>>> Aerospace Engineering[t192] (69%) vehicle flight vehicles engine car road speed nasa aircraft air[t352] (13%) star solar planet mass astrophysic binary dwarf orbital sun companion[t191] (8%) space spaces hilbert subspace dimensional subspaces defined exploration linear point

>>> Dermatology[t388] (83%) infection skin disease tract respiratory fever burgdorferi caused wound arthritis[t157] (8%) cancer tumor p53 breast carcinoma survival human tumour malignant prostate[t071] (7%) growth tuberculosis mycobacterium growing grow igf factor bcg avium

>>> Geology and Earth Sciences[t121] (73%) geothermal rock seismic energy mountain drilling fluid survey spring yucca[t268] (12%) sea atmospheric climate ice ocean atmosphere cloud global wind aerosol

>>> Molecular, Cellular and Developmental Biology[t276] (31%) molecular biological sciences molecules biology molecule quantitative biochemistry basic[t417] (15%) cell apoptosis cellular death cultured bcl lines hela transfected mediated[t355] (12%) brain neuron neuronal cortex synaptic cortical rat nervous cerebral dopamine[t418] (9%) genes genome gene repeat chromosome sequences dna genomic sequence region[t319] (7%) mice development mouse drosophila expression transgenic cell embryonic embryos gene

>>> Transportation[t192] (85%) vehicle flight vehicles engine car road speed nasa aircraft air

in review, would delete

this topic from this

BTC

just found 1 topic relevant to

transportation

The Browser >

Page 27: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

27

Browse Records in a Topic

nice mix of repositories

The Browser >

can navigate back to

multiple BTCs

Page 28: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

28

Browse Records in a Topic: From one repository

The Browser >

display records just from Library

of Congress

Page 29: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

29

Sample RecordMurphy's Law in algebraic geometry: Badly-behaved deformation spaces

> preprocessed textmurphy law algebraic geometry badly behaved deformation spaces

consider question bad deformation space object answer priori reason deformation space bad moduli spaces precisely singularity finite type smooth parameter hilbert scheme curves projective space moduli spaces smooth projective type surfaces higher dimensional varieties plane curves nodes cusp stable sheaves isolated threefold singularities object pathological fact nice curves smooth surfaces ample canonical bundle stable sheaves torsion free rank singularities normal cohen macaulay justifies mumford philosophy moduli spaces behaved object arbitrarily bad priori reason construct smooth curve projective space deformation space component singularity type reduced behavior subschemes similarly give surface f_p lift course hold holomorphic category difficult compute deformation spaces directly obstruction theories circumvent relating tractable deformation spaces smooth morphism essential starting point mnev universality theorem mathematic algebraic geometry mathematic complex variables

> top topics

[ t381 ] algebraic geometry mathematic conjecture varieties projective variety theory cohomology moduli curves prove genus rational give math [ t191 ] space spaces

oai:arXiv.org:math/0411469

The Browser >

link to actual OAI record

topics for this record

Page 30: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

30

Repository-specific Browsers

• Library of Congress (http://yarra.calit2.uci.edu/oai/loc/)• University of Michigan (http://yarra.calit2.uci.edu/oai/umich/)• University of Washington (http://yarra.calit2.uci.edu/oai/uwash/)• African Journals Online (http://yarra.calit2.uci.edu/oai/africa/)• and many more…

The Browser >

Page 31: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

31

Clustering, Classification, and Metadata Enhancement Techniques on OAI RecordsI. Preprocessing and Topic Modeling

II. The “Browser”

III. Lessons Learned and Next Steps

Page 32: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

32

Evaluation

• Topic modeling worked well– Most topics were useful– Drain on computer resources was reasonable– Human effort was relatively small– All repositories processed identically, no special

treatment

• Strategy worked well– Clustering, then– Classification, and– Broad Topical Categories creation

Lessons Learned & Next Steps >

Page 33: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

33

Further Evaluation

• Current processing only for– English-language repositories– Science/research based repositories

• Need to test cultural heritage repositories and foreign-language records– Less consistent descriptive language and length– “On-the-horse” problem more prevalent– Greater need to individually process repositories

• Also need usability testing to evaluate further– Depends on criteria -- who are users?

• Librarians?• End-users?

– Depends on products and services desired by users

Lessons Learned & Next Steps >

Page 34: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

34

Discussion Point: When to Re-cluster?

• Need to re-cluster– when collection changes significantly– if there is a “hole” in topics– but NOT just because you have another repository

• If you re-cluster– all topics will be different– have to discard hand-labeling– Broad Topical Categories might be different

• However, classification is– “cheap” and easy– e.g., for OAIster, could re-classify every harvest…until spring clean

clu

ste

r

cla

ssify

clu

ste

r

clu

ste

r

cla

ssify

cla

ssify

cla

ssify

cla

ssify

cla

ssify

Lessons Learned & Next Steps >

Page 35: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

35

Products and Services

• Depending on users…• What kind of service is useful?• What should interface to topics look/act like?• What kind of use should we envision?• We have some ideas…

Lessons Learned & Next Steps >

Page 36: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

36

Archive of Topics

• Are the topics we created useful to anyone else?• Scenario: librarian uses topics/classifier for local

resources• To use locally you need:

– the preprocessor (i.e. the preprocessing rules)– the vocabulary (file of 90,000 words)– the topic model classifier

Lessons Learned & Next Steps >

Page 37: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

37

Subject Search/Browse for OAIster

• Integrate topics into OAIster– add to records so can perform canned topic search– add to interface so can browse BTCs to records

• Additionally, can allow users to find records similar to those retrieved– e.g., retrieved records on cosmology and can find

similar records on astrophysics, relativity, …

• How to do this?

Lessons Learned & Next Steps >

Page 38: Meow Hagedorn

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records

38

How To Reach Us

• David Newman: University of California, Irvine

<[email protected]>

• Kat Hagedorn: University of Michigan

<[email protected]>

• Bill Landis: California Digital Library

<[email protected]>