1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair...

57
1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February, 2005

Transcript of 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair...

Page 1: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

1

Converting Existing Corpus to an OAI Compliant Repository

J. Tang, K. Maly, and M. Zubair

Department of Computer ScienceOld Dominion University

February, 2005

Page 2: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

2

Contents

Introduction

Background

Overall Architecture

Metadata Extraction Approach

Experiments

Screenshots

Page 3: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

3

DocumentsSCAN & OCR

Online Documents

?

Page 4: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

4

Introduction Why need go further

Lack of metadata available for these resources hampers their discovery and dispersion over the Web.

Lack of metadata available for these resources hampers the interoperability between them and resources from other organizations.

Benefits of using metadata Using metadata helps resource discovery

It may save about $8,200 per employee for a company to use metadata in its intranet to reduce employee time for searching, verifying and organizing the files . (estimation made by Mike Doane on DCMI 2003 workshop)

Using metadata helps make collections interoperable with OAI-PMH

Page 5: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

5

Introduction (cont.)

How to get these metadata Creating metadata manually for a large collection is

expensive It would take about 60 employee-years to create metadata for

1 million documents. (estimation made by Lou Rosenfeld on DCMI 2003 workshop)

These enormous costs for manual metadata creation make a great demand of the automated metadata extraction tools.

Page 6: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

6

Introduction (cont.)

Our main objective is to automate the task of building an interoperable digital library starting with a legacy collection consisting of printed version of documents or scanned version of documents in TIFF or PDF formats

To develop a flexible and adaptable approach for extracting metadata from physical collections with focus on DTIC (Defense Technical Information Center) collections.

To develop efficient ways of integrating OCR, extraction processes with an interoperable digital library.

To integrate the techniques and tools developed for metadata extraction to develop a test bed that DTIC legacy collection into an interoperable digital library framework

To evaluate the effectiveness of the automation process.

Page 7: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

7

Background OAI and Digital Library

Metadata Extraction Rule-based approach Machine-Learning approach

Hidden Markov Model Support Vector Machine

Page 8: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

8

Digital Library and OAI

Digital Library (DL) A DL is a network accessible and searchable collection of digital

information. DL provides a way to store, organize, preserve and share

information.

Interoperability problem DLs are usually created separately by using different technologies

and different metadata schemas.

Page 9: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

9

Open Archive Initiatives (OAI) Open Archive Initiatives Protocol for Metadata Harvesting (OAI-PMH)

is a framework to to provide interoperability among heterogeneous DLs.

It is based on metadata harvesting: a services provider can harvest metadata from a data provider.

Data provider accepts OAI-PMH requests and provides metadata through network

Service provider issues OAI-PMH requests to get metadata and build services on them.

Each Data Provider can support its own metadata formats, but it has to support at least Dublin Core(DC) metadata set.

Page 10: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

10

Page 11: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

11

Dublin Core Metadata Set

It supports 15 elements Title, Creator, Subject, Description, Publisher,

Contributor, Date, Type, Format, Identifier, Source, Relation, Coverage, Rights

All fields are optional

http://dublincore.org/documents/dces/

Page 12: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

12

Metadata Extraction: Rule-based

Basic idea: Use a set of rules to define how to extract metadata

based on human observation. For example, a rule may be “ The first line is title”.

Advantage Can be implemented straightforward No need for training

Disadvantage Lack of adaptabilities, (work for similar document) Difficult to work with a large number of features Difficult to tune the system when errors occur because

rules are usually fixed

Page 13: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

13

Metadata Extraction: Rule-based

Related works Automated labeling algorithms for biomedical

document images (Kim J, 2003 ) Extract metadata from first pages of biomedical journals Accuracy: title 100%, author 95.64%, abstract 95.85%,

affiliation 63.13% (76 articles are used for test) Document Structure Analysis Based on Layout and

Textual Features (Stefan Klink, 2000) Extract metadata from U-Wash document corpus with 979

journal pages Good results for some elements (such as page-number has

90% recall and 98% precision) but bad results for others( abstract: 35% recall and 90% precision; biography: 80% recall and 35% precision)

Page 14: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

14

Metadata Extraction: Machine-Learning Approach

Basic idea: Learn the relationship between input and

output from samples and make predictions for new data

This approach has good adaptability but it has to be trained from samples.

Page 15: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

15

Hidden Markov Model -general

Overview

HMM was introduced by Baum in late 60s. HMM is a dominating technology for Speech recognition. It is widely use in other areas such as DNA segmentation

and gene recognition. HMM has been used in Information Extraction recently

Address parsing (borkar 2001, etc.) Name recognition (Klein 2003, etc.) Reference Parsing (borkar 2001) Metadata Extraction ( seymore 1999, Freitag 2000, etc.)

Page 16: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

16

Hidden Markov Model -general “Hidden Markov Modeling is a probabilistic technique for the

study of observed items arranged in discrete-time series” --Alan B Poritz : Hidden Markov Models : A Guided Tour, ICASSP 1988

HMM is a probabilistic finite state automaton Transit from state to state Emit a symbol when visit each state States are hidden

A B C D

Page 17: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

17

HMM - Metadata Extraction

A document is a sequence of words that is produced by some hidden states (title, author, etc.)

The parameters of HMM was learned from samples in advance.

Metadata Extraction is to find the most possible sequence of states (title, author, etc.) for a given sequence of words.

Page 18: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

18

Challenges in Building Federation Services over Harvested Metadata, Kurt Maly, Mohammad Zubair, 2003

Challenges in Building Federation … Kurt Maly … 2003

…title title title title

Challenges in Building Federation … Kurt Maly … 2003

author author date…

Page 19: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

19

HMM - Metadata Extraction

Related work K. Seymore, A. McCallum, and R. Rosenfeld.

Learning hidden Markov model structure for information extraction.

Result: overall accuracy 90.1% was reported

Page 20: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

20

Support Vector Machine - general

Overview It was introduced by Vapnik in late 70s It is now receiving increasing attentions It is widely used in pattern recognition areas such as

face detection, isolated handwriting digit recognition, gene classification, etc.

A list of SVM applications is available at http://www.clopinet.com/isabelle/Projects/SVM/applist.html

It is also used in text analysis (Joachim's 1998, etc.) and metadata extraction (Han 2003).

Page 21: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

21

Support Vector Machine - general

Binary Classifier (classify data into two classes) It represents data with

pre-defined features It finds the plane with

largest margin to separate the two classes from samples

It classifies data into two classes based on which side they located.

Font size

Line number

hyperplane

margin

The figure shows a SVM example to classify a line into two classes: title, not title by two features: font size and line number (1, 2, 3, etc). Each dot represents a line. Red dot: title; Blue dot: not title.

Page 22: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

22

Multi-Class SVMs Combining into multi-class classifier

One-vs-rest Classes: in this class or not in this class Positive training samples: data in this class Negative training samples: the rest K binary SVM (k the number of the classes)

One-vs-One Classes: in class one or in class two Positive training samples: data in this class Negative training samples: data in the other class K(K-1)/2 binary SVM

Page 23: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

23

SVM - Metadata Extraction

Basic idea Classes metadata elements Extract metadata from a document classify each

line (or block) into appropriate classes. For example

Extract document title from a document Classify each line to see whether it is a part of title or not

Related work Automatic Document Metadata Extraction Using

Support Vector Machine (H. Han, 2003) Overall accuracy 92.9% was reported

Page 24: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

24

System Architecture

Documents

OCR/ Converter

Metadata Extractor Metadata

JDBC

OAI Layer

Search Engine

Cache

User Interface

Query Results

Request Response

Page 25: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

25

System Architecture (cont.)

Main components: Scan and OCR: Commercial OCR software is used

to scan the documents. Metadata Extractor: Extract metadata by using

rules and machine learning techniques. The extracted metadata are stored in a local database. In order to support Dublin Core, it may be necessary to map extracted metadata to Dublin Core format.

OAI layer: Make the digital collection interoperable. The OAI layer accepts all OAI requests, get the information from database and encode metadata into XML format as responses.

Search Engine

Page 26: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

26

Metadata Extraction (Cont.)

Doc

OCR

Template

Rule-based module

Interactive Tool

TaggedText

Doc after OCR

SVM Classifier

Merger

Models

Metadata

Models

HMM

Page 27: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

27

Metadata Extraction (cont.) Rule-based module

Classify documents into classes based on similarity

For each document class, create a template, or a set of rules

Decoupling rules from coding

A template is kept in a separate file

Benefits Easy to extend

For a new document class, just create a template

Rules are simpler Rules can be refined easily

Doc3

template2

Metadata

Extraction

Doc1

template1

Doc2

template2

metadata

Page 28: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

28

Metadata Extraction (cont.) Machine-learning module -- SVM with HMM

SVM is good at working with a large number of features but is not good at catching correlated features

a section before an author section is most possible a title section

HMM is good at working with events in a sequence but is expensive to handle a large number of features

Integration SVM works with a large number of features to produce

probabilistic results ( title 54%, author 30%, abstract 16%) HMM works with results from SVM and the probabilities

transiting from one metadata element to another element to produce final results.

Page 29: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

29

Metadata Extraction Approach (Cont.)

Integration Rule-based Approach with Machine-learning Approach

integrate machine-learning approach with our rule-based approach is to overcome two drawbacks of rule-based system

Lack of auto-correction ability

Lack of statistical fundamentals:

Integrate the results from two modules directly

Page 30: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

30

Experiments

Performance Measures SVM Experiments with different data sets Pure rule-based experiment

Page 31: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

31

Performance Measure

For individual metadata element Precision=TT/(TT+FT) Recall=TT/(TT+TF) Accuracy=(TT+FF)/(TT+TF+FT+FF)

Overall accuracy is the ratio of the number of data that are classified correctly over the total number of data.

TT TF

FT FF

Original

Classified

In class

Not In class

In class Not In class

Page 32: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

32

SVM Experiments with different data sets

Objective: Evaluate the performances of SVM for different data

sets to see how well SVM works for metadata extraction.

Data Sets Data Set 1: Seymore935

Download from http://www-2.cs.cmu.edu/~kseymore/ie.html 935 manually tagged document headers Using the first 500 for training and the rest for test

Data Set 2: DTIC100 Selected 100 PDF files from DTIC website based on Z39.18 standard OCR the first pages and convert to text format Manually tagged these 100 document headers Using the first 75 for training and the rest for test

Page 33: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

33

SVM Experiments with different data sets

Data Set 3: DTIC33 A subset of DTIC100 33 tagged document headers with identical layout Using the first 24 for training and the rest for test

DTIC33

Seymore945 DTIC100

More heterogeneous

Page 34: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

34

SVM Experiments with different data sets

Overall accuracy of title, author, affiliation and date

Page 35: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

35

Pure rule-based experiment

Objective Evaluate the performance of our rule-based approach

– defining a template for each class. Experiment

Use data set DTIC100: 100 XML files with font size and bold information

It is divided into 7 classes according to layout information

For each class, a template is developed after checking the first one or two documents in this class. This template is applied to the remaining documents to get performance data

Page 36: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

36

Pure rule-based experiment

Template Documents Class Name Precision Recall Identifier 100% 100% Type 100% 100%

Date 100% 100% Title 100% 100%

Creator 100% 100%

Contributor 100% 100%

Afrl 5

Publisher 100% 100%

Identifier 100% 100%

Date 100% 100%

Title 100% 83.33%

Arl 5

Creator 75.00% 100%

Identifier 100% 100%

Date 100% 100% Title 100% 83.33%

Edgewood 4

Creator 85.71% 66.67%

Page 37: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

37

Pure rule-based experiment

Template Documents Class Name Precision Recall Creator 100% 93.33%

Date 100% 96.67% Nps 15

Title 100% 86.67%

Creator 100% 90.00%

Date 100% 100% Usnce 5

Title 100% 100%

Title 100% 100%

Creator 100% 100% Contributor 100% 100%

Identifier 100% 100%

Afit 6

Right 100% 100%

Title 100% 100%

Creator 100% 100% Contributor 100% 100%

Date 100% 100%

Text 33

Type 100% 100%

Page 38: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

38

Screenshots – OAI

Page 39: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

39

Screenshots – Search Engine

Page 40: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

40

Thanks

Page 41: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

41

DTIC Samples

Page 42: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

42

What does it mean making an existing digital library OAI enabled ?

DigitalLibrary

Storage

OAILayer

Exposing metadata to OAI service providers – DC and Parallel metadata sets

ONLY METADATA

Page 43: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

43

- OAI Request for Metadata is embedded in HTTP.

- OAI Response to OAI Request is encoded in XML.

- XML Schema specification for OAI Response is provided in OAI-PMH document.

RCDL 2003, St. Petersburg

OAI Request and OAI Response

Page 44: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

44

OAI MechanicsRequest is encoded in http

Response is encoded in XML

XML Schemas for theresponses are defined in the OAI-PMH document

Courtesy: Michael Nelson

Page 45: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

45

Hidden Markov Model -general

A simple Example – Tossing coins Your friend in a room is tossing three coins – three

states You are outside the room and can not see which coin

he tossed –states are hidden You are shown the tossing result, a sequence of

header/tail, for example HTTHHHTT… (observation symbols)

The tossing result is affected by The probability of producing header for each coin The transition probabilities from coin to coin Which coin to be started with

Page 46: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

46

Hidden Markov Model -general

A Hidden Markov Model consists of A set of hidden states (e.g. coin1, coin2, coin3) A set of observation symbols ( e.g. H and T) Transition probabilities: the probabilities from

one state to another Emission probabilities: probability of emitting

each symbol in each state Initial probabilities: probability of each state to

be chosen as the first state

Page 47: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

47

Hidden Markov Model -general

Uses associated with HMM Evaluation: Consider the problem where we have a number

of HMMs describing different systems, and a sequence of observations. We may want to know which HMM most probably generated the given sequence.

Decoding: Finding the most probable sequence of hidden states given some observations.

Learning:Generating a HMM from a sequence of observations.

Information in this slide comes from http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/main.html

Page 48: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

48

Support Vector Machine - general

Many decision boundaries can separate these two classes

Which one should we choose?Class 1

Class 2

Courtesy: Martin Law

Page 49: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

49

Support Vector Machine - general

Class 1

Class 2

Basic idea

Choose the one to separate two classes with largest margin

margin

hyperplane

Support Vector

Page 50: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

50

SVM Experiments with different data sets

Precision

0%

20%

40%

60%

80%

100%

Title

Author

Affiliation Da

te

dtic33

dtic100

seymore935

Page 51: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

51

SVM Experiments with different data sets

Recall

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

Title

Author

Affiliation Da

te

dtic33

dtic100

seymore935

Page 52: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

52

SVM Experiments with different data sets

Accuracy

50.00%

55.00%

60.00%

65.00%

70.00%

75.00%

80.00%

85.00%

90.00%

95.00%

100.00%

Title Author Affiliation Date overall

dtic33

dtic100

seymore935

Page 53: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

53

SVM with different feature sets

Overall Accuracy

84.94% 83.68% 85.36%

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

text text+font text+font+bold

feature set

Page 54: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

54

SVM with different feature sets

DTIC100 Recall

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

Title Creator Affiliation Date

text

text+font

text+font+bold

Page 55: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

55

SVM with different feature sets

DTIC100 Precision

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

Title Creator Affiliation Date

text

text+font

text+font+bold

Page 56: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

56

SVM with different feature sets

dtic33 recall

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Title Creator Affi liation Date

text

text+font

text+font+bold

Page 57: 1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

57

HMM experiment result

Data Set: Seymore935 One state per field (tag) Using the first 500 for training and the rest for

test

Experimental Result Overall accuracy=93.0%