U of R eXtensible Catalog Team MetaCat. Problem Domain.

38
U of R eXtensible Catalog Team MetaCat
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    0

Transcript of U of R eXtensible Catalog Team MetaCat. Problem Domain.

Page 1: U of R eXtensible Catalog Team MetaCat. Problem Domain.

U of R eXtensible Catalog

Team MetaCat

Page 2: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Problem Domain

Page 3: U of R eXtensible Catalog Team MetaCat. Problem Domain.

A Modern Library

• Card catalogs are stored on a computer

• Card catalogs store metadata about books Subject Author(s)

• Searching for a book is done via an OPAC (Online Public Access Catalog) Example: http://albert.rit.edu/

Page 4: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Card Catalog Metadata

• Two types of records A bibliographic record represents a book, and

is linked to multiple authority records. An authority record represents a single author

or subject.

• Metadata has been hand-typed by librarians across the country MARC: MAchine Readable Cataloging (XML),

specifies for both bib. and auth. record formats Dublin Core: also XML format, but only bib.

records

Page 5: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Metadata Issues

• Since metadata has been hand-typed, it may be inconsistent

• An author could be: “Mark Twain” “Twain, Mark” “M. Twain” “Samuel Clemens”

• If a user searches for “Mark Twain”, the search may not return all related books

Page 6: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Goals

• Bibliographic Record Author field

Name Date of Birth, Death

• Authority Record Authorized Form Alternate Forms:

Alternate form 1 Alternate form 2 …

See Also References to other

authority records

Page 7: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Sponsor’s Solution

Page 8: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Iterative Process Flow

Requirement Elicitation

Requirement Analysis

Define Architecture

Update Release Plan

produce SRS &acceptance tests

Subsystem DesignIdentify Integration

Tests

Implementation

Integration

Acceptance Testing

Delivery

For each release:

Update Documentation

Page 9: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Metrics

• Effort by type of activity• Test metrics (JUnit)• Defects by types

Page 10: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Effort by Type

Meeting Development Documentation

Before ~40 hrs 0 0

1/12-1/18 45 29 2

1/18-1/25 20 43 5

1/26-2/1 (R1) 24 41 4

2/2-2/8 20 31 7

2/9-2/15 (R2) 24 2 0

Total 133 146 18

Page 11: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Hours spent on activities

0

5

10

15

20

25

30

35

40

45

50

Before 1/12-1/18

1/18-1/25

1/26-2/1(R1)

2/2-2/8

2/9-2/15(R2)

Time

Hou

r Meeting

Development

Documentation

Effort by Type

Page 12: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Issuetracker

Initially, all the issues are not recorded properly.

Issue Tracker is used to track1. Issues (design, documentation, process)2. Bugs3. Discussions (new features, nice to have)

Page 13: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Issuetracker

Page 14: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Defects by Type

Page 15: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Status

• 3.1 Import a record into database (R1) FR-1.1: The system shall parse the XML

record. (R2) FR-1.2: The system shall store the

information that obtained from parsing the XML record into MySQL database.

(R1) FR-1.3: The system shall be able to import multiple records at once. (Batch processing)

(R1) FR-1.4: The system shall normalize strings.

Page 16: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Status cont.

• 3.2 Matching records (R1) FR-2.1: The system shall create a new authority

record. (R2) FR-2.2: The system shall match two strings and give a

confidence level of the matching. (R2) FR-2.3: The system shall store the results of the

matching that includes the degree of certainty, and the link(s) matched authorized record(s).

(R1) FR-2.4: The system shall identify all unprocessed records in the records database. The unprocessed records are the records that have not yet been matched against.

(R1) FR-2.5: The system shall create a new authority record, and store it in the database.

Page 17: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Status cont.

(R1) FR-2.6: The system shall replace the data in authority-controlled fields with its authorized form and store the link to its authorized form if the degree of certainty is above auto-accept threshold.

(R2) FR-2.7: The system shall mark the record to be reviewed by a person if the degree of certainty is between auto-accept threshold and auto-reject threshold.

FR-2.8: The system shall create a new authority record using the information from the current record, and create a link between those two records if the degree of certainty is below auto-reject threshold.

(R1) FR-2.9: The system shall analyze unprocessed records on demand.

(R1) FR-2.10: The system shall attempt to match records first by comparing authority names.

(R2) FR-2.11: The system shall attempt to match records by comparing alternative names if the first attempt (FR-2.10) failed.

Page 18: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Status cont.

• 3.5 Review possible matches (R2) FR-5.1: The system shall gather a collection of

records that are marked to review from the database. The questionable matches have the degree of certainty level between auto-accept threshold and auto-reject threshold.

(R2) FR-5.2: The system shall replace the data in authority-controlled fields with its authorized form and store the link to its authorized form if the user approves the matching.

(R2) FR-5.3: The system shall replace the data in authority-controlled fields with its authorized form and store the link to its authorized form if the user approves the matching.

Page 19: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Our Solution

Page 20: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Architecture

API

MyS

QL

DB

Exporter

DA

O (

Dat

a A

cces

s O

bje

ct)

-match

GUI

«subsystem»Matcher

«subsystem»Import

Page 21: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Matcher

• In NACM, we need to be able to match Bibliographic records (books) to Authorized records (authors).

• The information in the records may not always match exactly, or may match multiple records!

Page 22: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Matching Problems

• Different forms of the same name Nate verses Nathan, typos

• Different authors with the same name George Bush (41) versus George Bush

(43)

• Aliases or pen names Samuel Clemens verses Mark Twain

Page 23: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Matching Problems

• To assist in matching different forms of an author’s name, Authority records have a list of alternate names in addition to the authorized form.

• Alternate names may not be distinct.

Page 24: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Matcher Design

• We need a matching strategy that is easy to extend to add new matching rules, while still being fast.

Page 25: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Matching Subsystem

Page 26: U of R eXtensible Catalog Team MetaCat. Problem Domain.

MatchStrategy

• Abstract class that defines the basics of a matching rule• Matching method• Match confidence

• All matching strategies extend this class

Page 27: U of R eXtensible Catalog Team MetaCat. Problem Domain.

StringTransformer

• Abstract class for string manipulation rules• String transform method• Transformation confidence

• All string manipulation rules extend this class

Page 28: U of R eXtensible Catalog Team MetaCat. Problem Domain.

MatchDriver

• Handles performing a match• Creates pairs of strategies &

transformations• Sorts Pairs based on overall confidence• Iterates through the pairs looking for

matches

Page 29: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Matcher Extensibility

• Adding new rules• Extend MatchStrategy or

StringTransformer• implement new matching or

transforming rules• Assign a confidence• Add to MatchDriver

• MatchDriver takes care of the rest

Page 30: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Importer

• Takes in input streams and parses them to extract authority and bibliographic data

• Uses a SAX parser into a Document Object Model (DOM) object

• Data is extracted from document, normalized, and inserted into the database

Page 31: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Importer

Page 32: U of R eXtensible Catalog Team MetaCat. Problem Domain.

MySQL data model

record_types

PK id

name

names

PK id

orig_string nor_string

authority_records

PK id

processed generated xml_hashcode orig_xmlFK1 record_type_idFK2 authority_name_id

authority_records_alter_forms

PK,FK1 namePK,FK2 record_id

authority_records_see_also

PK,FK1 namePK,FK2 record_id

bib_records

PK id

processed xml_hashcode orig_xmlFK1 record_type_id

bib_records_titles

PK,FK1 namePK,FK2 record_id

bib_records_authors

PK,FK1 namePK,FK2 record_id

bib_records_subjects

PK,FK1 namePK,FK2 record_id

authority_records_links

PK id

approved flagged rejectedFK1 auth_record_idFK2 bib_record_id evidence time_found time_verifed approvedby percent_confidenceFK3 string_id

bib_records_author_links

PK,FK1 bib_record_idPK,FK2 auth_link_id

bib_records_subjects_links

PK,FK2 bib_record_idPK,FK1 auth_link_id

Page 33: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Using Hibernate

• Transparent Data Persistence• Manages relationships between

entities• Benefits

Query caching Lazy-loading of associated entities Automatic flagging of changes Programmatic API for complex queries

Page 34: U of R eXtensible Catalog Team MetaCat. Problem Domain.

How it Works

• Define Schema• Define Domain Model• Use XML to map fields in classes to

columns in tables Define cascading behavior

Page 35: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Hibernate Caveats

• Designed with transactions in mind But, we use batch processing!

• Query language lacks some of the power of SQL

• Not 100% transparent Design and use of domain model is

affected

Page 36: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Results Viewing GUI

+refresh()+sortBy(in column : int)+updateLinkCountLabel()

ResultsTable

+getValueAt(in row, in column)

ResultsTableModel

FilterControls

PagingControls

SelectedLinkControls

Filter

+findAllWithFilter()

AuthorityLinkDAO

AuthorityLink

Creates and lays outa JTable and otherGUI components

*

-creates

*

-database

gui.resultsGUI

• A table displaying all created links• Can be filtered, sorted, and paged

Page 37: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Future Plans

• Verify that matching algorithm is doing the right things

• Implement string transformers• Create new XC records• Merge and update records with new

data upon import• Configuration files for the system

Page 38: U of R eXtensible Catalog Team MetaCat. Problem Domain.

Demo!