Slinging Data: Data Loading and Cleanup in Evergreen

Post on 27-Jun-2015

1.172 views 0 download

Tags:

description

Presentation for the 2010 Evergreen Conference on migrating data to the Evergreen open source ILS.

Transcript of Slinging Data: Data Loading and Cleanup in Evergreen

Slinging Data: Data Loading and Cleanup in Evergreen

Growing Evergreen Conference

22 April 2010

To migrate data …

Extract from the old, map and load into the new, clean up along the way, and keep

the auditor happy.

Whence

Extract data in a convenient form:

• Sometimes that means whatever you can get

• But better is

• MARC

• Flat text

• XML

All over the map

• Map entities

• Map fields

• Map values

• Map policies

All over the map

• Entities

• What is an item?

• What is a patron?

• Fields

• Where does the patron PIN come from?

All over the map

• Values

• Legacy item types• 0

• 1

• 45

• 123

• 234

Quick: which is the one for journal loan?

All over the map

Legacy Item Type Circ Modifier

0 Regular

1 Media

45 AV

123 Reference

234 Reference

Cleaning up

What?

• Bad data

• Ancient data

• Data it is too expensive to deal with later

When?

• Extract

• Load

• Post-load

Don’t box me in!

• The case of the dreaded double-encoding

• The even more dreadful case of the duplicitous and multiplicitous character encoding

Yes, those fixed fields really matter

The purpose of every modern ILS and discovery layer …

Yes, those fixed fields really matter

… is to point out every fixed field coding error in a form convenient for catalogers to identify and

fix.

Fixed fields

Oops!

create or replace function m_foo.set_leader (TEXT, INT, TEXT) RETURNS TEXT AS $$

my ($marcxml, $pos, $value) = @_;

use MARC::Record; use MARC::File::XML;

my $xml = $marcxml; eval { my $marc = MARC::Record->new_from_xml($marcxml, 'UTF-8'); my $leader = $marc->leader(); substr($leader, $pos, 1) = $value; $marc->leader($leader); $xml = $marc->as_xml_record; $xml =~ s/^<\?.+?\?>$//mo; $xml =~ s/\n//sgo; $xml =~ s/>\s+</></sgo; }; return $xml;$$ LANGUAGE PLPERLU STABLE;

On stage

Postgres lets us create an elegant mechanism for staging data to be loaded into an Evergreen database:

• Table inheritance

• Sequences

On stage

We want to be able to

• Load and manipulate the data

• … using every tool on our belt

• … while ensuring that it doesn’t show up in production until it’s ready (and we’re ready)

On stage

• Make a separate schema

psql> create schema m_foo;

• Mirror a real table

create table m_foo.asset_copy …

On stage

• Use the sequence

…id bigint not null default nextval('asset.copy_id_seq'::regclass)…

On stage

• Make space for the legacy

create table m_foo.asset_copy_legacy (

l_call_number TEXT

inherits (m_foo.asset_copy);

On stage

• Munge

• Munge

• Munge some more, then …

• Insert into production:

insert into asset.copy

select * from m_foo.asset_copy;

Counting

Who is the auditor?

It is you … and your patrons … and maybe even an actual auditor.

Counting

• Count what matters

• Number of records

• Number of dollars

• Number of things you’ll have to fix manually

• Don’t count what doesn’t matter

• Header rows

• Junk

Counting

• Count early and often

• Conservation of library data is Newton’s 42nd law!

Tools

• The usual suspects

• MARC::Record (or pymarc, or ruby-marc, or …)

• MARCEdit

• yaz-marcdump

• Spreadsheets

And now something new

Equinox Migration Tools

What?

MARC processing

Non-MARC processing

And more …

Where?

git://git.esilibrary.com/git/migration-tools.git

Thanks!

Galen Charlton

VP for Data Services, Equinox Software Inc.

gmc@esilibrary.com