Slinging Data: Data Loading and Cleanup in Evergreen

26
Slinging Data: Data Loading and Cleanup in Evergreen Growing Evergreen Conference 22 April 2010

description

Presentation for the 2010 Evergreen Conference on migrating data to the Evergreen open source ILS.

Transcript of Slinging Data: Data Loading and Cleanup in Evergreen

Page 1: Slinging Data: Data Loading and Cleanup in Evergreen

Slinging Data: Data Loading and Cleanup in Evergreen

Growing Evergreen Conference

22 April 2010

Page 2: Slinging Data: Data Loading and Cleanup in Evergreen

To migrate data …

Extract from the old, map and load into the new, clean up along the way, and keep

the auditor happy.

Page 3: Slinging Data: Data Loading and Cleanup in Evergreen

Whence

Extract data in a convenient form:

• Sometimes that means whatever you can get

• But better is

• MARC

• Flat text

• XML

Page 4: Slinging Data: Data Loading and Cleanup in Evergreen

All over the map

• Map entities

• Map fields

• Map values

• Map policies

Page 5: Slinging Data: Data Loading and Cleanup in Evergreen

All over the map

• Entities

• What is an item?

• What is a patron?

• Fields

• Where does the patron PIN come from?

Page 6: Slinging Data: Data Loading and Cleanup in Evergreen

All over the map

• Values

• Legacy item types• 0

• 1

• 45

• 123

• 234

Quick: which is the one for journal loan?

Page 7: Slinging Data: Data Loading and Cleanup in Evergreen

All over the map

Legacy Item Type Circ Modifier

0 Regular

1 Media

45 AV

123 Reference

234 Reference

Page 8: Slinging Data: Data Loading and Cleanup in Evergreen

Cleaning up

What?

• Bad data

• Ancient data

• Data it is too expensive to deal with later

When?

• Extract

• Load

• Post-load

Page 9: Slinging Data: Data Loading and Cleanup in Evergreen

Don’t box me in!

• The case of the dreaded double-encoding

• The even more dreadful case of the duplicitous and multiplicitous character encoding

Page 10: Slinging Data: Data Loading and Cleanup in Evergreen

Yes, those fixed fields really matter

The purpose of every modern ILS and discovery layer …

Page 11: Slinging Data: Data Loading and Cleanup in Evergreen

Yes, those fixed fields really matter

… is to point out every fixed field coding error in a form convenient for catalogers to identify and

fix.

Page 12: Slinging Data: Data Loading and Cleanup in Evergreen

Fixed fields

Page 13: Slinging Data: Data Loading and Cleanup in Evergreen

Oops!

create or replace function m_foo.set_leader (TEXT, INT, TEXT) RETURNS TEXT AS $$

my ($marcxml, $pos, $value) = @_;

use MARC::Record; use MARC::File::XML;

my $xml = $marcxml; eval { my $marc = MARC::Record->new_from_xml($marcxml, 'UTF-8'); my $leader = $marc->leader(); substr($leader, $pos, 1) = $value; $marc->leader($leader); $xml = $marc->as_xml_record; $xml =~ s/^<\?.+?\?>$//mo; $xml =~ s/\n//sgo; $xml =~ s/>\s+</></sgo; }; return $xml;$$ LANGUAGE PLPERLU STABLE;

Page 14: Slinging Data: Data Loading and Cleanup in Evergreen

On stage

Postgres lets us create an elegant mechanism for staging data to be loaded into an Evergreen database:

• Table inheritance

• Sequences

Page 15: Slinging Data: Data Loading and Cleanup in Evergreen

On stage

We want to be able to

• Load and manipulate the data

• … using every tool on our belt

• … while ensuring that it doesn’t show up in production until it’s ready (and we’re ready)

Page 16: Slinging Data: Data Loading and Cleanup in Evergreen

On stage

• Make a separate schema

psql> create schema m_foo;

• Mirror a real table

create table m_foo.asset_copy …

Page 17: Slinging Data: Data Loading and Cleanup in Evergreen

On stage

• Use the sequence

…id bigint not null default nextval('asset.copy_id_seq'::regclass)…

Page 18: Slinging Data: Data Loading and Cleanup in Evergreen

On stage

• Make space for the legacy

create table m_foo.asset_copy_legacy (

l_call_number TEXT

inherits (m_foo.asset_copy);

Page 19: Slinging Data: Data Loading and Cleanup in Evergreen

On stage

• Munge

• Munge

• Munge some more, then …

• Insert into production:

insert into asset.copy

select * from m_foo.asset_copy;

Page 20: Slinging Data: Data Loading and Cleanup in Evergreen

Counting

Who is the auditor?

It is you … and your patrons … and maybe even an actual auditor.

Page 21: Slinging Data: Data Loading and Cleanup in Evergreen

Counting

• Count what matters

• Number of records

• Number of dollars

• Number of things you’ll have to fix manually

• Don’t count what doesn’t matter

• Header rows

• Junk

Page 22: Slinging Data: Data Loading and Cleanup in Evergreen

Counting

• Count early and often

• Conservation of library data is Newton’s 42nd law!

Page 23: Slinging Data: Data Loading and Cleanup in Evergreen

Tools

• The usual suspects

• MARC::Record (or pymarc, or ruby-marc, or …)

• MARCEdit

• yaz-marcdump

• Spreadsheets

Page 24: Slinging Data: Data Loading and Cleanup in Evergreen

And now something new

Page 25: Slinging Data: Data Loading and Cleanup in Evergreen

Equinox Migration Tools

What?

MARC processing

Non-MARC processing

And more …

Where?

git://git.esilibrary.com/git/migration-tools.git

Page 26: Slinging Data: Data Loading and Cleanup in Evergreen

Thanks!

Galen Charlton

VP for Data Services, Equinox Software Inc.

[email protected]