SCLENDS dedupping project

SCLENDs Dedupping Project

Rogan [email protected]

Obligatory Obligations*

This is a new version of the slides I used at a Cataloging Workgroup meeting on 2010-07-21. I’ve tried to clean it to read better without me talking to it but it still carries with it most of the faults of being meant as a speaking aide rather than self explanatory. And paradoxically is very text heavy as well. Some (few) tweaks have

been contributed for clarity. All faults are purely mine.

* You’ll find scattered foot notes in here. I apologize in advance. Really, you can skip them if you want to.

On Made Up Words

When I say ‘dedupping’ I mean ‘MARC de-duplication’

Schrödinger’s MARC

MARC records are simultaneously perfect and horrible and only acquire one state once we

start using them.

‘Bad’ or ‘Idiosyncratic’ records often exist due to valid decisions in the past that now are

unviable due to a strict MARC centric ILS and consortia cohabitation in the catalog.

It’s Dead Jim *

‘Idiosyncratic’ records and natural variety among MARC records hampered the

deduplication process during the original migrations and database merges.

* The slide title is a reference to Schrodinger’s cat as a MARC record and that it’s attained a state now that it’s in use. If you don’t get it, that’s OK. I’m a geek and should get out more.

The Problem

The result is a messy database that is reflected in the catalog. Searching the OPAC felt more like an obscure, and maybe arcane, process then we were comfortable with.

Time for the Cleaning Gloves

In March 2009 we began discussing the issue with ESI. The low merging rate was due to the very precise and conservative finger printing of

the dedupping process. In true open source spirit we decided to roll our own solution and

start cleaning up the database.

A Disclaimer

The dedupping as it was performed was not incorrect or wrong in any way. It put a strong

emphasis on avoiding creating wrong or imprecise (edition) matches which is almost

inevitable with looser finger printing. We decided that we had different priorities and

were willing to make compromises.

Project Goals

Improve Searching

Faster Holds Filling

(maybe) Reduce ICL costs

Scope of Dedupping

2,048,936 bib records

Shasta & Lynn worked

with the CatWoG.

Rogan joined to look at doing some modeling and translating the project into production.

On Changes

I watch the ripples change their size / But never leave the stream

-David Bowie, Changes

The practical challenges meant that a lot changed from the early discussion to

development. We weighted decisions heavily on the side of needing to have a significant and

practical impact.

Two Types of Match Points *

Limiting Match Points – these create a basis for matches and

exclude potential matches.

Additive Match Points – these are not required but create

additional matches.

* These are terms I use to differentiate between two kinds of logistical match points you have to make decisions about. I have no idea if anyone else uses similar terms for the same principles.

Modeling the Data part 1

Determining match points determines the scope of the record set you may create mergers from.

Due to lack of uniformity in records matching became extremely important. Adding a single extra limiting

match point caused high percentage drops in possible matches reducing the effectiveness of the project.

Tilting at Windmills

We refused to believe that dedupping is something that is done to minimal effect where minimizing bad merges is the highest priority.

Many said we were a bit mad. Fortunately, we took it as a complement.*

* Cervantes was actually reacting against what he saw as prevailing custom when he wrote Don Quixote and ended up with brilliant literature. He was also bitter and jealous but we’ll gloss over that part. We were hoping to be more like the first part.

Modeling the Data part 2

We agreed upon only two match points. Title and ISBN.

This excluded a large number of records by requiring both a valid title and ISBN entry.

Records with ISBNs and Titles accounted for ~1,200,000 of the over 2 million bib records in the system.

What Was Left Behind

Records excluded include many that do not have valid ISBNs including those that have SUDOC numbers, ISSNs, pre-cats, etc…

Also excluded were a significant number of potential matches that might have been matched using additive match points.

The Importance of Being Earnest

We were absolutely confidant that we could not achieve a high level of matching with extra limiting

match points.

We chose to not include additional merging (additive) match points because we could easily over reach.

We estimated based on modeling a conservative ~300,000 merges or about 15% of our ISBNs.

The Wisdom of Crowds

Conventional wisdom said that MARC could not be generalized despite the presence of

supposedly unique information in the records.

We were taking risks and very aware of it but the need to create a large impact on our database drove us to disregard friendly

warnings.

An Imperfect World

We knew that we would miss things that could potentially be merged.

We knew that we would create some bad merges when there were bad

records.*

10% wrong to get it 90% done.

* GIGO = Garbage In, Garbage Out

Next Step … Normalization

With matching decided we needed to normalize the data. This was done to copies of the production MARC records and that used to

make lists.

Normalization is needed because of variability in how data was entered. It allows us to get the most possible matches based on data.

Normalization Details

We normalized case, punctuation, numbers, non-Roman characters, trailing and leading spaces, some GMDs put in as parts of titles,

redacted fields, 10 digit ISBNs as 13 digit and lots, lots more.

This was not done to permanent records but to copies used to make the lists.

Weighting

Finally, we had to weight the records that have been matched to determine which should be

the record to keep. To do this each bib record is given a score to show it’s quality.

The Weighting Criteria

We looked at the presence, length number of entries in the 003, 02X, 24X, 300, 260$b, 100, 010, 500s, 440, 490, 830s, 7XX, 9XX and 59X to manipulate, add to, subtract from, bludgeon,

poke and eventually determine a 24 digit number that would represent the quality of a

bib record. *

* While not complete, this is mostly accurate.

The Merging

Once the weighing is done the highest scored record in each group (that should be the same items) is made the master record, the copies from the others moved to it and those bibs

marked deleted. Holds move with the copies and then holds can be retargeted allowed back

logged holds to fill.

The Coding

We proceeded to contract with Equinox to have them develop the code and run it against our test environment (and eventually production). Galen Charlton was our primary contact in this and aside from excellent work also provided us wonderful feedback about additional criteria to include in the weighting and normalization.

Test Server

Once run on the test server we took our new batches of records and broke them into 50,000 record chunks. We then gave those chunks to

member libraries and had them do random samples for five days.

Fixed As We Went

Lynn quickly found a problem with 13 digit ISBNs normalizing as 10 digit ISBNs. We

quickly identified many parts of DVD sets and some shared title publications that would be

issues. Galen was responsive and helped us compensate for these issues as they were

discovered.

In Conclusion

We don’t know how many bad matches were formed but it was below our threshold, perhaps

a few hundred. We are still gathering that feedback.

We were able to purge 326,098 bib records or about 27% of our ISBN based collection.

Evaluation

The catalog is visibly cleaner.

The cost per bib record was 1.5 cents.

Absolutely successful.

Future

This dedupping system will improve further.

There are still problems that need to be cleaned up – some manually and some by

automation.

New libraries that join SCLENDs will use our dedupping algorithm not the old one.

Challenges

One, how do we go forward with more clean up? Treat AV materials separately? We need

to look at repackaging standards more.

Two, how do we prevent adding new errors to the system (which is happening)?

Questions?

SCLENDS dedupping project

Documents

Transcript of SCLENDS dedupping project