SCLENDS dedupping project
-
Upload
rogan-hamby -
Category
Documents
-
view
712 -
download
1
Transcript of SCLENDS dedupping project
SCLENDs Dedupping Project
Rogan [email protected]
Obligatory Obligations*
This is a new version of the slides I used at a Cataloging Workgroup meeting on 2010-07-21. I’ve tried to clean it to read better without me talking to it but it still carries with it most of the faults of being meant as a speaking aide rather than self explanatory. And paradoxically is very text heavy as well. Some (few) tweaks have
been contributed for clarity. All faults are purely mine.
* You’ll find scattered foot notes in here. I apologize in advance. Really, you can skip them if you want to.
On Made Up Words
When I say ‘dedupping’ I mean ‘MARC de-duplication’
Schrödinger’s MARC
MARC records are simultaneously perfect and horrible and only acquire one state once we
start using them.
‘Bad’ or ‘Idiosyncratic’ records often exist due to valid decisions in the past that now are
unviable due to a strict MARC centric ILS and consortia cohabitation in the catalog.
It’s Dead Jim *
‘Idiosyncratic’ records and natural variety among MARC records hampered the
deduplication process during the original migrations and database merges.
* The slide title is a reference to Schrodinger’s cat as a MARC record and that it’s attained a state now that it’s in use. If you don’t get it, that’s OK. I’m a geek and should get out more.
The Problem
The result is a messy database that is reflected in the catalog. Searching the OPAC felt more like an obscure, and maybe arcane, process then we were comfortable with.
Time for the Cleaning Gloves
In March 2009 we began discussing the issue with ESI. The low merging rate was due to the very precise and conservative finger printing of
the dedupping process. In true open source spirit we decided to roll our own solution and
start cleaning up the database.
A Disclaimer
The dedupping as it was performed was not incorrect or wrong in any way. It put a strong
emphasis on avoiding creating wrong or imprecise (edition) matches which is almost
inevitable with looser finger printing. We decided that we had different priorities and
were willing to make compromises.
Project Goals
Improve Searching
Faster Holds Filling
(maybe) Reduce ICL costs
Scope of Dedupping
2,048,936 bib records
Shasta & Lynn worked
with the CatWoG.
Rogan joined to look at doing some modeling and translating the project into production.
On Changes
I watch the ripples change their size / But never leave the stream
-David Bowie, Changes
The practical challenges meant that a lot changed from the early discussion to
development. We weighted decisions heavily on the side of needing to have a significant and
practical impact.
Two Types of Match Points *
Limiting Match Points – these create a basis for matches and
exclude potential matches.
Additive Match Points – these are not required but create
additional matches.
* These are terms I use to differentiate between two kinds of logistical match points you have to make decisions about. I have no idea if anyone else uses similar terms for the same principles.
Modeling the Data part 1
Determining match points determines the scope of the record set you may create mergers from.
Due to lack of uniformity in records matching became extremely important. Adding a single extra limiting
match point caused high percentage drops in possible matches reducing the effectiveness of the project.
Tilting at Windmills
We refused to believe that dedupping is something that is done to minimal effect where minimizing bad merges is the highest priority.
Many said we were a bit mad. Fortunately, we took it as a complement.*
* Cervantes was actually reacting against what he saw as prevailing custom when he wrote Don Quixote and ended up with brilliant literature. He was also bitter and jealous but we’ll gloss over that part. We were hoping to be more like the first part.
Modeling the Data part 2
We agreed upon only two match points. Title and ISBN.
This excluded a large number of records by requiring both a valid title and ISBN entry.
Records with ISBNs and Titles accounted for ~1,200,000 of the over 2 million bib records in the system.
What Was Left Behind
Records excluded include many that do not have valid ISBNs including those that have SUDOC numbers, ISSNs, pre-cats, etc…
Also excluded were a significant number of potential matches that might have been matched using additive match points.
The Importance of Being Earnest
We were absolutely confidant that we could not achieve a high level of matching with extra limiting
match points.
We chose to not include additional merging (additive) match points because we could easily over reach.
We estimated based on modeling a conservative ~300,000 merges or about 15% of our ISBNs.
The Wisdom of Crowds
Conventional wisdom said that MARC could not be generalized despite the presence of
supposedly unique information in the records.
We were taking risks and very aware of it but the need to create a large impact on our database drove us to disregard friendly
warnings.
An Imperfect World
We knew that we would miss things that could potentially be merged.
We knew that we would create some bad merges when there were bad
records.*
10% wrong to get it 90% done.
* GIGO = Garbage In, Garbage Out
Next Step … Normalization
With matching decided we needed to normalize the data. This was done to copies of the production MARC records and that used to
make lists.
Normalization is needed because of variability in how data was entered. It allows us to get the most possible matches based on data.
Normalization Details
We normalized case, punctuation, numbers, non-Roman characters, trailing and leading spaces, some GMDs put in as parts of titles,
redacted fields, 10 digit ISBNs as 13 digit and lots, lots more.
This was not done to permanent records but to copies used to make the lists.
Weighting
Finally, we had to weight the records that have been matched to determine which should be
the record to keep. To do this each bib record is given a score to show it’s quality.
The Weighting Criteria
We looked at the presence, length number of entries in the 003, 02X, 24X, 300, 260$b, 100, 010, 500s, 440, 490, 830s, 7XX, 9XX and 59X to manipulate, add to, subtract from, bludgeon,
poke and eventually determine a 24 digit number that would represent the quality of a
bib record. *
* While not complete, this is mostly accurate.
The Merging
Once the weighing is done the highest scored record in each group (that should be the same items) is made the master record, the copies from the others moved to it and those bibs
marked deleted. Holds move with the copies and then holds can be retargeted allowed back
logged holds to fill.
The Coding
We proceeded to contract with Equinox to have them develop the code and run it against our test environment (and eventually production). Galen Charlton was our primary contact in this and aside from excellent work also provided us wonderful feedback about additional criteria to include in the weighting and normalization.
Test Server
Once run on the test server we took our new batches of records and broke them into 50,000 record chunks. We then gave those chunks to
member libraries and had them do random samples for five days.
Fixed As We Went
Lynn quickly found a problem with 13 digit ISBNs normalizing as 10 digit ISBNs. We
quickly identified many parts of DVD sets and some shared title publications that would be
issues. Galen was responsive and helped us compensate for these issues as they were
discovered.
In Conclusion
We don’t know how many bad matches were formed but it was below our threshold, perhaps
a few hundred. We are still gathering that feedback.
We were able to purge 326,098 bib records or about 27% of our ISBN based collection.
Evaluation
The catalog is visibly cleaner.
The cost per bib record was 1.5 cents.
Absolutely successful.
Future
This dedupping system will improve further.
There are still problems that need to be cleaned up – some manually and some by
automation.
New libraries that join SCLENDs will use our dedupping algorithm not the old one.
Challenges
One, how do we go forward with more clean up? Treat AV materials separately? We need
to look at repackaging standards more.
Two, how do we prevent adding new errors to the system (which is happening)?
Questions?