Data Mining the Largest Library Database in the World
Roy TennantOCLC Research
Leveraging WorldCat
E U R O P E, M I D D L E E A S T & A F R I C A R E G I O N A L C O U N C I L
Worldcat.org/identities/
Algorithmically constructed from WorldCat records
E U R O P E, M I D D L E E A S T & A F R I C A R E G I O N A L C O U N C I L
Viaf.org
A Union database of authority records
E U R O P E, M I D D L E E A S T & A F R I C A R E G I O N A L C O U N C I L
The Responsible Party
Thom HickeyChief Scientist
OCLC Research
E U R O P E, M I D D L E E A S T & A F R I C A R E G I O N A L C O U N C I L
290+ million records
E U R O P E, M I D D L E E A S T & A F R I C A R E G I O N A L C O U N C I L
Language Coverage
Percentage of records for non-English materials
30 June 2012
60.2%
274 million
36.5 million
25.5 million11.3
million4.7 million4.3 million3.6 million3.5 million
Total
GermanFrenchSpanishItalianDutch Russian Latin
E U R O P E, M I D D L E E A S T & A F R I C A R E G I O N A L C O U N C I L
Worldcat.org/identities/
E U R O P E, M I D D L E E A S T & A F R I C A R E G I O N A L C O U N C I L
E U R O P E, M I D D L E E A S T & A F R I C A R E G I O N A L C O U N C I L
E U R O P E, M I D D L E E A S T & A F R I C A R E G I O N A L C O U N C I L
(J.K. Rowling)
(Diana Gabaldon)
(Galileo)
E U R O P E, M I D D L E E A S T & A F R I C A R E G I O N A L C O U N C I L
E U R O P E, M I D D L E E A S T & A F R I C A R E G I O N A L C O U N C I L
E U R O P E, M I D D L E E A S T & A F R I C A R E G I O N A L C O U N C I L
Viaf.org
E U R O P E, M I D D L E E A S T & A F R I C A R E G I O N A L C O U N C I L
VIAF Participants
E U R O P E, M I D D L E E A S T & A F R I C A R E G I O N A L C O U N C I L
E U R O P E, M I D D L E E A S T & A F R I C A R E G I O N A L C O U N C I L
“Super” Authority File
E U R O P E, M I D D L E E A S T & A F R I C A R E G I O N A L C O U N C I L
E U R O P E, M I D D L E E A S T & A F R I C A R E G I O N A L C O U N C I L
E U R O P E, M I D D L E E A S T & A F R I C A R E G I O N A L C O U N C I L
Our Cataloging Future
“Moving from cataloging to catalinking”
Eric Miller, Zepheira
E U R O P E, M I D D L E E A S T & A F R I C A R E G I O N A L C O U N C I L
E U R O P E, M I D D L E E A S T & A F R I C A R E G I O N A L C O U N C I L
Some Lessons• Widespread collaboration is essential• Normalizing the data is essential• Normalizing the data is complicated• Everything is interrelated:
– You can’t bring names together if titles don’t match– You can’t bring titles together if names don’t match
• Batch mode processing still rules (but we’re getting better and faster at it)
E U R O P E, M I D D L E E A S T & A F R I C A R E G I O N A L C O U N C I L
Conclusions
• Data mining isn’t just useful, it’s essential• Extracting data from MARC that is useful in
other contexts is possible, but will require sophisticated processing
• Only very large organizations (e.g., OCLC, national libraries) have the data and resources to do this work
• Thankfully, we are doing it, but there is much more to be done
E U R O P E, M I D D L E E A S T & A F R I C A R E G I O N A L C O U N C I L
Roy Tennant
[email protected]@rtennantroytennant.com