Large Scale Data Clean-ups & Challenges
for the Library
Ksenija Mincic-Obradovic
Asia Pacific Metadata Advisory Board Meeting 3-4 August 2014Pattaya, Thailand
“Data cleaning is considered as a main challenge in the era of big data, due to the increasing volume, velocity and variety of data.”
(Tang, 2014)
Data cleaning process:
• Identifying data errors
• Repairing data errors
• Preventing data errors
In LIBRARIES,we might have to clean up data to:
• Remove ceased e- titles• Update changed URLs• Enable DDA/PDA purchasing• Perform gap analysis• Enable system migrations• Enable system integrations• Improve display in ILS
Main types of mistakes in e-book records
• MARC21 errors– E.g.: coding, wrong indicators, wrong characters… – Consequence: wrong indexing, records rejected…
• Wrong identifiers– 001, 010, 020/022, 035, 856z– Consequence: wrong matching, duplicates…
• Mistakes in description fields– E.g. wrong title, wrong author,– Consequence: bad display, faceting doesn’t work…
• Lack of URLs– Consequence: e-book cannot be accessed
Example 1: Fixing MARC21 errors in vendors/publishers files with e-book records
• Use programmes such as MARCReport and MarcEdit to identify errors
• Use MARCGlobal and MARCEdit to fix data
• Load file in the local catalogue
Example 2:Updating the NUC(National Union Catalogue)
• New Zealand national level project • Started in 2008 • Automated way of reporting changes to the
library holdings (additions and deletions) to the NUC
• Using OSMOSIS, a software tool, developed by the TMQ (Fla)
Identifiers for Matching Bibliographic Data
• 001 - Control Number • 010 - Library of Congress Number• 020/022 - ISBN/ISSN• 035 - System Control Number• 856 $z e-book SpringerLink
OSMOSIS Report (11/2014
020(ISBN)
020(ISBN)
Recommendations
• Check and clean data in vendor files before loading to your catalogue.
• Follow national and international standards in all aspects.
• Perform regular database maintenance.• Encourage cooperation between libraries and
vendors/publishers.
References
Beall, J. (2005). 10 ways to improve data quality: with a coordinated effort, your library can make significant progress in cleaning up its online catalog. American Libraries, 36(3), 36+. Retrieved from http://go.galegroup.com.ezproxy.auckland.ac.nz/ps/i.do?id=GALE%7CA139719467&v=2.1&u=learn&it=r&p=AONE&sw=w&asid=8bc9b1a0d979542543f18fc581b25da2
Rahm, E. (2004) Data Cleaning: Problems and Current Approaches . In Galindo, F., Takizawa, Makoto, & Traunmuller, R. (2004). Database and expert systems applications 15th International Conference, DEXA 2004, Zaragoza, Spain, August 30-September 3, 2004 : Proceedings (Lecture notes in computer science ; 3180). Berlin ; New York: Springer.
Tang, N. (2014). Big Data Cleaning. In Chen, L. (2014). Web technologies and applications : 16th Asia-Pacific Web Conference, APWeb 2014, Changsha, China, September 5-7, 2014. Proceedings (Lecture notes in computer science ; 8709).
Image credits
• http://www.bluewolfconsulting.co.uk/blog/data-doesn-t-have-be-dirty-four-letter-word
• https://www.flickr.com/photos/epublicist/8718123610
• http://www.dreamstime.com/
Top Related