Metadata Extraction & Web Archives: Automating the Record Creation Process
description
Transcript of Metadata Extraction & Web Archives: Automating the Record Creation Process
Metadata Extraction & Web Archives:
Automating the Record Creation Process
Abbie Grotke / [email protected] Jones / [email protected]
Library of CongressOffice of Strategic Initiatives
Web Capture Team
Library of Congress Web Archives
• Since 2000, 20+ thematic, event-based collections
• 100 TB+ of data collected
• 12,500+ URLs
http://www.loc.gov/lcwa
Web Archiving Tools• Crawling:
– Heritrix– WARC
• Access:– Wayback Machine– NutchWAX
International Internet Preservation Consortiumnetpreserve.org
LC’s Web Archive Workflow
• Identify & select URLs (LS or LAW)• Determine crawl strategy, create a
seed list for crawling (OSI)• Sites harvested by Internet Archive
or in-house crawlers (OSI), • Quality Review (OSI & curators)• Create “catalogers list” (OSI) and
XML MODS template (LS) for metadata extraction
Describing the Archives
• Collection-level MARC record in OPAC• Item-level MODS records in LCWA
– One record per recommended URL for each distinct collection
• With so many thousands of URLs to process, how do we streamline the process?
XML MODS Template
Metadata Extraction
• For each URL that will be cataloged:– Get archived web site metadata– Combine with URL Nominations Database
metadata– If elections/campaign web site, metadata also
pulled from our candidate Access database (used to create subject terms)
• Using XML template, we add collection and record level metadata
• Create a single file for delivery
Data Sources for Metadata Extraction
URL Nominations Database
• URL• Access Rights• Language(s)• Category• Subject Terms
Election Candidate Metadata
• Name• URL• Party Affiliation• State • Race• District (House)
Archived Web Site Metadata
From 1st capture:• Document Title• Keywords• Abstract• Mime Types
From Wayback index:
• Capture Dates (First & Last)
Combined Data in Template
Combined Data in Template
Combined Data in Template