Big Data: Indexing ~50Tb of URIs
-
Upload
robert-sanderson -
Category
Technology
-
view
681 -
download
3
description
Transcript of Big Data: Indexing ~50Tb of URIs
1 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Towards Seamless Navigation of the Web of the Past
Using DISC to Reindex ~50Tb of URIs
Robert Sanderson [email protected] @azaroth42
Herbert Van de Sompel
[email protected] @hvdsomp
http://www.mementoweb.org/
2 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Memento Project Overview Participants The Plan … … and the Reality Technical Details Next Run
Summary
3 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Memento Project
Provide web citizens seamless access
to the web of the past Some accolades: • $1M funding from the Library of Congress • Winner of 2010 International Digital Preservation Award • Accepted by International Internet Preservation Consortium
as the way forwards for access to web archives • Nominated for Best Paper at JCDL 2010 • Best Poster at JCDL 2012 • Tim Berners-Lee @ LDOW10: “We absolutely need this” • Internet Draft 6 (final call) by May
“ ”
4 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Old Copies have New Names Everyone knows the URL for CNN is: http://www.cnn.com/ Everyone knows the URL for CNN was: http://www.cnn.com/
5 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Old Copies have New Names Copies of the past versions of to http://www.cnn.com/ exist … but you don’t go there to get them, they have new URLs … and there’s no way to automatically discover those new URLs
http://web.archive.org/web/20031013111028/http://www2.cnn.com/
http://web.archive.org/web/20031013111028/http://www2.cnn.com/
6 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Discovery is Hard People want to find, not search … and especially not searching by hand!
http://web.archive.org/web/*/http://www2.cnn.com/
7 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Navigating in the Past Links to the real resource bring you back out of the past … or trap you in a single incomplete archive.
http://web.archive.org/web/*/http://www2.cnn.com/
Dec 20 2001, 4:51:00 UTC current
Pentagon
8 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Memento: Time Travel for the Web TimeGate: Resource that knows the locations of archived copies Memento: An archived copy of previous state of a web resource
Link header to TimeGate
302 redirect to Memento
How does the TimeGate know where the appropriate Memento is?
9 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Project Overview
Goal: • To aggregate the metadata of the distributed archives
of the IIPC, and • To provide Memento based access to the holdings
of open archives • To provide knowledge of the holdings of restricted
archives • To provide knowledge to IIPC members of the
holdings of totally closed archives
10 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Experiment Participants
• Austrian National Library • Bibliothèque Nationale de France • British Library • Institut National de l'Audiovisuel • Internet Archive • Koninklijke Bibliotheek • Library of Congress • Netarchive.dk • Swiss National Library • University of North Texas
• Los Alamos National Laboratory • Old Dominion University
11 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Data
kosovakosovo.com/photo.php?id=5785
20090608161553
http://www.kosovakosovo.com/photo.php?id=5785
text/html
200
M36MRHSBVPLKMUN6PFOIEV3AH5ADITAN
-
563096
AIT-1068-20090608161511.warc.gz
Canonical URL
Datestamp
Request URL
MIME type HTTP Status Checksum
Redirect?
Bytes
Storage File
Multiplied by 6Tb compressed, ~50Tb uncompressed
12 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
The Plan…
• To provide fast access to distributed archives, LANL would merge the indexes of the holdings of multiple archives and provide Memento based access
• Step 1: Library of Congress gathers CDX files Step 2: LANL indexes (a.k.a. “…”) Step 3: Profit
• Data: 6T of gzipped CDX files (mostly from IA)
• Shipped on hard drives • Computing: 210 node DISC cluster at LANL
• 2x 2ghz processors, 2x 2T HDD, 8G RAM
13 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
… and the Reality
• Hardware failure killed one of the drives en route • Transferred remaining files via BagIt from LoC
• DISC has restricted access: • Had to transfer data over intranet • 2 weeks to sync (5Mb/sec) • And then 2 weeks to get the processed results off
• Compute cluster has faulty switch, unreliable nodes: • Ran original processing 15 times without success
due to hardware failures
14 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Processing Design
• For each CDX file, • For each URI + timestamps,
• Map URI to an appropriate database slice • Merge timestamps with those of previous CDXs
• Possible because: • No need to do truncated search • No need to walk through URIs in order • No need for time based access, only URI
• Problem is “Embarrassingly Parallel”
15 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Approach 1: Online Messaging
• 25 read nodes, 1 control node, 150 write nodes • Messages (1000 URIs) sent via control node to write
• Failed 15 times due to hardware issues
16 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Approach 1: Implementation
• Python
• MPI L • PyMPI LL • No failure detection, no auto restart, no zombie killing,
no … LLL • Several months of experiments to find issues, fine tune
parameters most likely to complete…
17 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Approach 2: No Interaction
• 43 read/split nodes • Phase 1: Read nodes split CDX files to 3000 slices
18 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Approach 2: No Interaction
• Phase 2a: Transfer CDX slices to Control node • Phase 2b: Transfer CDX slices to Write nodes
19 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Approach 2: No Interaction
• 50 write nodes (* 60 slices each = 3000 slices) • Phase 3: Merge slices from nodes to BerkeleyDBs
20 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Next Steps
• Re-index, using non-interactive approach
• New data: 14 Tb (~120Tb uncompressed?) • Use FileMap for better automation?
• Two eSata 8T LaCie cubes, 2 3T internal drives to install locally
• More intelligent data partitioning • Some sort of error detection and handling!
21 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Memento
http://mementoweb.org/
Robert Sanderson [email protected]
@azaroth42
Herbert Van de Sompel [email protected]
@hvdsomp