Warcbase: Building a Scalable Web Archiving Platform on Hadoop and HBase Jimmy Lin University of...

download Warcbase: Building a Scalable Web Archiving Platform on Hadoop and HBase Jimmy Lin University of Maryland @lintool IIPC 2015 General Assembly Tuesday,

If you can't read please download the document

Transcript of Warcbase: Building a Scalable Web Archiving Platform on Hadoop and HBase Jimmy Lin University of...

  • Slide 1
  • Warcbase: Building a Scalable Web Archiving Platform on Hadoop and HBase Jimmy Lin University of Maryland @lintool IIPC 2015 General Assembly Tuesday, April 28, 2015
  • Slide 2
  • Source: Wikipedia (All Souls College, Oxford) From the Ivory Tower
  • Slide 3
  • Source: Wikipedia (Factory) to building sh*t that works
  • Slide 4
  • data science data products I worked on analytics infrastructure to support data science data products to surface relevant content to users
  • Slide 5
  • Gupta et al. WTF: The Who to Follow Service at Twitter. WWW 2013 Lin and Kolcz. Large-Scale Machine Learning at Twitter. SIGMOD 2012 Busch et al. Earlybird: Real- Time Search at Twitter. ICDE 2012 Mishne et al. Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion Architecture. SIGMOD 2013. Leibert et al. Automatic Management of Partitioned, Replicated Search Services. SoCC 2011 I worked on analytics infrastructure to support data science data products to surface relevant content to users
  • Slide 6
  • Source: https://www.flickr.com/photos/bongtongol/3491316758/
  • Slide 7
  • circa ~2010 ~150 people total ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total Hadoop DW capacity ~100 TB ingest daily dozens of teams use Hadoop daily 10s of Ks of Hadoop jobs daily
  • Slide 8
  • Source: Wikipedia (All Souls College, Oxford) And back!
  • Slide 9
  • but theyre underused Source: http://images3.nick.com/nick-assets/shows/images/house-of-anubis/flipbooks/hidden-room/hidden-room-04.jpg Web archives are an important part of our cultural heritage
  • Slide 10
  • Why? Restrictive use regimes? But I dont think thats all
  • Slide 11
  • Source: http://www.flickr.com/photos/cheryne/8417457803/ Users cant do much with current web archives Hard to develop tools for non-existent needs
  • Slide 12
  • We need deep collaborations between: Users (e.g., archivists, journalists, historians, digital humanists, etc.) Tool builders (me and my colleagues) Goal: tools to support exploration and discovery in web archives Beyond browsing Beyond searching Source: http://waterloocyclingclub.ca/wp-content/uploads/2013/05/Help-Wanted-Sign.jpg
  • Slide 13
  • Source: Google What would a web archiving platform built on modern big data infrastructure look like?
  • Slide 14
  • OpenWayback: monolithic Tomcat application Scalable storage of archived data Efficient random access Scalable processing and analytics Scalable storage and access of derived data Desiderata Some work by the Internet Archive, Common Crawl, and others Petabox by Internet Archive; NAS, SAN, etc. by others Ad hoc storage in flat text WAT files
  • Slide 15
  • Scalable storage of archived data Efficient random access Scalable processing and analytics Scalable storage and access of derived data Desiderata HDFS HBase Hadoop HBase Existing tools arent adequate!
  • Slide 16
  • (file name, block id) (block id, block location) instructions to datanode datanode state (block id, byte range) block data HDFS namenode HDFS datanode Linux file system HDFS datanode Linux file system File namespace /foo/bar block 3df2 Application HDFS Client Stores data blocks across commodity servers Scales to 100s of PBs of data Open source implementation of the Google File System
  • Slide 17
  • map Shuffle and Sort: aggregate values by keys reduce k1k1 k2k2 k3k3 k4k4 k5k5 k6k6 v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 ba12cc36ac52bc78 a15b27c2368 r1r1 s1s1 r2r2 s2s2 r3r3 s3s3 Suitable for batch processing on HDFS data Open source implementation of Googles framework
  • Slide 18
  • ~ Googles Bigtable A collection of tables, each of which represents a sparse, distributed, persistent multidimensional sorted map Source: Bonaldo Big Table by Alain Gilles
  • Slide 19
  • Image Source: Chang et al., OSDI 2006 in a nutshell (row key, column family, column qualifier, timestamp) value row keys are lexicographically sorted column families define locality groups Client-side operations: gets, puts, deletes, range scans
  • Slide 20
  • Warcbase An open-source platform for managing web archives built on and H and http://warcbase.org/ Source: Wikipedia (Archive)
  • Slide 21
  • WARC data Ingestion Applications and Services Processing & Analytics
  • Slide 22
  • Scalability? We got 99 problems but scalability aint one Jay- Z Scalability of Warcbase limited by Hadoop/HBase Applications are lightweight clients
  • Slide 23
  • WARC data Ingestion Applications and Services Processing & Analytics text analysis, link analysis,
  • Slide 24
  • Sample dataset: crawl of the 108 th U.S. Congress Monthly snapshots, January 2003 to January 2005 1.15 TB gzipped ARC files 29 million captures of 7.8 million unique URLs 23.8 million captures are HTML files Hadoop/HBase cluster: 16 nodes, dual quad-core processors, 3 2TB disks each
  • Slide 25
  • OpenWayBack + Warcbase Integration Implementation: OpenWayback frontend for rendering Offloads storage management to HBase Transparently scales out with HBase
  • Slide 26
  • Slide 27
  • Slide 28
  • Topic Model Explorer Implementation LDA on each temporal slice Adaptation of Termite visualization
  • Slide 29
  • Slide 30
  • Webgraph Explorer Implementation Link extraction with Hadoop, site-level aggregation Computation of standard graph statistics d3 interactive visualization
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • We need deep collaborations between: Users (e.g., archivists, journalists, historians, digital humanists, etc.) Tool builders (me and my colleagues)
  • Slide 35
  • Warcbase:here.
  • Slide 36
  • Warcbase in a box Wide Web scrape (2011) Internet Archive wide0002 crawl Sample of 100 WARC files (~100 GB) Warcbase running on the Mac Pro: Ingestion in ~1 hour Extraction of webgraph using Pig in ~55 minutes Result: 17m links,.ca subset of 1.7m links Visualization in Gephi cylinder
  • Slide 37
  • Internet Archive, Wide Web Scrape from 2011
  • Slide 38
  • But they can probably afford a Mac Pro Whats the big deal? Historians probably cant afford Hadoop clusters How will this change historical scholarship? Visual graph analysis on longitudinal data, select subsets for further textual analysis all on your desktop! Drill down to examine individual pages
  • Slide 39
  • Warcbase:here. Bonus!
  • Slide 40
  • Raspberry Pi Experiments Columbia Universitys Human Rights Web Archive 43 GB of crawls from 2008 (1.68 million records) Warcbase running on the Raspberry Pi (standalone mode) Ingestion in ~27 hours (17.3 records/second) Random browsing (avg over 100 pages): 2.1s page load Same pages in Internet Archive: 2.9s page load Draws 2.4 Watts Jimmy Lin. Scaling Down Distributed Infrastructure on Wimpy Machines for Personal Web Archiving. Temporal Web Analytics Workshop 2015.
  • Slide 41
  • Throw in search, lightweight analytics, Whats the big deal? Store every page youve ever visited in your pocket! What will you do with the web in your pocket? How will this change how you interact with the web? When was the last time you searched to refind?
  • Slide 42
  • We need deep collaborations between: Users (e.g., archivists, journalisms, historians, digital humanists, etc.) Tool builders (me and my colleagues) Warcbase is just the first step Goal: tools to support exploration and discovery
  • Slide 43
  • Source: Wikipedia (Hammer) Questions?
  • Slide 44
  • Bigtable use case: storing web crawls! Row key: domain-reversed URL Column family: contents Raw data Value: raw HTML Column qualifier: Column family: anchor Derived data Value: anchor text Column qualifier: source URL
  • Slide 45
  • Warcbase: HBase data model Row key: domain-reversed URL Column family: c Value: raw source Column qualifier: MIME type Timestamp: crawl date