Big Tools for Big Data

28
Big Tools for Big Data Analytics and Management at web scale IIPC General Assembly, Singapore, May 2010 Lewis Crawford Web Archiving Programme Technical Lead British Library

Transcript of Big Tools for Big Data

Page 1: Big Tools for Big Data

Big Tools for Big DataAnalytics and Management at web scale

IIPC General Assembly, Singapore, May 2010

Lewis Crawford

Web Archiving Programme Technical LeadBritish Library

Page 2: Big Tools for Big Data

2

Big Data “the Petabyte age”

Internet Archive stores about 2 Petabytes of data and grows at 20TB a month

Large Hadron Collider 15PB / year

At the BL

Selective Web Archive growing at

200GB a month

Conservative estimate for

Domain Crawl is 100TB

Page 3: Big Tools for Big Data

3

The problem of big data

We can process data very quickly but we can read/write it very slowly

1990 1 GB disk 4.4MB/s read whole disk in 5 mins

2010 1 TB disk 100MB/s read whole disk in 2.5 hours

Page 4: Big Tools for Big Data

The solution!

4

Solution: parallel reads

1 HDD = 100 MB/sec 1000 HDDs = 100 GB/sec

Page 5: Big Tools for Big Data

5

Hadoop

2002 Nutch Crawler - Doug Cutting

2003 GFS http://labs.google.com/papers/gfs.html

2004 Map Reduce http://labs.google.com/papers/mapreduce.html

2005 Nutch moves to Map Reduce model with NDFS

2006 NDFS and Map Reduce model becomes Hadoop

under

2008 Top level project at Apache

2009 17 clusters with 24,000 nodes at Yahoo!

1TB sorted in 62 seconds

100TB sorted in 173 minutes

Page 6: Big Tools for Big Data

6

Hadoop Users

Yahoo!

More than 100,000 CPUs in >25,000 computers running Hadoop

Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) Used to support research for Ad Systems and Web Search Also used to do scaling tests to support development of Hadoop on larger clusters

Baidu - the leading Chinese language search engine

Hadoop used to analyze the log of search and do some mining work on web page database We handle about 3000TB per week Our clusters vary from 10 to 500 nodes

Facebook

Use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.

Currently we have 2 major clusters: A 1100-machine cluster with 8800 cores and about 12 PB raw storage. A 300-machine cluster with 2400 cores and about 3 PB raw storage. Each (commodity) node has 8 cores and 12 TB of storage.

http://wiki.apache.org/hadoop/PoweredBy

Page 7: Big Tools for Big Data

Nutchwax!

7

Page 8: Big Tools for Big Data

8

Hadoop@BL

Page 9: Big Tools for Big Data

IBM Digital Democracy for the BBC

9

Page 10: Big Tools for Big Data

10

Bigsheets!

Page 11: Big Tools for Big Data

11

BigSheets and the open source stack

Top level Apache Project

Yahoo! Contributed open source

IBM Research Licence

Insight Engine Spreadsheet Paradigm

SQL ‘like’ programming language

Distributed processing and file system

Page 12: Big Tools for Big Data

12

Analytics - the meta tag example.

Extract meta data tags from all html files in the 2005 General Election Collection

Extract ‘keywords’ from metatags

Record all html pages into three separate ‘bags’ where keywords contained:

Tory, Conservative Labour Liberal, Lib Dem, Liberal Democrat

Analyse single and pairs of words in each of those ‘bags’ of data

Generate Tag clouds from the 50 most common words.

Page 13: Big Tools for Big Data

Data management

13

Page 14: Big Tools for Big Data

robots.txt example

14

Page 15: Big Tools for Big Data

Robots.txt continued…

15

Page 16: Big Tools for Big Data

16

Page 17: Big Tools for Big Data

17

Data management

High level management tool – Spreadsheet paradigm

Clean User interface

Straightforward programming model (UDF’s)

Use cases: ARC to WARC migration Information package generation (SIP) CDX indexes / Lucene indexes JHOVE object validation / verification Object format migration.

Page 18: Big Tools for Big Data

18

Slash Page crawl - election sites extraction

Slash page (home page) of known UK domains Data discarded after processing

Generate list of election terms (Politcal parties, Mori election tags)

Extract text from html pages using an HTML tag density algorithm

Identify all web pages that contain these words

Identify sites that contain two or more of the terms

Page 19: Big Tools for Big Data

Slash Page Data

19

Page 20: Big Tools for Big Data

Text Extracted Using Tag Density Algorithm

20

Page 21: Big Tools for Big Data

Election Key Terms

21

Page 22: Big Tools for Big Data

Results

22

Page 23: Big Tools for Big Data

Pie Chart Visualization

23

Page 24: Big Tools for Big Data

Seeds With 2 Or More Terms

24

Page 25: Big Tools for Big Data

Manual Verification

25

Page 26: Big Tools for Big Data

26

Other potential potential digital material

Digital Books

Datasets

19th Century Newspapers

Page 27: Big Tools for Big Data

27

Back to analytics and the next generation access tools

Automatic Classification – WebDewey, LOC Subject Headings Machine learning

Faceted lucene indexes for Advanced Search functionality

Engage directly with Higher Education community

Access tool – researcher focus? BL 3 year Research Behaviour Study

Page 28: Big Tools for Big Data

Thank you!

[email protected]

http://uk.linkedin.com/in/lewiscrawford

3x30 Nehalem-based node grids, with 2x4 cores, 16GB RAM, 8x1TB storage using ZFS in a JBOD configuration.

Hadoop and Pig for discovering People You May Know and other fun facts.

28