Analyzing Hadoop with Hadoop

Post on 07-Jul-2015

537 views 0 download

Tags:

description

Talk that I gave at Berlin Buzzwords 2012. It shows why Hive doesn't fit in the Hadoop No-SQL environment and some examples of what information we were able to extract from the Hadoop user mailing list and git logs.

Transcript of Analyzing Hadoop with Hadoop

Analyzing Hadoop

with HadoopMontag, 4. Juni 12

© sg@datameer.com, confidential - Do not distribute

Data Grows Faster Than Moore's Law!

Unstructured: 61.7% growth

Structured: 21.8 % growth

http://www.emc.com/about/news/press/2011/20110628-01.htm

Montag, 4. Juni 12

© sg@datameer.com, confidential - Do not distribute

Data Warehouse

Static

ETL

Slow

Business Intelligence

Barrier

Hadoop

Dynamic

Raw Load

Fast

Analytics

Agile

30+ Years Workflow

Montag, 4. Juni 12

SQL

Hadoop + Hive

NO-SQL Hadoop 10+MLOC

http://dearcomputer.nl/gir/?q=nerd+&s=4&b=Rip+Google!

http://thepage.time.com/2009/04/18/why-is-this-elephant-crying/

Montag, 4. Juni 12

Evolution backward

http://chelseavose.wordpress.com/2012/01/26/is-evolution-real/

Structured English Query Language

1970’SEQUEL

ANSI SQL ORM JDO NO-SQL Hive

Montag, 4. Juni 12

Unstructured + Structured

Montag, 4. Juni 12

git log --numstat --pretty=format:%H,%ai,%cn,%ce%+B

Montag, 4. Juni 12

Data Quality?

Montag, 4. Juni 12

Results...

Montag, 4. Juni 12

Commits per Year

200

Montag, 4. Juni 12

LOC Changes per Year

7,000,000

Montag, 4. Juni 12

Most Lines Added

1,500,000

Montag, 4. Juni 12

2006 eMails vs Commits

72

commitsemails

Montag, 4. Juni 12

2011 eMails vs Commitscommitsemails

559

Montag, 4. Juni 12

EMails per Month

800

Montag, 4. Juni 12

Most Discussed, Least Changed

Montag, 4. Juni 12

Most Active Emailers

900

Montag, 4. Juni 12

We’re hiring!

Montag, 4. Juni 12

Emails with Most Replies

Montag, 4. Juni 12

Longest Comment

35,000

Montag, 4. Juni 12

Email Activity per Timezone

Montag, 4. Juni 12

Follow us: @datameer

Montag, 4. Juni 12