Eric Baldeschwieler Keynote from Storage Developers Conference

33
© Hortonworks, Inc. All Rights Reserved. Apache Hadoop Today & Tomorrow Eric Baldeschwieler, CEO Hortonworks, Inc. twitter: @jeric14 (@hortonworks) www.hortonworks.com

description

Eric Baldeschwieler's keynote presentation from Storage Developer's Conference 2011. Eric is co-founder and CEO of Hortonworks.

Transcript of Eric Baldeschwieler Keynote from Storage Developers Conference

Page 1: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.

Apache Hadoop Today & Tomorrow

Eric Baldeschwieler, CEOHortonworks, Inc.

twitter: @jeric14 (@hortonworks)www.hortonworks.com

Page 2: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.

Agenda

Brief Overview of Apache Hadoop

Where Apache Hadoop is Used

Apache Hadoop Core Hadoop Distributed File System (HDFS) Map/Reduce

Where Apache Hadoop Is Going

Q&A

2

Page 3: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.3

A set of open source projects owned by the Apache Foundation thattransforms commodity computers and network into a distributed service•HDFS – Stores petabytes of data reliably•Map-Reduce – Allows huge distributed computations

Key Attributes•Reliable and Redundant – Doesn’t slow down or loose data even as hardware fails•Simple and Flexible APIs – Our rocket scientists use it directly!•Very powerful – Harnesses huge clusters, supports best of breed analytics•Batch processing centric – Hence its great simplicity and speed, not a fit for all use cases

What is Apache Hadoop?

Page 4: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.

What is it used for? Internet scale data

Web logs – Years of logs at many TB/day Web Search – All the web pages on earth Social data – All message traffic on facebook

Cutting edge analytics Machine learning, data mining…

Enterprise apps Network instrumentation, Mobil logs Video and Audio processing Text mining

And lots more!

4

Page 5: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.5

Apache Hadoop Projects

Programming Languages

Computation

Object Storage

Zooke

eper

(Coord

inati

on)

Zooke

eper

(Coord

inati

on)

Core Apache Hadoop Related Apache Projects

HDFS (Hadoop Distributed File System)

HDFS (Hadoop Distributed File System)

MapReduce(Distributed Programing Framework)

MapReduce(Distributed Programing Framework)

Hive(SQL)Hive(SQL)

Pig(Data Flow)

Pig(Data Flow)

HBase(Columnar Storage)

HBase(Columnar Storage)HCatalog

(Meta Data)

HCatalog(Meta Data)

HM

S(M

anagem

ent)

HM

S(M

anagem

ent)

Table Storage

Page 6: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.6

Where Hadoop is Used

Page 7: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.

2006 2008 2009 2010

The Datagraph Blog

Everywhere!

2007

7

Page 8: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved. 8

HADOOP @ YAHOO!

40K+ Servers

170 PB Storage

5M+ Monthly Jobs

1000+ Active users

© Yahoo 2011

Page 9: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.

twice the engagement twice the engagement

CASE STUDYYAHOO! HOMEPAGE

9

Personalized for each visitor

Result: twice the engagement

+160% clicksvs. one size fits all

+160% clicksvs. one size fits all

+79% clicksvs. randomly selected

+79% clicksvs. randomly selected

+43% clicksvs. editor selected

+43% clicksvs. editor selected

Recommended links News Interests Top Searches

© Yahoo 2011

Page 10: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.

CASE STUDYYAHOO! HOMEPAGE

10

• Serving Maps• Users - Interests

• Five Minute Production

• Weekly Categorization models

SCIENCE HADOOP CLUSTER

SERVING SYSTEMS

PRODUCTION HADOOP CLUSTER

USERBEHAVIOR

ENGAGED USERS

CATEGORIZATIONMODELS (weekly)

SERVINGMAPS

(every 5 minutes)USER

BEHAVIOR

» Identify user interests using Categorization models

» Machine learning to build ever better categorization models

Build customized home pages with latest data (thousands / second)

© Yahoo 2011

Page 11: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.

CASE STUDYYAHOO! MAIL

Enabling quick response in the spam arms race

• 450M mail boxes • 5B+ deliveries/day

• Antispam models retrainedevery few hours on Hadoop

40% less spam than Hotmail and 55% less spam than Gmail

“ “

SCIENCE

PRODUCTION

11© Yahoo 2011

Page 12: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.12

, early adopters Scale and productize Hadoop

Apache Hadoop

A Brief History

2006 – present

Wide Enterprise Adoption Funds further development, enhancements

Nascent / 2011

Other Internet Companies Add tools / frameworks, enhance

Hadoop

2008 – present

Service Providers Provide training, support, hosting

2010 – present

Page 13: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.13

Traditional Enterprise ArchitectureData Silos + ETL

EDWData Marts

BI / Analyti

cs

Traditional Data Warehouses, BI & Analytics

Serving Applications

Unstructured Systems

Serving

Logs

Social Media

Sensor Data

Text Systems

Traditional ETL &Message buses

Trad

ition

al E

TL &

Mes

sage

bus

es

Page 14: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.14

EDWData Marts

BI / Analyti

cs

Traditional Data Warehouses, BI & Analytics

Serving Applications

Unstructured Systems

Apache HadoopEsTsL (s = Store) Custom Analytics

Serving

Logs

Social Media

Sensor Data

Text Systems

Traditional ETL &Message buses

Trad

ition

al E

TL &

Mes

sage

bus

es

Hadoop Enterprise ArchitectureConnecting All of Your Big Data

Page 15: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.15

EDWData Marts

BI / Analyti

cs

Traditional Data Warehouses, BI & Analytics

Serving Applications

Unstructured Systems

Serving

Logs

Social Media

Sensor Data

Text Systems

80-90% of data produced today is unstructured

Gartner predicts 800% data growth over next 5 years

Traditional ETL &Message buses

Trad

ition

al E

TL &

Mes

sage

bus

es

Apache HadoopEsTsL (s = Store) Custom Analytics

Hadoop Enterprise ArchitectureConnecting All of Your Big Data

Page 16: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.

What is Driving Adoption?

Business drivers Identified high value projects that require use of more

data Belief that there is great ROI in mastering big data

Financial drivers Growing cost of data systems as proportion of IT spend Cost advantage of commodity hardware + open source

Enables departmental-level big data strategies

Technical drivers Existing solutions failing under growing requirements

3Vs - Volume, velocity, variety Proliferation of unstructured data

16

Page 17: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.17

Big Data PlatformsCost per TB, Adoption

Source:

Size of bubble = cost effectiveness of solution

Page 18: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.18

Apache Hadoop Core

Page 19: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.19

Overview

Frameworks share commodity hardware Storage - HDFS Processing - MapReduce

2 * 10GigE2 * 10GigE 2 * 10GigE

2 * 10GigE

• 20-40 nodes / rack• 16 Cores• 48G RAM• 6-12 * 2TB disk• 1-2 GigE to node

• 20-40 nodes / rack• 16 Cores• 48G RAM• 6-12 * 2TB disk• 1-2 GigE to node

Rack Switch

1-2U server

Rack Switch

1-2U server

Rack Switch

1-2U server

…Rack Switch

1-2U server

Page 20: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.

Map/Reduce

Map/Reduce is a distributed computing programming model

It works like a Unix pipeline: cat input | grep | sort | uniq -c > output Input | Map | Shuffle & Sort | Reduce | Output

Strengths: Easy to use! Developer just writes a couple of functions Moves compute to data

Schedules work on HDFS node with data if possible Scans through data, reducing seeks Automatic reliability and re-execution on failure

20

Page 21: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.

HDFS: Scalable, Reliable, Managable

21

Scale IO, Storage, CPU•Add commodity servers & JBODs •4K nodes in cluster, 80

Fault Tolerant & Easy management Built in redundancy Tolerate disk and node

failures Automatically manage

addition/removal of nodes One operator per 8K

nodes!!

Storage server used for computation Move computation to data

Not a SAN But high-bandwidth

network access to data via Ethernet

Immutable file system Read, Write, sync/flush

No random writes

Switch

Switch

Switch

Switch

Switch

Switch

CoreSwitchCore

SwitchCore

SwitchCore

Switch

Page 22: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.

HDFS Use Cases

Petabytes of unstructured data for parallel, distributed analytics processing using commodity hardware

Solve problems that cannot be solved using traditional systems at a cheaper price Large storage capacity ( >100PB raw) Large IO/Computation bandwidth (>4K

servers)> 4 Terabit bandwidth to disk! (conservatively)

Scale by adding commodity hardware Cost per GB ~= $1.5, includes MapReduce

cluster

22

Page 23: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.

HDFS Architecture

Namenode

Persistent Namespace Metadata & Journal

NamespaceState

Block Map

Heartbeats & Block Reports

Block ID Block Locations

Datanodes

Block ID Data

NFS

Hierarchal NamespaceFile Name BlockIDs

Horizontally Scale IO and Storage

23

b1

b5

b3

JBOD

Blo

ck S

tora

ge

Nam

esp

ace

b2

b3

b1

JBOD

b3

b5

b2

JBOD

b1

b5

b2

JBOD

Page 24: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.

Client Read & Write Directly from Closest Server

Namenode

NamespaceState

Block Map

Horizontally Scale IO and Storage

24

b1

b5

b3

JBOD

b2

b3

b1

JBOD

b3

b5

b2

JBOD

b1b5

b2

JBOD

Client

1 open

2 read

Client

2 write

1 create

End-to-end checksum

Page 25: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.

Actively maintain data reliability

Namenode

NamespaceState

Block Map

25

b1

b5

b3

JBOD

b2

b3

b4

JBOD

b3

b5

b2

JBOD

b1b5

b2

JBOD

2. copy

3. blockReceived

1. replicate

Bad/lost block replica

Bad/lost block replica

Periodically check block checksums

Page 26: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.

HBase

Hadoop ecosystem Database, based on Google BigTable

Goal: Hosting of very large tables (billions of rows X millions of columns) on commodity hardware. Multidimensional sorted Map

Table => Row => Column => Version => Value Distributed column-oriented store Scale – Sharding etc. done automatically

No SQL, CRUD etc.

26

Page 27: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.27

What’s Next

Page 28: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.

About Hortonworks – Basics

Founded – July 1st, 2011 22 Architects & committers from Yahoo!

Mission – Architect the future of Big Data Revolutionize and commoditize the

storage and processing of Big Data via open source

Vision – Half of the worlds data will be stored in Hadoop within five years

Page 29: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.

Game Plan

Support the growth of a huge Apache Hadoop ecosystem Invest in ease of use, management, and

other enterprise features Define APIs for ISVs, OEMs and others to

integrate with Apache Hadoop Continue to invest in advancing the Hadoop

core, remain the experts Contribute all of our work to Apache

Profit by providing training & support to the

Hadoop community

Page 30: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.30

Lines of Code Contributed to Apache Hadoop

Page 31: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.

Apache Hadoop Roadmap

Phase 1 – Making Apache Hadoop Accessible• Release the most stable version of Hadoop ever (Hadoop 0.20.205)

• Frequent sustaining releases• Release directly usable code via Apache (RPMs, .debs…)• Improve project integration (HBase support)

2011

Phase 2 – Next-Generation Apache Hadoop• Address key product gaps (HA, Management…)• Enable partner innovation via open APIs• Enable community innovation via modular architecture

2012(Alphas in Q4 2011)

31

Page 32: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.

Next-Generation Hadoop

Core HDFS Federation – Scale out and innovation via new APIs

Will run on 6000 node clusters with 24TB disk / node = 144PB in next release

Next Gen MapReduce – Support for MPI and many other programing models HA (no SPOF) and Wire compatibility

Data - HCatalog 0.3 Pig, Hive, MapReduce and Streaming as clients HDFS and HBase as storage systems Performance and storage improvements

Management & Ease of use Ambari – A Apache Hadoop Management & Monitoring System Stack installation and centralized config management REST and GUI for user & administrator tasks

32

Page 33: Eric Baldeschwieler Keynote from Storage Developers Conference

© Hortonworks, Inc. All Rights Reserved.

Thank You!

33

Questions

Twitter: @jeric14 (@hortonworks)www.hortonworks.com