Cloudera Intro Hadoop 111116 Slides

30
11/16/2011, Stanford EE380 Computer Systems Colloquium Introducing Apache Hadoop: The Modern Data Operating System Dr. Amr Awadallah | Founder, CTO, VP of Engineering [email protected], twitter: @awadallah

Transcript of Cloudera Intro Hadoop 111116 Slides

Page 1: Cloudera Intro Hadoop 111116 Slides

11/16/2011, Stanford EE380 Computer Systems Colloquium

Introducing Apache Hadoop:The Modern Data Operating System

Dr. Amr Awadallah | Founder, CTO, VP of [email protected], twitter: @awadallah

Page 2: Cloudera Intro Hadoop 111116 Slides

©2011 Cloudera, Inc. All Rights Reserved.

2

Storage Only Grid (original raw data)

Instrumentation

Collection

RDBMS (aggregated data)

BI Reports + Interactive Apps

Mostly Append

ETL Compute Grid

Moving Data ToCompute Doesn’t Scale

Can’t Explore OriginalHigh Fidelity Raw Data

Archiving =PrematureData Death

Limitations of Existing Data Analytics Architecture

Page 3: Cloudera Intro Hadoop 111116 Slides

So What is Apache Hadoop ?

• A scalable fault-tolerant distributed system fordata storage and processing (open sourceunder the Apache license).

• Core Hadoop has two main systems:

– Hadoop Distributed File System: self-healinghigh-bandwidth clustered storage.

– MapReduce: distributed fault-tolerant resourcemanagement and scheduling coupled with ascalable data programming abstraction.

©2011 Cloudera, Inc. All Rights Reserved.

3

Page 4: Cloudera Intro Hadoop 111116 Slides

The Key Benefit: Agility/Flexibility

©2011 Cloudera, Inc. All Rights Reserved.

4

Schema-on-Read (Hadoop):Schema-on-Write (RDBMS):

• Schema must be created beforeany data can be loaded.

• An explicit load operation has totake place which transformsdata to DB internal structure.

• New columns must be addedexplicitly before new data forsuch columns can be loadedinto the database.

• Read is Fast

• Standards/Governance

• Data is simply copied to the filestore, no transformation is needed.

• A SerDe (Serializer/Deserlizer) isapplied during read time to extractthe required columns (late binding)

• New data can start flowing anytimeand will appear retroactively oncethe SerDe is updated to parse it.

• Load is Fast

• Flexibility/AgilityProsPros

Page 5: Cloudera Intro Hadoop 111116 Slides

Innovation: Explore Original Raw Data

©2011 Cloudera, Inc. All Rights Reserved.

5

Data Committee Data Scientist

Page 6: Cloudera Intro Hadoop 111116 Slides

Flexibility: Complex Data Processing

1. Java MapReduce: Most flexibility and performance, but tediousdevelopment cycle (the assembly language of Hadoop).

2. Streaming MapReduce (aka Pipes): Allows you to develop inany programming language of your choice, but slightly lowerperformance and less flexibility than native Java MapReduce.

3. Crunch: A library for multi-stage MapReduce pipelines in Java(modeled After Google’s FlumeJava)

4. Pig Latin: A high-level language out of Yahoo, suitable for batchdata flow workloads.

5. Hive: A SQL interpreter out of Facebook, also includes a meta-store mapping files to their schemas and associated SerDes.

6. Oozie: A PDL XML workflow engine that enables creating aworkflow of jobs composed of any of the above.

6©2011 Cloudera, Inc. All Rights Reserved.

Page 7: Cloudera Intro Hadoop 111116 Slides

Scalability: Scalable Software Development

©2011 Cloudera, Inc. All Rights Reserved.

7

Grows without requiring developers tore-architect their algorithms/application.

AUTO SCALE

Page 8: Cloudera Intro Hadoop 111116 Slides

Scalability: Data Beats Algorithm

©2011 Cloudera, Inc. All Rights Reserved.

8

Smarter Algos More Data

A. Halevy et al, “The Unreasonable Effectiveness of Data”, IEEE Intelligent Systems, March 2009

Page 9: Cloudera Intro Hadoop 111116 Slides

Scalability: Keep All Data Alive Forever

©2011 Cloudera, Inc. All Rights Reserved.

9

Archive to Tape andNever See It Again

Extract Value FromAll Your Data

Page 10: Cloudera Intro Hadoop 111116 Slides

Use The Right Tool For The Right Job

©2011 Cloudera, Inc. All Rights Reserved.

10

Relational Databases: Hadoop:

Use when:

• Structured or Not (Flexibility)

• Scalability of Storage/Compute

• Complex Data Processing

Use when:

• Interactive OLAP Analytics (<1sec)

• Multistep ACID Transactions

• 100% SQL Compliance

Page 11: Cloudera Intro Hadoop 111116 Slides

HDFS: Hadoop Distributed File System

©2011 Cloudera, Inc. All Rights Reserved.

11

Block Replicas are distributedacross servers and racks.

Block Replication for:• Durability• Availability• Throughput

Optimized for:• Throughput• Put/Get/Delete• Appends

A given file is broken down into blocks(default=64MB), then blocks arereplicated across cluster (default=3).

Page 12: Cloudera Intro Hadoop 111116 Slides

MapReduce: Computational Framework

©2011 Cloudera, Inc. All Rights Reserved.

12

Split 1

Split i

Split N

Map 1(docid, text)

(docid, text) Map i

(docid, text) Map M

Reduce 1

OutputFile 1(sorted words,

sum of counts)

Reduce i

OutputFile i(sorted words,

sum of counts)

Reduce R

OutputFile R(sorted words,

sum of counts)

(words, counts)(sorted words, counts)

Map(in_key, in_value) => list of (out_key, intermediate_value) Reduce(out_key, list of intermediate_values) => out_value(s)

Shuffle

(words, counts) (sorted words, counts)

“To BeOr NotTo Be?”

Be, 5

Be, 12

Be, 7Be, 6

Be, 30

cat *.txt | mapper.pl | sort | reducer.pl > out.txt

Page 13: Cloudera Intro Hadoop 111116 Slides

MapReduce: Resource Manager / Scheduler

©2011 Cloudera, Inc. All Rights Reserved.

13

A given job is broken down into tasks,then tasks are scheduled to be asclose to data as possible.

Three levels of data locality:• Same server as data (local disk)• Same rack as data (rack/leaf switch)• Wherever there is a free slot (cross rack)

System detects laggard tasks andspeculatively executes parallel taskson the same slice of data.

Optimized for:• Batch Processing• Failure Recovery

Page 14: Cloudera Intro Hadoop 111116 Slides

But Networks Are Faster Than Disks!

©2011 Cloudera, Inc. All Rights Reserved.

14

Yes, however, core and disk density per serverare going up very quickly:

• 1 Hard Disk = 100MB/sec (~1Gbps)

• Server = 12 Hard Disks = 1.2GB/sec (~12Gbps)

• Rack = 20 Servers = 24GB/sec (~240Gbps)

• Avg. Cluster = 6 Racks = 144GB/sec (~1.4Tbps)

• Large Cluster = 200 Racks = 4.8TB/sec (~48Tbps)

• Scanning 4.8TB at 100MB/sec takes 13 hours.

Page 15: Cloudera Intro Hadoop 111116 Slides

Hadoop High-Level Architecture

©2011 Cloudera, Inc. All Rights Reserved.

15

Name NodeMaintains mapping of file names

to blocks to data node slaves.

Job TrackerTracks resources and schedulesjobs across task tracker slaves.

Data NodeStores and serves

blocks of data

Hadoop ClientContacts Name Node for dataor Job Tracker to submit jobs

Task TrackerRuns tasks (work units)

within a job

Share Physical Node

Page 16: Cloudera Intro Hadoop 111116 Slides

Changes for Better Availability/Scalability

©2011 Cloudera, Inc. All Rights Reserved.

16

Data NodeStores and serves

blocks of data

Hadoop ClientContacts Name Node for dataor Job Tracker to submit jobs

Task TrackerRuns tasks (work units)

within a job

Share Physical Node

Name Node Job Tracker

Federation partitionsout the name space,High Availability viaan Active Standby.

Each job has its ownApplication Manager,Resource Manager isdecoupled from MR.

Page 17: Cloudera Intro Hadoop 111116 Slides

CDH: Cloudera’s Distribution Including Apache Hadoop

Coordination

DataIntegration

FastRead/Write

Access

Languages / Compilers

Workflow Scheduling Metadata

APACHE ZOOKEEPER

APACHE FLUME,APACHE SQOOP APACHE HBASE

APACHE PIG, APACHE HIVE

APACHE OOZIE APACHE OOZIE APACHE HIVE

File System Mount UI Framework/SDK Data MiningFUSE-DFS HUE APACHE MAHOUT

©2011 Cloudera, Inc. All Rights Reserved.

17

Bu

ild

/Te

st:

AP

AC

HE

BIG

TO

P

SCM Express (Installation Wizard for CDH)

Page 18: Cloudera Intro Hadoop 111116 Slides

Books

©2011 Cloudera, Inc. All Rights Reserved.

18

Page 19: Cloudera Intro Hadoop 111116 Slides

Conclusion

• The Key Benefits of Apache Hadoop:

– Agility/Flexibility (Quickest Time to Insight).

– Complex Data Processing (Any Language, Any Problem).

– Scalability of Storage/Compute (Freedom to Grow).

– Economical Storage (Keep All Your Data Alive Forever).

• The Key Systems for Apache Hadoop are:

– Hadoop Distributed File System: self-healing high-bandwidth clustered storage.

– MapReduce: distributed fault-tolerant resourcemanagement coupled with scalable data processing.

19©2011 Cloudera, Inc. All Rights Reserved.

Page 20: Cloudera Intro Hadoop 111116 Slides

Appendix

©2011 Cloudera, Inc. All Rights Reserved.

20

BACKUP SLIDES

Page 21: Cloudera Intro Hadoop 111116 Slides

Unstructured Data is Exploding

©2011 Cloudera, Inc. All Rights Reserved.

21

Source: IDC White Paper - sponsored by EMC.As the Economy Contracts, the Digital Universe Expands. May 2009.

• 2,500 exabytes of new information in 2012 with Internet as primary driver

• Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2“zettabytes” this year

Relational

Complex, Unstructured

Page 22: Cloudera Intro Hadoop 111116 Slides

Hadoop Creation History

©2011 Cloudera, Inc. All Rights Reserved.

22

• Fastest sort of a TB,62secs over 1,460 nodes• Sorted a PB in 16.25hoursover 3,658 nodes

Page 23: Cloudera Intro Hadoop 111116 Slides

Hadoop in the Enterprise Data Stack

Logs Files Web Data

EnterpriseData

Warehouse

WebApplication

EnterpriseReporting

BI, Analytics

Analysts Business Users

Customers

IDEs

Data Scientists

RelationalDatabases

Low-LatencyServingSystems

ClouderaMgmt Suite

SystemOperators

Development Tools

ET

LT

oo

ls

Business Intelligence Tools

DataArchitects

©2011 Cloudera, Inc. All Rights Reserved.

23

Flume Flume Flume Sqoop

Sqoop

ODBC, JDBC,NFS, Native

Page 24: Cloudera Intro Hadoop 111116 Slides

Main idea is to split up the JobTracker functions:• Cluster resource management (for tracking and

allocating nodes)• Application life-cycle management (for MapReduce

scheduling and execution)

MapReduce Next Gen

©2011 Cloudera, Inc. All Rights Reserved.

24

Enables:• High Availability• Better Scalability• Efficient Slot Allocation• Rolling Upgrades• Non-MapReduce Apps

Page 25: Cloudera Intro Hadoop 111116 Slides

Two Core Use Cases Common Across Many Industries

©2011 Cloudera, Inc. All Rights Reserved.

25

AD

VA

NC

ED

AN

ALY

TIC

S

DA

TA

PR

OC

ES

SIN

G

Social Network Analysis

Content Optimization

Network Analytics

Loyalty & Promotions

Fraud Analysis

Entity Analysis

Clickstream Sessionization

Clickstream Sessionization

Mediation

Data Factory

Trade Reconciliation

SIGINT

Application ApplicationIndustry

Web

Media

Telco

Retail

Financial

Federal

Bioinformatics Genome MappingSequencing Analysis

Use CaseUse Case

ManufacturingProduct Quality Mfg Process Tracking

Page 26: Cloudera Intro Hadoop 111116 Slides

What is Cloudera Enterprise?

©2011 Cloudera, Inc. All Rights Reserved.

26

Simplify and Accelerate Hadoop Deployment

Reduce Adoption Costs and Risks

Lower the Cost of Administration

Increase the Transparency & Control of Hadoop

Leverage the Experience of Our Experts

Cloudera Enterprise makes opensource Apache Hadoop enterprise-easy

EFFECTIVENESS

Ensuring Repeatable Value fromApache Hadoop Deployments

EFFICIENCY

Enabling Apache Hadoop to beAffordably Run in Production

ClouderaManagement

Suite

ComprehensiveToolset for Hadoop

Administration

Production-Level Support

Our Team of ExpertsOn-Call to Help You

Meet Your SLAs

CLOUDERA ENTERPRISE COMPONENTS

3 of the top 5 telecommunications, mobile services, defense &intelligence, banking, media and retail organizations depend on Cloudera

Page 27: Cloudera Intro Hadoop 111116 Slides

Hive vs Pig Latin (count distinct values > 0)

• Hive Syntax:

SELECT COUNT(DISTINCT col1)

FROM mytable

WHERE col1 > 0;

• Pig Latin Syntax:

mytable = LOAD ‘myfile’ AS (col1, col2, col3);

mytable = FOREACH mytable GENERATE col1;

mytable = FILTER mytable BY col1 > 0;

mytable = DISTINCT col1;

mytable = GROUP mytable BY col1;

mytable = FOREACH mytable GENERATE COUNT(mytable);

DUMP mytable;

©2011 Cloudera, Inc. All Rights Reserved.

27

Page 28: Cloudera Intro Hadoop 111116 Slides

Apache Hive Key Features

• A subset of SQL covering the most common statements

• JDBC/ODBC support

• Agile data types: Array, Map, Struct, and JSON objects

• Pluggable SerDe system to work on unstructured files directly

• User Defined Functions and Aggregates

• Regular Expression support

• MapReduce support

• Partitions and Buckets (for performance optimization)

• Microstrategy/Tableau Compatibility (through ODBC)

• In The Works: Indices, Columnar Storage, Views, Explode/Collect

• More details: http://wiki.apache.org/hadoop/Hive

©2011 Cloudera, Inc. All Rights Reserved.

28

Page 29: Cloudera Intro Hadoop 111116 Slides

Hive Agile Data Types

• STRUCTS:– SELECT mytable.mycolumn.myfield FROM …

• MAPS (Hashes):– SELECT mytable.mycolumn[mykey] FROM …

• ARRAYS:– SELECT mytable.mycolumn[5] FROM …

• JSON:– SELECT get_json_object(mycolumn, objpath) FROM …

©2011 Cloudera, Inc. All Rights Reserved.

29

Page 30: Cloudera Intro Hadoop 111116 Slides