Data Science Day New York: The Platform for Big Data

1

The Platform for Big DataAmr Awadallah | CTO, Founder, Cloudera, [email protected], twitter: @awadallah

©2012 Cloudera, Inc. All Rights Reserved.2

Storage Only Grid (original raw data)

Instrumentation

Collection

RDBMS (aggregated data)

BI Reports + Interactive Apps

Mostly Append

ETL Compute Grid

1. Moving Data To Compute Doesn’t Scale

3. Can’t Explore Original High Fidelity Raw Data

2. Archiving = PrematureData Death

The Problems with Current Data Systems


The Solution: A Combined Storage/Compute Layer

Hadoop: Storage + Compute Grid

Instrumentation

Collection

RDBMS (aggregated data)

BI Reports + Interactive Apps3. Data Exploration &Advanced Analytics

2. Keep Data Alive For Ever

(Active Archive)

1. Scalable ThroughputFor ETL & Aggregation

(ETL Acceleration)

Mostly Append

So What is Apache Hadoop ?

• A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license).

• Core Hadoop has two main systems:• Hadoop Distributed File System: self-healing high-bandwidth clustered

storage.• MapReduce: distributed fault-tolerant resource management and scheduling

coupled with a scalable data programming abstraction.

• Key business values:• Flexibility – Store any data, Run any analysis.• Scalability – Start at 1TB/3-nodes grow to petabytes/1000s of nodes.• Economics – Cost per TB at a fraction of traditional options.


The Hadoop Big Bang


• Fastest sort of a TB, 62secs over 1,460 nodes• Sorted a PB in 16.25hours over 3,658 nodes

Hadoop World 2009,500 attendees

The Key Benefit: Agility/Flexibility


Schema-on-Read (Hadoop):Schema-on-Write (RDBMS):• Schema must be created before

any data can be loaded.

• An explicit load operation has to take place which transforms data to DB internal serialization format.

• New columns must be added explicitly before new data for such columns can be loaded into the database.

• OLAP is Fast

• Standards/Governance

• Data is simply copied to the file store, no transformation is needed.

• A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns (late binding)

• New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it.

• Load is Fast

• Flexibility/AgilityProsPros

Scalability: Scalable Software Development


Grows without requiring developers to re-architect their algorithms/application.

AUTO SCALEAUTO SCALE

Economics: Return on Byte

• Return on Byte (ROB) = value to be extracted from that byte divided by the cost of storing that byte

• If ROB is < 1 then it will be buried into tape wasteland, thus we need more economical active storage.


Low ROB

High ROB

Cloud Deployment

The Big Data Platform: CDH4 – June 2012

Coordination

Data Integration

Fast Read/Write

Access

Batch Processing Languages

Web Console

Job Workflow

Metadata

APACHE ZOOKEEPER

APACHE FLUME, APACHE SQOOP APACHE HBASE

APACHE PIG, APACHE HIVE

HUE

APACHE OOZIE

APACHE HIVE MetaStoreInteractive SQL

Data Mining Lib

Impala

APACHE MAHOUT

APACHE WHIRR

Build

/Tes

t: A

PACH

E BI

GTO

P

Cloudera Manager Free Edition (Installation Wizard)


Hadoop Core KernelMapReduce, HDFS

Connectivity

Data Processing LibDataFu for Pig

ODBC/JDBC/FUSE/HTTPS

CDH in the Enterprise Data Stack

LogsLogs FilesFiles Web DataWeb Data Relational DatabasesRelational Databases

IDEsIDEs BI / Analytics

BI / Analytics

Enterprise ReportingEnterprise Reporting

Enterprise Data Warehouse

Online Serving Systems

ClouderaManagerClouderaManager

SYSTEM OPERATORS

ENGINEERS ANALYSTS BUSINESS USERS

Web/Mobile ApplicationsWeb/Mobile Applications

CUSTOMERS

Sqoop

Sqoop

Sqoop

FlumeFlumeFlume

Modeling Tools

Modeling Tools

DATA SCIENTISTS

DATA ARCHITECTS

Meta Data/ ETL Tools

Meta Data/ ETL Tools

ODBC, JDBC, NFS, HTTP


HBase versus HDFS

HDFS: HBase:

Use For:

• Dimension tables which are updated frequently and require random low-latency lookups.

Use For:

• Fact tables that are mostly append only and require sequential full table scans.

Optimized For:

• Large Files

• Sequential Access (Hi Throughput)

• Append Only

Optimized For:

• Small Records

• Random Access (Lo Latency)

• Atomic Record Updates

Not Suitable For:

• Low Latency Interactive OLAP.


• Retail: Price OptimizationRetail: Price Optimization• Media: Content TargetingMedia: Content Targeting• Finance: Fraud DetectionFinance: Fraud Detection• Manufacturing: DiagnosticsManufacturing: Diagnostics• Info Services: Satellite ImageryInfo Services: Satellite Imagery• Agriculture: Seed OptimizationAgriculture: Seed Optimization• Power: Smart ConsumptionPower: Smart Consumption

Use Case Examples


1. FLEXIBILITYSTORE ANY DATARUN ANY ANALYSISKEEP’S PACE WITH THE RATE OF CHANGE OF INCOMING DATA

2. SCALABILITYPROVEN GROWTH TO PBS/1,000s OF NODESNO NEED TO REWRITE QUERIES, AUTOMATICALLY SCALESKEEP’S PACE WITH THE RATE OF GROWTH OF INCOMING DATA

3. ECONOMICSCOST PER TB AT A FRACTION OF OTHER OPTIONSKEEP ALL OF YOUR DATA ALIVE IN AN ACTIVE ARCHIVEPOWERING THE DATA BEATS ALGORITHM MOVEMENT


Core Benefits of the Platform for Big Data

Amr Awadallah, CTO, Founder, Cloudera, Inc. <[email protected]> @awadallahThank you!

Data Science Day New York: The Platform for Big Data

Documents

Transcript of Data Science Day New York: The Platform for Big Data