10 Amazing Things To Do With a Hadoop-Based Data Lake

2© 2014 Pivotal Software, Inc. All rights reserved. 2© 2014 Pivotal Software, Inc. All rights reserved.

10 Amazing Things To Do With a Hadoop-Based Data Lake

Strata Conference New York 2014

Greg ChaseDirector, Product Marketing, Pivotal Software

3© 2014 Pivotal Software, Inc. All rights reserved.

Pivotal Business Data Lake Architecture

Ingestion Tier

Insights Tier

Unified Operations Tier

Command Center

Processing Tier

Spring XD, Oozie

Distillation Tier

Pivotal HD

Unstructured and structured data

GemFire XD

HAWQ/Greenplum

GemFire XDSpring XD

Spring XDGemFire XD

SqoopFlume

Spring XD

GemFire XDHAWQHBase

HAWQMapReduce

HivePig

Query interfaces

HAWQGemFire XD

HBase

Sources Action Tier

ClickstreamSensor Data

WeblogsNetworkData

CRM DataERP Data

GemFire

RabbitMQRedis

Pivotal CF



Ingestion Tier

Insights Tier


Command Center

Processing Tier

Spring XD, Oozie

Distillation Tier

Pivotal HD


GemFire XD

HAWQ/Greenplum

GemFire XDSpring XD

Spring XDGemFire XD

SqoopFlume

Spring XD

GemFire XDHAWQHBase

HAWQMapReduce

HivePig

Query interfaces

HAWQGemFire XD

HBase

Sources Action Tier


WeblogsNetworkData

CRM DataERP Data

GemFire

RabbitMQRedis

Pivotal CF


1. Store Massive Data Sets

…

Rack 1 Rack 2 Rack 3 Rack n

Scale-out: use

commodity hardware

and storage


2. Mix Disparate Data Sources

101010101010Sensor data

CRM data

Website click streams

Schema flexibility:

adsorb different

data types from data sources



Ingestion Tier

Insights Tier


Command Center

Processing Tier

Spring XD, Oozie

Distillation Tier

Pivotal HD


GemFire XD

HAWQ/Greenplum

GemFire XDSpring XD

Spring XDGemFire XD

SqoopFlume

Spring XD

GemFire XDHAWQHBase

HAWQMapReduce

HivePig

Query interfaces

HAWQGemFire XD

HBase

Sources Action Tier


WeblogsNetworkData

CRM DataERP Data

GemFire

RabbitMQRedis

Pivotal CF


3. Ingest Bulk Data

Microbatch

Scalable open source

tools for batch

loading data

D …

Batch

D … D

Sqoop Bulk load RDBMS

Spring XD Bulk load With processing With analytics Any source

Flume Event driven Any source


4. Ingest High-Velocity Data

Capture all volatile data.

Apply structure.

101010101010101010110101010101010101011010101010101010101

Spring XD Bulk load Real-time ingest With processing With analytics Any source

Pivotal GemFire XD Advanced DB operations Consistency Reliable persistence Convert to structured

Streaming data



Ingestion Tier

Insights Tier


Command Center

Processing Tier

Spring XD, Oozie

Distillation Tier

Pivotal HD


GemFire XD

HAWQ/Greenplum

GemFire XDSpring XD

Spring XDGemFire XD

SqoopFlume

Spring XD

GemFire XDHAWQHBase

HAWQMapReduce

HivePig

Query interfaces

HAWQGemFire XD

HBase

Sources Action Tier


WeblogsNetworkData

CRM DataERP Data

GemFire

RabbitMQRedis

Pivotal CF


5. Apply Structure to Unstructured / Semi-Structured Data

Flexible processing of different data types

1010101010101

1010101010101

1010101010101


6. Make Data Available for MPP SQL Analysis

Name Node

Fast processing

for advanced

analytics in many

supported HDFS

formats

Resource Manager

HAWQ Master

Data Node

Node Manager

HAWQ Segment(s)

Data Node

Node Manager

Data Node

Node Manager

Data Node

Node Manager

HAWQ Segment(s)

HAWQ Segment(s)

HAWQ Segment(s)

Hadoop Cluster


7. Achieve Data Integration

Create multi-dimensional

analytical models.

1010101010101

1010101010101

1010101010101


8. Improve Machine Learning & Predictive Analytics

Richer, deeper data

sets for accurate

predictive analytics.

HAWQ Master

HAWQ Segment(s)

HAWQ Segment(s)

HAWQ Segment(s)


9. Deploy Real-Time Automation at Scale

Respond in real-time, at

scale.

Archive history in Hadoop.

Pivotal GemFire XD

Web App

Web App

Web App

101010101010

101010101010

In-Memory


10. Achieve Continuous Innovation at Scale

Deploy automationAt scale

Capture and store all data

Analyze to discover insights

& algorithms


Increase Value Derived from Data With a Data Lake

Store massive data sets

Mix disparate

data

Ingest bulk data

Ingest high-

velocity data

Apply structure

Enable MPP

analysis

Achieve data

integration

Improve predictive analytics

Deploy real-time

automation at scale

Achieve continuous innovation

Business Value

18© 2014 Pivotal Software, Inc. All rights reserved. 18© 2014 Pivotal Software, Inc. All rights reserved.

For more information on Pivotal Big Data SuiteVisit Pivotal.io/big-data

http://www.pivotal.io/big-data/



10 Amazing Things To Do With a Hadoop-Based Data Lake

Technology

Transcript of 10 Amazing Things To Do With a Hadoop-Based Data Lake