2014 july 24_what_ishadoop

62
EVERYONE LIKES ELEPHANTS Adam Muise [email protected] Principal Architect Hortonworks

description

Presentation for Silicon Peel group at Microsoft Canada HQ, July 24, 2014

Transcript of 2014 july 24_what_ishadoop

Page 1: 2014 july 24_what_ishadoop

EVERYONE LIKES ELEPHANTS

Adam [email protected] ArchitectHortonworks

Page 2: 2014 july 24_what_ishadoop

Who am I?

Page 3: 2014 july 24_what_ishadoop

Who is ?

Page 4: 2014 july 24_what_ishadoop

We do Hadoop

The leaders of Hadoop’s development

Community driven, Enterprise Focused

Drive Innovation in the platform – We lead the roadmap

100% Open Source – Democratized Access to Data

Page 5: 2014 july 24_what_ishadoop

We do Hadoop successfully.

Support

Professional ServicesTraining

Page 6: 2014 july 24_what_ishadoop

What is Hadoop? What is everyone talking about?

Page 7: 2014 july 24_what_ishadoop

Data

Page 8: 2014 july 24_what_ishadoop

“Big Data” is the marketing term of the decade in IT

Page 9: 2014 july 24_what_ishadoop

What lurks behind the hype is the democratization of Data, a move to aggregate disparate data silos

into one shiny pile of analytic gold

Page 10: 2014 july 24_what_ishadoop

So what are the problems with Big Data?

Page 11: 2014 july 24_what_ishadoop

Let’s talk challenges…

Page 12: 2014 july 24_what_ishadoop

Volume

Volume

Volume

Volume

Page 13: 2014 july 24_what_ishadoop

Volume

Volume

Volume

Volume

VolumeVolume

Volume

VolumeVolume Volume

Volume

VolumeVolume

VolumeVolume

VolumeVolume

Page 14: 2014 july 24_what_ishadoop

Volume

Volume

Volume

Volume

VolumeVolume

Volume

VolumeVolume Volume

Volume

VolumeVolume

VolumeVolume

VolumeVolume Volume

Volume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Page 15: 2014 july 24_what_ishadoop

Volume

Volume

Volume

Volume

VolumeVolume

Volume

VolumeVolume Volume

Volume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

VolumeVolume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Page 16: 2014 july 24_what_ishadoop

Storage, Management, Processing all become challenges with Data at Volume

Page 17: 2014 july 24_what_ishadoop

Traditional technologies adopt a divide, drop, and conquer approach

Page 18: 2014 july 24_what_ishadoop

The solution?EDW

DataDataData

DataData

Data

Data DataData

Yet Another EDW

DataDataData

DataData

Data

Data DataData

Analytical DB

DataDataData

DataData

Data

Data DataData OLTP

DataDataData

DataData

Data

Data DataData

Another EDW

DataDataData

DataData

Data

Data DataData

Page 19: 2014 july 24_what_ishadoop

Ummm…you dropped something

DataDataData

DataData

Data

Data DataData

DataDataData

DataData

Data

Data DataDataDataDataData

DataData

Data

Data DataData

DataDataData

DataData

Data

Data DataData

DataDataData

DataData

Data

Data DataData

DataDataData

DataData

Data

Data DataData

DataDataData

DataData

Data

Data DataData

DataDataData

DataData

Data

Data DataDataDataData

Data

DataData

Data

Data DataData

EDW

DataDataData

DataData

Data

Data DataData

Yet Another EDW

DataDataData

DataData

Data

Data DataData

Analytical DB

DataDataData

DataData

Data

Data DataData

OLTP

DataDataData

DataData

Data

Data DataData

Another EDW

DataDataData

DataData

Data

Data DataData

Page 20: 2014 july 24_what_ishadoop

Analyzing the data usually raises more interesting questions…

Page 21: 2014 july 24_what_ishadoop

…which leads to more data

Page 22: 2014 july 24_what_ishadoop

Wait, you’ve seen this before.

DataDataData

DataData

Data

Data DataData

DataDataData

DataData

Data

Data DataDataDataData

Data

DataData

Data

Data DataData

DataDataData

DataData

Data

Data DataData

DataDataData

DataData

Data

Data DataData

DataDataData

DataData

Data

Data DataData

Analytics Sausage Factory

Data DataData

DataData

Data

Data DataData …Data

DataData…

DataData

Data

Data

Page 23: 2014 july 24_what_ishadoop

Data begets Data.

Page 24: 2014 july 24_what_ishadoop

What keeps us from our Data?

Page 25: 2014 july 24_what_ishadoop

“Prices, Stupid passwords, and Boring Statistics.”

- Hans Rosling

http://www.youtube.com/watch?v=hVimVzgtD6w

Page 26: 2014 july 24_what_ishadoop

Your data silos are lonely places.

EDW

DataDataData

DataData

Data

Data DataData

Accounts

DataDataData

DataData

Data

Data DataData

Customers

DataDataData

DataData

Data

Data DataData

Web Properties

DataDataData

DataData

Data

Data DataData

Page 27: 2014 july 24_what_ishadoop

… Data likes to be together.

EDW

DataDataData

DataData

Data

Data DataData

Accounts

DataDataData

DataData

Data

Data DataData

Customers

DataDataData

DataData

Data

Data DataData

Web Properties

DataDataData

DataData

Data

Data DataData

Page 28: 2014 july 24_what_ishadoop

Data likes to socialize too.EDW

DataDataData

DataData

Data

Data DataData

Accounts

DataDataData

DataData

Data

Data DataData

Customers

DataDataData

DataData

Data

Data DataData

Web Properties

DataDataData

DataData

Data

Data DataData

Machine Data

DataDataData

DataData

Data

Data DataData

Twitter

DataDataData

DataData

Data

Data DataData

Facebook

DataDataData

DataData

Data

Data DataData

CDR

DataDataData

DataData

Data

Data DataData

Weather Data

DataDataData

DataData

Data

Data DataData

Page 29: 2014 july 24_what_ishadoop

New types of data don’t quite fit into your pristine view of the world.

My Little Data Empire

DataDataData

Data

DataData

Data DataData

Logs

DataDataDataData

Data

DataData

Machine Data

DataDataDataData

Data

DataData

??

? ?

Page 30: 2014 july 24_what_ishadoop

To resolve this, some people take hints from Lord Of The Rings...

Page 31: 2014 july 24_what_ishadoop

…and create One-Schema-To-Rule-Them-All…

EDW

DataDataData

DataData

Data

Data DataDataSchema

Page 32: 2014 july 24_what_ishadoop

…but that has its problems too.

EDW

DataDataData

DataData

Data

Data DataDataSchemaData

DataData

ETL ETL

ETL ETL

EDW

DataDataData

DataData

Data

Data DataDataSchemaData

DataData

ETL ETL

ETL ETL

Page 33: 2014 july 24_what_ishadoop

What if the data was processed and stored centrally? What if you didn’t

need to force it into a single schema? We call it a Data Lake.

EDW

DataDataData

DataData

DataData

Schema

BI & Analytics Schema Schema

DataData

Data

Data Lake

DataData

DataData

DataDataData

DataData

DataData

Data

SchemaSchema

DataData

DataProcess Process

DataData

Data

DataData

Data

DataData

DataData

DataDataData Sources

Data Sources

Page 34: 2014 july 24_what_ishadoop

A Data Lake Architecture enables:- Landing data without forcing a single schema- Landing a variety and large volume of data efficiently- Retaining data for a long period of time with a very low $/TB- A platform to feed other Analytical DBs- A platform to execute next gen data analytics and processing applications (SAS, Informatica,

Graph Analytics, Machine Learning, SAP, etc…)

Page 35: 2014 july 24_what_ishadoop

In most cases, more data is better.Work with the population, not just a sample.

Page 36: 2014 july 24_what_ishadoop

Your view of a client today.

Male

Female

Age: 25-30

Town/City

Middle Income Band

Product Category Preferences

Page 37: 2014 july 24_what_ishadoop

Your view with more data.

Male

Female

Age: 27 but feels old

GPS coordinates

$65-68k per year

Product recommendations

Tea PartyHippie

Looking to start a business

Walking into Starbucks right now…

A depressed Toronto Maple Leaf’s Fan

Products left in basket indicate drunk amazon shopper

Gene Expression for Risk Taker

Thinking about a new house

Unhappy with his cell phone plan

Pregnant

Spent 25 minutes looking at tea cozies

Page 38: 2014 july 24_what_ishadoop

So what is the answer?

Page 39: 2014 july 24_what_ishadoop

Enter the Hadoop.

http://www.fabulouslybroke.com/2011/05/ninja-elephants-and-other-awesome-stories/

………

Page 40: 2014 july 24_what_ishadoop

Hadoop was created because traditional technologies never cut it for the Internet properties like Google, Yahoo, Facebook, Twitter, and LinkedIn

Page 41: 2014 july 24_what_ishadoop

Traditional architecture didn’t scale enough…

DB DBDB

SAN

AppApp AppApp

DB DBDB

SAN

AppApp AppApp DB DBDB

SAN

AppApp AppApp

Page 42: 2014 july 24_what_ishadoop

Traditional architectures cost too much at that volume…

$/TB

$pecial Hardware

$upercomputing

Page 43: 2014 july 24_what_ishadoop

So what is the answer?

Page 44: 2014 july 24_what_ishadoop

If you could design a system that would handle this, what would it look like?

Page 45: 2014 july 24_what_ishadoop

It would probably need a highly resilient, self-healing, cost-efficient, distributed file system…

Storage Storage Storage

Storage Storage Storage

Storage Storage Storage

Page 46: 2014 july 24_what_ishadoop

It would probably need a completely parallel processing framework that took tasks to the

data…

Storage Storage Storage

Storage Storage Storage

Storage Storage StorageProcessing Processing Processing

Processing Processing Processing

Processing Processing Processing

Page 47: 2014 july 24_what_ishadoop

It would probably run on commodity hardware, virtualized machines, and common OS

platforms

Storage Storage Storage

Storage Storage Storage

Storage Storage StorageProcessing Processing Processing

Processing Processing Processing

Processing Processing Processing

Page 48: 2014 july 24_what_ishadoop

It would probably be open source so innovation could happen as quickly as possible

Page 49: 2014 july 24_what_ishadoop

It would need a critical mass of users

Page 50: 2014 july 24_what_ishadoop

{Processing + Storage}=

{YARN + HDFS}

Page 51: 2014 july 24_what_ishadoop

Want to get your hands dirty?

Page 52: 2014 july 24_what_ishadoop

To do this, we need to install Hadoop right?

Page 53: 2014 july 24_what_ishadoop

Nope.

Page 54: 2014 july 24_what_ishadoop

Enter the

Sandbox.

Page 55: 2014 july 24_what_ishadoop

The Sandbox is ‘Hadoop in a Can’.It contains one copy of each of the Master and Worker node processes

used in a cluster, only in a single virtual node.

Storage Storage Storage

Storage Storage Storage

Storage Storage StorageProcessing Processing Processing

Processing Processing Processing

Processing Processing Processing

ProcessingStorage

Linux VM

Page 56: 2014 july 24_what_ishadoop

Getting started with Sandbox VM:

- Pick your flavor of VM at…http://www.hortonworks.com/sandbox

- Start the sandbox VM- find the IP displayed - go to…

http://172.16.130.137

- Register- Click on ‘Start Tutorials’- On the left hand nav, click on ‘HCatalog, Basic Pig & Hive Commands’

Page 57: 2014 july 24_what_ishadoop

http://hortonworks.com/hadoop-tutorial/how-to-use-hcatalog-basic-pig-hive-commands/

In this tutorial you can…- Land files in HDFS- Assign metadata with HCatalog- Use SQL with Hive- Learn to process data with Pig

Page 58: 2014 july 24_what_ishadoop

Hadoop has other open source projects…

Page 59: 2014 july 24_what_ishadoop

Apache Hadoop

FlumeAmbari

HBaseFalcon

MapReduceHDFS

SqoopHCatalog

Pig

Hive

StormYARN

Knox

Tez

Page 60: 2014 july 24_what_ishadoop

Hortonworks Data Platform

FlumeAmbari

HBaseFalcon

MapReduceHDFS

SqoopHCatalog

Pig

Hive

Storm YARN

Knox

Tez

Page 61: 2014 july 24_what_ishadoop

What else are we working on?

hortonworks.com/labs/

Page 62: 2014 july 24_what_ishadoop

© Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Page 62

There is NO second place

HortonworksWe do Hadoop.