2014 july 24_what_ishadoop

Post on 27-Jan-2015

106 views 0 download

Tags:

description

Presentation for Silicon Peel group at Microsoft Canada HQ, July 24, 2014

Transcript of 2014 july 24_what_ishadoop

EVERYONE LIKES ELEPHANTS

Adam Muiseamuise@hortonworks.comPrincipal ArchitectHortonworks

Who am I?

Who is ?

We do Hadoop

The leaders of Hadoop’s development

Community driven, Enterprise Focused

Drive Innovation in the platform – We lead the roadmap

100% Open Source – Democratized Access to Data

We do Hadoop successfully.

Support

Professional ServicesTraining

What is Hadoop? What is everyone talking about?

Data

“Big Data” is the marketing term of the decade in IT

What lurks behind the hype is the democratization of Data, a move to aggregate disparate data silos

into one shiny pile of analytic gold

So what are the problems with Big Data?

Let’s talk challenges…

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

VolumeVolume

Volume

VolumeVolume Volume

Volume

VolumeVolume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

Volume

VolumeVolume

Volume

VolumeVolume Volume

Volume

VolumeVolume

VolumeVolume

VolumeVolume Volume

Volume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

Volume

Volume

Volume

VolumeVolume

Volume

VolumeVolume Volume

Volume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

VolumeVolume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Volume

VolumeVolume

VolumeVolume

Volume

Volume

Storage, Management, Processing all become challenges with Data at Volume

Traditional technologies adopt a divide, drop, and conquer approach

The solution?EDW

DataDataData

DataData

Data

Data DataData

Yet Another EDW

DataDataData

DataData

Data

Data DataData

Analytical DB

DataDataData

DataData

Data

Data DataData OLTP

DataDataData

DataData

Data

Data DataData

Another EDW

DataDataData

DataData

Data

Data DataData

Ummm…you dropped something

DataDataData

DataData

Data

Data DataData

DataDataData

DataData

Data

Data DataDataDataDataData

DataData

Data

Data DataData

DataDataData

DataData

Data

Data DataData

DataDataData

DataData

Data

Data DataData

DataDataData

DataData

Data

Data DataData

DataDataData

DataData

Data

Data DataData

DataDataData

DataData

Data

Data DataDataDataData

Data

DataData

Data

Data DataData

EDW

DataDataData

DataData

Data

Data DataData

Yet Another EDW

DataDataData

DataData

Data

Data DataData

Analytical DB

DataDataData

DataData

Data

Data DataData

OLTP

DataDataData

DataData

Data

Data DataData

Another EDW

DataDataData

DataData

Data

Data DataData

Analyzing the data usually raises more interesting questions…

…which leads to more data

Wait, you’ve seen this before.

DataDataData

DataData

Data

Data DataData

DataDataData

DataData

Data

Data DataDataDataData

Data

DataData

Data

Data DataData

DataDataData

DataData

Data

Data DataData

DataDataData

DataData

Data

Data DataData

DataDataData

DataData

Data

Data DataData

Analytics Sausage Factory

Data DataData

DataData

Data

Data DataData …Data

DataData…

DataData

Data

Data

Data begets Data.

What keeps us from our Data?

“Prices, Stupid passwords, and Boring Statistics.”

- Hans Rosling

http://www.youtube.com/watch?v=hVimVzgtD6w

Your data silos are lonely places.

EDW

DataDataData

DataData

Data

Data DataData

Accounts

DataDataData

DataData

Data

Data DataData

Customers

DataDataData

DataData

Data

Data DataData

Web Properties

DataDataData

DataData

Data

Data DataData

… Data likes to be together.

EDW

DataDataData

DataData

Data

Data DataData

Accounts

DataDataData

DataData

Data

Data DataData

Customers

DataDataData

DataData

Data

Data DataData

Web Properties

DataDataData

DataData

Data

Data DataData

Data likes to socialize too.EDW

DataDataData

DataData

Data

Data DataData

Accounts

DataDataData

DataData

Data

Data DataData

Customers

DataDataData

DataData

Data

Data DataData

Web Properties

DataDataData

DataData

Data

Data DataData

Machine Data

DataDataData

DataData

Data

Data DataData

Twitter

DataDataData

DataData

Data

Data DataData

Facebook

DataDataData

DataData

Data

Data DataData

CDR

DataDataData

DataData

Data

Data DataData

Weather Data

DataDataData

DataData

Data

Data DataData

New types of data don’t quite fit into your pristine view of the world.

My Little Data Empire

DataDataData

Data

DataData

Data DataData

Logs

DataDataDataData

Data

DataData

Machine Data

DataDataDataData

Data

DataData

??

? ?

To resolve this, some people take hints from Lord Of The Rings...

…and create One-Schema-To-Rule-Them-All…

EDW

DataDataData

DataData

Data

Data DataDataSchema

…but that has its problems too.

EDW

DataDataData

DataData

Data

Data DataDataSchemaData

DataData

ETL ETL

ETL ETL

EDW

DataDataData

DataData

Data

Data DataDataSchemaData

DataData

ETL ETL

ETL ETL

What if the data was processed and stored centrally? What if you didn’t

need to force it into a single schema? We call it a Data Lake.

EDW

DataDataData

DataData

DataData

Schema

BI & Analytics Schema Schema

DataData

Data

Data Lake

DataData

DataData

DataDataData

DataData

DataData

Data

SchemaSchema

DataData

DataProcess Process

DataData

Data

DataData

Data

DataData

DataData

DataDataData Sources

Data Sources

A Data Lake Architecture enables:- Landing data without forcing a single schema- Landing a variety and large volume of data efficiently- Retaining data for a long period of time with a very low $/TB- A platform to feed other Analytical DBs- A platform to execute next gen data analytics and processing applications (SAS, Informatica,

Graph Analytics, Machine Learning, SAP, etc…)

In most cases, more data is better.Work with the population, not just a sample.

Your view of a client today.

Male

Female

Age: 25-30

Town/City

Middle Income Band

Product Category Preferences

Your view with more data.

Male

Female

Age: 27 but feels old

GPS coordinates

$65-68k per year

Product recommendations

Tea PartyHippie

Looking to start a business

Walking into Starbucks right now…

A depressed Toronto Maple Leaf’s Fan

Products left in basket indicate drunk amazon shopper

Gene Expression for Risk Taker

Thinking about a new house

Unhappy with his cell phone plan

Pregnant

Spent 25 minutes looking at tea cozies

So what is the answer?

Enter the Hadoop.

http://www.fabulouslybroke.com/2011/05/ninja-elephants-and-other-awesome-stories/

………

Hadoop was created because traditional technologies never cut it for the Internet properties like Google, Yahoo, Facebook, Twitter, and LinkedIn

Traditional architecture didn’t scale enough…

DB DBDB

SAN

AppApp AppApp

DB DBDB

SAN

AppApp AppApp DB DBDB

SAN

AppApp AppApp

Traditional architectures cost too much at that volume…

$/TB

$pecial Hardware

$upercomputing

So what is the answer?

If you could design a system that would handle this, what would it look like?

It would probably need a highly resilient, self-healing, cost-efficient, distributed file system…

Storage Storage Storage

Storage Storage Storage

Storage Storage Storage

It would probably need a completely parallel processing framework that took tasks to the

data…

Storage Storage Storage

Storage Storage Storage

Storage Storage StorageProcessing Processing Processing

Processing Processing Processing

Processing Processing Processing

It would probably run on commodity hardware, virtualized machines, and common OS

platforms

Storage Storage Storage

Storage Storage Storage

Storage Storage StorageProcessing Processing Processing

Processing Processing Processing

Processing Processing Processing

It would probably be open source so innovation could happen as quickly as possible

It would need a critical mass of users

{Processing + Storage}=

{YARN + HDFS}

Want to get your hands dirty?

To do this, we need to install Hadoop right?

Nope.

Enter the

Sandbox.

The Sandbox is ‘Hadoop in a Can’.It contains one copy of each of the Master and Worker node processes

used in a cluster, only in a single virtual node.

Storage Storage Storage

Storage Storage Storage

Storage Storage StorageProcessing Processing Processing

Processing Processing Processing

Processing Processing Processing

ProcessingStorage

Linux VM

Getting started with Sandbox VM:

- Pick your flavor of VM at…http://www.hortonworks.com/sandbox

- Start the sandbox VM- find the IP displayed - go to…

http://172.16.130.137

- Register- Click on ‘Start Tutorials’- On the left hand nav, click on ‘HCatalog, Basic Pig & Hive Commands’

http://hortonworks.com/hadoop-tutorial/how-to-use-hcatalog-basic-pig-hive-commands/

In this tutorial you can…- Land files in HDFS- Assign metadata with HCatalog- Use SQL with Hive- Learn to process data with Pig

Hadoop has other open source projects…

Apache Hadoop

FlumeAmbari

HBaseFalcon

MapReduceHDFS

SqoopHCatalog

Pig

Hive

StormYARN

Knox

Tez

Hortonworks Data Platform

FlumeAmbari

HBaseFalcon

MapReduceHDFS

SqoopHCatalog

Pig

Hive

Storm YARN

Knox

Tez

What else are we working on?

hortonworks.com/labs/

© Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Page 62

There is NO second place

HortonworksWe do Hadoop.