Introduction to Hadoop

21
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Introduction to Hadoop Eric Mizell – Director, Solution Engineering Hortonworks. We do Hadoop.

Transcript of Introduction to Hadoop

Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Introduction to Hadoop

Eric Mizell – Director, Solution Engineering

Hortonworks. We do Hadoop.

© Hortonworks Inc. 2012 Page 2

© Hortonworks Inc. 2012 Page 3

© Hortonworks Inc. 2012 Page 4

Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Quick Audience Poll

Which best describes how your org is using Hadoop? A.  We’re using Hadoop B.  We’re in the process of getting Hadoop integrated C.  We don’t have Hadoop installed D.  What’s Hadoop? E.  I don’t know

Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Big Data, Hadoop, and the Modern Data Architecture

Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Big Data Explosion

Big Data Market Trends & Projections

20% % by which org’s leveraging modern info management

systems outperform peers by 2015

!"

1 Zettabyte (ZB) =

1 Billion TBs

15x

growth rate of machine generated

data by 2020

The US has 1/3 of the world’s data

Big Data is 1 of 5 US GDP Game Changers $325 billion incremental annual GDP from big data analytics in retail and manufacturing by

2020

Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Existing Siloed Data Architectures Under Pressure AP

PLICAT

IONS  

DATA

   SYSTEM  

SOURC

ES  

Business    Analy:cs  

Custom  Applica:ons  

Packaged  Applica:ons  

Exis:ng  Sources    (CRM,  ERP,  Clickstream,  Logs)  

SILO  SILO  

RDBMS  

SILO   SILO  SILO   SILO  

EDW   MPP  

Data  growth:  New  Data  Types  

OLTP,  ERP,  CRM  Systems  

Unstructured  docs,  emails  

Clickstream  

Server  logs  

Social/Web  Data  

Sensor.  Machine  Data  

Geoloca:on  

85% Source: IDC

??

"   Can’t manage new data paradigm

"   Constrains data to specific schema

" Siloed data

"   Limited scalability

"   Economically unfeasible

"   Limited analytics

Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop is Driving the New Data-driven Era of IT

1st Era

Real-time Data Driven

RDBMS

2nd Era 3rd Era

Automation + Efficiency Processing Power

Mainframe

Goa

l D

ata

Tech

nolo

gy

Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Key Drivers of Hadoop

OPERATIONS  TOOLS  

Provision, Manage & Monitor

DEV  &  DATA  TOOLS  

Build & Test

DATA

   SYSTEM  

REPOSITORIES  

SOURC

ES  

RDBMS   EDW   MPP  

APPLICAT

IONS  

Business    Analy:cs  

Custom  Applica:ons  

Packaged  Applica:ons  

Unlock  New  Approach  to  Analy:cs  •  Agile  analy*cs  via  “Schema  on  Read”  with  ability  to  store  all  data  in  na*ve  format  

•  Create  new  apps  from  new  types  of  data  A

Op:mize  Investments,  Cut  Costs  •  Focus  EDW  on  high  value  workloads  •  Use  commodity  servers  &  storage  to  enable  all  data  (original  and  historical)  to  be  accessible  for  ongoing  explora*on  

B Enable  a  Modern  Data  Architecture  •  Integrate  new  &  exis*ng  data  sets  •  Make  all  data  available  for  shared  access  and  processing  in  mul*tenant  infrastructure  

•  Batch,  interac*ve  &  real-­‐*me  use  cases  •  Integrated  with  exis*ng  tools  &  skills  

C EXISTING  Systems  

Clickstream   Web  &  Social  

Geoloca:on   Sensor  &  Machine  

Server    Logs  

Unstructured  

YARN: Data Operating System

° ° ° ° ° ° ° ° °

Interactive Real-Time Batch

HDFS: Hadoop Distributed File System

Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

…to real-time personalization From static branding

…to repair before break From break then fix

…to designer medicine From mass treatment

…to automated algorithms From educated investing

…to 1x1 targeting From mass branding

A shift in Advertising

A shift in Financial Services

A shift in Healthcare

A shift in Retail

A shift in Manufacturing

Hadoop enables organizations to cost effectively store and use all of the data available in a way that shifts the business from…

Reactive

Proactive

Shift to Data-driven Means Treating Data like Capital

Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Enterprise Goals for the Modern Data Architecture

ü  Centrally manage new and existing data

ü  Data needs flexibility and lands in Hadoop without schema

ü  Prepare data with no predetermined questions

ü  User self-service – no limit to questions

ü  Run batch, interactive & real time analytic applications on shared datasets

ü  Leverage new and existing data center infrastructure investments

ü  Scalable and affordable; low cost per TB

APP

LIC

ATIO

NS

DAT

A S

YSTE

M

Business Analytics

Custom Applications

Packaged Applications

RDBMS

EDW

MPP

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° N

Interactive Real-Time Batch CRM

ERP

Other 1 ° ° °

° ° ° °

HDFS (Hadoop Distributed File System)

SOU

RC

ES

EXISTING  Systems  

Clickstream   Web  &  Social  

Geoloca:on   Sensor  &  Machine  

Server    Logs  

Unstructured  

Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

YARN and HDP Enables the Modern Data Architecture YARN is the architectural center of Hadoop and HDP •  YARN enables a common data set

across all applications

•  Batch, interactive & real-time workloads

•  Support multi-tenant access & processing

HDP enables Apache Hadoop to become Enterprise Viable Data Platform with centralized services •  Security

•  Governance

•  Operations

•  Productization

Enabled broad ecosystem adoption

Hortonworks drove this innovation of Hadoop through YARN

Hortonworks Data Platform 2.2

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

Tez Tez

Java Scala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

HDFS (Hadoop Distributed File System)

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

Slider Slider

SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Provision, Manage & Monitor

Ambari

Zookeeper

Scheduling

Oozie

Data Workflow, Lifecycle & Governance

Falcon Sqoop Flume Kafka NFS

WebHDFS

Authentication Authorization

Audit Data Protection

Storage: HDFS

Resources: YARN Access: Hive

Pipeline: Falcon Cluster: Ranger Cluster: Knox

Deployment Choice Linux Windows Cloud

Others

ISV Engines

On-Premises

Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

OPERATIONAL  TOOLS  

DEV  &  DATA  TOOLS  

INFRASTRUCTURE  

Modern Data Architecture SO

UR

CES

EXISTING  Systems  

Clickstream   Web  &Social   Geoloca:on   Sensor  &  Machine  

Server  Logs   Unstructured  

DAT

A S

YSTE

M

RDBMS   EDW  HANA

APPLICAT

IONS  

BusinessObjects BI

Deep Partnerships Hortonworks engages in deep engineered relationships with the leaders in the data center, such as Microsoft, HP, Teradata, SAS, SAP & Redhat Broad Partnerships Over 600 partners work with us to certify their applications to work with Hadoop so they can extend big data to their users

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° N HDFS

(Hadoop Distributed File System)

Interactive Real-Time Batch

Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop unlocks a new approach: Iterative Analytics

       

✚ Determine  list  of  ques:ons  

Design  solu:ons  

Collect  structured  data  

Ask  ques:ons  from  list  

Detect  addi:onal  ques:ons  

Current Reality Apply schema on write

Dependent on IT

Repeatable Process: SQL Only

Augment w/ Hadoop

Apply schema on read

Support range of access patterns to data stored in HDFS: polymorphic access

HADOOP Iterate over structure

Transform and Analyze

batch interactive real-time

Right Engine, Right Job

in-memory

Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop delivers compelling economics

EDW Optimization

OPERATIONS 50%

ANALYTICS 20%

ETL PROCESS 30%

OPERATIONS 50% ANALYTICS

50%

Current Reality EDW at capacity: some usage from low value workloads

Older data archived, unavailable for ongoing exploration

Source data often discarded

Augment w/ Hadoop

Free up EDW resources from low value tasks

Keep 100% of source data and historical data for ongoing exploration

Mine data for value after loading it because of schema-on-read

MPP

SAN

Engineered System

NAS

HADOOP

Cloud Storage

$0 $20,000 $40,000 $60,000 $80,000 $180,000

Fully-loaded Cost Per Raw TB of Data (Min–Max Cost)

Commodity Compute & Storage

Hadoop Enables Scalable Compute & Storage at a Compelling Cost Structure

Hadoop Parse, Cleanse

Apply Structure, Transform

Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

How to Get Started with Hadoop

Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Try Hadoop Today

Download the Hortonworks Sandbox http://hortonworks.com/products/hortonworks-sandbox/

Learn Hadoop

Build a Proof of Concept

Test New Functionality

© Hortonworks Inc. 2013

5 Reasons Hadoop is Kicking Cans and Taking Names

Hadoop’s momentum is unstoppable as its open source roots grow wildly into enterprises. Its refreshingly unique approach to data management is transforming how companies store, process, analyze, and share big data.

Forrester believes that Hadoop will become must-have infrastructure for large enterprises.

Here are five reasons firms should adopt Hadoop today: 1.  Build a data lake with the Hadoop file system (HDFS) 2.  Enjoy cheap, quick processing with MapReduce 3.  Data scientists can wrangle big data faster 4.  Even the POC can make you money 5.  The future of Hadoop is real-time and transactional

Page 19

http://blogs.forrester.com/mike_gualtieri/13-10-22-5_reasons_hadoop_is_kicking_can_and_taking_names

© Hortonworks Inc. 2013

Hadoop Summit 2015

Page 20

Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2013

Thank You! Eric Mizell - Director, Solutions Engineering [email protected]