Apache Hadoop Crash Course - HS16SJ

Post on 16-Apr-2017

2.005 views 3 download

Transcript of Apache Hadoop Crash Course - HS16SJ

ApacheHadoopCrashCourseRafaelCossDataEvangelist@racoss#FutureOfData

2 ©HortonworksInc.2011–2016.AllRightsReserved

AgendaFutureofData

TraditionalDataArchitectures

What’sApacheHadoop?

DataAccesswithHadoop

LabIntro

3 ©HortonworksInc.2011–2016.AllRightsReserved

CustomersarebuildingModernDataApplicationstotransformtheirindustries–renovatingtheirITarchitecturesandinnovatingwiththeirDatainMotionorDataatResttopoweractionableintelligence.

SocialMapping

PaymentTracking

FactoryYields

DefectDetection

CallAnalysis MachineData

ProductDesign M&A

DueDiligence

NextProductRecs

CyberSecurity

RiskModeling

AdPlacement

ProactiveRepair

DisasterMitigation

InvestmentPlanning

InventoryPredictions

CustomerSupport

SentimentAnalysis

SupplyChain

AdPlacement

BasketAnalysis Segments

Cross-Sell

CustomerRetention

VendorScorecards

OptimizeInventories

OPEXReduction

MainframeOffloads

HistoricalRecords

DataasaService

PublicData

Capture

FraudPrevention

DeviceDataIngest

RapidReporting

DigitalProtection

3 © HortonworksInc.2011–2016.AllRightsReserved

FutureofData

5 ©HortonworksInc.2011–2016.AllRightsReserved

INTERNETOF

ANYTHING

TheFutureofDataisaboutactionableintelligencederivedfromaconstantlyconnectedsocietywitheasysecureaccesstorichdatasetscomingfromtheInternetofAnything

DataPowersHighwaySafety

7 ©HortonworksInc.2011–2016.AllRightsReserved

TirePressure

Serverlog Mobile

Sensor

Location

Precipitation

Social

Click-stream

DataPowersHighwaySafety

8 ©HortonworksInc.2011–2016.AllRightsReserved

NewDataParadigmOpensUpNewOpportunity

2.8zettabytesin2012

44zettabytesin2020

N E W

1 zettabyte (ZB) = 1 million petabytes (PB); Sources: IDC, IDG Enterprise, and AMR Research

Clickstream

ERP,CRM,SCM

Web&social

Geolocation

InternetofThings

Server logs

Files, emails

Transformeveryindustryviafullfidelityofdataandanalytics

Opportunity

T R A D I T I O N A L

LAGGARDS

LEADERS

AbilitytoConsumeData

EnterpriseBlindSpot

9 ©HortonworksInc.2011–2016.AllRightsReserved

Whatdisruptedthedatacenter?

?

Data?

10 ©HortonworksInc.2011–2016.AllRightsReserved

ModernDataApplications

Polygot Persistence

SQLNoSQL

NewSQLSearch

Graph

At-Rest In-Motion

AnalyticsDataVariety

Integration

DataLake Federation

OptimizationStorage,Compute

DistributedComputing

CommodityHardware

Cloud

HybridDistributedComputing

11 ©HortonworksInc.2011–2016.AllRightsReserved

TheFutureofDataActionableIntelligence

D AT A I N M O T I O N

STORAG

ESTORAG

E

GROUP2GROUP1

GROUP4GROUP3

D A T A A T R E S T

INTERNETOF

ANYTHING

ConnectedDataPlatformsarepoweringActionableIntelligence

Anyandalldatafromsensors,machines,

geolocation,clicks,files,social.

Securepoint-to-pointandbi-directionaldataflows

Collectandcuratealldata.

12 ©HortonworksInc.2011–2016.AllRightsReserved

TraditionalDataArchitectures

13 ©HortonworksInc.2011–2016.AllRightsReserved

SystemsofIntelligence

SystemsofEngagements

SystemsofInteractions

DataSystems

13

SystemsofRecord

SystemsofInsight

EventsInGray

AnalyticsIn

Green

OperatorsDevelopers

14 ©HortonworksInc.2011–2016.AllRightsReserved

RDBMS

Sales

NoSQL

Unstructured

Visualization&Dashboards

BusinessAnalytics

DataMarts

DataMarts Archive

StatisticsOLAP

EDW

FileServer

ClickstreamLogs

Web&SocialLogs

AudioVideo

LogsLogs

Logs

Geolocation

JSON

ETL

POS CRM ERP

ECM

Filter

AppServer

MessageBus

Documents

15 ©HortonworksInc.2011–2016.AllRightsReserved

RDBMS

Sales

NoSQL

Unstructured

Visualization&Dashboards

BusinessAnalytics

DataMarts

DataMarts Archive

StatisticsOLAP

EDW

FileServer

ClickstreamLogs

Web&SocialLogs

AudioVideo

LogsLogs

Logs

Geolocation

JSON

ETL

POS CRM ERP

ECM

Filter

AppServer

MessageBus

Documents

à Tooexpensiveandslowasdatagrowthkeepsaccelerating

à Tooslowtogetthedatapreparedforanalytics

à Analyticsisonlyleveragingalimiteddataset

à Colddatabecomesarchivedandisnolongerusableforanalytics

à DataingestisrigidandslowfornewIoAT datatypes

à Limitedrealtimeinsights

TraditionalDataArchitectureChallengeswithBigData

16 ©HortonworksInc.2011–2016.AllRightsReserved

RDBMS

Sales

NoSQL

Unstructured

Visualization&Dashboards

BusinessAnalytics

DataMarts

DataMarts Archive

StatisticsOLAP

EDW

FileServer

ClickstreamLogs

Web&SocialLogs

AudioVideo

LogsLogs

Logs

Geolocation

JSON

ETL

POS CRM ERP

ECM

Filter

AppServer

MessageBus

Documents

17 ©HortonworksInc.2011–2016.AllRightsReserved

Next Generation AnalyticsIterative & ExploratoryData is the structure

IT TeamDelivers DataOn Flexible

Platform

BusinessUsers

Explore andAsk Any Question

Analyze ALL Available Information

Whole population analytics connects the dots

Traditional AnalyticsStructured & Repeatable

Structure built to store data

BusinessUsers

DetermineQuestions

IT TeamBuilds System

To AnswerKnown Questions

17

Available Information

AnalyzedInformation

Capacity constrained down sampling of available information

Carefully cleanse all information before any analysis

AnalyzedInformation

Analyze information as is & cleanse as needed

AnalyzedInformation

ModernDataApplications

18 ©HortonworksInc.2011–2016.AllRightsReserved

Next Generation AnalyticsIterative & ExploratoryData is the structure

Traditional AnalyticsStructured & Repeatable

Structure built to store data

18

?AnalyzedInformation

Question

DataAnswer

Hypothesis

StartwithhypothesisTestagainstselecteddata

Data leads the way Explore all data, identify correlations

Data

Correlation

All Information

Exploration

Actionable Insight

Analyzeafterlanding… Analyzeinmotion…

ModernDataApplicationsHasTwoThemes

What’sApacheHadoop?

20 ©HortonworksInc.2011–2016.AllRightsReserved

HadoopArchitecture

DataAccessEngines

DistributedReliableStorage

DistributedComputeFrameworkResourceManagement,DataLocalityDataOperatingSystem

Batch Interactive Real-time

Governance&

IntegrationSecurity

Applications

DeployAnywhere

21 ©HortonworksInc.2011–2016.AllRightsReserved

HadoopDataPlatformArchitecture

StoreandprocessallofyourCorporateDataAssets

YARN:DataOperatingSystem

DATA MANAGEMENT

Providelayeredapproachto

securitythroughAuthentication,Authorization,Accounting,andDataProtection

SECURITY

Access yourdatasimultaneously inmultiple ways(batch, interactive, real-time)

DATA ACCESS

Loaddataandmanage according

topolicy

GOVERNANCE & INTEGRATION

ENTERPRISEMGMT&SECURITY

Empowerexistingoperationsandsecuritytoolstomanage Hadoop

PRESENTATION&APPLICATION

Enablebothexistingandnewapplicationtoprovidevaluetotheorganization

Providedeploymentchoice acrosson-premise,appliance, virtualized,cloud

DEPLOYMENTOPTIONS

Deployandeffectivelymanage theplatform

OPERATIONS

22 ©HortonworksInc.2011–2016.AllRightsReserved runson

ETL

RDBMSImport/Export

DistributedStorage&ProcessingFramework

SecureNoSQL DB

SQLonHBase

NoSQL DB

WorkflowManagement

SQL

StreamingDataIngestion

ClusterSystemOperations

SecureGateway

DistributedRegistry

ETL

Search&Indexing

EvenFasterDataProcessing

DataManagement

MachineLearning

HadoopEcosystem

23 ©HortonworksInc.2011–2016.AllRightsReserved

OpenEnterpriseHadoopCapabilities

YARN : Data Operating System

DATA ACCESS SECURITYGOVERNANCE & INTEGRATION OPERATIONS

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

N

Data Lifecycle & Governance

FalconAtlas

AdministrationAuthenticationAuthorizationAuditingData Protection

RangerKnoxAtlasHDFSEncryptionData Workflow

SqoopFlumeKafkaNFSWebHDFS

Provisioning, Managing, & Monitoring

AmbariCloudbreakZookeeper

Scheduling

Oozie

Batch

MapReduce

Script

Pig

Search

Solr

SQL

Hive

NoSQL

HBaseAccumuloPhoenix

Stream

Storm

In-memory

Spark

Others

ISV Engines

Tez Tez Slider Slider

DATA MANAGEMENT

HortonworksDataPlatform

DeploymentChoiceLinux Windows On-Premise Cloud

HDFS Hadoop Distributed File System

24 ©HortonworksInc.2011–2016.AllRightsReserved

HORTONWORKS DATAPLATFORM

DATAMGMT

HDP2.2Dec2014

HDP2.1April2014

HDP2.0Oct2013

HDP2.2Dec2014

HDP2.1April2014

HDP2.0Oct2013

2.2.0

2.4.0

2.6.0

OngoingInnovationinApache

HDFSYARNMapReduceHadoopCore

WhatisApacheHadoop?

Yahoo!2006

HortonworksOct2011

Yahoo!startfocusonmultipleHadoopapps&clustersContributesHadooptoApache

2008

HDP1.0Oct2012

ApacheHadoopv2YARN

GooglepublishesGFS&MapReduce papers2004-2005

HDP 2.4March2016

2.7.1

HDP2.2Dec2014HDP2.3July2015 2.7.1

25 ©HortonworksInc.2011–2016.AllRightsReserved

`

+ /directory/structure/in/memory.txt

Resource management + schedulingDisk, CPU, Memory

CoreNameNode

HDFS

ResourceManagerYARN

Hadoop daemon

User application

NN

RM

DataNodeHDFS

NodeManagerYARN

Worker Node

26 ©HortonworksInc.2011–2016.AllRightsReserved

HDFS:Scalable,ReliableandSecureStoragePlatformTheStoragePlatformforHadoop2.0ScalableHorizontallygrowasdatavolumesgrow,addingoneormultiple nodesatatime

ReliableHighlyavailable(HA)andfaulttoleranttoprotectagainstdatalossandcorruption

CostEffectiveLeverageCommodityHardwareCrossworkloadaccess

SecureStrongaccess controls,integratedwithauthenticationmechanisms

Granulardataaccesscontrolstodatasets acrossusersandgroupsProtectsdataoverthewireandatrest

HDFS

YARN: Data Operating System

C A B C B B A C

B A B A C A

Standards Based Data Interfaces

NFSSource /

Destination

REST

RPC

Source / Destination

Source / Destination

Ingestandstoreanydatainanyformat

Flexiblereadaccess enablesavarietyofworkloads

27 ©HortonworksInc.2011–2016.AllRightsReserved

Heterogeneous Storage

Before• DataNodeisasinglestorage• Storageisuniform-OnlystoragetypeDisk• Storagetypeshiddenfromthefilesystem

New Architecture• DataNodeisacollectionofstorages• Supportdifferenttypesofstorages

– Disk,SSDs,Memory

Alldisksasasinglestorage

S3SwiftSANFilers

Collectionoftieredstorages

28 ©HortonworksInc.2011–2016.AllRightsReserved

Hadoop Distributed File System (HDFS)

Fault Tolerant Distributed Storage• Dividefilesintobigblocksanddistribute3copiesrandomlyacrossthecluster• ProcessingDataLocality

• NotJuststoragebutcomputation

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111010

0

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

29 ©HortonworksInc.2011–2016.AllRightsReserved

Batch Processing in HadoopMapReduceBatch Access to DataOriginal data access mechanism for Hadoop

• FrameworkMadefordevelopingdistributed applications toprocessvastamountsofdatain-parallelonlargeclusters

• ProvenReliable interfacetoHadoopwhichworksfromGBtoPB.But,batchoriented– Speedisnotit’sstrongpoint.

• EcosystemPortedtoHadoop2torunonYARN.Supportsoriginalinvestments inHadoopbycustomersandpartnerecosystem.

DataNode1

Mapper

Dataisshuffledacrossthenetwork

&sorted

MapPhase Shuffle/Sort ReducePhase

MapReduce JobLifecycle

SayingthatMapReduce isdeadispreposterous- Wouldlimitsustoonlynewworkloads- ALLHadoop clustersusemapreduce

- ProvenatEnterpriseScale

DataNode2

Mapper

DataNode3

Mapper

DataNode1

Reducer

DataNode2

Reducer

DataNode3

Reducer

YARN:DataOperatingSystem

Interactive Real-TimeBatch

30 ©HortonworksInc.2011–2016.AllRightsReserved

What is MapReduce?Break a large problem into sub-solutionsMap

• Iterate over a large # of records

• Extract something of interest fromeach record

Shuffle

• Sort Intermediate results

Reduce

• Aggregate, summarize, filter or transform intermediate results

• Generate final output

MapProcess

MapProcess

MapProcess

MapProcess

Data

DataData

Data

DataData

DataData

DataData

Data

DataData MapProcess

ReduceProcess

ReduceProcess

Data

Read&ETL

Shuffle&Sort Aggregation

Data

DataData

Data

Data

Data

Data

Data

31 ©HortonworksInc.2011–2016.AllRightsReserved

1st GenHadoop:CostEffectiveBatchatScale

HADOOP1.0BuiltforWeb-ScaleBatchApps

SingleAppBATCH

HDFS

SingleAppINTERACTIVE

SingleAppBATCH

HDFS

SiloscreatedfordistinctusecasesSingleApp

BATCH

HDFS

SingleAppONLINE

32 ©HortonworksInc.2011–2016.AllRightsReserved

HadoopemergedasfoundationofnewdataarchitectureApacheHadoopisanopensourcedataplatformformanaginglargevolumesofhighvelocityandvarietyofdata

• BuiltbyYahoo!tobetheheartbeatofitsad&searchbusiness

• DonatedtoApacheSoftwareFoundationin2005withrapidadoptionbylargewebproperties&earlyadopterenterprises

• Incrediblydisruptivetocurrentplatformeconomics

TraditionalHadoopAdvantages

ü Managesnewdataparadigmü Handlesdataatscaleü Costeffectiveü Opensource

TraditionalHadoopHadLimitations

Batch-onlyarchitectureSinglepurposeclusters,specificdatasetsDifficulttointegratewithexistinginvestmentsNotenterprise-grade

Application

StorageHDFS

Batch ProcessingMapReduce

33 ©HortonworksInc.2011–2016.AllRightsReserved

YARNextendsHadoopintodatacenterleaders

YARNThe Architectural Center of Hadoop

• Common data platform, many applications

• Support multi-tenant access & processing

• Batch, interactive & real-time use cases

• Supports 3rd-party ISV tools

(ex. SAS, Syncsort, Actian, etc.)

YARN Ready Applications Facilitates ongoing innovation and enterprise adoption via ecosystem of new and existing “YARN Ready” solutions

YARN : Data Operating System

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

N

HDFS Hadoop Distributed File System

DATA MANAGEMENT

Batch

MapReduce

Script

Pig

Search

Solr

SQL

Hive

NoSQL

HBaseAccumuloPhoenix

Stream

Storm

In-memory

Spark

Others

ISV Engines

Tez Tez Slider Slider

34 ©HortonworksInc.2011–2016.AllRightsReserved

WhatdoesiOS 6andWindows3.1haveincommon?

35 ©HortonworksInc.2011–2016.AllRightsReserved

HadoopBeyondBatchwithYARN

SingleUseSysztemBatchApps

MultiUseDataPlatformBatch,Interactive,Online,Streaming,…

Ashiftfromtheoldtothenew…

HADOOP 1

MapReduce(cluster resource management

& data processing)

Data FlowPig

SQLHive

Others

API,Engine,

andSystem

YARN(Data Operating System: resource management, etc.)

Data FlowPig

SQLHive

OtherISV

Apache Yarn as a Base

System

Engine

API’s

1 ° ° ° ° °

° ° ° ° ° N

HDFS (redundant, reliable storage)

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° N

HDFS (redundant, reliable storage)

BatchMapReduce

Tez Tez

MapReduce as the BaseHADOOP 2

36 ©HortonworksInc.2011–2016.AllRightsReserved

ArchitectureEnabledbyYARNAsinglesetofdataacrosstheentireclusterwithmultipleaccessmethodsusing“zones”forprocessing

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

° ° ° ° ° ° ° n

SQLHive

InteractiveSQLQueryforAnalytics

Pig

Script-basedETLAlgorithmexecutedinbatchtoreworkdatausedbyHiveandHBaseconsumers

• Maximize compute resources to lower TCO

• No standalone, silo’d clusters

• Simple management & operations

…all enabled by YARN

StreamProcessingStorm

Identify&actonreal-timeevents

NoSQLHbase

Accumulo

Low-latencyaccessservingupawebfrontend

37 ©HortonworksInc.2011–2016.AllRightsReserved

HadoopWorkloadEvolution

SingleUseSystemBatchApps

MultiUseDataPlatformBatch,Interactive,Online,Streaming,…

Ashiftfromtheoldtothenew… MultiUsePlatformData&Beyond

HADOOP 1

YARN

HADOOP 2

1 ° ° ° °

° ° ° ° N

HDFS (redundant, reliable storage)

1 ° ° °

° ° ° N

HDFS

MapReduce

HADOOP.Next

YARN ‘

1 ° ° ° ° ° °

° ° ° ° ° ° N

HDFS (redundant, reliable storage)

DATA ACCESS APPS

Docker

MySQLMR2 Others(ISV Engines)

Multiple(Script, SQL, NoSQL, …)

MR2 Others(ISV Engines)

Multiple(Script, SQL, NoSQL, …)

Docker

Tomcat

Docker

Other

38 ©HortonworksInc.2011–2016.AllRightsReserved

Gartner:WhatisHadoop?

à CommonApacheProjects– ALL=7(6)– Exceptfor1=3(5)– Exceptfor2=4(4)² About14CommonProjects

à UncommonProjects– Only1=9(1)– Only2=7 (2)– Only3=6 (3)² About22UncommonProjects

http://blogs.gartner.com/merv-adrian/2015/07/02/now-what-is-hadoop/

ODPi

ODPi

ODPi

ODPi

ODPi ODPi ODPi

Page 39 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

HORTONWORKS DATA PLATFORM

Hado

op&

YARN

Flum

e

Ooz

ie

HDP 2.3 is Apache Hadoop; not “based on” Hadoop

Pig

Hive

Tez

Sqoo

p

Clou

dbre

ak

Amba

ri

Slid

er

Kafk

a

Knox

Solr

Zook

eepe

r

Spar

k

Falc

on

Rang

er

HBas

e

Atla

s

Accu

mul

o

Stor

m

Phoe

nix

4.10.2

DATA MGMT DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS SECURITY

HDP 2.2Dec 2014

HDP 2.1April 2014

HDP 2.0Oct 2013

HDP 2.2Dec 2014

HDP 2.1April 2014

HDP 2.0Oct 2013 0.12.0 0.12.0

0.12.1 0.13.0 0.4.0

1.4.4 1.4.4 3.3.23.4.5

0.4.00.5.0

0.14.0 0.14.0 3.4.6 0.5.0 0.4.00.9.30.5.2

4.0.04.7.2

1.2.1 0.60.0 0.98.4 4.2.0 1.6.1 0.6.0 1.5.21.4.5 4.1.02.0.0

1.4.0 1.5.1 4.0.0

1.3.1

1.5.1 1.4.4 3.4.5

2.2.0

2.4.0

2.6.0

2.7.1 1.4.6 1.0.0 0.6.0 0.5.02.1.00.8.2 3.4.61.5.25.2.1 0.80.0 0.5.01.7.04.4.0 0.10.0 0.6.10.7.01.2.10.15.0HDP 2.3Oct 2015 4.2.0

0.96.1

0.98.0 0.9.1

0.8.1

1.4.1 1.1.2

2.7.1 1.4.6 1.3.0 0.9.0 0.6.02.4.00.10.0 3.4.61.5.25.5.1 0.80.0 0.7.01.7.04.7.0 1.0.1 0.10.00.7.01.2.10.16.0HDP 2.5*2H2016 4.2.01.6.2 1.1.2

2.7.1 1.4.6 1.1.0 0.6.0 0.5.02.2.10.9.0 3.4.61.5.25.2.1 0.80.0 0.5.01.7.04.4.0 0.10.0 0.6.10.7.01.2.10.15.0HDP 2.4Mar 2016 4.2.01.6.0 1.1.2

Zepp

elin

Ongoing Innovation in Apache

0.6.0

* HDP 2.5 – Shows current Apache branches being used. Final component version subject to change based on Apache release process.

40 ©HortonworksInc.2011–2016.AllRightsReserved

NextGenerationDataVendorsInvestmentfortheEnterprise

VerticalIntegration with YARN and HDFSEnsure engines can run reliably and respectfully in a YARN based cluster

Provision, Manage & Monitor

AmbariZookeeper

Scheduling

Oozie

Loaddataandmanageaccordingtopolicy

Providelayeredapproachto

securitythroughAuthentication,Authorization,Accounting,andDataProtection

SECURITYGOVERNANCE

Deployandeffectivelymanage theplatform

° ° ° ° ° ° ° ° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

JavaScala

Cascading

Stream

Storm

Search

Solr

NoSQL

HBaseAccumulo

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Others

ISV Engines

1 ° ° ° ° ° ° ° ° ° ° ° ° ° °

YARN: Data Operating System(ClusterResourceManagement)

HDFS (Hadoop Distributed File System)

Tez Slider SliderTez Tez

OPERATIONS

Horizontal Integration for Enterprise ServicesEnsure consistent enterprise services are applied across the Hadoop stack

41 ©HortonworksInc.2011–2016.AllRightsReserved

Whatdodistributionsdo?

à Defineastackofcomponents• RichandlatestsetofApacheProjects(opensource&opencommunity)withoutlockin

à VerticalandHorizontalintegrationofcomponents• Vertical:BestSpeedandScale• Horizontal:OpenEnterpriseReady

à ProvisionandUpgradestack• Robust,EasyandAnywhere

à Acceleratetimetovalue(easyofuse)• NewFaceofHadoopwithUis fromAmbari,AmbariViews,Ranger,Falcon,Atlas

à PartnerEcosystem• RichandDeep

à Support• Industry’sbest,SmartSenseandinfluencecommunity

HadoopOperations&Tools

43 ©HortonworksInc.2011–2016.AllRightsReserved

How Do You Operate a Hadoop Cluster?

Apache™ Ambari isaplatformtoprovision,manageandmonitorHadoopclusters

44 ©HortonworksInc.2011–2016.AllRightsReserved

Ambari Core Features and Extensibility

Install&Configure

Operate,Manage&Administer

Develop

Optimize&Tune

Developer

DataArchitect

Ambariprovidescoreservicesforoperations,developmentandextensionspointsforboth

ExtensibilityFeatures

Stacks,Blueprints&RESTAPIs

CoreFeatures

InstallWizard&Web

Web,OperatorViews,Metrics&Alerts

UserViews

UserViews

ViewsFramework&RESTAPIs

ViewsFramework

ViewsFramework

How?

ClusterAdmin

45 ©HortonworksInc.2011–2016.AllRightsReserved

Newuserinterfaceenablesfast&easySQLdefinitionandexecution.

46 ©HortonworksInc.2011–2016.AllRightsReserved

New User Views for DevOps

CapacitySchedulerViewBrowseandmanageYARNqueues

Tez ViewViewinformationrelatedtoTez jobsthatareexecuting onthecluster

47 ©HortonworksInc.2011–2016.AllRightsReserved

NewUserViewsforDevelopment

PigViewAuthorandexecute PigScripts.

HiveViewAuthor,execute anddebugHive

queries.

FilesViewBrowseHDFSfilesystem.

48 ©HortonworksInc.2011–2016.AllRightsReserved

ApacheZeppelin

• Web-basednotebookfordataengineers,dataanalystsanddatascientists• Bringsinteractivedataingestion,data

exploration,visualization,sharingandcollaborationfeaturestoHadoopandSpark

• Moderndatasciencestudio• ScalawithSpark• PythonwithSpark• SparkSQL• ApacheHive,andmore.

HadoopDataAccess

50 ©HortonworksInc.2011–2016.AllRightsReserved

Access patterns enabled by YARN

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

°N

HDFS Hadoop Distributed File System

Interactive Real-TimeBatch

Applications BatchNeeds to happen but, no timeframe limitations

InteractiveNeeds to happen at Human time

Real-Time Needs to happen at Machine Execution time.

51 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Hive: SQL in Hadoop

• Created by a team at Facebook

• Provides a standard SQL interface to data stored in Hadoop

• Quickly find value in raw data files

• Proven at petabyte scale

• Compatible with ALL major BI tools such as Tableau, Excel, MicroStrategy, Business Objects, etc…

SensorMobile

WeblogOperational

/MPP

SQLQueries

52 ©HortonworksInc.2011–2016.AllRightsReserved

Hive and the Stinger Initiative

BaseOptimizationsGeneratesimplifiedDAGsIn-memoryHashJoins

VectorQueryEngineOptimizedformodernprocessor

architectures

TezExpresstasksmoresimply

Eliminate diskwritesPre-warmedContainers

ORCFileColumnStore

HighCompressionPredicate/FilterPushdowns

YARNNext-genHadoopdataprocessing

framework

+ +

QueryPlannerIntelligentCost-BasedOptimizer

PerformanceOptimizations100x+fastertimetoinsightDeeperanalyticalcapabilities

53 ©HortonworksInc.2011–2016.AllRightsReserved

Stinger.next andSub-SecondSQL

Emergence of LLAP brings Sub-Second SQL response times within reach with Hive.

BATCH & INTERACTIVE BATCH & INTERACTIVE BATCH, INTERACTIVE & SUB-SECONDSPEED

DELIVERY

SQL

UPDATES

ENGINES

STINGERD E L I V E R E D

PROGRESSD E L I V E R E D FINALVERSION

HDP 2.1VERSION

0.13VERSION

HDP 2.3VERSION

1.2.1

SQL:2003+ SQL:2011 SUBSET

READ-ONLY SQL INSERT/UPDATE/DELETE

MR, TEZ MR, TEZ

F U T U R ES T I N G E R N E X T

COMPLETE ACID SUPPORT INCLUDING MERGE

COMPREHENSIVE SQL:2011 BASED ANALYTICS

MR, TEZ, LLAP

DELIVERED IN DEVELOPMENT

TieredDataStorage

Stinger.next Phase3

YARN:ContainerizedApplications

54 ©HortonworksInc.2011–2016.AllRightsReserved

DataTypes SQL Features File Formats Latest Additions…Numeric CoreSQLFeatures Columnar ScalableCrossProduct

FLOAT/DOUBLE Date,Time andArithmeticalFunctions ORCFile PrimaryKey/Foreign KeyDECIMAL INNER,OUTER,CROSSandSEMIJoins Parquet Non-EquijoinINT/TINYINT/SMALLINT/BIGINT DerivedTableSubqueries

TextTechPreview:Proc.Extensions(PL/SQL)

BOOLEAN Correlated+ UncorrelatedSubqueries CSV FutureString UNIONALL Logfile ACIDMERGE

CHAR/VARCHAR UDFs, UDAFs,UDTFs Nested/Complex MultiSubquerySTRING CommonTableExpressions Avro Comparison tosub-selectBINARY UNIONDISTINCT JSON INTERSECT andEXCEPT

Date, Time AdvancedAnalytics XMLDATE OLAPandWindowing Functions CustomFormatsTIMESTAMP CUBE andGrouping Sets OtherFeaturesIntervalTypes NestedDataAnalytics XPath Analytics

ComplexTypes NestedDataTraversalARRAY LateralViewsMAP ACIDTransactionsSTRUCT INSERT/UPDATE/DELETEUNION

ApacheHive:JourneytoSQL:2011Analytics

LegendExisting

Future

NewwithHive2.0

55 ©HortonworksInc.2011–2016.AllRightsReserved

Stor

age

Columnar Storage

ORCFile Parquet

Unstructured Data

JSON CSV

Text Avro

Custom

Weblog

Engi

ne

SQL Engines

RowEngine VectorEngineSQ

L

SQL Support

SQL:2011 Optimizer HCatalog HiveServer2

Cac

he

Block Cache

LinuxCache

Dis

tribu

ted

Exe

cutio

n

Hadoop 1

MapReduce

Hadoop 2

Tez Spark

Vector Cache

LLAP

Persistent Server

Historical

Current

In-development

Legend

Apache Hive: Modern Architecture

56 ©HortonworksInc.2011–2016.AllRightsReserved

ApacheTezisacriticalinnovationoftheStingerInitiative.

• Along with YARN, Tez not only improves Hive, but improvesallthingsbatchand interactiveforHadoop;Pig,Cascading…

• More Efficient Processing than MapReduce

• Reduceoperationsandcomplexityofbackendprocessing• AllowsforMapReduceReducewhichsavesharddiskoperations• Implementsa“service”whichisalwayson,decreasingstarttimesofjobs• AllowsCachingofDatainMemory

YARN

Dev

Cascading/ Scalding

WhyisTez Important?

°1 ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

°°

° ° ° ° ° ° °

° ° ° ° ° ° N

HDFS (Hadoop Distributed File

System)

Scripting

Pig

SQL

Hive

Tez Tez

Applications

Tez

YARN:DataOperatingSystem

Interactive Real-TimeBatch

57 ©HortonworksInc.2011–2016.AllRightsReserved

ApacheTez

Hive– MapReduce Hive– Tez

SELECT a.state, COUNT(*), AVG(c.price) FROM a

JOIN b ON (a.id = b.id)JOIN c ON (a.itemId = c.itemId)

GROUP BY a.state

SELECTa.state

JOIN(a,c)SELECTc.price

SELECTb.id

JOIN(a, b)GROUPBYa.state

COUNT(*)AVG(c.price)

M M M

R R

M M

R

M M

R

M M

R

HDFS

HDFS

HDFS

M M M

R R

R

M M

R

R

SELECTa.state,c.itemId

JOIN(a,c)

JOIN(a, b)GROUPBYa.state

COUNT(*)AVG(c.price)

SELECTb.id

Tez avoidsunneededwritestoHDFS

58 ©HortonworksInc.2011–2016.AllRightsReserved

Scripting Data Pipeline & ETLApache Pig

• Dataflowengineandscriptinglanguage(PigLatin)• Allowsyoutotransformdataanddatasets

Advantages over MapReduce• Reducestimetowritejobs• Communitysupport• PiggybankhasasignificantnumberofUDF’stohelpadoption• TherearealargenumberofexistingshopsusingPIG

YARN:DataOperatingSystem

Interactive Real-TimeBatch

59 ©HortonworksInc.2011–2016.AllRightsReserved

PigLatin

• Pigexecutesinauniquefashion:oDuringexecution,eachstatementisprocessedbythePiginterpretero Ifastatementisvalid,itgetsaddedtoalogicalplan builtbytheinterpreter

oThestepsinthelogicalplandonotactuallyexecuteuntilaDUMPorSTOREcommandisused

60 ©HortonworksInc.2011–2016.AllRightsReserved

WhyusePig?

• Maybewewanttojointwodatasets,fromdifferentsources,onacommonvalue,andwanttofilter,andsort,andgettop5sites

61 ©HortonworksInc.2011–2016.AllRightsReserved

ResourceManagement

Storage

Elegant Developer APIsDataFrames, Machine Learning, and SQL

Made for Data ScienceAll apps need to get predictive at scale and fine granularity

Democratize Machine LearningSpark is doing to ML on Hadoop what Hive did for SQL on Hadoop

CommunityBroad developer, customer and partner interest

Realize Value of Data Operating SystemA key tool in the Hadoop toolbox

ApacheSparkenthusiasm

Applications

SparkCoreEngine

ScalaJavaPythonlibraries

MLlib(Machinelearning)

SparkSQL*

SparkStreaming*

SparkCoreEngine

62 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Spark & Apache Hadoop Perfect Together

General Purpose Data Access Engineforfast,large-scaledataprocessing

Designed for Iterative, In-Memorycomputationsandinteractivedatamining

Expressive Multi-Language APIsforJava,Scala,PythonandR

Built-in LibrariesEnabledataworkerstorapidlyiterateoverdatafor:ETL,MachineLearning,SQLandStreamprocessing

YARN

ScalaJava

PythonR

APIs

Spark Core Engine

Spark SQL

Spark StreamingMLlib GraphX

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

NHDFS

63 ©HortonworksInc.2011–2016.AllRightsReserved

Apache Projects Enable Access Patterns

Various open source projects have incubated in order to meet these access pattern needs

Today, they can all run on a single cluster on a single set of data because of YARN

All powered by a broad open community

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

°N

HDFS Hadoop Distributed File System

InteractiveSolr

SparkHivePig

Real-TimeHBase

AccumuloStorm

BatchMapReduce

Applications

Kafka

64 ©HortonworksInc.2011–2016.AllRightsReserved

ConnectedDataPlatforms

ConnectedDataPlatformsEnableArchitecturalTransformations

DatainMotion(Cloud)

DatainMotion

(on-premises)

DataatRest

(on-premises)

EdgeData

DatainMotion

EdgeAnalytics

DataatRest(Cloud)

EdgeData

DataatRest

(on-premises)

ClosedLoop

Analytics

MachineLearning

DeepHistoricalAnalysis

Must-haveConsiderationsforTechnology

ContinuousDataLifeCycle

Real-timeinsightsfromorigintorest

EnterpriseReady

ManagementSecurity

Governance

DeploymentFlexibility

OnPremiseCloudHybrid

OpenInnovation

ArchitectureCommunityEcosystem

HandsonLabOverview

HDP2.4Sandbox

à ProvidesFreepreconfiguredHDP– RunsinaVirtualMachineor

AzureHortonworks.com/sandbox

à EasytoUse– Operations

• Ambari– DevandDevOps

• AmbariUserViews– WebNotebook

• Zeppelin

à Workswith60+FreetutorialHortonworks.com/tutorials

DataDiscoveryLab• ElefanteWineCompanyhasafleetofover100trucks.

• Thegeolocationdatacollectedfromthetruckscontainseventsgeneratedwhilethetruckdriversaredriving.

• Thecompany’sgoalwithHadoopistoMitigateRisk:o Understandcorrelationsbetweenmilesdrivenandeventso Computetheriskfactorforeachdriverbasedonmileage&events

o LabEnvo Sandbox2.4

o LabDoco URL:http://goo.gl/14OAato LoadDatao QueryDatao ProcessData

Elefante Wine Current Challenges

The CompanyElefante Wine is a boutique wine fulfillment company with a large fleet of trucks. It delivers wine in a highly-regulated industry with stringent transportation requirements.

The SituationRecently a number of driver violations led to fines and increased insurance rates

The Challenges• Rising Operational Costs• Driver Safety• Risk Management• Logistics Optimization

© Hortonworks Inc. 2012 Professional Services

ElefanteWineCompanyhasalargefleetoftrucksinUSA

Atruckgeneratesmillionsofeventsforagivenroute;aneventcouldbe:

§ 'Normal'events:starting/stoppingofthevehicle

§ ‘Violation’events:speeding,excessiveaccelerationandbreaking,unsafetaildistance

Companyusesanapplicationthatmonitorstrucklocationsandviolationsfromthetruck/driverinreal-timetocalculaterisk

Route?Truck?Driver?

Analystsqueryabroadhistorytounderstandiftoday’sviolations arepartofalargerproblemwithspecificroutes,trucks,ordrivers

Elefante Wine Risk and Driver Safety Challenges

Trucksoutfittedwithnewsensorsgeneratinglargevolumesofnewdata:

• Location

• Speed

• DriverViolations

Needtobeintegratereal-time&historicaldata

Increase safety and reduce liabilitiesAnticipate driver violations BEFORE they happen and take precautionary actionsFindpredictivecorrelationsindriverbehavioroverlargevolumesofreal-timedata

Difficult to deliver timely insights to the right people and systems to take action

Data DiscoveryUncover new findings

Predictive Analytics Identify your next best action

Better Understandingof the Past

Better Prediction of the Future

What’sourgoal?

à Solution:– CollectadditionaldataviasensorsintruckstobetterunderstandRiskFactors

à How:– Quicklystorenewsensordatainacommonrepository– Preparethedataforanalysis– Explorethedata– CalculateRisk– Generateareport

Move Data Into Hadoop

Geolocation.csv

trucks.csv

Geolocation_stage Geolocation

Trucks_stage Trucks

csv

csv ORC

ORCSQL

SQL

move

LOAD

Geolocation

Trucks

ORC

ORC

SQL

SQL

PIGorSparkRisk Calculation

Truck_mileage

ORC

Avg_mileage

ORC

DriverMileage

ORC

RiskFactor

ORC

Events

ORC

Trucking Risk Analysis – Hadoop ELT

CalculateRisk

GettingStartedResources

78 ©HortonworksInc.2011–2016.AllRightsReserved

developer.hortonworks.com

79 ©HortonworksInc.2011–2016.AllRightsReserved

HortonworksNourishestheCommunityHOR TONWORKS

C OMMUN I TY C ONNE C T I ONHOR TONWOR KS PA R TN ERWORKS

https://community.hortonworks.com

80 ©HortonworksInc.2011–2016.AllRightsReserved

Thankyou!

rafael@hortonworks.com

@racoss