Introduction to the Hadoop EcoSystem

29
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hortonworks – The Hadoop Ecosystem Fall 2014 Powering the Modern Data Architecture Shivaji Dutta – Sr. Partner Solutions Engineer

Transcript of Introduction to the Hadoop EcoSystem

Page 1: Introduction to the Hadoop EcoSystem

Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hortonworks – The Hadoop Ecosystem

Fall 2014

Powering the Modern Data ArchitectureShivaji Dutta – Sr. Partner Solutions Engineer

Page 2: Introduction to the Hadoop EcoSystem

Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

AgendaApache Hadoop and Hortonworks Data Platform (HDP)HDP and CouchbaseWhat’s new in HDP?

Page 3: Introduction to the Hadoop EcoSystem

Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

What is Hadoop

Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.

Page 4: Introduction to the Hadoop EcoSystem

Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Projects in Hadoop

Hadoop Core– Hadoop Common

– Hadoop Distributed File System

– Hadoop YARN

– Hadoop Mapreduce

Other Hadoop Key Projects

• Hive

• Hbase

• Spark

• Pig

• Tez

• Zookeper

Page 5: Introduction to the Hadoop EcoSystem

Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

HDP delivers a comprehensive data management platform

Hortonworks Data Platform 2.2

YARN: Data Operating System(Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

TezTez

JavaScala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

Others

ISV Engines

HDFS (Hadoop Distributed File System)

Stream

Storm

Search

Solr

NoSQL

HBaseAccumulo

Slider Slider

SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Provision, Manage & Monitor

AmbariZookeeper

Scheduling

Oozie

Data Workflow, Lifecycle & Governance

FalconSqoopFlumeKafkaNFS

WebHDFS

AuthenticationAuthorizationAccounting

Data Protection

Storage: HDFSResources: YARNAccess: Hive, … Pipeline: Falcon

Cluster: KnoxCluster: Ranger

Deployment ChoiceLinux Windows On-Premises Cloud

YARN is the architectural center of HDP

Enables batch, interactive and real-time workloads

Provides comprehensive enterprise capabilities

The widest range of deployment options

Delivered Completely in the OPEN

Page 6: Introduction to the Hadoop EcoSystem

Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

HDFS and Yarn – The Core of Hadoop

The core components of HDP are YARN and Hadoop Distributed Filesystem (HDFS).

YARN is the architectural center of Hadoop that enables you to process data simultaneously in multiple ways. YARN provides the resource management and pluggable architecture for enabling a wide variety of data access methods.

HDFS provides the scalable, fault-tolerant, cost-efficient storage for big data.

Page 7: Introduction to the Hadoop EcoSystem

Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

YARN extends Hadoop into data center leaders

YARNThe Architectural Center of Hadoop

• Common data platform, many applications

• Support multi-tenant access & processing

• Batch, interactive & real-time use cases

• Supports 3rd-party ISV tools

(ex. SAS, Syncsort, Actian, etc.)

YARN Ready Applications Facilitates ongoing innovation and enterprise adoption via ecosystem of new and existing “YARN Ready” solutions

YARN: Data Operating System(Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

TezTez

JavaScala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

Others

ISV Engines

HDFS (Hadoop Distributed File System)

Stream

Storm

Search

Solr

NoSQL

HBaseAccumulo

Slider Slider

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Page 8: Introduction to the Hadoop EcoSystem

Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Data Access

YARN provides the foundation for a versatile range of processing engines that empower you to interact with the same data in multiple ways, at the same time.

This means applications can interact with the data in the best way: from batch to interactive SQL or low latency access with NoSQL.

Emerging use cases for data science, search and streaming are also supported with Apache Spark, Solr and Storm.

Additionally, ecosystem partners provide even more specialized data access engines for YARN.

YARN: Data Operating System(Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

TezTez

JavaScala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

Others

ISV Engines

HDFS (Hadoop Distributed File System)

Stream

Storm

Search

Solr

NoSQL

HBaseAccumulo

Slider Slider

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Page 9: Introduction to the Hadoop EcoSystem

Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Data Governance and Integration

• HDP extends data access and management with powerful tools for data governance and integration.

• They provide a reliable, repeatable, and simple framework for managing the flow of data in and out of Hadoop. This control structure, along with a set of tooling to ease and automate the application of schema or metadata on sources is critical for successful integration of Hadoop into your modern data architecture.• Apache SQOOP

• Apache OOZIE• Apache FALCON• Apache FLUME

Page 10: Introduction to the Hadoop EcoSystem

Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Security

• Authentication/ Authorization and Encryption

• Kerberos

• SSL & SASL

• Apache Knox

• Apache Ranger

• HDFS File/Directory Encryption

Page 11: Introduction to the Hadoop EcoSystem

Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Operations – Apache Ambari• Provisioning, manage and monitor

Hadoop Clusters

• A complete set of operational capabilities that provide both visibilities into the health of your cluster as well as tooling to manage configuration and optimize performance across all data access methods.

• Apache Ambari provides APIs to integrate with existing management systems: for instance Microsoft System Center and Teradata ViewPoint

Page 12: Introduction to the Hadoop EcoSystem

Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Enterprise Hadoop: Central Set of Services

YARN: Data Operating System(Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

° °

° °

° ° ° ° °

° ° ° ° °

Enables Apache Hadoop to be an Enterprise Data Platform with centralized services for:

• Governance

• Operations

• Security

Everything that plugs into Hadoop inherits these services

Provision, Manage & Monitor

AmbariZookeeper

Scheduling

Oozie

Load data and manage

according to policy

Deploy and effectively

manage the platform

Provide layered approach to

security through Authentication, Authorization,

Accounting, and Data Protection

SECURITYGOVERNANCE OPERATIONS

Script

Pig

SQL

Hive

JavaScala

Cascading

Stream

Storm

Search

Solr

NoSQL

HBaseAccumulo

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Others

ISV Engines

YARN: Data Operating System(Cluster Resource Management)

HDFS (Hadoop Distributed File System)

Tez Slider SliderTez Tez

Page 13: Introduction to the Hadoop EcoSystem

Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

HDP IS Apache Hadoop

There is ONE Enterprise Hadoop: everything else is a vendor derivation

Hortonworks Data Platform 2.2

H

ad

oo

p

&Y

AR

N

P

ig

H

ive

& H

Cat

alog

H

Ba

se

S

qo

op

O

ozi

e

Z

oo

ke

ep

er

A

mb

ari

S

torm

F

lum

e

K

no

x

P

ho

en

ix

A

cc

um

ulo

2.2.00.12.0

0.12.0

2.4.00.12.1

Data Management

0.13.0

0.96.1

0.98.0

0.9.11.4.4

1.3.1

1.4.0

1.4.4

1.5.1

3.3.2

4.0.0

3.4.50.4.0

4.0.0

1.5.1

F

alc

on

0.5.0

R

an

ge

r

S

pa

rk

K

afk

a

0.14.00.14.0

0.98.4

1.6.1

4.2 0.9.3

1.2.00.6.0

0.8.1

1.4.5

1.5.0

1.7.0

4.1.00.5.0

0.4.02.6.0

* version numbers are targets and subject to change at time of general availability in accordance with ASF release process

3.4.5

Te

z

0.4.0

S

lid

er

0.60

HDP 2.0

October

2013

HDP 2.2

October

2014

HDP 2.1

April

2014

S

olr

4.7.2

4.10.0

0.5.1

Data AccessGovernance & Integration

SecurityOperations

Page 14: Introduction to the Hadoop EcoSystem

Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

OPERATIONAL TOOLS

DEV & DATA TOOLS

INFRASTRUCTURE

The Partner EcoSystemS

OU

RC

ES

EXISTING Systems

Clickstream Web &Social Geolocation Sensor & Machine

Server Logs Unstructured

DA

TA S

YS

TE

M

RDBMS EDW MPP

HANA

APPL

ICAT

ION

S

BusinessObjects BI

Deep PartnershipsHortonworks engages in deep engineered relationships with the leaders in the data center, such as Microsoft, Teradata, Redhat, HP, SAS & SAP

Broad PartnershipsOver 900 partners work with us to certify their applications to work with Hadoop so they can extend big data to their users

HDP 2.1

Go

ve

rna

nc

e

& In

teg

rati

on

Sec

uri

ty

Op

era

tio

nsData Access

Data Management

YARN

Page 15: Introduction to the Hadoop EcoSystem

Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

• Couchbase is primarily online operational NoSQL datastore, low latency, scalable

• Source of data and also a sink

• Example source: Pulling user profiles into Hadoop for deep analytics

• Example sink: training machine learning models that are then cached / served from Couchbase

Couchbase and HDP

Page 16: Introduction to the Hadoop EcoSystem

Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

• HDP Certified Sqoop connector for batch mode export / import

• Couchbase Kafka connector enables both Producer and Consumer scenarios

• Community supported Storm spout to persist data by writing to Couchbase Server

• Developer Preview Spark Connector

Couchbase and HDP

New!

Page 17: Introduction to the Hadoop EcoSystem

Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

What’s New in HDP 2.2

New and Improved YARN Ready Engines

• Enterprise SQL at Hadoop Scale with Stinger.next

• Enterprise Ready Spark on YARN

• Deep YARN integration for real-time engines: HBase, Accumulo, Storm

• Enabling ISVs with a general SDK and API for direct YARN integration

• Only solution to provide real-time to micro batch for analyzing the internet of things

• Other engines/tools: Solr, Cascading

Continued Innovation of Central Enterprise Services

• Centralized security administration and policy enforcement

• Ease of use and operations agility features to speed cluster deployment

• 100% uptime target with cluster rolling upgrades

Expanded Deployment Options

• Enhanced business continuity with replication/archival across on-premises and cloud storage tiers (Azure Blob, S3)

• Simultaneous ship of Windows and Linux installs

• Expand Azure support beyond HDInsight Azure to include HDP for Windows or Linux in Azure VMs

HDP 2.2Delivering Apache Hadoop for the Enterprise

Page 18: Introduction to the Hadoop EcoSystem

Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Stinger.next: Enterprise SQL at Hadoop Scale

A continuation of momentum built in Apache Hive Community to deliver Enterprise SQL at Hadoop scale

HDP Stinger/Hive Goals:

• SpeedDeliver sub-second query response times

• Scale The only SQL interface to Hadoop designed for queries that scale from Gigabytes, to Terabytes and Petabytes

• SQLEnable transactions and SQL:2011 Analytics

Familiar three phase deliveryStinger delivered 390,000 lines of code to Apache Hive in 13 months from 44 companies, 145 developers

HDP 2.2 – Beyond Read Only• Transactions with ACID, allowing insert, update & delete

• Temporary tables

• Cost Based Optimizer for star & bushy join queries

Phase 2 – Sub Second• Sub-second queries with LLAP

• Hive-Spark Machine Learning integration

• Operational reporting w/ Hive streaming ingest & transactions

Phase 3 – Rich Analytics• SQL:2011 Analytics

• Materialized views

• Cross-geo queries

• Workload management via YARN and LLAP integration

HD

P 2

.2

Sec

urity

Ope

ratio

ns

Gov

erna

nce

Access

Management

YARN

Page 19: Introduction to the Hadoop EcoSystem

Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Spark

• Apache Spark is an open source project for fast and large scale data processing. – Simple and expressive programming model

– Machine learning, graph computation and Streaming

– in-memory compute for iterative workloads

• It does most of the processing in memory

• It support programming languages– Java, Scala and Python

• It provides a high level modules for – Mlib

– GraphX

– Sprak Streaming

– Sprark SQL

• Cluster Manager– Yarn (recommended)

– Mesos

– Sparks Own

Page 20: Introduction to the Hadoop EcoSystem

Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Spark Stack

Page 21: Introduction to the Hadoop EcoSystem

Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Enterprise Ready Spark for HDP 2.2 & beyond

HDP 2.2 – Spark on YARN• Integrated: Hive 0.13 support

• Integrated: Basic ORCfile support

Phase 2 – Spark for HDP 2.2• Managed: Deployment best practices with YARN Node Labels

• Managed: Ambari Stack Definition: Install/Start/Stop/Config/Quick links to Spark UI

• Security: Spark certification on Kerberized Cluster

• Security: Authentication in Spark UI against LDAP

Phase 3 - Beyond• Managed: Enhanced workload mgmnt & improved

debuggability

• Managed: Spark logs published to YARN Application Timeline

• Security: Wire Encryption and Authorization with XA/Argus

• Enhanced ORC support

Deliver a reliable and managed, enterprise grade Apache Spark that will run alongside other workloads in Hadoop via YARN

HDP Spark Goals:• Integrated

Enterprise-grade Workload Management & Optimized multi-tenancy on YARN

• SecureExtend comprehensive Hadoop security policy to Spark

• ManagedProvision, manage and monitor Spark along with other engines in hadoop

HD

P 2

.2

Sec

urity

Ope

ratio

ns

Gov

erna

nce

Access

Management

YARN

Page 22: Introduction to the Hadoop EcoSystem

Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Bringing more applications and services to YARN and making ISV adoption easier

• Complete work for Pig with Tez

• Cascading with Tez for Java and Scala apps

• Integration of Spark on YARN

• Kafka for inbound messaging to Storm & Spark – widest range from real-time to micro batch for internet of things

HDP 2.2 Delivers more YARN Ready Engines

YARN: Data Operating System(Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

TezTez

Others

Engines

Tez

JavaScala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

°

°

°

°

°

°

Others

ISV Engines

°

°

Storm

Stream

Slider introduces native YARN integration for applications with long running services

• HBase, Accumulo, Storm

• SDK for 3rd-party ISVs

Indicates “new to HDP” in 2.2. All engines have been updated

HD

P 2

.2

Sec

urity

Ope

ratio

ns

Gov

erna

nce

Access

Management

YARN

Others

Engines

Slider

Solr

Search

HBase

NoSQL

Slider

Accumulo

NoSQL

Slider

Spark

In-Memory

Kaf

ka

Slider

°

°

°

°

HDFS (Hadoop Distributed File System)

Page 23: Introduction to the Hadoop EcoSystem

Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Security in HDP 2.2

HDP 2.2 New Features

• Extend Authorization with Apache Ranger

• Breadth: Knox and Storm integrations

• Policy enforcement at depth: Hive, HDFS and HBase integrations

• Documentation to support community development and partner ecosystem

• Apache Hadoop Advances• TP: HDFS Transparent Encryption in HDFS – HDFS-6134

• Key Management Server - HADOOP-10433

• Key Provider API - HADOOP-10141

Continue investments across for central security policy for authentication, authorization, audit, and data protection

HDP Security Goals:• Comprehensive Security

Meet all security requirements across authentication, authorization, audit & data protection for all HDP components.

• Central AdministrationProvide central administration ofg security policy and for viewing and managing audit across the platform.

• Consistent IntegrationIntegrate with other security and identity management systems, for compliance with IT policies.

HD

P 2

.2

Sec

urity

Ope

ratio

ns

Gov

erna

nce

Access

Management

YARN

Page 24: Introduction to the Hadoop EcoSystem

Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Streamlining Operations in HDP 2.2

Apache Ambari 1.7.0 Delivers

• ViewsA common, secure, and extensible approach for the user interface for Operators, System Administrators, Application Developers, Data Workers and ISVs

• BlueprintsCreate and manage cluster templates for easy deployment

Apache Ambari is advancing at light speed to enable the IT operator to more easily manage clusters

HDP Operations Goals:• Open

Deliver a complete set of features for Hadoop operations, in public and with the community.

• IntegratedEnsure Hadoop operations integrate with existing IT tools, behind a single pane of glass.

• IntuitiveMake Hadoop’s most complex operational challenges easy to manage.

HD

P 2

.2

Sec

urity

Ope

ratio

ns

Gov

erna

nce

Access

Management

YARN

Ambari 2.0.0 delivers • Ambari on Windows• native metrics and alerts• rolling upgrade automation

Page 25: Introduction to the Hadoop EcoSystem

Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Rolling Upgrades

Allow continuous operation and up-time for applications and services on the cluster while upgrading• Single most critical feature for streamlining operations

• HDFS provides the ability to do this today… remaining components need to follow

• Leverages native operating system tools and scripting

• Allow jobs in-flight to complete

• Provides support for rapid rollback

HD

P 2

.2

Sec

urity

Ope

ratio

ns

Gov

erna

nce

Access

Management

YARN

Page 26: Introduction to the Hadoop EcoSystem

Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Development& POC Cluster

ProductionCluster

Vision: Maximize Hadoop Deployment Choice

Deployment Choice• Linux, Windows• On-Premises, Cloud, Hybrid

“Tethered” Clusters• Compatible services• An explicit “connection”

Synchronized Datasets• Efficient sharing & access• Governance & lineage

BI or MLCluster

Backup& Archive Cluster

Learn

Page 27: Introduction to the Hadoop EcoSystem

Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

BI / Analytics(Hive)

IoT Apps(Storm, HBase, Hive)

Cloudbreak with HDP

Dev / Test(all HDP services)

Data Science(Spark)

Cloudbreak

1. Pick a Blueprint2. Choose a Cloud3. Launch HDP!

Example Ambari Blueprints: IoT Apps, BI / Analytics, Data Science, Dev / Test

Page 28: Introduction to the Hadoop EcoSystem

Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

BI / Analytics(Hive)

IoT Apps(Storm, HBase, Hive)

Periscope with HDP

Dev / Test(all HDP services)

Data Science(Spark)

Autoscaling Policy

Periscope• Policies based on any Ambari metrics• Coordinates with YARN to achieve

elasticity based on the policies.

Page 29: Introduction to the Hadoop EcoSystem

Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Thank You