Post on 15-Jul-2015
Page 1 Hortonworks Confidential 2014
Enterprise Hadoop with Hortonworks and Nimble Storage
Ajay Singh Director of Technical Alliance - Hortonworks Ibrahim “Ibby” Rahmani, Product and Solutions Marketing- Nimble Storage
Page 2 Hortonworks Confidential 2014
Agenda • Hortonworks Overview
• Big Data Use Cases
• Hadoop Journey and Phases of Adoption
• Requirements of Enterprise Hadoop
• Key Trends
Page 4 Hortonworks Confidential 2014
Our Mission: Power your Modern Data Architecture with HDP and Enterprise Apache Hadoop
Who we are June 2011: Original 24 architects, developers, operators of Hadoop from Yahoo! June 2014: An enterprise software company with 420+ Employees
Key Partners
Our model Innovate and deliver Apache Hadoop as a complete enterprise data platform completely in the open, backed by a world class support organization
Page 6 Hortonworks Confidential 2014
HDP IS Apache Hadoop
There is ONE Enterprise Hadoop: everything else is a vendor derivation
Hortonworks Data Platform
Had
oop
&YA
RN
Pig
Hiv
e &
HC
atal
og
HB
ase
Sqo
op
Ooz
ie
Zoo
keep
er
Am
bari
Sto
rm
Flu
me
Kno
x
Pho
enix
Acc
umul
o
2.2.0 0.12.0
0.12.0 2.4.0
0.12.1
Data Management
0.13.0
0.96.1
0.98.0
0.9.1 1.4.4
1.3.1
1.4.0
1.4.4
1.5.1
3.3.2
4.0.0
3.4.5 0.4.0
4.0.0
1.5.1
Fal
con
0.5.0
Ran
ger
Spa
rk
Kaf
ka
0.14.0 0.14.0
0.98.4
1.6.1
4.2 0.9.3
1.2.0 0.6.0
0.8.1
1.4.5
1.5.0
1.7.0
4.1.0 0.5.0
0.4.0 2.6.0
* version numbers are targets and subject to change at time of general availability in accordance with ASF release process
3.4.5
Tez
0.4.0
Slid
er
0.60
HDP 2.0
October
2013
HDP 2.2 December
2014
HDP 2.1
April
2014
Sol
r
4.7.2
4.10.0
0.5.1
Data Access Governance & Integration Security Operations
Page 7 Hortonworks Confidential 2014
YARN : Data Opera.ng System
Script Pig
Search
Solr
SQL
Hive/Tez, HCatalog
NoSQL
HBase Accumulo
Stream
Storm
Batch
Map Reduce
HDFS (Hadoop Distributed File System)
Contributes more to the Apache Hadoop ecosystem in the ASF than any other vendor
Hadoop is a platform decision
• Open Source: fastest path to innovation for a platform technology
• Eliminate vendor lock in, no proprietary software
• Data center leaders have committed to the open source approach
Apache Project Committers PMC
Members
Hadoop 27 20
Accumulo 2 2
Ambari 33 27
Falcon 5 3
Flume 1 0
HBase 6 4
Hive 17 4
Knox 12 3
Oozie 3 2
Pig 5 5
Sqoop 1 1
Storm 3 2
Tez 15 15
Zookeeper 2 1
TOTAL 132 89
HDP 2.1
Gov
erna
nce
&
Inte
grat
ion
Secu
rity
Ope
ratio
ns
Data Access
Data Management
YARN
Community Leadership
Leading Hadoop Innovations; 100% Open Source
Page 8 Hortonworks Confidential 2014
Proven By Customer Success
Customer Momentum
• 300+ customers in seven quarters, growing at 75+/quarter • 30+ customers migrated from other distributions • Two thirds of customers come from F1000 • 100% Renewal Rate
Largest Cluster in North America
32,000 Nodes Largest Cluster in Europe
1,000 Nodes
Experience at Scale 80,000 nodes under contract
Largest Known Cluster in APAC
400 Nodes
Fastest growing Fortune 1000 customer base
Market Leadership
Page 10 Hortonworks Confidential 2014
APP
LIC
ATIO
NS
DAT
A S
YSTE
M
Business Analytics
Custom Applications
Packaged Applications
Traditional systems under pressure
• Silos of Data • Costly to Scale
• Constrained Schemas
Clickstream
Geolocation
Sentiment, Web Data
Sensor. Machine Data
Unstructured docs, emails
Server logs
SOU
RC
ES
Existing Sources (CRM, ERP,…)
RDBMS EDW MPP
New Data Types
…and difficult to manage new data
LIMITATIONS Silos & Expensive
Single Purpose
Page 11 Hortonworks Confidential 2014
1. Unlock New Applications from New Types of Data INDUSTRY USE CASE Sentiment
& Web Clickstream & Behavior
Machine & Sensor Geographic Server Logs Structured &
Unstructured
Financial Services New Account Risk Screens ✔ ✔
Trading Risk ✔
Insurance Underwriting ✔ ✔ ✔
Telecom Call Detail Records (CDR) ✔ ✔
Infrastructure Investment ✔ ✔
Real-time Bandwidth Allocation ✔ ✔ ✔
Retail 360° View of the Customer ✔ ✔ ✔
Localized, Personalized Promotions ✔
Website Optimization ✔
Manufacturing Supply Chain and Logistics ✔
Assembly Line Quality Assurance ✔
Crowd-sourced Quality Assurance ✔
Healthcare Use Genomic Data in Medial Trials ✔ ✔ ✔
Monitor Patient Vitals in Real-Time ✔ ✔
Pharmaceuticals Recruit and Retain Patients for Drug Trials ✔ ✔
Improve Prescription Adherence ✔ ✔ ✔ ✔
Oil & Gas Unify Exploration & Production Data ✔ ✔ ✔ ✔
Monitor Rig Safety in Real-Time ✔ ✔ ✔
Government ETL Offload/Federal Budgetary Pressures ✔ ✔
Sentiment Analysis for Government Programs ✔
Page 12 Hortonworks Confidential 2014
2. Or to realize a dramatic cost savings…
✚
EDW Optimization
OPERATIONS 50%
ANALYTICS 20%
ETL PROCESS 30%
OPERATIONS 50% ANALYTICS
50%
Current Reality EDW at capacity: some usage from low value workloads
Older data archived, unavailable for ongoing exploration
Source data often discarded
Augment w/ Hadoop
Free up EDW resources from low value tasks
Keep 100% of source data and historical data for ongoing exploration
Mine data for value after loading it because of schema-on-read
MPP
SAN
Engineered System
NAS
HADOOP
Cloud Storage
$0 $20,000 $40,000 $60,000 $80,000 $180,000
Fully-loaded Cost Per Raw TB of Data (Min–Max Cost)
Commodity Compute & Storage Hadoop Enables Scalable Compute & Storage at a Compelling Cost Structure
Hadoop Parse, Cleanse
Apply Structure, Transform
Storage Costs/Compute Costs from $19/GB to $0.23/GB
Page 13 Hortonworks Confidential 2014
3. Data Lake: An architectural shift SC
ALE
SCOPE
Unlocking the Data Lake
RDBMS
MPP
EDW
Data Lake Enabled by YARN • Single data repository,
shared infrastructure
• Multiple biz apps accessing all the data
• Enable a shift from reactive to proactive interactions
• Gain new insight across the entire enterprise
New Analytic Apps or IT Optimization
HDP 2.1
Gov
erna
nce
&
Inte
grat
ion
Secu
rity
Ope
ratio
ns
Data Access
Data Management
YARN
Page 15 Hortonworks Confidential 2014
Business Value from Hadoop Flight Plan for a Journey in Four Phases
1 2 Evaluation –
Business Value
Awareness & Interest
Evaluation – Technical
Enterprise Deployment
Enterprise Production
Industry Leadership
Point Deployment
Point Production
3 4 Operational Value Strategic Value Data-Driven
Organization
* Timeline varies by company size. Often smaller or focused online businesses achieve milestones at the shorter end of the range.
Flight plan – typical elapsed time* from start of phase 1 in months:
2-6 9-15 18-36
Potential Value
Page 16 Hortonworks Confidential 2014
1 2 3 4 What Would You Like to Accomplish? Levels of Success with Hadoop
Potential Value Operational Value Strategic Value Data-Driven Organization
CXO • Recognition of potential • Mandate to explore
• Recognition of value realized • Sponsorship to expand use
• Recognition of material value realized • Sponsorship to transform organization
• Competitive advantage • CDO part of Exec Team
Line of Business
• Basic understanding of the value of Hadoop to the business
• Value realized in 1 area ‒ Customer intimacy ‒ Operational excellence ‒ Risk, security, compliance ‒ New business
• Value realized and tracked in many areas ‒ Customer intimacy ‒ Operational excellence ‒ Risk, security, compliance ‒ New business
• Data managed like capital • Intelligence at the front line • JIT decision making • Widespread value creation
Analytics & Applications
• Basic understanding how Hadoop fits into existing landscape
• BI and EDW access to Hadoop • Some new analytic apps, often batch • Few use cases and processing engines • Many sources and time periods • Mostly departmental silos • 10-50 enterprise users
• Hadoop consumable by any department, both technically and process-wise
• New apps natively on Hadoop, often transactional or real-time
• Many use cases and processing engines • Multiple lenses into common data pool • Emerging data science team • 50-500 enterprise users
• Data-driven culture • High-performing data
science team • Use cases build on each
other • 500-5000 enterprise users
Data Mgt. & Security
• Basic understanding how Hadoop fits
• Benefitting from schema on read • Professionalizing data definitions and
models
• Collaboration and granular security controls governing use of shared data
• Incentives and process to encourage consumption of shared data
Infra-structure
• Basic fluency with core technical concepts of Hadoop
• 1 or more production environments • Multi-tenant shared service worldwide • Data Lake • Service Desk / CoE
• Hadoop community participation and contribution
Page 18 Hortonworks Confidential 2014
The 1st Generation of Hadoop: Batch
HADOOP 1.0 Built for Web-Scale Batch Apps
Single App
BATCH
HDFS
Single App
INTERACTIVE
Single App
BATCH
HDFS
• All other usage patterns must leverage that same infrastructure
• Forces the creation of silos for managing mixed workloads
Single App
BATCH
HDFS
Single App
ONLINE
Page 19 Hortonworks Confidential 2014
2009 2006
1 ° ° ° ° °
° ° ° ° ° N
HDFS (Hadoop Distributed File System)
MapReduce Largely Batch Processing
Hadoop w/ MapReduce
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
° N
HDFS (Hadoop Distributed File System)
Hadoop2 & YARN based Architecture
Silo’d clusters Largely batch system Difficult to integrate
MR-‐279: YARN
Hadoop 2 & YARN
Interactive Real-Time Batch
Architected & led development of YARN to enable the Modern Data Architecture
October 23, 2013
Page 20 Hortonworks Confidential 2014
A Blueprint for Enterprise Hadoop
Load data and manage
according to policy
Deploy and effectively
manage the platform
Store and process all of your Corporate Data Assets
Access your data simultaneously in multiple ways (batch, interactive, real-time) Provide layered
approach to security through Authentication, Authorization,
Accounting, and Data Protection
DATA MANAGEMENT
SECURITY DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS
Enable both existing and new application to provide value to the organization
PRESENTATION & APPLICATION
Empower existing operations and security tools to manage Hadoop
ENTERPRISE MGMT & SECURITY
Provide deployment choice across physical, virtual, cloud
DEPLOYMENT OPTIONS
YARN Data Operating System
Page 21 Hortonworks Confidential 2014
HDP delivers a comprehensive data management platform
Hortonworks Data Platform 2.2
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez Tez
Java Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV Engines
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
Slider Slider
SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume Kafka NFS
WebHDFS
Authentication Authorization Accounting
Data Protection
Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon
Cluster: Knox Cluster: Ranger
Deployment Choice Linux Windows On-Premises Cloud
YARN is the architectural center of HDP
Enables batch, interactive and real-time workloads
Provides comprehensive enterprise capabilities
The widest range of deployment options
Delivered Completely in the OPEN
Page 23 Hortonworks Confidential 2014
Modern Data Architecture
• Enterprise Hadoop as single consolidated Data Lake
• Deep Integration with existing systems
• Accelerated Interactive & Real-Time Capabilities
• Central services for security, governance and operation
APP
LIC
ATIO
NS
DAT
A S
YSTE
M
Business Analytics
Custom Applications
Packaged Applications
RDBMS
EDW
MPP
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
Interactive Real-Time Batch CRM
ERP
Other 1 ° ° °
° ° ° °
HDFS (Hadoop Distributed File System)
SOU
RC
ES
EXISTING Systems
Clickstream Web &Social
Geoloca.on Sensor & Machine
Server Logs
Unstructured
Hadoop As Enterprise Data Lake
Page 24 Hortonworks Confidential 2014
Development & POC Cluster
Production Cluster
Multiple Deployment Choices
Deployment Choice • Linux, Windows • On-Premises, Public/Private Cloud,
Hybrid
“Tethered” Clusters • Compatible services • An explicit “connection”
Synchronized Datasets • Efficient sharing & access • Governance & lineage
BI or ML Cluster
Backup & Archive Cluster
Learn
On-Premise & Cloud Deployments Physical & Virtual Clusters
Page 25 Hortonworks Confidential 2014
Cloud Backup & Storage Tiering Dataset Backup / Archival • Deliver business continuity through replication across on-
premises and cloud-based storages targets; Microsoft Azure and Amazon S3
• Lineage as a GA feature with supporting documentation and examples
Storage Tiers in HDFS • HDFS Heterogeneous storage tiering feature
• Allow for the definition of hot/cold storage tiers within a cluster with all data remaining in cluster for data lake
• Higher density storage, lower CPU and memory footprint machines further drive costs down for the hardware used in the cold storage tier
Backup & Archive Cluster
Production Cluster
Page 26 Hortonworks Confidential 2014
Expanded Infrastructure Choices
Servers with Internal Storage
§ High performance
§ Low upfront cost
§ Limited data movement
Key Technology Trends § Fast & cost effective networks § SSD storage
Servers with Shared Storage
§ Ease of administration
§ Independent scale out of compute & storage
§ Shared storage infrastructure for Big Data and Legacy applications
§ High memory servers § Scale out shared storage sub systems