Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

31
Page 1 © Hortonworks Inc. 2014 Discover HDP 2.2 Data Storage Innovations in Hadoop Distributed File System (HDFS) Hortonworks. We do Hadoop.

description

Hortonworks Data Platform 2.2 include HDFS for data storage . In this 30-minute webinar, we discussed data storage innovations, including Heterogeneous storage, encryption, and operational security enhancements.

Transcript of Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 1: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 1 © Hortonworks Inc. 2014

Discover HDP 2.2 Data Storage Innovations in Hadoop Distributed File System (HDFS)

Hortonworks. We do Hadoop.

Page 2: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 2 © Hortonworks Inc. 2014

Speakers

Rohit Bakhshi

Hortonworks Senior Product Manager & PM for Apache Hadoop & Apache Solr in Hortonworks Data Platform

Jitendra Pandey

Hortonworks Senior Architect for HDFS

Page 3: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 3 © Hortonworks Inc. 2014

Agenda

•  Overview of HDFS

•  New HDFS Innovation in HDP 2.2 –  Heterogeneous storage

–  Encryption

–  Operational security enhancements

•  Q & A

We’ll move quickly: •  Attendee phone lines are muted •  Text any questions to Jitendra using Webex chat

•  Questions will be answered at the end of the call •  Unanswered questions and answers in upcoming FAQ/blog post

Page 4: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 4 © Hortonworks Inc. 2014

Big Data, Hadoop & Data Center Re-platforming

Business Drivers

•  From reactive analytics to proactive interactions

•  Insights that drive competitive advantage & optimal returns

Financial Drivers

•  Cost of data systems, as % of IT spend, continues to grow

•  Cost advantages of commodity hardware & open source software

$ Technical Drivers

•  Data is growing exponentially & existing systems overwhelmed

•  Predominantly driven by NEW types of data that can inform analytics

There is an inequitable balance between vendor and customer in the market

Page 5: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 5 © Hortonworks Inc. 2014

Clickstream Capture and analyze website visitors’ data trails and optimize your website

Sensors Discover patterns in data streaming automatically from remote sensors and machines

Server Logs Research logs to diagnose process failures and prevent security breaches

New Types of Data Hadoop Value:

Sentiment Understand how your customers feel about your brand and products – right now

Geographic Analyze location-based data to manage operations where they occur

Unstructured Understand patterns in files across millions of web pages, emails, and documents

Page 6: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 6 © Hortonworks Inc. 2014

A Shift from Reactive to Proactive Interactions

HDP and Hadoop allow organizations to use data to shift interactions from…

Reactive Post Transaction

Proactive Pre Decision

…to Real-time Personalization From static branding

…to repair before break From break then fix

…to Designer Medicine From mass treatment

…to Automated Algorithms From Educated Investing

…to 1x1 Targeting From mass branding

A shift in Advertising

A shift in Financial Services

A shift in Healthcare

A shift in Retail

A shift in Telco

Page 7: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 7 © Hortonworks Inc. 2014

Enterprise Goals for the Modern Data Architecture

•  Consolidate siloed data sets structured and unstructured

•  Central data set on a single cluster

•  Multiple workloads across batch interactive and real time

•  Central services for security, governance and operation

•  Preserve existing investment in current tools and platforms

•  Single view of the customer, product, supply chain

APP

LIC

ATIO

NS

DAT

A S

YSTE

M

Business Analytics

Custom Applications

Packaged Applications

RDBMS

EDW

MPP

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° N

Interactive Real-Time Batch CRM

ERP

Other 1 ° ° °

° ° ° °

HDFS (Hadoop Distributed File System)

SOU

RC

ES

EXISTING  Systems  

Clickstream   Web    &Social  

Geoloca9on   Sensor    &  Machine  

Server    Logs  

Unstructured  

Page 8: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 8 © Hortonworks Inc. 2014

YARN Transformed Hadoop & Opened a New Era

YARN The Architectural Center of Hadoop

•  Common data platform, many applications

•  Support multi-tenant access & processing

•  Batch, interactive & real-time use cases

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

Tez Tez

Java Scala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

Others

ISV Engines

HDFS (Hadoop Distributed File System)

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

Slider Slider

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Page 9: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 9 © Hortonworks Inc. 2014

YARN Extends Hadoop to Other Data Center Leaders

YARN The Architectural Center of Hadoop

•  Common data platform, many applications

•  Support multi-tenant access & processing

•  Batch, interactive & real-time use cases

•  Supports 3rd-party ISV tools

(ex. SAS, Syncsort, Actian, etc.)

YARN Ready Applications Facilitates ongoing innovation and enterprise adoption via ecosystem of new and existing “YARN Ready” solutions

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

Tez Tez

Java Scala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

Others

ISV Engines

HDFS (Hadoop Distributed File System)

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

Slider Slider

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Page 10: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 10 © Hortonworks Inc. 2014

Enterprise Hadoop: Central Set of Services

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

° °

° °

° ° ° ° °

° ° ° ° °

Enables Apache Hadoop to be an Enterprise Data Platform with centralized services for:

•  Governance

•  Operations

•  Security

Everything that plugs into Hadoop inherits these services

Provision, Manage & Monitor

Ambari

Zookeeper

Scheduling

Oozie

Load data and manage

according to policy

Deploy and effectively

manage the platform

Provide layered approach to

security through Authentication, Authorization,

Accounting, and Data Protection

SECURITY GOVERNANCE OPERATIONS

Script

Pig

SQL

Hive

Java Scala

Cascading

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Others

ISV Engines

YARN: Data Operating System (Cluster Resource Management)

HDFS (Hadoop Distributed File System)

Tez Slider Slider Tez Tez

Page 11: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 11 © Hortonworks Inc. 2014

Hortonworks Development Investment for the Enterprise

Vertical Integration with YARN and HDFS

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

° °

° °

° ° ° ° °

° ° ° ° °

Provision, Manage & Monitor

Ambari

Zookeeper

Scheduling

Oozie

Load data and manage

according to policy

Deploy and effectively

manage the platform

Provide layered approach to

security through Authentication, Authorization,

Accounting, and Data Protection

SECURITY GOVERNANCE OPERATIONS

Script

Pig

SQL

Hive

Java Scala

Cascading

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Others

ISV Engines

YARN: Data Operating System (Cluster Resource Management)

HDFS (Hadoop Distributed File System)

Tez Slider Slider Tez Tez

•  Ensure engines can run reliably and respectfully in a YARN based cluster •  Implement features throughout the stack to accommodate

Page 12: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 12 © Hortonworks Inc. 2014

Hortonworks Development Investment for the Enterprise

Horizontal Integration for Enterprise Services

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

° °

° °

° ° ° ° °

° ° ° ° °

Provision, Manage & Monitor

Ambari

Zookeeper

Scheduling

Oozie

Load data and manage

according to policy

Deploy and effectively

manage the platform

Provide layered approach to

security through Authentication, Authorization,

Accounting, and Data Protection

SECURITY GOVERNANCE OPERATIONS

Script

Pig

SQL

Hive

Java Scala

Cascading

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Others

ISV Engines

YARN: Data Operating System (Cluster Resource Management)

HDFS (Hadoop Distributed File System)

Tez Slider Slider Tez Tez

•  Ensure consistent enterprise services are applied across the entire Hadoop stack •  Integrate with and extend existing data center solutions for these key requirements

Page 13: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 13 © Hortonworks Inc. 2014

Hortonworks Data Platform 2.2

HDP Delivers Enterprise Hadoop

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

Tez Tez

Java Scala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

HDFS (Hadoop Distributed File System)

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

Slider Slider

SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Provision, Manage & Monitor

Ambari

Zookeeper

Scheduling

Oozie

Data Workflow, Lifecycle & Governance

Falcon Sqoop Flume Kafka NFS

WebHDFS

Authentication Authorization

Audit Data Protection

Storage: HDFS

Resources: YARN Access: Hive

Pipeline: Falcon Cluster: Ranger Cluster: Knox

Deployment Choice Linux Windows Cloud

YARN is the architectural center of HDP

•  Common data set across all applications

•  Batch, interactive & real-time workloads

•  Multi-tenant access & processing

Provides comprehensive enterprise capabilities

•  Governance

•  Security

•  Operations

Enables broad ecosystem adoption

•  ISVs can plug directly into Hadoop

The widest range of deployment options •  Linux & Windows

•  On premises & cloud

Others

ISV Engines

On-Premises

Page 14: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 14 © Hortonworks Inc. 2014

Hortonworks Data Platform 2.2

HDP Delivers Enterprise Hadoop

YARN: Data Operating System (Cluster Resource Management)

Script

Pig

SQL

Hive

Tez Tez

Java Scala

Cascading

Tez

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

Slider Slider

SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Provision, Manage & Monitor

Ambari

Zookeeper

Scheduling

Oozie

Data Workflow, Lifecycle & Governance

Falcon Sqoop Flume Kafka NFS

WebHDFS

Authentication Authorization

Audit Data Protection

Storage: HDFS

Resources: YARN Access: Hive

Pipeline: Falcon Cluster: Ranger Cluster: Knox

YARN is the architectural center of HDP

•  Common data set across all applications

•  Batch, interactive & real-time workloads

•  Multi-tenant access & processing

Provides comprehensive enterprise capabilities

•  Governance

•  Security

•  Operations

Enables broad ecosystem adoption

•  ISVs can plug directly into Hadoop

The widest range of deployment options •  Linux & Windows

•  On premises & cloud

Others

ISV Engines

Deployment Choice Linux Windows Cloud On-Premises

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

° °

° °

° ° ° ° °

° ° ° ° °

HDFS (Hadoop Distributed File System)

Page 15: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 15 © Hortonworks Inc. 2014

Overview of HDFS

Page 16: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 16 © Hortonworks Inc. 2014

HDFS enables the Common Data Platform

HDFS Storage Platform for Modern Data Architecture

•  Common data platform across multiple

application workloads

•  Reliable

•  Scalable

•  Cost Efficient

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

Tez Tez

Java Scala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

Others

ISV Engines

HDFS (Hadoop Distributed File System)

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

Slider Slider

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Page 17: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 17 © Hortonworks Inc. 2014

HDFS Innovations on HDP 2.2

Page 18: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 18 © Hortonworks Inc. 2014

HDFS in HDP 2.2: What’s New

Heterogeneous  Storage  •  Archive  and  SSD  Tiers  

•  Tech  Preview:  Enable  intermediate  data  to  stored  in  memory    

Heterogeneous  Storage  

THEM

E  

Encryp9on  •  Tech  Preview:  Transparent  Data  Encryp?on

Security  

THEM

E  

DataNode  does  not  require  Root  to  start  •  HDFS  services  in  a  Kerberized  cluster  no  longer  need  Root  to  start    

Security  

THEM

E  

Page 19: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 19 © Hortonworks Inc. 2014

New in HDP 2.2: Heterogeneous Storage

Page 20: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 20 © Hortonworks Inc. 2014

Heterogeneous Storage

Before •  DataNode is a single storage •  Storage is uniform - Only storage type Disk •  Storage types hidden from the file system

New Architecture •  DataNode is a collection of storages •  Support different types of storages

– Disk, SSDs, Memory

All disks as a single storage

S3 Swift SAN Filers

Collection of tiered storages

Page 21: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 21 © Hortonworks Inc. 2014

HDFS Storage Architecture - Now

Page 22: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 22 © Hortonworks Inc. 2014

Storage Policies: Archival D

ISK

DIS

K

DIS

K

DIS

K

DIS

K

DIS

K

DIS

K

DIS

K

DIS

K

AR

CH

IVE

AR

CH

IVE

AR

CH

IVE

AR

CH

IVE

AR

CH

IVE

AR

CH

IVE

AR

CH

IVE

AR

CH

IVE

AR

CH

IVE

Warm 1 replica on DISK,

others on ARCHIVE

Hot All replicas on DISK

Cold All replicas on

ARCHIVE

HDP Cluster

Page 23: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 23 © Hortonworks Inc. 2014

Storage Policy: SSD S

SD

DIS

K

DIS

K

SS

D

DIS

K

DIS

K

SS

D

DIS

K

DIS

K

SS

D

DIS

K

DIS

K

SS

D

DIS

K

DIS

K

HDP Cluster

A

SS

D

DIS

K

DIS

K

A A

SSD All replicas on SSD DataSet A

Page 24: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 24 © Hortonworks Inc. 2014

Store Intermediate Data in Memory

Application Process

Memory Tier

Write block to memory

Lazy persist block to disk

RAM_DISK

Tech Preview feature

For data writes that:

-  Need low latency writes

-  Where data is regenerate-able

Page 25: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 25 © Hortonworks Inc. 2014

New in HDP 2.2: Encryption

Page 26: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 26 © Hortonworks Inc. 2014

HDFS Transparent Data Encryption

• HDFS Encryption – Transparent Encryption in HDFS – HDFS-6134 – Designate a dir as encryption zone, all files in the zone are encrypted – Dependency on Key Management Server

• Key Management Server - HADOOP-10433 – The custodian for all encryption keys in Hadoop – REST API for key CRUD operations

• Key Provider API - HADOOP-10141 – API to allow Hadoop code (NN, DN, DFS Clients) CRUD operations on key material

Page 27: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 27 © Hortonworks Inc. 2014

1  

°  

°  

°  

°  

°   °  

°   °  

°   °  

°   °  

°   N  °  

HDFS Transparent Data Encryption

DATA    ACCESS  

   

DATA    MANAGEMENT  

1   °   °   °   °   °  

°   °   °   °   °   °  

°   °   °   °   °   °  

SECURITY  

   YARN  

HDFS  Client      

°   °   °   °   °   °  

°   °   °   °   °   °  

°   °  

°   °  

°   °  

°   °  

°  HDFS    (Hadoop  Distributed  File  System)  

 Encryp9on  Zone    

(aIributes  -­‐  EZKey  ID,  version)  HDFS-­‐6134  

Encrypted  File  (aIributes  -­‐  EDEK,  IV)  

Name  Node  

KeyProvider  API  

KeyProvider  API  

Key  Management  System  (KMS)  Hadoop-­‐10433  

KeyProvider  API  –  Hadoop-­‐10141  

EDEK  

DEK  

Crypto  Stream    (r/w  with  DEK)  

DEKs   EZKs  

Acronym   Descrip?on  

EZ   Encryp?on  Zone  (an  HDFS  directory)    

EZK   Encryp?on  Zone  Key;  master  key  associated  with  all  files  in  an  EZ  

DEK   Data  Encryp?on  Key,  unique  key  associated  with  each  file.    EZ  Key  used  to  generate  DEK  

EDEK     Encrypted  DEK,  Name  Node  only  has  access  to  encrypted  DEK.      

IV   Ini?aliza?on  Vector  

EDEK  

EDEK  

Page 28: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 28 © Hortonworks Inc. 2014

New in HDP 2.2: Operational Security Enhancements

Page 29: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 29 © Hortonworks Inc. 2014

DataNode does not require root

Enables Organizations to run services without utilizing root privilege

For Kerberized clusters

DataNode no longer needs to run as the Linux root user when starting

DataNode no longer needs to bind to privileged ports

DataNode utilizes SASL to transfer blocks between HDFS clients and DataNodes.

Page 30: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 30 © Hortonworks Inc. 2014

Q & A

Page 31: Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

Page 31 © Hortonworks Inc. 2014

Thank you! Learn more at: hortonworks.com/hadoop/hdfs/

Register for the remaining 4 Discover HDP 2.2 Webinars

Hortonworks.com/webinars