Debunking Common Myths of Hadoop Backup & Test Data Management

19
Debunking Common Myths About Hadoop Backup and Test Data Management Hari Mankude, CTO November 2016

Transcript of Debunking Common Myths of Hadoop Backup & Test Data Management

Page 1: Debunking Common Myths of Hadoop Backup & Test Data Management

Confidential and Proprietary1

Debunking Common Myths About Hadoop Backup and Test Data ManagementHari Mankude, CTONovember 2016

Page 2: Debunking Common Myths of Hadoop Backup & Test Data Management

Confidential and Proprietary2

My Background

Page 3: Debunking Common Myths of Hadoop Backup & Test Data Management

Confidential and Proprietary3

Why Bother With Backup and Test Data Mgmt?

The average cost of a data loss incident is $900,00090% of enterprises delay applications because of a lack

of test data

• Source: EMC, Talena

Page 4: Debunking Common Myths of Hadoop Backup & Test Data Management

Confidential and Proprietary4

Myth #1 Data Replicas Prevent Data Loss

Name Node

Data Node Data Node Data Node Data NodeData Node

Page 5: Debunking Common Myths of Hadoop Backup & Test Data Management

Confidential and Proprietary5

Myth #2 Hadoop Replication Prevents Data Loss

Name Node

Data Node Data Node Data Node

Name Node

Data Node Data Node Data Node

Data Center #1 Data Center #2

DistCp

Page 6: Debunking Common Myths of Hadoop Backup & Test Data Management

Confidential and Proprietary6

Myth #3: Hadoop Snapshots Are An Effective Backup Strategy

Snapshots result in storage

amplification

PROBLEM

Need scheduler to take timely snapshots & delete older

restore points

PROBLEM

Page 7: Debunking Common Myths of Hadoop Backup & Test Data Management

Confidential and Proprietary7

Myth #4: Restoring From Snapshots Is Trivial

Requires metadata

and data to be restored

in synch

PROBLEM

Versioning complicates the restore

process

PROBLEM

Page 8: Debunking Common Myths of Hadoop Backup & Test Data Management

Confidential and Proprietary8

Myth #5: DistCp Is Good Enough

DistCp only copies data,

not metadata or attributes

Very resource intensive – takes up

MapReduce slots on

production

Error recovery is not robust

and can lead to failed jobs

No restore point

management (aka no point

in time recovery)

Page 9: Debunking Common Myths of Hadoop Backup & Test Data Management

Confidential and Proprietary9

Myth #6: The traditional backup/restore process works

• 500 TB with 5% daily change = 650 TB moved per week

Weekly Fulls and Daily

Incrementals

• Impact on CPU• Management overhead

of agents on 100s of nodes

Agents

• Involves going back to last full backup and applying all the incrementals

Restores

Page 10: Debunking Common Myths of Hadoop Backup & Test Data Management

Confidential and Proprietary10

Myth #7 Test Data Management Is A Simple Process

Change Request - 1 week

Provision Production Data - 1 week

Create Test DB

and Mask Data - 1

week

Create Samples

of Production Data – 2 days

Push Production Data To

Test – Hours

Repeat Process –

3-4 weeks

Page 11: Debunking Common Myths of Hadoop Backup & Test Data Management

Confidential and Proprietary11

The Evolution of Data Management

THE NEXT 25 YEARS

THE TRADITIONALWORLD

Data ManagementData Platforms

Page 12: Debunking Common Myths of Hadoop Backup & Test Data Management

Confidential and Proprietary12

Talena in Production

Test Cluster

ResearchCluster

Talena GUI

Hadoop/Spark Cluster

Cassandra Cluster

Vertica Cluster

Couchbase Cluster

Talena Smart

Storage Cluster

Page 13: Debunking Common Myths of Hadoop Backup & Test Data Management

Confidential and Proprietary13

The Talena Architecture

• Deep de-duplication and compression with app-aware architecture

• Incremental-forever backup architecture• High availability via erasure coding in distributed cluster

architecture

Smart Storage Optimizer

Page 14: Debunking Common Myths of Hadoop Backup & Test Data Management

Confidential and Proprietary14

The Talena Architecture

Native querying and analytics via active compute layer

Unbounded scale with a Hadoop-native architecture

Smart Storage Optimizer

Active Compute Services Distributed File System

Page 15: Debunking Common Myths of Hadoop Backup & Test Data Management

Confidential and Proprietary15

The Talena Architecture

• Google-like catalog shortens data recovery time

• Automatic schema generation for mirroring and backups

• Granular recovery at an object level

• Recovery to multiple topologies

• Native integration with LDAP and Kerberos for authentication

• Role-based access control defines specific privileges

• Transparent data encryption

• Masking for PII data

Smart Storage Optimizer

Active Compute Services Distributed File System

Metadata Catalog Data Orchestration ServicesSecurity Services

Page 16: Debunking Common Myths of Hadoop Backup & Test Data Management

Confidential and Proprietary16

Smart Storage Optimizer

The Talena Architecture

GUI CLI API

Active Compute Services Distributed File System

• ‘Single pane of glass’ for multiple use cases and data platforms• Agentless architecture minimizes management overhead• GUI, CLI, REST-based Talena API options

Metadata Catalog Data Orchestration ServicesSecurity Services

Page 17: Debunking Common Myths of Hadoop Backup & Test Data Management

Confidential and Proprietary17

Hadoop Support

Supports and/or certified against multiple distributions–Apache, Cloudera, Hortonworks, IBM BigInsights

Supports multiple applications–HDFS, Hive, HBase, Impala, Presto

Deployed either on-premise or in private/public clouds

Page 18: Debunking Common Myths of Hadoop Backup & Test Data Management

Confidential and Proprietary18

Q&A We’ll send you a link to our eBook “The Hadoop Backup Guide”

Additional resources: talena-inc.com/resources and talena-inc.com/blog

Ping us with any additional questions: [email protected]

Page 19: Debunking Common Myths of Hadoop Backup & Test Data Management

Confidential and Proprietary19

Q and A