Debunking Common Myths of Hadoop Backup & Test Data Management
-
Upload
talena-inc -
Category
Engineering
-
view
19 -
download
6
Transcript of Debunking Common Myths of Hadoop Backup & Test Data Management
Confidential and Proprietary1
Debunking Common Myths About Hadoop Backup and Test Data ManagementHari Mankude, CTONovember 2016
Confidential and Proprietary2
My Background
Confidential and Proprietary3
Why Bother With Backup and Test Data Mgmt?
The average cost of a data loss incident is $900,00090% of enterprises delay applications because of a lack
of test data
• Source: EMC, Talena
Confidential and Proprietary4
Myth #1 Data Replicas Prevent Data Loss
Name Node
Data Node Data Node Data Node Data NodeData Node
Confidential and Proprietary5
Myth #2 Hadoop Replication Prevents Data Loss
Name Node
Data Node Data Node Data Node
Name Node
Data Node Data Node Data Node
Data Center #1 Data Center #2
DistCp
Confidential and Proprietary6
Myth #3: Hadoop Snapshots Are An Effective Backup Strategy
Snapshots result in storage
amplification
PROBLEM
Need scheduler to take timely snapshots & delete older
restore points
PROBLEM
Confidential and Proprietary7
Myth #4: Restoring From Snapshots Is Trivial
Requires metadata
and data to be restored
in synch
PROBLEM
Versioning complicates the restore
process
PROBLEM
Confidential and Proprietary8
Myth #5: DistCp Is Good Enough
DistCp only copies data,
not metadata or attributes
Very resource intensive – takes up
MapReduce slots on
production
Error recovery is not robust
and can lead to failed jobs
No restore point
management (aka no point
in time recovery)
Confidential and Proprietary9
Myth #6: The traditional backup/restore process works
• 500 TB with 5% daily change = 650 TB moved per week
Weekly Fulls and Daily
Incrementals
• Impact on CPU• Management overhead
of agents on 100s of nodes
Agents
• Involves going back to last full backup and applying all the incrementals
Restores
Confidential and Proprietary10
Myth #7 Test Data Management Is A Simple Process
Change Request - 1 week
Provision Production Data - 1 week
Create Test DB
and Mask Data - 1
week
Create Samples
of Production Data – 2 days
Push Production Data To
Test – Hours
Repeat Process –
3-4 weeks
Confidential and Proprietary11
The Evolution of Data Management
THE NEXT 25 YEARS
THE TRADITIONALWORLD
Data ManagementData Platforms
Confidential and Proprietary12
Talena in Production
Test Cluster
ResearchCluster
Talena GUI
Hadoop/Spark Cluster
Cassandra Cluster
Vertica Cluster
Couchbase Cluster
Talena Smart
Storage Cluster
Confidential and Proprietary13
The Talena Architecture
• Deep de-duplication and compression with app-aware architecture
• Incremental-forever backup architecture• High availability via erasure coding in distributed cluster
architecture
Smart Storage Optimizer
Confidential and Proprietary14
The Talena Architecture
Native querying and analytics via active compute layer
Unbounded scale with a Hadoop-native architecture
Smart Storage Optimizer
Active Compute Services Distributed File System
Confidential and Proprietary15
The Talena Architecture
• Google-like catalog shortens data recovery time
• Automatic schema generation for mirroring and backups
• Granular recovery at an object level
• Recovery to multiple topologies
• Native integration with LDAP and Kerberos for authentication
• Role-based access control defines specific privileges
• Transparent data encryption
• Masking for PII data
Smart Storage Optimizer
Active Compute Services Distributed File System
Metadata Catalog Data Orchestration ServicesSecurity Services
Confidential and Proprietary16
Smart Storage Optimizer
The Talena Architecture
GUI CLI API
Active Compute Services Distributed File System
• ‘Single pane of glass’ for multiple use cases and data platforms• Agentless architecture minimizes management overhead• GUI, CLI, REST-based Talena API options
Metadata Catalog Data Orchestration ServicesSecurity Services
Confidential and Proprietary17
Hadoop Support
Supports and/or certified against multiple distributions–Apache, Cloudera, Hortonworks, IBM BigInsights
Supports multiple applications–HDFS, Hive, HBase, Impala, Presto
Deployed either on-premise or in private/public clouds
Confidential and Proprietary18
Q&A We’ll send you a link to our eBook “The Hadoop Backup Guide”
Additional resources: talena-inc.com/resources and talena-inc.com/blog
Ping us with any additional questions: [email protected]
Confidential and Proprietary19
Q and A