Tuning yourHadoopAnalytics with IsilonScale-Out NAS · ISILON “SHARE-EVERYTHING” HADOOP 1 Start...
Transcript of Tuning yourHadoopAnalytics with IsilonScale-Out NAS · ISILON “SHARE-EVERYTHING” HADOOP 1 Start...
GLOBAL SPONSORS
Tuning your Hadoop Analytics withIsilon Scale-Out NASAlexander GrafAdvisory Systems EngineerUnstructured Data and Analytics
AGENDA
• Commonly seen first Haddoop uscases• Isilon Scale out Datalake concept• Better Hadoop architecture with Isilon• What about performance ? • How to find the right solution ?
Common first Hadoop usecases
• Predictive Maintenance• Churn prediction / prevention• Fraud detection• Datawarehouse offloading
© Copyright 2017 Dell Inc.4
Isilon Scale Out NAS: Simplicity and Ease of Use
• Automation:NO manual interventionNO reconfigurationNO server or client mount point or
application changesNO data migrationsNO RAID
Single File System Spans All Nodes
Scales linear to 33 PB
Customers>8000
>17% YoY Customer Growth
>2000 Analytics Customers
In Scale-Out NASNow in All-Flash#1
>3.2 Exabyte's Shipped Calendar 2016
Recognized Leader
ISILON MOMENTUM
ISILON - THE RECOGNIZED LEADER
© Copyright 2017 Dell Inc.7
Isilon Workload Consolidation
Ethernet
HADOOP ARCHITECTURE – DAS VS ISILON
NameNode
Data Node + Compute Node
Data Node + Compute Node
Data Node + Compute Node
Data Node + Compute Node
Data Node + Compute Node
Data Node + Compute Node
Ethernet
Compute Node Compute Node Compute Node
Compute NodeCompute Node Compute Node
name node
name node
name node
data node
TRADITIONAL “SHARE-NOTHING” HADOOP
Existing Virtualized Data Center SHARE-NOTHING Hadoop Infrastructure
Unstructured Data
1
Existing Primary Storage
2 3 4 2 3 4 2 3 4 2 3 4
• Hadoop on a Stick (R=3) means 5 data copies ($$$$)
• Data has to copy to the Hadoop cluster before analysis can begin (Time to Results)
How will you maintain data consistency when a file changes on your primary storage?
Existing Virtualized Data Center
Existing Primary Storage
ISILON “SHARE-EVERYTHING” HADOOP
1 Start using Hadoop NOW with
unused processing and RAM available in your VMware environment
No replication required (Use your existing data)
Access to same data via NAS and HDFS protocols
Time to results extremely fast using already existing data with NO COPIES or wasted $$$$
Analysis Can Begin with the 1st VM
New Hadoop Compute Nodes
Unstructured Data
Use Native HDFS Protocol
Data Center Network
TIME-TO-RESULTS
Data Copy AnalysisIn-Place Analysis
Existing Primary Storage
Hadoop on a Stick
Have you ever copied 100TB from Primary Storage to a Hadoop system?
How long does it take to copy 100TB from one place to
another over a 10Gb link?
>24 Hours
Data Center Network
Existing Primary Storage
Hadoop Compute Nodes
Reading relevant data to
analysis
Virtual ServersHDFSNFSFTPSMB
Support for Multiple Hadoop Landscapes
name node
name node
name node
name node data node
MAP Reduce
MAP Reduce
MAP Reduce
MAP Reduce
MAP Reduce
MAP Reduce
MAP Reduce
MAP Reduce
MAP Reduce
MAP Reduce
MAP Reduce
MAP Reduce
MAP Reduce
MAP Reduce
MAP Reduce
MAP Reduce
MAP Reduce
MAP Reduce
(or even different versions/distro’s)
DATA LAKE
Cloudera IBM
Increase Utilization to Control Costs
Hadoop 1
Hadoop 2
HBase
• Consolidated cluster has access to entire pool of physical resources • Take advantage of multi-tenancy to increase utilization during non-peak hours
Source:
HDFSPERFORMANCE BENCHMARKS
DATANODE LOAD BALANCINGINTELLIGENTLY IMPROVE YOUR HADOOP PERFORMANCE
Key Features
Benefits
Intelligently provides datanode with the least load to new HDFS clients
Totally transparent to client, no configuration required
Improves overall performance of Hadoop clients for analytics workloads
Avoids overloading any specific OneFS node and increases cluster resilience
Node 1 Node 2
HDFS Client
Node 3
1. Namenode: Where to write?
2. Write to Node 2.
3. Good, will write to Node 2.
Connection Count
HIBENCH – WORDCOUNT TESTS
DAS Results:
Type Input_data_size Duration(s) Throughput(bytes/s) Throughput/node
Tiny 36 KB 24.441 1478 295Large 3 GB 90.349 36358575 7271715Huge 32 GB 136.893 239963008 47992601Gigantic 328 GB 1429.692 229763783 45952756
Isilon Results:
Type Input_data_size Duration(s) Throughput(bytes/s) Throughput/node
Tiny 36 KB 23.446 (4.07% Faster) 1529 305Large 3 GB 62.457 (30.87% Faster) 52595796 10519159Huge 32 GB 101.105 (26.14% Faster) 324901473 64980294Gigantic 328 GB 574.295 (59.83% Faster) 571990421 114398084
Counts the occurrence of each word in the input data, which are generated using RandomTextWriter
Even faster with all FlashIsilon Generation 6
Capacity
Perfo
rman
ce
S-Series
NL-Series HD-Series
X-Series
S-Series
NL-Series HD-Series
X-Series
250k ops, 15GB/s ops
F800
2GB/s480TB/chassis
H400
120TB-480TB/chassis
A200
40k ops5GB/s
H500117k ops12GB/s
H600
800TB/chassis
A2000
FINDINGTHE RIGHTSOLUTION
HADOOP DECISIONS
DAS
ECS
3 TRADITIONAL DISCOVERY QUESTIONS
1
2
3
What do you hope to achieve with Hadoop?
Why is this impactful to your business?
Which Hadoop
Distribution will you
choose?
Data Science
Data EngineeringDataOps
Data Thinking
Experienced Partners• Consulting: Data, Algorithms,
Compute, Mindset• Guiding companies to data leader-
and creatorship
• Ideation & Scoping of Usecases• Data Analysis• Development of machine learning
algorithms• Proof of Concepts
• Architechture design and concepts• Engineering and deployment• Testing and test management• Application managment
• Managed, hybrid, cloud infrastructures• DevOps Application management• Haddop and beyond on scale solutions• Security concepts and system design
*UM HADOOP-AS-A-SERVICE
1 Hadoop-HW on prem at customer Datacenter or off prem at UM Datacenter
2 *um provides fully managed platform services including hadoop layer
3 Customer specific analytics Software (tableau, SAS or others)
managed by
Compute nodes
Proven solutions for unstructured analytics
Dell EMC Unstructured Analytics Portfolio
PowerEdge Solution accelerators Splunk Ready System Hadoop Ready Bundle QuickStart for Hadoop EDW Optimization Solutions Hadoop Backup Solutions SAS-Grid Solution with Isilon Streaming Analytics Solutions
Recap - Better Hadoop with Isilon
• No data-loading, better performance => FASTER RESULTS• Run pilots on existing infrastructure• Run multiple Hadoop distributions• Scale storage and compute indepenently• Get enterprise storage features– Snapshots, DR-Replicas, Compliance
• Get best possible capacity utilisation – 80 % + of raw
Thank you