The Evolution of Storage to Support Data Analytics …better efficiency and ease of management,...

8
Joe Moore, NetApp Array Products Group The Evolution of Storage to Support Data Analytics at Scale 1

Transcript of The Evolution of Storage to Support Data Analytics …better efficiency and ease of management,...

Page 1: The Evolution of Storage to Support Data Analytics …better efficiency and ease of management, either: – Logically distributed, shared nothing on physically shared everything, or

Joe Moore, NetApp Array Products Group

The Evolution of Storage to Support Data Analytics at Scale

1

Page 2: The Evolution of Storage to Support Data Analytics …better efficiency and ease of management, either: – Logically distributed, shared nothing on physically shared everything, or

Solutions For Multiple Workloads

Deliver solutions built on open standards with best-in-class partnerships

2

FAS/V Family with Data ONTAP

®

E-Series with

Hadoop Lustre StorNext StorageGRID and many others

Agile Data Infrastructure Infrastructure for a New Era of

Enterprise Applications

Page 3: The Evolution of Storage to Support Data Analytics …better efficiency and ease of management, either: – Logically distributed, shared nothing on physically shared everything, or

What is “Big Data”?

Complexity

Volume Speed

“Big Data” refers to datasets whose volume, speed and complexity is beyond the ability of typical tools to capture, store, manage and analyze.

3

Coined in 2000 by Francis Diebold, Professor of Economics at the University of Pennsylvania.

Page 4: The Evolution of Storage to Support Data Analytics …better efficiency and ease of management, either: – Logically distributed, shared nothing on physically shared everything, or

Big Data Solution Portfolio

Insight from extremely large datasets

Performance for data intensive workloads

Secure boundless data storage

Big Data

4

Page 5: The Evolution of Storage to Support Data Analytics …better efficiency and ease of management, either: – Logically distributed, shared nothing on physically shared everything, or

Analytics of Tomorrow

¡  Traditional & Big Analytics side-by-side for years to come. ¡  Hadoop moves to shared, virtualized infrastructure, for

better efficiency and ease of management, either: –  Logically distributed, shared nothing on physically shared

everything, or –  Same as above, except Hadoop becomes logically shared

everything, as HDFS is replaced by a parallel file system (e.g., Lustre Cluster, StorNext or GPFS).

¡  Enterprise class resiliency (no SPoF) and reliability with HPC-like performance (no need for triplicas).

¡  Use of a single copy of data for the map phase (higher storage utilization).

¡  Natural intersection with Cloud (Analytics as a Service).

5

Page 6: The Evolution of Storage to Support Data Analytics …better efficiency and ease of management, either: – Logically distributed, shared nothing on physically shared everything, or

Application-Aware Storage (Hadoop example)

6

•   Thin  6GB  slices  of  LUN  ac3ve  •   Sparse  working  set:  144GB  of  4TB  

Hadoop  (Terasort)  Workload  Example  

Intermediate  results  wriCen,    read  back  within  20  minutes.  Cache  only  un3l  first  read.  

64MB  read    issued  as  a  jumbled  IO  burst  Chunk-­‐aligned  Prefetch  

Map    Reduce  

Dominant  workload  trend:    Big-­‐block  IO  with  “pseudo-­‐random”  jumps  • Lustre,  StorNext,  Teradata,  Hadoop  • Prefetch  only  up  to  FS  block  size  

LiFle  short-­‐term  block  reuse  • LRU  replacement  ineffec3ve  • All  cache  hits  come  from  prefetch  • Evict-­‐aZer-­‐read?  

FS  may  split  IO  into  a  random  burst  • Defeats  tradi3onal  stream  prefetch  logic  • Seen  as  IO  jiCer  within  stream  as  wide  as  app  block  size.  

Sub-­‐LUN  working  sets  have  disMnct  IO  characterisMcs  • Sta3c  LUN-­‐grain  caching  policy  sub-­‐op3mal  

Key  ObservaMons  

Page 7: The Evolution of Storage to Support Data Analytics …better efficiency and ease of management, either: – Logically distributed, shared nothing on physically shared everything, or

Data Evolution: Scale, Structure and Storage ¡  Unstructured data increasingly the predominant format:

–  Scale a challenge to traditional database technologies. –  Innovative key-value stores (Hbase, BigTable) sacrifice some

structure (e.g., relational indexing) to achieve scalability. –  Large data sets for many analytics domain are not amenable

to fixed tabular structures: ¡  Adjacency lists for graphs/networks, ¡  Feature vectors for machine learning are typically a reduction from

unstructured input.

¡  Cluster file systems, with large blocks and write-once semantics, accommodate this evolution of data scale and structure: –  Block size of 64MB. not optimal for all datasets:

¡  Google Colossus uses 1MB. Blocks.

–  Cluster-level erasure codes replace replica blocks for data protection. ¡  System level RAID is also a viable alternative to replicas.

7

Page 8: The Evolution of Storage to Support Data Analytics …better efficiency and ease of management, either: – Logically distributed, shared nothing on physically shared everything, or

8