Introduc|on to Apache Spark

of 23 /23
1 © Cloudera, Inc. All rights reserved. Introduc8on to Apache Spark Jordan Volz, Systems Engineer @ Cloudera

Transcript of Introduc|on to Apache Spark

Page 1: Introduc|on to Apache Spark

1  ©  Cloudera,  Inc.  All  rights  reserved.  

Introduc8on  to  Apache  Spark  Jordan  Volz,  Systems  Engineer  @  Cloudera  

Page 2: Introduc|on to Apache Spark

2  ©  Cloudera,  Inc.  All  rights  reserved.  

Analyzing  Data  on  Large  Data  Sets  • Python,  R,  etc.  are  popular  tools  among  data  scien8sts/analysts,  sta8s8cians,  etc.  • Why  are  these  tools  popular?  •  Easy  to  learn  and  maximizes  produc8vity  for  data  engineers,  data  scien8sts,  sta8s8cians  •  Build  robust  soOware  and  do  interac8ve  data  analysis    •  Large,  diverse  open  source  development  communi8es  •  Comprehensive  libraries:  data  wrangling,  ML,  visualiza8on,  etc.  

•  Limita8ons  do  exist:  •  Largely  confined  to  single-­‐node  analysis  and  smaller  data  sets  •  Requires  sampling  or  aggrega8ons  for  larger  data  •  Distributed  tools  compromise  in  various  ways  –  adds  complexity  and  8me  •  Restricts  effec8veness  in  certain  use  cases  

Page 3: Introduc|on to Apache Spark

3  ©  Cloudera,  Inc.  All  rights  reserved.  

Key  Advances  by  MapReduce:  

•  Data  Locality:  Automa8c  split  computa8on  and  launch  of  mappers  appropriately  

•  Fault-­‐Tolerance:  Write  out  of  intermediate  results  and  restartable  mappers  meant  ability  to  run  on  commodity  hardware  

•  Linear  Scalability:  Combina8on  of  locality  +  programming  model  that  forces  developers  to  write  generally  scalable  solu8ons  to  problems  

MapReduce  –  Analysis  on  Large  Data  Sets  (Hadoop)  

Map   Map   Map   Map   Map   Map   Map   Map   Map   Map   Map   Map  

Reduce   Reduce   Reduce   Reduce  

Page 4: Introduc|on to Apache Spark

4  ©  Cloudera,  Inc.  All  rights  reserved.  

Map  Reduce  is  Not  Perfect  Map  

Reduce  Map  

Map  Reduce  

Map  

Limited  to  map-­‐reduce  paradigm  

Lots  of  I/O  à  slower  jobs  Map   Reduce   Map   Reduce  

Map   Reduce  

Itera8ve  jobs  (ML)  à  even  slower  

Redundant  joins  with  SQL  Tools  

Page 5: Introduc|on to Apache Spark

5  ©  Cloudera,  Inc.  All  rights  reserved.  

MapReduce  on  YARN  

Page 6: Introduc|on to Apache Spark

6  ©  Cloudera,  Inc.  All  rights  reserved.  

Death  by  Pinprick  

Page 7: Introduc|on to Apache Spark

7  ©  Cloudera,  Inc.  All  rights  reserved.  

Apache  Spark  Flexible,  in-­‐memory  data  processing  for  Hadoop  

Easy    Development  

Flexible  Extensible    API  

Fast  Batch  &  Stream  Processing  

•  Rich  APIs  for  Scala,  Java,  and  Python  

 •  Interac8ve  shell  

•  APIs  for  different  types  of  workloads:  •  Batch  (MR)  •  Streaming  •  Machine  Learning  •  Graph  

•  In-­‐Memory  processing  and  caching  

Retains:  Linear  Scalability,  Fault-­‐Tolerance,  Data  Locality  

Page 8: Introduc|on to Apache Spark

8  ©  Cloudera,  Inc.  All  rights  reserved.  

Spark  Basics  • Distributed  cluster  framework  (like  MR),  running  tasks  in  parallel  across  a  cluster  • Tasks  operate  in-­‐memory,  spill  to  disk  when  memory  exceeded.    • Resilient  Distributed  Datasets  (RDD):  Read-­‐only  par88oned  collec8on  of  records  • RDDs  ac8onable  through  parallel  transforma8ons  and  ac8ons  • Lazy  materializa8on  op8mizes  resources  • RDD  lineage  from  storage  to  compute  and  caching  layer  provides  fault-­‐tolerance  • Users  control  persistence  and  par88oning    

Page 9: Introduc|on to Apache Spark

9  ©  Cloudera,  Inc.  All  rights  reserved.  

Fast  Processing  Using  RAM,  Operator  Graphs  

In-­‐Memory  Caching  •  Data  Par88ons  read  from  RAM  

instead  of  disk    Operator  Graphs  •  Scheduling  Op8miza8ons  •  Fault  Tolerance  

join  

filter  

groupBy  

B:   B:  

C:   D:   E:  

F:  

Ç√Ω  

map  

A:  

map  

take  

=  cached  par88on  =  RDD  

Page 10: Introduc|on to Apache Spark

10  ©  Cloudera,  Inc.  All  rights  reserved.  

Logis8c  Regression  Performance    (Data  Fits  in  Memory)  

0  

500  

1000  

1500  

2000  

2500  

3000  

3500  

4000  

1   5   10   20   30  

Runn

ing  Time(s)  

#  of  IteraMons  

MapReduce  

Spark  

110  s/itera8on  

First  itera8on  =  80s  Further  itera8ons  1s  due  to  caching  

Page 11: Introduc|on to Apache Spark

11  ©  Cloudera,  Inc.  All  rights  reserved.  

Spark  on  YARN  

Page 12: Introduc|on to Apache Spark

12  ©  Cloudera,  Inc.  All  rights  reserved.  

Spark  will  replace  MapReduce  To  become  the  standard  execu8on  engine  for  Hadoop    

Spark  public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Hadoop  MapReduce  val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Page 13: Introduc|on to Apache Spark

13  ©  Cloudera,  Inc.  All  rights  reserved.  

The  Future  of  Data  Processing  on  Hadoop  Spark  complemented  by  specialized  fit-­‐for-­‐purpose  engines  

General  Data  Processing  w/Spark  

Fast  Batch  Processing,  Machine  Learning,    and  Stream  Processing  

 

AnalyMc  Database  w/

Impala  Low-­‐Latency  

Massively  Concurrent  Queries  

     

Full-­‐Text  Search  w/Solr    Querying  textual  data  

On-­‐Disk  Processing  w/MapReduce  Jobs  at  extreme  scale  and  extremely  disk  IO  intensive      

Shared:  •  Data  Storage  •  Metadata  •  Resource  

Management  •  Administra8on  •  Security  •  Governance  

Page 14: Introduc|on to Apache Spark

14  ©  Cloudera,  Inc.  All  rights  reserved.  

Easy  Development  High  Produc8vity  Language  Support  

• Na8ve  support  for  mul8ple  languages  with  iden8cal  APIs  • Scala,  Java,  Python  

• Use  of  closures,  itera8ons,  and  other  common  language  constructs  to  minimize  code  • 2-­‐5x  less  code  

Python  lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count()

Scala  val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count()

Java  JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();

Page 15: Introduc|on to Apache Spark

15  ©  Cloudera,  Inc.  All  rights  reserved.  

Easy  Development  Use  Interac8vely  

•  Interac8ve  explora8on  of  data  for  data  scien8sts  • No  need  to  develop  “applica8ons”  

• Developers  can  prototype  applica8on  on  live  system  

percolateur:spark srowen$ ./bin/spark-shell --master local[*]...Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51)Type in expressions to have them evaluated.Type :help for more information....

scala> val words = sc.textFile("file:/usr/share/dict/words")...words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21

scala> words.count...res0: Long = 235886

scala>

Page 16: Introduc|on to Apache Spark

16  ©  Cloudera,  Inc.  All  rights  reserved.  

Easy  Development  Expressive  API  •  map

•  filter

•  groupBy

•  sort

•  union

•  join

•  leftOuterJoin

•  rightOuterJoin

•  sample

•  take

•  first

•  partitionBy

•  mapWith

•  pipe

•  save

•  …

•  reduce

•  count

•  fold

•  reduceByKey

•  groupByKey

•  cogroup

•  cross

•  zip

Page 17: Introduc|on to Apache Spark

17  ©  Cloudera,  Inc.  All  rights  reserved.  

Example  Logis8c  Regression  

data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y) w -= gradient print “Final w: %s” % w

Page 18: Introduc|on to Apache Spark

18  ©  Cloudera,  Inc.  All  rights  reserved.  

The  Spark  Ecosystem  &  Hadoop  

Spark  Streaming   MLlib   SparkSQL   GraphX   Data-­‐

frames   SparkR  

STORAGE  HDFS,  HBase  

RESOURCE  MANAGEMENT  YARN  

Spark   Impala   MR   Others  Search  

Page 19: Introduc|on to Apache Spark

19  ©  Cloudera,  Inc.  All  rights  reserved.  

One  Plauorm,  Many  Workloads  

Batch,  Interac8ve,  and  Real-­‐Time.  Leading  performance  and  usability  in  one  plauorm.  

•  End-­‐to-­‐end  analy8c  workflows  •  Access  more  data  •  Work  with  data  in  new  ways  •  Enable  new  users  

Security  and  Administra8on  

Process  Ingest  

Sqoop,  Flume,  Kaxa,  Spark  Streaming  Transform  MapReduce,  

Hive,  Pig,  Spark  

Discover  Analy8c  Database  

Impala  

Search  Solr  

Model  Machine  Learning  SAS,  R,  Spark,  

Mahout  

Serve  NoSQL  Database  

HBase  

Streaming  Spark  Streaming  

Unlimited  Storage  HDFS,  HBase  

YARN,  Cloudera  Manager,  Cloudera  Navigator  

Page 20: Introduc|on to Apache Spark

20  ©  Cloudera,  Inc.  All  rights  reserved.  

Cloudera  Customer  Use  Cases  

Core  Spark  

Spark  Streaming  

•  Poruolio  Risk  Analysis  •  ETL  Pipeline  Speed-­‐Up  •  20+  years  of  stock  data  Financial  

Services  

Health  

•  Iden8fy  disease-­‐causing  genes  in  the  full  human  genome  

•  Calculate  Jaccard  scores  on  health  care  data  sets  

ERP  

•  Op8cal  Character  Recogni8on  and  Bill  Classifica8on  

•  Trend  analysis    •  Document  classifica8on  (LDA)  •  Fraud  analy8cs  Data  

Services  

1010  

•  Online  Fraud  Detec8on  Financial  Services  

Ad  Tech  

•  Real-­‐Time  Ad  Performance  Analysis  

Over  150  customers  using  Spark                                          Spark  clusters  as  large  as  800  nodes  

Page 21: Introduc|on to Apache Spark

21  ©  Cloudera,  Inc.  All  rights  reserved.  

Uni8ng  Spark  and  Hadoop  The  One  Plauorm  Ini8a8ve  Investment  Areas  

Management  Leverage  Hadoop-­‐na8ve  resource  management.  

Security  Full  support  for  Hadoop  security  

and  beyond.  

Scale  Enable  10k-­‐node  clusters.  

Streaming  Support  for  80%  of  common  stream  

processing  workloads.  

Page 22: Introduc|on to Apache Spark

22  ©  Cloudera,  Inc.  All  rights  reserved.  

Spark  Resources  •  Learn  Spark  • O’Reilly  Advanced  Analy8cs  with  Spark  eBook  (wri{en  by  Clouderans)  • Cloudera  Developer  Blog  • cloudera.com/spark      

• Get  Trained  • Cloudera  Spark  Training    

• Try  it  Out  • Cloudera  Live  Spark  Tutorial  

Page 23: Introduc|on to Apache Spark

23  ©  Cloudera,  Inc.  All  rights  reserved.  

Thank  You  [email protected]  linkedin.com/in/jordan.volz