2. Spark essentials.pptx

download 2. Spark essentials.pptx

of 15

Transcript of 2. Spark essentials.pptx

  • 8/19/2019 2. Spark essentials.pptx

    1/37

  • 8/19/2019 2. Spark essentials.pptx

    2/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    Spark essentialsAleey !ilanovskiyClo"dera certi#ed developer

  • 8/19/2019 2. Spark essentials.pptx

    3/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    Architect"re

  • 8/19/2019 2. Spark essentials.pptx

    4/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    Architect"re

    Storage %ayer

    !ilesyste& '()!S*+oS,% )ata-ases

    'Oracle +oS,% ) (-

    eso"rce anage&ent 'A+ cgro"ps*

    3rocessing %ayeraped"ce

    and (iveSpark

    &pala Searchig )

    S,

    6 the processing engine over the ()!S

  • 8/19/2019 2. Spark essentials.pptx

    5/37Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    Architect"re

    Spark consi

    - Spark Core.

    processing da- li-. 7tens6or &achine l- 5raph8. 7tcore 6or 5rapengine-Spark S,%. 7

    core that alloprogra&s 9it- Spark strea&co"ld deal 9i

  • 8/19/2019 2. Spark essentials.pptx

    6/37Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    Architect"re

    es 6or 9riting progra& 6or Spark<

  • 8/19/2019 2. Spark essentials.pptx

    7/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    Architect"re

    Spark a"to&atically elect one node o6 cl"ster 6or r"nning)river 3rogra& '&ain coordinator*.t &anage all other processing distri-"ted -y other nodes

     

  • 8/19/2019 2. Spark essentials.pptx

    8/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    ))

  • 8/19/2019 2. Spark essentials.pptx

    9/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    )). )e#nition

    An )) 'esilient )istri-"ted )ataset* in Spark is si&ply an i&&"ta-le distri-"teo-?ects. 7ach )) is split into &"ltiple partitions 9hich &ay -e co&p"ted on dio6 the cl"ster.

    In other words – RDD is input for your Spark JobsSpark provides t9o 9ays to create ))s<- loading an eternal dataset- paralleliBing a collection in yo"r driver progra&.

  • 8/19/2019 2. Spark essentials.pptx

    10/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    )). er&inology

    DEc"stdEFGH2E&oviedE

  • 8/19/2019 2. Spark essentials.pptx

    11/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    )). %oad dataset

    An )) in Spark is si&ply an i&&"ta-le distri-"ted collection o6 o-?ects. 7ach &"ltiple partitions 9hich &ay -e co&p"ted on dierent nodes o6 the cl"ster.

    Spark provides t9o 9ays to create ))s<

    $ %oadin& an e'ternal datasetExample:Kclo"deraL:"ickstart MNhadoop 6s ;cat hd6s G

    10PPPscalaQ val inp"t)) R sc.tet!ile'Ehd6s

  • 8/19/2019 2. Spark essentials.pptx

    12/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    )). )e#ne in progra&

    An )) in Spark is si&ply an i&&"ta-le distri-"ted collection o6 o-?ects. 7ach &"ltiple partitions 9hich &ay -e co&p"ted on dierent nodes o6 the cl"ster.

    Spark provides t9o 9ays to create ))s<

    - %oading an eternal dataset- (aralleli)in& a collection in your dri*er pro&ra#7a&ple<scalaQ val inp"t)) R sc.paralleliBe'%ist'E1 2EE$ 4 FEEJ H > GEE10E**scalaQ println'inp"t)).collect'*.&kString'E E**PP..

    1 2 $ 4 F J H > G 10

  • 8/19/2019 2. Spark essentials.pptx

    13/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    )) trans6or&ation

  • 8/19/2019 2. Spark essentials.pptx

    14/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    )) trans6or&ation

    ncepts:6or&ations are operations on ))s that ret"rn a ne9 ))6or&ations on ))s are laBily eval"ated &eaning that Spark 9ill not -egin to eec"

    le: val 9e-log R sc.tet!ile'Ehd6s

  • 8/19/2019 2. Spark essentials.pptx

    15/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    )). Co&&on pattern

    attern 9hen single )) "sing &"ltiple ti&es<

    calaQ val inp"t R sc.paralleliBe'%ist'1 2 $ 4**calaQ val res"lt1 R inp"t.&ap' RQ T *calaQ val res"lt2 R inp"t.#lter' RQ UR1*P.

    calaQ println'res"lt1.collect'*.&kString'EE**4G1JcalaQ println'res"lt2.collect'*.&kString'EE**$4

  • 8/19/2019 2. Spark essentials.pptx

    16/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    )). Co&&on pattern. Caching

    scalaQ i&port

    org.apache.spark.storage.Storage%evelscalaQ val inp"t R sc.paralleliBe'%ist'1 2 $ 4**scalaQ inp"t.persist'Storage%evel.7OVO+%*scalaQ val res"lt1 R inp"t.&ap' RQ T *scalaQ val res"lt2 R inp"t.#lter' RQ UR1*PP.scalaQ println'res"lt1.collect'*.&kString'EE**14G1J

    scalaQ println'res"lt2.collect'*.&kString'EE**2$4

  • 8/19/2019 2. Spark essentials.pptx

    17/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    )). Wse6"l trans6or&ations 'ap phase* 6or single )

    1" #ap!" ; Apply a 6"nction to each ele&ent in the )) and ret"rn an )) o6 the resscalaQ val inp"t R sc.paralleliBe'%ist'1 2 $ 4**scalaQ val res"lt R inp"t.&ap' RQ T *scalaQ println'res"lt.collect'*.&kString'EE**

    14G1J

    2* +at,ap!" ; Apply a 6"nction to each ele&ent in the )) and ret"rn an )) o6 thethe iterators ret"rned.scalaQ val inp"t R sc.paralleliBe'%ist'EoneE Eone t9oE Eone t9o threeE Eone t9o threscalaQ val res"lt R inp"t.Xatap' RQ .split'E E**scalaQ println'res"lt.collect'*.&kString'EE**PPPP

    oneonet9oonet9othreeonet9othree6o"r

    $* -lter!" et"rn an )) consisting o6 only ele&ents that pass the condition passed tscalaQ val inp"t R sc.paralleliBe'%ist'1 2 $ 4**scalaQ val res"lt R inp"t.#lter'line RQ line UR 1*scalaQ println'res"lt.collect'*.&kString'EE**PPPP2$4

  • 8/19/2019 2. Spark essentials.pptx

    18/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    )). ap in details

    &ap'* 6"nction details.tYs inp"t #le '))*<

    $

    4FJ

    &ap' RQ !or this 6"ncOn itsel6 &ap' RQ $Output :G1J2F$J

    &ap' RQ Z *!or this 6"nction 9e add to%ine itYs o9n val"e&ap' RQ $ Z $* RQ JPOutput :J>1012

    &ap' RQ Z 1*!or this 6"nction 9ill add 1 to

    7ach line<&ap' RQ $ Z 1* RQ 4POutput:4FJH

  • 8/19/2019 2. Spark essentials.pptx

    19/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    )). Wse6"l trans6or&ations 'ap phase* 6or &"ltiple

    1" union!" ; 3rod"ce an )) containing ele&ents 6ro& -oth ))s.scalaQ val inp"t1 R sc.paralleliBe'%ist'1 2 $ 4**scalaQ val inp"t2 R sc.paralleliBe'%ist'$ 4 F J**scalaQ println'inp"t1."nion'inp"t2*.collect'*.&kString'EE**

    P.. 12$4$4FJ

    2" intersection!" ; )) containing only ele&ents 6o"nd in -oth ))scalaQ val inp"t1 R sc.paralleliBe'%ist'1 2 $ 4**scalaQ val inp"t2 R sc.paralleliBe'%ist'$ 4 F J**scalaQ println'inp"t1.intersection'inp"t2*.collect'*.&kString'EE**P..

    4$

    $* cartesian!" ; Cartesian prod"ct 9ith the other ))scalaQ val inp"t1 R sc.paralleliBe'%ist'1 2 $ 4**scalaQ val inp"t2 R sc.paralleliBe'%ist'$ 4 F J**scalaQ println'inp"t1.cartesian'inp"t2*.collect'*.&kString'EE**PPP'1$*'14*'1F*'1J*'2$*'24*'2F*'2J*'$$*'$4*'$F*'$J*'4$*'44*'4F*'4J*

  • 8/19/2019 2. Spark essentials.pptx

    20/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    )) actions

  • 8/19/2019 2. Spark essentials.pptx

    21/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    )). Wse6"l actions'ed"ce phase* 6or &"ltiple ))

    1* count!" ; +"&-er o6 ele&ents in the )).scalaQ val inp"t)) R sc.paralleliBe'%ist'1 2 $ 4 $ 4**scalaQ println'inp"t)).co"nt'**J

    2* countBy.alue!" ; +"&-er o6 ti&es each ele&ent occ"rs in the ))scalaQ val inp"t)) R sc.paralleliBe'%ist'1 2 $ 4 $ 4**scalaQ println'inp"t)). co"nty[al"e'**Pap'4 ;Q 2 1 ;Q 1 $ ;Q 2 2 ;Q 1*

    $* reduce!func" ; Co&-ine the ele&ents o6 the )) together in parallelscalaQ val inp"t)) R sc.paralleliBe'%ist'1 2 $ 4 $ 4**

    scalaQ println'inp"t)).red"ce''y* RQ Z y**1HscalaQ println'inp"t)).red"ce''y* RQ T y**2>>scalaQ println'inp"t)).red"ce''y* RQ ; y**;1F

  • 8/19/2019 2. Spark essentials.pptx

    22/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    )). ed"ce in details

    red"ce'6"nc* 6"nction in details.tYs inp"t #le<

    $

    4FJ

    red"ce''y* RQ ; y*t 9ill goes do9n 6ro& #rst

     o the last. Example:nitially R$ yR4⇒$ ; 4R;1

     hen R;1 yRF⇒ ;1 \ F R ;J

     hen R;J yRJ⇒ ;J;J R ;12. itYs res"lt

    red"ce''y* RQ Z y*t 9ill goes do9n 6ro& #rst ele&ent

     o the last. Example:nitially R$ yR4⇒$Z4RH

     hen RH yRF⇒HZFR12

     hen R12 yRJ⇒ 12ZJ R 1>. itYs res"lt

  • 8/19/2019 2. Spark essentials.pptx

    23/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    3air ))

  • 8/19/2019 2. Spark essentials.pptx

    24/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    3air ))

    Spark provides special operations on ))s containing key/val"e pairs. hese ))s are calle[ery useful 6or &roup by key type o6 operations.

    Creating 3air )) example<

    scalaQ val inp"t)) R sc.paralleliBe'%ist'E#rst string 9ord so&e other] Esecond string hellscalaQ val pairs R inp"t)).&ap' RQ ''/split!0 0"!" **

    scalaQ println'pairs.collect'*.&kString'EE**'-rst #rst string 9ord so&e other*'second second string hello*

  • 8/19/2019 2. Spark essentials.pptx

    25/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    rans6or&ation o6 3air )) 'over single ))*

    1* ReduceByey!" ; Co&-ine val"es 9ith the sa&e key.scalaQ val inp"t)) R sc.paralleliBe'%ist''1>*'14*'2F*'21***scalaQ println'inp"t)).red"cey^ey'VZV*.collect'*.&kString'EE**'112*'2J*

    2* &roupByey!" ; 5ro"p val"es 9ith the sa&e key.scalaQ val inp"t)) R sc.paralleliBe'%ist''1>*'14*'2F*'21***scalaQ println'inp"t)).gro"py^ey'*.collect'*.&kString'EE**'1Co&pact"er'> 4**'2Co&pact"er'F 1**+ote< avoid this 6"nction. t al9ays sh"_e data 9itho"t local red"ce

    $* #ap.alues!func" ; Apply a 6"nction to each val"e o6 a pair )) 9itho"t changing the kscalaQ val inp"t)) R sc.paralleliBe'%ist''1>*'14*'2F*'21***

    scalaQ println'inp"t)).&ap[al"es' RQ T *.collect'*.&kString'EE**'1J4*'11J*'22F*'21*

    4* sortByey!" ; et"rn an )) sorted -y the keyscalaQ val inp"t)) R sc.paralleliBe'%ist''1>*'24*'1F*'21***scalaQ println'inp"t)).sorty^ey'*.collect'*.&kString'EE**'1>*'1F*'24*'21*

  • 8/19/2019 2. Spark essentials.pptx

    26/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    rans6or&ation o6 3air )) 'over &"ltiple ))s*

    1" Join!"scalaQ val inp"t))1 R sc.paralleliBe'%ist''1>*'24*'$F*'41***scalaQ val inp"t))2 R sc.paralleliBe'%ist''4H***scalaQ println'inp"t))1.?oin'inp"t))2*.collect'*.&kString'EE**PPP.'4'1H**

    2* leftuterJoin!"scalaQ val inp"t))1 R sc.paralleliBe'%ist''1>*'24*'$F*'41***scalaQ val inp"t))2 R sc.paralleliBe'%ist''4H***scalaQ println'inp"t))1.?oin'inp"t))2*.collect'*.&kString'EE**............'4'1So&e'H***'1'>+one**'$'F+one**'2'4+one**

    $* ri&htuterJoin!"scalaQ val inp"t))1 R sc.paralleliBe'%ist''1>*'24*'$F*'41***scalaQ val inp"t))2 R sc.paralleliBe'%ist''4H***scalaQ println'inp"t))1.?oin'inp"t))2*.collect'*.&kString'EE**............'4'So&e'1*H**

  • 8/19/2019 2. Spark essentials.pptx

    27/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    Average -y key ea&plescalaQ val inp"t)) R sc.paralleliBe'%ist'']panda`0*']pink`$*']pirate`$*']panscalaQ val kv)) R inp"t)).&ap[al"es' RQ ' 1**scalaQ val s"&)) R kv)). red"cey^ey'' y* RQ '.V1 Z y.V1 .V2 Z y.V2**scalaQ println's"&)).collect'*.&kString'EE**

    PPP.'panda'12**'pirate'$1**'pink'H2**

    scalaQ println's"&)).&ap[al"es' RQ .V1/.V2.to!loat*.collect'*.&kString'EE*val"ePP'panda0.F*'pirate$.0*'pink$.F* \ its our Result – a*era&e *alue for each k

    C r eat e ^ ey ;[ al "e st r "c t "r e 6 or  [ al "e

    )e#ne

    S  "  &  k  e  y  s   a n d   

    v  a l  "  e s   '   g  r  o "   p  -  y   

    & a  ?  o r   k  e  y   *  

  • 8/19/2019 2. Spark essentials.pptx

    28/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    ed"cey^ey. (o9 it 9orks.

    d ^ ( it k

  • 8/19/2019 2. Spark essentials.pptx

    29/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    ed"cey^ey. (o9 it 9orks.

  • 8/19/2019 2. Spark essentials.pptx

    30/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    3arallel eec"tion

    3 ll l ti

  • 8/19/2019 2. Spark essentials.pptx

    31/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    3arallel eec"tioney concepts:1" 7very )) has a #ed n"&-er o6 partitions that deter&ine the degree o6 par- o kno9 ho9 &any partitions contain given )) r"n<scalaQ -ig)).partitions.siBeP.res11F< nt R 1022" y de6a"lt n"&-er o6 partitions e:"al to n"&-er o6 -locks<Kclo"deraL:"ickstart MN hd6s 6sck /"ser/hive/9areho"se/9e-logs/|grep Eotal -P

     otal -locks 'validated*

  • 8/19/2019 2. Spark essentials.pptx

    32/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    Spark 3artitioning

    )ata partitioning 3ro-le&

  • 8/19/2019 2. Spark essentials.pptx

    33/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    )ata partitioning. 3ro-le&

    6ase:-e need to ?oin t9o da

    periodically '10 &in"tes- "ser)ata is large i&&"- events is relatively s&ne9 6or each ?oin operat

    7very ti&e t9o '-ig onedatasets 9ill -e distri-"net9ork.

    )ata partitioning Sol"tion

  • 8/19/2019 2. Spark essentials.pptx

    34/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    )ata partitioning. Sol"tion

    Solution:- !i so&e distri-"tion across si&&"ta-le dataset-

     edistri-"te s&all dataset acaccordingly to distri-"tion o6 -:"ery

    !or do this ?"st r"n over -ig onval "ser)ata R sc.se:"ence!ile'Ehd6s

  • 8/19/2019 2. Spark essentials.pptx

    35/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

    )ata partitioning. ^ey concepts

     rick eplained a-ove na&ed as Spark partitioning<

    ; SparkYs partitioning is availa-le on all ))s o6 key/val"e pairs-

     Spark does not give eplicit control o6 9hich 9orker node each key goes to- 3rogra& ens"re that a set  o6 keys 9ill appear together on some node- 6 a given )) is scanned only once there is no point in partitioning it inadvance- t is "se6"l only 9hen a dataset is re"sed multiple times in key;orientedoperations s"ch as ?oins

    Example:

    scalaQ val pairs R sc.paralleliBe'%ist''1 1* '2 2* '$ $***scalaQ pairs.partitionerres1$2< OptionKorg.apache.spark.3artitionerN R +onescalaQ i&port org.apache.spark.(ash3artitionerscalaQ val partitioned R pairs.partitiony'ne9 (ash3artitioner'2**scalaQ partitioned.partitionerres1$$< OptionKorg.apache.spark.3artitionerN RSo&e'org.apache.spark.(ash3artitionerL2*

  • 8/19/2019 2. Spark essentials.pptx

    36/37

    Copyright © 2014 Oracle and/or its afliates. All rights reserved. |

  • 8/19/2019 2. Spark essentials.pptx

    37/37