8/19/2019 2. Spark essentials.pptx
1/37
8/19/2019 2. Spark essentials.pptx
2/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
Spark essentialsAleey !ilanovskiyClo"dera certi#ed developer
8/19/2019 2. Spark essentials.pptx
3/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
Architect"re
8/19/2019 2. Spark essentials.pptx
4/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
Architect"re
Storage %ayer
!ilesyste& '()!S*+oS,% )ata-ases
'Oracle +oS,% ) (-
eso"rce anage&ent 'A+ cgro"ps*
3rocessing %ayeraped"ce
and (iveSpark
&pala Searchig )
S,
6 the processing engine over the ()!S
8/19/2019 2. Spark essentials.pptx
5/37Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
Architect"re
Spark consi
- Spark Core.
processing da- li-. 7tens6or &achine l- 5raph8. 7tcore 6or 5rapengine-Spark S,%. 7
core that alloprogra&s 9it- Spark strea&co"ld deal 9i
8/19/2019 2. Spark essentials.pptx
6/37Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
Architect"re
es 6or 9riting progra& 6or Spark<
8/19/2019 2. Spark essentials.pptx
7/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
Architect"re
Spark a"to&atically elect one node o6 cl"ster 6or r"nning)river 3rogra& '&ain coordinator*.t &anage all other processing distri-"ted -y other nodes
8/19/2019 2. Spark essentials.pptx
8/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
))
8/19/2019 2. Spark essentials.pptx
9/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
)). )e#nition
An )) 'esilient )istri-"ted )ataset* in Spark is si&ply an i&&"ta-le distri-"teo-?ects. 7ach )) is split into &"ltiple partitions 9hich &ay -e co&p"ted on dio6 the cl"ster.
In other words – RDD is input for your Spark JobsSpark provides t9o 9ays to create ))s<- loading an eternal dataset- paralleliBing a collection in yo"r driver progra&.
8/19/2019 2. Spark essentials.pptx
10/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
)). er&inology
DEc"stdEFGH2E&oviedE
8/19/2019 2. Spark essentials.pptx
11/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
)). %oad dataset
An )) in Spark is si&ply an i&&"ta-le distri-"ted collection o6 o-?ects. 7ach &"ltiple partitions 9hich &ay -e co&p"ted on dierent nodes o6 the cl"ster.
Spark provides t9o 9ays to create ))s<
$ %oadin& an e'ternal datasetExample:Kclo"deraL:"ickstart MNhadoop 6s ;cat hd6s G
10PPPscalaQ val inp"t)) R sc.tet!ile'Ehd6s
8/19/2019 2. Spark essentials.pptx
12/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
)). )e#ne in progra&
An )) in Spark is si&ply an i&&"ta-le distri-"ted collection o6 o-?ects. 7ach &"ltiple partitions 9hich &ay -e co&p"ted on dierent nodes o6 the cl"ster.
Spark provides t9o 9ays to create ))s<
- %oading an eternal dataset- (aralleli)in& a collection in your dri*er pro&ra#7a&ple<scalaQ val inp"t)) R sc.paralleliBe'%ist'E1 2EE$ 4 FEEJ H > GEE10E**scalaQ println'inp"t)).collect'*.&kString'E E**PP..
1 2 $ 4 F J H > G 10
8/19/2019 2. Spark essentials.pptx
13/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
)) trans6or&ation
8/19/2019 2. Spark essentials.pptx
14/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
)) trans6or&ation
ncepts:6or&ations are operations on ))s that ret"rn a ne9 ))6or&ations on ))s are laBily eval"ated &eaning that Spark 9ill not -egin to eec"
le: val 9e-log R sc.tet!ile'Ehd6s
8/19/2019 2. Spark essentials.pptx
15/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
)). Co&&on pattern
attern 9hen single )) "sing &"ltiple ti&es<
calaQ val inp"t R sc.paralleliBe'%ist'1 2 $ 4**calaQ val res"lt1 R inp"t.&ap' RQ T *calaQ val res"lt2 R inp"t.#lter' RQ UR1*P.
calaQ println'res"lt1.collect'*.&kString'EE**4G1JcalaQ println'res"lt2.collect'*.&kString'EE**$4
8/19/2019 2. Spark essentials.pptx
16/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
)). Co&&on pattern. Caching
scalaQ i&port
org.apache.spark.storage.Storage%evelscalaQ val inp"t R sc.paralleliBe'%ist'1 2 $ 4**scalaQ inp"t.persist'Storage%evel.7OVO+%*scalaQ val res"lt1 R inp"t.&ap' RQ T *scalaQ val res"lt2 R inp"t.#lter' RQ UR1*PP.scalaQ println'res"lt1.collect'*.&kString'EE**14G1J
scalaQ println'res"lt2.collect'*.&kString'EE**2$4
8/19/2019 2. Spark essentials.pptx
17/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
)). Wse6"l trans6or&ations 'ap phase* 6or single )
1" #ap!" ; Apply a 6"nction to each ele&ent in the )) and ret"rn an )) o6 the resscalaQ val inp"t R sc.paralleliBe'%ist'1 2 $ 4**scalaQ val res"lt R inp"t.&ap' RQ T *scalaQ println'res"lt.collect'*.&kString'EE**
14G1J
2* +at,ap!" ; Apply a 6"nction to each ele&ent in the )) and ret"rn an )) o6 thethe iterators ret"rned.scalaQ val inp"t R sc.paralleliBe'%ist'EoneE Eone t9oE Eone t9o threeE Eone t9o threscalaQ val res"lt R inp"t.Xatap' RQ .split'E E**scalaQ println'res"lt.collect'*.&kString'EE**PPPP
oneonet9oonet9othreeonet9othree6o"r
$* -lter!" et"rn an )) consisting o6 only ele&ents that pass the condition passed tscalaQ val inp"t R sc.paralleliBe'%ist'1 2 $ 4**scalaQ val res"lt R inp"t.#lter'line RQ line UR 1*scalaQ println'res"lt.collect'*.&kString'EE**PPPP2$4
8/19/2019 2. Spark essentials.pptx
18/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
)). ap in details
&ap'* 6"nction details.tYs inp"t #le '))*<
$
4FJ
&ap' RQ !or this 6"ncOn itsel6 &ap' RQ $Output :G1J2F$J
&ap' RQ Z *!or this 6"nction 9e add to%ine itYs o9n val"e&ap' RQ $ Z $* RQ JPOutput :J>1012
&ap' RQ Z 1*!or this 6"nction 9ill add 1 to
7ach line<&ap' RQ $ Z 1* RQ 4POutput:4FJH
8/19/2019 2. Spark essentials.pptx
19/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
)). Wse6"l trans6or&ations 'ap phase* 6or &"ltiple
1" union!" ; 3rod"ce an )) containing ele&ents 6ro& -oth ))s.scalaQ val inp"t1 R sc.paralleliBe'%ist'1 2 $ 4**scalaQ val inp"t2 R sc.paralleliBe'%ist'$ 4 F J**scalaQ println'inp"t1."nion'inp"t2*.collect'*.&kString'EE**
P.. 12$4$4FJ
2" intersection!" ; )) containing only ele&ents 6o"nd in -oth ))scalaQ val inp"t1 R sc.paralleliBe'%ist'1 2 $ 4**scalaQ val inp"t2 R sc.paralleliBe'%ist'$ 4 F J**scalaQ println'inp"t1.intersection'inp"t2*.collect'*.&kString'EE**P..
4$
$* cartesian!" ; Cartesian prod"ct 9ith the other ))scalaQ val inp"t1 R sc.paralleliBe'%ist'1 2 $ 4**scalaQ val inp"t2 R sc.paralleliBe'%ist'$ 4 F J**scalaQ println'inp"t1.cartesian'inp"t2*.collect'*.&kString'EE**PPP'1$*'14*'1F*'1J*'2$*'24*'2F*'2J*'$$*'$4*'$F*'$J*'4$*'44*'4F*'4J*
8/19/2019 2. Spark essentials.pptx
20/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
)) actions
8/19/2019 2. Spark essentials.pptx
21/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
)). Wse6"l actions'ed"ce phase* 6or &"ltiple ))
1* count!" ; +"&-er o6 ele&ents in the )).scalaQ val inp"t)) R sc.paralleliBe'%ist'1 2 $ 4 $ 4**scalaQ println'inp"t)).co"nt'**J
2* countBy.alue!" ; +"&-er o6 ti&es each ele&ent occ"rs in the ))scalaQ val inp"t)) R sc.paralleliBe'%ist'1 2 $ 4 $ 4**scalaQ println'inp"t)). co"nty[al"e'**Pap'4 ;Q 2 1 ;Q 1 $ ;Q 2 2 ;Q 1*
$* reduce!func" ; Co&-ine the ele&ents o6 the )) together in parallelscalaQ val inp"t)) R sc.paralleliBe'%ist'1 2 $ 4 $ 4**
scalaQ println'inp"t)).red"ce''y* RQ Z y**1HscalaQ println'inp"t)).red"ce''y* RQ T y**2>>scalaQ println'inp"t)).red"ce''y* RQ ; y**;1F
8/19/2019 2. Spark essentials.pptx
22/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
)). ed"ce in details
red"ce'6"nc* 6"nction in details.tYs inp"t #le<
$
4FJ
red"ce''y* RQ ; y*t 9ill goes do9n 6ro& #rst
o the last. Example:nitially R$ yR4⇒$ ; 4R;1
hen R;1 yRF⇒ ;1 \ F R ;J
hen R;J yRJ⇒ ;J;J R ;12. itYs res"lt
red"ce''y* RQ Z y*t 9ill goes do9n 6ro& #rst ele&ent
o the last. Example:nitially R$ yR4⇒$Z4RH
hen RH yRF⇒HZFR12
hen R12 yRJ⇒ 12ZJ R 1>. itYs res"lt
8/19/2019 2. Spark essentials.pptx
23/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
3air ))
8/19/2019 2. Spark essentials.pptx
24/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
3air ))
Spark provides special operations on ))s containing key/val"e pairs. hese ))s are calle[ery useful 6or &roup by key type o6 operations.
Creating 3air )) example<
scalaQ val inp"t)) R sc.paralleliBe'%ist'E#rst string 9ord so&e other] Esecond string hellscalaQ val pairs R inp"t)).&ap' RQ ''/split!0 0"!" **
scalaQ println'pairs.collect'*.&kString'EE**'-rst #rst string 9ord so&e other*'second second string hello*
8/19/2019 2. Spark essentials.pptx
25/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
rans6or&ation o6 3air )) 'over single ))*
1* ReduceByey!" ; Co&-ine val"es 9ith the sa&e key.scalaQ val inp"t)) R sc.paralleliBe'%ist''1>*'14*'2F*'21***scalaQ println'inp"t)).red"cey^ey'VZV*.collect'*.&kString'EE**'112*'2J*
2* &roupByey!" ; 5ro"p val"es 9ith the sa&e key.scalaQ val inp"t)) R sc.paralleliBe'%ist''1>*'14*'2F*'21***scalaQ println'inp"t)).gro"py^ey'*.collect'*.&kString'EE**'1Co&pact"er'> 4**'2Co&pact"er'F 1**+ote< avoid this 6"nction. t al9ays sh"_e data 9itho"t local red"ce
$* #ap.alues!func" ; Apply a 6"nction to each val"e o6 a pair )) 9itho"t changing the kscalaQ val inp"t)) R sc.paralleliBe'%ist''1>*'14*'2F*'21***
scalaQ println'inp"t)).&ap[al"es' RQ T *.collect'*.&kString'EE**'1J4*'11J*'22F*'21*
4* sortByey!" ; et"rn an )) sorted -y the keyscalaQ val inp"t)) R sc.paralleliBe'%ist''1>*'24*'1F*'21***scalaQ println'inp"t)).sorty^ey'*.collect'*.&kString'EE**'1>*'1F*'24*'21*
8/19/2019 2. Spark essentials.pptx
26/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
rans6or&ation o6 3air )) 'over &"ltiple ))s*
1" Join!"scalaQ val inp"t))1 R sc.paralleliBe'%ist''1>*'24*'$F*'41***scalaQ val inp"t))2 R sc.paralleliBe'%ist''4H***scalaQ println'inp"t))1.?oin'inp"t))2*.collect'*.&kString'EE**PPP.'4'1H**
2* leftuterJoin!"scalaQ val inp"t))1 R sc.paralleliBe'%ist''1>*'24*'$F*'41***scalaQ val inp"t))2 R sc.paralleliBe'%ist''4H***scalaQ println'inp"t))1.?oin'inp"t))2*.collect'*.&kString'EE**............'4'1So&e'H***'1'>+one**'$'F+one**'2'4+one**
$* ri&htuterJoin!"scalaQ val inp"t))1 R sc.paralleliBe'%ist''1>*'24*'$F*'41***scalaQ val inp"t))2 R sc.paralleliBe'%ist''4H***scalaQ println'inp"t))1.?oin'inp"t))2*.collect'*.&kString'EE**............'4'So&e'1*H**
8/19/2019 2. Spark essentials.pptx
27/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
Average -y key ea&plescalaQ val inp"t)) R sc.paralleliBe'%ist'']panda`0*']pink`$*']pirate`$*']panscalaQ val kv)) R inp"t)).&ap[al"es' RQ ' 1**scalaQ val s"&)) R kv)). red"cey^ey'' y* RQ '.V1 Z y.V1 .V2 Z y.V2**scalaQ println's"&)).collect'*.&kString'EE**
PPP.'panda'12**'pirate'$1**'pink'H2**
scalaQ println's"&)).&ap[al"es' RQ .V1/.V2.to!loat*.collect'*.&kString'EE*val"ePP'panda0.F*'pirate$.0*'pink$.F* \ its our Result – a*era&e *alue for each k
C r eat e ^ ey ;[ al "e st r "c t "r e 6 or [ al "e
)e#ne
S " & k e y s a n d
v a l " e s ' g r o " p - y
& a ? o r k e y *
8/19/2019 2. Spark essentials.pptx
28/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
ed"cey^ey. (o9 it 9orks.
d ^ ( it k
8/19/2019 2. Spark essentials.pptx
29/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
ed"cey^ey. (o9 it 9orks.
8/19/2019 2. Spark essentials.pptx
30/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
3arallel eec"tion
3 ll l ti
8/19/2019 2. Spark essentials.pptx
31/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
3arallel eec"tioney concepts:1" 7very )) has a #ed n"&-er o6 partitions that deter&ine the degree o6 par- o kno9 ho9 &any partitions contain given )) r"n<scalaQ -ig)).partitions.siBeP.res11F< nt R 1022" y de6a"lt n"&-er o6 partitions e:"al to n"&-er o6 -locks<Kclo"deraL:"ickstart MN hd6s 6sck /"ser/hive/9areho"se/9e-logs/|grep Eotal -P
otal -locks 'validated*
8/19/2019 2. Spark essentials.pptx
32/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
Spark 3artitioning
)ata partitioning 3ro-le&
8/19/2019 2. Spark essentials.pptx
33/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
)ata partitioning. 3ro-le&
6ase:-e need to ?oin t9o da
periodically '10 &in"tes- "ser)ata is large i&&"- events is relatively s&ne9 6or each ?oin operat
7very ti&e t9o '-ig onedatasets 9ill -e distri-"net9ork.
)ata partitioning Sol"tion
8/19/2019 2. Spark essentials.pptx
34/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
)ata partitioning. Sol"tion
Solution:- !i so&e distri-"tion across si&&"ta-le dataset-
edistri-"te s&all dataset acaccordingly to distri-"tion o6 -:"ery
!or do this ?"st r"n over -ig onval "ser)ata R sc.se:"ence!ile'Ehd6s
8/19/2019 2. Spark essentials.pptx
35/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
)ata partitioning. ^ey concepts
rick eplained a-ove na&ed as Spark partitioning<
; SparkYs partitioning is availa-le on all ))s o6 key/val"e pairs-
Spark does not give eplicit control o6 9hich 9orker node each key goes to- 3rogra& ens"re that a set o6 keys 9ill appear together on some node- 6 a given )) is scanned only once there is no point in partitioning it inadvance- t is "se6"l only 9hen a dataset is re"sed multiple times in key;orientedoperations s"ch as ?oins
Example:
scalaQ val pairs R sc.paralleliBe'%ist''1 1* '2 2* '$ $***scalaQ pairs.partitionerres1$2< OptionKorg.apache.spark.3artitionerN R +onescalaQ i&port org.apache.spark.(ash3artitionerscalaQ val partitioned R pairs.partitiony'ne9 (ash3artitioner'2**scalaQ partitioned.partitionerres1$$< OptionKorg.apache.spark.3artitionerN RSo&e'org.apache.spark.(ash3artitionerL2*
8/19/2019 2. Spark essentials.pptx
36/37
Copyright © 2014 Oracle and/or its afliates. All rights reserved. |
8/19/2019 2. Spark essentials.pptx
37/37
Top Related