Other Map-Reduce (ish) Frameworks William Cohen 1.
-
Upload
may-stanley -
Category
Documents
-
view
218 -
download
1
Transcript of Other Map-Reduce (ish) Frameworks William Cohen 1.
![Page 1: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/1.jpg)
1
Other Map-Reduce (ish) Frameworks
William Cohen
![Page 2: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/2.jpg)
2
Outline
• More concise languages for map-reduce pipelines• Abstractions built on top of map-reduce
–General comments–Specific systems
• Cascading, Pipes• PIG, Hive• Spark, Flink
![Page 3: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/3.jpg)
3
Y:Y=Hadoop+X or Hadoop~=Y• What else are people using?
– instead of Hadoop–on top of Hadoop
![Page 4: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/4.jpg)
4
Issues with Hadoop• Too much typing
– programs are not concise• Too low level
– missing abstractions– hard to specify a workflow
• Not well suited to iterative operations– E.g., E/M, k-means clustering, …– Workflow and memory-loading issues
![Page 5: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/5.jpg)
5
STREAMING AND MRJOB:MORE CONCISE MAP-REDUCE
PIPELINES
![Page 6: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/6.jpg)
6
Hadoop streaming• start with stream & sort pipelinecat data | mapper.py | sort –k1,1 | reducer.py
• run with hadoop streaming insteadbin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -file mapper.py –file reducer.py-mapper mapper.py -reducer reducer.py -input /hdfs/path/to/inputDir-output /hdfs/path/to/outputDir-mapred.map.tasks=20-mapred.reduce.tasks=20
![Page 7: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/7.jpg)
7
mrjob word count• Python level over map-reduce – very concise• Can run locally in Python• Allows a single job or a linear chain of steps
![Page 8: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/8.jpg)
8
mrjob most freq word
![Page 9: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/9.jpg)
9
MAP-REDUCE ABSTRACTIONS
![Page 10: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/10.jpg)
10
Abstractions On Top Of Hadoop• MRJob and other tools to make Hadoop pipelines more concise (Dumbo, …) still keep the same basic language of map-reduce jobs• How else can we express these sorts of computations? Are there some common special cases of map-reduce steps we can parameterize and reuse?
![Page 11: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/11.jpg)
11
Abstractions On Top Of Hadoop• Some obvious streaming processes:
– for each row in a table• Transform it and output the result• Decide if you want to keep it with some boolean test, and copy out only the ones that pass the test
Example: stem words in a stream of word-count pairs:(“aardvarks”,1) (“aardvark”,1)
Proposed syntax:
table2 = MAP table1 TO λ row : f(row)) f(row)row’
Example: apply stop words(“aardvark”,1) (“aardvark”,1)(“the”,1) deleted
Proposed syntax:
table2 = FILTER table1 BY λ row : f(row)) f(row) {true,false}
![Page 12: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/12.jpg)
12
Abstractions On Top Of Hadoop• A non-obvious? streaming processes:
– for each row in a table• Transform it to a list of items• Splice all the lists together to get the output table (flatten)Example: tokenizing a line
“I found an aardvark” [“i”, “found”,”an”,”aardvark”]“We love zymurgy” [“we”,”love”,”zymurgy”]..but final table is one word per row
“i”“found”“an”“aardvark”“we”“love”…
Proposed syntax:
table2 = FLATMAP table1 TO λ row : f(row)) f(row)list of rows
![Page 13: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/13.jpg)
13
Abstractions On Top Of Hadoop• Another example from the Naïve Bayes test program…
![Page 14: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/14.jpg)
NB Test Step (Can we do better?)
X=w1^Y=sportsX=w1^Y=worldNewsX=..X=w2^Y=…X=……
524510542120
373
…
Event counts
How:• Stream and sort:
• for each C[X=w^Y=y]=n• print “w C[Y=y]=n”
• sort and build a list of values associated with each key wLike an inverted index
w Counts associated with W
aardvark C[w^Y=sports]=2
agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564
… …
zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464
![Page 15: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/15.jpg)
NB Test Step 1 (Can we do better?)
X=w1^Y=sportsX=w1^Y=worldNewsX=..X=w2^Y=…X=……
524510542120
373
…
Event counts
w Counts associated with W
aardvark C[w^Y=sports]=2
agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564
… …
zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464
The general case:We’re taking rows from a table• In a particular format
(event,count)Applying a function to get a new value• The word for the eventAnd grouping the rows of the table by this new value
Grouping operationSpecial case of a map-reduce
Proposed syntax:
GROUP table BY λ row : f(row) Could define f via: a function, a field of a defined record structure, …
f(row)field
![Page 16: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/16.jpg)
NB Test Step 1 (Can we do better?)
The general case:We’re taking rows from a table• In a particular format
(event,count)Applying a function to get a new value• The word for the eventAnd grouping the rows of the table by this new value
Grouping operationSpecial case of a map-reduce
Proposed syntax:
GROUP table BY λ row : f(row) Could define f via: a function, a field of a defined record structure, …
f(row)field
Aside: you guys know how to implement this, right?
1. Output pairs (f(row),row) with a map/streaming process
2. Sort pairs by key – which is f(row)
3. Reduce and aggregate by appending together all the values associated with the same key
![Page 17: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/17.jpg)
17
Abstractions On Top Of Hadoop• And another example from the Naïve
Bayes test program…
![Page 18: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/18.jpg)
Request-and-answer
id1 w1,1 w1,2 w1,3 …. w1,k1
id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 …id5 w5,1 w5,2 …...
Test data Record of all event counts for each word
w Counts associated with W
aardvark C[w^Y=sports]=2
agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564
… …
zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464
Step 2: stream through and for each test case
idi wi,1 wi,2 wi,3 …. wi,ki
request the event counters needed to classify idi from the event-count DB, then classify using the answers
Classification logic
![Page 19: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/19.jpg)
Request-and-answer
• Break down into stages– Generate the data being requested (indexed by
key, here a word)• Eg with group … by
– Generate the requests as (key, requestor) pairs• Eg with flatmap … to
– Join these two tables by key• Join defined as (1) cross-product and (2) filter out pairs
with different values for keys • This replaces the step of concatenating two different
tables of key-value pairs, and reducing them together
– Postprocess the joined result
![Page 20: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/20.jpg)
w Counters
aardvark C[w^Y=sports]=2
agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564
… …
zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464
w Counters Requests
aardvark C[w^Y=sports]=2 ~ctr to id1
agent C[w^Y=sports]=…
~ctr to id345
agent C[w^Y=sports]=…
~ctr to id9854
agent C[w^Y=sports]=…
~ctr to id345
… C[w^Y=sports]=…
~ctr to id34742
zynga C[…] ~ctr to id1
zynga C[…] …
w Request
found ~ctr to id1
aardvark ~ctr to id1
…
zynga ~ctr to id1
… ~ctr to id2
![Page 21: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/21.jpg)
w Counters
aardvark C[w^Y=sports]=2
agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564
… …
zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464
w Counters Requests
aardvark C[w^Y=sports]=2 id1
agent C[w^Y=sports]=…
id345
agent C[w^Y=sports]=…
id9854
agent C[w^Y=sports]=…
id345
… C[w^Y=sports]=…
id34742
zynga C[…] id1
zynga C[…] …
w Request
found id1
aardvark id1
…
zynga id1
… id2
![Page 22: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/22.jpg)
22
MAP-REDUCE ABSTRACTIONS:CASCADING, PIPES, SCALDING
![Page 23: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/23.jpg)
23
Y:Y=Hadoop+X
• Cascading– Java library for map-reduce workflows–Also some library operations for common mappers/reducers
![Page 24: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/24.jpg)
24
Cascading WordCount ExampleInput format
Output format: pairs
Bind to HFS path
Bind to HFS pathA pipeline of map-reduce jobs
Append a step: apply function to the “line” field
Append step: group a (flattened) stream of “tuples”
Replace line with bag of words
Append step: aggregate grouped values
Run the pipeline
![Page 25: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/25.jpg)
Cascading WordCount Example
Is this inefficient? We explicitly form a group for each word, and then count the elements…?
We could be saved by careful optimization: we know we don’t need the GroupBy intermediate result when we run the assembly….
Many of the Hadoop abstraction levels have a similar flavor:• Define a pipeline of tasks declaratively• Optimize it automatically• Run the final result
The key question: does the system successfully hide the details from you?
![Page 26: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/26.jpg)
26
Y:Y=Hadoop+X • Cascading
– Java library for map-reduce workflows• expressed as “Pipe”s, to which you add Each, Every,
GroupBy, …
– Also some library operations for common mappers/reducers• e.g. RegexGenerator
– Turing-complete since it’s an API for Java• Pipes
– C++ library for map-reduce workflows on Hadoop• Scalding
– More concise Scala library based on Cascading
![Page 27: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/27.jpg)
27
MORE DECLARATIVE LANGUAGES
![Page 28: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/28.jpg)
28
Hive and PIG: word count
• Declarative ….. Fairly stable
PIG program is a bunch of assignments where every LHS is a relation.No loops, conditionals, etc allowed.
![Page 29: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/29.jpg)
29
More on Pig
• Pig Latin– atomic types + compound types like tuple, bag, map– execute locally/interactively or on hadoop
• can embed Pig in Java (and Python and …) • can call out to Java from Pig• Similar (ish) system from Microsoft: DryadLinq
![Page 30: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/30.jpg)
30
Tokenize – built-in function
Flatten – special keyword, which applies to the next step in the process – ie foreach is transformed from a MAP to a FLATMAP
![Page 31: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/31.jpg)
31
PIG Features• LOAD ‘hdfs-path’ AS (schema)
– schemas can include int, double, bag, map, tuple, …• FOREACH alias GENERATE … AS …, …
– transforms each row of a relation• DESCRIBE alias/ ILLUSTRATE alias -- debugging• GROUP alias BY …• FOREACH alias GENERATE group, SUM(….)
– GROUP/GENERATE … aggregate op together act like a map-reduce• JOIN r BY field, s BY field, …
– inner join to produce rows: r::f1, r::f2, … s::f1, s::f2, …• CROSS r, s, …
– use with care unless all but one of the relations are singleton• User defined functions as operators
– also for loading, aggregates, …
PIG parses and optimizes a sequence of commands before it executes themIt’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce
![Page 32: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/32.jpg)
32
ANOTHER EXAMPLE: COMPUTING TFIDF IN PIG LATIN
![Page 33: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/33.jpg)
33
(docid,token) (docid,token,tf(token in doc))
(docid,token,tf) (docid,token,tf,length(doc))
(docid,token,tf,n)(…,tf/n)
(docid,token,tf,n,tf/n)(…,df)
ndocs.total_docs
(docid,token,tf,n,tf/n)(docid,token,tf/n * id)
relation-to-scalar casting
![Page 34: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/34.jpg)
34
Other PIG features
• …• Macros, nested queries,• FLATTEN “operation”
– transforms a bag or a tuple into its individual elements– this transform affects the next level of the aggregate
• STREAM and DEFINE … SHIPDEFINE myfunc `python myfun.py` SHIP(‘myfun.py’)…r = STREAM s THROUGH myfunc AS (…);
![Page 35: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/35.jpg)
35
TF-IDF in PIG - another version
![Page 36: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/36.jpg)
36
Issues with Hadoop• Too much typing
– programs are not concise• Too low level
– missing abstractions– hard to specify a workflow
• Not well suited to iterative operations– E.g., E/M, k-means clustering, …– Workflow and memory-loading issuesFirst: an iterative algorithm in Pig Latin
![Page 37: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/37.jpg)
37
Julien Le Dem - Yahoo
How to use loops, conditionals, etc?
Embed PIG in a real programming language.
![Page 38: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/38.jpg)
38
![Page 39: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/39.jpg)
39
An example from Ron Bekkerman
![Page 40: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/40.jpg)
40
Example: k-means clustering
• An EM-like algorithm:• Initialize k cluster centroids• E-step: associate each data instance with the closest centroid
– Find expected values of cluster assignments given the data and centroids• M-step: recalculate centroids as an average of the associated data instances
– Find new centroids that maximize that expectation
![Page 41: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/41.jpg)
41
k-means Clustering
centroids
![Page 42: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/42.jpg)
42
Parallelizing k-means
![Page 43: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/43.jpg)
43
Parallelizing k-means
![Page 44: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/44.jpg)
44
Parallelizing k-means
![Page 45: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/45.jpg)
45
k-means on MapReduce
• Mappers read data portions and centroids• Mappers assign data instances to clusters• Mappers compute new local centroids and local cluster sizes• Reducers aggregate local centroids (weighted by local cluster sizes) into new global centroids• Reducers write the new centroids
Panda et al, Chapter 2
![Page 46: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/46.jpg)
46
k-means in Apache Pig: input data
• Assume we need to cluster documents– Stored in a 3-column table D:
• Initial centroids are k randomly chosen docs– Stored in table C in the same format as above
Document
Word Count
doc1 Carnegie 2
doc1 Mellon 2
![Page 47: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/47.jpg)
47
D_C = JOIN C BY w, D BY w;PROD = FOREACH D_C GENERATE d, c, id * ic AS idic ;
PRODg = GROUP PROD BY (d, c);DOT_PROD = FOREACH PRODg GENERATE d, c, SUM(idic) AS dXc;
SQR = FOREACH C GENERATE c, ic * ic AS ic2;
SQRg = GROUP SQR BY c;LEN_C = FOREACH SQRg GENERATE c, SQRT(SUM(ic
2)) AS lenc;
DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c;SIM = FOREACH DOT_LEN GENERATE d, c, dXc / lenc;
SIMg = GROUP SIM BY d;CLUSTERS = FOREACH SIMg GENERATE TOP(1, 2, SIM);
k-means in Apache Pig: E-step
cw
wc
dw
wc
wd
cd
i
iic
2maxarg
![Page 48: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/48.jpg)
48
D_C = JOIN C BY w, D BY w;PROD = FOREACH D_C GENERATE d, c, id * ic AS idic ;
PRODg = GROUP PROD BY (d, c);DOT_PROD = FOREACH PRODg GENERATE d, c, SUM(idic) AS dXc;
SQR = FOREACH C GENERATE c, ic * ic AS ic2;
SQRg = GROUP SQR BY c;LEN_C = FOREACH SQRg GENERATE c, SQRT(SUM(ic
2)) AS lenc;
DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c;SIM = FOREACH DOT_LEN GENERATE d, c, dXc / lenc;
SIMg = GROUP SIM BY d;CLUSTERS = FOREACH SIMg GENERATE TOP(1, 2, SIM);
k-means in Apache Pig: E-step
cw
wc
dw
wc
wd
cd
i
iic
2maxarg
![Page 49: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/49.jpg)
49
D_C = JOIN C BY w, D BY w;PROD = FOREACH D_C GENERATE d, c, id * ic AS idic ;
PRODg = GROUP PROD BY (d, c);DOT_PROD = FOREACH PRODg GENERATE d, c, SUM(idic) AS dXc;
SQR = FOREACH C GENERATE c, ic * ic AS ic2;
SQRg = GROUP SQR BY c;LEN_C = FOREACH SQRg GENERATE c, SQRT(SUM(ic
2)) AS lenc;
DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c;SIM = FOREACH DOT_LEN GENERATE d, c, dXc / lenc;
SIMg = GROUP SIM BY d;CLUSTERS = FOREACH SIMg GENERATE TOP(1, 2, SIM);
k-means in Apache Pig: E-step
cw
wc
dw
wc
wd
cd
i
iic
2maxarg
![Page 50: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/50.jpg)
50
D_C = JOIN C BY w, D BY w;PROD = FOREACH D_C GENERATE d, c, id * ic AS idic ;
PRODg = GROUP PROD BY (d, c);DOT_PROD = FOREACH PRODg GENERATE d, c, SUM(idic) AS dXc;
SQR = FOREACH C GENERATE c, ic * ic AS ic2;
SQRg = GROUP SQR BY c;LEN_C = FOREACH SQRg GENERATE c, SQRT(SUM(ic
2)) AS lenc;
DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c;SIM = FOREACH DOT_LEN GENERATE d, c, dXc / lenc;
SIMg = GROUP SIM BY d;CLUSTERS = FOREACH SIMg GENERATE TOP(1, 2, SIM);
k-means in Apache Pig: E-step
cw
wc
dw
wc
wd
cd
i
iic
2maxarg
![Page 51: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/51.jpg)
51
D_C = JOIN C BY w, D BY w;PROD = FOREACH D_C GENERATE d, c, id * ic AS idic ;
PRODg = GROUP PROD BY (d, c);DOT_PROD = FOREACH PRODg GENERATE d, c, SUM(idic) AS dXc;
SQR = FOREACH C GENERATE c, ic * ic AS ic2;
SQRg = GROUP SQR BY c;LEN_C = FOREACH SQRg GENERATE c, SQRT(SUM(ic
2)) AS lenc;
DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c;SIM = FOREACH DOT_LEN GENERATE d, c, dXc / lenc;
SIMg = GROUP SIM BY d;CLUSTERS = FOREACH SIMg GENERATE TOP(1, 2, SIM);
k-means in Apache Pig: E-step
cw
wc
dw
wc
wd
cd
i
iic
2maxarg
![Page 52: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/52.jpg)
52
k-means in Apache Pig: E-step
D_C = JOIN C BY w, D BY w;PROD = FOREACH D_C GENERATE d, c, id * ic AS idic ;
PRODg = GROUP PROD BY (d, c);DOT_PROD = FOREACH PRODg GENERATE d, c, SUM(idic) AS dXc;
SQR = FOREACH C GENERATE c, ic * ic AS ic2;
SQRg = GROUP SQR BY c;LEN_C = FOREACH SQRg GENERATE c, SQRT(SUM(ic
2)) AS lenc;
DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c;SIM = FOREACH DOT_LEN GENERATE d, c, dXc / lenc;
SIMg = GROUP SIM BY d;CLUSTERS = FOREACH SIMg GENERATE TOP(1, 2, SIM);
![Page 53: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/53.jpg)
53
k-means in Apache Pig: M-step
D_C_W = JOIN CLUSTERS BY d, D BY d;
D_C_Wg = GROUP D_C_W BY (c, w);SUMS = FOREACH D_C_Wg GENERATE c, w, SUM(id) AS sum;
D_C_Wgg = GROUP D_C_W BY c;SIZES = FOREACH D_C_Wgg GENERATE c, COUNT(D_C_W) AS size;
SUMS_SIZES = JOIN SIZES BY c, SUMS BY c;C = FOREACH SUMS_SIZES GENERATE c, w, sum / size AS ic ;
Finally - embed in Java (or Python or ….) to do the looping
![Page 54: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/54.jpg)
54
The problem with k-means in HadoopI/O costs
![Page 55: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/55.jpg)
55
Data is read, and model is written, with every iteration
• Mappers read data portions and centroids• Mappers assign data instances to clusters• Mappers compute new local centroids and local cluster sizes• Reducers aggregate local centroids (weighted by local cluster sizes) into new global centroids• Reducers write the new centroids
Panda et al, Chapter 2
![Page 56: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/56.jpg)
56
SCHEMES DESIGNED FOR ITERATIVE HADOOP PROGRAMS:
SPARK AND FLINK
![Page 57: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/57.jpg)
57
Spark word count example• Research project, based on Scala and Hadoop• Now APIs in Java and Python as well
• Familiar-looking API for abstract operations (map, flatMap, reduceByKey, …)
• Most API calls are “lazy” – ie, counts is a data structure defining a pipeline, not a materialized table.
• Includes ability to store a sharded dataset in cluster memory as an RDD (resiliant distributed database)
![Page 58: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/58.jpg)
58
Spark logistic regression example
![Page 59: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/59.jpg)
59
Spark logistic regression example• Allows caching data in memory
![Page 60: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/60.jpg)
60
Spark logistic regression example
![Page 61: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/61.jpg)
61
FLINK
• Recent Apache Project – just moved to top-level at 0.8 – formerly Stratosphere….
![Page 62: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/62.jpg)
62
FLINK
• Apache Project – just getting started….
Java API
![Page 63: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/63.jpg)
63
FLINK
![Page 64: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/64.jpg)
64
FLINK
• Like Spark, in-memory or on disk• Everything is a Java object• Unlike Spark, contains operations for iteration
–Allowing query optimization• Very easy to use and install in local model
–Very modular–Only needs Java
![Page 65: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/65.jpg)
65
MORE EXAMPLES IN PIG
![Page 66: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/66.jpg)
66
Phrase Finding in PIG
![Page 67: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/67.jpg)
67
Phrase Finding 1 - loading the input
![Page 68: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/68.jpg)
68
…
![Page 69: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/69.jpg)
69
PIG Features
• comments -- like this /* or like this */• ‘shell-like’ commands:
– fs -ls … -- any hadoop fs … command–some shorter cuts: ls, cp, …–sh ls -al -- escape to shell
![Page 70: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/70.jpg)
70
…
![Page 71: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/71.jpg)
71
PIG Features• comments -- like this /* or like this */• ‘shell-like’ commands:
– fs -ls … -- any hadoop fs … command– some shorter cuts: ls, cp, …– sh ls -al -- escape to shell
• LOAD ‘hdfs-path’ AS (schema)– schemas can include int, double, …– schemas can include complex types: bag, map, tuple, …
• FOREACH alias GENERATE … AS …, …– transforms each row of a relation– operators include +, -, and, or, … – can extend this set easily (more later)
• DESCRIBE alias -- shows the schema• ILLUSTRATE alias -- derives a sample tuple
![Page 72: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/72.jpg)
72
Phrase Finding 1 - word counts
![Page 73: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/73.jpg)
73
![Page 74: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/74.jpg)
74
PIG Features• LOAD ‘hdfs-path’ AS (schema)
– schemas can include int, double, bag, map, tuple, …• FOREACH alias GENERATE … AS …, …
– transforms each row of a relation• DESCRIBE alias/ ILLUSTRATE alias -- debugging• GROUP r BY x
– like a shuffle-sort: produces relation with fields group and r, where r is a bag
![Page 75: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/75.jpg)
75
PIG parses and optimizes a sequence of commands before it executes themIt’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce
![Page 76: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/76.jpg)
76
PIG Features• LOAD ‘hdfs-path’ AS (schema)
– schemas can include int, double, bag, map, tuple, …• FOREACH alias GENERATE … AS …, …
– transforms each row of a relation• DESCRIBE alias/ ILLUSTRATE alias -- debugging• GROUP alias BY …• FOREACH alias GENERATE group, SUM(….)
– GROUP/GENERATE … aggregate op together act like a map-reduce– aggregates: COUNT, SUM, AVERAGE, MAX, MIN, … – you can write your own
![Page 77: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/77.jpg)
77
PIG parses and optimizes a sequence of commands before it executes themIt’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce
![Page 78: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/78.jpg)
78
Phrase Finding 3 - assembling phrase- and word-level statistics
![Page 79: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/79.jpg)
79
![Page 80: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/80.jpg)
80
![Page 81: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/81.jpg)
81
PIG Features• LOAD ‘hdfs-path’ AS (schema)
– schemas can include int, double, bag, map, tuple, …• FOREACH alias GENERATE … AS …, …
– transforms each row of a relation• DESCRIBE alias/ ILLUSTRATE alias -- debugging• GROUP alias BY …• FOREACH alias GENERATE group, SUM(….)
– GROUP/GENERATE … aggregate op together act like a map-reduce• JOIN r BY field, s BY field, …
– inner join to produce rows: r::f1, r::f2, … s::f1, s::f2, …
![Page 82: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/82.jpg)
82
Phrase Finding 4 - adding total frequencies
![Page 83: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/83.jpg)
83
![Page 84: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/84.jpg)
84
How do we add the totals to the phraseStats relation?
STORE triggers execution of the query plan….it also limits optimization
![Page 85: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/85.jpg)
85Comment: schema is lost when you store….
![Page 86: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/86.jpg)
86
PIG Features• LOAD ‘hdfs-path’ AS (schema)
– schemas can include int, double, bag, map, tuple, …• FOREACH alias GENERATE … AS …, …
– transforms each row of a relation• DESCRIBE alias/ ILLUSTRATE alias -- debugging• GROUP alias BY …• FOREACH alias GENERATE group, SUM(….)
– GROUP/GENERATE … aggregate op together act like a map-reduce• JOIN r BY field, s BY field, …
– inner join to produce rows: r::f1, r::f2, … s::f1, s::f2, …• CROSS r, s, …
– use with care unless all but one of the relations are singleton– newer pigs allow singleton relation to be cast to a scalar
![Page 87: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/87.jpg)
87
Phrase Finding 5 - phrasiness and informativeness
![Page 88: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/88.jpg)
88
How do we compute some complicated function?
With a “UDF”
![Page 89: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/89.jpg)
89
![Page 90: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/90.jpg)
90
PIG Features• LOAD ‘hdfs-path’ AS (schema)
– schemas can include int, double, bag, map, tuple, …• FOREACH alias GENERATE … AS …, …
– transforms each row of a relation• DESCRIBE alias/ ILLUSTRATE alias -- debugging• GROUP alias BY …• FOREACH alias GENERATE group, SUM(….)
– GROUP/GENERATE … aggregate op together act like a map-reduce• JOIN r BY field, s BY field, …
– inner join to produce rows: r::f1, r::f2, … s::f1, s::f2, …• CROSS r, s, …
– use with care unless all but one of the relations are singleton• User defined functions as operators
– also for loading, aggregates, …
![Page 91: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/91.jpg)
91
The full phrase-finding pipeline
![Page 92: Other Map-Reduce (ish) Frameworks William Cohen 1.](https://reader030.fdocuments.in/reader030/viewer/2022032710/56649ee65503460f94bf678f/html5/thumbnails/92.jpg)
92