Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API...
-
Upload
linette-hollie-johnston -
Category
Documents
-
view
218 -
download
0
description
Transcript of Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API...
![Page 1: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/1.jpg)
Cloud ComputingMapreduce (2)
Keke Chen
![Page 2: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/2.jpg)
Outline Hadoop streaming example Hadoop java API
Framework important APIs
Mini-project
![Page 3: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/3.jpg)
A nice book Hadoop: The definitive Guide
You can read it online from campus network- ohiolink ebook center safari online
![Page 4: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/4.jpg)
Hadoop streaming Simple and powerful interface for
programming Application developers do not need to learn
hadoop java APIs Good for simple, adhoc tasks
![Page 5: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/5.jpg)
Note: Map/Reduce uses the local linux file
system for processing and hosting temporary data
HDFS is used to host application data
HDFS
Node Local filesystem
![Page 6: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/6.jpg)
Hadoop streamining http://hadoop.apache.org/common/docs/c
urrent/streaming.html /usr/local/hadoop/bin/hadoop jar \
/usr/local/hadoop/hadoop-streaming-1.0.3.jar \-input myInputDirs -output myOutputDir \-mapper myMapper -reducer myReducer
Reducer can be empty: -reducer None myMapper and myReducer can be any
executable Mapper/reducer will take stdin and output to
stdout Files in myInputDirs are fed into mapper as stdin Mapper’s output will be the input of reducer
![Page 7: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/7.jpg)
Packaging files with job submission /usr/local/hadoop/bin/hadoop jar \
/usr/local/hadoop/hadoop-streaming-1.0.3.jar \ -input “/user/hadoop/inputdata” \ -output “/user/hadoop/outputdata” \ -mapper “python myPythonScript.py myDictionary.txt” \ -reducer “/bin/wc” \ -file myPythonScript.py \ -file myDictionary.txt -file is good for small files
Input parameterfor the script
![Page 8: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/8.jpg)
Using hadoop library classeshadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -D mapred.reduce.tasks=12 \ -input myInputDirs \ -output myOutputDir \ -mapper org.apache.hadoop.mapred.lib.IdentityMapper \ -reducer org.apache.hadoop.mapred.lib.IdentityReducer \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
![Page 9: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/9.jpg)
Large files and archives Upload large files to HDFS first Use –files option in streaming, which will
download files to local working directory -files
hdfs://host:fs_port/user/testfile.txt#testlink -archives
hdfs://host:fs_port/user/testfile.jar#testlink Cache1.txt, cache2.txt are in testfile.jar Then, locally testlink/cache1.txt, textlink/cache2.txt
![Page 10: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/10.jpg)
Wordcount Problem: counting frequencies of words
for a large document collection. Implement mapper and reducer
respectively, using python Some good python tutorials at
http://wiki.python.org/
![Page 11: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/11.jpg)
Mapper.pyimport sys
for line in sys.stdin: line = line.strip()
words = line.split()for word in words:
print ‘%s\t1’ % (word)
![Page 12: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/12.jpg)
Reducer.pyimport sys
word2count={}
for line in sys.stdin: line = line.strip() word, count = line.split(‘\t’, 1) try:
count = int(count)word2count[word] = word2count.get(word, 0)+ count
except ValueError: pass
for word in word2count:print ‘%s\t%s’% (word, word2count[word])
![Page 13: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/13.jpg)
Running wordcount
hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-mapper "python mapper.py" \ -reducer "python reducer.py" \ -input text -output output2 \ -file /localpath/mapper.py -file
/localpath/reducer.py
![Page 14: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/14.jpg)
Running wordcounthadoop jar $HADOOP_HOME/hadoop-
streaming.jar \ -mapper "python mapper.py" \ -reducer "python reducer.py" \ -input text -output output2 \ -file mapper.py -file reducer.py \ -jobconf mapred.reduce.tasks=2 \ -jobconf mapred.map.tasks=4
![Page 15: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/15.jpg)
If mapper/reducer takes files as parameters
hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-mapper "python mapper.py" \ -reducer "python reducer.py myfile" \ -input text -output output2 \ -file /localpath/mapper.py -file
/localpath/reducer.py -file /localpath/myfile
![Page 16: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/16.jpg)
Hadoop Java APIs hadoop.apache.org/common/docs/
current/api/ benefits
Jave code is more efficient than streaming More parameters for control and tuning Better for iterative MR programs
![Page 17: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/17.jpg)
Important base classes Mapper<keyIn, valueIn, keyOut,
valueOut> Function map(Object, Writable, Context)
Reducer<keyIn, valueIn, keyOut, valueOut> Function reduce(WritableComparable,
Iterator, Context) Combiner Partitioner
![Page 18: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/18.jpg)
The frameworkpublic class Wordcount{
public static class MapClass extends Mapper<Object, Text, Text, LongWritable> {
public void setup(Mapper.Context context){…} public void map(Object key, Text value, Context context) throws IOException {…} }
public static class ReduceClass Reducer<Text, LongWritable, Text, LongWritable> { public void setup(Reducer.Context context){…}
public void reduce(Text key, Iterator<LongWritable> values, Context context) throws IOException{…}}
public static void main(String[] args) throws Exception{}}
![Page 19: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/19.jpg)
The wordcount example in java http://hadoop.apache.org/common/docs/
current/mapred_tutorial.html#Example%3A+WordCount+v1.0
Old/New framework Old framework for version prior to 0.20
![Page 20: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/20.jpg)
Mapper of wordcount
public static class WCMapper extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1); private Text word = new Text();
public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
![Page 21: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/21.jpg)
WordCount Reducer public static class WCReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
![Page 22: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/22.jpg)
Function parameters Define map/reduce parameters
according to your application Have to use writable classes in
org.apache.hadoop.io E.g. Text, LongWritable, IntWritable etc.
Template parameters and the function parameters should be matched
Map’s output and reduce’s input parameters should be matched.
![Page 23: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/23.jpg)
Configuring map/reduce Passing global parameter settings to
each map/reduce process In main function, set parameters in a
Configuration object Configuration conf = new Configuration(); Job job = new Job(conf, "cloudvista");
job.setJarByClass(Wordcount.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class);
job.setMapperClass(WCMapper.class); //job.setCombinerClass(WCReducer.class); job.setReducerClass(WCReducer.class); //job.setPartitionerClass(WCPartitioner.class);
job.setNumReduceTasks(num_reduce); FileInputFormat.setInputPaths (job, input); FileOutputFormat.setOutputPath (job, new Path(output_path )); System.exit(job.waitForCompletion(true)?0:1);
![Page 24: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/24.jpg)
How to run your app1. Compile to jar file2. Command line hadoop jar your_jar your_parameters
Normally you need to pass in Number of reducers Input files Output directory Any other application specific parameters
![Page 25: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/25.jpg)
Access Files in HDFS?Example: In map function
Public void setup(Mapper.Context context){ Configuration conf = context.getConfiguration(); string filename = conf.get(“yourfile");
Path p = new Path(filename); // Path is used for opening the file.
FileSystem fs = FileSystem.get(conf);//determines local or HDFS
FSInputStream file = fs.open(p);
while (file.available() > 0){…
} file.close();
}
![Page 26: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/26.jpg)
Combiner Apply reduce function to the intermediate
results locally after the map generates the result
Map1key1
Key n
combine Key1, value1Key2, value2…Keyn, valueN
reduces
Map’s local
![Page 27: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/27.jpg)
Partitioner If map’s output will generate N keys
(N>R, R:# of reduces) By default, N keys are randomly distributed
to R reduces You can use partitioner to define how the
keys are distributed to the reduces.
![Page 28: Cloud Computing Mapreduce (2) Keke Chen. Outline Hadoop streaming example Hadoop java API Framework important APIs Mini-project.](https://reader035.fdocuments.in/reader035/viewer/2022062503/5a4d1b0d7f8b9ab05998c7e1/html5/thumbnails/28.jpg)
Mini project 11. Learn to use HDFS2. Read and run wordcount example
http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html
3. Write a MR program for inverted-index /user/hadoop/prj1.txt Implement two versions
Script/exe + streaming Hadoop Java API
The file has “docID \t docContent” per line Generating inverted index
Word \t a list of “DocID:position”