Solving real world problems with Hadoop

68
Solving Real World Problems with Hadoop and SQL -> Hadoop Masahji Stewart <[email protected] > Tuesday, April 5, 2011

description

 

Transcript of Solving real world problems with Hadoop

Page 1: Solving real world problems with Hadoop

Solving Real World Problems with Hadoop and

SQL -> Hadoop

Masahji Stewart <[email protected]>

Tuesday, April 5, 2011

Page 2: Solving real world problems with Hadoop

Solving Real World Problems with Hadoop

Tuesday, April 5, 2011

Page 3: Solving real world problems with Hadoop

Word CountMapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster...

Input

Tuesday, April 5, 2011

Page 4: Solving real world problems with Hadoop

Word CountMapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster...

Input

as!! ! ! 1certain! ! 1collectively!1datasets! ! 1framework!! 1huge! ! ! 1number!! ! 1on!! ! ! 1referred! ! 1to!! ! ! 1

OutputMapReduce!! 1cluster! ! 1computers!! 1distributable!1for!! ! ! 1kinds! ! ! 1of!! ! ! 2problems! ! 1

(nodes),! ! 1a! ! ! ! 3is!! ! ! 1large! ! ! 1processing! 1using! ! ! 1

Tuesday, April 5, 2011

Page 5: Solving real world problems with Hadoop

Word Count (Mapper)

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }

Tuesday, April 5, 2011

Page 6: Solving real world problems with Hadoop

Word Count (Mapper)

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }

Extractword = “MapReduce”word = ”is”word = “a”...

Tuesday, April 5, 2011

Page 7: Solving real world problems with Hadoop

Word Count (Mapper)

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }

Emit“MapReduce”, 1“is”, 1“a”, 1...

Tuesday, April 5, 2011

Page 8: Solving real world problems with Hadoop

Word Count (Reducer) public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

Tuesday, April 5, 2011

Page 9: Solving real world problems with Hadoop

Word Count (Reducer) public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

Sumkey=“of”sum = 2

Tuesday, April 5, 2011

Page 10: Solving real world problems with Hadoop

Word Count (Reducer) public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

Emit“of”, 2

Tuesday, April 5, 2011

Page 11: Solving real world problems with Hadoop

Word Count (Running)$ hadoop jar ./.versions/0.20/hadoop-0.20-examples.jar wordcount \ -D mapred.reduce.tasks=3\ input_file out

11/04/03 21:21:27 INFO mapred.JobClient: Default number of map tasks: 211/04/03 21:21:27 INFO mapred.JobClient: Default number of reduce tasks: 311/04/03 21:21:28 INFO input.FileInputFormat: Total input paths to process : 111/04/03 21:21:29 INFO mapred.JobClient: Running job: job_201103252110_065911/04/03 21:21:30 INFO mapred.JobClient: map 0% reduce 0%11/04/03 21:21:37 INFO mapred.JobClient: map 100% reduce 0%11/04/03 21:21:49 INFO mapred.JobClient: map 100% reduce 33%11/04/03 21:21:52 INFO mapred.JobClient: map 100% reduce 66%11/04/03 21:22:05 INFO mapred.JobClient: map 100% reduce 100%11/04/03 21:22:08 INFO mapred.JobClient: Job complete: job_201103252110_065911/04/03 21:22:08 INFO mapred.JobClient: Counters: 17...11/04/03 21:22:08 INFO mapred.JobClient: Map output bytes=28611/04/03 21:22:08 INFO mapred.JobClient: Combine input records=2711/04/03 21:22:08 INFO mapred.JobClient: Map output records=2711/04/03 21:22:08 INFO mapred.JobClient: Reduce input records=24

Tuesday, April 5, 2011

Page 12: Solving real world problems with Hadoop

Word Count (Output)$ hadoop@ip-10-245-210-191:~$ hadoop fs -ls outFound 3 items-rw-r--r-- 2 hadoop supergroup 90 2011-04-03 21:21 /user/hadoop/out/part-r-00000-rw-r--r-- 2 hadoop supergroup 80 2011-04-03 21:21 /user/hadoop/out/part-r-00001-rw-r--r-- 2 hadoop supergroup 49 2011-04-03 21:21 /user/hadoop/out/part-r-00002$ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00000as!1certain! 1collectively!1datasets! 1framework!1...$ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00001MapReduce!1cluster! 1computers!1distributable!1for!1...$ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00002 (nodes),! 1a! 3is!1large! 1processing! 1using! 1

A file per reducer

Tuesday, April 5, 2011

Page 13: Solving real world problems with Hadoop

Word Count (Output)$ hadoop@ip-10-245-210-191:~$ hadoop fs -ls outFound 3 items-rw-r--r-- 2 hadoop supergroup 90 2011-04-03 21:21 /user/hadoop/out/part-r-00000-rw-r--r-- 2 hadoop supergroup 80 2011-04-03 21:21 /user/hadoop/out/part-r-00001-rw-r--r-- 2 hadoop supergroup 49 2011-04-03 21:21 /user/hadoop/out/part-r-00002$ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00000as!1certain! 1collectively!1datasets! 1framework!1...$ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00001MapReduce!1cluster! 1computers!1distributable!1for!1...$ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00002 (nodes),! 1a! 3is!1large! 1processing! 1using! 1

Tuesday, April 5, 2011

Page 14: Solving real world problems with Hadoop

Word Count

MapReduce is a f ramework fo r processsing

huge datasets on certain kinds of distributable

problems using a large number of computers

(nodes) collectively

referrered to as a

MapReduce is a framework for p r o c e s s i n g huge datasets on certain kinds of distributable problems using a large number of computers ( n o d e s ) , c o l l e c t i v e l y referred to as a cluster... cluster

MAP

MAP

MAP

MAP

MAP

REDUCE

REDUCE

REDUCE

as 1certain 1collectively 1datasets 1framework 1huge 1number 1on 1referred 1to 1

MapReduce 1cluster 1computers 1distributable 1for 1kinds 1of 2problems 1

(nodes), 1a 3is 1large 1processing 1using 1

Input Split Map Reduce OutputShuffle/Sort

Tuesday, April 5, 2011

Page 15: Solving real world problems with Hadoop

Log Processing (Date IP COUNT)

67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0"201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0"

...

Input

Tuesday, April 5, 2011

Page 16: Solving real world problems with Hadoop

Log Processing (Date IP COUNT)

Input

Output18/Jul/2010!! 189.186.9.181! 118/Jul/2010!! 201.201.16.82! 318/Jul/2010!! 66.195.114.59! 118/Jul/2010!! 67.195.114.59! 118/Jul/2010!! 90.221.175.16! 119/Jul/2010!! 90.221.75.196! 1

...

67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0"201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0"

...

Tuesday, April 5, 2011

Page 17: Solving real world problems with Hadoop

Log Processing (Mapper) public static final Pattern LOG_PATTERN = Pattern.compile("^([\\d.]+) (\\S+) (\\S+) \\[(([\\w/]+):([\\d:]+)\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\d+) \"([^\"]+)\" \"([^\"]+)\"");

public static class ExtractDateAndIpMapper extends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1); private Text ip = new Text();

public void map(Object key, Text value, Context context) throws IOException {

String text = value.toString(); Matcher matcher = LOG_PATTERN.matcher(text); while (matcher.find()) { try { ip.set(matcher.group(5) + "\t" + matcher.group(1)); context.write(ip, one); } catch(InterruptedException ex) { throw new IOException(ex); } }

} }

Tuesday, April 5, 2011

Page 18: Solving real world problems with Hadoop

Log Processing (Mapper) public static final Pattern LOG_PATTERN = Pattern.compile("^([\\d.]+) (\\S+) (\\S+) \\[(([\\w/]+):([\\d:]+)\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\d+) \"([^\"]+)\" \"([^\"]+)\"");

public static class ExtractDateAndIpMapper extends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1); private Text ip = new Text();

public void map(Object key, Text value, Context context) throws IOException {

String text = value.toString(); Matcher matcher = LOG_PATTERN.matcher(text); while (matcher.find()) { try { ip.set(matcher.group(5) + "\t" + matcher.group(1)); context.write(ip, one); } catch(InterruptedException ex) { throw new IOException(ex); } }

} }

Extractip = “189.186.9.181”ip = ”201.201.16.82”ip = “66.249.67.57”...

Tuesday, April 5, 2011

Page 19: Solving real world problems with Hadoop

Log Processing (Mapper) public static final Pattern LOG_PATTERN = Pattern.compile("^([\\d.]+) (\\S+) (\\S+) \\[(([\\w/]+):([\\d:]+)\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\d+) \"([^\"]+)\" \"([^\"]+)\"");

public static class ExtractDateAndIpMapper extends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1); private Text ip = new Text();

public void map(Object key, Text value, Context context) throws IOException {

String text = value.toString(); Matcher matcher = LOG_PATTERN.matcher(text); while (matcher.find()) { try { ip.set(matcher.group(5) + "\t" + matcher.group(1)); context.write(ip, one); } catch(InterruptedException ex) { throw new IOException(ex); } }

} }

Emit“18/Jul/2010\t189.186.9.181”, 1...

Tuesday, April 5, 2011

Page 20: Solving real world problems with Hadoop

Log Processing (main)public class LogAggregator {... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}

Tuesday, April 5, 2011

Page 21: Solving real world problems with Hadoop

Log Processing (main)public class LogAggregator {... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}

Mapper

Tuesday, April 5, 2011

Page 22: Solving real world problems with Hadoop

Log Processing (main)public class LogAggregator {... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}

Reducer

Tuesday, April 5, 2011

Page 23: Solving real world problems with Hadoop

Log Processing (main)public class LogAggregator {... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}

Input/OutputSettings

Tuesday, April 5, 2011

Page 24: Solving real world problems with Hadoop

Log Processing (main)public class LogAggregator {... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}

Run it!

Tuesday, April 5, 2011

Page 25: Solving real world problems with Hadoop

Log Processing (Running)$ hadoop jar target/hadoop-recipes-1.0.jar com.synctree.hadoop.recipes.LogAggregator\ -libjars hadoop-examples.jar data/access.log log_results

11/04/04 00:51:30 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 00:51:30 INFO input.FileInputFormat: Total input paths to process : 111/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Creating hadoop-examples.jar in /tmp/hadoop-masahji/mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/hadoop-recipes-work--8125788655475885988 with rwxr-xr-x11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:///Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/hadoop-recipes/hadoop-examples.jar11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:///Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/hadoop-recipes/hadoop-examples.jar11/04/04 00:51:32 INFO mapred.JobClient: map 100% reduce 100%

Tuesday, April 5, 2011

Page 26: Solving real world problems with Hadoop

Log Processing (Running)$ hadoop jar target/hadoop-recipes-1.0.jar com.synctree.hadoop.recipes.LogAggregator\ -libjars hadoop-examples.jar data/access.log log_results

11/04/04 00:51:30 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 00:51:30 INFO input.FileInputFormat: Total input paths to process : 111/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Creating hadoop-examples.jar in /tmp/hadoop-masahji/mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/hadoop-recipes-work--8125788655475885988 with rwxr-xr-x11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:///Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/hadoop-recipes/hadoop-examples.jar11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:///Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/hadoop-recipes/hadoop-examples.jar11/04/04 00:51:32 INFO mapred.JobClient: map 100% reduce 100%

JAR placed into Distributed Cache

Tuesday, April 5, 2011

Page 27: Solving real world problems with Hadoop

Log Processing (Output)

$ hadoop fs -ls log_resultsFound 2 items-rwxrwxrwx 1 masahji staff 0 2011-04-04 00:51 log_results/_SUCCESS-rwxrwxrwx 1 masahji staff 168 2011-04-04 00:51 log_results/part-r-00000

$ hadoop fs -cat log_results/part-r-00000 18/Jul/2010! 189.186.9.181!118/Jul/2010! 201.201.16.82!318/Jul/2010! 66.195.114.59!118/Jul/2010! 67.195.114.59!118/Jul/2010! 90.221.175.16!119/Jul/2010! 90.221.75.196!1...

Tuesday, April 5, 2011

Page 28: Solving real world problems with Hadoop

Hadoop Streaming

Task Tracker Mapper / Reducer

Fork

script

STDOUTSTDIN

Tuesday, April 5, 2011

Page 29: Solving real world problems with Hadoop

Basic grep ...搜索 搜索 [sou1 suo3] /to search/.../internet search/database search/吉日 吉日 [ji2 ri4] /propitious day/lucky day/吉祥 吉祥 [ji2 xiang2] /lucky/auspicious/propitious/咄咄 咄咄 [duo1 duo1] /to cluck one's tongue/tut-tut/喜鵲 喜鹊 [xi3 que4] /black-billed magpie, legendary bringer of good luck/...

Input

Tuesday, April 5, 2011

Page 30: Solving real world problems with Hadoop

Basic grep ...搜索 搜索 [sou1 suo3] /to search/.../internet search/database search/吉日 吉日 [ji2 ri4] /propitious day/lucky day/吉祥 吉祥 [ji2 xiang2] /lucky/auspicious/propitious/咄咄 咄咄 [duo1 duo1] /to cluck one's tongue/tut-tut/喜鵲 喜鹊 [xi3 que4] /black-billed magpie, legendary bringer of good luck/...

Input

Output...匯出 汇出 [hui4 chu1] /to export data (e.g. from a database)/!搜索 搜索 [sou1 suo3] /to search/.../internet search/database search/!數據庫 数据库 [shu4 ju4 ku4] /database/!數據庫軟件 数据库软件 [shu4 ju4 ku4 ruan3 jian4] /database software/!資料庫 资料库 [zi1 liao4 ku4] /database//...

Tuesday, April 5, 2011

Page 31: Solving real world problems with Hadoop

$ hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input data/cedict.txt.gz \ -output streaming/grep_database_mandarin \ -mapper 'grep database' \ -reducer org.apache.hadoop.mapred.lib.IdentityReducer...11/04/04 05:27:58 INFO streaming.StreamJob: map 100% reduce 100%11/04/04 05:27:58 INFO streaming.StreamJob: Job complete: job_local_000111/04/04 05:27:58 INFO streaming.StreamJob: Output: streaming/grep_database_mandarin

Basic grep

Tuesday, April 5, 2011

Page 32: Solving real world problems with Hadoop

ScriptsorJava Classes

$ hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input data/cedict.txt.gz \ -output streaming/grep_database_mandarin \ -mapper 'grep database' \ -reducer org.apache.hadoop.mapred.lib.IdentityReducer...11/04/04 05:27:58 INFO streaming.StreamJob: map 100% reduce 100%11/04/04 05:27:58 INFO streaming.StreamJob: Job complete: job_local_000111/04/04 05:27:58 INFO streaming.StreamJob: Output: streaming/grep_database_mandarin

Basic grep

Tuesday, April 5, 2011

Page 33: Solving real world problems with Hadoop

$ hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input data/cedict.txt.gz \ -output streaming/grep_database_mandarin \ -mapper 'grep database' \ -reducer org.apache.hadoop.mapred.lib.IdentityReducer...11/04/04 05:27:58 INFO streaming.StreamJob: map 100% reduce 100%11/04/04 05:27:58 INFO streaming.StreamJob: Job complete: job_local_000111/04/04 05:27:58 INFO streaming.StreamJob: Output: streaming/grep_database_mandarin

Basic grep

$ hadoop fs -cat streaming/grep_database_mandarin/part-00000

匯出 汇出 [hui4 chu1] /to remit (money)//to export data (e.g. from a database)/!搜索 搜索 [sou1 suo3] /to search/to look for sth/internet search/database search/!數據庫 数据库 [shu4 ju4 ku4] /database/!數據庫軟件 数据库软件 [shu4 ju4 ku4 ruan3 jian4] /database software/!資料庫 资料库 [zi1 liao4 ku4] /database/

Tuesday, April 5, 2011

Page 34: Solving real world problems with Hadoop

Ruby Example (ignore ip list)Input

Output

67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")192.168.10.4 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 96 "-" "Mozilla/4.0"189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"10.1.10.12 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 51 "-" "Mozilla/5.0"201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0"201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"10.1.10.4 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 94 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"10.1.10.14 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 24 "-" "Mozilla/4.0"66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0"...

189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"!201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla 4.0"!201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"!201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0"!66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"!67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")!90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"!90.221.75.196 - - [19/Jul/2010] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"!90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0"...

Tuesday, April 5, 2011

Page 35: Solving real world problems with Hadoop

Ruby Example (ignore ip list)

#!/usr/bin/env ruby

ignore = %w(127.0.0.1 192.168 10)log_regex = /^([\d.]+)\s/

while(line = STDIN.gets) next unless line =~ log_regex ip = $1

print line if ignore.reject { |ignore_ip| ip !~ /^#{ignore_ip}(\.|$)/ }.empty?end

Read STDINWrite STDOUT

Tuesday, April 5, 2011

Page 36: Solving real world problems with Hadoop

Ruby Example (ignore ip list)

#!/usr/bin/env ruby

ignore = %w(127.0.0.1 192.168 10)log_regex = /^([\d.]+)\s/

while(line = STDIN.gets) next unless line =~ log_regex ip = $1

print line if ignore.reject { |ignore_ip| ip !~ /^#{ignore_ip}(\.|$)/ }.empty?end

Tuesday, April 5, 2011

Page 37: Solving real world problems with Hadoop

$ hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input data/access.log \ -output out/streaming/filter_ips \ -mapper './script/filter_ips' \ -reducer org.apache.hadoop.mapred.lib.IdentityReducer11/04/04 07:08:08 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 11/04/04 07:08:08 WARN mapred.JobClient: No job jar file set. User classes may not 11/04/04 07:08:08 INFO mapred.FileInputFormat: Total input paths to process : 111/04/04 07:08:09 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-masahji/11/04/04 07:08:09 INFO streaming.StreamJob: Running job: job_local_000111/04/04 07:08:09 INFO streaming.StreamJob: Job running in-process (local Hadoop)...

Ruby Example (ignore ip list)

Tuesday, April 5, 2011

Page 38: Solving real world problems with Hadoop

$ hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input data/access.log \ -output out/streaming/filter_ips \ -mapper './script/filter_ips' \ -reducer org.apache.hadoop.mapred.lib.IdentityReducer11/04/04 07:08:08 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 11/04/04 07:08:08 WARN mapred.JobClient: No job jar file set. User classes may not 11/04/04 07:08:08 INFO mapred.FileInputFormat: Total input paths to process : 111/04/04 07:08:09 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-masahji/11/04/04 07:08:09 INFO streaming.StreamJob: Running job: job_local_000111/04/04 07:08:09 INFO streaming.StreamJob: Job running in-process (local Hadoop)...

Ruby Example (ignore ip list)

$ hadoop fs -cat out/streaming/filter_ips/part-00000 ...!

189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"!201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"!201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"!201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0"!66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"!67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")!90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"!90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0"

Tuesday, April 5, 2011

Page 39: Solving real world problems with Hadoop

SQL -> Hadoop

Tuesday, April 5, 2011

Page 40: Solving real world problems with Hadoop

Simple QueryQuerySELECT first_name, last_name FROM people WHERE first_name = ‘John’ OR favorite_movie_id = 2

Tuesday, April 5, 2011

Page 41: Solving real world problems with Hadoop

Simple QueryQuerySELECT first_name, last_name FROM people WHERE first_name = ‘John’ OR favorite_movie_id = 2

Inputid first_name last_name favorite_movie_id

1 John Mulligan 3

2 Samir Ahmed 5

3 Royce Rollins 2

4 John Smith 2

Tuesday, April 5, 2011

Page 42: Solving real world problems with Hadoop

Simple QueryQuerySELECT first_name, last_name FROM people WHERE first_name = ‘John’ OR favorite_movie_id = 2

Inputid first_name last_name favorite_movie_id

1 John Mulligan 3

2 Samir Ahmed 5

3 Royce Rollins 2

4 John Smith 2

first_name last_name

John Mulligan

John Smith

Output

Tuesday, April 5, 2011

Page 43: Solving real world problems with Hadoop

Simple Query (Mapper)public class SimpleQuery {... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) throws IOException {

String [] row = value.toString().split(DELIMITER);

try { if( row[FIRST_NAME_COLUMN].equals("John") || row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) {

columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] });

context.write(columns, blank);

} } catch(InterruptedException ex) { throw new IOException(ex); } } }...}

Tuesday, April 5, 2011

Page 44: Solving real world problems with Hadoop

Simple Query (Mapper)public class SimpleQuery {... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) throws IOException {

String [] row = value.toString().split(DELIMITER);

try { if( row[FIRST_NAME_COLUMN].equals("John") || row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) {

columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] });

context.write(columns, blank);

} } catch(InterruptedException ex) { throw new IOException(ex); } } }...}

Extract

Tuesday, April 5, 2011

Page 45: Solving real world problems with Hadoop

Simple Query (Mapper)public class SimpleQuery {... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) throws IOException {

String [] row = value.toString().split(DELIMITER);

try { if( row[FIRST_NAME_COLUMN].equals("John") || row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) {

columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] });

context.write(columns, blank);

} } catch(InterruptedException ex) { throw new IOException(ex); } } }...}

Extract

WHEREWHERE first_name = ‘John’ OR favorite_movie_id = 2

Tuesday, April 5, 2011

Page 46: Solving real world problems with Hadoop

Simple Query (Mapper)public class SimpleQuery {... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) throws IOException {

String [] row = value.toString().split(DELIMITER);

try { if( row[FIRST_NAME_COLUMN].equals("John") || row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) {

columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] });

context.write(columns, blank);

} } catch(InterruptedException ex) { throw new IOException(ex); } } }...}

SELECTSELECT first_name, last_name

Extract

WHEREWHERE first_name = ‘John’ OR favorite_movie_id = 2

Tuesday, April 5, 2011

Page 47: Solving real world problems with Hadoop

Simple Query (Mapper)public class SimpleQuery {... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) throws IOException {

String [] row = value.toString().split(DELIMITER);

try { if( row[FIRST_NAME_COLUMN].equals("John") || row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) {

columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] });

context.write(columns, blank);

} } catch(InterruptedException ex) { throw new IOException(ex); } } }...}

SELECTSELECT first_name, last_name

Extract

WHEREWHERE first_name = ‘John’ OR favorite_movie_id = 2

Emit

Tuesday, April 5, 2011

Page 48: Solving real world problems with Hadoop

Simple Query (Running)$ hadoop jar target/hadoop-recipes-1.0.jar com.synctree.hadoop.recipes.SimpleQuery\ data/people.tsv out/simple_query

...11/04/04 09:19:15 INFO mapred.JobClient: map 100% reduce 100%11/04/04 09:19:15 INFO mapred.JobClient: Job complete: job_local_000111/04/04 09:19:15 INFO mapred.JobClient: Counters: 1311/04/04 09:19:15 INFO mapred.JobClient: FileSystemCounters11/04/04 09:19:15 INFO mapred.JobClient: FILE_BYTES_READ=30629611/04/04 09:19:15 INFO mapred.JobClient: FILE_BYTES_WRITTEN=39867611/04/04 09:19:15 INFO mapred.JobClient: Map-Reduce Framework11/04/04 09:19:15 INFO mapred.JobClient: Reduce input groups=311/04/04 09:19:15 INFO mapred.JobClient: Combine output records=011/04/04 09:19:15 INFO mapred.JobClient: Map input records=411/04/04 09:19:15 INFO mapred.JobClient: Reduce shuffle bytes=011/04/04 09:19:15 INFO mapred.JobClient: Reduce output records=311/04/04 09:19:15 INFO mapred.JobClient: Spilled Records=611/04/04 09:19:15 INFO mapred.JobClient: Map output bytes=5411/04/04 09:19:15 INFO mapred.JobClient: Combine input records=011/04/04 09:19:15 INFO mapred.JobClient: Map output records=311/04/04 09:19:15 INFO mapred.JobClient: SPLIT_RAW_BYTES=12711/04/04 09:19:15 INFO mapred.JobClient: Reduce input records=3...

Tuesday, April 5, 2011

Page 49: Solving real world problems with Hadoop

Simple Query (Running)

$ hadoop fs -cat out/simple_query/part-r-00000

John! Mulligan!John! Smith!Royce! Rollins!

Tuesday, April 5, 2011

Page 50: Solving real world problems with Hadoop

Join QueryQuerySELECT first_name, last_name, movies.name name, movies.imageFROM people JOIN movies ON ( people.favorite_movie_id = movies.id)

Tuesday, April 5, 2011

Page 51: Solving real world problems with Hadoop

Join QueryInputid first_name last_name favorite_...

1 John Mulligan 3

2 Samir Ahmed 5

3 Royce Rollins 2

4 John Smith 2

id name image

2 The Matrix http://bit.ly/matrix.jpg

3 Gatacca http://bit.ly/g.jpg

4 AI http://bit.ly/ai.jpg

5 Avatar http://bit.ly/avatar.jpg

Tuesday, April 5, 2011

Page 52: Solving real world problems with Hadoop

Join QueryInputid first_name last_name favorite_...

1 John Mulligan 3

2 Samir Ahmed 5

3 Royce Rollins 2

4 John Smith 2

id name image

2 The Matrix http://bit.ly/matrix.jpg

3 Gatacca http://bit.ly/g.jpg

4 AI http://bit.ly/ai.jpg

5 Avatar http://bit.ly/avatar.jpg

first_name last_name name image

John Mulligan The Matrix http://bit.ly/matrix.jpg

Samir Ahmed Gatacca http://bit.ly/g.jpg

Royce Rollins AI http://bit.ly/ai.jpg

John Smith Avatar http://bit.ly/avatar.jpg

Output

people movies

Tuesday, April 5, 2011

Page 53: Solving real world problems with Hadoop

Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> {... public void map(Object key, Text value, Context context) throws IOException {

String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] });

joinKey.set(row[MOVIES_ID_COLUMN]); }

context.write(joinKey, columns);

} catch(InterruptedException ex) { throw new IOException(ex); }...

Tuesday, April 5, 2011

Page 54: Solving real world problems with Hadoop

Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> {... public void map(Object key, Text value, Context context) throws IOException {

String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] });

joinKey.set(row[MOVIES_ID_COLUMN]); }

context.write(joinKey, columns);

} catch(InterruptedException ex) { throw new IOException(ex); }...

Parse

Tuesday, April 5, 2011

Page 55: Solving real world problems with Hadoop

Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> {... public void map(Object key, Text value, Context context) throws IOException {

String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] });

joinKey.set(row[MOVIES_ID_COLUMN]); }

context.write(joinKey, columns);

} catch(InterruptedException ex) { throw new IOException(ex); }...

Parse

Classify

Tuesday, April 5, 2011

Page 56: Solving real world problems with Hadoop

Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> {... public void map(Object key, Text value, Context context) throws IOException {

String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] });

joinKey.set(row[MOVIES_ID_COLUMN]); }

context.write(joinKey, columns);

} catch(InterruptedException ex) { throw new IOException(ex); }...

Parse

Classify

Emit

Tuesday, April 5, 2011

Page 57: Solving real world problems with Hadoop

Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> {... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException {

LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>();

for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } }

for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } }...Tuesday, April 5, 2011

Page 58: Solving real world problems with Hadoop

Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> {... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException {

LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>();

for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } }

for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } }...

Extract

Tuesday, April 5, 2011

Page 59: Solving real world problems with Hadoop

Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> {... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException {

LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>();

for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } }

for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } }...

people X movies

Extract

Tuesday, April 5, 2011

Page 60: Solving real world problems with Hadoop

Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> {... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException {

LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>();

for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } }

for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } }...

people X movies

SELECT first_name, last_name, movies.name name, movies.image

SELECT

Extract

Tuesday, April 5, 2011

Page 61: Solving real world problems with Hadoop

Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> {... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException {

LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>();

for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } }

for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } }...

Emit

people X movies

SELECT first_name, last_name, movies.name name, movies.image

SELECT

Extract

Tuesday, April 5, 2011

Page 62: Solving real world problems with Hadoop

Hive

Tuesday, April 5, 2011

Page 63: Solving real world problems with Hadoop

What is Hive?“Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.”

Tuesday, April 5, 2011

Page 64: Solving real world problems with Hadoop

Hive Features

SerDe

MetaStore

Query Processor

Compiler

Processor

Functions / UDFs, UDAFs, UDTFs

Tuesday, April 5, 2011

Page 65: Solving real world problems with Hadoop

Hive Demo

Tuesday, April 5, 2011

Page 67: Solving real world problems with Hadoop

Questions?

Tuesday, April 5, 2011

Page 68: Solving real world problems with Hadoop

Thanks

Tuesday, April 5, 2011