Solving real world problems with Hadoop
-
Upload
synctree -
Category
Technology
-
view
110 -
download
1
description
Transcript of Solving real world problems with Hadoop
Solving Real World Problems with Hadoop and
SQL -> Hadoop
Masahji Stewart <[email protected]>
Tuesday, April 5, 2011
Solving Real World Problems with Hadoop
Tuesday, April 5, 2011
Word CountMapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster...
Input
Tuesday, April 5, 2011
Word CountMapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster...
Input
as!! ! ! 1certain! ! 1collectively!1datasets! ! 1framework!! 1huge! ! ! 1number!! ! 1on!! ! ! 1referred! ! 1to!! ! ! 1
OutputMapReduce!! 1cluster! ! 1computers!! 1distributable!1for!! ! ! 1kinds! ! ! 1of!! ! ! 2problems! ! 1
(nodes),! ! 1a! ! ! ! 3is!! ! ! 1large! ! ! 1processing! 1using! ! ! 1
Tuesday, April 5, 2011
Word Count (Mapper)
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
Tuesday, April 5, 2011
Word Count (Mapper)
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
Extractword = “MapReduce”word = ”is”word = “a”...
Tuesday, April 5, 2011
Word Count (Mapper)
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
Emit“MapReduce”, 1“is”, 1“a”, 1...
Tuesday, April 5, 2011
Word Count (Reducer) public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
Tuesday, April 5, 2011
Word Count (Reducer) public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
Sumkey=“of”sum = 2
Tuesday, April 5, 2011
Word Count (Reducer) public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
Emit“of”, 2
Tuesday, April 5, 2011
Word Count (Running)$ hadoop jar ./.versions/0.20/hadoop-0.20-examples.jar wordcount \ -D mapred.reduce.tasks=3\ input_file out
11/04/03 21:21:27 INFO mapred.JobClient: Default number of map tasks: 211/04/03 21:21:27 INFO mapred.JobClient: Default number of reduce tasks: 311/04/03 21:21:28 INFO input.FileInputFormat: Total input paths to process : 111/04/03 21:21:29 INFO mapred.JobClient: Running job: job_201103252110_065911/04/03 21:21:30 INFO mapred.JobClient: map 0% reduce 0%11/04/03 21:21:37 INFO mapred.JobClient: map 100% reduce 0%11/04/03 21:21:49 INFO mapred.JobClient: map 100% reduce 33%11/04/03 21:21:52 INFO mapred.JobClient: map 100% reduce 66%11/04/03 21:22:05 INFO mapred.JobClient: map 100% reduce 100%11/04/03 21:22:08 INFO mapred.JobClient: Job complete: job_201103252110_065911/04/03 21:22:08 INFO mapred.JobClient: Counters: 17...11/04/03 21:22:08 INFO mapred.JobClient: Map output bytes=28611/04/03 21:22:08 INFO mapred.JobClient: Combine input records=2711/04/03 21:22:08 INFO mapred.JobClient: Map output records=2711/04/03 21:22:08 INFO mapred.JobClient: Reduce input records=24
Tuesday, April 5, 2011
Word Count (Output)$ hadoop@ip-10-245-210-191:~$ hadoop fs -ls outFound 3 items-rw-r--r-- 2 hadoop supergroup 90 2011-04-03 21:21 /user/hadoop/out/part-r-00000-rw-r--r-- 2 hadoop supergroup 80 2011-04-03 21:21 /user/hadoop/out/part-r-00001-rw-r--r-- 2 hadoop supergroup 49 2011-04-03 21:21 /user/hadoop/out/part-r-00002$ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00000as!1certain! 1collectively!1datasets! 1framework!1...$ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00001MapReduce!1cluster! 1computers!1distributable!1for!1...$ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00002 (nodes),! 1a! 3is!1large! 1processing! 1using! 1
A file per reducer
Tuesday, April 5, 2011
Word Count (Output)$ hadoop@ip-10-245-210-191:~$ hadoop fs -ls outFound 3 items-rw-r--r-- 2 hadoop supergroup 90 2011-04-03 21:21 /user/hadoop/out/part-r-00000-rw-r--r-- 2 hadoop supergroup 80 2011-04-03 21:21 /user/hadoop/out/part-r-00001-rw-r--r-- 2 hadoop supergroup 49 2011-04-03 21:21 /user/hadoop/out/part-r-00002$ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00000as!1certain! 1collectively!1datasets! 1framework!1...$ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00001MapReduce!1cluster! 1computers!1distributable!1for!1...$ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00002 (nodes),! 1a! 3is!1large! 1processing! 1using! 1
Tuesday, April 5, 2011
Word Count
MapReduce is a f ramework fo r processsing
huge datasets on certain kinds of distributable
problems using a large number of computers
(nodes) collectively
referrered to as a
MapReduce is a framework for p r o c e s s i n g huge datasets on certain kinds of distributable problems using a large number of computers ( n o d e s ) , c o l l e c t i v e l y referred to as a cluster... cluster
MAP
MAP
MAP
MAP
MAP
REDUCE
REDUCE
REDUCE
as 1certain 1collectively 1datasets 1framework 1huge 1number 1on 1referred 1to 1
MapReduce 1cluster 1computers 1distributable 1for 1kinds 1of 2problems 1
(nodes), 1a 3is 1large 1processing 1using 1
Input Split Map Reduce OutputShuffle/Sort
Tuesday, April 5, 2011
Log Processing (Date IP COUNT)
67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0"201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0"
...
Input
Tuesday, April 5, 2011
Log Processing (Date IP COUNT)
Input
Output18/Jul/2010!! 189.186.9.181! 118/Jul/2010!! 201.201.16.82! 318/Jul/2010!! 66.195.114.59! 118/Jul/2010!! 67.195.114.59! 118/Jul/2010!! 90.221.175.16! 119/Jul/2010!! 90.221.75.196! 1
...
67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0"201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0"
...
Tuesday, April 5, 2011
Log Processing (Mapper) public static final Pattern LOG_PATTERN = Pattern.compile("^([\\d.]+) (\\S+) (\\S+) \\[(([\\w/]+):([\\d:]+)\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\d+) \"([^\"]+)\" \"([^\"]+)\"");
public static class ExtractDateAndIpMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1); private Text ip = new Text();
public void map(Object key, Text value, Context context) throws IOException {
String text = value.toString(); Matcher matcher = LOG_PATTERN.matcher(text); while (matcher.find()) { try { ip.set(matcher.group(5) + "\t" + matcher.group(1)); context.write(ip, one); } catch(InterruptedException ex) { throw new IOException(ex); } }
} }
Tuesday, April 5, 2011
Log Processing (Mapper) public static final Pattern LOG_PATTERN = Pattern.compile("^([\\d.]+) (\\S+) (\\S+) \\[(([\\w/]+):([\\d:]+)\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\d+) \"([^\"]+)\" \"([^\"]+)\"");
public static class ExtractDateAndIpMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1); private Text ip = new Text();
public void map(Object key, Text value, Context context) throws IOException {
String text = value.toString(); Matcher matcher = LOG_PATTERN.matcher(text); while (matcher.find()) { try { ip.set(matcher.group(5) + "\t" + matcher.group(1)); context.write(ip, one); } catch(InterruptedException ex) { throw new IOException(ex); } }
} }
Extractip = “189.186.9.181”ip = ”201.201.16.82”ip = “66.249.67.57”...
Tuesday, April 5, 2011
Log Processing (Mapper) public static final Pattern LOG_PATTERN = Pattern.compile("^([\\d.]+) (\\S+) (\\S+) \\[(([\\w/]+):([\\d:]+)\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\d+) \"([^\"]+)\" \"([^\"]+)\"");
public static class ExtractDateAndIpMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1); private Text ip = new Text();
public void map(Object key, Text value, Context context) throws IOException {
String text = value.toString(); Matcher matcher = LOG_PATTERN.matcher(text); while (matcher.find()) { try { ip.set(matcher.group(5) + "\t" + matcher.group(1)); context.write(ip, one); } catch(InterruptedException ex) { throw new IOException(ex); } }
} }
Emit“18/Jul/2010\t189.186.9.181”, 1...
Tuesday, April 5, 2011
Log Processing (main)public class LogAggregator {... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}
Tuesday, April 5, 2011
Log Processing (main)public class LogAggregator {... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}
Mapper
Tuesday, April 5, 2011
Log Processing (main)public class LogAggregator {... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}
Reducer
Tuesday, April 5, 2011
Log Processing (main)public class LogAggregator {... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}
Input/OutputSettings
Tuesday, April 5, 2011
Log Processing (main)public class LogAggregator {... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}
Run it!
Tuesday, April 5, 2011
Log Processing (Running)$ hadoop jar target/hadoop-recipes-1.0.jar com.synctree.hadoop.recipes.LogAggregator\ -libjars hadoop-examples.jar data/access.log log_results
11/04/04 00:51:30 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 00:51:30 INFO input.FileInputFormat: Total input paths to process : 111/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Creating hadoop-examples.jar in /tmp/hadoop-masahji/mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/hadoop-recipes-work--8125788655475885988 with rwxr-xr-x11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:///Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/hadoop-recipes/hadoop-examples.jar11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:///Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/hadoop-recipes/hadoop-examples.jar11/04/04 00:51:32 INFO mapred.JobClient: map 100% reduce 100%
Tuesday, April 5, 2011
Log Processing (Running)$ hadoop jar target/hadoop-recipes-1.0.jar com.synctree.hadoop.recipes.LogAggregator\ -libjars hadoop-examples.jar data/access.log log_results
11/04/04 00:51:30 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 00:51:30 INFO input.FileInputFormat: Total input paths to process : 111/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Creating hadoop-examples.jar in /tmp/hadoop-masahji/mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/hadoop-recipes-work--8125788655475885988 with rwxr-xr-x11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:///Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/hadoop-recipes/hadoop-examples.jar11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:///Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/hadoop-recipes/hadoop-examples.jar11/04/04 00:51:32 INFO mapred.JobClient: map 100% reduce 100%
JAR placed into Distributed Cache
Tuesday, April 5, 2011
Log Processing (Output)
$ hadoop fs -ls log_resultsFound 2 items-rwxrwxrwx 1 masahji staff 0 2011-04-04 00:51 log_results/_SUCCESS-rwxrwxrwx 1 masahji staff 168 2011-04-04 00:51 log_results/part-r-00000
$ hadoop fs -cat log_results/part-r-00000 18/Jul/2010! 189.186.9.181!118/Jul/2010! 201.201.16.82!318/Jul/2010! 66.195.114.59!118/Jul/2010! 67.195.114.59!118/Jul/2010! 90.221.175.16!119/Jul/2010! 90.221.75.196!1...
Tuesday, April 5, 2011
Hadoop Streaming
Task Tracker Mapper / Reducer
Fork
script
STDOUTSTDIN
Tuesday, April 5, 2011
Basic grep ...搜索 搜索 [sou1 suo3] /to search/.../internet search/database search/吉日 吉日 [ji2 ri4] /propitious day/lucky day/吉祥 吉祥 [ji2 xiang2] /lucky/auspicious/propitious/咄咄 咄咄 [duo1 duo1] /to cluck one's tongue/tut-tut/喜鵲 喜鹊 [xi3 que4] /black-billed magpie, legendary bringer of good luck/...
Input
Tuesday, April 5, 2011
Basic grep ...搜索 搜索 [sou1 suo3] /to search/.../internet search/database search/吉日 吉日 [ji2 ri4] /propitious day/lucky day/吉祥 吉祥 [ji2 xiang2] /lucky/auspicious/propitious/咄咄 咄咄 [duo1 duo1] /to cluck one's tongue/tut-tut/喜鵲 喜鹊 [xi3 que4] /black-billed magpie, legendary bringer of good luck/...
Input
Output...匯出 汇出 [hui4 chu1] /to export data (e.g. from a database)/!搜索 搜索 [sou1 suo3] /to search/.../internet search/database search/!數據庫 数据库 [shu4 ju4 ku4] /database/!數據庫軟件 数据库软件 [shu4 ju4 ku4 ruan3 jian4] /database software/!資料庫 资料库 [zi1 liao4 ku4] /database//...
Tuesday, April 5, 2011
$ hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input data/cedict.txt.gz \ -output streaming/grep_database_mandarin \ -mapper 'grep database' \ -reducer org.apache.hadoop.mapred.lib.IdentityReducer...11/04/04 05:27:58 INFO streaming.StreamJob: map 100% reduce 100%11/04/04 05:27:58 INFO streaming.StreamJob: Job complete: job_local_000111/04/04 05:27:58 INFO streaming.StreamJob: Output: streaming/grep_database_mandarin
Basic grep
Tuesday, April 5, 2011
ScriptsorJava Classes
$ hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input data/cedict.txt.gz \ -output streaming/grep_database_mandarin \ -mapper 'grep database' \ -reducer org.apache.hadoop.mapred.lib.IdentityReducer...11/04/04 05:27:58 INFO streaming.StreamJob: map 100% reduce 100%11/04/04 05:27:58 INFO streaming.StreamJob: Job complete: job_local_000111/04/04 05:27:58 INFO streaming.StreamJob: Output: streaming/grep_database_mandarin
Basic grep
Tuesday, April 5, 2011
$ hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input data/cedict.txt.gz \ -output streaming/grep_database_mandarin \ -mapper 'grep database' \ -reducer org.apache.hadoop.mapred.lib.IdentityReducer...11/04/04 05:27:58 INFO streaming.StreamJob: map 100% reduce 100%11/04/04 05:27:58 INFO streaming.StreamJob: Job complete: job_local_000111/04/04 05:27:58 INFO streaming.StreamJob: Output: streaming/grep_database_mandarin
Basic grep
$ hadoop fs -cat streaming/grep_database_mandarin/part-00000
匯出 汇出 [hui4 chu1] /to remit (money)//to export data (e.g. from a database)/!搜索 搜索 [sou1 suo3] /to search/to look for sth/internet search/database search/!數據庫 数据库 [shu4 ju4 ku4] /database/!數據庫軟件 数据库软件 [shu4 ju4 ku4 ruan3 jian4] /database software/!資料庫 资料库 [zi1 liao4 ku4] /database/
Tuesday, April 5, 2011
Ruby Example (ignore ip list)Input
Output
67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")192.168.10.4 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 96 "-" "Mozilla/4.0"189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"10.1.10.12 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 51 "-" "Mozilla/5.0"201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0"201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"10.1.10.4 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 94 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"10.1.10.14 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 24 "-" "Mozilla/4.0"66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0"...
189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"!201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla 4.0"!201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"!201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0"!66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"!67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")!90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"!90.221.75.196 - - [19/Jul/2010] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"!90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0"...
Tuesday, April 5, 2011
Ruby Example (ignore ip list)
#!/usr/bin/env ruby
ignore = %w(127.0.0.1 192.168 10)log_regex = /^([\d.]+)\s/
while(line = STDIN.gets) next unless line =~ log_regex ip = $1
print line if ignore.reject { |ignore_ip| ip !~ /^#{ignore_ip}(\.|$)/ }.empty?end
Read STDINWrite STDOUT
Tuesday, April 5, 2011
Ruby Example (ignore ip list)
#!/usr/bin/env ruby
ignore = %w(127.0.0.1 192.168 10)log_regex = /^([\d.]+)\s/
while(line = STDIN.gets) next unless line =~ log_regex ip = $1
print line if ignore.reject { |ignore_ip| ip !~ /^#{ignore_ip}(\.|$)/ }.empty?end
Tuesday, April 5, 2011
$ hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input data/access.log \ -output out/streaming/filter_ips \ -mapper './script/filter_ips' \ -reducer org.apache.hadoop.mapred.lib.IdentityReducer11/04/04 07:08:08 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 11/04/04 07:08:08 WARN mapred.JobClient: No job jar file set. User classes may not 11/04/04 07:08:08 INFO mapred.FileInputFormat: Total input paths to process : 111/04/04 07:08:09 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-masahji/11/04/04 07:08:09 INFO streaming.StreamJob: Running job: job_local_000111/04/04 07:08:09 INFO streaming.StreamJob: Job running in-process (local Hadoop)...
Ruby Example (ignore ip list)
Tuesday, April 5, 2011
$ hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input data/access.log \ -output out/streaming/filter_ips \ -mapper './script/filter_ips' \ -reducer org.apache.hadoop.mapred.lib.IdentityReducer11/04/04 07:08:08 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 11/04/04 07:08:08 WARN mapred.JobClient: No job jar file set. User classes may not 11/04/04 07:08:08 INFO mapred.FileInputFormat: Total input paths to process : 111/04/04 07:08:09 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-masahji/11/04/04 07:08:09 INFO streaming.StreamJob: Running job: job_local_000111/04/04 07:08:09 INFO streaming.StreamJob: Job running in-process (local Hadoop)...
Ruby Example (ignore ip list)
$ hadoop fs -cat out/streaming/filter_ips/part-00000 ...!
189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"!201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"!201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"!201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0"!66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"!67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")!90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"!90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0"
Tuesday, April 5, 2011
SQL -> Hadoop
Tuesday, April 5, 2011
Simple QueryQuerySELECT first_name, last_name FROM people WHERE first_name = ‘John’ OR favorite_movie_id = 2
Tuesday, April 5, 2011
Simple QueryQuerySELECT first_name, last_name FROM people WHERE first_name = ‘John’ OR favorite_movie_id = 2
Inputid first_name last_name favorite_movie_id
1 John Mulligan 3
2 Samir Ahmed 5
3 Royce Rollins 2
4 John Smith 2
Tuesday, April 5, 2011
Simple QueryQuerySELECT first_name, last_name FROM people WHERE first_name = ‘John’ OR favorite_movie_id = 2
Inputid first_name last_name favorite_movie_id
1 John Mulligan 3
2 Samir Ahmed 5
3 Royce Rollins 2
4 John Smith 2
first_name last_name
John Mulligan
John Smith
Output
Tuesday, April 5, 2011
Simple Query (Mapper)public class SimpleQuery {... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) throws IOException {
String [] row = value.toString().split(DELIMITER);
try { if( row[FIRST_NAME_COLUMN].equals("John") || row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) {
columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] });
context.write(columns, blank);
} } catch(InterruptedException ex) { throw new IOException(ex); } } }...}
Tuesday, April 5, 2011
Simple Query (Mapper)public class SimpleQuery {... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) throws IOException {
String [] row = value.toString().split(DELIMITER);
try { if( row[FIRST_NAME_COLUMN].equals("John") || row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) {
columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] });
context.write(columns, blank);
} } catch(InterruptedException ex) { throw new IOException(ex); } } }...}
Extract
Tuesday, April 5, 2011
Simple Query (Mapper)public class SimpleQuery {... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) throws IOException {
String [] row = value.toString().split(DELIMITER);
try { if( row[FIRST_NAME_COLUMN].equals("John") || row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) {
columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] });
context.write(columns, blank);
} } catch(InterruptedException ex) { throw new IOException(ex); } } }...}
Extract
WHEREWHERE first_name = ‘John’ OR favorite_movie_id = 2
Tuesday, April 5, 2011
Simple Query (Mapper)public class SimpleQuery {... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) throws IOException {
String [] row = value.toString().split(DELIMITER);
try { if( row[FIRST_NAME_COLUMN].equals("John") || row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) {
columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] });
context.write(columns, blank);
} } catch(InterruptedException ex) { throw new IOException(ex); } } }...}
SELECTSELECT first_name, last_name
Extract
WHEREWHERE first_name = ‘John’ OR favorite_movie_id = 2
Tuesday, April 5, 2011
Simple Query (Mapper)public class SimpleQuery {... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) throws IOException {
String [] row = value.toString().split(DELIMITER);
try { if( row[FIRST_NAME_COLUMN].equals("John") || row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) {
columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] });
context.write(columns, blank);
} } catch(InterruptedException ex) { throw new IOException(ex); } } }...}
SELECTSELECT first_name, last_name
Extract
WHEREWHERE first_name = ‘John’ OR favorite_movie_id = 2
Emit
Tuesday, April 5, 2011
Simple Query (Running)$ hadoop jar target/hadoop-recipes-1.0.jar com.synctree.hadoop.recipes.SimpleQuery\ data/people.tsv out/simple_query
...11/04/04 09:19:15 INFO mapred.JobClient: map 100% reduce 100%11/04/04 09:19:15 INFO mapred.JobClient: Job complete: job_local_000111/04/04 09:19:15 INFO mapred.JobClient: Counters: 1311/04/04 09:19:15 INFO mapred.JobClient: FileSystemCounters11/04/04 09:19:15 INFO mapred.JobClient: FILE_BYTES_READ=30629611/04/04 09:19:15 INFO mapred.JobClient: FILE_BYTES_WRITTEN=39867611/04/04 09:19:15 INFO mapred.JobClient: Map-Reduce Framework11/04/04 09:19:15 INFO mapred.JobClient: Reduce input groups=311/04/04 09:19:15 INFO mapred.JobClient: Combine output records=011/04/04 09:19:15 INFO mapred.JobClient: Map input records=411/04/04 09:19:15 INFO mapred.JobClient: Reduce shuffle bytes=011/04/04 09:19:15 INFO mapred.JobClient: Reduce output records=311/04/04 09:19:15 INFO mapred.JobClient: Spilled Records=611/04/04 09:19:15 INFO mapred.JobClient: Map output bytes=5411/04/04 09:19:15 INFO mapred.JobClient: Combine input records=011/04/04 09:19:15 INFO mapred.JobClient: Map output records=311/04/04 09:19:15 INFO mapred.JobClient: SPLIT_RAW_BYTES=12711/04/04 09:19:15 INFO mapred.JobClient: Reduce input records=3...
Tuesday, April 5, 2011
Simple Query (Running)
$ hadoop fs -cat out/simple_query/part-r-00000
John! Mulligan!John! Smith!Royce! Rollins!
Tuesday, April 5, 2011
Join QueryQuerySELECT first_name, last_name, movies.name name, movies.imageFROM people JOIN movies ON ( people.favorite_movie_id = movies.id)
Tuesday, April 5, 2011
Join QueryInputid first_name last_name favorite_...
1 John Mulligan 3
2 Samir Ahmed 5
3 Royce Rollins 2
4 John Smith 2
id name image
2 The Matrix http://bit.ly/matrix.jpg
3 Gatacca http://bit.ly/g.jpg
4 AI http://bit.ly/ai.jpg
5 Avatar http://bit.ly/avatar.jpg
Tuesday, April 5, 2011
Join QueryInputid first_name last_name favorite_...
1 John Mulligan 3
2 Samir Ahmed 5
3 Royce Rollins 2
4 John Smith 2
id name image
2 The Matrix http://bit.ly/matrix.jpg
3 Gatacca http://bit.ly/g.jpg
4 AI http://bit.ly/ai.jpg
5 Avatar http://bit.ly/avatar.jpg
first_name last_name name image
John Mulligan The Matrix http://bit.ly/matrix.jpg
Samir Ahmed Gatacca http://bit.ly/g.jpg
Royce Rollins AI http://bit.ly/ai.jpg
John Smith Avatar http://bit.ly/avatar.jpg
Output
people movies
Tuesday, April 5, 2011
Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> {... public void map(Object key, Text value, Context context) throws IOException {
String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();
try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] });
joinKey.set(row[MOVIES_ID_COLUMN]); }
context.write(joinKey, columns);
} catch(InterruptedException ex) { throw new IOException(ex); }...
Tuesday, April 5, 2011
Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> {... public void map(Object key, Text value, Context context) throws IOException {
String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();
try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] });
joinKey.set(row[MOVIES_ID_COLUMN]); }
context.write(joinKey, columns);
} catch(InterruptedException ex) { throw new IOException(ex); }...
Parse
Tuesday, April 5, 2011
Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> {... public void map(Object key, Text value, Context context) throws IOException {
String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();
try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] });
joinKey.set(row[MOVIES_ID_COLUMN]); }
context.write(joinKey, columns);
} catch(InterruptedException ex) { throw new IOException(ex); }...
Parse
Classify
Tuesday, April 5, 2011
Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> {... public void map(Object key, Text value, Context context) throws IOException {
String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();
try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] });
joinKey.set(row[MOVIES_ID_COLUMN]); }
context.write(joinKey, columns);
} catch(InterruptedException ex) { throw new IOException(ex); }...
Parse
Classify
Emit
Tuesday, April 5, 2011
Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> {... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException {
LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>();
for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } }
for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } }...Tuesday, April 5, 2011
Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> {... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException {
LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>();
for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } }
for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } }...
Extract
Tuesday, April 5, 2011
Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> {... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException {
LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>();
for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } }
for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } }...
people X movies
Extract
Tuesday, April 5, 2011
Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> {... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException {
LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>();
for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } }
for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } }...
people X movies
SELECT first_name, last_name, movies.name name, movies.image
SELECT
Extract
Tuesday, April 5, 2011
Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> {... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException {
LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>();
for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } }
for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } }...
Emit
people X movies
SELECT first_name, last_name, movies.name name, movies.image
SELECT
Extract
Tuesday, April 5, 2011
Hive
Tuesday, April 5, 2011
What is Hive?“Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.”
Tuesday, April 5, 2011
Hive Features
SerDe
MetaStore
Query Processor
Compiler
Processor
Functions / UDFs, UDAFs, UDTFs
Tuesday, April 5, 2011
Hive Demo
Tuesday, April 5, 2011
Links
http://hadoop.apache.org/
https://github.com/synctree/hadoop-recipes
http://hadoop.apache.org/common/docs/r0.20.2/streaming.html
http://developer.yahoo.com/blogs/hadoop/
http://wiki.apache.org/hadoop/Hive
Tuesday, April 5, 2011
Questions?
Tuesday, April 5, 2011
Thanks
Tuesday, April 5, 2011