Hadoop 101 - Kansas City Big Data Summit 2014
-
Upload
skahler -
Category
Technology
-
view
345 -
download
0
Transcript of Hadoop 101 - Kansas City Big Data Summit 2014
![Page 1: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/1.jpg)
Hadoop 101
![Page 2: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/2.jpg)
Scott Kahlertwitter: boogabee
http://simpit.com
Community Engineer - Greenplum Database
![Page 3: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/3.jpg)
![Page 4: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/4.jpg)
xkcd.com
![Page 5: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/5.jpg)
Primary Hadoop Use
Case
![Page 6: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/6.jpg)
Data Lake, Active Archive, Staging Area
![Page 7: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/7.jpg)
DATA
![Page 8: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/8.jpg)
2002
![Page 9: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/9.jpg)
Doug CuttingMike
Cafarella
2002
20032004
![Page 10: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/10.jpg)
Doug Cutting 2006
![Page 11: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/11.jpg)
Apache Hadoop
The project includes these modules:
● Hadoop Common
● Hadoop Distributed File System (HDFS™)
● Hadoop MapReduce
● Hadoop YARN
![Page 12: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/12.jpg)
Apache Hadoop Ecosystem
Distributed Filesystem
Red Hat GlusterFS
Quantcast File System QFS
Ceph Filesystem
Lustre file system
Tachyon
GridGain
Distributed Programming
Apache Pig
JAQL
Apache Spark
Apache Flink
Netflix PigPen
AMPLab SIMR
Facebook Corona
Apache Twill
Damballa Parkour
Apache Hama
Datasalt Pangool
Apache Tez
Apache DataFu
Pydoop
Kangaroo
NoSQL Databases
Column Data Model
Apache HBase
Apache Cassandra
Hypertable
Apache Accumulo
Document Data Model
MongoDB
RethinkDB
ArangoDB
Stream Data Model
EventStore
Key-Value Data Model
Redis DataBase
Linkedin Voldemort
RocksDB
OpenTSDB
Graph Data Model
ArangoDB
Neo4j
NewSQL Databases
TokuDB
HandlerSocket
Akiban Server
Drizzle
Haeinsa
SenseiDB
Sky
BayesDB
InfluxDB
SQL-On-Hadoop
Apache Hive
Apache HCatalog
AMPLAB Shark
Apache Drill
Cloudera Impala
Facebook Presto
Datasalt Splout SQL
Apache Tajo
Apache Phoenix
Apache MRQL
Data Ingestion
Apache Flume
Apache Sqoop
Facebook Scribe
Apache Chukwa
Apache Storm
Apache Kafka
Netflix Suro
Apache Samza
Cloudera Morphline
HIHO
Service Programming
Apache Thrift
Apache Zookeeper
Apache Avro
Apache Curator
Apache karaf
Twitter Elephant Bird
Linkedin Norbert
Scheduling
Apache Oozie
Linkedin Azkaban
Apache Falcon
Machine Learning
Apache Mahout
WEKA
Cloudera Oryx
MADlib
H2O
Sparkling Water
Bechmarking
Apache Hadoop
Benchmarking
Yahoo Gridmix3
PUMA Benchmarking
Berkeley SWIM Benchmark
Intel HiBench
Security
Apache Sentry
Apache Knox Gateway
Apache Ranger
System Deployment
Apache Ambari
Cloudera HUE
Apache Whirr
Apache Mesos
Myriad
Marathon
Brooklyn
Hortonworks HOYA
Apache Helix
Apache Bigtop
Buildoop
Deploop
Applications
Apache Nutch
Sphnix Search Server
Apache OODT
HIPI Library
PivotalR
Development Frameworks
Spring XD
Categorize Pending ...
Twitter Summingbird
Apache Kiji
S4 Yahoo
Metamarkers Druid
Concurrent Cascading
Concurrent Lingual
Concurrent Pattern
Apache Giraph
Talend
Akka Toolkit
Eclipse BIRT
Spango BI
Jedox Palo
Twitter Finagle
Intel GraphBuilder
Apache Tika
http://hadoopecosystemtable.github.io/
![Page 13: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/13.jpg)
Apache Bigtop
![Page 14: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/14.jpg)
Hadoop Distributed File System (HDFS™)
A distributed file system that provides high-throughput access to application data
![Page 15: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/15.jpg)
hdfs dfs -copyFromLocal File.txt hdfs://nn.hadoopcluster.local/user/hadoop/
![Page 16: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/16.jpg)
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Data
Node 4
Client
File.txt
A
B
C
I have File.txt
and I want to
write block A of it
![Page 17: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/17.jpg)
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Data
Node 4
Client
File.txt
A
B
C
Write that to
Data Node 2, 5
and 6
![Page 18: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/18.jpg)
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Data
Node 4
Client
File.txt
A
B
C
Setting up a
pipeline to
Nodes 2, 5, 6
![Page 19: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/19.jpg)
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Data
Node 4
Client
File.txt
A
B
C
Pushing block
down the
pipeline A A
A
![Page 20: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/20.jpg)
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Data
Node 4
Client
File.txt
A
B
C
A A
AIt worked!
Got a block
![Page 21: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/21.jpg)
Repeat until all blocks in the system
![Page 22: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/22.jpg)
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Data
Node 4
Client
A A
A
B
B
B
C
C
C
A: 2,5,6
B: 1,3,4
C: 6,2,4
File.txt
![Page 23: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/23.jpg)
hdfs dfs -copyToLocal hdfs://nn.hadoopcluster.local/user/hadoop/File.txt File2.txt
![Page 24: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/24.jpg)
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Data
Node 4
Client
A A
A
B
B
B
C
C
C
A: 2,5,6
B: 1,3,4
C: 6,2,4
File.txt
I want File.txt
![Page 25: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/25.jpg)
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Data
Node 4
Client
A A
A
B
B
B
C
C
C
A: 2,5,6
B: 1,3,4
C: 6,2,4
File.txt
File.txt is
A: 2,5,6
B: 1,3,4
C: 6,2,4
![Page 26: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/26.jpg)
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Data
Node 4
Client
A A
A
B
B
B
C
C
C
A: 2,5,6
B: 1,3,4
C: 6,2,4
File.txt
File.txt
A
B
C
![Page 27: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/27.jpg)
System Health
![Page 28: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/28.jpg)
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Data
Node 4
Client
A A
A
B
B
B
C
C
C
A: 2,5,6
B: 1,3,4
C: 6,2,4
File.txt
Block
Report
![Page 29: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/29.jpg)
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Data
Node 4
Client
A A
A
B
B
B
C
C
C
A: 2,5,6
B: 1,3,4
C: 6,2,4
File.txt
No heartbeat
from Node 4
![Page 30: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/30.jpg)
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Data
Node 4
Client
A A
A
B
B
B
C
C
C
A: 2,5,6
B: 1,3,4
C: 6,2,4
File.txt
Copies on Node
4 must be gone
![Page 31: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/31.jpg)
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Client
A A
A
B
B
C
C
A: 2,5,6
B: 1,3
C: 6,2
File.txt
Need to get B &
C back up to 3
copies.
![Page 32: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/32.jpg)
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Client
A A
A
B
B
C
C
A: 2,5,6
B: 1,3,5
C: 6,2,1
File.txt
Okay all good
now
C
B
![Page 33: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/33.jpg)
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Client
A A
A
B
B
C
C
A: 2,5,6
B: 1,3,5
C: 6,2,1
File.txt
C
B
Data
Node 4
![Page 34: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/34.jpg)
![Page 35: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/35.jpg)
Map Shuffle/Sort Reduce
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
![Page 36: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/36.jpg)
Peter Piper picked a peck of pickled peppers;
A peck of pickled peppers Peter Piper picked;
If Peter Piper picked a peck of pickled peppers,
Where's the peck of pickled peppers Peter Piper picked?
Map Shuffle/Sort Reduce
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
![Page 37: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/37.jpg)
Peter Piper picked a peck of pickled peppers;
A peck of pickled peppers Peter Piper picked;
If Peter Piper picked a peck of pickled peppers,
Where's the peck of pickled peppers Peter Piper picked?
Map Shuffle/Sort Reducepeter -> 1
piper -> 1
picked -> 1
a -> 1
peck -> 1
of -> 1
pickled -> 1
peppers -> 1
a -> 1
peck -> 1
of -> 1
pickled -> 1
peppers -> 1
peter -> 1
piper -> 1
picked -> 1 if -> 1
peter -> 1
piper -> 1
picked -> 1
a -> 1
peck -> 1
of -> 1
pickled -> 1
peppers -> 1
where’s -> 1
the -> 1
peck -> 1
of -> 1
pickled -> 1
peppers -> 1
peter -> 1
piper -> 1
picked -> 1
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
![Page 38: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/38.jpg)
Peter Piper picked a peck of pickled peppers;
A peck of pickled peppers Peter Piper picked;
If Peter Piper picked a peck of pickled peppers,
Where's the peck of pickled peppers Peter Piper picked?
Map Shuffle/Sort Reducepeter -> 1
piper -> 1
picked -> 1
a -> 1
peck -> 1
of -> 1
pickled -> 1
peppers -> 1
a -> 1
peck -> 1
of -> 1
pickled -> 1
peppers -> 1
peter -> 1
piper -> 1
picked -> 1 if -> 1
peter -> 1
piper -> 1
picked -> 1
a -> 1
peck -> 1
of -> 1
pickled -> 1
peppers -> 1
where’s -> 1
the -> 1
peck -> 1
of -> 1
pickled -> 1
peppers -> 1
peter -> 1
piper -> 1
picked -> 1
a -> 1,1,1
if -> 1
of -> 1,1,1,1
peck -> 1,1,1,1
peppers -> 1,1,1,1
peter -> 1,1,1,1
picked -> 1,1,1,1
pickled -> 1,1,1,1
piper -> 1,1,1,1
the -> 1
where’s -> 1
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
![Page 39: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/39.jpg)
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
Peter Piper picked a peck of pickled peppers;
A peck of pickled peppers Peter Piper picked;
If Peter Piper picked a peck of pickled peppers,
Where's the peck of pickled peppers Peter Piper picked?
Map Shuffle/Sort Reducepeter -> 1
piper -> 1
picked -> 1
a -> 1
peck -> 1
of -> 1
pickled -> 1
peppers -> 1
a -> 1
peck -> 1
of -> 1
pickled -> 1
peppers -> 1
peter -> 1
piper -> 1
picked -> 1 if -> 1
peter -> 1
piper -> 1
picked -> 1
a -> 1
peck -> 1
of -> 1
pickled -> 1
peppers -> 1
where’s -> 1
the -> 1
peck -> 1
of -> 1
pickled -> 1
peppers -> 1
peter -> 1
piper -> 1
picked -> 1
a -> 3
if -> 1
of -> 4
peck -> 4
peppers -> 4
peter -> 4
picked -> 4
pickled -> 4
piper -> 4
the -> 1
where’s -> 1
a -> 1,1,1
if -> 1
of -> 1,1,1,1
peck -> 1,1,1,1
peppers -> 1,1,1,1
peter -> 1,1,1,1
picked -> 1,1,1,1
pickled -> 1,1,1,1
piper -> 1,1,1,1
the -> 1
where’s -> 1
![Page 40: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/40.jpg)
$HADOOP_HOME/bin/hadoop jar wc.jar WordCount
/user/hadoop/wordcount/input /user/hadoop/wordcount/output
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myPythonScript.py \
-reducer /bin/wc \
-file myPythonScript.py
![Page 41: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/41.jpg)
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.StringUtils;
public class WordCount2 {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
static enum CountersEnum { INPUT_WORDS }
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private boolean caseSensitive;
private Set<String> patternsToSkip = new HashSet<String>();
private Configuration conf;
private BufferedReader fis;
@Override
public void setup(Context context) throws IOException,
InterruptedException {
conf = context.getConfiguration();
caseSensitive = conf.getBoolean("wordcount.case.sensitive", true);
if (conf.getBoolean("wordcount.skip.patterns", true)) {
URI[] patternsURIs = Job.getInstance(conf).getCacheFiles();
for (URI patternsURI : patternsURIs) {
Path patternsPath = new Path(patternsURI.getPath());
String patternsFileName = patternsPath.getName().toString();
parseSkipFile(patternsFileName);
}
}
}
private void parseSkipFile(String fileName) {
try {
fis = new BufferedReader(new FileReader(fileName));
String pattern = null;
while ((pattern = fis.readLine()) != null) {
patternsToSkip.add(pattern);
}
} catch (IOException ioe) {
System.err.println("Caught exception while parsing the cached file '"
+ StringUtils.stringifyException(ioe));
}
}
@Override
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String line = (caseSensitive) ?
value.toString() : value.toString().toLowerCase();
for (String pattern : patternsToSkip) {
line = line.replaceAll(pattern, "");
}
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
Counter counter = context.getCounter(CountersEnum.class.getName(),
CountersEnum.INPUT_WORDS.toString());
counter.increment(1);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
GenericOptionsParser optionParser = new GenericOptionsParser(conf,
args);
String[] remainingArgs = optionParser.getRemainingArgs();
if (!(remainingArgs.length != 2 || remainingArgs.length != 4)) {
System.err.println("Usage: wordcount <in> <out> [-skip
skipPatternFile]");
System.exit(2);
}
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount2.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
List<String> otherArgs = new ArrayList<String>();
for (int i=0; i < remainingArgs.length; ++i) {
if ("-skip".equals(remainingArgs[i])) {
job.addCacheFile(new Path(remainingArgs[++i]).toUri());
job.getConfiguration().setBoolean("wordcount.skip.patterns", true);
} else {
otherArgs.add(remainingArgs[i]);
}
}
FileInputFormat.addInputPath(job, new Path(otherArgs.get(0)));
FileOutputFormat.setOutputPath(job, new Path(otherArgs.get(1)));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
![Page 42: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/42.jpg)
lines = LOAD '/user/hadoop/File.txt' AS (line:chararray);
words = FOREACH input_lines GENERATE
FLATTEN(TOKENIZE(line)) AS word;
filtered_words = FILTER words BY word MATCHES '\\w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE
COUNT(filtered_words) AS count, group AS word;
ordered_word_count = ORDER word_count BY count DESC;
DUMP wordcount;
Apache Pig
![Page 43: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/43.jpg)
Hive provides a mechanism to project
structure onto this data and query the data
using a SQL-like language called HiveQL.
Apache Hive
![Page 44: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/44.jpg)
Resource Management
![Page 45: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/45.jpg)
Job
TrackerTask Tracker1
Task Tracker6Task Tracker3
Task Tracker2 Task Tracker5
Task Tracker4
Client
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
![Page 46: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/46.jpg)
Job
TrackerTask Tracker1
Task Tracker6Task Tracker3
Task Tracker2 Task Tracker5
Task Tracker4
Client
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
wordcount
![Page 47: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/47.jpg)
Job
TrackerTask Tracker1
Task Tracker6Task Tracker3
Task Tracker2 Task Tracker5
Task Tracker4
Client
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
wordcount
![Page 48: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/48.jpg)
Job
TrackerTask Tracker1
Task Tracker6Task Tracker3
Task Tracker2 Task Tracker5
Task Tracker4
Client
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
HBASE SOLR
SOLRHBASE
SOLRHBASE
SOLRHBASE
SOLRHBASE
SOLRHBASE
![Page 49: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/49.jpg)
YARN
MapReduce v2
![Page 50: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/50.jpg)
Resource
ManagerNode
Manager1
Node
Manager6
Node
Manager3
Node
Manager2
Node
Manager5
Node
Manager4
Client
![Page 51: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/51.jpg)
Resource
ManagerNode
Manager1
Node
Manager6
Node
Manager3
Node
Manager2
Node
Manager5
Node
Manager4
Client
wordcount
Application
Master -
Wordcount
I need a
container to run
my wordcount
MR job
![Page 52: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/52.jpg)
Resource
ManagerNode
Manager1
Node
Manager6
Node
Manager3
Node
Manager2
Node
Manager5
Node
Manager4
Client
wordcount
Application
Master -
Wordcount
I need 4 Mapper
and 2 Reducer
containers
M
M
M
M
R
R
![Page 53: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/53.jpg)
Emerging Hadoop Use
Case
![Page 54: Hadoop 101 - Kansas City Big Data Summit 2014](https://reader030.fdocuments.in/reader030/viewer/2022032419/55a2d8491a28ab8d7d8b4627/html5/thumbnails/54.jpg)
Application Container
Management