Introduction to the Hadoop Ecosystem (SEACON Edition)

50
Introduction to the Hadoop ecosystem

Transcript of Introduction to the Hadoop Ecosystem (SEACON Edition)

Page 1: Introduction to the Hadoop Ecosystem (SEACON Edition)

Introduction to the Hadoop ecosystem

Page 2: Introduction to the Hadoop Ecosystem (SEACON Edition)

About me

Page 3: Introduction to the Hadoop Ecosystem (SEACON Edition)

About us

Page 4: Introduction to the Hadoop Ecosystem (SEACON Edition)

Why Hadoop?

Page 5: Introduction to the Hadoop Ecosystem (SEACON Edition)

Why Hadoop?

Page 6: Introduction to the Hadoop Ecosystem (SEACON Edition)

Why Hadoop?

Page 7: Introduction to the Hadoop Ecosystem (SEACON Edition)

Why Hadoop?

Page 8: Introduction to the Hadoop Ecosystem (SEACON Edition)

Why Hadoop?

Page 9: Introduction to the Hadoop Ecosystem (SEACON Edition)

Why Hadoop?

Page 10: Introduction to the Hadoop Ecosystem (SEACON Edition)

Why Hadoop?

Page 11: Introduction to the Hadoop Ecosystem (SEACON Edition)

How to scale data?

w1 w2 w3

r1 r2 r3

Page 12: Introduction to the Hadoop Ecosystem (SEACON Edition)

But…

Page 13: Introduction to the Hadoop Ecosystem (SEACON Edition)

But…

Page 14: Introduction to the Hadoop Ecosystem (SEACON Edition)

What is Hadoop?

Page 15: Introduction to the Hadoop Ecosystem (SEACON Edition)

What is Hadoop?

Page 16: Introduction to the Hadoop Ecosystem (SEACON Edition)

What is Hadoop?

Page 17: Introduction to the Hadoop Ecosystem (SEACON Edition)

What is Hadoop?

Page 18: Introduction to the Hadoop Ecosystem (SEACON Edition)

The Hadoop App Store

HDFS MapRed HCat Pig Hive HBase Ambari Avro Cassandra

Chukwa

Intel

Sync

Flume Hana HyperT Impala Mahout Nutch Oozie Scoop

Scribe Tez Vertica Whirr ZooKee Horton Cloudera MapR EMC

IBM Talend TeraData Pivotal Informat Microsoft. Pentaho Jasper

Kognitio Tableau Splunk Platfora Rack Karma Actuate MicStrat

Page 19: Introduction to the Hadoop Ecosystem (SEACON Edition)

Data Storage

Page 20: Introduction to the Hadoop Ecosystem (SEACON Edition)

Data Storage

Page 21: Introduction to the Hadoop Ecosystem (SEACON Edition)

Hadoop Distributed File System

Page 22: Introduction to the Hadoop Ecosystem (SEACON Edition)

Hadoop Distributed File System

Page 23: Introduction to the Hadoop Ecosystem (SEACON Edition)

HDFS Architecture

Page 24: Introduction to the Hadoop Ecosystem (SEACON Edition)

Data Processing

Page 25: Introduction to the Hadoop Ecosystem (SEACON Edition)

Data Processing

Page 26: Introduction to the Hadoop Ecosystem (SEACON Edition)

MapReduce

Page 27: Introduction to the Hadoop Ecosystem (SEACON Edition)

Typical large-data problem

Page 28: Introduction to the Hadoop Ecosystem (SEACON Edition)

MapReduce Flow

𝐤𝟏 𝐯𝟏 𝐤𝟐 𝐯𝟐 𝐤𝟒 𝐯𝟒 𝐤𝟓 𝐯𝟓 𝐤𝟔 𝐯𝟔 𝐤𝟑 𝐯𝟑

a 𝟏 b 2 c 9 a 3 c 2 b 7 c 8

a 𝟏 b 2 c 3 c 6 a 3 c 2 b 7 c 8

a 1 3 b 𝟐 7 c 2 8 9

a 4 b 9 c 19

Page 29: Introduction to the Hadoop Ecosystem (SEACON Edition)

Combined Hadoop Architecture

Page 30: Introduction to the Hadoop Ecosystem (SEACON Edition)

Word Count Mapper in Java

public class WordCountMapper extends MapReduceBase implements

Mapper<LongWritable, Text, Text, IntWritable>

{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text,

IntWritable> output, Reporter reporter) throws IOException

{

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens())

{

word.set(tokenizer.nextToken());

output.collect(word, one);

}

}

}

Page 31: Introduction to the Hadoop Ecosystem (SEACON Edition)

Word Count Reducer in Java

public class WordCountReducer extends MapReduceBase

implements Reducer<Text, IntWritable, Text, IntWritable>

{

public void reduce(Text key, Iterator values, OutputCollector

output, Reporter reporter) throws IOException

{

int sum = 0;

while (values.hasNext())

{

IntWritable value = (IntWritable) values.next();

sum += value.get();

}

output.collect(key, new IntWritable(sum));

}

}

Page 32: Introduction to the Hadoop Ecosystem (SEACON Edition)

Scripting for Hadoop

Page 33: Introduction to the Hadoop Ecosystem (SEACON Edition)

Scripting for Hadoop

Page 34: Introduction to the Hadoop Ecosystem (SEACON Edition)

Apache Pig

••

Page 35: Introduction to the Hadoop Ecosystem (SEACON Edition)

Pig in the Hadoop ecosystem

Hadoop Distributed File System

Distributed Programming Framework

Metadata Management

Scripting

Page 36: Introduction to the Hadoop Ecosystem (SEACON Edition)

Pig Latin

users = LOAD 'users.txt' USING PigStorage(',') AS (name,

age);

pages = LOAD 'pages.txt' USING PigStorage(',') AS (user,

url);

filteredUsers = FILTER users BY age >= 18 and age <=50;

joinResult = JOIN filteredUsers BY name, pages by user;

grouped = GROUP joinResult BY url;

summed = FOREACH grouped GENERATE group,

COUNT(joinResult) as clicks;

sorted = ORDER summed BY clicks desc;

top10 = LIMIT sorted 10;

STORE top10 INTO 'top10sites';

Page 37: Introduction to the Hadoop Ecosystem (SEACON Edition)

Pig Execution Plan

Page 38: Introduction to the Hadoop Ecosystem (SEACON Edition)

Try that with Java…

Page 39: Introduction to the Hadoop Ecosystem (SEACON Edition)

SQL for Hadoop

Page 40: Introduction to the Hadoop Ecosystem (SEACON Edition)

SQL for Hadoop

Page 41: Introduction to the Hadoop Ecosystem (SEACON Edition)

Apache Hive

Page 42: Introduction to the Hadoop Ecosystem (SEACON Edition)

Hive in the Hadoop ecosystem

Hadoop Distributed File System

Distributed Programming Framework

Metadata Management

Scripting Query

Page 43: Introduction to the Hadoop Ecosystem (SEACON Edition)

Hive Architecture

Page 44: Introduction to the Hadoop Ecosystem (SEACON Edition)

Hive Example

CREATE TABLE users(name STRING, age INT);

CREATE TABLE pages(user STRING, url STRING);

LOAD DATA INPATH '/user/sandbox/users.txt' INTO

TABLE 'users';

LOAD DATA INPATH '/user/sandbox/pages.txt' INTO

TABLE 'pages';

SELECT pages.url, count(*) AS clicks FROM users JOIN

pages ON (users.name = pages.user)

WHERE users.age >= 18 AND users.age <= 50

GROUP BY pages.url

SORT BY clicks DESC

LIMIT 10;

Page 45: Introduction to the Hadoop Ecosystem (SEACON Edition)

Bringing it all together…

Page 46: Introduction to the Hadoop Ecosystem (SEACON Edition)

Online AdServing

Page 47: Introduction to the Hadoop Ecosystem (SEACON Edition)

AdServing Architecture

Page 48: Introduction to the Hadoop Ecosystem (SEACON Edition)

Getting started…

Page 49: Introduction to the Hadoop Ecosystem (SEACON Edition)

Hortonworks Sandbox

Page 50: Introduction to the Hadoop Ecosystem (SEACON Edition)

Hadoop Training

••

••

••