Hadoop ecosystem

34
Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.

description

 

Transcript of Hadoop ecosystem

Page 1: Hadoop ecosystem

Moscow, November 16th, 2011

The Hadoop EcosystemKai Voigt, Cloudera Inc.

Page 2: Hadoop ecosystem

2©2011 Cloudera, Inc. All Rights Reserved.

ClouderaCloudera

2

Hadoop Linux

Licence Apache GPL and others

Distribution Vendor Cloudera Red Hat

Free DistributionCloudera's Distribution Including Hadoop (CDH)

Fedora Core

Commercial Distribution

Cloudera EnterpriseRed Hat Enterprise Linux (RHEL)

Page 3: Hadoop ecosystem

3©2011 Cloudera, Inc. All Rights Reserved.

Hadoop CoreHadoop Core

3

HDFS

MapReduce

Page 4: Hadoop ecosystem

4©2011 Cloudera, Inc. All Rights Reserved.

HDFSHDFS

4

• Hadoop Distributed File System

• Redundancy

• Fault Tolerant

• Scalable

• Self Healing

• Write Once, Read Many Times

• Java API

• Command Line Tool

Page 5: Hadoop ecosystem

5©2011 Cloudera, Inc. All Rights Reserved.

MapReduceMapReduce

5

• Two Phases of Functional Programming

• Redundancy

• Fault Tolerant

• Scalable

• Self Healing

• Java API

Page 6: Hadoop ecosystem

6©2011 Cloudera, Inc. All Rights Reserved.

Hadoop CoreHadoop Core

6

HDFS

MapReduce

JavaJava

Java

Java

Page 7: Hadoop ecosystem

7©2011 Cloudera, Inc. All Rights Reserved.

HDFS-FUSEHDFS-FUSE

7

/mnt/hdfs/

HDFS-FUSE

HDFS

Page 8: Hadoop ecosystem

8©2011 Cloudera, Inc. All Rights Reserved.

HDFS-FUSE ExamplesHDFS-FUSE Examples

8

$ mount ...fuse on /mnt/hdfs type fuse (rw,nosuid,nodev,user_id=0,group_id=0,default_permissions,allow_other)

$ cp /boot/vmlinuz-* /mnt/hdfs/user/cloudera/$ hadoop fs -ls vmlinuz-*-rw-r--r-- 3 cloudera supergroup 2107004 2011-11-08 16:14 /user/cloudera/vmlinuz-2.6.18-274.7.1.el5

Page 9: Hadoop ecosystem

9©2011 Cloudera, Inc. All Rights Reserved.

SqoopSqoop

9

RDBMS

Sqoop

HDFS

Page 10: Hadoop ecosystem

10 ©2011 Cloudera, Inc. All Rights Reserved.

SqoopSqoop

10

• Import & Export

• ODBC, JDBC Data Sources

• CSV Files in HDFS

Page 11: Hadoop ecosystem

11 ©2011 Cloudera, Inc. All Rights Reserved.

Sqoop ExamplesSqoop Examples

11

$ sqoop import --connect jdbc:mysql://localhost/world --username root --table City ...

$ hadoop fs -cat City/part-m-000001,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,Herat,AFG,Herat,1868004,Mazar-e-Sharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord-Holland,731200...

Page 12: Hadoop ecosystem

12 ©2011 Cloudera, Inc. All Rights Reserved.

HiveHive

12

MapReduce

Hive

SQL

Page 13: Hadoop ecosystem

13 ©2011 Cloudera, Inc. All Rights Reserved.

HiveHive

13

• Data Warehouse System for Hadoop

• Data Aggregation

• Ad-Hoc Queries

• SQL-like Language (HiveQL)

• Developed at facebook

Page 14: Hadoop ecosystem

14 ©2011 Cloudera, Inc. All Rights Reserved.

Hive ExamplesHive Examples

14

CREATE TABLE newmovie (id INT, name STRING, year INT, numratings INT, avgrating FLOAT);INSERT OVERWRITE TABLE newmovieSELECT id, name, year, COUNT(1), AVG(rating)FROM movie JOIN movieratingON movie.id = movierating.movieidGROUP BY id, name, year;

Page 15: Hadoop ecosystem

15 ©2011 Cloudera, Inc. All Rights Reserved.

PigPig

15

MapReduce

Pig

Script

Page 16: Hadoop ecosystem

16 ©2011 Cloudera, Inc. All Rights Reserved.

PigPig

16

• Data Warehouse System for Hadoop

• Data Aggregation

• Ad-Hoc Queries

• High-Level Scripting Language (Pig Latin)

• Developed at Yahoo

Page 17: Hadoop ecosystem

17 ©2011 Cloudera, Inc. All Rights Reserved.

Pig ExamplesPig Examples

17

movierating = LOAD 'movierating' AS (userid, movieid, rating:INT);groupmr = GROUP movierating BY movieid;ratings = FOREACH groupmr GENERATE group AS movieid, COUNT(movierating.rating) AS numratings, AVG(movierating.rating) AS avgrating;movie = LOAD 'movie' AS (id, name, year);mr = JOIN movie BY id, ratings BY movieid;result = FOREACH mr GENERATE id, name, year, numratings, avgrating;STORE result INTO 'ratedmovie';

Page 18: Hadoop ecosystem

18 ©2011 Cloudera, Inc. All Rights Reserved.

The Story So FarThe Story So Far

18

RDBMS

Hive Pig

Sqoop

MapReduce

HDFS

FUSE

FSSQL

SQL Script

Posix

Java

Java

Page 19: Hadoop ecosystem

19 ©2011 Cloudera, Inc. All Rights Reserved.

HBaseHBase

19

• Low Latency

• Random Reads And Writes

• Distributed Key/Value Store

• Simple API– PUT– GET– DELETE– SCANE

Page 20: Hadoop ecosystem

20 ©2011 Cloudera, Inc. All Rights Reserved.

HBase Data ModelHBase Data Model

20

Key

RowID Columname Timestamp Value

com.apple.www Size yesterday 1234

com.apple.www Content yesterday <html>...

com.cloudera.www Size yesterday 2345

com.cloudera.www Content yesterday <html>...

com.cloudera.www Size today 3456

com.cloudera.www Content today <html>...

com.facebook.www Size yesterday 4567

com.facebook.www Content yesterday <html>...

com.yahoo.www Size today 5678

com.yahoo.www Content today <html>...

Page 21: Hadoop ecosystem

21 ©2011 Cloudera, Inc. All Rights Reserved.

HBase FlowHBase Flow

21

GET/PUT/DELETE

MEMORY

HDFS Logfile

Page 22: Hadoop ecosystem

22 ©2011 Cloudera, Inc. All Rights Reserved.

HBase ExamplesHBase Examples

22

hbase> create 'mytable', 'mycf'hbase> listhbase> put 'mytable', 'row1', 'mycf:col1', 'val1'hbase> put 'mytable', 'row1', 'mycf:col2', 'val2'hbase> put 'mytable', 'row2', 'mycf:col1', 'val3'hbase> scan 'mytable'hbase> disable 'mytable'hbase> drop 'mytable'

Page 23: Hadoop ecosystem

23 ©2011 Cloudera, Inc. All Rights Reserved.

FlumeFlume

23

• Many Servers with many Log Files– Webserver– Mailserver– Syslog

• Store all Logs in One Place– Manageable– Extensible– Reliable

Page 24: Hadoop ecosystem

24 ©2011 Cloudera, Inc. All Rights Reserved.

Flume ArchitectureFlume Architecture

24

Log

Flume Node

Log

Flume Node

...

HDFS

Page 25: Hadoop ecosystem

25 ©2011 Cloudera, Inc. All Rights Reserved.

Flume Sources and SinksFlume Sources and Sinks

25

• Local Files

• HDFS

• Stdin, Stdout

• Twitter

• IRC

• IMAP

Page 26: Hadoop ecosystem

26 ©2011 Cloudera, Inc. All Rights Reserved.

WhirrWhirr

26

• Automatic Cluster Setup in the Cloud– Amazon– Rackspace

Page 27: Hadoop ecosystem

27 ©2011 Cloudera, Inc. All Rights Reserved.

Whirr ExampleWhirr Example

27

$ cat hadoop.properties whirr.cluster-name=myhadoopcluster whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,7 hadoop-datanode+hadoop-tasktracker whirr.provider=aws-ec2whirr.identity=${env:AWS_ACCESS_KEY_ID} whirr.credential=${env:AWS_SECRET_ACCESS_KEY}whirr.private-key-file=${sys:user.home}/.ssh/id_rsawhirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub

$ bin/whirr launch-cluster --config hadoop.properties

$ . ~/.whirr/myhadoopcluster/hadoop-proxy.sh

$ export HADOOP_CONF_DIR=~/.whirr/myhadoopcluster

$ bin/whirr destroy-cluster --config hadoop.properties

Page 28: Hadoop ecosystem

28 ©2011 Cloudera, Inc. All Rights Reserved.

Oozie ConceptOozie Concept

28

• crond for Hadoop

• Job Flow Control– Branching– Serial– Loops

• Triggered– Time– Data

Job 1

Job 3

Job 2

Job 4 Job 5

Page 29: Hadoop ecosystem

29 ©2011 Cloudera, Inc. All Rights Reserved.

Oozie FeaturesOozie Features

29

• Component Independent– MapReduce– Hive– Pig– Sqoop– Streaming

Page 30: Hadoop ecosystem

30 ©2011 Cloudera, Inc. All Rights Reserved.

MahoutMahout

• Machine Learning Library for Hadoop– Regression– Classification– Recommendations– Pattern Mining

30

Page 31: Hadoop ecosystem

31 ©2011 Cloudera, Inc. All Rights Reserved.

Mahout Use CasesMahout Use Cases

• Yahoo: Spam Detection

• Foursquare: Recommendations

• SpeedDate.com: Recommendations

• Adobe: User Targetting

• Amazon: Personalization Platform

31

Page 32: Hadoop ecosystem

32 ©2011 Cloudera, Inc. All Rights Reserved.

CDH4u2CDH4u2

32

• Cloudera's Distribution Including Hadoop

• http://www.cloudera.com/download/

• Linux Packages– Red Hat– Debian– Tar Archive

• Virtual Machines

• Cloud Installation with Whirr

Page 33: Hadoop ecosystem

33 ©2011 Cloudera, Inc. All Rights Reserved.

CDH ComponentsCDH Components

33

Hadoop Hive

Pig HBase

Zookeeper Flume

Sqoop Whirr

Hue Oozie

FUSE-DFS Mahout

Page 34: Hadoop ecosystem

34 ©2011 Cloudera, Inc. All Rights Reserved.

Thank you!Thank you!

• Kai Voigt

[email protected]

• LinkedIn

• http://www.cloudera.com/

34