Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation:...

72
Dr. Sandeep G. Deshmukh 1 Introduction to

Transcript of Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation:...

Page 1: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Dr. Sandeep G. Deshmukh

1

Introduction to

Page 2: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Contents

❑ Big Data Primers

❑ Hadoop

➢ Hadoop Distributed File System (HDFS)

➢ MapReduce

❑ Examples

2

Page 3: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Show of Hands

Page 4: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Big Data Primers: Source of Data

Page 5: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Big Data Primers : The four Vs

Page 6: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Big Data Primers: Size does matter

Page 7: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Big Data Primers: Vertical Vs Horizontal Scaling

Vertical Scaling Horizontal Scaling

Page 8: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Big Data Primers: The scale of infrastructure

Page 9: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Big Data Primers

Page 10: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Big Data Primers

Then what is

Distributed Programming:Even more chaos...

Page 11: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Big Data Primers

Network File System Distributed File System

Master

Slaves

Page 12: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

12

Page 13: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Motivation - Traditional Distributed systems ❑ Processor Bound

❑ Using multiple machines

❑ Developer is burdened with managing too many things

➢ Synchronization

➢ Failures

❑ Data moves from shared disk to compute node

❑ Cost of maintaining clusters

❑ Scalability as and when required not present

13

Page 14: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

What we need❑ Handling failure

➢ One computer = fails once in 1000 days

➢ 1000 computers = 1 per day

❑ Petabytes of data to be processed in parallel➢ 1 HDD= 100 MB/sec

➢ 1000 HDD= 100 GB/sec

❑ Easy scalability➢ Relative increase/decrease of performance depending on increase/decrease of nodes

14

Page 15: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

What we’ve got : Hadoop!

❑ Created by Doug Cutting

❑ Started as a module in nutch and then matured as an apache project

❑ Named it after his son's stuffed

elephant

15

Page 16: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

What we’ve got : Hadoop!❑ Fault-tolerant file system➢ Hadoop Distributed File System (HDFS)➢ Modeled on Google File system

❑ Takes computation to data➢ Data Locality

❑ Scalability: ➢ Program remains same for 10, 100, 1000,… nodes➢ Corresponding performance improvement

❑ Parallel computation using MapReduce❑ Other components – Pig, Hbase, HIVE, ZooKeeper

16

Page 17: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Hadoop: Myth Vs Truth

Myth Truth

Hadoop is a database HDFS is a Distributed File System

Hadoop is a database warehouse Compliments it, not a substitute

Hadoop is a complete, single product Ecosystem, not just a product. HDFS and MapReduce being the key components

Hadoop is used only for unstructured data, web analytics

Enables many types of analytics

Page 18: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Who is using Hadoop

Page 19: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

HDFSHadoop distributed File System

Page 20: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

HDFS - Contents

❑ Components❑ Architecture❑ Tasks / Services

20

Page 21: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Terminology

❑ HDFS: Hadoop Distributed File System

❑ Datanode: A DataNode stores data in HDFS.

❑ Namenode: The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.

➢ Active : Actively serving request

➢ Standby: Becomes Active if the current Active node fails

❑ Secondary Namenode:

➢ helper node for namenode

➢ Puts a checkpoint in filesystem which will help Namenode to function better

21

Page 22: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Components of HDFS

Active NameNode

DataNodes

Secondary NameNode

22

Standby NameNode

Page 23: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Architecture

Page 24: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Storing file on HDFS

Motivation: Reliability, Availability , Network Bandwidth

➢ The input file (say 1 TB) is split into smaller chunks/blocks of 128 MB➢ The chunks are stored on multiple nodes as independent files on slave nodes➢ To ensure that data is not lost, replicas are stored in the following way:

➢ One on local node ➢ One on remote rack (incase local rack fails)➢ One on local rack (incase local node fails) ➢ Other randomly placed➢ Default replication factor is 3

24

Page 25: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

B1

B1

B1

B2

B2B2

B3 Bn

Hub 1 Hub 2

Data n

od

esFile

Master Node

8 gigabit

1 gigabit

Blocks

25

Page 26: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Tasks of NameNode

❑ Manages File System- mapping files to blocks and blocks to data nodes

❑ Maintaining status of data nodes➢ Heartbeat

■ Datanode sends heartbeat at regular intervals

■ If heartbeat is not received, datanode is declared dead

➢ Blockreport■ DataNode sends list of blocks on it

■ Used to check health of HDFS

26

Page 27: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

NameNode Functions

❑ Replication

➢ On Datanode failure

➢ On Disk failure

➢ On Block corruption

❑ Data integrity

➢ Checksum for each block

➢ Stored in hidden file

27

❑ Rebalancing - balancer tool

➢ Addition of new nodes

➢ Decommissioning

➢ Deletion of some files

Page 28: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

HDFS Robustness

❑ Safemode➢ At startup: No replication possible

➢ Receives Heartbeats and Blockreports from Datanodes

➢ Only a percentage of blocks are checked for defined replication factor

❑ Replicate blocks wherever necessary

28

All is well ☺ ➔ Exit Safemode

Page 29: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

HDFS Summary

❑ Fault tolerant

❑ Scalable

❑ Reliable

❑ File are distributed in large blocks for

➢ Efficient reads

➢ Parallel access

29

Page 30: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Break Time

Questions?

30

Page 31: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

MapReduce

Page 32: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

MapReduce

❑ Origin❑ Example❑ Programming model

32

Page 33: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

What is MapReduce?

❑ It is a powerful paradigm for parallel computation

❑ Hadoop uses MapReduce to execute jobs on files in HDFS

❑ Hadoop will intelligently distribute computation over cluster

❑ Take computation to data

33

Page 34: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Origin: Functional Programming

map f [a, b, c] = [f(a), f(b), f(c)]

Returns a list constructed by applying a function (the first argument) to all items in a list passed as the second argument

Example:

map sq [1, 2, 3] = [sq(1), sq(2), sq(3)]

= [1,4,9]

34

Page 35: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Origin: Functional Programming

reduce f [a, b, c] = f(a, b, c)

Returns a list constructed by applying a function (the first argument) on the list passed as the second argument

Example:

reduce sum [1, 4, 9] = sum(1, 4, 9)

= 14

35

Page 36: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Example: Sum of squares

[1,2,3,4]

Sq (1) Sq (2) Sq (3) Sq (4)

16941

30

Input

Intermediate output

Output

MAPPER

REDUCER

M1 M2 M3 M4

R1

36

Page 37: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Example: Sum of squares of even and odd

[1,2,3,4]

Sq (1) Sq (2) Sq (3) Sq (4)

(even, 16)(odd, 9)(even, 4)(odd, 1)

(even, 20) (odd, 10)

Input

Intermediate output

Output

MAPPER

REDUCER

M1 M2 M3 M4

R1 R2

37

Page 38: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Example: Sum of squares of even and odd

[1,2,3,4]

Sq (1) Sq (2) Sq (3) Sq (4)

(odd, 9)(even, 4)

(even, 20) (odd, 10)

Input

Intermediate output

Output

R2R1

38

(odd, 1) (even, 16)

MAPPER

REDUCER

Page 39: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Programming model- key, value pairs

Format of input- output

(key, value)

Map: (k1, v

1) → list (k

2, v

2)

Reduce: (k2, list v

2) → list (k

3, v

3)

39

Page 40: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Sum of squares of even and odd and prime

[1,2,3,4]

Sq (1) Sq (2) Sq (3) Sq (4)

(odd, 9)(prime, 9)

(even, 4)(prime, 4)

(even, 20) (odd, 10)(prime, 13)

Input

Intermediate output

Output

R2R1 R3

40

(odd, 1) (even, 16)

Page 41: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Many keys, many values

Format of input- output

(key, value)

Map: (k1, v

1) → list (k

2, v

2)

Reduce: (k2, list v

2) → list (k

3, v

3)

41

Page 42: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Input: 1TB text file containing color names- Blue, Green, Yellow, Purple, Pink, Red, Maroon, Grey,

Desired output:

Occurrence of colors Blue and Green

Page 43: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

N1 f.001Blue PurpleBlue RedGreenBlueMaroonGreenYellow

N1 f.001Blue

Blue

GreenBlue

Green

grep Blue|Green

Nn f.00nGreen

Blue

BlueBlue

Green

Blue= 3000Green= 5500

Blue=500Green=200

Blue=420Green=200

sort |unique -c

awk ‘{arr[$1]+=$2;}END{print arr[Blue], arr[Green]}’

COMBINER

MAPPER

REDUCER

awk ‘{arr[$1]++;}END{print arr[Blue], arr[Green]}’

Nn f.00nBlue PurpleBlue RedGreenBlueMaroonGreenYellow

Page 44: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Input data

Map

Map

Map

Reduce

Reduce

Output

INPUT MAP SHUFFLE REDUCE OUTPUT

Works on a recordWorks on output of Map

44

MapReduce Overview

Page 45: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Input data

Combine

Combine

Combine

Map

Map

Map

Reduce

Reduce

Output

INPUT MAP REDUCE OUTPUT

Works on output of Map Works on output of Combiner

45

MapReduce Overview

Page 46: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Takes computation to data

MapReduce Overview

Page 47: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

47

MapReduce Overview

Page 48: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

MapReduce Overview

Page 49: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

❑ Mapper, reducer and combiner act on <key, value> pairs

❑ Mapper gets one record at a time as an input

❑ Combiner (if present) works on output of map

❑ Reducer works on output of map (or combiner, if present)

❑ Combiner can be thought of local-reducer

➢ Reduces output of maps that are executed on same node

49

MapReduce Summary

Page 50: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

What Hadoop is not..

❑ Not for interactive file accessing

❑ Not meant for a large number of small files - but for a small number of large files

❑ MapReduce cannot be used for any and all applications

50

Page 51: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Hadoop: Take Home

❑ Takes computation to data

❑ Suitable for large data centric operations

❑ Scalable on demand

❑ Fault tolerant and highly transparent

51

Page 52: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Questions?

52

❑ Coming up next …❑ First hadoop program

❑ Second hadoop program

Page 53: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Your first program in hadoop

Open up any tutorial on hadoop and first program you see will be of wordcount ☺

Task:➢ Given a text file, generate a list of words with the

number of times each of them appear in the fileInput:➢ Plain text file

Expected Output: ➢ <word, frequency> pairs for all words in the file

hadoop is a framework written in javahadoop supports parallel processing and is a simple framework

<hadoop, 2><is, 2> <a , 2> <java , 1>

<framework , 2> <written , 1> <in , 1> <and,1>

<supports , 1> <parallel , 1> <processing. , 1><simple,1>

53

Page 54: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Your second program in hadoop

Task:➢ Given a text file containing numbers, one per

line, count sum of squares of odd, even and prime

Input:➢ File containing integers, one per line

Expected Output: ➢ <type, sum of squares> for odd, even, prime

1253563794

<odd, 302><even, 278> <prime, 323 >

54

Page 55: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Your second program in hadoop

File on HDFS55

3

9

6

2

3

7

8Map: square

3 <odd,9>

7 <odd,49>

2

6 <even,36>

9 <odd,81>

3 <odd,9>

8 <even,64>

<prime,4>

<prime,9>

<prime,9>

<even,4>

Reducer: sum

prime:<9,4,9>

odd:<9,81,9,49>

even:<,36,4,64>

<odd,148>

<even,104>

<prime,22>

Input Value

Output (Key,Value)

Input (Key, List of Values)

Output (Key,Value)

Page 56: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Your second program in hadoop

56

Map(Invoked on a record)

Reduce(Invoked on a key)

void map (int x){ int sq = x * x; if(x is odd) print(“odd”,sq); if(x is even) print(“even”,sq); if(x is prime) print(“prime”,sq);}

void reduce(List l ){ for(y in List l){ sum += y; } print(Key, sum);}

Library functionsboolean odd(int x){ …}boolean even(int x){ …}boolean prime(int x){ …}

Page 57: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Your second program in hadoop

57

Map(Invoked on a record)

Map(Invoked on a record)void map (int x){ int sq = x * x; if(x is odd) print(“odd”,sq); if(x is even) print(“even”,sq); if(x is prime) print(“prime”,sq);}

Page 58: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Your second program in hadoop: Reduce

58

Reduce(Invoked on a key)

Reduce(Invoked on a

key)void reduce(List l ){ for(y in List l){ sum += y; } print(Key, sum);}

Page 59: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Your second program in hadooppublic static void main(String[] args) throws Exception {

Configuration conf = new Configuration();Job job = new Job(conf, “OddEvenPrime");

job.setJarByClass(OddEvenPrime.class);job.setMapperClass(OddEvenPrimeMapper.class);job.setCombinerClass(OddEvenPrimeReducer.class);job.setReducerClass(OddEvenPrimeReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setInputFormatClass(TextInputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);}

59

Page 60: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Questions?

60

❑ Coming up next …❑ More examples

❑ Hadoop Ecosystem

Page 61: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Example: Counting Fans

61

Page 62: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Example: Counting Fans

62

❑ Problem: Give Crowd statistics➢ Count fans

supporting India and Pakistan

Page 63: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

63

45882 67917

❑ Traditional Way

➢ Central Processing

➢ Every fan comes to the centre and presses India/Pak button

❑ Issues

➢ Slow/Bottlenecks

➢ Only one processor

➢ Processing time determined by the speed at which people(data) can move

Example: Counting Fans

Page 64: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

64

Hadoop Way

❑ Appoint processors per block (MAPPER)

❑ Send them to each block and ask them to send a signal for each person

❑ Central processor will aggregate the results (REDUCER)

❑ Can make the processor smart by asking him/her to aggregate locally and only send aggregated value (COMBINER)

55499003

7320

1280 5

8112

11289

629413109

8813

10676

3991

12225

Example: Counting Fans

Reducer

Combiner

Page 65: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Hadoop EcoSystem: Basic

65

Page 66: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Hadoop Distributions

66

Page 67: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Who all are using Hadoop

67

http://wiki.apache.org/hadoop/PoweredBy

Page 68: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

ReferencesFor understanding Hadoop

❑ Official Hadoop website- http://hadoop.apache.org/

❑ Hadoop presentation wiki- http://wiki.apache.org/hadoop/HadoopPresentations?action=AttachFile

❑ http://developer.yahoo.com/hadoop/

❑ http://wiki.apache.org/hadoop/

❑ http://www.cloudera.com/hadoop-training/

❑ http://developer.yahoo.com/hadoop/tutorial/module2.html#basics

68

Page 69: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

ReferencesSetup and Installation

❑ Installing on Ubuntu

➢ http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)

➢ http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)

❑ Installing on Debian

➢ http://archive.cloudera.com/docs/_apt.html

69

Page 70: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

❑ Hadoop: The Definitive Guide: Tom White

❑ http://developer.yahoo.com/hadoop/tutorial/

❑ http://www.cloudera.com/content/cloudera-content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-Tutorial.html

70

Further Reading

Page 71: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

Questions?

71

Page 72: Introduction to - files.meetup.com Program - Introduction … · Storing file on HDFS Motivation: Reliability, Availability , Network Bandwidth The input file (say 1 TB) is split

72

Acknowledgements

Surabhi Pendse

Sayali Kulkarni

Parth Shah