Hadoop Session 2 : HDFS

22
HDFS HADOOP DISTRIBUTED FILE SYSTEM

Transcript of Hadoop Session 2 : HDFS

Page 1: Hadoop Session 2 : HDFS

HDFSHADOOP DISTRIBUTED

FILE SYSTEM

Page 2: Hadoop Session 2 : HDFS

Revision What is Big Data ? 3 V’s of Big Data What can we do with Big Data ? What is Hadoop ? Components of Hadoop

Page 4: Hadoop Session 2 : HDFS

Hadoop Components

SELF HEALINGDISTRIBUTED STORAGE

FAULT TOLERANTDISTRIBUTED COMPUTING

+ ABSTRACTION PARALLEL

PROCESSING

Page 5: Hadoop Session 2 : HDFS

• Designed for modest number of Large files (millions instead of billions)• Sequential access not Random access• Write Once, Read Many Times• Data is split into BIG chunks and stored in multiple nodes as blocks• Blocks get replicated over the multiple nodes

HDFS Overview

Page 6: Hadoop Session 2 : HDFS

HDFS Client Server Architecture Server - Name Node Client – Data Nodes File Split into multiple Blocks Multiple Copies of Each Block

Page 7: Hadoop Session 2 : HDFS

NN

DATA NODES

5

3

2

1

4

3

144

2

5

1

Page 8: Hadoop Session 2 : HDFS

TOPOLOGY OF HADOOP CLUSTER

NAME NODE

DATANODE

SECONDARYNAME NODE

DATANODE DATANODEDATANODE

Page 9: Hadoop Session 2 : HDFS

Nodes in a HDFS Name Node Secondary Name Node Data Node Job Tracker Task Tracker

Page 10: Hadoop Session 2 : HDFS

NAMENODE One NN per cluster Manages

File System namespace META Data

Single Point of failure Enterprise hardware i.e. RAID

machines

Page 11: Hadoop Session 2 : HDFS

SECONDARY NAMENODE

NOT a backup node of NN NOT Automatic to replace NN Single Point of failure Enterprise hardware i.e. RAID

machines

Page 12: Hadoop Session 2 : HDFS

DATANODE Many per cluster Manages

Blocks and Serves to the client Periodically report to NN list of block it

stores Use Inexpensive commodity hardware

Page 13: Hadoop Session 2 : HDFS

JOB TRACKER One per cluster Manages

Job Requests submitted by client Initial point of contact for client Job starts at Job Tracker Single point of failure

Page 14: Hadoop Session 2 : HDFS

TASK TRACKER Many per cluster Execute Map and Reduce Operation Read input splits for a Map Reduce

Job

Page 15: Hadoop Session 2 : HDFS

Block Replication

1st Node at the client (Randomly Chosen) 2nd Different Rack than first 3rd Same Rack as the second

Replication factor = 3

REPLICA PLACEMENT

Page 16: Hadoop Session 2 : HDFS

HDFS Large Blocks of 64 MB/128 MB

150 MB

64 MB

64 MB

22 MB

Page 17: Hadoop Session 2 : HDFS

HDFS CLI

Page 18: Hadoop Session 2 : HDFS

HDFS Files Read/Write hadoop fs -ls <path> hadoop fs -mkdir <path> hadoop fs -cp <Source> <Destination> hadoop fs -cat <File Path> hadoop fs –tail <File Path> hadoop fs -mv <Source> <Destination> hadoop fs -rm <path>

Page 19: Hadoop Session 2 : HDFS

HDFS File Ownership sudo -u hdfs hadoop fs -chmod 600

hadoop/purchases.txt

sudo -u hdfs hadoop fs -chown root:root hadoop/purchases.txt

sudo -u hdfs hadoop fs -chgrp training hadoop/purchases.txt

Page 20: Hadoop Session 2 : HDFS

HDFS Administration Commands

hadoop version hadoop classpath hadoop fsck - / hadoop balancer hadoop fs -du -s -h <path> **hadoop fs -setrep -w 2 <File Path> hadoop fs –expunge hadoop fs -df hdfs:/

Page 21: Hadoop Session 2 : HDFS

HDFS Read/Write Local HDFS

hadoop fs -copyFromLocal <Source - Local>

<Destination - HDFS> hadoop fs –copyToLocal

<Source HDFS> < Source - Local> hadoop fs –put <source> <destination> hadoop fs –get <source> <destination>

Page 22: Hadoop Session 2 : HDFS

Most ImportantHelp ??

hadoop fs -help