Apache Hadoop - A Deep Dive (Part 1 - HDFS)
-
Upload
debarchan-sarkar -
Category
Data & Analytics
-
view
323 -
download
0
description
Transcript of Apache Hadoop - A Deep Dive (Part 1 - HDFS)
HADOOP-A DEEP DIVEDebarchan Sarkar
Sunil Kumar Chakrapani
The call would start soon, please be on mute.Thanks for your time and patience.
AGENDA Recap - What is Big DATA?
Problems Introduced
Traditional Architecture
Cluster Architecture
Where it all started?
How does It work, A 50000 feet overview How does it work 1 & 2
Hadoop Distributed Architecture
HDFS Architecture
Internet of things Audio /
VideoLog Files
Text/Image
Social Sentiment
Data Market FeedseGov Feeds
Weather
Wikis / Blogs
Click Stream
Sensors / RFID / Devices
Spatial & GPS Coordinates
WEB 2.0Mobile
Advertising
Collaboration
eCommerce
Digital Marketing
Search Marketing
Web Logs
Recommendations
ERP / CRM
Sales Pipeline
PayablesPayroll
Inventory
Contacts
Deal Tracking
Terabytes(10E12)
Gigabytes(10E9)
Exabytes(10E18)
Petabytes(10E15)
Velocity - Variety - variability
Volu
me
1980190,000$
20100.07$
19909,000$
200015$
Storage/GB
ERP / CRM WEB 2.0
Internet of things
WHAT IS BIG DATA?
STORAGE CAPACITY VS ACCESS SPEED1990 2010
Stores 1370 MB of data
Read
@ 4.4MB/S transfer rate
1 TB is a norm
Read
@ 100MB/S transfer rate
Takes 5 minutes Takes 2.5 hours
READ 1 TB OF DATA1 Machine 10 Machine
4 I/O Channels
Each channel: 100 MB/s
~ 45 minutes
4 I/O Channels
Each channel: 100 MB/s
~4.5 Minutes
HARDWARE FAILURE
A common way of avoiding data loss is through replication
TRADITIONAL ARCHITECTURE
Servers
SAN
Storage
CLUSTER ARCHITECTURE
1 U
1 U
1 U
1 U
1 U
1 U
1 U
1 U 1 U
1 U
NUTCH IS WHERE IT ALL STARTED
Google File System
Map Reduce
HDFS: HADOOP Distributed File System
MapReduce
HOW DOES IT WORK - 1
HOW DOES IT WORK - 2
RUNTIME
// Map Reduce function in JavaScript
var map = function (key, value, context) {var words = value.split(/[^a-zA-Z]/);for (var i = 0; i < words.length; i++) {
if (words[i] !== "") {context.write(words[i].toLowerCase(), 1);}}};
var reduce = function (key, values, context) {var sum = 0;while (values.hasNext()) {sum += parseInt(values.next());
}context.write(key, sum);};
CodeCodeCodeCode
Reference: http://en.wikipedia.org/wiki/File:Hadoop_1.png
MapReduce Layer
HDFS Layer
HADOOP DISTRIBUTED ARCHITECTURE Master Slave
RACK 1 - DataNodes RACK 2 - DataNodes
File Metadata/user/kc/data01.txt – Block 1,2,3,4/user/apb/data02.txt– Block 5,6
HDFS ARCHITECTURE
1 11
2 23
3
2
34 445
5
5 66
6
Block1: R1DN01, R1DN02, R2DN01Block2:R1DN01, R1DN02, R2DN03Block3:R1DN02, R1DN03, R2DN01
BLOCK SIZE AND REPLICATION <property>
<name>dfs.block.size</name> <value>134217728</value> </property>
<property> <name>dfs.replication</name>
<value>3</value>
</property>
NAMENODE & SECONDARY
NameNode
Secondary NameNode
• Reads fsimage and edits file
• Transaction in edits are merged With fsimage and edits is emptied
• A client application creates a new file in HDFS
• Name node logs that transaction in the edits file
Checkpoint
• Secondary Namenode periodically creates checkpoints of the namespace
• It downloads fsimage and edit from the active NameNode
• Merges fsimage and edits locally
• Uploads the new image back to the active NameNode
• fs.checkpoint.period• fs.checkpoint.size
SAFE MODE During start up the NameNode loads the file system state from the fsimage
and the edits log file.
Waits for DataNodes to report their blocks.
During this time NameNode stays in Safemode. Safemode for the NameNode is essentially a read-only mode for the HDFS
cluster, where it does not allow any modifications to file system or blocks. Normally the NameNode leaves Safemode automatically after the DataNodes
have reported that most file system blocks are available.
HDFS WRITES
1 2 3
1. HDFS client
caches the file data into a
temporary local file
Step 2
Step 3
Step 4
Step 5
Name Node
Data Node
FEEDBACKSupport Team’s blog: http://blogs.msdn.com/b/bigdatasupport/ Facebook Page: https://www.facebook.com/MicrosoftBigData Facebook Group: https://www.facebook.com/groups/bigdatalearnings/ Twitter: @debarchans
Read more:http://en.wikipedia.org/wiki/Hadoophttp://en.wikipedia.org/wiki/Big_data
Next Session:Apache Hadoop – Map Reduce