Hadoop training by keylabs

BigDataBigData

An Introduction by An Introduction by KeylabsKeylabs

Need For A New Processing Platform (BigData)

What is BigData ? - Twitter (over 7~ TB/day) - Facebook (over 10~ TB/day) - Google (over 20~ PB/day)

Where does it come from ?

Existing systems (vertical scalibility)

Why Hadoop (horizontal scalibility)?

Origin of Hadoop

Companies Using Hadoop

Yahoo Google Facebook LinkedIn IBM Amazon HortonWorks Cloudera NY Times … the list goes on.

What is Hadoop? Flexible infrastructure for large scale computation & data

processing on a network of commodity hardware.

Completely written in java.

Open source & distributed under Apache license

Hadoop Core Components: HDFS & MapReduce.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

What Hadoop is Not?

A File system

A database

An online transaction processing (OLTP) system

Replacement of all programming logic

Three Vs of Hadoop and counting…

Hadoop Introduction and Architecture

Hadoop High-Level Architecture

Hadoop Architecture

Admin Node

Job Tracker

Name Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

MapReduce Engine

HDFS Cluster

Hadoop Cluster

Distributed File System Hadoop Distributed File System

Read 1TB Data

1 Machine•4 I/O Channels•Each Channel – 100MB/s

10 Machines•4 I/O Channels•Each Channel – 100MB/s

What’s so Special About Open Source Hadoop?

HDFS - Hadoop Distributed File System

Design of HDFS Where HDFS is not a good fit Why Is a Block in HDFS So Large? Advantage of HDFS?

HDFS is not for. Low Latency Data Access

Large number of small files.

Multiple writers, arbitrary file modifications.

HDFS Architecture

Let us Zoom into HDFS

NameNode

Deeper Things about Name NodeRequest to note down these points

DataNode

What is DataNode?

NameNode and DataNodes

Data Replication

What is Data Replication

Data Replication & Rack Awareness

File Write Operation

File Write Operation

A client writing the data to HDFS

File Write Operation in Depth - 1

File Write Operation – Unhappy Path

File Read Operation

File Read Operation

A client reading data from HDFS

File Read Operation in Depth - 1

File Read Operation - Unhappy Path

Secondary NameNode

Hadoop Cluster – A Typical Scenario

Hadoop Ecosystem

Data Loading Techniques and Analysis

When should we go for Hadoop? Data is too huge

Processes are independent

Online analytical processing (OLAP)

Better scalability

Parallelism

Unstructured data

THANK YOUTHANK YOUFOR YOURFOR YOUR

ATTENTION!ATTENTION!

Hadoop training by keylabs

Education

Transcript of Hadoop training by keylabs