Hadoop training by keylabs

42
BigData BigData An Introduction by An Introduction by Keylabs Keylabs

Transcript of Hadoop training by keylabs

Page 1: Hadoop training by keylabs

BigDataBigData

An Introduction by An Introduction by KeylabsKeylabs

Page 2: Hadoop training by keylabs

Need For A New Processing Platform (BigData)

What is BigData ? - Twitter (over 7~ TB/day) - Facebook (over 10~ TB/day) - Google (over 20~ PB/day)

Where does it come from ?

Existing systems (vertical scalibility)

Why Hadoop (horizontal scalibility)?

Page 3: Hadoop training by keylabs

Origin of Hadoop

Page 4: Hadoop training by keylabs

Companies Using Hadoop

Yahoo Google Facebook LinkedIn IBM Amazon HortonWorks Cloudera NY Times … the list goes on.

Page 5: Hadoop training by keylabs

What is Hadoop? Flexible infrastructure for large scale computation & data

processing on a network of commodity hardware.

Completely written in java.

Open source & distributed under Apache license

Hadoop Core Components: HDFS & MapReduce.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

Page 6: Hadoop training by keylabs

What Hadoop is Not?

A File system

A database

An online transaction processing (OLTP) system

Replacement of all programming logic

Page 7: Hadoop training by keylabs

Three Vs of Hadoop and counting…

Page 8: Hadoop training by keylabs

Hadoop Introduction and Architecture

Page 9: Hadoop training by keylabs

Hadoop High-Level Architecture

Page 10: Hadoop training by keylabs

Hadoop Architecture

Admin Node

Job Tracker

Name Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

MapReduce Engine

HDFS Cluster

Page 11: Hadoop training by keylabs

Hadoop Cluster

Page 12: Hadoop training by keylabs

Distributed File System Hadoop Distributed File System

Read 1TB Data

1 Machine•4 I/O Channels•Each Channel – 100MB/s

10 Machines•4 I/O Channels•Each Channel – 100MB/s

Page 13: Hadoop training by keylabs

What’s so Special About Open Source Hadoop?

Page 14: Hadoop training by keylabs

HDFS - Hadoop Distributed File System

Design of HDFS Where HDFS is not a good fit Why Is a Block in HDFS So Large? Advantage of HDFS?

Page 15: Hadoop training by keylabs

HDFS is not for. Low Latency Data Access

Large number of small files.

Multiple writers, arbitrary file modifications.

Page 16: Hadoop training by keylabs

HDFS Architecture

Page 17: Hadoop training by keylabs

Let us Zoom into HDFS

Page 18: Hadoop training by keylabs

NameNode

Deeper Things about Name NodeRequest to note down these points

Page 19: Hadoop training by keylabs

DataNode

What is DataNode?

Page 20: Hadoop training by keylabs

NameNode and DataNodes

Page 21: Hadoop training by keylabs

Data Replication

What is Data Replication

Page 22: Hadoop training by keylabs

Data Replication & Rack Awareness

Page 23: Hadoop training by keylabs

File Write Operation

File Write Operation

Page 24: Hadoop training by keylabs

A client writing the data to HDFS

Page 25: Hadoop training by keylabs

File Write Operation in Depth - 1

Page 26: Hadoop training by keylabs

File Write Operation in Depth - 2

Page 27: Hadoop training by keylabs

File Write Operation in Depth - 3

Page 28: Hadoop training by keylabs

File Write Operation in Depth - 4

Page 29: Hadoop training by keylabs

File Write Operation in Depth - 4

Page 30: Hadoop training by keylabs

File Write Operation – Unhappy Path

Page 31: Hadoop training by keylabs

File Read Operation

File Read Operation

Page 32: Hadoop training by keylabs

A client reading data from HDFS

Page 33: Hadoop training by keylabs

File Read Operation in Depth - 1

Page 34: Hadoop training by keylabs

File Read Operation in Depth - 2

Page 35: Hadoop training by keylabs

File Read Operation in Depth - 3

Page 36: Hadoop training by keylabs

File Read Operation - Unhappy Path

Page 37: Hadoop training by keylabs

Secondary NameNode

Page 38: Hadoop training by keylabs

Hadoop Cluster – A Typical Scenario

Page 39: Hadoop training by keylabs

Hadoop Ecosystem

Page 40: Hadoop training by keylabs

Data Loading Techniques and Analysis

Page 41: Hadoop training by keylabs

When should we go for Hadoop? Data is too huge

Processes are independent

Online analytical processing (OLAP)

Better scalability

Parallelism

Unstructured data

Page 42: Hadoop training by keylabs

THANK YOUTHANK YOUFOR YOURFOR YOUR

ATTENTION!ATTENTION!