Big Data and Hadoop Ecosystem

27
Big Data and Hadoop Presenter Rajkumar Singh http://rajkrrsingh.blogspot.com / http://in.linkedin.com/in/rajkrrsingh

description

Brief introduction of Hadoop Ecosystem's component

Transcript of Big Data and Hadoop Ecosystem

Big Data and Hadoop

Presenter

Rajkumar Singhhttp://rajkrrsingh.blogspot.com/http://in.linkedin.com/in/rajkrrsingh

http://rajkrrsingh.blogspot.com

Big Data and Hadoop Introduction

Volume

FacebookGoogle Plus

TwitterLinkedIn

Stock ExchangeHealthcare

Telecom

Variety Structured,SemiStructured,unstructured

Velocity

FacebookStock Exchange

HealthcareTelecom

Mobile DevicesGPS

Security Infrastructure

http://rajkrrsingh.blogspot.com

The Problem

e.g. Stock Market

http://rajkrrsingh.blogspot.com

The Solution (Hadoop Evolution)

Traditional Approach

http://rajkrrsingh.blogspot.com

GB->TB->PB--ZBso the processing with RDBMS is Impossible

http://rajkrrsingh.blogspot.com

Challenges In Big data

• Storage -- PB

• Processing – In a timely manner

• Variety of data -- S/SS/US

• Cost

http://rajkrrsingh.blogspot.com

To overcome Big Data Challenges Hadoop evolves

• Cost Effective – Commodity HW

• Big Cluster – (1000 Nodes) --- Provides Storage n Processing

• Parallel Processing – Map reduce

• Big Storage – Memory per node * no of Nodes / RF

• Fail over mechanism – Automatic Failover

• Data Distribution

• Map Reduce Framework

• Moving Code to data

• Heterogeneous Hardware System (IBM,HP,AIX,Oracle Machine of any memory and CPU configuration)

• Scalable

http://rajkrrsingh.blogspot.com

Typical Hadoop Infrastructure

http://rajkrrsingh.blogspot.com

What is Hadoop

• Java Framework to Process erroneous amount of data

Hadoop Core• HDFS • Programming Construct (Map Reduce)

http://rajkrrsingh.blogspot.com

HDFS

http://rajkrrsingh.blogspot.com

Processing Framework (Mapreduce)

http://rajkrrsingh.blogspot.com

Hadoop Ecosystem

http://rajkrrsingh.blogspot.com

Hadoop Sub-Projects

• Hadoop Common: The common utilities that support the other Hadoop subprojects.

• Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

• Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters.

Other Hadoop-related projects at Apache include:

• Avro™: A data serialization system.

• Cassandra™: A scalable multi-master database with no single points of failure.

• Chukwa™: A data collection system for managing large distributed systems.

• HBase™: A scalable, distributed database that supports structured data storage for large tables.

• Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.

• Mahout™: A Scalable machine learning and data mining library.

• Pig™: A high-level data-flow language and execution framework for parallel computation.

• ZooKeeper™: A high-performance coordination service for distributed applications. 

http://rajkrrsingh.blogspot.com

HDFS

1 TB File

250 GB

250 GB

250 GB

250 GB

DFS

Based on GFS

http://rajkrrsingh.blogspot.com

HDFS : Use Cases

• Very large file.

• Reading/Streaming Data Access.Read data in large volume

Write once and Read frequent

• Expensive Hardware.

• Low latency Access.

• Lots of small files

• Parallel write/ Arbitrary Read

http://rajkrrsingh.blogspot.com

HDFS Building Blocks

1GB file = 1024 MB/128 MB = 8 Blocks

Default Block Size64MB128MB

For Small File Size

100 MB File < Block Size (128 MB) : Optimize for storage = 1 Block of HDFS of size 100 MB

http://rajkrrsingh.blogspot.com

HDFS Daemon Services

• Name Node• Secondary Name Node• Data Node GFS (Master/Slave Architecture)

http://rajkrrsingh.blogspot.com

HDFS Write

128 MBRF = 3 D1,D2,D4

D1 D2 D3 D4

File 1: D1,D2,D4File 2: D1,D2,D3

http://rajkrrsingh.blogspot.com

http://rajkrrsingh.blogspot.com

http://rajkrrsingh.blogspot.com

http://rajkrrsingh.blogspot.com

HDFS File System Commands

http://rajkrrsingh.blogspot.com

http://rajkrrsingh.blogspot.com

http://rajkrrsingh.blogspot.com

HDFS Federation

http://rajkrrsingh.blogspot.com

http://rajkrrsingh.blogspot.com

High Availability

http://rajkrrsingh.blogspot.com

Copying Data from one Cluster to another Cluster

UAT Cluster Prod Cluster

Parallel copying using distcphadoop distcp hdfs://uat:54311/user/rajkrrsingh/input hdfs://prod:54311/user/rajkrrsingh/input