Hadoop presentation

Post on 07-Jul-2015

692 views 0 download

description

A short compilation about Hadoop form various books and other resources. This is just for leaning..,,,,,,,

Transcript of Hadoop presentation

BIG DATA and Hadoop

ByChandra Sekhar

Contents

Introduction to BigData What is hadoop? What hadoop is used for and is not? Top level Hadoop Projects Differences between RDBMS and Hbase. Facebook server model.

BigData- The Data Age Big data is a collection of datasets so large and

complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.

The data that is getting generated by different companies has an inherent value, can be used for different use cases in their analytics and predictions.

A new approach As per Moore's Law, which was true for the past 40 years.

1) Processing power doubles every two years

2) Processing speed is no longer the problem.

Getting the data to the processors becomes the bottleneck. Average time taken to transfer 100GB of data takes 22 min, if the disk transfer rate is 75 MB/sec

So, the new approach is to move processing of the data to the data side in a distributed way, and need to satisfy the different requirements like : Data Recoverability, Component Recovery, Consistency, Reliability and Scalability.

The answer is the Google's File System(GFS) and MapReduce, which is now Hadoop called HDFS and MapReduce.

Hadoop used for. Hadoop is recommended to coexist with your RDBMS as a

data ware house.

It is not a replacement to any of the RDBMS.

Processing over TB and PB of data is specified to take hours of time with traditional methods, with Hadoop and and it's eco-system it would take a few minutes with the power of distribution.

Many related tools integrate with Hadoop – Data"analysis” Data"visualization" Database"integration" Workflow"management" Cluster"management"

➲ Distributed File system and parallel processing for large scale data operations using HDFS and MapReduce.

➲ Plus the infrastructure needed to make them work, include Filesystem utilities Job scheduling and monitoring Web UI

There are many other projects running around the core components of Hadoop. Pig, Hive, HBase, Flume, Oozie, Sqoop, etc called as Ecosystem.

A set of machines running HDFS and MapReduce is known as Hadoop Cluster.

Individual machines are known as nodes – A cluster can have as few as one node, as many as several thousands , horizontally scalable.

More nodes = better performance!

Hadoop and EcoSystem

Hadoop-Components

HDFS and MapReduce-Core

ZooKeeper-Admin

Hive,Pig – SQL and scripts based on MapReduce

Hbase is NoSQL Datastore.

Sqoop- import to and export

data from RDBMS. Avro - Serialization based on

JSON. Used for metadata store.

Hadoop Components: HDFS HDFS, the Hadoop Distributed File System, is responsible for

storing data on the cluster. Uses Ext3/Ext4 or xfs file system. HDFS is a file-system designed for storing very large files with

streaming data-acess(write-once, read many time), running on clusters of commodity hardware.

Data is split into blocks and distributed across mul/ple nodes in the cluster

Each block is typically 64MB or 128MB in size Each block is replicated multiple times

Default is to replicate each block three times Replicas are stored on different nodes This ensures both reliability and availability.

HDFS and MapReduce NameNode(Master)

SecondaryNameNode

Master FailoverNode

Data Nodes (SlaveNodes).

JobTracker

Jobs

Task Tracker

Tasks

Mapper

Reducer

Combiner

Partitioner

HDFS and Nodes

Architecture

MapReduce

HDFS Access• WebHDFS – REST API • Fuse DFS – Mounting HDFS as normal

drive.• Direct Access – Direct HDFS access

Hive and Pig

Hive is a powerful SQL language, though not fully supported SQL, can be used to perform joins on top of datasets in HDFS.

Used for large batch Programming. At the backend, hive does the MapReduce Jobs only.

Pig is a powerful scripting language, that is built on top of the MapReduce Jobs, the language is called PigLatin.

HBASE The most powerful NoSQL database on earth. Supports Master Active-Active Setup and is

based on the Google's BigTable. Supports Columns and ColumnFamilies, can

support many billions of rows and many millions of columns in its datamodel.

An excellent Architectural master-piece, as far as the scalability is concerned.

A NoSQL database, which can support transactions, very fast reads/writes typically millions of queries / second.

HBASE-Continued

Hbase Master Region Servers ZooKeepers HDFS

ZooKeeper, Mahout

Zookeeper is a distributed coordinator and can be used as independent package, in any distributed servers management.

Mahout is a machine learning tool useful for using it for various Data science techniques.

For eg: Data Clustering, Classification and Recommender Systems by using Supervised and Unsupervised Learning.

Flume

Flume is a real time data access mechanism and writes to a data mart.

Flume can move large capacity of streaming data into HDFS and will be used for further analysis.

A part from this realtime analysis of the web-log data is also possible along with Flume.

Logs of a group of webservers can be written to HDFS using Flume.

Sqoop and Oozie

Sqoop is a data import and export mechanism from RDBMS to HDFS or hive and vice-versa.

There are lot of free connectors that has been prepared by various vendors with different RDBMS, which has really made the data transfer very fast, as it supports parallel transfer of stuff.

Oozie is a workflow, mechanism of executing a large sequence of MapReduce Jobs, Hive or Pig Jobs and Hbase Jobs and any other Java Programs. Oozie also has an email job which is supported along with the workflow.

RDBMS vs HBASE

A typical RDBMS scaling story runs this way: Initial Public Launch

Service Popular, too many reads hitting database.

Service continues to grow in popularity; too many writes hitting the database.

New features increases query complexity; now we have too many joins

Rising popularity swamps the server; things are too slow

Some queries are still too slow

Reads are OK, but writes are getting slower and slower

With Hbase

Enter HBase, which has the following characteristics:

No real indexes.

Automatic partitioning/Sharding

Scale linearly and automatically with new nodes

Commodity hardware

Fault tolerance

Batch processing

Facebook Server Architecture