Hadoop presentation

23
BIG DATA and Hadoop By Chandra Sekhar

description

A short compilation about Hadoop form various books and other resources. This is just for leaning..,,,,,,,

Transcript of Hadoop presentation

Page 1: Hadoop presentation

BIG DATA and Hadoop

ByChandra Sekhar

Page 2: Hadoop presentation

Contents

Page 3: Hadoop presentation

Introduction to BigData What is hadoop? What hadoop is used for and is not? Top level Hadoop Projects Differences between RDBMS and Hbase. Facebook server model.

Page 4: Hadoop presentation

BigData- The Data Age Big data is a collection of datasets so large and

complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.

The data that is getting generated by different companies has an inherent value, can be used for different use cases in their analytics and predictions.

Page 5: Hadoop presentation

A new approach As per Moore's Law, which was true for the past 40 years.

1) Processing power doubles every two years

2) Processing speed is no longer the problem.

Getting the data to the processors becomes the bottleneck. Average time taken to transfer 100GB of data takes 22 min, if the disk transfer rate is 75 MB/sec

So, the new approach is to move processing of the data to the data side in a distributed way, and need to satisfy the different requirements like : Data Recoverability, Component Recovery, Consistency, Reliability and Scalability.

The answer is the Google's File System(GFS) and MapReduce, which is now Hadoop called HDFS and MapReduce.

Page 6: Hadoop presentation

Hadoop used for. Hadoop is recommended to coexist with your RDBMS as a

data ware house.

It is not a replacement to any of the RDBMS.

Processing over TB and PB of data is specified to take hours of time with traditional methods, with Hadoop and and it's eco-system it would take a few minutes with the power of distribution.

Many related tools integrate with Hadoop – Data"analysis” Data"visualization" Database"integration" Workflow"management" Cluster"management"

Page 7: Hadoop presentation

➲ Distributed File system and parallel processing for large scale data operations using HDFS and MapReduce.

➲ Plus the infrastructure needed to make them work, include Filesystem utilities Job scheduling and monitoring Web UI

There are many other projects running around the core components of Hadoop. Pig, Hive, HBase, Flume, Oozie, Sqoop, etc called as Ecosystem.

A set of machines running HDFS and MapReduce is known as Hadoop Cluster.

Individual machines are known as nodes – A cluster can have as few as one node, as many as several thousands , horizontally scalable.

More nodes = better performance!

Hadoop and EcoSystem

Page 8: Hadoop presentation

Hadoop-Components

HDFS and MapReduce-Core

ZooKeeper-Admin

Hive,Pig – SQL and scripts based on MapReduce

Hbase is NoSQL Datastore.

Sqoop- import to and export

data from RDBMS. Avro - Serialization based on

JSON. Used for metadata store.

Page 9: Hadoop presentation

Hadoop Components: HDFS HDFS, the Hadoop Distributed File System, is responsible for

storing data on the cluster. Uses Ext3/Ext4 or xfs file system. HDFS is a file-system designed for storing very large files with

streaming data-acess(write-once, read many time), running on clusters of commodity hardware.

Data is split into blocks and distributed across mul/ple nodes in the cluster

Each block is typically 64MB or 128MB in size Each block is replicated multiple times

Default is to replicate each block three times Replicas are stored on different nodes This ensures both reliability and availability.

Page 10: Hadoop presentation

HDFS and MapReduce NameNode(Master)

SecondaryNameNode

Master FailoverNode

Data Nodes (SlaveNodes).

JobTracker

Jobs

Task Tracker

Tasks

Mapper

Reducer

Combiner

Partitioner

Page 11: Hadoop presentation

HDFS and Nodes

Page 12: Hadoop presentation

Architecture

Page 13: Hadoop presentation

MapReduce

Page 14: Hadoop presentation

HDFS Access• WebHDFS – REST API • Fuse DFS – Mounting HDFS as normal

drive.• Direct Access – Direct HDFS access

Page 15: Hadoop presentation

Hive and Pig

Hive is a powerful SQL language, though not fully supported SQL, can be used to perform joins on top of datasets in HDFS.

Used for large batch Programming. At the backend, hive does the MapReduce Jobs only.

Pig is a powerful scripting language, that is built on top of the MapReduce Jobs, the language is called PigLatin.

Page 16: Hadoop presentation

HBASE The most powerful NoSQL database on earth. Supports Master Active-Active Setup and is

based on the Google's BigTable. Supports Columns and ColumnFamilies, can

support many billions of rows and many millions of columns in its datamodel.

An excellent Architectural master-piece, as far as the scalability is concerned.

A NoSQL database, which can support transactions, very fast reads/writes typically millions of queries / second.

Page 17: Hadoop presentation

HBASE-Continued

Hbase Master Region Servers ZooKeepers HDFS

Page 18: Hadoop presentation

ZooKeeper, Mahout

Zookeeper is a distributed coordinator and can be used as independent package, in any distributed servers management.

Mahout is a machine learning tool useful for using it for various Data science techniques.

For eg: Data Clustering, Classification and Recommender Systems by using Supervised and Unsupervised Learning.

Page 19: Hadoop presentation

Flume

Flume is a real time data access mechanism and writes to a data mart.

Flume can move large capacity of streaming data into HDFS and will be used for further analysis.

A part from this realtime analysis of the web-log data is also possible along with Flume.

Logs of a group of webservers can be written to HDFS using Flume.

Page 20: Hadoop presentation

Sqoop and Oozie

Sqoop is a data import and export mechanism from RDBMS to HDFS or hive and vice-versa.

There are lot of free connectors that has been prepared by various vendors with different RDBMS, which has really made the data transfer very fast, as it supports parallel transfer of stuff.

Oozie is a workflow, mechanism of executing a large sequence of MapReduce Jobs, Hive or Pig Jobs and Hbase Jobs and any other Java Programs. Oozie also has an email job which is supported along with the workflow.

Page 21: Hadoop presentation

RDBMS vs HBASE

A typical RDBMS scaling story runs this way: Initial Public Launch

Service Popular, too many reads hitting database.

Service continues to grow in popularity; too many writes hitting the database.

New features increases query complexity; now we have too many joins

Rising popularity swamps the server; things are too slow

Some queries are still too slow

Reads are OK, but writes are getting slower and slower

Page 22: Hadoop presentation

With Hbase

Enter HBase, which has the following characteristics:

No real indexes.

Automatic partitioning/Sharding

Scale linearly and automatically with new nodes

Commodity hardware

Fault tolerance

Batch processing

Page 23: Hadoop presentation

Facebook Server Architecture