Hadoop and big data

Big Data and Hadoop Essentials

Hadoop Ecosystem

Agenda

Map Reduce Algorithm Exemplified

Hadoop Architecture

Brief History in time

Why Hadoop?

How Big is Big Data?

Brief History in timeIn pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but more systems of computers.

—Grace Hopper, American Computer Scientist

How Big is Big Data?

Why Hadoop?

The Problem

BIGDATA

VolumeBig Data comes in on large

scale. Its on TB and even PBRecords, Transaction,

Tables , Files

VeracityQuality, consistency, reliability

and provenance of dataGood, bad, undefined,

inconsistency, incomplete.

VarietyBig Data extends structured,

including semi- structured and unstructured data of all variety

text, log, xml, audio, video, stream, flat files

VelocityData flown continues, time sensitive, streaming flow Batch, Real time, Streams,

Historic

Challenges in managing Big Data

To overcome Big Data challenges Hadoop evolves

• Cost Effective – Commodity HW• Big Cluster – (1000 Nodes) --- Provides Storage

& Processing• Parallel Processing – Map reduce• Big Storage – Memory per node * no of

Nodes / RF• Fail over mechanism – Automatic Failover• Data Distribution• Moving Code to data• Heterogeneous Hardware System

(IBM,HP,AIX,Oracle Machine of any memory and CPU configuration)

• Scalable

What Exactly is Hadoop?

What’s in a name?

Hadoop Vendors

Who uses Hadoop?

Why Hadoop is used for?

Stop and Ponder• Is Hadoop an alternative for RDBMS?

• Hadoop is not replacing the traditional data systems used for building analytic applications – the RDBMS, EDW and MPP systems – but rather is a complement. & Works fine together with RDBMs.

• Hadoop is being used to distill large quantities of data into something more manageable

Stop and Ponder• But Don’t we know Coherence to be distributed too? Why Hadoop?

Coherence is the market leading In-Memory Data Grid. While Hadoop works fine for large processing operations, i.e. requiring many TB of data, that can be processed in a batch like way, there are use cases where the processing requirements are more real-time and the data volumes are smaller, where Coherence is a better choice than HDFS for storing the data

Hadoop vs. RDBMSRDBMS MapReduce

Data size Gigabytes Petabytes

Access Interactive and batch Batch

Structure Fixed schema Unstructured schema

Language SQL Procedural (Java, C++, Ruby, etc)

Integrity High Low

Scaling Nonlinear Linear

Updates Read and write Write once, read many times

Latency Low High

Using Hadoop in Enterprise

Hadoop Architecture

• Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

• Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters.

Map Reduce

Hadoop

Hadoop Distributed File System(HDFS)

HDFS Architecture(Master-Slave)

Secondary Name Node

MasterBook Keeper

Slave(s)

Periodic checkpoint

Data Block

The CORE

CLIENTData Analytics Jobs

Map Reduce

Data Storage JobsHDFS

MASTER

= HDFS

Hadoop Ecosystem

MAP REDUCE Algorithm exemplified!

Calculate the yearly average per state.

Group the city average temperatures by state

We don’t really care about the city names, so we will discard those and keep only the state names and cities Temperatures.

We’re going to get a list of temperatures averages for each state.

That was Map/Reduce!

All we have to do is to calculate the average temperature for each state.

Let’s do it again…• Map/Reduce has 3 stages : Map/Shuffle/Reduce• The Shuffle part is done automatically by Hadoop, you just need to

implement the Map and Reduce parts.• You get input data as <Key,Value> for the Map part.• In this example, the Key is the City name, and the Value is the set of

attributes : State and City yearly average temperature.

• Since you want to regroup your temperatures by state, you’re going to get rid of the city name, and the State will become the Key, while the Temperature will become the Value.

Shuffle• Now, the shuffle task will run on the output of the Map task. It is

going to group all the values by Key, and you’ll get a List<Value>

Reduce• The Reduce task is the one that does the logic on the data, in our

case this is the calculation of the State yearly average temperature.• And that’s what we will get as final output

Hadoop AppStore

Ecosystem Matrix

Pig and HIVE in the Hadoop Ecosystem

Hadoop Ecosystem Development

References

• http://hadoop.apache.org/• http://hadoop.apache.org/hive/• Hadoop in Action (http://www.manning.com/lam/)• Definitive Guide to Hadoop, 2nd ed. (http://oreilly.com/catalog/0636920010388)• Yahoo! Hadoop blog (http://developer.yahoo.net/blogs/hadoop/)• Cloudera (http://www.cloudera.com/)

Thank You

Hadoop and big data

Data & Analytics

Transcript of Hadoop and big data

Hadoop Big Data A big picture

Big Data, Hadoop, HDFS

BIG DATA HADOOP FULL - themeslearning.com · Learning Big Data and Hadoop This module will help you understand Big Data Common Hadoop ecosystem components Hadoop Architecture HDFS

Big Data and Hadoop - How Big is this Big Data?

Big data and hadoop

Big problems with big data Hadoop interfaces · PDF fileBig problems with big data – Hadoop interfaces security Jakub Kaluzny

Big Data & Hadoop

Big Data Infrastructures & Technologies - …homepages.cwi.nl/~boncz/bads/03-The Hadoop Ecosystem.pdf · Big Data Infrastructures & Technologies Hadoop Streaming ... Big Data Infrastructures

Hadoop & Big Data benchmarking

Hadoop Big Data Visualization

Big data with hadoop

Hadoop and Big Data

Big data Hadoop

Big Data Technologies - Hadoop

Hadoop A Big Picture of Big Data

Big Data Hadoop Training

Big Data Hadoop (Overview)

Big Data Beyond Hadoop

Big Data Hadoop Insight

Big data&hadoop