A gentle introduction to the world of BigData and Hadoop

74
Hello, A “gentle” introduction to the world of Big Data and the Hadoop platform

description

A gentle introduction to the world of #BigData and #Hadoop with also a fast view of what you can do in Azure

Transcript of A gentle introduction to the world of BigData and Hadoop

Page 1: A gentle introduction to the world of BigData and Hadoop

Hello, A “gentle” introduction to the world of

Big Data and the Hadoop platform

Page 2: A gentle introduction to the world of BigData and Hadoop

Agenda1. Introduction• The history, the #BigData, a bit of theory

behind…

2. What is Hadoop, part 1• Introducing HDFS and Map/Reduce

3. What is Hadoop, part 21. The next generation (v. 2.x), Real time, …

4. Microsoft and Big Data1. Lambda architecture and Windows Azure,

WA Storage(s), WA HDInsight

5. Q&A

Page 3: A gentle introduction to the world of BigData and Hadoop

Who am I?(Who bothers? )

Stefano Paluello• Tech Lead @ SG Gaming• All around geek, passionate

about architecture, Cloud and Data

• Co-founder of various start-up(s)

http://about.me/stefanop

Page 4: A gentle introduction to the world of BigData and Hadoop
Page 5: A gentle introduction to the world of BigData and Hadoop

How it all started…

Page 6: A gentle introduction to the world of BigData and Hadoop

Ops….

Page 7: A gentle introduction to the world of BigData and Hadoop

history• 2002

• 2003

• 2004

Hadoop, created by Doug Cutting (part of the Lucene project), starts as an Open Source search engine for the Web. It has its origins in Apache Nutch, parts of the Lucene project (full text search engine).

Google publishes a paper describing its own distributed file system, also called GFS.

The first version of NDFS, Nutch Distributed FS, implementing the Google’s paper.

Page 8: A gentle introduction to the world of BigData and Hadoop

history• 2004

• 2005

• 2005 (end)

• 2006 (Feb)

Google publishes, another, paper introducing the MapReduce algorithm

The first version of MapReduce is implemented in Nutch

Nutch’s MapReduce is running on NDFS

Nutch’s MapReduce and NDFS became the core of a new Lucene’s subproject:

Page 9: A gentle introduction to the world of BigData and Hadoop

history• 2008 Yahoo launches the World’s

largest Hadoop PRODUCTION site

Some Webmap size data:• # of links between pages in the

index:roughly 1 trillion (1012) links

• Size of the output:over 300 TB, compressed (!!!)

• # of cores to run a single MapReduce job:

over 10000• Raw disk used in the production

cluster:Over 5 Petabytes

Page 10: A gentle introduction to the world of BigData and Hadoop
Page 11: A gentle introduction to the world of BigData and Hadoop
Page 12: A gentle introduction to the world of BigData and Hadoop

OK, let’s start with…

… a bit of theory

Page 13: A gentle introduction to the world of BigData and Hadoop

Nooo, Wait! Don’t run away

Page 14: A gentle introduction to the world of BigData and Hadoop

What is #BigData?

BigData is a definition, but for someone is a buzzword (a keyword with no or not precise meaning but sounding interesting) that is trying to address all this “new” (really?!?) needing of processing a lot of data.

To identify we usually use the “Three V” to define BigData

Page 15: A gentle introduction to the world of BigData and Hadoop

The 3 V’s of #BigData?

Volume: the size of the data that we’re dealing with

Variety: the data is coming from a lot of different sources

Velocity: the speed at which the data is generated

Page 16: A gentle introduction to the world of BigData and Hadoop

Source: www.wipro.com, July 2012

Page 17: A gentle introduction to the world of BigData and Hadoop

And the 4Vs of #BigData?

Font: www.wipro.com, July 2012

Source: Oracle.com

Page 18: A gentle introduction to the world of BigData and Hadoop

the 4Vs of #BigData (2)

Source: IBM.com

Page 19: A gentle introduction to the world of BigData and Hadoop

#BigDataIt is predicted that between 2009 and 2020, the estimated size of the “digital universe” will grow around 35 Zettabytes (270 bytes) per year (!!!)

1 Zettabyte = 1k Exabyte or 1M Petabyte or 1G Terabyte

#BigData market and analysis and the 3Vs definition, was introduced by a Gartner

research about 13 years agohttp://blogs.gartner.com/doug-laney/deja-vvvue-others-claiming-gartners-volume-velocity-variety-construct-for-big-data/

Font: www.wipro.com, July 2012

Page 20: A gentle introduction to the world of BigData and Hadoop
Page 21: A gentle introduction to the world of BigData and Hadoop

Big Data Lambda Architecture

What??? Lam…what???

Page 22: A gentle introduction to the world of BigData and Hadoop

I said LAMBDA !!!

Page 23: A gentle introduction to the world of BigData and Hadoop

Lambda Architecture

Solves the problem of computing arbitrary functions on arbitrary data by decomposing the problem in three layer:

The batch layerThe serving layer The speed layer

Page 24: A gentle introduction to the world of BigData and Hadoop

The Batch layer

Stores all the data in an immutable, constantly growing dataset

Accessing all the data is too expensive (even if possible)

Precompute “query” functions are created (aka “batch view”, high latency operations) allowing the results to be accessed quickly

Page 25: A gentle introduction to the world of BigData and Hadoop

The Batch layer

Source: “Big Data”, by Manning

Page 26: A gentle introduction to the world of BigData and Hadoop

The Serving layer

Indexes the batch views

Loads the batch views and allows to access and query them efficiently

Usually is a distributed database that loads in the batch views and it’s updated by the batch layer

It requires batch updates and random reads but does NOT require random writes.

Page 27: A gentle introduction to the world of BigData and Hadoop

The Speed layer

Compensate for high latency updates of the serving layer

Provides fast incremental algorithms

Updates the realtime view while receiving new data, without computing them like the batch layer)

Page 28: A gentle introduction to the world of BigData and Hadoop

The Speed layer

Source: “Big Data”, by Manning

Page 29: A gentle introduction to the world of BigData and Hadoop

Recap

Source: “Big Data”, by Manning

Page 30: A gentle introduction to the world of BigData and Hadoop
Page 31: A gentle introduction to the world of BigData and Hadoop

Distributed Data 101

Just a couple of reminders…

Page 32: A gentle introduction to the world of BigData and Hadoop

ACIDACID is a set of properties that guarantee that database transactions are processed reliability

[ Source: Wikipedia ]

Atomicity: or “all or nothing”. All the modification in a transaction must happen successfully or no changes are committed

Consistency: all my data will be always in a valid state after every transactions.

Isolation: transactions are isolated, so any transaction is separated and won’t affect the data of other transactions

Durability: once a transaction is committed, the related data are safely and durably stored, regardless to errors, crashes or any software malfunctions

Page 33: A gentle introduction to the world of BigData and Hadoop

CAPCAP theorem (or Brewer’s theorem) is a set of basic requirements that describes a distributed system

Consistency: all the server in the system will have the same data

Availability: all the server in the system will be available and they will return all the data available (also if they could be not consistent across the system)

Partition (tolerance): the system will continues to operate as a whole despite arbitrary message loss or failure of a part of the system

According to the theorem, a distributed system CANNOT satisfy all the three requirements at the SAME time (“two out of three” concept).

Page 34: A gentle introduction to the world of BigData and Hadoop

Here we are…Your “#BigData 101”

degree!

Page 35: A gentle introduction to the world of BigData and Hadoop

What is Hadoop?

(Part 1)

Page 36: A gentle introduction to the world of BigData and Hadoop

Hadoop…

Where it comes from?The “legend” says that the name comes from Doug Cutting (one of the founder of the project) son’s toy elephant. So it is also the logo of the yellow smiling elephant.

Page 37: A gentle introduction to the world of BigData and Hadoop

Hadoop cluster

A Hadoop cluster consist in mainly two modules:

A way to store distributed data, the HDFS or Hadoop Distributed File System (storage layer)A way to process data, the MapReduce (compute layer)

This is the core of Hadoop!

Page 38: A gentle introduction to the world of BigData and Hadoop

HDFSThe Hadoop Distributed File System

For a developer point of view it looks like a standard file systemRuns on top of OS file system (extf3,…)Designed to store a very large amount of data (petabytes and so on) and to solve some problems that comes with DFS e NFS Provides fast and scalable access to the data Stores data reliably

Page 39: A gentle introduction to the world of BigData and Hadoop
Page 40: A gentle introduction to the world of BigData and Hadoop

How does this…

?

Page 41: A gentle introduction to the world of BigData and Hadoop

HDFS under the hood

All the files loaded in Hadoop are split into chunks, called blocks. Each block has a fixed size of 64Mb (!!!). Yes, Megabytes!

MyData – 150Mb

HDFS

Blk_0164Mb

Blk_03, 22Mb

Blk_0264Mb

Page 42: A gentle introduction to the world of BigData and Hadoop

Datanode(s) and Namenode

Datanode is a daemon (a service in the Windows language) running on each cluster nodes, that is responsible to store the blocks

Namenode, is a dedicated node where all the metadata of all the files (blocks) inside my system are stored. It’s the directory manager of the HDFS

To access a file, a client contact the Namenode to retrieve the list of locations for the blocks. With the locations the client contact the Datanodes to read the data (possibly in parallel).

Page 43: A gentle introduction to the world of BigData and Hadoop

Data RedundancyHadoop replicates each block THREE times, as it’s stored in the HDFS.

The location of every blocks is managed by the Namenode

If a block is under-replicated (due to some failures on a node), the Namenode is smart enough to create another replica, until each node has three replica inside the cluster

Yes… you made your homework! If I have 100Tb of data to store in Hadoop, I will need 300Tb of storage space.

Page 44: A gentle introduction to the world of BigData and Hadoop

Datanode(s) and Namenode

NN

D DD

DD

Page 45: A gentle introduction to the world of BigData and Hadoop

Namenode availability

If the Namenode fails ALL the cluster becomes inaccessible

In the early versions the Namenode was a single point of failure

Couple of solution are now available:the Namenode stores the data on the network through NFSmost production sites have two Namenode: Active and Standby

Page 46: A gentle introduction to the world of BigData and Hadoop

HDFS Quick Reference

The HDFS are pretty easy to use and to remember (specially if you come from a *nix like environment

The commands usually have the “hadoop fs” prefixTo list the content of a HDFS folder

> hadoop fs –ls

To load a file in the HDFS> hadoop fs –put <file>

To read a file loaded into HDFS> hadoop fs –tail <file>

And so on…>hadoop fs –mkdir <dir>

>hadoop fs –mv <sourcefile> <destfile>

>hadoop fs –rm <file>

Page 47: A gentle introduction to the world of BigData and Hadoop

MapReduce

Page 48: A gentle introduction to the world of BigData and Hadoop
Page 49: A gentle introduction to the world of BigData and Hadoop

MapReduceProcessing large file serially could be a problem.

MapReduce is designed to be a very parallelized way of managing data

Data are split into many pieces

Each piece is processed simultaneously and isolated

Data are processed in isolation by tasks called Mappers.

The result of the Mappers, is then brought together (with a process called “Shuffle and Sort”) into a second set of tasks, Reducers.

Page 50: A gentle introduction to the world of BigData and Hadoop

Mappers

Page 51: A gentle introduction to the world of BigData and Hadoop

Reducers

Page 52: A gentle introduction to the world of BigData and Hadoop

The MapReduce “HelloWorld”

All the examples and tutorials of MapReduce start with one simple example: the Wordcount. Let’s take a look at it.

Page 53: A gentle introduction to the world of BigData and Hadoop

Java code…

Page 54: A gentle introduction to the world of BigData and Hadoop
Page 55: A gentle introduction to the world of BigData and Hadoop

Using Hadoop Streaming

Hadoop Streaming allows to write Mappers and Reducers in almost any language, rather than forcing you to use Java

The command to run the streaming it’s a bit “tricky”

Page 56: A gentle introduction to the world of BigData and Hadoop
Page 57: A gentle introduction to the world of BigData and Hadoop

MapReduce on a “real” case

Retailer with many stores around the country

The data are written on a sequential log with date, store location, item, price, payment

2014-01-01 London Clothes13.99£ Card

2014-0101 NewCastle Music 05.69£ Bank

….

A really simple mapper will simply split all the data and then pass them to a mapper

The mapper will calculate the Sales Total split for every location

Page 58: A gentle introduction to the world of BigData and Hadoop

Python code…

Page 59: A gentle introduction to the world of BigData and Hadoop

How MapReduce works…

Page 60: A gentle introduction to the world of BigData and Hadoop

… and the Streaming

Page 61: A gentle introduction to the world of BigData and Hadoop

Hadoop related projects

PIG: high level language fro analyzing large data-sets. It’s working as a compiler that produce M/R jobs

HIVE: data warehouse software facilities querying and managing large data-sets with a SQL like language

Hbase : a scalable, distributed database that supports structured data storage for large tables

Cassandra: a scalable multi-master database

Page 62: A gentle introduction to the world of BigData and Hadoop

What is Hadoop?

(Part 2)

Page 63: A gentle introduction to the world of BigData and Hadoop

Hadoop v 2.xHadoop is a pretty easy system to use, but a bit tricky to set-up and manage

The skills required are more related to System Management than the Dev side

Let’s add that the Apache documentation never stood up for clarification and completeness

So, to add a bit of mess, they decided to make the v2, that is actually changing a lot

Page 64: A gentle introduction to the world of BigData and Hadoop

Hadoop v 2.xThe new Hadoop has now FOUR modules (instead of two)

HadoopCommon: common utilities supporting all the other modules

HDFS: an evolution of the previous distributed FS

Hadoop YARN: a fx for job scheduling and cluster resource management

Hadoop MapReduce: a YARN based system for paralllel processing of large data sets

Page 65: A gentle introduction to the world of BigData and Hadoop

Hadoop v 2.xHadoop v2, leveraging YARN, is aiming to become the new OS for the data processing

Page 66: A gentle introduction to the world of BigData and Hadoop

Hadoop and real time

Hadoop v2, using YARN, and Storm (a free and open source distributed real time computation system) can compute your data in real time

Some Hadoop distribution (like Hortonworks) are working on an effortless integration

http://hortonworks.com/blog/stream-processing-in-hadoop-yarn-storm-and-the-hortonworks-data-platform/

Page 67: A gentle introduction to the world of BigData and Hadoop

Microsoft Azure and Hadoop

Page 68: A gentle introduction to the world of BigData and Hadoop

Microsoft Lambda Architecture support

Batch Layer Speed Layer Serving Layer

• WA HDInsight• WA Blob storage• MapReduce, Hive,

Pig,…

• Federation in WA SQL DB

• Azure Tables• Memchached/

MongoDB• SQL Azure• Reactive

Extensions (Rx)

• Azure Storage Explorer

• MS Excel (and Office suite)

• Reporting Services• Linq To Hive• Analysis Services

Page 69: A gentle introduction to the world of BigData and Hadoop

Yahoo, Hadoop and SQL Server

Serving LayerSpeed LayerBatch Layer

Apache Hadoop SQL Server Analysis Service (SSAS)

Microsoft Excel and PowerPivot

Other BI Tools and Custom Applications

Hadoop Data

Third Party Database

SQL Server Analysis Services

(SSAS Cube)

+Custom

Applications

SQL Server Connector (Hadoop Hive ODBC)

Staging Database

Microsoft Excel & PowerPivot for

Excel

Page 70: A gentle introduction to the world of BigData and Hadoop

MS. Net SDK for Hadoop

• .NET client libraries for Hadoop

• Write MapReduce in Visual Studio using C# or F#

• Debug against local data Slave Nodes

.NET Hadoop

SDK

JobTracker

Page 71: A gentle introduction to the world of BigData and Hadoop

WebClient Libraries in .Net

• WebHDFS client library: works with files in HDFS and Windows Azure Blob storage

• WebHCat client library: manages the scheduling and execution of jobs in an HDInsight cluster

WebHDFS

• Scalable REST API

• Move files in and out and delete from HDFS

• Perform file and directory functions

WebHCat

• HDInsight job scheduling and execution

Page 72: A gentle introduction to the world of BigData and Hadoop

Reactive Extensions (Rx): Pulling vs.

Pushing DataInteractive vs Reactive• In interactive programming,

pulling data from a sequence that represents the source (IEnumerator)

• In reactive programming, subscribing to a data stream (called observable sequence in Rx), with updates handed to it from the source

Page 73: A gentle introduction to the world of BigData and Hadoop

Reactive Extensions (Rx): Pulling vs. Pushing Data

Move

Next

Application

On N

ext Have

next!

IEnumerable<T>

IEnumerator<T>

IObservable<T>IObserver<T>

Interactive Reactive

PU

SH P

ull

Environment

Got next?

Page 74: A gentle introduction to the world of BigData and Hadoop