Hadoop on Windows Azure - an Introduction

9
1 www.aditi.com Introducing Hadoop on Azure M Sheik Uduman Ali Technical Architect, Aditi Technologies Instead of reinventing the wheel, Microsoft takes a strong and brilliant move to integrate Hadoop on its blockbuster cloud computing PaaS stack. Isn't it? Of course, LINQ2HPC was embraced many .NET developers, however, Hadoop distribu- tion for Windows is also the safest move. This paper evaluates the early preview of Hadoop on Azure. It cover the basics of using Hadoop on Azure. It would be helpful to read about MapReduce and Hadoop Topology before learning about Hadoop on Azure. For comments or questions regarding the content of this pa- per, please contact Sunny Neogi ([email protected]) or Arun Kumar ([email protected]) www.aditi.com

Transcript of Hadoop on Windows Azure - an Introduction

Page 1: Hadoop on Windows Azure - an Introduction

1

www.aditi.com

Introducing Hadoop on Azure M Sheik Uduman Ali Technical Architect, Aditi Technologies

Instead of reinventing the wheel, Microsoft takes a strong and

brilliant move to integrate Hadoop on its blockbuster cloud

computing PaaS stack. Isn't it? Of course, LINQ2HPC was

embraced many .NET developers, however, Hadoop distribu-

tion for Windows is also the safest move. This paper evaluates

the early preview of Hadoop on Azure. It cover the basics of

using Hadoop on Azure. It would be helpful to read

about MapReduce and Hadoop Topology before learning

about Hadoop on Azure.

For comments or questions regarding the content of this pa-

per, please contact

Sunny Neogi ([email protected]) or

Arun Kumar ([email protected])

www.aditi.com

Page 2: Hadoop on Windows Azure - an Introduction

2

www.aditi.com

Why do we need Hadoop?

The simple answer to this question is "Big data analysis". Some examples of

big data analysis are:

Calculating consumers purchasing trend on particular product categories

based on the growing big data with the rate of 1 million transactions per

hour

Web application log analysis

Internet search indexing

Social network data

Since relational databases and its ecosystem were designed on "scale-up"

strategy with centralized data processing, they are not much suitable for data

warehousing space. And the data persistence of modern applications is mix

and match of relational, structured and non-structured. Hence, we need a

much more powerful system. Hadoop is one of the successful open source

platform based on MapReduce principle, which in turn follows the "Making

big by small" philosophy.

The big data processing is called as "Job" since it would be done very fre-

quently, periodically, some in a while or only once. It is not to be part of day

to day business.

Page 3: Hadoop on Windows Azure - an Introduction

3

www.aditi.com

ABOUT ADITI

Basically, the input data is processed on "n" number of small physical nodes in

a clustered environment in two different phases:

Map: The input data needs to be grouped as <k1, v1> key-value pair. For

example, if the input data reside in one or more files, then k1 would be the

file name and v1 be the file content. Hence, the map phase receives list of

<k1, v1>. It splits each k1 into available map nodes in the cluster. On every

node, the mapping function mostly performs "filtering and transfor-

mation“ and produces <k2, v2>. For example, if you want to count the

number of occurrences of words in the given set of documents, <filename,

content> as <k1, v1> and the nodes in the mapping phase does counting

the words in the given v1. This will generate output like <"aditi", 1> as <k2,

v2> for every occurrence of the word "Aditi" in a document. Here, "aditi" is

one of the words in the document. Hence, the output of mapping phase is

list of <k2, v2>. For example, there are many <"aditi", 1> in the <k2, v2>.

Reduce: All <k2, v2> are aggregated and created <k2, list(v2)>. In the

word count example, a node in the Hadoop cluster may produce may

<"aditi", List(1, 1, 1, 1)> from all the documents from different nodes. Eve-

ry list(v2) for k2 passed to a node for reducing. The output will be list of

<k3, v3>. For example, if a node receives "aditi" as k2, it just accumulates

all List(1+1+1+1) as v2 and produces 4 as v3. Here, k3 is again

"aditi". Each reducer node does the same for different words.

The <k2, v2> aggregation is actually performed by a component called

"combiner". As of now, let us keep focus on the mapper and reducer.

See the below figure (figure 1):

What are the layers of

Architecture?

What is MapReduce?

Page 4: Hadoop on Windows Azure - an Introduction

4

www.aditi.com

ABOUT ADITI

Hadoop cluster is an infrastructure with many physical nodes, where some are

configured for "mapping" and some are for "reducing" along with administra-

tive, tracking and data persistence nodes called as "Name Node", "Job Track-

er", "Task Tracker" and "Data Node" respectively. This is a master/slave archi-

tecture "Name Node" and "Job Tracker" are masters and remaining are

slaves. This is shown in figure 2.

In order handle big data storage and processing, Hadoop uses HDFS as a file

system which even handle 100 TB content as a single file.

What are the layers of

Architecture? Hadoop Cluster

Page 5: Hadoop on Windows Azure - an Introduction

5

www.aditi.com

ABOUT ADITI

Since every task is called as "Job", you can rent required nodes for your job,

use and release. Hence, the elastic computing and data storage (blob and ta-

ble storage) in Azure is definitely the good choice for running your Hadoop

job. The home land for Hadoop is Java, at this early stage on Azure, Hadoop

Java SDK is one of the good options for your job. In addition to this, the

"Hadoop on Azure" leverages the elasticity of Azure storage with Hadoop

streaming, by which you can write your job on C# or F# and use Azure blob

for data persistence (the scheme is called as ASV). The figure below shows the

Hadoop ecosystem on Azure (figure 3).

To create directories, get and put files, and issue some data processing com-

mands on HDFS/ASV, Azure provides interactive JavaScript console. (In the ac-

tual Hadoop distribution, Java is the main interface for this). In addition to

this, Azure supports Hive (SQL like language in Hadoop) and Pig Latin (high

level data processing language).

What are the layers of

Architecture? Hadoop Ecosystem on Azure

Page 6: Hadoop on Windows Azure - an Introduction

6

www.aditi.com

ABOUT ADITI

The www.hadooponazure.com is the management portal to create, release

and renew clusters for your job. The following are the steps you need to per-

form to run job:

1. Develop the mapping and reducing functions either in Java or your pre-

ferred platform. For non-Windows, it could be shell scripts, ruby, php, Py-

thon, etc. In Azure, you can write the code in .NET.

2. Decide from where the input data and output result of the job need to be

managed. Either in HDFS or Azure Blob.

3. Request a cluster for the job in the portal

4. Specify all the parameters for the job which includes the executable for the

job, input and output details

5. Run the job and get the output

6. Release the cluster

In this post, let us see the step 3, how we can create a cluster for a job.

Requesting a new Cluster

After you entered into the portal, you need to enter the following details for

the new cluster environment as shown in the below figure (figure 4):

DNS name (<dnsname>.cloudapp.net)

Cluster size - like Azure role size. 4 nodes + 2 TB disk space = small, 32

nodes + 16 TB = extra large

Cluster login information

What are the layers of

Architecture?

The Web Portal for Hadoop on Azure

Page 7: Hadoop on Windows Azure - an Introduction

7

www.aditi.com

ABOUT ADITI

After entering these details, press Request Cluster button. This will create the

cluster environment for your job. The screen shows the progress of creating

new nodes for the cluster as shown in the below figure (figure 5):

Page 8: Hadoop on Windows Azure - an Introduction

8

www.aditi.com

ABOUT ADITI

After the provisioning, you will see a screen as shown below (figure 6):

You can start create a new job and if you want to access the environment you

can use either "Interactive Console" or “Remote Desktop".

Page 9: Hadoop on Windows Azure - an Introduction

9

www.aditi.com

ABOUT ADITI

The above figure is a Hadoop Streaming based job.

——————————————————————————————————

About the Author:

M Sheik Uduman Ali is a cloud architect at Aditi who has involved in cloud practic-

es. He is a blogger and published an online book about "Domain Specific Languages

in .NET".

Aditi helps product companies, web businesses and enterprises leverage the power of cloud, e-social and mobile, to

drive competitive advantage. We are Microsoft cloud partner of the year; one of the top 3 Platform-as-a-Service so-

lution providers globally and one of the top 5 Microsoft technology partners in US. We are passionate about emerg-

ing technologies and are focused on custom development.

ABOUT ADITI

When you click on new job, you will see the below screen (figure 7):