HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited.

HDInsight on Azure and Map-Reduce

Richard ConwayWindows Azure MVPElastacloud Limited

Agenda

Introduction

Big Data with HDInsight

Introduction

Solving problems through distributionSome challenges become bound by hardware capacity; 24 hours on 1 machine can be 1 hours on 24 machines.

These 24 machines require orchestration; jobs are to be divided into tasks and tasks are distributed across a cluster.

There are systems of software required to facilitate the distribution; examples are Hadoop and HPC Server.

We will now provision a Hadoop cluster on Windows Azure.

Big Data vs Big Compute

Compute Bound IO Bound

HPC ServerOpen MPI

Hadoop

All distributed compute works on the basis of taking a large JOB and breaking it to many smaller TASKS which are then parallelised

Hadoop

Name Node Name Node

Data Nodes

Head Node Broker Node

Worker Nodes

Understanding Big Data

Cheap Storage

$100 gets you 3million times

more storage in 30 years)

Inexpensive Computing

1980 10 MIPS/$ 2005 10M MIPS/$

Device Explosion

>5.5 billion (70+% of global population)

KEY TRENDS

Social Networks

>2 Billionusers

Ubiquitous Connection

Web traffic2010 130 Exabyte (10 E18)

2015 1.6 ZettaByte (10 E21)

Sensor Networks

>10 Billion

Internet of things Audio /

VideoLog Files

Text/Image

Social Sentiment

Data Market FeedseGov Feeds

Weather

Wikis / Blogs

Click Stream

Sensors / RFID / Devices

Spatial & GPS Coordinates

WEB 2.0Mobile

Advertising

Collaboration

eCommerce

Digital Marketing

Search Marketing

Web Logs

Recommendations

ERP / CRM

Sales Pipeline

PayablesPayroll

Inventory

Contacts

Deal Tracking

Terabytes(10E12)

Gigabytes(10E9)

Exabytes(10E18)

Petabytes(10E15)

Velocity - Variety - variability

1980190,000$

20100.07$

19909,000$

200015$Storage/GB

ERP / CRM WEB 2.0

Internet of things

What is Big Data?

Big Data, BIG OPPORTUNITY

Big Data is a top priority for institutions

49% CEOs and CIOs are planning big data projects

Software Growth

41.8 2.5

3.44.6

34% compound annual growth rate2

Services Growth

2.7 3.9 5.16.5

39% compound annual growth rate2

1. McKinsey&Company, McKinsey Global Survey Results, Minding Your Digital Business, 20122. IDC Market Analysis, Worldwide Big Data Technology and Services 2012–2015 Forecast , 2012

Devices: Internet and Internet of things

Internet of

things Invisible devicesTrillions of networked

Low bandwidth last-mile

connection

100kBit/sec

Mostly addressed by local schemes

Machine-centric Sensing-focus

Trillions of computer-enabled

devices which are part of the

Global addressing

User-centricCommunication-

Internet

Laptops / tablets / smartphones

Billions of networked devices

High-bandwidth access

Cable: 10Mbs+Fiber: 50-100Mbs

6+billion people

1.5 billion use net

US: 4.3 devices per adult

Big Data Scenarios

Short History of Hadoop

Seminal whitepapers by Google in 2004 on a new programming paradigm to handle data at internet scaleHadoop started as a part of the Nutch project.In Jan 2006 Doug Cutting started working on Hadoop at YahooFactored out of Nutch in Feb 2006First release of Apache Hadoopin September 2007Jan 2008 Hadoop became a top level Apache project

Hadoop Distributed Architecture

FIRST, STORE THE DATA

Server

ServerServer

MapReduce: Move Code to the Data

Server

SECOND, TAKE THE PROCESSING TO THE DATA

So How Does It Work?

// Map Reduce function in JavaScript

var map = function (key, value, context) {var words = value.split(/[^a-zA-Z]/);for (var i = 0; i < words.length; i++) {

if (words[i] !== "")context.write(words[i].toLowerCase(),1);}}};

var reduce = function (key, values, context) {var sum = 0;while (values.hasNext()) {sum += parseInt(values.next());

}context.write(key, sum);};

ServerServer

RUNTIME

Traditional RDBMS vs. NoSQL

TRADITIONAL RDBMS HADOOP

Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)

Access Interactive and Batch Batch

Updates Read / Write many times Write once, Read many times

Structure Static Schema Dynamic Schema

Integrity High (ACID) Low

Scaling Nonlinear Linear

DBA Ratio 1:40 1:3000

Reference: Tom White’s Hadoop: The Definitive Guide

Windows Azure HDInsight Service

Creating an HDInsightCluster Demo

MICROSOFT CONFIDENTIAL – INTERNAL ONLY

Distributed Storage(HDFS)

Query(Hive)

Distributed Processing

(MapReduce)

Scripting(Pig)

L Data

Metadata(HCatalog)

P/ REST)

Machine Learning(Mahout)

Graph(Pegasus)

Stats processin

g(RHadoo

nt Pip

Active Directory (Security)

Monitoring & Deployment

(System Center)

C#, F#, .NET

JavaScript

Pipelin

orkflo

Azure Storage Vault (ASV)

lybase

lligence

xcel, Po

HDINSIGHT / HADOOP Eco-System

World's Data (Azure Data Marketplace)

LegendRed = Core HadoopBlue = Data processingPurple = Microsoft integration points and value addsOrange = Data MovementGreen = Packages

Storing Data with HDInsight

Front end

Stream Layer

Partition Layer

HDFS on Azure: Tale of two File Systems

Name Node

Data Node Data Node

Front end

HDFS API

DFS (1 Data Node per Worker Role)and Compute Cluster

Azure Storage (ASV)

Azure Blob Storage

Azure Storage (ASV)• Default file system for HDInsight Service• Provides sharable, persistent, highly-scalable Storage with high

availability (Azure Blob Store)• Azure storage itself does not provide compute• Fast access from compute nodes to data in same data center• Several file systems, addressable via:asv[s]:<container>@<account>.blob.core.windows.net/<path>

• Requires storage key in core-site.xml:<property> <name>fs.azure.account.key.accountname</name> <value>enterthekeyvaluehere</value></property>

Map Reduce

Examples in C#

Map/Reduce

Map/Reduce is a programming model for efficient distributed computingInput > Map > Shuffle & Sort > Reduce > Output

Efficiency from Streaming through data, reducing seeksA good fit for a lot of applicationsLog processingWeb index buildingData mining and machine learning

Hadoop SDK

C# integrationRemote Data & JobsHive in C#Serialization

http://hadoopsdk.codeplex.com

public class FrenchSessionsJob : HadoopJob<FrenchSessionsMapper, SessionsReducer>

public override HadoopJobConfiguration Configure(ExecutorContext context)

var config = new HadoopJobConfiguration()

InputPath = "\"/AllSessions/*.gz\"",

OutputFolder = "/FrenchSessions/"

return config;

public class FrenchSessionsMapper : MapperBase

public override void Map(string inputLine, MapperContext context)

if (inputLine.Contains("Country=France")

context.IncrementCounter("FrenchSession");

context.EmitKeyValue("FR", "1");

Mapper

public class SessionsReducer : ReducerCombinerBase

public override void Reduce(string key, IEnumerable<string> values, ReducerContext context)

context.EmitKeyValue(key, values.Count());

Reducer

Navigating the HDInsight portal Demo

C# and Map/ReduceDemo

https://elastastorage.blob.core.windows.net/hdinsight/Map-Reduce HDInsight Lab.pdf

Questions?

HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited.

Documents

Transcript of HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited.

Azure Data Usando HDInsight Ejemplo Hadoop: MadReduce, HIVE, PIG

How To Access and Manage Microsoft Azure Cloud Data …...• URI – Azure HDInsight JDBC URI to connect to Hive server 2. Once a similar JDBC URI is retrieved from Azure HDInsight

2016 DSG Webinar Azure HDInsight 2 V4

eBECS SmartWorker - Microsoft Azure · 10/1/2015 · Azure HDInsight, AzureML, Power BI, Azure Data Factory, Azure Data Lake Hot path analytics Azure Stream Analytics, Azure HDInsight

Azure HDInsight Hadoop Meets the Cloud Microsoft’s managed Hadoop as a Service 100% open source Apache Hadoop Built on the latest releases across.

How Microsoft Azure Can Help Organizations Become ... · import Azure HDInsight data into Excel and query for personal data using the power query functionality. Azure Data Catalog

Processing Big Data with Hadoop in Azure HDInsight€¦ · Hadoop uses a file system named HDFS, which in Azure HDInsight clusters is implemented as a blob container in Azure Storage.

Campus days Azure HDInsight automation

Marcello BUONCOMPAGNI Italy, Microsoft Consulting Services · 2020-01-09 · Analytics Hot Analytics Real-time monitoring Azure Stream Analytics HDInsight Spark & Storm Warm Analytics

Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD

Introducción a HDInsight

Big Data & Azure HDInsight: Programming with C#

Understanding NoSQL on Microsoft Azure - David · PDF fileBig data analytics, including the managed service provided by Azure HDInsight. This service implements Hadoop, ... Understanding

Jonelle Kreiner GBB, SSP IoT Public Sector...Azure IoT Hub SQL,DW, HDInsight, Cosmos Customer equipment and devices Azure Machine Learning, Stream Analytics, AI Azure IoT Suite and

Making your Apps Smarter with Azure HDInsight

Manish Sharma Sr. Technology Evangelist ISV …...Data Analytics Azure Analysis Services Azure Data Lake Analytics Azure HDInsight R Server SQL Server R services Compute Azure Batch

Getting Started Using HBase in Microsoft Azure HDInsight · DBI-IL202 Getting Started Using HBase in Microsoft Azure HDInsight 4 You will be able to query tweets with certain keywords

Alexander Klein ETL meets Azure - sqlpass.de Data Lake Analytics U-SQL Azure Data Lake Analytics DotNet HDInsight[Hadoop] orAzure Batch. Azure Data Factory (ADF) ... Alexander Klein

HDInsight Hadoop on Windows Azure

Democratizing Big Data with Microsoft Azure HDInsight