Iasi code camp 20 april 2013 mihai nadas hadoop azure

Big Data on Azure

What do I need to know as a developer to make it worthwhile?

Mihai NadășChief Technology Officer, YonderMost Valuable Professional, Microsoft

About myself

@mihainadasblog.mihainadas.com

Agenda

Why Big Data?

Understanding the Basics

Microsoft and Hadoop

Two Big Data examples

1. Google Flu Trends

2. Farecast

Why Big Data?

Gartner’s Hype Cycle on Big Data

Key Technologies• Accessible storage (non-relational) in cloud: Amazon S3, Azure Blob &

Table storage, Google Cloud Storage

• In memory databases & grids: MemSQL, XAP (Gigaspaces ), SAP Hana

• Parallel processing frameworks: Hadoop

• Online analytics frameworks: Google BigQuery, Hive

• Data stream processing: Twitter Storm

• Complex event processing: Oracle CEP Server, Microsoft StreamInsight

• Sentiment analysis – Radian6

It’s BIG

Example Scenario

OPERATIONAL DATA

Traditional E-Commerce Data Flow

NEW USER REGISTRY

NEW PURCHASE

NEW PRODUCT

Excess Data

Logs

ETL Some Data

Data Warehouse

OPERATIONAL DATA

New E-Commerce Big Data Flow

Raw Data“Store it All” Cluster

Raw Data“Store it All” Cluster

NEW USER REGISTRY

NEW PURCHASE

NEW PRODUCT

Data Warehouse

Logs

Logs

How much do views for certain products increase when our TV ads run?

Viktor Mayer-SchonbergerProfessor at Oxford

Kenneth CukierEditor, The Economist

Big Data Principles1. More: store over trash

2. Messy: quantity over quality

3. Correlation: what over why

Understanding the Basics Move the Compute to the Data

Characteristics of Big Data

MapReduce

Think of the following problem...

What if we parallelize?

Welcome, MapReduce

Map Reduce

So How Does It Work?

MapReduce – Workflow

Hadoop

The Hadoop EcosystemETL Tools BI Reporting RDBMS

Reference: Tom White’s Hadoop: The Definitive Guide

Traditional RDBMS vs. MapReduce

TRADITIONAL RDBMS MAPREDUCE

Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)

Structure Static Schema Dynamic Schema

Integrity High (ACID) Low

Scaling Nonlinear Linear

DBA Ratio 1:40 1:3000

Reference: Tom White’s Hadoop: The Definitive Guide

Microsoft and Hadoop

Microsoft Big Data Solution

Power View Excel with PowerPivot Embedded BIPredictive Analytics

APPsLOBCRMERP

Microsoft EDW

SSAS SSRS

Devices CrawlersSensors Bots

Hadoop On Windows Server

Hadoop On Windows Azure

http://www.google.com/imgres?imgurl=http://richfrombechtle.files.wordpress.com/2008/10/vs2010archexplorer.jpg&imgrefurl=http://richfrombechtle.wordpress.com/2008/10/13/&usg=__qPArABkba3JddWP-O2AT7MRoU1s=&h=500&w=749&sz=95&hl=en&start=3&zoom=1&itbs=1&tbnid=mMsoPo--rTSTfM:&tbnh=94&tbnw=141&prev=/images?q=visual+Studio+Application&hl=en&sa=X&tbs=isch:1&prmd=ivns&ei=TGhwTYmPNsKTtwflnuiTDw

Deploying and Interacting With a Hadoop Cluster on Azure

step-by-step walktrough

Objectives1. Run a basic Java MapReduce program using a Hadoop jar file

2. Import data from the Windows Azure Marketplace into a Hadoop on Azure cluster using the Interactive Hive Console

Prerequisites1. Access to a Hadoop on Azure account

2. Request an invitation to the Preview Feature

Creating a new HDInsight Cluster (I)

Creating a new HDInsight Cluster (II)

Cluster Management Interface

Hadoop Sample Gallery

Objective #1: Basic MapReduce Task• We will use the Pi Estimator sample job

• Distributed Pi Estimator with 16 maps, each will compute 10 million samples

Pi Estimator

2r

r=1

• Uses the Monte Carlo Simulation method to compute π

Pi Estimator• Uses the Monte Carlo Simulation method to

compute π

Pi Estimator: Running the Job

Pi Estimator: And the result is.... • 160.000.000 random

points• 16 mappers• 10.000.000 samples /

map

• Computed in 65.108 seconds

Objective #2: Import data from the Windows Azure Marketplace into a Hadoop

• Windows Azure Marketplace is a cloud one-stop-shop for premium data and applications

• We will see how we can use the „2006 – 2008 Crime in the US” dataset to play with on Hadoop using Hive

Windows Azure Marketplace

Apache Hive

• Data Warehouse infrastructure built on top of Hadoop

• Provides data summarization, query and analysis

• Initially developed by Facebook, now an Apache project

Apache Hive: Features

• Analysis of large datasets stored in Hadoop-compatible file-systems

• Provides a SQL-like language called HiveQL while maintaining full support for map/reduce

• By default, stores data in Apache Derby database

Importing data to Hadoop on Azure

Querying huge datasets using Hive

Hadoop on WindowsInsights to all users by activating new types of data

Integrate with Microsoft Business Intelligence

Choice of deployment on Windows Server + Windows Azure

Integrate with Windows Components (AD, Systems Center)Easy installation and configuration of Hadoop on Windows

Simplified programming with . Net & Javascript integration

Integrate with SQL Server Data Warehousing

Diff

ere

nti

ati

on

Microsoft Big Data RoadmapTo accelerate the delivery of Microsoft’s Hadoop based solution for Windows Server and service for Windows Azure, Microsoft is announcing a partnership with HortonworksMicrosoft is committed to broadening accessibility and usage of Hadoop to end users, developers and IT professionals in organizations of all sizes

Microsoft is announcing an end-to-end roadmap for Big Data that embraces Apache HadoopTM by distributing enterprise class Hadoop based solutions on both Windows Server and Windows Azure

Microsoft is extending its leadership in business intelligence and data warehousing to provide insights to all users by activating new types of data of any size

Things to do1. Get a trial of Windows Azure2. Subscribe to the Preview Program of Hadoop on

Azure3. Write your first Map/Reduce job

4. Have a talk in autumn at CodeCamp on your experience with Big Data

Thank you

Iasi code camp 20 april 2013 mihai nadas hadoop azure

Documents

Transcript of Iasi code camp 20 april 2013 mihai nadas hadoop azure