Iasi code camp 20 april 2013 mihai nadas hadoop azure
-
Upload
codecampiasi -
Category
Documents
-
view
115 -
download
0
Transcript of Iasi code camp 20 april 2013 mihai nadas hadoop azure
Big Data on Azure
What do I need to know as a developer to make it worthwhile?
Mihai NadășChief Technology Officer, YonderMost Valuable Professional, Microsoft
About myself
@mihainadasblog.mihainadas.com
Agenda
Why Big Data?
Understanding the Basics
Microsoft and Hadoop
Two Big Data examples
1. Google Flu Trends
2. Farecast
Why Big Data?
Gartner’s Hype Cycle on Big Data
Key Technologies• Accessible storage (non-relational) in cloud: Amazon S3, Azure Blob &
Table storage, Google Cloud Storage
• In memory databases & grids: MemSQL, XAP (Gigaspaces ), SAP Hana
• Parallel processing frameworks: Hadoop
• Online analytics frameworks: Google BigQuery, Hive
• Data stream processing: Twitter Storm
• Complex event processing: Oracle CEP Server, Microsoft StreamInsight
• Sentiment analysis – Radian6
It’s BIG
Example Scenario
OPERATIONAL DATA
Traditional E-Commerce Data Flow
NEW USER REGISTRY
NEW PURCHASE
NEW PRODUCT
Excess Data
Logs
ETL Some Data
Data Warehouse
OPERATIONAL DATA
New E-Commerce Big Data Flow
Raw Data“Store it All” Cluster
Raw Data“Store it All” Cluster
NEW USER REGISTRY
NEW PURCHASE
NEW PRODUCT
Data Warehouse
Logs
Logs
How much do views for certain products increase when our TV ads run?
Viktor Mayer-SchonbergerProfessor at Oxford
Kenneth CukierEditor, The Economist
Big Data Principles1. More: store over trash
2. Messy: quantity over quality
3. Correlation: what over why
Understanding the Basics Move the Compute to the Data
Characteristics of Big Data
MapReduce
Think of the following problem...
What if we parallelize?
What if we parallelize?
Welcome, MapReduce
Map Reduce
So How Does It Work?
MapReduce – Workflow
Hadoop
The Hadoop EcosystemETL Tools BI Reporting RDBMS
Reference: Tom White’s Hadoop: The Definitive Guide
Traditional RDBMS vs. MapReduce
TRADITIONAL RDBMS MAPREDUCE
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
DBA Ratio 1:40 1:3000
Reference: Tom White’s Hadoop: The Definitive Guide
Microsoft and Hadoop
Microsoft Big Data Solution
Power View Excel with PowerPivot Embedded BIPredictive Analytics
APPsLOBCRMERP
Microsoft EDW
SSAS SSRS
Devices CrawlersSensors Bots
Hadoop On Windows Server
Hadoop On Windows Azure
Deploying and Interacting With a Hadoop Cluster on Azure
step-by-step walktrough
Objectives1. Run a basic Java MapReduce program using a Hadoop jar file
2. Import data from the Windows Azure Marketplace into a Hadoop on Azure cluster using the Interactive Hive Console
Prerequisites1. Access to a Hadoop on Azure account
2. Request an invitation to the Preview Feature
Creating a new HDInsight Cluster (I)
Creating a new HDInsight Cluster (II)
Cluster Management Interface
Hadoop Sample Gallery
Objective #1: Basic MapReduce Task• We will use the Pi Estimator sample job
• Distributed Pi Estimator with 16 maps, each will compute 10 million samples
Pi Estimator
2r
r=1
• Uses the Monte Carlo Simulation method to compute π
Pi Estimator• Uses the Monte Carlo Simulation method to
compute π
Pi Estimator: Running the Job
Pi Estimator: And the result is.... • 160.000.000 random
points• 16 mappers• 10.000.000 samples /
map
• Computed in 65.108 seconds
Objective #2: Import data from the Windows Azure Marketplace into a Hadoop
• Windows Azure Marketplace is a cloud one-stop-shop for premium data and applications
• We will see how we can use the „2006 – 2008 Crime in the US” dataset to play with on Hadoop using Hive
Windows Azure Marketplace
Apache Hive
• Data Warehouse infrastructure built on top of Hadoop
• Provides data summarization, query and analysis
• Initially developed by Facebook, now an Apache project
Apache Hive: Features
• Analysis of large datasets stored in Hadoop-compatible file-systems
• Provides a SQL-like language called HiveQL while maintaining full support for map/reduce
• By default, stores data in Apache Derby database
Importing data to Hadoop on Azure
Importing data to Hadoop on Azure
Querying huge datasets using Hive
Querying huge datasets using Hive
Querying huge datasets using Hive
Hadoop on WindowsInsights to all users by activating new types of data
Integrate with Microsoft Business Intelligence
Choice of deployment on Windows Server + Windows Azure
Integrate with Windows Components (AD, Systems Center)Easy installation and configuration of Hadoop on Windows
Simplified programming with . Net & Javascript integration
Integrate with SQL Server Data Warehousing
Diff
ere
nti
ati
on
Microsoft Big Data RoadmapTo accelerate the delivery of Microsoft’s Hadoop based solution for Windows Server and service for Windows Azure, Microsoft is announcing a partnership with HortonworksMicrosoft is committed to broadening accessibility and usage of Hadoop to end users, developers and IT professionals in organizations of all sizes
Microsoft is announcing an end-to-end roadmap for Big Data that embraces Apache HadoopTM by distributing enterprise class Hadoop based solutions on both Windows Server and Windows Azure
Microsoft is extending its leadership in business intelligence and data warehousing to provide insights to all users by activating new types of data of any size
Things to do1. Get a trial of Windows Azure2. Subscribe to the Preview Program of Hadoop on
Azure3. Write your first Map/Reduce job
4. Have a talk in autumn at CodeCamp on your experience with Big Data
Thank you