Windows Azure HDInsight Service
-
Upload
neil-mackenzie -
Category
Technology
-
view
4.078 -
download
1
description
Transcript of Windows Azure HDInsight Service
NEIL MACKENZIE
Windows Azure HDInsight Service
Hadoop on Windows Azure
Who Am I?
Neil MackenzieWindows Azure Architect @ Satory Global
Windows Azure MVPBlog: http://convective.wordpress.com/Twitter: @mknz
Book:Microsoft Windows Azure Development Cookbook
Goals and Agenda
Goals Introduce Windows Azure HDInsight Service to the
Windows Azure developer Introduce Windows Azure to the Hadoop user Not a tutorial on how to use Hadoop features
Agenda Big Data Windows Azure Windows Azure HDInsight Service
Big Data
Problem: How do we create value from enormous amounts of
low-value data?
Solution: Analyze it using a lot of commodity hardware.
Three Vs of Big Data
Volume How much data is there?
Variety What are the sources of the data?
Velocity How fast is the data being generated?
MapReduce
Distributed computational model for data analysis. Map function:
Processes a key-value pair to generate intermediate pairs Reduce function:
Merges all intermediate values with the same intermediate key.
Map and reduce functions allocated to many compute nodes with data stored locally.
Raw MapReduce functions are written in Java.
Apache Hadoop
Modules: Hadoop Distributed File System (HDFS) MapReduce
Related projects: HBase – scalable, distributed database Hive – data warehouse infrastructure Mahout – scalable machine learning library Pig – high-level data-flow language
Other: Sqoop –import and export to relational database
Windows Azure
Compute PaaS: Cloud Services, Windows Azure Web Sites IaaS: Virtual Machines
Storage Windows Azure Storage Service: blobs, tables, queues Windows Azure SQL Database IaaS: Microsoft SQL Server, MongoDB, Cassandra, etc.
Connectivity HTTP, TCP, UDP, Site-to-Site VPN
Administration Portal, Service Management API
Windows Azure HDInsight Service
Components: HadoopCore – v1.0.1 HDFS & ASV Pig – v0.9.3 Hive – v0.8.1 Sqoop – v1.4.2 Excel/Hive
Note: this was formerly known as Hadoop on Azure.
Hadoop Administration
Portal http://www.hadooponazure.com Apply to join preview Create and manage Hadoop cluster
3 nodes for 5 days Access the Interactive console
Hive Invoke Hive statements
JavaScript Invoke HDFS commands Invoke Hive & Pig statements
Distributed File Systems
HDFS Contents deleted when cluster deleted
ASV Azure Storage Vault Data stored in Windows Azure Blob Storage Configured on Hadoop on Azure portal Contents survive deletion of Hadoop cluster Supports multi-level structure, e.g.:
containername/input/file1
Pig
Hadoop feature to perform data-flow operations: Execution environment Language: Pig Latin
Execution Environment Local in local JVM or distributed on Hadoop cluster
Pig Latin High-level language Describes data-flow operations Automatically invokes MapReduce jobs Much simpler than using MapReduce directly
Pig Example
records = LOAD 'asv://flightdata/input/flightdata.txt'AS (year:int, month:int, day:int, carrier:chararray, origin:chararray, dest:chararray, depdelay:int, arrdelay:int);
modified_records = FOREACH recordsGENERATE origin, depdelay;
STORE modified_recordsINTO 'my_output' using PigStorage(',');
Hive
Hadoop feature to perform data warehouse operations
HiveQL high-level, SQL-like language Supports equi-joins Schema on read NOT schema on write Automatically invokes MapReduce jobs Much simpler than using MapReduce directly
Metadata store Contains descriptions of tables
Hive Example
FROM flightdata_asv
INSERT OVERWRITE TABLE origin_countsSELECT origin, COUNT(*)GROUP BY origin
INSERT OVERWRITE TABLE dest_countsSELECT dest, COUNT(*)GROUP BY dest
Sqoop
Feature allowing import and export from SQL databases Uses JDBC connector Works with Windows Azure SQL Database Table must exist before export
Sqoop Example
Exporting a table:sqoop.cmd export –connect"jdbc:sqlserver://sql_database_server.database.windows.net:1433;database=sql_database_instance;user=sqoop_login@sql_database_server;password=sqoop_login_password"--table sql_database_table--export-dir "/user/hive/warehouse/hive_table"--input-fields-terminated-by "\001"
Excel and Hadoop on Azure
Example of Microsoft business intelligence strategy Expose Hadoop to existing tools
HiveODBC connector for Excel Create Hive queries from Excel Invoke them from Excel
More Information
Sign up for preview:http://www.hadooponazure.com
Support:http://social.msdn.microsoft.com/Forums/en-US/hdinsight
Avkash Chauhan’s blog:http://blogs.msdn.com/b/avkashchauhan/archive/tags/hadoop
Roger Jennings’ blog:http://oakleafblog.blogspot.com/2012/04/using-data-in-windows-azure-blobs-with.html
Summary
Hadoop: De-facto solution to the Big Data problem
Windows Azure HDInsight Service Native Hadoop implementation Managed Hadoop service for Windows Azure Currently in preview