Hadoop Summit 2013 : Continuous Integration on top of hadoop
SQLRally Amsterdam 2013 - Hadoop
-
Upload
jan-pieter-posthuma -
Category
Technology
-
view
106 -
download
2
description
Transcript of SQLRally Amsterdam 2013 - Hadoop
Hadoop
Henk van der ValkTechnical Sales Professional
Jan Pieter PosthumaMicrosoft BI Consultant
7/11/2013
Join Local Chapters
JOIN THE PASS COMMUNITY
2
Become a PASS member for free and join the world’s
biggest SQL Server Community.
Access to online training content
Access to events at discounted
rates
Join Virtual Chapters
Personalize your PASS website experience
3
Agenda
• Introduction• Hadoop• HDFS• Data access to HDFS• Map/Reduce• Hive• Data access from HDFS• SQL PDW PolyBase• Wrap up
4
Introduction Henk
• 10 years of Unisys-EMEA Performance Center• 2002- Largest SQL DWH in the world (SQL2000)• Project Real – (SQL 2005)• ETL WR - loading 1TB within 30 mins (SQL 2008)• Contributed to various SQL whitepapers
• Schuberg Philis-100% uptime for mission critical applications• Since april 1st, 2011 – Microsoft SQL PDW - Western Europe• SQLPass speaker & volunteer since 2005
5
Introduction
Fast
L
oad
Source Systems
Historical Data(Beyond Active Window)
Summarize & Load
Big Data Sources (Raw,
Unstructured)
Alerts, Notifications
Data & Compute Intensive Application
ERP CRM LOB APPS
Integrate/Enrich
SQL Server StreamInsight
Enterprise ETL with SSIS, DQS, MDS
HDInsight on Windows Azure
HDInsight on Windows Server
SQL Server FTDW Data Marts
SQL Server Reporting Services
SQL Server Analysis Server
Business Insights
Interactive Reports
Performance Scorecards
Crawlers
Bots
Devices
Sensors
SQL Server Parallel Data Warehouse
Data Insights Value
Azure Market Place
1. Data
Warehousing:Storing and analysis of
structured data
4. Business Analytics:Interactionwith data
2. Map Reduce:
Storing and processing of
unstructured data
3. Streaming:
Predictive Maintenance aka
Real-time data processing
6
Introduction Jan Pieter Posthuma
Jan Pieter Posthuma• Technical Lead Microsoft BI and
Big Data consultant • Inter Access, local consultancy firm in the
Netherlands• Architect role at multiple projects• Analysis Service, Reporting Service,
PerformancePoint Service, Big Data, HDInsight, Cloud BI
http://twitter.com/jppphttp://linkedin.com/[email protected]
7
Hadoop
• Hadoop is a collection of software to create a data-intensive distributed cluster running on commodity hardware.
• Original idea by Google (2003).• Widely accepted by Database vendors as a solution for
unstructured data• Microsoft partners with HortonWorks and delivers their
Hadoop Data Platform as Microsoft HDInsight• Available as an Azure service and on premise• HortonWorks Data Platform (HDP) 100% Open Source!
7
8
Hadoop
• HDFS – distributed, fault tolerant file system• MapReduce – framework for writing/executing distributed,
fault tolerant algorithms• Hive & Pig – SQL-like declarative languages• Sqoop/PolyBase – package
for moving data between HDFS and relational DB systems
• + Others…• Hadoop 2.0
HDFS
Map/Reduce
Hive & PigSqoop / Poly base
Avro
(S
eri
aliz
ati
on
)
HBase
Zooke
ep
er
ETL Tools
BI Reporting
RDBMS
9
HDFS
9
Large File11001010100111001010100111001010100111001010100111001100101010011100101010011100101010011100101010011100101010011100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001110010101001110010101001110010101001110010101001110011001010100111001010100111001010100111001010100111001010100111001010100111001010100111001010100111001010100111001100101010011100101010011100101010011100101010011100101010011100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001110010101001110010101001110010101001110010101001110011001010100111001010100111001010100111001010100111001010100111001010100111001010100111001010100111001010100111001100101010011100101010011100101010011100101010011100101010011100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
…
6440MB
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Block 100
Block 101
64MB 64MB 64MB 64MB 64MB 64MB
…
64MB 40MB
Block 1
Block 2
Let’s color-code them
Block 3
Block 4
Block 5
Block 6
Block 100
Block 101
e.g., Block Size = 64MB
Files are composed of set of blocks• Typically 64MB in size• Each block is stored as a
separate file in the local file system (e.g. NTFS)
HDFS
NameNode BackupNode
DataNode DataNode DataNode DataNode DataNode
(heartbeat, balancing, replication, etc.)
nodes write to local disk
namespace backupsHDFS was designed with the expectation that failures (both hardware and software) would occur frequently
Hadoop 2.0 is more decentralized• Interaction between DataNodes• Less dependent on primary
NameNode
11
Data access to HDFS
• FTP – Upload your data files• Streaming – Via AVRO (RPC) or Flume• Hadoop command – hadoop fs -copyFromLocal• Windows Azure BLOB storage – HDInsight Service (Azure)
uses BLOB storage instead of local VM storage. Data can be uploaded without a provisioned Hadoop cluster
• PolyBase – Feature of PDW 2012. Direct read/write data access to the datanodes.
12
Data access
Hadoop command Demo
13
14
Map/Reduce
• MR: all functions in a batch oriented architecture• Map: Apply the logic to the data, eg page hits count.• Reduce: Reduces (aggregate) the results of the Mappers to one.
• YARN: split the JobTracker in to Resource Manager and Node Manager. And MR in Hadoop 2.0 uses YARN as its JobTacker
15
Map/Reduce
Total page hits
16
Hive
• Build for easy data retrieval• Uses Map/Reduce• Created by Facebook • HiveQL: SQL like language• Stores data in tables, which are stored as HDFS file(s)• Only initial INSERT supported, no UPDATE or DELETE• External tables possible on existing (CSV) file(s)• Extra language options to use benefits of Hadoop• Stinger initiative: Phase 1 (0.11) and Phase 2 (0.12).
Improve Hive 100x
17
Hive
Star schema join – (Based on TPC-DS Query 27)
SELECT col5, avg(col6)FROM store_sales_fact ssf join item_dim on (ssf.col1 = item_dim.col1) join date_dim on (ssf.col2 = date_dim.col2) join custdmgrphcs_dim on (ssf.col3 = custdmgrphcs_dim.col3)
join store_dim on (ssf.col4 = store_dim.col4)GROUP BY col5ORDER BY col5LIMIT 100;
Cluster: 6 Nodes (2 Name, 4 Compute – dual core, 14GB)
41 GB58 MB11 MB80 MB106 KB
18
Hive
File Type # MR jobs
Input Size # Mappers Time
Text / Hive 0.10 5 43.1 GB 179 21:00 min
Text / Hive 0.11 1 38.0 GB 151 4:06 min
RC / Hive 0.11 1 8.21 GB 76 2:16 min
ORC / Hive 0.11 1 2.83 GB 38 1:44 min
RC / Hive 0.11 / Partitioned / Bucketed
1 1.73 GB 19 1:44 min
ORC / Hive 0.11 / Partitioned / Bucketed
1 687 MB 27 01:19 min
Data: ~64x less dataTime; ~16x times faster
19
Data access from Hadoop
• Excel• FTP• Hadoop command – hadoop fs -copyToLocal• ODBC[1] – Via Hive (HiveQL) data can be extracted.• Power Query – Is capable of extracting data directly from
HDFS or Azure BLOB storage • PolyBase – Feature of PDW 2012. Direct read/write data
access to the datanodes.
[1] http://www.microsoft.com/en-us/download/details.aspx?id=40886[2] Power BI Excel add-in – http://www.powerbi.com
20
Data access
Excel 2013 Demo
21
22
PDW – Polybase
22
Control Node
SQL Server
Compute Node
SQL Server
Compute Node
SQL Server
Compute Node…
SQL Server
PDW Cluster
Namenode
(HDFS)Node
DN
Node
DN
Node
DN
Node
DN
Node
DN
Node
DN
Node
DN
Node
DN
Node
DN
Node
DN
Node
DN
Node
DN
Hadoop Cluster
Sqoop This is PDW!
23
PDW – External Tables
• An external table is PDW’s representation of data residing in HDFS
• The “table” (metadata) lives in the context of a SQL Server database
• The actual table data resides in HDFS
• No support for DML operations
• No concurrency control or isolation level guaranteesCREATE EXTERNAL TABLE table_name ({<column_definition>} [,...n ]) {WITH (LOCATION =‘<URI>’,[FORMAT_OPTIONS = (<VALUES>)])}[;]
Required to indicate location of Hadoop cluster
Optional format options associated with parsing of data from HDFS (e.g. field delimiters
& reject-related thresholds)
PDW – Hadoop use cases & examples
[1] Retrieve data from HDFS with a PDW query• Seamlessly join structured and semi-structured data
CREATE TABLE ClickStreamInPDW WITH DISTRIBUTION = HASH(URL) AS SELECT URL, EventDate, UserID FROM ClickStream;
SELECT Username FROM ClickStream c, User u WHERE c.UserID = u.ID AND c.URL=‘www.bing.com’;
CREATE EXTERNAL TABLE ClickStream2 (URL, EventDate, UserID) WITH (LOCATION =‘hdfs://MyHadoop:5000/joe’, FORMAT_OPTIONS (...) AS SELECT URL, EventDate, UserID FROM ClickStreamInPDW;
[2] Import data from HDFS to PDW• Parallelized CREATE TABLE AS SELECT (CTAS)• External tables as the source• PDW table, either replicated or distributed, as destination
[3] Export data from PDW to HDFS• Parallelized CREATE EXTERNAL TABLE AS SELECT (CETAS)• External table as the destination; creates a set of HDFS files
25
SQL Server 2012 PDWPolybase demo
26
Wrap up
Hadoop ‘just another data source’ @ your fingertips!Batch processing large datasets before loading into your DWHOffloading DWH data, but still accessible for analysis/reporting
Integrate Hadoop via SQOOP, ODBC (Hive) or PolyBase
Near future: deeply integration between Hadoop and SQL PDW
Try Hadoop / HDInsight yourself:Azure: http://www.windowsazure.com/en-us/pricing/free-trial/Web PI: http://www.microsoft.com/web/downloads/platform.aspx
27
Q&A
28
References
Microsoft Big Datahttp://www.microsoft.com/bigdata
Windows Azure HDInsight Service (3 months free trail)http://www.windowsazure.com/en-us/services/hdinsight/
SQL Server Parallel Data Warehouse (PDW) Landing Page
http://www.microsoft.com/PDW
http://www.upgradetopdw.com
Introduction to Polybasehttp://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/polybase.aspx
28
29
Thanks!