SQLRally Amsterdam 2013 - Hadoop

Hadoop

Henk van der ValkTechnical Sales Professional

Jan Pieter PosthumaMicrosoft BI Consultant

7/11/2013

Join Local Chapters

JOIN THE PASS COMMUNITY

2

Become a PASS member for free and join the world’s

biggest SQL Server Community.

Access to online training content

Access to events at discounted

rates

Join Virtual Chapters

Personalize your PASS website experience

3

Agenda

• Introduction• Hadoop• HDFS• Data access to HDFS• Map/Reduce• Hive• Data access from HDFS• SQL PDW PolyBase• Wrap up

4

Introduction Henk

• 10 years of Unisys-EMEA Performance Center• 2002- Largest SQL DWH in the world (SQL2000)• Project Real – (SQL 2005)• ETL WR - loading 1TB within 30 mins (SQL 2008)• Contributed to various SQL whitepapers

• Schuberg Philis-100% uptime for mission critical applications• Since april 1st, 2011 – Microsoft SQL PDW - Western Europe• SQLPass speaker & volunteer since 2005

5

Introduction

Fast

L

oad

Source Systems

Historical Data(Beyond Active Window)

Summarize & Load

Big Data Sources (Raw,

Unstructured)

Alerts, Notifications

Data & Compute Intensive Application

ERP CRM LOB APPS

Integrate/Enrich

SQL Server StreamInsight

Enterprise ETL with SSIS, DQS, MDS

HDInsight on Windows Azure

HDInsight on Windows Server

SQL Server FTDW Data Marts

SQL Server Reporting Services

SQL Server Analysis Server

Business Insights

Interactive Reports

Performance Scorecards

Crawlers

Bots

Devices

Sensors

SQL Server Parallel Data Warehouse

Data Insights Value

Azure Market Place

1. Data

Warehousing:Storing and analysis of

structured data

4. Business Analytics:Interactionwith data

2. Map Reduce:

Storing and processing of

unstructured data

3. Streaming:

Predictive Maintenance aka

Real-time data processing

6

Introduction Jan Pieter Posthuma

Jan Pieter Posthuma• Technical Lead Microsoft BI and

Big Data consultant • Inter Access, local consultancy firm in the

Netherlands• Architect role at multiple projects• Analysis Service, Reporting Service,

PerformancePoint Service, Big Data, HDInsight, Cloud BI

http://twitter.com/jppphttp://linkedin.com/[email protected]

http://twitter.com/jppp

http://linkedin.com/jpposthuma

mailto:[email protected]

7

Hadoop

• Hadoop is a collection of software to create a data-intensive distributed cluster running on commodity hardware.

• Original idea by Google (2003).• Widely accepted by Database vendors as a solution for

unstructured data• Microsoft partners with HortonWorks and delivers their

Hadoop Data Platform as Microsoft HDInsight• Available as an Azure service and on premise• HortonWorks Data Platform (HDP) 100% Open Source!

7

8

Hadoop

• HDFS – distributed, fault tolerant file system• MapReduce – framework for writing/executing distributed,

fault tolerant algorithms• Hive & Pig – SQL-like declarative languages• Sqoop/PolyBase – package

for moving data between HDFS and relational DB systems

• + Others…• Hadoop 2.0

HDFS

Map/Reduce

Hive & PigSqoop / Poly base

Avro

(S

eri

aliz

ati

on

)

HBase

Zooke

ep

er

ETL Tools

BI Reporting

RDBMS

9

HDFS

9

Large File11001010100111001010100111001010100111001010100111001100101010011100101010011100101010011100101010011100101010011100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001110010101001110010101001110010101001110010101001110011001010100111001010100111001010100111001010100111001010100111001010100111001010100111001010100111001010100111001100101010011100101010011100101010011100101010011100101010011100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001110010101001110010101001110010101001110010101001110011001010100111001010100111001010100111001010100111001010100111001010100111001010100111001010100111001010100111001100101010011100101010011100101010011100101010011100101010011100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001

…

6440MB

Block 1

Block 2

Block 3

Block 4

Block 5

Block 6

Block 100

Block 101

64MB 64MB 64MB 64MB 64MB 64MB

…

64MB 40MB

Block 1

Block 2

Let’s color-code them

Block 3

Block 4

Block 5

Block 6

Block 100

Block 101

e.g., Block Size = 64MB

Files are composed of set of blocks• Typically 64MB in size• Each block is stored as a

separate file in the local file system (e.g. NTFS)

HDFS

NameNode BackupNode

DataNode DataNode DataNode DataNode DataNode

(heartbeat, balancing, replication, etc.)

nodes write to local disk

namespace backupsHDFS was designed with the expectation that failures (both hardware and software) would occur frequently

Hadoop 2.0 is more decentralized• Interaction between DataNodes• Less dependent on primary

NameNode

11

Data access to HDFS

• FTP – Upload your data files• Streaming – Via AVRO (RPC) or Flume• Hadoop command – hadoop fs -copyFromLocal• Windows Azure BLOB storage – HDInsight Service (Azure)

uses BLOB storage instead of local VM storage. Data can be uploaded without a provisioned Hadoop cluster

• PolyBase – Feature of PDW 2012. Direct read/write data access to the datanodes.

12

Data access

Hadoop command Demo

14

Map/Reduce

• MR: all functions in a batch oriented architecture• Map: Apply the logic to the data, eg page hits count.• Reduce: Reduces (aggregate) the results of the Mappers to one.

• YARN: split the JobTracker in to Resource Manager and Node Manager. And MR in Hadoop 2.0 uses YARN as its JobTacker

15

Map/Reduce

Total page hits

16

Hive

• Build for easy data retrieval• Uses Map/Reduce• Created by Facebook • HiveQL: SQL like language• Stores data in tables, which are stored as HDFS file(s)• Only initial INSERT supported, no UPDATE or DELETE• External tables possible on existing (CSV) file(s)• Extra language options to use benefits of Hadoop• Stinger initiative: Phase 1 (0.11) and Phase 2 (0.12).

Improve Hive 100x

17

Hive

Star schema join – (Based on TPC-DS Query 27)

SELECT col5, avg(col6)FROM store_sales_fact ssf join item_dim on (ssf.col1 = item_dim.col1) join date_dim on (ssf.col2 = date_dim.col2) join custdmgrphcs_dim on (ssf.col3 = custdmgrphcs_dim.col3)

join store_dim on (ssf.col4 = store_dim.col4)GROUP BY col5ORDER BY col5LIMIT 100;

Cluster: 6 Nodes (2 Name, 4 Compute – dual core, 14GB)

41 GB58 MB11 MB80 MB106 KB

18

Hive

File Type # MR jobs

Input Size # Mappers Time

Text / Hive 0.10 5 43.1 GB 179 21:00 min

Text / Hive 0.11 1 38.0 GB 151 4:06 min

RC / Hive 0.11 1 8.21 GB 76 2:16 min

ORC / Hive 0.11 1 2.83 GB 38 1:44 min

RC / Hive 0.11 / Partitioned / Bucketed

1 1.73 GB 19 1:44 min

ORC / Hive 0.11 / Partitioned / Bucketed

1 687 MB 27 01:19 min

Data: ~64x less dataTime; ~16x times faster

19

Data access from Hadoop

• Excel• FTP• Hadoop command – hadoop fs -copyToLocal• ODBC[1] – Via Hive (HiveQL) data can be extracted.• Power Query – Is capable of extracting data directly from

HDFS or Azure BLOB storage • PolyBase – Feature of PDW 2012. Direct read/write data

access to the datanodes.

[1] http://www.microsoft.com/en-us/download/details.aspx?id=40886[2] Power BI Excel add-in – http://www.powerbi.com

http://www.microsoft.com/en-us/download/details.aspx?id=40886

http://www.microsoft.com/en-us/download/details.aspx?id=40886

http://www.powerbi.com/



20

Data access

Excel 2013 Demo

22

PDW – Polybase

22

Control Node

SQL Server

Compute Node

SQL Server

Compute Node

SQL Server

Compute Node…

SQL Server

PDW Cluster

Namenode

(HDFS)Node

DN

Node

DN

Node

DN

Node

DN

Node

DN

Node

DN

Node

DN

Node

DN

Node

DN

Node

DN

Node

DN

Node

DN

Hadoop Cluster

Sqoop This is PDW!

23

PDW – External Tables

• An external table is PDW’s representation of data residing in HDFS

• The “table” (metadata) lives in the context of a SQL Server database

• The actual table data resides in HDFS

• No support for DML operations

• No concurrency control or isolation level guaranteesCREATE EXTERNAL TABLE table_name ({<column_definition>} [,...n ]) {WITH (LOCATION =‘<URI>’,[FORMAT_OPTIONS = (<VALUES>)])}[;]

Required to indicate location of Hadoop cluster

Optional format options associated with parsing of data from HDFS (e.g. field delimiters

& reject-related thresholds)

PDW – Hadoop use cases & examples

[1] Retrieve data from HDFS with a PDW query• Seamlessly join structured and semi-structured data

CREATE TABLE ClickStreamInPDW WITH DISTRIBUTION = HASH(URL) AS SELECT URL, EventDate, UserID FROM ClickStream;

SELECT Username FROM ClickStream c, User u WHERE c.UserID = u.ID AND c.URL=‘www.bing.com’;

CREATE EXTERNAL TABLE ClickStream2 (URL, EventDate, UserID) WITH (LOCATION =‘hdfs://MyHadoop:5000/joe’, FORMAT_OPTIONS (...) AS SELECT URL, EventDate, UserID FROM ClickStreamInPDW;

[2] Import data from HDFS to PDW• Parallelized CREATE TABLE AS SELECT (CTAS)• External tables as the source• PDW table, either replicated or distributed, as destination

[3] Export data from PDW to HDFS• Parallelized CREATE EXTERNAL TABLE AS SELECT (CETAS)• External table as the destination; creates a set of HDFS files

25

SQL Server 2012 PDWPolybase demo

26

Wrap up

Hadoop ‘just another data source’ @ your fingertips!Batch processing large datasets before loading into your DWHOffloading DWH data, but still accessible for analysis/reporting

Integrate Hadoop via SQOOP, ODBC (Hive) or PolyBase

Near future: deeply integration between Hadoop and SQL PDW

Try Hadoop / HDInsight yourself:Azure: http://www.windowsazure.com/en-us/pricing/free-trial/Web PI: http://www.microsoft.com/web/downloads/platform.aspx

http://www.windowsazure.com/en-us/pricing/free-trial/

http://www.microsoft.com/web/downloads/platform.aspx

27

Q&A

28

References

Microsoft Big Datahttp://www.microsoft.com/bigdata

Windows Azure HDInsight Service (3 months free trail)http://www.windowsazure.com/en-us/services/hdinsight/

SQL Server Parallel Data Warehouse (PDW) Landing Page

http://www.microsoft.com/PDW

http://www.upgradetopdw.com

Introduction to Polybasehttp://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/polybase.aspx

28

http://www.microsoft.com/bigdata

http://www.windowsazure.com/en-us/services/hdinsight/



http://www.upgradetopdw.com/

http://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/polybase.aspx



29

Thanks!

SQLRally Amsterdam 2013 - Hadoop

Technology

Transcript of SQLRally Amsterdam 2013 - Hadoop