SQLRally Amsterdam 2013 - Hadoop

29
Hadoop Henk van der Valk Technical Sales Professional Jan Pieter Posthuma Microsoft BI Consultant 7/11/2013

description

SQLRally Amsterdam 2013 presentation about Hadoop. Including HDFS, Hive and PolyBase.

Transcript of SQLRally Amsterdam 2013 - Hadoop

Page 1: SQLRally Amsterdam 2013 - Hadoop

Hadoop

Henk van der ValkTechnical Sales Professional

Jan Pieter PosthumaMicrosoft BI Consultant

7/11/2013

Page 2: SQLRally Amsterdam 2013 - Hadoop

Join Local Chapters

JOIN THE PASS COMMUNITY

2

Become a PASS member for free and join the world’s

biggest SQL Server Community.

Access to online training content

Access to events at discounted

rates

Join Virtual Chapters

Personalize your PASS website experience

Page 3: SQLRally Amsterdam 2013 - Hadoop

3

Agenda

• Introduction• Hadoop• HDFS• Data access to HDFS• Map/Reduce• Hive• Data access from HDFS• SQL PDW PolyBase• Wrap up

Page 4: SQLRally Amsterdam 2013 - Hadoop

4

Introduction Henk

• 10 years of Unisys-EMEA Performance Center• 2002- Largest SQL DWH in the world (SQL2000)• Project Real – (SQL 2005)• ETL WR - loading 1TB within 30 mins (SQL 2008)• Contributed to various SQL whitepapers

• Schuberg Philis-100% uptime for mission critical applications• Since april 1st, 2011 – Microsoft SQL PDW - Western Europe• SQLPass speaker & volunteer since 2005

Page 5: SQLRally Amsterdam 2013 - Hadoop

5

Introduction

Fast

L

oad

Source Systems

Historical Data(Beyond Active Window)

Summarize & Load

Big Data Sources (Raw,

Unstructured)

Alerts, Notifications

Data & Compute Intensive Application

ERP CRM LOB APPS

Integrate/Enrich

SQL Server StreamInsight

Enterprise ETL with SSIS, DQS, MDS

HDInsight on Windows Azure

HDInsight on Windows Server

SQL Server FTDW Data Marts

SQL Server Reporting Services

SQL Server Analysis Server

Business Insights

Interactive Reports

Performance Scorecards

Crawlers

Bots

Devices

Sensors

SQL Server Parallel Data Warehouse

Data Insights Value

Azure Market Place

1. Data

Warehousing:Storing and analysis of

structured data

4. Business Analytics:Interactionwith data

2. Map Reduce:

Storing and processing of

unstructured data

3. Streaming:

Predictive Maintenance aka

Real-time data processing

Page 6: SQLRally Amsterdam 2013 - Hadoop

6

Introduction Jan Pieter Posthuma

Jan Pieter Posthuma• Technical Lead Microsoft BI and

Big Data consultant • Inter Access, local consultancy firm in the

Netherlands• Architect role at multiple projects• Analysis Service, Reporting Service,

PerformancePoint Service, Big Data, HDInsight, Cloud BI

http://twitter.com/jppphttp://linkedin.com/[email protected]

Page 7: SQLRally Amsterdam 2013 - Hadoop

7

Hadoop

• Hadoop is a collection of software to create a data-intensive distributed cluster running on commodity hardware.

• Original idea by Google (2003).• Widely accepted by Database vendors as a solution for

unstructured data• Microsoft partners with HortonWorks and delivers their

Hadoop Data Platform as Microsoft HDInsight• Available as an Azure service and on premise• HortonWorks Data Platform (HDP) 100% Open Source!

7

Page 8: SQLRally Amsterdam 2013 - Hadoop

8

Hadoop

• HDFS – distributed, fault tolerant file system• MapReduce – framework for writing/executing distributed,

fault tolerant algorithms• Hive & Pig – SQL-like declarative languages• Sqoop/PolyBase – package

for moving data between HDFS and relational DB systems

• + Others…• Hadoop 2.0

HDFS

Map/Reduce

Hive & PigSqoop / Poly base

Avro

(S

eri

aliz

ati

on

)

HBase

Zooke

ep

er

ETL Tools

BI Reporting

RDBMS

Page 9: SQLRally Amsterdam 2013 - Hadoop

9

HDFS

9

Large File11001010100111001010100111001010100111001010100111001100101010011100101010011100101010011100101010011100101010011100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001110010101001110010101001110010101001110010101001110011001010100111001010100111001010100111001010100111001010100111001010100111001010100111001010100111001010100111001100101010011100101010011100101010011100101010011100101010011100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001110010101001110010101001110010101001110010101001110011001010100111001010100111001010100111001010100111001010100111001010100111001010100111001010100111001010100111001100101010011100101010011100101010011100101010011100101010011100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001

6440MB

Block 1

Block 2

Block 3

Block 4

Block 5

Block 6

Block 100

Block 101

64MB 64MB 64MB 64MB 64MB 64MB

64MB 40MB

Block 1

Block 2

Let’s color-code them

Block 3

Block 4

Block 5

Block 6

Block 100

Block 101

e.g., Block Size = 64MB

Files are composed of set of blocks• Typically 64MB in size• Each block is stored as a

separate file in the local file system (e.g. NTFS)

Page 10: SQLRally Amsterdam 2013 - Hadoop

HDFS

NameNode BackupNode

DataNode DataNode DataNode DataNode DataNode

(heartbeat, balancing, replication, etc.)

nodes write to local disk

namespace backupsHDFS was designed with the expectation that failures (both hardware and software) would occur frequently

Hadoop 2.0 is more decentralized• Interaction between DataNodes• Less dependent on primary

NameNode

Page 11: SQLRally Amsterdam 2013 - Hadoop

11

Data access to HDFS

• FTP – Upload your data files• Streaming – Via AVRO (RPC) or Flume• Hadoop command – hadoop fs -copyFromLocal• Windows Azure BLOB storage – HDInsight Service (Azure)

uses BLOB storage instead of local VM storage. Data can be uploaded without a provisioned Hadoop cluster

• PolyBase – Feature of PDW 2012. Direct read/write data access to the datanodes.

Page 12: SQLRally Amsterdam 2013 - Hadoop

12

Data access

Hadoop command Demo

Page 13: SQLRally Amsterdam 2013 - Hadoop

13

Page 14: SQLRally Amsterdam 2013 - Hadoop

14

Map/Reduce

• MR: all functions in a batch oriented architecture• Map: Apply the logic to the data, eg page hits count.• Reduce: Reduces (aggregate) the results of the Mappers to one.

• YARN: split the JobTracker in to Resource Manager and Node Manager. And MR in Hadoop 2.0 uses YARN as its JobTacker

Page 15: SQLRally Amsterdam 2013 - Hadoop

15

Map/Reduce

Total page hits

Page 16: SQLRally Amsterdam 2013 - Hadoop

16

Hive

• Build for easy data retrieval• Uses Map/Reduce• Created by Facebook • HiveQL: SQL like language• Stores data in tables, which are stored as HDFS file(s)• Only initial INSERT supported, no UPDATE or DELETE• External tables possible on existing (CSV) file(s)• Extra language options to use benefits of Hadoop• Stinger initiative: Phase 1 (0.11) and Phase 2 (0.12).

Improve Hive 100x

Page 17: SQLRally Amsterdam 2013 - Hadoop

17

Hive

Star schema join – (Based on TPC-DS Query 27)

SELECT  col5, avg(col6)FROM  store_sales_fact ssf   join item_dim on (ssf.col1 = item_dim.col1)   join date_dim on (ssf.col2 = date_dim.col2) join custdmgrphcs_dim on (ssf.col3 = custdmgrphcs_dim.col3)

join store_dim on (ssf.col4 = store_dim.col4)GROUP BY col5ORDER BY col5LIMIT 100;

Cluster: 6 Nodes (2 Name, 4 Compute – dual core, 14GB)

41 GB58 MB11 MB80 MB106 KB

Page 18: SQLRally Amsterdam 2013 - Hadoop

18

Hive

File Type # MR jobs

Input Size # Mappers Time

Text / Hive 0.10 5 43.1 GB 179 21:00 min

Text / Hive 0.11 1 38.0 GB 151 4:06 min

RC / Hive 0.11 1 8.21 GB 76 2:16 min

ORC / Hive 0.11 1 2.83 GB 38 1:44 min

RC / Hive 0.11 / Partitioned / Bucketed

1 1.73 GB 19 1:44 min

ORC / Hive 0.11 / Partitioned / Bucketed

1 687 MB 27 01:19 min

Data: ~64x less dataTime; ~16x times faster

Page 19: SQLRally Amsterdam 2013 - Hadoop

19

Data access from Hadoop

• Excel• FTP• Hadoop command – hadoop fs -copyToLocal• ODBC[1] – Via Hive (HiveQL) data can be extracted.• Power Query – Is capable of extracting data directly from

HDFS or Azure BLOB storage • PolyBase – Feature of PDW 2012. Direct read/write data

access to the datanodes.

[1] http://www.microsoft.com/en-us/download/details.aspx?id=40886[2] Power BI Excel add-in – http://www.powerbi.com

Page 20: SQLRally Amsterdam 2013 - Hadoop

20

Data access

Excel 2013 Demo

Page 21: SQLRally Amsterdam 2013 - Hadoop

21

Page 22: SQLRally Amsterdam 2013 - Hadoop

22

PDW – Polybase

22

Control Node

SQL Server

Compute Node

SQL Server

Compute Node

SQL Server

Compute Node…

SQL Server

PDW Cluster

Namenode

(HDFS)Node

DN

Node

DN

Node

DN

Node

DN

Node

DN

Node

DN

Node

DN

Node

DN

Node

DN

Node

DN

Node

DN

Node

DN

Hadoop Cluster

Sqoop This is PDW!

Page 23: SQLRally Amsterdam 2013 - Hadoop

23

PDW – External Tables

• An external table is PDW’s representation of data residing in HDFS

• The “table” (metadata) lives in the context of a SQL Server database

• The actual table data resides in HDFS

• No support for DML operations

• No concurrency control or isolation level guaranteesCREATE EXTERNAL TABLE table_name ({<column_definition>} [,...n ]) {WITH (LOCATION =‘<URI>’,[FORMAT_OPTIONS = (<VALUES>)])}[;]

Required to indicate location of Hadoop cluster

Optional format options associated with parsing of data from HDFS (e.g. field delimiters

& reject-related thresholds)

Page 24: SQLRally Amsterdam 2013 - Hadoop

PDW – Hadoop use cases & examples

[1] Retrieve data from HDFS with a PDW query• Seamlessly join structured and semi-structured data

CREATE TABLE ClickStreamInPDW WITH DISTRIBUTION = HASH(URL) AS SELECT URL, EventDate, UserID FROM ClickStream;

SELECT Username FROM ClickStream c, User u WHERE c.UserID = u.ID AND c.URL=‘www.bing.com’;

CREATE EXTERNAL TABLE ClickStream2 (URL, EventDate, UserID) WITH (LOCATION =‘hdfs://MyHadoop:5000/joe’, FORMAT_OPTIONS (...) AS SELECT URL, EventDate, UserID FROM ClickStreamInPDW;

[2] Import data from HDFS to PDW• Parallelized CREATE TABLE AS SELECT (CTAS)• External tables as the source• PDW table, either replicated or distributed, as destination

[3] Export data from PDW to HDFS• Parallelized CREATE EXTERNAL TABLE AS SELECT (CETAS)• External table as the destination; creates a set of HDFS files

Page 25: SQLRally Amsterdam 2013 - Hadoop

25

SQL Server 2012 PDWPolybase demo

Page 26: SQLRally Amsterdam 2013 - Hadoop

26

Wrap up

Hadoop ‘just another data source’ @ your fingertips!Batch processing large datasets before loading into your DWHOffloading DWH data, but still accessible for analysis/reporting

Integrate Hadoop via SQOOP, ODBC (Hive) or PolyBase

Near future: deeply integration between Hadoop and SQL PDW

Try Hadoop / HDInsight yourself:Azure: http://www.windowsazure.com/en-us/pricing/free-trial/Web PI: http://www.microsoft.com/web/downloads/platform.aspx

Page 27: SQLRally Amsterdam 2013 - Hadoop

27

Q&A

Page 28: SQLRally Amsterdam 2013 - Hadoop

28

References

Microsoft Big Datahttp://www.microsoft.com/bigdata

Windows Azure HDInsight Service (3 months free trail)http://www.windowsazure.com/en-us/services/hdinsight/

SQL Server Parallel Data Warehouse (PDW) Landing Page

http://www.microsoft.com/PDW

http://www.upgradetopdw.com

Introduction to Polybasehttp://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/polybase.aspx

28

Page 29: SQLRally Amsterdam 2013 - Hadoop

29

Thanks!