Garrett Edmondson Data Warehouse Architect Blue Granite Inc. [email protected]

60
Garrett Edmondson Data Warehouse Architect Blue Granite Inc. [email protected] http:// garrettedmondson.wordpress.com /

Transcript of Garrett Edmondson Data Warehouse Architect Blue Granite Inc. [email protected]

Page 1: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Garrett EdmondsonData Warehouse ArchitectBlue Granite [email protected]://garrettedmondson.wordpress.com/

Page 2: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

DW vs. OLTP

• Data Warehouse– Scan Centric (sequential reads/writes) MB/sec

– Nonvolatile Data – nightly loads– Index – Light

• Few covering Clustered Indexes

– Low Concurrency

OLTP Seek Centric (Radom Read/Writes) IOPs Volatile Data Index Heavy Many Heap Tables High Concurrency

Page 3: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Enterprise Shared SAN Storage

Dedicated Network Bandwidth

Traditional SQL DWArchitectureShared Infrastructure

Data Warehouse Speed = THROUGHPUT from storage to CPU

DW

Page 5: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Potential Performance Bottlenecks

FCHBA

AB

FCHBA

AB

FC S

WIT

CH

STORAGECONTROLLER

AB

ABCA

CHE

SERV

ER

CACH

ESQ

L SE

RVER

WIN

DO

WS

CPU

CO

RES

CPU Feed Rate HBA Port Rate Switch Port Rate SP Port Rate

A

BDISK DISK

LUN

DISK DISK

LUN

SQL Server Read Ahead Rate

LUN Read Rate Disk Feed Rate

Page 6: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Balanced System IO Stack

CPU Socket

(4 Core)

CPU Socket(4 Core)

SMP Scale-Up for Data Growth!!!• Complex• Costly• Obsolete Hardware

Tune every IO Interface !

Page 7: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

CPU Maximum Consumption Rate (MCR)

• Theoretical Max (MCR)– From Cache Query

• Set STATISTICS IO and STATISTICS TIME to ON • MAXDOP = 4 • ( [Logical reads] / [CPU time in seconds] ) * 8KB / 1024 = MCR

per core

• Benchmark (BCR)– From Disk - Cache

• DBCC dropcleanbuffers • Set STATISTICS IO and STATISTICS TIME to ON • MAXDOP 8+• ( [Logical reads] / [CPU time in seconds] ) * 8KB / 1024

Page 8: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

MB/sec

• Test Storage Throughput in MB/sec– SQLIO– Sequential Reads and Writes Important

Establish real rather than rated, performance metrics for the key hardware components of the Fast Track reference architecture!!!!

Page 9: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

LUN Configuration

Page 10: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Fast Track Data Striping

ARY01D1v01

ARY01D2v02

ARY02D1v03

ARY02D2v04

ARY03D1v05

ARY03D2v06

ARY04D1v07

ARY04D2v08

ARY05v09

DB1-1.ndf DB1-7.ndfDB1-5.ndfDB1-3.ndf

DB1-2.ndf DB1-4.ndf DB1-6.ndf DB1-8.ndf

DB1.ldf

Primary Data Log

Storage Enclosure

Raid-10

Disk pairs (1&2 + 3&4)

Page 11: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Evaluating Page Fragmentation

Average Fragment Size in Pages – This metric is a reasonable measure of contiguous page allocations for a table.

Value should be >= 400 for optimal performance

select db_name(ps.database_id) as database_name ,object_name(ps.object_id) as table_name ,ps.index_id ,i.name

,cast (ps.avg_fragment_size_in_pages as int) as [Avg Fragment Size In Pages] ,ps.fragment_count as [Fragment Count] ,ps.page_count ,(ps.page_count * 8)/1024/1024 as [Size in GB]from sys.dm_db_index_physical_stats (DB_ID() --NULL for all DBs else run in context of DB , OBJECT_ID(‘dbo.lineitem’), 1, NULL, ‘SAMPLED’) AS ps --DETAILED, SAMPLED, NULL = LIMITEDinner join sys.indexes AS i on (ps.object_id = i.object_id AND ps.index_id = i.index_id)where ps.database_id = db_id() and ps.index_level = 0

Page 12: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

(c) 2011 Microsoft. All rights reserved.

TRACE FLAG –E !!!!!!Trace FLAG -T1117

Page 13: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

SQL Server Configurations

• Sequential scan performance starts with database creation and extent allocation

• Recall that the –E startup option is used– Allocate 64 extents at a time (4MB)

• Pre-allocation of user databases is strongly recommended

• Autogrow should be avoided if possible– If used, always use 4MB increments

Page 14: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Storage Layout Best Practices for SQL Server

• Create a SQL data file per LUN, for every filegroup• TempDB filegroups share same LUNs as other databases• Log on separate disks, within each enclosure

– Striped using SQL Striping– Log may share these LUNs with load files, backup targets

Page 15: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Storage Layout Best Practices for SQL Server

LUN16 LUN 2 LUN 3

Local Drive 1

Log LUN 1

Permanent DB Log

LUN 1

Tem

pD

B

TempDB.mdf (25GB) TempDB_02.ndf (25GB) TempDB_03ndf (25GB) TempDB_16.ndf (25GB)

Permanent FG

Permanent_1.ndf

Per

man

ant_

DB

Sta

ge

Dat

abas

e Stage FG

Stage_1.ndf Stage_2.ndf Stage_3.ndf Stage_16.ndf

Stage DB Log

Permanent_2.ndf Permanent_3.ndf Permanent_16.ndf

Page 16: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Techniques to Maximize Scan Throughput

• –E startup parameter (2MB Extents and not mixed extents)

• Minimize use of NonClustered indexes on Fact Tables

• Load techniques to avoid fragmentation

– Load in Clustered Index order (e.g. date) when possible

• Index Creation always MAXDOP 1, SORT_IN_TEMPDB

• Isolate volatile tables in separate filegroup

• Isolate staging tables in separate filegroup or DB

• Periodic maintenance

• Turn on SQL Server Compression

Page 17: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Conventional data loads lead to fragmentation

• Bulk Inserts into Clustered Index using a moderate ‘batchsize’ parameter– Each ‘batch’ is sorted independently… causes

fragmentation• Overlapping batches lead to page splits

1:321:31 1:351:341:331:36 1:381:37 1:401:391:321:31 1:351:341:33

Key Order of Index

Page 18: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Best Practices for loading• Use a heap

– Practical if queries need to scan whole partitions• or…Use a batchsize = 0

– Fine if no parallelism is needed during load• or…Use a Two-Step Load

1. Load to a Staging Table (heap)2. INSERT-SELECT from Staging Table into Target CIResulting rows are not fragmentedCan use Parallelism in step 1 – essential for large data volumes

Page 19: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Other fragmentation best practices

• Avoid Autogrow of filegroups– Pre-allocate filegroups to desired long-term size– Manually grow in large increments when necessary

• Keep volatile tables in a separate filegroup– Tables that are frequently rebuilt or loaded in small

increments

• If historical partitions are loaded in parallel, consider separate filegroups for separate partitions to avoid extent fragmentation

Page 20: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

(c) 2011 Microsoft. All rights reserved.

Columnstore Indexesstorage = Segments + Dictionaries

CPU = Batch Mode

Page 21: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Row Storage LayoutID Name Address City State Bal Due

1 Bob … … … 3,000

2 Sue … … … 500

3 Ann … … … 1,700

4 Jim … … … 1,500

5 Liz … … … 0

6 Dave … … … 9,000

7 Sue … … … 1,010

8 Bob … … … 50

9 Jim … … … 1,300

1 Bob … … … 3,000

2 Sue … … … 500

3 Ann … … … 1,700

4 Jim … … … 1,500

5 Liz … … … 0

6 Dave … … … 9,000

7 Sue … … … 1,010

8 Bob … … … 50

9 Jim … … … 1,300

Customers Table

page

page

page

Extent

Page 22: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Column Storage LayoutID Name Address City State Bal Due

1 Bob … … … 3,000

2 Sue … … … 500

3 Ann … … … 1,700

4 Jim … … … 1,500

5 Liz … … … 0

6 Dave … … … 9,000

7 Sue … … … 1,010

8 Bob … … … 50

9 Jim … … … 1,300

Customers Table

ID

1

2

3

4

5

6

7

8

9

Name

Bob

Sue

Ann

Jim

Liz

Dave

Sue

Bob

Jim

Address

City

State

Bal Due

3,000

500

1,700

1,500

0

9,000

1,010

50

1,300

Segment = 1 million row chunks

Page 23: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Run Length Encoding (RLE)

Quarter

Q1

Q1

Q1

Q1

Q1

Q1

Q2

Q2

Q2

Q2

Q2

Q2

Q2

Q2

Q2

Quarter Start Count

Q1 1 310

Q2 311 290

… … …

ProdID Start Count

1 1 5

2 6 3

… … …

1 51 5

2 56 3

ProdID

1

1

1

1

1

2

2

2

1

1

1

1

1

2

2

2

Price

100

120

315

100

315

198

450

320

320

150

256

450

192

184

310

251

266

Price

100

120

315

100

315

198

450

320

320

150

256

450

192

184

310

251

266

RLE Compression applied onlywhen size of compressed datais smaller than original

Page 24: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

xVelocity Store

Dictionary Encoding

Quarter

Q1

Q1

Q1

Q1

Q2

Q2

Q2

Q3

Q3

Q3

Q3

Q4

Q4

Q4

Q4

Only 4 values.2 bits are enough torepresent it

DISTINCT

Q.ID Quarter

0 Q1

1 Q2

2 Q3

3 Q4

Q.ID

1

1

1

1

2

2

2

3

3

3

3

4

4

4

4

R.L.E.

Q.ID Start Count

1 1 4

2 5 10

3 11 4

4 15 15

Page 25: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

CPU Architecture“Batch” Mode

• Modern CPUs have many Cores• Cache Hierarchies: RAM L3,L2,L1

– Small L1 and L2 per core; L3 shared by socket– L1 faster than L2, L2 faster than L3– CPUs stall when waiting for caches to load

• Batch Mode sizes instructions & data to fit into L2/L1 cache !

Page 26: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Parallel Data WarehouseOverview

Page 27: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Data Warehouse appliances

A prepackaged or pre-configured balanced set of hardware (servers, memory, storage and I/O channels), software (operating system, DBMS and management software), service and support, sold as a unit with built-in redundancy for high availability positioned as a platform for data warehousing.

Page 28: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Control Node

Failover Protection:• Redundant Control Node• Redundant Compute Node• Cluster Failover

•Redundante Array of Inexpensive Databases

Spare Node

Page 29: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Parallel Data Warehouse Appliance - Hardware Architecture

Database Servers

Du

al In

fin

iban

d

Control Nodes

Active / Passive

Landing Zone

Backup Node

Storage Nodes

Spare Database Server

Du

al Fib

er

Ch

an

nel

SQL

SQL

SQL

SQL

SQL

SQL

SQL

SQL

SQL

Management Servers

Client Drivers

ETL Load Interface

Corporate Backup Solution

Data Center Monitoring

Corporate Network Private Network

SQL

SQL

Page 30: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

PDW Data ExampleTime Dim

Date Dim IDCalendar YearCalendar QtrCalendar MoCalendar Day

Store Dim

Store Dim IDStore NameStore MgrStore Size

Product Dim

Prod Dim IDProd CategoryProd Sub CatProd Desc

MktgCampaign Dim

Mktg Camp IDCamp NameCamp MgrCamp StartCamp End

SQL

SQL

SQL

SQL

PDW Compute Nodes

Sales Facts

Date Dim IDStore Dim IDProd Dim IDMktg Camp IdQty SoldDollars Sold

Page 31: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Time Dim

Date Dim IDCalendar YearCalendar QtrCalendar MoCalendar Day

Sales Facts

Date Dim IDStore Dim IDProd Dim IDMktg Camp IdQty SoldDollars Sold

PDW Data Example

Store Dim

Store Dim IDStore NameStore MgrStore Size

Product Dim

Prod Dim IDProd CategoryProd Sub CatProd Desc

MktgCampaign Dim

Mktg Camp IDCamp NameCamp MgrCamp StartCamp End

SQL

SQL

SQL

SQL

PDTD

MDSD

PDTD

MDSD

PDTD

MDSD

PDTD

MDSD

Smaller Dimension Tables are Replicated on

Every Compute Node

Page 32: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

PDW Data ExampleTime Dim

Date Dim IDCalendar YearCalendar QtrCalendar MoCalendar Day

Store Dim

Store Dim IDStore NameStore MgrStore Size

Product Dim

Prod Dim IDProd CategoryProd Sub CatProd Desc

Sales Facts

Date Dim IDStore Dim IDProd Dim IDMktg Camp IdQty SoldDollars Sold

MktgCampaign Dim

Mktg Camp IDCamp NameCamp MgrCamp StartCamp End

SQL

SQL

SQL

SQL

PDTD

MDSD

PDTD

MDSD

PDTD

MDSD

PDTD

MDSD

SF-1

SF-2

SF-3

SF-4

Larger Fact Table is Hash Distributed Across All

Compute Nodes

SF-1SF-2SF-3SF-4

Page 33: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

SMP: Scale-Up • Complex – Tune Every IO interface• Cost – exponential • Obsolete Hardware

MPP: Scale-OutSimple – Buy more Processing NodesCost – linear Keep all hardware investments

SMP vs. MPP

Page 34: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

SQL Server Parallel Data WarehouseA quick look at MPP query execution

Compute Node 1

Compute Node 2

Compute Node N

Client Control Node

..

.

The control node handles global query execution, and generates a distributed execution plan

The user connects to ‘the appliance’ like he would to a ‘normal’ SQL Server, and sends his request

The actual user data resides on compute nodes, and steps of the global execution plan are executed on each compute node

SQL Server PDW is a shared nothing MPP system, meaning user data is distributed across the nodes*. Data Movement Service is responsible for moving data around so that individual nodes can satisfy queries that need data from other nodes.

SQL Server PDW Appliance

Page 35: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Shuffle MovementDMS Redistributes the data by color values in parallel.

Co

mp

ute

No

de

1C

om

pu

te N

od

e 2

Dealing with Distributions - ShufflingExample:Select [color], SUM([qty]) from [Store Sales] group by [color];

Retu

rn

Ss_id

color qty

Store Sales

1 Red 5

3 Blue 11

5 Red 12

7 Green 7

Ss_id

color qty

Store Sales

2 Red 8

4 Blue 10

6 Yellow 12

Distributed Table

Temp_1

Red 5

Red 12

Red 8

Green

7

Temp_1

Blue 11

Yellow

12

Blue 10

color qty

color qty

Hash

Blue 21

Red 25

Green

7

Yellow

12

color qty

Hash

HashHash

Parallel Merge

and Aggregate

Page 36: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

SQL Server Parallel Data WarehouseOverall Architecture

Legend:

Control Node

Client Interface(JDBC, ODBC,

OLE-DB, ADO.NET)DMS Manager

PDW Engine

…Compute Node 1

DMS Core

PDW Agent

Landing Zone Node

Bulk Data Loader

PDW Agent

Management Node

Active Directory

PDW Agent

PDW AgentCompute Node 2

DMS Core

PDW Agent

Compute Node 10

DMS Core

PDW AgentPDW service

Data Movement ServiceDMS =Parallel Data WarehousePDW =

ETL Interface

Data Rack (up to 4)Control Rack

Page 37: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

(c) 2011 Microsoft. All rights reserved.

Page 38: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

SQL Server Parallel Data Warehouse AU3Release Themes

BI, Analytics, & ETL Integration

Performance At Scale

Broader functionality

Full Alignment

SQL ServerCompatibility

Less work for the same results

Do the same work more efficiently

Native Support for- Analysis Services- Reporting Services- PowerPivot

Lay the foundation for broad connectivity support

Page 39: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

SQL Server PDW ArchitectureHow did it work before?

• Problem– Basic RDBMS functionality, that already exists in SQL Server, was re-built in PDW

• Challenge for PDW AU3 release – Can we leverage SQL Server and focus on MPP related challenges?

Control Node

Page 40: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

SQL Server PDW AU3 Architecture PDW AU3 Architecture with Shell Appliance and Cost-Based Query Optimizer

SQL Server runs a ‘Shell Appliance’

Every database exists as an empty ‘shell’• All objects, no user data

DDL executes against both the shell and the compute nodes

Large parts of basic RDBMS functionality now provided by the shell• Authentication and authorization• Schema binding • Metadata catalog

Shell Appliance(SQL Server)

Engine ServiceP

lan

S

tep

s

Pla

n

Ste

ps

Pla

n

Ste

ps

Compute Node (SQL Server)

Compute Node (SQL Server)

Compute Node (SQL Server)

Control Node

SELECTSELECT

foo foofoo

foo

Page 41: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

SQL Server PDW AU3 Architecture PDW AU3 Architecture with Shell Appliance and Cost-Based Query Optimizer

1. User issues a query

2. Query is sent to the Shell through sp_showmemo_xml stored procedure

– SQL Server performs parsing, binding, authorization

– SQL optimizer generates execution alternatives

3. MEMO containing candidate plans, histograms, data types is generated

4. Parallel execution plan generated

5. Parallel plan executes on compute nodes

6. Result returned to the user

Shell Appliance(SQL Server)

Engine ServiceP

lan

S

tep

s

Pla

n

Ste

ps

Pla

n

Ste

ps

ME

MO

Compute Node (SQL Server)

Compute Node (SQL Server)

Compute Node (SQL Server)

Control Node

SELECTSELECT

Return

Page 42: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

PDW Cost-Based OptimizerOptimizer lifecycle…

1. Simplification and space exploration– Query standardization and simplification (e.g. column reduction, predicates push-

down)– Logical space exploration (e.g. join re-ordering, local/global aggregation)– Space expansion (e.g. bushy trees – dealing with intermediate resultsets)– Physical space exploration– Serializing MEMO into binary XML (logical plans)– De-serializing binary XML into PDW Memo

2. Parallel optimization and pruning– Injecting data move operations (expansion)– Costing different alternatives– Pruning and selecting lowest cost distributed plan

3. SQL Generation– Generating SQL Statements to be executed

Page 43: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

PDW Cost-Based Optimizer… And Cost Model Details

• PDW cost model assumptions:– Costing only data movement operations (relational operations excluded)

– Sequential step execution (no pipelined and independent parallelism)

• Data movement operations consist of multiple tasks

• Each task has Fixed and Variable overhead

• Uniform data distribution assumed (no data skew)

Page 44: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

PDW Sales Test WorkloadAU2 to AU3

• 5x improvement in terms of total elapsed time out of the box

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 390

10

20

30

40

50

60

70

80

AU2AU3

Seco

nds

Queries

Page 45: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Theme: Performance at ScaleZero data conversions in data movement

Goal• Eliminate CPU utilization spent on data

conversions• Further parallelize operations during data

moves

Functionality• Using ODBC instead of ADO.NET for

reading and writing data• Minimizing appliance resource utilization

for data moves

Benefits• Better resource, CPU, utilization • 6x or more faster move operations• Increased concurrency• Mixed workload (loads + queries)

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10

Q11Q12

Q13Q14

Q15Q16

Q17Q18

Q19Q20

Q21Q22

0

10

20

30

40

50

60

DMS CPU Utilization - TPCH

AU2 AU3

CPU

(%)

Broadcast

Trim

Replicate

Shuffle

Repl Table Load

0% 100% 200% 300% 400% 500% 600%

Throughput improvement for data movements

Page 46: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Theme: SQL Server CompatibilitySQL Server Security and Metadata

Security• SQL Server security syntax and semantics• Supporting user, roles and logins• Fixed database roles• Allows script re-use• Allows well-known security methods

Metadata• PDW metadata stored in SQL Server• Existing SQL Server metadata tables/views (e.g. security views)• PDW distribution info as extended properties in SQL Server metadata• Existing means and technology for persisting metadata• Improved 3rd party tool compatibility (BI, ETL)

Page 47: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Theme: SQL Server CompatibilitySupport for SQL Server (Native) Client

Goal• ‘Look’ just like a normal SQL Server• Better integration with other BI tools

Functionality• Use existing SQL Server drivers to connect

to SQL Server PDW• Implement SQL Server TDS protocol• Named Parameter support• SQLCMD connectivity to PDW

Benefits• Use known tools and proven technology

stack• Existing SQL Server ’eco-system’• 2x performance improvement for return

operations• 5x reduction of connection time

SQL PDW Clients(ODBC, OLE-DB,

ADO.NET)

SQL Server Clients(ADO.NET, ODBC,

OLE-DB, JDBC)

TDS

Server: 10.217.165.13, 17001

Server: 10.217.165.13, 17000

SequeLink

Page 48: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Theme: SQL Server CompatibilityStored Procedure Support (Subset)

Goal • Support common scenarios of code

encapsulation and reuse in Reporting and ETL

Functionality • System and user-defined stored procedures• Invocation using RPC or EXECUTE• Control flow logic, input parameters

Benefits• Enables common logic re-use• Big impact for Reporting Services scenarios• Allows porting existing scripts• Increases compatibility with SQL Server

SyntaxCREATE { PROC | PROCEDURE } [dbo.]procedure_name     [ { @parameter  data_type } [ = default ] ] [ ,...n ] AS { [ BEGIN ] sql_statement [;] [ ...n ] [ END ] } [;]

ALTER { PROC | PROCEDURE } [dbo.]procedure_name [ { @parameter data_type } [ = default ]    ] [ ,...n ] AS { [ BEGIN ] sql_statement [;] [ ...n ] [ END ] } [;]

DROP { PROC | PROCEDURE } { [dbo.]procedure_name } [;]

[ { EXEC | EXECUTE } ]  {     { [database_name.][schema_name.]procedure_name }       [{ value | @variable }] [ ,...n ]  } [;]

{ EXEC | EXECUTE }   ( { @string_variable | [ N ]'tsql_string' } [ + ...n ] ) [;]Unsupported Functionality

Stored Proc Nesting

Output Params

Return Try-Catch

Page 49: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Theme: SQL Server CompatibilityCollations

Goal• Support local and international data

Functionality• Fixed server level collation• User-defined column level collation• Supporting all Windows collations• Allow COLLATE clauses in Queries and DML

Benefits• Store all the data in PDW w/ additional

querying flexibility• Existing T-SQL DDL and Query scripts• SQL Server alignment and functionality

SyntaxCREATE TABLE T ( c1 varchar(3) COLLATE traditional_Spanish_ci_ai, c2 varchar(10) COLLATE …)

SELECT c1 COLLATE Latin1_General_Bin2FROM T

SELECT * FROM T ORDER BY c1 COLLATE Latin1_General_Bin2

Unsupported Functionality

Cannot specify DB collation during DB creation

Cannot alter column collations for existing tables

Page 50: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Theme: Improved IntegrationSQL Server PDW Connectors

Connector for Hadoop• Bi-directional (import/export) interface between MSFT Hadoop and PDW• Delimited file support• Adapter uses existing PDW tools (bulk loader, dwsql)• Low cost solution that handles all the data: structured and unstructured• Additional agility, flexibility and choice

Connector for Informatica• Connector providing PDW source and target (mappings, transformations)• Informatica uses PDW bulk loader for fast loads

• Leverage existing toolset and knowledge

Connector for Business Objects

Page 51: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Agenda

• Trends in the DW space• How does SQL Server PDW fit in?• SQL Server PDW AU3 – What’s new?• Building BI Solutions with SQL Server PDW

– Customer Successes– Using SQL Server PDW with Microsoft BI solutions– Using SQL Server PDW with third party BI solutions– BI solutions leveraging Hadoop integration

• What’s coming next in SQL Server PDW?

Page 52: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

PDW Retail POS WorkloadOriginal Customer SMP solution vs. PDW AU3 (with cost-based query optimizer)

Q1 Q2 Q3 Q4 Q5 Q6 Q70

200

400

600

800

1000

1200

1400

1600

Old SMPPOS ODS AU3

Seco

nds

Queries

Page 53: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Customer SuccessesHow are customers using PDW & BI ?

Data Volume • 80 TB data warehouse analyzing data from

exchanges• Existing system based on SQL SMP farm

– 2 different clusters of 6 servers each

Requirement • Linear scalability with additional hardware• Support hourly loads with SSIS – 300GB/day• BI Integration: SSRS, SSAS and PowerPivot

AU3 Feedback• SP and increased T-SQL support was great• Migrating SMP SSRS to PDW was painless• 142x for scan heavy queries & no summary tables• Enabled queries that do not run on existing system

Reports

Dashboards

Scorecards

CUSTOMER EXAMPLE:Stock Exchange in the US

Portal

ETL

PDWOperational

DB’s

Page 54: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Role of PDW within the BI stack

PDW

DM

DM DM

3rd party BI

SSAS / SSRS

SSAS / SSRS

SSAS / SSRS

PDW role as fast ‘data hub’Fast and parallel feeding of data marts (DMs) via Infiniband

CREATE REMOTE TABLE AS SELECT

Aggregation abilities avoids ETL overhead in existing systemsNo need for indexes No need to maintain indexed/materialized views (summary tables)

Infiniband

GBit link

Page 55: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

SSAS with SQL Server PDWUnderstanding the differences compared to ‘SMP world’

Specific to PDW• PDW does not support foreign key constraints• Shared nothing model requires careful data design and retrieval planning• Design cubes for parallel processing – via MOLAP & ROLAP storage model

Specific to the nature of large data• Parallel cube processing/deployment has its limits

– Cautious about parallel loads of SSAS - query timeout settings

• Query design crucial - only include required data– BI tools traditionally not designed for handling huge amount of data

Page 56: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

New Challenges for Business Analytics

• Huge amount of data born ‘unstructured’• Increasing demand for (near) real-time

business analytics• Pre-filtering of important from less relevant

raw data required

Applications• Sensor networks & RFID• Social networks & Mobile Apps• Biological & Genomics Sensor/

RFID DataBlogs, Docs

Web Data

HADOOP

Page 57: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

HADOOP

Fast ETL processing

Active Archive

FastRefinery

Cost-Optimal storage

Hadoop as a Platform SolutionIn the context of ETL , BI , and DW

• Platform to accelerate ETL processes (not competing with current ETL software tools!)

• Flexible and fast development of ‘hand-written’ refining requests of raw data

• Active & cost effective data archive to let (historical) data ‘live forever’

• Co-existence with a relational DW (not completely replacing it !)

Page 58: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Importing HDFS data into PDW for advanced BI

HADOOP

Sensor/RFID Data

Blogs, Docs

Web Data

SQL Server PDW

Interactive BI/Data Visualization

SQOOP

Application Programmers

DBMS Admin

Power BI Users

Page 59: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

Hadoop - PWD Integration via SQOOP (export)

Landing Zone

Compute Node 1

Compute Node 8

HDFS

PDW-configuration file

PDW Hadoop

Connector

SQOOP export with source (HDFS path) &

target (PDW DB & table) 1. FTP Server

Copies incoming data on Landing Zone

3.

2.

Read HDFS data via mappers

Invokes‘DWLoader’

Telnet Server

4.

Control Node

Compute Nodes

Windows/PDW

Linux/Hadoo

p

5.

Page 60: Garrett Edmondson Data Warehouse Architect Blue Granite Inc. GEdmondson@blue-granite.com

SQL Server PDW Roadmap What is coming next?

Q1 Q2 Q3 Q4 Q1 Q2

• Improved node manageability

• Better performance and reduced overhead

• OEM requests

• Programmability• Batches• Control flow• Variables

• Temp tables• QDR infiniband switch• Onboard Dell

• Columnar store index• Stored procedures• Integrated Authentication• PowerView integration• Workload management• LZ/BU redundancy• Windows 8 • SQL Server 2012• Hardware refresh

CALENDAR YEAR 2011 CALENDAR YEAR 2012

• Cost based optimizer • Native SQL Server drivers,

including JDBC• Collations• More expressive query language • Data Movement Services

performance• SCOM pack• Stored procedures (subset)• Half-rack

• 3rd party integration (Informatica, MicroStrategy, Business Objects, HADOOP)

Q4

V-NextAppliance Update 3Appliance Update 1Shipped

Appliance Update 2

Q3

Shipped

Shipped