Garrett Edmondson Data Warehouse Architect Blue Granite Inc. [email protected]

Garrett EdmondsonData Warehouse ArchitectBlue Granite [email protected]://garrettedmondson.wordpress.com/

DW vs. OLTP

• Data Warehouse– Scan Centric (sequential reads/writes) MB/sec

– Nonvolatile Data – nightly loads– Index – Light

• Few covering Clustered Indexes

– Low Concurrency

OLTP Seek Centric (Radom Read/Writes) IOPs Volatile Data Index Heavy Many Heap Tables High Concurrency

Enterprise Shared SAN Storage

Dedicated Network Bandwidth

Traditional SQL DWArchitectureShared Infrastructure

Data Warehouse Speed = THROUGHPUT from storage to CPU

DW

The DW Trinity

• Star Schema (Fact & Dimension) –

The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition)

• Balanced Architecture– Parallel Data Warehouse

• ColumnStore Indexes - SQL 2012 SQL Server Columnstore Index FAQ– “Batch” Mode– Page Compression

http://www.amazon.com/The-Data-Warehouse-Toolkit-Dimensional/dp/0471200247

http://www.amazon.com/The-Data-Warehouse-Toolkit-Dimensional/dp/0471200247

http://social.technet.microsoft.com/wiki/contents/articles/3540.sql-server-columnstore-index-faq-en-us.aspx





Potential Performance Bottlenecks

FCHBA

AB

FCHBA

AB

FC S

WIT

CH

STORAGECONTROLLER

AB

ABCA

CHE

SERV

ER

CACH

ESQ

L SE

RVER

WIN

DO

WS

CPU

CO

RES

CPU Feed Rate HBA Port Rate Switch Port Rate SP Port Rate

A

BDISK DISK

LUN

DISK DISK

LUN

SQL Server Read Ahead Rate

LUN Read Rate Disk Feed Rate

Balanced System IO Stack

CPU Socket

(4 Core)

CPU Socket(4 Core)

SMP Scale-Up for Data Growth!!!• Complex• Costly• Obsolete Hardware

Tune every IO Interface !

CPU Maximum Consumption Rate (MCR)

• Theoretical Max (MCR)– From Cache Query

• Set STATISTICS IO and STATISTICS TIME to ON • MAXDOP = 4 • ( [Logical reads] / [CPU time in seconds] ) * 8KB / 1024 = MCR

per core

• Benchmark (BCR)– From Disk - Cache

• DBCC dropcleanbuffers • Set STATISTICS IO and STATISTICS TIME to ON • MAXDOP 8+• ( [Logical reads] / [CPU time in seconds] ) * 8KB / 1024

MB/sec

• Test Storage Throughput in MB/sec– SQLIO– Sequential Reads and Writes Important

Establish real rather than rated, performance metrics for the key hardware components of the Fast Track reference architecture!!!!

LUN Configuration

Fast Track Data Striping

ARY01D1v01

ARY01D2v02

ARY02D1v03

ARY02D2v04

ARY03D1v05

ARY03D2v06

ARY04D1v07

ARY04D2v08

ARY05v09

DB1-1.ndf DB1-7.ndfDB1-5.ndfDB1-3.ndf

DB1-2.ndf DB1-4.ndf DB1-6.ndf DB1-8.ndf

DB1.ldf

Primary Data Log

Storage Enclosure

Raid-10

Disk pairs (1&2 + 3&4)

Evaluating Page Fragmentation

Average Fragment Size in Pages – This metric is a reasonable measure of contiguous page allocations for a table.

Value should be >= 400 for optimal performance

select db_name(ps.database_id) as database_name ,object_name(ps.object_id) as table_name ,ps.index_id ,i.name

,cast (ps.avg_fragment_size_in_pages as int) as [Avg Fragment Size In Pages] ,ps.fragment_count as [Fragment Count] ,ps.page_count ,(ps.page_count * 8)/1024/1024 as [Size in GB]from sys.dm_db_index_physical_stats (DB_ID() --NULL for all DBs else run in context of DB , OBJECT_ID(‘dbo.lineitem’), 1, NULL, ‘SAMPLED’) AS ps --DETAILED, SAMPLED, NULL = LIMITEDinner join sys.indexes AS i on (ps.object_id = i.object_id AND ps.index_id = i.index_id)where ps.database_id = db_id() and ps.index_level = 0

(c) 2011 Microsoft. All rights reserved.

TRACE FLAG –E !!!!!!Trace FLAG -T1117

SQL Server Configurations

• Sequential scan performance starts with database creation and extent allocation

• Recall that the –E startup option is used– Allocate 64 extents at a time (4MB)

• Pre-allocation of user databases is strongly recommended

• Autogrow should be avoided if possible– If used, always use 4MB increments

Storage Layout Best Practices for SQL Server

• Create a SQL data file per LUN, for every filegroup• TempDB filegroups share same LUNs as other databases• Log on separate disks, within each enclosure

– Striped using SQL Striping– Log may share these LUNs with load files, backup targets

Storage Layout Best Practices for SQL Server

LUN16 LUN 2 LUN 3

Local Drive 1

Log LUN 1

Permanent DB Log

LUN 1

Tem

pD

B

TempDB.mdf (25GB) TempDB_02.ndf (25GB) TempDB_03ndf (25GB) TempDB_16.ndf (25GB)

Permanent FG

Permanent_1.ndf

Per

man

ant_

DB

Sta

ge

Dat

abas

e Stage FG

Stage_1.ndf Stage_2.ndf Stage_3.ndf Stage_16.ndf

Stage DB Log

Permanent_2.ndf Permanent_3.ndf Permanent_16.ndf

Techniques to Maximize Scan Throughput

• –E startup parameter (2MB Extents and not mixed extents)

• Minimize use of NonClustered indexes on Fact Tables

• Load techniques to avoid fragmentation

– Load in Clustered Index order (e.g. date) when possible

• Index Creation always MAXDOP 1, SORT_IN_TEMPDB

• Isolate volatile tables in separate filegroup

• Isolate staging tables in separate filegroup or DB

• Periodic maintenance

• Turn on SQL Server Compression

Conventional data loads lead to fragmentation

• Bulk Inserts into Clustered Index using a moderate ‘batchsize’ parameter– Each ‘batch’ is sorted independently… causes

fragmentation• Overlapping batches lead to page splits

1:321:31 1:351:341:331:36 1:381:37 1:401:391:321:31 1:351:341:33

Key Order of Index

Best Practices for loading• Use a heap

– Practical if queries need to scan whole partitions• or…Use a batchsize = 0

– Fine if no parallelism is needed during load• or…Use a Two-Step Load

1. Load to a Staging Table (heap)2. INSERT-SELECT from Staging Table into Target CIResulting rows are not fragmentedCan use Parallelism in step 1 – essential for large data volumes

Other fragmentation best practices

• Avoid Autogrow of filegroups– Pre-allocate filegroups to desired long-term size– Manually grow in large increments when necessary

• Keep volatile tables in a separate filegroup– Tables that are frequently rebuilt or loaded in small

increments

• If historical partitions are loaded in parallel, consider separate filegroups for separate partitions to avoid extent fragmentation


Columnstore Indexesstorage = Segments + Dictionaries

CPU = Batch Mode

Row Storage LayoutID Name Address City State Bal Due

1 Bob … … … 3,000

2 Sue … … … 500

3 Ann … … … 1,700

4 Jim … … … 1,500

5 Liz … … … 0

6 Dave … … … 9,000

7 Sue … … … 1,010

8 Bob … … … 50

9 Jim … … … 1,300

1 Bob … … … 3,000

2 Sue … … … 500

3 Ann … … … 1,700

4 Jim … … … 1,500

5 Liz … … … 0

6 Dave … … … 9,000

7 Sue … … … 1,010

8 Bob … … … 50

9 Jim … … … 1,300

Customers Table

page

page

page

Extent

Column Storage LayoutID Name Address City State Bal Due

1 Bob … … … 3,000

2 Sue … … … 500

3 Ann … … … 1,700

4 Jim … … … 1,500

5 Liz … … … 0

6 Dave … … … 9,000

7 Sue … … … 1,010

8 Bob … … … 50

9 Jim … … … 1,300

Customers Table

ID

1

2

3

4

5

6

7

8

9

Name

Bob

Sue

Ann

Jim

Liz

Dave

Sue

Bob

Jim

Address

…

…

…

…

…

…

…

…

…

City

…

…

…

…

…

…

…

…

…

State

…

…

…

…

…

…

…

…

…

Bal Due

3,000

500

1,700

1,500

0

9,000

1,010

50

1,300

Segment = 1 million row chunks

Run Length Encoding (RLE)

Quarter

Q1

Q1

Q1

Q1

Q1

Q1

…

Q2

Q2

Q2

Q2

Q2

Q2

Q2

Q2

Q2

…

Quarter Start Count

Q1 1 310

Q2 311 290

… … …

ProdID Start Count

1 1 5

2 6 3

… … …

1 51 5

2 56 3

ProdID

1

1

1

1

1

2

2

2

…

1

1

1

1

1

2

2

2

Price

100

120

315

100

315

198

450

320

320

150

256

450

192

184

310

251

266

Price

100

120

315

100

315

198

450

320

320

150

256

450

192

184

310

251

266

RLE Compression applied onlywhen size of compressed datais smaller than original

xVelocity Store

Dictionary Encoding

Quarter

Q1

Q1

Q1

Q1

Q2

Q2

…

Q2

Q3

Q3

Q3

Q3

Q4

Q4

Q4

Q4

…

Only 4 values.2 bits are enough torepresent it

DISTINCT

Q.ID Quarter

0 Q1

1 Q2

2 Q3

3 Q4

Q.ID

1

1

1

1

2

2

…

2

3

3

3

3

4

4

4

4

…

R.L.E.

Q.ID Start Count

1 1 4

2 5 10

3 11 4

4 15 15

CPU Architecture“Batch” Mode

• Modern CPUs have many Cores• Cache Hierarchies: RAM L3,L2,L1

– Small L1 and L2 per core; L3 shared by socket– L1 faster than L2, L2 faster than L3– CPUs stall when waiting for caches to load

• Batch Mode sizes instructions & data to fit into L2/L1 cache !

Parallel Data WarehouseOverview

Data Warehouse appliances

A prepackaged or pre-configured balanced set of hardware (servers, memory, storage and I/O channels), software (operating system, DBMS and management software), service and support, sold as a unit with built-in redundancy for high availability positioned as a platform for data warehousing.

Control Node

Failover Protection:• Redundant Control Node• Redundant Compute Node• Cluster Failover

•Redundante Array of Inexpensive Databases

Spare Node

Parallel Data Warehouse Appliance - Hardware Architecture

Database Servers

Du

al In

fin

iban

d

Control Nodes

Active / Passive

Landing Zone

Backup Node

Storage Nodes

Spare Database Server

Du

al Fib

er

Ch

an

nel

SQL

SQL

SQL

SQL

SQL

SQL

SQL

SQL

SQL

Management Servers

Client Drivers

ETL Load Interface

Corporate Backup Solution

Data Center Monitoring

Corporate Network Private Network

SQL

SQL

PDW Data ExampleTime Dim

Date Dim IDCalendar YearCalendar QtrCalendar MoCalendar Day

Store Dim

Store Dim IDStore NameStore MgrStore Size

Product Dim

Prod Dim IDProd CategoryProd Sub CatProd Desc

MktgCampaign Dim

Mktg Camp IDCamp NameCamp MgrCamp StartCamp End

SQL

SQL

SQL

SQL

PDW Compute Nodes

Sales Facts

Date Dim IDStore Dim IDProd Dim IDMktg Camp IdQty SoldDollars Sold

Time Dim


Sales Facts


PDW Data Example

Store Dim


Product Dim


MktgCampaign Dim


SQL

SQL

SQL

SQL

PDTD

MDSD

PDTD

MDSD

PDTD

MDSD

PDTD

MDSD

Smaller Dimension Tables are Replicated on

Every Compute Node

PDW Data ExampleTime Dim


Store Dim


Product Dim


Sales Facts


MktgCampaign Dim


SQL

SQL

SQL

SQL

PDTD

MDSD

PDTD

MDSD

PDTD

MDSD

PDTD

MDSD

SF-1

SF-2

SF-3

SF-4

Larger Fact Table is Hash Distributed Across All

Compute Nodes

SF-1SF-2SF-3SF-4

SMP: Scale-Up • Complex – Tune Every IO interface• Cost – exponential • Obsolete Hardware

MPP: Scale-OutSimple – Buy more Processing NodesCost – linear Keep all hardware investments

SMP vs. MPP

SQL Server Parallel Data WarehouseA quick look at MPP query execution

Compute Node 1

Compute Node 2

Compute Node N

Client Control Node

..

.

The control node handles global query execution, and generates a distributed execution plan

The user connects to ‘the appliance’ like he would to a ‘normal’ SQL Server, and sends his request

The actual user data resides on compute nodes, and steps of the global execution plan are executed on each compute node

SQL Server PDW is a shared nothing MPP system, meaning user data is distributed across the nodes*. Data Movement Service is responsible for moving data around so that individual nodes can satisfy queries that need data from other nodes.

SQL Server PDW Appliance

Shuffle MovementDMS Redistributes the data by color values in parallel.

Co

mp

ute

No

de

1C

om

pu

te N

od

e 2

Dealing with Distributions - ShufflingExample:Select [color], SUM([qty]) from [Store Sales] group by [color];

Retu

rn

Ss_id

color qty

Store Sales

1 Red 5

3 Blue 11

5 Red 12

7 Green 7

Ss_id

color qty

Store Sales

2 Red 8

4 Blue 10

6 Yellow 12

Distributed Table

Temp_1

Red 5

Red 12

Red 8

Green

7

Temp_1

Blue 11

Yellow

12

Blue 10

color qty

color qty

Hash

Blue 21

Red 25

Green

7

Yellow

12

color qty

Hash

HashHash

Parallel Merge

and Aggregate

SQL Server Parallel Data WarehouseOverall Architecture

Legend:

Control Node

Client Interface(JDBC, ODBC,

OLE-DB, ADO.NET)DMS Manager

PDW Engine

…Compute Node 1

DMS Core

PDW Agent

Landing Zone Node

Bulk Data Loader

PDW Agent

Management Node

Active Directory

PDW Agent

PDW AgentCompute Node 2

DMS Core

PDW Agent

Compute Node 10

DMS Core

PDW AgentPDW service

Data Movement ServiceDMS =Parallel Data WarehousePDW =

ETL Interface

Data Rack (up to 4)Control Rack

SQL Server Parallel Data Warehouse AU3Release Themes

BI, Analytics, & ETL Integration

Performance At Scale

Broader functionality

Full Alignment

SQL ServerCompatibility

Less work for the same results

Do the same work more efficiently

Native Support for- Analysis Services- Reporting Services- PowerPivot

Lay the foundation for broad connectivity support

SQL Server PDW ArchitectureHow did it work before?

• Problem– Basic RDBMS functionality, that already exists in SQL Server, was re-built in PDW

• Challenge for PDW AU3 release – Can we leverage SQL Server and focus on MPP related challenges?

Control Node

SQL Server PDW AU3 Architecture PDW AU3 Architecture with Shell Appliance and Cost-Based Query Optimizer

SQL Server runs a ‘Shell Appliance’

Every database exists as an empty ‘shell’• All objects, no user data

DDL executes against both the shell and the compute nodes

Large parts of basic RDBMS functionality now provided by the shell• Authentication and authorization• Schema binding • Metadata catalog

Shell Appliance(SQL Server)

Engine ServiceP

lan

S

tep

s

Pla

n

Ste

ps

Pla

n

Ste

ps

Compute Node (SQL Server)



Control Node

SELECTSELECT

foo foofoo

foo

SQL Server PDW AU3 Architecture PDW AU3 Architecture with Shell Appliance and Cost-Based Query Optimizer

1. User issues a query

2. Query is sent to the Shell through sp_showmemo_xml stored procedure

– SQL Server performs parsing, binding, authorization

– SQL optimizer generates execution alternatives

3. MEMO containing candidate plans, histograms, data types is generated

4. Parallel execution plan generated

5. Parallel plan executes on compute nodes

6. Result returned to the user

Shell Appliance(SQL Server)

Engine ServiceP

lan

S

tep

s

Pla

n

Ste

ps

Pla

n

Ste

ps

ME

MO




Control Node

SELECTSELECT

Return

PDW Cost-Based OptimizerOptimizer lifecycle…

1. Simplification and space exploration– Query standardization and simplification (e.g. column reduction, predicates push-

down)– Logical space exploration (e.g. join re-ordering, local/global aggregation)– Space expansion (e.g. bushy trees – dealing with intermediate resultsets)– Physical space exploration– Serializing MEMO into binary XML (logical plans)– De-serializing binary XML into PDW Memo

2. Parallel optimization and pruning– Injecting data move operations (expansion)– Costing different alternatives– Pruning and selecting lowest cost distributed plan

3. SQL Generation– Generating SQL Statements to be executed

PDW Cost-Based Optimizer… And Cost Model Details

• PDW cost model assumptions:– Costing only data movement operations (relational operations excluded)

– Sequential step execution (no pipelined and independent parallelism)

• Data movement operations consist of multiple tasks

• Each task has Fixed and Variable overhead

• Uniform data distribution assumed (no data skew)

PDW Sales Test WorkloadAU2 to AU3

• 5x improvement in terms of total elapsed time out of the box

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 390

10

20

30

40

50

60

70

80

AU2AU3

Seco

nds

Queries

Theme: Performance at ScaleZero data conversions in data movement

Goal• Eliminate CPU utilization spent on data

conversions• Further parallelize operations during data

moves

Functionality• Using ODBC instead of ADO.NET for

reading and writing data• Minimizing appliance resource utilization

for data moves

Benefits• Better resource, CPU, utilization • 6x or more faster move operations• Increased concurrency• Mixed workload (loads + queries)

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10

Q11Q12

Q13Q14

Q15Q16

Q17Q18

Q19Q20

Q21Q22

0

10

20

30

40

50

60

DMS CPU Utilization - TPCH

AU2 AU3

CPU

(%)

Broadcast

Trim

Replicate

Shuffle

Repl Table Load

0% 100% 200% 300% 400% 500% 600%

Throughput improvement for data movements

Theme: SQL Server CompatibilitySQL Server Security and Metadata

Security• SQL Server security syntax and semantics• Supporting user, roles and logins• Fixed database roles• Allows script re-use• Allows well-known security methods

Metadata• PDW metadata stored in SQL Server• Existing SQL Server metadata tables/views (e.g. security views)• PDW distribution info as extended properties in SQL Server metadata• Existing means and technology for persisting metadata• Improved 3rd party tool compatibility (BI, ETL)

Theme: SQL Server CompatibilitySupport for SQL Server (Native) Client

Goal• ‘Look’ just like a normal SQL Server• Better integration with other BI tools

Functionality• Use existing SQL Server drivers to connect

to SQL Server PDW• Implement SQL Server TDS protocol• Named Parameter support• SQLCMD connectivity to PDW

Benefits• Use known tools and proven technology

stack• Existing SQL Server ’eco-system’• 2x performance improvement for return

operations• 5x reduction of connection time

SQL PDW Clients(ODBC, OLE-DB,

ADO.NET)

SQL Server Clients(ADO.NET, ODBC,

OLE-DB, JDBC)

TDS

Server: 10.217.165.13, 17001

Server: 10.217.165.13, 17000

SequeLink

Theme: SQL Server CompatibilityStored Procedure Support (Subset)

Goal • Support common scenarios of code

encapsulation and reuse in Reporting and ETL

Functionality • System and user-defined stored procedures• Invocation using RPC or EXECUTE• Control flow logic, input parameters

Benefits• Enables common logic re-use• Big impact for Reporting Services scenarios• Allows porting existing scripts• Increases compatibility with SQL Server

SyntaxCREATE { PROC | PROCEDURE } [dbo.]procedure_name [ { @parameter data_type } [ = default ] ] [ ,...n ] AS { [ BEGIN ] sql_statement [;] [ ...n ] [ END ] } [;]

ALTER { PROC | PROCEDURE } [dbo.]procedure_name [ { @parameter data_type } [ = default ] ] [ ,...n ] AS { [ BEGIN ] sql_statement [;] [ ...n ] [ END ] } [;]

DROP { PROC | PROCEDURE } { [dbo.]procedure_name } [;]

[ { EXEC | EXECUTE } ] { { [database_name.][schema_name.]procedure_name } [{ value | @variable }] [ ,...n ] } [;]

{ EXEC | EXECUTE } ( { @string_variable | [ N ]'tsql_string' } [ + ...n ] ) [;]Unsupported Functionality

Stored Proc Nesting

Output Params

Return Try-Catch

Theme: SQL Server CompatibilityCollations

Goal• Support local and international data

Functionality• Fixed server level collation• User-defined column level collation• Supporting all Windows collations• Allow COLLATE clauses in Queries and DML

Benefits• Store all the data in PDW w/ additional

querying flexibility• Existing T-SQL DDL and Query scripts• SQL Server alignment and functionality

SyntaxCREATE TABLE T ( c1 varchar(3) COLLATE traditional_Spanish_ci_ai, c2 varchar(10) COLLATE …)

SELECT c1 COLLATE Latin1_General_Bin2FROM T

SELECT * FROM T ORDER BY c1 COLLATE Latin1_General_Bin2

Unsupported Functionality

Cannot specify DB collation during DB creation

Cannot alter column collations for existing tables

Theme: Improved IntegrationSQL Server PDW Connectors

Connector for Hadoop• Bi-directional (import/export) interface between MSFT Hadoop and PDW• Delimited file support• Adapter uses existing PDW tools (bulk loader, dwsql)• Low cost solution that handles all the data: structured and unstructured• Additional agility, flexibility and choice

Connector for Informatica• Connector providing PDW source and target (mappings, transformations)• Informatica uses PDW bulk loader for fast loads

• Leverage existing toolset and knowledge

Connector for Business Objects

Agenda

• Trends in the DW space• How does SQL Server PDW fit in?• SQL Server PDW AU3 – What’s new?• Building BI Solutions with SQL Server PDW

– Customer Successes– Using SQL Server PDW with Microsoft BI solutions– Using SQL Server PDW with third party BI solutions– BI solutions leveraging Hadoop integration

• What’s coming next in SQL Server PDW?

PDW Retail POS WorkloadOriginal Customer SMP solution vs. PDW AU3 (with cost-based query optimizer)

Q1 Q2 Q3 Q4 Q5 Q6 Q70

200

400

600

800

1000

1200

1400

1600

Old SMPPOS ODS AU3

Seco

nds

Queries

Customer SuccessesHow are customers using PDW & BI ?

Data Volume • 80 TB data warehouse analyzing data from

exchanges• Existing system based on SQL SMP farm

– 2 different clusters of 6 servers each

Requirement • Linear scalability with additional hardware• Support hourly loads with SSIS – 300GB/day• BI Integration: SSRS, SSAS and PowerPivot

AU3 Feedback• SP and increased T-SQL support was great• Migrating SMP SSRS to PDW was painless• 142x for scan heavy queries & no summary tables• Enabled queries that do not run on existing system

Reports

Dashboards

Scorecards

CUSTOMER EXAMPLE:Stock Exchange in the US

Portal

ETL

PDWOperational

DB’s

Role of PDW within the BI stack

PDW

DM

DM DM

3rd party BI

SSAS / SSRS

SSAS / SSRS

SSAS / SSRS

PDW role as fast ‘data hub’Fast and parallel feeding of data marts (DMs) via Infiniband

CREATE REMOTE TABLE AS SELECT

Aggregation abilities avoids ETL overhead in existing systemsNo need for indexes No need to maintain indexed/materialized views (summary tables)

Infiniband

GBit link

SSAS with SQL Server PDWUnderstanding the differences compared to ‘SMP world’

Specific to PDW• PDW does not support foreign key constraints• Shared nothing model requires careful data design and retrieval planning• Design cubes for parallel processing – via MOLAP & ROLAP storage model

Specific to the nature of large data• Parallel cube processing/deployment has its limits

– Cautious about parallel loads of SSAS - query timeout settings

• Query design crucial - only include required data– BI tools traditionally not designed for handling huge amount of data

New Challenges for Business Analytics

• Huge amount of data born ‘unstructured’• Increasing demand for (near) real-time

business analytics• Pre-filtering of important from less relevant

raw data required

Applications• Sensor networks & RFID• Social networks & Mobile Apps• Biological & Genomics Sensor/

RFID DataBlogs, Docs

Web Data

HADOOP

HADOOP

Fast ETL processing

Active Archive

FastRefinery

Cost-Optimal storage

Hadoop as a Platform SolutionIn the context of ETL , BI , and DW

• Platform to accelerate ETL processes (not competing with current ETL software tools!)

• Flexible and fast development of ‘hand-written’ refining requests of raw data

• Active & cost effective data archive to let (historical) data ‘live forever’

• Co-existence with a relational DW (not completely replacing it !)

Importing HDFS data into PDW for advanced BI

HADOOP

Sensor/RFID Data

Blogs, Docs

Web Data

SQL Server PDW

Interactive BI/Data Visualization

SQOOP

Application Programmers

DBMS Admin

Power BI Users

Hadoop - PWD Integration via SQOOP (export)

…

Landing Zone

Compute Node 1

Compute Node 8

HDFS

…

PDW-configuration file

PDW Hadoop

Connector

SQOOP export with source (HDFS path) &

target (PDW DB & table) 1. FTP Server

Copies incoming data on Landing Zone

3.

2.

Read HDFS data via mappers

Invokes‘DWLoader’

Telnet Server

4.

Control Node

Compute Nodes

Windows/PDW

Linux/Hadoo

p

5.

SQL Server PDW Roadmap What is coming next?

Q1 Q2 Q3 Q4 Q1 Q2

• Improved node manageability

• Better performance and reduced overhead

• OEM requests

• Programmability• Batches• Control flow• Variables

• Temp tables• QDR infiniband switch• Onboard Dell

• Columnar store index• Stored procedures• Integrated Authentication• PowerView integration• Workload management• LZ/BU redundancy• Windows 8 • SQL Server 2012• Hardware refresh

CALENDAR YEAR 2011 CALENDAR YEAR 2012

• Cost based optimizer • Native SQL Server drivers,

including JDBC• Collations• More expressive query language • Data Movement Services

performance• SCOM pack• Stored procedures (subset)• Half-rack

• 3rd party integration (Informatica, MicroStrategy, Business Objects, HADOOP)

Q4

V-NextAppliance Update 3Appliance Update 1Shipped

Appliance Update 2

Q3

Shipped

Shipped

Garrett Edmondson Data Warehouse Architect Blue Granite Inc. [email protected]

Documents

Transcript of Garrett Edmondson Data Warehouse Architect Blue Granite Inc. [email protected]