Garrett Edmondson Data Warehouse Architect Blue Granite Inc. [email protected]
-
Upload
tierra-morss -
Category
Documents
-
view
215 -
download
1
Transcript of Garrett Edmondson Data Warehouse Architect Blue Granite Inc. [email protected]
Garrett EdmondsonData Warehouse ArchitectBlue Granite [email protected]://garrettedmondson.wordpress.com/
DW vs. OLTP
• Data Warehouse– Scan Centric (sequential reads/writes) MB/sec
– Nonvolatile Data – nightly loads– Index – Light
• Few covering Clustered Indexes
– Low Concurrency
OLTP Seek Centric (Radom Read/Writes) IOPs Volatile Data Index Heavy Many Heap Tables High Concurrency
Enterprise Shared SAN Storage
Dedicated Network Bandwidth
Traditional SQL DWArchitectureShared Infrastructure
Data Warehouse Speed = THROUGHPUT from storage to CPU
DW
The DW Trinity
• Star Schema (Fact & Dimension) –
The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition)
• Balanced Architecture– Parallel Data Warehouse
• ColumnStore Indexes - SQL 2012 SQL Server Columnstore Index FAQ– “Batch” Mode– Page Compression
Potential Performance Bottlenecks
FCHBA
AB
FCHBA
AB
FC S
WIT
CH
STORAGECONTROLLER
AB
ABCA
CHE
SERV
ER
CACH
ESQ
L SE
RVER
WIN
DO
WS
CPU
CO
RES
CPU Feed Rate HBA Port Rate Switch Port Rate SP Port Rate
A
BDISK DISK
LUN
DISK DISK
LUN
SQL Server Read Ahead Rate
LUN Read Rate Disk Feed Rate
Balanced System IO Stack
CPU Socket
(4 Core)
CPU Socket(4 Core)
SMP Scale-Up for Data Growth!!!• Complex• Costly• Obsolete Hardware
Tune every IO Interface !
CPU Maximum Consumption Rate (MCR)
• Theoretical Max (MCR)– From Cache Query
• Set STATISTICS IO and STATISTICS TIME to ON • MAXDOP = 4 • ( [Logical reads] / [CPU time in seconds] ) * 8KB / 1024 = MCR
per core
• Benchmark (BCR)– From Disk - Cache
• DBCC dropcleanbuffers • Set STATISTICS IO and STATISTICS TIME to ON • MAXDOP 8+• ( [Logical reads] / [CPU time in seconds] ) * 8KB / 1024
MB/sec
• Test Storage Throughput in MB/sec– SQLIO– Sequential Reads and Writes Important
Establish real rather than rated, performance metrics for the key hardware components of the Fast Track reference architecture!!!!
LUN Configuration
Fast Track Data Striping
ARY01D1v01
ARY01D2v02
ARY02D1v03
ARY02D2v04
ARY03D1v05
ARY03D2v06
ARY04D1v07
ARY04D2v08
ARY05v09
DB1-1.ndf DB1-7.ndfDB1-5.ndfDB1-3.ndf
DB1-2.ndf DB1-4.ndf DB1-6.ndf DB1-8.ndf
DB1.ldf
Primary Data Log
Storage Enclosure
Raid-10
Disk pairs (1&2 + 3&4)
Evaluating Page Fragmentation
Average Fragment Size in Pages – This metric is a reasonable measure of contiguous page allocations for a table.
Value should be >= 400 for optimal performance
select db_name(ps.database_id) as database_name ,object_name(ps.object_id) as table_name ,ps.index_id ,i.name
,cast (ps.avg_fragment_size_in_pages as int) as [Avg Fragment Size In Pages] ,ps.fragment_count as [Fragment Count] ,ps.page_count ,(ps.page_count * 8)/1024/1024 as [Size in GB]from sys.dm_db_index_physical_stats (DB_ID() --NULL for all DBs else run in context of DB , OBJECT_ID(‘dbo.lineitem’), 1, NULL, ‘SAMPLED’) AS ps --DETAILED, SAMPLED, NULL = LIMITEDinner join sys.indexes AS i on (ps.object_id = i.object_id AND ps.index_id = i.index_id)where ps.database_id = db_id() and ps.index_level = 0
(c) 2011 Microsoft. All rights reserved.
TRACE FLAG –E !!!!!!Trace FLAG -T1117
SQL Server Configurations
• Sequential scan performance starts with database creation and extent allocation
• Recall that the –E startup option is used– Allocate 64 extents at a time (4MB)
• Pre-allocation of user databases is strongly recommended
• Autogrow should be avoided if possible– If used, always use 4MB increments
Storage Layout Best Practices for SQL Server
• Create a SQL data file per LUN, for every filegroup• TempDB filegroups share same LUNs as other databases• Log on separate disks, within each enclosure
– Striped using SQL Striping– Log may share these LUNs with load files, backup targets
Storage Layout Best Practices for SQL Server
LUN16 LUN 2 LUN 3
Local Drive 1
Log LUN 1
Permanent DB Log
LUN 1
Tem
pD
B
TempDB.mdf (25GB) TempDB_02.ndf (25GB) TempDB_03ndf (25GB) TempDB_16.ndf (25GB)
Permanent FG
Permanent_1.ndf
Per
man
ant_
DB
Sta
ge
Dat
abas
e Stage FG
Stage_1.ndf Stage_2.ndf Stage_3.ndf Stage_16.ndf
Stage DB Log
Permanent_2.ndf Permanent_3.ndf Permanent_16.ndf
Techniques to Maximize Scan Throughput
• –E startup parameter (2MB Extents and not mixed extents)
• Minimize use of NonClustered indexes on Fact Tables
• Load techniques to avoid fragmentation
– Load in Clustered Index order (e.g. date) when possible
• Index Creation always MAXDOP 1, SORT_IN_TEMPDB
• Isolate volatile tables in separate filegroup
• Isolate staging tables in separate filegroup or DB
• Periodic maintenance
• Turn on SQL Server Compression
Conventional data loads lead to fragmentation
• Bulk Inserts into Clustered Index using a moderate ‘batchsize’ parameter– Each ‘batch’ is sorted independently… causes
fragmentation• Overlapping batches lead to page splits
1:321:31 1:351:341:331:36 1:381:37 1:401:391:321:31 1:351:341:33
Key Order of Index
Best Practices for loading• Use a heap
– Practical if queries need to scan whole partitions• or…Use a batchsize = 0
– Fine if no parallelism is needed during load• or…Use a Two-Step Load
1. Load to a Staging Table (heap)2. INSERT-SELECT from Staging Table into Target CIResulting rows are not fragmentedCan use Parallelism in step 1 – essential for large data volumes
Other fragmentation best practices
• Avoid Autogrow of filegroups– Pre-allocate filegroups to desired long-term size– Manually grow in large increments when necessary
• Keep volatile tables in a separate filegroup– Tables that are frequently rebuilt or loaded in small
increments
• If historical partitions are loaded in parallel, consider separate filegroups for separate partitions to avoid extent fragmentation
(c) 2011 Microsoft. All rights reserved.
Columnstore Indexesstorage = Segments + Dictionaries
CPU = Batch Mode
Row Storage LayoutID Name Address City State Bal Due
1 Bob … … … 3,000
2 Sue … … … 500
3 Ann … … … 1,700
4 Jim … … … 1,500
5 Liz … … … 0
6 Dave … … … 9,000
7 Sue … … … 1,010
8 Bob … … … 50
9 Jim … … … 1,300
1 Bob … … … 3,000
2 Sue … … … 500
3 Ann … … … 1,700
4 Jim … … … 1,500
5 Liz … … … 0
6 Dave … … … 9,000
7 Sue … … … 1,010
8 Bob … … … 50
9 Jim … … … 1,300
Customers Table
page
page
page
Extent
Column Storage LayoutID Name Address City State Bal Due
1 Bob … … … 3,000
2 Sue … … … 500
3 Ann … … … 1,700
4 Jim … … … 1,500
5 Liz … … … 0
6 Dave … … … 9,000
7 Sue … … … 1,010
8 Bob … … … 50
9 Jim … … … 1,300
Customers Table
ID
1
2
3
4
5
6
7
8
9
Name
Bob
Sue
Ann
Jim
Liz
Dave
Sue
Bob
Jim
Address
…
…
…
…
…
…
…
…
…
City
…
…
…
…
…
…
…
…
…
State
…
…
…
…
…
…
…
…
…
Bal Due
3,000
500
1,700
1,500
0
9,000
1,010
50
1,300
Segment = 1 million row chunks
Run Length Encoding (RLE)
Quarter
Q1
Q1
Q1
Q1
Q1
Q1
…
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
…
Quarter Start Count
Q1 1 310
Q2 311 290
… … …
ProdID Start Count
1 1 5
2 6 3
… … …
1 51 5
2 56 3
ProdID
1
1
1
1
1
2
2
2
…
1
1
1
1
1
2
2
2
Price
100
120
315
100
315
198
450
320
320
150
256
450
192
184
310
251
266
Price
100
120
315
100
315
198
450
320
320
150
256
450
192
184
310
251
266
RLE Compression applied onlywhen size of compressed datais smaller than original
xVelocity Store
Dictionary Encoding
Quarter
Q1
Q1
Q1
Q1
Q2
Q2
…
Q2
Q3
Q3
Q3
Q3
Q4
Q4
Q4
Q4
…
Only 4 values.2 bits are enough torepresent it
DISTINCT
Q.ID Quarter
0 Q1
1 Q2
2 Q3
3 Q4
Q.ID
1
1
1
1
2
2
…
2
3
3
3
3
4
4
4
4
…
R.L.E.
Q.ID Start Count
1 1 4
2 5 10
3 11 4
4 15 15
CPU Architecture“Batch” Mode
• Modern CPUs have many Cores• Cache Hierarchies: RAM L3,L2,L1
– Small L1 and L2 per core; L3 shared by socket– L1 faster than L2, L2 faster than L3– CPUs stall when waiting for caches to load
• Batch Mode sizes instructions & data to fit into L2/L1 cache !
Parallel Data WarehouseOverview
Data Warehouse appliances
A prepackaged or pre-configured balanced set of hardware (servers, memory, storage and I/O channels), software (operating system, DBMS and management software), service and support, sold as a unit with built-in redundancy for high availability positioned as a platform for data warehousing.
Control Node
Failover Protection:• Redundant Control Node• Redundant Compute Node• Cluster Failover
•Redundante Array of Inexpensive Databases
Spare Node
Parallel Data Warehouse Appliance - Hardware Architecture
Database Servers
Du
al In
fin
iban
d
Control Nodes
Active / Passive
Landing Zone
Backup Node
Storage Nodes
Spare Database Server
Du
al Fib
er
Ch
an
nel
SQL
SQL
SQL
SQL
SQL
SQL
SQL
SQL
SQL
Management Servers
Client Drivers
ETL Load Interface
Corporate Backup Solution
Data Center Monitoring
Corporate Network Private Network
SQL
SQL
PDW Data ExampleTime Dim
Date Dim IDCalendar YearCalendar QtrCalendar MoCalendar Day
Store Dim
Store Dim IDStore NameStore MgrStore Size
Product Dim
Prod Dim IDProd CategoryProd Sub CatProd Desc
MktgCampaign Dim
Mktg Camp IDCamp NameCamp MgrCamp StartCamp End
SQL
SQL
SQL
SQL
PDW Compute Nodes
Sales Facts
Date Dim IDStore Dim IDProd Dim IDMktg Camp IdQty SoldDollars Sold
Time Dim
Date Dim IDCalendar YearCalendar QtrCalendar MoCalendar Day
Sales Facts
Date Dim IDStore Dim IDProd Dim IDMktg Camp IdQty SoldDollars Sold
PDW Data Example
Store Dim
Store Dim IDStore NameStore MgrStore Size
Product Dim
Prod Dim IDProd CategoryProd Sub CatProd Desc
MktgCampaign Dim
Mktg Camp IDCamp NameCamp MgrCamp StartCamp End
SQL
SQL
SQL
SQL
PDTD
MDSD
PDTD
MDSD
PDTD
MDSD
PDTD
MDSD
Smaller Dimension Tables are Replicated on
Every Compute Node
PDW Data ExampleTime Dim
Date Dim IDCalendar YearCalendar QtrCalendar MoCalendar Day
Store Dim
Store Dim IDStore NameStore MgrStore Size
Product Dim
Prod Dim IDProd CategoryProd Sub CatProd Desc
Sales Facts
Date Dim IDStore Dim IDProd Dim IDMktg Camp IdQty SoldDollars Sold
MktgCampaign Dim
Mktg Camp IDCamp NameCamp MgrCamp StartCamp End
SQL
SQL
SQL
SQL
PDTD
MDSD
PDTD
MDSD
PDTD
MDSD
PDTD
MDSD
SF-1
SF-2
SF-3
SF-4
Larger Fact Table is Hash Distributed Across All
Compute Nodes
SF-1SF-2SF-3SF-4
SMP: Scale-Up • Complex – Tune Every IO interface• Cost – exponential • Obsolete Hardware
MPP: Scale-OutSimple – Buy more Processing NodesCost – linear Keep all hardware investments
SMP vs. MPP
SQL Server Parallel Data WarehouseA quick look at MPP query execution
Compute Node 1
Compute Node 2
Compute Node N
Client Control Node
..
.
The control node handles global query execution, and generates a distributed execution plan
The user connects to ‘the appliance’ like he would to a ‘normal’ SQL Server, and sends his request
The actual user data resides on compute nodes, and steps of the global execution plan are executed on each compute node
SQL Server PDW is a shared nothing MPP system, meaning user data is distributed across the nodes*. Data Movement Service is responsible for moving data around so that individual nodes can satisfy queries that need data from other nodes.
SQL Server PDW Appliance
Shuffle MovementDMS Redistributes the data by color values in parallel.
Co
mp
ute
No
de
1C
om
pu
te N
od
e 2
Dealing with Distributions - ShufflingExample:Select [color], SUM([qty]) from [Store Sales] group by [color];
Retu
rn
Ss_id
color qty
Store Sales
1 Red 5
3 Blue 11
5 Red 12
7 Green 7
Ss_id
color qty
Store Sales
2 Red 8
4 Blue 10
6 Yellow 12
Distributed Table
Temp_1
Red 5
Red 12
Red 8
Green
7
Temp_1
Blue 11
Yellow
12
Blue 10
color qty
color qty
Hash
Blue 21
Red 25
Green
7
Yellow
12
color qty
Hash
HashHash
Parallel Merge
and Aggregate
SQL Server Parallel Data WarehouseOverall Architecture
Legend:
Control Node
Client Interface(JDBC, ODBC,
OLE-DB, ADO.NET)DMS Manager
PDW Engine
…Compute Node 1
DMS Core
PDW Agent
Landing Zone Node
Bulk Data Loader
PDW Agent
Management Node
Active Directory
PDW Agent
PDW AgentCompute Node 2
DMS Core
PDW Agent
Compute Node 10
DMS Core
PDW AgentPDW service
Data Movement ServiceDMS =Parallel Data WarehousePDW =
ETL Interface
Data Rack (up to 4)Control Rack
(c) 2011 Microsoft. All rights reserved.
SQL Server Parallel Data Warehouse AU3Release Themes
BI, Analytics, & ETL Integration
Performance At Scale
Broader functionality
Full Alignment
SQL ServerCompatibility
Less work for the same results
Do the same work more efficiently
Native Support for- Analysis Services- Reporting Services- PowerPivot
Lay the foundation for broad connectivity support
SQL Server PDW ArchitectureHow did it work before?
• Problem– Basic RDBMS functionality, that already exists in SQL Server, was re-built in PDW
• Challenge for PDW AU3 release – Can we leverage SQL Server and focus on MPP related challenges?
Control Node
SQL Server PDW AU3 Architecture PDW AU3 Architecture with Shell Appliance and Cost-Based Query Optimizer
SQL Server runs a ‘Shell Appliance’
Every database exists as an empty ‘shell’• All objects, no user data
DDL executes against both the shell and the compute nodes
Large parts of basic RDBMS functionality now provided by the shell• Authentication and authorization• Schema binding • Metadata catalog
Shell Appliance(SQL Server)
Engine ServiceP
lan
S
tep
s
Pla
n
Ste
ps
Pla
n
Ste
ps
Compute Node (SQL Server)
Compute Node (SQL Server)
Compute Node (SQL Server)
Control Node
SELECTSELECT
foo foofoo
foo
SQL Server PDW AU3 Architecture PDW AU3 Architecture with Shell Appliance and Cost-Based Query Optimizer
1. User issues a query
2. Query is sent to the Shell through sp_showmemo_xml stored procedure
– SQL Server performs parsing, binding, authorization
– SQL optimizer generates execution alternatives
3. MEMO containing candidate plans, histograms, data types is generated
4. Parallel execution plan generated
5. Parallel plan executes on compute nodes
6. Result returned to the user
Shell Appliance(SQL Server)
Engine ServiceP
lan
S
tep
s
Pla
n
Ste
ps
Pla
n
Ste
ps
ME
MO
Compute Node (SQL Server)
Compute Node (SQL Server)
Compute Node (SQL Server)
Control Node
SELECTSELECT
Return
PDW Cost-Based OptimizerOptimizer lifecycle…
1. Simplification and space exploration– Query standardization and simplification (e.g. column reduction, predicates push-
down)– Logical space exploration (e.g. join re-ordering, local/global aggregation)– Space expansion (e.g. bushy trees – dealing with intermediate resultsets)– Physical space exploration– Serializing MEMO into binary XML (logical plans)– De-serializing binary XML into PDW Memo
2. Parallel optimization and pruning– Injecting data move operations (expansion)– Costing different alternatives– Pruning and selecting lowest cost distributed plan
3. SQL Generation– Generating SQL Statements to be executed
PDW Cost-Based Optimizer… And Cost Model Details
• PDW cost model assumptions:– Costing only data movement operations (relational operations excluded)
– Sequential step execution (no pipelined and independent parallelism)
• Data movement operations consist of multiple tasks
• Each task has Fixed and Variable overhead
• Uniform data distribution assumed (no data skew)
PDW Sales Test WorkloadAU2 to AU3
• 5x improvement in terms of total elapsed time out of the box
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 390
10
20
30
40
50
60
70
80
AU2AU3
Seco
nds
Queries
Theme: Performance at ScaleZero data conversions in data movement
Goal• Eliminate CPU utilization spent on data
conversions• Further parallelize operations during data
moves
Functionality• Using ODBC instead of ADO.NET for
reading and writing data• Minimizing appliance resource utilization
for data moves
Benefits• Better resource, CPU, utilization • 6x or more faster move operations• Increased concurrency• Mixed workload (loads + queries)
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10
Q11Q12
Q13Q14
Q15Q16
Q17Q18
Q19Q20
Q21Q22
0
10
20
30
40
50
60
DMS CPU Utilization - TPCH
AU2 AU3
CPU
(%)
Broadcast
Trim
Replicate
Shuffle
Repl Table Load
0% 100% 200% 300% 400% 500% 600%
Throughput improvement for data movements
Theme: SQL Server CompatibilitySQL Server Security and Metadata
Security• SQL Server security syntax and semantics• Supporting user, roles and logins• Fixed database roles• Allows script re-use• Allows well-known security methods
Metadata• PDW metadata stored in SQL Server• Existing SQL Server metadata tables/views (e.g. security views)• PDW distribution info as extended properties in SQL Server metadata• Existing means and technology for persisting metadata• Improved 3rd party tool compatibility (BI, ETL)
Theme: SQL Server CompatibilitySupport for SQL Server (Native) Client
Goal• ‘Look’ just like a normal SQL Server• Better integration with other BI tools
Functionality• Use existing SQL Server drivers to connect
to SQL Server PDW• Implement SQL Server TDS protocol• Named Parameter support• SQLCMD connectivity to PDW
Benefits• Use known tools and proven technology
stack• Existing SQL Server ’eco-system’• 2x performance improvement for return
operations• 5x reduction of connection time
SQL PDW Clients(ODBC, OLE-DB,
ADO.NET)
SQL Server Clients(ADO.NET, ODBC,
OLE-DB, JDBC)
TDS
Server: 10.217.165.13, 17001
Server: 10.217.165.13, 17000
SequeLink
Theme: SQL Server CompatibilityStored Procedure Support (Subset)
Goal • Support common scenarios of code
encapsulation and reuse in Reporting and ETL
Functionality • System and user-defined stored procedures• Invocation using RPC or EXECUTE• Control flow logic, input parameters
Benefits• Enables common logic re-use• Big impact for Reporting Services scenarios• Allows porting existing scripts• Increases compatibility with SQL Server
SyntaxCREATE { PROC | PROCEDURE } [dbo.]procedure_name [ { @parameter data_type } [ = default ] ] [ ,...n ] AS { [ BEGIN ] sql_statement [;] [ ...n ] [ END ] } [;]
ALTER { PROC | PROCEDURE } [dbo.]procedure_name [ { @parameter data_type } [ = default ] ] [ ,...n ] AS { [ BEGIN ] sql_statement [;] [ ...n ] [ END ] } [;]
DROP { PROC | PROCEDURE } { [dbo.]procedure_name } [;]
[ { EXEC | EXECUTE } ] { { [database_name.][schema_name.]procedure_name } [{ value | @variable }] [ ,...n ] } [;]
{ EXEC | EXECUTE } ( { @string_variable | [ N ]'tsql_string' } [ + ...n ] ) [;]Unsupported Functionality
Stored Proc Nesting
Output Params
Return Try-Catch
Theme: SQL Server CompatibilityCollations
Goal• Support local and international data
Functionality• Fixed server level collation• User-defined column level collation• Supporting all Windows collations• Allow COLLATE clauses in Queries and DML
Benefits• Store all the data in PDW w/ additional
querying flexibility• Existing T-SQL DDL and Query scripts• SQL Server alignment and functionality
SyntaxCREATE TABLE T ( c1 varchar(3) COLLATE traditional_Spanish_ci_ai, c2 varchar(10) COLLATE …)
SELECT c1 COLLATE Latin1_General_Bin2FROM T
SELECT * FROM T ORDER BY c1 COLLATE Latin1_General_Bin2
Unsupported Functionality
Cannot specify DB collation during DB creation
Cannot alter column collations for existing tables
Theme: Improved IntegrationSQL Server PDW Connectors
Connector for Hadoop• Bi-directional (import/export) interface between MSFT Hadoop and PDW• Delimited file support• Adapter uses existing PDW tools (bulk loader, dwsql)• Low cost solution that handles all the data: structured and unstructured• Additional agility, flexibility and choice
Connector for Informatica• Connector providing PDW source and target (mappings, transformations)• Informatica uses PDW bulk loader for fast loads
• Leverage existing toolset and knowledge
Connector for Business Objects
Agenda
• Trends in the DW space• How does SQL Server PDW fit in?• SQL Server PDW AU3 – What’s new?• Building BI Solutions with SQL Server PDW
– Customer Successes– Using SQL Server PDW with Microsoft BI solutions– Using SQL Server PDW with third party BI solutions– BI solutions leveraging Hadoop integration
• What’s coming next in SQL Server PDW?
PDW Retail POS WorkloadOriginal Customer SMP solution vs. PDW AU3 (with cost-based query optimizer)
Q1 Q2 Q3 Q4 Q5 Q6 Q70
200
400
600
800
1000
1200
1400
1600
Old SMPPOS ODS AU3
Seco
nds
Queries
Customer SuccessesHow are customers using PDW & BI ?
Data Volume • 80 TB data warehouse analyzing data from
exchanges• Existing system based on SQL SMP farm
– 2 different clusters of 6 servers each
Requirement • Linear scalability with additional hardware• Support hourly loads with SSIS – 300GB/day• BI Integration: SSRS, SSAS and PowerPivot
AU3 Feedback• SP and increased T-SQL support was great• Migrating SMP SSRS to PDW was painless• 142x for scan heavy queries & no summary tables• Enabled queries that do not run on existing system
Reports
Dashboards
Scorecards
CUSTOMER EXAMPLE:Stock Exchange in the US
Portal
ETL
PDWOperational
DB’s
Role of PDW within the BI stack
PDW
DM
DM DM
3rd party BI
SSAS / SSRS
SSAS / SSRS
SSAS / SSRS
PDW role as fast ‘data hub’Fast and parallel feeding of data marts (DMs) via Infiniband
CREATE REMOTE TABLE AS SELECT
Aggregation abilities avoids ETL overhead in existing systemsNo need for indexes No need to maintain indexed/materialized views (summary tables)
Infiniband
GBit link
SSAS with SQL Server PDWUnderstanding the differences compared to ‘SMP world’
Specific to PDW• PDW does not support foreign key constraints• Shared nothing model requires careful data design and retrieval planning• Design cubes for parallel processing – via MOLAP & ROLAP storage model
Specific to the nature of large data• Parallel cube processing/deployment has its limits
– Cautious about parallel loads of SSAS - query timeout settings
• Query design crucial - only include required data– BI tools traditionally not designed for handling huge amount of data
New Challenges for Business Analytics
• Huge amount of data born ‘unstructured’• Increasing demand for (near) real-time
business analytics• Pre-filtering of important from less relevant
raw data required
Applications• Sensor networks & RFID• Social networks & Mobile Apps• Biological & Genomics Sensor/
RFID DataBlogs, Docs
Web Data
HADOOP
HADOOP
Fast ETL processing
Active Archive
FastRefinery
Cost-Optimal storage
Hadoop as a Platform SolutionIn the context of ETL , BI , and DW
• Platform to accelerate ETL processes (not competing with current ETL software tools!)
• Flexible and fast development of ‘hand-written’ refining requests of raw data
• Active & cost effective data archive to let (historical) data ‘live forever’
• Co-existence with a relational DW (not completely replacing it !)
Importing HDFS data into PDW for advanced BI
HADOOP
Sensor/RFID Data
Blogs, Docs
Web Data
SQL Server PDW
Interactive BI/Data Visualization
SQOOP
Application Programmers
DBMS Admin
Power BI Users
Hadoop - PWD Integration via SQOOP (export)
…
Landing Zone
Compute Node 1
Compute Node 8
HDFS
…
PDW-configuration file
PDW Hadoop
Connector
SQOOP export with source (HDFS path) &
target (PDW DB & table) 1. FTP Server
Copies incoming data on Landing Zone
3.
2.
Read HDFS data via mappers
Invokes‘DWLoader’
Telnet Server
4.
Control Node
Compute Nodes
Windows/PDW
Linux/Hadoo
p
5.
SQL Server PDW Roadmap What is coming next?
Q1 Q2 Q3 Q4 Q1 Q2
• Improved node manageability
• Better performance and reduced overhead
• OEM requests
• Programmability• Batches• Control flow• Variables
• Temp tables• QDR infiniband switch• Onboard Dell
• Columnar store index• Stored procedures• Integrated Authentication• PowerView integration• Workload management• LZ/BU redundancy• Windows 8 • SQL Server 2012• Hardware refresh
CALENDAR YEAR 2011 CALENDAR YEAR 2012
• Cost based optimizer • Native SQL Server drivers,
including JDBC• Collations• More expressive query language • Data Movement Services
performance• SCOM pack• Stored procedures (subset)• Half-rack
• 3rd party integration (Informatica, MicroStrategy, Business Objects, HADOOP)
Q4
V-NextAppliance Update 3Appliance Update 1Shipped
Appliance Update 2
Q3
Shipped
Shipped