Post on 19-Jan-2016
Sponsorzy strategiczni
Sponsorzy srebrni
PolyBase – data beyond tablesHubert Kobierzewski
Hubert K. Kobierzewski
• BI Consultant in Codec-dss (aka Codec Systems) – over 8 years• Specialized in: Data Warehousing, ETL processes and Business
Intelligence• Ex-Developer• MS SQL Server certified (MCDBA, MCTS, MCITP, MCSE – BI, ex-MCT)• Member of Data Platform Advisors (internal MS group)• co-leader of Warsaw PLSSUG Chapter
Megabytes
What is Big Data and why is it valuable to the business?
Evolution in the nature and use of data in the enterprise
Data complexity: variety and velocity
Peta
byte
s
Historical analysis
Insight analysis
Predictive analytics
Predictive forecasting
Valu
e t
o t
he b
usi
ness
Hadoop (some elements, relevant in this presentation)
HDFSDistributed, scalable fault tolerant file system
MapReduceA framework for writing fault tolerant, scalable distributed applications
HiveA relational DBMS that stores its tables in HDFS and uses MapReduce as its target execution language
SqoopA library and framework for moving data between HDFS and a relational DBMS
HDFS
MapReduce
Hive
Sqoop
Move HDFS into the warehouse before analysis
HDFS (Hadoop)
ETL
WarehouseHDFS (Hadoop)
Learn new skills
T-SQL
Build Integrate ManageMaintainSupport
Hadoop alone is not the answer to all Big Data challenges
Steep learning curve, slow and inefficient
Hadoop ecosystem
New data sources
Devices Web Sensor Social
“New” data sources
New data sources
Devices Web Sensor Social
PolyBase in the Modern Data WarehouseBackground
Research done by Gray System Lab lead by Technical Fellow David DeWitt
High-level goals for PolyBase Seamless Integration with Hadoop via regular T-SQL
Enhancing the MPP query engine to process data coming from the Hadoop Distributed File System (HDFS)
Fully parallelized query processing for highly performing data import and export from HDFS
Integration with various Hadoop implementations
Hadoop on Windows Server, Hortonworks, and Cloudera
Prerequisites for installing PolyBase• 64-bit SQL Server Evaluation edition• Microsoft .NET Framework 4.0. • Oracle Java SE RunTime Environment (JRE) version
7.51 or higher• NOTE: Java JRE version 8 does not work.
• Minimum memory: 4GB• Minimum hard disk space: 2GB
Using the installation wizard for PolyBase
• Run SQL Server Installation Center. (Insert SQL Server installation media and double-click Setup.exe)
• Click Installation, then click New Standalone SQL Server installation or add features
• On the feature selection page, select PolyBase Query Service for External Data.
• On the Server Configuration Page, configure the PolyBase Engine Service and PolyBase Data Movement Service to run under the same account.
External tables
CREATE EXTERNAL TABLE table_name ({<column_definition>}[,..n ])
{WITH (DATA_SOURCE = <data_source>,
FILE_FORMAT = <file_format>,
LOCATION =‘<file_path>’,
[REJECT_VALUE = <value>], …} [;]
Referencing external file format
Referencing external data source
Path of the Hadoop file/folder
(Optional) Reject parameters
1
2
3
• Internal representation of data residing outside of appliance
• Supports wide array of data types
o Excluding text, ntext and similar but including binary and varbinary
• SQL permissions
o CREATE TABLE, and ALTER ANY SCHEMA
o ALTER ANY DATA SOURCE
4
External data sources
CREATE EXTERNAL DATA SOURCE datasource_name
{WITH (TYPE = <data_source>,
LOCATION =‘<location>’,
[JOB_TRACKER_LOCATION = ‘<jb_location>’]} [;]
Location of external data source
Type of external data source
Enabling or disabling of MapReduce job generation
1
2
3
• Internal representation of an external data source
o Support of Hadoop as a data source and Windows Azure Blob Storage (WASB, formerly known as ASV)
• Enabling and disabling of split-based query processing
o Generation of MapReduce jobs on-the-fly [fully transparent for end user]
• ALTER ANY EXTERNAL DATA SOURCE permission required
External file format
CREATE EXTERNAL FILE FORMAT fileformat_name
{WITH ( FORMAT_TYPE = <type>,
[SERDE_METHOD = ‘<sede_method>’]
[DATA_COMPRESSION = ‘<compr_method>’]
[FORMAT_OPTIONS (<format_options>)]}[;]
(De)Serialization method [Hive RCFile]
Type of external data source
Compression method
(Optional) Format Options [Text Files]
1
2
3
• Internal representation of an external file format
o Support of delimited text files, Hive RCFiles and Hive ORC
• Enabling and disabling of split-based query processing
o Generation of MapReduce jobs on-the-fly
• ALTER ANY EXTERNAL FILE FORMAT permission required
4
Format options for delimited text files
<Format Options> :: = [,FIELD_TERMINATOR= ‘Value’], [,STRING_DELIMITER = ‘Value’], [,DATE_FORMAT = ‘Value’], [USE_TYPE_DEFAULT = ‘Value’]
FIELD_TERMINATOR
STRING_DELIMITER
USE_TYPE_DEFAULT
DATE_FORMAT
To indicate a column delimiter
To specify the delimiter for string data type fields
To specify a particular date format
To specify how missing entries in text files are treated
HDFS File / Directory//hdfs/social_media/twitter
//hdfs/social_media/twitter/Daily.log
Hadoop
Column filtering
Dynamic binding
Row filtering
User Location Product Sentiment Rtwt Hour Date
Sean
Suz
Audie
Tom
Sanjay
Roger
Steve
CA
WA
CO
IL
MN
TX
AL
xbox
xbox
excel
sqls
wp8
ssas
ssrs
-1
0
1
1
1
1
1
5
0
0
8
0
0
0
2
2
2
2
1
23
23
1-8-14
1-8-14
1-8-14
1-8-14
1-8-14
1-8-14
1-7-14
PolyBase – Predicate pushdown
SELECT User, Product, Sentiment
FROM Twitter_Table
WHERE Hour = Current - 1AND Date = TodayAND Sentiment <= 0
SELECT DISTINCT C.FirstName, C.LastName, C.MaritalStatusFROM Insurance_Customer_SQL -- table in SQL Server…OPTION (FORCE EXTERNALPUSHDOWN) – push-down computationCREATE EXTERNAL DATA SOURCES ds-hdp WITH .( TYPE = Hadoop, LOCATION = “hdfs://10.193.27.52:8020”, Resources_Manager_Location = ‘10.193.27.52:8032’);
Pushing Compute
Either on data source level or Per-query basis using new query hints
Query CapabilitiesPush-Down Computation
PolyBase Demo
Use cases where PolyBase simplifies using Hadoop data
Bringing islands of Hadoop data together
Running queries against Hadoop data
Archiving data warehouse data to Hadoop (move)
Exporting relational data to Hadoop (copy)
Importing Hadoop data into a data warehouse (copy)
HDInsightThe MPP Engine’s Integration Method – without PolyBase
Control Node Compute Node
MPP DWH Engine
Compute Node
Name Node Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Hadoop Cluster
SQOOP-based connector
Data Node
HDInsightThe MPP Engine’s Integration Method – with PolyBase
Name NodeData Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Hadoop Cluster
Control Node
Compute NodeMPP DWH Engine
Compute Node
DMS DMS
Major Competitors
• Oracle since version 9i (ca. 2003)• IBM PureData System• Pivotal Greenplum• Oracle BDA (Big Data Appliance)
Read and watch more…
• MSDN Documentationhttps://msdn.microsoft.com/en-ie/library/mt163689.aspx
• Brief introduction on Channel 9https://channel9.msdn.com/Shows/Data-Exposed/PolyBase-in-SQL-Server-2016
• SQL Server blog on Polybase in APShttp://blogs.technet.com/b/dataplatforminsider/archive/tags/polybase/
Questions
Sponsorzy strategiczni
Sponsorzy srebrni