Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
-
Upload
mark-kromer -
Category
Technology
-
view
745 -
download
2
Transcript of Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
Automate pipelines:Tools for building ETL pipelines
Mark Kromer
Microsoft Sr. Program Manager
Information Management
@kromerbigdata
ETL Pipeline Objectives
• Consume hybrid disparate data (Extract)• On-prem + Cloud
• Structured, un-structured, semi-structured
• Calculate and format data for analytics (Transform)• Transform, aggregate, join, normalize
• Address large-scale Big Data requirements (Scale / Load)• Scale-up or Scale-out data movement and transformation
• Operationalize (Automate)• Create pipeline orchestrations for different org requirements
• Manage & monitor multiple pipelines
Hybrid Data Integration Pattern 1:Analyze blog comments
Azure SQL Database
SQL Server
(on-premises)
Data Management
Gateway Req’d for ADF
Azure Data Factory (PaaS)
Capture blog comments via API
Drop into Blob Store
Copy & lookup
Power BI Dashboard
Visualize and analyze
SSIS (self-hosted)
Transform via SPROC (ELT)
Transform via Dataflow (ETL)
Hybrid Data Integration Pattern 2:Sentiment Analysis with Machine Learning
Azure Data Factory
Power BI
Blob Storage
Azure Functions
Hybrid Data Integration Pattern 3:Modern Data Warehouse
Daily flat files
OLTP DBTables
Analytical Schemas
AML: Churn Model
Customer Call Details
Azure Data Factory (PaaS)
SSIS (self-hosted)
Social Media (un/semi structured)
SQL Server Integration Services (SSIS)SSIS is a platform for building enterprise-grade data integration solutions
User-friendly code-free authoring / management client tools: SQL Server Data Tools (SSDT) SQL Server Management Studio (SSMS)
Wealth of connectors + rich transformations toExtract, Transform, and Load (ETL) data between various sources and destinations, on premises and in the cloud
Low Total Cost of Ownership (TCO)
Azure Feature Pack
Provides Azure connectivity components for SSIS
1. Move / transfer hybrid data between various sources and destinations, on premises and in Azure
2. Develop ELT workflows with Big Data transformations / processing in Azure
3. Combine traditional ETL and modern ELT workflows spanning on-premises and Azure
SSIS Azure Feature Pack FeaturesSSIS Azure Feature Pack contains:1. Connection Managers
1. Azure Subscription Connection Manager
2. Azure Storage Connection Manager
3. ADLS Connection Manager (NEW)
2. Control Flows / Tasks1. Azure Blob Upload / Download Tasks
2. Azure HDInsight Hive / Pig Tasks
3. Azure HDInsight Create / Delete Cluster Tasks
4. Azure SQL DW Upload Task (NEW)
3. Data Flows1. Azure Blob Source / Destination
2. ADLS Source / Destination (NEW)
4. Azure Blob Enumerator
SQL DW + SSIS – Quick introSQL DW is Microsoft’s scale-out database in the cloud
Built on Massively Parallel Processing (MPP)architecture
Capable of processing huge volumes of relational and non-relational data.
It divides data and processing capability across multiple nodes
Control Node receives, optimizes, and distributes requests to Compute Nodes that work in parallel.
There are 2 ways to load data into SQL DW using SSIS:
Front-loading through Control Node with data flows
Back-loading through Compute Nodes with PolyBase
Azure SQL DW Upload Task – Typical scenario
Azure Blob Storage SQL DW
Cloud
On-Premise
SQL Server Flat File SSIS Machine
0. Export to a flat file
1. Upload to Azure Blob
2. Create an external table
3. Trigger PolyBase to load data
Azure SQL DW Upload Task automates steps 1 – 2 – 3 below:
Azure SQL DW Upload Task
On Azure SQL DW Upload Task Editor, you can1. Name and describe a create / insert table task
2. Select and configure UTF-8-encoded text file(s) as your data source
3. Select and configure Azure Storage Connection Manager + new / existing blob container as your data staging area
4. Select and configure ADO.NET Connection Manager for SQL DW + new / existing table as your data destination
5. Map source and destination columns for the create / insert table task
6. Define metadata / data types for source columns
Azure SQL DW Upload Task
Following configurations on Azure SQL DW Upload Task Editor, T-SQL script that triggers PolyBase to load data from your Azure Blob Storage into SQL DW will be automatically generated
You can still manually edit this auto-generated T-SQL script to customize it for your particular needs
RAW DATA,DATA CHAOS
REFINED, ORGANIZED DATA
DATA CLARITY,BETTER DECISIONS
DATA MOVEMENT DATA TRANSFORMATIONS BUSINESS INTELLIGENCE
AZURE DATA FACTORYHYBRID DATA INTEGRATION AT SCALE
Customer Profitability
Sentiment Analysis
Market Basket
Analysis
Machine Learning, Big Data Analytics, SQL, NoSQL, Data Warehouse , Data Lake
ADF: Orchestrate data services at scale with fully managed Data Integration cloud service
PREPARE TRANSFORM & ANALYZE
INGEST
SQL
<>
SQL
DATA SOURCES
{ }
SQL
• Create, schedule, orchestrate, and manage data pipelines
• Visualize data lineage
• Connect to on-premises and cloud data sources
• Monitor data pipeline health
• Automate cloud resource management
• Move relational data for Hadoop processing
• Transform with Hive, Pig, PySpark, SQL SPROC or custom code
ELT with Apache Spark Activity from ADF Pipeline
Create new pipeline and HDI Cluster
Linked Service for Spark from Azure
Portal
Invoke Python script from Spark activity to
transform data at scale
Schedule, monitor and manage pipeline from
ADF
Verify results and perform analytics from
Jupyter notebooks / PBI
ELT with Azure Data Lake from ADF Pipeline
Create new pipeline and Azure Data Lake
Analytics Linked Service from Azure
Portal
Perform data transformations at
scale with U-SQL script
Schedule, monitor and manage pipeline from
ADF
Verify results and perform analytics from
ADLA
Microsoft Data Integration Roadmap
• SSIS• SQL Server 2017
• SSIS on Linux
• Scale-out
• ADF• SSIS as a Cloud “Integration Runtime”
• Code-free web-based user experience
• Control Flow orchestration + Data Flow steps
• On-Demand Spark Cluster