SQL Server 2019 Big Data Clusters
The task of getting insights from ever-increasing data is tough
Barriers to insights are barriers to success
SQL Server 2019
Intelligence over all your data
Build intelligent apps and
AI with all your data
Analyzing all data
Easily and securely manage
data big and small
Managing all data
Unified access to all your data with
unparalleled performance
Integrating all data
Integrating all data
Two methods for data integration
Data movement
Data virtualization integrates data at query time,
without replicating or moving the data
Data movement or “ETL/ELT” moves or creates a copy
of the data from one source to another to make it
available for a query
Data
Data movement is a barrier to insights
CostsDuplicated storage costs
Engineering effort to build and maintain data pipelines
Delays in integrate data before it can be used
Increased data latency
Increased attack surface area
Inconsistent security models
Data quality issues can be created by ETL pipelines
Increased governance issues
1, 19%
2, 5%
3, 76%
3/4 of respondents say that
untimely data has inhibited business opportunities
Speed
Security
Quality
Compliance
*IDC 3rd Platform Information Management Requirements Survey, Oct 2016
Data virtualization creates solutions
CostsLower storage costs
Less dev time spent on integration
Rapid iterations and prototypes
Timely data
Smaller attack surface area
Consistent security model
Fresh and accurate data
Easier data governance
Speed
Security
Quality
Compliance
ODBC NoSQL Relational databases Big data
SQL Server
T-SQLAnalytics Apps
PolyBase external tables
SQL Server 2019 is the hub for integrating data
Scale out compute for performance
Unified relational and big data using data virtualization
Apache Hadoop Distributed File System (HDFS) storage and Spark are built-in to SQL Server big data clusters
HDFS provides a scale-out storage tier for storage capacity of 1,000x a traditional SQL Server
Spark provides a scale-out compute tier for data preparation, large data queries, and machine learning tasks
SQL
Server
HDFS Data Node
SparkSQL
Server
HDFS Data Node
SparkSQL
Server
HDFS Data Node
Spark
SQL Server master instance
T-SQL
PolyBase external tables
Storage Pool
HDFS Tiering: unified data lakes, caching for faster queries
Directories from Azure Data Lake or CDH/HDP HDFS can be mounted as virtual directories in the big data cluster HDFS
SQL Server and Spark can virtually access files in remote file systems just as they do files in the big data cluster HDFS
Data is seamlessly cached from the remote file system for faster query performance by bringing the data to the compute SQL
Server
HDFS Data Node
SparkSQL
Server
HDFS Data Node
SparkSQL
Server
HDFS Data Node
Spark
SQL Server master instance
T-SQL
PolyBase external tables
Storage Pool
HDFSAzure Data Lake
DemoData Virtualization over built-in HDFS
Cache and combine datasets from external sources
without writing code
Offload query execution from source databases
Bring data to the big data cluster for faster query times
Data is automatically distributed across the data pool
instances
Columnstore indexes are automatically applied to
compress data storage and boost query performance
by up to 10x
Filtering and local aggregations happen in parallel across
the data pool instances
Scale the data pools as needed for better performance
High performance with data pools
Scale-out data pool
Cosmos DB Oracle
PolyBase
connector
s
Shard 1 Shard nShard 2
SQL Server master instance
T-SQL
PolyBase external tables
SQL
Server
HDFS Data Node
SparkSQL
Server
HDFS Data Node
SparkSQL
Server
HDFS Data Node
Spark
Compute pool instances query partitions from external
data sources in parallel
Compute pool instances work together to perform cross-
partition aggregations and sorts
Compute pool offloads query execution from the master
instance and provide scale-out compute capacity
Compute pool works over data pool, storage pool, or
external data sources
Compute pool is stateless and can be scaled up/down as
needed
Boost performance with compute pools
Scale-out data pool
Cosmos DB Oracle
Shard 1 Shard nShard 2
SQL Server
T-SQL
PolyBase external tables
SQL
Server
HDFS Data Node
SparkSQL
Server
HDFS Data Node
SparkSQL
Server
HDFS Data Node
Spark
Scale-out compute pool
Compute
instance
Compute
instance
Compute
instance
Compute
instance
Managing all data
Manage all your relational and big data in one unified system
Deploy big data and relational as a single, unified solution
Upgrade big data and relational at the same time and be assured of component compatibility
Provided, updated, and supported by Microsoft as part of SQL Server
Simplified deployment withcontainers & Kubernetes
A container is a standardized unit of software that includes everything needed to run it
Kubernetes is an open source container hosting platform
Benefits of containers and Kubernetes:
1. Fast to deploy, easily scriptable
2. Upgrades are easy - just pull a new image
3. Self-contained – no installation/uninstallation required
4. Scalable, multi-tenant, and designed for elasticity
5. Consistently run anywhere: on-prem, Azure, other clouds
Kubernetes pod
SQL Server
HDFS Data Node
Spark
Big data clusters deployed on Kubernetes, OpenShift, AKS
Compute pool
SQL Compute
Node
SQL Compute
Node
SQL Compute
Node…
SQL Compute
Node
IoT data
Directly
read from
HDFS
Persistent storage
…
Storage pool
SQL
ServerSpark
HDFS Data Node
SQL
ServerSpark
HDFS Data Node
SQL
ServerSpark
HDFS Data Node
Kubernetes pod
AnalyticsCustom
apps BI
SQL Server
master instance
Node Node Node Node Node Node Node
SQL
Data pool
SQL Data
Node
SQL Data
Node
SQL Compute
Node
Storage Storage
SQL Compute
Node
Application pool
Application
Unified experience
Azure Data Studio is a unified tool for DBAs, data
engineers, and data scientists
Notebook experience for both T-SQL and Spark
HDFS file management
Built-in administration provides easy to use cloud-
style managed services for HA, monitoring,
backup/recovery, security, and provisioning.
REST API and command line tools simplify automation
for SQL and big data
The development and management experience is
consistent on-prem or any cloud
Advanced Spark and YARN user experience
Integrated Big Data and SQL Server security model
Simple, single sign-on with Active Directory authentication
Unified connection experience in Azure Data Studio to connect to SQL Server, HDFS, Spark, and admin UIs at the same time
Manage data access with SQL Server security roles
Access auditing for compliance
Unified security and governance
External data sources
Active Directory
Apps, Reports, Users
Azure Data Studio
Impersonation
Active Directory
SQL Server
master instance
HDFS/Spark
Impersonation
Analyzing all data
Model & serve
Business/custom apps
(Structured)
Logs, files and media
(unstructured)
Sensors and IoT
(unstructured)
Predictive
apps
BI tools
Store
HDFS
Data pools
Ingest
SQL Server Integration
Services (SSIS)
Prep & train
Spark
Spark ML
SQL Server
ML Services
Master instance
Simplified management and analysis through a unified deployment, governance, and tooling
Big data clusters: a complete AI platform
Master instanceREST API containers
for models
SQL Server 2019 big data clusters
Managed SQL Server,
Spark, and data lake
Store high volume data in a data lake and access
it easily using either SQL or Spark
Management services, admin portal, and
integrated security make it all easy to manage
SQL Server
Data virtualization
Combine data from many sources without
moving or replicating it
Scale out compute and caching to boost
performance
T-SQLAnalytics Apps
Open
database
connectivity
NoSQL Relational
databases
HDFS
Complete AI platform
Easily feed integrated data from many sources to
your model training
Ingest and prep data and then train, store, and
operationalize your models all in one system
SQL Server External Tables
Compute pools and data pools
Spark
Scalable, shared storage (HDFS)
External
data sources
Admin portal and management services
Integrated AD-based security
SQL Server
ML Services
Spark &
Spark ML
HDFS
REST API containers
for models
Get started with
SQL Server 2019 big data clusters
https://aka.ms/eapsignup
Apply to join the SQL Server Early Adoption Program
Q&A
Thank you for joining us.
Top Related