Azure Stream Analytics : Analyse Data in Motion
-
Upload
ruhani-arora -
Category
Technology
-
view
478 -
download
1
Transcript of Azure Stream Analytics : Analyse Data in Motion
Stream Analytics Analyze your data in motionDeepthi Anantharam
Technology Evangelist
@deananth
Ruhani Arora
Technology Evangelist
@infinitydlimit
The need for evolution – Identified 2 years ago
… data warehousing has reached the most significant tipping point since its inception. The biggest, possibly most elaborate data management system in IT is changing.
– Gartner, “The State of Data Warehousing in 2012”
Data sources
ETL
Data warehouse
BI and analytics
The “Traditional” Data Warehouse
4
Data sources
OLTP ERP CRM LOB
ETL
Data warehouse
BI and analytics
Increasing data volumes
1
Real-time data
4
Non-Relational Data
Devices Web Sensors Social
New data sources & types
2Cloud-born data
3
Evolving Approaches to Analytics
ETL Tool(SSIS, etc)
EDW(SQL Svr, Teradata, etc)
Extract
Original Data
Load
Transformed Data
Transform
OLTP
ERP LOB
…
BI Tools
Data Marts
Data Lake(s)
Dashboards
Apps
ETL Tool(SSIS, etc)
EDW(SQL Svr, Teradata, etc)
Extract
Original Data
Load
Transformed Data
Transform
OLTP
ERP LOB
…
BI Tools
Devices
Web
Sensors
Social
Ingest (EL)Original Data
Data Marts
Data Lake(s)
Dashboards
Apps
Evolving Approaches to Analytics
ETL Tool(SSIS, etc)
EDW(SQL Svr, Teradata, etc)
Extract
Original Data
Load
Transformed Data
Transform
OLTP
ERP LOB
…
BI Tools
Devices
Web
Sensors
Social
Ingest (EL)Original Data
Scale-out Storage & Compute
(HDFS, Blob Storage, etc)
Transform & Load
Data Marts
Data Lake(s)
Dashboards
Apps
Streaming data
Evolving Approaches to Analytics
ETL Tool(SSIS, etc)
EDW(SQL Svr, Teradata, etc)
Extract
Original Data
Load
Transformed Data
Transform
OLTP
ERP LOB
…
BI Tools
Devices
Web
Sensors
Social
Ingest (EL)Original Data
Scale-out Storage & Compute
(HDFS, Blob Storage, etc)
Transform & Load
Data Marts
Data Lake(s)
Dashboards
Apps
Streaming data
Evolving Approaches to Analytics
Real Time data analytics
Agenda• ETL with new sources of
data• Azure Data Factory
• Analytics with new sources of data• Azure Stream Analytics
Azure Data Factory Overview • New Azure service for data developers & IT
• Compose data processing, storage and movement services to create & manage analytics pipelines
• Initially focused on Azure & hybrid movement to/from on premises SQL Server. Overtime will expand to more storage & processing systems throughout
• Rich, simple end-to-end pipeline monitoring and management
Operationalizing Information Production With Data Factory
Example Scenario: Customer Profiling (game usage analytics)
Customer Profiling – Game Usage Analytics
2277,2013-06-01 02:26:54.3943450,111,164.234.187.32,24.84.225.233,true,8,1,20582277,2013-06-01 03:26:23.2240000,111,164.234.187.32,24.84.225.233,true,8,1,2058-2123-2009-2068-21662277,2013-06-01 04:22:39.4940000,111,164.234.187.32,24.84.225.233,true,8,1,2277,2013-06-01 05:43:54.1240000,111,164.234.187.32,24.84.225.233,true,8,1,2058-225545-2309-2068-21662277,2013-06-01 06:11:23.9274300,111,164.234.187.32,24.84.225.233,true,8,1,223-2123-2009-4229-99366232277,2013-06-01 07:37:01.3962500,111,164.234.187.32,24.84.225.233,true,8,1,2277,2013-06-01 08:12:03.1109790,111,164.234.187.32,24.84.225.233,true,8,1,234322-2123-2234234-12432-344323…
Log Files Snippet (10s of TBs per day in cloud storage)
User Table UserID FirstName LastName State …
2277 Pratik Patel Oregon
664432 Dave Nettleton Washington
8853 Mike Flasko California
New User Activity Per Week By Region
profileid day state duration rank weaponsused interactedwith1148 6/2/2013 Oregon 216 33 1 51004 6/2/2013 Missouri 22 40 6 2292 6/1/2013 Georgia 201 137 1 51059 6/2/2013 Oregon 27 104 5 2675 6/2/2013 California 65 164 3 21348 6/3/2013 Nebraska 21 95 5 2
Terminologies• Linked Services• Data Sets • Pipeline• Diagram View
• Create a Data factory• Add Data Sources• Define Tables and
Pipelines• Deploy & Start• Monitor and Manage
Steps
Example: Game Logs, Customer Profiling
On Premises SQL Server Azure Blob Storage
1000’s Log FilesNew User View
Azure Data Factory
Example: Game Logs, Customer Profiling
On Premises SQL Server Azure Blob Storage
1000’s Log FilesNew User View
Azure Data FactoryVi
ew O
f
Game Usage
View
Of
New Users
New User Activity
Example: Game Logs, Customer Profiling
View
Of
On Premises SQL Server Azure Blob Storage
1000’s Log FilesNew User View
Copy “NewUsers” to Blob Storage
Cloud New Users
Azure Data FactoryVi
ew O
f
Game Usage
View
Of
New Users
New User Activity
Pipeline
Example: Game Logs, Customer Profiling
On Premises SQL Server Azure Blob Storage
1000’s Log FilesNew User View
Copy NewUsers to Blob Storage
Cloud New Users
Azure Data FactoryVi
ew O
f
Game Usage
View
Of
Mask & Geo-Code
New Users
Geo DictionaryGeo Coded
Game Usage
HDInsight
New User Activity
Pipeline
Pipeline
Example: Game Logs, Customer Profiling
On Premises SQL Server Azure Blob Storage
1000’s Log FilesNew User View
Copy NewUsers to Blob Storage
Cloud New Users
Azure Data FactoryVi
ew O
f
Game Usage
View
Of
Runs
OnMask & Geo-
Code
New Users
Geo DictionaryGeo Coded
Game Usage
Join & Aggregate
HDInsight
New User Activity
View
Of
Pipeline
Pipeline
Pipeline
“GeoCoded Game Usage” Table:
Step 3: Define Tables & Pipelines
Pipeline Definition:Step 3: Define Tables & Pipelines
Activ
ityAc
tivity
Powershell// Deploy TableNew-AzureDataFactoryTable -DataFactory“GameTelemetry“-File NewUserActivityPerRegion.json
// Deploy PipelineNew-AzureDataFactoryPipeline -DataFactory “GameTelemetry“-File NewUserTelemetryPipeline.json
// Start PipelineSet-AzureDataFactoryPipelineActivePeriod -Name “NewUserTelemetryPipeline“-DataFactory “GameTelemetry“-StartTime 10/29/2014 12:00:00
Incremental Data Production
Dataset2
Dataset3
Hourly
12-1
1-2
2-3
Daily
Monday
Tuesday
Wednesday
Daily
Monday
Tuesday
Wednesday
Hive Activity
GameUsage
GeoCodeDictionary
Geo-CodedGameUsage
Custom Actions• Allows running any .NET code wrapped within an ADF
activity• Can be used to connect to new sources/destination• Can be used to create custom transformation activities• Example: Invoke Azure ML model• SDK for custom activity creation:
Coordination: • Rich scheduling• Complex dependencies• Incremental rerun
Authoring: • JSON & Powershell/C#
Management:• Lineage• Data production policies (late data, rerun, latency, etc)
Hub: Azure Hub (HDInsight + Blob storage)• Activities: Hive, Pig, C#• Data Connectors: Blobs, Tables, Azure DB, On Prem SQL Server, MDS
[internal]
Data Factory – Available Today
Analyze your data in motion
What is Streaming Data?
Data in MotionData at Rest
Azure Stream Analytics
Real-time stream processing Near infinite cloud scale
Managed real-time analytics
Mission-critical reliability and scale
Rapid development
Point of Service Devices
Self CheckoutStations
Kiosks
Smart Phones
Slates/Tablets
PCs/Laptops
Servers
Digital Signs
DiagnosticEquipmentRemote Medical
MonitorsLogic
Controllers
SpecializedDevicesThin
Clients
Handhelds
Security
POS Terminals
AutomationDevices
VendingMachines
Kinect
ATM
Stream Analytics
How do customers create a real-time streaming solution?
Customers using ASA?
Using Azure Analytic Service
Data Source
Collect Process
Consume
Deliver
Event Inputs- Event Hub- Azure Blob
Transform- Temporal joins- Filter- Aggregates- Projections- Windows- Etc.
Enrich
Correlate
Outputs- SQL Azure- Azure Blobs- Event Hub- Table Storage
☁
BI Dashboards
Predictive Analytics
AzureStorage
Azure Stream Analytics
Reference Data- Azure Blob
Sample Scenario : Toll Station
TollId EntryTime License Plate State Make Model Type Weight
1 2014-10-25T19:33:30.0000000Z JNB 7001 NY Honda CRV 1 3010
1 2014-10-25T19:33:31.0000000Z YXZ 1001 NY Toyota Camry 2 3020
3 2014-10-25T19:33:32.0000000Z ABC 1004 CT Ford Taurus 2 3800
2 2014-10-25T19:33:33.0000000Z XYZ 1003 CT Toyota Corolla 2 2900
1 2014-10-25T19:33:34.0000000Z BNJ 1007 NY Honda CRV 1 3400
2 2014-10-25T19:33:35.0000000Z CDE 1007 NJ Toyota 4x4 1 3800
… … … … … … … …
EntryStream - Data about vehicles entering toll stations TollId ExitTime LicensePlate
1 2014-10-25T19:33:40.0000000Z JNB 7001
1 2014-10-25T19:33:41.0000000Z YXZ 1001
3 2014-10-25T19:33:42.0000000Z ABC 1004
2 2014-10-25T19:33:43.0000000Z XYZ 1003
… … …
ExitStream - Data about cars leaving toll stations
LicensePlate RegistartionId Expired
SVT 6023 285429838 1
XLZ 3463 362715656 0
QMZ 1273 876133137 1
RIV 8632 992711956 0
… … ….
ReferenceData - Commercial vehicle registration data
Query Language - OverviewDML Statements• SELECT• FROM• WHERE• GROUP BY• HAVING• CASE• JOINS• UNION
Scaling Functions• WITH• PARTITION BY
Date and Time Functions• DATENAME• DATEPART• DAY• MONTH• YEAR• DATETIMEFROMPARTS• DATEDIFF• DATADD
Windowing Extensions• Tumbling Window• Hopping Window• Sliding Window
Aggregate Functions• SUM• COUNT• AVG• MIN• MAX
String Functions• LEN
CONCAT• SUBSTRING• CHARINDEX• PATINDEX
Tumbling Windows
SELECT TollId, COUNT(*)FROM EntryStream TIMESTAMP BY EntryTimeGROUP BY TollId, TumblingWindow(second, 10)
Count the total number of vehicles entering each toll booth every interval of 10 seconds.
1 5 4 26 8 6 5
0 5 2010 15 Time (secs)
1 5 4 26
8 6
25
A 10-second Tumbling Window
30
3 6 1
5 3 6 1
Hopping Windows
SELECT COUNT(*), TollId FROM EntryStream TIMESTAMP BY EntryTimeGROUP BY TollId, HoppingWindow (second, 10,5)
Count the number of vehicles entering each toll booth every interval of 10 seconds; update results every 10 seconds
1 5 4 26 8 7
0 5 2010 15 Time (secs)
25
A 10-second Hopping Window with a 5-second “Hop”
30
4 26
8 6
5 3 6 1
1 5 4 26
8 6 5 3
6 15 3
Sliding Windows
Give me the count of all the toll booths which have served more than 10 vehicles in the last 10 seconds
1 5
0 5 2010 15 Time (secs)
25
A 10-second Sliding Window8
8
51
9
51 9
1
SELECT TollId, Count(*) FROM EntryStream ESGROUP BY TollId, SlidingWindow (second, 10)HAVING Count(*) > 10
Intake millions of events per secondProcess data from connected devices/appsIntegrated with highly-scalable publish-subscriber ingestor
Easy processing on continuous streams of data Transform, augment, correlate, temporal operationsDetect patterns and anomalies in streaming data
Correlate streaming with reference data
Real-time analytics
Input and OutputManagement
TransformationsManagement
Programmatic Access with REST APIs
Jobs Management Start JobStop Job
Create JobDelete Job
List JobsUpdate Job
Create Input / OutputDelete Input / Output
List Input / OutputUpdate Input / Output
Create TransformationDelete Transformation
Get TransformationUpdate Transformation
The full functionality of Azure Stream Analytics is through REST APIs. Enables programmatic accessUseful for automation through scriptingEmbed in other applications/tools
Demo: Scaling , Monitoring & Logging
Scaling Concepts – Partitions
Step Result 1
Step Result 2
Step Result 3
PartitionId = 1
PartitionId = 3PartitionId = 2
PartitionId = 1
PartitionId = 2PartitionId = 3
Event Hub
Stream Analytics
SELECT COUNT(*) AS Count, TollBoothId FROM EntryStream Partition By PartitionId GROUP BY TumblingWindow (minute, 3), TollBoothId
41
• Preview services
• Offers ability to deal with new age problem in processing and analyzing data
• Scale, Speed, Economy
ADF & ASA
Recommended/related sessions
Inside Azure Storage – Options, abstractions and Best PracticesData, Sabha2, 11.00 AM – 11.55 AM tomorrow
1
Choosing Right platform for BigDataData, Sabha2, 3.00 PM to 3.55 PM tomorrow
2
Practical Machine LearningData, Sabha2 , 4.15 to 5.10 Today
3
ReferencesRelated references for you to expand your knowledge on the subjectAzure Stream Analytics Documentationhttp://azure.microsoft.com/en-in/documentation/services/stream-analytics/
Stream Analytics Query Language Referencehttps://msdn.microsoft.com/en-us/library/azure/dn834998.aspx
Azure Portalhttp://azure.microsoft.com
Azure Updateshttp://azure.microsoft.com/blog/
Microsoft Virtual Academyaka.ms/mva
Developer Networkmsdn.microsoft.com/
Azure SupportMust know resources to get online help for Azure.
Azure Support Optionshttp://azure.microsoft.com/en-us/support/options/
Azure Support Planshttp://azure.microsoft.com/en-us/support/plans/
Ask questions, & get answers
Post questions in the Azure
forums
Tag questions with the keyword Azure.
Azure VidyapeethA platform for learning – Choose your topic, choose your time
• Register to attend Azure Vidyapeeth Live webinars @
www.aka.ms/azure-vidyapeeth
• Collect free $100 Azure gift pass by registering for our Azure Vidyapeeth series at the Expo zone!
• Point your mobile phone here to download the Azure Vidyapeeth Mobile App : www.aka.ms/av-app
Tell us what you think Help us shape future events by sharing your valuable feedback.
Scan the QR code to evaluate this session.
< QR Code will be given 2 days before the Conference >
Thank you
Twitter: @deananth @infinitydlimit
Follow us online
Pricing (Today)
Query Language You write declarative queries in SQL No code compilation, easy to author and deploy
Unified programming modelBrings together event streams, reference data and machine learning extensions
Temporal Semantics All operators respect, and some use, the temporal properties of events
Built-in operators and functionsThese should (mostly) look familiar if you know relational databases
Filters, projections, joins, windowed (temporal) aggregates, text and date manipulation
50
Why Event Processing in the Cloud?
Event data is already in the Cloud
Event data isglobally distributed
Reduced TCO Scale Managed service,
not infrastructure
Bring the processing to the data, not the data to the processing!
Streamed Data
is naturallynon-local!
Application ComponentsComponents of an Azure Stream Analytics Application
OUTPUT[Result of Query]
Azure SQL DB
Azure Event Hubs
Azure Blob Storage
INPUT
Source of Events
Azure Blob Storage
Azure Event Hubs
Reference Data
Query runs continuously against incoming stream of events
Stream Analytics Query
Events
Have a defined schema and are
temporal (sequenced in time)