Killer Scenarios with Data Lake in Azure with U-SQL
-
Upload
michael-rys -
Category
Data & Analytics
-
view
460 -
download
1
Transcript of Killer Scenarios with Data Lake in Azure with U-SQL
![Page 1: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/1.jpg)
Microsoft Data Science SummitSept 26 – 27 | Atlanta, GA
![Page 2: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/2.jpg)
Killer Scenarios with Data Lake in Azure with U-SQLMichael RysPrincipal Program Manager Big Data@[email protected]://aka.ms/azuredatalake
![Page 3: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/3.jpg)
Agenda Today (BR013): Killer extensibility in Azure Data Lake with U-SQL Custom rowset aggregation How to do JSON processing Image processing How to call R from U-SQL
Yesterday (BR014): Introduction to Azure Data Lake and U-SQL What is Azure Data Lake? Why U-SQL? Core concepts
Schema on read on file and file sets C# extensibility SQL with U-SQL Script level execution and optimization
Tool usage
![Page 4: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/4.jpg)
U-SQL extensibilityExtend U-SQL with C#/.NET
Built-in operators, function, aggregates
C# expressions (in SELECT expressions)
User-defined aggregates (UDAGGs)
User-defined functions (UDFs)
User-defined operators (UDOs)
![Page 5: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/5.jpg)
User-Defined Extractors User-Defined Outputters User-Defined Processors
Take one row and produce one row Pass-through versus transforming
User-Defined Appliers Take one row and produce 0 to n rows Used with OUTER/CROSS APPLY
User-Defined Combiners Combines rowsets (like a user-defined join)
User-Defined Reducers Take n rows and produce m rows (normally m<n)
Scaled out with explicit U-SQL Syntax that takes a UDO instance (created as part of the execution): EXTRACT OUTPUT PROCESS COMBINE REDUCE
What are UDOs?Custom Operator ExtensionsScaled out by U-SQL
![Page 6: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/6.jpg)
UDO model• Marking UDOs• Parameterizing UDOs• UDO signature• UDO-specific
processing pattern• Rowsets and their
schemas in UDOs• Setting results
By position By name
[SqlUserDefinedExtractor] public class DriverExtractor : IExtractor { private byte[] _row_delim; private string _col_delim; private Encoding _encoding; // Define a non-default constructor since I want to pass in my own parameters public DriverExtractor( string row_delim = "\r\n", string col_delim = ",“ , Encoding encoding = null ) { _encoding = encoding == null ? Encoding.UTF8 : encoding; _row_delim = _encoding.GetBytes(row_delim); _col_delim = col_delim; } // DriverExtractor
// Converting text to target schema private void OutputValueAtCol_I(string c, int i, IUpdatableRow outputrow) { var schema = outputrow.Schema;
if (schema[i].Type == typeof(int)) { var tmp = Convert.ToInt32(c); outputrow.Set(i, tmp); } ... } //SerializeCol
public override IEnumerable<IRow> Extract( IUnstructuredReader input , IUpdatableRow outputrow) { foreach (var row in input.Split(_row_delim)) { using(var s = new StreamReader(row, _encoding)) { int i = 0; foreach (var c in s.ReadToEnd().Split(new[] { _col_delim }, StringSplitOptions.None)) { OutputValueAtCol_I(c, i++, outputrow); } // foreach } // using yield return outputrow.AsReadOnly(); } // foreach } // Extract } // class DriverExtractor
![Page 7: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/7.jpg)
Code behindHow to specify UDOs?
![Page 8: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/8.jpg)
C# Class Project for U-SQLHow to specify UDOs?
![Page 9: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/9.jpg)
Any .Net language usable however not first-class in tooling Use U-SQL specific .Net DLLs Compile DLL, upload to ADLS, register
with script
How to specify UDOs?
![Page 10: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/10.jpg)
Managing Assemblies
• CREATE ASSEMBLY db.assembly FROM @path;• CREATE ASSEMBLY db.assembly FROM byte[];
• Can also include additional resource files
• REFERENCE ASSEMBLY db.assembly;
• Referencing .Net Framework Assemblies• Always accessible system namespaces:
• U-SQL specific (e.g., for SQL.MAP)• All provided by system.dll system.core.dll
system.data.dll, System.Runtime.Serialization.dll, mscorelib.dll (e.g., System.Text, System.Text.RegularExpressions, System.Linq)
• Add all other .Net Framework Assemblies with:REFERENCE SYSTEM ASSEMBLY [System.XML];
• Enumerating Assemblies• Powershell command• U-SQL Studio Server Explorer
• DROP ASSEMBLY db.assembly;
Create assemblies Reference assemblies Enumerate assemblies Drop assemblies
VisualStudio makes registration easy!
![Page 11: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/11.jpg)
USING clause 'USING' csharp_namespace | Alias '=' csharp_namespace_or_class.
Examples: DECLARE @ input string = "somejsonfile.json";
REFERENCE ASSEMBLY [Newtonsoft.Json];REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
@data0 = EXTRACT IPAddresses string FROM @input USING new JsonExtractor("Devices[*]");
USING json = [Microsoft.Analytics.Samples.Formats.Json.JsonExtractor];
@data1 = EXTRACT IPAddresses string FROM @input USING new json("Devices[*]");
Allows shortening and disambiguating C# namespace and class names
![Page 12: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/12.jpg)
Overlapping Range AggregationStart Time - End Time - User Name5:00 AM - 6:00 AM - ABC5:00 AM - 6:00 AM - XYZ8:00 AM - 9:00 AM - ABC8:00 AM - 10:00 AM - ABC10:00 AM - 2:00 PM - ABC7:00 AM - 11:00 AM - ABC9:00 AM - 11:00 AM - ABC11:00 AM - 11:30 AM - ABC11:40 PM - 11:59 PM - FOO11:50 PM - 0:40 AM - FOO
https://blogs.msdn.microsoft.com/azuredatalake/2016/06/27/how-do-i-combine-overlapping-ranges-using-u-sql-introducing-u-sql-reducer-udos
Start Time - End Time - User Name5:00 AM - 6:00 AM - ABC5:00 AM - 6:00 AM - XYZ7:00 AM - 2:00 PM - ABC11:40 PM - 0:40 AM - FOO
![Page 13: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/13.jpg)
U-SQL:
@r = REDUCE @in PRESORT begin ON user PRODUCE begin DateTime , end DateTime , user string READONLY user USING new ReduceSample.RangeReducer();
Overlapping Range Aggregation
Presort input rowset
Partition and scale out
Declare passthrough
User-defined Reducer
![Page 14: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/14.jpg)
Code Behind:namespace ReduceSample{ [SqlUserDefinedReducer(IsRecursive = true)] public class RangeReducer : IReducer { public override IEnumerable<IRow> Reduce(IRowset input, IUpdatableRow output) { // Init aggregation values int i = 0; var begin = DateTime.MaxValue; var end = DateTime.MinValue;
foreach (var row in input.Rows) { ... begin = row.Get<DateTime>("begin"); end = row.Get<DateTime>("end"); ... output.Set<DateTime>("begin", begin); output.Set<DateTime>("end", end); yield return output.AsReadOnly(); ... } // foreach } // Reduce
Overlapping Range Aggregation
• Provides better scale
• Requires associative operation
• Implement IReducer• Implement IReducer
• Get input column
• Input Rowset Partition
• Set output column
• Accumulate rows
![Page 15: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/15.jpg)
JSON Processing
How do I extract data from JSON documents?
https://github.com/Azure/usql/tree/master/Examples/DataFormats
![Page 16: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/16.jpg)
Architecture of Sample Format Assembly
Single JSON document per file: Use JsonExtractor
Multiple JSON documents per file: Do not allow CR/LF (row delimiter) in JSON Use built-in Text Extractor to extract Use JsonTuple to schematize (with CROSS
APPLY) Currently loads full JSON document into
memory better to use JSONReader Processing if docs
are large
JSON Processing Microsoft.Analytics.Samples.Formats
NewtonSoft.Json System.Xml
![Page 17: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/17.jpg)
JSON Processing
@json = EXTRACT personid int, name string, addresses string FROM @input USING new Json.JsonExtractor(“[*].person");
@person = SELECT personid, name, Json.JsonFunctions.JsonTuple(addresses)["address"] AS address_array FROM @json;
@addresses = SELECT personid, name, Json.JsonFunctions.JsonTuple(address) AS address FROM @person CROSS APPLY EXPLODE (Json.JsonFunctions.JsonTuple(address_array).Values) AS A(address);
@result = SELECT personid, name, address["addressid"]AS addressid, address["street"]AS street, address["postcode"]AS postcode, address["city"]AS city FROM @addresses;
Key to field relative to objects in JsonExtractor
JPath Expression mapping objects to Row
Generates 1-level key value-pairs as SqlMap
Gets value from map as string
Convert string array into Map and pivot all Values into rows
Get object map for array item
Get desired keys from object map
![Page 18: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/18.jpg)
Image ProcessingCopyright
Camera Make
Camera Model
Thumbnail
Michael Canon 70D
Michael Samsung S7
https://github.com/Azure/usql/tree/master/Examples/ImageApp
![Page 19: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/19.jpg)
Image processing assembly Uses System.Drawing Exposes
Extractors Outputter Processor User-defined Functions
Trade-offs Column memory limits:
Image Extractor vs Feature Extractor
Main memory pressures in vertex:
UDFs vs Processor vs Extractor
Image Processing
![Page 20: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/20.jpg)
R Processing
KMeans Centroids
![Page 21: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/21.jpg)
ArchitectureU-SQL Processing with R R Programmer Assembly
KMeansRReducer
R Engine (Runtime)
R to .Net interop (RDotNet.dll & RDotNet.NativeLib.dll)
R Runtime (R-bin.zip)
R Engine Manager Utility (RUtilities.dll)
Similar Approaches can be done for deploying other runtimes: Python, JavaScript, JVM No external access from UDOsFuture work: More generic samples More automatic experiences (no user
wrappers/deploys)
![Page 22: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/22.jpg)
Summary of U-SQL UDOs
![Page 23: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/23.jpg)
What are UDOs?
Custom Operator Extensions written in .Net (C#)Scaled out by U-SQL
![Page 24: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/24.jpg)
UDO Tips and Warnings
• Tips when Using UDOs: READONLY clause to allow pushing predicates through
UDOs REQUIRED clause to allow column pruning through UDOs PRESORT on REDUCE if you need global order Hint Cardinality if it does choose the wrong plan
• Warnings and better alternatives: Use SELECT with UDFs instead of PROCESS Use User-defined Aggregators instead of REDUCE Learn to use Windowing Functions (OVER expression)
• Good use-cases for PROCESS/REDUCE/COMBINE: The logic needs to dynamically access the input and/or
output schema. E.g., create a JSON doc for the data in the row where the columns are not known apriori.
Your UDF based solution creates too much memory pressure and you can write your code more memory efficient in a UDO
You need an ordered Aggregator or produce more than 1 row per group
![Page 25: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/25.jpg)
Additional Resources Blogs and community page:
http://usql.io (U-SQL Github) http://blogs.msdn.microsoft.com/azuredatalake/ http://blogs.msdn.microsoft.com/mrys/ https://channel9.msdn.com/Search?term=U-SQL#ch9Se
arch
Documentation and articles: http://aka.ms/usql_reference https://azure.microsoft.com/en-us/documentation/servic
es/data-lake-analytics/ https://msdn.microsoft.com/en-us/magazine/mt614251
ADL forums and feedback http://aka.ms/adlfeedback https://social.msdn.microsoft.com/Forums/azure/en-US/h
ome?forum=AzureDataLake
http://stackoverflow.com/questions/tagged/u-sql
![Page 26: Killer Scenarios with Data Lake in Azure with U-SQL](https://reader036.fdocuments.in/reader036/viewer/2022062523/5871643f1a28ab58758b505d/html5/thumbnails/26.jpg)
© 2016 Microsoft Corporation. All rights reserved.