Data Liberty
description
Transcript of Data Liberty
![Page 1: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/1.jpg)
Data LibertyAlternatives to the shackles of limited scale in data
solutions
Andy CrossWindows Azure MVP
Elastacloud
![Page 2: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/2.jpg)
Thank you, sponsors!
![Page 3: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/3.jpg)
The Cloud for Modern Business
Grab your benefit
aka.ms/azuretry
Deploy fast in the cloud, scale elastically and minimize test costActivate your Windows Azure MSDN benefit at no additional charge
aka.ms/msdnsubscr
![Page 4: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/4.jpg)
Tell everyone I’m awesome #cloudbrew
I’m @andybareweb
Social Media
![Page 5: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/5.jpg)
Data value at scale requires technology choices;• often prioritising data read traversal over operational
characteristics of create/update/delete
• embracing hybrid data platforms with varied technology partners over homogenous estates
• establishing alternative skillsets, augmented with entrenched languages, trusting cloud over maintenance
• following robust engineering processes to provide rigour in a deterministic world
![Page 6: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/6.jpg)
Bravery leads to rewards;• the winners will have data which shows them that they’ve
won
• the commoditised query turns energy sucking data silos into profit centres
• new data traversal mechanisms lead to new connotative data expression
• everything you already know is relevant and valid; the constraints on how it is applied are not
![Page 7: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/7.jpg)
Most developers have heard of Big Data. I’m going to show how Microsoft are increasingly relevant in this space.
My talk is about architecture and approach. Note we’re talking Big Data and not strictly Data Science.
But it’s always worth context so lets start with the history.
WHAT’S A DATA SCIENTISTS FAVOURITE LANGUAGE?
![Page 8: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/8.jpg)
IBM have been a leader in Big Data for years.
Wikimedia commons
![Page 9: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/9.jpg)
We’re not as great as we’d hope; we’re often still bound by our ability to marshal our IO.
Just as the speed of loading punchcards was historically a limiting factor, we are now limited by our capacity to ingest data on individual machines.
This leads to ideas such as DFS and data locality.
![Page 10: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/10.jpg)
![Page 11: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/11.jpg)
During the evolution of data we eventually moved to client/server and this was a big step up from dBase et al of the time.
Fundamentally however, the tabular structured nature of data poses many changes; not least the long term effects of normalisation which trade off effective storage in the short term with long term offset compute which is required to reconstruct sets.
This eventually leads to such ideas as NoSQL document and entity stores.
![Page 12: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/12.jpg)
![Page 13: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/13.jpg)
Modelling of data provides a consistent challenge. Our world is highly connected and our brains are effective connectors of data. Real world data fits poorly into highly structured data sets.
This leads to semi-structured and unstructured data formats and data queryability through relationship traversal
![Page 14: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/14.jpg)
The technologies shown today are primarily written in non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the .net and Microsoft stacks.
There are obviously challenges beyond language to running the alternative stacks; but remember in the Cloud you aren’t responsible for tuning a Linux cluster which has been running for 5 years. You should provision for a duration that is bounded by the likelihood of the cluster requiring routine maintenance.
![Page 15: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/15.jpg)
Open Source; Apache Foundation.
Java.
Map Reduce framework for job distribution; Distributed File System for file access.
In Windows Azure this is known as HDInsight.
Hadoop – KEY FACTS
![Page 16: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/16.jpg)
Hadoop is O(n)It exhibits linear performance; when the dataset doubles, the time taken to execute the algorithm doubles.
![Page 17: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/17.jpg)
Lets look at some scary JavaAny children should look away now.
![Page 18: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/18.jpg)
![Page 19: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/19.jpg)
![Page 20: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/20.jpg)
Hadoop SDK
C# integrationRemote Data & JobsHive in C#Serialization
![Page 21: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/21.jpg)
public class SwedishSessionsJob : HadoopJob<SwedishSessionsMapper, SessionsReducer> { public override HadoopJobConfiguration Configure(ExecutorContext context) { var config = new HadoopJobConfiguration() { InputPath = "\"/AllSessions/*.gz\"", OutputFolder = "/SwedishSessions/" }; return config; } }
Jobs
![Page 22: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/22.jpg)
public class SwedishSessionsMapper : MapperBase { public override void Map(string inputLine, MapperContext context) { if (inputLine.Contains("Country=Sweden") { context.IncrementCounter("SwedishSession"); context.EmitKeyValue(“SE", "1"); } } }
Mapper
![Page 23: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/23.jpg)
public class SessionsReducer : ReducerCombinerBase { public override void Reduce(string key, IEnumerable<string> values, ReducerContext context) { context.EmitKeyValue(key, values.Count()); } }
Reducer
![Page 24: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/24.jpg)
Testing Hadoop Queries
var inputData = "Country=Sweden&Name=Magnus";var result = StreamingUnit.Execute<Jobs.SwedishJob>(new[]{inputData});Assert.AreEqual("SE\t1", result.ReducerResult.First());
![Page 25: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/25.jpg)
Skill reuse
Express elegant solutions in C#Familiar Unit Testing patternsConcise programmatic terseness
Your existing development team can immediately
realise value
The frameworks
facilitate deterministic
testing for highly reliable
queries
Complex logic is best expressed in programmatic
form
![Page 26: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/26.jpg)
Commoditised query
Provision
Execute
De-provision
Valu
e
Action Cost
Value of query
Time
Cost
![Page 27: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/27.jpg)
* Tools are great but not friendly
HDInsight wins.Automated provisioning and job execution services.
Transient clusters limit exposure to poorly tooled* java estate.
Persistence with Windows Azure Blob Storage as HDFS proxy known as Azure Storage Vault (ASV).
Persistence in Windows Azure SQL Database for Hive Metastore.
Javascript console.
![Page 28: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/28.jpg)
NoSQL Document and Entity StoresExamples in MongoDB
Entity stores are similar; you can find a great example in Windows Azure Table Storage
![Page 29: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/29.jpg)
What is a document database?Relational Database Document Database
{ "_id" : ObjectId("51fccc57f82352d76653bdae"), "Name" : { "FirstName" : "Owen", "LastName" : "Grzegorek" }, "Company" : "Howard Miller Co", "Address" : { "Line1" : "15410 Minnetonka Industrial Rd", "Line2" : "Minnetonka", "Line3" : "Hennepin", "Line4" : "MN", "Line5" : "55345" }, "ContactDetails" : { "Phone" : "952-939-2973", "Fax" : "952-939-4663", "Email" : "[email protected]", "Web" : "http://www.owengrzegorek.com" }}
{ "_id" : ObjectId("51fccc57f82352d76653bdae"), "Name" : { "FirstName" : "Owen", "LastName" : "Grzegorek" }, "Company" : "Howard Miller Co", "Address" : { "Line1" : "15410 Minnetonka Industrial Rd", "Line2" : "Minnetonka", "Line3" : "Hennepin", "Line4" : "MN", "Line5" : "55345" }, "ContactDetails" : { "Phone" : "952-939-2973", "Fax" : "952-939-4663", "Email" : "[email protected]", "Web" : "http://www.owengrzegorek.com" }}
{ "_id" : ObjectId("51fccc57f82352d76653bdae"), "Name" : { "FirstName" : "Owen", "LastName" : "Grzegorek" }, "Company" : "Howard Miller Co", "Address" : { "Line1" : "15410 Minnetonka Industrial Rd", "Line2" : "Minnetonka", "Line3" : "Hennepin", "Line4" : "MN", "Line5" : "55345" }, "ContactDetails" : { "Phone" : "952-939-2973", "Fax" : "952-939-4663", "Email" : "[email protected]", "Web" : "http://www.owengrzegorek.com" }}
{ "_id" : ObjectId("51fccc57f82352d76653bdae"), "Name" : { "FirstName" : "Owen", "LastName" : "Grzegorek" }, "Company" : "Howard Miller Co", "Address" : { "Line1" : "15410 Minnetonka Industrial Rd", "Line2" : "Minnetonka", "Line3" : "Hennepin", "Line4" : "MN", "Line5" : "55345" }, "ContactDetails" : { "Phone" : "952-939-2973", "Fax" : "952-939-4663", "Email" : "[email protected]", "Web" : "http://www.owengrzegorek.com" }}
{ "Name" : { "FirstName" : "Owen", "LastName" : "Grzegorek" }, "Company" : "Howard Miller Co", "Address" : { "Line1" : "15410 Minnetonka Industrial Rd", "Line2" : "Minnetonka", "Line3" : "Hennepin", "Line4" : "MN", "Line5" : "55345" }, "ContactDetails" : { "Phone" : "952-939-2973", "Fax" : "952-939-4663", "Email" : "[email protected]", "Web" : "http://www.owengrzegorek.com" }}
{ "Name" : { "FirstName" : "Owen", "LastName" : "Grzegorek" }, "Company" : "Howard Miller Co", "Address" : { "Line1" : "15410 Minnetonka Industrial Rd", "Line2" : "Minnetonka", "Line3" : "Hennepin", "Line4" : "MN", "Line5" : "55345" }, "ContactDetails" : { "Phone" : "952-939-2973", "Fax" : "952-939-4663", "Email" : "[email protected]", "Web" : "http://www.owengrzegorek.com" }}
{ "Name" : “Richard Conway", “Books Published” : “12”, “Specialises in” : “Data Science”}
{ "Name" : “Andy Cross", “Hometown" : “Blackpool“}
{ "Name" : “Isaac Abraham", “Age" : “33“ “Football Team” : “Tottenham” “Icon” : }
![Page 30: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/30.jpg)
MongoDB Key Facts• General Purpose Operational Database• Real-time updates, ad-hoc queries and batch processing• Maps nicely with popular programming models e.g. .NET• Schema-free documents – lightweight and quick to get up and running
• High Performance• Embedding documents – no expensive joins across tables• Indexes allow query optimization• High-speed saving of data (writes)
• High Availability• Built in replication• Built in failover
• Easy Scalability• “Sharding” allows easily spreading data across multiple databases• Replicated data can be spread throughout the cluster
![Page 31: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/31.jpg)
MongoDB is O(log n)It exhibits logarithmic performance; when the dataset doubles, the time taken to execute the algorithm increases by a fixed amount
![Page 32: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/32.jpg)
Strengths of MongoDBLow barrier to entryUses well-known .NET technologies e.g. LINQGood migration path from SQL-style development
Good fit for .NET developers
Works well as operational data storeBatch processing capability for map reduceFlexible
Massively scalable with well-defined replication modelSelf-managing – easily add new nodesHigh performance writes and eventually consistent reads
Designed for scalability
Database is free to use (tooling is not!)Popular, so a relatively large communityLow cost
![Page 33: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/33.jpg)
Mongo SDK
There are many different way to connect with MongoDB from a .net project.
Official
Wrapper
Alternative
Tool
![Page 34: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/34.jpg)
C# implementationsIf your data is regularly structured, you can use domain classes:
public class Book { public string Author { get; set; } public string Title { get; set; } }// "entities" is the name of the collection
var books = database.GetCollection<Entity>("books"); Book book = new Book
{ Author = "Ernest Hemingway", Title = "For Whom the Bell Tolls" }; books.Insert(book);
![Page 35: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/35.jpg)
C# implementationsIf your data is irregularly structured or semi-structured, you can use a BSON object model:
BsonDocument person = new BsonDocument { { "name", "John Doe" }, { "address", new BsonDocument { { "street", "123 Main St." }, { "city", "Centerville" }, { "state", "PA" }, { "zip", 12345} }}};var people = database.GetCollection<BsonDocument>("people");people.Insert(person);
![Page 36: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/36.jpg)
NoSQL Document WinsSemi-structured data first class citizen
Built in MapReduce
Operational and interactive
Massively scalable
![Page 37: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/37.jpg)
Graph Databases, Neo4j KEY FACTSOpen Source; Neotechnologies
Java
Runs equally well on Windows or Linux. In Windows Azure there are VMDepot images able to be deployed in a few simple steps. Additionally the Azure Linux VMs are a good fit for this database engine.
There is an Open Source .net SDK available through Nuget and actively maintained primarily by an Australian company, Readify.
![Page 38: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/38.jpg)
Neo4j is O(1)It exhibits constant-time performance; that is, the algorithm takes the same time to execute irrespective of the size of the dataset.
![Page 39: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/39.jpg)
How O(1)?• Graphs don’t have tables. They don’t have collections.• They have nodes and relationships.
• Rather than having to select out a whole table, we can identify a point on the graph• A start point
• Follow the traversal of relationships from that point.
![Page 40: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/40.jpg)
http://www.apcjones.com/arrows/#
![Page 41: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/41.jpg)
Things we can do• Find all the things formed in Sweden START sweden = node:countryIdx(“country=Sweden”)MATCH Sweden<-[:FORMED_IN]-somethingRETURN something;• Find friends of friendsSTART magnus = node:peopleIdx(“name=magnus”)MATCH magnus-[:FRIENDS]->friend-[:FRIENDS]->friendoffriendRETURN friendoffriend;
![Page 42: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/42.jpg)
NEO4J Client
Open source Neo4j Client
![Page 43: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/43.jpg)
C# examples
var query = neo4Jclient.Cypher .Start(new { sweden = Node.ByIndexLookup("countryIdx", "country", "sweden") }) .Match("sweden-[:FRIENDS]->friend-[:FRIENDS]->friendoffriend") .Return<Node<Friend>>("friendoffriend");
![Page 44: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/44.jpg)
Graph Database Wins• Modelled domains match cognitive processes
• Optimised for traversal of relationships allow complex and “social” queries to emerge• LIKES of FRIENDS of COLLEAGUES
• O(1) performance characteristics due to ability to START queries at arbitrary graph points.
![Page 45: Data Liberty](https://reader037.fdocuments.in/reader037/viewer/2022110215/568168d0550346895ddfbe3b/html5/thumbnails/45.jpg)
Summary
•HDInsight brings Hadoop to Azure•Suited to Data Volume, Variety, Variability etc•MongoDB brings Document stores•Suited to Data Volume, Operational concerns•Neo4j brings Graph database•Suited to data relationship traversal