Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
-
Upload
andrew-brust -
Category
Technology
-
view
2.012 -
download
0
description
Transcript of Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
![Page 1: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/1.jpg)
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big DataAndrew J. BrustAndrew J. Brust
CEO and FounderBlue Badge Insights
Level: Intermediate
![Page 2: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/2.jpg)
• CEO and Founder, Blue Badge Insights• Big Data blogger for ZDNet• Microsoft Regional Director, MVP• Co-chair VSLive! and 17 years as a speaker• Founder, Microsoft BI User Group of NYC
– http://www.msbinyc.com
• Co-moderator, NYC .NET Developers Group– http://www.nycdotnetdev.com
• “Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News
• brustblog.com, Twitter: @andrewbrust
Meet AndrewMeet Andrew
![Page 3: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/3.jpg)
My New Blog (bit.ly/bigondata)My New Blog (bit.ly/bigondata)
![Page 4: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/4.jpg)
Read All About It!Read All About It!
![Page 5: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/5.jpg)
What is Big Data?What is Big Data?
• 100s of TB into PB and higher• Involving data from: financial data, sensors,
Web logs, social media, etc.• Distributed/parallel processing often involved
– Hadoop is emblematic, but other technologies are Big Data too
• Processing of data sets too large for transactional databases– Analyzing interactions, rather than transactions
– The three V’s: Volume, Velocity, Variety
• Big Data tech sometimes imposed on small data problems
![Page 6: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/6.jpg)
What’s MapReduce?What’s MapReduce?
• “Big” input data as key-value pair series• Partition the data and send to mappers
(nodes in cluster)• Mappers pre-aggregate by key, then all
output for (a) given key(s) goes to a reducer
• Reducer completes aggregations; one output per key, with value
• Map and Reduce code natively written as Java functions
![Page 7: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/7.jpg)
MapReduce, in a DiagramMapReduce, in a Diagram
mapper
mapper
mapper
mapper
mapper
mapper
Input
reducer
reducer
reducer
Input
Input
Input
Input
Input
Input
Output
Output
Output
Output
Output
Output
Output
Input
Input
Input
K1
K2
K3
Output
Output
Output
![Page 8: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/8.jpg)
What’s a Distributed File System?What’s a Distributed File System?
• One where data gets distributed over commodity drives on commodity servers
• Data is replicated• If one box goes down, no data lost
– Except the name node = SPOF!
• BUT: HDFS is immutable– Files can only be written to once
– So updates require drop + re-write (slow)
![Page 9: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/9.jpg)
Hadoop = MapReduce + HDFSHadoop = MapReduce + HDFS
• Modeled after Google MapReduce + GFS• Have more data? Just add more nodes to
cluster. – Mappers execute in parallel
– Hardware is commodity
– “Scaling out”
• Use of HDFS means data may well be local to mapper processing
• So, not just parallel, but minimal data movement, which avoids network bottlenecks
![Page 10: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/10.jpg)
What’s NoSQL?What’s NoSQL?
• Databases that are non-relational (don’t let name fool you, some actually use SQL)
• Four kinds:– Key-Value Store
Schema-freeFYI: Azure Table Storage is an example
– Document Store
All data stored in JSON objects– Wide-Column Store
Define column families, but not columns– Graph database
Manage relationships between objects
![Page 11: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/11.jpg)
What’s HBase?What’s HBase?
• A Wide-Column Store• Modeled after Google BigTable• Born at Powerset in 2007
– Powerset acquired by Microsoft in 2008
– Adopted in 2010 by Facebook for messaging platform
• Uses HDFS– Therefore, Hadoop-compatible
• Hadoop often used with HBase– But you can use either without the other
![Page 12: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/12.jpg)
The Hadoop StackThe Hadoop Stack• Hadoop
– MapReduce, HDFS
• HBase– Lesser extent: Cassandra, HyperTable
• Hive, Pig– SQL-like “data warehouse” system– Data transformation language
• Sqoop– Import/export between HDFS, HBase,
Hive and relational data warehouses
• Flume– Log file integration
• Mahout– Data Mining
![Page 13: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/13.jpg)
What’s Hive?What’s Hive?
• Began as Hadoop sub-project– Now top-level Apache project
• Provides a SQL-like (“HiveQL”) abstraction over MapReduce
• Has its own HDFS table file format (and it’s fully schema-bound)
• Can also work over HBase• Acts as a bridge to many BI products
which expect tabular data
![Page 14: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/14.jpg)
Hadoop DistributionsHadoop Distributions
• Cloudera• Hortonworks
– HCatalog: Hive/Pig/MR Interop
• MapR– Network File System replaces HDFS
• IBM InfoSphere BigInsights– HDFS<->DB2 integration
• And now Microsoft…
![Page 15: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/15.jpg)
Project “Isotope”Project “Isotope”
• Work with Hortonworks to create “distro” of Hadoop that runs on Windows Server and Windows Azure– Hortonworks are ex-Yahoo FTEs who are Hadoop
pioneers
• Create ODBC Driver for Hive– And Excel Add-In that uses it
• Build JavaScript command line and MapReduce framework
• Contribute it all back to open source Apache project
![Page 16: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/16.jpg)
Hadoop on AzureHadoop on Azure
• Install onto your own Azure VMs and build a cluster, or…
• Provision a cluster in one step– Give it a name
– Choose number of nodes and storage size in cluster
– Wait for it to provision
– Go!
![Page 17: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/17.jpg)
Provisioning a ClusterProvisioning a Cluster
![Page 18: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/18.jpg)
Submitting, Running and Monitoring Submitting, Running and Monitoring JobsJobs
• Upload a JAR• Use .NET• Use the JavaScript Console• Use the Hive Console
![Page 19: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/19.jpg)
Running MapReduce JobsRunning MapReduce Jobs
![Page 20: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/20.jpg)
Hadoop on Azure Data SourcesHadoop on Azure Data Sources
• Files in HDFS• Azure Blob Storage• Amazon S3 Storage• Hive Tables• HBase?
![Page 21: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/21.jpg)
Review: ODBC Connection TypesReview: ODBC Connection Types
• Registry-based– User Data Source Name (DSN)
– System DSN
• File-based– File DSN
• String-based– DSN-less connection
• We need file-based• Wizard obfuscates how to do this• Don’t forget to open the ODBC port!
![Page 22: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/22.jpg)
Hive ODBC Setup, Excel Hive ODBC Setup, Excel Add-InAdd-In
![Page 23: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/23.jpg)
How Does SQL Server Fit In?How Does SQL Server Fit In?
• RDBMS + PDW: Sqoop connectors• RDBMS: Columnstore Indexes
– Enterprise Edition only
• Analysis Services: Tabular Mode– Compatible with ODBC Driver
Multidimensional mode is not
• RDBMS + SSAS Tabular: DirectQuery• PowerPivot (as with SSAS Tabular)• Power View
– Works against PowerPivot and SSAS Tabular
![Page 24: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/24.jpg)
Querying Hadoop from Querying Hadoop from SQL Server BISQL Server BI
![Page 25: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/25.jpg)
The “Data-Refinery” IdeaThe “Data-Refinery” Idea
• Use Hadoop to “on-board” unstructured data, then extract manageable subsets
• Load the subsets into conventional DW/BI servers and use familiar analytics tools to examine
• This is the current rationalization of Hadoop + BI tools’ coexistence
• Will it stay this way?
![Page 26: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/26.jpg)
Usability ImpactUsability Impact
• PowerPivot makes analysis much easier, self-service
• Power View is great for discovery and visualization; also self-service
• Combine with the Hive ODBC driver and suddenly Hadoop is accessible to business users
• Caveats– Someone has to write the HiveQL
– Can query Big Data, but must have smaller result
![Page 27: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/27.jpg)
Other Relevant MS TechnologiesOther Relevant MS Technologies
• SQL Server Components:– SQL Server Parallel Data Warehouse
– StreamInsight
• Azure Components:– Data Explorer
– DataMarket
• Deprecated MSR Project– Dryad
![Page 28: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/28.jpg)
ResourcesResources
• Big On Data blog– http://www.zdnet.com/blog/big-data
• Apache Hadoop home page– http://hadoop.apache.org/
• Hive & Pig home pages– http://hive.apache.org/– http://pig.apache.org/
• Hadoop on Azure home page– https://www.hadooponazure.com/
• SQL Server 2012 Big Data– http://bit.ly/sql2012bigdata
![Page 29: Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012](https://reader033.fdocuments.in/reader033/viewer/2022061303/5492ec72ac7959182e8b473b/html5/thumbnails/29.jpg)
Thank youThank you
• [email protected]• @andrewbrust on twitter• Want to get the free “Redmond Roundup
Plus?”– Text “bluebadge” to 22828