What Is Hadoop And Why Deploy It In the Cloud?
AgendaWhat Is Hadoop?Why Deploy To the Cloud?Microsoft’s SolutionHow Do I Get Started?
Breaking points of traditional approach
Staging
Increasing data volumes1
50x Data growth 2010-2020
40ZB Digital Universe 2020
1Trillion Web pages
Breaking points of traditional approach
Staging
Increasing data volumes1
204MEmails sent every minute
340MTweets sent every day
231BUS Ecommerce in 2012 Real-time data
2
Breaking points of traditional approach
Staging
Increasing data volumes1
Real-time data2
New data types3 15x
Machine generated data 2020
1.3M Hours on Skype per hour
2.4MFacebook content per minute
Breaking points of traditional approach
Staging
Increasing data volumes1
Real-time data2
New data types3
Cloud-born data4 $100
B spend on cloud
50% large orgs have hybrid by 2017
40% CRM sold are SaaS
What if you could handle big data?
Data complexity: variety and velocity
Terabytes
Gigabytes
Megabytes
Petabytes Big
DataLog filesSpatial & GPS coordinatesData market feedseGov feedsWeather Text/image
Click streamWikis/blogs
Sensors/RFID/devices
Social sentimentAudio/video
Web 2.0
Web LogsDigital MarketingSearch MarketingRecommendations
AdvertisingMobile
CollaborationeCommerce
ERP/CRMPayables
PayrollInventory
ContactsDeal TrackingSales Pipeline
Introducing Apache HadoopApache Open Source ProjectHighly scalable distributed file system (HDFS)Distributed processing on data nodes
Data volumeHadoop stores files in a distributed file systemStorage and computation is distributed across many serversFiles can be spread out over multiple nodesHadoop can store very large amounts of dataCombined storage resource can grow with demand from a few nodes to thousands of nodesScales out linearlyVery large files supported including those larger than the capacity of a single node
Files
Data varietyHadoop stores files (non-relational store)Files could have a variety of semi-structured or unstructured dataPreviously, these files may not have been seen as providing value or insightsToday, new business questions and insights are being uncovered through data science
SentimentUnderstand how your customersfeel about your brand and products—right now
ClickstreamCapture and analyzewebsite visitors’ data trails and optimize your website
SensorsDiscover patterns in data streaming automatically from remote sensors and machines
GeographicAnalyze location-based data to manage operations where they occur
Server logsResearch logs to diagnose process failures and prevent security breaches
UnstructuredUnderstand patterns in files across millions of web pages, emails, and documents
Applications
Devices
HTTP
Inco
min
g
Outg
oing
Data velocityHadoop can stream live data and process them in real-timeHadoop can act as scalable event stream ingestionHadoop can do near real-time in-stream processingData input Event
brokerStream processing Outgoing
Governance and integrationData workflow, lifecycle and governanceFalconSqoopFlumeNFSWebHDFS
YARN: data operating system
ScriptPig
SearchSolr
SQLHive/Tez, HCatalog
NosqlHbaseAccumulo
Stream Storm
OthersSpark, in-memory, ISV engines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° °
°
°
N
BatchMap reduce
Data access
HDFS (Hadoop Distributed File System)Data management
AuthenticationAuthorizationAccountingData protectionStorage: HDFSResources: YARNAccess: Hive, … Pipeline: FalconCluster: Knox
Security Operations
Provision, manage, and monitorAmbariZookeeper
SchedulingOozie
Hadoop is a platform with portfolio of projectsGoverned by Apache Software Foundation (ASF)Comprises core services of MapReduce, HDFS, and YARNIn addition to the core, includes functions across: Data services which allow you to manipulate and move data (Hive, HBase, Pig, Flume, Sqoop) Operational services which help manage the cluster (Ambari, Falcon, and Oozie)
A Hadoop distribution is a package of projectsTested for consistency across entire package
Knox
Tez
Pig
Hive
and
HCa
talo
g
Phoe
nix
Accu
mul
o
Stor
m
Mah
out
Solr
Falco
n
Sqoo
p
Flum
e
Amba
ri
Oozie
Zook
eepe
r
HBas
e
Hado
op
and
YARN
Data management
Data access Governance and integration
Operations Security
HDP 2.0 October 2013 2.2.0 0.12.0 0.12.0 0.96.1 0.8.0 1.4.4 1.3.0 1.4.4 3.3.2 3.4.5 .0.4.0
HDP 1.3 May 2013 1.1.2 011.0 0.11.0 0.94.6 0.7.0 1.4.3 1.3.1 1.2.5 3.3.2 3.4.5 .0.4.0
HDP 2.1 April 2014 0.4.0 0.12.1 0.13.0 0.98.0 4.0.0 1.5.1 0.9.1 0.9.0 4.7.2 0.5.0 1.4.4 1.4.0 1.5.1 4.0.0 3.4.5 .0.4.02.4.0
With many contributors80 committers to Hadoop core project
Retail360°view of the customerAnalyze brand sentimentLocalized, personalized promotionsWebsite optimizationOptimal store layout
Financial servicesNew account risk screensFraud preventionTrading riskMaximize deposit spreadInsurance underwritingAccelerate loan processing
TelecomCall detail records (CDRs)Infrastructure investmentNext product to buy (NPTB)Real-time bandwidth allocationNew product development
Utilities, oil, and gasSmart meter stream analysisSlow oil well decline curvesOptimize lease biddingCompliance reportingProactive equipment repairSeismic image processing
Public sectorAnalyze public sentimentProtect critical networksPrevent fraud and wasteCrowd source reporting for repairs to infrastructureFulfill open records requests
ManufacturingSupplier consolidationSupply chain and logisticsAssembly line quality assurance Proactive maintenanceCrowd source quality assurance
HealthcareGenomic data for medical trialsMonitor patient vitalsReduce re-admittance ratesStore medical research dataRecruit cohorts for pharmaceutical trials
Business applications of Hadoop
New analytic applications from new dataINDUSTRY USE CASE
SENTIMENTAND WEB
CLICKSTREAMAND BEHAVIOR
MACHINE AND SENSOR
GEOGRAPHIC
SERVER LOGS
STRUCTURED AND UNSTRUCTURED
Financial services
New account risk screens ✔ ✔Trading risk ✔Insurance underwriting ✔ ✔ ✔
TelecomCall detail records (CDR) ✔ ✔Infrastructure investment ✔ ✔Real-time bandwidth allocation ✔ ✔ ✔
Retail360° view of the customer ✔ ✔ ✔Localized, personalized promotions ✔Website optimization ✔
ManufacturingSupply chain and logistics ✔Assembly line quality assurance ✔Crowd-sourced quality assurance ✔
Healthcare Use genomic data in medial trials ✔ ✔ ✔Monitor patient vitals in real-time
PharmaceuticalsRecruit and retain patients for drug trials ✔ ✔
Improve prescription adherence ✔ ✔ ✔ ✔
Oil and gas Unify exploration and production data ✔ ✔ ✔ ✔Monitor rig safety in real-time ✔ ✔ ✔
GovernmentETL offload/federal budgetary pressures ✔ ✔
Sentiment analysis for government programs ✔
AgendaWhat Is Hadoop?Why Deploy To the Cloud?Microsoft’s SolutionHow Do I Get Started?
Up-front HW costs Capacity planning Hadoop expertise
Challenges with implementing Hadoop
Why Cloud + Big Data?
Speed Scale Economics
Always Up, Always On
Open and flexibleTime to value
Data of all Volume, Variety, Velocity
Massive Compute and Storage
Deployment expertise
No HW costs
$0
Unlimited scalePay what you need
Deployed in minutes
Why Hadoop in the Cloud?
On-premises Hadoop
SoftwareAppliances
Scenarios For Deploying Hadoop As Hybrid
CloudCloud
Develop/POC
Cloud
Bursting
Cloud
Backup/archive
AgendaWhat Is Hadoop?Why Deploy To the Cloud?Microsoft’s SolutionHow Do I Get Started?
Introducing Azure HDInsight
Hadoop 2.2 and 2.4
80% data compression with ORC
Microsoft contributions to HadoopHadoop on Windows
Hive 100x Query Speed Up
30,000+code linecontributions
HDFS in Cloud (Azure)
REEF for Machine Learning
10,000+engineering hours
Committers
to Hadoop
Microsoft + Hortonworks
Promoting Open Hadoop
Engineering alignmentCorporate alignmentField alignment
HDInsight Built for Windows or LinuxCustomer ChoiceManaged & supported by MicrosoftFamiliarity of WindowsRe-use common tools, documentation, samples from Hadoop/Linux ecosystemAdd Hadoop projects that were authored on Linux to HDInsightEasier transition from on-premise to cloud
HDInsight Supports HiveSQL-like queries on Hadoop data in HDInsightHDInsight provides easy-to-use graphical query interface for HiveHiveQL is a SQL-like language (subset of SQL)Hive structures include well-understood database concepts such as tables, rows, columns, partitionsCompiled into MapReduce jobs that are executed on Hadoop
Dramatic performance gains with Stinger/TezStinger is a Microsoft, Hortonworks and OSS driven initiative to bring interactive queries with HiveBrings query execution engine technology from Microsoft SQL Server to HivePerformance gains up to 100x
Microsoft contribution to Apache code
Hadoop 2.0
1400s44.3s
35.1s
Sample Query
Hive 10 HDP 1.3 /Hive 11
HDP 2.0
32x Speedup40XSpeedup
HDP 2.115s
100xSpeedup
HDInsight Supports HBase
Data Node Data Node Data Node Data Node
Task Tracker Task Tracker Task Tracker Task Tracker
Name Node
Job Tracker
HMasterCoordination
Region Server Region Server Region Server Region Server
NoSQL database on data in HDInsightColumnar, NoSQL databaseRuns on top of the Hadoop Distributed File System (HDFS)Provides flexibility in that new columns can be added to column families at any time
HDInsight Supports MahoutMachine learning library A library of machine learning algorithms to execute on data in HDFSAlgorithms are not dependent on size of data and can scale with large datasetsLibrary includes: Collaborative Filtering, Classification, Clustering, Dimensionality Reduction, Topic Models
HDInsight Supports StormStream analytics for Near-Real Time ProcessingConsumes millions of real-time events from a scalable event broker (ie. Apache Kafka, Azure Event Hub)Performs time-sensitive computationOutput to persistent stores, dashboards or devicesCustomizable with Java + .NETDeeply integrated to Visual Studio
Event Queuing System
Collection Presentation and action
Event producers
Transformation
Long-term storage
Event Hubs
Storage adapters
Stream processi
ngCloud gateways(web APIs)
Field gateways
Applications
Search and query
Data analytics (Excel)
Web/thick client dashboards
Live Dashboards
Apache Storm on
HDInsight
Devices to take action
Kafka /RabbitMQ /ActiveMQ
Web and Social
Devices
Sensors
Azure Stream
Analytics
HDFS
Azure DBs
Azure storage
HBase
HDInsight Supports SparkIn Memory Processing on Multiple WorkloadsSingle execution model for multiple tasks (SQL queries, Streaming, Machine Learning, and Graph)Processing up to 100x faster performanceDeveloper friendly (Java, Python, Scala)BI tool of choice (Power BI, Tabelau, Qlik, SAP)Notebook experience (Jupyter/iPython, Zeppelin)
Spark SQL Spark Streaming
Machine Learning MLib
Graph GraphX
…
Add Hadoop Projects to HDInsightModify HDInsight clusters with custom scriptAdd Apache Hadoop projects to HDInsightDocumented for Spark, R, Giraph, Solr
HDInsight Allows You To Add Hadoop Projects
Microsoft Makes Hadoop EasierDeep Visual Studio IntegrationDebug Hive jobs through Yarn logs or troubleshoot Storm topologiesVisualize Hadoop clusters, tables, and storageSubmit Hive queries, Storm topologies (C# or Java spouts/bolts)IntelliSense
Introducing Azure HDInsight
Why Microsoft Azure?
Azure Storage
HDInsight
Data Factory
ML
Stream Analytics
Database
DocumentDB
Search
On-premises Hadoop SoftwareAppliances
Azure Facts• >4 trillion objects in Azure• 300,000-1M+ requests per second• Double compute and storage every 6 months
Event Hubs
No hardware challengesHDInsight in the Cloud bypasses hardware costsHardware acquisitionHardware maintenancePerformance tuning
HDInsight in the Cloud bypasses capacity planningSpin up any number of Hadoop nodes on-demandGo from tens of nodes to thousands of nodes
No HW costs
$0
Unlimited scale
Deployed in minutesHDInsight in the Cloud Bypasses deployment expertiseHadoop is non-trivial to install and get up and running on multi-nodesEducation gap in IT community regarding Hadoop
HDInsight is deployed in minutesSpin up any number of Hadoop nodes on-demandUp and running in a few clicks (and within minutes)
Deployed in minutes
Mission Critical, Enterprise ReadyManaged Hadoop, Backed By An SLAThree Nine’s of Availability 99.9% uptime
HDInsight Auto Replicates DataAutomatic geo-replication of dataData only replicates within the same geo-political (i.e., country, region)
Mission Critical Hadoop
Maintenance done for youMinimal IT resources for upgrades/patchingOS patching and security updates done automatically
Minimal IT resources to update Hadoop versions Hadoop versions are rapidly releasing throughout the yearAlways be on the latest version of Hadoop with no effort
HDInsight on Hadoop 2.2April 2014HDInsight on Hadoop 1.1.2Oct 2013
HDInsight on Hadoop 2.4June 2014
O/S Upgrades
O/S Patching
HDInsight adds latest version of Hadoop for you
Low Cost HDInsight is billed by usageBilled for usageClusters can be deleted when no longer used
No additional price for supportAzure Support includes Hadoop supportWhat usually costs thousands of dollars per node is included
$£€¥
Introducing Azure HDInsight
Scalable, manageable, trusted
1 Billion Microsoft Office users Connect to HDInsight Analyze Visualize
Office 365 is our fastest-growing commercial product ever Share Ask Access
Bringing Hadoop to a billion peopleExcel as the BI tool for everyone
Power BI for collaboration& new experiences
DevicesApplicationsDashboards
Making advanced analytics accessible to Hadoop Microsoft Azure Machine Learning
Cloud
Desktop
ML API Service
Microsoft Azure PortalPublish API
Publish API in minutes
Web
ML Studio
Workspace
Easily make changes
ResultsRun & refineTest model typesHistorical data
SQL DB Blobs & tables
HDInsight
SQL Server VM
Wu FengProfessor of Computer ScienceVirginia Tech
“What excites me about what I’m doing with HDInsight is the ability to accelerate discovery to the point that we may be able to find treatments for cancer.”
Virginia Tech is able to capture data from DNA sequencers which are generating 15 PB of genome data each year. Rather than creating a supercomputing center with millions of dollars, Virginia Tech leverages Azure and only paying for compute they use.
Blackball uses HDInsight to collect point-of-sale (POS) data and new types of data such as customer feedback via social media.
“Before, we thought that people would choose cold drinks and desserts in hot weather. But contrary to our assumptions, in certain outlets we saw an opposite trend.”
Andrew CheongSenior ManagerBlackBall
AgendaWhat Is Hadoop?Why Deploy To the Cloud?Microsoft’s SolutionHow Do I Get Started?
Get StartedRead documentationhttp://azure.microsoft.com/en-us/documentation/services/hdinsight/
Learning Maphttp://azure.microsoft.com/en-us/documentation/articles/hdinsight-learn-map/
Microsoft Virtual Academyhttp://www.microsoftvirtualacademy.com/training-courses/getting-started-with-microsoft-big-data
Channel 9 Data Exposed Showhttp://channel9.msdn.com/Shows/Data-Exposed
Try 30 day trialhttp://azure.microsoft.com/en-us/pricing/free-trial/
© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing marketconditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Top Related