Big Data Analysis Starts with R
-
Upload
revolution-analytics -
Category
Technology
-
view
5.471 -
download
3
Transcript of Big Data Analysis Starts with R
R evolution A nalytic s
Dec ember 20, 2011
1
T he B ig Data A nalytic s R evolutionS tarts with R
In Today’s Webinar:
About Revolution AnalyticsGetting Value with Advanced AnalyticsImplementing The Advanced Analytics StackResources and Further Reading
The professor who invented analytic software for the experts now wants to take it to the masses
Most advanced statistical analysis software available
Half the cost of commercial alternatives
2M+ Users
4,000+ Applications
Statistics
Predictive Analytics
Data Mining
Visualization
Finance
Life Sciences
Manufacturing
Retail
Telecom
Social Media
Government
Power
Productivity
Enterprise Readiness
An open-source software project
A community
Data analysis software
A programming language
An environment
What is R ?
4
What’s the Difference B etween R and R evolution R E nterpris e?
Revolution R is 100% R and More®
5
R Engine Language Libraries
4,000+ Community Packages
Technical Support
Web-Based GUI
Web ServicesAPI
Big DataAnalysis
IDE / DeveloperGUI
BuildAssurance
ParallelTools
Multi-ThreadedMath Libraries
L et’s Talk about B ig Data
6
E xtracting Value with A dvanced A nalytics
Missing the potential value of the data that is being collectedNeed more than counts and averagesAdvanced Analytics with Big Data
Predict the FutureUnderstand Risk and UncertaintyEmbrace ComplexityIdentify the UnusualThink Big
7
R : A Unique P latform for E xtrac ting Value from Data
• R is superior at exploring data to find unexpected trends and relationships…finding the best predictive models and identify critical “outliers”, such as clusters of customers who are particularly profitable(or unprofitable!).
Data Exploration and Visualization
• Google, LinkedIn and Facebook, rely on R and the skills of data scientists who are accustomed to hacking together large data sets from disparate sources, visualizing and exploring data to identify novel modeling techniques, and combining the results of several modeling strategies to optimize predictive power.
Data Science
•Other commercial programs push users through a pre-programmed procedure and discourages modeling innovation. R was created as a 4GL with the needs of modern data scientists in mind, with an interactive language that promotes data exploration, data visualization, and flexible data modeling.
Modeling Innovation
•R is creating a massive amount of talent because is now the dominant tool of choice at the universities.Talent
8
Making It WorkUs e C as es for B ig Data A nalytic s deployment
9
T he A dvanced A nalytics S tack
Deployment / Consumption
Advanced Analytics
ETL
Data / Infrastructure
“Open Analytics Stack” White Paper: bit.ly/lC43Kw10
B es t P rac tic es for Implementing an A dvanc ed A nalytic s S tac k for B ig Data
Limit samplingReduce data movement and replicationBring the analytics as close as possible to the dataOptimize computation speed – parallel algorithms
11
B ig Data C omputations
Computations are data intensiveTo be effective, must rely on data parallelism
Data is distributed across compute nodesSame task is run in parallel on each of the data partitions
Examples of distributed computing frameworks that support data parallelism
Traditional file based analytics using on-premise clustersHadoop and MapReduceIn-Database Analytics using parallel hardware architectures
12
R evolution R E nterpris e: B ig Data S tatis tics in R
13
www.revolutionanalytics.com/bigdata
Every US airline departure and arrival, 1987-2008
File: AirlineData87to08.xdfRows: 123.5 millionVariables: 29Size on disk: 13.2Gb
arrDelayLm2 <- rxLinMod(ArrDelay ~ DayOfWeek:F(CRSDepTime),cube=TRUE)
Compute Node
(RevoScaleR)
Compute Node
(RevoScaleR) Master Node
(RevoScaleR)
DataPartition
DataPartition
Compute Node
(RevoScaleR)
Compute Node
(RevoScaleR)
DataPartition
DataPartition
• Portions of the data source are made available to each compute node
• RevoScaleR on the master node assigns a task to each compute node
• Each compute node independently processes its data, and returns its intermediate results back to the master node
• master node aggregates all of the intermediate results from each compute node and produces the final result
R evoS c aleR – Dis tributed C omputing
14
R and Hadoop
15
R Client
R
Map or Reduce
Job Tracker
Task Node
HDFS
HBASE
Thrift
rhdfs - R and HDFSrhbase - R and HBASErmr - R and MapReduce
Capabilities delivered as individual R packages
rmr
rhdfs rhbase
Downloads available from Github
R evolution A nalytics with Netezza A ppliance
16
Deployment with R evolution R E nterpris e
17
RevoDeployR Web Services
Client libraries (JavaScript, Java, .NET)
Desktop Applications(i.e. Excel)
Business Intelligence
(i.e. QlikView)
Interactive Web Applications
HTTP/HTTPS – JSON/XML
Session Management Authentication Data/Script
Management Administration
RR Programmer
ApplicationDeveloper
End User
RR
Admin
T hree final thoughts
Now enterprise-ready, R offers innovation and flexibility needed to meet analytics challenges in a changing worldR-enabled advanced analytics are key to unlocking value in big dataRevolution Analytics optimizes R to take advantage of multiple data management paradigms and emerging best practices
18
R es ourcesSlides / Replay: bit.ly/r-big-data
“Open Analytics Stack” White Paper: bit.ly/lC43Kw
McKinsey Report on Big Data: bit.ly/jWyrFM
Conway, Data Science Intelligence: bit.ly/myMwak
“Big Analytics” White Paper by Norman H. Nie: bit.ly/biganalytics
Revolution R Enterprise: bit.ly/Enterprise-R
Questions: [email protected]
19
20
www.revolutionanalytics.com 650.330.0553 Twitter: @RevolutionR
The leading commercial provider of software and support for the popular open source R statistics language.
T hank you.