AE foyer: R and Hadoop, the perfect marriage for your analytics?

53
ae nv/sa Interleuvenlaan 27b, B-3001 Heverlee T +32 16 39 30 60 - F +32 16 39 30 70 www.ae.be Bram Vanschoenwinkel Principal Consultant BI & Analytics # aeFoyer @ ae_nv @bvschoen R & Hadoop The perfect marriage for your analytics?

Transcript of AE foyer: R and Hadoop, the perfect marriage for your analytics?

ae nv/saInterleuvenlaan 27b, B-3001 Heverlee

T +32 16 39 30 60 - F +32 16 39 30 70

www.ae.be

Bram VanschoenwinkelPrincipal Consultant BI & Analytics

#aeFoyer

@ae_nv

@bvschoen

R & HadoopThe perfect marriage for your analytics?

ae nv/saInterleuvenlaan 27b, B-3001 Heverlee

T +32 16 39 30 60 - F +32 16 39 30 70

www.ae.be

WELCOMER & HadoopThe perfect marriage for your analytics?

By Michael DegrezSales Director - AE

#aeFoyer

@ae_nv

ae nv/saInterleuvenlaan 27b, B-3001 Heverlee

T +32 16 39 30 60 - F +32 16 39 30 70

www.ae.be

19/02 Mobile by designHow to design, build, run for mobile first

23/04 R & HadoopThe perfect marriage for your analytics?

18/06 From private cloud to hybrid cloudHow to benefit from a successful implementation

01/10 Prepare for the digital enterpriseBusiness driven enterprise architecture

26/11 Multi-device front-end engineeringHow businesses benefit from applying this technical skill

#aeFoyer

@ae_nv

ae nv/saInterleuvenlaan 27b, B-3001 Heverlee

T +32 16 39 30 60 - F +32 16 39 30 70

www.ae.be

Bram VanschoenwinkelPrincipal Consultant BI & Analytics

#aeFoyer

@ae_nv

@bvschoen

R & HadoopThe perfect marriage for your analytics?

7

Agenda

1. It’s a ( R )evolution

2. Intelligent Decision Support in the Digital Age

3. The R Project for Statistical Computing

4. The World of Hadoop

5. Case: A Customer Intelligence Platform

6. Conclusions

8

It’s a (R)evolution

2000 2010 2015

DATA VOLUME

TIME

MA

JORI

TY

UN

STRU

CTU

RED

DAT

A

9

Abundance of Data

BEYOND

WEB

CRM

ERPPURCHASE DETAIL

PRODUCTION

PAYMENT DETAIL

PLANNING

CONTACT INFORMATION

LEADS

OFFERS

SEGMENTATION

PROSPECTS

CLICK STREAM DATA

WEB SHOPS SOCIAL MEDIAVIDEO

IMAGES

TEXT

ONLINE SERVICES

AUDIO

OPEN DATA

MOBILE DEVICES

INTERNET OF THINGS

RFID

GPS

SENSORS

USER GENERATED CONTENT

SMART DEVICES

SENSORS

REMOTE MONITORING

CLOUD

MEDICAL

INCREASING DATA VARIETY & COMPLEXITY

INCR

EASI

NG

VO

LUM

E

WARABLES

10

Opportunities

OPERATIONAL EXCELLENCE

INNOVATIVE BUSINESS MODELS

INSIGHTS, STRATEGY AND POLICY

11

SHORT LIFESPAN OF THE DATA

FAST

MO

VIN

G D

ATA

FAST

DAT

A PR

OCE

SSIN

G

HIGH VARIETY OF DATA

Challenges

12

intelligent decision support in the digital age

WHAT WE SEE

ABUNDANCE OF HETEROGENOUS DATA

THE WAY WE INTERACT WITH THE WORLD HAS

CHANGED

OPPORTUNITIES

OPERATIONAL EXCELLENCE

BETTER DECISION SUPPORT

CHALLENGES

ANALYSIS GAP

VOLUME, VARIETY, VELOCITY

INNOVATING BUSINESS MODELS COMPETENCES

13

Decision Support in the Digital Age

Facing the Challenges and realizing the Opportunities

Business Analytics Big Data

14

Elements of a Holistic Information Management Framework

- Data Sources- Internal & External- From Data to Information

- Improving data quality- Integrality of data- From Information to Knowledge

Intelligent Decision Support:

- Reporting- Business Analytics- From Knowledge to Intelligence

DATAInformation

Knowledge

Intelligence Wisdom/Insight

15

Decision Support in the Digital Age

“Business Analytics is the nontrivial extraction of implicit, previously unknown, and potentially useful

information from data.”

16

Business Analytics vs Business Intelligence

What happened?When did it happen?Who made it happen?Where did it happen?How many times did it happen?

Why did it happen?Will it happen again?When will it happen?What will happen if…?What else could have happened?

Business Intelligen

ce

Business Analytics

17

New Insights

8 stoppen

132 stoppen

10 stoppen

53 stoppen64 stoppen

14 stoppen 4 stoppen

11 stoppen

18

Innovating Business Models

Front-end Application(s)

Security

Analytics (on Hadoop)

Web Click StreamingSocial Media

Connectivity

External Application Integration

Operational Data Processing on Hadoop

19

From Analytics…

Statistics Algorithms

BiologyPsychology Databases

Analytics(Data Mining)

20

…to Business Analytics

Business Analytics

Finance• F

raud Detection

• Financial Risk Analysis

• Forecasting

• Financial Market Analysis

Process• P

rocess Mining

• Work Organization Analysis

• Web Analytics

• Forecasting

• Process Simulation

Customer• C

ustomer Segmentation

• Churn Prediction

• Customer Targeting

• Customer Lifetime Value Analysis

• Sentiment Analysis

• Market Basket Analysis

HR• T

alent Analytics

• Retention Analytics

• Recruitment Analytics

• HR Market Analytics

• Sentiment Analysis

Business Analytics

21

Analytics Approach

Analytics Incremental and iterative Think big act small Proof-of-Concept Open source tools

Architecture & Deployment (Non-)funtional requirements Information Architecture Technology Embedded into operations

Two Phase Approach

Analytics

Architecture Deployment

22

Analytics Churn Prediction Example

Invoicing CRM Call Center Application

John Doe – 43years – Antwerp – Man – 7calls – 3weeks – 30%down invoicingJane Dan – 32years – Brussels – Woman – 2calls – 12weeks – 10%up invoicing…

Operations

CHURN SCORES

REGION

PRO

DU

CT

TIME

CHURN SCORES

MAN

AGEM

ENT

DASH

BOAR

D

OPERATIONS

DATA DUMP

Analytics Engine

Data Warehouse

23

Big Data

“Big data is high-volume, high-velocity, high-complexity and high-variety information assets that demand cost-effective,

innovative forms of information processing for enhanced insight and decision making.” (Gartner)

24

Four V’s and a C

Not only volume makes big data big, it’s all about the three V’s: High Volume, Variety, Velocity High Value!

In addition the data is very complex in nature, often unstructured: Text documents, emails, images and videos, etc. Click stream data, social media feed data, etc.

25

Innovative Forms of Information Processing

Traditional methods don’t suffice anymore. New forms of information processing have emerged.

DISTRIBUTE DATA STORAGE

COMPUTATIONNoSQL DATA STORES

26

Innovative Forms of Information Processing

27

The R Project for Statistical Computing

R is a dialect of the S language S was developed by John Chambers and others at Bell Labs S was initiated in 1976 Now owned by TIBCO and sold under the name S-PLUS

INTERACTIVE NOT PROGRAMMING

PROGRAMMING WHEN SYSTEM

ASPECTS BECOME IMPORTANT

GRADUALLY MOVING INTO

28

Advantages of R

Most widely used data analysis software Created and used by 2M+ data scientists, statisticians and analysts

Most powerful statistical programming language Flexible, extensible & comprehensive for productivity, +4800 packages

Create beautiful and unique data visualizations As seen in New York Times, Twitter and Flowing Data

Thriving open-source community Leading edge of analytics research

Fills the talent gap New graduates prefer R

29

Drawbacks of R

Steep learning curve

Objects must be stored in physical

memory, little thought to memory

management

Functionality is based on consumer demand and user

contributions

Documentation is sometimes patchy

and terse, and impenetrable to the

non-statistician

Vibrant community to help you

Recent advancements to

deal with this

If a package is useful to many people, it will

quickly evolve into a robust product

Vibrant community to help you

30

Exploding growth and Demand for R

R is the highest paid IT skill – Dice.com, Jan 2014

R most-used data science language after SQL – O’Reilly, Jan 2014

R is used by 70% of data miners – Rexer, Sep 2013

R is #15 of all programming languages – RedMonk, Jan 2014

R growing faster than any other data science language – KDnuggets, Aug 2013

More than 2 million users worldwide

31

Great Adoption of R by Many Companies

Commercial vendors offering general support and developing specific R based products, e.g.: Oracle, RevolutionAnalytics.

Companies using R for advanced statistics and analytics, e.g.: Thomas Cook, Google, Twitter.

Also in the AE customer base we see different companies looking into R as an alternative or complement to the traditional tools.

32

Example Packages

twitteR: Provides an interface to the Twitter web API. tm: Provides Text Mining functionalities like word stemming,

stopword removal, etc. wordcloud: Provides methods for producing wordclouds in

different forms, shapes and colors.

33

Apache Hadoop

Open-source software framework. Storage and large-scale processing of data on clusters of commodity hardware. Apache top-level project built and used by a global community.

Two core components: 1. Hadoop Distributed File System (HDFS)2. MapReduce

34

Apache Hadoop

MapReduce/HDFS based on Google's MapReduce and Google File System.

Other components are: Hadoop Common – libraries and utilities needed by other Hadoop modules Hadoop YARN – a resource-management platform

The entire Apache Hadoop “platform” is now commonly considered to consist of a number of related projects as well: Pig, Hive, Hbase,…

Created by Doug Cutting and Mike Cafarella at Yahoo in 2005 originally to support distribution for the Apache Nutch search engine project.

All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or

racks of machines) are common and thus should be automatically handled in software by the framework.

35

The World of Hadoop

36

Key Properties Apache Hadoop

Transforms commodity hardware into a service that: Stores petabytes of data reliably. Allows huge distributed computations.

Key Properties: Designed for batch processing. Write-once-read-many access model for files. Extremely powerful. Scalability:

• Scales linearly with cores and disks.• Machines can be added and removed from the cluster.• Write code once, same program runs on 1, 1000, 4000 machines.

Reliable and fault-tolerant:• Failed tasks/data transfers are automatically retried.• Data replication, redundancy.

Hadoop brings the computation to

the data and not the data to the

computation!

37

Rack 2 Rack 3Rack 1

A Typical Hadoop Cluster

Client

DATA ASSIGNMENT TO NODES

DATA READDATA WRITE

METADATA FORBLOCK INFO

Task Tracker

Task Tracker

Map Reduce

Map Reduce

Job Tracker

Data Node

Data Node

Task Tracker

Map Reduce

Data Node

Task Tracker

Task Tracker

Map Reduce

Map Reduce

Data Node

Data Node

Task Tracker

Map Reduce

Data Node

Task Tracker

Task Tracker

Map Reduce

Map Reduce

Data Node

Data Node

Task Tracker

Map Reduce

Data Node

Master Node

Slave Nodes

Slave Nodes

Slave Nodes

Name Node

JOB ASSIGNMENT

TASK ASSIGNMENT

1. Client2. Master Node

Name Node Job Tracker

3. Slave Nodes Data Nodes Task Trackers Map / Reduce

38

1. Client consults Name Node2. Client writes block to Data Node3. Data Node replicates block4. Cycle repeats for next blocks

Rack 2 Rack 3Rack 1

Hadoop File System (HDFS)

Data Node 1 Data Node 4 Data Node 7

Data Node 2 Data Node 5 Data Node 8

Data Node 3 Data Node 6 Data Node 9

Name Node

Client

FILE

FILE

DATA ASSIGNMENT TO NODES

DATA READDATA WRITE

METADATA FORBLOCK INFO

Rack 1: Data Node 1 Data Node 2 …Rack 2: Data Node 3 …

39

MapReduce

the, 1quick, 1brown, 1fox, 1

the, 1fox, 1ate, 1the, 1mouse, 1

how, 1now, 1brown, 1cow, 1

the, 1the, 1the, 1

fox, 1fox, 1

quick, 1

brown, 1brown, 1

ate, 1

mouse, 1

how, 1

now, 1

cow, 1

the, 3

fox, 2

quick, 1

brown, 2

ate, 1

mouse, 1

how, 1

now, 1

cow, 1

the, 3fox, 2quick, 1brown, 2ate, 1mouse, 1how, 1now, 1cow, 1

Input Splitting Map ShuffleSort

Reduce

OutputThe Map function processes one line at a time, splits it into tokens seperated by a withespace

and emits a key-value pair <word, 1>.

The Reducer function just sums up the values, which are the occurence counts for each key

(i.e. words in this example).

40

Hadoop Distributions

Fully equipped, scalable and flexible cloud solutions. Also different on premise solutions are being offered. Choice depends on specific requirements.

Data Privacy, Scalability, Security, Data Mastership, Configuration, Flexibility, Price-Performance Ratio, Automation,…

How to get started? Free to download! Business model is based on training, consulting, support and additional

“tooling” (Enterprise Editions). Many free trial cloud versions available to play around with. Many tutorials, trainings, blogs, user groups etc.

41

RHadoop

A collection of four R packages that allow users to manage and analyze data with Hadoop: rmr: Hadoop MapReduce functionality in R rhdfs: file management of the HDFS from within R rhbase: database management for the HBase distributed database Recently a new package plyrmr was relased providing a familiar interface

while hiding many of the MapReduce details (like Hive, Pig and Mahoot).

R and all RHadoop packges should be installed on all nodes in the Hadoop cluster.

Combining the advantages of R with the power of Hadoop.

42

MapReduce Wordcount Example in R

Map function.

Reduce function.

Reading the input from HDFS from.dfs().

Writing the results back to HDFS to.dfs().

43

Case: A Customer Intelligence PlatformExample Energy Sector

Help companies to better understand their customers, interact with them at the right time, with the right message and through the right channel with the aim to:

Get in dialogue, increase sales through cross-selling and up-selling.

An intelligent Data Driven Customer Intelligence platform: Use of factual and observed data and socio demographic statistics. Enable actionable customer insights. Provide personalized product offers based on customer preferences and interests. Visualize relevant data in infographics to get better insights. Use the gained insights for better customer experience.

GREEN SAVER

ENVIRONMENTAL PRAGMATIST

ENVIRONMENTALIS

ECO WARRIORSOCIAL

ENVIRONMENTALIST

44

Case: A Customer Intelligence PlatformAnalytics

Development independently of architecture and technical setup. In first phase only multiple logistic regression for deriving

customer profiling score, later also other (predictive) analytics: Formula is “learned” from historical data. Periodical processing in the backend, derived formula can be applied online. Environment for the Data Scientist to configure for new clients, batch

processing, development and testing of new algorithms. Use of R for development and prototyping. RHadoop in production

environment?

P (probability of conversion) = weigthing1 * variabel1 + weigthing2 * variabel2x + ... + constant 

45

Case: A Customer Intelligence PlatformBusiness Architecture

Key Business Requirements: Collect “customer related” data from any possible source. Generic framework that can be applied in different sectors. Process huge amounts of data quickly and accurately. The client should be master of all collected data. Scalability in order to handle various volumes. The use of analytics for insights (customer profile scores, segmentation,…). Sandbox environment for development and testing analytics. Visualisation and dimensional reporting with filtering.

46

Case: A Customer Intelligence PlatformInformation Architecture

Unstructured data Social Media / Web site

• Personal info: name, age, gender, country, location, email, education and employee history, user profile history

• Behavioral info: • Social Media: preferences, interests, photos, ‘likes’, favorites, followers, …• Web site: clicks (e.g. products), forms, geolocation,… up to the individual level

• Other info: web browser, IP address, group memberships,…

Relational Data Sources like CRM applications and others. External Sources, e.g. Open Data. Atomic and Aggregated data. (Application data.)

Dimensional Data: Cube for reporting and analysis.

47

Case: A Customer Intelligence PlatformCore Architecture (simplified view)

Operational Data Processing Zone

Transportation Zone

Analytics Zone

API (Access Layer)

Data Reception

Data Validation

Data Enrichment

Data Aggregation

Data Publication

Pre-processing

Model Building ValidationData

Reception

Inte

rfac

es

48

Case: A Customer Intelligence PlatformTechnical Architecture (simplified view)

Operational Data Processing Zone

API (Access Layer)

Data Reception

Data Validation

Data Enrichment

Data Aggregation

Data Publication

Inte

rfac

esSnowplow & Janrain

JSON files

DATA DUMPNoSQL – Hbase

Very large volumesJSON files

MongoDB in later versions?

CONFIGURATION REPOSITORYRDBMS (SQL)

Validation RulesInsights (Analytics)

VALIDATED DATA REPOSITORYNoSQL - HBase

Very large volumes Scalability, Flexibility

MongoDB, Cassandra, CouchBase in later versions?

AGGREGATES REPOSITORIESNoSQL - HBase

SOURCE DATA REPOSITORYNoSQL – Hbase

Final resultsAbsolute basis for further analysis

+ OLAP in later versions?

HadoopData Intensive

Amazon EMR (Elastic MapReduce)Scalability, flexibility

Features

49

Case: A Customer Intelligence PlatformTechnical Architecture (simplified view)

Periodic (batch) Analytics processing to gain new insights Three main scenarios have been considered:

1. Locally making use of R: • Small sample analysis• E.g. on-site at the client

2. Hadoop making use of RHadoop: • Full/big sample analysis• Computation Intensive Hadoop • Bring up and down when needed Amazon EMR

3. Other (Hadoop or Locally): • Mahout or other Analytics/BI tools

Analytics Zone

Pre-processing

Model Building ValidationData

Reception

Hadoop or LocallyMaking use of R

Periodic or ad hocFlexibility, Cost-efficientMany other possibilities

50

Conclusions

The Digital Age brings many opportunities but also challenges.

Big Data and Analytics can face the challenges and realize the opportunities.

It is within anyone’s grasp, do it incremental and iterative.

R and Hadoop: Open source software, active user groups and support. A great way to start exploring! Combined power gives you the advantage of 1 + 1 =3. Sometimes alternatives are better.

51

Conclusions

Don’t always need Big Data to do Analytics, it depends on the requirements.

Hadoop cloud solutions are scalable, flexible and cost-efficient, but sometimes limited in functionality (or not standardized).

Many differences between Hadoop distributions, constantly evolving (and getting better).

Need for good Data Scientists in a mixed team of competences to make the right choices.

52

What’s next?

Ask yourselves following questions: What opportunities do I see for myself? What strategic and competitive advantages can I realize? Is Analytics the right solution for me? Do I need Big Data? What about my Data Warehouse environment? And what about the quality of my operational data? Do I have the right infrastructure in place? Do I have the right competences in house?

Now you should know what’s in it for you, but also the challenges your most probably will be facing.

53

What’s next?

You have a case you would like to discuss…? You have any questions…?

Please feel free to contact me: Bram Vanschoenwinkel [email protected] +32(0)478741738

@bvschoen

be.linkedin.com/in/bramvanschoenwinkel/

54

23 april 2014 R and Hadoop - The perfect marriage for your analytics?18 juni 2014 From Private Cloud to Hybrid Cloud

1 oktober 2014 Digital Enterprise Architecture26 november 2014 Multi-device front-end engineering

? ? ?CHECK OUT THESE UPCOMING AE FOYERS

Thank you!

@bvschoen / @ae_nv

www.ae.be