Harvesting business Value with Data Science

78
Veldkant 33A, Kontich [email protected] www.infofarm.be Data Science Company Harvesting Business Value with Data Science InfoFarm - Seminar 18/03/2015

Transcript of Harvesting business Value with Data Science

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Data Science Company

Harvesting Business Value with Data Science

InfoFarm - Seminar18/03/2015

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Agenda

• 09:30 About us

• 09:40 Introduction to data science

• 10:00 Data science in practice:

- Fictive examples

- InfoFarm use-cases

- Big Data at Essent (Els Descheemaeker)

- Fraud detection: Gotch’all (KULeuven)

• 11:30 Possibility to discuss your data

science ideas

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

About us

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Data Science

Big Data

Provide customers with new information by

identifying, extracting and modeling data of all types

and origins; exploring, correlating and using it in new

and innovative ways in order to extract meaning and

business value from it.

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Java

PHPE-Commerce

Web

Development

Mobile

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

InfoFarm - Team

• Mixed skills team

- Data scientists

- Big Data developers

- Infrastructure specialist

• Complementary with client on domain expertise

• Certifications– CCDH - Cloudera Certified Hadoop Developer

– CCAD - Cloudera Certified Hadoop Administrator

– OCJP – Oracle Certified Java Programmer

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

VisualizationData

science

BusinessKnowledge

Development

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

Introduction to data science

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Being a Data Scientist

• Complementing business knowledge with figures

• “Getting meaning from data”: Finding patterns (data mining)

• Data Scientist: “A person who is better at statistics than any

software engineer and better at

software engineering than any

statistician”

- Josh Wills

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Data Science = about asking the right question!

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Data Science

• Relevance for business – use data to:– Increment conversion

– Increment operational efficiency

– Understand your customers’ needs

– Make better offers

– Make better recommendations

– …

• The key point is spotting opportunities to outperform your

competitors using any data available!

• Data science is very affordable to companies of all sizes

• Typical data science projects are 10’s of man-days of work

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Business Knowledge vs Data Science(Intuitive knowledge vs data driven decisions)

Business Knowledge

Acquired by experience

(assumed) insights

RISK: too high bias on past experience and gut feeling

Data Science

Complementary to business knowledge

Confirmative or new insights

Data-driven decision taking

RISK: too naive data intepretation, disconnected from business

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Business Knowledge vs Data Science(Intuitive knowledge vs data driven decisions)

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Data Science vs Business Intelligence

Business Intelligence Data Science

Basic concepts Structure & query Experimenting & discover

Processes DWH, OLAP, ETL Avoid heavy ETL (loosely structured data and agile use of many sources)

Investment Big investmentDelivers exactly

Limited investment Might or might not deliver

Cycle Development Exploratory working

Perspective

Questions What happened? What will happen?What if?

Data Warehouse, silo Distributed, real-time, “unstructured”

FuturePast

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Data Science vs Big Data

• What about the elephant in the room?

• BigData allows:

– N=ALL (avoid sampling errors)• Sampling issues can be overcome by just processing ALL available data (process massive data)

– N=1 (avoid issues with non-homogenous datasets)• Categorization becomes true personalisation: project towards ONE individual (calculate per item)

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

The Data Science maturity model

• Don’t run before you can walk: The Data Science Maturity modelEach level builds on the quality of the underlying step. It’s science, not magic …

• The process is a scientific cycle, no development cycle!

• Being a Science makes that the outcome cannot be predicted

• Even without success you learned something

Collect

Describe

Discover

Predict

Advise

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

Data Science: Tools & Techniques

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Tools

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Machine Learning

Classification

Clustering

Association Rules

Regression

Information extraction

Techniques

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Classification: Use Cases

• Incoming mail redirection

• Sentiment analysis

• Order picking optimization

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Clustering: Use cases

• Customer segmentation

• Product segmentation

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Association Rule Learning: Use Cases

• Recommendations

• Data exploration

• Find connections between unrelated

events

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Regression: Use Cases

• Order Quantity Prediction

• Trend estimation

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Information Extraction

• Extract variables out of unstructured data

like text.

• Named Entity Extraction

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

Data Science Examples

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Data Science examples

• Market segmentation

• Impact analysis

• Recommendations

• Water treatment

• Damage type research

• Call center aid

• Personalized client mailing (Essent)

• What do people write about us

• Fraud detection: Gotch’All (KU Leuven)

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

#1 Market Segmentation

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Market segmentation

• Business knowledge based approach

– “We know our segments: -25y, 25y-35y, 35y+ groups, and male/female”

– But is this (still) true?

– E.g.: do we really want to send an ad of the new iPhone to a long-time Android

user because he’s a 30-something male customer?

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Market segmentation

• Example:

We want to send mailings about our new product

• Decisions to take:

– Which mail to send to which customers?

– We need customer segmentation!

• Risks in failing to do this correctly

– Missing opportunities (not informing customers)

– Annoying customers with irrelevant mailings (churn, reputation damage, …)

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Market segmentation

WEB SERVER LOGSWhich customers

looked at similar products?

ORDER HISTORYWhich

complementary products does the

customer own?

EXTERNAL DATAReviews or critics?

CRM INFORMATIONTypical profile of a

customer responsive on campaigns for a

similar product?

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

#2: Analysis on the impact of physical

stores on your webshop

InfoFarm example

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Impact physical store on online?

– Are online sales higher when physical store is nearby?

– Where to open a new store?

– How to approach your customers to motivate them to buy (more) at

your store?

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Impact physical shops - example

• Analysis for a retailer: Physical shops vs online sales

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Impact physical shops - example

• Impact of opening a physical shop on local online sales

(brand awareness?)

• Simple statistical test

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Impact physical shops – now what?

• Use this correlation information:

– As extra input for determining new shop locations

– Use popup-stores to get brand awareness

• Do these pop-up store have the same non-temporary

influence?

– Publish folders focusing on online in non-covered

areas

– Discounts per region

– Google Adwords campaigns focusing on regions with

limited brand presence

– Customer segmentation based on this information

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

#3 Recommendations of products to a

customer

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Recommendations – Why? How?

– Why?• Attempt to cross-sell or up-sell

• Provide customers with alternatives that might please them even more

– Traditional approach• No recommendations at all

• Products in the same category

• Manually managed cross-selling opportunities per product

– Why are these approaches fundamentally flawed?• They all start from the seller perspective, not the customer!

• “We know what you should be buying”

• Manual recommendations are too costly and time-consuming to

maintain – even impossible with large catalogs

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Recommendations

– Product based recommendations

• Main focus on online, but why?

• Who knows best what products to recommend?

• Learn from your data, don’t take decisions based on a feeling.

– Time based recommendations

• Recommend or cross sell different products depending on

– season?

– holiday?

– weather?

– Customer based recommendations

• Learn from your customers and their past.

• Android vs iOS smartphones.

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Recommendations – Traditional approach

Current product Similar products

Related products

Which related products to show

Which brush would be appropriate?

Primer + paint combo?

Traditionally: unavailable

Which similar products to show?

Color alternatives?Glossy/matte alternatives?

Cheaper/better?

Traditionally: too similar

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Recommendations – what does Amazon do?

Cross-selling as realized with other (similar?) customers

Starts from customer point of view!

Recommendations based on perceived customer journeys

Re-use the product comparisons that

previous customers did!

DATA DRIVEN!

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Recommendations – Other ideas

• Data Science ideas

– “x % of the people who looked at this item eventually bought product X or Y”

– Get cross-selling information from ERP in the physical shops and let this feed the

webshop recommendations!

– Similar product in different price ranges

(“best-buy alternative”, “deluxe alternative”)

– ...

• This is very achievable for a webshop of any size

– Just generate ideas, and test to see what actually increases sales!

• Secondary use of various kinds of non-structured data = BigData !

– Weblogs of e-commerce site (use to deduct customer journeys)

– ERP info with bills and/or invoices (use to deduct cross-selling in physical shops)

– Product information (product categorization, …)

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

#4 Water treatment

InfoFarm example

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Context

• Rainfall and wastewater entering the sewer system sometimes

peaks to over the max capacity, requiring dumping wastewater into

rivers. To be avoided as much as possible!

• Long-term question: can we come to a better capacity

management of the sewer system with current data available?

• Short-term action: Proof-of-Concept on the application of Data

Science with BigData tools (Hadoop)

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Collect

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Describe

Data quality – visual inspection

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Describe – data quality

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Lag-analysis between 2 points

17 minutes (= +/- 20km/h = avg wind direction & speed)

NorthNNE

NE

ENE

East

ESE

SE

SSESouth

SSW

SW

WSW

West

WNW

NW

NNW

Wind

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Predict

• (attempt to) Predictions, very limited results due to

– data quality

– our limited business insights

– limited time (Data Science isn’t magic)

• Model predicting whether rain or only wastewater is in the sewer

system based on incoming water at treatment plant

PredictedNo rain

PredictedRain

Observed No Rain 4504 171 96%

Observed Rain 836 602 42%

84% 78% 84%

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

#5 Damage type research

Future InfoFarm example

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

One damage invoking another?

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Damage type research

• Not limited to logistics:– Telecom decoders

– Machinery

• Possible ideas:– Which damage types occur most?

– Are certain damages restricted to certain types of machinery?

– Do certain damages invoke others?

– Do certain damages occur more on certain lines/with certain users?

– Which damages cause early maintenance and can we predict these occurrences in advance

– …

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

#6 Call center aid + omnichannel

Future InfoFarm example

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Pro active calling

• Pro active calling:

– List of people most likely to react on callings

• In omnichannel case: better to call, mail, …

– List of items they might be interested in

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Call center information

• Call center information

– Personal information

on caller?

– What are they going

to ask?

– What are they telling

about you?

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Omnichannel

• Are customers more likely to react on:

– Internet based contacts: mailings, webshop, …

– Paper brochures

– Callings

– Physical shop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

#7 Personalized client mailing

Essent

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

• Belgian supplier of energy and natural gaz to consumers and profession users

• 4th largest player in Belgium

• 350 000 customers of which 24 000 professional

• Active since 2001

• 150 FET

• For more information, contact [email protected]

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

It all started with…

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Enjoy

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

If we know who is gonna call

us…

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

We could give the answer

before they give a call

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

“A”cquireData

‘A”nalyzeData

Make it“Actionable”

• What is the profile of

the calling customer

• Which parameters are

important

• OUTPUT: algorithm

made by data scientist

• Collection of data

• Quality check of data

• Descriptive,

consumption behavior

data, Call-data

SEEMS EASY, BUT IT

ISN’T

• Apply the defined

profile to NEW customers

with highest risk of

calling.

• HOW ?

Send a personalized

video via email with all

“relevant” data, for which

they normally call.

• WHEN ?

Before the customer

recieves the invoice

3 A’s approach

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Send to those new customers with the highest probability of calling.

Example of e-mail with video link

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Learning

• Guerilla approach – no big project

• Mixed team on top of daily business

• Focused innovation, DQ positioned as side-effect MUST

• “Guerilla” lead to attention for DQ towards right audience

• Engage employees for good DQ output – Input of employees that generate the output

– Leads to a long term commitment

• More impact than big DQ initiatives – part of daily process

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

#8 What do people write about us

Infofarm example

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

How do we get in the media?

• Find news articles containing certain

keywords/concerning certain topics

• First model:

Identifying relevant

texts

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

How do we get in the media?

• Second model: dividing relevant texts into

topic clusters

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

How do we get in the media?

• Third model: are the talking

positive/negative about these topics

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

How do we get in the media?

• Final idea, extract:

– Who is talking about you?

– To which organization do they belong?

– Can we confirm their figures?

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

#9 Fraude detection: Gotch’All

KU Leuven

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Gotch’All

• Research (mini lecture: https://www.youtube.com/watch?v=6H5Lp3i05Cg)– Prof. Dr. Bart Baesens

– Veronique Van Vlasselaer

– Prof. Dr. Tina Eliassi-Rad

– Prof. Dr. Leman Akoglu

– Prof. Dr. Monique Snoeck

Social network analysis– Is fraud a social phenomenon?

– Social security fraud

– Credit card fraud

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Is fraud a social phenomenon

Identity theft:• Before: person calls his/her frequent contacts

• After: person also calls new contacts which coincidentally overlap with

another persons contacts.

Social security fraud• Companies are frequently associated with other companies that perpetrate

suspicious/fraudulent activities.

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Fraud?

• Anomalous behavior– Outlier detection: abnormal behavior and/or characteristics in a data set might

often indicate that that person perpetrates suspicious activities

– Behavior of a person/instance does not comply with overall behavior. E.g., illegal

set up of customer account

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Properties of fraud detection models

• Accuracy (AUC, precision and recall)

• Operational efficiency (e.g. 6 second rule in credit card

fraud)

• Economical cost

• Interpretability (i.e. make sense)

How to detect mister Hyde?

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Social Network Analytics: Components

• Nodes (the objects of the network)– People

– Computers

– Reviewers

– Companies

– Credit card holders

– …

• Links (the relationships between objects)– Call record

– File sharing

– Product reviews

– Shared suppliers/buyers

– Merchant

– …

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Social security institution

End of a company’s lifecycle:

(1) Regular suspensionNo outstanding debts

(2) Regular bankruptcy

Outstanding debts

Cause: economical situation

(3) Fraudulent bankruptcy

Outstanding debts

Cause: intention

Goal: prevention of fraudulent bankruptcies (i.e., intentionally bankruptcies to avoid contribution payments to the government)