Harvesting business Value with Data Science

Post on 22-Jul-2015

143 views 0 download

Transcript of Harvesting business Value with Data Science

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Data Science Company

Harvesting Business Value with Data Science

InfoFarm - Seminar18/03/2015

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Agenda

• 09:30 About us

• 09:40 Introduction to data science

• 10:00 Data science in practice:

- Fictive examples

- InfoFarm use-cases

- Big Data at Essent (Els Descheemaeker)

- Fraud detection: Gotch’all (KULeuven)

• 11:30 Possibility to discuss your data

science ideas

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company

About us

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Data Science

Big Data

Provide customers with new information by

identifying, extracting and modeling data of all types

and origins; exploring, correlating and using it in new

and innovative ways in order to extract meaning and

business value from it.

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Java

PHPE-Commerce

Web

Development

Mobile

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

InfoFarm - Team

• Mixed skills team

- Data scientists

- Big Data developers

- Infrastructure specialist

• Complementary with client on domain expertise

• Certifications– CCDH - Cloudera Certified Hadoop Developer

– CCAD - Cloudera Certified Hadoop Administrator

– OCJP – Oracle Certified Java Programmer

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

VisualizationData

science

BusinessKnowledge

Development

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company

Introduction to data science

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Being a Data Scientist

• Complementing business knowledge with figures

• “Getting meaning from data”: Finding patterns (data mining)

• Data Scientist: “A person who is better at statistics than any

software engineer and better at

software engineering than any

statistician”

- Josh Wills

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Data Science = about asking the right question!

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Data Science

• Relevance for business – use data to:– Increment conversion

– Increment operational efficiency

– Understand your customers’ needs

– Make better offers

– Make better recommendations

– …

• The key point is spotting opportunities to outperform your

competitors using any data available!

• Data science is very affordable to companies of all sizes

• Typical data science projects are 10’s of man-days of work

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Business Knowledge vs Data Science(Intuitive knowledge vs data driven decisions)

Business Knowledge

Acquired by experience

(assumed) insights

RISK: too high bias on past experience and gut feeling

Data Science

Complementary to business knowledge

Confirmative or new insights

Data-driven decision taking

RISK: too naive data intepretation, disconnected from business

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Business Knowledge vs Data Science(Intuitive knowledge vs data driven decisions)

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Data Science vs Business Intelligence

Business Intelligence Data Science

Basic concepts Structure & query Experimenting & discover

Processes DWH, OLAP, ETL Avoid heavy ETL (loosely structured data and agile use of many sources)

Investment Big investmentDelivers exactly

Limited investment Might or might not deliver

Cycle Development Exploratory working

Perspective

Questions What happened? What will happen?What if?

Data Warehouse, silo Distributed, real-time, “unstructured”

FuturePast

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Data Science vs Big Data

• What about the elephant in the room?

• BigData allows:

– N=ALL (avoid sampling errors)• Sampling issues can be overcome by just processing ALL available data (process massive data)

– N=1 (avoid issues with non-homogenous datasets)• Categorization becomes true personalisation: project towards ONE individual (calculate per item)

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

The Data Science maturity model

• Don’t run before you can walk: The Data Science Maturity modelEach level builds on the quality of the underlying step. It’s science, not magic …

• The process is a scientific cycle, no development cycle!

• Being a Science makes that the outcome cannot be predicted

• Even without success you learned something

Collect

Describe

Discover

Predict

Advise

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company

Data Science: Tools & Techniques

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Tools

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Machine Learning

Classification

Clustering

Association Rules

Regression

Information extraction

Techniques

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Classification: Use Cases

• Incoming mail redirection

• Sentiment analysis

• Order picking optimization

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Clustering: Use cases

• Customer segmentation

• Product segmentation

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Association Rule Learning: Use Cases

• Recommendations

• Data exploration

• Find connections between unrelated

events

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Regression: Use Cases

• Order Quantity Prediction

• Trend estimation

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Information Extraction

• Extract variables out of unstructured data

like text.

• Named Entity Extraction

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company

Data Science Examples

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Data Science examples

• Market segmentation

• Impact analysis

• Recommendations

• Water treatment

• Damage type research

• Call center aid

• Personalized client mailing (Essent)

• What do people write about us

• Fraud detection: Gotch’All (KU Leuven)

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company

#1 Market Segmentation

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Market segmentation

• Business knowledge based approach

– “We know our segments: -25y, 25y-35y, 35y+ groups, and male/female”

– But is this (still) true?

– E.g.: do we really want to send an ad of the new iPhone to a long-time Android

user because he’s a 30-something male customer?

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Market segmentation

• Example:

We want to send mailings about our new product

• Decisions to take:

– Which mail to send to which customers?

– We need customer segmentation!

• Risks in failing to do this correctly

– Missing opportunities (not informing customers)

– Annoying customers with irrelevant mailings (churn, reputation damage, …)

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Market segmentation

WEB SERVER LOGSWhich customers

looked at similar products?

ORDER HISTORYWhich

complementary products does the

customer own?

EXTERNAL DATAReviews or critics?

CRM INFORMATIONTypical profile of a

customer responsive on campaigns for a

similar product?

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company

#2: Analysis on the impact of physical

stores on your webshop

InfoFarm example

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Impact physical store on online?

– Are online sales higher when physical store is nearby?

– Where to open a new store?

– How to approach your customers to motivate them to buy (more) at

your store?

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Impact physical shops - example

• Analysis for a retailer: Physical shops vs online sales

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Impact physical shops - example

• Impact of opening a physical shop on local online sales

(brand awareness?)

• Simple statistical test

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Impact physical shops – now what?

• Use this correlation information:

– As extra input for determining new shop locations

– Use popup-stores to get brand awareness

• Do these pop-up store have the same non-temporary

influence?

– Publish folders focusing on online in non-covered

areas

– Discounts per region

– Google Adwords campaigns focusing on regions with

limited brand presence

– Customer segmentation based on this information

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company

#3 Recommendations of products to a

customer

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Recommendations – Why? How?

– Why?• Attempt to cross-sell or up-sell

• Provide customers with alternatives that might please them even more

– Traditional approach• No recommendations at all

• Products in the same category

• Manually managed cross-selling opportunities per product

– Why are these approaches fundamentally flawed?• They all start from the seller perspective, not the customer!

• “We know what you should be buying”

• Manual recommendations are too costly and time-consuming to

maintain – even impossible with large catalogs

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Recommendations

– Product based recommendations

• Main focus on online, but why?

• Who knows best what products to recommend?

• Learn from your data, don’t take decisions based on a feeling.

– Time based recommendations

• Recommend or cross sell different products depending on

– season?

– holiday?

– weather?

– Customer based recommendations

• Learn from your customers and their past.

• Android vs iOS smartphones.

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Recommendations – Traditional approach

Current product Similar products

Related products

Which related products to show

Which brush would be appropriate?

Primer + paint combo?

Traditionally: unavailable

Which similar products to show?

Color alternatives?Glossy/matte alternatives?

Cheaper/better?

Traditionally: too similar

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Recommendations – what does Amazon do?

Cross-selling as realized with other (similar?) customers

Starts from customer point of view!

Recommendations based on perceived customer journeys

Re-use the product comparisons that

previous customers did!

DATA DRIVEN!

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Recommendations – Other ideas

• Data Science ideas

– “x % of the people who looked at this item eventually bought product X or Y”

– Get cross-selling information from ERP in the physical shops and let this feed the

webshop recommendations!

– Similar product in different price ranges

(“best-buy alternative”, “deluxe alternative”)

– ...

• This is very achievable for a webshop of any size

– Just generate ideas, and test to see what actually increases sales!

• Secondary use of various kinds of non-structured data = BigData !

– Weblogs of e-commerce site (use to deduct customer journeys)

– ERP info with bills and/or invoices (use to deduct cross-selling in physical shops)

– Product information (product categorization, …)

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company

#4 Water treatment

InfoFarm example

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Context

• Rainfall and wastewater entering the sewer system sometimes

peaks to over the max capacity, requiring dumping wastewater into

rivers. To be avoided as much as possible!

• Long-term question: can we come to a better capacity

management of the sewer system with current data available?

• Short-term action: Proof-of-Concept on the application of Data

Science with BigData tools (Hadoop)

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Collect

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Describe

Data quality – visual inspection

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Describe – data quality

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Lag-analysis between 2 points

17 minutes (= +/- 20km/h = avg wind direction & speed)

NorthNNE

NE

ENE

East

ESE

SE

SSESouth

SSW

SW

WSW

West

WNW

NW

NNW

Wind

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Predict

• (attempt to) Predictions, very limited results due to

– data quality

– our limited business insights

– limited time (Data Science isn’t magic)

• Model predicting whether rain or only wastewater is in the sewer

system based on incoming water at treatment plant

PredictedNo rain

PredictedRain

Observed No Rain 4504 171 96%

Observed Rain 836 602 42%

84% 78% 84%

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company

#5 Damage type research

Future InfoFarm example

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

One damage invoking another?

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Damage type research

• Not limited to logistics:– Telecom decoders

– Machinery

• Possible ideas:– Which damage types occur most?

– Are certain damages restricted to certain types of machinery?

– Do certain damages invoke others?

– Do certain damages occur more on certain lines/with certain users?

– Which damages cause early maintenance and can we predict these occurrences in advance

– …

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company

#6 Call center aid + omnichannel

Future InfoFarm example

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Pro active calling

• Pro active calling:

– List of people most likely to react on callings

• In omnichannel case: better to call, mail, …

– List of items they might be interested in

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Call center information

• Call center information

– Personal information

on caller?

– What are they going

to ask?

– What are they telling

about you?

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Omnichannel

• Are customers more likely to react on:

– Internet based contacts: mailings, webshop, …

– Paper brochures

– Callings

– Physical shop

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company

#7 Personalized client mailing

Essent

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

• Belgian supplier of energy and natural gaz to consumers and profession users

• 4th largest player in Belgium

• 350 000 customers of which 24 000 professional

• Active since 2001

• 150 FET

• For more information, contact els.descheemaeker@essent.be

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

It all started with…

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Enjoy

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

If we know who is gonna call

us…

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

We could give the answer

before they give a call

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

“A”cquireData

‘A”nalyzeData

Make it“Actionable”

• What is the profile of

the calling customer

• Which parameters are

important

• OUTPUT: algorithm

made by data scientist

• Collection of data

• Quality check of data

• Descriptive,

consumption behavior

data, Call-data

SEEMS EASY, BUT IT

ISN’T

• Apply the defined

profile to NEW customers

with highest risk of

calling.

• HOW ?

Send a personalized

video via email with all

“relevant” data, for which

they normally call.

• WHEN ?

Before the customer

recieves the invoice

3 A’s approach

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Send to those new customers with the highest probability of calling.

Example of e-mail with video link

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Learning

• Guerilla approach – no big project

• Mixed team on top of daily business

• Focused innovation, DQ positioned as side-effect MUST

• “Guerilla” lead to attention for DQ towards right audience

• Engage employees for good DQ output – Input of employees that generate the output

– Leads to a long term commitment

• More impact than big DQ initiatives – part of daily process

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company

#8 What do people write about us

Infofarm example

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

How do we get in the media?

• Find news articles containing certain

keywords/concerning certain topics

• First model:

Identifying relevant

texts

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

How do we get in the media?

• Second model: dividing relevant texts into

topic clusters

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

How do we get in the media?

• Third model: are the talking

positive/negative about these topics

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

How do we get in the media?

• Final idea, extract:

– Who is talking about you?

– To which organization do they belong?

– Can we confirm their figures?

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company

#9 Fraude detection: Gotch’All

KU Leuven

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Gotch’All

• Research (mini lecture: https://www.youtube.com/watch?v=6H5Lp3i05Cg)– Prof. Dr. Bart Baesens

– Veronique Van Vlasselaer

– Prof. Dr. Tina Eliassi-Rad

– Prof. Dr. Leman Akoglu

– Prof. Dr. Monique Snoeck

Social network analysis– Is fraud a social phenomenon?

– Social security fraud

– Credit card fraud

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Is fraud a social phenomenon

Identity theft:• Before: person calls his/her frequent contacts

• After: person also calls new contacts which coincidentally overlap with

another persons contacts.

Social security fraud• Companies are frequently associated with other companies that perpetrate

suspicious/fraudulent activities.

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Fraud?

• Anomalous behavior– Outlier detection: abnormal behavior and/or characteristics in a data set might

often indicate that that person perpetrates suspicious activities

– Behavior of a person/instance does not comply with overall behavior. E.g., illegal

set up of customer account

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Properties of fraud detection models

• Accuracy (AUC, precision and recall)

• Operational efficiency (e.g. 6 second rule in credit card

fraud)

• Economical cost

• Interpretability (i.e. make sense)

How to detect mister Hyde?

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Social Network Analytics: Components

• Nodes (the objects of the network)– People

– Computers

– Reviewers

– Companies

– Credit card holders

– …

• Links (the relationships between objects)– Call record

– File sharing

– Product reviews

– Shared suppliers/buyers

– Merchant

– …

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Social security institution

End of a company’s lifecycle:

(1) Regular suspensionNo outstanding debts

(2) Regular bankruptcy

Outstanding debts

Cause: economical situation

(3) Fraudulent bankruptcy

Outstanding debts

Cause: intention

Goal: prevention of fraudulent bankruptcies (i.e., intentionally bankruptcies to avoid contribution payments to the government)