THE DATA MINING AUTOMATION COMPANY TM European Conference on Machine Learning and Principles and...

THE DATA MINING AUTOMATION COMPANY TM

European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases

15 - 19 September 2008, Antwerp, Belgium

ECML PKDD 2008

Industrializing Data Mining, Challenges and Perspectives

Françoise Fogelman Soulié [email protected]

ECML PKDD 2008 September 17, 2008Antwerp, Belgium

THE DATA MINING AUTOMATION COMPANY TM 2

• KXEN is an editor of a data mining software - The KXEN Analytic Framework™ is a suite

of predictive and descriptive modeling engines that create robust analytic

models fast and easily

- KXEN's products are based upon Vladimir Vapnik’s Statistical Learning Theory (Structural Risk Minimization)

• Our vision - KXEN wants to make predictive analytics part of everyday corporate

business decisions

• Our mission - Our mission is to embed advanced analytics into existing enterprise

applications and business processes

http://www.kxen.com/

KXEN


Data Mining industrial applications

Data Mining industrial applications are in

• Telecommunications

• Bank & Finance

• Retail

• … and increasingly in « Web » companies

For

• Marketing

• Risk, fraud

• Security

• On-line retail and services, advertising, key-word optimization …

• In my talk, I will rely upon examples taken from KXEN customers in 3 sectors

Banking & Finance Retail Telecommunications


Agenda

• Which world is this ?

• Data Mining in the real world

• Some examples


A little bit of history

What has happened ?

Andrew Moore


Data

The volume of data has explodedIn the 90s Today• Web transactions Fayyad, KDD 2007

- At Yahoo ! • Around 16 B events / day• 425 M visitors / month• 10 Tb data / day

• RFID Jiawei, Adma 2006

- A retailer with 3,000 stores, selling 10,000 items a day per store• 300 million events per day (after redundancy removal)

• Social network Kleinberg, KDD’07

- 4.4-million-node network of declared friendships on blogging community LiveJournal

- 240-million-node network of all IM communication over one month on Microsoft Instant Messenger

• Cellular networks- A telecom carrier generates hundreds of millions of CDRs / day- The network generates technical data : 40 M events / day in a large city


Data

Just how big are big data sets ?

• Depth- Up to 100 Million lines

- Or Billion ?

• Width- Thousands of attributes

- Or Million ?

If they’re big today, wait for to-morrow

• Size of databases - X2-3 every 2 years


Data

Many different

• Sources

• Types - Structured

- Unstructured• Text• Image • Video• Audio …

• Volumes- Web dominates !

• Lots more data out there- X 10 ? X 100 ?

Russom, TDWI 2007


Agenda



• Some examples


What are the issues in the real-world ?

• Data mining provides ways to define actions- A model not used for action is a useless cost

• Data volume grow exponentially : number of models must too

Herschel, Gartner 2006


Data mining in practice

The data mining process is not very efficient in practice

• It does not make use of all variables

• It spends a large part of modeling time on data manipulation

• And as a result, it still takes a long time to build a model

Time to build a model

Project breakdown

Number of variables used

Eckerson, TDWI, 2007


Challenges for the real-world

1. Challenge n°1 : IntegrationData mining is never THE solution : it only is a – small – part of it- In the real-world data mining needs to be integrated into a global system - Data mining needs to take inputs from/generate results to rest-of-the-world - Key words : openness, standards

2. Challenge n°2 : ProductivityData mining must bring value - Exploit all data available & Produce actionable results- At lowest possible cost- Be simple to use by non experts- Key words : Return On Investment

3. Challenge n° 3 : ScalabilityData mining must hold data volumes & number of models- Handle LARGE data sets- Produce AS MANY models as needed- Key words : time to produce a model as function of data set (width, depth)

4. Challenge n°4 : AutomatisationData mining should do all of the above (almost) automatically- Produce models, detect problems, retrain …- Key words : automatisation, control



• Scalability - Data volume is characterized by (width, depth)- Modeling has 2 phases : build & apply

• How does time scale with volume ? Number of models ?

- Is real-time possible ?• At apply• Integration, Productivity, Scalability &

Automatisation

• Productivity- About 40% of modeling time is spent in data

preparation, can this be cut ?- Can all data be used ?

• Volumes ?• Structured / Unstructured ?

• Automatisation- Can modeling be done by

• A machine ?• Non-experts ?

- Can a model be « turned on » and controlled by a machine ?

Russom, TDWI, 2007



The ability of Data Mining tools to satisfy the real-world challenges is critical for the wide deployment of data mining applications


What are the issues in the real-world ?

Algorithms & Theory is NOT central said Fayyad at KDD’07

• Yes- In real-world, central issue is $

• But - Only strong algorithms & theory

can bring the $

- Iff they are up to the issues in the real-world

20

Research

Researcher view

Algorithms andTheory

Database

Systems

22

Research

Practitioner view

Systems and integration

Database

Algorithms

Customer

23

Research

Business view

Systems

Database

Algorithms

Customer

$$$’s

Fayyad


Agenda



• Some examples


A large financial institution

The Dilemma

• Efficiency goals

• Limits on resources

• New data sources

• Time-to-market opportunities

• Skill specific dependencies

• Need for competitive advantage

• Unique opportunity to mitigate macroeconomic risks including mortgage and housing troubles

Example n° 1


Extreme Granularity Data

• Competitive advantage through enhanced analytics and unique data sources

• Numerous sources of granular data (transaction data, payment data, call data, etc.)

• Granularity and detail creates value if you can aggregate intelligently and extract knowledge

• Number of attributes grows exponentially as you consider time series, interactions, and transformations

Example n° 1


The Challenge to KXEN

• Aggregation yields tens of thousands of variables about customers

• How do we select the 10 best variables for predicting credit risk, such as, “Will the customer be delinquent in payment 12 months from now?”

• How do we develop and deploy models quickly and efficiently?

• How can we enable business analysts not statisticians to build predictive models?

• How do I regain flexibility as the business owner while ensuring production quality?

Example n° 1


Possible usage for enhancing variable selection

• Approach 1: use the same variables we used last year (common in resource constrained environment)

• Approach 2: based on experience and expertise, select the 500 variables that are most likely to be useful. Then use statistics to pick the 10 to 20 subset that is best (common in sophisticated analytic shops with heavy analyst presence)

• Approach 3: use all the variables and let the data tell you which are useful (rare where attributes >1000)

• KXEN POC focused on enabling approach 3 as new modeling process standard

Example n° 1


The KXEN Fit – Large telco operator

Rapid development and deployment utilizing large volumes of data

• Data Enhancements- SNA dataset added ~1000 extra variables to our existing analytical

datasets (external data plus additional info aggregated from SNA data)

- Analytical datasets average over 2000 attributes

• Short time lines- Model development/comparative analysis to pilot execution at times was

done over 1-2 week periods

- KXEN allowed for quick turnaround on model development

- Built standard models and standard + SNA models for piloting

Example n° 2


A large american Data provider

Target Source Consumer Database

• Largest response survey database in North America

• Precision data on 2 million Canadian households and 14 million US households

• 1000 data variables for targeting :- Behavioral, lifestyle, demographics and more

- Drives solutions for new customer acquisition, modeling, retention / growth and consumer insights

Example n° 3


A large american Data provider

The Challenge

• Build 252 models

• Score 167 models on 4 million records

• Score 85 models on 2 million records

• Do it all in a 5 day time period…

• And do it with just one analyst…

Example n° 3


Number of variablesSears

900Large Bank

1 200Vodafone D2

2 500Barclays

2 500Rogers Wireless

5 800HSBC

8 000Credit card

16 000

Why use more data ?

Some companies use lots of data

The goal

• Increase the performance of their models

The challenge

• Produce models with thousands of variables

• Exploiting all the available variables - 3 000, 5 000, 10 000 ?

• Creating new variables, also- Aggregates

- Behavioral variables

- Textual variables

- « Social network » variables …

The results- More lift, more returns, more €, $


Scoring Code

SQLPMML

Initial Model

Validated Model

Exploit all variables

The analysis process

• Build ADS (Analytic Data Set)- Extract data

- Transform, aggregate, …

- Create ADS

• Build model- Produce initial model

- Refine, select variables

- Produce final model

• Apply model- Extract data


- Create ADS

- Apply model

- Export results to data base

Analytic Data Set including new variables

D’après



• What-if ADS could contain ALL variables- Example : in Teradata ADS is a view. Data are not replicated or moved

ID FIELDS BEHAVIOR FIELDS DEMOGRAPHIC FIELDS MODEL SCORES CONTACT HISTORY

Model Description

Model ID

Model NameModel Date

Model Scores

Model ID (FK)Individual ID (FK)

Score

Individuals in Household

Individual ID

Household ID (FK)Individual attribute fieldsPrimary Householder Flag

Campaign Type Ref

Campaign ID

DescriptionStart Date

Contact History Detail

Campaign ID (FK)Individual ID (FK)

Promotion CodeTreatment CodeContact Date

Contact History Summary

Individual ID (FK)

Number of Mail ContactsNumber of eMail ContactsNumber of Telephone Contacts

Household

Household ID

Street AddressCityStateZipPhone

Demographics Lifestage Lifestage

Individual ID (FK)

Demographic Data Fields

Accounts

Account No

Account Information

Transactions

Transaction IDIndividual ID (FK)Account No (FK)

Transaction DetailsTransaction Timestamp

Products

Product ID

Product Details

Purchases

Transaction IDIndividual ID (FK)Product ID (FK)

Purchase Details

CustomerCustomerDataData

WarehouseWarehouse

D’après

HH-ID

CUST-ID

NAME

VALUE_SEG

BEHAV_SEG

LIFESTY_S

EG

LIFESTG_S

EG

EQUITY_1

2

EQUITY_2

4

LTVAGE

INCOM

E_CD

EDUCATION

2347387474 4797978698Gustavo 2 5 8 3 37.22 28.18 49.8 … 28 7 14 …7879973979 2439970274 Susan 3 3 6 5 18.88 28.97 154.32 … 42 9 18 …

9870908 879979 Andre 1 1 18 4 -1.38 -12.8 -48.76 … 61 5 12 …… … … … … … … … … … … … … … …

… …



The analysis process

• Build ADS (Analytic Data Set)- Extract data


- Create ADS

• Build model- Produce initial model

- Refine, select variables

- Produce final model

• Apply model- Extract data


- Create ADS

- Apply model

- Export results to data base

« In-database Mining »

Analytic Data Set including new variables

Initial Model

Validated Model

Scoring Code

SQLPMML

< 1 week

3 days

< 1 day

< 1 day


Using behavioral data

• From transactional data produce behavioral data - Transition from transaction A to transaction B

• Number of variables grows exponentially !

• But lift grows !!

Price

Time Jan Feb Mar Apr May Jun

40

30

20

10

0

Customer 2

AABB

CC


Using textual variables

• In many applications (surveys, emails, …) there is a textual field

• These fields can be exploited to enhance results of data mining analysis

Bag of words

Oil prices rebound to above $67 after fears Hurricane Rita, heading for the Gulf of Mexico, may cause more damage to US oil refineries. One of the refinery…

Textual field

Count stop lists

Stemming

Analytical Data Set

Enrich

THE DATA MINING AUTOMATION COMPANY TM 30Without text

With text

Using textual variables – DataMining Cup’06

With 1000 added textual variables

• Computing time : 6 seconds 43 seconds

• Most significant variables : textual

• Lift : goes up


Using « social network » variables

An example in telco

• Build the social network

• Extract « social network » variables - A few 10-100 additional variables


Using « social network » variables

• « social network » variables increase lift- Globally : 40%

- First decile : 67%

- Second decile : 47%

0

The rest


Why use more models ?

Some Companies produce lots of models

The goal

• Increase the performance of their models

The challenge

• Produce thousands of models

• For every campaign

• Refreshed often- Data distribution changes fast (Web)

• As « fine grain » as possible- Performance on a homogeneous population is better

The results- More lift, more returns, more €, $

- Ex : Response rate + 260%

Number of Models / yearVodafone D2 760Market research 9 600Cox Comm. 28 800Real estate 70 000Lower My Bills 460 000

Cable TV

High-speed InternetDigital TelephonePay-Per-View

Janu

ary

Digital Cable

April

Octob

er

July

TDWI 03-2005


On-going research projects – Call center supervision

• Key Performance Indicators- Number of emails received, on-hold,

processed • Per time period, category, agent …

- To make sure SLA are met

• Issues- 100s of KPI

- Aggregated in hierarchies

• Predictive models- Alerts

- Optimize resource (capacity, people)• Long term Time Series forecasting on

KPI

- Detect deviations (seasonal effects)• Short term Time Series forecasting

on KPI

• “Predictive cubes” for hundreds of indicators


On-going research projects – Recommendation

Models for recommendation

• The catalog can have 1 Million products

• Hundreds / thousands of models with thousands of attributes

www.hmv.co.jp


Conclusion

Data mining can bring more $$

But it needs to satisfy constraints

• Integration

• Productivity

• Scalability

• Automatization

Manipulating MORE data and producing MORE models requires

• Automatized data manipulation & coding

• Simple & robust algorithms

• Software modules open & in line with market standards

Productivity gains Rogers Wireless

7xVodafone D2

10xSears

8xBelgacom

12x


How does KXEN do it

• KXEN was designed for data miningindustrial applications

• KXEN is based upon Vapnik’s SRM – Structural Risk

Minimization - Strategy to control the trade-off

accuracy / robustness

• KXEN produces- An automatic encoding

• Non linear

- Then regression / classification• Polynomial

• Which allows- Integration

- Productivity

- Scalability

- Automatization

1

*h

n

VC dimension

Generalization error

Training error

empRGenR

h

Error......

......

21

21

khhhk

Ridge regression

Polynomials

KI (Gini index)


How does KXEN do it

• In practice, for one final model, KXEN builds many models (SRM)- Depending upon variables complexity, encoding requires 10 - 30 models

- Then about 100 models (for regression)

• KXEN uses « data streams » techniques

• There is no data duplication - A few sweeps are necessary

• Time to build a model- About linear in depth & width


Questions ?


• Location - Marriott Paris Rive Gauche Hotel & Conference Center

17 Boulevard St Jacques - 75014 Paris, France

• Dates- June 28 - July 1, 2009

• Key Submission Dates- Due January 19, 2009 Workshop Proposals

- Due February 2, 2009 Paper Abstracts

- Due February 6, 2009 Research/Industrial Track Papers

- Due February 23, 2009 Tutorials/Panel Proposals

http://www.kdd.org/kdd2009/index.html


For the first time, in 2009, KDD will leave North America for Europe

KDD’09 will be held in PARIS, France


Mark the dates

Conference

• June 28th – July 1st, 2009

Key Submission Dates

• January 19, 2009 Workshop Proposals

• February 2, 2009 Paper Abstracts

• February 6, 2009 Research/Industrial Track Papers

• February 23, 2009 Tutorials/Panel Proposals

• April 10, 2009 Notification

• April 27, 2009 Camera Ready

Visit the Conference site

http://www.kdd.org/kdd2009/


Contact us

General Chair [email protected]

• John Elder (Elder Research, Inc.)

• Francoise Fogelman Soulié (KXEN, Inc.)

Program Co-Chair [email protected]

• Peter Flach (University of Bristol)

• Mohammad Zaki (Rensselaer Polytechnic Institute

I personnally count on you for sending lots of good papers and show a very strong european attendance at KDD’09

See you there

THE DATA MINING AUTOMATION COMPANY TM European Conference on Machine Learning and Principles and...

Documents

Transcript of THE DATA MINING AUTOMATION COMPANY TM European Conference on Machine Learning and Principles and...