Post on 12-Feb-2016
description
KDD Cup2009Fast Scoring on a Large DatabasePresentation of the Results at the KDD Cup WorkshopJune 28, 2008The Organizing Team
KDD Cup 2009 Organizing Team
Project team at Orange Labs R&D: • Vincent Lemaire• Marc Boullé• Fabrice Clérot• Raphaël Féraud• Aurélie Le Cam• Pascal Gouzien
Beta testing and proceedings editor:• Gideon Dror
Web site design: • Olivier Guyon (MisterP.net, France)
Coordination (KDD cup co-chairs): • Isabelle Guyon• David Vogel
Thanks to our sponsors…
Orange ACM SIGKDD Pascal Unipen Google Health Discovery Corp Clopinet Data Mining Solutions MPS
KDD Cup Participation By Year
45 5724 31
136
1857
102
3768
95128
453
050
100150200250300350400450500
1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Year
Year # Teams
1997 45
1998 57
1999 24
2000 31
2001 136
2002 18
2003 57
2004 102
2005 37
2006 68
2007 95
2008 128
2009 453
Record KDD Cup Participation
Participation Statistics 1299 registered teams 7865 entries 46 countries :
Argentina Germany Malaysia South KoreaAustralia Greece Mexico SpainAustria Hong Kong Netherlands SwedenBelgium Hungary New Zealand SwitzerlandBrazil India Pakistan TaiwanBulgaria Iran Portugal TurkeyCanada Ireland Romania UgandaChile Israel Russian Federation United KingdomChina Italy Singapore UruguayFiji Japan Slovak Republic United StatesFinland Jordan SloveniaFrance Latvia South Africa
A worlwide operator
One of the main telecommunication operators in the world
Providing services to more than 170 millions customers over five continents
Including 120 millions under the Orange Brand
KDD Cup 2009 organized by OrangeCustomer Relationship Management (CRM)
Three marketing tasks: predict the propensity of customers
– to switch provider: Churn– to buy new products or services: Appentency– to buy upgrades or new options proposed to them: Up-selling
Objective: improve the return of investments (ROI) of marketing campaigns
– Increase the efficiency of the campaign given a campaign cost– Decrease the campaign cost for a given marketing objective
Better prediction leads to better ROI
Train and deploy requirements
– About one hundred models per month
– Fast data preparation and modeling
– Fast deployment
Model requirements– Robust– Accurate– Understandable
Business requirement– Return of investment for the
whole process
Input data– Relational databases– Numerical or categorical– Noisy– Missing values– Heavily unbalanced distribution
Train data– Hundreds of thousands of
instances– Tens of thousand of variables
Deployment– Tens of millions of instances
Data, constraints and requirements
In-house systemFrom raw data to scoring models
0,n1,n
0,n
1,n0,1
0,n
0,n
0,n
1,1
1,1
1,n
0,n
0,n
1,1
1,n1,n
1,n1,n
0,n
1,n
1,1 0,n
1,n
1,1
0,n
1,1
0,n
1,1
0,n
1,n
0,n
1,11,1
0,n
1,n
1,1
1,n(1,1)
1,1
0,n
1,n
0,n
1,1
0,n
0,n
0,1
1,1
1,n
1,n
0,1
0,n
1,1
1,1
0,1
0,n
Heritage tiers
Heritage offre commerciale
0,n 1,n
1,1
0,n
(1,1)
0,n
0,n
0,n 0,1
0,1
0,n
1,n
1,n0,n
1,n
0,n
(1,1)
Fu appartient type FU
1,1
1,n
Offre
Id offreLibel lé offre
<pi>
Produit & Service
Id PSDate fin val idi té du P&SDate début validité du P&SDate création du P&SLibellé P&S
<pi>
Identi té Tiers
Id identi té tiersLoginType identi té tiers
<pi>
O composée de PS
Elément De Parc
Id EDPDate dernière util isation EDPDate première uti lisation EDP
<pi>
Modèle Conceptuel de DonnéesModèle : MCD PAC_v4Package : Diagramme : Tiers ServicesAuteur : claudebe Date : 14/06/2005 Version :
PS a pour FU
T util ise IT
EDP souscri t ds O
Date début souscription offreDate fin souscription offre
DD
<O>
CRU concerne FU
Gamme
Id gammeLibel lé gammeDate création gammeDate fin de gamme
<pi>
G composée de PS
Fonction Usage
Id fonction d'usageLibél lé fonction usage
<pi>
T détient EDP
Date début détention EDPDate fin détention EDP
DD
<O>
Compte Facturation
Id compte facturationDate début val idité compte facturationDate fin validité compte facturation
<pi>
F émise pour CF
Compte Rendu Usage
Id compte rendu usageDate début CRUDate fin CRUVolume descendant CRUVolume montant CRUType transmission
<pi>
IT génère CRU
CRU généré par EDP
Ligne Facture
Id l igne de factureLigne affichée sur factureMontant HTMontant TTC
<pi>
Type Ligne Facture
Id type l igne factureLibel lé type l igne facture
<pi>
Facture
Id factureDate échéance facture
<pi>
LF correspond à EDP
LF compose F
EDP facturé sur CF
Tiers
Id tiersPrénom tiers PPNom tiers PPNom mari tal PPGenre PPDate naissance tiers PPDate création tiersDate clôture tiersDate modification tiersType Tiers
<pi>T a pour relation avec T
Foyer
Id foyerDate création foyerDate fin foyerNb personnes foyer
<pi>
Adresse
Id adresseCode postal d istributionCommuneNb habitants communeDépartement
<pi>
F a pour A
Date début adresseDate fin adresse
DD
T a pour F
Date début appartenance foyerDate fin appartenance foyerRole tiers ds foyer
DDVA1Type Relation Tiers
Id type relationDate création type de rela tion tiersLibel lé type rela tion tiers
<pi>
Statut Opérateur
Id statut opérateurLibellé statut opérateur
<pi>
Operateur
Id opérateurLibel lé opérateur
<pi>
T a pour S
Date début statut tiersDate fin statut tiers
DD
CSP
Id CSP 350Libel lé CSP 350Id CSP 23Libel lé CSP 23Id CSP 5Libel lé CSP 5
<pi>
T a pour CSP
LF a pour TLF
Classification Offre
Id classi fication offreLibel lé classification offre
<pi>
O posi tionnée ds C
CO hiérarchie
Groupe de CRU
Id groupe de CRU <pi>
CRU appartient à la CCRU
Cercle Relationnel
Id CRLibél lé cercle relationnel
<pi>
CRU a pour OCR
CRU a pour DCR
Coordonnées Tiers
Id coordonnée tiersDate création coordonnéeLibel lé coordonnée tiers
<pi>
T ti tula ire CT
C correspond à M
Données payeur
Inscription fichier contentieuxNb dossiers recouvrement acti fsNb dossiers réclamation actifsNb dossiers recouvrementNb dossiers réclamationNiveau risque courantNiveau risque précédent
Classe de risque
Id classe risqueLibel lé classe risqueLibel lé court classe risqueNiveau risque minimumNiveau risque maximum
<pi>
T a pour CR
Date début tiers ds classe risqueDate fin tiers ds classe risque
DD
Offre composée
Id offre composéeLibel lé offre composée
<pi>
Offre commercia le
Id offre commercialeLibellé offre commercialeDate création offreDate clôture offre
<pi>
O fait partie OC
Date début rattachement offreDate fin rattachement offre
DD
EDP correspond PS
Posi tionnement classification
Id posi tionnementLibel lé positionnement
<pi>
P dans O P hiérarchie
CRU Enchainement
Média
Id médiaLibel lé média
<pi>
EDP a EU
moisvaleur
VA6N10
T payeur du CF
DP pour O
Etat Usage
Id EUlibellé é tat usage
<pi>
Type de fonction d'usage
id type FUlib typ FU
<pi>
Customer
Services
Products
Call details
…
Data warehouse– Relational data base
Data mart– Star schema
Feature construction– PAC technology– Generates tens of thousands of
variables
Data preparation and modeling
– Khiops technology
Id customer zip code Nb call/month Nb calls/hour Nb calls/month,weekday,hours,service …
scoring model
Data feeding
PAC
Khiops
Design of the challenge Orange business objective
– Benchmark the in-house system against state of the art techniques
Data– Data store
– Not an option– Data warehouse
– Confidentiality and scalability issues– Relational data requires domain knowledge and specialized skills
– Tabular format– Standard format for the data mining community– Domain knowledge incorporated using feature construction (PAC)– Easy anonymization
Tasks– Three representative marketing tasks
Requirements– Fast data preparation and modeling (fully automatic)– Accurate– Fast deployment– Robust– Understandable
Data sets extraction and preparation Input data
– 10 relational table– A few hundreds of fields– One million customers
Instance selection– Resampling given the three marketing tasks– Keep 100 000 instances, with less unbalanced target distributions
Variable construction– Using PAC technology– 20000 constructed variables to get a tabular representation– Keep 15 000 variables (discard constant variables)– Small track: subset of 230 variables related to classical domain knowledge
Anonymization– Discard variable names, discard identifiers– Randomize order of variables– Rescale each numerical variable by a random factor– Recode each categorical variable using random category names
Data samples– 50 000 train and test instances sampled randomly – 5000 validation instances sampled randomly from the test set
Scientific and technical challenge
Scientific objective– Fast data preparation and modeling: within five days– Large scale: 50 000 train and test data, 15 000 variables– Hetegeneous data
– Numerical with missing values– Categorical with hundreds of values– Heavily unbalanced distribution
KDD social meeting objective– Attract as many participants as possible
– Additional small track and slow track– Online feedback on validation dataset– Toy problem (only one informative input variable)
– Leverage challenge protocol overhead– One month to explore descriptive data and test submission protocol
– Attractive conditions– No intellectual property conditions– Money prizes
Business impact of the challenge
Bring Orange datasets to the data mining community– Benefit for community
– Access to challenging data– Benefit for Orange
– Benchmark of numerous competing techniques– Drive the research efforts towards Orange needs
Evaluate the Orange in-house system– High number of participants and high quality of the results– Orange in-house results:
– Improved by a significant margin when leveraging all business requirements
– Almost Parretto optimal when other criterions are considered (automation, very fast train and deploy, robustness and understandability)
– Need to study the best challenge methods to get more insights
KDD Cup 2009: Result Analysis
Best Result (period considered in the figure)
In House System (downloadable : www.khiops.com)
Baseline (Naïve Bayes)
Overall – Test AUC – Fast
Good Result Very Quickly Best Results (on each dataset) Submissions
Overall – Test AUC – Fast
Good Result Very Quickly Best Results (on each dataset) Submissions
In House (Orange) System: •No parameters•On 1 standard laptop (mono proc)•If deal as 3 different problems
Overall – Test AUC – Fast
Very Fast Good Result Small improvement after the first day
(83.85 84.93)
Overall – Test AUC – SlowVery Small improvement after the 5th day
(84.93 85.2)Improvement due to unscrambling?
Overall – Test AUC – Submissions
23.24% of the submissions (>0.5)
< Baseline
15.25% of the submissions (>0.5)
> In House
84.75% of the submissions (>0.5)
< In House
Overall – Test AUC 'Correlation' Test / Valid
Overall – Test AUC'Correlation' Test / Train
Random Values Submitted
Boosting Method orTrain Target Submitted Over fitting
?
Overall – Test AUC
Test AUC - 12 hours Test AUC - 24 hours
Test AUC – 36 days Test AUC – 5 days
Overall – Test AUC
Test AUC - 12 hours
Test AUC – 36 days
• time to adjust model parameters ?
• time to train ensemble method ?
• time to find more processors ?
• time to test more methods
• time to unscramble ?
• …
Difference between :
• best result at the end of the first day and
• best result at the end of the 36 days
=1.35%
Test AUC = f (time)
Easier ?Harder ?
Churn – Test AUC – day [0:36] Appetency – Test AUC – day [0:36] Up-selling– Test AUC – day [0:36]
Test AUC = f (time)
Easier ?Harder ?
Difference between :
• best result at the end of the first day and
• best result at the end of the 36 days
=1.84% =1.38% =0.11%
Churn – Test AUC – day [0:36] Appetency – Test AUC – day [0:36] Up-selling– Test AUC – day [0:36]
CorrelationTest AUC / Valid AUC (5 days)
Easier ?Harder ?
Churn – Test/Valid – day [0:5] Appetency – Test/Valid – day [0:5] Up-selling– Test/Valid – day [0:5]
CorrelationTrain AUC / Valid AUC (36 days)
Difficulty to conclude something…
Churn – Test/Train – day [0:36] Appetency – Test/Train – day [0:36] Up-selling– Test/Train – day [0:36]
HistogramTest AUC / Valid AUC ([0:5] or ]5-36] days)
Knowledge (parameters?) found during 5 days helps after… ?
Churn – Test AUC – day [0:36] Appetency – Test AUC – day [0:36] Up-selling– Test AUC – day [0:36]
Knowledge (parameters?) found during 5 days helps after… ?
HistogramTest AUC / Valid AUC ([0:5] or ]5-36] days)
YES !
Churn – Test AUC – day [0:36] Appetency – Test AUC – day [0:36] Up-selling– Test AUC – day [0:36]
Churn – Test AUC – day ]5:36] Appetency – Test AUC – day ]5:36] Up-selling– Test AUC – day ]5:36]
Fact Sheets:Preprocessing & Feature Selection
0 20 40 60 80
Principal Component Analysis
Other prepro
Grouping modalities
Normalizations
Discretization
Replacement of the missing values
PREPROCESSING (overall usage=95%)
Percent of participants
0 10 20 30 40 50 60
Wrapper with search
Embedded method
Other FS
Filter method
Feature ranking
FEATURE SELECTION (overall usage=85%)
Percent of participants
Forward / backward wrapper
Fact Sheets:Classifier
0 10 20 30 40 50 60
Bayesian Neural Network
Bayesian Network
Nearest neighbors
Naïve Bayes
Neural Network
Other Classif
Non-linear kernel
Linear classifier
Decision tree...
CLASSIFIER (overall usage=93%)
Percent of participants
- About 30% logistic loss, >15% exp loss, >15% sq loss, ~10% hinge loss.
- Less than 50% regularization (20% 2-norm, 10% 1-norm).
- Only 13% unlabeled data.
Fact Sheets: Model Selection
0 10 20 30 40 50 60
Bayesian
Bi-level
Penalty-based
Virtual leave-one-out
Other cross-valid
Other-MS
Bootstrap est
Out-of-bag est
K-fold or leave-one-out
10% test
MODEL SELECTION (overall usage=90%)
Percent of participants
- About 75% ensemble methods (1/3 boosting, 1/3 bagging, 1/3 other).
- About 10% used unscrambling.
Run in parallel
Multi-processor
None
>= 32 GB
> 8 GB
<= 8 GB <= 2GB
Fact Sheets: Implementation
Java
Matlab
C C++
Other (R, SAS)
Mac OS
Linux Unix Windows
Memory
Operating System
Parallelism
Software Platform
Winning methods
Fast track:- IBM research, USA +: Ensemble of a wide variety of classifiers. Effort put into coding (most frequent values coded with binary features, missing values replaced by mean, extra features constructed, etc.)- ID Analytics, Inc., USA +: Filter+wrapper FS. TreeNet by Salford Systems an additive boosting decision tree technology, bagging also used.- David Slate & Peter Frey, USA: Grouping of modalities/discretization, filter FS, ensemble of decision trees.Slow track:- University of Melbourne: CV-based FS targeting AUC. Boosting with classification trees and shrinkage, using Bernoulli loss.- Financial Engineering Group, Inc., Japan: Grouping of modalities, filter FS using AIC, gradient tree-classifier boosting.- National Taiwan University +: Average 3 classifiers: (1) Solve joint multiclass problem with l1-regularized maximum entropy model. (2) AdaBoost with tree-based weak leaner. (3) Selective Naïve Bayes.
-(+: small dataset unscrambling)
Conclusion
Participation exceeded our expectations. We thank the participants for their hard work, our sponsors, and Orange who offered:
– A problem of real industrial interest with challenging scientific and technical aspects
– Prizes.
Lessons learned:– Do not under-estimate the participants: five days were
given for the fast challenge, only a few hours sufficed to some participants.
– Ensemble methods are effective.– Ensemble of decision trees offer off-the-shelf solutions to
problems with large numbers of samples and attributes, mixed types of variables, and lots of missing values.