KDD Cup 2009

KDD Cup2009Fast Scoring on a Large DatabasePresentation of the Results at the KDD Cup WorkshopJune 28, 2008The Organizing Team

KDD Cup 2009 Organizing Team

Project team at Orange Labs R&D: • Vincent Lemaire• Marc Boullé• Fabrice Clérot• Raphaël Féraud• Aurélie Le Cam• Pascal Gouzien

Beta testing and proceedings editor:• Gideon Dror

Web site design: • Olivier Guyon (MisterP.net, France)

Coordination (KDD cup co-chairs): • Isabelle Guyon• David Vogel

Thanks to our sponsors…

Orange ACM SIGKDD Pascal Unipen Google Health Discovery Corp Clopinet Data Mining Solutions MPS

KDD Cup Participation By Year

45 5724 31

100150200250300350400450500

1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Year # Teams

1997 45

1998 57

1999 24

2000 31

2001 136

2002 18

2003 57

2004 102

2005 37

2006 68

2007 95

2008 128

2009 453

Record KDD Cup Participation

Participation Statistics 1299 registered teams 7865 entries 46 countries :

Argentina Germany Malaysia South KoreaAustralia Greece Mexico SpainAustria Hong Kong Netherlands SwedenBelgium Hungary New Zealand SwitzerlandBrazil India Pakistan TaiwanBulgaria Iran Portugal TurkeyCanada Ireland Romania UgandaChile Israel Russian Federation United KingdomChina Italy Singapore UruguayFiji Japan Slovak Republic United StatesFinland Jordan SloveniaFrance Latvia South Africa

A worlwide operator

One of the main telecommunication operators in the world

Providing services to more than 170 millions customers over five continents

Including 120 millions under the Orange Brand

KDD Cup 2009 organized by OrangeCustomer Relationship Management (CRM)

Three marketing tasks: predict the propensity of customers

– to switch provider: Churn– to buy new products or services: Appentency– to buy upgrades or new options proposed to them: Up-selling

Objective: improve the return of investments (ROI) of marketing campaigns

– Increase the efficiency of the campaign given a campaign cost– Decrease the campaign cost for a given marketing objective

Better prediction leads to better ROI

Train and deploy requirements

– About one hundred models per month

– Fast data preparation and modeling

– Fast deployment

Model requirements– Robust– Accurate– Understandable

Business requirement– Return of investment for the

whole process

Input data– Relational databases– Numerical or categorical– Noisy– Missing values– Heavily unbalanced distribution

Train data– Hundreds of thousands of

instances– Tens of thousand of variables

Deployment– Tens of millions of instances

Data, constraints and requirements

In-house systemFrom raw data to scoring models

0,n1,n

1,n0,1

1,n1,n

1,1 0,n

1,11,1

1,n(1,1)

Heritage tiers

Heritage offre commerciale

0,n 1,n

0,n 0,1

1,n0,n

Fu appartient type FU

Id offreLibel lé offre

Produit & Service

Id PSDate fin val idi té du P&SDate début validité du P&SDate création du P&SLibellé P&S

Identi té Tiers

Id identi té tiersLoginType identi té tiers

O composée de PS

Elément De Parc

Id EDPDate dernière util isation EDPDate première uti lisation EDP

Modèle Conceptuel de DonnéesModèle : MCD PAC_v4Package : Diagramme : Tiers ServicesAuteur : claudebe Date : 14/06/2005 Version :

PS a pour FU

T util ise IT

EDP souscri t ds O

Date début souscription offreDate fin souscription offre

CRU concerne FU

Id gammeLibel lé gammeDate création gammeDate fin de gamme

G composée de PS

Fonction Usage

Id fonction d'usageLibél lé fonction usage

T détient EDP

Date début détention EDPDate fin détention EDP

Compte Facturation

Id compte facturationDate début val idité compte facturationDate fin validité compte facturation

F émise pour CF

Compte Rendu Usage

Id compte rendu usageDate début CRUDate fin CRUVolume descendant CRUVolume montant CRUType transmission

IT génère CRU

CRU généré par EDP

Ligne Facture

Id l igne de factureLigne affichée sur factureMontant HTMontant TTC

Type Ligne Facture

Id type l igne factureLibel lé type l igne facture

Facture

Id factureDate échéance facture

LF correspond à EDP

LF compose F

EDP facturé sur CF

Id tiersPrénom tiers PPNom tiers PPNom mari tal PPGenre PPDate naissance tiers PPDate création tiersDate clôture tiersDate modification tiersType Tiers

<pi>T a pour relation avec T

Id foyerDate création foyerDate fin foyerNb personnes foyer

Adresse

Id adresseCode postal d istributionCommuneNb habitants communeDépartement

F a pour A

Date début adresseDate fin adresse

T a pour F

Date début appartenance foyerDate fin appartenance foyerRole tiers ds foyer

DDVA1Type Relation Tiers

Id type relationDate création type de rela tion tiersLibel lé type rela tion tiers

Statut Opérateur

Id statut opérateurLibellé statut opérateur

Operateur

Id opérateurLibel lé opérateur

T a pour S

Date début statut tiersDate fin statut tiers

Id CSP 350Libel lé CSP 350Id CSP 23Libel lé CSP 23Id CSP 5Libel lé CSP 5

T a pour CSP

LF a pour TLF

Classification Offre

Id classi fication offreLibel lé classification offre

O posi tionnée ds C

CO hiérarchie

Groupe de CRU

Id groupe de CRU <pi>

CRU appartient à la CCRU

Cercle Relationnel

Id CRLibél lé cercle relationnel

CRU a pour OCR

CRU a pour DCR

Coordonnées Tiers

Id coordonnée tiersDate création coordonnéeLibel lé coordonnée tiers

T ti tula ire CT

C correspond à M

Données payeur

Inscription fichier contentieuxNb dossiers recouvrement acti fsNb dossiers réclamation actifsNb dossiers recouvrementNb dossiers réclamationNiveau risque courantNiveau risque précédent

Classe de risque

Id classe risqueLibel lé classe risqueLibel lé court classe risqueNiveau risque minimumNiveau risque maximum

T a pour CR

Date début tiers ds classe risqueDate fin tiers ds classe risque

Offre composée

Id offre composéeLibel lé offre composée

Offre commercia le

Id offre commercialeLibellé offre commercialeDate création offreDate clôture offre

O fait partie OC

Date début rattachement offreDate fin rattachement offre

EDP correspond PS

Posi tionnement classification

Id posi tionnementLibel lé positionnement

P dans O P hiérarchie

CRU Enchainement

Média

Id médiaLibel lé média

EDP a EU

moisvaleur

VA6N10

T payeur du CF

DP pour O

Etat Usage

Id EUlibellé é tat usage

Type de fonction d'usage

id type FUlib typ FU

Customer

Services

Products

Call details

Data warehouse– Relational data base

Data mart– Star schema

Feature construction– PAC technology– Generates tens of thousands of

variables

Data preparation and modeling

– Khiops technology

Id customer zip code Nb call/month Nb calls/hour Nb calls/month,weekday,hours,service …

scoring model

Data feeding

Khiops

Design of the challenge Orange business objective

– Benchmark the in-house system against state of the art techniques

Data– Data store

– Not an option– Data warehouse

– Confidentiality and scalability issues– Relational data requires domain knowledge and specialized skills

– Tabular format– Standard format for the data mining community– Domain knowledge incorporated using feature construction (PAC)– Easy anonymization

Tasks– Three representative marketing tasks

Requirements– Fast data preparation and modeling (fully automatic)– Accurate– Fast deployment– Robust– Understandable

Data sets extraction and preparation Input data

– 10 relational table– A few hundreds of fields– One million customers

Instance selection– Resampling given the three marketing tasks– Keep 100 000 instances, with less unbalanced target distributions

Variable construction– Using PAC technology– 20000 constructed variables to get a tabular representation– Keep 15 000 variables (discard constant variables)– Small track: subset of 230 variables related to classical domain knowledge

Anonymization– Discard variable names, discard identifiers– Randomize order of variables– Rescale each numerical variable by a random factor– Recode each categorical variable using random category names

Data samples– 50 000 train and test instances sampled randomly – 5000 validation instances sampled randomly from the test set

Scientific and technical challenge

Scientific objective– Fast data preparation and modeling: within five days– Large scale: 50 000 train and test data, 15 000 variables– Hetegeneous data

– Numerical with missing values– Categorical with hundreds of values– Heavily unbalanced distribution

KDD social meeting objective– Attract as many participants as possible

– Additional small track and slow track– Online feedback on validation dataset– Toy problem (only one informative input variable)

– Leverage challenge protocol overhead– One month to explore descriptive data and test submission protocol

– Attractive conditions– No intellectual property conditions– Money prizes

Business impact of the challenge

Bring Orange datasets to the data mining community– Benefit for community

– Access to challenging data– Benefit for Orange

– Benchmark of numerous competing techniques– Drive the research efforts towards Orange needs

Evaluate the Orange in-house system– High number of participants and high quality of the results– Orange in-house results:

– Improved by a significant margin when leveraging all business requirements

– Almost Parretto optimal when other criterions are considered (automation, very fast train and deploy, robustness and understandability)

– Need to study the best challenge methods to get more insights

KDD Cup 2009: Result Analysis

Best Result (period considered in the figure)

In House System (downloadable : www.khiops.com)

Baseline (Naïve Bayes)

Overall – Test AUC – Fast

Good Result Very Quickly Best Results (on each dataset) Submissions

In House (Orange) System: •No parameters•On 1 standard laptop (mono proc)•If deal as 3 different problems

Very Fast Good Result Small improvement after the first day

(83.85 84.93)

Overall – Test AUC – SlowVery Small improvement after the 5th day

(84.93 85.2)Improvement due to unscrambling?

Overall – Test AUC – Submissions

23.24% of the submissions (>0.5)

< Baseline

> In House

< In House

Overall – Test AUC 'Correlation' Test / Valid

Overall – Test AUC'Correlation' Test / Train

Random Values Submitted

Boosting Method orTrain Target Submitted Over fitting

Overall – Test AUC

Test AUC - 12 hours Test AUC - 24 hours

Test AUC – 36 days Test AUC – 5 days

Overall – Test AUC

Test AUC - 12 hours

Test AUC – 36 days

• time to adjust model parameters ?

• time to train ensemble method ?

• time to find more processors ?

• time to test more methods

• time to unscramble ?

• …

Difference between :

• best result at the end of the first day and

• best result at the end of the 36 days

=1.35%

Test AUC = f (time)

Easier ?Harder ?

Churn – Test AUC – day [0:36] Appetency – Test AUC – day [0:36] Up-selling– Test AUC – day [0:36]

Test AUC = f (time)

Easier ?Harder ?

Difference between :

• best result at the end of the first day and

• best result at the end of the 36 days

=1.84% =1.38% =0.11%

CorrelationTest AUC / Valid AUC (5 days)

Easier ?Harder ?

Churn – Test/Valid – day [0:5] Appetency – Test/Valid – day [0:5] Up-selling– Test/Valid – day [0:5]

CorrelationTrain AUC / Valid AUC (36 days)

Difficulty to conclude something…

Churn – Test/Train – day [0:36] Appetency – Test/Train – day [0:36] Up-selling– Test/Train – day [0:36]

HistogramTest AUC / Valid AUC ([0:5] or ]5-36] days)

Knowledge (parameters?) found during 5 days helps after… ?

HistogramTest AUC / Valid AUC ([0:5] or ]5-36] days)

Churn – Test AUC – day ]5:36] Appetency – Test AUC – day ]5:36] Up-selling– Test AUC – day ]5:36]

Fact Sheets:Preprocessing & Feature Selection

0 20 40 60 80

Principal Component Analysis

Other prepro

Grouping modalities

Normalizations

Discretization

Replacement of the missing values

PREPROCESSING (overall usage=95%)

Percent of participants

0 10 20 30 40 50 60

Wrapper with search

Embedded method

Other FS

Filter method

Feature ranking

FEATURE SELECTION (overall usage=85%)

Forward / backward wrapper

Fact Sheets:Classifier

0 10 20 30 40 50 60

Bayesian Neural Network

Bayesian Network

Nearest neighbors

Naïve Bayes

Neural Network

Other Classif

Non-linear kernel

Linear classifier

Decision tree...

CLASSIFIER (overall usage=93%)

- About 30% logistic loss, >15% exp loss, >15% sq loss, ~10% hinge loss.

- Less than 50% regularization (20% 2-norm, 10% 1-norm).

- Only 13% unlabeled data.

Fact Sheets: Model Selection

0 10 20 30 40 50 60

Bayesian

Bi-level

Penalty-based

Virtual leave-one-out

Other cross-valid

Other-MS

Bootstrap est

Out-of-bag est

K-fold or leave-one-out

10% test

MODEL SELECTION (overall usage=90%)

- About 75% ensemble methods (1/3 boosting, 1/3 bagging, 1/3 other).

- About 10% used unscrambling.

Run in parallel

Multi-processor

>= 32 GB

> 8 GB

<= 8 GB <= 2GB

Fact Sheets: Implementation

Matlab

Other (R, SAS)

Mac OS

Linux Unix Windows

Memory

Operating System

Parallelism

Software Platform

Winning methods

Fast track:- IBM research, USA +: Ensemble of a wide variety of classifiers. Effort put into coding (most frequent values coded with binary features, missing values replaced by mean, extra features constructed, etc.)- ID Analytics, Inc., USA +: Filter+wrapper FS. TreeNet by Salford Systems an additive boosting decision tree technology, bagging also used.- David Slate & Peter Frey, USA: Grouping of modalities/discretization, filter FS, ensemble of decision trees.Slow track:- University of Melbourne: CV-based FS targeting AUC. Boosting with classification trees and shrinkage, using Bernoulli loss.- Financial Engineering Group, Inc., Japan: Grouping of modalities, filter FS using AIC, gradient tree-classifier boosting.- National Taiwan University +: Average 3 classifiers: (1) Solve joint multiclass problem with l1-regularized maximum entropy model. (2) AdaBoost with tree-based weak leaner. (3) Selective Naïve Bayes.

-(+: small dataset unscrambling)

Conclusion

Participation exceeded our expectations. We thank the participants for their hard work, our sponsors, and Orange who offered:

– A problem of real industrial interest with challenging scientific and technical aspects

– Prizes.

Lessons learned:– Do not under-estimate the participants: five days were

given for the fast challenge, only a few hours sufficed to some participants.

– Ensemble methods are effective.– Ensemble of decision trees offer off-the-shelf solutions to

problems with large numbers of samples and attributes, mixed types of variables, and lots of missing values.

KDD Cup 2009

Documents

Transcript of KDD Cup 2009

KDD Cup Research Paper

KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

Racer Bikes Cup 2009

KDD CUP 2007

The Yahoo! Music Dataset and KDD-Cup'11

KDD Cup 2011 Triton Miners

Targeted Marketing, KDD Cup and Customer Modeling.

2009 Paris Cup

Feature Engineering and Classi er Ensemble for KDD Cup 2010

Reidentification of Artists and Genres in KDD Cup …...Reidentification of Artists and Genres in KDD Cup 2011 ing is possible, even with a perfectly reidentified tax-onomy. 2. Reidentification

Download Estimation for KDD Cup 2003KDD Cup 2003 Janez BrankJanez Brank and Jure LeskovecJure Leskovec Jožef Stefan Institute Ljubljana, Slovenia.

Tutorial on Event Detection KDD 2009KDD 2009neill/papers/eventdetection.pdf · · 2009-06-04Tutorial on Event Detection KDD 2009KDD 2009 Weng-Keen Wong ... 100 kg anthrax released

1 KDD Cup Survey Xinyue Liu. 2 Outline Nuts and Bolts of KDD Cup KDD Cup 97-99 KDD Cup 2000 Summary.

KDD-2001 Cup The Genomics Challenge Christos Hatzis, Silico Insights

KDD-Cup 2004 Chairs: Rich Caruana & Thorsten Joachims Web Master++: Lars Backstrom Cornell University.

EDM and the 4 th Paradigm of Scientific Discovery Reflections On The 2010 KDD Cup Competition

2000 KDD Cup Winners

Individualizing Bayesian Knowledge Tracing. Are Skill ... · Individualizing Bayesian Knowledge Tracing. Are Skill ... KDD Cup data mining challenge. ... 2013), draws on the ...

Imagine Cup 2009

Breeders\' Cup 2009