1 KDD Cup Survey Xinyue Liu. 2 Outline Nuts and Bolts of KDD Cup KDD Cup 97-99 KDD Cup 2000 ...
-
Upload
jessie-barber -
Category
Documents
-
view
235 -
download
5
Transcript of 1 KDD Cup Survey Xinyue Liu. 2 Outline Nuts and Bolts of KDD Cup KDD Cup 97-99 KDD Cup 2000 ...
1
KDD Cup Survey
Xinyue Liu
2
Outline
Nuts and Bolts of KDD Cup
KDD Cup 97-99
KDD Cup 2000
Summary
3
About KDD Cup
A knowledge discovery and data mining tools competition in conjunction with KDD conferences. It aims at:
showcase the best methods for discovering higher-level knowledge from data.
Helping to close the gap between research and industry
Stimulating further KDD research and development
4
Statistics
Access and Final Participation
0
50
100
150
200
NDA (access to data) Participants
Coun
t
Cup 97
Cup 98
Cup 99
Cup 2000
Participation in KDD Cup grew steadily, especially requests to access the data
Average person-hours per submission: 204Max person-hours per submission: 910
Commercial software grew from 44% (cup 97) to 52% (cup 98) to 77% (cup 2000)
5
AlgorithmsAlgorithms Tried vs Submitted
0
2
4
6
8
10
12
14
16
18
20
Algorithm
Entri
es Tried
Submitted
Decision trees most widely tried and by far the most commonly submitted
6
KDD Cup 97
A classification task – to predict Financial services industry direct mail response
Winners Charles Elkan, a PhD from UC-San Diego with his
Boosted Naive Bayesian (BNB) Silicon Graphics, Inc with their software MineSet Urban Science Applications, Inc. with their
software gain, Direct Marketing Selection System
7
BNB Boosting – to learn a series of classifiers, where
each classifier in the series pays more attention to the examples misclassified by its predecessor. Repeated T rounds.
BNB – representationally equivalent to a multilayer perceptron with a single hidden layer.
Complexity – O(ef)
e – examples f - attributes
8
MineSet A KDD tool that combines data access, transformation,
classification, and visualization.
9
KDD Cup 98
URL: www.kdnuggets.com/meetings/kdd98/kdd-cup-98.html
A classification task – to analyze fund raising mail responses to a non-profit organization
Winners Urban Science Applications, Inc. with their software
GainSmarts. SAS Institute, Inc. with their software Enterprise
Miner. Quadstone Limited with their software
Decisionhouse
10
GainSmarts GainSmarts – a feature selection expert
system First step - used Logistic Regression to assign
each prospect a probability of donation (Pi). Second step - used Linear Regression to
estimate a conditional donation amount of responding donors (Ai)
Result (<1% error) -
Prediction = Pi * Ai
11
Enterprise Miner A data mining solution that addresses the entire
data mining process SEMMA Process
Sample Explore Modify Model Assess
Algorithms Decision tree Neural network Regression
12
Decisionhouse
Decisionhouse – an integrated modelling software suite by Quadstone Data exploration using visualization modules. Use Decision trees and Scorecards to model more
complex tasks. Choose the final model by comparing a variety of
modeling approaches and looking at the difference in predicted net profitability (lift curve).
Results
$-
$5,000
$10,000
$15,000
$20,000
$25,000
$30,000
$35,000
$40,000
$45,000
$50,000
$55,000
$60,000
$65,000
$70,000
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%Maximum Possible Profit Line($72,776 in profits with 4,873 mailed)
GainSmarts
SAS/Enterprise Miner
Quadstone/Decisionhouse
Mail to Everyone Solution ($10,560 in profits with 96,367 mailed)
14
KDD Cup 99 URL:
www.cse.ucsd.edu/users/elkan/kdresults.html Problem
same data set as KDD Cup 98 Winners
SAS Institute Inc. with their software Enterprise Miner.
Amdocs with their Information Analysis Environment
15
Software SAS – using two-stage model which includes two
multi-layer perceptron (MLP) neural networks models.
Amdocs – using its own Information Analysis Environment, which allows modeling of the value and class membership simultaneously. Algorithms used is a hybrid logistic regression model
16
KDD Cup 2000
www.ecn.purdue.edu/KDDCUP/
Sponsored by
Purdue University
Blue Martini Software
17
Data Set
Data collected from Gazelle.com, a legwear and legcare web retailer Pre-processedTraining set: 2 months Test sets: one month Data collected includes:
Click streams Order information Registration form
18
Problems The goal – to design models to support web-site
personalization and to improve the profitability of the site by increasing customer response.
Questions - When given a set of page views,1. will the visitor view another page on the site or leave? 2. which product brand will the visitor view in the remainder of
the session?3. characterize heavy spenders4. characterize killer pages5. characterize which product brand a visitor will view in the
remainder of the session?
19
Evaluation Accuracy/score was measured for the two
questions with test sets Insight questions judged with help of retail experts
from Gazelle and Blue Martini
Created a list of insights from all participants Each insight was given a weigh Each participant was scored on all insights
Additional factors: Presentation quality Correctness
20
The Winners
Question 1 & 5 Winner: Amdocs
Question 2 & 3 Winner: Salford Systems
Question 4 Winner: e-steam
poster
21
Software (Amdocs)
Exploratory Data Analysis – SAS Classification Tree – Amdocs Business
Insight Tool Decision tree Rules Extraction Modeling Combining models
22
Scheme
No
YesScore Model
0 for all
Score Model5 rules (0 or 1)
Score Model1 for all
Yes
one-click
multi-click
No
BestRuleMergedHybrid
Score ModelEnsemble
FinalPrediction
Model
Main Model
PartialTestingPattern?
Bot /Crawler ?
All Data
23
Main Model
Decision Tree5 trees
built on 34000 cases
Decision Tree5 trees
built on 34000 cases
Decision Tree5 trees
built on 34000 cases
Rule Generator1466 rules
111 continue rules
Rule Generator1466 rules
111 continue rules
Rule Generator1466 rules
111 continue rules
BestRule
HybridModel
MergedRules
BestRule
HybridModel
MergedRules
BestRule
HybridModel
MergedRules
24
Sub-models
Best rule
Hybrid Model
Merged Rules
Each model captures a different aspect of
the overall behavior in
the data.Combining
or ensembling the models
provides the best
prediction results.
Logistic regression on rule set defines score for
each record as a combination of rules the
record satisfies
Logistic regression on rule set + raw field
values combine to define score for each record
Chooses most accurate rule satisfied by each
record
25
Software (Salford)
CART - a decision tree tool that automatically searching for and isolating significant patterns and relationships
MARS - a multivariate non-parametric regression procedure
HotSpotDetector
TreeNet
26
Cart Binary recursive partitioning. Key elements:
Splitting rules Brute force search all possible splits for all
variables Rank each splitting rule on the basis of a quality-
of-split criterion (default GINI) Recursion - split until further splitting is
impossible or stopped. Class assignment
Plurality rule Assign every node whether it is terminal or not.
Pruning Trees – does not stop in the middle Testing - best sub-tree is the one with the lowest
error
27
MARs Automatic variable search Automatic variable transformation Automatic limited interaction searches Variable nesting Built-in testing regimens model selection parameters.
28
Insights (Heavy Spenders)
Some of the Good insights Referrers - establish ad policy based on conversion
rates, not click-throughs Not an AOL user - browser window too small for layout Referring site traffic changed dramatically over time Came to site from print-ad or news, not friends & families Very high and very low income Geographic: Northeast U.S. states Repeat visitors
29
Insights (Who leaves?)
Some of the good insights Crawlers, bots accounted for 16% of sessions Long processing time (> 12 seconds) implies high
abandonment Referring sites: mycoupons have long sessions,
shopnow.com are prone to exit quickly Returning visitors' prob of continuing is double View of specific products (Oroblue,Levante) cause
abandonment Probability of leaving decreases with page views Free Gift and Welcome templates on first three
pages encouraged visitors to stay at site
30
Insights(Brand view)
Some good insights Referrer URL is great predictor:
Fashionmall.com and winnie-cooper are referrers for Hanes and Donna Karan
mycoupons.com, tripod, deal-finder are referrers for American Essentials
Previous views of a product imply later views
31
Summary Data mining requires background knowledge and
access to business users Successful data mining solutions combine
automated and manual analysis, integrating the power of the machine with expert knowledge and human insight
Web Mining is challenging: crawlers/bots, frequent site changes, etc.
KDD Cup is an excellent source to learn the state-of-art KDD techniques
KDD Cup data available for research and education
32
ReferencesElkan C. (1997). Boosting and Naive Bayesian Learning. Technical
Report No. CS97-557, September 1997, UCSD. Decisionhouse (1998). KDD Cup 98: Quadstone Take Bronze
Miner Award. Retrieved March 15, 2001 from http://www.kdnuggets.com/meetings/kdd98/quadstone/index.html
Urbane Science (1998). Urbane Science wins the KDD-98 Cup. Retrieved March 15, 2001 from http://www.kdnuggets.com/meetings/kdd98/gain-kddcup98-release.html
Georges, J. & Milley, A. (1999). KDD’99 Competition: Knowledge Discovery Contest. Retrieved March 15, 2001 from http://www.cse.ucsd.edu/users/elkan/saskdd99.pdf
Rosset, S. & Inger A. (1999). KDD-Cup 99 : Knowledge Discovery In a Charitable Organization’s Donor Database. Retrieved March 15, 2001 from http://www.cse.ucsd.edu/users/elkan/KDD2.doc
33
References (Cont.)
Sebastiani P., Ramoni M. & Crea A. (1999). Profiling your Customers using Bayesian Networks. Retrieved March 15, 2001 from http://bayesware.com/resources/tutorials/kddcup99/kddcup99.pdf
Inger A., Vatnik N., Rosset S. & Neumann E. (2000). KDD-Cup 2000: Question 1 Winner’s Report. Retrieved March 18, 2000 from
http://www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.ppt
Neumann E., Vatnik N., Rosset S., Duenias M., Sasson I. & Inger A. (2000). KDD-Cup 2000: Question 5 Winner’s Report. Retrieved March 18, 2000 from http://www.ecn.purdue.edu/KDDCUP/amdocs-slides-5.ppt
Salford System white papers:
http://www.salford-systems.com/whitepaper.html
Summary talk presented at KDD (2000)
http://robotics.stanford.edu/~ronnyk/kddCupTalk.ppt