1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and...

34
1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨杨 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science and Technology 杨杨杨杨

Transcript of 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and...

Page 1: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

1

ACM KDD Cup A Survey: 1997-2011

Qiang Yang 杨强(partly based on Xinyue Liu’s slides @SFU,

and Nathan Liu’s slides @hkust)

Hong Kong University of Science and Technology香港科大

Page 2: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

2

About KDD Cup (1997 – 2011)

Competition is a strong mover for Science and Engineering: ACM Programming

Contest World College level

Programming skills ROBOCUP

World Robotics Competition

Page 3: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

3

About ACM KDDCUP ACM KDD: Premiere Conference in knowledge

discovery and data mining ACM KDDCUP:

Worldwide competition in conjunction with ACM KDD conferences.

It aims at: showcase the best methods for discovering higher-level

knowledge from data. Helping to close the gap between research and industry Stimulating further KDD research and development

Page 4: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

4

Statistics

Participation in KDD Cup grew steadily

Average person-hours per submission: 204Max person-hours per submission: 910

Year 97 98 99 2000 2005 2011

Submissions 16 21 24 30 32 1000+

Page 5: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

5

Algorithms (up to 2000)Algorithms Tried vs Submitted

0

2

4

6

8

10

12

14

16

18

20

Algorithm

Entri

es

Tried

Submitted

Page 6: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

6

KDD Cup 97 A classification task –

to predict financial services industry (direct mail response)

Winners Charles Elkan, a Prof

from UC-San Diego with his Boosted Naive Bayesian (BNB)

Silicon Graphics, Inc with their software MineSet

Urban Science Applications, Inc. with their software gain, Direct Marketing Selection System

Page 7: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

7

MineSet (Silicon Graphics Inc.) A KDD tool that combines data access,

transformation, classification, and visualization.

Page 8: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

8

KDD Cup 98: CRM Benchmark

URL: www.kdnuggets.com/meetings/kdd98/kdd-cup-98.html

A classification task – to analyze fund raising mail responses to a non-profit organization

Winners Urban Science Application

s, Inc. with their software GainSmarts.

SAS Institute, Inc. with their software SAS Enterprise Miner ™

Quadstone Limited with their software Decisionhouse ™

Page 9: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

KDDCUP 1998 Results

$-

$5,000

$10,000

$15,000

$20,000

$25,000

$30,000

$35,000

$40,000

$45,000

$50,000

$55,000

$60,000

$65,000

$70,000

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%Maximum Possible Profit Line($72,776 in profits with 4,873 mailed)

GainSmarts

SAS/Enterprise Miner

Quadstone/Decisionhouse

Mail to Everyone Solution ($10,560 in profits with 96,367 mailed)

Page 10: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

10

ACM KDD Cup 1999 URL:

www.cse.ucsd.edu/users/elkan/kdresults.html

Problem To detect network intrusion and protect a computer network from unauthorized users, including perhaps insiders

Data: from DoD Winners

SAS Institute Inc. with their software Enterprise Miner.

Amdocs with their Information Analysis Environment

Page 11: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

11

KDDCUP 2000: Data Set and Goal:

Data collected from Gazelle.com, a legwear and legcare Web retailer Pre-processedTraining set: 2 months Test sets: one month Data collected includes:

Click streams Order information

The goal – to design models to support web-site personalization and to improve the profitability of the site by increasing customer response.

Questions - When given a set of page views,

characterize heavy spenders

characterize killer pages characterize which

product brand a visitor will view in the remainder of the session?

Page 12: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

12

KDDCUP 2000: The Winners

Question 1 & 5 Winner: Amdocs

Question 2 & 3 Winner: Salford Systems

Question 4 Winner: e-steam

Page 13: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

13

KDD Cup 2001 3 Bioinformatics Tasks

Dataset 1: Prediction of Molecular Bioactivity for Drug Design

half a gigabyte when uncompressed

Dataset 2: Prediction of Gene/Protein Function (task 2) and Localization (task 3)

Dataset 2 is smaller and easier to understand

7 megabytes uncompressed

A total of 136 groups participated to produce a total of 200 submitted predictions over the 3 tasks: 114 for Thrombin, 41 for Function, and 45 for Localization.

Page 14: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

14

2001 Winners Task 1, Thrombin:

Jie Cheng (Canadian Imperial Bank of Commerce).

Bayesian network learner and classifier

Task 2, Function: Mark-A. Krogel (University of Magdeburg).

Inductive Logic programming Task 3, Localization:

Hisashi Hayashi, Jun Sese, and Shinichi Morishita (University of Tokyo).

K nearest neighbor

Task 2: the genes of one

particular type of organism

A gene/protein can have more than one function, but only one localization.

Page 15: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

15

molecular biology : Two tasks Task 1: Document

extraction from biological articles

Task 2: Classification of proteins based on gene deletion experiments

Winners: Task 1: ClearForest

and Celera, USA Yizhar Regev and

Michal Finkelstein Task 2: Telstra

Research Laboratories, Australia

Adam Kowalczyk and Bhavani Raskutti

Page 16: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

16

2003 KDDCUP Information

Retrieval/Citation Mining of Scientific research papers

based on a very large archive of research papers

First Task: predict how many citations each paper will receive during the three months leading up to the KDD 2003 conference

Second Task: a citation graph of a large subset of the archive from only the LaTex sources

Third Task: each paper's popularity will be estimated based on partial download logs

Last Task: devise their own questions

Page 17: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

17

2003 KDDCUP: Results Task 1:

Claudia Perlich, Foster Provost, Sofus Kacskassy

New York University Task 2:

1st place: David Vogel AI Insight Inc.

Task 3: Janez Brank and Jure Leskovec Jozef Stefan Institute, Slovenija

Task 4: Amy McGovern, Lisa Friedland,

Michael Hay, Brian Gallagher, Andrew Fast,

Jennifer Neville, and David Jensen University of Massachusetts Amherst, USA

Page 18: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

18

2004 Tasks and Results 粒子物理学和同调蛋白质预测( Particle physics; plus protein homology prediction )

两个子任务的冠军分别为:David S. Vogel, Eric Gottschalk, and Morgan C. Wang以及Bernhard Pfahringer, Yan Fu (付岩 ), RuiXiang Sun, Qiang Yang (杨强 ), Simin He, Chunli Wang, Haipeng Wang, Shiguang Shan, Junfa Liu, Wen Gao.

Page 19: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

Past KDDCUP Overview: 2005-2010Year Host Task Technique Winner

2005 Microsoft Web query categorization

Feature Engineering, Ensemble

HKUST (沈抖,杨强,等)

2006 Siemens Pulmonary emboli detection

Multi-instance, Non-IID sample, Cost sensitive, Class Imbalance, Noisy data

AT&T, Budapest University of Technology & Economics

2007 Netflix Consumer recommendation

Collaborative Filtering, Time series, Ensemble

IBM Research, Hungarian Academy of Sciences

2008 Siemens Breast cancer detection from medical images

Ensemble, Class imbalance, Score calibration

IBM Research,National Taiwan University

2009 Orange Customer relationship prediction in telecom

Feature selection,Ensemble

IBM Research, University of Melbourne

2010 PSLC Data Shop

Student performance prediction in E-Learning

Feature engineering, Ensemble,Collaborative filtering

National Taiwan University ( CJ Lin, S. Lin, etc.)

Page 20: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

KDDCUP’11 Dataset 11 years of data Rated items are

Tracks Albums Artists Genres

Items arranges in a taxonomy Two tasks

Track 1 Track 2

#ratings 263M 63M

#items 625K 296K

#users 1M 249K

Page 21: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

Items in a Taxonomy

Page 22: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

Track 1 Details

Page 23: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

Track 1 Highlights Largest publicly available dataset Large number of items (50 times more

than Netflix) Extreme rating sparsity (20 times more

sparse than Netflix) Taxonomy can help in combating

sparsely rated items. Fine time stamps with both date and

time allow sophisticated temporal modeling.

Page 24: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

Track 2 Details

Page 25: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

Track 2 Highlights Performance metric focus on ranking/

classification, which differs from traditional collaborative filtering.

No validation data provided, need to self-construct binary labeled data from rating data.

Unlike track 1, track 2 removed time stamps to focus more than long term preference rather than short term behaviors.

Page 26: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

Submission Stats

Page 27: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

WinnersTrack 1 Track 2

1st place National Taiwan University National Taiwan University

2nd place Commendo (Netflix Prize Winnder)

Chinese Academy of Science,Hulu Labs

3rd place Hong Kong University of Science and Technology,Shanghai Jiaotong University

Commendo (Netflix Prize Winnder)

Page 28: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

Chinese Teams at KDDCUP (NTU, CAS, HKUST)

Page 29: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

Key Techniques Track 1:

Blending of multiple techniques Matrix factorization models Nearest neighbor models Restricted Bolzmann machines Temporal modelings

Track 2: Importance sampling of negative instances Taxonomical modelings Use of pairwise ranking objective functions

Page 30: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

30

Summary To place on top of KDDCUP requires

Team work Expertise in domain knowledge as well as

mathematical tools Often done by world famous institutes and

companies Recent trends:

Dataset increasingly more realistic Participants increasingly more professional Tasks are increasingly more difficult

Page 31: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

31

Summary

KDD Cup is an excellent source to learn the state-of-art KDD techniques

KDDCUP dataset often becomes the standard benchmark for future research, development and teaching

Top winners are highly regarded and respected

Page 32: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

32

References

Elkan C. (1997). Boosting and Naive Bayesian Learning. Technical Report No. CS97-557, September 1997, UCSD.

Decisionhouse (1998). KDD Cup 98: Quadstone Take Bronze Miner Award. Retrieved March 15, 2001 from http://www.kdnuggets.com/meetings/kdd98/quadstone/index.html

Urbane Science (1998). Urbane Science wins the KDD-98 Cup. Retrieved March 15, 2001 from http://www.kdnuggets.com/meetings/kdd98/gain-kddcup98-release.html

Georges, J. & Milley, A. (1999). KDD’99 Competition: Knowledge Discovery Contest. Retrieved March 15, 2001 from http://www.cse.ucsd.edu/users/elkan/saskdd99.pdf

Rosset, S. & Inger A. (1999). KDD-Cup 99 : Knowledge Discovery In a Charitable Organization’s Donor Database. Retrieved March 15, 2001 from http://www.cse.ucsd.edu/users/elkan/KDD2.doc

Page 33: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

33

References (Cont.)

Sebastiani P., Ramoni M. & Crea A. (1999). Profiling your Customers using Bayesian Networks. Retrieved March 15, 2001 from http://bayesware.com/resources/tutorials/kddcup99/kddcup99.pdf

Inger A., Vatnik N., Rosset S. & Neumann E. (2000). KDD-Cup 2000: Question 1 Winner’s Report. Retrieved March 18, 2000 from http://www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.ppt

Neumann E., Vatnik N., Rosset S., Duenias M., Sasson I. & Inger A. (2000). KDD-Cup 2000: Question 5 Winner’s Report. Retrieved March 18, 2000 from http://www.ecn.purdue.edu/KDDCUP/amdocs-slides-5.ppt

Salford System white papers: http://www.salford-systems.com/whitepaper.html

Summary talk presented at KDD (2000)http://robotics.stanford.edu/~ronnyk/kddCupTalk.ppt

Page 34: 1 ACM KDD Cup A Survey: 1997-2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science.

34

References (cont) http://www.cs.wisc.edu/~dpage/kddcup2001/Cheng.pdf http://www.cs.wisc.edu/~dpage/kddcup2001/Krogel.pdf http://www.cs.wisc.edu/~dpage/kddcup2001/Hayashi.pdf