Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001...

62
Course on Data Mining Course on Data Mining Mika Klemettinen and Pirjo Moen Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 20 University of Helsinki/Dept of CS Autumn 20 Page 1/62 Course on Data Mining (581550-4 Course on Data Mining (581550-4 Intro/Ass. Rules Intro/Ass. Rules Episodes Episodes Text Mining Text Mining Home Exam Home Exam 24./26.10. 30.10. Clustering Clustering KDD Process KDD Process Appl./Summary Appl./Summary 14.11. 21.11. 7.11. 28.11.

Transcript of Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001...

Page 1: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page1/62

Course on Data Mining (581550-4)Course on Data Mining (581550-4)

Intro/Ass. RulesIntro/Ass. RulesIntro/Ass. RulesIntro/Ass. Rules

EpisodesEpisodesEpisodesEpisodes

Text MiningText MiningText MiningText Mining

Home ExamHome Exam

24./26.10.

30.10.

ClusteringClusteringClusteringClustering

KDD ProcessKDD ProcessKDD ProcessKDD Process

Appl./SummaryAppl./SummaryAppl./SummaryAppl./Summary

14.11.

21.11.

7.11.

28.11.

Page 2: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page2/62

Arkko Jouko

Asikainen Tomi

Aunimo Lili

Hyvönen Leena

Johansson Carl

Jokinen Sakari

Kerminen Antti

Kuokkanen Ville

Lehmussaari Kari

Lehtonen Miro

Accepted to Autumn 2001 CourseAccepted to Autumn 2001 Course

Löfström Jaakko

Malinen Johanna

Mäkelä Eetu

Ojala Petri

Palin Kimmo

Pasanen Janne

Pietilä Mikko

Pitkänen Esa

Rapiokallio Maarit

Roos Teemu

Sahlberg Mauri

Saikku Arja

Sundman Jonas

Tarvainen Tero

Tiihonen Sami

Tolvanen Juha

Uusitalo Petri

Vasankari Minna

Virtanen Otso

Page 3: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page3/62

Course OrganizationCourse Organization

LecturersLecturersLecturersLecturers

ExercisesExercisesExercisesExercises

LecturesLecturesLecturesLectures

CourseCourse MaterialMaterialCourseCourse MaterialMaterial

ContentsContentsContentsContents

Page 4: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page4/62

Dr. Mika KlemettinenDr. Mika KlemettinenDr. Mika KlemettinenDr. Mika Klemettinen

• PhD Mika Klemettinen:PhD Mika Klemettinen:– Email: [email protected]– WWW: http://www.cs.helsinki.fi/u/mklemett/– Room: B356– Tel: 050-483 6661

• PhD in January 1999:PhD in January 1999:– Thesis: A Knowledge Discovery

Methodology for Telecommunication Network Alarm Databases

• Data mining and SGML/XML related Data mining and SGML/XML related research at UH/CS (1994-2000) and at research at UH/CS (1994-2000) and at Nokia (2000-)Nokia (2000-)

Course OrganizationCourse Organization

Page 5: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page5/62

Dr. Pirjo MoenDr. Pirjo MoenDr. Pirjo MoenDr. Pirjo Moen

• PhD Pirjo Moen:PhD Pirjo Moen:– Email: [email protected]– WWW: http://www.cs.helsinki.fi/pirjo.moen/– Room: B350– Tel:191 44238

• PhD in February 2000:PhD in February 2000:– Thesis: Attribute, Event Sequence, and Event

Type Similarity Notions for Data Mining

• Data mining related research at UH/CS Data mining related research at UH/CS (1994-)(1994-)

Course OrganizationCourse Organization

Page 6: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page6/62

DM/SGML/XML at UH/CSDM/SGML/XML at UH/CSDM/SGML/XML at UH/CSDM/SGML/XML at UH/CS

• RATIRATI (A structured text database system/ Rakenteiset tekstitietokannat), 1988-91

• Data mining from telecommunication Data mining from telecommunication alarm dataalarm data, 1994-97

• Structured and Intelligent Documents (SIDSID), 1995-98

• From Data to Knowledge (FDKFDK), 1995-

• Knowledge worker’s workstation (TYTTITYTTI), 2000-02

• DM Group (99), DOREMI Group (00) Linux was invented here!Linux was invented here!

Course OrganizationCourse Organization

Page 7: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page7/62

NRC in ShortNRC in ShortNRC in ShortNRC in Short

• Nokia is the global leader in digital Nokia is the global leader in digital communication technologiescommunication technologies with around 60 000 employees all over the world

• Nokia Research Center (NRC)Nokia Research Center (NRC) has around 1 200 employees in Finland, USA, Japan, China, Germany, Hungary, UK, etc.

• NRC's roleNRC's role is to enhance the Nokia's technological competitiveness by exploring and developing new technologies

• Strongly involved in many European Strongly involved in many European Union and national research projectsUnion and national research projects

Course OrganizationCourse Organization

Page 8: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page8/62

DM Group at NRCDM Group at NRCDM Group at NRCDM Group at NRC

• Background:Background:

– At the University of Computer Science data mining methods and theory of data mining since late 80´s

– Association and episode rule mining, time series similarity, analysis of telecommunication alarm data and web logs, etc.

• Other members include:Other members include:

– Dr. Heikki MannilaHeikki Mannila (group leader)

– Dr. Hannu ToivonenHannu Toivonen

Course OrganizationCourse Organization

Page 9: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page9/62

Lectures (1)Lectures (1)Lectures (1)Lectures (1)

• 24.10.-30.11.2001 (12 lectures):24.10.-30.11.2001 (12 lectures):– 7 normal lectures

– 5 seminar like lectures

• Wed 14-16, Fri 12-14 (A217):Wed 14-16, Fri 12-14 (A217):– Wed: normal lecture

– Fri: seminar like lecture (except for 26.10.)

• Lectures are obligatory:Lectures are obligatory:– Normal lectures: 5/7

– Seminar like lectures: 4/5

• Lists are circulatedLists are circulated

Course OrganizationCourse Organization

Page 10: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page10/62

Lectures (2)Lectures (2)Lectures (2)Lectures (2)

• Lecturing language is Finnish, slides are Lecturing language is Finnish, slides are in English:in English:

– Students can also use English

– A foreign student group can be established

• Normal lectures:Normal lectures:– Basics, terminology, standard methods

– Lecturer driven teaching

• Seminar like lectures:Seminar like lectures:– Extensions to the basic methods

– Lecturer gives an introduction

– Student groups give short presentations

Course OrganizationCourse Organization

Page 11: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page11/62

Lectures (3)Lectures (3)Lectures (3)Lectures (3)

• Group for seminar (and exercise) work:Group for seminar (and exercise) work:– 10 groups, à 3 persons, 2 groups/lecture

– Dates are agreed at the beginning of course

– Articles are given on previous week's Wed

• Seminar presentations:Seminar presentations:– Presentation in an HTML page (around 3-5

printed pages) due to seminar starting:• Can be either a HTML page or a printable

document in PostScript/PDF format

– 30 minutes of presentation

– 5-15 minutes of discussion

– Active participation

Course OrganizationCourse Organization

Page 12: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page12/62

Course MaterialCourse MaterialCourse MaterialCourse Material

• Lecture slidesLecture slides

• Original articlesOriginal articles

• Seminar presentationsSeminar presentations

• Book: Book: "Data Mining: Concepts and Techniques" by Jiawei Han and Micheline Kamber, Morgan Kaufmann Publishers, August 2000. 550 pages. ISBN 1-55860-489-8

• Remember to check course website and Remember to check course website and folder for the material!folder for the material!

Course OrganizationCourse Organization

Page 13: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page13/62

ExercisesExercisesExercisesExercises

• Given by Pirjo Moen:Given by Pirjo Moen:– Email: [email protected]

– Room: B350

– Tel: 191 44238

• 1.11.-29.11.2001 (5 exercises)1.11.-29.11.2001 (5 exercises)

• Thu 12-14 (A318)Thu 12-14 (A318)

• Exercises are obligatory:Exercises are obligatory:– Exercises: 4/5

• Lists are circulatedLists are circulated

• Discussion is an essential part!Discussion is an essential part!

Course OrganizationCourse Organization

Page 14: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page14/62

ExercisesExercisesExercisesExercises

• Usually around 3-4 exercises:Usually around 3-4 exercises:– 2-3 "normal" exercises (with subtasks):

• Available due Thu mornings at 9

– 1 group work:• A practical exercise

• Available due Thu mornings at 9

• A written report (not hand-written!) must be returned at the exercise session

• Group = the seminar presentation group

• Foreign students:Foreign students:– Return all exercises in written format to

Pirjo Moen

Course OrganizationCourse Organization

Page 15: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page15/62

Home ExamHome ExamHome ExamHome Exam

• The home exam is given on 28.11.2001The home exam is given on 28.11.2001

• Must be returned by 21.12.2001 (printed Must be returned by 21.12.2001 (printed version, not hand-written, not by email)version, not hand-written, not by email)

• Tentatively:Tentatively:– Course lectures, seminar presentations and

exercises are the material for the exam

– Questions contain both theoretical and practical issues

– Around 4-6 smaller questions

– Around 1-2 bigger questions

Course OrganizationCourse Organization

Page 16: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page16/62

Course EvaluationCourse EvaluationCourse EvaluationCourse Evaluation

• Scale: 1-/3 … 3/3 or rejectedScale: 1-/3 … 3/3 or rejected

• Grade = home exam + exercises + Grade = home exam + exercises + experiments + group presentations:experiments + group presentations:

– home exam: max 30 points• (4 X 5p) + (1 X 10p)

– normal exercises (10): max 5 points• 2: 1p, 4: 2p, 6: 3p, 8: 4p, 10: 5p

– experiments (5): max 15 points• max 3 points/experiment

– group presentation: max 10 points

Course OrganizationCourse Organization

Page 17: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page17/62

Course EvaluationCourse EvaluationCourse EvaluationCourse Evaluation

• Passing the course: min 30 pointsPassing the course: min 30 points– home exam: min 13 points (max 30 points)

– exercises/experiments: min 8 points (max 20 points)

• at least 3 returned and reported experiments

– group presentation: min 4 points (max 10 points)

• Remember also the other requirements:Remember also the other requirements:– Attending the lectures (5/7)

– Attending the seminars (4/5)

– Attending the exercises (4/5)

Course OrganizationCourse Organization

Page 18: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page18/62

Course Contents (1)Course Contents (1)Course Contents (1)Course Contents (1)

• Module/Week 1: Module/Week 1: – What is Data Mining?What is Data Mining?

– Association rulesAssociation rules

– 24.10. normal lecture by Mika24.10. normal lecture by Mika

– 26.10. normal lecture by Mika26.10. normal lecture by Mika

• Module/Week 2:Module/Week 2:– Recurrent patternsRecurrent patterns

– Episode rules, minimal occurrencesEpisode rules, minimal occurrences

– 31.10. normal lecture by Mika31.10. normal lecture by Mika

– 2.11. seminar like lecture by Pirjo2.11. seminar like lecture by Pirjo

Course OrganizationCourse Organization

Page 19: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page19/62

Course Contents (2)Course Contents (2)Course Contents (2)Course Contents (2)

• Module/Week 3:Module/Week 3: – Text miningText mining

– 7.11. normal lecture by Mika7.11. normal lecture by Mika

– 9.11. seminar like lecture by Mika9.11. seminar like lecture by Mika

• Module/Week 4:Module/Week 4:– ClusteringClustering

– ClassificationClassification

– SimilaritySimilarity

– 14.11. normal lecture by Pirjo14.11. normal lecture by Pirjo

– 16.11. seminar like lecture by Mika16.11. seminar like lecture by Mika

Course OrganizationCourse Organization

Page 20: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page20/62

Course Contents (3)Course Contents (3)Course Contents (3)Course Contents (3)

• Module/Week 5: Module/Week 5: – Knowledge discovery processKnowledge discovery process

– Pre- and postprocessingPre- and postprocessing

– 21.11. normal lecture by Pirjo21.11. normal lecture by Pirjo

– 23.11. seminar like lecture by Pirjo23.11. seminar like lecture by Pirjo

• Module/Week 6:Module/Week 6:– Data mining toolsData mining tools

– Summary, futureSummary, future

– 28.11. normal lecture by Pirjo28.11. normal lecture by Pirjo

– 30.11. seminar like lecture by Pirjo30.11. seminar like lecture by Pirjo

Course OrganizationCourse Organization

Page 21: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page21/62

Group EstablishmentGroup EstablishmentGroup EstablishmentGroup Establishment

• Group is for both seminar and weekly Group is for both seminar and weekly group exercise workgroup exercise work

• 10 groups à 3 persons10 groups à 3 persons

Course Organization / GroupsCourse Organization / Groups

Get grouped!Get grouped!

Page 22: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page22/62

• Group presentation time allocation:Group presentation time allocation:

– Fri 2.11.: Group 1, Group 2 (associations)

– Fri 9.11.: Group 3, Group 4 (episodes)

– Fri 16.11.: Group 5, Group 6 (text mining)

– Fri 23.11.: Group 7, Group 8 (clustering)

– Fri 30.11.: Group 9, Group 10 (KDD process)

Course Organization / GroupsCourse Organization / Groups

Page 23: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page23/62

• Group 1:Group 1:– Asikainen Tomi, Hyvönen Leena

• Group 2:Group 2:– Löfström Jaakko, Pitkänen Esa, Tarvainen Tero

• Group 3:Group 3:– Jokinen Sakari, Kuokkanen Ville, Tolvanen Juha

• Group 4:Group 4:– Lehmussaari Kari, Pietilä Mikko, Uusitalo Petri

• Group 5:Group 5:– Johansson Carl, Kerminen Antti, Sundman Jonas

Course Organization / GroupsCourse Organization / Groups

Page 24: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page24/62

• Group 6:Group 6:– Malinen Johanna, Sahlberg Mauri, Vasankari Minna

• Group 7:Group 7:– Arkko Jouko, Ojala Petri, Rapiokallio Maarit

• Group 8:Group 8:– Palin Kimmo, Pasanen Janne (, X)

• Group 9:Group 9:– Aunimo Lili, Lehtonen Miro, Saikku Arja

• Group 10:Group 10:– X, X, X

Course Organization / GroupsCourse Organization / Groups

Page 25: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page25/62

Introduction to Data Mining (DM)Introduction to Data Mining (DM)

What? Why?What? Why?What? Why?What? Why?

DM ViewsDM ViewsDM ViewsDM Views

ApplicationsApplicationsApplicationsApplications

KDD ProcessKDD ProcessKDD ProcessKDD Process

Major IssuesMajor IssuesMajor IssuesMajor Issues

Page 26: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page26/62

Computers in 1940s (ENIAC)Computers in 1940s (ENIAC)

Page 27: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page27/62

Mount431 7437 195079% /02 631963 4735893% /us

F ile E d it L oc a te V ie w H e lp

1 2 3 4 5 6 70

100

200

300

400

500EDCBA

Ne tw o rkTraffic He lp

Personal Home Network in 2000sPersonal Home Network in 2000s

InternetInternet

StorageStorageStorageStorageStorageStorageStorageStorage

StorageStorageStorageStorage StorageStorageStorageStorage

StorageStorageStorageStorage

StorageStorageStorageStorage

StorageStorageStorageStorage

StorageStorageStorageStorage

Page 28: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page28/62

Evolution of Database TechnologyEvolution of Database Technology

• 1960s:1960s:– Data collection, database creation, IMS and network DBMS

• 1970s: 1970s: – Relational data model, relational DBMS implementation

• 1980s:1980s: – RDBMS, advanced data models (extended-relational, OO,

deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)

• 1990s:1990s: – Data mining and data warehousing, multimedia databases, and

Web technology

Page 29: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page29/62

Why Data Mining?Why Data Mining?

• Enormous amounts of Enormous amounts of data available:data available:

– Automated data collection tools and mature database technology lead to huge amounts of data stored in databases, data warehouses and other information repositories

– Manual inspection is either tedious or just impossible

Page 30: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page30/62

• Ultimately:Ultimately:

– "Extraction of interesting (non-trivial, implicit, previously unknown, potentially useful) information or patterns from data in large databases"

• Often just:Often just:

– "Tell something interesting about this data", "Describe this data"

Exploratory, semi-automatic data Exploratory, semi-automatic data analysis on large data setsanalysis on large data sets

What is Data Mining?What is Data Mining?

Page 31: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page31/62

• Rather established terminology:Rather established terminology:

– Data mining• Usually DM is one part of KDD process

– Knowledge discovery in databases (KDD)• The general term that covers, e.g., data

preprocessing, DM, and post-processing

• Not so often used terms:Not so often used terms:

– Knowledge extraction, data archeology

• Newest hype:Newest hype:

– Business intelligence, knowledge management

What is Data Mining?What is Data Mining?

Page 32: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page32/62

Marketing

DatabaseMarketing

DataWarehousing

KDD &Data Mining

Increase knowledge to base Increase knowledge to base decision upondecision upon

E.g., impact on marketingE.g., impact on marketing

The role and importance The role and importance of KDD and DM has of KDD and DM has growed rapidly - and is growed rapidly - and is still growing!still growing!

But DM is not just But DM is not just marketing...marketing...

What is DM Useful for?What is DM Useful for?

Page 33: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page33/62

• Database analysis and decision Database analysis and decision support:support:

– Market analysis and management

– Risk analysis and management

– Fraud detection and management

• Other applications:Other applications:

– Web mining

– Text mining

– etc.

Potential Applications?Potential Applications?

Page 34: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page34/62

• You are a marketing manager for a You are a marketing manager for a cellular telephone company:cellular telephone company:

– Customers receive a free phone (worth 150€) with one-year contract; you pay a sales commission of 250€ per contract

– Problem: Turnover (after contract expires) is 25%

– Giving a new phone to everyone whose contract is expiring is very expensive

– Bringing back a customer after quitting is both difficult and expensive

Example (1)Example (1)

Page 35: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page35/62

• Three months before a Three months before a contract expires, predict contract expires, predict which customers will leave:which customers will leave:

– If you want to keep a customer that is predicted to leave, offer them a new phone

Example (1)Example (1)

Yippee!I won't leave!

Yippee!I won't leave!

Page 36: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page36/62

• You are an insurance You are an insurance officer and you should officer and you should define a suitable monthly define a suitable monthly payment for an 18-year-old payment for an 18-year-old boy who has bough a boy who has bough a Ferrari … what to do?Ferrari … what to do?

Example (2)Example (2)

Oh, yes!I love myFerrari!

Oh, yes!I love myFerrari!

Page 37: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page37/62

• Analyze all previous customer data Analyze all previous customer data and paid compensations dataand paid compensations data

• What is the predicted accident What is the predicted accident probability based on…probability based on…

– Driver's gender (male/female) and age

– Car model and age, place of living

– etc.

• If the accident probability is higher If the accident probability is higher than on average, set the monthly than on average, set the monthly payment accordingly!payment accordingly!

Example (2)Example (2)

Page 38: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page38/62

• You are in a foreign country and You are in a foreign country and somebody steals or duplicates your somebody steals or duplicates your credit card or mobile phone …credit card or mobile phone …

• Credit card companies …Credit card companies …

– use historical data to build models of fraudulent behaviour and use data mining to help identify similar instances

• Phone companies …Phone companies …

– analyze patterns that deviate from an expected norm (destination, duration, etc.)

Example (3)Example (3)

Page 39: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page39/62

• Web access logs can be analyzed Web access logs can be analyzed for … for …

– discovering customer preferences

– improving Web site organization

• Similarly … Similarly …

– all kinds of log information analysis

– user interface/service adaptation

Example (4)Example (4)

Excellent surfing experience!

Excellent surfing experience!

Page 40: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page40/62

Knowledge Discovery Process (1)Knowledge Discovery Process (1)

Learning the domainLearning the domainLearning the domainLearning the domain

Data reduction/projectionData reduction/projectionData reduction/projectionData reduction/projection

Creating a target data setCreating a target data setCreating a target data setCreating a target data set

Data cleaning/preprocessingData cleaning/preprocessingData cleaning/preprocessingData cleaning/preprocessing

Choosing the DM taskChoosing the DM taskChoosing the DM taskChoosing the DM task

Page 41: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page41/62

Choosing the DM algorithm(s)Choosing the DM algorithm(s)Choosing the DM algorithm(s)Choosing the DM algorithm(s)

Knowledge presentationKnowledge presentationKnowledge presentationKnowledge presentation

Data mining: SearchData mining: SearchData mining: SearchData mining: Search

Pattern evaluationPattern evaluationPattern evaluationPattern evaluation

Use of discovered knowledgeUse of discovered knowledgeUse of discovered knowledgeUse of discovered knowledge

Knowledge Discovery Process (2)Knowledge Discovery Process (2)

Page 42: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page42/62

Data miningData miningData miningData miningInput dataInput dataInput dataInput data ResultsResultsResultsResultsPreprocessingPreprocessing PostprocessingPostprocessing

OperationalOperationalDatabaseDatabase

OperationalOperationalDatabaseDatabase

Selection

Selection

Selection

Selection

UtilizationUtilizationUtilizationUtilization

CleanedVerifiedFocused

Eval. ofinteres-tingness

Raw data

Time based

selection

Selected usable

patterns

1 32

Typical KDD ProcessTypical KDD Process

Page 43: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page43/62

Increasing potentialto supportbusiness decisions End User

Business Analyst

DataAnalyst

DBA

MakingDecisions

Data PresentationVisualization Techniques

Data MiningInformation Discovery

Data Exploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

UtilizationUtilization

Page 44: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page44/62

DataData• Customer data• Store data• Demographical Data• Geographical data

InformationInformation• X lives in Z• S is Y years old• X and S moved• W has money in Z

KnowledgeKnowledge• A quantity Y of product A is used in

region Z• Customers of class Y use x% of C

during period D

DecisionDecision• Promote product A in region Z.• Mail ads to families of profile P• Cross-sell service B to clients C

The Value ChainThe Value Chain

Page 45: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page45/62

• General approaches:General approaches:

– Descriptive data mining:

• Describe what interesting can be found in this data!

• Explain this data to me!

– Predictive data mining:

• Based on this and previous data, tell me what will happen in the future!

• Show me the future trends!

Data Mining ViewsData Mining Views

Page 46: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page46/62

• Views based on … Views based on …

– Databases to be mined

– Knowledge to be discovered

– Techniques utilized

– Applications adapted

• Let's take a closer look at Let's take a closer look at these views...these views...

Data Mining ViewsData Mining Views

Page 47: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page47/62

Databases to be minedDatabases to be minedDatabases to be minedDatabases to be mined

• Relational

• Transactional

• Object-oriented

• Object-relational

• Active

• Spatial

• Time-series

DatabasesDatabases

Data Mining ViewsData Mining Views

• Text, XML

• Multi-media

• Heterogeneous

• Legacy

• Inductive

• WWW

• etc.

Page 48: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page48/62

Knowledge to be mined = tasksKnowledge to be mined = tasksKnowledge to be mined = tasksKnowledge to be mined = tasks

• Characterization

• Discrimination

• Association

• Classification

• Clustering

• Trend

KnowledgeKnowledge==

tasktask

Data Mining ViewsData Mining Views

• Deviation analysis

• Outlier analysis

• etc.

Page 49: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page49/62

Techniques utilizedTechniques utilizedTechniques utilizedTechniques utilized

• Database-oriented

• Data warehouse (OLAP)

• Machine learning

• Statistics

• Visualization

• Neural networks

• Etc.

TechniquesTechniques

Data Mining ViewsData Mining Views

Page 50: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page50/62

Applications adaptedApplications adaptedApplications adaptedApplications adapted

• Retail (supermarkets etc.)

• Telecom

• Banking

• Fraud analysis

• DNA mining

Applic.Applic.

Data Mining ViewsData Mining Views

• Stock market analysis

• Web mining

• Log data analysis

• etc.

Page 51: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page51/62

• Mining methodologies and interaction:Mining methodologies and interaction:– Mining different kinds of knowledge– Interactive mining of knowledge– Incorporation of background knowledge– DM query languages and ad-hoc DM– Visualization of DM results– Handling noise and incomplete data– The interestingness problem

• Performance and scalability:Performance and scalability:– Efficiency and scalability of DM algorithms– Parallel, distributed and incremental mining methods

Major Issues in Data MiningMajor Issues in Data Mining

Page 52: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page52/62

• Diversity of data types:Diversity of data types:– Handling complex types of data– Mining information from heterogeneous databases (Web etc.)

• Application and integration of discovered knowledge:Application and integration of discovered knowledge:– Domain-specific DM tools– Intelligent query answering and decision making– Integration of discovered knowledge with existing knowledge

• Protection of data … Protection of data … – Security– Integrity– Privacy

Major Issues in Data MiningMajor Issues in Data Mining

Page 53: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page53/62

• 1989 IJCAI Workshop1989 IJCAI Workshop

• 1991-1994 KDD Workshops1991-1994 KDD Workshops

• 1995-1998 KDD Conferences1995-1998 KDD Conferences

• 1998 ACM SIGKDD1998 ACM SIGKDD

• 1999- SIGKDD Conferences1999- SIGKDD Conferences

• And many smaller/new DM conferences … And many smaller/new DM conferences … – PAKDD, PKDD

– SIAM-Data Mining, (IEEE) ICDM

– etc.

Historical Data Mining ActivitiesHistorical Data Mining Activities

Page 54: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page54/62

““Standards”Standards”““Standards”Standards”

• DM:DM: Conferences: KDD, PKDD, PAKDD, ...

Journals: Data Mining and Knowledge Discovery, CACM

• DM/DB:DM/DB: Conferences: ACM-SIGMOD/PODS, VLDB, ...

Journals: ACM-TODS, J. ACM, IEEE-TKDE, JIIS, ...

• AI/ML:AI/ML: Conferences: Machine Learning, AAAI, IJCAI, ...

Journals: Machine Learning, Artific. Intell., ...

Useful References on Data MiningUseful References on Data Mining

Page 55: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page55/62

• Data mining: Data mining: semi-automaticsemi-automatic discovery of discovery of interestinginteresting patterns from patterns from large data setslarge data sets

• Knowledge discovery is a process:Knowledge discovery is a process:– Preprocessing– Data mining– Postprocessing

• To be mined, used or utilized different … To be mined, used or utilized different … – Databases (relational, object-oriented, spatial, WWW, …)– Knowledge (characterization, clustering, association, …)– Techniques (machine learning, statistics, visualization, …)– Applications (retail, telecom, Web mining, log analysis, …)

ConclusionsConclusions

Page 56: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page56/62

• Module/Week 1: Module/Week 1: – What is Data Mining?What is Data Mining?

– Association rulesAssociation rules

– 24.10. normal lecture by Mika24.10. normal lecture by Mika

– 26.10. normal lecture by Mika26.10. normal lecture by Mika

• Module/Week 2:Module/Week 2:– Episode rules, minimal occurrencesEpisode rules, minimal occurrences

– 31.10. normal lecture by Mika31.10. normal lecture by Mika

– 2.11. seminar like lecture by Pirjo2.11. seminar like lecture by Pirjo

• Module/Week 3:Module/Week 3: – Text miningText mining

– 7.11. normal lecture by Mika7.11. normal lecture by Mika

– 9.11. seminar like lecture by Mika9.11. seminar like lecture by Mika

ConclusionsConclusions

Page 57: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page57/62

• Module/Week 4:Module/Week 4:– Clustering, Classification, SimilarityClustering, Classification, Similarity

– 14.11. normal lecture by Pirjo14.11. normal lecture by Pirjo

– 16.11. seminal like lecture by Mika16.11. seminal like lecture by Mika

• Module/Week 5: Module/Week 5: – Knowledge discovery processKnowledge discovery process

– Pre- and postprocessingPre- and postprocessing

– 21.11. normal lecture by Pirjo21.11. normal lecture by Pirjo

– 23.11. Seminar like lecture by Pirjo23.11. Seminar like lecture by Pirjo

• Module/Week 6:Module/Week 6:– Data mining tools, Summary, FutureData mining tools, Summary, Future

– 28.11. normal lecture by Pirjo28.11. normal lecture by Pirjo

– 30.11. seminal like lecture by Pirjo30.11. seminal like lecture by Pirjo

ConclusionsConclusions

Page 58: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page58/62

• Seminar presentations:Seminar presentations:– Articles are given on previous

week's Wed

– Presentation in an HTML page (around 3-5 printed pages) due to seminar starting:

• Can be either a HTML page or a printable document in PostScript/PDF format

– 30 minutes of presentation

– 5-15 minutes of discussion

– Active participation

Seminar PresentationsSeminar Presentations

Page 59: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page59/62

Seminar Presentations/Groups 1-2Seminar Presentations/Groups 1-2

Quantitative RulesQuantitative RulesQuantitative RulesQuantitative Rules

MINERULEMINERULEMINERULEMINERULE

Page 60: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page60/62

• R. Srikant, R. Agrawal: "Mining Quantitative Association Rules R. Srikant, R. Agrawal: "Mining Quantitative Association Rules in Large Relational Tables", Proc. of the ACM-SIGMOD 1996 in Large Relational Tables", Proc. of the ACM-SIGMOD 1996 Conference on Management of Data, Montreal, Canada, June Conference on Management of Data, Montreal, Canada, June 1996.1996.

Seminar 1/2: Quantitative Rules Seminar 1/2: Quantitative Rules

Page 61: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page61/62

• Rosa Meo, Giuseppe Psaila, Stefano Ceri: "A New SQL-like Rosa Meo, Giuseppe Psaila, Stefano Ceri: "A New SQL-like Operator for Mining Association Rules". VLDB 1996: 122-133Operator for Mining Association Rules". VLDB 1996: 122-133

Seminar 2/2: MINERULE Seminar 2/2: MINERULE

Page 62: Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Page 1/62 Course on Data Mining (581550-4) Intro/Ass.

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Page62/62

Thank you for Thank you for your attention and your attention and have a nice course!have a nice course!

Thanks to Jiawei Han from Simon Fraser University for his slides which greatly helped in preparing this lecture! Also thanks to Fosca

Giannotti and Dino Pedreschi from Pisa for their slides.

Introduction to Data Mining (DM)Introduction to Data Mining (DM)