Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001...
-
Upload
angelo-rowden -
Category
Documents
-
view
213 -
download
1
Transcript of Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001...
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page1/62
Course on Data Mining (581550-4)Course on Data Mining (581550-4)
Intro/Ass. RulesIntro/Ass. RulesIntro/Ass. RulesIntro/Ass. Rules
EpisodesEpisodesEpisodesEpisodes
Text MiningText MiningText MiningText Mining
Home ExamHome Exam
24./26.10.
30.10.
ClusteringClusteringClusteringClustering
KDD ProcessKDD ProcessKDD ProcessKDD Process
Appl./SummaryAppl./SummaryAppl./SummaryAppl./Summary
14.11.
21.11.
7.11.
28.11.
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page2/62
Arkko Jouko
Asikainen Tomi
Aunimo Lili
Hyvönen Leena
Johansson Carl
Jokinen Sakari
Kerminen Antti
Kuokkanen Ville
Lehmussaari Kari
Lehtonen Miro
Accepted to Autumn 2001 CourseAccepted to Autumn 2001 Course
Löfström Jaakko
Malinen Johanna
Mäkelä Eetu
Ojala Petri
Palin Kimmo
Pasanen Janne
Pietilä Mikko
Pitkänen Esa
Rapiokallio Maarit
Roos Teemu
Sahlberg Mauri
Saikku Arja
Sundman Jonas
Tarvainen Tero
Tiihonen Sami
Tolvanen Juha
Uusitalo Petri
Vasankari Minna
Virtanen Otso
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page3/62
Course OrganizationCourse Organization
LecturersLecturersLecturersLecturers
ExercisesExercisesExercisesExercises
LecturesLecturesLecturesLectures
CourseCourse MaterialMaterialCourseCourse MaterialMaterial
ContentsContentsContentsContents
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page4/62
Dr. Mika KlemettinenDr. Mika KlemettinenDr. Mika KlemettinenDr. Mika Klemettinen
• PhD Mika Klemettinen:PhD Mika Klemettinen:– Email: [email protected]– WWW: http://www.cs.helsinki.fi/u/mklemett/– Room: B356– Tel: 050-483 6661
• PhD in January 1999:PhD in January 1999:– Thesis: A Knowledge Discovery
Methodology for Telecommunication Network Alarm Databases
• Data mining and SGML/XML related Data mining and SGML/XML related research at UH/CS (1994-2000) and at research at UH/CS (1994-2000) and at Nokia (2000-)Nokia (2000-)
Course OrganizationCourse Organization
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page5/62
Dr. Pirjo MoenDr. Pirjo MoenDr. Pirjo MoenDr. Pirjo Moen
• PhD Pirjo Moen:PhD Pirjo Moen:– Email: [email protected]– WWW: http://www.cs.helsinki.fi/pirjo.moen/– Room: B350– Tel:191 44238
• PhD in February 2000:PhD in February 2000:– Thesis: Attribute, Event Sequence, and Event
Type Similarity Notions for Data Mining
• Data mining related research at UH/CS Data mining related research at UH/CS (1994-)(1994-)
Course OrganizationCourse Organization
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page6/62
DM/SGML/XML at UH/CSDM/SGML/XML at UH/CSDM/SGML/XML at UH/CSDM/SGML/XML at UH/CS
• RATIRATI (A structured text database system/ Rakenteiset tekstitietokannat), 1988-91
• Data mining from telecommunication Data mining from telecommunication alarm dataalarm data, 1994-97
• Structured and Intelligent Documents (SIDSID), 1995-98
• From Data to Knowledge (FDKFDK), 1995-
• Knowledge worker’s workstation (TYTTITYTTI), 2000-02
• DM Group (99), DOREMI Group (00) Linux was invented here!Linux was invented here!
Course OrganizationCourse Organization
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page7/62
NRC in ShortNRC in ShortNRC in ShortNRC in Short
• Nokia is the global leader in digital Nokia is the global leader in digital communication technologiescommunication technologies with around 60 000 employees all over the world
• Nokia Research Center (NRC)Nokia Research Center (NRC) has around 1 200 employees in Finland, USA, Japan, China, Germany, Hungary, UK, etc.
• NRC's roleNRC's role is to enhance the Nokia's technological competitiveness by exploring and developing new technologies
• Strongly involved in many European Strongly involved in many European Union and national research projectsUnion and national research projects
Course OrganizationCourse Organization
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page8/62
DM Group at NRCDM Group at NRCDM Group at NRCDM Group at NRC
• Background:Background:
– At the University of Computer Science data mining methods and theory of data mining since late 80´s
– Association and episode rule mining, time series similarity, analysis of telecommunication alarm data and web logs, etc.
• Other members include:Other members include:
– Dr. Heikki MannilaHeikki Mannila (group leader)
– Dr. Hannu ToivonenHannu Toivonen
Course OrganizationCourse Organization
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page9/62
Lectures (1)Lectures (1)Lectures (1)Lectures (1)
• 24.10.-30.11.2001 (12 lectures):24.10.-30.11.2001 (12 lectures):– 7 normal lectures
– 5 seminar like lectures
• Wed 14-16, Fri 12-14 (A217):Wed 14-16, Fri 12-14 (A217):– Wed: normal lecture
– Fri: seminar like lecture (except for 26.10.)
• Lectures are obligatory:Lectures are obligatory:– Normal lectures: 5/7
– Seminar like lectures: 4/5
• Lists are circulatedLists are circulated
Course OrganizationCourse Organization
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page10/62
Lectures (2)Lectures (2)Lectures (2)Lectures (2)
• Lecturing language is Finnish, slides are Lecturing language is Finnish, slides are in English:in English:
– Students can also use English
– A foreign student group can be established
• Normal lectures:Normal lectures:– Basics, terminology, standard methods
– Lecturer driven teaching
• Seminar like lectures:Seminar like lectures:– Extensions to the basic methods
– Lecturer gives an introduction
– Student groups give short presentations
Course OrganizationCourse Organization
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page11/62
Lectures (3)Lectures (3)Lectures (3)Lectures (3)
• Group for seminar (and exercise) work:Group for seminar (and exercise) work:– 10 groups, à 3 persons, 2 groups/lecture
– Dates are agreed at the beginning of course
– Articles are given on previous week's Wed
• Seminar presentations:Seminar presentations:– Presentation in an HTML page (around 3-5
printed pages) due to seminar starting:• Can be either a HTML page or a printable
document in PostScript/PDF format
– 30 minutes of presentation
– 5-15 minutes of discussion
– Active participation
Course OrganizationCourse Organization
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page12/62
Course MaterialCourse MaterialCourse MaterialCourse Material
• Lecture slidesLecture slides
• Original articlesOriginal articles
• Seminar presentationsSeminar presentations
• Book: Book: "Data Mining: Concepts and Techniques" by Jiawei Han and Micheline Kamber, Morgan Kaufmann Publishers, August 2000. 550 pages. ISBN 1-55860-489-8
• Remember to check course website and Remember to check course website and folder for the material!folder for the material!
Course OrganizationCourse Organization
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page13/62
ExercisesExercisesExercisesExercises
• Given by Pirjo Moen:Given by Pirjo Moen:– Email: [email protected]
– Room: B350
– Tel: 191 44238
• 1.11.-29.11.2001 (5 exercises)1.11.-29.11.2001 (5 exercises)
• Thu 12-14 (A318)Thu 12-14 (A318)
• Exercises are obligatory:Exercises are obligatory:– Exercises: 4/5
• Lists are circulatedLists are circulated
• Discussion is an essential part!Discussion is an essential part!
Course OrganizationCourse Organization
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page14/62
ExercisesExercisesExercisesExercises
• Usually around 3-4 exercises:Usually around 3-4 exercises:– 2-3 "normal" exercises (with subtasks):
• Available due Thu mornings at 9
– 1 group work:• A practical exercise
• Available due Thu mornings at 9
• A written report (not hand-written!) must be returned at the exercise session
• Group = the seminar presentation group
• Foreign students:Foreign students:– Return all exercises in written format to
Pirjo Moen
Course OrganizationCourse Organization
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page15/62
Home ExamHome ExamHome ExamHome Exam
• The home exam is given on 28.11.2001The home exam is given on 28.11.2001
• Must be returned by 21.12.2001 (printed Must be returned by 21.12.2001 (printed version, not hand-written, not by email)version, not hand-written, not by email)
• Tentatively:Tentatively:– Course lectures, seminar presentations and
exercises are the material for the exam
– Questions contain both theoretical and practical issues
– Around 4-6 smaller questions
– Around 1-2 bigger questions
Course OrganizationCourse Organization
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page16/62
Course EvaluationCourse EvaluationCourse EvaluationCourse Evaluation
• Scale: 1-/3 … 3/3 or rejectedScale: 1-/3 … 3/3 or rejected
• Grade = home exam + exercises + Grade = home exam + exercises + experiments + group presentations:experiments + group presentations:
– home exam: max 30 points• (4 X 5p) + (1 X 10p)
– normal exercises (10): max 5 points• 2: 1p, 4: 2p, 6: 3p, 8: 4p, 10: 5p
– experiments (5): max 15 points• max 3 points/experiment
– group presentation: max 10 points
Course OrganizationCourse Organization
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page17/62
Course EvaluationCourse EvaluationCourse EvaluationCourse Evaluation
• Passing the course: min 30 pointsPassing the course: min 30 points– home exam: min 13 points (max 30 points)
– exercises/experiments: min 8 points (max 20 points)
• at least 3 returned and reported experiments
– group presentation: min 4 points (max 10 points)
• Remember also the other requirements:Remember also the other requirements:– Attending the lectures (5/7)
– Attending the seminars (4/5)
– Attending the exercises (4/5)
Course OrganizationCourse Organization
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page18/62
Course Contents (1)Course Contents (1)Course Contents (1)Course Contents (1)
• Module/Week 1: Module/Week 1: – What is Data Mining?What is Data Mining?
– Association rulesAssociation rules
– 24.10. normal lecture by Mika24.10. normal lecture by Mika
– 26.10. normal lecture by Mika26.10. normal lecture by Mika
• Module/Week 2:Module/Week 2:– Recurrent patternsRecurrent patterns
– Episode rules, minimal occurrencesEpisode rules, minimal occurrences
– 31.10. normal lecture by Mika31.10. normal lecture by Mika
– 2.11. seminar like lecture by Pirjo2.11. seminar like lecture by Pirjo
Course OrganizationCourse Organization
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page19/62
Course Contents (2)Course Contents (2)Course Contents (2)Course Contents (2)
• Module/Week 3:Module/Week 3: – Text miningText mining
– 7.11. normal lecture by Mika7.11. normal lecture by Mika
– 9.11. seminar like lecture by Mika9.11. seminar like lecture by Mika
• Module/Week 4:Module/Week 4:– ClusteringClustering
– ClassificationClassification
– SimilaritySimilarity
– 14.11. normal lecture by Pirjo14.11. normal lecture by Pirjo
– 16.11. seminar like lecture by Mika16.11. seminar like lecture by Mika
Course OrganizationCourse Organization
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page20/62
Course Contents (3)Course Contents (3)Course Contents (3)Course Contents (3)
• Module/Week 5: Module/Week 5: – Knowledge discovery processKnowledge discovery process
– Pre- and postprocessingPre- and postprocessing
– 21.11. normal lecture by Pirjo21.11. normal lecture by Pirjo
– 23.11. seminar like lecture by Pirjo23.11. seminar like lecture by Pirjo
• Module/Week 6:Module/Week 6:– Data mining toolsData mining tools
– Summary, futureSummary, future
– 28.11. normal lecture by Pirjo28.11. normal lecture by Pirjo
– 30.11. seminar like lecture by Pirjo30.11. seminar like lecture by Pirjo
Course OrganizationCourse Organization
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page21/62
Group EstablishmentGroup EstablishmentGroup EstablishmentGroup Establishment
• Group is for both seminar and weekly Group is for both seminar and weekly group exercise workgroup exercise work
• 10 groups à 3 persons10 groups à 3 persons
Course Organization / GroupsCourse Organization / Groups
Get grouped!Get grouped!
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page22/62
• Group presentation time allocation:Group presentation time allocation:
– Fri 2.11.: Group 1, Group 2 (associations)
– Fri 9.11.: Group 3, Group 4 (episodes)
– Fri 16.11.: Group 5, Group 6 (text mining)
– Fri 23.11.: Group 7, Group 8 (clustering)
– Fri 30.11.: Group 9, Group 10 (KDD process)
Course Organization / GroupsCourse Organization / Groups
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page23/62
• Group 1:Group 1:– Asikainen Tomi, Hyvönen Leena
• Group 2:Group 2:– Löfström Jaakko, Pitkänen Esa, Tarvainen Tero
• Group 3:Group 3:– Jokinen Sakari, Kuokkanen Ville, Tolvanen Juha
• Group 4:Group 4:– Lehmussaari Kari, Pietilä Mikko, Uusitalo Petri
• Group 5:Group 5:– Johansson Carl, Kerminen Antti, Sundman Jonas
Course Organization / GroupsCourse Organization / Groups
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page24/62
• Group 6:Group 6:– Malinen Johanna, Sahlberg Mauri, Vasankari Minna
• Group 7:Group 7:– Arkko Jouko, Ojala Petri, Rapiokallio Maarit
• Group 8:Group 8:– Palin Kimmo, Pasanen Janne (, X)
• Group 9:Group 9:– Aunimo Lili, Lehtonen Miro, Saikku Arja
• Group 10:Group 10:– X, X, X
Course Organization / GroupsCourse Organization / Groups
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page25/62
Introduction to Data Mining (DM)Introduction to Data Mining (DM)
What? Why?What? Why?What? Why?What? Why?
DM ViewsDM ViewsDM ViewsDM Views
ApplicationsApplicationsApplicationsApplications
KDD ProcessKDD ProcessKDD ProcessKDD Process
Major IssuesMajor IssuesMajor IssuesMajor Issues
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page26/62
Computers in 1940s (ENIAC)Computers in 1940s (ENIAC)
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page27/62
Mount431 7437 195079% /02 631963 4735893% /us
F ile E d it L oc a te V ie w H e lp
1 2 3 4 5 6 70
100
200
300
400
500EDCBA
Ne tw o rkTraffic He lp
Personal Home Network in 2000sPersonal Home Network in 2000s
InternetInternet
StorageStorageStorageStorageStorageStorageStorageStorage
StorageStorageStorageStorage StorageStorageStorageStorage
StorageStorageStorageStorage
StorageStorageStorageStorage
StorageStorageStorageStorage
StorageStorageStorageStorage
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page28/62
Evolution of Database TechnologyEvolution of Database Technology
• 1960s:1960s:– Data collection, database creation, IMS and network DBMS
• 1970s: 1970s: – Relational data model, relational DBMS implementation
• 1980s:1980s: – RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)
• 1990s:1990s: – Data mining and data warehousing, multimedia databases, and
Web technology
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page29/62
Why Data Mining?Why Data Mining?
• Enormous amounts of Enormous amounts of data available:data available:
– Automated data collection tools and mature database technology lead to huge amounts of data stored in databases, data warehouses and other information repositories
– Manual inspection is either tedious or just impossible
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page30/62
• Ultimately:Ultimately:
– "Extraction of interesting (non-trivial, implicit, previously unknown, potentially useful) information or patterns from data in large databases"
• Often just:Often just:
– "Tell something interesting about this data", "Describe this data"
Exploratory, semi-automatic data Exploratory, semi-automatic data analysis on large data setsanalysis on large data sets
What is Data Mining?What is Data Mining?
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page31/62
• Rather established terminology:Rather established terminology:
– Data mining• Usually DM is one part of KDD process
– Knowledge discovery in databases (KDD)• The general term that covers, e.g., data
preprocessing, DM, and post-processing
• Not so often used terms:Not so often used terms:
– Knowledge extraction, data archeology
• Newest hype:Newest hype:
– Business intelligence, knowledge management
What is Data Mining?What is Data Mining?
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page32/62
Marketing
DatabaseMarketing
DataWarehousing
KDD &Data Mining
Increase knowledge to base Increase knowledge to base decision upondecision upon
E.g., impact on marketingE.g., impact on marketing
The role and importance The role and importance of KDD and DM has of KDD and DM has growed rapidly - and is growed rapidly - and is still growing!still growing!
But DM is not just But DM is not just marketing...marketing...
What is DM Useful for?What is DM Useful for?
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page33/62
• Database analysis and decision Database analysis and decision support:support:
– Market analysis and management
– Risk analysis and management
– Fraud detection and management
• Other applications:Other applications:
– Web mining
– Text mining
– etc.
Potential Applications?Potential Applications?
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page34/62
• You are a marketing manager for a You are a marketing manager for a cellular telephone company:cellular telephone company:
– Customers receive a free phone (worth 150€) with one-year contract; you pay a sales commission of 250€ per contract
– Problem: Turnover (after contract expires) is 25%
– Giving a new phone to everyone whose contract is expiring is very expensive
– Bringing back a customer after quitting is both difficult and expensive
Example (1)Example (1)
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page35/62
• Three months before a Three months before a contract expires, predict contract expires, predict which customers will leave:which customers will leave:
– If you want to keep a customer that is predicted to leave, offer them a new phone
Example (1)Example (1)
Yippee!I won't leave!
Yippee!I won't leave!
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page36/62
• You are an insurance You are an insurance officer and you should officer and you should define a suitable monthly define a suitable monthly payment for an 18-year-old payment for an 18-year-old boy who has bough a boy who has bough a Ferrari … what to do?Ferrari … what to do?
Example (2)Example (2)
Oh, yes!I love myFerrari!
Oh, yes!I love myFerrari!
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page37/62
• Analyze all previous customer data Analyze all previous customer data and paid compensations dataand paid compensations data
• What is the predicted accident What is the predicted accident probability based on…probability based on…
– Driver's gender (male/female) and age
– Car model and age, place of living
– etc.
• If the accident probability is higher If the accident probability is higher than on average, set the monthly than on average, set the monthly payment accordingly!payment accordingly!
Example (2)Example (2)
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page38/62
• You are in a foreign country and You are in a foreign country and somebody steals or duplicates your somebody steals or duplicates your credit card or mobile phone …credit card or mobile phone …
• Credit card companies …Credit card companies …
– use historical data to build models of fraudulent behaviour and use data mining to help identify similar instances
• Phone companies …Phone companies …
– analyze patterns that deviate from an expected norm (destination, duration, etc.)
Example (3)Example (3)
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page39/62
• Web access logs can be analyzed Web access logs can be analyzed for … for …
– discovering customer preferences
– improving Web site organization
• Similarly … Similarly …
– all kinds of log information analysis
– user interface/service adaptation
Example (4)Example (4)
Excellent surfing experience!
Excellent surfing experience!
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page40/62
Knowledge Discovery Process (1)Knowledge Discovery Process (1)
Learning the domainLearning the domainLearning the domainLearning the domain
Data reduction/projectionData reduction/projectionData reduction/projectionData reduction/projection
Creating a target data setCreating a target data setCreating a target data setCreating a target data set
Data cleaning/preprocessingData cleaning/preprocessingData cleaning/preprocessingData cleaning/preprocessing
Choosing the DM taskChoosing the DM taskChoosing the DM taskChoosing the DM task
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page41/62
Choosing the DM algorithm(s)Choosing the DM algorithm(s)Choosing the DM algorithm(s)Choosing the DM algorithm(s)
Knowledge presentationKnowledge presentationKnowledge presentationKnowledge presentation
Data mining: SearchData mining: SearchData mining: SearchData mining: Search
Pattern evaluationPattern evaluationPattern evaluationPattern evaluation
Use of discovered knowledgeUse of discovered knowledgeUse of discovered knowledgeUse of discovered knowledge
Knowledge Discovery Process (2)Knowledge Discovery Process (2)
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page42/62
Data miningData miningData miningData miningInput dataInput dataInput dataInput data ResultsResultsResultsResultsPreprocessingPreprocessing PostprocessingPostprocessing
OperationalOperationalDatabaseDatabase
OperationalOperationalDatabaseDatabase
Selection
Selection
Selection
Selection
UtilizationUtilizationUtilizationUtilization
CleanedVerifiedFocused
Eval. ofinteres-tingness
Raw data
Time based
selection
Selected usable
patterns
1 32
Typical KDD ProcessTypical KDD Process
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page43/62
Increasing potentialto supportbusiness decisions End User
Business Analyst
DataAnalyst
DBA
MakingDecisions
Data PresentationVisualization Techniques
Data MiningInformation Discovery
Data Exploration
OLAP, MDA
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data SourcesPaper, Files, Information Providers, Database Systems, OLTP
UtilizationUtilization
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page44/62
DataData• Customer data• Store data• Demographical Data• Geographical data
InformationInformation• X lives in Z• S is Y years old• X and S moved• W has money in Z
KnowledgeKnowledge• A quantity Y of product A is used in
region Z• Customers of class Y use x% of C
during period D
DecisionDecision• Promote product A in region Z.• Mail ads to families of profile P• Cross-sell service B to clients C
The Value ChainThe Value Chain
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page45/62
• General approaches:General approaches:
– Descriptive data mining:
• Describe what interesting can be found in this data!
• Explain this data to me!
– Predictive data mining:
• Based on this and previous data, tell me what will happen in the future!
• Show me the future trends!
Data Mining ViewsData Mining Views
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page46/62
• Views based on … Views based on …
– Databases to be mined
– Knowledge to be discovered
– Techniques utilized
– Applications adapted
• Let's take a closer look at Let's take a closer look at these views...these views...
Data Mining ViewsData Mining Views
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page47/62
Databases to be minedDatabases to be minedDatabases to be minedDatabases to be mined
• Relational
• Transactional
• Object-oriented
• Object-relational
• Active
• Spatial
• Time-series
DatabasesDatabases
Data Mining ViewsData Mining Views
• Text, XML
• Multi-media
• Heterogeneous
• Legacy
• Inductive
• WWW
• etc.
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page48/62
Knowledge to be mined = tasksKnowledge to be mined = tasksKnowledge to be mined = tasksKnowledge to be mined = tasks
• Characterization
• Discrimination
• Association
• Classification
• Clustering
• Trend
KnowledgeKnowledge==
tasktask
Data Mining ViewsData Mining Views
• Deviation analysis
• Outlier analysis
• etc.
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page49/62
Techniques utilizedTechniques utilizedTechniques utilizedTechniques utilized
• Database-oriented
• Data warehouse (OLAP)
• Machine learning
• Statistics
• Visualization
• Neural networks
• Etc.
TechniquesTechniques
Data Mining ViewsData Mining Views
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page50/62
Applications adaptedApplications adaptedApplications adaptedApplications adapted
• Retail (supermarkets etc.)
• Telecom
• Banking
• Fraud analysis
• DNA mining
Applic.Applic.
Data Mining ViewsData Mining Views
• Stock market analysis
• Web mining
• Log data analysis
• etc.
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page51/62
• Mining methodologies and interaction:Mining methodologies and interaction:– Mining different kinds of knowledge– Interactive mining of knowledge– Incorporation of background knowledge– DM query languages and ad-hoc DM– Visualization of DM results– Handling noise and incomplete data– The interestingness problem
• Performance and scalability:Performance and scalability:– Efficiency and scalability of DM algorithms– Parallel, distributed and incremental mining methods
Major Issues in Data MiningMajor Issues in Data Mining
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page52/62
• Diversity of data types:Diversity of data types:– Handling complex types of data– Mining information from heterogeneous databases (Web etc.)
• Application and integration of discovered knowledge:Application and integration of discovered knowledge:– Domain-specific DM tools– Intelligent query answering and decision making– Integration of discovered knowledge with existing knowledge
• Protection of data … Protection of data … – Security– Integrity– Privacy
Major Issues in Data MiningMajor Issues in Data Mining
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page53/62
• 1989 IJCAI Workshop1989 IJCAI Workshop
• 1991-1994 KDD Workshops1991-1994 KDD Workshops
• 1995-1998 KDD Conferences1995-1998 KDD Conferences
• 1998 ACM SIGKDD1998 ACM SIGKDD
• 1999- SIGKDD Conferences1999- SIGKDD Conferences
• And many smaller/new DM conferences … And many smaller/new DM conferences … – PAKDD, PKDD
– SIAM-Data Mining, (IEEE) ICDM
– etc.
Historical Data Mining ActivitiesHistorical Data Mining Activities
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page54/62
““Standards”Standards”““Standards”Standards”
• DM:DM: Conferences: KDD, PKDD, PAKDD, ...
Journals: Data Mining and Knowledge Discovery, CACM
• DM/DB:DM/DB: Conferences: ACM-SIGMOD/PODS, VLDB, ...
Journals: ACM-TODS, J. ACM, IEEE-TKDE, JIIS, ...
• AI/ML:AI/ML: Conferences: Machine Learning, AAAI, IJCAI, ...
Journals: Machine Learning, Artific. Intell., ...
Useful References on Data MiningUseful References on Data Mining
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page55/62
• Data mining: Data mining: semi-automaticsemi-automatic discovery of discovery of interestinginteresting patterns from patterns from large data setslarge data sets
• Knowledge discovery is a process:Knowledge discovery is a process:– Preprocessing– Data mining– Postprocessing
• To be mined, used or utilized different … To be mined, used or utilized different … – Databases (relational, object-oriented, spatial, WWW, …)– Knowledge (characterization, clustering, association, …)– Techniques (machine learning, statistics, visualization, …)– Applications (retail, telecom, Web mining, log analysis, …)
ConclusionsConclusions
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page56/62
• Module/Week 1: Module/Week 1: – What is Data Mining?What is Data Mining?
– Association rulesAssociation rules
– 24.10. normal lecture by Mika24.10. normal lecture by Mika
– 26.10. normal lecture by Mika26.10. normal lecture by Mika
• Module/Week 2:Module/Week 2:– Episode rules, minimal occurrencesEpisode rules, minimal occurrences
– 31.10. normal lecture by Mika31.10. normal lecture by Mika
– 2.11. seminar like lecture by Pirjo2.11. seminar like lecture by Pirjo
• Module/Week 3:Module/Week 3: – Text miningText mining
– 7.11. normal lecture by Mika7.11. normal lecture by Mika
– 9.11. seminar like lecture by Mika9.11. seminar like lecture by Mika
ConclusionsConclusions
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page57/62
• Module/Week 4:Module/Week 4:– Clustering, Classification, SimilarityClustering, Classification, Similarity
– 14.11. normal lecture by Pirjo14.11. normal lecture by Pirjo
– 16.11. seminal like lecture by Mika16.11. seminal like lecture by Mika
• Module/Week 5: Module/Week 5: – Knowledge discovery processKnowledge discovery process
– Pre- and postprocessingPre- and postprocessing
– 21.11. normal lecture by Pirjo21.11. normal lecture by Pirjo
– 23.11. Seminar like lecture by Pirjo23.11. Seminar like lecture by Pirjo
• Module/Week 6:Module/Week 6:– Data mining tools, Summary, FutureData mining tools, Summary, Future
– 28.11. normal lecture by Pirjo28.11. normal lecture by Pirjo
– 30.11. seminal like lecture by Pirjo30.11. seminal like lecture by Pirjo
ConclusionsConclusions
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page58/62
• Seminar presentations:Seminar presentations:– Articles are given on previous
week's Wed
– Presentation in an HTML page (around 3-5 printed pages) due to seminar starting:
• Can be either a HTML page or a printable document in PostScript/PDF format
– 30 minutes of presentation
– 5-15 minutes of discussion
– Active participation
Seminar PresentationsSeminar Presentations
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page59/62
Seminar Presentations/Groups 1-2Seminar Presentations/Groups 1-2
Quantitative RulesQuantitative RulesQuantitative RulesQuantitative Rules
MINERULEMINERULEMINERULEMINERULE
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page60/62
• R. Srikant, R. Agrawal: "Mining Quantitative Association Rules R. Srikant, R. Agrawal: "Mining Quantitative Association Rules in Large Relational Tables", Proc. of the ACM-SIGMOD 1996 in Large Relational Tables", Proc. of the ACM-SIGMOD 1996 Conference on Management of Data, Montreal, Canada, June Conference on Management of Data, Montreal, Canada, June 1996.1996.
Seminar 1/2: Quantitative Rules Seminar 1/2: Quantitative Rules
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page61/62
• Rosa Meo, Giuseppe Psaila, Stefano Ceri: "A New SQL-like Rosa Meo, Giuseppe Psaila, Stefano Ceri: "A New SQL-like Operator for Mining Association Rules". VLDB 1996: 122-133Operator for Mining Association Rules". VLDB 1996: 122-133
Seminar 2/2: MINERULE Seminar 2/2: MINERULE
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Page62/62
Thank you for Thank you for your attention and your attention and have a nice course!have a nice course!
Thanks to Jiawei Han from Simon Fraser University for his slides which greatly helped in preparing this lecture! Also thanks to Fosca
Giannotti and Dino Pedreschi from Pisa for their slides.
Introduction to Data Mining (DM)Introduction to Data Mining (DM)