Course on Data Mining: Seminar Meetings Page 1/17 Course on Data Mining (581550-4): Seminar Meetings...

17
Course on Data Mining: Course on Data Mining: Seminar Meetings Seminar Meetings Page 1/17 Course on Data Mining (581550-4): Course on Data Mining (581550-4): Seminar Meetings Seminar Meetings Ass. Rules Ass. Rules Episodes Episodes Text Mining Text Mining 02.11. 09.11. Clustering Clustering KDD Process KDD Process Home Exam Home Exam 23.11. 30.11. 16.11. M P Seminar by Mika Seminar by Pirjo P P P M M

Transcript of Course on Data Mining: Seminar Meetings Page 1/17 Course on Data Mining (581550-4): Seminar Meetings...

Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings

Page1/17

Course on Data Mining (581550-4): Course on Data Mining (581550-4): Seminar MeetingsSeminar Meetings

Ass. RulesAss. RulesAss. RulesAss. Rules

EpisodesEpisodesEpisodesEpisodes

Text MiningText MiningText MiningText Mining

02.11.

09.11.

ClusteringClusteringClusteringClustering

KDD ProcessKDD ProcessKDD ProcessKDD Process

Home ExamHome ExamHome ExamHome Exam

23.11.

30.11.

16.11.

MM

PP

Seminar by Mika

Seminar by Pirjo

PP PP

PPMM

MM

Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings

Page2/17

Today 16.11.2001Today 16.11.2001Today 16.11.2001Today 16.11.2001

• R. Feldman, M. Fresko, H. Hirsh, R. Feldman, M. Fresko, H. Hirsh, et.al.: "Knowledge Management: A et.al.: "Knowledge Management: A Text Mining Approach", Proc of the Text Mining Approach", Proc of the 2nd Int'l Conf. on Practical Aspects 2nd Int'l Conf. on Practical Aspects of Knowledge Management of Knowledge Management (PAKM98), 1998(PAKM98), 1998

• B. Lent, R. Agrawal, R. Srikant: B. Lent, R. Agrawal, R. Srikant: "Discovering Trends in Text "Discovering Trends in Text Databases", Proc. of the 3rd Int'l Databases", Proc. of the 3rd Int'l Conference on Knowledge Conference on Knowledge Discovery in Databases and Data Discovery in Databases and Data Mining, 1997. Mining, 1997.

Course on Data Mining (581550-4): Course on Data Mining (581550-4): Seminar MeetingsSeminar Meetings

Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings

Page3/17

Good to Read as BackgroundGood to Read as BackgroundGood to Read as BackgroundGood to Read as Background

• Both papers refer to the Both papers refer to the Agrawal and Srikant paper we Agrawal and Srikant paper we had last week:had last week:

Rakesh Agrawal and Rakesh Agrawal and Ramakrishnan Srikant: Ramakrishnan Srikant: Mining Mining Sequential PatternsSequential Patterns. Int'l . Int'l Conference on Data Conference on Data Engineering, 1995.Engineering, 1995.

Course on Data Mining (581550-4): Course on Data Mining (581550-4): Seminar MeetingsSeminar Meetings

Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings

Page4/17

Knowledge Management: Knowledge Management: A Text Mining ApproachA Text Mining Approach

R. Feldman, M. Fresko, H. Hirsh, et.al

Bar-Ilan University and Instict Software, ISRAEL; Rutgers University, USA; LIA-EPFL,

Switzerland

Published in PAKM'98 (Int'l Conf. on Practical Aspects of Knowledge

Management)

Data Mining course Autumn 2001/University of Helsinki

Summary by Mika Klemettinen

Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings

Page5/17

KM: A Text Mining ApproachKM: A Text Mining Approach

• Basic idea (see selected phases on the next slides):Basic idea (see selected phases on the next slides):1. Get input data in SGML (or XML) formatSelect only the contents of desired elements! (title, abstract, etc.) 2. Do linguistic preprocessing:2.1 Term extraction (use linguistic software for this)2.2 Term generation (combine adjacent terms to morpho-syntactic patterns like "noun-noun", "adj.-noun", etc. by calculating association coefficients)2.3 Term filtering (select only the top M most frequent ones)3. Create taxonomies (there is a tool for this)4. Generate associations (you may constrain the creation)5. Visualize/explore the results

Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings

Page6/17

2.1: Term Extraction2.1: Term Extraction

Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings

Page7/17

3: Taxonomy Construction3: Taxonomy Construction

Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings

Page8/17

4: Association Rule Generation4: Association Rule Generation

Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings

Page9/17

4: Association Rule Generation4: Association Rule Generation

Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings

Page10/17

5.1: Visualization/5.1: Visualization/ExplorationExploration

Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings

Page11/17

5.2: 5.2: VisualizationVisualization/Exploration/Exploration

Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings

Page12/17

Discovering Trends in Text Discovering Trends in Text DatabasesDatabases

Brian Lent, Rakesh Agrawal and Ramakrishnan Srikant

IBM Almaden Research Center, USA

Published in KDD'97

Data Mining course Autumn 2001/University of Helsinki

Summary by Mika Klemettinen

Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings

Page13/17

Discovering Trends in Text Discovering Trends in Text DatabasesDatabases

• Basic ideas:Basic ideas:• Identify frequent phrases using sequential patterns

mining (see the slides & summaries from the Agrawal et. al paper "Mining Sequential Patterns" (MSP))

• Generate histories of phrases• Find phrases that satisfy a specified trend

• Definitions:Definitions:• Phrase: phrase p is (w1)(w2) … (wn ) , where w is a

word• 1-phrase: (IBM) (data)(mining) • 2-phrase: (IBM) (data)(mining) (Anderson)

(Consulting) (decision)(support) • Itemset, sequence, is contained, etc.: as in MSP paper

Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings

Page14/17

Discovering Trends in Text Discovering Trends in Text DatabasesDatabases

• Gaps: Minimum and maximum gaps between adjacent words: identify relations of words/phrases inside sentences/paragraphs, between words/phrases in different paragraphs, between words/phrases in different sections, etc.

• Sentence boundary: 1000• Paragraph boundary: 100.000• Section boundary: 10.000.000

• Phases: • Partition data/documents based on their time stamps, create

phrases for each partition (Lent & al. have patent data documents)

• Select the frequent phrases and save their frequences• Define shape queries using SDL (Shape Definition Language)

Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings

Page15/17

Discovering Trends in Text Discovering Trends in Text DatabasesDatabases

Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings

Page16/17

Discovering Trends in Text Discovering Trends in Text DatabasesDatabases

Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings

Page17/17

Discovering Trends in Text Discovering Trends in Text DatabasesDatabases