Course on Data Mining (581550-4)

70
Course on Data Mining Course on Data Mining Mika Klemettinen and Pirjo Moen Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 20 University of Helsinki/Dept of CS Autumn 20 1 Page 1/70 Course on Data Mining (581550-4 Course on Data Mining (581550-4 Intro/Ass. Rules Intro/Ass. Rules Episodes Episodes Text Mining Text Mining Home Exam Home Exam 24./26.10. 30.10. Clustering Clustering KDD Process KDD Process Appl./Summary Appl./Summary 14.11. 21.11. 7.11. 28.11.

description

7.11. 24./26.10. 14.11. Home Exam. 30.10. 21.11. 28.11. Course on Data Mining (581550-4). Intro/Ass. Rules. Clustering. Episodes. KDD Process. Text Mining. Appl./Summary. Course on Data Mining (581550-4). Today's subject : - PowerPoint PPT Presentation

Transcript of Course on Data Mining (581550-4)

Page 1: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

1Page1/70

Course on Data Mining (581550-4)Course on Data Mining (581550-4)

Intro/Ass. RulesIntro/Ass. RulesIntro/Ass. RulesIntro/Ass. Rules

EpisodesEpisodesEpisodesEpisodes

Text MiningText MiningText MiningText Mining

Home ExamHome Exam

24./26.10.

30.10.

ClusteringClusteringClusteringClustering

KDD ProcessKDD ProcessKDD ProcessKDD Process

Appl./SummaryAppl./SummaryAppl./SummaryAppl./Summary

14.11.

21.11.

7.11.

28.11.

Page 2: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

2Page2/70

Today 07.11.2001Today 07.11.2001Today 07.11.2001Today 07.11.2001

• Today's subjectToday's subject: :

– Text Mining, focus on maximal Text Mining, focus on maximal frequent phrases or maximal frequent phrases or maximal frequent sequences (MaxFreq)frequent sequences (MaxFreq)

• Next week's programNext week's program: :

– Lecture:Lecture: Clustering, Clustering, Classification, SimilarityClassification, Similarity

– Exercise:Exercise: Text MiningText Mining

– Seminar:Seminar: Text MiningText Mining

Course on Data Mining (581550-4)Course on Data Mining (581550-4)

Page 3: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

3Page3/70

BackgroundBackgroundBackgroundBackground

MaxFreq AlgorithmsMaxFreq AlgorithmsMaxFreq AlgorithmsMaxFreq Algorithms

What is Text Mining?What is Text Mining?What is Text Mining?What is Text Mining?

MaxFreq SequencesMaxFreq SequencesMaxFreq SequencesMaxFreq Sequences

MaxFreq ExperimentsMaxFreq ExperimentsMaxFreq ExperimentsMaxFreq Experiments

Text MiningText Mining

Page 4: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

4Page4/70

• Text databases (document databases) Text databases (document databases)

– Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, Web pages, etc.

• Information retrieval (IR)Information retrieval (IR)

– Information is organized into (a large number of) documents

– Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents

Text Databases and Text Databases and Information RetrievalInformation Retrieval

Page 5: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

5Page5/70

Precision:Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses)

Recall:Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved

|}{||}{}{|

RetrievedRetrievedRelevant

precision

Basic Measures for Text RetrievalBasic Measures for Text Retrieval

|}{||}{}{|

RelevantRetrievedRelevant

recall

Relevant Retrieved

All

Relevant &Retrieved

Page 6: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

6Page6/70

• A document is represented by a string, which can be A document is represented by a string, which can be identified by a set of keywordsidentified by a set of keywords

• Find similar documents based on a set of common Find similar documents based on a set of common keywordskeywords

• Answer should be based on the degree of relevance Answer should be based on the degree of relevance based on the nearness of the keywords, relative based on the nearness of the keywords, relative frequency of the keywords, etc.frequency of the keywords, etc.

• In the following, some basic techniques related to the In the following, some basic techniques related to the preprocessing and retrieval are briefly mentionedpreprocessing and retrieval are briefly mentioned

Keyword/Similarity-Based RetrievalKeyword/Similarity-Based Retrieval

Page 7: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

7Page7/70

• Basic techniques (1): Remove unrelevant words with Basic techniques (1): Remove unrelevant words with stop liststop list– Set of words that are deemed “irrelevant”, even though

they may appear frequently– E.g., a, the, of, for, with, etc.– Stop lists may vary when document set varies

• Basic techniques (2): Take basic forms of words with Basic techniques (2): Take basic forms of words with word stemmingword stemming– Several words are small syntactic variants of each other

since they share a common word stem (basic form)– E.g., drug, drugs, drugged

Keyword/Similarity-Based RetrievalKeyword/Similarity-Based Retrieval

Page 8: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

8Page8/70

• Basic techniques (3): Calculate occurrences of terms to Basic techniques (3): Calculate occurrences of terms to a term frequency tablea term frequency table

– Each entry frequent_table(i, j) = # of occurrences of the word ti in document di (or just "0" or "1" )

• Basic techniques (4): Similarity metrics: measure the Basic techniques (4): Similarity metrics: measure the closeness of a document to a query (a set of keywords)closeness of a document to a query (a set of keywords)

– Cosine distance:

– Relative term occurrences

• This is all nice to know, but where is the text mining This is all nice to know, but where is the text mining and how does it relate to this?and how does it relate to this?

||||),(

21

2121 vv

vvvvsim

Keyword/Similarity-Based RetrievalKeyword/Similarity-Based Retrieval

Page 9: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

9Page9/70

BackgroundBackgroundBackgroundBackground

MaxFreq AlgorithmsMaxFreq AlgorithmsMaxFreq AlgorithmsMaxFreq Algorithms

What is Text Mining?What is Text Mining?What is Text Mining?What is Text Mining?

MaxFreq SequencesMaxFreq SequencesMaxFreq SequencesMaxFreq Sequences

MaxFreq ExperimentsMaxFreq ExperimentsMaxFreq ExperimentsMaxFreq Experiments

Text MiningText Mining

Page 10: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

10Page10/70

• Data mining in text: find something useful and Data mining in text: find something useful and surprising from a text collectionsurprising from a text collection

• Text mining vs. information retrieval is like data Text mining vs. information retrieval is like data mining vs. database queriesmining vs. database queries

What is Text Mining?What is Text Mining?

Page 11: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

11Page11/70

• For example, we might have the following text:For example, we might have the following text:

Documents are an interesting application field for data mining techniques.

• Remember the market basket data? Remember the market basket data? – The text can then be considered as a shopping transaction, i.e.,

row in the database– The words occurring in the text can be considered as items bought

Different Views on TextDifferent Views on Text

Transaction ID Items Bought100 A,B,C200 A,C

Document ID Words occurring100 an,application,... 200 ...

Page 12: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

12Page12/70

Different Views on TextDifferent Views on Text

0 10 20 30 40 50 60 70 80 90

D C A B D A B C

• Recall the event sequence from episode rules:Recall the event sequence from episode rules:

• Now we can consider the text as a sequence of words!Now we can consider the text as a sequence of words!

0 1 2 3 4 5 6 7 8 9 10 11

Doc

umen

ts

appl

icat

ion

fiel

d

data

min

ing

tech

niqu

es

are

an inte

rest

ing

for

Page 13: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

13Page13/70

• So, suppose that we have the following example text:So, suppose that we have the following example text:

Documents are an interesting application field for data mining techniques.

• To this text, we might do the following preprocessing To this text, we might do the following preprocessing operations:operations:

1. Find the basic forms of the words (stemming)1. Find the basic forms of the words (stemming)2. Use stop list to remove uninteresting words2. Use stop list to remove uninteresting words

3. Select, e.g., the wanted word classes (e.g., nouns)3. Select, e.g., the wanted word classes (e.g., nouns)

Text PreprocessingText Preprocessing

Page 14: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

14Page14/70

(Documents, 1)(are, 2)(an, 3)(interesting, 4)(application, 5)(field, 6)(for, 7)(data, 8)(mining, 9)(techniques, 10)(., 11)

Text PreprocessingText Preprocessing

(document_N_PL, 1)(be_V_PRES_PL, 2)(an_DET, 3)(interesting_A_POS, 4)(application_N_SG, 5)(field_N_SG, 6)(for_PP, 7)(data_N_SG, 8)(mining_N_SG, 9)(technique_N_PL, 10)(STOP, 11)

Morphological information: N = noun, PL = plural, V = verb, PRES = present form, DET = determinant, A = adjective, POS = positive, SG = singular, PP=preposition

Page 15: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

15Page15/70

Text PreprocessingText Preprocessing

(document_N_PL, 1)

(interesting_A_POS, 4)(application_N_SG, 5)(field_N_SG, 6)

(data_N_SG, 8)(mining_N_SG, 9)(technique_N_PL, 10)

(document_N_PL, 1)(be_V_PRES_PL, 2)(an_DET, 3)(interesting_A_POS, 4)(application_N_SG, 5)(field_N_SG, 6)(for_PP, 7)(data_N_SG, 8)(mining_N_SG, 9)(technique_N_PL, 10)(STOP, 11)

Morphological information: N = noun, PL = plural, V = verb, PRES = present form, DET = determinant, A = adjective, POS = positive, SG = singular, PP=preposition

Page 16: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

16Page16/70

Text PreprocessingText Preprocessing

(document_N_PL, 1)

(application_N_SG, 5)(field_N_SG, 6)

(data_N_SG, 8)(mining_N_SG, 9)(technique_N_PL, 10)

(document_N_PL, 1)

(interesting_A_POS, 4)(application_N_SG, 5)(field_N_SG, 6)

(data_N_SG, 8)(mining_N_SG, 9)(technique_N_PL, 10)

Morphological information: N = noun, PL = plural, V = verb, PRES = present form, DET = determinant, A = adjective, POS = positive, SG = singular, PP=preposition

Page 17: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

17Page17/70

Text PreprocessingText Preprocessing

0 1 2 3 4 5 6 7 8 9 10 11

docu

men

t

appl

icat

ion

fiel

d

data

min

ing

tech

niqu

e

• Now we have a preprocessed sequence of wordsNow we have a preprocessed sequence of words

• We might also just throw away the stop words etc., and We might also just throw away the stop words etc., and put words in consecutive "time slots" (1, 2, 3, …)put words in consecutive "time slots" (1, 2, 3, …)

• Preprocessing can be applied to transaction-based text Preprocessing can be applied to transaction-based text data in a similar fashion data in a similar fashion

Page 18: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

18Page18/70

• Keyword (or term) based association analysisKeyword (or term) based association analysis

• Automatic document classificationAutomatic document classification

• Similarity detectionSimilarity detection

– Cluster documents by a common author

– Cluster documents containing information from a common source

• Sequence analysis: predicting a recurring event, Sequence analysis: predicting a recurring event, discovering trendsdiscovering trends

• Anomaly detection: find information that violates usual Anomaly detection: find information that violates usual patternspatterns

Types of Text MiningTypes of Text Mining

Page 19: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

19Page19/70

• Collect sets of keywords or terms that occur frequently Collect sets of keywords or terms that occur frequently together and then find the association relationships together and then find the association relationships among themamong them

• First preprocess the text data by parsing, stemming, First preprocess the text data by parsing, stemming, removing stopwords, etc.removing stopwords, etc.

• Then evoke association mining algorithmsThen evoke association mining algorithms

– Consider each document as a transaction

– View a set of keywords/terms in the document as a set of items in the transaction

Term-Based Assoc. AnalysisTerm-Based Assoc. Analysis

Page 20: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

20Page20/70

Term-Based Assoc. AnalysisTerm-Based Assoc. Analysis

• For example, we might find frequent sets such as:For example, we might find frequent sets such as:2%: application, field

5%: data, mining

• ……and association rules like:and association rules like:application field (2%,52%)data mining (5%,75%)

• These kind of frequent sets etc. might help in These kind of frequent sets etc. might help in expanding user queries or in describing better the expanding user queries or in describing better the documents than simple key wordsdocuments than simple key words

• Sometimes it would be nice to discover new descriptive Sometimes it would be nice to discover new descriptive phrases directly from the actual text - what then?phrases directly from the actual text - what then?

Page 21: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

21Page21/70

Term-Based Episode AnalysisTerm-Based Episode Analysis

• Now, we want to find words/terms that occur frequently Now, we want to find words/terms that occur frequently close to each other in the actual textclose to each other in the actual text

• Take the preprocessed sequential text data and then Take the preprocessed sequential text data and then find relationships among the words/terms by evoking find relationships among the words/terms by evoking episode mining algorithms (WINEPI or MINEPI)episode mining algorithms (WINEPI or MINEPI)

• For example, we might find frequent episodes such as:For example, we might find frequent episodes such as:

data, mining, knowledge, discovery

• ……and MINEPI style episode rules like:and MINEPI style episode rules like:

data, mining knowledge, discovery [4] [8] (2%,81%)

Page 22: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

22Page22/70

• Quite often, it could be interesting to try to find very Quite often, it could be interesting to try to find very long descriptive phrases to describe the documents… long descriptive phrases to describe the documents…

• ……but discovery of long descriptive phrases might be but discovery of long descriptive phrases might be tedious, especially if and when you'll have to create all tedious, especially if and when you'll have to create all shorter phrases in order to get the longest onesshorter phrases in order to get the longest ones

• One answer can be One answer can be maximal frequent sequencesmaximal frequent sequences or or maximal frequent phrasesmaximal frequent phrases (note: by concepts (note: by concepts "sequence" and "phrase" we mean basically the same)"sequence" and "phrase" we mean basically the same)

ProblemsProblems

Page 23: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

23Page23/70

BackgroundBackgroundBackgroundBackground

MaxFreq AlgorithmsMaxFreq AlgorithmsMaxFreq AlgorithmsMaxFreq Algorithms

What is Text Mining?What is Text Mining?What is Text Mining?What is Text Mining?

MaxFreq SequencesMaxFreq SequencesMaxFreq SequencesMaxFreq Sequences

MaxFreq ExperimentsMaxFreq ExperimentsMaxFreq ExperimentsMaxFreq Experiments

Text MiningText Mining

Page 24: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

24Page24/70

• Assume: S is a set of documents; each document consists of a sequence of words

• A phrase is a sequence of words

• A sequence p occurs in a document d if all the words of p occur in d, in the same order as in p

• A sequence p is frequent in S if p occurs in at least documents of S, where is a frequency threshold given

• A maximal gap n can be given: the original locations of any two consecutive words of a sequence can have at most n words between them

Frequent Word SequencesFrequent Word Sequences

Page 25: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

25Page25/70

1: (The,70) (Congress,71) (subcommittee,72) (backed,73) (away,74) (from,75) (mandating,76) (specific,77) (retaliation,78) (against,79) (foreign,80) (countries,81) (for,82) (unfair,83) (foreign,84) (trade,85) (practices,86)

2: (He,105) (urged,106) (Congress,107) (to,108) (reject,109) (provisions,110) (that,111) (would,112) (mandate,113) (U.S.,114) (retaliation,115) (against,116) (foreign,117) (unfair,118) (trade,119) (practices,120)

3: (Washington,407) (charged,408) (France,409) (West,410) (Germany,411) (the,412) (U.K.,413) (Spain, 414) (and,415) (the,416) (EC,417) (Commission,418) (with,419) (unfair,420) (practices,421) (on,422) (behalf,423) (of,424) (Airbus,425)

Frequent Word SequencesFrequent Word Sequences

Page 26: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

26Page26/70

Examples from the previous slides:Examples from the previous slides:

• The phrase

(retaliation, against, foreign, unfair, trade, practices)occurs in the first two documents, in the locations (78, 79, 80, 83, 85, 86) and (115, 116, 117, 118, 119, 120).

• The phrase (unfair, practices) occurs in all the documents, namely in the locations (83, 86), (118, 120), and (420, 421).

Note that we only count one occurrence of a sequence/doc!Note that we only count one occurrence of a sequence/doc!

Frequent Word SequencesFrequent Word Sequences

Page 27: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

27Page27/70

• Maximal frequent sequence: Maximal frequent sequence:

– A sequence p is a maximal frequent (sub)sequence in S if there does not exist any other sequence p' in S such that p is a subsequence of p' and p' is frequent in S

• Shortly, a maximal frequent sequence is a sequence of Shortly, a maximal frequent sequence is a sequence of words thatwords that

– appears frequently in the document collection

– is not included in another longer frequent sequence

Maximal Frequent SequencesMaximal Frequent Sequences

Page 28: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

28Page28/70

• Usually, it makes sense to concentrate on the maximal Usually, it makes sense to concentrate on the maximal frequent sequences or maximal frequent phrasesfrequent sequences or maximal frequent phrases

– Subsequences or subphrases usually do not have own meaning

– However, sometimes also subsequences or subphrases may be interesting, if they are much more frequent

Maximal Frequent SequencesMaximal Frequent Sequences

Page 29: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

29Page29/70

• Example (maximal sequence + subsequences):Example (maximal sequence + subsequences):

dow jones industrial average

dow jones

dow industrial

dow average

jones industrial

jones average

industrial average

dow jones industrial

dow jones average

jones industrial average

A Maximal Seq. with Subseq.sA Maximal Seq. with Subseq.s

Page 30: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

30Page30/70

• Interesting subsequences can be distinguished by the Interesting subsequences can be distinguished by the characteristic that they are more frequent than the characteristic that they are more frequent than the maximal sequencesmaximal sequences

– Subsequence has its OWN occurrences in the text

– Subsequence might be joint to MANY maximal sequences

– TOO FREQUENT subsequence might NOT be interesting

Examples of Meaningful SubseqsExamples of Meaningful Subseqs

Page 31: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

31Page31/70

• Maximal sequences:Maximal sequences:

prime minister Lionel Jospin

prime minister Paavo Lipponen

• Subsequences:Subsequences:

prime minister

Lionel Jospin

Paavo Lipponen

Examples of Meaningful SubseqsExamples of Meaningful Subseqs

Page 32: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

32Page32/70

BackgroundBackgroundBackgroundBackground

MaxFreq AlgorithmsMaxFreq AlgorithmsMaxFreq AlgorithmsMaxFreq Algorithms

What is Text Mining?What is Text Mining?What is Text Mining?What is Text Mining?

MaxFreq SequencesMaxFreq SequencesMaxFreq SequencesMaxFreq Sequences

MaxFreq ExperimentsMaxFreq ExperimentsMaxFreq ExperimentsMaxFreq Experiments

Text MiningText Mining

Page 33: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

33Page33/70

• Frequency of a sequence cannot be decided locally: all Frequency of a sequence cannot be decided locally: all the instances in the collection has to be countedthe instances in the collection has to be counted

• However: already a document of length 20 (words) However: already a document of length 20 (words) contains over one million sequencescontains over one million sequences

• Only small fraction of sequences are frequentOnly small fraction of sequences are frequent

– There are many sequences that have only very few occurrences

Discovery of Frequent SequencesDiscovery of Frequent Sequences

Page 34: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

34Page34/70

• Basic idea: the "standard" bottom-up approachBasic idea: the "standard" bottom-up approach

– Collect all the pairs from the documents, count them, and select the frequent ones

– Build sequences of length p+1 from frequent sequences of length p

– Select sequences that are frequent

– Iterate

• Finally: select maximal sequences (by checking for each Finally: select maximal sequences (by checking for each phrase, whether it is contained in some other phrase)phrase, whether it is contained in some other phrase)

Naïve Discovery ApproachNaïve Discovery Approach

Page 35: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

35Page35/70

• Problem: frequent sequences in text can be long Problem: frequent sequences in text can be long

– In our experiments: longest phrase 22 words (Reuters-21578 newswire data, 19000 documents, frequency threshold 15, max gap 2)

– Processing all the subphrases of all lengths is not possible

– Straightforward bottom-up approach does not work

– Restriction of the length would produce a large amount of slightly differing subphrases of a phrase that is longer than the threshold

Problems in the Naïve ApproachProblems in the Naïve Approach

Page 36: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

36Page36/70

• First, frequent pairs are collected

Initial phaseInitial phase

• Longer sequences are constructed from shorter sequences (k-grams) as in the bottom-up approach

Discovery phaseDiscovery phase

• Maximal sequences are discovered directly, starting from a k-gram that is not a subsequence of any known maximal sequence

Expansion stepExpansion step

Combining Bottom-Up and Combining Bottom-Up and Greedy Approaches: MaxFreqGreedy Approaches: MaxFreq

Page 37: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

37Page37/70

• Each maximal sequence has at least one unique subsequence that distinguishes it from the other maximal sequences. A maximal sequence is discovered, at the latest, on the level k, where k is the length of the shortest unique subsequence.

• Grams that cannot be used to construct any new maximal sequences are pruned away after each level, before the length of grams is increased

Pruning stepPruning step

• Let's take a closer look at these phases and steps!

Combining Bottom-Up and Combining Bottom-Up and Greedy Approaches: MaxFreqGreedy Approaches: MaxFreq

Page 38: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

38Page38/70

Input: a set of documents S, a frequency threshold, and a maximal gap

Output: a gram set Grams2 containing the frequent pairs

For all the documents d S

collect all the ordered pairs of words (A,B) within d such that A and B occur in this order (wrt maximal gap)

Grams2 = all the ordered pairs that are frequent in the set S

(wrt frequency threshold)

Return Grams2

Algorithm: Initial PhaseAlgorithm: Initial Phase

Page 39: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

39Page39/70

Document 1: (A,11) (B,12) (C,13) (D,14) (E,15)

Document 2: (P,21) (B,22) (C,23) (D,24) (K,25)

Document 3: (A,31) (B,32) (C,33) (H,34) (D,35) (K,36)

Document 4: (P,41) (B,42) (C,43) (D,44) (E,45) (N,46)

Document 5: (P,51) (B,52) (C,53) (K,54) (E,55) (L,56) (M,57)

Document 6: (R,61) (H,62) (K,63) (L,64) (M,65)

Algorithm: Initial PhaseAlgorithm: Initial Phase

Page 40: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

40Page40/70

AB 2 BE 3 CK 3 EL 1 HM 1 PC 3

AC 2 BH 1 CL 1 EM 1 KE 1 PD 2

AD 1 BK 2 CN 1 EN 1 KL 2 PK 1

AH 1 CD 4 DE 2 HD 1 KM 2 RH 1

BC 5 CE 3 DK 2 HK 2 LM 2 RK 1

BD 4 CH 1 DN 1 HL 1 PB 3 RL 1

Algorithm: Initial PhaseAlgorithm: Initial Phase

• The following pairs of words could be found (with max The following pairs of words could be found (with max gap=2). E.g., AB occurs in doc 1 ([11-12]) and in doc 2 gap=2). E.g., AB occurs in doc 1 ([11-12]) and in doc 2 ([31-32]), while AE is unfrequent ([11-15] > max gap).([31-32]), while AE is unfrequent ([11-15] > max gap).

Page 41: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

41Page41/70

Input: a gram set Grams2 containing the frequent pairs (A, B)Output: the set Max of maximal frequent phrases

k := 2; Max := While Gramsk is not empty

For all grams g Gramsk If a gram g is not a subphrase of some m Max

If a gram g is frequentmax := ExpandExpand(g)Max := Max maxIf max = g Remove {g} from Gramsk

Else Remove {g} from Gramsk

Prune(Gramsk)Join the grams of Gramsk to form Gramsk+1

k := k + 1Return Max

Algorithm: Discovery PhaseAlgorithm: Discovery Phase

Page 42: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

42Page42/70

Input:Input: a phrase p

Output:Output: a maximal frequent phrase p' such that p is a subphrase of p'

Repeat

Let l be the length of the sequence p.

Find a sequence p' such that the length of p' is l+1,

and p is a subsequence of p'.

If p' is frequent

p := p'

Until there exists no frequent p'

Return p

Algorithm: Expansion StepAlgorithm: Expansion Step

Note! All the possibilities to expand has to bechecked: tail, front, middle!

Page 43: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

43Page43/70

1: 1: (A,11) (B,12) (C,13) (D,14) (E,15)2:2: (P,21) (B,22) (C,23) (D,24) (K,25)3:3: (A,31) (B,32) (C,33) (H,34) (D,35) (K,36)4:4: (P,41) (B,42) (C,43) (D,44) (E,45) (N,46)5:5: (P,51) (B,52) (C,53) (K,54) (E,55) (L,56) (M,57)6:6: (R,61) (H,62) (K,63) (L,64) (M,65)

Freq:Freq: AB BD CD DE KL PBAC BE CE DK KM PCBC BK CK HK LM PD

Exp:Exp: AB => ABC => ABCD (- ABCDE, ABCDK)BE => BCE => BCDE

Algorithm: Expansion StepAlgorithm: Expansion Step

Page 44: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

44Page44/70

• Maximal frequent sequences after the first expansion Maximal frequent sequences after the first expansion step:step:

AB => ABC => ABCD

BE => BCE => BCDE

BK => BDK => BCDK

KL => KLM

PD => PBD => PBCD

HK

ExampleExample

Page 45: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

45Page45/70

• 3-grams after join:3-grams after join:

ABC ACK CDE PCD BKMABD BCD CDK PCE CKLABE BCE PBC PCK CKM italics+ABK BCK PBD PDE DKL underlined=ACD BDE PBE PDK DKM already foundACE BDK PBK BKL KLM maximal phrase

• New maximal frequent sequences:New maximal frequent sequences:

PBE => PBCEPBK => PBCK

ExampleExample

Page 46: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

46Page46/70

• 3-grams after the second expansion step:3-grams after the second expansion step:

ABC BCE CDE PBE PCKABD BCK CDK PBKACD BDE PBC PCDBCD BDK PBD PCE

• 4-grams after join:4-grams after join:

ABCD ABDK BCDK PBDE ABCE ACDE PBCD PBDKABCK ACDK PBCE PCDEABDE BCDE PBCK PCDK

ExampleExample

Page 47: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

47Page47/70

• After expansion step, every gram is a subsequence of some maximal sequence

• For any other maximal sequence m not found yet: m has to contain grams from two or more other maximal sequences, or from one sequence m' in a different order than in m'

• For each gram g: check if g can join grams of maximal sequences in a new way

=> extract sequences that are frequent and not yet included in any maximal sequence; mark the grams

• Remove grams that are not marked

Algorithm: Pruning StepAlgorithm: Pruning Step

Page 48: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

48Page48/70

• BC: ABCD, BCDE, BCDK, PBCD• Prefixes: A, P • Suffixes: D, DE, DK• Check the strings ABCDE, ABCDK, PBCDE, PBCDK a subsequence that is frequent and not included in any

maximal sequence?ABCDE - ABC - ABCD (maximal)

- ABCE (not frequent)- BCD - BCDE (maximal)

- ABCD (known)- BCE - ABCE (known)

Pruning After the 1Pruning After the 1stst Exp. Step Exp. Step

Page 49: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

49Page49/70

PBCDE - PBC - PBCD (maximal) - PBCE (frequent, not in maximal) - BCD - BCDE (maximal) - PBCD (known) - BCE - PBCE (known)

PBCDK - PBC - PBCD (maximal) - PBCK (frequent, not in maximal)

...

Marked: PB, BC, CE, CKAll the other grams are removed.

Pruning After the 1Pruning After the 1stst Exp. Step Exp. Step

Page 50: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

50Page50/70

Data structures:Data structures:

• Table: for each pair its exact occurrences in text

• Table: for each prefix the grams that have this prefix

• Table: for each suffix the grams that have this suffix

• Table: for each pair the indexes of maximal sequences within which it is a subsequence

• An array of maximal sequences

• Document identifiers are attached to the grams and occurrences

Algorithm: ImplementationAlgorithm: Implementation

Page 51: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

51Page51/70

• The occurrences of frequent pairs are stored:The occurrences of frequent pairs are stored:

AB: [11-12][31-32]AC: [11-13][31-33]BC: [12-13][22-23][32-33][42-43][52-53]

• The occurrences of longer sequences are computed The occurrences of longer sequences are computed from the occurrences of pairsfrom the occurrences of pairs

• All the occurrences computed are storedAll the occurrences computed are stored– The computation for ABC may help to compute later

the frequency for ABCD

Testing FrequencyTesting Frequency

Page 52: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

52Page52/70

– ABCD can only occur in places where ABC has occurred

• NOTE:NOTE:

– Already calculated occurrences can be used while adding elements to the front or to the tail

– ABCD may occur in more documents than ABD, since the distance of B and D might be greater than the maximal gap

Testing FrequencyTesting Frequency

Page 53: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

53Page53/70

BackgroundBackgroundBackgroundBackground

MaxFreq AlgorithmsMaxFreq AlgorithmsMaxFreq AlgorithmsMaxFreq Algorithms

What is Text Mining?What is Text Mining?What is Text Mining?What is Text Mining?

MaxFreq SequencesMaxFreq SequencesMaxFreq SequencesMaxFreq Sequences

MaxFreq ExperimentsMaxFreq ExperimentsMaxFreq ExperimentsMaxFreq Experiments

Text MiningText Mining

Page 54: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

54Page54/70

• Data: Data: Reuters-21578 newswire collection (year 1987)• Around 19000 documents19000 documents (average length 135 words)• Originally 2.5 million words, after stopword pruning (400

stopwords) 1.3 million words– Stopwords: single letters, pronouns, prepositions, some

abbreviations (e.g., pct, dlr, cts, shr), etc.• 50.000 distinct words (stemming was not used)• Frequency threshold 15, max gap 2 (stopwords pruned)• Prototype implementation in Perl• Sun Enterprise 450, with 1 GB of main memory

ExperimentsExperiments

Page 55: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

55Page55/70

• Amounts of maximal frequent sequences of different Amounts of maximal frequent sequences of different lengths:lengths:

Len 2 3 4 5 6 7 8 9 10 11 12

f:15 7,664 1,320 353 146 65 17 8 4 13 12 13

Len 13 14 15 16 17 18 19 20 21 22 23

f:15 5 - 1 1 - 1 - - - 2 -

ExperimentsExperiments

Page 56: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

56Page56/70

• Solid, established phrases:Solid, established phrases:bundesbank president karl otto poehl

european monetary system ems

• Verb phrases:Verb phrases:bank england provided money market assistance

board declared stock split payable april

boost domestic demand

• Short phrases:Short phrases:expects higher

expects complete

Examples of MaxFreq SequencesExamples of MaxFreq Sequences

Page 57: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

57Page57/70

• The following phrases are extracted from one document belonging to the Reuters data set

• The phrases contain both maximal phrases and subphrases that are more frequent than the maximal ones

• The document describes a situation, where the persons monitoring the nuclear power plant operation were catched asleep during their shift and the Nuclear Regulatory Commission ordered the power plant to be closed

• As you can see, the phrases do not actually reveal what happened, they just tell about the subject matter

Phrases Extracted from "Doc A"Phrases Extracted from "Doc A"

Page 58: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

58Page58/70

power station 11immediately after 26co operations 11effective april 63company's operations 20unit nuclear 12unit power 16early week 42senior management 28nuclear regulatory commission 14-regulatory commission 34nuclear power plant 26-power plant 55-nuclear power 42-nuclear plant 42electric co 143

Phrases Extracted from "Doc A"Phrases Extracted from "Doc A"

Page 59: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

59Page59/70

• Maximal frequent sequence (frequency = 15):Maximal frequent sequence (frequency = 15):federal reserve entered u.s. government securities market arrange repurchase agreements fed dealers federal funds trading fed began temporary supply reserves banking system

• One occurrence of the phrase:One occurrence of the phrase:The Federal Reserve entered the U.S. Government securities market to arrange 1.5 billion dlrs of customer repurchase agreements, a Fed spokesman said. Dealers said Federal funds were trading at 6-3/16 pct when the Fed began its temporary and indirect supply of reserves to the banking system.

Phrases Extracted from "Doc B"Phrases Extracted from "Doc B"

Page 60: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

60Page60/70

• The frequency of the sequence is 13, and it contains the The frequency of the sequence is 13, and it contains the following subsequences that are more frequent:following subsequences that are more frequent:

arrange repurchase 23 banking system 66fed federal 25 trading fed 22 fed funds 23 trading system 25 fed temporary 23 reserve u.s. 43 market arrange 23 supply reserves 36market trading 41 supply system 25u.s. government 160 dealers federal 30u.s. dealers 32 dealers funds 27u.s. trading 35 dealers trading 33u.s. supply 26 federal u.s. 28reserves system 36 federal trading 30securities arrange 23 funds trading 43securities trading 32 reserve u.s. government 31government arrange 23 reserves banking system 25

Phrases Extracted from "Doc B"Phrases Extracted from "Doc B"

Page 61: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

61Page61/70

• Goal: rich computational representation for documentsGoal: rich computational representation for documents– Feature sets for analysis– Human-readable description

• ApplicationsApplications– Key phrases in information retrieval– Overview to the collection: clustering– Summary of the content– Automatic generation of hypertext links– Associations between documents– Browsing of document collection

Use of Frequent PhrasesUse of Frequent Phrases

Page 62: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

62Page62/70

• Example:Example: suppose that a query "agricultur*" has been made

• The user has been given a "middle-level list" of phrases that tell something more about the context around the words in the query

Use of Frequent PhrasesUse of Frequent Phrases

agricultur* QUERYQUERY

Page 63: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

63Page63/70

agricultural exportsagricultural productionagricultural productsagricultural stabilization conservation service

agricultural subsidiesagricultural subsidiesagricultural tradeu.s. agricultureagriculture department usdaagriculture department wheatagriculture ministeragriculture officialsagriculture undersecretary daniel amstutzcommon agricultural policyec agriculture ministerseuropean community agriculture

Use of Frequent PhrasesUse of Frequent Phrases

Page 64: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

64Page64/70

• Suppose that the user is interested in subject "agricultural subsidies" and selects it from the list

• As an answer to the query, one might now return all the sentences containing the phrase "agricultural subsidies" (e.g., the ones on the next pages)

• Alternatively, the user might want to see directly the whole documents in which the phrase appears, or the other phrases that occur together with the phrase "agricultural subsidies" in the documents

Use of Frequent PhrasesUse of Frequent Phrases

Page 65: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

65Page65/70

• Text mining:Text mining: – The "roots" are in text databases and information

retrieval– Data mining techniques might complement or help the

existing database/information retrieval techniques• In this lecture, only a few methods based of association In this lecture, only a few methods based of association

and episode style algorithms were given:and episode style algorithms were given:– Naïve approaches applicable to some extent, maximal

frequent phrases might be useful in some cases– Many clustering, classification and similarity

techniques that will be presented on the next lectures, are useful to go a few steps further

SummarySummary

Page 66: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

66Page66/70

• Helena Ahonen-Myka: Finding All Frequent Maximal Sequences in Text. In ICML-99 Workshop on Machine Learning in Text Data Analysis, p. 11-17, J. Stefan Institute, Ljubljana 1999. See electronic version at http://www.cs.helsinki.fi/u/hahonen/ham_icml99.ps

• Han, J., Kamber, M.: Data Mining: Concepts and Techniques (also available at "http://www.cs.sfu.ca/~han/DM_Book.html"), Section 9.5 of the book.

• Helena Ahonen, Oskari Heinonen, Mika Klemettinen, and Inkeri Verkamo. Applying Data Mining Techniques for Descriptive Phrase Extraction in Digital Document Collections. In Advances in Digital Libraries'98, April 1998. See electronic version at http://www-db.informatik.uni-tuebingen.de/forschung/papers/adl98.ps

ReferencesReferences

Page 67: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

67Page67/70

Next WeekNext WeekNext WeekNext Week

• Lecture 14.11.: Clustering, Lecture 14.11.: Clustering, Classification, SimilarityClassification, Similarity

– Pirjo gives the lecturePirjo gives the lecture

• Excercise 15.11.: Text miningExcercise 15.11.: Text mining– Pirjo takes care of you! :-) Pirjo takes care of you! :-)

• Seminar 9.11.: Text miningSeminar 9.11.: Text mining– Mika gives the lectureMika gives the lecture

– 2 group presentations (groups 5-6)2 group presentations (groups 5-6)

Course OrganizationCourse Organization

Page 68: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

68Page68/70

Seminar Presentations/Groups 5-6Seminar Presentations/Groups 5-6

Feldman et. alFeldman et. alFeldman et. alFeldman et. al

Lent, Agrawal, SrikantLent, Agrawal, SrikantLent, Agrawal, SrikantLent, Agrawal, Srikant

R. Feldman et al.: "Knowledge Management: A Text Mining Approach", PAKM 1998.

B. Lent, R. Agrawal, R. Srikant: "Discovering Trends in Text Databases", KDD 1997.

Page 69: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

69Page69/70

• Remember:Remember:– Try to understand the

"message" in the article

– Try to present the basic ideas as clearly as possible, use examples

– Do not present detailed mathematics or algorithms

– Test: do you understand your own presentation?

– In the presentation, use PowerPoint or conventional slides

Seminar PresentationsSeminar Presentations

• Requirements:Requirements:– Articles are given on previous

week's Wed

– Presentation in an HTML page (around 3-5 printed pages) due to seminar starting:

• Can be either a HTML page or a printable document in PostScript/PDF format

– 30 minutes of presentation

– 5-15 minutes of discussion

– Active participation

Page 70: Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

70Page70/70

Thank you for Thank you for your attention!your attention!

Thanks to Helena Ahonen-Myka and Jiawei Han for their slides which greatly helped in preparing this lecture!

Text MiningText Mining