Course on Data Mining (581550-4)

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

1Page1/70

Course on Data Mining (581550-4)Course on Data Mining (581550-4)

Intro/Ass. RulesIntro/Ass. RulesIntro/Ass. RulesIntro/Ass. Rules

EpisodesEpisodesEpisodesEpisodes

Text MiningText MiningText MiningText Mining

Home ExamHome Exam

24./26.10.

30.10.

ClusteringClusteringClusteringClustering

KDD ProcessKDD ProcessKDD ProcessKDD Process

Appl./SummaryAppl./SummaryAppl./SummaryAppl./Summary

14.11.

21.11.

7.11.

28.11.



2Page2/70

Today 07.11.2001Today 07.11.2001Today 07.11.2001Today 07.11.2001

• Today's subjectToday's subject: :

– Text Mining, focus on maximal Text Mining, focus on maximal frequent phrases or maximal frequent phrases or maximal frequent sequences (MaxFreq)frequent sequences (MaxFreq)

• Next week's programNext week's program: :

– Lecture:Lecture: Clustering, Clustering, Classification, SimilarityClassification, Similarity

– Exercise:Exercise: Text MiningText Mining

– Seminar:Seminar: Text MiningText Mining

Course on Data Mining (581550-4)Course on Data Mining (581550-4)



3Page3/70

BackgroundBackgroundBackgroundBackground

MaxFreq AlgorithmsMaxFreq AlgorithmsMaxFreq AlgorithmsMaxFreq Algorithms

What is Text Mining?What is Text Mining?What is Text Mining?What is Text Mining?

MaxFreq SequencesMaxFreq SequencesMaxFreq SequencesMaxFreq Sequences

MaxFreq ExperimentsMaxFreq ExperimentsMaxFreq ExperimentsMaxFreq Experiments

Text MiningText Mining



4Page4/70

• Text databases (document databases) Text databases (document databases)

– Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, Web pages, etc.

• Information retrieval (IR)Information retrieval (IR)

– Information is organized into (a large number of) documents

– Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents

Text Databases and Text Databases and Information RetrievalInformation Retrieval



5Page5/70

Precision:Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses)

Recall:Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved

|}{||}{}{|

RetrievedRetrievedRelevant

precision

Basic Measures for Text RetrievalBasic Measures for Text Retrieval

|}{||}{}{|

RelevantRetrievedRelevant

recall

Relevant Retrieved

All

Relevant &Retrieved



6Page6/70

• A document is represented by a string, which can be A document is represented by a string, which can be identified by a set of keywordsidentified by a set of keywords

• Find similar documents based on a set of common Find similar documents based on a set of common keywordskeywords

• Answer should be based on the degree of relevance Answer should be based on the degree of relevance based on the nearness of the keywords, relative based on the nearness of the keywords, relative frequency of the keywords, etc.frequency of the keywords, etc.

• In the following, some basic techniques related to the In the following, some basic techniques related to the preprocessing and retrieval are briefly mentionedpreprocessing and retrieval are briefly mentioned

Keyword/Similarity-Based RetrievalKeyword/Similarity-Based Retrieval



7Page7/70

• Basic techniques (1): Remove unrelevant words with Basic techniques (1): Remove unrelevant words with stop liststop list– Set of words that are deemed “irrelevant”, even though

they may appear frequently– E.g., a, the, of, for, with, etc.– Stop lists may vary when document set varies

• Basic techniques (2): Take basic forms of words with Basic techniques (2): Take basic forms of words with word stemmingword stemming– Several words are small syntactic variants of each other

since they share a common word stem (basic form)– E.g., drug, drugs, drugged




8Page8/70

• Basic techniques (3): Calculate occurrences of terms to Basic techniques (3): Calculate occurrences of terms to a term frequency tablea term frequency table

– Each entry frequent_table(i, j) = # of occurrences of the word ti in document di (or just "0" or "1" )

• Basic techniques (4): Similarity metrics: measure the Basic techniques (4): Similarity metrics: measure the closeness of a document to a query (a set of keywords)closeness of a document to a query (a set of keywords)

– Cosine distance:

– Relative term occurrences

• This is all nice to know, but where is the text mining This is all nice to know, but where is the text mining and how does it relate to this?and how does it relate to this?

||||),(

21

2121 vv

vvvvsim




9Page9/70









10Page10/70

• Data mining in text: find something useful and Data mining in text: find something useful and surprising from a text collectionsurprising from a text collection

• Text mining vs. information retrieval is like data Text mining vs. information retrieval is like data mining vs. database queriesmining vs. database queries

What is Text Mining?What is Text Mining?



11Page11/70

• For example, we might have the following text:For example, we might have the following text:

Documents are an interesting application field for data mining techniques.

• Remember the market basket data? Remember the market basket data? – The text can then be considered as a shopping transaction, i.e.,

row in the database– The words occurring in the text can be considered as items bought

Different Views on TextDifferent Views on Text

Transaction ID Items Bought100 A,B,C200 A,C

Document ID Words occurring100 an,application,... 200 ...



12Page12/70

Different Views on TextDifferent Views on Text

0 10 20 30 40 50 60 70 80 90

D C A B D A B C

• Recall the event sequence from episode rules:Recall the event sequence from episode rules:

• Now we can consider the text as a sequence of words!Now we can consider the text as a sequence of words!

0 1 2 3 4 5 6 7 8 9 10 11

Doc

umen

ts

appl

icat

ion

fiel

d

data

min

ing

tech

niqu

es

are

an inte

rest

ing

for



13Page13/70

• So, suppose that we have the following example text:So, suppose that we have the following example text:

Documents are an interesting application field for data mining techniques.

• To this text, we might do the following preprocessing To this text, we might do the following preprocessing operations:operations:

1. Find the basic forms of the words (stemming)1. Find the basic forms of the words (stemming)2. Use stop list to remove uninteresting words2. Use stop list to remove uninteresting words

3. Select, e.g., the wanted word classes (e.g., nouns)3. Select, e.g., the wanted word classes (e.g., nouns)

Text PreprocessingText Preprocessing



14Page14/70

(Documents, 1)(are, 2)(an, 3)(interesting, 4)(application, 5)(field, 6)(for, 7)(data, 8)(mining, 9)(techniques, 10)(., 11)


(document_N_PL, 1)(be_V_PRES_PL, 2)(an_DET, 3)(interesting_A_POS, 4)(application_N_SG, 5)(field_N_SG, 6)(for_PP, 7)(data_N_SG, 8)(mining_N_SG, 9)(technique_N_PL, 10)(STOP, 11)

Morphological information: N = noun, PL = plural, V = verb, PRES = present form, DET = determinant, A = adjective, POS = positive, SG = singular, PP=preposition



15Page15/70


(document_N_PL, 1)

(interesting_A_POS, 4)(application_N_SG, 5)(field_N_SG, 6)

(data_N_SG, 8)(mining_N_SG, 9)(technique_N_PL, 10)

(document_N_PL, 1)(be_V_PRES_PL, 2)(an_DET, 3)(interesting_A_POS, 4)(application_N_SG, 5)(field_N_SG, 6)(for_PP, 7)(data_N_SG, 8)(mining_N_SG, 9)(technique_N_PL, 10)(STOP, 11)




16Page16/70


(document_N_PL, 1)

(application_N_SG, 5)(field_N_SG, 6)


(document_N_PL, 1)

(interesting_A_POS, 4)(application_N_SG, 5)(field_N_SG, 6)





17Page17/70


0 1 2 3 4 5 6 7 8 9 10 11

docu

men

t

appl

icat

ion

fiel

d

data

min

ing

tech

niqu

e

• Now we have a preprocessed sequence of wordsNow we have a preprocessed sequence of words

• We might also just throw away the stop words etc., and We might also just throw away the stop words etc., and put words in consecutive "time slots" (1, 2, 3, …)put words in consecutive "time slots" (1, 2, 3, …)

• Preprocessing can be applied to transaction-based text Preprocessing can be applied to transaction-based text data in a similar fashion data in a similar fashion



18Page18/70

• Keyword (or term) based association analysisKeyword (or term) based association analysis

• Automatic document classificationAutomatic document classification

• Similarity detectionSimilarity detection

– Cluster documents by a common author

– Cluster documents containing information from a common source

• Sequence analysis: predicting a recurring event, Sequence analysis: predicting a recurring event, discovering trendsdiscovering trends

• Anomaly detection: find information that violates usual Anomaly detection: find information that violates usual patternspatterns

Types of Text MiningTypes of Text Mining



19Page19/70

• Collect sets of keywords or terms that occur frequently Collect sets of keywords or terms that occur frequently together and then find the association relationships together and then find the association relationships among themamong them

• First preprocess the text data by parsing, stemming, First preprocess the text data by parsing, stemming, removing stopwords, etc.removing stopwords, etc.

• Then evoke association mining algorithmsThen evoke association mining algorithms

– Consider each document as a transaction

– View a set of keywords/terms in the document as a set of items in the transaction

Term-Based Assoc. AnalysisTerm-Based Assoc. Analysis



20Page20/70

Term-Based Assoc. AnalysisTerm-Based Assoc. Analysis

• For example, we might find frequent sets such as:For example, we might find frequent sets such as:2%: application, field

5%: data, mining

• ……and association rules like:and association rules like:application field (2%,52%)data mining (5%,75%)

• These kind of frequent sets etc. might help in These kind of frequent sets etc. might help in expanding user queries or in describing better the expanding user queries or in describing better the documents than simple key wordsdocuments than simple key words

• Sometimes it would be nice to discover new descriptive Sometimes it would be nice to discover new descriptive phrases directly from the actual text - what then?phrases directly from the actual text - what then?



21Page21/70

Term-Based Episode AnalysisTerm-Based Episode Analysis

• Now, we want to find words/terms that occur frequently Now, we want to find words/terms that occur frequently close to each other in the actual textclose to each other in the actual text

• Take the preprocessed sequential text data and then Take the preprocessed sequential text data and then find relationships among the words/terms by evoking find relationships among the words/terms by evoking episode mining algorithms (WINEPI or MINEPI)episode mining algorithms (WINEPI or MINEPI)

• For example, we might find frequent episodes such as:For example, we might find frequent episodes such as:

data, mining, knowledge, discovery

• ……and MINEPI style episode rules like:and MINEPI style episode rules like:

data, mining knowledge, discovery [4] [8] (2%,81%)



22Page22/70

• Quite often, it could be interesting to try to find very Quite often, it could be interesting to try to find very long descriptive phrases to describe the documents… long descriptive phrases to describe the documents…

• ……but discovery of long descriptive phrases might be but discovery of long descriptive phrases might be tedious, especially if and when you'll have to create all tedious, especially if and when you'll have to create all shorter phrases in order to get the longest onesshorter phrases in order to get the longest ones

• One answer can be One answer can be maximal frequent sequencesmaximal frequent sequences or or maximal frequent phrasesmaximal frequent phrases (note: by concepts (note: by concepts "sequence" and "phrase" we mean basically the same)"sequence" and "phrase" we mean basically the same)

ProblemsProblems



23Page23/70









24Page24/70

• Assume: S is a set of documents; each document consists of a sequence of words

• A phrase is a sequence of words

• A sequence p occurs in a document d if all the words of p occur in d, in the same order as in p

• A sequence p is frequent in S if p occurs in at least documents of S, where is a frequency threshold given

• A maximal gap n can be given: the original locations of any two consecutive words of a sequence can have at most n words between them

Frequent Word SequencesFrequent Word Sequences



25Page25/70

1: (The,70) (Congress,71) (subcommittee,72) (backed,73) (away,74) (from,75) (mandating,76) (specific,77) (retaliation,78) (against,79) (foreign,80) (countries,81) (for,82) (unfair,83) (foreign,84) (trade,85) (practices,86)

2: (He,105) (urged,106) (Congress,107) (to,108) (reject,109) (provisions,110) (that,111) (would,112) (mandate,113) (U.S.,114) (retaliation,115) (against,116) (foreign,117) (unfair,118) (trade,119) (practices,120)

3: (Washington,407) (charged,408) (France,409) (West,410) (Germany,411) (the,412) (U.K.,413) (Spain, 414) (and,415) (the,416) (EC,417) (Commission,418) (with,419) (unfair,420) (practices,421) (on,422) (behalf,423) (of,424) (Airbus,425)




26Page26/70

Examples from the previous slides:Examples from the previous slides:

• The phrase

(retaliation, against, foreign, unfair, trade, practices)occurs in the first two documents, in the locations (78, 79, 80, 83, 85, 86) and (115, 116, 117, 118, 119, 120).

• The phrase (unfair, practices) occurs in all the documents, namely in the locations (83, 86), (118, 120), and (420, 421).

Note that we only count one occurrence of a sequence/doc!Note that we only count one occurrence of a sequence/doc!




27Page27/70

• Maximal frequent sequence: Maximal frequent sequence:

– A sequence p is a maximal frequent (sub)sequence in S if there does not exist any other sequence p' in S such that p is a subsequence of p' and p' is frequent in S

• Shortly, a maximal frequent sequence is a sequence of Shortly, a maximal frequent sequence is a sequence of words thatwords that

– appears frequently in the document collection

– is not included in another longer frequent sequence

Maximal Frequent SequencesMaximal Frequent Sequences



28Page28/70

• Usually, it makes sense to concentrate on the maximal Usually, it makes sense to concentrate on the maximal frequent sequences or maximal frequent phrasesfrequent sequences or maximal frequent phrases

– Subsequences or subphrases usually do not have own meaning

– However, sometimes also subsequences or subphrases may be interesting, if they are much more frequent

Maximal Frequent SequencesMaximal Frequent Sequences



29Page29/70

• Example (maximal sequence + subsequences):Example (maximal sequence + subsequences):

dow jones industrial average

dow jones

dow industrial

dow average

jones industrial

jones average

industrial average

dow jones industrial

dow jones average

jones industrial average

A Maximal Seq. with Subseq.sA Maximal Seq. with Subseq.s



30Page30/70

• Interesting subsequences can be distinguished by the Interesting subsequences can be distinguished by the characteristic that they are more frequent than the characteristic that they are more frequent than the maximal sequencesmaximal sequences

– Subsequence has its OWN occurrences in the text

– Subsequence might be joint to MANY maximal sequences

– TOO FREQUENT subsequence might NOT be interesting

Examples of Meaningful SubseqsExamples of Meaningful Subseqs



31Page31/70

• Maximal sequences:Maximal sequences:

prime minister Lionel Jospin

prime minister Paavo Lipponen

• Subsequences:Subsequences:

prime minister

Lionel Jospin

Paavo Lipponen

Examples of Meaningful SubseqsExamples of Meaningful Subseqs



32Page32/70









33Page33/70

• Frequency of a sequence cannot be decided locally: all Frequency of a sequence cannot be decided locally: all the instances in the collection has to be countedthe instances in the collection has to be counted

• However: already a document of length 20 (words) However: already a document of length 20 (words) contains over one million sequencescontains over one million sequences

• Only small fraction of sequences are frequentOnly small fraction of sequences are frequent

– There are many sequences that have only very few occurrences

Discovery of Frequent SequencesDiscovery of Frequent Sequences



34Page34/70

• Basic idea: the "standard" bottom-up approachBasic idea: the "standard" bottom-up approach

– Collect all the pairs from the documents, count them, and select the frequent ones

– Build sequences of length p+1 from frequent sequences of length p

– Select sequences that are frequent

– Iterate

• Finally: select maximal sequences (by checking for each Finally: select maximal sequences (by checking for each phrase, whether it is contained in some other phrase)phrase, whether it is contained in some other phrase)

Naïve Discovery ApproachNaïve Discovery Approach



35Page35/70

• Problem: frequent sequences in text can be long Problem: frequent sequences in text can be long

– In our experiments: longest phrase 22 words (Reuters-21578 newswire data, 19000 documents, frequency threshold 15, max gap 2)

– Processing all the subphrases of all lengths is not possible

– Straightforward bottom-up approach does not work

– Restriction of the length would produce a large amount of slightly differing subphrases of a phrase that is longer than the threshold

Problems in the Naïve ApproachProblems in the Naïve Approach



36Page36/70

• First, frequent pairs are collected

Initial phaseInitial phase

• Longer sequences are constructed from shorter sequences (k-grams) as in the bottom-up approach

Discovery phaseDiscovery phase

• Maximal sequences are discovered directly, starting from a k-gram that is not a subsequence of any known maximal sequence

Expansion stepExpansion step

Combining Bottom-Up and Combining Bottom-Up and Greedy Approaches: MaxFreqGreedy Approaches: MaxFreq



37Page37/70

• Each maximal sequence has at least one unique subsequence that distinguishes it from the other maximal sequences. A maximal sequence is discovered, at the latest, on the level k, where k is the length of the shortest unique subsequence.

• Grams that cannot be used to construct any new maximal sequences are pruned away after each level, before the length of grams is increased

Pruning stepPruning step

• Let's take a closer look at these phases and steps!

Combining Bottom-Up and Combining Bottom-Up and Greedy Approaches: MaxFreqGreedy Approaches: MaxFreq



38Page38/70

Input: a set of documents S, a frequency threshold, and a maximal gap

Output: a gram set Grams2 containing the frequent pairs

For all the documents d S

collect all the ordered pairs of words (A,B) within d such that A and B occur in this order (wrt maximal gap)

Grams2 = all the ordered pairs that are frequent in the set S

(wrt frequency threshold)

Return Grams2

Algorithm: Initial PhaseAlgorithm: Initial Phase



39Page39/70

Document 1: (A,11) (B,12) (C,13) (D,14) (E,15)

Document 2: (P,21) (B,22) (C,23) (D,24) (K,25)

Document 3: (A,31) (B,32) (C,33) (H,34) (D,35) (K,36)

Document 4: (P,41) (B,42) (C,43) (D,44) (E,45) (N,46)

Document 5: (P,51) (B,52) (C,53) (K,54) (E,55) (L,56) (M,57)

Document 6: (R,61) (H,62) (K,63) (L,64) (M,65)




40Page40/70

AB 2 BE 3 CK 3 EL 1 HM 1 PC 3

AC 2 BH 1 CL 1 EM 1 KE 1 PD 2

AD 1 BK 2 CN 1 EN 1 KL 2 PK 1

AH 1 CD 4 DE 2 HD 1 KM 2 RH 1

BC 5 CE 3 DK 2 HK 2 LM 2 RK 1

BD 4 CH 1 DN 1 HL 1 PB 3 RL 1


• The following pairs of words could be found (with max The following pairs of words could be found (with max gap=2). E.g., AB occurs in doc 1 ([11-12]) and in doc 2 gap=2). E.g., AB occurs in doc 1 ([11-12]) and in doc 2 ([31-32]), while AE is unfrequent ([11-15] > max gap).([31-32]), while AE is unfrequent ([11-15] > max gap).



41Page41/70

Input: a gram set Grams2 containing the frequent pairs (A, B)Output: the set Max of maximal frequent phrases

k := 2; Max := While Gramsk is not empty

For all grams g Gramsk If a gram g is not a subphrase of some m Max

If a gram g is frequentmax := ExpandExpand(g)Max := Max maxIf max = g Remove {g} from Gramsk

Else Remove {g} from Gramsk

Prune(Gramsk)Join the grams of Gramsk to form Gramsk+1

k := k + 1Return Max

Algorithm: Discovery PhaseAlgorithm: Discovery Phase



42Page42/70

Input:Input: a phrase p

Output:Output: a maximal frequent phrase p' such that p is a subphrase of p'

Repeat

Let l be the length of the sequence p.

Find a sequence p' such that the length of p' is l+1,

and p is a subsequence of p'.

If p' is frequent

p := p'

Until there exists no frequent p'

Return p

Algorithm: Expansion StepAlgorithm: Expansion Step

Note! All the possibilities to expand has to bechecked: tail, front, middle!



43Page43/70

1: 1: (A,11) (B,12) (C,13) (D,14) (E,15)2:2: (P,21) (B,22) (C,23) (D,24) (K,25)3:3: (A,31) (B,32) (C,33) (H,34) (D,35) (K,36)4:4: (P,41) (B,42) (C,43) (D,44) (E,45) (N,46)5:5: (P,51) (B,52) (C,53) (K,54) (E,55) (L,56) (M,57)6:6: (R,61) (H,62) (K,63) (L,64) (M,65)

Freq:Freq: AB BD CD DE KL PBAC BE CE DK KM PCBC BK CK HK LM PD

Exp:Exp: AB => ABC => ABCD (- ABCDE, ABCDK)BE => BCE => BCDE

Algorithm: Expansion StepAlgorithm: Expansion Step



44Page44/70

• Maximal frequent sequences after the first expansion Maximal frequent sequences after the first expansion step:step:

AB => ABC => ABCD

BE => BCE => BCDE

BK => BDK => BCDK

KL => KLM

PD => PBD => PBCD

HK

ExampleExample



45Page45/70

• 3-grams after join:3-grams after join:

ABC ACK CDE PCD BKMABD BCD CDK PCE CKLABE BCE PBC PCK CKM italics+ABK BCK PBD PDE DKL underlined=ACD BDE PBE PDK DKM already foundACE BDK PBK BKL KLM maximal phrase

• New maximal frequent sequences:New maximal frequent sequences:

PBE => PBCEPBK => PBCK

ExampleExample



46Page46/70

• 3-grams after the second expansion step:3-grams after the second expansion step:

ABC BCE CDE PBE PCKABD BCK CDK PBKACD BDE PBC PCDBCD BDK PBD PCE

• 4-grams after join:4-grams after join:

ABCD ABDK BCDK PBDE ABCE ACDE PBCD PBDKABCK ACDK PBCE PCDEABDE BCDE PBCK PCDK

ExampleExample



47Page47/70

• After expansion step, every gram is a subsequence of some maximal sequence

• For any other maximal sequence m not found yet: m has to contain grams from two or more other maximal sequences, or from one sequence m' in a different order than in m'

• For each gram g: check if g can join grams of maximal sequences in a new way

=> extract sequences that are frequent and not yet included in any maximal sequence; mark the grams

• Remove grams that are not marked

Algorithm: Pruning StepAlgorithm: Pruning Step



48Page48/70

• BC: ABCD, BCDE, BCDK, PBCD• Prefixes: A, P • Suffixes: D, DE, DK• Check the strings ABCDE, ABCDK, PBCDE, PBCDK a subsequence that is frequent and not included in any

maximal sequence?ABCDE - ABC - ABCD (maximal)

- ABCE (not frequent)- BCD - BCDE (maximal)

- ABCD (known)- BCE - ABCE (known)

Pruning After the 1Pruning After the 1stst Exp. Step Exp. Step



49Page49/70

PBCDE - PBC - PBCD (maximal) - PBCE (frequent, not in maximal) - BCD - BCDE (maximal) - PBCD (known) - BCE - PBCE (known)

PBCDK - PBC - PBCD (maximal) - PBCK (frequent, not in maximal)

...

Marked: PB, BC, CE, CKAll the other grams are removed.

Pruning After the 1Pruning After the 1stst Exp. Step Exp. Step



50Page50/70

Data structures:Data structures:

• Table: for each pair its exact occurrences in text

• Table: for each prefix the grams that have this prefix

• Table: for each suffix the grams that have this suffix

• Table: for each pair the indexes of maximal sequences within which it is a subsequence

• An array of maximal sequences

• Document identifiers are attached to the grams and occurrences

Algorithm: ImplementationAlgorithm: Implementation



51Page51/70

• The occurrences of frequent pairs are stored:The occurrences of frequent pairs are stored:

AB: [11-12][31-32]AC: [11-13][31-33]BC: [12-13][22-23][32-33][42-43][52-53]

• The occurrences of longer sequences are computed The occurrences of longer sequences are computed from the occurrences of pairsfrom the occurrences of pairs

• All the occurrences computed are storedAll the occurrences computed are stored– The computation for ABC may help to compute later

the frequency for ABCD

Testing FrequencyTesting Frequency



52Page52/70

– ABCD can only occur in places where ABC has occurred

• NOTE:NOTE:

– Already calculated occurrences can be used while adding elements to the front or to the tail

– ABCD may occur in more documents than ABD, since the distance of B and D might be greater than the maximal gap

Testing FrequencyTesting Frequency



53Page53/70









54Page54/70

• Data: Data: Reuters-21578 newswire collection (year 1987)• Around 19000 documents19000 documents (average length 135 words)• Originally 2.5 million words, after stopword pruning (400

stopwords) 1.3 million words– Stopwords: single letters, pronouns, prepositions, some

abbreviations (e.g., pct, dlr, cts, shr), etc.• 50.000 distinct words (stemming was not used)• Frequency threshold 15, max gap 2 (stopwords pruned)• Prototype implementation in Perl• Sun Enterprise 450, with 1 GB of main memory

ExperimentsExperiments



55Page55/70

• Amounts of maximal frequent sequences of different Amounts of maximal frequent sequences of different lengths:lengths:

Len 2 3 4 5 6 7 8 9 10 11 12

f:15 7,664 1,320 353 146 65 17 8 4 13 12 13

Len 13 14 15 16 17 18 19 20 21 22 23

f:15 5 - 1 1 - 1 - - - 2 -

ExperimentsExperiments



56Page56/70

• Solid, established phrases:Solid, established phrases:bundesbank president karl otto poehl

european monetary system ems

• Verb phrases:Verb phrases:bank england provided money market assistance

board declared stock split payable april

boost domestic demand

• Short phrases:Short phrases:expects higher

expects complete

Examples of MaxFreq SequencesExamples of MaxFreq Sequences



57Page57/70

• The following phrases are extracted from one document belonging to the Reuters data set

• The phrases contain both maximal phrases and subphrases that are more frequent than the maximal ones

• The document describes a situation, where the persons monitoring the nuclear power plant operation were catched asleep during their shift and the Nuclear Regulatory Commission ordered the power plant to be closed

• As you can see, the phrases do not actually reveal what happened, they just tell about the subject matter

Phrases Extracted from "Doc A"Phrases Extracted from "Doc A"



58Page58/70

power station 11immediately after 26co operations 11effective april 63company's operations 20unit nuclear 12unit power 16early week 42senior management 28nuclear regulatory commission 14-regulatory commission 34nuclear power plant 26-power plant 55-nuclear power 42-nuclear plant 42electric co 143

Phrases Extracted from "Doc A"Phrases Extracted from "Doc A"



59Page59/70

• Maximal frequent sequence (frequency = 15):Maximal frequent sequence (frequency = 15):federal reserve entered u.s. government securities market arrange repurchase agreements fed dealers federal funds trading fed began temporary supply reserves banking system

• One occurrence of the phrase:One occurrence of the phrase:The Federal Reserve entered the U.S. Government securities market to arrange 1.5 billion dlrs of customer repurchase agreements, a Fed spokesman said. Dealers said Federal funds were trading at 6-3/16 pct when the Fed began its temporary and indirect supply of reserves to the banking system.

Phrases Extracted from "Doc B"Phrases Extracted from "Doc B"



60Page60/70

• The frequency of the sequence is 13, and it contains the The frequency of the sequence is 13, and it contains the following subsequences that are more frequent:following subsequences that are more frequent:

arrange repurchase 23 banking system 66fed federal 25 trading fed 22 fed funds 23 trading system 25 fed temporary 23 reserve u.s. 43 market arrange 23 supply reserves 36market trading 41 supply system 25u.s. government 160 dealers federal 30u.s. dealers 32 dealers funds 27u.s. trading 35 dealers trading 33u.s. supply 26 federal u.s. 28reserves system 36 federal trading 30securities arrange 23 funds trading 43securities trading 32 reserve u.s. government 31government arrange 23 reserves banking system 25

Phrases Extracted from "Doc B"Phrases Extracted from "Doc B"



61Page61/70

• Goal: rich computational representation for documentsGoal: rich computational representation for documents– Feature sets for analysis– Human-readable description

• ApplicationsApplications– Key phrases in information retrieval– Overview to the collection: clustering– Summary of the content– Automatic generation of hypertext links– Associations between documents– Browsing of document collection

Use of Frequent PhrasesUse of Frequent Phrases



62Page62/70

• Example:Example: suppose that a query "agricultur*" has been made

• The user has been given a "middle-level list" of phrases that tell something more about the context around the words in the query


agricultur* QUERYQUERY



63Page63/70

agricultural exportsagricultural productionagricultural productsagricultural stabilization conservation service

agricultural subsidiesagricultural subsidiesagricultural tradeu.s. agricultureagriculture department usdaagriculture department wheatagriculture ministeragriculture officialsagriculture undersecretary daniel amstutzcommon agricultural policyec agriculture ministerseuropean community agriculture




64Page64/70

• Suppose that the user is interested in subject "agricultural subsidies" and selects it from the list

• As an answer to the query, one might now return all the sentences containing the phrase "agricultural subsidies" (e.g., the ones on the next pages)

• Alternatively, the user might want to see directly the whole documents in which the phrase appears, or the other phrases that occur together with the phrase "agricultural subsidies" in the documents




65Page65/70

• Text mining:Text mining: – The "roots" are in text databases and information

retrieval– Data mining techniques might complement or help the

existing database/information retrieval techniques• In this lecture, only a few methods based of association In this lecture, only a few methods based of association

and episode style algorithms were given:and episode style algorithms were given:– Naïve approaches applicable to some extent, maximal

frequent phrases might be useful in some cases– Many clustering, classification and similarity

techniques that will be presented on the next lectures, are useful to go a few steps further

SummarySummary



66Page66/70

• Helena Ahonen-Myka: Finding All Frequent Maximal Sequences in Text. In ICML-99 Workshop on Machine Learning in Text Data Analysis, p. 11-17, J. Stefan Institute, Ljubljana 1999. See electronic version at http://www.cs.helsinki.fi/u/hahonen/ham_icml99.ps

• Han, J., Kamber, M.: Data Mining: Concepts and Techniques (also available at "http://www.cs.sfu.ca/~han/DM_Book.html"), Section 9.5 of the book.

• Helena Ahonen, Oskari Heinonen, Mika Klemettinen, and Inkeri Verkamo. Applying Data Mining Techniques for Descriptive Phrase Extraction in Digital Document Collections. In Advances in Digital Libraries'98, April 1998. See electronic version at http://www-db.informatik.uni-tuebingen.de/forschung/papers/adl98.ps

ReferencesReferences



67Page67/70

Next WeekNext WeekNext WeekNext Week

• Lecture 14.11.: Clustering, Lecture 14.11.: Clustering, Classification, SimilarityClassification, Similarity

– Pirjo gives the lecturePirjo gives the lecture

• Excercise 15.11.: Text miningExcercise 15.11.: Text mining– Pirjo takes care of you! :-) Pirjo takes care of you! :-)

• Seminar 9.11.: Text miningSeminar 9.11.: Text mining– Mika gives the lectureMika gives the lecture

– 2 group presentations (groups 5-6)2 group presentations (groups 5-6)

Course OrganizationCourse Organization



68Page68/70

Seminar Presentations/Groups 5-6Seminar Presentations/Groups 5-6

Feldman et. alFeldman et. alFeldman et. alFeldman et. al

Lent, Agrawal, SrikantLent, Agrawal, SrikantLent, Agrawal, SrikantLent, Agrawal, Srikant

R. Feldman et al.: "Knowledge Management: A Text Mining Approach", PAKM 1998.

B. Lent, R. Agrawal, R. Srikant: "Discovering Trends in Text Databases", KDD 1997.



69Page69/70

• Remember:Remember:– Try to understand the

"message" in the article

– Try to present the basic ideas as clearly as possible, use examples

– Do not present detailed mathematics or algorithms

– Test: do you understand your own presentation?

– In the presentation, use PowerPoint or conventional slides

Seminar PresentationsSeminar Presentations

• Requirements:Requirements:– Articles are given on previous

week's Wed

– Presentation in an HTML page (around 3-5 printed pages) due to seminar starting:

• Can be either a HTML page or a printable document in PostScript/PDF format

– 30 minutes of presentation

– 5-15 minutes of discussion

– Active participation



70Page70/70

Thank you for Thank you for your attention!your attention!

Thanks to Helena Ahonen-Myka and Jiawei Han for their slides which greatly helped in preparing this lecture!


Course on Data Mining (581550-4)

Documents

Transcript of Course on Data Mining (581550-4)