Text mining & Web mining

97
2007-6-23 Data Mining: Tech. & Appl. 1 Lecture 12 Text Mining and Web Mining Zhou Shuigeng June 10, 2007

Transcript of Text mining & Web mining

Page 1: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 1

Lecture 12Text Mining and Web Mining

Zhou Shuigeng

June 10, 2007

Page 2: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 2

Part I: Information Retrieval and Text Mining

Zhou Shuigeng

May 28, 2006

Page 3: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 3

Text Databases and IRText databases (document databases)

Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, and Web pages, library database, etc.

Data stored is usually semi-structured

Information retrievalA field developed in parallel with database systems

Information is organized into (a large number of) documents

Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents

Page 4: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 4

Information RetrievalTypical IR systems

Online library catalogsOnline document management systems

Information retrieval vs. database systemsSome DB problems are not present in IR, e.g., update, transaction management, complex objectsSome IR problems are not addressed well in DBMS, e.g., unstructured documents, approximate search using keywords and relevance

Page 5: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 5

IR Techniques(1)Basic Concepts

A document can be described by a set of representative keywords called index terms.Different index terms have varying relevance when used to describe document contents.This effect is captured through the assignment of numerical weights to each index term of a document. (e.g.: frequency, tf-idf)

DBMS AnalogyIndex Terms AttributesWeights Attribute Values

Page 6: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 6

IR Techniques(2)Index Terms (Attribute) Selection:

Term extractionStop listWord stemIndex terms weighting methods

Terms Documents Frequency MatricesInformation Retrieval Models:

Boolean ModelVector ModelProbabilistic Model

Page 7: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 7

Keyword ExtractionGoal:

given N documents, each consisting of words, extract the most significant subset of words keywordsExample

[All the students are taking exams] -- >[student, take, exam]Keyword Extraction Process

remove stop wordsstem remaining termscollapse terms using thesaurusbuild inverted indexextract key words - build key word indexextract key phrases - build key phrase index

Page 8: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 8

Stop Words and StemmingFrom a given Stop Word List

[a, about, again, are, the, to, of, …]Remove them from the documents

Or, determine stop wordsGiven a large enough corpus of common EnglishSort the list of words in decreasing order of their occurrence frequency in the corpusZipf’s law: Frequency * rank ≈ constant

most frequent words tend to be shortmost frequent 20% of words account for 60% of usage

Page 9: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 9

Zipf’s Law -- An illustrationRank(R) Term Frequency (F) R*F (10**6)

1 the 69,971 0.0702 of 36,411 0.0733 and 28,852 0.0864 to 26,149 0.1045 a 23,237 0.1166 in 21,341 0.1287 that 10,595 0.0748 is 10,009 0.0819 was 9,816 0.08810 he 9,543 0.095

Page 10: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 10

Resolving Power of Word

Words in decreasing frequency order

Non-significanthigh-frequency

terms

Non-significantlow-frequency

terms

Presumed resolving powerof significant words

Page 11: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 11

Simple Indexing Scheme Based on Zipf’s Law

Compute frequency of term k in document i, FreqikDetermine total collection frequency

TotalFreqk = ∑ Freqik for i = 1, 2, …, nArrange terms in order of collection frequencySet thresholds - eliminate high and low frequency termsUse remaining terms as index terms

Use term frequency information only:

Page 12: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 12

StemmingThe next task is stemming: transforming words to root form

Computing, Computer, Computation comput

Suffix based methodsRemove “ability” from “computability”“…”+ness, “…”+ive, remove

Suffix list + context rules

Page 13: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 13

Thesaurus RulesA thesaurus aims at

classification of words in a languagefor a word, it gives related terms which are broader than, narrower than, same as(synonyms) and opposed to (antonyms) of the given word (other kinds of relationships may exist, e.g., composed of)

Static Thesaurus Tables[anneal, strain], [antenna, receiver], …Roget’s thesaurusWordNet at Preinceton

Page 14: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 14

Thesaurus Rules can also be LearnedFrom a search engine query log

After typing queries, browse…If query1 and query2 leads to the same document

Then, Similar(query1, query2)If query1 leads to Document with title keyword K,

Then, Similar(query1, K)Then, transitivity…

Microsoft Research China’s work in WWW10 (Wen, et al.) on Encarta online

Page 15: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 15

The Vector-Space ModelThe distinct terms are available; call them index terms or the vocabularyThe index terms represent important terms for an application a vector to represent the document

<T1,T2,T3,T4,T5> or <W(T1),W(T2),W(T3),W(T4),W(T5)>

T1=architectureT2=busT3=computerT4=databaseT5=xml

computer sciencecollection

index terms or vocabularyof the collection

Page 16: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 16

The Vector-Space ModelAssumptions: words are uncorrelated

T1 T2 …. TtD1 d11 d12 … d1tD2 d21 d22 … d2t: : : :: : : :Dn dn1 dn2 … dnt

Given:1. N documents and a Query2. Query considered a document

too2. Each represented by t terms

3. Each term j in document i hasweight

4. We will deal with how to compute the weights later

ijdt21 qqqQ ...

Page 17: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 17

Graphic RepresentationExample:D1 = 2T1 + 3T2

+ 5T3D2 = 3T1 + 7T2

+ T3Q = 0T1 + 0T2

+ 2T3

T3

T1

T2

D1 = 2T1+ 3T2 + 5T3

D2 = 3T1 + 7T2 + T3

Q = 0T1 + 0T2 + 2T3

7

32

5

• Is D1 or D2 more similar to Q?• How to measure the degree of

similarity? Distance? Angle? Projection?

Page 18: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 18

Similarity Measure - Inner Product

Similarity between documents Di and query Qcan be computed as the inner vector product:

sim ( Di , Q ) = (Di • Q)

Binary: weight = 1 if word present, 0 o/wNon-binary: weight represents degree of similary

Example: TF/IDF we explain later

k

t

=∑

1

∑=

=t

jjij qd

1*

Page 19: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 19

Inner Product -- Examples

Binary:D = 1, 1, 1, 0, 1, 1, 0

Q = 1, 0 , 1, 0, 0, 1, 1

sim(D, Q) = 3

retrie

val

database

archite

cture

computer

textmanagem

ent

informatio

n Size of vector = size of vocabulary = 7

WeightedD1 = 2T1 + 3T2 + 5T3

Q = 0T1 + 0T2 + 2T3sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10

Page 20: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 20

Properties of Inner Product

The inner product similarity is unboundedFavors long documents

long document ⇒ a large number of unique terms, each of which may occur many timesmeasures how many terms matched but not how many terms not matched

Page 21: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 21

Cosine Similarity MeasuresCosine similarity measures the cosine of the angle between two vectorsInner product normalized by the vector lengths θ2

t3

t1

t2

D1

D2

Q

θ1

∑ ∑

= =

=

t

k

t

k

t

kkik

qd

qd

kik1 1

22

1)(

CosSim(Di, Q) =

Page 22: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 22

Cosine Similarity: an Example

D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 5 / √ 38 = 0.81D2 = 3T1 + 7T2 + T3 CosSim(D2 , Q) = 1 / √ 59 = 0.13Q = 0T1 + 0T2 + 2T3

D1 is 6 times better than D2 using cosine similarity but only 5 times better using inner product

Page 23: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 23

Document and Term WeightsDocument term weights are calculated using frequencies in documents (tf) and in collection (idf)

tfij = frequency of term j in document idf j = document frequency of term j

= number of documents containing term jidfj = inverse document frequency of term j

= log2 (N/ df j) (N: number of documents in collection)Inverse document frequency -- an indication of term values as a document discriminator.

Page 24: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 24

Term Weight CalculationsWeight of the jth term in ith document:

dij = tfij• idfj = tfij• log2 (N/ df j)

TF Term FrequencyA term occurs frequently in the document but rarely in the remaining of the collection has a high weightLet maxl{tflj} be the term frequency of the most frequent term in document jNormalization: term frequency = tfij /maxl{tflj}

Page 25: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 25

An example of TFDocument=(A Computer Science Student Uses Computers)Vector Model based on keywords (Computer, Engineering, Student)Tf(Computer) = 2Tf(Engineering)=0Tf(Student) = 1Max(Tf)=2TF weight for:

Computer = 2/2 = 1Engineering = 0/2 = 0Student = ½ = 0.5

Page 26: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 26

Inverse Document FrequencyDfj gives the number of times term jappeared among N documentsIDF = 1/DFTypically use log2 (N/ df j) for IDFExample: given 1000 documents, computer appeared in 200 of them,

IDF= log2 (1000/ 200) =log2(5)

Page 27: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 27

TF IDFdij = (tfij /maxl{tflj}) • idfj

= (tfij /maxl {tflj}) • log2 (N/ df j)Can use this to obtain non-binary weightsUsed in the SMART Information Retrieval System by the late Gerald Salton and MJ McGill, Cornell University to tremendous success, 1983

Page 28: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 28

Implementation based on Inverted Files

In practice, document vectors are not stored directly; an inverted organization provides much better access speed. The index file can be implemented as a hash file, a sorted list, or a B-tree.

system

computerdatabase

science D2, 4

D5, 2

D1, 3

D7, 4Index terms df

3

2

41

Dj, tfj

• • •

Page 29: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 29

A Simple Search EngineNow we have got enough tools to build a simple Search engine (documents == web pages)

1. Starting from well known web sites, crawl to obtain N web pages (for very large N)

2. Apply stop-word-removal, stemming and thesaurus to select K keywords

3. Build an inverted index for the K keywords4. For any incoming user query Q,

1. For each document D1. Compute the Cosine similarity score between Q and document D

2. Select all documents whose score is over a certain threshold T

3. Let this result set of documents be M4. Return M to the user

Page 30: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 30

Remaining QuestionsHow to crawl?How to evaluate the result

Given 3 search engines, which one is better?

Is there a quantitative measure?

Page 31: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 31

MeasurementLet M documents be returned out of a total of N documents; N=N1+N2

N1 total documents are relevant to queryN2 are not

M=M1+M2M1 found documents are relevant to queryM2 are not

Precision = M1/MRecall = M1/N1

Page 32: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 32

documents relevant of number Totalretrieved documents relevant of Number recall =

retrieved documents of Number totalretrieveddocumentsrelevantofNumber precision =

Relevant documents

Retrieved documents

Entire document collection

retrieved & relevant

not retrieved but relevant

retrieved &

irrelevant

Not retrieved & irrelevant

retrieved not retrieved

rele

vant i

rrel

evan

t

Retrieval effectiveness: recall & precision

Page 33: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 33

Precision and RecallPrecision

evaluates the correlation of the query to the databasean indirect measure of the completeness of indexing algorithm

Recallthe ability of the search to find all of the relevant items in the database

Among three numbers,only two are always available

total number of items retrievednumber of relevant items retrieved

total number of relevant items is usually not available

Page 34: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 34

Relationship between Recall and Precision

10

1

prec

isio

n

Return mostly relevantdocuments but includemany junks too

recall

The idealReturn relevant documents butmiss many useful ones too

Page 35: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 35

Fallout RateProblems with precision and recall:

A query on “Hong Kong” will return most relevant documents but it doesn't tell you how good or how bad the system is !number of irrelevant documents in the collection is not taken into accountrecall is undefined when there is no relevant document in the collectionprecision is undefined when no document is retrieved

collection the in items tnonrelevan of no. totalretrieveditems tnonrelevanofno. Fallout =

Fallout can be viewed as the inverse of recall. A good system should have high recall and low fallout

Page 36: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 36

Total Number of Relevant ItemsIn an uncontrolled environment (e.g., the web), it is unknown.Two possible approaches to get estimates

Sampling across the database and performing relevance judgment on the returned itemsApply different retrieval algorithms to the same database for the same query. The aggregate of relevant items is taken as the total relevant algorithm

Page 37: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 37

R=2/5=0.4; p=2/3=0.67

Computation of Recall and Precisionn doc #relevantRecallPrecision1 588 x 0.2 1.002 589 x 0.4 1.003 576 0.4 0.674 590 x 0.6 0.765 986 0.6 0.606 592 x 0.8 0.677 984 0.8 0.578 988 0.8 0.509 578 0.8 0.4410 985 0.8 0.4011 103 0.8 0.3612 591 0.8 0.3313 772 x 1.0 0.3814 990 1.0 0.36

Suppose:total no. of relevant docs = 5

R=1/5=0.2; p=1/1=1

R=2/5=0.4; p=2/2=1

R=5/5=1; p=5/13=0.38

Page 38: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 38

Computation of Recall and Precisionn RecallPrecision1 0.2 1.002 0.4 1.003 0.4 0.674 0.6 0.765 0.6 0.606 0.8 0.677 0.8 0.578 0.8 0.509 0.8 0.4410 0.8 0.4011 0.8 0.3612 0.8 0.3313 1.0 0.3814 1.0 0.36 0.4 0.8

1.0

0.8

0.6

0.4

0.2

0.2 1.00.6

1 2

3

4

5

6

7

12

13

200

recall

prec

isio

n

Page 39: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 39

Compare Two or More SystemsComputing recall and precision values for two or more systemsSuperimposing the results in the same graphThe curve closest to the upper right-hand corner of the graph indicates the best performance

0

0.2

0.4

0.6

0.8

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Precision

Stem Theraurus

Page 40: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 40

The TREC Benchmark TREC: Text Retrieval ConferenceOriginated from the TIPSTER program sponsored by Defense Advanced Research Projects Agency (DARPA)Became an annual conference in 1992, co-sponsored by the National Institute of Standards and Technology (NIST) and DARPAParticipants are given parts of a standard set of documents and queries in different stages for testing and trainingParticipants submit the P/R values on the final document and query set and present their results in the conference

http://trec.nist.gov/

Page 41: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 41

Interactive Search EnginesAims to improve their search results incrementally,

often applies to query “Find all sites with certain property”Content based Multimedia search: given a photo, find all other photos similar to it

Large vector spaceQuestion: which feature (keyword) is important?

Procedure:User submits queryEngine returns resultUser marks some returned result as relevant or irrelevant, and continues searchEngine returns new resultsIterates until user satisfied

Page 42: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 42

Query ReformulationBased on user’s feedback on returned results

Documents that are relevant DR

Documents that are irrelevant DN

Build a new query vector Q’ from Q<w1, w2, … wt> <w1’, w2’, … wt’>

Best known algorithm: Rocchio’s algorithmAlso extensively used in multimedia search

Page 43: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 43

Query ModificationUsing the previously identified relevant and nonrelevant document set DR and DN to repeatedly modify the query to reach optimalityStarting with an initial query in the form of

where Q is the original query, andα, β, and γ are suitable constants

)1()1(*' ∑∑∈∈

−+=DNDR

QQNR

jj

ii DD γα

Page 44: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 44

An ExampleQ: original queryD1: relevant doc.D2: non-relevant doc.α = 1, β = 1/2, γ = 1/4Assume: dot-product similarity measure

T1 T2 T3 T4 T5Q = ( 5, 0, 3, 0, 1)D1 = ( 2, 1, 2, 0, 0)D2 = ( 1, 0, 0, 0, 2)

)x(),(1

DQD ij

t

jjiQS ∑

=

=

Sim(Q,D1) = (5•2)+(0 • 1)+(3 • 2)+(0 • 0)+(1 • 0) = 16Sim(Q,D2) = (5•1)+(0 • 0)+(3 • 0)+(0 • 0)+(1 • 2) = 7

Page 45: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 45

Example (Cont.)

New Similarity Scores:

Sim(Q’, D1)=(5.75 • 2)+(0.5 • 1)+(4 • 2)+(0 • 0)+(0.5 • 0)=20Sim(Q’, D2)=(5.75 • 1)+(0.5 • 0)+(4 • 0)+(0 • 0)+(0.5 • 2)=6.75

Q QNi

i RDi

i NDD D' ( ) (

')

' '= + −

∈ ∈∑ ∑

12

14

1

Q ' ( , , , , ) ( , , , , ) ( , , , , )= + −5 0 3 0 1 12

2 1 2 0 0 14

1 0 0 0 2

)5.0,0,4,5.0,75.5( ' =Q

Page 46: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 46

Latent Semantic Indexing (1)

Basic ideaSimilar documents have similar word frequenciesDifficulty: the size of the term frequency matrix is very largeUse a singular value decomposition (SVD) techniques to reduce the size of frequency tableRetain the K most significant rows of the frequency table

MethodCreate a term x document weighted frequency matrix ASVD construction: A = U * S * V’Define K and obtain Uk ,, Sk , and Vk.Create query vector q’ .Project q’ into the term-document space: Dq = q’ * Uk * Sk

-1

Calculate similarities: Dq . D / ||Dq|| * ||D||

Page 47: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 47

Latent Semantic Indexing (2)

Weighted Frequency Matrix

Query Terms:- Insulation- Joint

Page 48: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 48

Types of Text Data MiningKeyword-based association analysisAutomatic document classificationSimilarity detection

Cluster documents by a common authorCluster documents containing information from a common source

Link analysis: unusual correlation between entitiesSequence analysis: predicting a recurring eventAnomaly detection: find information that violates usual patterns Hypertext analysis

Patterns in anchors/linksAnchor text correlations with linked objects

Page 49: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 49

Keyword-Based Association Analysis

MotivationCollect sets of keywords or terms that occur frequently together and then find the association or correlation relationships among them

Association Analysis ProcessPreprocess the text data by parsing, stemming, removing stop words, etc.Evoke association mining algorithms

Consider each document as a transactionView a set of keywords in the document as a set of items in the transaction

Term level association miningNo need for human effort in tagging documentsThe number of meaningless results and the execution time is greatly reduced

Page 50: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 50

Text Classification(1)Motivation

Automatic classification for the large number of on-line text documents (Web pages, e-mails, corporate intranets, etc.)

Classification ProcessData preprocessingDefinition of training set and test setsCreation of the classification model using the selected classification algorithmClassification model validationClassification of new/unknown text documents

Text document classification differs from the classification of relational data

Document databases are not structured according to attribute-value pairs

Page 51: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 51

Text Classification(2)Classification Algorithms:

Support Vector MachinesK-Nearest NeighborsNaïve BayesNeural NetworksDecision TreesAssociation rule-basedBoosting

Page 52: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 52

Document ClusteringMotivation

Automatically group related documents based on their contentsNo predetermined training sets or taxonomiesGenerate a taxonomy at runtime

Clustering ProcessData preprocessing: remove stop words, stem, feature extraction, lexical analysis, etc.Hierarchical clustering: compute similarities applying clustering algorithms.Model-Based clustering (Neural Network Approach): clusters are represented by “exemplars”. (e.g.: SOM)

Page 53: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 53

Part II: Web Mining

Zhou Shuigeng

May 28, 2004

Page 54: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 54

Mining the World-Wide WebThe WWW is huge, widely distributed, global information service center for

Information services: news, advertisements, consumer information, financial management, education, government, e-commerce, etc.Hyper-link informationAccess and usage information

WWW provides rich sources for data miningChallenges

Too huge for effective data warehousing and data miningToo complex and heterogeneous: no standards and structure

Page 55: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 55

Mining the World-Wide WebGrowing and changing very rapidly

Broad diversity of user communitiesOnly a small portion of the information on the Web is truly relevant or useful

99% of the Web information is useless to 99% of Web usersHow can we find high-quality Web pages on a specified topic?

Internet growth

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

40000000

Sep-

69

Sep-

72

Sep-

75

Sep-

78

Sep-

81

Sep-

84

Sep-

87

Sep-

90

Sep-

93

Sep-

96

Sep-

99

Host

s

Page 56: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 56

Web search enginesIndex-based: search the Web, index Web pages, and build and store huge keyword-based indices Help locate sets of Web pages containing certain keywordsDeficiencies

A topic of any breadth may easily contain hundreds of thousands of documentsMany documents that are highly relevant to a topic may not contain keywords defining them (polysemy)

Page 57: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 57

Web Mining: A more challenging task

Searches for Web access patternsWeb structuresRegularity and dynamics of Web contents

ProblemsThe “abundance” problemLimited coverage of the Web: hidden Web sources, majority of data in DBMSLimited query interface based on keyword-oriented searchLimited customization to individual users

Page 58: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 58

Web Mining

Web StructureMining

Web ContentMining

Web PageContent Mining

Search ResultMining

Web UsageMining

General AccessPattern Tracking

CustomizedUsage Tracking

Web Mining Taxonomy

Page 59: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 59

Web Mining

Web StructureMining

Web ContentMining

Web Page Content MiningWeb Page Summarization WebLog (Lakshmanan et.al. 1996),WebOQL(Mendelzon et.al. 1998) …:Web Structuring query languages; Can identify information within given web pages •Ahoy! (Etzioni et.al. 1997):Uses heuristics to distinguish personal home pages from other web pages•ShopBot (Etzioni et.al. 1997): Looks for product prices within web pages

Search ResultMining

Web UsageMining

General AccessPattern Tracking

CustomizedUsage Tracking

Mining the World-Wide Web

Page 60: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 60

Web Mining

Mining the World-Wide Web

Web UsageMining

General AccessPattern Tracking

CustomizedUsage Tracking

Web StructureMining

Web ContentMining

Web PageContent Mining Search Result Mining

Search Engine Result Summarization•Clustering Search Result (Leouskiand Croft, 1996, Zamir and Etzioni, 1997): Categorizes documents using phrases in titles and snippets

Page 61: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 61

Web Mining

Web ContentMining

Web PageContent Mining

Search ResultMining

Web UsageMining

General AccessPattern Tracking

CustomizedUsage Tracking

Web Structure MiningUsing Links•PageRank (Brin et al., 1998)•CLEVER (Chakrabarti et al., 1998)Use interconnections between web pages to give weight to pages.

Using Generalization•MLDB (1994), VWV (1998)Uses a multi-level database representation of the Web. Counters (popularity) and link lists are used for capturing structure.

Mining the World-Wide Web

Page 62: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 62

Web Mining

Web StructureMining

Web ContentMining

Web PageContent Mining

Search ResultMining

Web UsageMining

General Access Pattern Tracking

•Web Log Mining (Zaïane, Xin and Han, 1998)Uses KDD techniques to understand general access patterns and trends.Can shed light on better structure and grouping of resource providers.

CustomizedUsage Tracking

Mining the World-Wide Web

Page 63: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 63

Web Mining

Web UsageMining

General AccessPattern Tracking

Customized Usage Tracking

•Adaptive Sites (Perkowitz and Etzioni, 1997)Analyzes access patterns of each user at a time.Web site restructures itself automatically by learning from user access patterns.

Mining the World-Wide Web

Web StructureMining

Web ContentMining

Web PageContent Mining

Search ResultMining

Page 64: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 64

Search Engine TopicsText-based Search Engines

Document basedRanking: TF-IDF, Vector Space ModelNo relationship between pages modeledCannot tell which page is important without query

Link-based search engines: Google, Hubs and Authorities Techniques

Can pick out important pages

Page 65: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 65

The PageRank AlgorithmFundamental question to ask

What is the importance level of a page P, Information Retrieval

Cosine + TF IDF does not give related hyperlinks

Link basedImportant pages (nodes) have many other links point to itImportant pages also point to other important pages

)(PI

Page 66: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 66

The Google Crawler Algorithm“Efficient Crawling Through URL Ordering”,

Junghoo Cho, Hector Garcia-Molina, Lawrence Page, Stanfordhttp://www.www8.orghttp://www-db.stanford.edu/~cho/crawler-paper/

“Modern Information Retrieval”, BY-RNPages 380—382

Lawrence Page, Sergey Brin. The Anatomy of a Search Engine. The Seventh International WWW Conference(WWW 98). Brisbane, Australia, April 14-18, 1998.

http://www.www7.org

Page 67: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 67

Back Link Metric

IB(P) = total number of backlinks of PIB(P) impossible to know, thus, use IB’(P) which is the number of back links crawler has seen so far

Web PageP

IB(P)=3

Page 68: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 68

Page Rank Metric

i

N

ii CTIRddPIR /)(*)1()(

1∑=

+−=

Web PageP

T1

T2

TN

Let 1-d be probabilitythat user randomly jumpto page P;

“d” is the damping factor

Let Ci be the number ofout links from each Ti

C=2

d=0.9

Page 69: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 69

Matrix FormulationConsider a random walk on the web (denote IR(P) by r(P))

Let Bij = probability of going directly from i to jLet ri be the limiting probability (page rank) of being at page i

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

nnnnnn

n

n

r

rr

r

rr

bbb

bbbbbb

.........

..................

2

1

2

1

21

22212

12111

rrBT =

Thus, the final page rank r is a principle eigenvector of BT

Page 70: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 70

How to compute page rank?For a given network of web pages,

Initialize page rank for all pages (to one)Set parameter (d=0.90)Iterate through the network, L times

Page 71: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 71

Example: iteration K=1

A

B

C

IR(P)=1/3 for all nodes, d=0.9

node IPA 1/3B 1/3C 1/3

Page 72: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 72

Example: k=2

A

B

C node IPA 0.4B 0.1C 0.55

i

l

ii CTIRPIR /)(*9.01.0)(

1∑=

+=

l is the in-degree of P

Note: A, B, C’s IP values are Updated in order of A, then B, then CUse the new value of A when calculating B, etc.

Page 73: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 73

Example: k=2 (normalize)

A

B

Cnode IPA 0.38B 0.095C 0.52

Page 74: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 74

Crawler ControlAll crawlers maintain several queues of URL’s to pursue next

Google initially maintains 500 queuesEach queue corresponds to a web site pursuing

Important considerations:Limited buffer spaceLimited timeAvoid overloading target sitesAvoid overloading network traffic

Page 75: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 75

Crawler ControlThus, it is important to visit important pages firstLet G be a lower bound threshold on I(P)Crawl and Stop

Select only pages with IP>G to crawl,Stop after crawled K pages

Page 76: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 76

Test Result: 179,000 pages

Percentage of Stanford Web crawled vs. PST –the percentage of hot pages visited so far

Page 77: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 77

Google Algorithm (very simplified)

First, compute the page rank of each page on WWW

Query independentThen, in response to a query q, return pages that contain q and have highest page ranksA problem/feature of Google: favors big commercial sites

Page 78: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 78

How powerful is Google?A PageRank for 26 million web pages can be computed in a few hours on a medium size workstationCurrently has indexed a total of 1.3 Billion pages

Page 79: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 79

Hubs and Authorities 1998Kleinburg, Cornell Universityhttp://www.cs.cornell.edu/home/kleinber/

Main Idea: type “java” in a text-based search engine

Get 200 or so pagesWhich one’s are authoritive?

http://java.sun.comWhat about others?

www.yahoo.com/Computer/ProgramLanguages

Page 80: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 80

Hubs and Authorities

HubsAuthorities

Others

- An authority is a page pointed to by many strong hubs;

- A hub is a page that points to many strong authorities

Page 81: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 81

H&A Search Engine Algorithm

First submit query Q to a text search engineSecond, among the results returned

select ~200, find their neighbors, compute Hubs and Authorities

Third, return Authorities found as final resultImportant Issue: how to find Hubs and Authorities?

Page 82: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 82

Link Analysis: weightsLet Bij=1 if i links to j, 0 otherwise

hi=hub weight of page iai = authority weight of page iWeight normalization

1)(1

2 =∑=

N

iih

1)( 2

1=∑

=

N

iia

(3) 11

=∑=

N

iih

11

=∑=

N

iia

(3’)

But, for simplicity, we will use

Page 83: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 83

Link Analysis: update a-weight

h1

h2

a

hBhBha Tjji

Bji

ji

==← ∑∑≠0

(1)

Page 84: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 84

Link Analysis: update h-weight

h

BaaBah jijB

jiij

==← ∑∑≠0

a1

a2

(2)

Page 85: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 85

H&A: algorithm1. Set value for K, the number of iterations2. Initialize all a and h weights to 13. For l=1 to K, do

a. Apply equation (1) to obtain new ai weightsb. Apply equation (2) to obtain all new hi weights, using

the new ai weights obtained in the last stepc. Normalize ai and hi weights using equation (3)

Page 86: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 86

DOES it converge?Yes, the Kleinberg paper includes a proofNeeds to know Linear algebra and eigenvector analysisWe will skip the proof but only using the results:

The a and h weight values will converge after sufficiently large number of iterations, K.

Page 87: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 87

Example: K=1

A

B

C

h=1 and a=1 for all nodes

node a hA 1 1B 1 1C 1 1

Page 88: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 88

Example: k=1 (update a)

A

B

C node a hA 1 1B 0 1C 2 1

Page 89: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 89

Example: k=1 (update h)

A

B

C node a hA 1 2B 0 2C 2 1

Page 90: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 90

Example: k=1 (normalize)

A

B

C node a hA 1/3 2/5B 0 2/5C 2/3 1/5

Use Equation (3’)

Page 91: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 91

Example: k=2 (update a, h,normalize)

A

B

C

node a hA 1/5 4/9B 0 4/9C 4/5 1/9

Use Equation (1)

If we choose a threshold of ½, then C is anAuthority, and there are no hubs.

Page 92: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 92

Search Engine Using H&AFor each query q,

Enter q into a text-based search engineFind the top 200 pagesFind the neighbors of the 200 pages by one link, let the set be SFind hubs and authorities in SReturn authorities as final result

Page 93: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 93

SummaryLink based analysis is very powerful in find out the important pagesModels the web as a graph, and based on in-degree and out-degreeGoogle: crawl only important pagesH&A: post analysis of search result

Page 94: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 94

Automatic Classification of Web Documents

Assign a class label to each document from a set of predefined topic categoriesBased on a set of examples of preclassifieddocumentsExample

Use Yahoo!'s taxonomy and its associated documents as training and test sets Derive a Web document classification schemeUse the scheme classify new Web documents by assigning categories from the same taxonomy

Keyword-based document classification methodsStatistical models

Page 95: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 95

Web Usage MiningMining Web log records to discover user access patterns of Web pagesApplications

Target potential customers for electronic commerceEnhance the quality and delivery of Internet information services to the end userImprove Web server system performanceIdentify potential prime advertisement locations

Web logs provide rich information about Web dynamics

Typical Web log entry includes the URL requested, the IP address from which the request originated, and a timestamp

Page 96: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 96

Techniques for Web usage mining

Construct multidimensional view on the Weblogdatabase

Perform multidimensional OLAP analysis to find the top N users, top N accessed Web pages, most frequently accessed time periods, etc.

Perform data mining on Weblog records Find association patterns, sequential patterns, and trends of Web accessingMay need additional information,e.g., user browsing sequences of the Web pages in the Web server buffer

Conduct studies toAnalyze system performance, improve system design by Web caching, Web page prefetching, and Web page swapping

Page 97: Text mining & Web mining

2007-6-23 Data Mining: Tech. & Appl. 97

Mining the World-Wide WebDesign of a Web Log Miner

Web log is filtered to generate a relational databaseA data cube is generated form databaseOLAP is used to drill-down and roll-up in the cubeOLAM is used for mining interesting knowledge

1Data Cleaning

2Data CubeCreation

3OLAP

4Data Mining

Sliced and dicedcube

KnowledgeWeb log Database Data Cube