10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data...

90
06/20/22 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj Acknowledgements: Based on the slides by students at CS512 (Spring 2009)

Transcript of 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data...

Page 1: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

04/19/23 1

Data Mining: Concepts and Techniques

— Chapter 10 —10.3.1 Mining Text and Web Data (I)

Jiawei Han and Micheline Kamber

Department of Computer Science

University of Illinois at Urbana-Champaign

www.cs.uiuc.edu/~hanj

Acknowledgements: Based on the slides by students at CS512 (Spring 2009)

Page 2: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Outline

Introduction to Information Retrieval (Rui Li)

Text categorization (Parikshit Sondhi)

Web link analysis (Kavita Ganesan)

Mining and Searching Structured Data on

the Web (Bo Zhao)

Page 3: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Information Retrieval

Rui [email protected]

Page 4: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

What’s the Information Retrieval ? Information Retrieval

There exists a collection of text documents User gives a query to express the information need A retrieval system returns relevant documents to users

Typical IR systems Online library catalogs Online document management systems Web Search Engine (Google)

Information Retrieval vs. Database System Unstructured/free text vs. structured data Ambiguous vs. well-defined semantics Incomplete vs. complete specification Relevant documents vs. matched records No transaction VS transaction management,

Page 5: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Typical IR System Architecture

5

User

querydocs

results

Query RepDoc Rep (Index)

ScorerIndexer

Tokenizer

Index

Page 6: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Document Representation

A document can be described by a set of representative keywords called index terms.

Different index terms have varying relevance when used to describe document contents.

Steps: Tokenize the document into the words Remove stop words from stop word list E.g., “is”

“a” “or” Words stemmer: Several words are small syntactic

variants of each other since they share a common word stem E.g., drug, drugs, drugged

Calculate the term weight based on the word frequency

Query Representation is a similar process

Page 7: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Indexing Inverted index

Maintains two hash- or B+-tree indexed tables: document_table: A set of document records <doc_id,

postings_list> term_table: A set of term records, <term, postings_list>

Answer query: Find all docs associated with one or a set of terms

+ easy to implement + effective to fetch documents with specific term – do not handle well synonymy and polysemy, and posting

lists could be too long (storage could be very large) Other index techniques: e.g., signature file

Page 8: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Ranking Model The basic question: Given a query, how do we know if document A

is more relevant than B? Relevance = Similarity

Query and document are represented similarly A query can be regarded as a “document” Relevance(d, q) similarity(d, q)

Key issues How to represent query/document? How to define the similarity measure ?

Typical Models Boolean Model Vector Space Model Language Model

Page 9: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

9

The Notion of RelevanceRelevance

(Rep(q), Rep(d)) Similarity

P(r=1|q,d) r {0,1} Probability of relevance

P(d q) or P(q d) Probabilistic inference

Different rep & similarity

Vector spacemodel(Salton et al., 75)

Prob. distr.model(Wong & Yao, 89)

GenerativeModel

RegressionModel(Fox 83)

Classicalprob. model(Robertson & Sparck Jones, 76)

Docgeneration

Querygeneration

LMapproach(Ponte & Croft, 98)(Lafferty & Zhai, 01a)

Prob. conceptspace model(Wong & Yao, 95)

Differentinference system

Inference network model(Turtle & Croft, 91)

Page 10: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

10

Vector Space Model Represent a doc/query by a term vector

Term: basic concept, e.g., word or phrase Each term defines one dimension and N terms

define a high-dimensional space Element of vector corresponds to term weight, i.e.,

the “importance” of the term

Java

Microsoft

Starbucks

D6

D10

D9

D4

D7D8

D5

D11

D2 ? ?

D1

? ?

D3? ?Query

Page 11: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

04/19/23Data Mining: Principles and Algorithms

11

How to Assign Weights

Two-fold heuristics based on frequency TF (Term frequency)

More frequent within a document more relevant to semantics

e.g., “query” vs. “commercial”

IDF (Inverse document frequency) Less frequent among documents more discriminative e.g. “algebra” vs. “science”

TF-IDF weighting: weight(t, d) = TF(t, d) * IDF(t) Frequent within doc high tf high weight Selective among docs high idf high weight

Page 12: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

04/19/23Data Mining: Principles and Algorithms

12

How to Measure Similarity? Given two document

Similarity definition dot product

normalized dot product (or cosine)

Page 13: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

13

Advantages and Disadvantages of VS Model Advantages:

Empirically effective! (Top TREC performance) Intuitive Easy to implement Well-studied/most evaluated

Disadvantages: Assume term independence Assume query and document be the same Lack of “predictive adequacy” Arbitrary term weighting Arbitrary similarity measure

Page 14: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

14

Language Models for Retrieval(Ponte & Croft 98)

Document

Text miningpaper

Food nutritionpaper

Language Model

…text ?mining ?assocation ?clustering ?…food ?…

…food ?nutrition ?healthy ?diet ?…

Query = “data mining algorithms”

?Which model would most likely have generated this query?

Page 15: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

15

Text Generation with Unigram LM

(Unigram) Language Model p(w| )

…text 0.2mining 0.1assocation 0.01clustering 0.02…food 0.00001…

Topic 1:Text mining

…food 0.25nutrition 0.1healthy 0.05diet 0.02…

Topic 2:Health

Document

Text miningpaper

Food nutritionpaper

Sampling

Page 16: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

16

Estimation of Unigram LM

(Unigram) Language Model p(w|) = ?

Document

text 10mining 5association 3database 3algorithm 2…query 1efficient 1

…text ?mining ?association ?database ?…query ?…

Estimation

A “text mining paper”(total #words=100)

10/1005/1003/1003/100

1/100

Page 17: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

17

Ranking Docs by Query Likelihood

d1

d2

dN

qd1

d2

dN

Doc LM

p(q| d1)

p(q| d2)

p(q| dN)

Query likelihood

Page 18: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

18

Retrieval as Language Model Estimation

Document ranking based on query likelihood

n

ii

wwwqwhere

dwpdqp

...,

)|(log)|(log

21

• Retrieval problem Estimation of p(wi|d)

• Smoothing is an important issue, and distinguishes different approaches

Document language model

Page 19: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Basic Measures for Text Retrieval

04/19/23Data Mining: Principles and Algorithms

19

Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses)

Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved

|}{|

|}{}{|

Relevant

RetrievedRelevantrecall

|}{||}{}{|

RetrievedRetrievedRelevant

precision

Relevant Relevant & Retrieved Retrieved

All Documents

Page 20: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Acknowledge Some slides are coming from Professor Jiawei

Han’ s CS512 course slides and from Professor Chengxiang Zhai’s CS410 course slides (Language Model Part)

Page 21: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

By:Parikshit Sondhi

Computer ScienceUniversity of Illinois at Urbana Champaign

Some slides have been adapted from Prof. Han's presentation

Text Categorization

Page 22: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Document Classification: Motivation

News article classification Automatic email filtering Webpage classification Word sense disambiguation … …

Page 23: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

04/19/23Data Mining: Principles and Algorithms23

Text Categorization Pre-given categories and labeled document

examples (Categories may form hierarchy) Classify new documents A standard classification (supervised learning )

problem

CategorizationSystem

Sports

Business

Education

Science…

SportsBusiness

Education

Page 24: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Document Classification: Problem Definition

Need to assign a boolean value {0,1} to each entry of the decision matrix

C = {c1,....., cm} is a set of pre-defined categories D = {d1,..... dn} is a set of documents to be

categorized 1 for aij : dj belongs to ci 0 for aij : dj does not belong to ci

A Tutorial on Automated Text Categorisation, Fabrizio Sebastiani, Pisa (Italy)

Page 25: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Flavors of Classification Single Label

For a given di at most one (di, ci) is true Train a system which takes a di and C as input and outputs

a ci

Multi-label For a given di zero, one or more (di, ci) can be true Train a system which takes a di and C as input and outputs

C’, a subset of C

Binary Build a separate system for each ci, such that it takes in as

input a di and outputs a boolean value for (di, ci) The most general approach Based on assumption that decision on (di, ci) is independent

of (di, cj)

Page 26: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Classification Methods

04/19/23Data Mining: Principles and Algorithms26

Manual: Typically rule-based (KE Approach) Does not scale up (labor-intensive, rule inconsistency) May be appropriate for special data on a particular

domain Automatic: Typically exploiting machine learning

techniques Vector space model based

Prototype-based (Rocchio) K-nearest neighbor (KNN) Decision-tree (learn rules) Neural Networks (learn non-linear classifier) Support Vector Machines (SVM)

Probabilistic or generative model based Naïve Bayes classifier

Page 27: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Steps in Document Classification Classification Process

Data preprocessingE.g., Term Extraction, Dimensionality

Reduction, Feature Selection, etc.Definition of training set and test setsCreation of the classification model using

the selected classification algorithmClassification model validationClassification of new/unknown text

documents

Page 28: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Taking an Example : TFIDF Classifiers

Page 29: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Vector Space Model

04/19/23Data Mining: Principles and Algorithms29

Represent a doc by a term vector Term: basic concept, e.g., word or phrase Each term defines one dimension N terms define an N-dimensional space Element of vector corresponds to term weight

E.g., d = (x1,…,xN), xi is “importance” of term i

New document is assigned to the most likely category based on vector similarity (e.g., based on cosine formula).

Page 30: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

VS Model: Illustration

04/19/23Data Mining: Principles and Algorithms30

Java

Microsoft

StarbucksC2 Category 2

C1 Category 1

C3

Category 3

new doc

Page 31: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

TFIDF Classifier The basic idea of the algorithm is to represent

each document d as a vector d = (d(1),...., d(|F|)) in a vector space so that documents with similar content have similar vectors.

Each dimension of the vector space represents a word selected by the feature selection process.

d(i) for a document d is calculated as a combination of the statistics TF(w, d) and DF(w).d(i) is called weight of word wi in document d.

A probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, Thorsten Joachims, Carnegie Mellon University, Pittsburgh, PA

Page 32: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Representation

Each distinct word is a feature with the number of times the word occurs in the document as its value. This value is usually a function of TF(w,d) and IDF(w,d).

To avoid unnecessarily large feature vectors words are considered as features only if they occur in the training data at least m times (e.g., m = 3).

A probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, Thorsten Joachims, Carnegie Mellon University, Pittsburgh, PA

Page 33: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Preprocessing: Feature Selection All available features vs. "good" subset The problem of finding a "good" subset of

features is called feature selection Feature selection methods;

1- pruning of infrequent words Words are only considered as features, if they occur at

least a few times in the training data. 2- Pruning of high frequency words

This technique is supposed to eliminate non content words like "the", "and", "for".

A probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, Thorsten Joachims, Carnegie Mellon University, Pittsburgh, PA

Page 34: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Classification: TFIDF Classifier

A probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, Thorsten Joachims, Carnegie Mellon University, Pittsburgh, PA

Page 35: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Evaluations

04/19/23Data Mining: Principles and Algorithms35

Effectiveness measure Classic: Precision & Recall

Precision

Recall

Page 36: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Evaluation (con’t)

04/19/23Data Mining: Principles and Algorithms36

Benchmarks Classic: Reuters collection

A set of newswire stories classified under categories related to economics

Effectiveness Difficulties of strict comparison

different parameter setting different “split” (or selection) between training and testing various optimizations … …

However, widely recognizable Best: Boosting-based committee classifier & SVM Worst: Naïve Bayes classifier

Need to consider other factors, especially efficiency

Page 37: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Document Classification: Approach Comparisons

Page 38: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Document Clustering

04/19/23Data Mining: Principles and Algorithms38

Motivation Automatically group related documents based on

their contents No predetermined training sets or taxonomies Generate a taxonomy at runtime

Most popular clustering methods are: K-Means clustering Agglomerative hierarchical clustering EM (Gaussian Mixture) …

Page 39: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

The Steps and Algorithms Clustering Process

Data preprocessing: remove stop words, stem, feature extraction, lexical analysis, etc.

Hierarchical clustering: compute similarities by applying clustering algorithms

Model-Based clustering (Neural Network Approach): clusters are represented by “exemplars” (e.g., SOM)

Page 40: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

K-Means clustering Given:

set of documents (e.g., TFIDF vectors), distance measure (e.g., cosine) K (number of groups)

For each of K groups, initialize its centroid with a random document

While not converging Each document is assigned to the nearest group

(represented by its centroid) For each group, calculate new centroid (group

mass point, average document in the group)

Page 41: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Slide adapted from Dr. Andrew Moore’s Presentation

Page 42: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Summary: Text Categorization

04/19/23Data Mining: Principles and Algorithms42

Wide application domain

Comparable effectiveness to professionals

Manual TC is not 100% and unlikely to improve

substantially

A.T.C. is growing at a steady pace

Prospects and extensions

Very noisy text, such as text from O.C.R.

Speech transcripts

Page 43: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

References

04/19/23Data Mining: Principles and Algorithms43

Fabrizio Sebastiani, “Machine Learning in Automated Text

Categorization”, ACM Computing Surveys, Vol. 34, No.1,

March 2002

Yiming Yang, “An evaluation of statistical approaches to text

categorization”, Journal of Information Retrieval, 1:67-88,

1999.

Yiming Yang and Xin Liu, “A re-examination of text

categorization methods”, Proceedings of ACM SIGIR

Conference on Research and Development in Information

Retrieval (SIGIR'99, pp 42--49), 1999.

Page 44: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Thank You

Page 45: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

April 19, 2023Data Mining: Concepts and

Techniques 45

Web Link Analysis

By Kavita Ganesan

Page 46: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

April 19, 2023Data Mining: Concepts and

Techniques 46

RECAP

What is ranking in information retrieval?

Doc 1Doc 1

Doc 2Doc 2

Doc 3Doc 3

Doc 4Doc 4 perform searchon google

Page 47: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

April 19, 2023Data Mining: Concepts and

Techniques 47

RECAP

What is ranking in information retrieval?

Doc 1Doc 1

Doc 2Doc 2

Doc 3Doc 3

Doc 4Doc 4

Ranked 1st

Ranked 2nd

Ranked 3rd

Ranked 4th perform searchon google

Page 48: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

April 19, 2023Data Mining: Concepts and

Techniques 48

Why is ranking important?

Users tend to look at top few results make sure that good matches are at the very

top

Fast access to information! savvy users want results immediately

What happens if pages are poorly ranked?

important matches missed

poor user retention

Page 49: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

April 19, 2023Data Mining: Concepts and

Techniques 49

Ranking in Text Information Retrieval

Before web existed Each document treated as a bag of words Minimal structure Ranking heuristics

Solely based on words in the documents E.g., term frequency, inverse document

frequency

After the web was born Documents

have structure contain hyperlinks contain components like title, author,

abstract, sections, references

Question is: Can we leverage this information to improve ranking?Question is: Can we leverage this information to improve ranking?

Page 50: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

April 19, 2023Data Mining: Concepts and

Techniques 50

Exploiting inter-document links

Description(“anchor text”)

Hub Authority

Links indicate the utility of a doc

What does a link tell us?

show

Page 51: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

April 19, 2023Data Mining: Concepts and

Techniques 51

Links Analysis Algorithms

PageRankPageRank HITSHITS

Hyperlink analysis to rank documents

Page 52: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

April 19, 2023Data Mining: Concepts and

Techniques 52

PageRank

Based on the idea of a ‘random surfer’ the likelihood that a person randomly clicking on

links will arrive at any particular page

Pages represented as Markov Chain states

Probability of moving from one page to another is modelled as a state transition probability

Page 53: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

April 19, 2023Data Mining: Concepts and

Techniques 53

PageRank

Ex:

BBAA

CC DD

0 1/2 1/2 0

1/2 0 0 1/2

1 0 0 01/2 0 1/2 0

State transition matrix

ABCD

A B C D

PR(A) =½*PR(B) + 1*PR(C)+ ½ PR(D)

1/2

1/21

Page 54: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

April 19, 2023Data Mining: Concepts and

Techniques 54

PageRank PageRank value for any page u can be expressed

as

PR(u) VEBu

PR(v) L(v)

L(v) = number of outbound links of page vPR(v) = PageRank of page vBu = set of pages linking to page u

Page 55: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

April 19, 2023Data Mining: Concepts and

Techniques 55

HITS

HITS = Hyperlink-Induced Topic Search

Developed by Jon Michael Kleinberg from Cornell

The algorithm produces two types of pages: Authority: pages that provide an important, trustworthy

information on a given topic Hub: pages that contain links to authorities

Authorities and hubs exhibit a mutually reinforcing relationship:

a better hub points to many good authorities, and a better authority is pointed to by many good hubs

Page 56: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

April 19, 2023Data Mining: Concepts and

Techniques 56

HITS algorithm

Start with each node(page) having a hub score and authority score of 1

Run the Authority Update Rule

Run the Hub Update Rule

Normalize the values: divide each Hub score by the sum of all Hub scores divide each Authority score by the sum of all Authority

scores

Repeat from the second step as necessary

Page 57: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

April 19, 2023Data Mining: Concepts and

Techniques 57

HITS algorithm—Authority Update

Node's Authority score = the sum of the Hub Score's of each node that points to it.

A page has high authority if it is linked to by pages that are recognized as Hubs for information.

1 A

B

C

D

authority(A) = h(B) + h(C) + h(D)

Page 58: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

April 19, 2023Data Mining: Concepts and

Techniques 58

HITS algorithm—Hub Update

Node’s Hub Score = the sum of the Authority Score's of each node that it points to. A page is a good hub if it links to pages that

have high authority

A

5

6

7

E

G

F

hub(A) = a(E) + a(F) + a(G)

Page 59: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

April 19, 2023Data Mining: Concepts and

Techniques 59

PageRank vs HITS

HITS PageRankiterative algorithm based on linkage of documents on the web

iterative algorithm based on linkage of documents on the web

HITS is executed at query time (authority and hub scores are query specific) takes a perfomance hit

PageRank is pre-computed

Computes two scores per document, hub and authority

Computes a single score

Page 60: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

April 19, 2023Data Mining: Concepts and

Techniques 60

End

Page 61: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

April 19, 2023Data Mining: Concepts and

Techniques 61

AUTHORITY PAGE

Kevin Chang

Cheng Zhai

Marianne Winslet

ibm.com

berkeley.edu

Stanford.edu

If a page is popular, then it must be an important page [back]

If a page is popular, then it must be an important page [back]

Page 62: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Mining and Searching Structured Data on the Web

Bo Zhao ([email protected])

Department of Computer Science

University of Illinois at Urbana-Champaign

Page 63: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Structured Data are EVERYWHERE!

Deep Web: databases behind websites (aa.com) Web 2.0 Contents: Flickr, Del.icio.us tags Google Base: structured data portals Surface Web: emails, org, country…

Page 64: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Solutions

Deep Web Data Integration Vertical Search Engines On-the-fly Meta-querying Systems Pay-As-You-Go Integration

Deep Web Surfacing Entity Search on the Surface Web

Page 65: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Vertical Search Engines—”Warehousing” approach

Academic Search Libra@MSRA DBLife@WISC ArnetMiner@Tsinghua

Many other domains Shopping Events Apartments …

Page 66: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

66

Integrating information from multiple types of sources Ranking papers, conferences, and authors for a given query Handling structured queries

WebDatabase

WebDatabase

WebDatabase

WebDatabase

WebDatabase

PDF

PS DOC

JournalHomepage

AuhtorHomepage

Conf.Homepage

Vertical Search Engines—”Warehousing” approach e.g., Libra Academic Search [NieZW+05] (courtesy MSRA)

Page 67: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

On-the-fly Meta-querying Systems

MetaQuerier@UIUC WISE-Integrator

http://www.data.binghamton.edu:8080/wise-integrator/ Commercial Systems

http://www.cheaptickets.com http://pipl.com …

Page 68: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

68

On-the-fly Meta-querying Systems—e.g., WISE [HeMYW03], MetaQuerier [ChangHZ05]

FIND sources

QUERY sources

db of dbs

unified query interface

Amazon.comCars.com

411localte.com

Apartments.com

MetaQuerier@UIUC :

Page 69: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

69

Technical Challenges.

Source Modeling & Selection How to describe a source and find right sources for query answering?

Schema Matching How to match the schematic structures between sources?

Source Querying, Crawling, and Object Ranking How to query a source? How to crawl all objects and to search them?

Data Extraction How to extract result pages into relations?

Page 70: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

70

Source Modeling & Selection: How to describe a source and find right sources for query answering

Focus: Discovery of sources. Focused crawling to collect query interfaces [BarbosaF05, ChangHZ05].

Focus: Extraction of source models. Hidden grammar-based parsing [ZhangHC04]. Proximity-based extraction [HeMY+04]. Classification to align with given taxonomy [HessK03, Kushmerick03].

Focus: Organization of sources and query routing Offline clustering [HeTC04, PengMH+04]. Online search for query routing [KabraLC05].

Page 71: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

71

Form Extraction: the Problem

Output all the conditions, for each: Grouping elements (into query conditions) Tagging elements with their “semantic roles”

attribute operator value

Page 72: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

72

Schema Matching: How to match the schematic structures between sources

Focus: Matching large number of interface schemas, often in a holistic way. Statistical model discovery [HeC03]; correlation mining [HeCH04, HeC05]. Query probing [WangWL+04]. Clustering [HeMY+03, WuYD+04]. Corpus-assisted [MadhavanBD+05]; Web-assisted [WuDY06].

Focus: Constructing unified interfaces. As a global generative model [HeC03]. Cluster-merge-select [HeMY+03].

Page 73: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

73

WISE-Integrator: Cluster-Merge-Represent [HeMY+03]

Page 74: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

74

Source Querying: How to query a source? How to crawl all objects and to search them?

1. Metaquerying model: Focus: On-the-fly Querying.

MetaQuerier Query Assistant [ZhangHC05].

2. Vertical-search-engine model: Focus: Source crawling to collect objects.

Form submission by query generation/selection e.g., [RaghavanG01, WuWLM06].

Focus: Object search and ranking [NieZW+05]

Page 75: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

75

On-the-fly Querying: [ZhangHC05] Type-locality based Predicate Translation

Target template P

Target Predicate t*

Type Recognizer

Domain Specific Handler

Text Handler

Numeric Handler

Datetime Handler

Predicate Mapper

Source predicate s

Correspondences occur within localities

Translation by type-handler

Page 76: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

76

Source Crawling by Query Selection [WuWL+06]

Author Title Category

Ullman Complier System

Ullman Data Mining Application

Ullman Automata Theory

Han Data Mining ApplicationUllman

Han

Compiler

Automata

Data Mining

Application

TheorySystem

Conceptually, the DB as a graph: Node: Attributes Edge: Occurrence relationship

Crawling is transformed into graph traversal problem:Find a set of nodes N in the graph G such that for every node i in G, there exists a node j in N, j->i. And the summation of the cost of nodes in N should be minimum.

Page 77: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

77

Object Ranking - Object Relationship Graph [NieZW+05]

Popularity Propagation Factor for each type of relationship link

Popularity of an object is also affected by the popularity of the Web pages containing the object

Page 78: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

78

Data Extraction: How to extract result pages into relations

Mediator

Wrapper Wrapper Wrapper

Focus: Semi-automatic wrapper construction

Techniques: Wrapper-mediator architecture [Wiederhold92] . Manual construction: Semi-automatic: Learning-based

HLRT [KushmerickWD97], Stalker [MusleaMK99], Softmealy [HsuD98];

Page 79: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

79

Data Extraction: How to extract result pages into relations

Mediator

Wrapper Wrapper Wrapper

Focus: Even more automatic approaches.

Techniques: Semi-automatic: Learning-based

[ZhaoMWRY05], [IRMKS06]. Automatic: Syntax-based

RoadRunner [MeccaCM01], ExAlg [ArasuG03],DEPTA [LiuGZ03, ZhaiL05].

Page 80: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

You can only afford to Pay As You Go

Data Integration Solution Build data integration systems with deep web sources Reformulate user queries at search-time Build data integration for every domain of interest

Impractical for web search! Cannot query sources too often Precise content description required Too many domains of interest? Mediated schema design is infeasible!

Page 81: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Web Search Queries and Users

Web Queries are typically keyword queries

Data integration solutions assume structured queries

Web users do not typically care if results are structured or unstructured

User attention restricted to small number of portals (~1)

Page 82: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

PAYGO Architecture

There can be many, potentially ill-defined, domainsMediated Schema Schema Clusters

Precise mappings cannot be created to all data sourcesExact Mappings Approximate Mappings

Users prefer keyword queries to structured queriesQuery Reformulation Query Routing

Data sources are diverse and mappings approximateExact Answers Heterogeneous Result Ranking

Uncertainty everywhere !

Page 83: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Pay As You Go in PAYGO

Integration is a continuous process Apriori integration impossible Understanding of mappings/sources/ranking/etc. evolves over time

Mechanisms to facilitate evolution over time Automatic schema clustering and matching Implicit use of user feedback, e.g., from result clicks Result variations to elicit disambiguating user feedback

Queries always answered with best effort “Pay” more by correcting/creating semantic mappings

Page 84: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Query Routing Example

Keyword Analysis

Domain Selection

Query Construction

Source Selection

Result Ranking

make model year attribute

vehicle

vehicle (mk:honda, md:civic, yr:2007, review:?)

car-reviews-by-year.com > car-reviews.com > car-prices.com

“honda civic 2007 review”

Page 85: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Surfacing the Deep Web – A More Practical Solution?

Pre-compute all interesting form submissions each HTML form

Each form submission corresponds to a distinct URL Add URLs for each form submission into search engine index

Enables the reuse of existing search engine infrastructure Deep-web URLs are like any other URL (GET method)

Reduced load on deep-web sites Only in response to user clicks on a search results Search engine performance not dependent on deep-web source

Page 86: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Surfacing Challenges

1. Predicting the appropriate values for text inputs Valid input values are required for retrieving data Ingredients in recipes.com and zipcodes in borderstores.com

2. Predicting the correct input combinations Generating all possible URLs is wasteful + unnecessary Cars.com has ~500K listings, but 250M possible queries

Page 87: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Google’s Deep-Web crawling system

Affects more than 1000 queries per second Enables access to more than a million Deep-Web sites Spans 50+ languages and 100+ domains Results served from 400K+ distinct forms per day Results validate the utility of Deep-Web content

Other systems: http://www.deeppeep.org/ …

Page 88: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Searching for Structured Data on the Surface Web: EntitySearch@UIUC

Entity Extraction and Indexing Ranking Entities Directly:

Contextual - Utilize Entities’ Surrounding Context Uncertain - Extractions are non-”prefect” Holistic - Many evidences from multiple sources Discriminative - Web Pages are of Varying Quality Associative - Tell True Associations from Accidental

Other systems: NAGA (http://www.mpi-inf.mpg.de/~kasneci/naga/) Correlator (http://correlator.sandbox.yahoo.net/)

Page 89: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

References

Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web. K. C.-C. Chang, tutorial in SIGMOD 2006

EntityRank: Searching Entities Directly and Holistically. T. Cheng, X. Yan, and K. C.-C. Chang. VLDB 2007.

Web-scale Data Integration: You can only afford to Pay As You Go. Jayant Madhavan, Shawn R. Jeffery, Shirley Cohen, Xin (Luna) Dong, David Ko, Cong Yu, Alon Halevy. CIDR, 2007.

Google's Deep-Web Crawl. Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, Alon Halevy. VLDB, 2008.

Page 90: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.

Thank you!