Web Mining

43
Web Mining

description

Web Mining. Two Key Problems. Page Rank Web Content Mining. PageRank. Intuition : solve the recursive equation: “a page is important if important pages link to it.” Maximailly: importance = the principal eigenvector of the stochastic matrix of the Web. A few fixups needed. - PowerPoint PPT Presentation

Transcript of Web Mining

Page 1: Web Mining

Web Mining

Page 2: Web Mining

Two Key Problems Page Rank Web Content Mining

Page 3: Web Mining

PageRank Intuition: solve the recursive equation: “a

page is important if important pages link to it.”

Maximailly: importance = the principal eigenvector of the stochastic matrix of the Web. A few fixups needed.

Page 4: Web Mining

Stochastic Matrix of the Web

Enumerate pages. Page i corresponds to row and column i. M [i,j ] = 1/n if page j links to n pages,

including page i ; 0 if j does not link to i. M [i,j ] is the probability we’ll next be at page

i if we are now at page j.

Page 5: Web Mining

Example

i

j

Suppose page j links to 3 pages, including i

1/3

Page 6: Web Mining

Random Walks on the Web

Suppose v is a vector whose i th component is the probability that we are at page i at a certain time.

If we follow a link from i at random, the probability distribution for the page we are then at is given by the vector M v.

Page 7: Web Mining

Random Walks --- (2) Starting from any vector v, the limit M

(M (…M (M v ) …)) is the distribution of page visits during a random walk.

Intuition: pages are important in proportion to how often a random walker would visit them.

The math: limiting distribution = principal eigenvector of M = PageRank.

Page 8: Web Mining

Example: The Web in 1839

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 1m 0 1/2 0

y a m

Page 9: Web Mining

Simulating a Random Walk Start with the vector v = [1,1,…,1]

representing the idea that each Web page is given one unit of importance.

Repeatedly apply the matrix M to v, allowing the importance to flow like a random walk.

Limit exists, but about 50 iterations is sufficient to estimate final distribution.

Page 10: Web Mining

Example Equations v = M v :

y = y /2 + a /2a = y /2 + mm = a /2

ya =m

111

13/21/2

5/4 13/4

9/811/81/2

6/56/53/5

. . .

Page 11: Web Mining

Solving The Equations Because there are no constant terms,

these 3 equations in 3 unknowns do not have a unique solution.

Add in the fact that y +a +m = 3 to solve. In Web-sized examples, we cannot solve

by Gaussian elimination; we need to use relaxation (= iterative solution).

Page 12: Web Mining

Real-World Problems

Some pages are “dead ends” (have no links out). Such a page causes importance to leak out.

Other (groups of) pages are spider traps (all out-links are within the group). Eventually spider traps absorb all importance.

Page 13: Web Mining

Microsoft Becomes Dead End

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 0m 0 1/2 0

y a m

Page 14: Web Mining

Example Equations v = M v :

y = y /2 + a /2a = y /2m = a /2

ya =m

111

11/21/2

3/41/21/4

5/83/81/4

000

. . .

Page 15: Web Mining

M’soft Becomes Spider Trap

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 0m 0 1/2 1

y a m

Page 16: Web Mining

Example Equations v = M v :

y = y /2 + a /2a = y /2m = a /2 + m

ya =m

111

11/23/2

3/41/27/4

5/83/82

003

. . .

Page 17: Web Mining

Google Solution to Traps, Etc. “Tax” each page a fixed percentage at

each interation. Add the same constant to all pages. Models a random walk with a fixed

probability of going to a random place next.

Page 18: Web Mining

Example: Previous with 20% Tax Equations v = 0.8(M v ) + 0.2:

y = 0.8(y /2 + a/2) + 0.2a = 0.8(y /2) + 0.2m = 0.8(a /2 + m) + 0.2

ya =m

111

1.000.601.40

0.840.601.56

0.7760.5361.688

7/11 5/1121/11

. . .

Page 19: Web Mining

General Case In this example, because there are no

dead-ends, the total importance remains at 3.

In examples with dead-ends, some importance leaks out, but total remains finite.

Page 20: Web Mining

Solving the Equations Because there are constant terms, we can

expect to solve small examples by Gaussian elimination.

Web-sized examples still need to be solved by relaxation.

Page 21: Web Mining

Speeding Convergence Newton-like prediction of where

components of the principal eigenvector are heading.

Take advantage of locality in the Web. Each technique can reduce the number of

iterations by 50%. Important --- PageRank takes time!

Page 22: Web Mining

Web Content Mining The Web is perhaps the single largest data

source in the world. Much of the Web (content) mining is about

Data/information extraction from semi-structured objects and free text, and

Integration of the extracted data/information

Due to the heterogeneity and lack of structure, mining and integration are challenging tasks.

Page 23: Web Mining
Page 24: Web Mining
Page 25: Web Mining
Page 26: Web Mining

Wrapper induction Using machine learning to generate extraction rules.

The user marks the target items in a few training pages. The system learns extraction rules from these pages. The rules are applied to extract target items from other

pages. Many wrapper induction systems, e.g.,

WIEN (Kushmerick et al, IJCAI-97), Softmealy (Hsu and Dung, 1998), Stalker (Muslea et al. Agents-99), BWI (Freitag and McCallum, AAAI-00), WL2 (Cohen et al. WWW-02). IDE (Liu and Zhai, WISE-05) Thresher (Hogue and Karger, WWW-05)

Page 27: Web Mining

Stalker: A wrapper induction system (Muslea et al. Agents-99)

E1: 513 Pico, <b>Venice</b>, Phone 1-<b>800</b>-555-1515

E2: 90 Colfax, <b>Palms</b>, Phone (800) 508-1570E3: 523 1st St., <b>LA</b>, Phone 1-<b>800</b>-578-

2293E4: 403 La Tijera, <b>Watts</b>, Phone: (310) 798-0008

We want to extract area code. Start rules:

R1: SkipTo(()R2: SkipTo(-<b>)

End rules:R3: SkipTo())R4: SkipTo(</b>)

Page 28: Web Mining

Learning extraction rules Stalker uses sequential covering to learn

extraction rules for each target item. In each iteration, it learns a perfect rule that

covers as many positive items as possible without covering any negative items.

Once a positive item is covered by a rule, the whole example is removed.

The algorithm ends when all the positive items are covered. The result is an ordered list of all learned rules.

Page 29: Web Mining

Rule induction through an exampleTraining examples:

E1: 513 Pico, <b>Venice</b>, Phone 1-<b>800</b>-555-1515E2: 90 Colfax, <b>Palms</b>, Phone (800) 508-1570E3: 523 1st St., <b>LA</b>, Phone 1-<b>800</b>-578-2293E4: 403 La Tijera, <b>Watts</b>, Phone: (310) 798-0008

We learn start rule for area code. Assume the algorithm starts with E2. It creates

three initial candidate rules with first prefix symbol and two wildcards: R1: SkipTo(() R2: SkipTo(Punctuation) R3: SkipTo(Anything)

R1 is perfect. It covers two positive examples but no negative example.

Page 30: Web Mining

Rule induction (cont …)E1: 513 Pico, <b>Venice</b>, Phone 1-<b>800</b>-555-1515E2: 90 Colfax, <b>Palms</b>, Phone (800) 508-1570E3: 523 1st St., <b>LA</b>, Phone 1-<b>800</b>-578-2293E4: 403 La Tijera, <b>Watts</b>, Phone: (310) 798-0008

R1 covers E2 and E4, which are removed. E1 and E3 need additional rules.

Three candidates are created: R4: SkiptTo(<b>) R5: SkipTo(HtmlTag) R6: SkipTo(Anything)

None is good. Refinement is needed. Stalker chooses R4 to refine, i.e., to add

additional symbols, to specialize it. It will find R7: SkipTo(-<b>), which is perfect.

Page 31: Web Mining

Limitations of Supervised Learning Manual Labeling is labor intensive and

time consuming, especially if one wants to extract data from a huge number of sites.

Wrapper maintenance is very costly: If Web sites change frequently It is necessary to detect when a wrapper stops

to work properly. Any change may make existing extraction rules

invalid. Re-learning is needed, and most likely manual

re-labeling as well.

Page 32: Web Mining

The RoadRunner System(Crescenzi et al. VLDB-01)

Given a set of positive examples (multiple sample pages). Each contains one or more data records.

From these pages, generate a wrapper as a union-free regular expression (i.e., no disjunction).

The approach To start, a sample page is taken as the wrapper. The wrapper is then refined by solving

mismatches between the wrapper and each sample page, which generalizes the wrapper.

Page 33: Web Mining
Page 34: Web Mining

Compare with wrapper induction No manual labeling, but need a set of positive

pages of the same template which is not necessary for a page with multiple data

records not wrapper for data records, but pages.

A Web page can have many pieces of irrelevant information.

Issues of automatic extraction Hard to handle disjunctions Hard to generate attribute names for the extracted

data. extracted data from multiple sites need integration,

manual or automatic.

Page 35: Web Mining

Relation Extraction Assumptions:

No single source contains all the tuples Each tuple appears on many web pages Components of tuple appear “close” together

Foundation, by Isaac Asimov Isaac Asimov’s masterpiece, the

<em>Foundation</em> trilogy There are repeated patterns in the way tuples

are represented on web pages

Page 36: Web Mining

Naïve approach Study a few websites and come up with a

set of patterns e.g., regular expressions

letter = [A-Za-z. ]title = letter{5,40}author = letter{10,30}<b>(title)</b> by (author)

Page 37: Web Mining

Problems with naïve approach A pattern that works on one web page

might produce nonsense when applied to another So patterns need to be page-specific, or at

least site-specific Impossible for a human to exhaustively

enumerate patterns for every relevant website Will result in low coverage

Page 38: Web Mining

Better approach (Brin) Exploit duality between patterns and

tuples Find tuples that match a set of patterns Find patterns that match a lot of tuples DIPRE (Dual Iterative Pattern Relation

Extraction)

Patterns Tuples

Match

Generate

Page 39: Web Mining

DIPRE Algorithm1. R Ã SampleTuples

e.g., a small set of <title,author> pairs

2. O Ã FindOccurrences(R) Occurrences of tuples on web pages Keep some surrounding context

3. P Ã GenPatterns(O) Look for patterns in the way tuples occur Make sure patterns are not too general!

4. R Ã MatchingTuples(P)5. Return or go back to Step 2

Page 40: Web Mining

Web query interface integration Many integration tasks,

Integrating Web query interfaces (search forms)

Integrating extracted data Integrating textual information Integrating ontologies (taxonomy) …

We only introduce integration of query interfaces. Many web sites provide forms to query deep

web Applications: meta-search and meta-query

Page 41: Web Mining

Global Query Interface

united.com airtravel.com

delta.com hotwire.com

Page 42: Web Mining

Synonym Discovery (He and Chang, KDD-04)

Discover synonym attributes Author – Writer, Subject – Category

Holistic Model Discovery

author name subject categorywriter

S2:writertitlecategoryformat

S3:nametitlekeywordbinding

S1:authortitlesubjectISBN

Pairwise Attribute Correspondence

S2:writertitlecategoryformat

S3:nametitlekeywordbinding

S1:authortitlesubjectISBN

S1.author S3.nameS1.subject S2.category

V.S.

Page 43: Web Mining

Schema matching as correlation mining

Across many sources: Synonym attributes are negatively correlated

synonym attributes are semantically alternatives. thus, rarely co-occur in query interfaces

Grouping attributes with positive correlation grouping attributes semantically complement thus, often co-occur in query interfaces