New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall...
Transcript of New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall...
![Page 1: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/1.jpg)
Data Mining TechniquesCS 6220 - Section 3 - Fall 2016
Lecture 17: Link AnalysisJan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec et al.)
![Page 2: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/2.jpg)
Graph Data: Media Networks
Connections between political blogsPolarization of the network [Adamic-Glance, 2005]
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 3: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/3.jpg)
Schedule Updates
![Page 4: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/4.jpg)
• Human-curated (e.g. Yahoo, Looksmart) • Hand-written descriptions • Wait time for inclusion
• Text-search(e.g. WebCrawler, Lycos) • Prone to term-spam
Web search before PageRank
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 5: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/5.jpg)
Web as a Directed Graph
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 6: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/6.jpg)
PageRank: Links as Votes
• Pages with more inbound links are more important • Inbound links from important pages carry more weight
Not all pages are equally important
Manyinboundlinks
Few/noinboundlinks
Links fromunimportantpages
Links fromimportantpages
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 7: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/7.jpg)
Example: PageRank Scores
B 38.4 C
34.3
E 8.1
F 3.9
D 3.9
A 3.3
1.61.6 1.6 1.6 1.6
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 8: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/8.jpg)
Simple Recursive Formulation
• A link’s vote is proportional to the importance of its source page
• If page j with importance rj has n out-links, each link gets rj / n votes
• Page j’s own importance is the sum of the votes on its in-links
j
ki
rj/3
rj/3rj/3rj = ri/3+rk/4
ri/3 rk/4
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 9: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/9.jpg)
Equivalent Formulation: Random Surfer
• At time t a surfer is on some page i • At time t+1 the surfer follows a
link to a new page at random • Define rank ri as fraction of time
spent on page i
j
ki
rj/3
rj/3rj/3rj = ri/3+rk/4
ri/3 rk/4
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 10: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/10.jpg)
PageRank: The “Flow” Model
• 3 equations, 3 unknowns • Impose constraint: ry + ra + rm = 1• Solution: ry = 2/5, ra = 2/5, rm = 1/5
∑→
=ji
ij
rrid
y
maa/2
y/2a/2
m
y/2
“Flow” equations:ry = ry /2 + ra /2ra = ry /2 + rm
rm = ra /2
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 11: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/11.jpg)
PageRank: The “Flow” Model
∑→
=ji
ij
rrid
y
maa/2
y/2a/2
m
y/2
“Flow” equations:ry = ry /2 + ra /2ra = ry /2 + rm
rm = ra /2
r = M·r
Matrix M is stochastic (i.e. columns sum to one) (adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 12: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/12.jpg)
PageRank: Eigenvector Problem
• PageRank: Solve for eigenvector r = M r with eigenvalue λ = 1
• Eigenvector with λ = 1 is guaranteed to exist since M is a stochastic matrix (i.e. if a = M b then Σ ai = Σ bi)
• Problem: There are billions of pages on the internet. How do we solve for eigenvector with order 1010 elements?
![Page 13: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/13.jpg)
PageRank: Power IterationModel for random Surfer:
• At time t = 0 pick a page at random • At each subsequent time t follow an
outgoing link at random
Probabilistic interpretation:
![Page 14: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/14.jpg)
PageRank: Power Iteration
y
maa/2
y/2a/2
m
y/2
pt converges to r. Iterate until |pt - pt -1| < ε(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 15: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/15.jpg)
• PageRank is assumes a random walkmodel for individual surfers
• Equivalent assumption: flow modelin which equal fractions of surfers follow each link at every time
• Ergodicity: The equilibrium of the flow model is the same as the asymptotic distribution for an individual random walk
Aside: Ergodicity
![Page 16: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/16.jpg)
Aside: Ergodicity• PageRank is assumes a random walk
model for individual surfers • Equivalent assumption: flow model
in which equal fractions of surfers follow each link at every time
• Ergodicity: The equilibrium of the flow model is the same as the asymptotic distribution for an individual random walk
![Page 17: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/17.jpg)
Aside: Ergodicity• PageRank is assumes a random walk
model for individual surfers • Equivalent assumption: flow model
in which equal fractions of surfers follow each link at every time
• Ergodicity: The equilibrium of the flow model is the same as the asymptotic distribution for an individual random walk
Averaging over individuals is equivalentto averaging single individual over time
![Page 18: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/18.jpg)
PageRank: Problems
Dead end
Spider trap
1. Dead Ends • Nodes with no outgoing links. • Where do surfers go next?
2. Spider Traps • Subgraph with no outgoing
links to wider graph • Surfers are “trapped” with
no way out.
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 19: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/19.jpg)
Power Iteration: Dead Ends
y
maa/2
y/2a/2
y/2
Probability not conserved(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 20: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/20.jpg)
Power Iteration: Dead Ends
y
maa/2
y/2a/2
y/2
Fixes “probability sink” issue
(teleport at dead ends)
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 21: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/21.jpg)
Power Iteration: Spider Traps
y
maa/2
y/2a/2
y/2
Probability accumulates in traps (surfers get stuck)
m
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 22: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/22.jpg)
rj =X
i! j
�ri
di+ (1� �) 1
N
Solution: Random TeleportsModel for teleporting random surfer:
• At time t = 0 pick a page at random • At each subsequent time t
• With probability β follow an outgoing link at random
• With probability 1-β teleport to a new initial location at random
PageRank Equation [Page & Brin 1998]
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 23: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/23.jpg)
Power Iteration: Teleports
y
ma (can use power iteration as normal)
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 24: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/24.jpg)
Power Iteration: Teleports
y
ma (can use power iteration as normal)
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 25: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/25.jpg)
Power Iteration: Teleports
y
ma (can use power iteration as normal)
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 26: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/26.jpg)
Computing PageRank
• M is sparse - only store nonzero entries • Space proportional roughly to number of links • Say 10N, or 4*10*1 billion = 40GB • Still won’t fit in memory, but will fit on disk
0 3 1, 5, 71 5 17, 64, 113, 117, 245
2 2 13, 23
source node degree destination nodes
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 27: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/27.jpg)
Block-based Update Algorithm• Break rnew into k blocks that fit in memory • Scan M and rold once for each block
0 4 0, 1, 3, 51 2 0, 52 2 3, 4
src degree destination01
23
45
012345
rnew rold
M
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 28: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/28.jpg)
Block-Stripe Update Algorithm
0 4 0, 11 3 02 2 1
src degree destination
01
23
45
012345
rnew
rold
0 4 51 3 52 2 4
0 4 32 2 3
Break M into stripes: Each stripe contains only destination nodes in the corresponding block of rnew
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 29: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/29.jpg)
First Spammers: Term Spam• How do you make your page appear to be
about movies? • (1) Add the word movie 1,000 times to your page • Set text color to the background color, so only search
engines would see it • (2) Or, run the query “movie” on your
target search engine • See what page came first in the listings • Copy it into your page, make it “invisible”
• These and similar techniques are term spam
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 30: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/30.jpg)
Google’s Solution to Term Spam• Believe what people say about you, rather
than what you say about yourself • Use words in the anchor text (words that appear
underlined to represent the link) and its surrounding text
• PageRank as a tool to measure the “importance” of Web pages
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 31: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/31.jpg)
Google vs. Spammers: Round 2!• Once Google became the dominant search
engine, spammers began to work out ways to fool Google
• Spam farms were developed to concentrate PageRank on a single page
• Link spam: • Creating link structures that
boost PageRank of a particular page
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 32: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/32.jpg)
Link Spamming• Three kinds of web pages from a
spammer’s point of view • Inaccessible pages • Accessible pages
• e.g., blog comments pages • spammer can post links to his pages
• Owned pages • Completely controlled by spammer • May span multiple domain names
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 33: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/33.jpg)
Link Farms• Spammer’s goal:
• Maximize the PageRank of target page t
• Technique: • Get as many links from accessible pages as
possible to target page t • Construct “link farm” to get PageRank
multiplier effect
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 34: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/34.jpg)
Link Farms
Inaccessible
t
Accessible Owned
1
2
M
One of the most common and effective organizations for a link farm
Millions of farm pages
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
![Page 35: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,](https://reader033.fdocuments.in/reader033/viewer/2022053023/60520ed7ac581a6e182e6658/html5/thumbnails/35.jpg)
PageRank: Extensions
• Topic-specific PageRank: • Restrict teleportation to some set S
of pages related to a specific topic • Set p0i = 1/|S| if i ∈ S, p0i = 0 otherwise
• Trust Propagation • Use set S of trusted pages for
teleport set