Graph-based Algorithms for Information Retrieval and
Natural Language Processing
Rada Mihalcea
University of North Texas
RANLP 2005
RANLP 2005 2
Introduction
Motivation– Graph-theory is a well studied discipline– So are the fields of Information Retrieval (IR) and
Natural Language Processing (NLP)– Often perceived as completely different disciplines
Goal of the tutorial: provide an overview of methods and applications in IR and NLOP that rely on graph-based algorithms, e.g.– Graph-based algorithms: graph traversal, min-cut
algorithms, node ranking– Applied to: Web search, text understanding, text
summarization, keyword extraction, text clustering
RANLP 2005 3
Outline
Part 1: Elements of graph-theory:– Terminology, notations, algorithms on graphs
Part 2: Graph-based algorithms for IR– Web search– Text clustering and classification
Part 3: Graph-based algorithms for NLP– Word Sense Disambiguation– Clustering of entities (names/numbers)– Thesaurus construction– Text summarization
content style
RANLP 2005 4
What is a Graph?
A graph G = (V,E) is composed of:
V: set of vertices
E: set of edges connecting the vertices in V An edge e = (u,v) is a pair of vertices Example:
a b
c
d e
V= {a,b,c,d,e}
E= {(a,b),(a,c),(a,d),(b,e),(c,d),(c,e),(d,e)}
RANLP 2005 5
Terminology: Adjacent and Incident
If (v0, v1) is an edge in an undirected graph, – v0 and v1 are adjacent– The edge (v0, v1) is incident on vertices v0 and v1
If <v0, v1> is an edge in a directed graph– v0 is adjacent to v1, and v1 is adjacent from v0
– The edge <v0, v1> is incident on v0 and v1
RANLP 2005 6
The degree of a vertex is the number of edges incident to that vertex
For directed graph, – the in-degree of a vertex v is the number of edges
that have v as the head– the out-degree of a vertex v is the number of edges
that have v as the tail– if di is the degree of a vertex i in a graph G with n
vertices and e edges, the number of edges is
e di
n
( ) /0
1
2 Why? Since adjacent vertices each count the adjoining edge, it will be counted twice
Terminology: Degree of a Vertex
RANLP 2005 7
0
1 2
3 4 5 6
G1 G2
3 2
3 3
1 1 1 1
directed graphin-degreeout-degree
0
1
2
G3
in:1, out: 1
in: 1, out: 2
in: 1, out: 0
0
1 2
3
33
3
Examples
RANLP 2005 8
Terminology: Path
path: sequence of vertices v1,v2,. . .vk such that consecutive vertices vi and vi+1 are adjacent.
3
3 3
3
2
a b
c
d e
a b
c
d e
a b e d c b e d c
RANLP 2005 9
More Terminology
simple path: no repeated vertices
cycle: simple path, except that the last vertex is the same as the first vertex
a b
c
d e
b e c
a b
c
d e
a c d a
RANLP 2005 10
Even More Terminology
subgraph: subset of vertices and edges forming a graph connected component: maximal connected subgraph. E.g., the graph
below has 3 connected components.
connected not connected
• connected graph: any two vertices are connected by some path
RANLP 2005 11
0 0
1 2 3
1 2 0
1 2
3 (i) (ii) (iii) (iv) (a) Some of the subgraphs of G1
0 0
1
0
1
2
0
1
2(i) (ii) (iii) (iv)
(b) Some of the subgraphs of G3
0
1 2
3G1
0
1
2
G3
Subgraph Examples
RANLP 2005 12
More …
tree - connected graph without cycles forest - collection of trees
tree
forest
tree
tree
tree
RANLP 2005 13
Connectivity
Let n = #vertices, and m = #edges A complete graph: one in which all pairs of vertices
are adjacent How many total edges in a complete graph?
– Each of the n vertices is incident to n-1 edges, however, we would have counted each edge twice! Therefore, intuitively, m = n(n -1)/2.
Therefore, if a graph is not complete, m < n(n -1)/2
n 5m (5
RANLP 2005 14
More Connectivity
n = #vertices
m = #edges For a tree m = n - 1
n 5m 4
n 5m 3
If m < n - 1, G is not connected
RANLP 2005 15
Directed and Undirected Graphs
An undirected graph is one in which the pair of vertices in a edge is unordered, (v0, v1) = (v1,v0)
A directed graph is one in which each edge is a directed pair of vertices, <v0, v1> != <v1,v0>
tail head
RANLP 2005 16
Weighted and Unweighted Graphs
Edges have / do not have a weight associated with them
s
2
3
4
5
6
7
t
15
5
10 4
4
weightedunweighted
RANLP 2005 17
Graph Representations: Adjacency Matrix
Let G=(V,E) be a graph with n vertices. The adjacency matrix of G is a two-dimensional
n by n array, say adj_mat If the edge (vi, vj) is in E(G), adj_mat[i][j]=1 If there is no such edge in E(G), adj_mat[i][j]=0 The adjacency matrix for an undirected graph is
symmetric; the adjacency matrix for a digraph need not be symmetric
RANLP 2005 18
Examples of Adjacency Matrices
0
1
1
1
1
0
1
1
1
1
0
1
1
1
1
0
0
1
0
1
0
0
0
1
0
0
1
1
0
0
0
0
0
1
0
0
1
0
0
0
0
1
0
0
1
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
1
0
G1
G2
G3
0
1 2
3
0
1
2
1
0
2
3
4
5
6
7
symmetric
undirected: n2/2directed: n2
RANLP 2005 19
Advantages of Adjacency Matrix
Starting with an adjacency matrix, it is easy to determine the links between vertices
The degree of a vertex is For a digraph (= directed graph), the row sum is
the out_degree, while the column sum is the in_degree
adj mat i jj
n
_ [ ][ ]
0
1
ind vi A j ij
n
( ) [ , ]
0
1
outd vi A i jj
n
( ) [ , ]
0
1
RANLP 2005 20
Graph Representations: Adjacency Lists
Each row in an adjacency matrix is represented as a list
An undirected graph with n vertices and e edges => n head nodes and 2e list nodes
Pros: More compact representation– Memory efficient
Cons: Sometimes less efficient computation of in- or out- degrees
RANLP 2005 21
0123
012
01234567
1 2 30 2 30 1 30 1 2
G1
10 2
G3
1 20 30 31 254 65 76G4
0
1 2
3
0
1
2
1
0
2
3
4
5
6
7
Examples of Adjacency Lists
RANLP 2005 22
Operations on Adjacency Lists
• degree of a vertex in an undirected graph# of nodes in adjacency list
• # of edges in a graphdetermined in O(n+e)
• out-degree of a vertex in a directed graph # of nodes in its adjacency list
• in-degree of a vertex in a directed graph traverse the whole data structure
RANLP 2005 23
Algorithms on Graphs
Graph traversal Min-Cut / Max-Flow Ranking algorithms Spectral Clustering Other:
– Dijkstra’s Algorithm
– Minimum Spanning Tree
– …
RANLP 2005 24
Graph Traversal
Problem: Search for a certain node or traverse all nodes in the graph
Depth First Search– Once a possible path is found, continue the search
until the end of the path
Breadth First Search– Start several paths at a time, and advance in each
one step at a time
RANLP 2005 25
M N O P
I J K L
E F G H
A B C D
Depth-First Search
RANLP 2005 26
Depth-First Search
Algorithm DFS(v); Input: A vertex v in a graph Output: A labeling of the edges as “discovery” edges and
“backedges”for each edge e incident on v do
if edge e is unexplored then let w be the other endpoint of e
if vertex w is unexplored then label e as a discovery edge
recursively call DFS(w)else label e as a backedge
RANLP 2005 27
Breadth-First Search
M N O P
I J K L
E F G H
A B C D
0
M N O P
I J K L
E F G H
A B C D
0 1
M N O P
I J K L
E F G H
A C DB
0 1 2
M N O P
I J K L
E F G H
A B C D
0 1 2 3
d)c)
b)a)
RANLP 2005 28
Algorithm BFS(s):
Input: A vertex s in a graph
Output: A labeling of the edges as “discovery” edges and “cross edges”
initialize container L0 to contain vertex s
i 0
while Li is not empty do
create container Li+1 to initially be empty
for each vertex v in Li do
if edge e incident on v do
let w be the other endpoint of e
if vertex w is unexplored then
label e as a discovery edge
insert w into Li+1
else label e as a cross edge
i i + 1
Breadth-First Search
RANLP 2005 29
Path Finding
Find path from source vertex s to destination vertex d
Use graph search starting at s and terminating as soon as we reach d– Need to remember edges traversed
Use depth – first search Use breath – first search
RANLP 2005 30
Path Finding with Depth First Search
EF
G
B
CD
A start
destination
A DFS on A ADFS on BB
A
DFS on C
BC
AB Return to call on B
D Call DFS on D
AB
D
Call DFS on GG found destination - donepath is implicitly stored in DFS recursionpath is: A, B, D, G
RANLP 2005 31
EF
G
B
CD
A start
destination
A
Initial call to BFS on AAdd A to queue
B
Dequeue AAdd B
frontrear frontrear
C
Dequeue BAdd C, D
frontrear
D D
Dequeue CNothing to add
frontrear
G
Dequeue DAdd G
frontrear
found destination - donepath must be stored separately
Path Finding with Breadth First Search
RANLP 2005 32
Max flow network: G = (V, E, s, t, u) .– (V, E) = directed graph, no parallel arcs.
– Two distinguished nodes: s = source, t = sink.
– u(e) = capacity of arc e.
Max Flow Network
Capacity
s
2
3
4
5
6
7
t
15
5
30
15
10
8
15
9
6 10
10
10 15 4
4
RANLP 2005 33
An s-t flow is a function f: E that satisfies:– For each e E: 0 f(e) u(e) (capacity)
– For each v V – {s, t}: (conservation)
Flows
veveefef
ofout to in )()(
s
2
3
4
5
6
7
t
15
5
30
15
10
8
15
9
6 10
10
10 15 4
4
4
0
0
0
0
0 0
0 4 4
0
4 0
00Capacity
Flow
Ewvwve
wvfef),(: ofout
),(:)(Evwwve
vwfef),(: to in
),(:)(
RANLP 2005 34
An s-t flow is a function f: E that satisfies:– For each e E: 0 f(e) u(e) (capacity)
– For each v V – {s, t}: (conservation)
Flows
veveefef
ofout to in )()(
s
2
3
4
5
6
7
t
15
5
30
15
10
8
15
9
6 10
10
10 15 4
4
4
0
0
0
0
0 0
0 4 4
0
4 0
00Capacity
Flow
seeff
ofout )(
Value = 4
MAX FLOW: find s-t flow that maximizes net flow out of the source.
RANLP 2005 35
Arcs
Networks Flow Examples
communication
Network
telephone exchanges,computers, satellites
Nodes
cables, fiber optics,microwave relays
Flow
voice, video,packets
circuitsgates, registers,processors
wires current
mechanical joints rods, beams, springs heat, energy
hydraulicreservoirs, pumpingstations, lakes
pipelines fluid, oil
financial stocks, currency transactions money
transportationairports, rail yards,street intersections
highways, railbeds,airway routes
freight,vehicles,passengers
chemical sites bonds energy
RANLP 2005 36
An s-t cut is a node partition (S, T) such that s S, t T.
– The capacity of an s-t cut (S, T) is:
Min s-t cut: find an s-t cut of minimum capacity.
Cuts
TwSvEwvSe
wvueu
,),( ofout
).,(:)(
s
2
3
4
5
6
7
t
15
5
30
15
10
8
15
9
6 10
10
10 15 4
4
Capacity = 30
RANLP 2005 37
Cuts
s
2
3
4
5
6
7
t
15
5
30
15
10
8
15
9
6 10
10
10 15 4
4
Capacity = 62
An s-t cut is a node partition (S, T) such that s S, t T.
– The capacity of an s-t cut (S, T) is:
Min s-t cut: find an s-t cut of minimum capacity.
TwSvEwvSe
wvueu
,),( ofout
).,(:)(
RANLP 2005 38
L1. Let f be a flow, and let (S, T) be a cut. Then, the net flow sent across the cut is equal to the amount reaching t.
Flows and Cuts
s
2
3
4
5
6
7
t
15
5
30
15
10
8
15
9
6 10
10
10 15 4
4
fefefefseSeSe
ofout to in ofout
)()()(
Value = 24
10
6
6
10
10
0 10
4 8 8
0
4 0
00
RANLP 2005 39
10
6
6
10
10
0 10
4 8 8
0
4 0
00
Flows and Cuts
s
2
3
4
5
6
7
t
15
5
30
15
10
8
15
9
6 10
10
10 15 4
4
Value = 24
L1. Let f be a flow, and let (S, T) be a cut. Then, the net flow sent across the cut is equal to the amount reaching t.
fefefefseSeSe
ofout to in ofout
)()()(
RANLP 2005 40
Let f be a flow, and let (S, T) be a cut. Then,
Proof by induction on |S|.– Base case: S = { s }.
– Inductive hypothesis: assume true for |S’| = k-1.consider cut (S, T) with |S| = kS = S' { v } for some v s, t, |S' | = k-1 cap(S', T') = | f |.adding v to S' increase cut capacity by
Flows and Cuts
.)()( to in ofout
Se Sefefef
0)()( to in ofout
veveefef
s
v
t
After
s
v
t
S' Before S
RANLP 2005 41
Max-Flow Min-Cut Theorem
Max-Flow Min-Cut Theorem (Ford-Fulkerson, 1956): In any network, the value of the max flow is equal to the value of the min cut.
s
2
3
4
5
6
7
t
15
5
30
15
10
8
15
9
6 10
10
10 15 4
4
10
9
9
15
15
4 10
4 8 9
1
0 0
Flow value = 28
00
Cut capacity = 28
RANLP 2005 42
Random Walk Algorithms
Algorithms for link analysis Based on random walks
– Results into a ranking of vertex “importance”
Used in many fields:– Social networks
co-authorship, citations
– Web search engines
– Clustering
– NLP
– …
RANLP 2005 43
PageRank Intuition
(Brin & Page 1998)
Imagine a web surfer doing a simple random walk on the entire web for an infinite number of steps.
Occasionally, the surfer will get bored and instead of following a link pointing outward from the current page will jump to another random page.
At some point, the percentage of time spent at each page will converge to a fixed value.
This value is known as the PageRank of the page.
RANLP 2005 44
HITS Intuition
Hyperlinked Induced Topic Search (Kleinberg 1999)
Defines 2 types of importance:– Authority
Can be thought of as being an expert on the subject“You are important if important hubs point to you”
– HubCan be thought of as being knowledgeable about the subject“You are important if important authorities point to you”
Hubs and Authorities mutually reinforce each other Note: A node in a graph can be both an Authority and a Hub at
the same time, eg. cnbc.com Note: this is equivalent to performing a random walk where:
– Odd steps you walk from hubs to authorities (follow outlink)– Even steps you walk from authorities to hubs (follow inlink)
RANLP 2005 45
A
B
C
F
E
D1.35
0.63
0.36
1.17
0.51
1.47
RANLP 2005 46
Graph-based Ranking
Decide the importance of a vertex within a graph A link between two vertices = a vote
– Vertex A links to vertex B = vertex A “votes” for vertex B
– Iterative voting => Ranking over all vertices
Model a random walk on the graph– A walker takes random steps– Converges to a stationary distribution of probabilities
Probability of finding the walker at a certain vertex
– Guaranteed to converge if the graph is Aperiodic – holds for non-bipartite graphs Irreducible – holds for strongly connected graphs
RANLP 2005 47
Graph-based Ranking
Traditionally applied on directed graphs – From a given vertex, the walker selects at
random one of the out-edges Applies also to undirected graphs
– From a given vertex, the walker selects at random one of the incident edges
Adapted to weighted graphs– From a given vertex, the walker selects at
random one of the out (directed graphs) or incident (undirected graphs) edges, with higher probability of selecting edges with higher weight
RANLP 2005 48
Algorithms
G = (V,E) – a graph – In(Vi) = predecessors,
Out(Vi) = successors
)(
)(|)(|
1)1()(
iVInjj
ji VPR
VOutddVPR
PageRank – (Brin & Page 1998) – integrates incoming / outgoing links in one formula
where d [0,1]
Adapted to weighted graphs
)(
)(
)()1()(i
jk
VInjj
w
VOutVjk
jii
w VPRw
wddVPR
RANLP 2005 49
Algorithms
Positional Function – (Positional Power / Weakness (Herings et al. 2001)
HITS – Hyperlinked Induced Topic Search (Kleinberg 1999) – authorities / hubs
)(
)(
))(1(||
1)(
))(1(||
1)(
ij
ij
VInVjPiW
VOutVjWiP
VPOSV
VPOS
VPOSV
VPOS
)(
)(
)()(
)()(
i
i
VInjjHiA
VOutjjAiH
VHITSVHITS
VHITSVHITS
RANLP 2005 50
Convergence
Convergence – Error below a small threshold
value (e.g. 0.0001)
Text-based graphs – convergence usually achieved after 20-40 iterations– more iterations for undirected
graphs
|)()(||)()(| 1 ki
kii
ki VSVSVSVS
RANLP 2005 51
Convergence Curves
-0.01
0.01
0.03
0.05
0.07
0.09
0.11
0.13
0.15
0.17
0.19
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Iteration
Score
Directed Unidrected
(250 vertices, 250 edges)
RANLP 2005 52
0.1
0.2
0.8
0.7
0.6
0.8
0.8
0.8
Spectral Clustering
Algorithms that cluster data using graph representations Similarity graph:
– Distance decreases as similarity increases
– Represent dataset as a weighted graph G(V,E)
Clustering can be viewed as partitioning a similarity graph Bi-partitioning: divide vertices into two disjoint groups: (A,B)
1
2
3
4
5
6
1
2
3
4
5
6
A B
RANLP 2005 53
Clustering Objectives
Traditional definition of a “good” clustering:1. Points assigned to same cluster should be highly similar.
2. Points assigned to different clusters should be highly dissimilar.
Minimize weight of between-group connections
0.1
0.2
0.8
0.7
0.6
0.8
0.8
0.81
2
3
4
5
6
Apply these objectives to our graph representation
Options:1. Find minimum cut between two clusters
2. Apply spectral graph theory
RANLP 2005 54
Spectral Graph Theory
Possible approach– Represent a similarity graph as a matrix
– Apply knowledge from Linear Algebra…
Spectral Graph Theory– Analyse the “spectrum” of matrix representing a graph– Spectrum : The eigenvectors of a graph, ordered by the
magnitude (strength) of their corresponding eigenvalues
},...,,{ 21 n
The eigenvalues and eigenvectors of a matrix provide global information about its structure.
11 1 1 1
1
n
n nn n n
w w x x
λ
w w x x
RANLP 2005 55
Matrix Representations
Adjacency matrix (A)– n x n matrix
– : edge weight between vertex xi and xj
x1 x2 x3 x4 x5 x6
x1 0 0.8 0.6 0 0.1 0
x2 0.8 0 0.8 0 0 0
x3 0.6 0.8 0 0.2 0 0
x4 0 0 0.2 0 0.8 0.7
x5 0.1 0 0 0.8 0 0.8
x6 0 0 0 0.7 0.8 0
0.1
0.2
0.8
0.7
0.6
0.8
0.8
1
2
3
4
5
6
0.8
Important properties: – Symmetric matrix
][ ijwA
RANLP 2005 56
Matrix Representations (continued)
Degree matrix (D)
– n x n diagonal matrix
– : total weight of edges incident to vertex xi
x1 x2 x3 x4 x5 x6
x1 1.5 0 0 0 0 0
x2 0 1.6 0 0 0 0
x3 0 0 1.6 0 0 0
x4 0 0 0 1.7 0 0
x5 0 0 0 0 1.7 0
x6 0 0 0 0 0 1.5
0.1
0.2
0.8
0.7
0.6
0.8
0.8
1
2
3
4
5
6
0.8
j
ijwiiD ),(
RANLP 2005 57
Matrix Representations (continued)
Laplacian matrix (L)
– n x n symmetric matrix
0.1
0.2
0.8
0.7
0.6
0.8
0.8
1
2
3
4
5
6
0.8
L = D - Ax1 x2 x3 x4 x5 x6
x1 1.5 -0.8 -0.6 0 -0.1 0
x2 -0.8 1.6 -0.8 0 0 0
x3 -0.6 -0.8 1.6 -0.2 0 0
x4 0 0 -0.2 1.7 -0.8 -0.7
x5 -0.1 0 0 0.8- 1.7 -0.8
x6 0 0 0 -0.7 -0.8 1.5
RANLP 2005 58
Spectral Clustering Algorithms
Three basic stages:
1. Pre-processing Construct a matrix representation of the dataset.
2. Decomposition Compute eigenvalues and eigenvectors of the matrix. Map each point to a lower-dimensional representation based
on one or more eigenvectors.
3. Grouping Assign points to two or more clusters, based on the new
representation.
RANLP 2005 59
Spectral Bi-partitioning Algorithm
1. Pre-processing– Build Laplacian
matrix L of the graph
0.90.80.5-0.2-0.70.4
-0.2-0.6-0.8-0.4-0.70.4
-0.6-0.40.20.9-0.40.4
0.6-0.20.0-0.20.20.4
0.30.4-0.0.10.20.4
-0.9-0.20.40.10.20.4
3.0
2.5
2.3
2.2
0.4
0.0
ΛΛ = = XX = =
2. Decomposition– Find eigenvalues Λ
and eigenvectors Xof the matrix L
-0.7x6
-0.7x5
-0.4x4
0.2x3
0.2x2
0.2x1– Map vertices to corresponding components of λ2 (= algebraic connectivity of the graph)
x1 x2 x3 x4 x5 x6
x1 1.5 -0.8 -0.6 0 -0.1 0
x2 -0.8 1.6 -0.8 0 0 0
x3 -0.6 -0.8 1.6 -0.2 0 0
x4 0 0 -0.2 1.7 -0.8 -0.7
x5 -0.1 0 0 -0.8 1.7 -0.8
x6 0 0 0 -0.7 -0.8 1.5
Fiedler vector
RANLP 2005 60
Spectral Bi-partitioning (continued)
Grouping– Sort components of reduced 1-dimensional vector.– Identify clusters by splitting the sorted vector in two.
How to choose a splitting point?– Naïve approaches:
Split at 0, mean or median value– More expensive approaches
Attempt to minimise (normalised) cut in 1-dimension
-0.7x6
-0.7x5
-0.4x4
0.2x3
0.2x2
0.2x1 Split at 0Split at 0Cluster Cluster AA: Positive points: Positive points
Cluster Cluster BB: Negative : Negative pointspoints
0.2x3
0.2x2
0.2x1
-0.7x6
-0.7x5
-0.4x4
A B
RANLP 2005 61
3-Clusters
Lets assume the next data points
RANLP 2005 62
If we use the 2nd eigenvector
If we use the 3rd eigenvector
RANLP 2005 63
K-Way Spectral Clustering
How do we partition a graph into k clusters?
Two basic approaches:
1. Recursive bi-partitioning (Hagen et al.,’91)
Recursively apply bi-partitioning algorithm in a hierarchical divisive manner.
2. Cluster multiple eigenvectors Build a reduced space from multiple eigenvectors. Use 3rd eigenvector to further partition the clustering output of the 2nd eigenvector Commonly used in recent work
– Multiple eigenvectors prevent instability due to information loss3( )O n
RANLP 2005 64
K-Eigenvector Clustering
K-eigenvector Algorithm (Ng et al.,’01)
1. Pre-processing
– Construct the scaled adjacency matrix2/12/1' ADDA
2. Decomposition Find the eigenvalues and eigenvectors of A'. Build embedded space from the eigenvectors
corresponding to the k largest eigenvalues.3. Grouping
Apply k-means to reduced n x k space to produce k clusters.– Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.– Assign each object to the group that has the closest centroid.– When all objects have been assigned, recalculate the positions of the K centroids.– Repeat Steps 2 and 3 until the centroids no longer move.
RANLP 2005 65
Outline
Part 1: Elements of graph-theory:– Terminology, notations, algorithms on graphs
Part 2: Graph-based algorithms for IR– Web search– Text clustering and classification
Part 3: Graph-based algorithms for NLP– Word Sense Disambiguation– Clustering of entities (names/numbers)– Thesaurus construction– Text summarization
content style
RANLP 2005 66
• Web: heterogeneous and unstructured
• Free of quality control on the web
• Commercial interest to manipulate ranking
• Use a model of Web page importance (based on link analysis) (Brin and Page 1998), (Kleinberg 1999)
• Related work • Academic citation analysis• Clustering methods of link structure• Hubs and Authorities Model
Web-link Analysis for Improved IR
RANLP 2005 67
Back-links
Link Structure of the Web Approximates the importance / quality of Web
pages
PageRank:– Pages with lots of back-links are important– Back-links coming from important pages convey more
importance to a page– Problem: rank sink
RANLP 2005 68
Rank Sink
Page cycles pointed by some incoming link
Problem: this loop will accumulate rank but never distribute any rank outside
RANLP 2005 69
Escape Term
Solution: Rank Source
E(u) is some vector over the web pages
– uniform, favorite pages, etc. Recall that the rank (PageRank) is the eigenvector
corresponding to the largest eigenvalue
)(
)(|)(|
1)()1()(
iVInjj
ji VPR
VOutduEdVPR
RANLP 2005 70
Random Surfer Model
Page Rank corresponds to the probability distribution of a random walk on the web graphs
E(u) can be re-phrased as the random surfer gets bored periodically and jumps to a different page and not kept in a loop forever
RANLP 2005 71
Implementation
Computing resources — 24 million pages
— 75 million URLs
Unique integer ID for each URL Sort and Remove dangling links Rank initial assignment Iteration until convergence Add back dangling links and Re-compute
RANLP 2005 72
Ranking Web-pages
Combined PageRank + vectorial model Vectorial model:
– Finds similarity between query and documents– Uses term weighting schemes (TF.IDF)– Ranks documents based on similarity– Ranking is query specific
PageRank:– Ranks documents based on importance– Ranking is general (not query specific)
Combined model:– Weighted combination of vectorial and PageRank
ranking
RANLP 2005 73
Issues
Users are not random walkers – Content based methods Starting point distribution
– Actual usage data as starting vector Reinforcing effects/bias towards main pages No query specific rank Linkage spam – PageRank favors pages that managed to get other pages to link to them – Linkage not necessarily a sign of relevancy, only of promotion (advertisement…)
RANLP 2005 74
Personalized PageRank
Initial value for rank source E can change ranking:
1. Uniformly over all pages: e.g. copyright warnings, disclaimers, mailing lists archives
• may result in overly high ranking
2. Total weight on a few sets of pages based on a-priori known popularity, e.g. Netscape, McCarthy
3. Different weight depending on the query or topic: 1. topic-sensitive PageRank (Haveliwala 2001)
2. query-sensitive PageRank (Richardson 2002)
RANLP 2005 75
Link Analysis for IR
PageRank is a global ranking based on the web's graph structure
PageRank relies on back-links information to “bring order to the web”
Proven in practice to exceed the performance of other search engines, which are exclusively based on vectorial / boolean models
Can be adapted to “personalized” searches– Although personalized search engines are still not
very popular
RANLP 2005 76
Text Classification
Given:– A description of an instance, xX, where X is the
instance language or instance space.
– A fixed set of categories:
C = {c1, c2,…, cn}
Determine:– The category of x: c(x)C, where c(x) is a
categorization function whose domain is X and whose range is C
RANLP 2005 77
Text Classification Examples
Assign labels to each document or web-page: Labels are most often topics such as Yahoo-categories
e.g., "finance," "sports," "news>world>asia>business" Labels may be genres
e.g., "editorials" "movie-reviews" "news“ Labels may be opinion
e.g., “like”, “hate”, “neutral” Labels may be domain-specific binary
e.g., "interesting-to-me" : "not-interesting-to-me”e.g., “spam” : “not-spam”e.g., “is a toner cartridge ad” :“isn’t”
RANLP 2005 78
Traditional Methods
Manual classification– Used by Yahoo!, Looksmart, about.com, ODP, Medline– very accurate when job is done by experts– consistent when the problem size and team is small– difficult and expensive to scale
Automatic document classification– Hand-coded rule-based systems
Used by CS dept’s spam filter, Reuters, … E.g., assign category if document contains a given boolean combination of words
Supervised learning of document-label assignment function
– Many new systems rely on machine learning k-Nearest Neighbors (simple, powerful) Naive Bayes (simple, common method) Support-vector machines (new, more powerful) …
RANLP 2005 79
Text Clustering
Goal similar to the task of text classification: classify text into categories
No apriori known set of categories Cluster data into sets of similar documents Pros: no annotated data is required Cons: more difficult to assign labels Traditional approaches:
– All clustering methods have been also applied to text clusteringAgglomerative clusteringK-means…
RANLP 2005 80
Spectral Learning / Clustering
Use graph representation of text collections for:– Text clustering: spectral clustering
– Text classification: spectral learning
Recall bi-partitioning and k-way clustering:– Use 2nd or first K eigenvectors as a reduced space
– Map vertices to the reduce space
– Cluster the reduced spaceE.g. cluster the points in the 2nd eigenvector in two classes: positive vs. negative => bi-partitioningUse k-means to cluster the representation obtained using top K eigenvectors
RANLP 2005 81
Spectral Learning / Clustering
Spectral representation: (Kamvar et al., 2003)
– Find similarity graph for the collection
– Find Laplacian matrix (alternatives: normalized L, LSA, etc.)
– Find the K largest eigenvectors of the Laplacian => X
– Normalize the rows in X to be unit length
Clustering:– Use k-means (or other clustering algorithm) to cluster the rows in
X into K clusters
Classification:– Assign known labels to the training points in the collection, using
the representation in X
– Classify unlabeled rows corresponding to the test points in the collection, using the representation in X, with the help of any learning algorithm (SVM, Naïve Bayes, etc.)
RANLP 2005 82
Empirical Results
Data sets:– 20 newsgroups: 1000 postings from each of 20 usenet
newsgroups
– 3 newsgroups: 3 of the 20 newsgroups (crypt, politics, religion)
– Lymphoma: gene expression profiles. 2 classes with 42 and 54 samples each
– Soybean: the soybean-large data set from the UCI repository, with 15 classes
RANLP 2005 83
Text Clustering
Compare spectral clustering with k-means Use document-to-document similarities to build
similarity graph for each collection– Pairwise cosine similarity
– Use k = 20
Spectral Learning K-means3 news 0.84 0.2020 news 0.36 0.07lymphoma 0.50 0.10soybean 0.41 0.34
RANLP 2005 84
Text Classification
Use the same spectral representation, but apply a learning algorithm trained on the labeled examples– Reinforce the prior knowledge: set Aij to 1 if i and j are
known to belong to the same class, or 0 otherwise
Can naturally integrate unlabeled training examples – included in the initial matrix
RANLP 2005 85
Text Classification
RANLP 2005 86
References
Brin, S. and Page, L., The anatomy of a large-scale hypertextual Web search engine, Computer Networks and ISDN Systems, 1998
Herings, P.J. van der Laan, G. and Talman, D., Measuring the power of nodes in digraphs, Technical Report, Tinbergen Institute, 2001
Kamvar, S. and Klein, D. and Manning, C. Spectral Learning, IJCAI 2003
Kleinberg, J., Authoritative sources in a hyperlinked environment, Journal of the ACM, 1999
Ng, A. and Jordan, M. and Weiss, Y. On Spectral Clustering: Analysis and an algorithm, NIPS 14, 2002
RANLP 2005 87
Outline
Part 1: Elements of graph-theory:– Terminology, notations, algorithms on graphs
Part 2: Graph-based algorithms for IR– Web search– Text clustering and classification
Part 3: Graph-based algorithms for NLP– Word Sense Disambiguation– Clustering of entities (names/numbers)– Thesaurus construction– Text summarization
content style
RANLP 2005 88
Word Sense Disambiguation
Word sense disambiguation is the problem of selecting a sense for a word from a set of predefined possibilities.
– Sense Inventory usually comes from a dictionary or thesaurus.
– Knowledge intensive methods, supervised learning, and (sometimes) bootstrapping approaches
Word polysemy (with respect to a dictionary)
Determine which sense of a word is used in a specific sentence
–Ex: “chair” – furniture or person–Ex: “child” – young person or human offspring
“Sit on a chair” “Take a seat on this chair”“The chair of the Math Department” “The chair of the meeting”
RANLP 2005 89
Graph-based Solutions for WSD
Use information derived from dictionaries / semantic networks to construct graphs
Build graphs using measures of “similarity”
– Similarity determined between pairs of concepts, or between a word and its surrounding context Distributional similarity (Lee 1999) (Lin 1999) Dictionary-based similarity (Rada 1989)
carnivore
wild dogwolf
bearfeline, felidcanine, canidfissiped mamal, fissiped
dachshund
hunting doghyena dogdingo
hyenadog
terrier
RANLP 2005 90
Semantic Similarity Metrics
Input: two concepts (same part of speech) Output: similarity measure E.g. (Leacock and Chodorow 1998)
– E.g. Similarity(wolf,dog) = 0.60 Similarity(wolf,bear) = 0.42 Other metrics:
– Similarity using information content (Resnik 1995) (Lin 1998)– Similarity using gloss-based paths across different
hierarchies (Mihalcea and Moldovan 1999)– Conceptual density measure between noun semantic
hierarchies and current context (Agirre and Rigau 1995)
D
CCPathCCSimilarity
2
),(log),( 21
21
where D is the taxonomy depth
RANLP 2005 91
Lexical Chains for WSD
Apply measures of semantic similarity in a global context Lexical chains (Hirst and St-Onge 1988), (Haliday and Hassan 1976)
“A lexical chain is a sequence of semantically related words, which creates a context and contributes to the continuity of meaning and the coherence of a discourse”
Algorithm for finding lexical chains:1. Select the candidate words from the text. These are words for which we
can compute similarity measures, and therefore most of the time they have the same part of speech.
2. For each such candidate word, and for each meaning for this word, find a chain to receive the candidate word sense, based on a semantic relatedness measure between the concepts that are already in the chain, and the candidate word meaning.
3. If such a chain is found, insert the word in this chain; otherwise, create a new chain.
RANLP 2005 92
Lexical Chains
A very long train traveling along the rails with a constant velocity v in a certain direction …
train #1: public transport
#2: order set of things
#3: piece of cloth
travel
#1 change location
#2: undergo transportation
rail #1: a barrier
# 2: a bar of steel for trains
#3: a small bird
RANLP 2005 93
Lexical Chains for WSD
Identify lexical chains in a text– Usually target one part of speech at a time
Identify the meaning of words based on their membership to a lexical chain
Evaluation:– (Galley and McKeown 2003) lexical chains on 74 SemCor texts
give 62.09%
– (Mihalcea and Moldovan 2000) on five SemCor texts give 90% with 60% recalllexical chains “anchored” on monosemous words
RANLP 2005 94
Graph-based Ranking on Semantic Networks
Goal: build a semantic graph that represents the meaning of the text
Input: Any open text Output: Graph of meanings (synsets)
– “importance” scores attached to each synset– relations that connect them
Models text cohesion– (Halliday and Hasan 1979)– From a given concept, follow “links” to semantically related
concepts Graph-based ranking identifies the most recommended
concepts
RANLP 2005 95
Two U.S. soldiers and an unknown number of civilian
contractors are unaccounted for after a fuel convoy was
attacked near the Baghdad International Airport today,
a senior Pentagon official said. One U.S. soldier and an
Iraqi driver were killed in the incident. …
…
…
RANLP 2005 96
Main Steps
Step 1: Preprocessing– SGML parsing, text tokenization, part of speech tagging,
lemmatization
Step 2: Assume any possible meaning of a word in a text is potentially correct
– Insert all corresponding synsets into the graph
Step 3: Draw connections (edges) between vertices
Step 4: Apply the graph-based ranking algorithm– PageRank, HITS, Positional power
RANLP 2005 97
Semantic Relations
Main relations provided by WordNet– ISA (hypernym/hyponym)– PART-OF (meronym/holonym)– causality– attribute– nominalizations– domain links
Derived relations– coord: synsets with common hypernym
Edges (connections)– directed (direction?) / undirected– Best results with undirected graphs
Output: Graph of concepts (synsets) identified in the text– “importance” scores attached to each synset– relations that connect them
RANLP 2005 98
Word Sense Disambiguation
Rank the synsets/meanings attached to each word Unsupervised method for semantic ambiguity resolution of all words in
unrestricted text (Mihalcea et al. 2004) Related algorithms:
– Lesk– Baseline (most frequent sense / random)
Hybrid:– Graph-based ranking + Lesk– Graph-based ranking + Most frequent sense
Evaluation– “Informed” (with sense ordering)– “Uninformed” (no sense ordering)
Data – Senseval-2 all words data (three texts, average size 600)– SemCor subset (five texts: law, sports, debates, education,
entertainment)
RANLP 2005 99
Evaluation
Size(words) Random Lesk TextRank TextRank+LeskSemCorlaw 825 37.12% 39.62% 46.42% 49.36%sports 808 29.95% 33.00% 40.59% 46.18%education 898 37.63% 41.33% 46.88% 52.00%debates 799 40.17% 42.38% 47.80% 50.52%entertainment 802 39.27% 43.05% 43.89% 49.31%Avg. 826 36.82% 39.87% 45.11% 49.47%Senseval-2d00 471 28.97% 43.94% 43.94% 47.77%d01 784 45.47% 52.65% 54.46% 57.39%d02 514 39.24% 49.61% 54.28% 56.42%Avg. 590 37.89% 48.73% 50.89% 53.86%Average (all) 740 37.22% 43.19% 47.27% 51.16%
“uninformed” (no sense order)
RANLP 2005 100
Evaluation
Size(words) MFS Lesk TextRank TextRank+LeskSemCorlaw 825 69.09% 72.65% 73.21% 73.97%sports 808 57.30% 64.21% 68.31% 68.31%education 898 64.03% 69.33% 71.65% 71.53%debates 799 66.33% 70.07% 71.14% 71.67%entertainment 802 59.72% 64.98% 66.02% 66.16%Avg. 826 63.24% 68.24% 70.06% 70.32%Senseval-2d00 471 51.70% 53.07% 58.17% 57.74%d01 784 60.80% 64.28% 67.85% 68.11%d02 514 55.97% 62.84% 63.81% 64.39%Avg. 590 56.15% 60.06% 63.27% 63.41%Average (all) 740 60.58% 65.17% 67.51% 67.72%
“informed” (sense order integrated)
RANLP 2005 101
Ambiguous Entitites
Name ambiguity in research papers:– David S. Johnson, David Johnson, D. Johnson– David Johnson (Rice), David Johnson (AT & T)
Similar problem across entities:– Washington (person), Washington (state), Washington
(city) Number ambiguity in texts:
– quantity: e.g. 100 miles– time: e.g. 100 years– money: e.g. 100 euro– misc.: anything else
Can be modeled as a clustering problem
RANLP 2005 102
Name Disambiguation
Extract attributes for each person – e.g. for research papers:– Co-authors– Paper titles– Venue titles
Each word in these attribute sets constitutes a binary feature
Apply a weighting scheme:– E.g. normalized TF, TF/IDF
Construct a vector for each occurrence of a person name
(Han et al., 2005)
RANLP 2005 103
Spectral Clustering
Apply k-way spectral clustering to name data sets Two data sets:
– DBLP: authors of 400,000 citation records – use top 14 ambiguous names
– Web-based data set: 11 authors named “J. Smith” and 15 authors named “J. Anderson”, in a total of 567 citations
Clustering evaluated using confusion matrices – Disambiguation accuracy = sum of diagonal elements
A[i,i] divided by the sum of all elements in the matrix
RANLP 2005 104
Name Disambiguation – Results
DBLP data set:
Web-based data set– 11 “J.Smith”: 84.7% (k-means 75.4%)– 15 “J.Anderson”: 71.2% (k-means 67.2%)
RANLP 2005 105
Number Disambiguation
Classify numbers into 4 classes: quantity, time, money, miscellaneous
Two algorithms– Spectral clustering on bipartite graphs
– Tripartite updating
Features– Surrounding window +/- 2, and the “range” of the word
(w ≤ 3, 3 < w ≤ 10, 10 < w ≤ 30 …)
– Map all numbers in the data set into these features (Radev, 2004)
RANLP 2005 106
Tripartite Updating
F – set of features, L/U – set of labeled / unlabeled examples
T1: connectivity matrix between F and L
T2: connectivity matrix between F and U
F and U are iteratively updated
)()(2
)1(
)1()(2
)(
)1(1
)(
ttTt
ttTt
tTt
FUTF
UFTU
FLTF
RANLP 2005 107
Number Disambiguation – Results
Use a subset of the numbers from the APW section of the ACQUAINT corpus– 5000 unlabeled examples, of which 300 were
manually annotated
– 200 labeled examples for evaluation
RANLP 2005 108
Automatic Thesaurus Generation
Idea: Use an (online) traditional dictionary to generate a graph structure, and use it to create thesaurus-like entries for all words (stopwords included)
– (Jannink and Wiederhold, 1999)
For instance:– Input: the 1912 Webster’s Dictionary
– Output: a repository with rank relationships between terms
The repository is comparable with handcrafted efforts such as MindNet and WordNet
RANLP 2005 109
Algorithm
1. Extract a directed graph from the dictionary- e.g. relations between head-words and words included
in definitions
2. Obtain the relative measure of arc importance
3. Rank the arcs with ArcRank(graph-based algorithm for edge ranking)
RANLP 2005 110
Algorithm (cont.)
Directed Graph One arc from each headword to all words in the definition.
Potential problems as: syllable and accent markers in head words, misspelled head words, accents, special characters, mistagged fields, common abbreviations in definitions, steaming, multi-word head words, undefined words with common prefixes, undefined hyphenated.
Source words: words never used in definitions
Sink words: undefined words
Transport. To carry or bear from one place to another;…
head word
Transport
to
carry
or
bear
…
1. Extract a directed graph from the Dictionary
RANLP 2005 111
Algorithm (cont.)
2. Obtain the relative measure of arc importance
Transport
carry
|as| outgoing arcs
e edge
s source node
t target node
re =
ps / |as|
pt Where:
re is the rank of the edge
ps is the rank of the source node
pt is the rank of the target node
|as| is the number of outgoing nodes
rs,t =
ps / |as|
pt
∑e=1
mFor more than 1 node (m) between s and t
RANLP 2005 112
Algorithm (cont.)
ArcRank
Input: triples (source s, target t, importance vs,t)
1. given source s and target t nodes
2. at s, sort vs, tj and rank arcs rs(vs, tj )
3. at t, sort vsi,t and rank arcs rt(csi
,t)
4. compute ArcRank: mean(rs (vs,t), rt(cs,t))
5. Rank Arcs input: sorted arc importance• {0.9, 0.75, 0.75, 0.75, 0.6, 0.5, …. , 0.1} sample values
• {1, 2, 2, 2, …} equal values take same rank
• {1, 2, 2, 2, 3, …} number ranks consecutively
3. Rank the arcs using ArcRank• Rank the importance of arcs with respect to source and target nodes• It promotes arcs that are important in both endpoints
RANLP 2005 113
Results
Webster
96,800 terms
112,897 distinct words
error rates:
<1% of original input
<0.05% incorrect arcs
(hyphenation)
<0.05% incorrect terms
(spelling)
0% artificial terms
An automatically built thesaurus starting with Webster
WordNet
99,642 terms
173,941 word senses
error rates:
~0.1% inappropriate classifications
~1-10% artificial & repeated terms
MindNet
159,000 head words
713,000 relationships between headwords
(not publicly available)
RANLP 2005 114
Results Analysis
The Webster’s repository
PROS It has a very general structure It can also address stopwords It has more relationships than WordNet It allows for any relationships between words (not only within a lexical
category) It is more “natural”: it does not include artificial concepts such as non-
existent words and artificial categorizations
CONS The type of relationships is not always evident The accuracy increases with the amount of data, nevertheless, the
dictionary contains sparse definitions It only distinguishes senses based on usage, not grammar It is less precise than other thesauruses (e.g. WordNet)
RANLP 2005 115
Extractive Summarization
Graph-based ranking algorithms for extractive summarization
1. Build a graph:– Sentences in a text = vertices– Similarity between sentences = weighted edges
Model the cohesion of text using intersentential similarity
2. Run a graph-based ranking algorithm:– keep top N ranked sentences– = sentences most “recommended” by other
sentences (Mihalcea & Tarau, 2004) (Erkan & Radev 2004)
RANLP 2005 116
A Process of Recommendation
Underlying idea: a process of recommendation– A sentence that addresses certain concepts in a text
gives the reader a recommendation to refer to other sentences in the text that address the same concepts
– Text knitting (Hobbs 1974) “repetition in text knits the discourse together”
– Text cohesion (Halliday & Hasan 1979)
RANLP 2005 117
Sentence Similarity
Inter-sentential relationships – weighted edges
Count number of common concepts Normalize with the length of the sentence
Other similarity metrics are also possible:– cosine, string kernels, etc.
– work in progress
|)log(||)log(|
|}|{|),( 21
ji
jkikk
SS
SwSwwSSSim
RANLP 2005 118
Graph Structure
Undirected– No direction established between sentences in
the text– A sentence can “recommend” sentences that
preceed or follow in the text Directed forward
– A sentence can “recommend” only sentences that follow in the text
– Seems more appropriate for movie reviews, stories, etc.
Directed backward– A sentence can “recommend” only sentences
that preceed in the text – More appropriate for news articles
RANLP 2005 119
An Example 3. r i BC-HurricaneGilbert 09-11 0339 4. BC-Hurricane Gilbert , 0348 5. Hurricane Gilbert Heads Toward Dominican Coast 6. By RUDDY GONZALEZ 7. Associated Press Writer 8. SANTO DOMINGO , Dominican Republic ( AP ) 9. Hurricane Gilbert swept toward the Dominican Republic Sunday , and the Civil Defense alerted its heavily populated south coast to prepare for high winds , heavy rains and high seas . 10. The storm was approaching from the southeast with sustained winds of 75 mph gusting to 92 mph . 11. " There is no need for alarm , " Civil Defense Director Eugenio Cabral said in a television alert shortly before midnight Saturday . 12. Cabral said residents of the province of Barahona should closely follow Gilbert 's movement . 13. An estimated 100,000 people live in the province , including 70,000 in the city of Barahona , about 125 miles west of Santo Domingo . 14. Tropical Storm Gilbert formed in the eastern Caribbean and strengthened into a hurricane Saturday night 15. The National Hurricane Center in Miami reported its position at 2a.m. Sunday at latitude 16.1 north , longitude 67.5 west , about 140 miles south of Ponce , Puerto Rico , and 200 miles southeast of Santo Domingo . 16. The National Weather Service in San Juan , Puerto Rico , said Gilbert was moving westward at 15 mph with a " broad area of cloudiness and heavy weather " rotating around the center of the storm . 17. The weather service issued a flash flood watch for Puerto Rico and the Virgin Islands until at least 6p.m. Sunday . 18. Strong winds associated with the Gilbert brought coastal flooding , strong southeast winds and up to 12 feet to Puerto Rico 's south coast . 19. There were no reports of casualties . 20. San Juan , on the north coast , had heavy rains and gusts Saturday , but they subsided during the night . 21. On Saturday , Hurricane Florence was downgraded to a tropical storm and its remnants pushed inland from the U.S. Gulf Coast . 22. Residents returned home , happy to find little damage from 80 mph winds and sheets of rain . 23. Florence , the sixth named storm of the 1988 Atlantic storm season , was the second hurricane . 24. The first , Debby , reached minimal hurricane strength briefly before hitting the Mexican coast last month
A text from DUC 2002on “Hurricane Gilbert”24 sentences
RANLP 2005 120
46
5
21
7
1615
9
8
10
11
12
1413
22
20
19
18
17
2423
0.27
0.35
0.55
0.15
0.19
0.15
0.15
0.16
0.59
0.30
[0.50]
[0.80]
[0.70]
[0.15]
[1.20][0.71]
[0.15]
[0.70]
[1.83]
[0.99]
[0.56]
[0.93]
[0.76]
[1.09][1.36][1.65]
[0.70]
[1.58]
[0.15]
[0.84]
[1.02]
An Example (cont'd)
RANLP 2005 121
Evaluation
Task-based evaluation: automatic text summarization– Single document summarization
100-word summaries
– Multiple document summarization 100-word multi-doc summaries clusters of ~10 documents
Data from DUC (Document Understanding Conference)– DUC 2002
– 567 single documents
– 59 clusters of related documents
Automatic evaluation with ROUGE (Lin & Hovy 2003)
– n-gram based evaluations unigrams found to have the highest correlations with human judgment
– no stopwords, stemming
RANLP 2005 122
Results: Single Document Summarization
Single-doc summaries for 567 documents (DUC 2002)
GraphAlgorithm UndirectedDir.forwardDir.backward
0.4912 0.4584 0.5023
0.4912 0.5023 0.4584
0.4878 0.4538 0.3910
0.4878 0.3910 0.4538
0.4904 0.4202 0.5008
HITSA
W
HITSH
W
POSP
W
POSW
W
PRW
Top 5 systems (DUC 2002)S27 S31 S28 S21 S29 Baseline
0.5011 0.4914 0.4890 0.4869 0.4681 0.4799
RANLP 2005 123
Multiple Document Summarization
Cascaded summarization (“meta” summarizer)– Use best single document summarization alorithms
PageRank (Undirected / Directed Backward) HITSA (Undirected / Directed Backward)
– 100-word single document summaries– 100-word “summary of summaries”
Avoid sentence redundancy:– set max threshold on sentence similarity (0.5)
Evaluation:– build summaries for 59 clusters of ~10 documents– compare with top 5 performing systems at DUC
2002– baseline: first sentence in each document
RANLP 2005 124
Results: Multiple Document Summarization
PageRank-UPageRank-DB
PageRank-U 0.3552 0.3499 0.3456 0.3465
PageRank-DB 0.3502 0.3448 0.3519 0.3439
0.3368 0.3259 0.3212 0.3423
0.3572 0.3520 0.3462 0.3473
HITSA-U HITS
A-DB
HITSA-U
HITSA-DB
Top 5 systems (DUC 2002)S26 S19 S29 S25 S20 Baseline
0.3578 0.3447 0.3264 0.3056 0.3047 0.2932
– Multi-doc summaries for 59 clusters (DUC 2002)
RANLP 2005 125
Portuguese Single-Document Evaluation
• 100 news articles from Jornal de Brasil, Folha de Sao Paolo• Create summaries with the same length as manual summaries • Automatic evaluation with ROUGE • Same baseline (top sentences in each document)
GraphAlgorithm Undirected Forward BackwardHITS-A 0.4814 0.4834 0.5002
PageRank 0.4939 0.4574 0.5121Baseline 0.4963
RANLP 2005 126
Graph-based Ranking for Sentence Extraction Unsupervised method for sentence
extraction – based exclusively on information drawn from the text itself
Goes beyond simple sentence connectivity Gives a ranking over all sentences in a text
– can be adapted to longer / shorter summaries
No training data required – can be adapted to other languages– Already applied on Portuguese
RANLP 2005 127
Subjectivity Analysis for Sentiment Classification
The objective is to detect subjective expressions in text (opinions against facts)
Use this information to improve the polarity classification (positive vs. negative)– E.g. Movie reviews ( see: www.rottentomatoes.com)
Sentiment analysis can be considered as a document classification problem, with target classes focusing on the authors sentiments, rather than topic-based categories– Standard machine learning classification techniques
can be applied
RANLP 2005 128
Subjectivity Extraction
RANLP 2005 129
Subjectivity Detection/Extraction
Detecting the subjective sentences in a text may be useful in filtering out the objective sentences creating a subjective extract
Subjective extracts facilitate the polarity analysis of the text (increased accuracy at reduced input size)
Subjectivity detection can use local and contextual features:
– Local: relies on individual sentence classifications using standard machine learning techniques (SVM, Naïve Bayes, etc) trained on an annotated data set
– Contextual: uses context information, such as e.g. sentences occurring near each other tend to share the same subjectivity status (coherence)
(Pang and Lee, 2004)
RANLP 2005 130
Cut-based Subjectivity Classification
Standard classification techniques usually consider only individual features (classify one sentence at a time).
Cut-based classification takes into account both individual and contextual (structural) features
Suppose we have n items x1,…,xn to divide in two classes: C1 and C2 .
Individual scores: indj(xi) - non-negative estimates of each xi being in Cj based on the features of xi alone
Association scores: assoc(xi,xk) - non-negative estimates of how important it is that xi and xk be in the same class
RANLP 2005 131
Cut-based Classification
Maximize each item’s assignment score (individual score for the class it is assigned to, minus its individual score for the other class), while penalize the assignment of different classes to highly associated items
Formulated as an optimization problem: assign the xi items to classes C1 and C2 so as to minimize the partition cost:
2
121
,12
CxCx
kiCxCx
k
i
xxassocxindxind
RANLP 2005 132
Cut-based Algorithm
There are 2n possible binary partitions of the n elements, we need an efficient algorithm to solve the optimization problem
Build an undirected graph G with vertices {v1,…vn,s,t} and edges:– (s,vi) with weights ind1(xi)
– (vi,t) with weights ind2(xi)
– (vi,vk) with weights assoc(xi,xk)
RANLP 2005 133
Cut-based Algorithm (cont.)
Cut: a partition of the vertices in two sets:
The cost is the sum of the weights of all edges crossing from S to T
A minimum cut is a cut with the minimal cost A minimum cut can be found using maximum-flow
algorithms, with polynomial asymptotic running times Use the min-cut / max-flow algorithm
',' where
'}{ and '}{
TtSs
TtTSsS
RANLP 2005 134
Cut-based Algorithm (cont.)
Notice that without the structural information we would be undecidedabout the assignment of node M
RANLP 2005 135
Subjectivity Extraction
Assign every individual sentence a subjectivity score – e.g. the probability of a sentence being subjective, as assigned
by a Naïve Bayes classifier, etc Assign every sentence pair a proximity or similarity score
– e.g. physical proximity = the inverse of the number of sentences between the two entities
Use the min-cut algorithm to classify the sentences into objective/subjective
RANLP 2005 136
Subjectivity Extraction with Min-Cut
RANLP 2005 137
Results
2000 movie reviews (1000 positive / 1000 negative)
The use of subjective extracts improves or maintains the accuracy of the polarity analysis while reducing the input data size
RANLP 2005 138
References
Galley, M. and McKeown, K.: Improving Word Sense Disambiguation in Lexical Chaining. IJCAI 2003.
Hirst, G. and St-Onge, D. Lexical chains as representations of context in the detection and correction of malaproprisms. WordNet: An electronic lexical database, MIT Press, 1998
Halliday, M. and Hasan, R. Cohesion in English. Longman, 1976. Lesk, M. Automatic sense disambiguation using machine readable
dictionaries: How to tell a pine cone from an ice cream cone. SIGDOC 1986.
Mihalcea, R. and Moldovan, D. An iterative approach to word sense disambiguation. FLAIRS 2000.
Mihalcea, R. and Tarau, P and Figa, E. PageRank on Semantic Networks with Application to Word Sense Disambiguation, COLING 2004
RANLP 2005 139
References
Mihalcea, R. and Tarau, P. TextRank: Bringing Order Into Texts, EMNLP 2004
Han, H and Zha, H. and Giles, C.L. Name disambiguation in author citations using a K-way spectral clustering method, ACM/IEEE – IJCDL 2005
Radev, D., Weakly-Supervised Graph-based Methods for Classification, U. Michigan Technical Report CSE-TR-500-04, 2004
Jannink, J. and Wiederhold, G., Thesaurus Entry Extraction from an On-line Dictionary, Fusion 1999, Sunnyvale CA, 1999.
Pang, B. and Lee, L., A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, ACL 2004.
Erkan, G. and Radev, D.. "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization", Journal of Artificial Intelligence Research , 2004
RANLP 2005 140
Conclusions
Graphs: encode information about nodes and relations between nodes
Graph-based algorithms can lead to state-of-the-art results in several IR and NLP problems
Graph-based NLP/IR:– On the verge between graph-theory and NLP / IR
– Many opportunities for novel approaches to traditional problems
Top Related