Graph-based Algorithms for Information Retrieval and Natural Language Processing Rada Mihalcea...

Graph-based Algorithms for Information Retrieval and

Natural Language Processing

Rada Mihalcea

University of North Texas

[email protected]

RANLP 2005

RANLP 2005 2

Introduction

Motivation– Graph-theory is a well studied discipline– So are the fields of Information Retrieval (IR) and

Natural Language Processing (NLP)– Often perceived as completely different disciplines

Goal of the tutorial: provide an overview of methods and applications in IR and NLOP that rely on graph-based algorithms, e.g.– Graph-based algorithms: graph traversal, min-cut

algorithms, node ranking– Applied to: Web search, text understanding, text

summarization, keyword extraction, text clustering

RANLP 2005 3

Outline

Part 1: Elements of graph-theory:– Terminology, notations, algorithms on graphs

Part 2: Graph-based algorithms for IR– Web search– Text clustering and classification

Part 3: Graph-based algorithms for NLP– Word Sense Disambiguation– Clustering of entities (names/numbers)– Thesaurus construction– Text summarization

content style

RANLP 2005 4

What is a Graph?

A graph G = (V,E) is composed of:

V: set of vertices

E: set of edges connecting the vertices in V An edge e = (u,v) is a pair of vertices Example:

a b

c

d e

V= {a,b,c,d,e}

E= {(a,b),(a,c),(a,d),(b,e),(c,d),(c,e),(d,e)}

RANLP 2005 5

Terminology: Adjacent and Incident

If (v0, v1) is an edge in an undirected graph, – v0 and v1 are adjacent– The edge (v0, v1) is incident on vertices v0 and v1

If <v0, v1> is an edge in a directed graph– v0 is adjacent to v1, and v1 is adjacent from v0

– The edge <v0, v1> is incident on v0 and v1

RANLP 2005 6

The degree of a vertex is the number of edges incident to that vertex

For directed graph, – the in-degree of a vertex v is the number of edges

that have v as the head– the out-degree of a vertex v is the number of edges

that have v as the tail– if di is the degree of a vertex i in a graph G with n

vertices and e edges, the number of edges is

e di

n

( ) /0

1

2 Why? Since adjacent vertices each count the adjoining edge, it will be counted twice

Terminology: Degree of a Vertex

RANLP 2005 7

0

1 2

3 4 5 6

G1 G2

3 2

3 3

1 1 1 1

directed graphin-degreeout-degree

0

1

2

G3

in:1, out: 1

in: 1, out: 2

in: 1, out: 0

0

1 2

3

33

3

Examples

RANLP 2005 8

Terminology: Path

path: sequence of vertices v1,v2,. . .vk such that consecutive vertices vi and vi+1 are adjacent.

3

3 3

3

2

a b

c

d e

a b

c

d e

a b e d c b e d c

RANLP 2005 9

More Terminology

simple path: no repeated vertices

cycle: simple path, except that the last vertex is the same as the first vertex

a b

c

d e

b e c

a b

c

d e

a c d a

RANLP 2005 10

Even More Terminology

subgraph: subset of vertices and edges forming a graph connected component: maximal connected subgraph. E.g., the graph

below has 3 connected components.

connected not connected

• connected graph: any two vertices are connected by some path

RANLP 2005 11

0 0

1 2 3

1 2 0

1 2

3 (i) (ii) (iii) (iv) (a) Some of the subgraphs of G1

0 0

1

0

1

2

0

1

2(i) (ii) (iii) (iv)

(b) Some of the subgraphs of G3

0

1 2

3G1

0

1

2

G3

Subgraph Examples

RANLP 2005 12

More …

tree - connected graph without cycles forest - collection of trees

tree

forest

tree

tree

tree

RANLP 2005 13

Connectivity

Let n = #vertices, and m = #edges A complete graph: one in which all pairs of vertices

are adjacent How many total edges in a complete graph?

– Each of the n vertices is incident to n-1 edges, however, we would have counted each edge twice! Therefore, intuitively, m = n(n -1)/2.

Therefore, if a graph is not complete, m < n(n -1)/2

n 5m (5

RANLP 2005 14

More Connectivity

n = #vertices

m = #edges For a tree m = n - 1

n 5m 4

n 5m 3

If m < n - 1, G is not connected

RANLP 2005 15

Directed and Undirected Graphs

An undirected graph is one in which the pair of vertices in a edge is unordered, (v0, v1) = (v1,v0)

A directed graph is one in which each edge is a directed pair of vertices, <v0, v1> != <v1,v0>

tail head

RANLP 2005 16

Weighted and Unweighted Graphs

Edges have / do not have a weight associated with them

s

2

3

4

5

6

7

t

15

5

10 4

4

weightedunweighted

RANLP 2005 17

Graph Representations: Adjacency Matrix

Let G=(V,E) be a graph with n vertices. The adjacency matrix of G is a two-dimensional

n by n array, say adj_mat If the edge (vi, vj) is in E(G), adj_mat[i][j]=1 If there is no such edge in E(G), adj_mat[i][j]=0 The adjacency matrix for an undirected graph is

symmetric; the adjacency matrix for a digraph need not be symmetric

RANLP 2005 18

Examples of Adjacency Matrices

0

1

1

1

1

0

1

1

1

1

0

1

1

1

1

0

0

1

0

1

0

0

0

1

0

0

1

1

0

0

0

0

0

1

0

0

1

0

0

0

0

1

0

0

1

0

0

0

0

0

1

1

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

1

0

1

0

0

0

0

0

0

1

0

1

0

0

0

0

0

0

1

0

G1

G2

G3

0

1 2

3

0

1

2

1

0

2

3

4

5

6

7

symmetric

undirected: n2/2directed: n2

RANLP 2005 19

Advantages of Adjacency Matrix

Starting with an adjacency matrix, it is easy to determine the links between vertices

The degree of a vertex is For a digraph (= directed graph), the row sum is

the out_degree, while the column sum is the in_degree

adj mat i jj

n

_ [ ][ ]

0

1

ind vi A j ij

n

( ) [ , ]

0

1

outd vi A i jj

n

( ) [ , ]

0

1

RANLP 2005 20

Graph Representations: Adjacency Lists

Each row in an adjacency matrix is represented as a list

An undirected graph with n vertices and e edges => n head nodes and 2e list nodes

Pros: More compact representation– Memory efficient

Cons: Sometimes less efficient computation of in- or out- degrees

RANLP 2005 21

0123

012

01234567

1 2 30 2 30 1 30 1 2

G1

10 2

G3

1 20 30 31 254 65 76G4

0

1 2

3

0

1

2

1

0

2

3

4

5

6

7

Examples of Adjacency Lists

RANLP 2005 22

Operations on Adjacency Lists

• degree of a vertex in an undirected graph# of nodes in adjacency list

• # of edges in a graphdetermined in O(n+e)

• out-degree of a vertex in a directed graph # of nodes in its adjacency list

• in-degree of a vertex in a directed graph traverse the whole data structure

RANLP 2005 23

Algorithms on Graphs

Graph traversal Min-Cut / Max-Flow Ranking algorithms Spectral Clustering Other:

– Dijkstra’s Algorithm

– Minimum Spanning Tree

– …

RANLP 2005 24

Graph Traversal

Problem: Search for a certain node or traverse all nodes in the graph

Depth First Search– Once a possible path is found, continue the search

until the end of the path

Breadth First Search– Start several paths at a time, and advance in each

one step at a time

RANLP 2005 25

M N O P

I J K L

E F G H

A B C D

Depth-First Search

RANLP 2005 26

Depth-First Search

Algorithm DFS(v); Input: A vertex v in a graph Output: A labeling of the edges as “discovery” edges and

“backedges”for each edge e incident on v do

if edge e is unexplored then let w be the other endpoint of e

if vertex w is unexplored then label e as a discovery edge

recursively call DFS(w)else label e as a backedge

RANLP 2005 27

Breadth-First Search

M N O P

I J K L

E F G H

A B C D

0

M N O P

I J K L

E F G H

A B C D

0 1

M N O P

I J K L

E F G H

A C DB

0 1 2

M N O P

I J K L

E F G H

A B C D

0 1 2 3

d)c)

b)a)

RANLP 2005 28

Algorithm BFS(s):

Input: A vertex s in a graph

Output: A labeling of the edges as “discovery” edges and “cross edges”

initialize container L0 to contain vertex s

i 0

while Li is not empty do

create container Li+1 to initially be empty

for each vertex v in Li do

if edge e incident on v do

let w be the other endpoint of e

if vertex w is unexplored then

label e as a discovery edge

insert w into Li+1

else label e as a cross edge

i i + 1

Breadth-First Search

RANLP 2005 29

Path Finding

Find path from source vertex s to destination vertex d

Use graph search starting at s and terminating as soon as we reach d– Need to remember edges traversed

Use depth – first search Use breath – first search

RANLP 2005 30

Path Finding with Depth First Search

EF

G

B

CD

A start

destination

A DFS on A ADFS on BB

A

DFS on C

BC

AB Return to call on B

D Call DFS on D

AB

D

Call DFS on GG found destination - donepath is implicitly stored in DFS recursionpath is: A, B, D, G

RANLP 2005 31

EF

G

B

CD

A start

destination

A

Initial call to BFS on AAdd A to queue

B

Dequeue AAdd B

frontrear frontrear

C

Dequeue BAdd C, D

frontrear

D D

Dequeue CNothing to add

frontrear

G

Dequeue DAdd G

frontrear

found destination - donepath must be stored separately

Path Finding with Breadth First Search

RANLP 2005 32

Max flow network: G = (V, E, s, t, u) .– (V, E) = directed graph, no parallel arcs.

– Two distinguished nodes: s = source, t = sink.

– u(e) = capacity of arc e.

Max Flow Network

Capacity

s

2

3

4

5

6

7

t

15

5

30

15

10

8

15

9

6 10

10

10 15 4

4

RANLP 2005 33

An s-t flow is a function f: E that satisfies:– For each e E: 0 f(e) u(e) (capacity)

– For each v V – {s, t}: (conservation)

Flows

veveefef

ofout to in )()(

s

2

3

4

5

6

7

t

15

5

30

15

10

8

15

9

6 10

10

10 15 4

4

4

0

0

0

0

0 0

0 4 4

0

4 0

00Capacity

Flow

Ewvwve

wvfef),(: ofout

),(:)(Evwwve

vwfef),(: to in

),(:)(

RANLP 2005 34

An s-t flow is a function f: E that satisfies:– For each e E: 0 f(e) u(e) (capacity)

– For each v V – {s, t}: (conservation)

Flows

veveefef

ofout to in )()(

s

2

3

4

5

6

7

t

15

5

30

15

10

8

15

9

6 10

10

10 15 4

4

4

0

0

0

0

0 0

0 4 4

0

4 0

00Capacity

Flow

seeff

ofout )(

Value = 4

MAX FLOW: find s-t flow that maximizes net flow out of the source.

RANLP 2005 35

Arcs

Networks Flow Examples

communication

Network

telephone exchanges,computers, satellites

Nodes

cables, fiber optics,microwave relays

Flow

voice, video,packets

circuitsgates, registers,processors

wires current

mechanical joints rods, beams, springs heat, energy

hydraulicreservoirs, pumpingstations, lakes

pipelines fluid, oil

financial stocks, currency transactions money

transportationairports, rail yards,street intersections

highways, railbeds,airway routes

freight,vehicles,passengers

chemical sites bonds energy

RANLP 2005 36

An s-t cut is a node partition (S, T) such that s S, t T.

– The capacity of an s-t cut (S, T) is:

Min s-t cut: find an s-t cut of minimum capacity.

Cuts

TwSvEwvSe

wvueu

,),( ofout

).,(:)(

s

2

3

4

5

6

7

t

15

5

30

15

10

8

15

9

6 10

10

10 15 4

4

Capacity = 30

RANLP 2005 37

Cuts

s

2

3

4

5

6

7

t

15

5

30

15

10

8

15

9

6 10

10

10 15 4

4

Capacity = 62

An s-t cut is a node partition (S, T) such that s S, t T.

– The capacity of an s-t cut (S, T) is:

Min s-t cut: find an s-t cut of minimum capacity.

TwSvEwvSe

wvueu

,),( ofout

).,(:)(

RANLP 2005 38

L1. Let f be a flow, and let (S, T) be a cut. Then, the net flow sent across the cut is equal to the amount reaching t.

Flows and Cuts

s

2

3

4

5

6

7

t

15

5

30

15

10

8

15

9

6 10

10

10 15 4

4

fefefefseSeSe

ofout to in ofout

)()()(

Value = 24

10

6

6

10

10

0 10

4 8 8

0

4 0

00

RANLP 2005 39

10

6

6

10

10

0 10

4 8 8

0

4 0

00

Flows and Cuts

s

2

3

4

5

6

7

t

15

5

30

15

10

8

15

9

6 10

10

10 15 4

4

Value = 24

L1. Let f be a flow, and let (S, T) be a cut. Then, the net flow sent across the cut is equal to the amount reaching t.

fefefefseSeSe

ofout to in ofout

)()()(

RANLP 2005 40

Let f be a flow, and let (S, T) be a cut. Then,

Proof by induction on |S|.– Base case: S = { s }.

– Inductive hypothesis: assume true for |S’| = k-1.consider cut (S, T) with |S| = kS = S' { v } for some v s, t, |S' | = k-1 cap(S', T') = | f |.adding v to S' increase cut capacity by

Flows and Cuts

.)()( to in ofout

Se Sefefef

0)()( to in ofout

veveefef

s

v

t

After

s

v

t

S' Before S

RANLP 2005 41

Max-Flow Min-Cut Theorem

Max-Flow Min-Cut Theorem (Ford-Fulkerson, 1956): In any network, the value of the max flow is equal to the value of the min cut.

s

2

3

4

5

6

7

t

15

5

30

15

10

8

15

9

6 10

10

10 15 4

4

10

9

9

15

15

4 10

4 8 9

1

0 0

Flow value = 28

00

Cut capacity = 28

RANLP 2005 42

Random Walk Algorithms

Algorithms for link analysis Based on random walks

– Results into a ranking of vertex “importance”

Used in many fields:– Social networks

co-authorship, citations

– Web search engines

– Clustering

– NLP

– …

RANLP 2005 43

PageRank Intuition

(Brin & Page 1998)

Imagine a web surfer doing a simple random walk on the entire web for an infinite number of steps.

Occasionally, the surfer will get bored and instead of following a link pointing outward from the current page will jump to another random page.

At some point, the percentage of time spent at each page will converge to a fixed value.

This value is known as the PageRank of the page.

RANLP 2005 44

HITS Intuition

Hyperlinked Induced Topic Search (Kleinberg 1999)

Defines 2 types of importance:– Authority

Can be thought of as being an expert on the subject“You are important if important hubs point to you”

– HubCan be thought of as being knowledgeable about the subject“You are important if important authorities point to you”

Hubs and Authorities mutually reinforce each other Note: A node in a graph can be both an Authority and a Hub at

the same time, eg. cnbc.com Note: this is equivalent to performing a random walk where:

– Odd steps you walk from hubs to authorities (follow outlink)– Even steps you walk from authorities to hubs (follow inlink)

RANLP 2005 45

A

B

C

F

E

D1.35

0.63

0.36

1.17

0.51

1.47

RANLP 2005 46

Graph-based Ranking

Decide the importance of a vertex within a graph A link between two vertices = a vote

– Vertex A links to vertex B = vertex A “votes” for vertex B

– Iterative voting => Ranking over all vertices

Model a random walk on the graph– A walker takes random steps– Converges to a stationary distribution of probabilities

Probability of finding the walker at a certain vertex

– Guaranteed to converge if the graph is Aperiodic – holds for non-bipartite graphs Irreducible – holds for strongly connected graphs

RANLP 2005 47

Graph-based Ranking

Traditionally applied on directed graphs – From a given vertex, the walker selects at

random one of the out-edges Applies also to undirected graphs

– From a given vertex, the walker selects at random one of the incident edges

Adapted to weighted graphs– From a given vertex, the walker selects at

random one of the out (directed graphs) or incident (undirected graphs) edges, with higher probability of selecting edges with higher weight

RANLP 2005 48

Algorithms

G = (V,E) – a graph – In(Vi) = predecessors,

Out(Vi) = successors

)(

)(|)(|

1)1()(

iVInjj

ji VPR

VOutddVPR

PageRank – (Brin & Page 1998) – integrates incoming / outgoing links in one formula

where d [0,1]

Adapted to weighted graphs

)(

)(

)()1()(i

jk

VInjj

w

VOutVjk

jii

w VPRw

wddVPR

RANLP 2005 49

Algorithms

Positional Function – (Positional Power / Weakness (Herings et al. 2001)

HITS – Hyperlinked Induced Topic Search (Kleinberg 1999) – authorities / hubs

)(

)(

))(1(||

1)(

))(1(||

1)(

ij

ij

VInVjPiW

VOutVjWiP

VPOSV

VPOS

VPOSV

VPOS

)(

)(

)()(

)()(

i

i

VInjjHiA

VOutjjAiH

VHITSVHITS

VHITSVHITS

RANLP 2005 50

Convergence

Convergence – Error below a small threshold

value (e.g. 0.0001)

Text-based graphs – convergence usually achieved after 20-40 iterations– more iterations for undirected

graphs

|)()(||)()(| 1 ki

kii

ki VSVSVSVS

RANLP 2005 51

Convergence Curves

-0.01

0.01

0.03

0.05

0.07

0.09

0.11

0.13

0.15

0.17

0.19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Iteration

Score

Directed Unidrected

(250 vertices, 250 edges)

RANLP 2005 52

0.1

0.2

0.8

0.7

0.6

0.8

0.8

0.8

Spectral Clustering

Algorithms that cluster data using graph representations Similarity graph:

– Distance decreases as similarity increases

– Represent dataset as a weighted graph G(V,E)

Clustering can be viewed as partitioning a similarity graph Bi-partitioning: divide vertices into two disjoint groups: (A,B)

1

2

3

4

5

6

1

2

3

4

5

6

A B

RANLP 2005 53

Clustering Objectives

Traditional definition of a “good” clustering:1. Points assigned to same cluster should be highly similar.

2. Points assigned to different clusters should be highly dissimilar.

Minimize weight of between-group connections

0.1

0.2

0.8

0.7

0.6

0.8

0.8

0.81

2

3

4

5

6

Apply these objectives to our graph representation

Options:1. Find minimum cut between two clusters

2. Apply spectral graph theory

RANLP 2005 54

Spectral Graph Theory

Possible approach– Represent a similarity graph as a matrix

– Apply knowledge from Linear Algebra…

Spectral Graph Theory– Analyse the “spectrum” of matrix representing a graph– Spectrum : The eigenvectors of a graph, ordered by the

magnitude (strength) of their corresponding eigenvalues

},...,,{ 21 n

The eigenvalues and eigenvectors of a matrix provide global information about its structure.

11 1 1 1

1

n

n nn n n

w w x x

λ

w w x x

RANLP 2005 55

Matrix Representations

Adjacency matrix (A)– n x n matrix

– : edge weight between vertex xi and xj

x1 x2 x3 x4 x5 x6

x1 0 0.8 0.6 0 0.1 0

x2 0.8 0 0.8 0 0 0

x3 0.6 0.8 0 0.2 0 0

x4 0 0 0.2 0 0.8 0.7

x5 0.1 0 0 0.8 0 0.8

x6 0 0 0 0.7 0.8 0

0.1

0.2

0.8

0.7

0.6

0.8

0.8

1

2

3

4

5

6

0.8

Important properties: – Symmetric matrix

][ ijwA

RANLP 2005 56

Matrix Representations (continued)

Degree matrix (D)

– n x n diagonal matrix

– : total weight of edges incident to vertex xi

x1 x2 x3 x4 x5 x6

x1 1.5 0 0 0 0 0

x2 0 1.6 0 0 0 0

x3 0 0 1.6 0 0 0

x4 0 0 0 1.7 0 0

x5 0 0 0 0 1.7 0

x6 0 0 0 0 0 1.5

0.1

0.2

0.8

0.7

0.6

0.8

0.8

1

2

3

4

5

6

0.8

j

ijwiiD ),(

RANLP 2005 57

Matrix Representations (continued)

Laplacian matrix (L)

– n x n symmetric matrix

0.1

0.2

0.8

0.7

0.6

0.8

0.8

1

2

3

4

5

6

0.8

L = D - Ax1 x2 x3 x4 x5 x6

x1 1.5 -0.8 -0.6 0 -0.1 0

x2 -0.8 1.6 -0.8 0 0 0

x3 -0.6 -0.8 1.6 -0.2 0 0

x4 0 0 -0.2 1.7 -0.8 -0.7

x5 -0.1 0 0 0.8- 1.7 -0.8

x6 0 0 0 -0.7 -0.8 1.5

RANLP 2005 58

Spectral Clustering Algorithms

Three basic stages:

1. Pre-processing Construct a matrix representation of the dataset.

2. Decomposition Compute eigenvalues and eigenvectors of the matrix. Map each point to a lower-dimensional representation based

on one or more eigenvectors.

3. Grouping Assign points to two or more clusters, based on the new

representation.

RANLP 2005 59

Spectral Bi-partitioning Algorithm

1. Pre-processing– Build Laplacian

matrix L of the graph

0.90.80.5-0.2-0.70.4

-0.2-0.6-0.8-0.4-0.70.4

-0.6-0.40.20.9-0.40.4

0.6-0.20.0-0.20.20.4

0.30.4-0.0.10.20.4

-0.9-0.20.40.10.20.4

3.0

2.5

2.3

2.2

0.4

0.0

ΛΛ = = XX = =

2. Decomposition– Find eigenvalues Λ

and eigenvectors Xof the matrix L

-0.7x6

-0.7x5

-0.4x4

0.2x3

0.2x2

0.2x1– Map vertices to corresponding components of λ2 (= algebraic connectivity of the graph)

x1 x2 x3 x4 x5 x6

x1 1.5 -0.8 -0.6 0 -0.1 0

x2 -0.8 1.6 -0.8 0 0 0

x3 -0.6 -0.8 1.6 -0.2 0 0

x4 0 0 -0.2 1.7 -0.8 -0.7

x5 -0.1 0 0 -0.8 1.7 -0.8

x6 0 0 0 -0.7 -0.8 1.5

Fiedler vector

RANLP 2005 60

Spectral Bi-partitioning (continued)

Grouping– Sort components of reduced 1-dimensional vector.– Identify clusters by splitting the sorted vector in two.

How to choose a splitting point?– Naïve approaches:

Split at 0, mean or median value– More expensive approaches

Attempt to minimise (normalised) cut in 1-dimension

-0.7x6

-0.7x5

-0.4x4

0.2x3

0.2x2

0.2x1 Split at 0Split at 0Cluster Cluster AA: Positive points: Positive points

Cluster Cluster BB: Negative : Negative pointspoints

0.2x3

0.2x2

0.2x1

-0.7x6

-0.7x5

-0.4x4

A B

RANLP 2005 61

3-Clusters

Lets assume the next data points

RANLP 2005 62

If we use the 2nd eigenvector

If we use the 3rd eigenvector

RANLP 2005 63

K-Way Spectral Clustering

How do we partition a graph into k clusters?

Two basic approaches:

1. Recursive bi-partitioning (Hagen et al.,’91)

Recursively apply bi-partitioning algorithm in a hierarchical divisive manner.

2. Cluster multiple eigenvectors Build a reduced space from multiple eigenvectors. Use 3rd eigenvector to further partition the clustering output of the 2nd eigenvector Commonly used in recent work

– Multiple eigenvectors prevent instability due to information loss3( )O n

RANLP 2005 64

K-Eigenvector Clustering

K-eigenvector Algorithm (Ng et al.,’01)

1. Pre-processing

– Construct the scaled adjacency matrix2/12/1' ADDA

2. Decomposition Find the eigenvalues and eigenvectors of A'. Build embedded space from the eigenvectors

corresponding to the k largest eigenvalues.3. Grouping

Apply k-means to reduced n x k space to produce k clusters.– Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.– Assign each object to the group that has the closest centroid.– When all objects have been assigned, recalculate the positions of the K centroids.– Repeat Steps 2 and 3 until the centroids no longer move.

RANLP 2005 65

Outline




content style

RANLP 2005 66

• Web: heterogeneous and unstructured

• Free of quality control on the web

• Commercial interest to manipulate ranking

• Use a model of Web page importance (based on link analysis) (Brin and Page 1998), (Kleinberg 1999)

• Related work • Academic citation analysis• Clustering methods of link structure• Hubs and Authorities Model

Web-link Analysis for Improved IR

RANLP 2005 67

Back-links

Link Structure of the Web Approximates the importance / quality of Web

pages

PageRank:– Pages with lots of back-links are important– Back-links coming from important pages convey more

importance to a page– Problem: rank sink

RANLP 2005 68

Rank Sink

Page cycles pointed by some incoming link

Problem: this loop will accumulate rank but never distribute any rank outside

RANLP 2005 69

Escape Term

Solution: Rank Source

E(u) is some vector over the web pages

– uniform, favorite pages, etc. Recall that the rank (PageRank) is the eigenvector

corresponding to the largest eigenvalue

)(

)(|)(|

1)()1()(

iVInjj

ji VPR

VOutduEdVPR

RANLP 2005 70

Random Surfer Model

Page Rank corresponds to the probability distribution of a random walk on the web graphs

E(u) can be re-phrased as the random surfer gets bored periodically and jumps to a different page and not kept in a loop forever

RANLP 2005 71

Implementation

Computing resources — 24 million pages

— 75 million URLs

Unique integer ID for each URL Sort and Remove dangling links Rank initial assignment Iteration until convergence Add back dangling links and Re-compute

RANLP 2005 72

Ranking Web-pages

Combined PageRank + vectorial model Vectorial model:

– Finds similarity between query and documents– Uses term weighting schemes (TF.IDF)– Ranks documents based on similarity– Ranking is query specific

PageRank:– Ranks documents based on importance– Ranking is general (not query specific)

Combined model:– Weighted combination of vectorial and PageRank

ranking

RANLP 2005 73

Issues

Users are not random walkers – Content based methods Starting point distribution

– Actual usage data as starting vector Reinforcing effects/bias towards main pages No query specific rank Linkage spam – PageRank favors pages that managed to get other pages to link to them – Linkage not necessarily a sign of relevancy, only of promotion (advertisement…)

RANLP 2005 74

Personalized PageRank

Initial value for rank source E can change ranking:

1. Uniformly over all pages: e.g. copyright warnings, disclaimers, mailing lists archives

• may result in overly high ranking

2. Total weight on a few sets of pages based on a-priori known popularity, e.g. Netscape, McCarthy

3. Different weight depending on the query or topic: 1. topic-sensitive PageRank (Haveliwala 2001)

2. query-sensitive PageRank (Richardson 2002)

RANLP 2005 75

Link Analysis for IR

PageRank is a global ranking based on the web's graph structure

PageRank relies on back-links information to “bring order to the web”

Proven in practice to exceed the performance of other search engines, which are exclusively based on vectorial / boolean models

Can be adapted to “personalized” searches– Although personalized search engines are still not

very popular

RANLP 2005 76

Text Classification

Given:– A description of an instance, xX, where X is the

instance language or instance space.

– A fixed set of categories:

C = {c1, c2,…, cn}

Determine:– The category of x: c(x)C, where c(x) is a

categorization function whose domain is X and whose range is C

RANLP 2005 77

Text Classification Examples

Assign labels to each document or web-page: Labels are most often topics such as Yahoo-categories

e.g., "finance," "sports," "news>world>asia>business" Labels may be genres

e.g., "editorials" "movie-reviews" "news“ Labels may be opinion

e.g., “like”, “hate”, “neutral” Labels may be domain-specific binary

e.g., "interesting-to-me" : "not-interesting-to-me”e.g., “spam” : “not-spam”e.g., “is a toner cartridge ad” :“isn’t”

RANLP 2005 78

Traditional Methods

Manual classification– Used by Yahoo!, Looksmart, about.com, ODP, Medline– very accurate when job is done by experts– consistent when the problem size and team is small– difficult and expensive to scale

Automatic document classification– Hand-coded rule-based systems

Used by CS dept’s spam filter, Reuters, … E.g., assign category if document contains a given boolean combination of words

Supervised learning of document-label assignment function

– Many new systems rely on machine learning k-Nearest Neighbors (simple, powerful) Naive Bayes (simple, common method) Support-vector machines (new, more powerful) …

RANLP 2005 79

Text Clustering

Goal similar to the task of text classification: classify text into categories

No apriori known set of categories Cluster data into sets of similar documents Pros: no annotated data is required Cons: more difficult to assign labels Traditional approaches:

– All clustering methods have been also applied to text clusteringAgglomerative clusteringK-means…

RANLP 2005 80

Spectral Learning / Clustering

Use graph representation of text collections for:– Text clustering: spectral clustering

– Text classification: spectral learning

Recall bi-partitioning and k-way clustering:– Use 2nd or first K eigenvectors as a reduced space

– Map vertices to the reduce space

– Cluster the reduced spaceE.g. cluster the points in the 2nd eigenvector in two classes: positive vs. negative => bi-partitioningUse k-means to cluster the representation obtained using top K eigenvectors

RANLP 2005 81

Spectral Learning / Clustering

Spectral representation: (Kamvar et al., 2003)

– Find similarity graph for the collection

– Find Laplacian matrix (alternatives: normalized L, LSA, etc.)

– Find the K largest eigenvectors of the Laplacian => X

– Normalize the rows in X to be unit length

Clustering:– Use k-means (or other clustering algorithm) to cluster the rows in

X into K clusters

Classification:– Assign known labels to the training points in the collection, using

the representation in X

– Classify unlabeled rows corresponding to the test points in the collection, using the representation in X, with the help of any learning algorithm (SVM, Naïve Bayes, etc.)

RANLP 2005 82

Empirical Results

Data sets:– 20 newsgroups: 1000 postings from each of 20 usenet

newsgroups

– 3 newsgroups: 3 of the 20 newsgroups (crypt, politics, religion)

– Lymphoma: gene expression profiles. 2 classes with 42 and 54 samples each

– Soybean: the soybean-large data set from the UCI repository, with 15 classes

RANLP 2005 83

Text Clustering

Compare spectral clustering with k-means Use document-to-document similarities to build

similarity graph for each collection– Pairwise cosine similarity

– Use k = 20

Spectral Learning K-means3 news 0.84 0.2020 news 0.36 0.07lymphoma 0.50 0.10soybean 0.41 0.34

RANLP 2005 84

Text Classification

Use the same spectral representation, but apply a learning algorithm trained on the labeled examples– Reinforce the prior knowledge: set Aij to 1 if i and j are

known to belong to the same class, or 0 otherwise

Can naturally integrate unlabeled training examples – included in the initial matrix

RANLP 2005 85

Text Classification

RANLP 2005 86

References

Brin, S. and Page, L., The anatomy of a large-scale hypertextual Web search engine, Computer Networks and ISDN Systems, 1998

Herings, P.J. van der Laan, G. and Talman, D., Measuring the power of nodes in digraphs, Technical Report, Tinbergen Institute, 2001

Kamvar, S. and Klein, D. and Manning, C. Spectral Learning, IJCAI 2003

Kleinberg, J., Authoritative sources in a hyperlinked environment, Journal of the ACM, 1999

Ng, A. and Jordan, M. and Weiss, Y. On Spectral Clustering: Analysis and an algorithm, NIPS 14, 2002

RANLP 2005 87

Outline




content style

RANLP 2005 88

Word Sense Disambiguation

Word sense disambiguation is the problem of selecting a sense for a word from a set of predefined possibilities.

– Sense Inventory usually comes from a dictionary or thesaurus.

– Knowledge intensive methods, supervised learning, and (sometimes) bootstrapping approaches

Word polysemy (with respect to a dictionary)

Determine which sense of a word is used in a specific sentence

–Ex: “chair” – furniture or person–Ex: “child” – young person or human offspring

“Sit on a chair” “Take a seat on this chair”“The chair of the Math Department” “The chair of the meeting”

RANLP 2005 89

Graph-based Solutions for WSD

Use information derived from dictionaries / semantic networks to construct graphs

Build graphs using measures of “similarity”

– Similarity determined between pairs of concepts, or between a word and its surrounding context Distributional similarity (Lee 1999) (Lin 1999) Dictionary-based similarity (Rada 1989)

carnivore

wild dogwolf

bearfeline, felidcanine, canidfissiped mamal, fissiped

dachshund

hunting doghyena dogdingo

hyenadog

terrier

RANLP 2005 90

Semantic Similarity Metrics

Input: two concepts (same part of speech) Output: similarity measure E.g. (Leacock and Chodorow 1998)

– E.g. Similarity(wolf,dog) = 0.60 Similarity(wolf,bear) = 0.42 Other metrics:

– Similarity using information content (Resnik 1995) (Lin 1998)– Similarity using gloss-based paths across different

hierarchies (Mihalcea and Moldovan 1999)– Conceptual density measure between noun semantic

hierarchies and current context (Agirre and Rigau 1995)

D

CCPathCCSimilarity

2

),(log),( 21

21

where D is the taxonomy depth

RANLP 2005 91

Lexical Chains for WSD

Apply measures of semantic similarity in a global context Lexical chains (Hirst and St-Onge 1988), (Haliday and Hassan 1976)

“A lexical chain is a sequence of semantically related words, which creates a context and contributes to the continuity of meaning and the coherence of a discourse”

Algorithm for finding lexical chains:1. Select the candidate words from the text. These are words for which we

can compute similarity measures, and therefore most of the time they have the same part of speech.

2. For each such candidate word, and for each meaning for this word, find a chain to receive the candidate word sense, based on a semantic relatedness measure between the concepts that are already in the chain, and the candidate word meaning.

3. If such a chain is found, insert the word in this chain; otherwise, create a new chain.

RANLP 2005 92

Lexical Chains

A very long train traveling along the rails with a constant velocity v in a certain direction …

train #1: public transport

#2: order set of things

#3: piece of cloth

travel

#1 change location

#2: undergo transportation

rail #1: a barrier

# 2: a bar of steel for trains

#3: a small bird

RANLP 2005 93

Lexical Chains for WSD

Identify lexical chains in a text– Usually target one part of speech at a time

Identify the meaning of words based on their membership to a lexical chain

Evaluation:– (Galley and McKeown 2003) lexical chains on 74 SemCor texts

give 62.09%

– (Mihalcea and Moldovan 2000) on five SemCor texts give 90% with 60% recalllexical chains “anchored” on monosemous words

RANLP 2005 94

Graph-based Ranking on Semantic Networks

Goal: build a semantic graph that represents the meaning of the text

Input: Any open text Output: Graph of meanings (synsets)

– “importance” scores attached to each synset– relations that connect them

Models text cohesion– (Halliday and Hasan 1979)– From a given concept, follow “links” to semantically related

concepts Graph-based ranking identifies the most recommended

concepts

RANLP 2005 95

Two U.S. soldiers and an unknown number of civilian

contractors are unaccounted for after a fuel convoy was

attacked near the Baghdad International Airport today,

a senior Pentagon official said. One U.S. soldier and an

Iraqi driver were killed in the incident. …

…

…

RANLP 2005 96

Main Steps

Step 1: Preprocessing– SGML parsing, text tokenization, part of speech tagging,

lemmatization

Step 2: Assume any possible meaning of a word in a text is potentially correct

– Insert all corresponding synsets into the graph

Step 3: Draw connections (edges) between vertices

Step 4: Apply the graph-based ranking algorithm– PageRank, HITS, Positional power

RANLP 2005 97

Semantic Relations

Main relations provided by WordNet– ISA (hypernym/hyponym)– PART-OF (meronym/holonym)– causality– attribute– nominalizations– domain links

Derived relations– coord: synsets with common hypernym

Edges (connections)– directed (direction?) / undirected– Best results with undirected graphs

Output: Graph of concepts (synsets) identified in the text– “importance” scores attached to each synset– relations that connect them

RANLP 2005 98

Word Sense Disambiguation

Rank the synsets/meanings attached to each word Unsupervised method for semantic ambiguity resolution of all words in

unrestricted text (Mihalcea et al. 2004) Related algorithms:

– Lesk– Baseline (most frequent sense / random)

Hybrid:– Graph-based ranking + Lesk– Graph-based ranking + Most frequent sense

Evaluation– “Informed” (with sense ordering)– “Uninformed” (no sense ordering)

Data – Senseval-2 all words data (three texts, average size 600)– SemCor subset (five texts: law, sports, debates, education,

entertainment)

RANLP 2005 99

Evaluation

Size(words) Random Lesk TextRank TextRank+LeskSemCorlaw 825 37.12% 39.62% 46.42% 49.36%sports 808 29.95% 33.00% 40.59% 46.18%education 898 37.63% 41.33% 46.88% 52.00%debates 799 40.17% 42.38% 47.80% 50.52%entertainment 802 39.27% 43.05% 43.89% 49.31%Avg. 826 36.82% 39.87% 45.11% 49.47%Senseval-2d00 471 28.97% 43.94% 43.94% 47.77%d01 784 45.47% 52.65% 54.46% 57.39%d02 514 39.24% 49.61% 54.28% 56.42%Avg. 590 37.89% 48.73% 50.89% 53.86%Average (all) 740 37.22% 43.19% 47.27% 51.16%

“uninformed” (no sense order)

RANLP 2005 100

Evaluation

Size(words) MFS Lesk TextRank TextRank+LeskSemCorlaw 825 69.09% 72.65% 73.21% 73.97%sports 808 57.30% 64.21% 68.31% 68.31%education 898 64.03% 69.33% 71.65% 71.53%debates 799 66.33% 70.07% 71.14% 71.67%entertainment 802 59.72% 64.98% 66.02% 66.16%Avg. 826 63.24% 68.24% 70.06% 70.32%Senseval-2d00 471 51.70% 53.07% 58.17% 57.74%d01 784 60.80% 64.28% 67.85% 68.11%d02 514 55.97% 62.84% 63.81% 64.39%Avg. 590 56.15% 60.06% 63.27% 63.41%Average (all) 740 60.58% 65.17% 67.51% 67.72%

“informed” (sense order integrated)

RANLP 2005 101

Ambiguous Entitites

Name ambiguity in research papers:– David S. Johnson, David Johnson, D. Johnson– David Johnson (Rice), David Johnson (AT & T)

Similar problem across entities:– Washington (person), Washington (state), Washington

(city) Number ambiguity in texts:

– quantity: e.g. 100 miles– time: e.g. 100 years– money: e.g. 100 euro– misc.: anything else

Can be modeled as a clustering problem

RANLP 2005 102

Name Disambiguation

Extract attributes for each person – e.g. for research papers:– Co-authors– Paper titles– Venue titles

Each word in these attribute sets constitutes a binary feature

Apply a weighting scheme:– E.g. normalized TF, TF/IDF

Construct a vector for each occurrence of a person name

(Han et al., 2005)

RANLP 2005 103

Spectral Clustering

Apply k-way spectral clustering to name data sets Two data sets:

– DBLP: authors of 400,000 citation records – use top 14 ambiguous names

– Web-based data set: 11 authors named “J. Smith” and 15 authors named “J. Anderson”, in a total of 567 citations

Clustering evaluated using confusion matrices – Disambiguation accuracy = sum of diagonal elements

A[i,i] divided by the sum of all elements in the matrix

RANLP 2005 104

Name Disambiguation – Results

DBLP data set:

Web-based data set– 11 “J.Smith”: 84.7% (k-means 75.4%)– 15 “J.Anderson”: 71.2% (k-means 67.2%)

RANLP 2005 105

Number Disambiguation

Classify numbers into 4 classes: quantity, time, money, miscellaneous

Two algorithms– Spectral clustering on bipartite graphs

– Tripartite updating

Features– Surrounding window +/- 2, and the “range” of the word

(w ≤ 3, 3 < w ≤ 10, 10 < w ≤ 30 …)

– Map all numbers in the data set into these features (Radev, 2004)

RANLP 2005 106

Tripartite Updating

F – set of features, L/U – set of labeled / unlabeled examples

T1: connectivity matrix between F and L

T2: connectivity matrix between F and U

F and U are iteratively updated

)()(2

)1(

)1()(2

)(

)1(1

)(

ttTt

ttTt

tTt

FUTF

UFTU

FLTF

RANLP 2005 107

Number Disambiguation – Results

Use a subset of the numbers from the APW section of the ACQUAINT corpus– 5000 unlabeled examples, of which 300 were

manually annotated

– 200 labeled examples for evaluation

RANLP 2005 108

Automatic Thesaurus Generation

Idea: Use an (online) traditional dictionary to generate a graph structure, and use it to create thesaurus-like entries for all words (stopwords included)

– (Jannink and Wiederhold, 1999)

For instance:– Input: the 1912 Webster’s Dictionary

– Output: a repository with rank relationships between terms

The repository is comparable with handcrafted efforts such as MindNet and WordNet

RANLP 2005 109

Algorithm

1. Extract a directed graph from the dictionary- e.g. relations between head-words and words included

in definitions

2. Obtain the relative measure of arc importance

3. Rank the arcs with ArcRank(graph-based algorithm for edge ranking)

RANLP 2005 110

Algorithm (cont.)

Directed Graph One arc from each headword to all words in the definition.

Potential problems as: syllable and accent markers in head words, misspelled head words, accents, special characters, mistagged fields, common abbreviations in definitions, steaming, multi-word head words, undefined words with common prefixes, undefined hyphenated.

Source words: words never used in definitions

Sink words: undefined words

Transport. To carry or bear from one place to another;…

head word

Transport

to

carry

or

bear

…

1. Extract a directed graph from the Dictionary

RANLP 2005 111

Algorithm (cont.)

2. Obtain the relative measure of arc importance

Transport

carry

|as| outgoing arcs

e edge

s source node

t target node

re =

ps / |as|

pt Where:

re is the rank of the edge

ps is the rank of the source node

pt is the rank of the target node

|as| is the number of outgoing nodes

rs,t =

ps / |as|

pt

∑e=1

mFor more than 1 node (m) between s and t

RANLP 2005 112

Algorithm (cont.)

ArcRank

Input: triples (source s, target t, importance vs,t)

1. given source s and target t nodes

2. at s, sort vs, tj and rank arcs rs(vs, tj )

3. at t, sort vsi,t and rank arcs rt(csi

,t)

4. compute ArcRank: mean(rs (vs,t), rt(cs,t))

5. Rank Arcs input: sorted arc importance• {0.9, 0.75, 0.75, 0.75, 0.6, 0.5, …. , 0.1} sample values

• {1, 2, 2, 2, …} equal values take same rank

• {1, 2, 2, 2, 3, …} number ranks consecutively

3. Rank the arcs using ArcRank• Rank the importance of arcs with respect to source and target nodes• It promotes arcs that are important in both endpoints

RANLP 2005 113

Results

Webster

96,800 terms

112,897 distinct words

error rates:

<1% of original input

<0.05% incorrect arcs

(hyphenation)

<0.05% incorrect terms

(spelling)

0% artificial terms

An automatically built thesaurus starting with Webster

WordNet

99,642 terms

173,941 word senses

error rates:

~0.1% inappropriate classifications

~1-10% artificial & repeated terms

MindNet

159,000 head words

713,000 relationships between headwords

(not publicly available)

RANLP 2005 114

Results Analysis

The Webster’s repository

PROS It has a very general structure It can also address stopwords It has more relationships than WordNet It allows for any relationships between words (not only within a lexical

category) It is more “natural”: it does not include artificial concepts such as non-

existent words and artificial categorizations

CONS The type of relationships is not always evident The accuracy increases with the amount of data, nevertheless, the

dictionary contains sparse definitions It only distinguishes senses based on usage, not grammar It is less precise than other thesauruses (e.g. WordNet)

RANLP 2005 115

Extractive Summarization

Graph-based ranking algorithms for extractive summarization

1. Build a graph:– Sentences in a text = vertices– Similarity between sentences = weighted edges

Model the cohesion of text using intersentential similarity

2. Run a graph-based ranking algorithm:– keep top N ranked sentences– = sentences most “recommended” by other

sentences (Mihalcea & Tarau, 2004) (Erkan & Radev 2004)

RANLP 2005 116

A Process of Recommendation

Underlying idea: a process of recommendation– A sentence that addresses certain concepts in a text

gives the reader a recommendation to refer to other sentences in the text that address the same concepts

– Text knitting (Hobbs 1974) “repetition in text knits the discourse together”

– Text cohesion (Halliday & Hasan 1979)

RANLP 2005 117

Sentence Similarity

Inter-sentential relationships – weighted edges

Count number of common concepts Normalize with the length of the sentence

Other similarity metrics are also possible:– cosine, string kernels, etc.

– work in progress

|)log(||)log(|

|}|{|),( 21

ji

jkikk

SS

SwSwwSSSim

RANLP 2005 118

Graph Structure

Undirected– No direction established between sentences in

the text– A sentence can “recommend” sentences that

preceed or follow in the text Directed forward

– A sentence can “recommend” only sentences that follow in the text

– Seems more appropriate for movie reviews, stories, etc.

Directed backward– A sentence can “recommend” only sentences

that preceed in the text – More appropriate for news articles

RANLP 2005 119

An Example 3. r i BC-HurricaneGilbert 09-11 0339 4. BC-Hurricane Gilbert , 0348 5. Hurricane Gilbert Heads Toward Dominican Coast 6. By RUDDY GONZALEZ 7. Associated Press Writer 8. SANTO DOMINGO , Dominican Republic ( AP ) 9. Hurricane Gilbert swept toward the Dominican Republic Sunday , and the Civil Defense alerted its heavily populated south coast to prepare for high winds , heavy rains and high seas . 10. The storm was approaching from the southeast with sustained winds of 75 mph gusting to 92 mph . 11. " There is no need for alarm , " Civil Defense Director Eugenio Cabral said in a television alert shortly before midnight Saturday . 12. Cabral said residents of the province of Barahona should closely follow Gilbert 's movement . 13. An estimated 100,000 people live in the province , including 70,000 in the city of Barahona , about 125 miles west of Santo Domingo . 14. Tropical Storm Gilbert formed in the eastern Caribbean and strengthened into a hurricane Saturday night 15. The National Hurricane Center in Miami reported its position at 2a.m. Sunday at latitude 16.1 north , longitude 67.5 west , about 140 miles south of Ponce , Puerto Rico , and 200 miles southeast of Santo Domingo . 16. The National Weather Service in San Juan , Puerto Rico , said Gilbert was moving westward at 15 mph with a " broad area of cloudiness and heavy weather " rotating around the center of the storm . 17. The weather service issued a flash flood watch for Puerto Rico and the Virgin Islands until at least 6p.m. Sunday . 18. Strong winds associated with the Gilbert brought coastal flooding , strong southeast winds and up to 12 feet to Puerto Rico 's south coast . 19. There were no reports of casualties . 20. San Juan , on the north coast , had heavy rains and gusts Saturday , but they subsided during the night . 21. On Saturday , Hurricane Florence was downgraded to a tropical storm and its remnants pushed inland from the U.S. Gulf Coast . 22. Residents returned home , happy to find little damage from 80 mph winds and sheets of rain . 23. Florence , the sixth named storm of the 1988 Atlantic storm season , was the second hurricane . 24. The first , Debby , reached minimal hurricane strength briefly before hitting the Mexican coast last month

A text from DUC 2002on “Hurricane Gilbert”24 sentences

RANLP 2005 120

46

5

21

7

1615

9

8

10

11

12

1413

22

20

19

18

17

2423

0.27

0.35

0.55

0.15

0.19

0.15

0.15

0.16

0.59

0.30

[0.50]

[0.80]

[0.70]

[0.15]

[1.20][0.71]

[0.15]

[0.70]

[1.83]

[0.99]

[0.56]

[0.93]

[0.76]

[1.09][1.36][1.65]

[0.70]

[1.58]

[0.15]

[0.84]

[1.02]

An Example (cont'd)

RANLP 2005 121

Evaluation

Task-based evaluation: automatic text summarization– Single document summarization

100-word summaries

– Multiple document summarization 100-word multi-doc summaries clusters of ~10 documents

Data from DUC (Document Understanding Conference)– DUC 2002

– 567 single documents

– 59 clusters of related documents

Automatic evaluation with ROUGE (Lin & Hovy 2003)

– n-gram based evaluations unigrams found to have the highest correlations with human judgment

– no stopwords, stemming

RANLP 2005 122

Results: Single Document Summarization

Single-doc summaries for 567 documents (DUC 2002)

GraphAlgorithm UndirectedDir.forwardDir.backward

0.4912 0.4584 0.5023

0.4912 0.5023 0.4584

0.4878 0.4538 0.3910

0.4878 0.3910 0.4538

0.4904 0.4202 0.5008

HITSA

W

HITSH

W

POSP

W

POSW

W

PRW

Top 5 systems (DUC 2002)S27 S31 S28 S21 S29 Baseline

0.5011 0.4914 0.4890 0.4869 0.4681 0.4799

RANLP 2005 123

Multiple Document Summarization

Cascaded summarization (“meta” summarizer)– Use best single document summarization alorithms

PageRank (Undirected / Directed Backward) HITSA (Undirected / Directed Backward)

– 100-word single document summaries– 100-word “summary of summaries”

Avoid sentence redundancy:– set max threshold on sentence similarity (0.5)

Evaluation:– build summaries for 59 clusters of ~10 documents– compare with top 5 performing systems at DUC

2002– baseline: first sentence in each document

RANLP 2005 124

Results: Multiple Document Summarization

PageRank-UPageRank-DB

PageRank-U 0.3552 0.3499 0.3456 0.3465

PageRank-DB 0.3502 0.3448 0.3519 0.3439

0.3368 0.3259 0.3212 0.3423

0.3572 0.3520 0.3462 0.3473

HITSA-U HITS

A-DB

HITSA-U

HITSA-DB

Top 5 systems (DUC 2002)S26 S19 S29 S25 S20 Baseline

0.3578 0.3447 0.3264 0.3056 0.3047 0.2932

– Multi-doc summaries for 59 clusters (DUC 2002)

RANLP 2005 125

Portuguese Single-Document Evaluation

• 100 news articles from Jornal de Brasil, Folha de Sao Paolo• Create summaries with the same length as manual summaries • Automatic evaluation with ROUGE • Same baseline (top sentences in each document)

GraphAlgorithm Undirected Forward BackwardHITS-A 0.4814 0.4834 0.5002

PageRank 0.4939 0.4574 0.5121Baseline 0.4963

RANLP 2005 126

Graph-based Ranking for Sentence Extraction Unsupervised method for sentence

extraction – based exclusively on information drawn from the text itself

Goes beyond simple sentence connectivity Gives a ranking over all sentences in a text

– can be adapted to longer / shorter summaries

No training data required – can be adapted to other languages– Already applied on Portuguese

RANLP 2005 127

Subjectivity Analysis for Sentiment Classification

The objective is to detect subjective expressions in text (opinions against facts)

Use this information to improve the polarity classification (positive vs. negative)– E.g. Movie reviews ( see: www.rottentomatoes.com)

Sentiment analysis can be considered as a document classification problem, with target classes focusing on the authors sentiments, rather than topic-based categories– Standard machine learning classification techniques

can be applied

RANLP 2005 128

Subjectivity Extraction

RANLP 2005 129

Subjectivity Detection/Extraction

Detecting the subjective sentences in a text may be useful in filtering out the objective sentences creating a subjective extract

Subjective extracts facilitate the polarity analysis of the text (increased accuracy at reduced input size)

Subjectivity detection can use local and contextual features:

– Local: relies on individual sentence classifications using standard machine learning techniques (SVM, Naïve Bayes, etc) trained on an annotated data set

– Contextual: uses context information, such as e.g. sentences occurring near each other tend to share the same subjectivity status (coherence)

(Pang and Lee, 2004)

RANLP 2005 130

Cut-based Subjectivity Classification

Standard classification techniques usually consider only individual features (classify one sentence at a time).

Cut-based classification takes into account both individual and contextual (structural) features

Suppose we have n items x1,…,xn to divide in two classes: C1 and C2 .

Individual scores: indj(xi) - non-negative estimates of each xi being in Cj based on the features of xi alone

Association scores: assoc(xi,xk) - non-negative estimates of how important it is that xi and xk be in the same class

RANLP 2005 131

Cut-based Classification

Maximize each item’s assignment score (individual score for the class it is assigned to, minus its individual score for the other class), while penalize the assignment of different classes to highly associated items

Formulated as an optimization problem: assign the xi items to classes C1 and C2 so as to minimize the partition cost:

2

121

,12

CxCx

kiCxCx

k

i

xxassocxindxind

RANLP 2005 132

Cut-based Algorithm

There are 2n possible binary partitions of the n elements, we need an efficient algorithm to solve the optimization problem

Build an undirected graph G with vertices {v1,…vn,s,t} and edges:– (s,vi) with weights ind1(xi)

– (vi,t) with weights ind2(xi)

– (vi,vk) with weights assoc(xi,xk)

RANLP 2005 133

Cut-based Algorithm (cont.)

Cut: a partition of the vertices in two sets:

The cost is the sum of the weights of all edges crossing from S to T

A minimum cut is a cut with the minimal cost A minimum cut can be found using maximum-flow

algorithms, with polynomial asymptotic running times Use the min-cut / max-flow algorithm

',' where

'}{ and '}{

TtSs

TtTSsS

RANLP 2005 134

Cut-based Algorithm (cont.)

Notice that without the structural information we would be undecidedabout the assignment of node M

RANLP 2005 135

Subjectivity Extraction

Assign every individual sentence a subjectivity score – e.g. the probability of a sentence being subjective, as assigned

by a Naïve Bayes classifier, etc Assign every sentence pair a proximity or similarity score

– e.g. physical proximity = the inverse of the number of sentences between the two entities

Use the min-cut algorithm to classify the sentences into objective/subjective

RANLP 2005 136

Subjectivity Extraction with Min-Cut

RANLP 2005 137

Results

2000 movie reviews (1000 positive / 1000 negative)

The use of subjective extracts improves or maintains the accuracy of the polarity analysis while reducing the input data size

RANLP 2005 138

References

Galley, M. and McKeown, K.: Improving Word Sense Disambiguation in Lexical Chaining. IJCAI 2003.

Hirst, G. and St-Onge, D. Lexical chains as representations of context in the detection and correction of malaproprisms. WordNet: An electronic lexical database, MIT Press, 1998

Halliday, M. and Hasan, R. Cohesion in English. Longman, 1976. Lesk, M. Automatic sense disambiguation using machine readable

dictionaries: How to tell a pine cone from an ice cream cone. SIGDOC 1986.

Mihalcea, R. and Moldovan, D. An iterative approach to word sense disambiguation. FLAIRS 2000.

Mihalcea, R. and Tarau, P and Figa, E. PageRank on Semantic Networks with Application to Word Sense Disambiguation, COLING 2004

RANLP 2005 139

References

Mihalcea, R. and Tarau, P. TextRank: Bringing Order Into Texts, EMNLP 2004

Han, H and Zha, H. and Giles, C.L. Name disambiguation in author citations using a K-way spectral clustering method, ACM/IEEE – IJCDL 2005

Radev, D., Weakly-Supervised Graph-based Methods for Classification, U. Michigan Technical Report CSE-TR-500-04, 2004

Jannink, J. and Wiederhold, G., Thesaurus Entry Extraction from an On-line Dictionary, Fusion 1999, Sunnyvale CA, 1999.

Pang, B. and Lee, L., A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, ACL 2004.

Erkan, G. and Radev, D.. "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization", Journal of Artificial Intelligence Research , 2004

RANLP 2005 140

Conclusions

Graphs: encode information about nodes and relations between nodes

Graph-based algorithms can lead to state-of-the-art results in several IR and NLP problems

Graph-based NLP/IR:– On the verge between graph-theory and NLP / IR

– Many opportunities for novel approaches to traditional problems

Graph-based Algorithms for Information Retrieval and Natural Language Processing Rada Mihalcea...

Documents

Transcript of Graph-based Algorithms for Information Retrieval and Natural Language Processing Rada Mihalcea...