Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the...

34
Graph-based Learning Models for Information Retrieval: A Survey Yi Liu August 29, 2006 Contents 1 Overview 3 1.1 Graph – A Mathematical Point of View ............... 3 2 Graph Construction 4 2.1 Graph Topology Establishment ................... 4 2.2 Graph Quantification ......................... 5 2.2.1 Observation-based Weight .................. 5 2.2.2 Similarity-based Weight ................... 5 2.2.3 Graph Kernels ......................... 6 3 Graph Analysis 6 3.1 Analysis Based on Spectral Graph Theory ............. 7 3.2 Analysis Based on Random Field Theory .............. 9 3.2.1 Markov Random Fields .................... 9 3.2.2 Conditional Random Fields ................. 10 3.2.3 Gaussian Random Fields ................... 11 3.3 Analysis Based on Matrix Approximation & Factorization .... 12 4 Graph-based Learning Models 13 4.1 Supervised Learning ......................... 13 4.1.1 k-Nearest Neighbor ...................... 13 4.1.2 Gaussian Processes ...................... 14 4.2 Semi-supervised Learning ...................... 15 4.2.1 Graph Mincuts ........................ 15 1

Transcript of Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the...

Page 1: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

Graph-based Learning Models for

Information Retrieval: A Survey

Yi Liu

August 29, 2006

Contents

1 Overview 3

1.1 Graph – A Mathematical Point of View . . . . . . . . . . . . . . . 3

2 Graph Construction 4

2.1 Graph Topology Establishment . . . . . . . . . . . . . . . . . . . 4

2.2 Graph Quantification . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 Observation-based Weight . . . . . . . . . . . . . . . . . . 5

2.2.2 Similarity-based Weight . . . . . . . . . . . . . . . . . . . 5

2.2.3 Graph Kernels . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Graph Analysis 6

3.1 Analysis Based on Spectral Graph Theory . . . . . . . . . . . . . 7

3.2 Analysis Based on Random Field Theory . . . . . . . . . . . . . . 9

3.2.1 Markov Random Fields . . . . . . . . . . . . . . . . . . . . 9

3.2.2 Conditional Random Fields . . . . . . . . . . . . . . . . . 10

3.2.3 Gaussian Random Fields . . . . . . . . . . . . . . . . . . . 11

3.3 Analysis Based on Matrix Approximation & Factorization . . . . 12

4 Graph-based Learning Models 13

4.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1.1 k-Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . 13

4.1.2 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . 14

4.2 Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . . . 15

4.2.1 Graph Mincuts . . . . . . . . . . . . . . . . . . . . . . . . 15

1

Page 2: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

4.2.2 Gaussian Random Fields and Harmonic Functions . . . . 16

4.2.3 Spectral Graph Transducer . . . . . . . . . . . . . . . . . . 17

4.2.4 Learning with Local and Global Consistency . . . . . . . 18

4.2.5 Local Laplacian Embedding . . . . . . . . . . . . . . . . . 18

4.2.6 Gaussian Process . . . . . . . . . . . . . . . . . . . . . . . 19

4.2.7 Learning on Directed Graphs . . . . . . . . . . . . . . . . 20

4.3 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . 20

4.3.1 Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . 20

4.3.2 Kernel k-means . . . . . . . . . . . . . . . . . . . . . . . . 21

4.4 Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.4.1 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.4.2 HITS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.4.3 Traffic Rank . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.4.4 Manifold Rank . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Applications in Information Retrieval 24

5.1 Document Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.2 Document Classification . . . . . . . . . . . . . . . . . . . . . . . 25

5.3 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . 25

5.4 Unified Link Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.5 Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6 Summary 27

2

Page 3: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

1 Overview

Graph theory, machine learning and information retrieval are all well stud-ied disciplines. Recent advance of machine learning research has witnesseda number of models based on graphs, from which information retrieval studyis also benefited since many of these models have been verified by differentinformation retrieval tasks. This survey is devoted to a summary of thosegraph-based learning models and their applications in information retrieval.

Graph-based models, in this survey, refer to the collection of machine learn-ing models that assume an underlying graph structure. Note that the graph-based models discussed here carry a more general sense than the graphicalmodels, which appear most often in Bayesian analysis literature ([27, 45]).Graphical models, probabilistic in nature, refer to inference structures in theform of graphs. In particular, nodes in graphical models represent randomvariables and the edges represent conditional dependence assumptions. Inthis survey we will not particularly favor either probabilistic or non-probabilisticmodels.

1.1 Graph – A Mathematical Point of View

From a mathematical point of view, a graph is a collection of points and linesconnecting some (possibly empty) subset of them [55]. The points of a graphare most commonly known as graph vertices, but may also be called “nodes”orsimply “points”. Similarly, the lines connecting the vertices of a graph aremost commonly known as graph edges, but may also be called “links”, “arcs”or simply “lines”. In an undirected graph, edges are not directional, i.e., a linefrom pointA to pointB is not distinguished from a line from pointB to pointA. However, the two directions are distinct in a directed graph (or digraph forshort). In many occasions, a weight (usually positive) will be associated witheach edge, indicating the strength of the relationship within the correspond-ing vertex pair.

Formally, we can denote a graph by G(V,E), where V is the vertex set and Eis the edge set. For a finite graph G with n vertices, the adjacency matrix isdefined as a binary matrix A = [ai,j ]n×n, with ai,j = 1 denoting there is anedge between the i-th and the j-th vertices and ai,j = 0 otherwise.

Introducing the weight to each edge in a graph will result in a weighted graph.In this case the adjacency matrix (or weight matrix) becomes W = [wi,j ]n×n,withwi,j > 0 indicating the edge weight between the i-th and the j-th verticesand wi,j = 0 indicating no edge there. For an undirected graph, both theadjacency matrix and the weight matrix are symmetric.

3

Page 4: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

2 Graph Construction

Graph, among all representation forms, is a way to present data or knowl-edge. One major advantage enjoyed by graph-based models is the presenceof structural information embedded in the graphs. Although graphs may notbe able to completely uncover the possibly complicated structures, at leastit provides a first approximation. As the staring point of applying the math-ematical concept of graph to learning problems, graph construction aims atencoding the structural information as complete and accurate as possible.

The topology of a graph is where the structural information is encoded, thoughit is qualitative in nature. The weight matrix further enriches the topologyin a quantitative manner. Accordingly constructing a graph can be dividedinto two stages in general: establishing graph topology and quantifying thegraphs. The first step involves determine nodes and edges, and the secondstep assign weights to edges. However, these two steps are not always carriedout in the due order. A typical example would be constructing a graph basedon the k nearest neighbors, where we measure the distances between nodepairs (or equivalently, compute the weights of edges) and then establish linksby choosing the k smallest values associated with each node.

For probabilistic models, directed graph is used in general: each node repre-sents a random variable and each edge represents the conditional probabil-ity between the pair of random variables associated with the correspondingnodes. The following discussions on graph construction mainly apply to non-probabilistic models.

2.1 Graph Topology Establishment

For non-probabilistic models, graphs are often constructed in such a way thatnodes are related to data examples and edges reflect the relationship betweenexamples. From a machine learning perspective, data examples could bethose examples with class labels or to be assigned with class labels (as in aclassification problem), to be clustered (as in a clustering problem) or to beranked (as in a ranking problem). While applying to an information retrievaltask scenario, those data examples always have concrete meanings. We namea few cases here

• To model a document collection, we can build a graph where nodesstand for individual documents and edges link similar documents to-gether

• To model the Web, we can build a graph where nodes stand for webpages and edges symbolized the existence of hyperlink(s) between them

• To model the user groups, we can build a graph where each node de-notes a user and edges indicate the corresponding user pair that sharesimilar interest

4

Page 5: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

• . . .

For the sake of easy reference, we use a set X = xini=1 to denote the setof data examples represented by the graph G, where the i-th node stands forthe data example xi. A bold font xi will be used when we refer to the vectorrepresentation of the i-th data example xi. When we further discuss super-vised learning and semi-supervised learning problems, each data example xi

is generally associated with a label, which (and also the value of it) will be de-noted by yi. The class label yi is typically a discrete value, being +1 or −1 forbinary classification and 1, 2, . . . ,K for multi-class classification. Also, we as-sume that in the n = l+m data examples, the first l are labeled, and the restmare unlabeled examples. Accordingly, the labeled examples form a subset XL,and the unlabeled examples form an unlabeled subset XU , i.e., X = XL∪XU .

2.2 Graph Quantification

To reveal more information on the structure of data, the graph topology canbe enriched by further quantifying the edges, i.e. assigning weight values toedges. A few strategies have been used, which are summarized as follows.

2.2.1 Observation-based Weight

The simplest way to determine the edge weights is based on the empiricalobservation between the node pairs, if any. For example, to quantify the linkgraph for the Web, we assign to the weightwi,j the number of hyperlinks fromthe i-th webpage to the j-th webpage.

Observation-based edge weights is straight-forward, but the data range of theweights could vary. Usually the weight matrix needs to be normalized (suchas [43]).

2.2.2 Similarity-based Weight

When observation is not readily available, a similarity measurement can bedefined to establish the relationship between nodes based on our prior be-liefs. Although any similarity definition could qualify, a kernel function K :V × V 7→ R if often preferred, which maps the node representations to aHilbert space where the kernel appears as the inner product

K(xi, xj) = < φ(xi), φ(xj) >

Mapped to the Hilbert space, a data example can be reformulated as a (possi-bly infinite) number of features that are independent to each other. The pair-wise form of the kernel function suggests that it can be viewed as inducing along-range relationship from local structures ([31]).

A valid kernel function must satisfy two mathematical constraints

5

Page 6: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

1. It must be symmetric, i.e., K(xi, xj) = K(xj , xi), which inevitably re-sults in a undirected graph being constructed.

2. It must be positive semi-definite, which is often called Mercer’s condi-tion ([53]).

These two properties bring much benefit when solving the related optimiza-tion problem of graph models constructed from kernel functions. However,it also limits its use on models based on directed graphs. Recently, there hasbeen a little development in the non-positive kernel([42]).

2.2.3 Graph Kernels

Constructing a kernel function is not easy and not every similarity measure-ment could be formulated as a kernel function. To enjoy the benefit of thekernel functions, it would be great to generate a kernel function from an arbi-trary similarity function as long as it is non-negative and symmetric. Further-more, the challenging part of quantifying a graph by a kernel function is theability of capturing both local and global structures in the data. To addressthese concerns, diffusion kernels (also known as heat kernels) have been pro-posed ([31]).

Suppose we have a symmetric similarity matrix Sn×n = si,jwith all si,j ≥ 0,from which a graph Laplacian LS can be constructed as defined in Equation2 in Section 3.1. Then the diffusion kernel is defined as

Kβ = exp (−βLs) (1)

= limn→∞

(

I− β

nLS

)n

where β > 0 is a bandwidth parameter. It can be proved that the above resultis a valid kernel. Also it is easy to find

dKβ

dβ= −LSKβ

3 Graph Analysis

Once a graph is constructed, we can assume that all the structural informa-tion has been embedded. However, most of the time the information can notbe directly used as it is, so it needs to be refined. Generally speaking, furtheranalysis on the graph has one or more of the following purposes:

1. identify properties or find the invariant information

2. fit with some assumptions made on the data

6

Page 7: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

3. approximate with simpler structures

In the rest of the section, we will discuss three graph analysis approaches cor-responding to the above purposes respectively.

3.1 Analysis Based on Spectral Graph Theory

A mainstream approach for graph analysis is based on the Spectral GraphTheory ([10]). It focuses on studying graph properties by means of the eigen-values and eigenvectors of the adjacency matrix.

An important concept centering in Spectral Graph Theory is the graph Lapla-cian. With respect to a symmetric adjacency matrix W, the graph Laplacianis defined as follows

L = D−W (2)

where D is a diagonal matrix D = diag(d1, d2, · · · , dn) with di =∑

j∈V wi,j .Note that in some occasions the following normalized version of the abovematrices are defined

W = D− 1

2 WD− 1

2

and

L = D− 1

2 LD− 1

2

= I− W

The graph Laplacian L (or L) has the following properties

1. It is positive semi-definite, i.e., all the eigenvalues are non-negative;

2. Its minimum eigenvalue is always 0. For unnormalized graph LaplacianL, the corresponding principal eigenvector is e = (1/

√n, 1/

√n, . . . , 1/

√n)>;

for normalized graph Laplacian L, the corresponding principal eigen-vector is D

1

2 e

The set of eigenvalues of the graph Laplacian L can be denoted by 0 = λ0 ≤λ1 ≤ · · · ≤ λn−1, which is also called the spectrum of L (or the graph itself).Spectral Graph Theory tells us that the structure of a graph and its principleproperties can be deduced from its spectrum. In particular, it has been shownthat the eigenvalues are closely related to almost all major invariants of thegraph, and link external properties together ([10, 9]). To name a few here

• If the spectra of two graphs λ0 ≤ λ1 ≤ · · · ≤ λn−1 and µ0 ≤ µ1 ≤ · · · ≤µn−1 differ, i.e., ∃i, 0 ≤ i ≤ n − 1, λi 6= µi, the two graphs cannot beisomorphic 1.

1Two graphs G and H are isomorphic when there exists a bijection f between the vertices oftwo graphs G and H with the property that any two vertices u and v from G are adjacent if andonly if f(u) and f(v) are adjacent in H.

7

Page 8: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

• For an unweighted graph (i.e. the adjacency matrix take values of 0 or 1)on n vertices, we have

n∑

i

λi ≤ n

with equality holding if and only if the graph has no isolated vertices.And for a graph without isolated vertices, we have

λ1 ≤ n

n− 1

λn−1 ≥ n

n− 1

• The second smallest eigenvalue λ2 tells us how well we can cut a graph,i.e. split the vertex set V into two disjoint sets A and B. For example,for an unweighted graph, if we use E(A,B) to denote the set of edgeswhose vertices lie in two sets A and B respectively, the ratio of the cut

Cut(A,B) =|E(A,B)|

min(|A|, |B|)

can be bounded by

Cut(A,B) ≥ λ2 ≥ [Cut(A,B)]2/2d∗

where d∗ is the upper bound of the degree of every vertex in the graph.For weighted graph, the cut can be defined as

Cut(A,B) =∑

vi∈A,vj∈B

wi,j

and we can prove that when we have a size-balanced partitioning, i.e.|A| = |B|, the cut is bounded by

cut(A,B) ≤ n

4λ2

The above bounds indicates that if λ2 is small, then it is possible to cutthe graph into two pieces without cutting too many edges; if λ2 is large,then every cut of the graph must cut many edges.

That’s why graph Laplacian plays a central role in graph analysis based onSpectral Graph Theory. As will be shown later in this survey, graph Laplacianis clearly present in a number of graph-based learning models.

8

Page 9: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

3.2 Analysis Based on Random Field Theory

The neighborhood of nodes in a graph preserves the local structure of thedata, which can be interpreted as a contextual constraint on the data. Thismotivates graph analysis by modeling its spatial property. Random Field The-ory is a general and powerful way to deal with the spatial properties. It char-acterizes the neighborhood relationship as dependency in the language ofprobability.

There are two typical ways to describe the dependency: Markovian and Gaussian.Assuming Markovian property on the dependency leads to Markov RandomFields and Conditional Random Fields, while assuming Gaussian property onthe dependency leads to Gaussian Random Fields. Though defined differ-ently, the three random fields share two common features which relates themclosely to each other

1. They are all undirected graphical models;

2. Usually they fall into the same family with exponential probability den-sity functions ([2]).

To begin with, let us explain the notations that will be used. For the graphG =(V,E) as defined before, two sets of random variables are introduced: a set ofinput variables X = Xi and a set of output variables Y = Yi. The inputvariables range over all the observed data examples, and the output variablesrange over finite set of labels or take real continuous values associated witheach data example. (For simplicity, we will use the same notation X for the setof data examples and the corresponding random variable set. And notationY is reused in a similar way. In most cases, their exact meaning should beidentified from the context. Special notes will be given in case of potentialconfusion.)

3.2.1 Markov Random Fields

A Markov Random Field (MRF) ([3]) is defined with respect to the graph G, if

Pr(Yi|Yj |i 6= j) = Pr(Yi|Yj |i ∼ j)where i ∼ j represents the i-th node and the j-th node are neighbors.

The graphical structure of the random field helps the joint distribution overoutput variables being factorized into a product of real-valued potential func-tions. Each potential function operates on a subset of random variables. Toensure (conditionally) independent variables won’t appear in the same po-tential functions, the easiest way is to define a potential function Φk for eachclique Ck in the graph. Then the joint distribution over the output variablesbecomes

Pr(Y = y) =1

Z

k

Φk(y(k))

9

Page 10: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

2Y1Y 3Y 1−nY nY

Figure 1: Graphical structure of a chain-strutured MRF

2Y1Y 3Y 1−nY nY

1X 2X 3X 1−nX nX

Figure 2: Graphical structure of a chain-strutured CRF

where y(k) represents the states of random variables in the k-th clique, andthe Z is a normalizing factor (also known ad partition function)

Z =∑

y

k

Φk(y(k))

The potential function often takes a form of exponential weighted sum overfeatures of the clique Φk(y(k)) = exp

(∑

l λlfl(y(l)))

, where l is the index forfeatures.

It is well known that inference of MRF is generally computationally expen-sive. To save the computational cost, MRF is applied to simple graph struc-ture such as a tree or a chain, where the clique is reduced to a pair of neigh-boring nodes. Then an inductive definition can be derived for the marginaldistribution, and efficient inference methods such as dynamic programming([12, 54]) and Vertibi decoding ([46]) can be applied. For example, when thegraph is a tree, the joint distribution of a label sequence in an MRF is given by

p(y) =1

Z

i

φ(yi)∏

i∼j

ψ(yi, yj)

where φ(·) is a potential function assigning each node a potential value φ(Yi),ψ(·) is another potential function assigning each edge a potential valueψ(Yi, Yj),and the normalization factor Z =

y′

i φ(y′i)∏

i∼j ψ(y′i, y′j).

3.2.2 Conditional Random Fields

A Conditional Random Field (CRF) is defined when the output random vari-ables Y, conditioned by the input variables X, obeys the Markov property

10

Page 11: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

([32, 54, 40])

Pr(Yi|X, Yj |i 6= j) = Pr(Yi|X, Yj |i ∼ j)

A comparative illustration of the graph structures of chain-structured MRFand CRF is given by Figure 1 and Figure 2.

Similar to MRF, to help factorize the conditional joint probability of the out-put variables we can define a potential function for each clique in the graph.Just as MRF, CRF also suffers from the expensive computation and there-fore are mainly applied for simple graph structures. For example, for a tree-structured graph with potential functions of an exponential form, the jointdistribution over the label sequence Y given X has the following form

p(y|X) ∝ exp

l

i∼j

λlfl(yi, yj ,x) +∑

l

i

µlgl(yi,x)

where fl(·) and gl(·) are feature functions associated with transitions (or edges)and examples (or nodes) respectively, and l is the feature index. A partitionfunction similar to that in MRF can be found to normalize the probabilitiesabove.

3.2.3 Gaussian Random Fields

Rather than directly model the dependency by Markov properties, GaussianRandom Fields assume a Gaussian distribution over the joint probability ofany label sequence

p(y|X) =1

(2π)n/2|Σ(X)|1/2exp

(

−1

2(y − µ)

>Σ(X)−1(y − µ)

)

where the covariance matrix Σ encodes the structure information presentedin the graph. The covariance matrix Σ is computed by applying a kernel func-tion K(·, ·) to the input patterns of the examples.

Gaussian Random Field is powerful in that it can not only predict the mostprobable output values of unobserved variables, but also tell how uncertainthe prediction is. This is often reflected by the mean and variance of the vari-ables to be predicted. Specifically, suppose we have l + 1 examples X ∪ x′,where the first l examples are observed with output values y and the last onex′ is a new example whose output value is to be predicted. The covariancematrix Σ∗ can be decomposed as

Σ∗ =

[

Σ c

c> K(x′, x′)

]

11

Page 12: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

Then the distribution of the new example x′ given the observed examples is aGaussian with conditional mean and variance

µy′|X = c>Σ−1(y − y)

σ2y′|X = K(x′, x′)− c>Σ−1c

3.3 Analysis Based on Matrix Approximation & Factorization

When we use a matrix to represent a graph, such as the adjacency matrix orother variants, there is a potential computation burden when the graph has alarge number nodes and edges. A natural thought would be to approximatethe matrix while preserving as much information as possible. This is partic-ularly useful especially when we are more interested in understanding globalstructures of the data. Taking this point of view, a few graph-based learningmodels can be viewed as a matrix approximation or a matrix factorizationproblem.

Finding the optimal k-rank approximation of a given r-rank matrix A (k < r)can be formulated as ([49, 59])

B = arg minRank(B)=k

||A−B||F (3)

Applying Singular Value Decomposition to matrix A, we have

A = USV>

where U and V are orthonormal matrices, and S = diag(s1, s2, . . . , sr, 0, . . . , 0)with s1 ≥ s2 ≥ · · · ≥ sr > 0, the solution to the lower rank approximationproblem would be

B = Ukdiag(s1, s2, . . . , sk)V>k (4)

Highly related to matrix approximation problem, non-negative matrix factor-ization tries to approximate a matrix A with two non-negative matrix factorsU and V ([34, 35]).

A ≈ UV

To measure the approximation quality, two cost functions are used. The firstmeasurement is the square of Euclidean distance

||A−B|| =∑

i,j

(Ai,j −Bi,j)2

and the second measurement is the divergence

D(A||B) =∑

i,j

(

Ai,j logAi,j

Bi,j−Ai,j +Bi,j

)

12

Page 13: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

Note that the second measurementD(A||B) is always nonnegative and reacheszero only when Ai,j = Bi,j holds for all (i, j) pairs.

An iterative algorithm has been proposed to efficiently solve the problem byiteratively minimizing the above two cost functions. In particular, the follow-ing updating rule minimizes the Euclidean distance ||A−UV||

Ui,a ← Ui,a

k

Ai,k

(UV)i,kVa,k

Ui,a ← Ui,a∑

j Uj,a

and the following updating rule minimizes the divergence D(A||UV)

Va,k ← Va,k

i

Ui,aAi,k

(UV)i,k

The above two rules which guarantees the two cost functions to be non-increasinguntil a local optimum is reached. To initialize the algorithm, the two matrixfactors U and V can be seeded as random non-negative matrices.

4 Graph-based Learning Models

In this section, we will present a number of graph-based learning models inthe literature. Those models are roughly categorized by supervised learningmodels, semi-supervised learning models and ranking models.

In this survey, if a graph-based supervised learning model constructs a graph(whether explicitly or implicitly) over only the labeled (or training) examples,it will be summarized as a supervised learning method in Section 4.1; if agraph is constructed over all examples including labeled and unlabeled ex-amples, it will be summarized as a semi-supervised learning method in Sec-tion 4.2.

4.1 Supervised Learning

4.1.1 k-Nearest Neighbor

The weighted Nearest Neighbor method uses those observations in the train-ing set closest to a test example to form a prediction

yi =∑

xj∈NNi

wjyj

whereNNi represents the nearest neighbor set of the test data example xi andwj is a weight satisfying

j wj = 1, which is always related to the similaritybetween xi and xj .

13

Page 14: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

The k-Nearest Neighbor method can also have a probabilistic interpretation([19]).

The probability of a test example xi being categorized into the j-th class Cj

can be written as

Pr(xi ∈ Cj) =∑

x′∈NNi

Pr(x′ ∈ Cj) Pr(xi → x′) Pr(x′ ∈ Cj)

where Pr(x′ ∈ Cj) is a normalization factor, Pr(xi → x′) is related to thesimilarity between xi and x′, and Pr(x′ ∈ Cj) = 1 if x′ belongs to the class Cj

or 0 otherwise.

Or we can interpret it in the following way

p(y|xi) =∑

x′∈NNi

p(y,x′|xi)

=∑

x′∈NNi

p(y|x′)p(x′|xi)

where

p(y|x′) =

1 y = y′

0 y 6= y′p(x′|X) ∝

1 x′ ∈ NN(x)

0 x′ /∈ NN(x)

Whether interpreted in probabilistic or non-probabilistic way, we can viewk-Nearest Neighbor as constructing a nearest neighbor graph around the testexamples.

4.1.2 Gaussian Processes

Gaussian Process is defined as a probability distribution function y(x) whichhas the property that any finite selection of points x(1),x(2), . . . ,x(k) has themarginal density Pr

(

y(x(1)), y(x(2)), . . . , y(x(k)))

as a Gaussian ([39]). The as-sumption of a Gaussian Random Field (as discussed in Section 3.2) over thelabeled data examples falls into the Gaussian Process category, which is thereason that we summarize Gaussian Process as a graph-based model.

When applying to classification problem, Gaussian Process ([1]) assumes ahidden intermediate stochastic process in the generation of labels from theobservations on the training examples. Specifically, the intermediate processis defined as u = [u(x, y)], which gives a compatibility measurement for eachtraining examples and its label, and we assume it is a zero mean Gaussianprocess with covariance matrix K, which is block diagonal. For a m-ary clas-sification problem, we can identify u as an lm × 1 vector with multiple index(i, y).

A Bayesian treatment on the predicting a label for a new observation x can beformulated as follows

p(y|XL,yL,x) =

p(y|u(x, ·))p(u|XL,yL)du

14

Page 15: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

and it is further approximated as

p(y|XL,yL,x) ≈ p(y|uMAP )

where uMAP is the maximizer of

log p(u|XL,yL) ∝ p(u)

l∏

i=1

p(yi|u(xi, ·))

Considering the Gaussian prior over u, then we have

log p(u|X,y) =

l∑

i=1

[

u(xi, yi)− log∑

y

exp(u(xi, y)

]

− 1

2u>K−1u + const.

The Representer Theorem guarantees the maximizer of the above objectivefunction in a form

uMAP (xi, y) =

l∑

j=1

m∑

y′=1

α(i,y′)K(i,y),(j,y′)

which transforms the objective function into the following one parameter-ized by α

minα α>Kα−l

i=1

α>Ke(i,yi) +l

i=1

log∑

y

exp(α>Ke(i,y))

where e(i,y) is the (i, y)-th unit vector.

4.2 Semi-supervised Learning

Semi-supervised learning deals with the use of both labeled and unlabeleddata for training. The central idea behind nearly all semi-supervised learningalgorithms is an assumption made on the data consistency that data exam-ples close to each other or in the same structure are more likely to have thesame label. Various semi-supervised learning approaches differ in the way tomodel the structure of data and attempts to propagate the label informationfrom labeled examples to unlabeled ones. Among those models, quite a fewhave clear interpretations from a graph point of view or could be related tographs.

4.2.1 Graph Mincuts

The graph mincuts method extends the algorithm for finding the minimal cutin a graph to a transductive learning setup [5, 6]. The basic idea is searching

15

Page 16: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

for a partition of the graph which results in a minimum sum of weights of theedge being cut while agreeing with the labeled data. To enforce the consis-tency with the labeled data, a special weighting scheme is adopted to build agraph over all the labeled and unlabeled examples: for any pair of data exam-ples belonging to different classes the edge weight indicates their similarity;for any pair of data examples belonging to the same class an infinite weightis assigned. In a binary case, the search for the minimum cut amounts to thefollowing optimization problem

minyi|xi∈XU

i,j

wi,j(yi − yj)2

s.t. yi = y∗i ,∀xi ∈ xL

where y∗i indicates the known labels of the labeled data.

Note that the solution gives binary labels for the unlabeled data, which canalso been proved to be optimal in another sense that it minimizes the leave-one-out cross-validation error of the nearest-neighbor algorithm applied tothe entire dataset ([5]). To improve the robustness of the solution, a follow-upwork ([6]) introduce random noise to edge weights and results in a solutionwith “soft” labeling.

4.2.2 Gaussian Random Fields and Harmonic Functions

Gaussian random fields and harmonic functions method ([66, 65, 67]) is moti-vated by the assumption that the label probability should vary smoothly overthe entire graph. To enforce the label smoothness on the graph, a quadraticenergy function is proposed as follows (in a binary classification case)

E(f) =1

2

i,j

wi,j(fi − fj)2

= f>Lf

where f = (f1, f2, . . . , fn)> is the label probability vector defined as

fi =

δ(yi, 1) xi is labeled

Pr(yi = 1|xi) xi is unlabeled

The energy defined above will be small when the label probability vector variessmoothly over the graph, which leads to the minimization of the energy func-tion. The minimizer f∗ makes the function harmonic in the sense that

(Lf∗)i =

yi xi is labeled

0 xi is unlabeled

16

Page 17: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

This method is also related to Gaussian Random Field because the energyfunction could be used to form a Gaussian density function

p(f) ∝ exp[−βE(f)]

where β is an “inverse temperature” parameter.

If we define P = D−1W and further decompose it into four blocks

P =

[

Pll Plu

Pul Puu

]

where Pll corresponds to the labeled data and Puu corresponds to the unla-beled data, then the final prediction is made in the following way

yu = (I−Puu)−1Pulfl

To save the computation introduced by the matrix inverse, an extended workis proposed in [67], which essentially forms a backbone graph with super-nodes created by pre-clustering the examples.

Again the important role played by the graph Laplacian L is witnessed: thesmoothness regularization on the graph and the harmonic nature of the en-ergy function are both achieved through it, which suggests a close relation-ship to the Spectral Graph Theory ([66]).

4.2.3 Spectral Graph Transducer

Spectral Graph Transducer is another semi-supervised version of ratio-cut al-gorithm originally proposed for unsupervised learning ([28]). The objectivefunction incorporates a quadratic penalty on labeled data in addition to min-imize the graph cut

minf

f>Lf + c(f − r)>C(f − r)

s.t. f>e = 0

f>f = n

where the vector f is a label probability vector, and the vector r is defined as

ri =

r+ xi is a positive example

r− xi is a negative example

0 xi is unlabeled

The matrix C = diag(c1, c2, . . . , cn) is a diagonal cost matrix allowing for dif-ferent misclassification cost for each data example. The trade-off betweengraph cut value and training error penalty is balance through the constant c.It has been shown that solving the optimization problem in Spectral GraphTransducer also leads to a matrix eigen-decomposition ([28]).

17

Page 18: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

4.2.4 Learning with Local and Global Consistency

The local and global consistency method ([61]) proposes the following opti-mization problem

minF

1

2

i,j

wi,j

Fi√di

− Fj√

dj

2

+ µ∑

i

||Fi −Yi||2

where F is the label probability matrix with each column corresponding to aclass, di =

j∈V wi,j and Y is the label matrix with yi,j = 1 if the i-th exampleis labeled as a member in the j-th class and yi,j = 0 otherwise.

The first term in the object function addresses the smoothness constraint onthe graph by the sum of local variations measured at each edge in an undi-rected graph; the second term penalize the inconsistency with the labeleddata.

4.2.5 Local Laplacian Embedding

To learn the global manifold structure from the data, local Laplacian embed-ding methods propose to project the data from the original space to a di-mension reduced space. In particular, let V = [v1,v2, . . . ,vp] be a matrixcomposed by the p smallest eigenvector of the graph Laplacian for the near-est neighbor graph built over all labeled and unlabeled data. If we rewriteV = [x>

1 , x>2 , . . . , x

>n ]>, then xi is the projected image of xi in the dimension

reduced space.

For classification purpose, in [4] a linear classifier is learned from the labeleddata

a = (V>LLVLL)−1V>

LLyL

where VLL is a sub-matrix of V that corresponds to the labeled examples.Then the prediction for the unlabeled data xi is made by

yi = a>xi

Another manifold learning based semi-supervised learning algorithm extendsthe manifold ranking function ([64]). The prediction is as follows

y∗ = (I− αD−1W)−1Dy

whereα is a constant that controls how much we rely on the labeled data, andy take a±1 value if labeled and 0 if unlabeled.

18

Page 19: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

4.2.6 Gaussian Process

Gaussian Process classification could be extended to semi-supervised learn-ing, if we assume a Gaussian distribution on the marginal density function onany labeled and unlabeled examples. A null category noise model (NCNM)has been proposed in [33] for binary classification problems.

Similar to the supervised version of Gaussian Process classification, a hiddenprocess variable fi is introduced as an intermediate step of generating a labelyi from input xi. The labeling for the binary classification takes three pos-sible values, yi = −1, 0 or 1 where yi = 0 refers to the null category, whoserole is analogous to the notion of “margin”. The null category acts to excludeunlabeled examples.

To further address the existence of both labeled and unlabeled data, a furthervariable zi is assumed to be generated from the output yi, with zi = 0 forlabeled data and zi = 1 for unlabeled data. The generation process of zi isdefined by first imposing

p(zi = 1|yi = 0) = 0

p(zi = 1|yi = 1) = γ+

p(zi = 1|yi = −1) = γ−

and then computing

p(zi = 1|fi) =∑

yi

p(zi = 1|yi)p(yi|fi)

The probability of class membership can be decomposed as

p(yi|xi) =

p(yi|fj)p(fj |xi)dfj (5)

The first component in Equation (5), which is called a null category noisemodel, is given by

p(yi|fj) =

H(−(fj + 0.5)) for yj = −1

H(fj + 0.5)−H(fj − 0.5) for yj = 0

H(fj − 0.5) for yj = 1

where H(·) is a Heaviside step function.

The second component in Equation (5), which is called process model, fol-lows the Gaussian Process assumption

p(fj |xi) ∼ N (fj |µ(xi), σ(xi))

The parameters in the above models can be estimated from the labeled databy maximizing the likelihood p(zL|XL). And the prediction on unlabeled datais given by

p(y|x, z = 1) ∝ p(z = 1|y)p(y|x)

19

Page 20: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

4.2.7 Learning on Directed Graphs

Recently a semi-supervised learning method is proposed based on directedgraphs [62], which extends the work in [61]. To address the data consistency,the undirected graph is expected to be smooth in the sense that nodes lyingon a densely linked subgraph are likely to have the same label. To search for agood classifier which results in a labeling with such smoothness on the graph,a smoothness functional is proposed (for binary classification)

Ω(f) =1

2

i,j

πili,j

(

fi√πi− fj√

πj

)2

where f = [f1, f2, . . . , fn]> is the label probability vector, L = [li,j ] is the graphLaplacian over the entire data examples, and π = [π1, π2, . . . , πn]> is the prin-cipal eigenvector of the graph Laplacian L.

To find the labeling for the unlabeled data, an optimization problem is pro-posed with an objective function addressing both the data smoothness en-forcement and the consistency with labeled data

arg minf

Ω(f) + µ||f − y||

where the component of y take a ±1 value if labeled and 0 if unlabeled, andµ > 0 is a constant specifying the trade-off.

A slightly different version of the algorithm above is described in [63], wherethe two set of smoothness functionals are defined for the data examples: onefor their “hub” scores (which accounts for outgoing links) and one for their“authority” scores (which accounts for incoming links). To separate the twosmoothness enforcement, directed graphs are transformed into bipartite graphs.

4.3 Unsupervised Learning

4.3.1 Spectral Clustering

Spectral clustering approaches view the problem of data clustering as a prob-lem of graph partitioning. Take 2-way graph partitioning as an example, toform two disjoint data sets A and B from a graph G = (V,E), edges connect-ing these two parts should be removed. The degree of dissimilarity betweenthe partitioned parts are captured by the notion of cut, which is defined asCut(A,B) =

vi∈A,vj∈B wi,j . Generally speaking, a good partitioning should

lead to a small cut value.

Addressing different balancing concerns, there are several variants of cut de-finition which lead to the optimal partitioning in different senses. To beginwith, we define S(A,B) =

i∈A

j∈B wi,j and dA =∑

i∈A di. Ratio Cut ad-dresses the balance concern on the sizes of partitioned graph ([21]), which

20

Page 21: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

leads to minimization of the following objective function

JRCut =S(A,B)

|A| +S(A,B)

|B|

Normalized Cut addresses the balance concern on the weights of partitionedgraph ([48]), which leads to minimization of

JNCut =S(A,B)

dA+S(A,B)

dB

Min-Max Cut addresses the balance concern between the intra-cluster weightsand inter-cluster weights in a partitioning ([16]), which leads to minimizationof

JMCut =S(A,B)

S(A,A)+S(A,B)

S(B,B)

By relaxing cluster memberships to real values, the above minimization prob-lems can all be formulated as eigenvector problems on related to the graphLaplacian as defined in Section 3.1

JRCut = q>Lq

JNCut = q>Lq

JMCut =q>Wq

q>Dq

where q is related to the relaxed cluster membership. All the above threeproblem leads to the finding the second eigenvector of the graph Laplacian L

or L.

2-way spectral clustering can be extended to k-way spectral clustering ([20,41]), whose solution is related to the first k eigenvectors of the graph Lapla-cian.

4.3.2 Kernel k-means

K-means aims at minimize the sum of distance withing clusters. Althoughgraph is not constructed explicitly, a distance metric is indispensable for thealgorithm. The distance metric is used to quantify the relationship of dataexamples to the cluster centers, i.e., the average of data examples belong tothe same clusters, which is, to some extent, an implicit graph construction.It has been shown that k-means clustering has a close relationship with thespectral clustering ([15]).

Specifically, suppose a kernel is introduced for distance metric with φ as themapping function, and let us denote the expected k clusters as Cjkj=1. The

21

Page 22: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

objective function for a generalized kernel k-means algorithm is defined asthe minimization of

D(Cjkj=1) =k

j=1

xi∈Cj

||φ(xi)−mj ||2

where mj is the centroid of the cluster Cj

mj =

xl∈Cjφ(xl)

xl∈Cj

The above minimization problem is equivalent to the problem of maximizing([15])

trace(Y>WY)

where W = Φ>Φ and Φ = [φ(x1), φ(x2), . . . , φ(xn)] give the adjacency matrixcomputed from the kernel function. Here Y is an n × k orthonormal matrixwhich indicates the class membership. By relaxing Y to real values, the so-lution of the above maximization problem is the first k eigenvectors of theweight matrix W.

Comparing with k-way spectral clustering, we find both problems lead to theeigenvector problem, which falls into the approach taken by spectral graphanalysis.

4.4 Ranking

In many occasions, an ordering of the data examples is expected and it leadsto the ranking problem. Usually a scoring function is learned to assign a scoreto each example, and based on the scores the ranking could be generated. Atypical family of models for this purpose are regression models. However,this task becomes more difficult when the examples themselves are hard todescribe but the relationships between them are relatively easier to capture.A number of graph-based models have been proposed to generate a rankingbased on the analysis on the relationship between data pairs.

As a linear relationship, a ranking could be viewed as a first-order approxima-tion towards the structure of data examples. Accordingly, we will find graph-based ranking models often can be explained as solving lower-rank matrixapproximation problems.

4.4.1 PageRank

Although the basic idea behind the PageRank algorithm has been proposedvery early and applied to a few areas such as citation analysis, it becomes

22

Page 23: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

well-known for its application in Web page ranking [43, 22]. In general, givena set of examples we can model it with a graph whose links indicate the exis-tence of correlations between examples. If we assume that the importance ofexamples can be mutually reinforced along the links, it can be formalized as

f = (1− α)Wf + αf0

where W is the normalized weight matrix, f is the importance vector, f0 re-flects the prior belief on the importance, and α is a damping factor indicateshow much we rely on the prior belief. It is easy to prove the solution of theabove equation is the principal eigenvector of the matrix W∗ = (1 − α)W +αf0e

> where e = [1, 1, . . . , 1]>, if we requires ||f ||1 = 1 ([43, 22]).

The importance score induces a ranking which can be seen as reducing thestructure in the data to one dimensional. This leads to the following interpre-tation based on lower-rank matrix approximation

B = arg minRank(B)=1

||W∗ −B||F

and the importance vector can be uniquely determined by B = f f>.

4.4.2 HITS

The HITS algorithm goes one step for the purpose of ranking ([29, 30]). It sep-arate the structure information embedded in a directed graph into two parts:a hub score describes the importance of a node based on outgoing links, and aauthority score describes the importance of a node based on incoming links.However, these two set of scores are correlated

ai =∑

i∼j

hj

hi =∑

i∼j

ai

where h = [h1, h2, . . . , hn]> is the hub score vector and a = [a1, a2, . . . , an]> isthe authority score vector.

The solution of the HITS algorithm is that the authority score vector a is theprincipal eigenvector of W>W and the hub score vector h is the principaleigenvector of WW>. Similarly, this can also be viewed as a 1-rank approxi-mation to the matrices W>W and WW>.

4.4.3 Traffic Rank

In [52] a maximum entropy model is proposed to rank the nodes in a graphbased on the traffic in an equilibrium state. In particular, a traffic probability

23

Page 24: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

pi,j is associated with each directed links in the graph, and it is assumed thatthe most appropriate traffic agreeing with the graph structure should be theone which maximizes the entropy of the traffic probability distribution. Thisleads to the following optimization problem

maxpi,j

−∑

i j

pi,j logpi,j

wi,j

s.t.∑

i,j

ci,jpi,j = C

i,j

pi,j = 1

pi,j > 0, ∀(i, j)

where ci,j is a cost or benefit associated with each directed link and C is totalcost or benefit observed empirically. If no cost or benefit is observed, thecorresponding constraint can be omitted.

Iterative scaling method is used to solve the above maximum entropy model.And the traffic rank score is computed as

fj =∑

i

pi,j

4.4.4 Manifold Rank

Manifold Rank builds a graph on the set consisting of all the documents aswell as queries, where the local manifold structure embeds the relevance in-formation. The ranking function ([64]) is given by

y∗ = (I− αD−1W)−1Dy

where α is a constant controlling how much we rely on the queries, and thecomponent of y is assigned a value 1 for queries and 0 for documents to beranked. Manifold rank is query-dependent, however it is equivalent to PageR-ank (by setting appropriate values to the constants), if we omit the queriesfrom the graph.

5 Applications in Information Retrieval

5.1 Document Retrieval

Traditional Information Retrieval study focuses on modeling the relevancebetween a textual query and documents. Based on the relevance scoring,a ranked list of documents will be returned in response to the given query.However, the relevance based retrieval does not fully satisfy the information

24

Page 25: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

need behind the given query in many occasions. Two major reasons may at-tribute to such dissatisfaction: 1) relevance is hard to be modeled accurately;2) there are factors other than query-document relevance which also governthe satisfaction of information need, such as the importance, topic coverage,etc.

Motivated by improving relevance modeling, in [13] an empirical study is de-scribed to testify that hyperlinks overlaps with content similarities to someextent, and an improved TFIDF weighting scheme is proposed in [51] whichtakes the advantage of the graph structure of web pages. Relevance propaga-tion is also studied ([47, 44]) to adjust relevance scores by the structure of thedocument graph. However, much more research effort has been devoted tofactors other than relevance that make a better ranking, including algorithmssuch as PageRank ([43, 22]), HITS ([29, 30]), SALSA ([36]), SimRank ([26]), im-plicit link analysis ([58]), affinity rank ([60]) etc.

5.2 Document Classification

As a typical application of classification, a number of graph-based semi-supervisedlearning models have been reported to perform well for text categorization,including Spectral Graph Transducer ([28]), harmonic mixture models ([67]),learning models on directed graphs ([62, 63]). All these studies are “purelygraph-based” in that the classification is carried out solely on the graphs,without combining with other textual feature based classification method.

A problem with applying graph-based semi-supervised learning models totext categorization is that constructing a graph with a large number of nodesis computationally expensive (which usually involves pairwise similarity com-putation), while real-world text categorization problems often involves a largenumber of documents. Besides, some learning algorithms may fail when fac-ing with relatively larger number of examples, such as the manifold learningapproach (discussed in Section 4.2.5) and the harmonic function approach(discussed in Section 4.2.2) will suffer because the computation of matrix in-verse could be prohibitively expensive.

However for web page classification, the hyperlink-based graph is more read-ily available. A few studies have been devoted to improve text-based web pageclassification by hyperlink graph structures ([11, 37, 38]).

5.3 Collaborative Filtering

Mining user interest patterns from sparse data is a central problem in col-laborative filtering. A typical approach is to infer a user pattern from gather-ing observations on other similar users, and make predictions on an unrateditem based on what can be learned from other similar items. Taking this view,the way collaborative filtering works can be explained as letting information

25

Page 26: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

propagates along similarity based structures on users and items, which leadsto graph-based models.

Given the sparseness of the user rating matrix, the problem of developinga reliable similarity measurement between users or items should be solvedprior to the prediction of users’ interest in unrated items. This problem hasbeen addressed in [25, 17] by introducing different weighting factors and nor-malization methods to the similarity computation. Some issues related topredicting user interest based on user/item similarity graph have been dis-cussed in [18, 25]. Recently an algorithm based on maximum margin matrixfactorization has been proposed in [50], which can be also explained from amaximum-margin perspective.

5.4 Unified Link Analysis

Graph-based models are also referred as link analysis methods in InformationRetrieval community. When link analysis becomes popular, various graphshave been constructed for different purposes, which brings a new researchtopic on unifying multiple graphs. On one hand, multiple graphs can be con-structed for a single set of nodes. For example, a Web graph can be built basedon different link types, such as hyperlinks, content similarities or user click-through. On the other hand, multiple graphs can be constructed for differentbut closely related sets of nodes. For example, in a retrieval system, threegraphs can be build for users, queries and documents respectively. Thus anew problem emerges that how we can unify the analysis on those graphs?Several work has been proposed in attempt to unifying the link analysis onmultiple graphs. For example, [14, 8] aims at link analysis on multiple graphsfor a single set of objects, and in [11] a probabilistic framework is proposed,while [57, 56] aims at a framework to accommodate multiple graphs on inter-related objects.

5.5 Image Retrieval

It is well-known that low-level features are not adequate for image represen-tation due to its gap with the image semantics. As a result pairwise Euclideansimilarity measurement is not reliable, which consequently degrade retrievalperformance. Such a problem introduces Manifold Learning, which aims atlearning a global manifold structure of data from local Euclidean distance,to the image retrieval. For example, [24] uses manifold structure for imagerepresentation, and [64, 23] propose manifold ranking schemes.

Besides manifold learning approach, there are some other work taking advan-tage of the graph structure of images, such as [68] where a similarity propaga-tion scheme is proposed over the graph, and [7] where graph-based clusteringis performed to facilitate retrieval.

26

Page 27: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

6 Summary

There are a few potential research problems in graph-based learning models

1. Relatively fewer work is devoted to directed graph-based models, whichshould be more general and powerful than undirected graph-based mod-els.

2. A number of problems can be modeled by multiple correlated graphs,such as

(a) The class graph and example graph in multi-label learning prob-lem

(b) The user graph and item graph in collaborative filtering

(c) The user graph, query graph and document graph in retrieval

But analysis on correlated graphs is not well studied, though it has beenaddress in some occasions ([57, 56]).

3. Graph construction is not well studied. Questions include

(a) How to construct a reliable graph from partially observed data?For example, in collaborative filtering rating information is sparse,and in image retrieval the lower-level features cannot reflect thesemantics very well. In those cases, how can we extract the corre-lation information as accurate as possible to build a reliable graph?

(b) How to construct graph based on dissimilarity?Most of the time a graph is constructed from the similarity infor-mation of examples. In some cases, we may not be sure to judgetwo examples to be similar, but more confident to judge them tobe dissimilar. Can we build a graph based on such dissimilarityinformation? Can dissimilarity information be complementary tographs constructed on similarity information, and if yes, how?

(c) How to make a robust graph when multiple evidences can be gath-ered?For example, we can judge whether two web pages are relevantfrom their contents and their hyperlinks. Sometimes the conclu-sions are consistent, sometimes they are not. If a single graph is tobe constructed, how can we address those consistencies as well asinconsistencies from multiple evidences?

4. If uncertainty exists in a graph structure, how can we address it? Forexample, a web graph could contain a lot of uncertainty since hyper-links could be very noisy and not reflecting any information regardingthe page contents.

27

Page 28: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

References

[1] Yasemin Altun, Thomas Hofmann, and Alexander J. Smola. Gaussianprocess classification for segmenting and annotating sequences. InICML ’04: Proceedings of the twenty-first international conference on Ma-chine learning, page 4, New York, NY, USA, 2004. ACM Press.

[2] Yasemin Altun, Alex J. Smola, and Thomas Hofmann. Exponential fam-ilies for conditional random fields. In AUAI ’04: Proceedings of the 20thconference on Uncertainty in artificial intelligence, pages 2–9, Arlington,Virginia, United States, 2004. AUAI Press.

[3] Dragomir Anguelov, Benjamin Taskar, Vassil Chatalbashev, DaphneKoller, Dinkar Gupta, Geremy Heitz, and Andrew Y. Ng. Discriminativelearning of markov random fields for segmentation of 3d scan data. In2005 IEEE Computer Society Conference on Computer Vision and PatternRecognition (CVPR 2005), pages 169–176, 2005.

[4] Mikhail Belkin and Partha Niyogi. Using manifold stucture for partiallylabeled classification. In Sebastian Thrun, Lawrence Saul, and BernhardScholkopf, editors, Advances in Neural Information Processing Systems15, Cambridge, MA, 2002. MIT Press.

[5] Avrim Blum and Shuchi Chawla. Learning from labeled and unlabeleddata using graph mincuts. In ICML ’01: Proceedings of the EighteenthInternational Conference on Machine Learning, pages 19–26, San Fran-cisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.

[6] Avrim Blum, John Lafferty, Mugizi Robert Rwebangira, and RajashekarReddy. Semi-supervised learning using randomized mincuts. In ICML’04: Proceedings of the twenty-first international conference on Machinelearning, page 13, New York, NY, USA, 2004. ACM Press.

[7] Deng Cai, Xiaofei He, Zhiwei Li, Wei-Ying Ma, and Ji-Rong Wen. Hierar-chical clustering of www image search results using visual, textual andlink information. In MULTIMEDIA ’04: Proceedings of the 12th annualACM international conference on Multimedia, pages 952–959, New York,NY, USA, 2004. ACM Press.

[8] Pavel Calado, Marco Cristo, Edleno Moura, Nivio Ziviani, BerthierRibeiro-Neto, and Marcos Andre; Goncalves. Combining link-based andcontent-based methods for web document classification. In CIKM ’03:Proceedings of the twelfth international conference on Information andknowledge management, pages 394–401, New York, NY, USA, 2003. ACMPress.

[9] Fan R. K. Chung. Eigenvalues of graphs. In Proceedings of the Inter-national Congress of Mathematicians, pages 1333–1342, Zurich, 1994.Birkh”auser Verlag, Berlin.

28

Page 29: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

[10] Fan R. K. Chung. Spectral Graph Theory. CBMS Regional Conference Se-ries in Mathematics, ISSN: 0160-7642. American Mathematical Society,1997.

[11] David Cohn and Thomas Hofmann. The missing link - a probabilisticmodel of document content and hypertext connectivity. In Advances inNeural Information Processing Systems 13, 2000.

[12] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and CliffordStein. Introduction to Algorithms. MIT Press & McGraw-Hill, 2nd ed.,2001.

[13] Brian D. Davison. Topical locality in the web. In SIGIR ’00: Proceedings ofthe 23rd annual international ACM SIGIR conference on Research and de-velopment in information retrieval, pages 272–279, New York, NY, USA,2000. ACM Press.

[14] Brian D. Davison. Toward a unification of text and link analysis. In SI-GIR ’03: Proceedings of the 26th annual international ACM SIGIR confer-ence on Research and development in informaion retrieval, pages 367–368, New York, NY, USA, 2003. ACM Press.

[15] Inderjit S. Dhillon, Yuqiang Guan, and Brian Kulis. Kernel k-means:spectral clustering and normalized cuts. In KDD ’04: Proceedings of thetenth ACM SIGKDD international conference on Knowledge discovery anddata mining, pages 551–556, New York, NY, USA, 2004. ACM Press.

[16] Chris H. Q. Ding, Xiaofeng He, Hongyuan Zha, Ming Gu, and Horst D.Simon. A min-max cut algorithm for graph partitioning and data clus-tering. In Proceedings of the 2001 IEEE International Conference on DataMining (ICDM 2001), pages 107–114. IEEE Computer Society, 2001.

[17] Francois Fouss, Alain Pirotte, and Marco Saerens. A novel way of com-puting similarities between nodes of a graph, with application to collab-orative recommendation. In Web Intelligence, pages 550–556, 2005.

[18] Kenneth Y. Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins.Eigentaste : A constant time collaborative filtering algorithm. Inf. Retr.,4(2):133–151, 2001.

[19] Norbert Govert, Mounia Lalmas, and Norbert Fuhr. A probabilisticdescription-oriented approach for categorizing web documents. InCIKM ’99: Proceedings of the eighth international conference on Informa-tion and knowledge management, pages 475–482, New York, NY, USA,1999. ACM Press.

[20] Ming Gu, Hongyuan Zha, Chris Ding, Xiaofeng He, and Horst Simon.Spectral relaxation models and structure analysis for k-way graph clus-tering and bi-clustering. Technical Report CSE-01-007, Department ofComputer Science and Engineering, Pennsylvania State University, 2001.

29

Page 30: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

[21] L. Hagen and A.B. Kahng. New spectral methods for ratio cut partition-ing and clustering. IEEE. Trans. on Computed Aided Desgin, 11:1074–1085, 1992.

[22] Taher H. Haveliwala. Topic-sensitive pagerank. In WWW ’02: Proceed-ings of the 11th international conference on World Wide Web, pages 517–526, New York, NY, USA, 2002. ACM Press.

[23] Jingrui He, Mingjing Li, Hong-Jiang Zhang, Hanghang Tong, and Chang-shui Zhang. Manifold-ranking based image retrieval. In MULTIME-DIA ’04: Proceedings of the 12th annual ACM international conferenceon Multimedia, pages 9–16, New York, NY, USA, 2004. ACM Press.

[24] Xiaofei He, Wei-Ying Ma, and Hong-Jiang Zhang. Learning an imagemanifold for retrieval. In MULTIMEDIA ’04: Proceedings of the 12th an-nual ACM international conference on Multimedia, pages 17–23, NewYork, NY, USA, 2004. ACM Press.

[25] Jonathan L. Herlocker, Joseph A. Konstan, Al Borchers, and John Riedl.An algorithmic framework for performing collaborative filtering. In SI-GIR ’99: Proceedings of the 22nd annual international ACM SIGIR confer-ence on Research and development in information retrieval, pages 230–237, New York, NY, USA, 1999. ACM Press.

[26] Glen Jeh and Jennifer Widom. Simrank : a measure of structural-contextsimilarity. In KDD ’02: Proceedings of the eighth ACM SIGKDD interna-tional conference on Knowledge discovery and data mining, pages 538–543, New York, NY, USA, 2002. ACM Press.

[27] Finn V. Jensen. Bayesian Networks and Decision Graphs (InformationScience and Statistics). Springer, July 2001.

[28] Thorsten Joachims. Transductive learning via spectral graph partition-ing. In Proceedings of the International Conference on Machine Learning(ICML), 2003.

[29] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment.Journal of the ACM, 46(5):604–632, 1999.

[30] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment.J. ACM, 46(5):604–632, 1999.

[31] Risi Imre Kondor and John D. Lafferty. Diffusion kernels on graphsand other discrete input spaces. In Proceedings of the Nineteenth Inter-national Conference on Machine Learning (ICML 2002), pages 315–322,2002.

[32] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Condi-tional random fields: Probabilistic models for segmenting and labelingsequence data. In ICML ’01: Proceedings of the eighteenth internationalconference on Machine learning, pages 282–289, 2001.

30

Page 31: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

[33] Neil D. Lawrence and Michael I. Jordan. Semi-supervised learning viagaussian processes. In Lawrence K. Saul, Yair Weiss, and Leon Bottou,editors, Advances in Neural Information Processing Systems 17, pages753–760. MIT Press, Cambridge, MA, 2005.

[34] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negativematrix factorization. Nature, 401(6755):788–791, October 1999.

[35] Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative ma-trix factorization. In Advances in Neural Information Processing Systems13, pages 556–562, 2000.

[36] R. Lempel and S. Moran. Rank-stability and rank-similarity of link-based web ranking algorithms in authority-connected graphs. Inf. Retr.,8(2):245–264, 2005.

[37] Qing Lu and Lise Getoor. Link-based classification. In ICML ’03: Pro-ceedings of the twentieth international conference on Machine learning,pages 496–503, 2003.

[38] Qing Lu and Lise Getoor. Link-based text classification. In IJCAI Work-shop on Text Mining and Link Analysis, Acapulco, MX, 2003.

[39] David J. C. MacKay. Introduction to Gaussian processes. In C. M. Bishop,editor, Neural Networks and Machine Learning, NATO ASI Series, pages133–166. Kluwer Academic Press, 1998.

[40] Andrew McCallum. Efficiently inducing features of conditional randomfields. In Proceedings of the 19th Annual Conference on Uncertainty inArtificial Intelligence (UAI-03), pages 403–410, San Francisco, CA, 2003.Morgan Kaufmann Publishers.

[41] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering:Analysis and an algorithm. In Advances in Neural Information ProcessingSystems 14, pages 849–856, 2001.

[42] Cheng Soon Ong, Xavier Mary, St&#233;phane Canu, and Alexander J.Smola. Learning with non-positive kernels. In ICML ’04: Proceedings ofthe twenty-first international conference on Machine learning, page 81,New York, NY, USA, 2004. ACM Press.

[43] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. Thepagerank citation ranking: Bringing order to the web. Technical report,Stanford Digital Library Technologies Project, 1998.

[44] Tao Qin, Tie-Yan Liu, Xu-Dong Zhang, Zheng Chen, and Wei-Ying Ma.A study of relevance propagation for web search. In SIGIR ’05: Proceed-ings of the 28th annual international ACM SIGIR conference on Researchand development in information retrieval, pages 408–415, New York, NY,USA, 2005. ACM Press.

31

Page 32: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

[45] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Ap-proach. Prentice Hall, 1995.

[46] Matthew S. Ryan and Graham R. Nudd. The viterbi algorithm. Technicalreport, Coventry, UK, UK, 1993.

[47] Azadeh Shakery and ChengXiang Zhai. Relevance propagation for topicdistillation uiuc trec 2003 web track experiments. In The Twelfth TextRetrieval Conference (TREC 2003), page 673, 2003.

[48] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmenta-tion. IEEE Trans. on PAMI, 22(8):888–905, August 2000.

[49] Nathan Srebro and Tommi Jaakkola. Weighted low-rank approxima-tions. In ICML ’03: Proceedings of the twentieth international conferenceon Machine learning, pages 720–727, 2003.

[50] Nathan Srebro, Jason D. M. Rennie, and Tommi S. Jaakkola. Maximum-margin matrix factorization. In Lawrence K. Saul, Yair Weiss, and LeonBottou, editors, Advances in Neural Information Processing Systems 17,pages 1329–1336. MIT Press, Cambridge, MA, 2005.

[51] Kazunari Sugiyama, Kenji Hatano, Masatoshi Yoshikawa, and ShunsukeUemura. Refinement of tf-idf schemes for web pages using their hyper-linked neighboring pages. In HYPERTEXT ’03: Proceedings of the four-teenth ACM conference on Hypertext and hypermedia, pages 198–207,New York, NY, USA, 2003. ACM Press.

[52] John A. Tomlin. A new paradigm for ranking pages on the world wideweb. In WWW ’03: Proceedings of the 12th international conference onWorld Wide Web, pages 350–355, New York, NY, USA, 2003. ACM Press.

[53] Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995.

[54] Hanna M. Wallach. Conditional random fields: An introduction. Techni-cal Report Technical Report MS-CIS-04-21, University of Pennsylvania,2004.

[55] Eric W. Weisstein. ”graph.” from mathworld–a wolfram web resource.http://mathworld.wolfram.com/graph.html.

[56] Wensi Xi, Edward A. Fox, Weiguo Fan, Benyu Zhang, Zheng Chen, JunYan, and Dong Zhuang. Simfusion : measuring similarity using unifiedrelationship matrix. In SIGIR ’05: Proceedings of the 28th annual inter-national ACM SIGIR conference on Research and development in infor-mation retrieval, pages 130–137, New York, NY, USA, 2005. ACM Press.

32

Page 33: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

[57] Wensi Xi, Benyu Zhang, Zheng Chen, Yizhou Lu, Shuicheng Yan, Wei-Ying Ma, and Edward Allan Fox. Link fusion: a unified link analysisframework for multi-type interrelated data objects. In WWW ’04: Pro-ceedings of the 13th international conference on World Wide Web, pages319–327, New York, NY, USA, 2004. ACM Press.

[58] Gui-Rong Xue, Hua-Jun Zeng, Zheng Chen, Wei-Ying Ma, Hong-JiangZhang, and Chao-Jun Lu. Implicit link analysis for small web search.In SIGIR ’03: Proceedings of the 26th annual international ACM SIGIRconference on Research and development in informaion retrieval, pages56–63, New York, NY, USA, 2003. ACM Press.

[59] Jieping Ye. Generalized low rank approximations of matrices. In ICML’04: Proceedings of the twenty-first international conference on Machinelearning, page 112, New York, NY, USA, 2004. ACM Press.

[60] Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen,and Wei-Ying Ma. Improving web search results using affinity graph.In SIGIR ’05: Proceedings of the 28th annual international ACM SIGIRconference on Research and development in information retrieval, pages504–511, New York, NY, USA, 2005. ACM Press.

[61] Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston,and Bernhard Scholkopf. Learning with local and global consistency.In Sebastian Thrun, Lawrence Saul, and Bernhard Scholkopf, editors,Advances in Neural Information Processing Systems 16. MIT Press, Cam-bridge, MA, 2004.

[62] Dengyong Zhou, Jiayuan Huang, and Bernhard Scholkopf. Learningfrom labeled and unlabeled data on a directed graph. In ICML ’05: Pro-ceedings of the 22nd international conference on Machine learning, pages1036–1043, New York, NY, USA, 2005. ACM Press.

[63] Dengyong Zhou, Bernhard Scholkopf, and Thomas Hofmann. Semi-supervised learning on directed graphs. In Lawrence K. Saul, Yair Weiss,and Leon Bottou, editors, Advances in Neural Information ProcessingSystems 17, pages 1633–1640. MIT Press, Cambridge, MA, 2005.

[64] Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, andBernhard Scholkopf. Ranking on data manifolds. In Sebastian Thrun,Lawrence Saul, and Bernhard Scholkopf, editors, Advances in Neural In-formation Processing Systems 16. MIT Press, Cambridge, MA, 2004.

[65] Xiaojin Zhu. Semi-supervised learning literature survey. Technical Re-port 1530, Computer Sciences, University of Wisconsin-Madison, 2005.http://www.cs.wisc.edu/∼jerryzhu/pub/ssl survey.pdf.

33

Page 34: Graph-basedLearningModelsfor InformationRetrieval: ASurveyrongjin/semisupervised/graph.pdf · the graph into two pieces without cutting too many edges; if λ2 is large, theneverycut

[66] Xiaojin Zhu, Zoubin Ghahramani, and John D. Lafferty. Semi-supervisedlearning using gaussian fields and harmonic functions. In ICML ’03: Pro-ceedings of the twentieth international conference on Machine learning,pages 912–919, 2003.

[67] Xiaojin Zhu and John Lafferty. Harmonic mixtures: combining mix-ture models and graph-based methods for inductive and scalable semi-supervised learning. In ICML ’05: Proceedings of the 22nd internationalconference on Machine learning, pages 1052–1059, New York, NY, USA,2005. ACM Press.

[68] Yueting Zhuang, Jun Yang, Qing Li, and Yunhe Pan. A graphic-theoreticmodel for incremental relevance feedback in image retrieval. In Pro-ceedings of the 2002 International Conference on Image Processing (ICIP2002), pages 413–416, 2002.

34