Grafos Recomendación Basada en

Recomendación Basada en Grafos

Denis Parra

IIC3633 – Sistemas Recomendadores

Problema de Recomendación

• Nuevamente revisitamos el problema de recomendación.

• Una alternativa válida a los métodos vistos hasta ahora es explotar las relaciones entre items en la forma de grafos.

Hoy

• Associative retrieval techniques to alleviate the sparsity problem in CF (Huang et al. 2004)

• The link Prediction Problem for Social Networks (Liben-Nowel, Kleinberg, 2002)

• SimRank: A Measure of Structural-Context Similarity (Jeh and Widom, 2002)

Paper 1

• Zan Huang, Hsinchun Chen, and Daniel Zeng. 2004. Applying associative retrieval techniques to alleviate the sparsity problem in collaborative filtering. ACM Trans. Inf. Syst. 22, 1 (January 2004), 116-142.

Resumen

• Lidiar con el problema de escasez de evaluaciones del usuario (ratings)

• Filtrado Colaborativo es estudiado como un grafo bi-partito.

• Técnicas de recuperación asociativa son utilizadas sobre el grafo (Spreading Activation)

• RESULTADO: Cuando hay escasez de ratings, estas técnicas basadas en grafos mejoran los resultado del filtrado colaborativo.

Recordatorio: Grafo bipartito

- Grafo cuyos nodos pueden ser separados en 2 conjuntos disjuntos e independientes

- Cada nodo en U está conectado con uno en V

El Problema de Escasez (Sparsity)

• Al 2004, los problemas de cold-start y new-item se habían atacado usando:– Item-Based CF (Sarwar 2001)

– Reducción de Dimensionalidad (Golderg 2001)

– Híbridos (Balanovic 2002, Basu 1998, Condliff 1999, etc.)

• Ninguno de los métodos mencionados había tenido consenso absoluto de su éxito

CF como Recuperación Asociativa

• Idea básica: construir un grafo entre usuarios e items y explorar asociaciones transitivas entre ellos.

CF como Recuperación Asociativa

• Idea básica: construir un grafo entre usuarios e items y explorar asociaciones transitivas entre ellos.

3 hops 3 hops 5 hops

Notación Matricial

• Consideremos la matriz consumidor/producto A

• Parámetros: M: hops, α= decaimiento (peso asociado al enlace

Ejemplo

• Dado A

• Luego, para M = 3, α= 0.5

• Luego, para M = 5, α= 0.5

Dificultades

• Calcular la potencia de una matriz puede ser muy costoso para un “c” y un “n” muy grandes, lo cual motiva los 3 métodos probados por Huang et al. en el paper.

Supuesto de la Investigación

• Los métodos de Spreading Activation funcionarán mejor cuando la red tiene muy baja densidad, en caso contrario puede ocurrir sobre-activación.

Modelos

• Constrained Leaky Capacitor Model (LCM)

• Branch-and-Bound

• Hopfield Net

LCM

• Propuesto por Anderson (1983)

• Pasos:– Identificar nodo-vector inicial V, setear D(0)

– Cálculo de nivel de activación

Donde (1-γ): speed of decay (0.8), α: efficiency (0.8)

– Condición de detención: en el paper = 10, top 50

Branch-and-Bound

• Implementación basada en (Chen & Ng 1995)• Paso 1, Inicialización: Nodo correspondiente al

usuario es activado (1), los otros = 0. Cola Q

priority se inicializa con nodo usuario activo.

• Paso 2, Cálculo de activación: Sacar nodos de Q

priority, por cada nodo vecino calcular

y agregar/actualizar nodo activado a Qoutput

• Paso 3, detención: determinada empiricamente (70)

Hopfield Net

• Paralelo con red neuronal. Usuarios e items son neuronas. Sinapsis son las activaciones.

• Inicialización: igual que las anteriores

• Cálculo de activación:

• Condición de detención:

Estudio Experimental

• Tienda de libros en línea de China

9,695 libros / 2,000 usuarios / 18,771 transacciones

• Métricas de evaluación:

Precision, Recall, F-1

• Y utility rank

h is the viewing halflife (the rank of the book on the list

such that there is a 50% chance the user will review that book)

Recordemos hipótesis• H1. Spreading activation-based CF can achieve higher

recommendation quality than the 3-hop, User-based (Correlation), User-based (Vector Similarity), and Item-based approaches.

• H2. Spreading activation-based CF can achieve higher recommendation quality than the 3-hop, User-based (Correlation), User-based (Vector Similarity), and Item-based approaches for new users (the cold-start problem).

• H3. The recommendation quality of spreading activation-based CF decreases when the density of user–item interactions is beyond a certain level (the over-activation effect).

Resultados

H1: Comparación de algoritmos bajo condiciones normales

H2: Comparación de algoritmos con usuarios sparse

Resultados 2

H2: Comparación de algoritmos en base a Recall, con usuarios cold-start

Resultados 3

Resultados 3.2

Computational Efficiency

Lecciones

• H1, H2 y H3 se demuestran

• Sensibilidad de los parámetros:– LCM: no es muy sensible (alfa, gama e iteraciones)

– BNB: diferencia en 70 y 100 iteraciones es baja, sobre 100 baja drásticamente

– Hopfield Net: poca diferencia entre parámetros

Paper 2

• Liben-Nowell, D., & Kleinberg, J. (2007). The link-prediction problem for social networks. Journal of the American society for information science and technology, 58(7), 1019-1031.

El Problema

Definiciones

Imagen desde http://be.amazd.com/link-prediction/

Notación para arXiv de Física

Métricas 1: distancia en el grafo


Vecinos en Común


Jaccard


Adamic-Adar

Preferential Attachment


Espectrales/Random Walk

• Hitting Time

• Rooted Page Rank

• SimRank

Resultados

Resultados 2

Paper 3

• Glen Jeh and Jennifer Widom (2002). SimRank: A Measure of Structural-Context Similarity. ACM SIGKDD 2002.

Motivation• Many applications require a measure of

“similarity” between objects.▪ Web search

▪ Shopping Recommendations

▪ Search for “Related Works” among scientific papers

• But “similarity” may be domain-dependent.

• Can we define ageneric model forsimilarity?

Common Ground

• What do all these applications have in common?→ data set of objects linked by a set of relations.

• Then, a generic concept of similarity is structural-context similarity.▪ “Two objects are similar if the relate to similar

objects.”

• Recall automorphic equivalence:▪ “Two objects are equivalent if the relate to equivalent

objects.”

• Given a Graph G = (V, E), for each pair of vertices a,b ∈ V, compute a similarity (ranking) score s(a,b) based on the concept of structural-context similarity.

Problem Statement

• Directed Graph G = (V,E)

▪ V = set of objects

▪ E = set of unweighted edges

▪ Edge (u,v) exists if there is an relation u → v

▪ I(v) = set of in-neighbors of vertex v

▪ O(v) = set of out-neighbors of vertex v

Basic Graph Model

SimRank Similarity▶ Recursive Model

– “Two objects are similar if they are referenced by similar objects”

– That is, a ~ b if⚫ c → a and d → b, and ⚫ c ~ d

– An object is equivalent to itself (score = 1)

▶ Example1. ProfA ~ ProfB because both are

referenced by Univ.2. StudentA ~ StudentB because they

are referenced by similar nodes{ProfA,ProfB}

• s(a,b) = similarity between a and b= average similarity between

in-neighbors of a and in-neighbors of b▪ s(a,b) is in the range [0, 1]▪ If a=b, then s(a,b) = 1▪ If a≠b,

▪ C is a constant, 0 < C < 1▪ if I(a) or I(b) = ∅ , then s(a,b) = 0

Basic SimRank Equation

Decay Factor C• X is identical to itself:

s(x,x) = 1

• Since we have x→a and x→ b,should s(a,b) = 1 also?▪ If the graph represented all the information about x, a, and b,

then s(a,b) would ideally = 1.▪ But, in reality the graph does not describe everything about

them, so we expect s(a,b) < 1.

• Therefore, the constant C expresses our limited confidence or decay with distance:s(a,b) = C ∙ average similarity of (I(a), I(b))

x

b

a

• Given graph G, define G2=(V2, E2) where▪ V2=V x V. Each vertex in V2 is a pair of vertices in V.

▪ E2: (a,b)→(c,d) in G2 iff a→c and b→d in G

• Since similarity scores are symmetric, (a,b) and (b,a) are merged into a single vertex.

G2 Paired-Vertex Perspective

• SimRank score for a vertex (a,b) in G2

= similarity between a and b in G.

• The source of similarity is self-vertices, like (Univ, Univ).

• Then, similarity propagates along pair-paths in G2, away from the sources.

• Note that values decrease away from (Univ, Univ)

Source and Flow of Similarity

• Bipartite: 2 types of objects▪ Example: Buyers and Items

SimRank in Bipartite Domains

• Two types of similarity:

▪ Two buyers are similar if they buy the similar items

▪ Out-neighbors of buyers are relevant:

▪ Two items are similar if they are bought by similar buyers

▪ In-neighbors of items are relevant:

• In general, we can use I(.) and/or O(.) for any graph

Bipartite SimRank Equations

• Motivation: Two students A and B take the same courses: {Eng1, Math1, Chem1, Hist1}▪ SimRank compares each course of A with each course of B

▪ But intuitively we just want the best matching pairs:s(Eng1

A,Eng1

B), s(Math1

A,Math1

B) , etc.

• Solution: Two steps▪ Max: Pair each neighbor of A with only its most similar neighbor of B. Do the same

in the other direction:Min: Final s(A,B) is the smaller of s

A(A,B) and s

B(A,B) [weakest link]

MiniMax Variant

• Rk(a,b) = estimate of SimRank after k iterations. ▪ Initialization:

▪ Iteration:

▪ Rk(a,b) is the similarity that has flowed a distance k

away from the sources.R

k values are non-decreasing as k increases.

• We can prove that Rk(a,b) converges to s(a,b)

Computing SimRank

• Space complexity : O(n2) to store Rk(a,b)

• Time complexity : O(kn2d2), d

2 is the average of

|I(a)||I(b)| over all vertex pairs (a,b)

• To improve performance, we can prune G2:▪ Idea: vertices that are far apart should have very low

similarity. We can approximate it as 0.▪ Select a radius r. If vertex-pair (a,b) cannot meet in less

than r steps, remove it from the graph G2.▪ space complexity: O(nd

r)

▪ time complexity: O(Kndrd

2),

dr = avg. number of neighbors within radius r.

Time and Space Complexity

Random Surfer-Pairs Model

• SimRank s(a,b) measures how soon two random surfers are expected to meet at the same node if they start at nodes a and b and randomly walk the graph backwards

• Background: Basic Forward Random Walk▪ Motion is in discrete steps, using edges of the graph.▪ Each time step, there is an equal probability of moving from your current

vertex to one of your out-neighbors.▪ Given adjacency matrix A, the probability of walking from x to y is pxy =

axy/O(x).• Random Walk as a Markov Process

▪ Initial location is described by the prob. distribution vector π(0)

▪ Prob. of being at y at time 1:

• Given adjacency matrix A:

• The forward and backward transition matrices:

Random Walk Transition Matrices

Paired Backwards Random Walk

▶ Probability of walking backwards to x in one step:

▶ Two walkers meet at x if they start at a and b, and if one goes x ←a and the other goes x ←b, respectively.

sx(a,b) = P(meeting at x) = π(a,b) p(x←a) p(x←b)s(a,b) = P(meeting) = Σx π(a,b) p(x←a) p(x←b)

▶ If they start together, they have met,so s(0)

xy = 1 if i = j; 0 otherwise [identity matrix]

▶ Then

Experiments: Data Sets

• Two data sets▪ ResearchIndex (www.researchindex.com)

• a corpus of scientific research papers

• 688,898 cross-reference among 278,628 papers

▪ Student’s transcripts• 1030 undergraduate students in the School of

Engineering at Stanford University

• Each transcript lists all course that the student has taken so far (average: 40 courses/student)

Performance Validation Metric

• Problem: Difficult to know what is the “correct” similarity between items.

• Solution: Define a rough domain-specific metric σ(p,q):▪ For scientific papers, we have two versions:

σC(p,q) = fraction of q’s citations also cited by p σT(p,q) = fraction of words in q’s title also in p’s title▪ For university courses:

σD(p,q) = 1 if p, q are in the same department, else 0

• Run the similarity algorithms:▪ SimRank (naïve, pruned, minmax)

▪ Co-Citation

• For each object p and algorithm A, form a set topA,N

(p) of the N objects most similar to p.

• For each q ∈ topA,N(p), compute σ(p,q).

• Return the average σA,N

(p) over all q.

Computing the Performance Score

• Setup▪ Used bipartite SimRank, only considering in-neighbors

(validation uses out-neighbors)

▪ N ∈ {5, 10, …, 45, 50}

• Results▪ Not very sensitive to decay factors C

1 and C

2

▪ Pruning the search radius had little effort on rank order of scores.

Experiment: Scientific Papers

Results: Scientific Papers

• Setup▪ Bipartite domain

▪ N ∈ {5, 10}

• Results▪ Min-Max version of SimRank performed the best

▪ Not very sensitive to decay factors C1 and C2

Experiment: Students and Courses

Results: Students and Courses

Co-citation scores are very poor (=0.161 for N=5, and =0.147 for N=10), so are not shown in the graph.

Conclusions

• Defined a recursive model of structural similarity between objects in a network

• Mathematically formulated SimRank based on the recursive concept

• Presented a convergent algorithm to compute SimRank

• Described a random-walk interpretation of SimRank equations and scores

• Experimentally validated SimRank over two real data sets

Open Issues and Critique

• O(n2) is large; scalability needs to be improved.

• s(a,b) only includes contributions for paths when a and b are the same distance from some x.What if the distances are offset (total is odd)?

• As |I(a)| and |I(b)| increase, SimRank decreases, even if I(a) = I(b)!▪ Addressed partially by Minimax method

Referencias

• Zan Huang, Hsinchun Chen, and Daniel Zeng. 2004. Applying associative retrieval techniques to alleviate the sparsity problem in collaborative filtering. ACM Trans. Inf. Syst. 22, 1 (January 2004), 116-142.

• Liben-Nowell, D., & Kleinberg, J. (2007). The link-prediction problem for social networks. Journal of the American society for information science and technology, 58(7), 1019-1031.

• G. Jeh and J. Widom. SimRank: A measure of structural-context similarity. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada,July 2002.

• Nguyen, P., Tomeo, P., Di Noia, T., & Di Sciascio, E. (2015, May). An evaluation of SimRank and Personalized PageRank to build a recommender system for the Web of Data. In Proceedings of the 24th International Conference on World Wide Web Companion (pp. 1477-1482). International World Wide Web Conferences Steering Committee.

Grafos Recomendación Basada en

Documents

Transcript of Grafos Recomendación Basada en