Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2...

24
Uncovering Social Links Through Stochastic Point Processes Rui Zhang (u5963436) Dr Marian-Andrei Rizoiu Research School of Computer Science The Australian National University COMP6470 Final Presentation May, 2017

Transcript of Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2...

Page 1: Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2 Tweet 3 Tweet 4 Tweet 5 Tweet 8 Tweet 6 Tweet 7 Tweet 9 Tweet 1 Tweet 2 Tweet 4 Tweet

Uncovering Social Links Through

Stochastic Point Processes

Rui Zhang (u5963436)

Dr Marian-Andrei Rizoiu

Research School of Computer Science

The Australian National University

COMP6470 Final Presentation

May, 2017

Page 2: Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2 Tweet 3 Tweet 4 Tweet 5 Tweet 8 Tweet 6 Tweet 7 Tweet 9 Tweet 1 Tweet 2 Tweet 4 Tweet

2

The Problem

(a) Twitter Tweet 1

Tweet 2

Tweet 3

Tweet 4

Tweet 5

Tweet 8

Tweet 6

Tweet 7

Tweet 9

Tweet 1

Tweet 2

Tweet 3Tweet 4

Tweet 5

Tweet 6

Tweet 7Tweet 8

Tweet 9

(b) Real retweet network

(Tree structure)

1.How tweets diffuse

2.Which user is important

in the diffusion

(c) Retweet network from the Twitter

API

(Star structure)

Wrong diffusion structure

Page 3: Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2 Tweet 3 Tweet 4 Tweet 5 Tweet 8 Tweet 6 Tweet 7 Tweet 9 Tweet 1 Tweet 2 Tweet 4 Tweet

3

The Problem

Purpose: Infer the real parent-offspring relationship between tweets

using only one cascade

Existing methods Probability

distribution

NETINF[Gomez-Rodriguez

et al KDD’11]

Description Predict links based on

probabilities

Choose links improving the

log-likelihood most

significantly

Shortcomings Need cascades for

optimizing parameters

of the distribution

Cascades for training and for

prediction

Sometimes, only one cascade occurring and no more cascades for

training and improving prediction.

Page 4: Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2 Tweet 3 Tweet 4 Tweet 5 Tweet 8 Tweet 6 Tweet 7 Tweet 9 Tweet 1 Tweet 2 Tweet 4 Tweet

4

Contents of this Presentation

• Modeling Retweets Cascades with Hawkes Point

Processes

• Optimization by Expectation Maximization Algorithm

• Constructing the Twitter Dataset

• Evaluation and Results

Page 5: Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2 Tweet 3 Tweet 4 Tweet 5 Tweet 8 Tweet 6 Tweet 7 Tweet 9 Tweet 1 Tweet 2 Tweet 4 Tweet

5

Introduction to Hawkes Point Processes

Point Processesdescribing events occurring at random locations and/or times.

(a) Modeling earthquake aftershocks

Hawkes Point Processes [Hawkes Biometrika’71]

Occurring events increase the likelihood of occurrence of futures events

(self-exciting)

Applications of Hawkes Point Processes.

(b) Modeling trade

Page 6: Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2 Tweet 3 Tweet 4 Tweet 5 Tweet 8 Tweet 6 Tweet 7 Tweet 9 Tweet 1 Tweet 2 Tweet 4 Tweet

Branching Structure and Hidden Vars

6

Occurring

Time

(t1, m1)

Assumption: self-exciting - - retweets in a cascade randomly occur and

occurrence of retweets is likely to cause more retweets

Root tweet

t - - occurring time

m - - user influence (the number of followers)

𝑢1

Page 7: Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2 Tweet 3 Tweet 4 Tweet 5 Tweet 8 Tweet 6 Tweet 7 Tweet 9 Tweet 1 Tweet 2 Tweet 4 Tweet

Branching Structure and Hidden Vars

7

Occurring

Time

(t1, m1)

(t4 m4)(t2, m2)

Assumption: retweets in a cascade randomly occur and occurrence of

retweets is likely to cause more retweets

𝑢1

𝑢4

𝑢2

Page 8: Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2 Tweet 3 Tweet 4 Tweet 5 Tweet 8 Tweet 6 Tweet 7 Tweet 9 Tweet 1 Tweet 2 Tweet 4 Tweet

Branching Structure and Hidden Vars

8

Occurring

Time

(t1, m1)

(t4 m4)

(t5, m5)

(t2, m2)

(t3, m3)

Assumption: retweets in a cascade randomly occur and occurrence of

retweets is likely to cause more retweets

𝑢1

𝑢4

𝑢2

𝑢3

𝑢5

Page 9: Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2 Tweet 3 Tweet 4 Tweet 5 Tweet 8 Tweet 6 Tweet 7 Tweet 9 Tweet 1 Tweet 2 Tweet 4 Tweet

Branching Structure and Hidden Vars

9

Occurring

Time

(t1, m1)

(t4 m4)

(t5, m5)

(t2, m2)

(t3, m3)

𝑝21

𝑝41

𝒑𝟑𝟐 𝑝54

Assumption: retweets in a cascade randomly occur and occurrence of

retweets is likely to cause more retweets

𝑝𝑗𝑖 - - P( the 𝑗𝑡ℎ retweet is caused by the 𝑖𝑡ℎ retweet )

Observed event sequence

𝑢1

𝑢4

𝑢2

𝑢3

𝑢5

Page 10: Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2 Tweet 3 Tweet 4 Tweet 5 Tweet 8 Tweet 6 Tweet 7 Tweet 9 Tweet 1 Tweet 2 Tweet 4 Tweet

10

Modeling Retweet Cascades

Model: Hawkes Point Processes with Power-law Triggering Kernel

[Mishra et al CIKM’16]

𝜆 𝑡 =

𝑡𝑖<𝑡

𝜙𝑚𝑖(t − ti)

𝜙𝑚𝑖𝑡 − 𝑡𝑖

= 𝜅𝑚𝑖𝛽𝑡 − 𝑡𝑖 + 𝑐 −(1+𝜃)

Optimize model parameters (𝜅, 𝛽, 𝑐, 𝜃) and

hidden variables 𝑝𝑗𝑖

Page 11: Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2 Tweet 3 Tweet 4 Tweet 5 Tweet 8 Tweet 6 Tweet 7 Tweet 9 Tweet 1 Tweet 2 Tweet 4 Tweet

11

Contents of this Presentation

• Modeling Retweets Cascades with Hawkes Point Processes

• Optimization by Expectation Maximization Algorithm

• Constructing the Twitter Dataset

• Evaluation and Results

Page 12: Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2 Tweet 3 Tweet 4 Tweet 5 Tweet 8 Tweet 6 Tweet 7 Tweet 9 Tweet 1 Tweet 2 Tweet 4 Tweet

12

Optimization by Expectation Maximization Algorithm

𝜅, 𝛽, 𝑐, 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝜅,𝛽,𝑐,𝜃

𝑖=2

𝑛

𝑗=1

𝑖−1

𝑝𝑗𝑖 log𝜙𝑚𝑖(𝑡𝑗 − 𝑡𝑖) − න

𝑡1

𝑡𝑛

𝜆 𝑡 𝑑𝑡

𝑝𝑗𝑖 =𝜙𝑚𝑖

𝑡𝑖 − 𝑡𝑗

𝜆(𝑡𝑗)𝑗 = 1,2, … , 𝑖 − 1 𝑖 = 1,2, … , 𝑛

E step

M step

H.EM

{𝑝𝑗𝑖} ← (𝜅, 𝛽, 𝑐, 𝜃)

(𝜅𝑜𝑙𝑑 , 𝛽𝑜𝑙𝑑 , 𝑐𝑜𝑙𝑑 , 𝜃𝑜𝑙𝑑 , {𝑝𝑗𝑖}) → (𝜅, 𝛽, 𝑐, 𝜃)

Expectation Maximization (EM) Algorithm:

1. An iterative algorithm

2. Alternates between E step and M step

Page 13: Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2 Tweet 3 Tweet 4 Tweet 5 Tweet 8 Tweet 6 Tweet 7 Tweet 9 Tweet 1 Tweet 2 Tweet 4 Tweet

13

Contents of this Presentation

• Modeling Retweets Cascades with Hawkes Point

Processes

• Optimization by Expectation Maximization Algorithm

• Constructing the Twitter Dataset

• Evaluation and Results

Page 14: Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2 Tweet 3 Tweet 4 Tweet 5 Tweet 8 Tweet 6 Tweet 7 Tweet 9 Tweet 1 Tweet 2 Tweet 4 Tweet

14

Constructing the Twitter Dataset

Retweet Cascades

Friend Networks

Twitter Crawler Twitter API

Twitter Users

in Cascades

Sydney Morning Herald (start: 14th Feb)

Simultaneously

Page 15: Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2 Tweet 3 Tweet 4 Tweet 5 Tweet 8 Tweet 6 Tweet 7 Tweet 9 Tweet 1 Tweet 2 Tweet 4 Tweet

15

Item Quantity

Cascades 68040

Tweets in cascades 259186

Users in cascades 61174

Cascades with more than 50 retweets (𝐶50) 274

Users in 𝐶50 16125

Tweets in 𝐶50 33539

Downloaded friends of users in 𝐶50 16051

Statistics on Current Data

Constructing the Twitter Dataset

Page 16: Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2 Tweet 3 Tweet 4 Tweet 5 Tweet 8 Tweet 6 Tweet 7 Tweet 9 Tweet 1 Tweet 2 Tweet 4 Tweet

16

Contents of this Presentation

• Modeling Retweets Cascades with Hawkes Point

Processes

• Optimization by Expectation Maximization Algorithm

• Constructing the Twitter Dataset

• Evaluation and Results

Page 17: Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2 Tweet 3 Tweet 4 Tweet 5 Tweet 8 Tweet 6 Tweet 7 Tweet 9 Tweet 1 Tweet 2 Tweet 4 Tweet

17

Evaluation and Results

Calculate optimal parameters on synthetic data

𝜅, 𝛽, 𝑐, 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥(𝜅,𝛽,𝑐,𝜃)

𝑖=2

𝑛

log 𝜆(𝑡𝑖) − න𝑡1

𝑡𝑛

𝜆 𝑡 𝑑𝑡

Baseline: maximizing observed log-likelihood (MLL) of the

same Point Process Models [Mishra et al CIKM’16]

Data: 10 cascades (20 experiments with different initial

parameters on each cascade)

Page 18: Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2 Tweet 3 Tweet 4 Tweet 5 Tweet 8 Tweet 6 Tweet 7 Tweet 9 Tweet 1 Tweet 2 Tweet 4 Tweet

18

Calculate optimal parameters on synthetic data

C (optimal 0.001)

0.001620

0.001635

H.EM MLL H.EM MLL

0.29

0.31

Theta (optimal: 0.2)

Evaluation and Results

H.EM MLL H.EM

0.0174

0.0178

0.0182

K (optimal: 0.025)

MLL

0.60

0.64

Beta (optimal: 0.51)

Page 19: Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2 Tweet 3 Tweet 4 Tweet 5 Tweet 8 Tweet 6 Tweet 7 Tweet 9 Tweet 1 Tweet 2 Tweet 4 Tweet

Performance Measures

19

0 1

𝑝51

𝑝54

𝑝53𝑝52

𝑢1

𝑢4

𝑢2

𝑢3

𝑢5

𝑝52 𝑝54 𝑝51𝑝53

True

False

ROC curve

Area Under Curve (AUC)

the highest probability: an edge

Accuracy

probability

time

Friend Networks

Page 20: Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2 Tweet 3 Tweet 4 Tweet 5 Tweet 8 Tweet 6 Tweet 7 Tweet 9 Tweet 1 Tweet 2 Tweet 4 Tweet

20

Evaluation and Results

Compare with Seven Methods on Real Data

Baselines Description

H.MLLPL infer 𝑝𝑗𝑖 after optimizing log-

likelihood

(H.EM - - during optimization)

(do not need training)

Power-Law Kernel

H.MLLEXP Exponential Kernel

Exponential distribution (E) Directly calculate

probabilities of links

without iterations

(need training)

𝑝𝑗𝑖 = 𝛼 − 1 𝑒−𝛼(𝑡𝑗−𝑡𝑖)

Power-law distribution (PL) 𝑝𝑗𝑖 = 𝛼 − 1 𝑡𝑗 − 𝑡𝑖−𝛼

Rayleigh distribution (R) 𝑝𝑗𝑖 = 𝛼(𝑡𝑗 − 𝑡𝑖)𝑒−0.5𝛼 𝑡𝑗−𝑡𝑖

2

Social Exponential (SE) 𝑝𝑗𝑖 =𝑚𝑖

σ𝑗=1𝑖 𝑚𝑗

𝑒−𝛼(𝑡𝑗−𝑡𝑖)

NETINF Select edges increasing log-likelihood most significantly

(need training)

274 cascades:254 – test

20 – training, E, PL, R, SE (mean AUC) and NETINF (mean Accuracy){

Page 21: Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2 Tweet 3 Tweet 4 Tweet 5 Tweet 8 Tweet 6 Tweet 7 Tweet 9 Tweet 1 Tweet 2 Tweet 4 Tweet

21

Evaluation and Results

Compare with Seven Methods on Real Data

HEM SE HMLL

PL

EXP HMLL

EXP

PL R NETINF

Mean

AUC

0.832 0.872 0.83 0.726 0.726 0.714 0.728 NA

H.EM SE H.MLLPL EXP H.MLLEXP PL R

0.4

0.6

0.8

1.0

AUC

Compare H.EM with baselines (AUC)

Page 22: Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2 Tweet 3 Tweet 4 Tweet 5 Tweet 8 Tweet 6 Tweet 7 Tweet 9 Tweet 1 Tweet 2 Tweet 4 Tweet

22

Evaluation and Results

Compare with Seven Methods on Real Data

HEM SE HMLL

PL

EXP HMLL

EXP

PL R NETINF

Mean Accuracy 0.506 0.556 0.468 0.185 0.187 0.186 0.567 0.249

H.EM SE H.MLLPL EXP H.MLLEXP PL R NETINF

0.0

0.2

0.4

0.6

0.8

1.0

Compare H.EM with baselines (Accuracy)

Accuracy

1. Our method does not need training

2. Infering 𝑝𝑗𝑖 during optimization improves performance

Page 23: Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2 Tweet 3 Tweet 4 Tweet 5 Tweet 8 Tweet 6 Tweet 7 Tweet 9 Tweet 1 Tweet 2 Tweet 4 Tweet

23

Summary

• Modeling by Hawkes Point Processes with Power-law Kernel

• Branching structure of Hawkes used to retrieve the

parenthood relation between retweets

• Inferring 𝑝𝑗𝑖 during optimization is important

• Applied to retrieving the true retweet relations in Twitter

cascades

The Way Ahead

Thank You !

• Experiments on more cascades with different themes

• Try more competitive triggering kernels

Page 24: Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2 Tweet 3 Tweet 4 Tweet 5 Tweet 8 Tweet 6 Tweet 7 Tweet 9 Tweet 1 Tweet 2 Tweet 4 Tweet

24

Reference

• Gomez Rodriguez, M., Leskovec, J., & Krause, A. (2010, July). Inferring networks

of diffusion and influence. In Proceedings of the 16th ACM SIGKDD international

conference on Knowledge discovery and data mining (pp. 1019-1028). ACM.

• Mishra, S., Rizoiu, M.A. and Xie, L., 2016, October. Feature driven and point

process approaches for popularity prediction. In Proceedings of the 25th ACM

International on Conference on Information and Knowledge Management (pp.

1069-1078). ACM.