Discover Tipping Users For Cross Network Inﬂuencingjzhang2/files/2016_iri_paper.pdf · Facebook)...

Discover Tipping UsersFor Cross Network Influencing

Qianyi Zhan∗, Jiawei Zhang†, Philip S. Yu†, Sherry Emery† and Junyuan Xie∗∗ National Laboratory for Novel Software Technology, Nanjing University, Nanjing, China

† University of Illinois at Chicago, Chicago, IL, USAEmail: [email protected], [email protected],

[email protected], [email protected], [email protected]

Abstract—Traditional viral marketing problem aims at se-lecting a set of influential seed users to maximize the awarenessof products and ideas in one single social network. However, inreal scenarios, users’ profiles in the target social network (e.g.,Facebook) are usually confidential to the public, which blockthe conventional viral marketing strategies reaching the targetconsumers effectively. Instead, since users nowadays are usuallyinvolved in multiple social networks simultaneously, the viralmarketing can actually be performed in other public networks.These networks with public profile information are referred asthe source networks, from which information can diffuse toand activate users in the target network indirectly. Thus inthe cross-network information diffusion, besides the influentialseed users, those who act as bridges propagating informationbetween networks actually play a more important role andsome can trigger the tipping point in the target network, whoare named as the tipping users formally.

Motivated by this, in this paper, we studied the “Discov-ering Tipping Users for Cross Network Influencing” (TURN)problem across multiple aligned heterogeneous social networks.To depict the information diffusion process across alignedheterogeneous social networks, we propose a novel networkinformation diffusion model, “Cross Network Information Dif-fusion” (CONFORM). In CONFORM, various diffusion links inthe heterogeneous networks are extracted and fused by weightto calculate the users’ activation probabilities. To address theTURN problem, a new method called “Tipping Users DiscoveryAlgorithm (TUDOR)” is proposed to identify the tipping userswho bring about the largest influence gain, which is a newconcept first introduced in this paper. Extensive experimentsare conducted on real-world social network datasets, whichdemonstrate the effectiveness and efficiency of TUDOR.

I. INTRODUCTION

One of the great advantages about the online socialnetworks is that when an idea takes off, it can propel abrand to seemingly instant fame and fortune with quite a lowcost [17]. This is called viral marketing based on the wordof mouth communication among users in the networks [8].Traditional viral marketing problems and methods mainlyfocus on selecting the seed users for the products and ideasto be promoted within one single network only [9].

However in the real scenarios, it is difficult to applyviral marketing on some online social networks which createintimate and private communicating circles within the users’choice of close friends. For example, since 2012, Centersfor Disease Control and Prevention (CDC) has launched thenational tobacco education campaign ”Tips From Former

Smokers”1 to encourage smokers to quit. One of the aimsof the campaign is to enlarge their exposures and influencein online social networks such as Facebook2 and Twitter3.The campaign has achieved great success, however, theadvertising effect in Facebook is not as good as othernetworks. HMC4 from UIC, which is supported to evaluatethe advertising effects, explains the reason that Facebookenables users to choose their own privacy settings andchoose who can see specific parts of their profile. Thereforedue to the privacy and security policies of the network, adcompanies do not have access to users’ profiles and cannoteasily get word out to the audience whom they most wishto reach. Hence traditional single-network viral marketingstrategies which influence target users directly can no longerperform well.

Meanwhile, users nowadays are usually involved in multi-ple online social networks simultaneously, and those joiningin Facebook are also using other networks, such as Twitter,Foursquare and Instagram, at the same time. These socialnetworks, whose information is more public and users aremuch easier to reach, provide good channels for ad compa-nies to communicate with audience. Moreover, most of themhave provided the services that users can share the posts,images and videos to their Facebook homepage effortlessly.In such a roundabout way, information can be propagated tothe Facebook network that we target on from these publicsocial networks. In other words, viral marketing can actuallybe performed in these public social networks instead, i.e.,information from which can diffuse to and activate users inthe target network indirectly. To differentiate them from thetarget social network (e.g., Facebook), these social networkswith easy access are named as the source networks in thispaper.

However, in such a cross-network viral marketing setting,besides the influence sources (i.e., seed users) to be selectedin traditional viral marketing problems, the anchor userswho acts as the bridges to propagate information betweenthe source networks and the target network play a moresignificant role.

1http://www.cdc.gov/tobacco/campaign/tips/index.html2https://www.facebook.com/cdctobaccofree/3https://twitter.com/CDCTobaccoFree4https://www.healthmediacollaboratory.org/

In traditional sociology studies, the concept tipping pointdenotes a time point when a large number of group membersrapidly and dramatically change their behavior by widelyadopting a previously rare practice [7]. Actually, such mys-terious behavior changes of individuals are happening inonline social networks everyday, e.g., the sudden emergenceof fashion trends [23], the quick swing of public opinion[14], as well as the transformation of an unknown personto be the center of attention [3]. If we let the rare socialaction be the individuals’ purchase of the products in thetarget network, the tipping point leading to the massiveadoption of the target products will be of great intereststo the companies carrying out the promotion activities. Inaddition, when triggering the tipping point in the targetnetwork, those critical anchor users (in the source networks)who occupy the crucial positions in diffusing informationacross the networks are formally named as the tipping users.

In this paper, we aim at identifying the tipping users fromthe source networks, and the problem is formally referred as”discovering Tipping Users for cRoss Network influencing”(TURN) problem.

The TURN problem studied in this paper is a novelproblem, and it is totally different from the conventionalinformation diffusion ( [1], [6], [13]) and viral marketingproblems ( [2], [4], [24]). Instead of finding the influencesources (i.e., the seed users), the TURN problem aims atidentifying the tipping users from the source network. There-fore, the models proposed for traditional viral marketingproblems cannot be applied to address the TURN problem.

Despite its importance and novelty, the TURN problemstudied in this paper is very challenging to solve due to thefollowing reasons:• Problem Definition: To the best of our knowledge,

we are the first to propose the TURN problem. Aclear definition of the tipping user concept as well asthe TURN problem is needed, which is still an openproblem to this context so far.

• Information Diffusion Model across HeterogeneousNetworks: The networks studied in this paper are allheterogeneous information networks, involving varioustypes of nodes and complex intra-network connections.Moreover, due to the shared common users, thesenetworks are partially aligned via the anchor links [29]as well. How to effectively model the information diffu-sion process across the aligned heterogeneous networksis a great challenge.

• NP-hard: The TURN problem based on our cross-network diffusion model to be introduced in Section 3is proved to be NP-hard, thus there exist no polynomial-time algorithms for TURN if P 6= NP.

To address these challenges, we propose a novel method“Tipping Users Discovery algORithm” (TUDOR) in thepaper. This method is based on a novel information diffusionmodel, named “CrOss Network inFORMation diffusion”(CONFORM), proposed in this paper. CONFORM effectivelyaggregates both intra-network and inter-network diffusion

links , which are weighted differently according to their im-portance in the aggregation. Based on CONFORM, the TURNproblem is proved to be NP-hard. However TUDOR appliesa novel step-wise greedy algorithm to identify the tippingusers, which can resolve the TURN problem efficiently andis proven to achieve a (1−1/e)-approximation of the optimalresult.

The rest of this paper is organized as follows. In Section 2,we give the concept definitions and problem formulation. InSection 3 and 4, the CONFORM model is introduced in detail.The TUDOR method is proposed in Section 5. We evaluatethe performance of TUDOR with extensive experiments inSection 6. Finally, we give the related works in Section 7and conclude the paper in Section 8.

II. PROBLEM FORMULATION

A. Basic Terms

In this paper, we will follow the definitions of concepts“heterogeneous networks”, “anchor user” “anchor link”,etc. proposed in [29].

DEFINITION 1. Partially aligned heterogeneous networks:A pair of partially aligned heterogeneous networks is H =(G(s), G(t),L), where G(s) = (V(s), E(s)) is the sourcenetwork and G(t) = (V(t), E(t)) is the target network,and their user sets are represented as U (s) ⊂ V(s) andU (t) ⊂ V(t) respectively. L denotes the set of anchor linkswhich connect anchor user sets A(s) ⊆ U (s) of G(s) andA(t) ⊆ U (t) of G(t).

In the cross-network information diffusion process, mes-sages propagate in discrete steps beginning with a group ofseed users S ⊆ U (s) in the source network, to the targetnetwork G(t) via both the heterogeneous intra and internetwork links. Users in networks have two statuses: activeand inactive, where active status represents the user hasadopted the certain product. The notations used in this paperare listed in Table I.

Table I: Notation

Notation Definition

G(s), G(t) the heterogeneous source or target networkA(s), A(t) set of anchor users in the source or target network

L set of anchor links connected the source and the target networkH partially aligned heterogeneous networksS set of seed users which all belong to the source network, i.e.,

S ∈ U(s)

Z set of tipping users which are all anchor users in the sourcenetwork, i.e., Z ∈ A(s)

δ(S|H) influence function which maps the seed user set S to the numberof activated users in the target network

G(s)Z the reduced source networkHZ the reduced partially aligned heterogeneous networks

B. Tipping Users and Influence Gain

As introduced in the previous section, in the cross-network information diffusion process, the existence ofcertain users occupying the crucial positions in the sourcenetwork can trigger the tipping point [7] of the massiveadoption of the product in the target network G(t). These

Gs Gt

A

B

E C

D

1

2

3

4

5

6

(a) Original Network Structure

Gs Gt

A

E C

D

1

2

3

4

5

6

(b) Reduced Network Structure

Figure 1: Reduced network structure and cross networkinfluence gain

users who are a group of anchor users and cause a substantialincrease of the number of activated users in the targetnetwork G(t) are called the tipping users in this paper.

To identify tipping users, a new metric, called cross net-work influence gain, is proposed to evaluate the contributionof Z and it is based on the reduced network and the influencefunction.

DEFINITION 2. Reduced Network: Given a heterogeneousnetwork G = (V, E) and an group of anchor usersZ ⊆A \ S , the reduced network is GZ = (V − Z, E − EZ),where EZ = (v, u)|v ∈ Z ∨ u ∈ Z is the set of edgesconnecting with users in Z .

DEFINITION 3. Reduced Partially Aligned Networks:Given a pair of partially aligned heterogeneous networksH = (G(s), G(t),L) and a user set Z ⊆ U (s)\S , the reducedpartially aligned network is HZ = (G

(s)Z , G(t),L − LZ),

where G(s)Z is the reduced network, and LZ = (v, u)|v ∈

Z is the set of anchor links connecting with users in Z .The reduced networks GZ and HZ actually denote the

residual network structures after removing users in Z andall the attached edges from networks G and H respectively.

DEFINITION 4. Influence Function: Given a pair ofpartially aligned heterogeneous social networks H =(G(s), G(t),L) and a seed user set S ⊆ U (s), let δ(S|H)be the influence function which maps the seed set S in G(s)

to the number of activated users in the target network G(t)

by the same seed user set S based on H.Therefore the cross network influence gain is defined as:

DEFINITION 5. Cross Network Influence Gain Given apair of partially aligned heterogeneous social networks H =(G(s), G(t),L), a pre-determined seed user set S ⊆ U (s),and a group of anchor user Z ∈ A(s) \ S, δ(S|HZ) isthe number of activated users in G(t) based on the reducednetwork HZ . The cross network influence gain of user set Zis defined as the difference between δ(S|H) and δ(S|HZ),denoted as I(Z) = δ(S|H)− δ(S|HZ).

Figure 1 explains the concepts of reduced network struc-ture and cross network influence gain. In both Fig. 1(a) andFig. 1(b), the left rectangle represents the source networkG(s) and the right one denotes the target network G(t).Each not yet activated user is represented by a blue icon,and the icon becomes red when the user is activated. Letyellow icons be seed users and green icons denote the

candidates of tipping users. Thus, in our example, basedon seed set S = A, we aim to discover one tippingusers (k = 1) from two anchor users of G(s) and user Bis current candidate, i.e., Z = B. The Fig. 1(a) showsthe original network structure H = (G(s), G(t),L), andunder this circumstance, four users 1, 2, 3, 6 in G(t) canbe activated, i.e., δ(S|H) = 4. Consider, for instance, if weremove user B and all edges connecting it (including theanchor link), the obtained reduced network structure can berepresented as HZ = (G

(s)Z , G(t),L−LZ), as shown in Fig.

1(b). Due to the removal of B, information can no longer bepropagated from the seed user A to users in G(t). In otherwords, no user in G(t) can be activated and δ(S|HZ) = 0.Therefore, the cross network influence gain introduced byZ is I(Z) = I(B) = δ(S|H)− δ(S|HZ) = 4.

C. Problem Formulation

Based on the above concepts, in this paper, tipping usersare formally defined as a group of anchor users who has thelargest cross network influence gain, which means withouttipping users, the number of activated users in the targetnetwork will decrease significantly.

DEFINITION 6. Tipping Users: Given a pair of partiallyaligned heterogeneous networks H = (G(s), G(t),L) andthe seed user set S ⊆ U (s), tipping users are a group ofanchor users who can maximize the cross-network influencegain in the source network, i.e., Z = argmaxZI(Z), whereZ ⊆ A(s) \ S .

Therefore based on above definitions, we formulate theproblem of “discovering Tipping Users for cRoss Networkinfluencing” (TURN) as follows:

DEFINITION 7. The TURN Problem: Given the partiallyaligned heterogeneous networks H = (G(s), G(t),L), theseed user set S ⊆ U (s) and the per-specified number oftipping users k, the TURN problem aims at discovering ktipping users in the source network, Z∗ ⊆ A(s) \ S, whocan maximize the cross-network influence gain I(Z), i.e.,.

Z∗ = argmaxZI(Z) = argmaxZδ(S|H)−δ(S|HZ) (1)

III. INFORMATION DIFFUSION MODEL

In the CONFORM model, information propagation processis traced by meta paths, which are based on the networkschema.

DEFINITION 8. Network Schema: Given a heterogeneousnetwork G = (V, E), its network schema is defined as S =(O,R), where O and R denote the type sets of entities andlinks in G respectively.

DEFINITION 9. Meta Path: A meta path Ω, based on thegiven network schema S = (O,R), is denoted in the formof Ω = o1

r1−→ o2r2−→ · · · rk−1−−−→ ok where entity o1, ok =

User ∈ O, oi ∈ O − User, i = 2, · · · , k − 1 and ri ∈R, i ∈ 1, 2, · · · , k − 1.

Table II: Atomic Meta Paths of Foursquare and Twitter

Link Type Network Physical Meaning Meta Path

Intra-network

Foursquare

follow Userfollow−1

−−−−−−−→ User

co-location check-ins User checkin−−−−−−→ Location checkin−1−−−−−−−−→ User

co-location via shared lists Usercreate/like−−−−−−−−→ List contain−−−−−−→ Location

contain−1−−−−−−−−→ List

create/like−1

−−−−−−−−−−→ User

Twitterfollow User

follow−1

−−−−−−−→ User

co-location check-ins User checkin−−−−−−→ Location checkin−1−−−−−−−−→ User

contact via tweet User write−−−−→ Tweet retweet−−−−−−→ Tweet write−1−−−−−−→ User

Inter-network anchor User of Twitter Anchor−−−−−→ User of Foursquare

anchor User of Foursquare Anchor−−−−−→ User of Twitter

In a meta path, the types of start and end entity areboth User and types of other entities are not User. Theinstances of meta paths are called diffusion links, which startand end with two specific users. It is obvious that networkschemes, meta paths and diffusion links of diverse networksare quite different. Therefore information propagation insideeach network and across two networks should be consideredrespectively, thus we classify meta paths into two categories:intra-network meta path and inter-network meta path.

DEFINITION 10. Intra-network and Inter-network MetaPath: Given partially aligned heterogeneous networksH = (G(s), G(t),L) with their network schemes S(s) =(O(s),R(s)) and S(t) = (O(t),R(t)). To a meta pathΩ(i) = o1

r1−→ o2r2−→ · · · rk−1−−−→ ok, if the start and end

users belong to the same network, i.e., o1, ok = User ∈ O(i),the path is called an intra-network meta path. While whenthe start and end users of a meta path belong to differentnetworks, i.e., o1 = User ∈ O(i), ok = User ∈ O(j), thepath is called an inter-network meta path.

Besides following links, a heterogeneous online socialnetwork usually has various kinds of intra-network metapaths. Let m(i) be the number of intra-network meta paths,and Ω

(i)k denotes the kth path of the network G(i). The

set of links connecting two users u and v in the samenetwork G(i) by a specific meta path Ω

(i)k is represented

as Q(i)k (u v). We define ω

(i)k (u, v) as the amount of

information propagating from u to v through meta path Ω(i)k ,

which is calculated as:

ω(i)k (u, v) =

2|Q(i)k (u v)|

|Q(i)k (u ·)|+ |Q(i)

k (· v)|,

where Q(i)k (u ·), Q(i)

k (· v) are the sets of diffusionlinks with Ω

(i)k which start from u and end at v respectively.

We now aggregate information received from the abovechannels. For a specific user v ∈ U (i), the sum of infor-mation from all intra-network diffusion links denoted asW

(i),(intra)v , is defined as follows.

W (i),(intra)v =

m(i)∑k=1

α(i)k ×

∑u∈Γ

(i)k (v)

ω(i)k (u, v) (2)

where Γ(i)k (v) is the set of neighbors connected to v with

intra-network links of meta path type Ω(i)k , and α

(i)k is the

weight of Ω(i)k in the linear aggregation.

As to inter-network meta paths, similar to intra-networkones, let Ω

(j,i)k be the kth inter-network meta path from the

network G(j) to G(i), and the number of inter-network metapaths is m(j,i). Therefore the amount of information receivedby v ∈ U (i) from u ∈ U (j) through inter-network diffusionlinks with meta path type Ω

(j,i)k is

ω(j,i)k (u, v) =

2|Q(j,i)k (u v)|

|Q(j,i)k (u ·)|+ |Q(j,i)

k (· v)|

Therefore, the total amount of information propagated by vvia inter-network diffusion links from the other networks is:

W (i),(inter)v =

m(j,i)∑k=1

β(j,i)k ×

∑u∈Γ

(j,i)k (v)

ω(j,i)k (u, v) (3)

where Γ(j,i)k (v) is the set of neighbors in G(j) connected

to v ∈ U (i) with inter-network diffusion links of meta pathtype Ω

(j,i)k , and β(j,i)

k is the weight of Ω(j,i)k .

We take Foursquare and Twitter as examples of partiallyaligned heterogeneous networks. Inside both networks, userscan follow others and check-in at locations. Meanwhile, (1)in Foursquare, users can create/like lists containing a set oflocations; (2) while in Twitter, users can repost other users’tweets. The intra-network meta paths considered in thispaper as well as their physical meanings are listed in Table(II). These two networks are connected by anchor links. Theintra-network and inter-network meta paths considered inthis paper as well as their physical meanings are listed inTable (II).

Finally we can define the activation function pv = f(·) :R→ 0, 1 and logistic function f(x) = ex

1+ex is used in thispaper. This function maps the received information amountto activation probability of user v of G(i), and is calculatedas:

p(i)v =

e(W (i),(intra)v +W (i),(inter)

v )

1 + e(W(i),(intra)v +W

(i),(inter)v )

(4)

IV. DIFFUSION LINKS WEIGHTING

In the last section, we extract different diffusion linksto describe the information propagation process in partiallyaligned heterogeneous networks and we set weights tothese diffusion links when aggregate information. In thissection, the optimal weight values are learned from the useractivation log data.

In the CONFORM model, each weight measures the sig-nificance of correspond link in the diffusion process, thusdifferent links will be ranked according to their importance,and top links will be selected to increase individuals’ acti-vation probabilities.

For the network G(t), we construct a column vectorP(t) ∈ R|U(t)|×1, where P(t)[v] records user v’s activationprobability, calculated by (4). At the same time ground-truth extracted from the log data is represented by a binaryvector T(t) ∈ R|U(t)|×1, which T(t)[v] = 1 means userv is activated finally, otherwise T(t)[v] = 0. The learnedweights aim to narrow the gap between the prediction andthe ground-truth, i.e., ‖P(t) − T(t)‖2F , where ‖ · ‖F is theFrobenius norm of the matrix.

The other concern is that the behavior of an anchor usershould be consistent. Though we treat an anchor user as twoindependent users in respective networks, it is intuitive thatone person shows the consistent interest to the same topicin real life. Therefore if anchor user v(s) in G(s) is active,there is a high probability for v in other networks, suchas G(t). Thus the values of anchor user P(s)[v] and P(t)[v]should be close. The restriction is the sum of all weights ofdiffusion links which connect to a specific network G(t) orG(s) should equals 1 and all weights should be no-negative.

Therefore, the optimal weights of information deliveredin different diffusion links can be obtained by solving thefollowing objective function:

min ‖P(s) − T(s)‖2F + ‖P(t) − T(t)‖2F+‖(A(t,s))TP(t) − (A(t,s))TA(t,s)P(s)‖2F

s.t.m(s)∑k=1

α(s)k +

m(t,s)∑k=1

β(t,s)k = 1,

m(t)∑k=1

α(t)k +

m(s,t)∑k=1

β(s,t)k = 1,

α(s)1 , · · · , α(s)

m(s) , α(t)1 , · · · , α(t)

m(t) ≥ 0,

β(t,s)1 , · · · , β(t,s)

m(t,s) , β(s,t)1 , · · · , β(s,t)

m(s,t) ≥ 0.

(5)

where anchor matrix A(t,s) ∈ R|U(t)|×|U(s)| represents theanchor users of two networks, which can be constructedbased on the anchor link set L and

∑i A(t,s)[v][i] = 1

denotes v is an anchor user. The matrix (A(t,s))TP(t) −(A(t,s))TA(t,s)P(s) extracts the difference of activation prob-ability of anchor users in P(s) and P(t). The constraints makesure users’ activation probabilities in each network are allbetween the range [0, 1] and all weights are no-negative.

By the triangle inequality and positive homogeneity, everynorm is a convex function. Since constraints have inequali-ties, we need to extend the Lagrange Multiplier Method tothe Karush-Kuhn-Tucker (KKT) conditions to find local andalso globe minim of the objective function:

L(α, β, λ) = ‖P(s) − T(s)‖2F + ‖P(t) − T(t)‖2F+ ‖(A(t,s))TP(2) − (A(t,s))TA(t,s)P(s)‖2F

+ λ1(

m(s)∑k=1

α(s)k +

m(t,s)∑k=1

β(t,s)k − 1)

+ λ2(

m(t)∑k=1

α(t)k +

m(s,t)∑k=1

β(s,t)k − 1)

+∑

µiαi +∑

νiβi

(6)

Setting the gradient ∇α,β,λ,µ,νL(α, β, λ, µ, ν) = 0 canyield a group of equations about variables α, β, λ, µ and ν.They are implicit functions and can be solved with the sourcetoolkit, e.g., Scipy nonlinear solver. Given the parameterswith initial values, multiple solutions can be obtained byresolving the objective function.

V. PROPOSED ALGORITHM

In this section, the TUDOR method is proposed to addressthe TURN problem. Before introducing TUDOR, we firstanalyze the TURN problem.

A. Problem Analysis

THEOREM 1. The TURN Problem based on the CONFORMmodel is NP-hard.

Proof. In the TURN problem, since when the seed user setS is pre-determined, its final activate user set, denoted as Dis fixed. Consider an instance of the NP-hard Vertex Coverproblem defined by a graph G = (D, E) and integer k. Itaims to identify a set of k vertices C such that each edge ofthe graph is incident to at least one vertex of the D. If there isa vertex cover C of size k in G(s), we can block all activatedusers, i.e.D, by setting the tipping user set Z = C. Thisshows the Vertex Cover problem can be viewed as a specialcase of the TURN problem, therefore, the TURN problem isNP-hard.

Therefore there is no known polynomial algorithm whichcan give the optimal solution to the TURN problem and thebrute force method tends to grow very quickly as the size ofthe network increases. However the influence gain functionis proofed to be monotone and submodular motivated by [9].

THEOREM 2. The influence gain function I(Z) is mono-tone.

Proof: To a specific u, the amount of information gotfrom any intra-network diffusion link Ω

(i)k ≥ 0 and the

its corresponding weight α(i)k ≥ 0. Therefore the sum

of information received from all intra-network diffusionlinks W (i),(intra)

v ≥ 0 and the function (2) is monotone.

Similarly, W (j,i),(inter)v ≥ 0 and the function (3) is also

monotone. Since the logistic function is monotone, theactivation probability function (4) is monotone too, whichdeducting more links will descend the value of p(i)

v . So ifwe block more users, the amount of activated users will notincrease, in other words, when Z1 ⊂ Z2, their influencegain I(Z1) ≤ I(Z2). Therefore the influence gain functionis monotone.

COROLLARY 1. The influence gain function I(Z) is non-negative.

Proof: If Z = ∅, I(Z) = 0, which means no activeuser will be blocked when there is no tipping users. As weproved above, the influence gain function I(Z) is monotone,therefore given any tipping user set Z , I(Z) ≥ 0.

THEOREM 3. The influence gain function I(Z) is submod-ular.

Proof: Given a pre-determined seed user set S in G(s),its final active user set D are fixed. Our aim is to discover asubset of anchor users who can block most of active usersin the target network. According to the Claim 2.6 in [9], theblock set obtained by running the CONFORM model, whichis similar to the LT model, is equivalent to reachability vialive-edge paths defined in [9]. Moreover, an active user u inG(t) can be blocked if and only if there is a live-edge pathfrom some node in Z to u based on Claim 2.3 in [9].

Let R(u) denote the set of all nodes that can be blockedfrom u via the live-edge path. Z1,Z2 are two candidatesets of tipping user set and Z1 ⊆ Z2. I(Z1 +u)− I(Z1)denotes the number of nodes in R(u) which are not in theunion set ∪v∈Z1R(v). Since Z1 ⊆ Z2, ∪v∈Z2R(v) containsmore or equal nodes than ∪v∈Z1R(v), thus I(Z1 + u)−I(Z1) ≥ I(Z2 +u)−I(Z2), where I(Z2 +u)−I(Z2)denotes the number of nodes in R(u) which are not in theunion set ∪v∈Z2R(v). Therefore the influence gain functionis submodular.

B. Proposed Algorithm

Algorithm 1 Tipping Users Discovery AlgorithmInput: H = (G(s), G(t),L), seed set S, size of tipping users kOutput: tipping user set Z1: initialize Z = ∅, TIPPING POINTS index i = 0;2: get the network schema S(s), S(t) and user set U(s), U(t);3: extract both intra-network and inter-network meta paths of G(s), G(t)

4: calculate ω(i)k (u, v) and ω(j,i)

k (u, v) for each user v5: learn the weights of meta paths α and β6: while i < k do7: for v ∈ A(s) do8: add v into Z9: estimate Z’s inter-network influence gain I(Z)

10: remove v from Z11: end for12: select z = arg maxvI(Z)13: Z = Z ∪ z and i = i+ 1.14: end while

For a non-negative, monotone, submodular function I(Z),let Z be a node set of size k obtained by choosing one nodeeach round, which maximizes the marginal increase in the

0

5

10

15

1 2 3 4 5

INFL

UEN

CE G

AIN

TIPPING USER SIZE

TUDOR Random Page Rank Degree Brute Force

(a) Influence Gain

-4

-2

0

2

1 2 3 4 5

TIM

E

TIPPING USER SIZE


(b) Running Time

Figure 2: Performance on small dataset when the targetnetwork is Foursquare.

function value. Let Z∗ be the set with largest value of all k-node sets. Thus Z provides a (1− 1/e)-approximation, i.e.,I(Z) ≥ (1 − 1/e) · I(Z∗) [16]. Alg. 1 shows the TUDORalgorithm’s pseudo-code.

VI. EXPERIMENT

In this section, we present empirical validation of ourmethods and results from experiments on real networkdata sets. We first sample a small dataset to compare theperformance of TUDOR with the optimal solution. Then theexperiments are conducted on original real-world alignedsocial networks.

Table III: Dataset Description

# dataset # small dataset

anchor link 698 22

Twitter

user 1000 30follow link 14138 83co-location link 13678 48contact link 3513 27

Foursquare

user 1000 30follow link 13255 73co-list link 2806 6co-tip link 1024 2

A. Small Dataset Experiment

Dataset Description: We sample a small dataset fromFoursquare and Twitter, both of which are famous hetero-geneous online social networks. Users of Foursquare canlink accounts with their Twitter accounts, which are shownin their profile. Based on these anchor links, we crawledinformation of users and links in two networks. In the smalldataset, we sample total 60 users, 30 of each network, andamong them 22 are anchor users. In Twitter, there are 83

0

5

10

1 2 3 4 5

INFL

UEN

CE G

AIN

TIPPING USER SIZE


(a) Influence Gain

-4

-2

0

2

1 2 3 4 5

TIM

E

TIPPING USER SIZE


(b) Seed User Analysis

Figure 3: Performance on small dataset when the targetnetwork is Twitter.

social links, 48 co-location links and 27 contact links. Usersfrom Foursquare generate 73 social links, 6 co-list links and2 co-tips links. The statistical information also listed in smalldataset column in Table III, and the physical meaning ofeach link is listed in Table II.

Comparison Methods: We implemented our proposedmethod and compared with other baselines. Since we arethe first to address the TURN problem, there are barely otheralgorithms can be directly used. Therefore the baselines inthis paper are classical algorithms including:• TUDOR: TUDOR is the method proposed in this paper,

which can discover tipping users based on the givenseed users.

• Random: This method selects k anchor users randomly.• Page Rank: Anchor users in G(t) are sorted according

to their page rank scores and tipping users are the topk among them.

• Out-Degree (Degree): Rank anchor users in G(t) basedon their out-degrees and choose top k as tipping users.

• Brute Force: To achieve the global optimal result, welist all possible combinations of k anchor users andselect one which has the largest influence gain.

Setups: The weights of diffusion links are first learnedbased on Section 3.2 and they are used to calculate useractivation probability by logistic function. Since all variablesof (4) are in [0, 1] due to the normalization, its result, theactivation probability, is in [0.5, 0.75]. Since each user hasa threshold towards the message in the CONFORM model,value of user’s threshold samples from a uniform distributionwith range [0.5, 0.75].

There are Foursquare and Twitter two social networks andeach time we regard one as the target network and the other

as the source network. The size of seed user set is 5 and thenumber of tipping users changes in range of 1, 2, 3, 4, 5.To evaluate the performance of all these methods, we willcalculate the influence gain of selected tipping users and wealso record the running time of algorithms to compare theirefficiency.

Results: When the target network is Foursquare, the ex-periment results of different comparison methods are givenin Fig. 2. Fig. 2(a) shows the influence gain with changingtipping user sizes and Fig. 2(b) shows running time of allalgorithms. Due to the huge difference between Brute Forceand other methods, the horizontal axis of 2(b) is scaledlogarithmically.

Based on the results shown in Fig. 2(a), the influencegains increase as more tipping users are selected for allmethods. The performance of TUDOR is exactly the samewith Brute Force, which achieves the most influence gain.When the number of tipping users is 5, their influence gainsare both 11, which is more than 2 times the result of Degree.Numbers of Page rank and Random algorithm are only 1.

The running time is shown in Fig. 2(b). Random andDegree choose tipping users in a heuristic way and consumethe least time, while Page rank needs update users’ scoresiteratively, so it costs a little more time than the formertwo. Same as expected, the time cost by Brute Force growsexponentially, when tipping user size is 5, it cost 27.08seconds. Actually, when the network size extends to 50users, it is difficult for Brute Force to find out 5 tipping users.TUDOR requires simulating the diffusion processes, which istime-consuming, but TUDOR just use 0.05 second to find thesame optimal solution as Brute Force, which demonstratesboth the efficiency and effectiveness of TUDOR.

When Twitter as the target network, the outcome ofinfluence gain is available in Fig. 3(a) and Fig. 3(b) showsthe time cost, which also uses the logarithmic y-axis.

The results shown in Fig. 3(a) illustrate the influencegains rise with the increase of tipping user size, whichis in agreement with the monotone property of influencegain function. Unlike the big difference in Foursquare, allmethods except Random achieve the optimal result, becauseTwitter network is denser than Foursquare, and choosingmore users has high probability to get the same influencegain. But among them, only TUDOR can find the optimalsolution even when the tipping user size is only 2.

The running time comparison is similar to that ofFoursquare. Page rank selects tipping users with simplemetric and cost more than Random and Degree methods.The time of Brute Force rises dramatically and reach 33.57seconds to find 5 tipping users. TUDOR still balance the timeand performance well, which cost 0.067 second to discoverthe optimal 5 tipping users.

The seed users selected by different methods are quitedistinct from each other. In Table IV, we show the inter-sections of tipping user set selected by different methodsin two networks, where the tipping user size is 5. We canobserve that the tipping user set selected by TUDOR always

Table IV: Intersection of seed users selected by different comparison methods in the viral marketing problem.

TUDOR Random Page Rank Degree Brute Force Brute Force Degree Page Rank Random TUDOR5 1 2 4 5 TUDOR 5 1 0 1 5

5 0 1 1 Random 0 4 2 55 3 2 Page Rank 1 3 5

5 4 Degree 0 5Foursquare 5 Brute Force 5 Twitter

0

100

200

300

400

5 10 15 20 25 30 35 40 45 50

INFL

UEN

CE G

AIN

TIPPING USER SIZE

TUDOR Random Page Rank Degree

(a) Foursquare

0

100

200

300

400

500

5 10 15 20 25 30 35 40 45 50

INFL

UEN

CE G

AIN

TIPPING USER SIZE


(b) Twitter

Figure 4: Performance of Influence Gain on two targetnetworks respectively.

coincides with the optimal set generated by Brute Force. Theinteresting thing is that TUDOR shares most tipping userswith Degree in Foursquare, but their influence gains differgreatly. While tipping user set of TUDOR is distinct fromthose selected by Degree in Twitter, however they achievethe same influence gain. This is because the small networksize, seed user size and tipping users size magnify the effectof each seed user and tipping use, and make the experimentresult sensitive.

B. Original Dataset Experiment

Dataset Description: This dataset is also sampled fromFoursquare and Twitter, which the numbers of users inFoursquare and Twitter are both 1000, among which 689users have accounts in both networks. There exist 4138social links, 13678 co-location links and 3513 contact linksin Twitter network and we crawled 13255 social links, 2806co-list links and 1024 co-tip links which connect users fromFoursquare. Column of large dataset in Table (III) shows thedetail statistics information.

Comparison Methods: Obviously, Brute Force methodcannot be applied in this dataset due to the large user size.Therefore we compare TUDOR with other baselines of thesmall dataset besides Brute Force methods.

Setups: Most of experiment setting are the same with thatof the small dataset. For example users’ threshold values

are also set within [0.5, 0.75] and we also treat Foursquareand Twitter as target network respectively. But the seed setcontains 50 users from source network and the amount oftipping users changes in range of [5, 50] with step 5. With thelarge dataset, we just compare their influence gain generatedby selected tipping users.

Results: The results of influence gain in Foursquare areshown in Fig. 4(a). Except Random, influence gains increasewhen more tipping users are selected. However for TUDOR,after a size point, 20 in Fig. 4(a), the growing speed ofinfluence gain slows down, which is because the saturationof tipping users. Comparing with other methods, the result ofTUDOR has obvious advantages on others. When the tippinguser size is 5, influence gain of TUDOR is 250, howeverDegree’s is only 9. While when the size is enlarged to 50,the result of TUDOR is 367, which is still 53% more than240 of Degree.

When the experiment conducted on Twitter, outcomes arepresented in Fig. 4(b), which shows the advantage of TUDORclearly. Like in Foursquare, with more tipping users, theinfluence gain of TUDOR rises, but after the size extendingto 20, the growth rate becomes smaller, which is also causedby saturation of tipping users. When comparing with othermethods, TUDOR enjoys a big lead with all tipping usersize. When the tipping user size is 50, the influence gain ofTUDOR is almost 2 times of others’ and the superiority ismuch more obvious with other sizes.

The overlap of tipping users selected by different methodsis given in Table (V). From the results, we observe that thetipping users selected by TUDOR is quite distinct from othersin both networks. For example, in Foursquare dataset, theintersection between tipping user sets achieved by TUDORand Page rank is 6 among 50 and so is the case in Twitter.

C. Parameter Analysis

To analyze how parameter the seed user size affects theperformance of all methods, we fix the tipping user size andconduct experiment on two datasets. In the small dataset, thetipping user size is 5, and size of seed users changes in therange of 1, 2, 3, 4, 5. The results are shown in Fig. 5(a)and Fig. 5(b). In the other dataset, the tipping user size isfixed at 50, and the seed users size is in the range of [5, 50]with step 5. Fig. 6(a) and Fig. 6(b) present the outcomesof influence gain. Here we mainly discuss about the largedataset.

Based on the result shown in Fig. 6(a) and Fig. 6(b), weobserve that with more seed users, influence gain shows auptrend at first, but the value drops afterwards. In Twitter(Fig. 6(b)), for example, the influence gain of TUDOR rises

Table V: Intersection of seed users selected by different comparison methods in the viral marketing problem.

TUDOR Random Page Rank Degree Degree Page Rank Random TUDOR50 2 6 3 TUDOR 4 5 3 50

50 2 2 Random 6 2 5050 12 Page Rank 11 50

Foursquare 50 Degree 50 Twitter

0

5

10

15

20

1 2 3 4 5

INFL

UEN

CE G

AIN

SEED USER SIZE


(a) Foursquare

0

5

10

15

1 2 3 4 5

INFL

UEN

CE G

AIN

SEED USER SIZE


(b) Twitter

Figure 5: Parameter Analysis on the small dataset of twotarget networks respectively.

from 84 to 474 when there are 35 seed users, and decrease to453 at last. The reason is explained as following: influencegain is the number of activated users who are blocked inthe target network when tipping users are removed from thesource network. When the size of seed users in the sourcenetwork increases, there are more original activated usersin the target network. Removing a large number of tippingusers declines the activated users substantially, and the influ-ence gain rises fast. But when seed user size is large enough,the saturation of seed user causes the number of originalactivated users keeping the same. However removing tippingusers breaks the saturation and let seed users who used tobe redundant active others again. Therefore the number ofblocked activated users decreases and influence gain drops.

We analyze the results of experiment on small datasetbriefly, which are presented in Fig. 5(a) and Fig. 5(b). Aswe explained before, the number of network size, seed usersize and tipping user size are small, which make the resultis very sensitive. Therefore the result is different from thatof large dataset but the drop of influence gain still can beexplained by the same reason.

D. Discussion

According to the result of extensive experiment above, welist some advises of using TUDOR for viral marketing:

0

100

200

300

400

5 10 15 20 25 30 35 40 45 50

INFL

UEN

CE G

AIN

SEED USER SIZE


(a) Foursquare

0

100

200

300

400

500

5 10 15 20 25 30 35 40 45 50IN

FLU

ENCE

GAI

NSEED USER SIZE


(b) Twitter

Figure 6: Parameter Analysis on two target networks respec-tively.

1) TUDOR works better when the target network is moreintense. The performance in Twitter is better thanFoursquare demonstrates this.

2) “More is better” does not apply to tipping user size.The influence gain grows slowly when having enoughtipping users.

3) Seed users have great effect. If seed users can activemany users, small group of tipping users still can helpdiffusion, but when seed users cannot do good job,selecting more tipping users is a better choice.

VII. RELATED WORK

Generally, our work is related to the following topics.Heterogeneous Network Analysis: A heterogeneous in-

formation network is composed of multiple types of objects.[13], [18], [25] investigated the principles and methodolo-gies of mining heterogeneous information networks in recentyears. Sun et al. studied similarity search that is definedamong the same type of objects in heterogeneous networksin [19]. Liu et al. [12] propose a generative graphical modelwhich utilizes the heterogeneous link information to minetopic-level direct influence.

Multiple Social Networks Alignment: Many worksstudy data mining with fusing and utilizing multiple net-works [24], [26], [28]. Zhang et al identify connections be-tween the shared users’ accounts in multiple social networks

in [28] and predict the formation of social links among usersin the target network with other external social networksin [27]. [24] studies the influence maximization problemin multiple partially aligned heterogeneous OSNs. Howeverour paper is the first to discover the tipping users in crossnetwork information propagation process based on alignednetworks.

Influence Maximization Problem: As an application,influence maximization problem attracts many researchers’interests. Since the seminal paper [9], massive algorithmsaim to propose more effective ways to find the seed users forviral marketing ( [2], [4]). A huge number of papers extendthe original setting to more complicated environment, suchas in networks with friendship and foe relationship [11], incontinuous time diffusion networks [20] and location-awarenetworks [10]. However the goal of influence maximizationproblem is selecting a group of initial users to spreadthe information, which is totally different from the TURNproblem.

Tipping Point: The concept tipping point denotes a pointin time when a large number of group members rapidlyand dramatically change their behavior by widely adoptinga previously rare practice in sociology [7]. The phrase hasextended beyond its original meaning and been applied toany process in which, beyond a certain point, the rate of theprocess increases dramatically [22]. It has been applied inmany fields, such as economics [15], human ecology [21],and epidemiology [5]. But we are the first to extend tippingpoint into the data mining area and define the new concepttipping users based on it.

VIII. CONCLUSION

We study the TURN problem in this paper, which providea novel way to conduct viral marketing. TURN aims atfinding tipping users in the source network to influence usersin the target network based on the given seed users. Wedesign the CONFORM model to describe the cross networkinformation diffusion in a pair of heterogeneous networks.The TUDOR method is proposed to address the TURNproblem, which can achieve a (1−1/e)-approximation of theoptimal results. Extensive experiments on real-world socialnetwork datasets demonstrate the superior performance ofTUDOR in addressing the TURN problem.

ACKNOWLEDGEMENT

This work is supported in part by NSF through grantsIII-1526499. It is also supported by grants from the Centersfor Disease Control and Prevention (CDC), the NationalCancer Institute (NCI), and the FDA Center for TobaccoProducts (CTP) under award numbers U01CA154254-05S1,U01CA154254, and P50CA179546. The content is solelythe responsibility of the authors and does not necessarilyrepresent the official views of CDC, NCI or FDA.

REFERENCES

[1] E. Bakshy, I. Rosenn, C. Marlow, and L. Adamic. The role of socialnetworks in information diffusion. In WWW, 2012.

[2] C. Borgs, M. Brautbar, J. Chayes, and B. Lucier. Maximizing socialinfluence in nearly optimal time. In SODA, 2014.

[3] M. Cha, H. Haddadi, F. Benevenuto, and P. K. Gummadi. Measuringuser influence in twitter: The million follower fallacy. ICWSM, 10(10-17):30, 2010.

[4] W. Chen, Y. Yuan, and L. Zhang. Scalable influence maximization insocial networks under the linear threshold model. In ICDM, 2010.

[5] K. S. Coyne, C. C. Sexton, C. L. Thompson, I. Milsom, D. Irwin,Z. S. Kopp, C. R. Chapple, S. Kaplan, A. Tubaro, L. P. Aiyer, et al.The prevalence of lower urinary tract symptoms (luts) in the usa, theuk and sweden: results from the epidemiology of luts (epiluts) study.BJU international, 104(3):352–360, 2009.

[6] N. Du, L. Song, H. Woo, and H. Zha. Uncover topic-sensitiveinformation diffusion networks. In AISTATS, 2013.

[7] M. Gladwell. The tipping point: How little things can make a bigdifference. Little, Brown, 2006.

[8] J. Goldenberg, B. Libai, and E. Muller. Talk of the network: Acomplex systems look at the underlying process of word-of-mouth.Marketing letters, 12(3):211–223, 2001.

[9] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread ofinfluence through a social network. In KDD, 2003.

[10] G. Li, S. Chen, J. Feng, K.-l. Tan, and W.-s. Li. Efficient location-aware influence maximization. In SIGMOD, 2014.

[11] Y. Li, W. Chen, Y. Wang, and Z.-L. Zhang. Influence diffusiondynamics and influence maximization in social networks with friendand foe relationships. In WSDM, 2013.

[12] L. Liu, J. Tang, J. Han, M. Jiang, and S. Yang. Mining topic-levelinfluence in heterogeneous networks. In CIKM, 2010.

[13] T. Lou and J. Tang. Mining structural hole spanners throughinformation diffusion in social networks. In WWW, 2013.

[14] M. McCombs. Setting the agenda: The mass media and publicopinion. John Wiley & Sons, 2013.

[15] J. Murray and D. King. Climate policy: Oil’s tipping point has passed.Nature, 481(7382):433–435, 2012.

[16] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis ofapproximations for maximizing submodular set functions—i. Mathe-matical Programming, 14(1):265–294, 1978.

[17] D. M. Scott. The new rules of marketing and PR: how to use socialmedia, blogs, news releases, online video, and viral marketing toreach buyers directly. John Wiley & Sons, 2009.

[18] Y. Sun and J. Han. Mining heterogeneous information networks:principles and methodologies. Synthesis Lectures on Data Miningand Knowledge Discovery, 3(2):1–159, 2012.

[19] Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks.VLDB, 2011.

[20] Y. Tang, Y. Shi, and X. Xiao. Influence maximization in near-lineartime: A martingale approach. In SIGMOD, 2015.

[21] A. J. Veraart, E. J. Faassen, V. Dakos, E. H. van Nes, M. Lurling,and M. Scheffer. Recovery rates reflect distance to a tipping point ina living system. Nature, 481(7381):357–359, 2012.

[22] Wikipedia. Tipping point (sociology), 2004.[23] J. Wolny and C. Mueller. Analysis of fashion consumers’ motives

to engage in electronic word-of-mouth communication through socialmedia platforms. Journal of Marketing Management, 29(5-6):562–583, 2013.

[24] Q. Zhan, J. Zhang, S. Wang, S. Y. Philip, and J. Xie. Influencemaximization across partially aligned heterogenous social networks.In Advances in Knowledge Discovery and Data Mining, pages 58–69.Springer, 2015.

[25] J. Zhang and P. Y. Mcd. Mutual clustering across multiple heteroge-neous networks. In IEEE BigData Congress, 2015.

[26] J. Zhang, W. Shao, S. Wang, X. Kong, and S. Y. Philip. Pna: Partialnetwork alignment with generic stable matching. In IRI, 2015.

[27] J. Zhang and P. S. Yu. Integrated anchor and social link predictionsacross partially aligned social networks. In IJCAI, 2015.

[28] J. Zhang and P. S. Yu. Multiple anonymized social networksalignment. In ICDM, 2015.

[29] J. Zhang, P. S. Yu, and Z.-H. Zhou. Meta-path based multi-networkcollective link prediction. In KDD, 2014.

Discover Tipping Users For Cross Network Inﬂuencingjzhang2/files/2016_iri_paper.pdf · Facebook)...

Documents

Transcript of Discover Tipping Users For Cross Network Inﬂuencingjzhang2/files/2016_iri_paper.pdf · Facebook)...