[IEEE 2013 IEEE 25th International Conference on Tools with Artificial Intelligence (ICTAI) -...

8
CPP-SNS: A Solution to Influence Maximization Problem Under Cost Control Qianyi ZHAN, Hongchao YANG, Chongjun WANG and Junyuan XIE Department of Computer Science and Technology Nanjing University, Nanjing, China Email: [email protected], chjwang, [email protected] Abstract—As more and more people join social network, viral marketing on online social network becomes a new trend of advertising. Motivated by this, plenty of research focuses on how to maximize the information propagation, which is called the influence maximization problem. Traditional work has made significant progress on this topic. However all ad companies have marketing budget, the research of influence maximization problem should take account of cost control. Under the condition of cost control, we model each user’s cost of helping spread information as a feature of each node in the network. Then we modify several most widely studied algorithms to suit the new model. In this paper, a new algorithm called CPP-SNS is proposed, which selects seeds according to cost performance of nodes. Further improvements, based on strategy of partial node loading and submodular property of spread function, make CPP-SNS more effective in practical scenarios. Extensive experiments show this method has a good performance in different social networks. Based on results of our research, we also provide some advice for the practical marketing. Keywords-social network; viral marketing; influence maxi- mization; cost control; I. I NTRODUCTION Nowadays Online Social Network (OSN) plays a more and more significant role as a medium for information spread. This trend gives birth to viral marketing, which is for brand or product promotion through creating a buzz or word of mouth effects. How to develop a successful viral marketing has attracted attentions of socialogists, psy- chologists, mathematicians and even epidemiologists. While computer scientists are trying to use mathematic theories and computing devices to understand the diffusion process in social network, much research in this field is related to influence maximization problem, which is the fundamental problem of viral marketing. In the seminal paper [1], Kempe et al. defined influence maximization as an optimization problem: A social network is modeled as a directed graph G(V,E), where the nodes V represent users and weighted edges E reflect influence between users. The goal is to find a seed set S, including k nodes, such that with a given propagation model, the information propagation range of S is the largest. [1] also proposed two basic stochastic influence propagation models, the independent cascade (IC) model and linear threshold (LT) model, both extracted from mathematical sociology. Each node is active or inactive in both models. In the IC model, an active node spreads influence to its inactive neighbors independently according to the weight of the corresponding edge. The IC model stresses the individual influence among friends. While in the LT model, each node has a threshold and a node is not activated until the sum of incoming edge weights from its active neighbors is no less than its threshold. The LT model emphasizes the threshold behavior in information spreading. The influence spread function σ(S) denotes the number of active nodes after propagation starting from seed set S. In both IC model and LT model, this function has two nice properties. The function σ(·) is monotone if σ(A) σ(B) when A B. Moreover it is modular if σ(A v) σ(A) σ(B v) σ(B) for all A B and v/ A. Based on these properties, Kempe et al. proved the optimization problems for both models are NP-hard. A. Related Work Most current solutions of influence maximization problem are greedy algorithms, the simple greedy algorithm in [1] chooses the node with maximum marginal gain repeatedly. It has been proved that this intuitive algorithm can achieve an approximation ratio of (1 1/e). However the simple greedy algorithm is rather slow and not scalable because they use Motnte-Carlo (MC) simulations on influence spread estima- tion. Therefore much work has been devoted to improve the simple greedy method ([2], [3]). CELF (Cost-Effective Lazy Forward selection) algorithm [4], proposed by Leskovec, is one of them. It requires less running time through reducing the number of spread estimation. Compared with the simple greedy algorithm, CELF is found to be 700 times faster, however it is still not fast and scalable enough in many situations. Chen et al. [5] presented NewGreedyIC algorithm for the IC model particularly. The main idea is removing the edges which are not necessary in propagation at the beginning, then using simple greedy in residual graph. NewGreedyIC pro- motes the performance, but if method uses CELF in residual graph, its effectiveness would be significantly improved and it would be faster than CELF naturally. Heuristic strategy is used in DegreeDiscount algorithm [5], which chooses seeds based on the following two points. The one is to prefer the nodes with large degree and the other one is to avoid the nodes which can be activated by 2013 IEEE 25th International Conference on Tools with Artificial Intelligence 1082-3409/13 $31.00 © 2013 IEEE DOI 10.1109/ICTAI.2013.129 849

Transcript of [IEEE 2013 IEEE 25th International Conference on Tools with Artificial Intelligence (ICTAI) -...

Page 1: [IEEE 2013 IEEE 25th International Conference on Tools with Artificial Intelligence (ICTAI) - Herndon, VA, USA (2013.11.4-2013.11.6)] 2013 IEEE 25th International Conference on Tools

CPP-SNS: A Solution to Influence Maximization Problem Under Cost Control

Qianyi ZHAN, Hongchao YANG, Chongjun WANG and Junyuan XIE

Department of Computer Science and TechnologyNanjing University, Nanjing, China

Email: [email protected], chjwang, [email protected]

Abstract—As more and more people join social network,viral marketing on online social network becomes a new trendof advertising. Motivated by this, plenty of research focuseson how to maximize the information propagation, which iscalled the influence maximization problem. Traditional workhas made significant progress on this topic. However all adcompanies have marketing budget, the research of influencemaximization problem should take account of cost control.

Under the condition of cost control, we model each user’scost of helping spread information as a feature of each nodein the network. Then we modify several most widely studiedalgorithms to suit the new model. In this paper, a new algorithmcalled CPP-SNS is proposed, which selects seeds according tocost performance of nodes. Further improvements, based onstrategy of partial node loading and submodular property ofspread function, make CPP-SNS more effective in practicalscenarios. Extensive experiments show this method has a goodperformance in different social networks. Based on results ofour research, we also provide some advice for the practicalmarketing.

Keywords-social network; viral marketing; influence maxi-mization; cost control;

I. INTRODUCTION

Nowadays Online Social Network (OSN) plays a more

and more significant role as a medium for information

spread. This trend gives birth to viral marketing, which is

for brand or product promotion through creating a buzz

or word of mouth effects. How to develop a successful

viral marketing has attracted attentions of socialogists, psy-

chologists, mathematicians and even epidemiologists. While

computer scientists are trying to use mathematic theories

and computing devices to understand the diffusion process

in social network, much research in this field is related to

influence maximization problem, which is the fundamental

problem of viral marketing.

In the seminal paper [1], Kempe et al. defined influence

maximization as an optimization problem: A social network

is modeled as a directed graph G(V,E), where the nodes

V represent users and weighted edges E reflect influence

between users. The goal is to find a seed set S, including

k nodes, such that with a given propagation model, the

information propagation range of S is the largest. [1] also

proposed two basic stochastic influence propagation models,

the independent cascade (IC) model and linear threshold(LT) model, both extracted from mathematical sociology.

Each node is active or inactive in both models. In the

IC model, an active node spreads influence to its inactive

neighbors independently according to the weight of the

corresponding edge. The IC model stresses the individual

influence among friends. While in the LT model, each node

has a threshold and a node is not activated until the sum of

incoming edge weights from its active neighbors is no less

than its threshold. The LT model emphasizes the threshold

behavior in information spreading.

The influence spread function σ(S) denotes the number

of active nodes after propagation starting from seed set S.

In both IC model and LT model, this function has two nice

properties. The function σ(·) is monotone if σ(A) � σ(B)when A ⊆ B. Moreover it is modular if σ(A∪v)−σ(A) �σ(B ∪ v)− σ(B) for all A ⊆ B and v /∈ A. Based on these

properties, Kempe et al. proved the optimization problems

for both models are NP-hard.

A. Related Work

Most current solutions of influence maximization problem

are greedy algorithms, the simple greedy algorithm in [1]

chooses the node with maximum marginal gain repeatedly.

It has been proved that this intuitive algorithm can achieve an

approximation ratio of (1−1/e). However the simple greedy

algorithm is rather slow and not scalable because they use

Motnte-Carlo (MC) simulations on influence spread estima-

tion. Therefore much work has been devoted to improve the

simple greedy method ([2], [3]).

CELF (Cost-Effective Lazy Forward selection) algorithm

[4], proposed by Leskovec, is one of them. It requires

less running time through reducing the number of spread

estimation. Compared with the simple greedy algorithm,

CELF is found to be 700 times faster, however it is still

not fast and scalable enough in many situations.

Chen et al. [5] presented NewGreedyIC algorithm for the

IC model particularly. The main idea is removing the edges

which are not necessary in propagation at the beginning, then

using simple greedy in residual graph. NewGreedyIC pro-

motes the performance, but if method uses CELF in residual

graph, its effectiveness would be significantly improved and

it would be faster than CELF naturally.

Heuristic strategy is used in DegreeDiscount algorithm

[5], which chooses seeds based on the following two points.

The one is to prefer the nodes with large degree and the

other one is to avoid the nodes which can be activated by

2013 IEEE 25th International Conference on Tools with Artificial Intelligence

1082-3409/13 $31.00 © 2013 IEEE

DOI 10.1109/ICTAI.2013.129

849

Page 2: [IEEE 2013 IEEE 25th International Conference on Tools with Artificial Intelligence (ICTAI) - Herndon, VA, USA (2013.11.4-2013.11.6)] 2013 IEEE 25th International Conference on Tools

the chosen seeds. The latter principle is further developed

as ”looking forward” strategy in [5] .

[6] designed CG (Community-Based Greedy) algorithm

on the base of community detection. The basic procedure

is that divide the social network into several communities,

and seeds are the nodes which can maximize influence in

their own communities. Compared with other methods, CG

combines the two key problems in social network analysis:

information propagation and community detection.

Since Domingos and Richardson ([7], [8]) first researched

the influence maximization, much progress has been made

in this topic. While there is an implicit but unrealistic

assumption in most current work: Each node has no cost

to be the seed. Some work takes cost into consideration but

the value is same and fixed for all nodes. In real situations,

an ad company usually has a budget on viral marketing.

On the other hand, a user spread the information with the

cost of time, network traffic and indiscretion from a celebrity

will damage the public image and lose fans. Therefore some

users are not willing to be seeds even has been selected.

These reasons show the necessity to research the influence

maximization models and algorithms under the condition of

cost control.

Some work has added the element of cost into the re-

search, for example [9] introduced notion of mechanism de-

sign into the influence maximization problem. Considering

each node is self-interested, incentive compatible mechanism

is designed to let each node report its true cost. However [9]

made a clear mistake in the proof of budget control, and the

mistake also affected his further proof.

B. Contributions and Roadmap

Considering lack of research on the cost control in the

influence maximization problem, this paper tries to add the

factor of node’s cost into the model. The classical algorithms

are modified by taking node’s cost into consideration, for

example, node’s cost performance is used to select seeds

greedily. We also propose new methods closely related to the

cost. Since the time complexity of simple greedy algorithm

is too high, The new algorithm, named CPP-SNS, reduces

the running time significantly without narrowing propagation

range. The improvement is based on the submodular prop-

erty of influence function and partial loading techniques.

Through the plenty of experiment, we not only show the

effectiveness of CPP-SNS, but also summarize some advice

for practical viral marketing.

Some classical algorithms are changed to suit the environ-

ment of cost control in Section 2. In Section 3 CPP-SNS al-

gorithm is proposed to improve the cost performance greedy

algorithm. Experiments to demonstrate the performance of

different methods are presented in Section 4, and Section 5

is the conclusion.

II. SIMPLE ALGORITHMS BASED ON COST CONTROL

To solve the cost control problem, we add the node’s

cost into the classical model. As a result, the traditional

algorithms for seed selection have to be altered, including

random selection and greedy selection. The involvement

of cost also gives the birth of the cost based selection

algorithms. In this section we focus on these three kinds

of selection algorithms. Before that, we first give the defini-

tion of influence maximization problem under cost control

condition.

A. Problem Definition

We first define the cost of a node: Each node vi keeps

a value ci to represent its cost of information spread. The

cost ci is generated by the cost function cost(�ai), where �aidescribes the feature vector of node vi.

Then the influence maximization problem under cost

control condition is modeled as following: In the given social

network G(V,E), with the budget b(b > 0) and nodes’

cost function ci = cost(�ai), through the propagation starting

from a set S(S ⊆ V ) of nodes as initial seeds, the number

of activated nodes at last is σ(S). The aim is to find such

seed set S∗ that σ(S∗) = maxS⊆V σ(S) and Σvi∈Sci � b.When cost(�ai) = 1, cost control problem regresses to

traditional influence maximization problem, so traditional

problem is one of special cases of cost control problem.

Since the original problem is NP-hard, it is easy to find:

Theorem 1: The influence problem under cost control

condition is NP-hard.

After the overall description of this problem, we now modify

some classical algorithms to give the solution.

B. Random Selection Algorithm

Though we add cost in the model, random selection

algorithm keeps the same, regardless of the node’s influence

or cost. Here simple random and repeated random are briefly

described.

1) Simple random algorithm: Just like its name, simple

random algorithm chooses nodes from the social network

randomly, and make sure the sum of all nodes’ cost is no

more than the budget. If one of chosen seeds s is using

this method, it is obvious that the probability of s ⊆ S∗

(the optimal seed set) is extremely low especially when

the number of nodes is large. As a result, the outcome of

this easy process is not desirable, even if it requires the

least runtime. Simple random algorithm cannot be used to

solve real problem, however, it provides a baseline model

for comparison, which makes it worth of discussion.

2) Repeated random algorithm: Simple random algo-

rithm suffers from its low probability of selecting effective

seeds. To this disadvantage, if we repeat the process enough

850

Page 3: [IEEE 2013 IEEE 25th International Conference on Tools with Artificial Intelligence (ICTAI) - Herndon, VA, USA (2013.11.4-2013.11.6)] 2013 IEEE 25th International Conference on Tools

times, the probability is increased and the result is much

better. This is the main idea of repeated random algorithm.

The following theorem shows the probability of getting

the optimal solution using this method.

Theorem 2: Repeated random algorithm can achieve the

optimal solution with the probability of 1 − 1/e, if the

repetition times N is large enough.

Due to limitation of space, the detail of proof is not shown in

this paper. It is easy to find the value of N should be 1/pbato reach that probability, where ps(0 < ps � 1) denotes the

probability of obtaining the optimal solution using simple

random algorithm. But the truth is that N is a huge number

since 1/pba is extremely low. Therefore repeated random

algorithm cannot be applied especially in a large network

with large volume of users.

C. Cost Performance Preferred Algorithm

Another classical kind of selection algorithm is greedy

selection. Because of its solid mathematic theory and good

performance in optimization problem, greedy algorithm is

well studied in many related topics. Simple greedy selection

algorithm chooses the node with largest marginal gain,

which can be formulated as following:

v = argmaxvi∈V,cost(�ai)<b∗(σ(Scur ∪ {vi})) (1)

where V is the set of nodes; Scur denotes current influence

result and b∗ means residual budget.

With the addition of the node’s cost, the simple greedy

algorithm is not appropriate any more. Though there is no

hard evidence proving the positive correlation between one

node’s influence and its cost, it is a logical statement that

the node’s cost also reflects its influence and large influence

always implies high cost of spreading. As a result, besides

influence of each node, it is more rational to add cost into

measurement. The combined factor is the ratio of node’s

marginal influence gain and its cost, which is so-called

cost performance. The seed is the node with highest cost

performance and the selection is formulated as following:

v = argmaxvi∈V,cost(�ai)<b∗σ(Scur ∪ {vi})− σ(Scur)

cost(�ai)(2)

where V is the set of nodes; Scur denotes current influ-

ence spread result and b∗ means residual budget.

Algorithm 1 is the pseudo-code of Cost Performance

Preferred (CPP) Algorithm.

Same as simple greedy algorithm, the cost performance

preferred algorithm calls for a large number of propagation

simulations, which cause its poor time efficiency. In addition,

with the same budget, more seeds will be chosen by CPP

than simple greedy algorithm since CPP prefers nodes with

low cost. This will increase the simulation times as well.

Algorithm 1 Greedy Selection Algorithm Based on Cost

Performance

Input: G(V,E), b, ci = cost(�ai), pOutput: S

1: S =empty;

2: budget = b;3: costPer = 0; //record node’s cost performance

4: while budget > 0 do5: for i = 0 to |V | do6: if (inf(S+vi)−inf(S))/vi.cost > costPer then7: costPer = (inf(S + vi)− inf(S))/vi.cost;8: seed = vi;

//choose the node with highest cost performance

9: end if10: end for11: S = S + {seed};12: budget = budget− seed.cost;13: end while

Above all, the CPP algorithm needs more time than the

simple one, which triggers us to improve it in the next

section.

D. Cost Based Selection Algorithm

Cost itself is an index of node’s influence, which can be

the criteria of seed choice. We list two kinds of selection

methods: high cost first algorithm (HCF) and low cost first

(LCF) algorithm.

1) High Cost First Algorithm(HCF): Nodes with large

influence always have high cost to spread information in

viral marketing. The main reason is that a strong attitude

from a famous person may cause uproar in public opinion,

and followers will be also disappointed about celebrities

who send obvious advertisements. Therefore to maintain

the public image, the user with many followers is more

cautious about his or her views and expression. Based on this

observation, we design the high cost first algorithm (HCF),

which gives preference to the nodes with high cost.

In this algorithm, first step is to sort the nodes with

descending order of their cost, the sorting algorithm is

optional. To the node with current highest cost, if it is

allowable to the residual budget, this node will be added

into the seed set. The algorithm runs the latter step as a

loop until the budget is run out.

The complexity of HCF depends on the complexity of

sorting algorithm, which is O(nlogn) at least. Therefore, the

complexity of this algorithm is O(nlogn), where n denotes

the number of nodes in the social network.

2) Low Cost First Algorithm: HCF gives priority to nodes

with high cost, because it is believed the large influence

of this kind of nodes can help spread information most

wildly. While the disadvantage is that the budget will be

used up quickly owing to high cost of seeds and not enough

851

Page 4: [IEEE 2013 IEEE 25th International Conference on Tools with Artificial Intelligence (ICTAI) - Herndon, VA, USA (2013.11.4-2013.11.6)] 2013 IEEE 25th International Conference on Tools

number of seeds is also against the influence maximization.

In contrast to HCF, priority of the low cost first algorithm

(LCF) is nodes with low cost, since it raise the influence by

enlarging the seed set.

The only difference between LCF and HCF is the sorting

order of nodes, instead of descending order, LCF sorts nodes

with ascending order of their cost. Similar with HCF, the

complexity of sorting algorithm decides the complexity of

LCF, which is also O(nlogn), where n denotes the number

of nodes in the social network.

Further discussion about HCF and LCF is given in Section

4.

III. IMPROVEMENT ON GREEDY ALGORITHM

As mentioned above, though classical algorithms are

modified to adapt to the cost control condition, they can-

not be used directly in real applications because of their

high complexity. Among them, the random selection and

cost based selection algorithms are intuitive, we focus on

improving CPP algorithm in this section.

A. Improvement Based on Submodular Property

Before the analysis, some notations are listed as following.

For the algorithm, the seed set is Si after adding the ithseed. For the node v, cv denotes its cost and giv denotes its

marginal gain in the selection process of the ith seed (the

ith round for short).

We give the statement of submodular property again. The

influence function is submodular if σ(A∪v)−σ(A) � σ(B∪v) − σ(B) for all A ⊆ B and v /∈ A. Roughly speaking,

the larger seed set is, the less marginal gain the same node

brings. From this property, we can get the following result

directly. For one node v, giv � gjv if i < j, where i < j is

equivalent to Si ⊆ Sj .

The cost performance of node v in the ith round is defined

as piv = giv/cv . For different nodes v and u, if they are

not included in the seed set before the jth round, as the

procedure of traditional algorithm, propagation has to be

simulated for each node and then calculate and compare

nodes’ cost performance. However if we know piv > piuin the ith (i < j) round and pjv > piu in the jth round,

then there is no need to calculate pju. The deduction is as

following. As known: giu � gju(i < j), this implies

piu =giucu

� gjucu

= pju

So under the condition of pjv>piu, we have pjv>pju.

According to the analysis above, it is not necessary to

do the simulation for each node excluding seeds in every

selection. The detailed procedure is when choose the first

seed, we need simulate the propagation process of each node

in the network and calculate its cost performance p. Then

all nodes are sorted according to p in descending order, and

the first node is chosen as a seed. In the following loop,

for example in the ith round, simulation for only the first

node v in current array is needed. If piv = pi−1v , the node

v is just the ith seed. Otherwise update the node v’s cost

performance with piv , and insert into the array in descending

order. After that, choose the current first node and do the

same loop again.

To choose the first seed, this method will execute |V |times of information propagation simulations. After that, in

the ideal situation, only nodes which will be the seed need

simulations. Therefore the total number of simulations is

|V | + |S| − 1, which is significantly less than |V | × |S| of

the original greedy algorithm.

B. Improvement on Partial Loading Strategy

The above method can help reduce the times of simu-

lations to |V | + |S| − 1 in the best condition. According

to (3), when the |V |/|S| is large enough, the number of

simulations is decreased by a factor of |S|. While this is

not enough because it happens only in the ideal condition.

Further improvement is required.

lim|V |→∞

|S| ∗ |V ||V |+ |S| − 1

= lim|V |→∞

|S|1 + |S|−1

|V |= |S| (3)

It is easy to find the most of simulations contribute to

the selection of first seed, because it is the basic step for

the following choice. Hence we make a comprise between

simulating propagation for all nodes and getting node’s

cost performance. The method is loading partial nodes with

high probability of being seeds at the beginning and using

cost performance to choose seeds. The key problem of this

method is what kind of nodes should be loaded first and

how to add other nodes during the selection.

A new criteria is introduced to decide which nodes are

loaded first. If the node vi’s activation probability is pi, the

expectation number of active nodes at last is N =∑|V |

i=1 pi[5]. Assume only node v has the information at first, which

means the propagation starts with one node. li denotes the

length of shortest path between node vi and node v. pli

denotes the probability of node v activated by the shortest

path, so pj means the probability of one node being activated

by its shortest path which length is j. It is obvious that

pi � pli . Let L = max|V |i=1li and rj means the number of

nodes whose l = j, then we have:

N =

|V |∑

i=1

pi �|V |∑

i=1

pli =L∑

j=1

rjpj (4)

we use∑L

j=1 rjpj to indicate node v’s influence, and this

value can be calculated by scanning the number of node v’s

neighbors, the number of neighbors’ neighbor... Because of

the ”small world effect” in the social network, the length of

shortest path between two nodes will not be too large, so the

calculation can be completed quickly. The approximate value

852

Page 5: [IEEE 2013 IEEE 25th International Conference on Tools with Artificial Intelligence (ICTAI) - Herndon, VA, USA (2013.11.4-2013.11.6)] 2013 IEEE 25th International Conference on Tools

of cost performance is obtained after getting the approximate

value of influence.

After ranking nodes according to the approximate value

of cost performance, top m nodes are first loaded. Here the

value of m is optional. When m = |V |, this method is equal

to Greedy algorithm and when m = 1, the criteria of this

method is the approximate value instead of actual value of

cost performance.

Now the m nodes v1, . . . , vi+m have been loaded, and the

latter problem is how to add other nodes into this array. Our

approach is selecting one seed from the array, meanwhile

loading another node into the array. Which node should be

loaded next? A natural choice is the node ranking m + 1according to load criteria. However since the seed set is

not empty any more, if cost of node vi+m+1 is beyond the

residual budget, vi+m+1 has no chance to be the seed and

there is no need to load it. So a more practical and quicker

way is loading the node ranking highest and with the cost

which is possible to the budget.

C. CPP-SNS Algorithm

By combining the two above improvement given above,

we propose the CPP-SNS Algorithm (Cost Performance

Preferred Algorithm Based on Strategy of Nodes Loading

and Submodular Property). Algorithm 2 is the pseudo-code:

Algorithm 2 CPP-SNS Algorithm

Input: G(V,E), b, ci = cost(�ai), p, mOutput: S

1: S =empty;

2: budget = b;3: for i = 0 to |V | do4: get cost performance by computing

∑Lj=1 rjp

j ;

5: end for6: rank nodes by approximate cost performance→ cnodes;

7: cnodes[0 ∼ m− 1]→ lnodes;

8: for i = 0 to m do9: get lnodes[i].gain/lnodes[i].cost;

10: end for11: rank lnodes by cost performance;

12: pos = m− 1;

13: while budget > 0 do14: if lnodes[0].cost � budget then15: s = s+ lnodes[0];16: budget = budget− lnodes[0].cost;17: end if18: remove lnodes[0] from lnodes;

19: get the position of next node→ pos;

20: get cnodes[pos].gain/cnodes[pos].cost;21: cnodes[pos]→ lnodes;

22: end while

The node’s approximate cost performance is computed as

the 4th line of CPP-SNS algorithm. To calculate the value

of∑L

i=1 ripi for one node, the method just needs count

the number of its neighbor nodes, which is O(|V |) in time

complexity. Thus the time complexity of calculation for all

nodes is O(|V |2). Any sorting algorithm can be used to

rank nodes in the 6th line. Considering the huge number

of users in the social network, we choose quick sort. The

first node is removed in 18th line after scanning, then the

following nodes are moved forward one step. The 19th line

shows how to choose the next node to load according to its

approximate cost performance. The new node is added in

the current loading array after updating its cost performance

in the 20th and 21st line. Here binary search can be used

to find the insertion position, or start the cost performance

comparison between new node and nodes in array from the

ending. This will reduce the times of comparison because

the later the node is added, the smaller its cost performance

should be.

The above is the whole description of the CPP-SNS

algorithm. To prove its effectiveness, thorough experiments

are conducted in the next section.

IV. EXPERIMENT AND ANALYSIS

After the introduction of the new algorithm, we are now

interested in understanding its behavior in practice, and

comparing its performance with other methods mentioned

above. Through the experiments, we found that our al-

gorithm achieves a better performance in both influence

maximization and time complexity.

A. Experiment Setup

Some preparation of experiments are listed as following,

including the algorithms used to compare with, the cost

function and the network data.

1) Comparison Algorithms: Despite the CPP-SNS algo-

rithm, other methods, such as Random Algorithm, Repeat

Random (ReRan for short), Greedy Algorithm, CPP Algo-

rithm, HCF Algorithm and LCF Algorithm, are also tested

for comparison. Another algorithm called SNS (Algorithm

Based on Strategy of Nodes Loading and Submodular

Property) is worth mentioning. It is the combination of

simple greedy algorithm and the improvement method (both

A and B) in the above section. Result comparison between

SNS and CPP-SNS can reveal the necessity of considering

cost performance.

2) The Cost Function: The key factor in budget control

problem is the cost function, which is ci = cost(�ai) where �aiis the feature vector of node vi. It is observed that there is a

positive correlation between the cost of node and its degree,

which also means ci = cost(di) is a increasing function of

di, where di denotes degree of node vi. Therefore we design

three kinds of cost functions to describe different growth of

cost.

• Linear cost function: v.cost = v.degree/15 + 1;

• Exponential cost function: v.cost = 1.015v.degree;

853

Page 6: [IEEE 2013 IEEE 25th International Conference on Tools with Artificial Intelligence (ICTAI) - Herndon, VA, USA (2013.11.4-2013.11.6)] 2013 IEEE 25th International Conference on Tools

Figure 1. Propagation Range Comparison in COND-MAT

• Logarithmic cost function: v.cost = log(v.degree+e).

3) The Network Data: The real datasets used in the

experiment are all from Stanford Large Network Dataset

Collection1 and the details are listed in IV-A3. In these

social networks, the extent of link between nodes is different,

which can show the algorithms’ performance in diverse

networks.

Table IDETAILS OF DATASET

Name Nodes Edges Description

COND-MAT 108300 186936Collaboration network ofArxiv Condensed Matter

cit-HepPh 15233 58891Arxiv High Energy Physics

paper citation network

facebook 4039 168486 Social circles from Facebook

The code is written in Java, and the experiments are run

on Windows XP machine with 2.59 GHz Pentium(R) Dual-

Core E5300 CPU and 2GB memory.

B. Experiment Results

In the experiments, algorithms’ performance lies in the

information propagation range and running time. The result

is the mean value of 10 times’ computation of using one of

algorithms in a specific network.

1) COND-MAT Network: The result of COND-MAT net-

work is shown in Figure 1. From the whole figure, we find

1http://snap.stanford.edu/data

Figure 2. Runtime Comparison in COND-MAT

though there is approximation of nodes’ cost performance in

CPP-SNS, its final propagation range is quite close to CPP.

When using the linear cost function (Figure 1.a), result

reveals other information: The results of HCF, Greedy and

SNS are close, which means nodes with high cost have a

larger influence. That is because that HCF, which chooses

nodes according to cost, shares the similar result with

Greedy only when the cost represents influence well.

Moreover, the results of HCF, Greedy and SNS are

worse than other methods at first, however the algorithm’s

effectiveness rise with the spread probability. The reason is

the number of seeds in this kind of method is smaller than

others’, but all seeds have high degrees. On the contrast, the

rise of CPP and CPP-SNS is mild. While LCF has no clear

change because it neglects node’s influence.

From Figure 1.b, which is the result of exponential

cost function, we can find some differences. Though the

outcomes of HCF, Greedy, SNS are still similar, they do not

have so much advantages as Figure 1.a when propagation

probability increases, and even CPP and CPP-SNS are better

than them at last. That is because with the cost growing fast,

the number of seeds shrinks quickly. CPP-SNS still has a

good performance and is more close to the Greedy than

Figure 1.a.

Figure 1.c shows the outcome of applying logarithmic

cost function. All algorithms have a close result, which

demonstrates when cost increase slowly, considering cost

performance shows no clear strengths over simple methods.

We also record the runtime of these algorithms when the

propagation is 0.01, shown as Figure 2. We can see Greedy,

CPP and ReRan are much slower than other methods. This

can be explained by the huge number of simulations they

used. SNS, based on Greedy, is 1000 to 2000 times faster

than Greedy. And CPP-SNS also reduce the time of CPP 400

to 700 times. The choice of cost function has little effect on

854

Page 7: [IEEE 2013 IEEE 25th International Conference on Tools with Artificial Intelligence (ICTAI) - Herndon, VA, USA (2013.11.4-2013.11.6)] 2013 IEEE 25th International Conference on Tools

Figure 3. Propagation Range Comparison in cit-HepPh

the runtime of all methods.

2) cit-HepPh Network: Figure 3 show the result of three

kinds of cost functions in cit-HepPh Network. In Figure

3.a, CPP and CPP-SNS have better propagation range per-

formance than other ones and CPP-SNS is quite close to

Greedy. As spread probability’s increase, CPP and CPP-

SNS have a larger improvement on propagation range and

outperform Greedy and SNS. This is mainly because CPP

and CPP-SNS take cost performance in account and choose

more seeds.

We can also know the relation of Greedy and SNS from

Figure 3.a. They share the same start and when the probabil-

ity increase, SNS falls behind Greedy. The poor performance

of HCF in Figure 3.a illustrates nodes with high cost maybe

don’t have large influence in cit-HepPh network. That is

why SNS is not as good as Greedy. Moreover, ReRan’s

result increases fast implies the probability of selecting a

node with large influence is high. In other words, in cit-

HepPh network, there is no huge difference between nodes’

influence.

The finding in Figure 3.b is consistent with Figure 3.a,

while Figure 3.c is consistent with Figure 1.c. When cost

increases slowly, it will not become a such important factor

of the node.

Same with COND-MAT network condition, runtime is

recorded as Figure 4. The implicit information is also

identical with that of Figure 4. Here CPP-SNS is 1200 to

2000 times faster than CPP.

Figure 4. Runtime Comparison in cit-HepPh

Figure 5. Propagation Range Comparison in facebook

3) facebook Network: The outcome in facebook network

as Figure 5 shows properties of different algorithms. To

CPP-SNS, the approximation of influence leads to the drop

of its result at last. LCF can have the largest seed set among

all methods, but whatever the cost function is, LCF suffers

poor performance. It tells us in a strong connection network

such as facebook, nodes with small influence are unfit for

the seed job.

In addition, compared with Figure 1 and Figure 3, there is

tiny difference between results of other 7 algorithms, besides

LCF. The reason is nodes in this network link closely and

855

Page 8: [IEEE 2013 IEEE 25th International Conference on Tools with Artificial Intelligence (ICTAI) - Herndon, VA, USA (2013.11.4-2013.11.6)] 2013 IEEE 25th International Conference on Tools

Figure 6. Runtime Comparison in facebook

there are many nodes with large influence. Though seeds

selected by algorithms are different, a good propagation

result can be achieved if some large influence nodes are

chosen. The performance of Random and ReRan also proves

that.

Figure 6 presents the runtime. It spends more time for

Greedy, CPP and ReRan because facebook network is more

complex and nodes all have large degrees. CPP-SNS is 80

to 150 times faster than CPP.

C. Advice on Viral Marketing

Through the analysis of our experiments, we give some

advice about viral marketing. First of all, a strong connected

network is a better choice for ad service.

Once the network is given, an important problem is how

to find the seed users. As mentioned before, price of nodes

with large influence are usually high. In the real life, ad

companies could make a ”degree-cost” growth curve to help

make decision. When the growth increases fast, cost of

nodes should not be neglected. For those users with low

ROI (return on investment), even they have huge influence,

it is better to think twice before selecting them. While in

the situation of mild growth of ”degree-cost”, ad companies

could focus on the large influence nodes. Though they

may not have a desirable ROI, it is an easy and effective

way. When the ”degree-cost” is unknown, taking ROI into

account is still a rational choice because it can avoid the

risk.

V. CONCLUSION

This paper is mainly about the cost control problem in

influence maximization problem. We incorporate the factor

of node’s cost into the traditional algorithms and improve

the cost performance preferred method to make it more

practical. Experiments show the CPP-SNS algorithm has a

good performance in different networks.

Our future research will focus on how to estimate node’s

cost correctly and how different factors effect the cost.

Also the influence maximization problem under cost control

condition needs further research.

ACKNOWLEDGMENT

This research was supported by NSFC (No. 61375069,

61105069) and Technology Foundation of Jiangsu Province

of China (No. BE2012181).

REFERENCES

[1] D. Kempe, J. Kleinberg, and E. Tardos, “Maximizing thespread of influ-ence through a social network,” in Proceed-ings of the 9th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, 2003, pp. 137–146.

[2] E. Even-Dar and A. Shapira, “A note on maximizing the spreadof influence in social networks,” in Internet and NetworkEconomics, 2007, pp. 281–286.

[3] C. Budak, D. Agrawal, and A. El Abbadi, “Limiting the spreadof misinformation in social networks,” in Proceedings of the20th international conference on World wide web, 2011, pp.665–674.

[4] J. Leskovec, A. Krause, C. Guestrin, and etc, “Cost-effectiveoutbreak detec-tion in networks,” in Proceedings of the 13thACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining, 2007, pp. 420–429.

[5] W. Chen, Y. Wang, and S. Yang, “Efficient influence maxi-mization in social networks,” in Proceedings of the 15th ACMSIGKDD International Conference on Knowledge Discoveryand Data Mining, 2009, pp. 199–208.

[6] Y. Wang, G. Cong, G. Song, and K. Xie, “Community-basedgreedy algorithm for mining top-k influential nodes in mobilesocial networks,” in Proceedings of the 16th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining, 2010, pp. 1039–1048.

[7] P. Domingos and M. Richardson, “Mining the network valueof cus-tomers,” in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining, 2001, pp. 57–66.

[8] M. Richardson and P. Domingos, “Mining knowledge-sharingsites for viral marketing,” in Proceedings of the 8th ACMSIGKDD International Conference on Knowledge Discoveryand Data Mining, 2002, pp. 61–70.

[9] Y. Singer, “How to win friends and influence people, truthfully:In-fluence maximization mechanisms for social networks,” inProceedings of the 5th ACM International Conference on WebSearch and Data Mining, 2012, pp. 733–742.

856