Traditional vs Nontraditional Methods for Network Analytics - Ernesto Estrada

Post on 16-Apr-2017

341 views 0 download

Transcript of Traditional vs Nontraditional Methods for Network Analytics - Ernesto Estrada

1

Professor Ernesto EstradaDepartment of Mathematics & Statistics

University of Strathclyde

Glasgow, UK

www.estradalab.org

2

3

Measure what is measurable, and make measurable what is not so.

Galileo Galilei

Professor Ernesto EstradaDepartment of Mathematics & Statistics

University of Strathclyde

Glasgow, UK

www.estradalab.org

kkp ~ kkekp /~

!k

kekp

kk

2

2

2

2

1 k

kk

k

ekp

kkp ~ !k

kekp

kk

Poisson

Exponential

Gamma

Power-law

Log-normal

Stretched exponential

Stumpf & Ingram, Europhys. Lett. 71 (2005) 152.

kekp kk /~

2~

222//ln

k

ekp

mk

Stumpf & Ingram, Europhys. Lett. 71 (2005) 152.

The qualitative interpretation of the skew is

complicated. For example, a zero value

indicates that the tails on both sides of the

mean balance out, which is the case for a

symmetric distribution, but is also true for an

asymmetric distribution where the asymmetries

even out, such as one tail being long but thin,

and the other being short but fat. Further, in

multimodal distributions and discrete

distributions, skewness is also difficult to

interpret. Importantly, the skewness does not

determine the relationship of mean and

median.

Wikipedia

• Collatz-Sinogowitz Index (1957)

kGCS 1

• Bell Index (1992)

2

2 11

i

i

i

i kn

kn

GVar

• Albertson Index (1997)

Eji

ji kkGirr,

Unknown

upper

bounds

i

ik

p1

ji

ijkk

p11

2

1

Estrada: Phys. Rev. E 82, 066102 (2010)

ji

ijkk

p1

ˆ

ijijij pp ˆ

2

11

2112

ji

jiji

ij

kk

kkkk

Eij jiEij

ijkk

G

2

112

AKL

otherwise,0

,~for 1

,for

ji

jik

L

i

ij

2/12/1

2/12/1

KAKI

KLKL

otherwise,0

,~for 1

,for 1

jikk

ji

ji

ijL

.2

2

11

,

2/1

Gn

kkn

G

Eji

ji

T

L

2

1n

Gn

12

2~

nn

GnG

0~1 G

n 210

n

i

j

j

j

T

j in 1

1

1

1cos

j

jj

T nG 2cos11

L

j i

jj

T iG

2

11

L

jjjx cos1

jjjy sin1

n

j

jxnn

nG

1

2

12

~

j

01

j

jx

jy

y

x

8k 16k

pnG ,

017.0~ G 034.0~ G

1227.2

27.018.1

18.1

nn

n

k

kBA

8k 16k

132.0~ G 108.0~ G

S. cereviciaeD. melanogaster

C. elegansH. pylori

148.0~ G 294.0~ G

E. coli

241.0~ G

323.0~ G 383.0~ G Stretched exponential

Log-normal

kekp kk /~

2~

222//ln

k

ekp

mk

Professor Ernesto EstradaDepartment of Mathematics & Statistics

University of Strathclyde

Glasgow, UK

www.estradalab.org

1,305 mi.

Stanley Milgram

1933-1984

CLUSTERINGDISTANCE

5ij

d

i

iCi

node of relations e transitivpossible ofnumber total

node of relations e transitivofnumber

1

2

2/1

ii

i

ii

ii

kk

t

kk

tC

i

iCn

C1

d 3.65

0.79

2.99

0.00027

1847

0.74C

d 2.65

0.28

2.25

0.05

10.5

0.69C

33

d 18.7

0.08

12.4

0.0005

926

0.3C

34

D. J. Watts

S. H. Strogatz

35

There is a high probability that the shortest path connecting nodes p and q goes through the most connected nodes of the network.

Sending the ‘information’ to the most connected nodes (hubs) of the network will increase the chances of reaching the target.

The sender does not know the global structure of the network!

36

But also, there is a high probability that the most connected nodes of the network are involved in a high number of transitive relations.

Sending the ‘information’ to the most connected nodes (hubs) of the network will increase the chances of getting lost.

37

Definition: A route is a walk of length l, which is any sequence of (not necessarily different) nodes v1,…, vl such as for each i =1,…,l there is a link from vl to vl+1.

Definition: The communicability between two nodes in a network is defined as a function of the total number of routes connecting them, giving more importance to the shorter than to the longer ones.

38

Hypothesis: The information not only flows through the shortestpaths but by mean of any available route.

39

0

# of walks in steps from to pq l

l

G c l p q

The between the nodes p and q is then mathematically defined as

where cl should:

makes the series convergent

gives more weight to the shorter than to the longer walks

VpVp pfpfV

2

2

Eqp

qfpAf,

Definition: Let .

The adjacency operator is a bounded operator on

defined as .

Theorem: (Harary) The number of walks of length l

between the nodes p and q in a network is equal to .

pq

lA

V2

40

The communicability function can be written as:

qjpj

l

j

l

n

j

lpq

l

l

lpq cAcG ,,

0 10

41

Let be a nonincreasing ordering

of the eigenvalues of A, and let be the eigenvector

associated with .

n 21

j

j

42

jee

l

AG

n

j

qjpjpq

A

l

pq

l

pq

1

,,

0 !

n

j j

qjpj

pq

l

pq

ll

pq AIAG1

,,1

0 1

1

10

1 1

1 1

2

nn

pq n j j

j

G K p q e e p p

11

2

1 1 11 .

n n

pq n j j

j

nn

eG K e p p

n

ee

n ne ne

pq nG K

2cos1

j

j

n

2sin

1 1j

jpp

n n

2cos

1

1

2cos1

1

2 2sin sin

1 1 1 1

2sin sin .

1 1 1

jnn

pq n

j

jnn

j

jp jqG P e

n n n n

jp jqe

n n n

sin sin 1/ 2 cos cos

2cos 2cos

1 1

1

1 1cos cos

1 1 1 1

j jnn n

pq n

j

j p q j p qG P e e

n n n n

1,2, ,j n 1/ nj ,0

2cos 2cos

0 0

1 1cos cospq nG P p q e d p q e d

/ 1j n

x cos

0

1cos e d I x

2 2pq n p q p qG P I I

1 1 12 2 0n n n nG P I I

x

xx

xxx

x

x

xooo

oooo

ooo

0 5 10 15 20

0.4

0.3

0.2

0.1

0

-0.1

-0.2

-0.3

-0.4

-0.5

Subject

Second larg

est

princip

al valu

e

x

x

x

x

x

x

x

x

oooo o

oo

o

0 5 10 15 20

0.4

0.3

0.2

0.1

0

-0.1

-0.2

-0.3

-0.4

-0.5

x

oo

Subject

Second larg

est

princip

al valu

e

pqpq AWWG 2/12/1exp

~

: Stroke : Decreased communicability

: Stroke : Increased communicability

Images courtesy of Dr. Jonathan J. Crofts.

Professor Ernesto EstradaDepartment of Mathematics & Statistics

University of Strathclyde

Glasgow, UK

www.estradalab.org

Local metrics provide a measurement of a structural

property of a single node

Designed to characterise

Functional role – what part does this node play in

system dynamics?

Structural importance – how important is this node to

the structural characteristics of the system?

jee

l

AG

n

j

pjpp

A

l

pp

l

pp

1

2

,

0 !

1) 3.026

2) 1.641

3) 1.641

4) 3.127

5) 2.284

6) 2.284

7) 1.592

8) 1.592

1

2

3

4

5

6

7

8

1

2

3

4

56

7

8

9 8,6,5,3,11 S

65.451 SpGpp

9,7,4,22 S

70.452 SpGpp

Essential

54

55

Rank Protein Essential

?

1 YCR035C Y

2 YIL062C Y

... ... ...

Estrada: Proteomics 6 (2009) 35-40

0

10

20

30

40

50

60

70P

erc

en

tage o

f E

ssen

tial P

rote

ins

Iden

tifi

ed

57

Consider that two nodes p and q are trying to communicate with each other by sending information both ways through the network.

ppGQuantifies the amount of ‘information’ that is sent by p, wanders around the network, and returns to the origin, the node p.

pqGQuantifies the amount of ‘information’ that is sent by p, wanders around the network, and arrives to its destination, the node q.

We know that:

2def

pq pp qq pqG G G

The goal of communication is to maximise the amount of information that arrives to its destination by minimising that which is lost.

58

Theorem: The function is a Euclidean distance. pq

59

T

pq p q p qe Λφ φ φ φ

1

2

0 0

0 0

0 0 n

Λ

1

2

p

n

p

p

p

φ

Proof:

60

/2 /2

/2 /2 /2 /2

2

T

pq p q p q

T

p q p q

T

p q p q

p q

e e

e e e e

Λ Λ

Λ Λ Λ Λ

φ φ φ φ

φ φ φ φ

x x x x

x x

Proof (cont.):

61

pqpqd

p q

62

p q

1P

p q

2P

09.5

1,,

2/1

PjiEji

ij

80.4

2,,

2/1

PjiEji

ij

Remark. The shortest route based on the communicability

distance is the shortest path that avoids the nodes with the

highest ‘cliquishness’ in the graph.

81.56 10 83.30 10

Fraction of nodes

removed

The routes minimizing the communicability distance are the shortest ones

that avoid the hubs of the network. Then, they can be useful to avoid

possible collapse of the networks due to the removal of the hubs.

67

Essential

q

qp

n

,

1

q

qpd

n

,

1

68

69

large ppG

Wikipedia

70

qqpp

pqdef

pqGG

G

Theorem. The function is the cosine of the Euclidean

angle spanned by the position vectors of the nodes p and q.

pq

qqpp

pq

qp

qp

pqGG

G

xx

xx11 coscos

Definition:

71

a

bcR

2

2 2

4

1

11

AT ea 1 AT esb sesc AT

Theorem. The communicability distance induces an

embedding of a network into an (n-1)-dimensional

Euclidean sphere of radius:

Estrada et al.: Discrete Appl. Math. 176 (2014) 53-77.

Estrada & Hatano, SIAM Rev. (2016) in press.

ATT essC 211 Aediags

Remark. The communicability distance matrix C is circum-

Euclidean:

72

2/1

2/1

2/1

1

2/1

0

2/1

2

2/1

2/1

2/1

3

jjjA

73

0.08.06.0

0.0

7.0

0.0

0.1

0.1

xy

z

2

3

1

0.00.1

2

3

1

0.1

0.1

74

2

1

3

x

y

z

0.1

0.0

0.1

0.10.1

0.0

0.10.2

0.0

O

1x

2x

3x

75

px

qx

O

ppp Gx pq

pq

qqq Gx

Estrada & Hatano: Arxiv (2014) 1412.7388.

Estrada: Lin. Alg. Appl. 436 (2012) 4317-4328.

qppq xxG

Professor Ernesto EstradaDepartment of Mathematics & Statistics

University of Strathclyde

Glasgow, UK

www.estradalab.org

1V 2V

1 2V V V

0

0T

BA

B

1j n j

m

mb

fr1

Problem:

sinh 0Bipartivity Tr A

Bipartivity No odd cycles

exp sinh coshTr A Tr A Tr A

exp coshBipartivity Tr A Tr A

EE

EE

ATr

ATrb even

n

j

j

n

j

j

s

1

1

exp

cosh

exp

cosh

even evens s

EE EE ab G b G e

EE EE a b

Theorem.

G G e

e

b

a

1.000sb 0.829sb 0.769sb 0.721sb

0.692sb 0.645sb 0.645sb 0.597sb

1 1

1 1

cosh sinh

cosh sinh

n n

j j

j j

e n n

j j

j j

b

1

1

expexp

expexp

n

j

j

e n

j

j

Tr Ab

Tr A

1.000sb 0.658sb 0.538sb 0.462sb

0.383sb 0.289sb 0.289sb 0.194sb

Hervibores

Grass

Parasitoids

Hyper-hyper-parasitoids

Hyper-parasitoids

0.766sb

0.532eb

j = 1 j = 2 j = 3

j = 4 j = 5 j = 6

ˆ exp cosh sinhpq pq pqG A A A

ˆ sinh 0pq pqG A

p

q

p

q

ˆ cosh 0pq pqG A

0 and in different partitionsˆ 0 and in the same partitions

pq

p qG

p q

1

3

7

4

6

10

8

11

9

12

5

2

38.919.872.673.680.636.628.777.610.705.704.773.8

19.81.1083.609.711.766.673.862.598.729.728.702.9

72.683.603.817.487.543.504.731.632.456.665.528.7

73.609.717.405.862.550.505.742.562.630.456.629.7

80.611.787.562.517.893.310.768.596.562.632.498.7

36.666.643.550.593.337.777.639.568.542.531.662.5

28.773.804.705.710.777.638.936.680.673.672.619.8

77.662.531.642.568.539.536.637.793.350.543,566.6

10.798.732.462.696.568.580.693.317.862.587.511.7

05.729.756.630.462.642.573.650.562.505.817.409.7

04.728.765.556.632.431.672.643.587.517.403.883.6

73.802.928.729.798.75.6219.866.611.709.783.61.10

G

011111000000

101111000000

110111000000

111011000000

111101000000

111110000000

000000011111

000000101111

000000110111

000000111011

000000111101

000000111110

~G

10

12

11

8

7

9

2

3

4

1

5

6

1 1,2,3,4,5,6C

2 6,7,8,9,10,11,12C

1

3

7

4

6

10

8

11

9

12

5

2

2 3

10

4

1211

1 5 6

87 91

3

7

4

6

10

8

11

9

12

5

2

Professor Ernesto EstradaDepartment of Mathematics & Statistics

University of Strathclyde

Glasgow, UK

www.estradalab.org

VS VS 2/1

S

Let and

represents the number

of edges with exactly one

endpoint in S.

12

min , ,02n

S

S VG S V S

S

1 21 1 22

2G

N 321(Alon-Milman): Let

21

A Ramanujan graph

n = 80, = 1/4

The Petersen graph

n = 10, = 1

S S

SS

bottleneck

MySQL 30.7

93.6 56.5

St Marks St Martin

t

t

t

t

iiG

i

800,628,3321

!

1032

0

iiiiii

k

ii

k

ii

AAA

k

AG

i i

i

1/2 1/6

1/24i

1/40320

iNk

1

1

n

j

kkk jNiNis

i,1

k

jpj

n

p

n

j

ijk iN ,

1 1

,

n

p

k

jqj

n

q

n

j

pj

n

p

k pN1

,

1 1

,

1

n

p

k

jqj

n

q

n

j

pj

k

jpj

n

p

n

j

ij

k is

1

,

1 1

,

,

1 1

,

.

1 1

,1,1

1

,1,1

1 1

1,1,1

1

1,1,1

n

p

n

q

qp

n

p

pi

n

p

n

q

k

qp

n

p

k

pi

k is

2

1

,1

1 1

,1,1

n

j

p

n

p

n

q

qp in

p

p

i

k is ,1

1

,1

,1

k

Remark: The eigenvector centrality of a given node represents the probability of selecting at random a walk of infinite length that hasstarted at that node.

1 2

n

j

jijiiiG2

2

,1

2

,1 expexp

1

2

,1 exp iiiG

2ln

2

1ln 1

,1

iii G

i,1ln

iiGln

5.0

1

2

,1

,1

explnln

ii

i

iG

iiii GVi 1

2

,1,1 exp,0ln

iiii GVi 1

2

,1,1 exp,0ln

iiii GVi 1

2

,1,1 exp,0ln

andVpp ,0ln ,1Vqq ,0ln ,1

Chordless cycle

of length 15

MySQL

3105.1

St Marks St Martin

31090.2

67.1