Graph Annotations in Modeling Complex Network...

arX

iv:0

708.

3879

v4 [

cs.N

I] 2

Nov

200

9

Graph Annotationsin Modeling Complex Network Topologies

XenofontasDimitropoulos

IBM Research, Zurich

[email protected]

Dmitri KrioukovCAIDA/UCSD

[email protected]

Amin VahdatUCSD

[email protected]

George RileyGeorgia Tech

[email protected]

ABSTRACTThe coarsest approximation of the structure of a complexnetwork, such as the Internet, is a simple undirected un-weighted graph. This approximation, however, loses toomuch detail. In reality, objects represented by vertices andedges in such a graph possess some non-trivial internal struc-ture that varies across and differentiates among distinct typesof links or nodes. In this work, we abstract such additionalinformation as network annotations. We introduce a net-work topology modeling framework that treats annotationsas an extended correlation profile of a network. Assumingwe have this profile measured for a given network, we presentan algorithm to rescale it in order to construct networks ofvarying size that still reproduce the original measured an-notation profile.

Using this methodology, we accurately capture the net-work properties essential for realistic simulations of net-work applications and protocols, or any other simulationsinvolving complex network topologies, including modelingand simulation of network evolution. We apply our approachto the Autonomous System (AS) topology of the Internetannotated with business relationships between ASs. Thistopology captures the large-scale structure of the Internet.In depth understanding of this structure and tools to modelit are cornerstones of research on future Internet architec-tures and designs. We find that our techniques are ableto accurately capture the structure of annotation correla-tions within this topology, thus reproducing a number ofits important properties in synthetically-generated randomgraphs.

Categories and Subject DescriptorsC.2.1 [Network Architecture and Design]: Networktopology; C.2.5 [Local and Wide-Area Networks]: In-ternet; G.3 [Probability and Statistics]: Distributionfunctions, multivariate statistics, correlation and regressionanalysis; G.2.2 [Graph Theory]: Network problems

General TermsMeasurement, Design, Theory

KeywordsAnnotations, AS relationships, complex networks, topology

1. INTRODUCTIONSimulations of new network protocols and architectures

are pointless without realistic models of network structureand evolution. Performance of routing [26], multicast [34],and other protocols depends crucially on network topol-ogy. Simulations of these protocols with inaccurate topologymodels can thus result in misleading outcomes.

Inaccuracies associated with representing complex net-work topologies as simple undirected unweighted graphs comenot only from potential sampling biases in topology mea-surements [27, 10, 11], but also from neglecting link andnode annotations. By annotations we mean various typesof links and nodes that abstract their intrinsic structural andfunctional differences to a certain degree. For example, con-sider the Internet topology at the Autonomous System (AS)level. Here, link annotations may represent different busi-ness relationship between ASs, e.g., customer-to-provider,peer-to-peer, etc. [13], while node annotations may representdifferent types of ASs, e.g., large or small Internet ServiceProviders (ISPs), exchange points, universities, customer en-terprises, etc. [15]. In router-level Internet topologies, linkannotations can be different transmission speeds, latencies,packet loss rates, etc. One can also differentiate betweendistinct types of links and nodes in other networks, suchas social, biological, or transportation networks. In manycases, simply reproducing the structure of a given network isinsufficient; we must also understand and reproduce domain-specific annotations.

We propose network annotations as a general frameworkto provide the next level of detail describing the “micro-scopic” structure of links and nodes. Clearly, since linksand nodes are constituents of a global network, increasingdescription accuracy at the “microscopic” level will also in-crease overall accuracy at the “macroscopic” level as well.That is, including appropriate per-node or per-link anno-tations will allow us to capture and reproduce more accu-rately a variety of important global graph properties. Inthe AS topology case, for example, instead of consideringonly shortest paths, we will be able to study the structureof paths that respect constraints imposed by routing policiesand AS business relationships.

Higher accuracy in approximating network structure isdesirable not only for studying applications and protocolsthat depend on such structure, but also for modeling net-work evolution. For example, realistic Internet AS topology

http://arxiv.org/abs/0708.3879v4

growth models should be based on economic realities of theInternet since AS links are nothing but reflections of AScontractual relationships, i.e., results of business decisionsmade by organizations that the corresponding ASs repre-sent. Therefore, economy-based AS topology models nat-urally produce links annotated with AS relationships. ASrelationship annotations are thus intrinsic to such models.

Network annotations should also be useful for researchersstudying only those networks that preserve some domain-specific constraints, thus avoiding “too random” networksthat violate these constraints. Examples of such “techno-logical” constraints for router topologies include maximumnode degree limits, specific relationships between node de-gree and centrality, etc. [28]. In this context, we note thatany node or link attributes, including their degrees and cen-trality, are forms of annotations. Therefore, one can use thenetwork annotation framework to introduce domain-specificor any other constraints to work with network topologiesnarrowed down to a specific class. We also note that thenetwork annotation framework is sufficiently general to in-clude directed and weighted networks as partial cases, sinceboth link directions and weights are forms of annotations.

After reviewing, in Section 2, past work on network topol-ogy modeling and generation, which largely ignores annota-tions, we make the following contributions in this paper:

• In Section 3, we demonstrate the importance of net-work annotations using the specific example of ASbusiness relationships in the Internet.

• In Section 4, we introduce a general network annota-tion formalism and apply it to the Internet AS topol-ogy annotated with AS business relationships.

• In Section 5, we formulate a general methodology andspecific algorithms to: i) rescale the annotation corre-lation profile of the observed AS topology to arbitrarynetwork sizes; and ii) construct synthetic networksreproducing the rescaled annotation profiles. Whilewe discuss our graph rescaling and construction tech-niques in the specific context of AS topologies, thesetechniques are generic and can be used for generatingsynthetic annotated networks that model other com-plex systems.

• In Section 6, we evaluate the properties of the resultingsynthetic AS topologies and show that they recreatethe annotation correlations observed in real annotatedAS topologies as well as other important propertiesdirectly related to common metrics used in simulationand performance evaluation studies.

We conclude by outlining some implications and directionsfor future work in Section 7.

2. RELATED WORKA large number of works have focused on modeling In-

ternet topologies and on developing realistic topology gen-erators. Waxman [44] introduced the first topology gen-erator that became widely known. The Waxman genera-tor was based on the classical (Erdos-Renyi) random graphmodel [18]. After it became evident that observed networkshave little in common with classical random graphs, newgenerators like GT-ITM [46] and Tiers [17] tried to mimic

the perceived hierarchical network structure and were con-sequently called structural. In 1999, Faloutsos et al. [19] dis-covered that the degree distributions of router- and AS-leveltopologies of the Internet followed a power law. Structuralgenerators failed to reproduce the observed power laws. Thisfailure led to a number of subsequent works trying to resolvethe problem.

The existing topology models capable of reproducing powerlaws can be roughly divided into the following two classes:causality-aware and causality-oblivious. The first class in-cludes the Barabasi-Albert (BA) [2] preferential attachmentmodel, the Highly Optimized Tolerance (HOT) model [6],and their derivatives. The BRITE [32] topology generatorbelongs to this class, as it employs preferential attachmentmechanisms to generate synthetic Internet topologies. Themodels in this class grow a network by incrementally addingnodes and links to a graph based on a formalized networkevolution process. One can show that both BA and HOTgrowth mechanisms produce power laws.

On the other hand, the causality-oblivious approaches tryto match a given (power-law) degree distribution withoutaccounting for different forces that might have driven evo-lution of a network to its currently observed state. Themodels in this class include random graphs with given ex-pected [9] and exact [1] degree sequences, Markov graphrewiring models [31, 23], and the Inet [45] topology gen-erator. Recent work by Mahadevan et al. introduced thedK-series [29] extending this class of models to account fornode degree correlations of arbitrary order. Whereas thedK-series provides a set of increasingly accurate descrip-tions of network topologies represented as graphs, networkannotations are another, independent and “orthogonal” todK-series, way to provide more accurate and complete in-formation about actual complex systems that these graphsrepresent.

Frank and Strauss first formally introduced the annotated(colored) random Markov graphs in [20]. In their definition,every edge is colored by one of T colors. More recently,Soderberg suggested a slightly different definition [42], whereevery half-edge, i.e., stub, is colored by one of T colors.Every edge is thus characterized by a pair of colors. Thisdefinition is very generic. It includes uncolored and stan-dard colored [20] random graphs, random vertex-coloredgraphs [40]1 , and random directed graphs [5] as partial cases.Soderberg considers some analytic properties of the ensem-ble of these random colored graphs in [41]. In [43], he ob-serves strong similarities between random graphs colored byT colors and random Feynman graphs representing a pertur-bative description of a T -dimensional system from quantumor statistical mechanics.

Recent works on annotation techniques specific to ASgraphs include [12] and [8]. The GHITLE [12] topology gen-erator produces AS topologies with c2p and p2p annota-tions based on simple design heuristics and user-controlledparameters. The work by Chang et al. [8] describes a topol-ogy evolution framework that models ASs’ decision criteriain establishing c2p and p2p relationships. Our methodol-ogy is different in that it explores the orthogonal, causality-oblivious approach to modeling link annotations. Its mainadvantage is that it is applicable to modeling any type ofcomplex networks.

1Random graphs with colored nodes are a partial case ofrandom graphs with hidden variables [3].

Figure 1: Example AS topology annotated with ASrelationships. The dotted lines represent shortestpaths between ASs 4, 6 and 8 to AS 2. The dashedlines represent policy compliant paths from the samesources to the same destination.

3. AS RELATIONSHIPS AND WHY THEYMATTER

In this section, we introduce our specific example of net-work annotations—AS relationships. We first describe whatAS relationships represent and then discuss the results ofsimple simulation experiments showing why preserving ASrelationship information is important.

AS relationships are annotations of links of the InternetAS-level topology. They represent business agreements be-tween pairs of AS neighbors. There are three major types ofAS relationships: 1) customer-to-provider (c2p), connectingcustomer and provider ASs; 2) peer-to-peer (p2p), connect-ing two peer ASs; and 3) sibling-to-sibling (s2s), connectingtwo sibling ASs. This classification stems from the followingBGP route export policies, dictated by business agreementsbetween ASs:

• exporting routes to a provider or a peer, an AS ad-vertises its local routes and routes received from itscustomer ASs only;

• exporting routes to a customer or a sibling, an ASadvertises all its routes, i.e., its local routes and routesreceived from all its AS neighbors.

Even though there are only two distinct export policies, theylead to the three different AS relationship types when com-bined in an asymmetric (c2p) or symmetric (p2p or s2s)manner.

If all ASs strictly adhere to these export policies, thenone can easily check [21] that every AS path must be ofthe following valley-free or valid pattern: zero or more c2plinks, followed by zero or one p2p links, followed by zero ormore p2c links, where by ‘p2c’ links we mean c2p links inthe direction from the provider to the customer.

Routing policies reflect business agreements and economicincentives. For this reason, they are deemed more importantthan quality of service and other criteria. As a result, sub-optimal routing and inflated AS paths often occur. Gao andWang [22] used BGP data to measure the extent of AS pathinflation in the Internet. They found that at least 45% ofthe AS paths observed in BGP data are inflated by at leastone AS hop and that AS paths can be inflated by as long as9 AS hops.

Taking into account such inflation effects is important formeaningful and realistic simulation studies. For example,

Table 1: Total number of paths for each AS with ASrelationships enabled and AS relationships disabled.

AS number 1 2 3 4 5 6 7 8

Number of pathsAS relationships disabled 12 13 16 15 13 15 15 13AS relationships enabled 12 9 10 8 8 7 9 6

consider the AS topology in Figure 1, which is a small partof the real (measured) AS topology annotated with AS re-lationships inferred using heuristics in [14]. Directed linksrepresent c2p relationships that point towards the providerand undirected links represent p2p relationships. If we ig-nore AS relationships then the shortest paths from ASs 4,6, and 8 to AS 2 are shown with dotted lines. On the otherhand, if we account for AS relationships these paths are nolonger valid. In particular, the path 4→3→2 transverses twop2p links; the path 6→3→2 transverses a p2c link followedby a p2p link; and the path 8→1→2 transverses a c2p linkafter having gone through a p2c link. As all these paths arenot valid, they are not used in practice. The paths actuallyused are the policy compliant paths marked with dashedlines.

In other words, the first effect of taking AS relationshipsinto account is that paths become longer than the corre-sponding shortest paths. From a performance perspective,longer paths can affect metrics such as end-to-end (e2e) de-lay, server response time, jitter, convergence time, and oth-ers.

To illustrate this effect, we simulated the topology in Fig-ure 1 using BGP++ [16]. We used a single router per AS andconfigured appropriate export rules between ASs accordingto the guidelines discussed above. We set the delay of eachlink to 10 milliseconds and the bandwidth to 400kbps. Then,we configured exponential on/off traffic sources at ASs 4, 6and 8 that send traffic to AS 2 at a rate of 500kbps. Werun the simulation for 120 seconds; for the first 100 secondswe waited for routers to converge2 and at the 100th sec-ond we started the traffic sources. We then measured thee2e delay between the sources and the destination with ASrelationships disabled and enabled.

In Figure 2 we depict the cumulative distribution func-tion (CDF) of the e2e delays for the both cases. We firstnotice that the CDF with AS relationships enabled shifts tothe right, which means that there is a significant increasein the e2e delay. In particular, the average e2e delay withAS relationships enabled is 0.853 seconds, whereas withoutAS relationships it drops to 0.389 seconds. Besides the de-crease in the e2e delay, we see that the CDF with AS re-lationships is much smoother than the other CDF, whichexhibits a step-wise increase. The reason for that differenceis that in the former case we have more flows sharing multi-ple queues and, consequently, more diverse queue dynamics,while in the latter case, almost all paths are disjoint, leadingto mostly fixed e2e delays. The observed difference signifiesthat the e2e delay with AS relationships enabled exhibits amuch higher variability compared to the case with AS rela-tionships ignored. This difference in variability is likely toaffect other performance metrics like jitter and router bufferoccupancy.

Another consequence of policy-constrained routing is thatASs have fewer alternative AS paths. For example, in Fig-

2Typically routers take much less than 100 seconds to con-verge, but to be conservative we used a longer period.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1 1.2 1.4

CD

F

end to end delay (sec)

AS relationships enabledAS relationships disabled

Figure 2: CDF of e2e delay between traffic sourcesand destination.

Table 2: Average bandwidth per flow with AS rela-tionships enabled or disabled.

Flow 4 → 2 6 → 2 8 → 2

Bandwidth (Kbps)AS relationships disabled 202 196 397AS relationships enabled 113 164 121

ure 1 when ignoring AS relationships AS 7 has three (onethrough each neighbor) disjoint paths to reach destination 2.One the other hand, with AS relationships enabled, AS 7has only one possible path through AS 5, since the othertwo paths are not valid. In Table 1, we show the total num-ber of paths we found in the BGP tables of the eight ASsin our simulations. The consistent decrease in the numberof paths when AS relationships are enabled highlights thatignoring AS relationships increases the path diversity of theASs in a simulation. Path diversity is an important prop-erty related to network robustness, vulnerability to attacks,links and router failures, load balancing, multi-path routing,convergence of routing protocols, and others.

Yet another effect of policy routing is different distribu-tion of load on ASs and AS links. Indeed, due to the smallernumber of available AS paths, compared to shortest pathrouting, some nodes and links are likely to experience greatertraffic load. For example, in Figure 1 the dashed paths sharethe links from AS 7 to AS 5. On the other hand, when as-suming shortest path routing the three paths are mostlydisjoint: only one link, the link between AS 3 and AS 2, isshared by two flows. Thus, AS links and nodes will receivegreater load, compared to the case with AS relationshipsignored. Higher load is likely to produce more packet loss,increased delay, congestion, router failures, and other unde-sirable effects. In Table 2 we list the average bandwidth inour simulations for each of the three flows with and withoutAS relationships enabled. We find that because of the in-creased load on the links between AS 7 and AS 5 the averagebandwidth of the three flows decreases substantially.

To summarize this section, we have provided three ex-amples showing that ignoring AS relationship annotationsleads to inaccuracies, which make the corresponding prop-erties look “better” than they are in reality. Indeed, if ASrelationships are ignored, then:

• paths are shorter than in reality;

• path diversity is larger than in reality; and

k=8

k=5 k’=6

kg=3

k=(1,2,2)

kr=1

kb=4

k=(kr,kg,kb)=(1,3,4) k’=(3,1,2)

r b

g

b g

r

Figure 3: The 1K- and 2K-annotations. Three differ-ent stub colors are represented by dashed (color red),dash-dotted (color green), and dash-double-dotted(color blue) lines.

• traffic load is lower than in reality.

4. NETWORK TOPOLOGY ANNOTATIONSIn this section we first introduce our general formalism

to annotate network topologies. We then show how thisformalism applies to our example of the AS-level Internettopology annotated with AS relationships.

4.1 General formalismOur general formalism is close to random colored graph

definitions from [42] and borrows parts of the convenientdK-series terminology from [29].

We define the annotated network as a graph G(V, E),|V | = n and |E| = m, such that all 2m edge-ends (stubs)of all m edges in E are of one of several colors c, c = 1 . . . C,where C is the total number of stub colors. We also allowfor node annotations by an independent set of node colorsθ = 1 . . . Θ. We do not use node annotations in this paperand we do not include them in the expressions below in or-der to keep them clearer. It is however trivial to add nodeannotations to these expressions.

Compared to the non-annotated case when the node de-gree is fully specified by an integer value k of the number ofstubs attached to the node, we now have to list the numbersof attached stubs of each color to fully describe the nodedegree. Instead of scalar k, we thus have the node degreevector

k = (k1, . . . , kC),

which has C components kc, each specifying the numberof c-colored stubs attached to the node, cf. the left side ofFigure 3. The L1-norm of this vector yields the node degreewith annotations ignored,

k = |k|1 =C

X

c=1

kc. (1)

The number n(k) of nodes of degree k defines the nodedegree distribution,

n(k)

n−−−−→n→∞

P (k), (2)

in the large-graph limit. We can think of n(k) as a non-normalized form of P (k). From the statistical perspective,the n(k) (P (k)) distribution is a multivariate distribution.Its C marginal distributions are the distributions of nodedegrees of each color c:

n(kc) =X

k′

˛

˛ k′

c=kc

n(k′), (3)

where the summation is over all vectors k′ such that theirc’s component is equal to kc. The degree distribution n(k)thus represents per-node correlations of degrees of differentcolors. Following the terminology in [29], we call the nodedegree distribution n(k) (P (k)) the 1K-annotated distribu-

tion.We then define the 2K-annotated distribution as correla-

tions of annotated degrees of connected nodes, or simply asthe number of edges that have stub of color c connected toa node of degree k and the other stub of color c′ connectedto a node of degree k′, n(c, k; c′,k′). See the right side ofFigure 3 for illustration.

As in the non-annotated case, the 2K-distribution yieldsa more exhaustive statistics about the annotated networktopology and fully defines the 1K-distribution. To see that,we introduce the following notations:

k = (c,k),

k′ = (c′,k′),

µ(c, c′) = 1 + δ(c, c′),

µ(k,k′) = 1 + δ(k,k′),

µ(k, k′) = 1 + δ(k, k′),

where δ(x, x′) is the standard Kronecker delta:

δ(x, x′) =

(

1 if x = x′,

0 otherwise,

and x is either c, k, or k. With these notations, one caneasily check that the normalized 2K-annotated distributionis

P (k, k′) = n(k, k′)µ(k, k′)/(2m), (4)

the number of edges of any pair of colors connecting nodesof degrees k and k′ is

n(k,k′) =X

c,c′

n(k, k′)µ(k, k′)/µ(k,k′),

the normalized form of this distributions is

P (k,k′) = n(k,k′)µ(k,k′)/(2m),

and the 1K-distribution is given by

n(k) =X

k′

n(k,k′)µ(k,k′)/k, (5)

P (k) =k

k

X

k′

P (k,k′), (6)

where k = 2m/n is the average degree. The last two expres-sions show how one can find the 1K-annotated distributiongiven the 2K-annotated distribution, and they look exactlythe same as in the non-annotated case [29], except that wehave vectors k,k′ instead of scalars k, k′.

The dK-annotated distributions with d > 2 [29] can bedefined in a similar way.

4.2 The AS relationship annotationsIn the specific case of the AS-level Internet topology that

interests us in this paper, we have just three colors: cus-tomer, provider, and peer.3 We assign the following numericvalues to represent these three colors:

c =

8

>

<

>

:

1 customer,

2 provider,

3 peer.

These three stub annotations come under the following twoconstraints defining the only two types of edges that we have:1) c2p edges: if one stub of an edge is customer, then theother stub of the same edge is provider, and vice versa; and2) p2p edges: if one stub of an edge is peer, then the otherstub of the same edges is also peer. The c2p edges are thusasymmetric, i.e., a generalization of directed edges, while thep2p edges are symmetric, i.e., a generalization of bi-directedor undirected edges.

While the 2K-annotated distribution n(k, k′) contains themost exhaustive information about the network topology, ithas too many (seven) independent arguments. As a result,the full 2K-annotated statistics is extremely sparse, whichmakes it difficult to model and reproduce directly. We thushave to find some summary statistics of n(k, k′) that we canmodel in practice. For each concrete complex network type,these summary statistics might be different. Given mea-surement data for a specific complex network, one wouldusually have to start with identifying a meaningful set ofsummary statistics of the 2K-annotated distribution, andthen proceed from there. At the same time, we believe thatas soon as the 2K-annotated distribution fully defines an ob-served complex network, i.e., the network is 2K-annotated-random [29], one can generally use the set of summary statis-tics that we found necessary and sufficient to reproduce inorder to model correctly the Internet AS topology. In therest of this section, we list these statistics and describe thespecific meanings that they have in the AS topology case.

Degree distribution (DD). This statistics is the tra-ditional non-annotated degree distribution n(k), where k isas in eq. (1). The DD tells us how many ASs of each totaldegree k are in the network.

Annotation distributions (ADs). The DD of an AStopology does not convey any information about the AS re-lationships. The initial step to account for this informationis to reproduce the distributions of ASs with specific num-bers of attached customer, provider, or peer stubs. Theseannotation distributions (ADs) are the marginal distribu-tions n(kc), c = 1, 2, 3, of the 1K-annotated distribution.They are given by eq. (3). If k1 (k2) customer (provider)stubs attach to an AS, then this AS has exactly k1 (k2)providers (customers), since the c2p edges are asymmetric.Consequently, the ADs n(k1) and n(k2) tell us how manyASs with the specific numbers of providers and, respectively,customers the network has. Since the p2p edges are sym-metric, the AD n(k3) is the distribution of ASs with specificnumbers of peers.

Annotated degree distribution (ADD). The ADs donot tell us anything about the correlations among anno-

3We ignore sibling relationships, since they typically accountfor a very small fraction of the total number of edges. Asfound in [13], the number of s2s edges is only 0.46% of thetotal number of edges in the AS-level Internet.

tated degrees of the same node, i.e., how many customers,providers, and, simultaneously, peers a specific AS has. Cor-relations of this type are fully described by the 1K-annotateddistribution in eq. (2), which we also call the annotated de-gree distribution (ADD). These correlations are present inthe Internet. For example, large tier-1 ISPs typically have alarge number of customers, i.e., large k2, no providers, i.e.,zero k1, and a small number of peers, i.e., small k3. On theother hand, medium-size ISPs tend to have a small set ofcustomers, several peers, and few providers. Ignoring theADD while generating synthetic graphs can lead to artifactslike high-degree nodes with many providers—a property ob-viously absent in the real Internet.

Joint degree distributions (JDDs). While the ADDcontains the full information about degree correlations “atnodes,” it does not tell us anything about degree correla-tions “across links,” while the latter type of correlationsis also characteristic for the Internet. For example, largetier-1 ISPs typically have p2p relationships with other tier-1 ISPs, not with much smaller ISPs, while small ISPs havep2p links with other small ISPs. In other words, p2p linksusually connect ASs of similar degrees, i.e., k ∼ k′. Sim-ilarly, c2p links tend to connect low-degree customers tohigh-degree providers, i.e., k ≪ k′. If we ignore these cor-relations, we can synthesize graphs with inaccuracies likep2p links connecting ASs of drastically dissimilar degrees.To reproduce these correlations, we work with the follow-ing summary statistics of the 2K-annotated distribution ineq. (4):

nc2p(k, k′) =X

k,k′

˛

˛ |k|1=k, |k′|1=k′

n(1,k; 2, k′), (7)

np2p(k, k′) =X

k,k′

˛

˛ |k|1=k, |k′|1=k′

n(3,k; 3, k′), (8)

where the summation is over such vectors k and k′ thattheir L1-norms are k and k′ respectively. The first expres-sion gives the number of c2p links that have their customerstub attached to a node of total degree k and provider stubattached to a node of total degree k′. The second expressionis the number of p2p links between nodes of total degreesk and k′. In other words, these two objects are the jointdegree distributions (JDDs) for the c2p and p2p links.

In summary, we work with the four types of distributions,i.e., DD, ADs, ADD, and JDDs, that allow two types ofclassification:

1. Univariate vs. multivariate distributions:

(a) Univariate. The ADs and DD are distribution ofonly one random variable.

(b) Multivariate. The ADD and JDDs are joint dis-tribution of three and two random variables. Themarginal distribution of these variables are theADs, cf. eq. (3), and DD, cf. eqs. (7,8), respec-tively.

2. 1K- vs. 2K-summary statistics:

(a) 1K-derived. The DD, ADs, and ADD are fully de-fined by the 1K-annotated distribution: that is,we do not need to know the 2K-annotated distri-bution to calculate the distributions in this class.

(b) 2K-derived. The JDDs are fully defined only bythe 2K-annotated distribution. Note that it alsodefines the 1K-annotated distribution via eqs. (5,6).

5. GENERATING ANNOTATED AS GRAPHSIn this section we describe how we generate synthetic an-

notated AS graphs of arbitrary sizes. We want our syntheticgraphs to reproduce as many important properties of theoriginal measured topology as possible. For this purposewe decide to explicitly model and reproduce the summarystatistics of the 2K-annotated distribution from Section 4.2,because [29] showed that by reproducing 2K-distributions,one automatically captures a long list of other importantproperties of AS topologies. In other words, the task ofgenerating synthetic annotated topologies becomes equiv-alent to the task of generating random annotated graphsthat reproduce the summary statistics of the 2K-annotateddistribution of the measured AS topology.

We wish to be able to generate synthetic topologies ofdifferent sizes, but the 2K-summary statistics defined inSection 4.2 are all bound to a specific graph size. There-fore, in order to generate arbitrarily-sized graphs, we needfirst to rescale the 2K-summary statistics from the originalto target graph sizes. We say that an empirical distribu-tion is rescaled with respect to another empirical distribu-tion, if the both distributions are defined by two differentfinite collections of random numbers drawn from the samecontinuous distribution. For example, the distributions ofnode scalar (or vector) degrees in two different graphs arerescaled with respect to each other if these degrees are drawnfrom the same continuous univariate (or multivariate) prob-ability distribution. We say that a 2K-annotated graph isrescaled with respect to another 2K-annotated graph, if allthe 2K-summary statistics of the first graphs are rescaledwith respect to the corresponding 2K-summary statistics ofthe second graph. This definition of rescaling is equivalentto assuming that for each summary statistic, the same dis-tribution function describes the ensemble of empirical dis-tributions of the statistic in past, present, and future In-ternet topologies. In other words, we assume that the 2K-annotated correlation profile of the Internet AS topology isan invariant of its evolution. This assumption is realistic, asdiscussed, for example, in [36], where it is shown that thenon-annotated 1K- and 2K-distributions of the Internet AStopology have stayed approximately the same during all theyears (more than a decade) of the existing data time span.

To illustrate what we mean by rescaling, consider theempirical distribution of peer degrees, i.e., the AD n(k3),in the measured AS topology annotated with AS relation-ships in Figure 4(a). The figure shows the empirical comple-mentary cumulative distribution function (CCDF) for peer-degrees of 19,036 nodes, i.e., 19,036 numbers of peer stubsattached to a node, and the largest such number is 448.The continuous probability distribution of Figure 4(b) ap-proximates the empirical distribution in Figure 4(a). Fig-ures 4(c), 4(d), and 4(e) show the CCDFs of three collectionsof 5,000, 20,000, and 50,000 random numbers drawn fromthe probability distribution in Figure 4(b). According toour definition of rescaling, the distributions in Figures 4(c),4(d), and 4(e) are rescaled with respect to the distributionof Figure 4(a). We see that all the empirical distributionshave the same overall shape, but differ in the total numberof samples and in the maximum values within these sam-

ple collections. Distributions with larger maximums corre-spond, as expected, to bigger collections of samples.

5.1 Overview of the approachWe now move to describing the details of our approach,

which consists of the following three major phases:

1. Extraction.

We first extract the empirical 2K-summary distribu-tions from available AS topology measurement data.We annotate links of the AS topology extracted fromthis data using existing AS relationship inference heuris-tics. This extraction step is conceptually simplest. Onits output, we obtain the extracted 2K-summary dis-tributions that are all bound to the size of the mea-sured AS graph.

2. Rescaling.

(a) We use the extracted empirical distributions tofind their continuous approximations. Referringto our example in Figure 4, this step correspondsto computing the continuous probability distribu-tion in Figure 4(b) based on the empirical distri-bution in Figure 4(a).

(b) We then use the computed probability distribu-tions to rescale the empirical distributions ob-tained at the extraction step. We generate adesired, target number of random scalar or vec-tor degree samples drawn from the correspond-ing probability distributions. The generated de-gree samples have empirical distributions that arerescaled with respect to the corresponding empir-ical distributions of the measured topology. Re-ferring to our example in Figure 4, this step cor-responds to generating the rescaled empirical dis-tributions in Figures 4(c), 4(d), and 4(e) basedon the probability distribution in Figure 4(b).

3. Construction.

Finally, we develop algorithms to generate syntheticgraphs that have their 2K-summary distributions equalto given distributions, i.e., to the corresponding distri-butions obtained at the previous step. The generatedgraphs thus reproduce the rescaled replicas of the 2K-annotated distribution of the original topology, butthey are “maximally random” in all other respects.

In the rest of this section, we describe each of these phasesin detail.

5.2 ExtractionWe extract the AS topology from the RouteViews [38]

data, performing some standard data cleaning, such as ig-noring private AS numbers, AS sets, etc. [30] The resultingAS graph is initially non-annotated. To annotate it, we in-fer c2p and p2p relationships for AS links using the heuris-tics in [13]. We thus obtain the real Internet AS topologyannotated with c2p and p2p relationships. Given this an-notated topology, we straightforwardly calculate all the em-pirical 2K-summary distributions that we have defined inSection 4.2.

While the extraction phase is conceptually and techni-cally the simplest phase of the overall approach, it is its

basis. Therefore the quality of the input Internet topologydata is a natural concern. This data is known to exhibita variety of vagaries, e.g., due to sampling biases [27, 10,11]. However, our approach is oblivious with respect to dataquality. It takes any available data, extracts the describedstatistics from it, and reproduces them, properly rescaled,in random synthetic graphs. A given input topology dataset thus defines an ensemble of random graphs generatedby our method. By construction, all graphs in this ensem-ble reproduce the described set of annotated distributions.In addition, in Section 6, we perform sensitivity analysis inorder to see the strength of fluctuations of these and otherbasic graph metrics within an ensemble. The quality of thesegraph ensembles, in terms of how veraciously they reflect re-ality, will improve as the quality of available topology dataimproves in the future. In this paper, we simply illustrateour approach with the currently available topology data.The RouteViews [38] is just one of very few sources of suchdata [30]. We select it because it appears to be the mostfrequently cited Internet topology data source.

5.3 RescalingOur rescaling approach differs for univariate and multi-

variate distributions.

5.3.1 Rescaling univariate distributionsWe recall from the end of Section 4.2 that we have the

following two types of univariate distributions: the ADs andthe DD. Here we describe how we rescale ADs. We note thatwe do not have to rescale the DD the same way. The reasonis that our approach to rescaling the ADD, which we discussbelow in Section 5.3.2, automatically takes care of rescalingthe DD, since the ADD is the distribution of degree vectorsand the DD is the distribution of the L1-norms of thesevectors, cf. eq. (1).

The first problem we face trying to compute a continu-ous approximation for a given finite empirical distributionis that we have to not only interpolate between points of theempirical distribution, but also extrapolate above its maxi-mum value. For example, if we want to construct a syntheticgraph bigger than the original, then we expect its maximumdegree to be larger than the maximum degree in the orig-inal graph. Therefore we have to properly extrapolate theobserved degree distribution beyond the observed maximumdegree.

We solve this problem by fitting the univariate empiri-cal distributions with smoothing splines. Spline smoothingis a non-parametric estimator of an unknown function rep-resented by a collection of empirical data points. Splinesmoothing produces a smooth curve passing through or nearthe data points. For example, the curve in Figure 4(b) is asmooth spline of the empirical distribution of Figure 4(a).Spline smoothing can also extrapolate the shape of an em-pirical function beyond the original data range.

Another reason to select spline smoothing is that it comesuseful for fitting distributions that do not closely follow regu-lar shapes, e.g., “clean” power laws. The ADs of the Internettopology do not necessarily have such regular shapes. Forexample, the distribution of the number of peers, i.e., theAD n(k3), has a complex shape that we found impossible tofit with any single-parametric distribution.

Among available implementations of spline smoothing tech-niques, we select the one in the smooth.spline method of

0

0.2

0.4

0.6

0.8

1

1 10 100 1000

CC

DF

number of peer stubs

max = 448

(a) Original distribution of 19,036 peer-degrees in the measured AS topology.

0

0.2

0.4

0.6

0.8

1

1 10 100 1000C

CD

F

number of peer stubs(b) Continuous distribution approximatingthe original one.

0

0.2

0.4

0.6

0.8

1

1 10 100 1000

CC

DF


max = 178

(c) Rescaled distribution of 5,000 sam-ples.

0

0.2

0.4

0.6

0.8

1

1 10 100 1000

CC

DF


max = 439

(d) Rescaled distribution of 20,000samples.

0

0.2

0.4

0.6

0.8

1

1 10 100 1000

CC

DF


max = 558

(e) Rescaled distribution of 50,000samples.

Figure 4: Rescaling an empirical distribution. The distributions in the three bottom figures are rescaled withrespect to the empirical distribution in the first figure. The distribution in the second figure is a continuousapproximation of the distribution in the first figure and is used to generate the rescaled distributions in thebottom figures. For each discrete distribution, we show its maximum in the top-right corners of the plots.

the R project [37], a popular statistical computing package.The specific details of this technique are in [7].

We can approximate with splines either the CDFs or CCDFs4

of the ADs obtained at the extraction step. We chose to fitthe CCDFs rather than the CDFs because the former bettercapture the shapes of high-degree tails of our heavy-tailedADs.

Another important detail is that we can define an em-pirical CCDF to be either a left- or right-continuous stepfunction [24]. Usually an empirical CCDF at some point xis defined as the fraction of samples with values strictlylarger than x, which means that the distribution is right-continuous and that the probability of a value larger thanthe observed maximum value is zero, whereas the proba-bility of a value smaller than the observed minimum valueis unspecified. For degree distributions, we know that theprobability of a degree smaller than zero is zero,5 but we donot know the probability of a degree larger than the maxi-mum observed degree. For this reason, we decide to fit theleft-continuous variants of empirical CCDFs, i.e., we definea CCDF at some point x as the fraction of samples withvalues greater than or equal to x.

Having the original ADs fitted with splines and assumingthat our target graph size is N , we finally use the standard,inverse-CDF method to produce N random numbers thatfollow the continuous distributions given by the splines. Re-call that the inverse-CDF method is based on the observa-tion that if the CDF of N random numbers xj , j = 1 . . . N ,closely follows some function F (x), then the distribution ofnumbers yj = F (xj) is approximately uniform in the inter-val [0, 1]. As its name suggests, the inverse-CDF invertsthis observation and operates as follows [25]: given a targetCDF F (x) and a target size N of a collection of randomsamples, the method first generates N random numbers yj

uniformly distributed in [0, 1] and then outputs numbersxj = F−1(yj), where F−1(y) is the inverse of CDF F (x),i.e., F−1(F (x)) = x. The CDF of numbers xj closely fol-lows F (x). Figures 4(c), 4(d), and 4(e) show random num-bers generated this way. These numbers follow the distri-bution in Figure 4(b). To compute values of inverse CDFson N random numbers uniformly distributed in [0, 1] in ourcase, we use the predict.smooth.spline method of the Rproject. Since the random numbers produced in this wayare not, in general, integers, we convert them to integer de-gree values using the floor function. We have to use thefloor and not the ceiling function because we work with left-rather than right-continuous distributions.

The outcome of the described process is three sets ofN random numbers that represent N customer degrees dj

1,

N provider degrees dj2, and N peer degrees dj

3 of nodes in thetarget graph, j = 1 . . . N . We denote the CDFs of these ran-dom numbers by D1(d1), D2(d2), and D3(d3) respectively.By construction, these distributions are properly rescaledversions of the customer-, provider- and peer-annotation dis-tributions (ADs) in the measured AS topology.

5.3.2 Rescaling multivariate distributionsRescaling multivariate distributions is not as simple as

rescaling univariate distributions. Our approach for rescal-ing univariate distributions is not practically applicable to

4Recall that the CCDF of CDF F (x) is 1 − F (x).5For a given node, some but not all the degrees k1, k2, and k3

can be equal to zero.

rescaling multivariate distributions because it is difficult tofit distributions that have many variates and complex shapes.To rescale multivariate distributions, we use copulas [33],which are a statistical tool for quantifying correlations be-tween several random variables. Compared to other well-known correlation metrics, such as Pearson’s coefficient, cop-ulas give not a single scalar value but a function of severalarguments that fully describes complex, fine-grained detailsof the structure of correlations among the variables, i.e.,their correlation profile.

According to Sklar’s theorem [39], any p-dimensional mul-tivariate CDF F of p random variables k = (k1, . . . , kp) canbe written in the following form:

F (k) = H(u), (9)

where u is the p-dimensional vector composed of the F ’smarginal CDFs Fm(km), m = 1, . . . , p:

Fm(km) = F (∞, . . . ,∞, km,∞, . . . ,∞), (10)

u = (F1(k1), . . . , Fp(kp)). (11)

The function H is called a copula and each of its marginaldistributions is uniform in [0, 1].

Copulas play a critical role at the following two steps inour approach for rescaling multivariate distributions. First,they allow us to split a multivariate distribution of the origi-nal, measured topology into two parts: the first part consistsof the marginal distributions Fm, while the second part istheir correlation profile, i.e., copula H . These two partsare independent. Therefore, we can independently rescalethe marginal distributions and the correlation profile. Thisproperty tremendously simplifies the rescaling process. Themarginals are univariate distributions that we rescale as inSection 5.3.1, while this section contains the details of howwe rescale the correlation profile. We use copulas the secondtime to merge together rescaled marginals and their corre-lation profile to yield a rescaled multivariate distribution inits final form.

In Figure 5 we present a high-level overview of our ap-proaches for rescaling univariate and multivariate distribu-tions. To rescale an original empirical univariate distribu-tion, we first approximate it with splines and then use thesesplines to generate random numbers. We split the processof rescaling an original empirical multivariate distributioninto two independent rescaling sub-processes, i.e., rescalingthe marginals and their copula. We rescale the marginalsas any other univariate distributions. To rescale the cop-ula, we re-sample measured correlation data as we describebelow in this section. At the end of multivariate rescaling,we merge the rescaled marginals with the rescaled copulato yield a rescaled multivariate distribution. One can seefrom Figure 5 that multivariate rescaling is a “superset,” interms of actions involved, of univariate rescaling. The fol-lowing three steps summarize the high-level description ofour multivariate rescaling approach:

1. extract and rescale the univariate marginals of a multi-variate distribution as described in Section 5.3.1 (boxes (a),(b), and (c) in Figure 5);

2. extract and rescale the copula of the multivariate dis-tribution (boxes (d) and (e) in Figure 5); and

3. merge the rescaled marginals and copula yielding arescaled multivariate distribution (box (f) in Figure 5).

In the rest of this section, we provide the low-level details forthe last two steps, using the ADD multivariate distributionas an example.

At Step 2, we compute a rescaled ADD copula as follows.The collected AS topology has n nodes, and for each node i,i = 1 . . . n, we record its degree vector ki = (ki

1, ki2, k

i3) pro-

ducing an n-sized set of degree triplets. We then performstatistical simulation on this set to produce another set of adesired size that has the same correlations as the measureddata ki. Specifically, we re-sample, uniformly at randomand with replacement, N degree triplets from the set of vec-tors ki, where N is the target size of our synthetic topol-ogy. We thus obtain an N-sized set of random triplets kj ,j = 1 . . . N , and we denote their joint CDF by F (k). Byconstruction, the empirical distribution of triplets kj hasthe same correlation profile as original triplets ki. This pro-cedure corresponds for box (d) in Figure 5.

Next, see box (e) in Figure 5, we compute the empir-ical copula of distribution F (k). By definition, the cop-ula of F (k) is simply the joint distribution of vectors u ineqs. (9,11). Therefore, we first compute the marginal CDFsF1(k1), F2(k2), and F3(k3) as CDFs of the first, second, andthird components of vectors kj :

ujm = Fm(kj

m) = rjm/N, m = 1, 2, 3, (12)

where rjm is the rank (position number) of value kj

m in theN-sized list of values km sorted in the non-decreasing order.Random triplets uj = (F1(k

j1), F2(k

j2), F3(k

j3)) are uniformly

distributed in the cube [0, 1]3, and their joint CDF H(u),u = (F1(k1), F2(k2), F3(k3)), is the empirical copula for dis-tribution F (k), cf. eq. (9), that describes the correlationsamong k1, k2, and k3.

At Step 3, box (f) in Figure 5, we merge the rescaledmarginals Dm(dm), m = 1, 2, 3, from Section 5.3.1 and cop-ula H(u) by computing the target graph degree tripletsqj = (qj

1, qj2, q

j3), j = 1, . . . , N , as

qjm = D−1

m (ujm), (13)

where D−1m are inverse CDFs of Dm from Section 5.3.1.

By construction, the correlation profile of annotation-degreevectors qj is the same as of the ADD in the original topology,while the distributions of their components qj

m are rescaledADs.

Algorithm 1 lists the described low-level details of ourmultivariate rescaling, using the ADD as an example.

We conclude our discussion of rescaling with the follow-ing remark. Recall from the end of Section 4.2 that we havethe following two types of multivariate statistics: the ADDand the JDDs. As illustrated in Figure 5, we rescale theADD using all the three steps described in this section. Forrescaling a JDD, it is not necessary to separately rescale itsmarginals, i.e., to use the first step of the described rescal-ing process, since the marginals of JDDs are distributionsof scalar degrees that we automatically rescale during theADD rescaling. To rescale a JDD, we execute only the sec-ond step of the described rescaling process to obtain therescaled empirical JDD copula. We then use this copula todetermine proper placement of edges in the final syntheticgraph that we construct. In other words, the last, thirdstep of our multivariate rescaling process applied to JDDstakes place during the graph construction phase, which wedescribe next.

Algorithm 1: Rescaling ADD

Input: Degree vectors ki = (ki1, k

i2, k

i3), i = 1 . . . n, of

the measured topology;Input: Size N of the target synthetic topology.

// Step 1: AD rescaling

forall m = 1, 2, 3 doLet km be the list of the mth component values ofvectors ki;Approximate distribution km by a smoothing splineSm;Sample N numbers dj

m, j = 1 . . . N , withprobability distribution given by Sm;Let Dm(dm) be the CDF of dm.

end

// Step 2: copula rescaling

Re-sample N degree triplets kj from the set of ki;forall m = 1, 2, 3 do

Let km be the list of the mth component values ofvectors kj ;Sort list km in the non-degreasing order of values;forall j = 1 . . . N do

Let rjm be the position number of value kj

m inthe sorted list;uj

m = rjm/N .

end

end

// Step 3: merge rescaled ADs and the ADD

copula

forall m = 1, 2, 3 doforall j = 1 . . . N do

qjm = D−1

m (ujm).

end

end

Output: Degree vectors qj = (qj1, q

j2, q

j3), j = 1 . . . N ,

of the synthetic topology.

5.4 ConstructionWe describe the 1K- and 2K-annotated random graph

constructors that are both generalizations of the well-knownconfiguration or pseudograph approach in the terminologyof [29]. The 1K-constructor requires only the rescaled ADD,while the 2K-constructor needs also the rescaled JDD cop-ulas.

5.4.1 Constructing1K-annotated random graphsUsing the rescaled degree vectors qj , j = 1 . . . N , we con-

struct 1K-annotated random graphs using the following al-gorithm:

1. for each vector qj = (qj1, q

j2, q

j3), prepare a node with qj

1

customer stubs, qj2 provider stubs, and qj

3 peer stubs;

2. randomly select pairs of either customer-and-provideror peer-and-peer stubs, and connect (match) them to-gether to form c2p or p2p links;

3. remove unmatched stubs, multiple edges between thesame pair of nodes (loops), links with both ends con-nected to the same node (self-loops), and extract thelargest connected component.

Figure 5: Overview of rescaling univariate and multivariate distributions.

The last step deals with the known problem of the pseu-dograph approach. As its name suggests, it does not neces-sarily produce simple connected graphs. In general, it gener-ates pseudographs, i.e., graphs with (self-)loops, consistingof several connected components. The size of the largestconnected component is usually comparable with the totalpseudograph size, while all others are small. Extraction ofthis largest connected component and removal of all (self-)loops6 alters the target degree distributions. Therefore,the resulting simple connected graph has a slightly differentADD than the one on the algorithm input.

Annotations alleviate this problem since they introducea series of additional constraints. For example, in the non-annotated case, loops tend to form between high-degree ASs,simply because these ASs have a lot of stubs attached tothem after step 1 of the algorithm. In the annotated case,the number of such loops is smaller because most stubs at-tached to high-degree ASs are annotated as provider stubsthat can be matched only with customer stubs attachedmostly to low-degree ASs.

Still, the 1K-annotated random graphs are not perfectas, for example, p2p links might end up connecting nodeswith drastically dissimilar degree, cf. the JDD discussion inSection 4.2. The 2K-annotated random graphs do not havethis problem.

5.4.2 Constructing2K-annotated random graphsEarlier work [29] extends the pseudograph approach to

non-annotated 2K-distributions. We extend it even furtherfor the 2K-annotated case in the following algorithm:

1. for each vector qj = (qj1, q

j2, q

j3), prepare a node with

qj1 customer stubs, qj

2 provider stubs, qj3 peer stubs,

and total degree qj = |qj |1;

2. determine the total numbers nc2p and np2p of c2p andp2p edges in the target graph as the maximum pos-sible number of customer-and-provider and peer-and-peer stubs that can be matched within the stub col-lection qj ;

3. rescale the c2p and p2p JDD copulas7 to target sizesof nc2p and np2p degree pairs (q, q′) corresponding toc2p and p2p edges between nodes of total degrees qand q′ in the target graph;

6Self-loops are removed, while multiple edges between thesame pair of nodes are mapped to a single edge between thetwo nodes.7See the remark at the end of Section 5.3.2.

4. for each c2p (or p2p) degree pair (q, q′) select ran-domly a customer (or peer) stub attached to a node ofdegree q and a provider (or peer) stub attached to anode of degree q′ and form a c2p (or p2p) edge;

5. use the procedure described below to rewire (self-)loops;

6. remove unmatched stubs, remaining (self-)loops, andextract the largest connected component.

The following rewiring procedure reduces the number ofedges removed from the final graph. For each edge involvedin a (self-)loop between nodes of degrees q1 and q2, we ran-domly select two non-adjacent nodes of degrees q1 and q2

and move the edge to these nodes. This procedure retainsa large number of edges that would, otherwise, be removedfrom the graph. In theory, this procedure may skew the orig-inal 2K-summary statistics. In practice, however, it altersthese statistics negligibly.

The resulting graph has both the ADD and JDDs ap-proximately the same as those obtained after rescaling. Mi-nor discrepancies are due to the last step of the algorithm,but the number of (self-)loops and small connected compo-nents are even smaller than in the 1K-annotated case. Thereason for these improvements is yet additional structuralconstraints, compared with the 1K-annotated case. For ex-ample, the JDD-induced constraints force the algorithm tocreate only one link between a pair of high-degree nodes, orno links between a pair of nodes of degree 1, thus avoidingcreation of many connected components composed of suchnode pairs. The original graph does not have such links, andthe rescaled JDDs preserve these structural properties, thusimproving the resulting graph quality.

6. EVALUATIONIn this section, we present results of evaluation of our

2K-annotated graph generation method. We also evaluatedthe 1K-annotated generator and found that, as expected,it produced less accurate graphs with defects such as thosementioned in Section 4.2, e.g., with p2p links connectingASs of dissimilar degrees, etc.

Experiments. To evaluate the accuracy of our 2K-annotatedgenerator, we want to compare graphs it produces with themeasured annotated Internet AS graph from Section 5.2.To simplify comparisons, we select one, most representa-tive graph from a set of 50 random synthetic graphs. Weselect this most representative graph as follows. We firstlook for a simple graph metric that exhibits high variability

across the generated graphs. One such metric is the maxi-mum degree. The expected maximum degree in an n-nodegraph with a power-law degree distribution P (k) ∼ k−γ is

kmax ≈ n1/(γ−1) [4]. Exponent γ is approximately 2.1 forthe Internet AS topology. This value of γ stays constant asthe Internet grows, and it implies almost linear scaling ofthe maximum degree since 1/(γ − 1) ≈ 0.9, which is consis-tent with scaling of maximum degree in historical Internettopologies [36]. For these reasons, our most representativegraph is the one with its maximum degree closest to its ex-pected value, across all the generated graphs.

In addition, we evaluate the variance of important graphmetrics described below, across ensembles of random graphsthat we generate. Studying the variance properties of agraph generation technique is essential for estimating struc-tural differences between equal-sized random graphs gen-erated by the model, and for gaining insight on how suchdifferences affect performance evaluation experiments. Thevariance properties of a graph generation technique is asso-ciated with the following tradeoff. On the one hand, vari-ance should be small so that generated graphs closely matchthe observed topology. On the other hand, though, randomgraphs should not all be identical or almost identical, be-cause if they do not exhibit sufficient structural diversity,then they have little value for performance evaluation stud-ies. In our experiments, we compute and report the varianceof important graph metrics in sets of 50 equal-sized randomgraphs.

Metrics. Since it is practically impossible to comparegraphs over every existing graph metric, we select a set ofmetrics that were found particularly important in the In-ternet topology literature. These metrics include the degreedistributions that we deal with in previous sections, assorta-tivity coefficient, distance distribution, and spectrum. Theassortativity coefficient is essentially the Pearson correlationcoefficient of the joint degree distribution (JDD). Its positive(negative) values indicate that degrees of connected nodesare positively (negatively) correlated, meaning that nodeswith similar (dissimilar) degrees interconnect with higherprobabilities. The distance distribution is the distribution oflengths of the shortest paths in a graph, which we computeboth with and without constraints imposed by annotations(routing policies). The spectrum of a graph is the set ofthe eigenvalues of its Laplacian L. The Laplacian’s matrixelements Lij are −1/(kikj)

1/2 if there is an edge betweennode i of degree ki and node j of degree kj ; 1 if i = j; and0 otherwise. Among the n eigenvalues of L, the smallestnon-zero and largest eigenvalues are most interesting, sincethey provide tight bounds to a number of important networkproperties. For more details on these and other metrics, andwhy they are important, see [30].

Results. In Figure 6 we plot the ADs of the measured AStopology and of the most representative synthetic graph ofthe equal size. We observe that the distributions of the cus-tomer, provider, and peer degrees in the synthetic graph arevery close to the corresponding distributions in the measuredtopology. The close match demonstrates that: 1) spline-smoothing accurately models complex ADs of real Inter-net topologies, 2) random number generation yields empiri-cal distributions that follow the modeled distributions, and3) rewiring and removal of (self-)loops do not introduce anysignificant artifacts. It is, of course, expected that our gen-erator accurately reproduces ADs, as they are part of the

1 5 10 50 100 500 1000

1e−

041e

−03

1e−

021e

−01

1e+

00

number of stubs

CC

DF

synthetic customer stubssynthetic provider stubssynthetic peer stubsreal customer stubsreal provider stubsreal peer stubs

Figure 6: CCDFs of the number of customer,provider, and peer stubs in the synthetic versusmeasured AS topology.

2K-summary statistics we explicitly model. We also confirmthat synthetic graphs, also as expected, closely reproduce allthe other summary statistics that we explicitly model: theDD, ADD, and JDDs of the synthetic graph are very closeto the originals. We do not show the corresponding plotsfor brevity.

In Figures 7(a) and 7(b) we compare the distance distri-butions of the measured and equal-sized synthetic topologyignoring and accounting for annotation-induced, i.e., rout-ing policy, constraints. In the former case, we calculatelengths of the standard shortest paths between nodes ina graph as if the graph was non-annotated. In the lattercase, we find lengths of shortest valid, i.e., valley-free, pathsdefined in Section 3. In both the non-annotated and anno-tated cases, we observe that the distance distribution in thesynthetic graph closely matches the distance distribution inthe measured topology, even though we have not explicitlymodeled or tried to reproduce the distance distributions.We also observe that the distance distributions in the non-annotated and annotated cases are different, meaning thatannotations in the synthetic graph properly filter realistic,policy-constrained paths from the set of all possible path inthe non-annotated case.

In Table 3 we compare the measured topology with syn-thetic graphs of different sizes over a set of important scalarmetrics, including those we do not explicitly model or tryto reproduce, e.g., the eigenvalues of the Laplacian, etc. Wecompute these metrics for five synthetic graphs of sizes 5, 000,10, 000, 30, 000, and 19, 036 nodes, the last size being equalto the size of the original topology. The first three metricsare the number of (c2p or p2p) edges in a graph. We ob-serve that the number of such edges grows almost linearlywith the number of nodes. This observation is consistentwith that the average degree in historical Internet topologiesstays almost constant [36]. The fourth metric is the maxi-mum degree. As expected, the maximum degree grows withthe size of the graph slightly slower than linearly. The nextfive metrics in the table describe properties that have stayedrelatively constant in historical Internet topologies. These

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

1 2 3 4 5 6 7 8 9 10

PD

F

Distrance in AS hops

Synthetic topologyCollected topology

(a) Shortest paths.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 2 4 6 8 10 12

PD

F

Distrance in AS hops

Synthetic topologyCollected topology

(b) Shortest valid paths.

Figure 7: Distance distributions in synthetic andmeasured AS topologies.

properties have small variations in the synthetic graphs aswell.

Next we investigate the benefit of modeling the ADs andADD in addition to the DD and JDDs. Previous work [29]shows that modeling DD and JDD is sufficient for captur-ing and reliably reproducing most important non-annotated

graph metrics. The main value of modeling ADs and ADDis that our generated synthetic graphs are properly anno-tated. We saw in Section 3 that the Internet topology an-notations are important. Here we provide another evidencethat they are non-trivial. Specifically, in Figures 8(a), 8(b),and 8(c) we plot the total degrees of ASs in the measuredAS topology versus their annotation degrees: the number ofcustomers k1, providers k2, and peers k3, respectively. Weobserve that a given total degree can correspond to a widerange of different values of k1, k2, and k3. The JDD pro-vides information only on total node degrees and on theircorrelations, whereas it is completely agnostic to annotationdegrees. On the other hand, the ADs and ADD capture thedistribution of annotation degrees and the correlations be-tween annotation degrees, respectively. Therefore, the JDDalone is in principle incapable of capturing topology anno-tations, while the benefit of modeling ADs and ADD lies inreproducing realistic annotations in generated graphs.

To quantify the variance properties of randomly generatedgraphs, we compute the standard deviation of our metricsacross sets of 50 random graphs. We construct 4 sets withtopologies of 5,000, 10,000, 19,036, and 30,000 nodes, a totalof 200 random graphs. Among our evaluation metrics, wedo not compute the eigenvalues of the Laplacian and the as-sortativity coefficient, since they require prohibitively longcomputation times for 200 graphs. In Table 4, we show thestandard deviation and mean value of the remaining metrics.

Table 4: Variance of graph metrics across sets ofequal-sized random graphs.

graph metricstd. deviation / mean

5000 10000 19036 30000nodes nodes nodes nodes

c2p edges 399/17,905 267/36,876 349/71,055 410/112,549p2p edges 100/1,238 203/2,980 396/6,412 541/12,863

max degree 387/1,618 417/2,090 471/2,335 376/2,599av. degree 0.08/3.83 0.03/3.99 0.02/4.07 0.02/4.11

av. distance 0.13/3.16 0.09/3.40 0.10/3.61 0.06/3.77

The maximum degree exhibits the highest standard devia-tion (with respect to the mean) taking values between 376and 471 for graphs of different size. The high variance of themaximum degree is expected, since the degree distributionof Internet topologies is highly skewed. On the other hand,the remaining metrics in Table 4 exhibit low variance. Thesemetrics reflect aggregate graph properties and can be mod-eled as a sum of many i.i.d. random variables. Therefore,according to the central limit theorem, their distributionis approximately normal and their variance is consequentlysmaller than the variance of the maximum degree.

An important difference between the graph generationmethod described in this study and the graph generationmethods described in our previous work [29] is that the for-mer exhibits higher variance. The two methods are concep-tually similar in generating synthetic graphs that reproducethe correlation profile of an observed topology—albeit [29]does not consider annotations. They differ in that our pre-vious techniques directly use the degree distribution or cor-relations of an observed topology to generate new similartopologies. On the other hand, the present work first mod-els the degree correlations of a topology and then uses ran-dom number generators to produce synthetic degree distri-butions fed into final graph constructors. In simpler words,our present technique induces more randomness by meansof the synthetic generation of degree correlations based onthe correlation profile extracted from the real topology. Thetwo approaches are complementary and together provide awider range of options for generating synthetic topologieswith desired variance characteristics.

Overall, our evaluation results show that:

• 2K-annotated random graphs generated with our ap-proach faithfully reproduce a number of important prop-erties of Internet topologies;

• rescaled graphs exhibit the expected behavior accord-ing to a number of definitive graph metrics, i.e., thesemetrics are either properly-rescaled or stay relativelystable as the size of synthetic graphs varies;

• the profile of correlations between annotation and totaldegrees is diverse; and

• random graphs generated with our method exhibit smallvariance, although higher than in our previous work [29].

7. CONCLUSIONSIn this work, we have focused on the problem of gener-

ating synthetic annotated graphs that model real complexnetworks. Our techniques are likely to have many applica-tions not only in networking, but also in other disciplines

Table 3: Scalar metrics of synthetic and collected graphs. Note that smallest eigenvalues are positive, butsome may round to zero.

Number of nodes

Metric19036 5000 10000 19036 30000

(measured) (synthetic) (synthetic) (synthetic) (synthetic)Number of edges 40115 10179 20730 39595 62853

Number of c2p edges 36188 9409 18917 36146 56870Number of p2p edges 3927 770 1813 3448 5983

Maximum degree 2384 1014 1492 2385 3461Average degree 4.21 4.07 4.15 4.16 4.19

Assortativity coefficient -0.20 -0.30 -0.24 -0.25 -0.18Largest eigenvalue of Laplacian 1.97 1.85 1.88 1.91 1.92Smallest eigenvalue of Laplacian 0.03 0.00 0.12 0.09 0.00

Average distance 3.76 3.26 3.52 3.57 3.75

1

10

100

1000

10000

1 10 100 1000 10000

num

ber

of c

usto

mer

s

total node degree

(a) Total degree vs. number of cus-tomers.

0 2 4 6 8

10 12 14 16 18 20

1 10 100 1000 10000

num

ber

of p

rovi

ders

total node degree

(b) Total degree vs. number ofproviders.

1

10

100

1000

1 10 100 1000 10000

num

ber

of p

eers

total node degree

(c) Total degree vs. number of peers.

Figure 8: Scatterplots demonstrating the diversity of annotation degrees and that total node degrees areagnostic with respect to annotations.

where annotated graphs are used to abstract and representnetwork structure. For example, two groups have recentlycontacted us to discuss our techniques as they were search-ing for tools to generate synthetic, semantic-rich, i.e., anno-tated, networks for their simulation studies. The first groupworks on modeling the European powerline networks, whilethe second is in brain and neural network research. Othernetworks to which our techniques are immediately applicableinclude the router-level Internet, WWW, networks of criti-cal resources dependencies, as well as many types of socialand biological networks, such as regulatory pathways [35].

A number of open problems remain. In particular, ourtechniques construct synthetic versions of real topologiesavailable from measurement projects. However, it is well-known that in many cases, the outcome of measurementsdoes not accurately represent a real complete topology. Infact, there might exist inherent limitations in measuring cer-tain network topologies with 100% accuracy. A venue forfurther research is the development of prediction techniquesthat extrapolate what we can presently measure in order topredict what we can not measure.

Another substantial problem is the difficulties in validat-ing results of topology inference studies. For example, inthe specific context of Internet topologies, validation is hardbecause of the unwillingness of service providers to releasedata on their infrastructure, network design, configuration,and performance. On the other hand, validation of any re-search result is a cornerstone to its reliability and utility.Therefore we believe it is imperative to focus on new valida-tion techniques that would combine the limited ground truthdata available today with convincing testbed or simulationexperiments.

8. ACKNOWLEDGMENTSThis work was supported in part by NSF CNS-0434996

and CNS-0722070, by DHS N66001-08-C-2029, and by CiscoSystems.

9. REFERENCES[1] W. Aiello, F. Chung, and L. Lu. A random graph

model for massive graphs. In STOC, 2000.

[2] R. Albert and A.-L. Barabasi. Topology of evolvingnetworks: Local events and universality. Phys Rev

Lett, 85(24):5234–5237, 2000.

[3] M. Boguna and R. Pastor-Satorras. Class of correlatedrandom networks with hidden variables. Phys Rev E,68:036112, 2003.

[4] M. Boguna, R. Pastor-Satorras, and A. Vespignani.Cut-offs and finite size effects in scale-free networks.Eur Phys J B, 38:205–209, 2004.

[5] M. Boguna and M. A. Serrano. Generalizedpercolation in random directed networks. Phys Rev E,72:016106, 2005.

[6] J. M. Carlson and J. Doyle. Highly optimizedtolerance: A mechanism for power-laws in designedsystems. Phys Rev E, 60:1412–1427, 1999.

[7] J. M. Chambers and T. J. Hastie. Statistical Models in

S. Wadsworth and Brooks/Cole, Pacific Grove,California, 1992.

[8] H. Chang, S. Jamin, and W. Willinger. To peer or notto peer: Modeling the evolution of the Internet’sAS-level topology. In INFOCOM, 2006.

[9] F. Chung and L. Lu. Connected components inrandom graphs with given degree sequences. Ann

Comb, 6:125–145, 2002.

[10] A. Clauset and C. Moore. Accuracy and scalingphenomena in Internet mapping. Phys Rev Lett,94:018701, 2005.

[11] L. Dall’Asta, I. Alvarez-Hamelin, A. Barrat,A. Vazquez, and A. Vespignani. Exploring networkswith traceroute-like probes: Theory and simulations.Theor Comput Sci, Complex Networks, 355(1):6–24,2006.

[12] C. de Launois. GHITLE: Generator of HierarchicalInternet Topologies using LEvels.http://ghitle.info.ucl.ac.be/.

[13] X. Dimitropoulos, D. Krioukov, M. Fomenkov,B. Huffaker, Y. Hyun, kc claffy, and G. Riley. ASrelationships: Inference and validation. Comput

Commun Rev, 37(1), 2007.

[14] X. Dimitropoulos, D. Krioukov, B. Huffaker, kc claffy,and G. Riley. Inferring AS relationships: Dead end orlively beginning? In WEA, 2005.

[15] X. Dimitropoulos, D. Krioukov, G. Riley, and kc claffy.Revealing the Autonomous System taxonomy: Themachine learning approach. In PAM, 2006.

[16] X. Dimitropoulos and G. Riley. Creating realisticBGP models. In MASCOTS, 2003.

[17] M. Doar. A better model for generating test networks.In GLOBECOM, 1996.

[18] P. Erdos and A. Renyi. On random graphs. Publ

Math, 6:290–297, 1959.

[19] M. Faloutsos, P. Faloutsos, and C. Faloutsos. Onpower-law relationships of the Internet topology. InSIGCOMM, 1999.

[20] O. Frank and D. Strauss. Markov graphs. J Am Stat

Assoc, 81(395):832–842, 1986.

[21] L. Gao. On inferring Autonomous Systemrelationships in the Internet. IEEE ACM T Network,9(6), 2001.

[22] L. Gao and F. Wang. The extent of AS path inflationby routing policies. In Global Internet, 2002.

[23] C. Gkantsidis, M. Mihail, and E. Zegura. The Markovchain simulation method for generating connectedpower law random graphs. In ALENEX, 2003.

[24] E. Hewitt. Real and Abstract Analysis: A Modern

Treatment of the Theory of Functions of a Real

Variable. Springer Verlag, 1975.

[25] W. Hormann, J. Leydold, and G. Derflinger.Automatic Nonuniform Random Variate Generation.Springer-Verlag, Berlin, 2004.

[26] D. Krioukov, kc claffy, K. Fall, and A. Brady. Oncompact routing for the Internet. Comput Commun

Rev, 37(3):41–52, 2007.

[27] A. Lakhina, J. Byers, M. Crovella, and P. Xie.Sampling biases in IP topology measurements. InINFOCOM, 2003.

[28] L. Li, D. Alderson, W. Willinger, and J. Doyle. Afirst-principles approach to understanding theInternets router-level topology. In SIGCOMM, 2004.

[29] P. Mahadevan, D. Krioukov, K. Fall, and A. Vahdat.Systematic topology analysis and generation usingdegree correlations. In SIGCOMM, 2006.

[30] P. Mahadevan, D. Krioukov, M. Fomenkov,B. Huffaker, X. Dimitropoulos, kc claffy, andA. Vahdat. The Internet AS-level topology: Threedata sources and one definitive metric. Comput

Commun Rev, 36(1):17–26, 2006.

[31] S. Maslov, K. Sneppen, and A. Zaliznyak. Detection oftopological patterns in complex networks: Correlationprofile of the Internet. Physica A, 333:529–540, 2004.

[32] A. Medina, A. Lakhina, I. Matta, and J. Byers.BRITE: An approach to universal topologygeneration. In MASCOTS, 2001.

[33] R. B. Nelson. An introduction to copulas.Springer-Verlag Lecture Notes in Statistics, 139:216,1999.

[34] C. Palmer and G. Steffan. Generating networktopologies that obey power laws. In GLOBECOM,November 2000.

[35] J. Pandey, M. Koyuturk, Y. Kim, W. Szpankowski,S. Subramaniam, and A. Grama. Functionalannotation of regulatory pathways. Bioinformatics,23(13):i377–i386, 2007.

[36] R. Pastor-Satorras and A. Vespignani. Evolution and

Structure of the Internet: A Statistical Physics

Approach. Cambridge University Press, Cambridge,

2004.

[37] The R Project for Statistical Computing.http://www.r-project.org/.

[38] University of Oregon RouteViews Project.http://www.routeviews.org/.

[39] A. Sklar. Fonctions de repartition a n dimensions etleures marges. Publications de l’Institut de Statistique

de L’Universite de Paris, 8:229231, 1959.

[40] B. Soderberg. A general formalism for inhomogeneousrandom graphs. Phys Rev E, 66:066121, 2002.

[41] B. Soderberg. Properties of random graphs withhidden color. Phys Rev E, 68:026107, 2003.

[42] B. Soderberg. Random graphs with hidden color. Phys

Rev E, 68:015102(R), 2003.

[43] B. Soderberg. Random Feynman graphs. In AIP Conf

Procs, volume 776, pages 118–132, 2005.

[44] B. M. Waxman. Routing of multipoint connections.JSAC, 1988.

[45] J. Winick and S. Jamin. Inet-3.0: Internet topologygenerator. Technical Report UM-CSE-TR-456-02,University of Michigan, 2002.

[46] E. W. Zegura, K. Calvert, and S. Bhattacharjee. Howto model an internetwork. In INFOCOM, 1996.

Graph Annotations in Modeling Complex Network...

Documents

Transcript of Graph Annotations in Modeling Complex Network...