Data-Driven Network Analysis:  Do You Really Know Your Data?

28
Data-Driven Network Analysis: Do You Really Know Your Data? Walter Willinger AT&T Labs-Research [email protected]

description

Data-Driven Network Analysis:  Do You Really Know Your Data?. Walter Willinger AT&T Labs-Research [email protected]. Heard about “Network Science”?. Recent “hot topic” area in science Thousands of papers, many in high-impact journals such as Science or Nature - PowerPoint PPT Presentation

Transcript of Data-Driven Network Analysis:  Do You Really Know Your Data?

Data-Driven Network Analysis: 

Do You Really Know Your Data?

Walter WillingerAT&T Labs-Research

[email protected]

2

Heard about “Network Science”?

• Recent “hot topic” area in science– Thousands of papers, many in high-impact journals

such as Science or Nature– Interdisciplinary flavor: (Stat.) Physics, Math, CS– Main apps: Internet, social science, biology, …

• Offers an alluring new recipe for doing network analysis– Largely measurement-driven– Main focus is on universal properties– Exploiting the predictive power of simple models

• small world networks: clustering and path lengths

• scale free networks: power law degree distributions

– Emphasis on self-organization and emergence

3

NETWORK SCIENCE

http://www.nap.edu/catalog/11516.html

• “First, networks lie at the core of the economic, political, and social fabric of the 21st century.”

• “Second, the current state of knowledge about the structure, dynamics, and behaviors of both large infrastructure networks and vital social networks at all scales is primitive.”

• “Third, the United States is not on track to consolidate the information that already exists about the science of large, complex networks, much less to develop the knowledge that will be needed to design the networks envisaged…”

January, 2006

4

Network Science• What?

“The study of network representations of physical, biological, and social phenomena leading to predictive models of these phenomena.” (National Research Council Report, 2006)

• Why? “To develop a body of rigorous results that

will improve the predictability of the engineering design of complex networks and also speed up basic research in a variety of applications areas.” (National Research Council Report, 2006)

• Who?– Physicists (statistical physics),

mathematicians (graph theory), computer scientists (algorithm design), etc.

5

As Internet researchers, why should we care?

• The teaching of “Network Science”

6

The “New Science of Networks”

7

Why should we care?• The teaching of “Network Science”

• The claims “Network Science” makes about the Internet– High-degree nodes form a hub-like core– Fragile/vulnerable to targeted node removal– Achilles’ heel– Zero epidemic threshold

• Network Science and the Internet– Lies, damned lies, statistics …– Rich source for wrong/bad models/theories– The published claims about the Internet are not

“controversial” – they are simply wrong!

8

What is wrong with “Network Science”?

• No critical assessment of available data

• Ignores all networking-related “details”

• Overarching desire to reproduce observed properties of the data even though the quality of the data is insufficient to say anything about those properties with sufficient confidence

• Reduces model validation to the ability to reproduce an observed statistics of the data (e.g., node degree distribution)

9

How to fix “Network Science”?

• Know your data!– Importance of data hygiene

• Take model validation more serious!– Model validation ≠ data fitting

• Apply an engineering perspective to engineered systems!– Design principles vs. random coin tosses

10

Some illustrative Examples

• Example 1– Data: Traceroute measurements – Objective: Inferring Internet topology at the

router-level• Example 2

– Data: Traceroute measurements – Objective: Inferring Internet topology at the

level of Autonomous Systems (ASes)• Example 3

– Data: BGP measurements– Objective: Inferring Internet topology at the

level of Autonomous Systems (ASes)

11

Measurement tool: traceroute

• traceroute www.duke.edu• traceroute to www.duke.edu (152.3.189.3), 30 hops max, 60 byte packets

• 1 fp-core.research.att.com (135.207.16.1) 2 ms 1 ms 1 ms• 2 ngx19.research.att.com (135.207.1.19) 1 ms 0 ms 0 ms• 3 12.106.32.1 1 ms 1 ms 1 ms• 4 12.119.12.73 2 ms 2 ms 2 ms• 5 tbr1.n54ny.ip.att.net (12.123.219.129) 4 ms 5 ms 3 ms• 6 ggr7.n54ny.ip.att.net (12.122.88.21) 3 ms 3 ms 3 ms•7 192.205.35.98 4 ms 4 ms 8 ms• 8 jfk-core-02.inet.qwest.net (205.171.30.5) 3 ms 3 ms 4 ms• 9 dca-core-01.inet.qwest.net (67.14.6.201) 11 ms 11 ms 11 ms•10 dca-edge-04.inet.qwest.net (205.171.9.98) 11 ms 15 ms 11 ms•11 gw-dc-mcnc.ncren.net (63.148.128.122) 18 ms 18 ms 18 ms•12 rlgh7600-gw-to-rlgh1-gw.ncren.net (128.109.70.38) 18 ms 18 ms 18 ms•13 roti-gw-to-rlgh7600-gw.ncren.net (128.109.70.18) 20 ms 20 ms 20 ms•14 art1sp-tel1sp.netcom.duke.edu (152.3.219.118) 23 ms 20 ms 20 ms•15 webhost-lb-01.oit.duke.edu (152.3.189.3) 21 ms 38 ms 20 ms

• 1 traceroute measurement: about 1KB

12

Large-scale traceroute experiments

1 million x 1 million traceroutes: 1PB

13

http://www.isi.edu/scan/mercator/mercator.html

Two Examples of inferred ISP topology

14

About the Traceroute tool (1)

• traceroute is strictly about IP-level connectivity– Originally developed by Van Jacobson (1988)– Designed to trace out the route to a host

• Using traceroute to map the router-level topology– Engineering hack– Example of what we can measure, not what we

want to measure!• Basic problem #1: IP alias resolution problem

– How to map interface IP addresses to IP routers– Largely ignored or badly dealt with in the past– New efforts in 2008 for better heuristics …

15

Interfaces 1 and 2 belong to the same router

16

IP Alias Resolution Problem for Abilene (thanks to Adam Bender)

17

About the Traceroute tool (2)

• traceroute is strictly about IP-level connectivity• Basic problem #2: Layer-2 technologies (e.g., MPLS,

ATM)– MPLS is an example of a circuit technology that

hides the network’s physical infrastructure from IP– Sending traceroutes through an opaque Layer-2

cloud results in the “discovery” of high-degree nodes, which are simply an artifact of an imperfect measurement technique.

– This problem has been largely ignored in all large-scale traceroute experiments to date.

18

(a) (b)

19

20

About the Traceroute tool (3)

• The irony of traceroute measurements– The high-degree nodes in the middle of the

network that traceroute reveals are not for real …– If there are high-degree nodes in the network,

they can only exist at the edge of the network where they will never be revealed by generic traceroute-based experiments …

• Additional irony– Bias in (mathematical abstraction of) traceroute– Has been a major focus within CS/Networking

literature– Non-issue in the presence of above-mentioned

problems

21

Example 1: Lessons learned

• Know your measurement technique!– Question: Can you trust the data obtained by your tool?

• Know your data!– Critical role of Data Hygiene in the Petabyte Age– Corollary: Petabytes of garbage = garbage– Data hygiene is often viewed as “dirty/unglamorous”

work– Question: Can the data be used for the purpose at hand?

• Regarding Example 1:– (Current) traceroute measurements are of (very) limited

use for inferring router-level connectivity– It is unlikely that future traceroute measurements will be

more useful for the purpose of router-level inference

22

A textbook example for what can go wrong …

• J.-J. Pansiot and D. Grad, “On routes and multicast trees in the Internet,” ACM Computer Communication Review 28(1), 1998.– Original traceroute data -- purpose for using the data is explicitly

stated– Most of the issues with traceroute are listed!

• M. Faloutsos, P. Faloutsos, and C. Faloutsos, “On the power-law relationships of the Internet topology”, Proc. ACM SIGCOMM’99, 1999.– Rely on the Pansiot-Grad data, but use it for a very different

purpose– Take the available data at face value, even though Pansiot/Grad list

most of the problems– There is no scientific basis for the reported power-law findings!

• R. Albert, H. Jeong, and A.-L. Barabasi, “Error and attack tolerance of complex networks”, Nature, 2000.– Do not even cite original data source (i.e., Pansiot/Grad)– Take the results of FFF’99 at face value– The reported results are all wrong!

23

Applying lessons to Example 2

• Example 2: Use of traceroute measurements to infer Internet topology at the level of Autonomous Systems (ASes)

• Know your measurement technique!– traceroute (see Example 1)

• Know your data!– Main source of errors: IP address sharing

between BGP neighbors makes mapping traceroute paths to AS paths very difficult

– Up to 50% of traceroute-derived AS adjacencies appear to be bogus

24

Applying lessons to Example 2 (cont.)

• Regarding Example 2– (Current) traceroute measurements are of (very)

limited use for inferring AS-level connectivity– Obtaining the “ground truth” is very challenging– It is possible that in the future, more targeted

traceroute measurements in conjunction with BGP data will be more useful for the purpose of inferring AS-level connectivity

25

Applying lessons to Example 3

• Example 3: Use of BGP data to infer Internet topology at the level of Autonomous Systems (ASes)

• Know your measurement technique!– BGP -- de facto inter-domain routing protocol– BGP -- designed to propagate reachability

information among ASes, not connectivity information

– Engineering hack – not designed to obtain connectivity information

– Example of what we can measure, not what we want to measure!

– Collect BGP routing information base (RIB) information from as many routers as possible

26

Applying lessons to Example 3 (cont.)

• Know your data!– Examining the hygiene of BGP measurements requires

significant commitment and domain knowledge– Parts of the available data seem accurate and solid

(i.e., customer-provider links, nodes)– Parts of the available data are highly problematic and

incomplete (i.e., peer-to-peer links)– “Ground truth” is hard to come by

• Regarding Example 3– (Current) BGP-based measurements are of

questionable quality for inferring AS-level connectivity– Obtaining the “ground truth” is very challenging– It is possible that in the future, more targeted

traceroute measurements in conjunction with BGP data will be more useful for the purpose of inferring AS-level connectivity

27

A Reminder

• Data-driven network analysis in the presence of high-quality data that can be taken at face value– “All models are wrong … but some are useful”

(G.E.P. Box)

• Data-driven network analysis in the presence of highly ambiguous data that should not be taken at face value– “When exactitude is elusive, it is better to be

approximately right than certifiably wrong.” (B.B. Mandelbrot)

28

SOME RELATED REFERENCES• L. Li, D. Alderson, W. Willinger, and J. Doyle, A first-principles

approach to understanding the Internet’s router-level topology, Proc. ACM SIGCOMM 2004.

• J.C. Doyle, D. Alderson, L. Li, S. Low, M. Roughan, S. Shalunov, R. Tanaka, and W. Willinger. The "robust yet fragile" nature of the Internet. PNAS 102(41), 2005.

• D. Alderson, L. Li, W. Willinger, J.C. Doyle. Understanding Internet Topology: Principles, Models, and Validation. ACM/IEEE Trans. on Networking 13(6), 2005.

• L. Li, D. Alderson, J.C. Doyle, W. Willinger. Toward a Theory of Scale-Free Networks: Definition, Properties, and Implications. Internet Mathematics 2(4), 2006.

• R. Oliveira, D. Pei, W. Willinger, B. Zhang, L. Zhang. In Search of the elusive Ground Truth: The Internet's AS-level Connectivity Structure.Proc. ACM SIGMETRICS 2008.

• B. Krishnamurthy and W. Willinger. What are our standards for validation of measurement-based networking research? Proc. ACM HotMetrics Workshop 2008.

• W. Willinger, D. Alderson, and J.C. Doyle. Mathematics and the Internet: A Source of Enormous Confusion and Great Potential. Notices of the AMS, Vol. 56, No. 2, 2009.