Toward Validation and Control of Network Models

1

Toward Validation and Control of Network Models

Michael Mitzenmacher

Harvard University

2

Internet Mathematics

The Future of Power Law Research

Articles Related to This Talk

A Brief History of Generative Models for Power Law and Lognormal Distributions

3

Motivation: General

• Network Science and Engineering is emerging as its own (sub)field.– NSF : cross-cutting area starting this year.– Courses : Cornell (Easley/Kleinberg), Kearns (U Penn), many

others.• For undergrads, not just grads!

– In popular culture: books like Linked by Barabasi or Six Degrees by Watts.

– Other sciences: Economics, biology, physics, ecology, linguistics, etc.

• What has been and what should be the research agenda?

4

My (Biased) View

• The 5 stages of networking research.1) Observe: Gather data to demonstrate a behavior in a

system. (Example: power law behavior.)2) Interpret: Explain the importance of this observation in

the system context.3) Model: Propose an underlying model for the observed

behavior of the system.4) Validate: Find data to validate (and if necessary

specialize or modify) the model.5) Control: Design ways to control and modify the

underlying behavior of the system based on the model.

5

My (Biased) View

• In networks, we have spent a lot of time observing and interpreting behaviors.

• We are currently very active in modeling. – Many, many possible models.– Perhaps easiest to write papers about.

• We need to now put much more focus on validation and control.– Have been moving in this direction.– And these are specific areas where computer science

has much to contribute!

6

Models

• After observation, the natural step is to explain/model the behavior.

• Outcome: lots of modeling papers.– And many models rediscovered.

• Example : power laws

• Lots of history…

7

History• In 1990’s, the abundance of observed power laws in networks surprised the

community.– Perhaps they shouldn’t have… power laws appear frequently throughout the

sciences.• Pareto : income distribution, 1897• Zipf-Auerbach: city sizes, 1913/1940’s• Zipf-Estouf: word frequency, 1916/1940’s• Lotka: bibliometrics, 1926• Yule: species and genera, 1924.• Mandelbrot: economics/information theory, 1950’s+

• Observation/interpretation were/are key to initial understanding.• My claim: but now the mere existence of power laws should not be surprising, or

necessarily even noteworthy.• My (biased) opinion: The bar should now be very high for

observation/interpretation.

8

So Many Models…

• Preferential Attachment

• Optimization (HOT)

• Monkeys typing randomly (scaling)

• Multiplicative processes

• Kronecker graphs

• Forest fire model (densification)

9

What Makes a Good Model…

• New variations coming up all of the time.• Question : What makes a new network model

sufficiently interesting to merit attention and/or publication? – Strong connection to an observed process.

• Many models claim this, but few demonstrate it convincingly.

– Theory perspective: significant new mathematical insight or sophistication.

• A matter of taste?

• My (biased) opinion: the bar should start being raised on model papers.

10

Validation: The Current Stage

• We now have so many models.• It is important to know the right model, to

extrapolate and control future behavior.• Given a proposed underlying model, we need tools

to help us validate it.• We appear to be entering the validation stage of

research…. BUT the first steps have focused on invalidation rather than validation.

11

Examples : Invalidation• Lakhina, Byers, Crovella, Xie

– Show that observed power-law of Internet topology might be because of biases in traceroute sampling.

• Pedarsani, Figueiredo, Grossglauser– Show that densification may also arise by sampling

approaches, not necessarily intrinsic to network.

• Chen, Chang, Govindan, Jamin, Shenker, Willinger – Show that Internet topology has characteristics that do not

match preferential-attachment graphs.– Suggest an alternative mechanism.

• But does this alternative match all characteristics, or are we still missing some?

12

My (Biased) View

• Invalidation is an important part of the process! BUT it is inherently different than validating a model.

• Validating seems much harder.• Indeed, it is arguable what constitutes a validation. • Question: what should it mean to say

“This model is consistent with observed data.”

13

An Alternative View

• There is no “right model”. • A model is the best until some other model comes

along and proves better.– Greedy refinement via invalidation in model space.– Statistical techniques: compare likelihood ratios for

various models.

• My (biased) opinion: this is one useful approach; but not the end of the question.– Need methods other than comparison for confirming

validity of a model.

14

Time-Series/Trace Analysis

• Many models posit some sort of actions.– New pages linking to pages in the Web.– New routers joining the network.– New files appearing in a file system.

• A validation approach: gather traces and see if the traces suitably match the model.– Trace gathering can be a challenging systems problem.– Check model match requires using appropriate

statistical techniques and tests.– May lead to new, improved, better justified models.

15

Sampling and Trace Analysis• Often, cannot record all actions.

– Internet is too big!

• Sampling– Global: snapshots of entire system at various times.– Local: record actions of sample agents in a system.

• Examples: – Snapshots of file systems: full systems vs. actions of individual

users.– Router topology: Internet maps vs. changes at subset of routers.

• Question: how much/what kind of sampling is sufficient to validate a model appropriately?– Does this differ among models?

16

To Control

• In many systems, intervention can impact the outcome.– Maybe not for earthquakes, but for computer networks!– Typical setting: individual agents acting in their own

selfish interest. Agents can be given incentives to change behavior.

• General problem: given a good model, determine how to change system behavior to optimize a global performance function.– Distributed algorithmic mechanism design.– Mix of economics/game theory and computer science.

17

Possible Control Approaches

• Adding constraints: local or global– Example: total space in a file system.– Example: preferential attachment but links limited by

an underlying metric.

• Add incentives or costs– Example: charges for exceeding soft disk quotas.– Example: payments for certain AS level connections.

• Limiting information– Impact decisions by not letting everyone have true view

of the system.

18

My Related Work : Hash Algorithms

• On the Internet, we need a measurement and monitoring infrastructure, for validation and control.– Approximate is fine; speed is key.

– Must be general, multi-purpose.

– Must allow data aggregation.

• Solution : hash-based architecture.– Eventual goal: every router has a programmable “hash

engine”.

19

Vision

• Three-pronged research data.

• Low: Efficient hardware implementations of relevant algorithms and data structures.

• Medium: New, improved data structures and algorithms for old and new applications.

• High: Distributed infrastructure supporting monitoring and measurement schemes.

20

The High-Level Pitch

• Lots of hash-based schemes being designed for approximate measurement/monitoring tasks.– But not built into the system to begin with.

• Want a flexible router architecture that allows:– New methods to be easily added. – Distributed cooperation using such schemes.

21

What We Need

On-ChipMemory

Hashing Computation

Unit

Off-ChipMemory

CAM(s)

Programming Language

Memory

Unit for Other

Computation

Computation

Communication+ Control

ControlSystem

CommunicationArchitecture

22

Lots of Design Questions

• How much space for various memory levels? How to dynamically divide memory among competing applications?

• What hash functions should be included? Openness to new hash functions?

• What programming language and functionality?• What communication infrastructure?• Security?• And so on…

23

Which Hash Functions?

• Theorists:– Want analyzable hash functions.

– Dislike standard assumption of perfectly random hash functions.

– Hard to prove things about actual performance.

• Practitioners– Want easy implementation, speed, small space.

– Want simple analysis (back-of-the-envelope).

– Will accept simulated results under right settings.

24

Why Do Weak Hash Functions Work So Well?

• In reality, assuming perfectly random hash functions seems to be the right thing to do.– Easier to analyze.– Real systems almost always work that way,

even with weak hash functions!

• Can Theory explain strong performance of weak hash functions?

25

Recent Work

• A new explanation (joint work with Salil Vadhan):• Choosing a hash function from a pairwise independent

family is enough – if data has sufficient entropy.– Randomness of hash function and data “combine”.– Behavior matches truly random hash function with high

probability.

• Techniques based on theory of randomness extraction.– Extensions of Leftover Hash Lemma.

26

What Functionality?

• Hash tables should be a basic primitive.

• “Best” hash tables: cuckoo hashing.– Worst case constant lookup time.– Simple to build, design.

• How can we make them even better?– Move cuckoo hashing from theory to practice!

27

Cuckoo Hashing [Pagh,Rodler]

• Basic scheme: each element gets two possible locations.

• To insert x, check both locations for x. If one is empty, insert.

• If both are full, x kicks out an old element y. Then y moves to its other location.

• If that location is full, y kicks out z, and so on, until an empty slot is found.

28

Cuckoo Hashing Examples

A B C

E D

29


A B C

E D

F

30


A B FC

E D

31


A B FC

E D

G

32


E G B FC

A D

33


A B C

E D F

G

34

Cuckoo Hashing Failures

• Bad case 1: inserted element runs into cycles.• Bad case 2: inserted element has very long path before

insertion completes.– Could be on a long cycle.

• Bad cases occur with small probability when load is sufficiently low, but not low enough:

• Theoretical solution: re-hash everything if a failure occurs.

• For 2 choices, load less than 50%, n elements gives failure rate of (1/n); maximum insert time O(log n).– Better space utilization and rate for more choices, more elements per

bucket.

35

Recent Work : A CAM-Stash

• Use a CAM (Content Addressable Memory) to stash away elements that would cause failure. – Joint with Kirsch/Wieder.

• Intuition: if failures were independent, probability that s elements cause failures goes to (1/ns). – Failures not independent, but nearly so.– A stash holding a constant number of elements greatly reduces failure

probability. – Implemented as a CAM in hardware, or a cache line in

hardware/software.

• Lookup requires also looking at stash.

36

Modeling : Economic Principles

• Joint work with Corbo, Jain, Parkes.• Exploration : what models make sense for AS

connectivity.– Extending approach of Chang, Jamin, Mao, Willinger. – Entering nodes link according to business model, utility

function.– Nodes revise their links based on new entrants.

• Like the forest fire model.

• Future considerations: how to validate such models.

37

Conclusion : My (Biased) View• There are 5 stages of networking research.

1) Observe: Gather data to demonstrate power law behavior in a system.

2) Interpret: Explain the import of this observation in the system context.

3) Model: Propose an underlying model for the observed behavior of the system.

4) Validate: Find data to validate (and if necessary specialize or modify) the model.

5) Control: Design ways to control and modify the underlying behavior of the system based on the model.

• We need to focus on validation and control.– Lots of open research problems.

38

A Chance for Collaboration

• The observe/interpret stages of research are dominated by systems; modeling dominated by theory.– And need new insights, from statistics, control theory, economics!!!

• Validation and control require a strong theoretical foundation.– Need universal ideas and methods that span different types of systems.– Need understanding of underlying mathematical models.

• But also a large systems buy-in.– Getting/analyzing/understanding data.– Find avenues for real impact.

• Good area for future systems/theory/others collaboration and interaction.

39

More About Me

• Website: www .eecs.harvard.edu/~michaelm– Links to papers– Link to book– Link to blog : mybiasedcoin

• mybiasedcoin.blogspot.com

Toward Validation and Control of Network Models

Documents

Transcript of Toward Validation and Control of Network Models