1 Hailuoto Workshop A Statistician ’ s Adventures in Internetland J. S. Marron Department of...

40
1 Hailuoto Workshop A Statistician’s Adventures in Internetland J. S. Marron Department of Statistics and Operations Research University of North Carolina March 17, 2022

Transcript of 1 Hailuoto Workshop A Statistician ’ s Adventures in Internetland J. S. Marron Department of...

11

Hailuoto Workshop

A Statistician’s Adventures

in Internetland

J. S. Marron

Department of Statistics and Operations

Research

University of North Carolina

April 18, 2023

22

Co-Authors in this work

Felix Hernández Campos Fred Godtliebsen

George Michailidis Cheolwoo Park

Juhyun Park Vladas Pipiras

Vitaliana Rondonotti David Rolls

Haipeng Shen F. D. Smith

Richard Smith Stilian Stoev

Murad Taqqu Zhengyuan Zhu

33

Internet Traffic Background

Gigantic (worldwide) Communic’ns Networks:

Telephone Network

Internet

Both based on “connections” between 2 points

44

Fundamental Difference, I

Manner of Equipment Usage:

Telephone: each con’n has sole use (of ~2

wires) Congestion no connection (busy signal)

Internet: all connections share resources (transmissions split into small “packets”)

Congestion packet loss & delays

55

Fundamental Difference, II

Distribution of duration (time) of connections:

Telephone: (roughly) exponential distrib’n (must speak & how long can hold phone?)

Internet: heavy tailed distributions(very long and very short connections!)

66

Fundamental Difference, III

Mathematical Models:

Telephone: queueing theory

Poisson arrivals, exponential durations

Internet: Heavy Tail Conn’n Durations

Long Range

Dependence

77

Internet Modes of Study

Internet Structure: Connectivity Graphs

Internet “Tomography”: Use “measurements at edges” To infer “general structure and behavior”

Traffic measurements at one point: Time series of packet information at one

location

88

Internet Measurement – Modeling Goals

Models for simulation??? For “protocol fixing”

QoS & fixing gross inefficiencies Testbed for business applications

web developers

“Goodness of Fit” of models??? “How do we know this works like real

traffic?” More realistically: How bad is it?

99

Our Data Collection Point

“Tap” on Main Link at UNC (U. North Carolina)

Heavy traffic both directions 35,000 web browsers Sunsite (mirror site for large data bases)

Some indication of scale: 1998 peak traffic: ~3 minutes for 1 mil. Packets 2001 peak traffic: ~1 minute for 1 mil. Packets

1010

Data Source

Sequence of Packet Header Info, such as: Arrival Time Source & Destination addresses Packet Type (request, data, ack’ment, …) Packet Size (40 – 1500 bytes) Sequence number

Data extraction: Heavy “database filtering”, by UNC Comp.

Sci. folks,Jeffay, Smith, Ott, Hernandez Campos, Long

1111

Toy Example View of Packets & Connections

ObservationsStarting time

Duration (time)Size (bytes)

Packet countsByte counts

.

.

Connections Made up of

Packets

1212

Mice and Elephants Graphic

“Mice and Elephants” plot:

Visual display of HTTP Responses

Show only “times of HTTP Response starts & ends” As horizontal line segments (in time)

Condense “string of packets” to only a line segment

Visually separate by adding random height Tukey’s “jitter plot” idea

Only show random sample of 5000 out of ~100,000

1313

Mice and Elephants Graphic

Many very small “mice”

Few very large “elephants”

1414

Mice and Elephants – Simulated Durations

Same start times

Sim’d ExponentialDurations

No “elephants”

No mice

Only “betweens”

Exponential Model is very poor

1515

Some Important Time Series

Binned Counts (aggregated over time)

Count: - Packets - Bytes

(often similar)

This talk: 1 or 10 ms bins

1616

A Menu of Interesting Issues

Bin Count Time Series Long Range dependence?

Point Process of Flow Start Times

Duration Distributions (heavy tails)

Heavy tail Durations LRD

Relationship between Size and

Duration?

Time series of packets within flows?

1717

Flow Duration Distributions

Study sizes (bytes) of HTTP responses, as surrogate for: Time between first and last packet In mice and elephant plot

Study 4 hour time block: “Heavy Traffic Time” Thursday Morning 8:00PM – 12:00Noon In April 2001 From UNC Main Link

1818

Flow Duration Distributions

Log-Log Complementary Cumulative Distribution Function Plot

as a function of

Would be linear for Pareto

(slope = shape parameter)

~ Q-Q plot against exponential

)(ˆ1log10 xF )(log10 x

1919

Flow Duration Dist’ns – log log CCDF

Log-log scale stretches quantiles

Allows “clear view of tail”

2020

Flow Duration Dist’ns – log log CCDF

“Wiggles” about possible linear fit

Wiggles really there? Or “natural variation”?

2121

Flow Duration Dist’ns – log log CCDF

Does Pareto Model “fit”?

Looks OK ???

Careful, have ~ 5.6 million data points

What is sampling variation?

How can we assess it?

2222

Flow Duration Dist’ns – log log CCDF

Downey (2001) Controversy: Log Normal fits as well? Big problem, log normal is not heavy tailed? E.g. all moments (expon’l, not poly’l, tail!) Implications for “Heavy tails LRD” theory? Modified theory:Hanning, Marron, Samorodnitzky and Smith (2002)

Idea: log normal (with changing parameters)LRD

2323

Flow Duration Dist’ns – log log CCDF

Log normal fit

Looks OK???

What is natural variation?

2424

Flow Duration Dist’ns – log log CCDF

Interesting Viewpoint: Gong, Liu, Misra and Towsley (2001)

Key idea: “distributional fragility”Several “very different” distributions

can all “fit in tails” Conclusions:

Careful tail fits not very interpretable “heavy tails” may be slippery concept

2525

Flow Duration Dist’ns – log log CCDF

Now address “sampling variation” issue:

Add overlay of 100 data sets

Same sample size

from given distribution

Gives good intuitive idea of “sampling variation”

2626

Flow Duration Dist’ns – log log CCDF

Pareto Model clearly does not fit this data

2727

Flow Duration Dist’ns – log log CCDF

Overlay from Pareto Distribution:

Gives good intuitive idea of “sampling variation”

Shows “wiggles” are really there

not sampling artifact

Suggests distribution not “heavy tailed”?

Definition is only asymptotic …

Implications for above theory?

2828

Flow Duration Dist’ns – log log CCDF

New concept motivated by above analysis:

Several “tail regions” near tail: “very complete info”

rich data, no envelope

far tail: “sketchy info” sparse data, wide envelope

extreme tail: “no info” - beyond range of data

Important point: have these regardless of “size of data set”

2929

Flow Duration Dist’ns – log log CCDF

Different Viewpoint: Look across time blocks 7 week days: Sun. - Sat. 3 time blocks Morning 8:00 AM – 12:00 Afternoon 12:00 – 4:00 PM Evening 7:30 PM – 11:30

21 log – log CCDF plots: All overlaid in blue, each highlighted in red Structure amazingly similar! Wiggles go the same way!!! Suggests these are “important” pop’n

structure!

3030

Flow Duration Dist’ns – log log CCDF

A deeper distributional look: Interesting candidate:

Double Pareto Log Normal distribution

Reed, W. J. (2001) http://www.math.uvic.ca/faculty/reed/

4 parameter family

Pareto type tails, “center” and “spread” like log-normal

Interpretable as “stopped geometric Brownian Motion”

Makes “physical sense” (via sequence of “file updates”)

3131

Flow Duration Dist’ns – log log CCDF

Seems a better fit?

But doesn’t model “wiggles”

3232

Flow Duration Dist’ns – log log CCDF

Sim’ed envelope shows DPLN actually doesn’t fit the data

3333

Flow Duration Dist’ns – log log CCDF

Richer family:

Mixture distributions

Mixture of 3 DPLN distributions

Parameters fit “visually”

E.g. maximum likelihood looks slippery

3434

Flow Duration Dist’ns – log log CCDF

Amazingly good fit

For ~ 5.6 million data points!

3535

Flow Duration Dist’ns – log log CCDF

Interpretation of Mixture parameters: ~ 55%, sizes ~102 bytes: maybe

tiny layout images HTML error status pages navigation bars in multi-frame pages

~45%, sizes ~ 104 bytes: maybe most standard HTML text pages and images

~0.1%, sizes ~ 106 bytes: maybe software multimedia content (such as movies) PDF document

Makes physical sense!

3636

Flow Duration Dist’ns – log log CCDF

Similar lessons for log-normal

Distribut’l fragility!

Want 4th comp’t?

Only ~ 5 data pts.

3737

Flow Duration Dist’ns – log log CCDF

Serious implications for current theory:

Wobbly tail is not a “heavy tail” in classical sense

Classical Definition requires “convergence at a particular rate”

Not “wobbling between rates” (as modelled above)

Question from Downey (2001): so where does LRD come from?

3838

Variable Tail Index

Idea: extended model, which allows “wiggly tails”

Approach: consider “location dependent tail index”as “- slope of log – log CCDF”

3939

Flow Duration Dist’ns – log log CCDF

Tail Index wobbles mostly between 1 & 2, and outside, too

4040

Variable Tail Index

Enhanced Theory:Hernandez Campos, Marron, Samorodnitzky &

Smith (2002) “Variable Heavy Tailed Durations in Internet Traffic”.

Sound bite version:For variable tail index “often between 1 & 2”

still get LRD