1 Hailuoto Workshop A Statistician ’ s Adventures in Internetland J. S. Marron Department of...
-
Upload
kory-welch -
Category
Documents
-
view
220 -
download
3
Transcript of 1 Hailuoto Workshop A Statistician ’ s Adventures in Internetland J. S. Marron Department of...
11
Hailuoto Workshop
A Statistician’s Adventures
in Internetland
J. S. Marron
Department of Statistics and Operations
Research
University of North Carolina
April 18, 2023
22
Co-Authors in this work
Felix Hernández Campos Fred Godtliebsen
George Michailidis Cheolwoo Park
Juhyun Park Vladas Pipiras
Vitaliana Rondonotti David Rolls
Haipeng Shen F. D. Smith
Richard Smith Stilian Stoev
Murad Taqqu Zhengyuan Zhu
33
Internet Traffic Background
Gigantic (worldwide) Communic’ns Networks:
Telephone Network
Internet
Both based on “connections” between 2 points
44
Fundamental Difference, I
Manner of Equipment Usage:
Telephone: each con’n has sole use (of ~2
wires) Congestion no connection (busy signal)
Internet: all connections share resources (transmissions split into small “packets”)
Congestion packet loss & delays
55
Fundamental Difference, II
Distribution of duration (time) of connections:
Telephone: (roughly) exponential distrib’n (must speak & how long can hold phone?)
Internet: heavy tailed distributions(very long and very short connections!)
66
Fundamental Difference, III
Mathematical Models:
Telephone: queueing theory
Poisson arrivals, exponential durations
Internet: Heavy Tail Conn’n Durations
Long Range
Dependence
77
Internet Modes of Study
Internet Structure: Connectivity Graphs
Internet “Tomography”: Use “measurements at edges” To infer “general structure and behavior”
Traffic measurements at one point: Time series of packet information at one
location
88
Internet Measurement – Modeling Goals
Models for simulation??? For “protocol fixing”
QoS & fixing gross inefficiencies Testbed for business applications
web developers
“Goodness of Fit” of models??? “How do we know this works like real
traffic?” More realistically: How bad is it?
99
Our Data Collection Point
“Tap” on Main Link at UNC (U. North Carolina)
Heavy traffic both directions 35,000 web browsers Sunsite (mirror site for large data bases)
Some indication of scale: 1998 peak traffic: ~3 minutes for 1 mil. Packets 2001 peak traffic: ~1 minute for 1 mil. Packets
1010
Data Source
Sequence of Packet Header Info, such as: Arrival Time Source & Destination addresses Packet Type (request, data, ack’ment, …) Packet Size (40 – 1500 bytes) Sequence number
Data extraction: Heavy “database filtering”, by UNC Comp.
Sci. folks,Jeffay, Smith, Ott, Hernandez Campos, Long
1111
Toy Example View of Packets & Connections
ObservationsStarting time
Duration (time)Size (bytes)
Packet countsByte counts
.
.
Connections Made up of
Packets
1212
Mice and Elephants Graphic
“Mice and Elephants” plot:
Visual display of HTTP Responses
Show only “times of HTTP Response starts & ends” As horizontal line segments (in time)
Condense “string of packets” to only a line segment
Visually separate by adding random height Tukey’s “jitter plot” idea
Only show random sample of 5000 out of ~100,000
1414
Mice and Elephants – Simulated Durations
Same start times
Sim’d ExponentialDurations
No “elephants”
No mice
Only “betweens”
Exponential Model is very poor
1515
Some Important Time Series
Binned Counts (aggregated over time)
Count: - Packets - Bytes
(often similar)
This talk: 1 or 10 ms bins
1616
A Menu of Interesting Issues
Bin Count Time Series Long Range dependence?
Point Process of Flow Start Times
Duration Distributions (heavy tails)
Heavy tail Durations LRD
Relationship between Size and
Duration?
Time series of packets within flows?
1717
Flow Duration Distributions
Study sizes (bytes) of HTTP responses, as surrogate for: Time between first and last packet In mice and elephant plot
Study 4 hour time block: “Heavy Traffic Time” Thursday Morning 8:00PM – 12:00Noon In April 2001 From UNC Main Link
1818
Flow Duration Distributions
Log-Log Complementary Cumulative Distribution Function Plot
as a function of
Would be linear for Pareto
(slope = shape parameter)
~ Q-Q plot against exponential
)(ˆ1log10 xF )(log10 x
1919
Flow Duration Dist’ns – log log CCDF
Log-log scale stretches quantiles
Allows “clear view of tail”
2020
Flow Duration Dist’ns – log log CCDF
“Wiggles” about possible linear fit
Wiggles really there? Or “natural variation”?
2121
Flow Duration Dist’ns – log log CCDF
Does Pareto Model “fit”?
Looks OK ???
Careful, have ~ 5.6 million data points
What is sampling variation?
How can we assess it?
2222
Flow Duration Dist’ns – log log CCDF
Downey (2001) Controversy: Log Normal fits as well? Big problem, log normal is not heavy tailed? E.g. all moments (expon’l, not poly’l, tail!) Implications for “Heavy tails LRD” theory? Modified theory:Hanning, Marron, Samorodnitzky and Smith (2002)
Idea: log normal (with changing parameters)LRD
2424
Flow Duration Dist’ns – log log CCDF
Interesting Viewpoint: Gong, Liu, Misra and Towsley (2001)
Key idea: “distributional fragility”Several “very different” distributions
can all “fit in tails” Conclusions:
Careful tail fits not very interpretable “heavy tails” may be slippery concept
2525
Flow Duration Dist’ns – log log CCDF
Now address “sampling variation” issue:
Add overlay of 100 data sets
Same sample size
from given distribution
Gives good intuitive idea of “sampling variation”
2727
Flow Duration Dist’ns – log log CCDF
Overlay from Pareto Distribution:
Gives good intuitive idea of “sampling variation”
Shows “wiggles” are really there
not sampling artifact
Suggests distribution not “heavy tailed”?
Definition is only asymptotic …
Implications for above theory?
2828
Flow Duration Dist’ns – log log CCDF
New concept motivated by above analysis:
Several “tail regions” near tail: “very complete info”
rich data, no envelope
far tail: “sketchy info” sparse data, wide envelope
extreme tail: “no info” - beyond range of data
Important point: have these regardless of “size of data set”
2929
Flow Duration Dist’ns – log log CCDF
Different Viewpoint: Look across time blocks 7 week days: Sun. - Sat. 3 time blocks Morning 8:00 AM – 12:00 Afternoon 12:00 – 4:00 PM Evening 7:30 PM – 11:30
21 log – log CCDF plots: All overlaid in blue, each highlighted in red Structure amazingly similar! Wiggles go the same way!!! Suggests these are “important” pop’n
structure!
3030
Flow Duration Dist’ns – log log CCDF
A deeper distributional look: Interesting candidate:
Double Pareto Log Normal distribution
Reed, W. J. (2001) http://www.math.uvic.ca/faculty/reed/
4 parameter family
Pareto type tails, “center” and “spread” like log-normal
Interpretable as “stopped geometric Brownian Motion”
Makes “physical sense” (via sequence of “file updates”)
3333
Flow Duration Dist’ns – log log CCDF
Richer family:
Mixture distributions
Mixture of 3 DPLN distributions
Parameters fit “visually”
E.g. maximum likelihood looks slippery
3535
Flow Duration Dist’ns – log log CCDF
Interpretation of Mixture parameters: ~ 55%, sizes ~102 bytes: maybe
tiny layout images HTML error status pages navigation bars in multi-frame pages
~45%, sizes ~ 104 bytes: maybe most standard HTML text pages and images
~0.1%, sizes ~ 106 bytes: maybe software multimedia content (such as movies) PDF document
Makes physical sense!
3636
Flow Duration Dist’ns – log log CCDF
Similar lessons for log-normal
Distribut’l fragility!
Want 4th comp’t?
Only ~ 5 data pts.
3737
Flow Duration Dist’ns – log log CCDF
Serious implications for current theory:
Wobbly tail is not a “heavy tail” in classical sense
Classical Definition requires “convergence at a particular rate”
Not “wobbling between rates” (as modelled above)
Question from Downey (2001): so where does LRD come from?
3838
Variable Tail Index
Idea: extended model, which allows “wiggly tails”
Approach: consider “location dependent tail index”as “- slope of log – log CCDF”