Monitorama14: A Melange of Methods for Manipulating Monitored Data

73
A Melange of Methods for Manipulating Monitored Data Converging on Consistency Neil Gunther @DrQz en.wikipedia.org/wiki/Neil_J._Gunther Performance Dynamics Monitorama PDX May 6, 2014 SM c 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 1 / 52

description

Discusses The Greatest Scatter Plot (Hubble 1929), Irregular Time Series (Harmonic Mean), Zipf’s Law of Words, Oracle Query Times, and Eleventh Hour Spikes.

Transcript of Monitorama14: A Melange of Methods for Manipulating Monitored Data

Page 1: Monitorama14: A Melange of Methods for Manipulating Monitored Data

A Melange of Methods for Manipulating MonitoredData

Converging on Consistency

Neil Gunther @DrQzen.wikipedia.org/wiki/Neil_J._Gunther

Performance Dynamics

Monitorama PDXMay 6, 2014

SM

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 1 / 52

Page 2: Monitorama14: A Melange of Methods for Manipulating Monitored Data

Introductions

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 2 / 52

Page 3: Monitorama14: A Melange of Methods for Manipulating Monitored Data

I didn’t do Monitorama Berlin

I didn’t get the memo about plane crashesSorry... Deal with it

SFO runway 28L, 11:28 a.m., July 6, 2013Asiana Airlines Flight 214 landing arse-backwards (sans tail)

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 3 / 52

Page 4: Monitorama14: A Melange of Methods for Manipulating Monitored Data

I didn’t do Monitorama Berlin

I didn’t get the memo about plane crashesSorry... Deal with it

SFO runway 28L, 11:28 a.m., July 6, 2013Asiana Airlines Flight 214 landing arse-backwards (sans tail)

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 3 / 52

Page 5: Monitorama14: A Melange of Methods for Manipulating Monitored Data

I didn’t do Monitorama Berlin

I didn’t get the memo about plane crashes

Sorry... Deal with it

SFO runway 28L, 11:28 a.m., July 6, 2013Asiana Airlines Flight 214 landing arse-backwards (sans tail)

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 3 / 52

Page 6: Monitorama14: A Melange of Methods for Manipulating Monitored Data

I didn’t do Monitorama Berlin

I didn’t get the memo about plane crashesSorry... Deal with it

SFO runway 28L, 11:28 a.m., July 6, 2013Asiana Airlines Flight 214 landing arse-backwards (sans tail)

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 3 / 52

Page 7: Monitorama14: A Melange of Methods for Manipulating Monitored Data

I didn’t do Monitorama Berlin

I didn’t get the memo about plane crashesSorry... Deal with it

SFO runway 28L, 11:28 a.m., July 6, 2013

Asiana Airlines Flight 214 landing arse-backwards (sans tail)

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 3 / 52

Page 8: Monitorama14: A Melange of Methods for Manipulating Monitored Data

I didn’t do Monitorama Berlin

I didn’t get the memo about plane crashesSorry... Deal with it

SFO runway 28L, 11:28 a.m., July 6, 2013Asiana Airlines Flight 214 landing arse-backwards

(sans tail)

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 3 / 52

Page 9: Monitorama14: A Melange of Methods for Manipulating Monitored Data

I didn’t do Monitorama Berlin

I didn’t get the memo about plane crashesSorry... Deal with it

SFO runway 28L, 11:28 a.m., July 6, 2013Asiana Airlines Flight 214 landing arse-backwards (sans tail)

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 3 / 52

Page 10: Monitorama14: A Melange of Methods for Manipulating Monitored Data

“Asiana pilots appear to be overly reliant on instrument-guided landings and lack thetraining to touch down manually.” —SFO Commissioner Eleanor Johns

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 4 / 52

Page 11: Monitorama14: A Melange of Methods for Manipulating Monitored Data

A Message from Your Sponsors

Don’t be too reliant on your instruments (strip charts, colored dials, shiny things)

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 5 / 52

Page 12: Monitorama14: A Melange of Methods for Manipulating Monitored Data

Consistency

1 It’s not about pretty pictures

2 It’s not about whiz bang tools3 It’s not about fancy math4 Data are usually trying to tell you something5 Your interpretation has to be consistent with other data6 Your interpretation has to be consistent with other information

This talk is about

Converging on consistency by example

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 6 / 52

Page 13: Monitorama14: A Melange of Methods for Manipulating Monitored Data

Consistency

1 It’s not about pretty pictures2 It’s not about whiz bang tools

3 It’s not about fancy math4 Data are usually trying to tell you something5 Your interpretation has to be consistent with other data6 Your interpretation has to be consistent with other information

This talk is about

Converging on consistency by example

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 6 / 52

Page 14: Monitorama14: A Melange of Methods for Manipulating Monitored Data

Consistency

1 It’s not about pretty pictures2 It’s not about whiz bang tools3 It’s not about fancy math

4 Data are usually trying to tell you something5 Your interpretation has to be consistent with other data6 Your interpretation has to be consistent with other information

This talk is about

Converging on consistency by example

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 6 / 52

Page 15: Monitorama14: A Melange of Methods for Manipulating Monitored Data

Consistency

1 It’s not about pretty pictures2 It’s not about whiz bang tools3 It’s not about fancy math4 Data are usually trying to tell you something

5 Your interpretation has to be consistent with other data6 Your interpretation has to be consistent with other information

This talk is about

Converging on consistency by example

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 6 / 52

Page 16: Monitorama14: A Melange of Methods for Manipulating Monitored Data

Consistency

1 It’s not about pretty pictures2 It’s not about whiz bang tools3 It’s not about fancy math4 Data are usually trying to tell you something5 Your interpretation has to be consistent with other data

6 Your interpretation has to be consistent with other information

This talk is about

Converging on consistency by example

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 6 / 52

Page 17: Monitorama14: A Melange of Methods for Manipulating Monitored Data

Consistency

1 It’s not about pretty pictures2 It’s not about whiz bang tools3 It’s not about fancy math4 Data are usually trying to tell you something5 Your interpretation has to be consistent with other data6 Your interpretation has to be consistent with other information

This talk is about

Converging on consistency by example

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 6 / 52

Page 18: Monitorama14: A Melange of Methods for Manipulating Monitored Data

Consistency

1 It’s not about pretty pictures2 It’s not about whiz bang tools3 It’s not about fancy math4 Data are usually trying to tell you something5 Your interpretation has to be consistent with other data6 Your interpretation has to be consistent with other information

This talk is about

Converging on consistency by example

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 6 / 52

Page 19: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Greatest Scatter Plot

Topics

1 The Greatest Scatter Plot

2 Irregular Time Series

3 The Power of Power LawsZipf’s Law of WordsDatabase Query TimesEleventh Hour Spikes

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 7 / 52

Page 20: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Greatest Scatter Plot

The Greatest Scatter Plot

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 8 / 52

Page 21: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Greatest Scatter Plot

Goggle up! Science ahead...

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 9 / 52

Page 22: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Greatest Scatter Plot

Some Monitored Data

5 10 15 20

0.0

0.5

1.0

1.5

2.0

Time

Met

ric 1

5 10 15 20

-200

200

600

1000

TimeM

etric

2

Two time series, two metrics: Metric 1 and Metric 2

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 10 / 52

Page 23: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Greatest Scatter Plot

Scatter Plot

0.0 0.5 1.0 1.5 2.0

0500

1000

Metric 1

Met

ric 2

Are Metric 1 and Metric 2 related in any way?c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 11 / 52

Page 24: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Greatest Scatter Plot

Linear Regression

0.0 0.5 1.0 1.5 2.0

0500

1000

Metric 1

Met

ric 2

LSQ fit: Metric2 = 423.94 Metric1 and R2 = 0.82c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 12 / 52

Page 25: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Greatest Scatter Plot

This is Not the End

This is just the beginning

Need to reach consistency

1 Is the linear fit still a reasonable choice?

2 What is the meaning of the slope ?

3 Willing to extrapolate this model into the future?

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 13 / 52

Page 26: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Greatest Scatter Plot

The most important scatter plot in history (1929)

Hubble’s diagram and cosmic expansionRobert P. Kirshner*Harvard–Smithsonian Center for Astrophysics, 60 Garden Street, Cambridge, MA 02138

Contributed by Robert P. Kirshner, October 21, 2003

Edwin Hubble’s classic article on the expanding universe appeared in PNAS in 1929 [Hubble, E. P. (1929) Proc. Natl. Acad. Sci. USA 15,168–173]. The chief result, that a galaxy’s distance is proportional to its redshift, is so well known and so deeply embedded into thelanguage of astronomy through the Hubble diagram, the Hubble constant, Hubble’s Law, and the Hubble time, that the article itselfis rarely referenced. Even though Hubble’s distances have a large systematic error, Hubble’s velocities come chiefly from VestoMelvin Slipher, and the interpretation in terms of the de Sitter effect is out of the mainstream of modern cosmology, this articleopened the way to investigation of the expanding, evolving, and accelerating universe that engages today’s burgeoning field ofcosmology.

T he publication of Edwin Hub-ble’s 1929 article ‘‘A relationbetween distance and radialvelocity among extra-galactic

nebulae’’ marked a turning point in un-derstanding the universe. In this briefreport, Hubble laid out the evidence forone of the great discoveries in 20th cen-tury science: the expanding universe.Hubble showed that galaxies recedefrom us in all directions and more dis-tant ones recede more rapidly in pro-portion to their distance. His graph ofvelocity against distance (Fig. 1) is theoriginal Hubble diagram; the equationthat describes the linear fit, velocity !Ho " distance, is Hubble’s Law; theslope of that line is the Hubble con-stant, Ho; and 1!Ho is the Hubble time.Although there were hints of cosmicexpansion in earlier work, this is thepublication that convinced the scientificcommunity that we live in an expandinguniverse. Because the result is so impor-tant and needs such constant reference,astronomers have created eponymousHubble entities to use Hubble’s aston-ishing discovery without a reference tothe original publication in PNAS (1).†

Today, #70 years later, exquisite ob-servations of the cosmic microwavebackground (2), measurement of lightelements synthesized in the first fewminutes of the universe (3), and modernversions of Hubble’s Law form a firmtriangular foundation for modern cos-mology. We now have confidence that ageometrically f lat universe has been ex-panding for the past 14 billion yr, grow-ing in contrast through the action ofgravity from a hot and smooth Big Bangto the lumpy and varied universe of gal-axies, stars, planets, and people we seearound us. Observations have forced usto accept a dark and exotic universethat is $30% dark matter with only 4%of the universe made of familiar protonsand neutrons. Of that small fraction offamiliar material, most is not visible.Like a dusting of snow on a mountain

ridge, luminous matter reveals the pres-ence of unseen objects.

Extensions of Hubble’s work with to-day’s technology have developed vastnew arenas for exploration: extensivemapping using Hubble’s Law shows thearrangement of matter in the universe,and, by looking further back in timethan Hubble could, we now see beyondthe nearby linear expansion of Hubble’sLaw to trace how cosmic expansion haschanged over the vast span of time sincethe Big Bang. The big surprise is thatrecent observations show cosmic expan-sion has been speeding up over the last5 billion yr. This acceleration suggeststhat the other 70% of the universe iscomposed of a ‘‘dark energy’’ whoseproperties we only dimly grasp but thatmust have a negative pressure to makecosmic expansion speed up over time(4–9). Future extension of the Hubblediagram to even larger distances andmore precise distances where the effects

of acceleration set in are the route toilluminating this mystery.

Hubble applied the fundamental dis-coveries of Henrietta Leavitt concern-ing bright Cepheid variable stars.Leavitt showed that Cepheids can besorted in luminosity by observing theirvibration periods: the slow ones arethe intrinsically bright ones. By mea-suring the period of pulsation, an ob-server can determine the star’s intrin-sic brightness. Then, measuring theapparent brightness supplies enoughinformation to infer the distance.

This Perspective is published as part of a series highlightinglandmark papers published in PNAS. Read more aboutthis classic PNAS article online at www.pnas.org!misc!classics.shtml.

*E-mail: [email protected].†There are just 73 citations of Hubble’s original paper inNASA’s Astrophysics Data System. There are 1,001 citationsof ref. 7.

© 2003 by The National Academy of Sciences of the USA

Fig. 1. Velocity–distance relation among extra-galactic nebulae. Radial velocities, corrected for solarmotion (but labeled in the wrong units), are plotted against distances estimated from involved stars andmean luminosities of nebulae in a cluster. The black discs and full line represent the solution for solarmotion by using the nebulae individually; the circles and broken line represent the solution combining thenebulae into groups; the cross represents the mean velocity corresponding to the mean distance of 22nebulae whose distances could not be estimated individually. [Reproduced with permission from ref. 1(Copyright 1929, The Huntington Library, Art Collections and Botanical Gardens).]

8–13 " PNAS " January 6, 2004 " vol. 101 " no. 1 www.pnas.org!cgi!doi!10.1073!pnas.2536799100

Metric 1 (x-axis) = distance to the observed star (r )Metric 2 (y -axis) = recessional velocity of the star (v )

106 parsecs ≡ 1 Mpc = 3.3 million light years

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 14 / 52

Page 27: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Greatest Scatter Plot

Astronomer Edwin Hubble 1929

1 Is the linear fit still a reasonable choice?

Edwin Hubble suspected v ∼ rSupports Big Bang hypothesis

2 What does the slope mean?

Slope:vr=

rt× 1

r=

1t≡ H0 (Hubble’s constant)

Inverse Hubble constant has units of time tH = 1/H0

tH is the expansion time = Age of Universe!

3 Small problem

Hubble calculated: tH ' 2 billion years

Age of Earth tE ' 3–5 billion years (Oops!)

Not consistent / Whaddya gonna do?

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 15 / 52

Page 28: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Greatest Scatter Plot

Astronomer Edwin Hubble 1929

1 Is the linear fit still a reasonable choice?

Edwin Hubble suspected v ∼ rSupports Big Bang hypothesis

2 What does the slope mean?

Slope:vr=

rt× 1

r=

1t≡ H0 (Hubble’s constant)

Inverse Hubble constant has units of time tH = 1/H0

tH is the expansion time = Age of Universe!

3 Small problem

Hubble calculated: tH ' 2 billion yearsAge of Earth tE ' 3–5 billion years (Oops!)

Not consistent / Whaddya gonna do?

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 15 / 52

Page 29: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Greatest Scatter Plot

Astronomer Edwin Hubble 1929

1 Is the linear fit still a reasonable choice?

Edwin Hubble suspected v ∼ rSupports Big Bang hypothesis

2 What does the slope mean?

Slope:vr=

rt× 1

r=

1t≡ H0 (Hubble’s constant)

Inverse Hubble constant has units of time tH = 1/H0

tH is the expansion time = Age of Universe!

3 Small problem

Hubble calculated: tH ' 2 billion yearsAge of Earth tE ' 3–5 billion years (Oops!)

Not consistent /

Whaddya gonna do?

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 15 / 52

Page 30: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Greatest Scatter Plot

Astronomer Edwin Hubble 1929

1 Is the linear fit still a reasonable choice?

Edwin Hubble suspected v ∼ rSupports Big Bang hypothesis

2 What does the slope mean?

Slope:vr=

rt× 1

r=

1t≡ H0 (Hubble’s constant)

Inverse Hubble constant has units of time tH = 1/H0

tH is the expansion time = Age of Universe!

3 Small problem

Hubble calculated: tH ' 2 billion yearsAge of Earth tE ' 3–5 billion years (Oops!)

Not consistent / Whaddya gonna do?

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 15 / 52

Page 31: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Greatest Scatter Plot

0.0 0.5 1.0 1.5 2.0

0500

1000

Hubble's 1929 Corrected Data

Galactic distance (Mpc)

Rec

essi

onal

vel

ocity

(km

/s)

Hubble even corrected for so-called peculiar velocity (black dots)c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 16 / 52

Page 32: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Greatest Scatter Plot

0.0 0.5 1.0 1.5 2.0

0500

1000

Hubble's 1929 Corrected Data

Galactic distance (Mpc)

Rec

essi

onal

vel

ocity

(km

/s)

Slope moved the wrong way /c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 17 / 52

Page 33: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Greatest Scatter Plot

Pay Day 2003

result, they make good distance indica-tors. Refined methods for analyzingthe observations of type Ia supernovaegive the distance to a single event tobetter than 10% (19, 20). The bestmodern Hubble diagram, based on wellobserved type Ia supernovae out to amodest distance of !2 billion lightyears, is shown in Fig. 3, where theaxes are chosen to match those ofHubble’s original linear diagram (tomask our uncertainties, astronomersgenerally use a log-log form of thisplot as in Fig. 4). Far beyond Hubble’soriginal sample, Hubble’s Law holdstrue.

In table 2 of his original article (1)(reproduced as Table 1, which is pub-lished as supporting information on thePNAS web site), Hubble inverted thevelocity–distance relation to estimatethe distances to galaxies of known red-shift. For galaxies like NGC 7619 forwhich he had only Humason’s recentlymeasured redshift, Hubble used thevelocity–distance relation to infer thedistance. This approach to estimatingdistances from the redshift alone hasbecome a major industry with galaxyredshift surveys. Today’s telescopes are1,000 times faster at measuring red-shifts than in Hubble’s time, leading tolarge samples of galaxies that trace thetexture of the galaxy distribution (21–24). As shown in Fig. 5, the 3D distri-bution of galaxies constructed fromHubble’s Law is surprisingly foamy,with great voids and walls that form asdark matter clusters in an expandinguniverse, shaping pits into which theordinary matter drains, to form theluminous matter we see as stars in gal-axies. Quantitative analysis of galaxy

clustering leads to estimates for theamount of clumpy dark matter associ-ated with galaxies. The best matchcomes if the clumpy matter (dark andluminous, baryons or not) adds up to!30% of the universe.

The interpretation of the redshift as avelocity, or more precisely, as a stretch-ing of photon wavelengths due to cosmicexpansion, which we assume today’s col-lege sophomores will grasp, was not soobvious to Hubble. Hubble was verycircumspect on this topic and, more gen-erally, on the question of whether cos-mic expansion revealed a genuine cos-mic history. He referred to the redshiftas giving an ‘‘apparent velocity.’’ In a

letter to Willem de Sitter (25), Hubblewrote, ‘‘Mr. Humason and I are bothdeeply sensible of your gracious appreci-ation of the papers on velocities anddistances of nebulae. We use the term‘apparent’ velocities to emphasize theempirical features of the correlation.The interpretation, we feel, should beleft to you and the very few others whoare competent to discuss the matterwith authority.’’

Part of the difficulty with the inter-pretation came from alternative views,notably by the local iconoclast, FritzZwicky, who promptly sent a note toPNAS in August 1929 that advocatedthinking of the redshift as the result ofan interaction between photons and in-tervening matter rather than cosmic ex-pansion (26). The reality of cosmicexpansion and the end of ‘‘tired light’’has only recently been verified in aconvincing way.

While the nature of the redshift was abubbling discussion in Pasadena, OlinWilson of the Mount Wilson Observa-tory staff suggested that measuring thetime it took a supernova to rise and fallin brightness would show whether theexpansion was real. Real expansionwould stretch the characteristic time,about a month, by an amount deter-mined by the redshift (27).

This time dilation was sought in 1974,but the sample was too small, toonearby, and too inhomogeneous to seeanything real (28). It was only with largecarefully measured and distant samplesof SN Ia (29, 30) and more thoroughcharacterization of the way supernovalight curves and supernova luminositiesare intertwined (31, 32) that this topic

Fig. 3. The Hubble diagram for type Ia supernovae. From the compilation of well observed type Iasupernovae by Jha (29). The scatter about the line corresponds to statistical distance errors of "10% perobject. The small red region in the lower left marks the span of Hubble’s original Hubble diagram from1929.

Fig. 4. Hubble diagram for type Ia supernovae to z ! 1. Plot in astronomers’ conventional coordinatesof distance modulus (a logarithmic measure of the distance) vs. log redshift. The history of cosmicexpansion can be inferred from the shape of this diagram when it is extended to high redshift andcorrespondingly large distances. Diagram courtesy of Brian P. Schmidt, Australian National University,based on data compiled in ref. 18.

Kirshner PNAS ! January 6, 2004 ! vol. 101 ! no. 1 ! 11

Hubble’s (linear) Law: v = H0r out to 2.3 billion light yearsc© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 18 / 52

Page 34: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Greatest Scatter Plot

Consistency

1 Hubble took some static for his 1929 paper2 Couldn’t reach consistency and had to gamble3 Best measurements (telescopes) at the time4 Telescopes and measurements improved5 Converged toward consistency over next decades6 tH = 2.36 Gy (1929)→ tH = 13.89 Gy (2003)

Data was wrong but his interpretation (model) was correct

Guerrilla Mantra 1.16:Treating data as something divine is a sin

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 19 / 52

Page 35: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Greatest Scatter Plot

Consistency

1 Hubble took some static for his 1929 paper2 Couldn’t reach consistency and had to gamble3 Best measurements (telescopes) at the time4 Telescopes and measurements improved5 Converged toward consistency over next decades6 tH = 2.36 Gy (1929)→ tH = 13.89 Gy (2003)

Data was wrong but his interpretation (model) was correct

Guerrilla Mantra 1.16:Treating data as something divine is a sin

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 19 / 52

Page 36: Monitorama14: A Melange of Methods for Manipulating Monitored Data

Irregular Time Series

Topics

1 The Greatest Scatter Plot

2 Irregular Time Series

3 The Power of Power LawsZipf’s Law of WordsDatabase Query TimesEleventh Hour Spikes

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 20 / 52

Page 37: Monitorama14: A Melange of Methods for Manipulating Monitored Data

Irregular Time Series

Irregular Time Series

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 21 / 52

Page 38: Monitorama14: A Melange of Methods for Manipulating Monitored Data

Irregular Time Series

Aggregating Time Series

1 Regular sample intervals:Samples on tick of a metronomeComputer performance metricsWeather data

2 Irregular sample intervals:Missing data (e.g., stock exchanges)Unequal sampling due to:

EventsSubscriptions (e.g., every 10,0000 sign-ups)Occasional (e.g., personal weight)

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 22 / 52

Page 39: Monitorama14: A Melange of Methods for Manipulating Monitored Data

Irregular Time Series

Back to Monitorama Boston 2013

Aggregation always assumes the arithmetic mean (AM)

Aggregation of irregular time series came up in @mleinart’s talk

NJG: “Should aggregate rate data using the harmonic mean (HM)”

But harmonic mean is not clear for time series

Cost me a month after Monitorama Boston to figure it out

See my blog post and detailed slides of April 9, 2013

Harmonic Averaging of Monitored Rate Data

Which is why Monitorama is cool ,

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 23 / 52

Page 40: Monitorama14: A Melange of Methods for Manipulating Monitored Data

Irregular Time Series

Back to Monitorama Boston 2013

Aggregation always assumes the arithmetic mean (AM)

Aggregation of irregular time series came up in @mleinart’s talk

NJG: “Should aggregate rate data using the harmonic mean (HM)”

But harmonic mean is not clear for time series

Cost me a month after Monitorama Boston to figure it out

See my blog post and detailed slides of April 9, 2013

Harmonic Averaging of Monitored Rate Data

Which is why Monitorama is cool ,

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 23 / 52

Page 41: Monitorama14: A Melange of Methods for Manipulating Monitored Data

Irregular Time Series

Equal Intervals

AM

0.0 0.5 1.0 1.5 2.0 2.5Time

0.5

1.0

1.5

2.0Metric

Heights : hblue = 1 and hred = 1

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 24 / 52

Page 42: Monitorama14: A Melange of Methods for Manipulating Monitored Data

Irregular Time Series

Arithmetic Mean of Heights

AM

0.0 0.5 1.0 1.5 2.0 2.5Time

0.5

1.0

1.5

2.0Metric

AM =12

hblue +12

hred =12(2 + 1) = 1.5

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 25 / 52

Page 43: Monitorama14: A Melange of Methods for Manipulating Monitored Data

Irregular Time Series

Unequal Intervals (Area = 6)

0 1 2 3 4Time

0.5

1.0

1.5

2.0

2.5

3.0Metric

Heights : hblue = 3 and hred = 1

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 26 / 52

Page 44: Monitorama14: A Melange of Methods for Manipulating Monitored Data

Irregular Time Series

AM Leaves a Gap (Area = 6)

AM

gap?

0 1 2 3 4Time

0.5

1.0

1.5

2.0

2.5

3.0Metric

AM =12

hblue +12

hred =12[3 + 1] = 2.0

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 27 / 52

Page 45: Monitorama14: A Melange of Methods for Manipulating Monitored Data

Irregular Time Series

Stretch the Rectangle (Area = 6, Width = 4)

AM

HM

0 1 2 3 4Time

0.5

1.0

1.5

2.0

2.5

3.0Metric

HM = 1.5× 4 = 6

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 28 / 52

Page 46: Monitorama14: A Melange of Methods for Manipulating Monitored Data

Irregular Time Series

Lowers the Height

AM

HM

0 1 2 3 4Time

0.5

1.0

1.5

2.0

2.5

3.0Metric

Theorem

HM < AM

Harmonic mean is always smaller than Arithmetic mean of the same samples

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 29 / 52

Page 47: Monitorama14: A Melange of Methods for Manipulating Monitored Data

Irregular Time Series

Monitored Subscription Rates

Samples only occur when subscription count reaches 10,000.Sampling intervals are unevenly spaced in time over 33 days.

æ

ææ æ

æ

æ

AMHM

0 5 10 15 20 25 30 35Time0

1000

2000

3000

4000Rate

AM and HM are (different) averaged subscription rates.

Only HM gives the correct total time window of 33 days.

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 30 / 52

Page 48: Monitorama14: A Melange of Methods for Manipulating Monitored Data

Irregular Time Series

Consistency

Use HM to aggregate monitored data when the following criteria apply:

R — Rate metric (on y -axis)

A — Async time intervals (on x-axis)

T — Threshold is low vs. high

E — Event data

Example metrics:

Cache-hit rate

Video bit-rate

Call rate

Please send in your examples ,

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 31 / 52

Page 49: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Power of Power Laws

Topics

1 The Greatest Scatter Plot

2 Irregular Time Series

3 The Power of Power LawsZipf’s Law of WordsDatabase Query TimesEleventh Hour Spikes

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 32 / 52

Page 50: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Power of Power Laws

The Power of Power Laws

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 33 / 52

Page 51: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Power of Power Laws Zipf’s Law of Words

Example 1: Zipf’s Law

Ranked data is 1000 most common wordforms in UK English based on 29 works ofliterature by 18 authors (i.e., 4.6 million words)

Wordform: english word

Abs: absolute frequency (total number of occurrences)

Data format> td <- read.table("~/../Power Laws/zipf1000.txt",header=TRUE)> head(td)

Rank Wordform Abs r mod1 1 the 225300 29 223066.92 2 and 157486 29 156214.43 3 to 134478 29 134044.84 4 of 126523 29 125510.25 5 a 100200 29 99871.26 6 I 91584 29 86645.5

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 34 / 52

Page 52: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Power of Power Laws Zipf’s Law of Words

Linear Axes

050000

100000

150000

200000

Ranked 1000 UK English Words

Ranked words (W)

Freq

uenc

y of

occ

urre

nce

(F)

the their us love voice true state eye stand worth service neck land art

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 35 / 52

Page 53: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Power of Power Laws Zipf’s Law of Words

Log-Log Axes

5e+02

2e+03

5e+03

2e+04

5e+04

2e+05

Ranked 1000 UK English Words

Ranked words (W)

Freq

uenc

y of

occ

urre

nce

(F)

the it at would much us love lay eye dare

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 36 / 52

Page 54: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Power of Power Laws Zipf’s Law of Words

Regression Fit

5e+02

2e+03

5e+03

2e+04

5e+04

2e+05

Ranked 1000 UK English Words

Ranked words (W)

Freq

uenc

y of

occ

urre

nce

(F)

the it at would much us love lay eye dare

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 37 / 52

Page 55: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Power of Power Laws Zipf’s Law of Words

Consistency

Log axes are word frequency (y) and ranked word order (x):

log(y) = −1.13 log(x)

y = x−1.13

y =1

x1.13

Here, “power” refers to x to the power −1.13 (exponent)

Power laws differ from standard statistical distributions

Power laws carry most of the information in their tail

Fatter tail corresponds to stronger correlations than usual

Power laws imply persistent correlations that have to be explained

Zipf’s law correlations arise from grammatical rules

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 38 / 52

Page 56: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Power of Power Laws Zipf’s Law of Words

Consistency

Log axes are word frequency (y) and ranked word order (x):

log(y) = −1.13 log(x)

y = x−1.13

y =1

x1.13

Here, “power” refers to x to the power −1.13 (exponent)

Power laws differ from standard statistical distributions

Power laws carry most of the information in their tail

Fatter tail corresponds to stronger correlations than usual

Power laws imply persistent correlations that have to be explained

Zipf’s law correlations arise from grammatical rules

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 38 / 52

Page 57: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Power of Power Laws Database Query Times

Example 2: Database Query Times

0 100 200 300 400 500

0100

200

300

400

Index

orad$Elapstime

Like Zipf’s law, data must be ranked by frequency of occurrencec© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 39 / 52

Page 58: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Power of Power Laws Database Query Times

Visualize Ranked Data

0 100 200 300 400 500

0100

200

300

400

Ranked SQL Times

Index

otr

Impossible to tell functional form of this curve

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 40 / 52

Page 59: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Power of Power Laws Database Query Times

Try Double-Log Visualization

1 2 5 10 20 50 100 200 500

0.1

0.5

1.0

5.0

50.0

500.0

Log-Log SQL Times

Index

otr

Clearly not power law overallBut first 100 queries do appear to be power law

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 41 / 52

Page 60: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Power of Power Laws Database Query Times

Three Data Windows

1 2 5 10 20 50 100

100

200

300

400

500

Log-Log of SQL-A Times

Index

etA

0 50 100 150

3040

5060

7080

Log-Lin of SQL-B Times

Index

etB

0 20 40 60 80

0.090

0.095

0.100

0.105

0.110

Log-Lin of SQL-C Times

Index

etC

(A) log-log axes

(B) log-linear axes

(C) log-linear axes

This suggests breaking data across 3 regions:

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 42 / 52

Page 61: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Power of Power Laws Database Query Times

Regression Analysis

1 2 5 10 20 50 100

100

200

300

400

500

Log-Log SQL A-Times

Index

etA

0 50 100 150

3040

5060

7080

Log-Lin SQL B-Times

Index

etB

0 20 40 60 80

0.090

0.095

0.100

0.105

0.110

Log-Lin SQL C-Times

Index

etC

(A) yA ∼ x−0.4632 power law decay(B) yB ∼ e−0.0074x exponential decay(C) yC ∼ e−0.0028x exponential decay

But this is still not enough

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 43 / 52

Page 62: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Power of Power Laws Database Query Times

Consistency

1 2 5 10 20 50 100

100

200

300

400

500

Log-Log SQL A-Times

Index

etA

Power law slope γ = 0.46

Half Zipfian slope γ = 1.0

Correlations stronger than Zipf

Hypothesis1 Shorter query times (window A) may involve dictionary lookups or other structured data.

Structure provides correlations.

2 Longer queries in window B are unstructured (ad hoc?) and randomized. Weakcorrelations produce exponential decay.

3 Ditto for window C.

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 44 / 52

Page 63: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Power of Power Laws Eleventh Hour Spikes

Example 3: Eleventh Hour Spikes

All Australian businesses were required to register with the Australian Tax Office (ATO)for an Australian Business Number (ABN) to claim an income tax refund. The ABNwas introduced in Y2K.

Time series data from ABN registrations database.

Period covers March 27 to September 19, 2000

Deadline traffic spike on 31 May, 2000

Similar to rush to meet Obamacare deadline of March 31, 2014.

More details in my CMG Australia 2006 paper.c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 45 / 52

Page 64: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Power of Power Laws Eleventh Hour Spikes

Complete Time Series

11!3!2000 21!4!2000 21!5!2000 15!6!2000 10!7!2000 4!8!2000 29!8!20000

200000

400000

600000

800000

1.!106ORA

Connections

Question: Could the “11th hour” spike have been predicted?

Answer: Yes, but quite involved.

How: Using a power law.

What else!?

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 46 / 52

Page 65: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Power of Power Laws Eleventh Hour Spikes

Complete Time Series

11!3!2000 21!4!2000 21!5!2000 15!6!2000 10!7!2000 4!8!2000 29!8!20000

200000

400000

600000

800000

1.!106ORA

Connections

Question: Could the “11th hour” spike have been predicted?

Answer: Yes, but quite involved.

How: Using a power law. What else!?

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 46 / 52

Page 66: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Power of Power Laws Eleventh Hour Spikes

Semi-Log Plot

11!3!2000 21!4!2000 21!5!20001!1042!104

5!1041!1052!105

5!105

1!1062!106

ORA

Connections

y -axis is the number of Oracle RDBMS connections (log scale)

Peak growth preceding spike looks almost linear on semi-log plot

Time range: 0–38 days

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 47 / 52

Page 67: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Power of Power Laws Eleventh Hour Spikes

Statistical Regression on Peaks

11!3!2000 21!4!20001!104

2!104

5!104

1!105

2!105

5!105

1!106

ORA

Connections

Linear growth on semi-log axes implies exponential function y = AeBt

Fit parameters

Origin: A = 1.14128× 105

Curvature: B = 0.0175

Doubling period:ln(2)

B∼ 6 months

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 48 / 52

Page 68: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Power of Power Laws Eleventh Hour Spikes

Trend on Linear Axes

11!3!2000 21!4!2000 21!5!2000 15!6!2000 10!7!20000

200000

400000

600000

800000

1.!106ORA

Connections

Exponential forecast looks valid, up to the crosshairs

Significantly underestimates onset of the “11th hour” peak

And rapid drop off after the peak

Faster than exponential suggests power law

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 49 / 52

Page 69: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Power of Power Laws Eleventh Hour Spikes

Power Law Fit

Exp growth

Power law

11!3!2000 21!4!2000 21!5!2000 15!6!2000 10!7!20000

200000

400000

600000

800000

1.!106ORA

Connections

Log axes are y: connects (y) and time in days (x):

log(y) = −0.6421 log(|x − xc |)

y =1

|x − xc |0.6421

where peak occurs at xc = 61 daysc© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 50 / 52

Page 70: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Power of Power Laws Eleventh Hour Spikes

Consistency

Log-log plots are an easy way to test for power law distributionsMay have mixed regions of power law and other distributionsCan even predict critical spikesPower laws signal presence of strong correlationsExplaining those correlations may be more difficultZipf’s law took 40 years

Remember

Aim for consistencyLearn to talk to God , (She’s listening)

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 51 / 52

Page 71: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Power of Power Laws Eleventh Hour Spikes

Consistency

Log-log plots are an easy way to test for power law distributionsMay have mixed regions of power law and other distributionsCan even predict critical spikesPower laws signal presence of strong correlationsExplaining those correlations may be more difficultZipf’s law took 40 years

Remember

Aim for consistencyLearn to talk to God ,

(She’s listening)

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 51 / 52

Page 72: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Power of Power Laws Eleventh Hour Spikes

Consistency

Log-log plots are an easy way to test for power law distributionsMay have mixed regions of power law and other distributionsCan even predict critical spikesPower laws signal presence of strong correlationsExplaining those correlations may be more difficultZipf’s law took 40 years

Remember

Aim for consistencyLearn to talk to God , (She’s listening)

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 51 / 52

Page 73: Monitorama14: A Melange of Methods for Manipulating Monitored Data

The Power of Power Laws Eleventh Hour Spikes

Performance Dynamics CompanyCastro Valley, Californiawww.perfdynamics.comperfdynamics.blogspot.comtwitter.com/DrQzFacebookTraining classes (May 19, 2014)[email protected]: +1-510-537-5758

c© 2014 Performance Dynamics A Melange of Methods for Manipulating Monitored Data May 6, 2014 52 / 52