Download - Statistics 101 for System Administrators

Transcript
Page 1: Statistics 101 for System Administrators

Statistics 101 for SystemAdministrators

EuroPython 2014, 22th July - Berlin

Roberto Polli - [email protected]

Babel Srl P.zza S. Benedetto da Norcia, 3300040, Pomezia (RM) - www.babel.it

22 July 2014Roberto Polli - [email protected]

Page 2: Statistics 101 for System Administrators

Who? What? Why?

• Using (and learning) elements of statistics with python.• Roberto Polli - Community Manager @ Babel.it. Loves writing in C, Java

and Python. Red Hat Certified Engineer and Virtualization Administrator.• Babel – Proud sponsor of this talk ;) Delivers large mail infrastructures

based on Open Source software for Italian ISP and PA. Contributes tovarious FLOSS.

Intro Roberto Polli - [email protected]

Page 3: Statistics 101 for System Administrators

Agenda

• A latency issue: what happened?• Correlation in 30”• Combining data• Plotting time• modules: scipy, matplotlib

Intro Roberto Polli - [email protected]

Page 4: Statistics 101 for System Administrators

A Latency Issue

• Episodic network latency issues• Logs traces: message size, #peers, retransimissions• Do we need to scale? Was a peak problem?

Find a rapid answer with python!

Intro Roberto Polli - [email protected]

Page 5: Statistics 101 for System Administrators

Basic statistics

Python provides basic statistics, likefrom scipy.stats import mean # x̄from scipy.stats import std # σXT = { ’ts’: (1, 2, 3, .., ),

’late’: (0.12, 6.31, 0.43, .. ),’peers’: (2313, 2313, 2312, ..),...}

print([k, max(X), min(X), mean(X), std(X) ]for k, X in T.items() ])

Intro Roberto Polli - [email protected]

Page 6: Statistics 101 for System Administrators

Distributions

Data distribution - aka δX - shows event frequency.

# The fastest way to get a# distribution isfrom matplotlib import pyplot as pltfreq, bins, _ = plt.hist(T[’late’])

# plt.hist returns adistribution = zip(bins, freq)

A ping rtt distribution

158.0 158.5 159.0 159.5 160.0 160.5 161.0 161.5 162.0rtt in ms

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0 Ping RTT distribution

r

Intro Roberto Polli - [email protected]

Page 7: Statistics 101 for System Administrators

Correlation I

Are two data series X ,Y related?Given ∆xi = xi − x̄ Mr. Pearson answered with this formula

ρ(X ,Y ) =

∑i ∆xi∆yi√∑i ∆2xi∆2yi

∈ [−1,+1] (1)

ρ identifies if the values of X and Y ‘move’ together on the same line.

Intro Roberto Polli - [email protected]

Page 8: Statistics 101 for System Administrators

You must (scatter) plot

ρ doesn’t find non-linear correlation!Intro Roberto Polli - [email protected]

Page 9: Statistics 101 for System Administrators

Probability Indicator

Python scipy provides a correlation function, returning two values:• the ρ correlation coefficient ∈ [−1,+1]

• the probability that such datasets are produced by uncorrelated systems

from scipy.stats.stats import pearsonr # our beloved ρa, b = range(0, 100), range(0, 400, 4)c, d = [randint(0, 100) for x in a], [randint(0, 100) for x in a]correlation, probability = pearsonr(a,b) # ρ = 1.000, p = 0.000correlation, probability = pearsonr(c,d) # ρ = −0.041, p = 0.683

Intro Roberto Polli - [email protected]

Page 10: Statistics 101 for System Administrators

Combinations

itertools is a gold pot of useful tools.

from itertools import combinations# returns all possible combination of# items grouped by N at a timeitems = "heart spades clubs diamonds".split()combinations(items, 2)

# And now all possible combinations between# dataset fields!combinations(T, 2)

Combinating 4 suites,2 at a time.

♥♠♥♣♥♦♠♣♠♦♣♦

Intro Roberto Polli - [email protected]

Page 11: Statistics 101 for System Administrators

Netfishing correlation I

# Now we have all the ingredients for# net-fishing relations between our data!for (k1,v1), (k2,v2) in combinations(T.items(), 2):

# Look for correlations between every dataset!corr, prob = pearsonr(v1, v2)

if corr > .6:print("Series", k1, k2, "can be correlated", corr)

elif prob < 0.05:print("Series", k1, k2, "probability lower than 5%%", prob)

Intro Roberto Polli - [email protected]

Page 12: Statistics 101 for System Administrators

Netfishing correlation IINow plot all combinations: there’s more to meet with eyes!# Plot everything, and insert data in plots!for (k1,v1), (k2,v2) in combinations(T.items(), 2):

corr, prob = pearsonr(v1, v2)plt.scatter(v1, v2)

# 3 digit precision on titleplt.title("R={:0.3f} P={:0.3f}".format(corr, prob))plt.xlabel(k1); plt.ylabel(k2)

# save and close the plotplt.savefig("{}_{}.png".format(k1, k2)); plt.close()

Intro Roberto Polli - [email protected]

Page 13: Statistics 101 for System Administrators

Plotting Correlation

Intro Roberto Polli - [email protected]

Page 14: Statistics 101 for System Administrators

Color is the 3rd dimension

from itertools import cyclecolors = cycle("rgb") # use more than 3 colors!labels = cycle("morning afternoon night".split())size = datalen / 3 # 3 colors, right?for (k1,v1), (k2,v2) in combinations(T.items(), 2):

[ plt.scatter( t1[i:i+size] , t2[i:i+size],color=next(colors),label=next(labels)) for i in range(0, datalen, size) ]

# set title, save plot & co

Intro Roberto Polli - [email protected]

Page 15: Statistics 101 for System Administrators

Example Correlation

Intro Roberto Polli - [email protected]

Page 16: Statistics 101 for System Administrators

Latency Solution

• Latency wasn’t related to packet size or system throughput• Errors were not related to packet size• Discovered system throughput

Intro Roberto Polli - [email protected]

Page 17: Statistics 101 for System Administrators

Wrap Up

• Use statistics: it’s easy• Don’t use ρ to exclude relations• Plot, Plot, Plot• Continue collecting results

Intro Roberto Polli - [email protected]

Page 18: Statistics 101 for System Administrators

That’s all folks!

Thank you for the attention!Roberto Polli - [email protected]

Intro Roberto Polli - [email protected]