Statistics 101 for System Administrators

18
Statistics 101 for System Administrators EuroPython 2014, 22 th July - Berlin Roberto Polli - [email protected] Babel Srl P.zza S. Benedetto da Norcia, 33 00040, Pomezia (RM) - www.babel.it 22 July 2014 Roberto Polli - [email protected]

description

Learn and use elements of statistics (distributions, standard deviation, linear correlation) in python is very simple. The slides shows an example of managing some dataseries for network troubleshooting.

Transcript of Statistics 101 for System Administrators

Page 1: Statistics 101 for System Administrators

Statistics 101 for SystemAdministrators

EuroPython 2014, 22th July - Berlin

Roberto Polli - [email protected]

Babel Srl P.zza S. Benedetto da Norcia, 3300040, Pomezia (RM) - www.babel.it

22 July 2014Roberto Polli - [email protected]

Page 2: Statistics 101 for System Administrators

Who? What? Why?

• Using (and learning) elements of statistics with python.• Roberto Polli - Community Manager @ Babel.it. Loves writing in C, Java

and Python. Red Hat Certified Engineer and Virtualization Administrator.• Babel – Proud sponsor of this talk ;) Delivers large mail infrastructures

based on Open Source software for Italian ISP and PA. Contributes tovarious FLOSS.

Intro Roberto Polli - [email protected]

Page 3: Statistics 101 for System Administrators

Agenda

• A latency issue: what happened?• Correlation in 30”• Combining data• Plotting time• modules: scipy, matplotlib

Intro Roberto Polli - [email protected]

Page 4: Statistics 101 for System Administrators

A Latency Issue

• Episodic network latency issues• Logs traces: message size, #peers, retransimissions• Do we need to scale? Was a peak problem?

Find a rapid answer with python!

Intro Roberto Polli - [email protected]

Page 5: Statistics 101 for System Administrators

Basic statistics

Python provides basic statistics, likefrom scipy.stats import mean # x̄from scipy.stats import std # σXT = { ’ts’: (1, 2, 3, .., ),

’late’: (0.12, 6.31, 0.43, .. ),’peers’: (2313, 2313, 2312, ..),...}

print([k, max(X), min(X), mean(X), std(X) ]for k, X in T.items() ])

Intro Roberto Polli - [email protected]

Page 6: Statistics 101 for System Administrators

Distributions

Data distribution - aka δX - shows event frequency.

# The fastest way to get a# distribution isfrom matplotlib import pyplot as pltfreq, bins, _ = plt.hist(T[’late’])

# plt.hist returns adistribution = zip(bins, freq)

A ping rtt distribution

158.0 158.5 159.0 159.5 160.0 160.5 161.0 161.5 162.0rtt in ms

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0 Ping RTT distribution

r

Intro Roberto Polli - [email protected]

Page 7: Statistics 101 for System Administrators

Correlation I

Are two data series X ,Y related?Given ∆xi = xi − x̄ Mr. Pearson answered with this formula

ρ(X ,Y ) =

∑i ∆xi∆yi√∑i ∆2xi∆2yi

∈ [−1,+1] (1)

ρ identifies if the values of X and Y ‘move’ together on the same line.

Intro Roberto Polli - [email protected]

Page 8: Statistics 101 for System Administrators

You must (scatter) plot

ρ doesn’t find non-linear correlation!Intro Roberto Polli - [email protected]

Page 9: Statistics 101 for System Administrators

Probability Indicator

Python scipy provides a correlation function, returning two values:• the ρ correlation coefficient ∈ [−1,+1]

• the probability that such datasets are produced by uncorrelated systems

from scipy.stats.stats import pearsonr # our beloved ρa, b = range(0, 100), range(0, 400, 4)c, d = [randint(0, 100) for x in a], [randint(0, 100) for x in a]correlation, probability = pearsonr(a,b) # ρ = 1.000, p = 0.000correlation, probability = pearsonr(c,d) # ρ = −0.041, p = 0.683

Intro Roberto Polli - [email protected]

Page 10: Statistics 101 for System Administrators

Combinations

itertools is a gold pot of useful tools.

from itertools import combinations# returns all possible combination of# items grouped by N at a timeitems = "heart spades clubs diamonds".split()combinations(items, 2)

# And now all possible combinations between# dataset fields!combinations(T, 2)

Combinating 4 suites,2 at a time.

♥♠♥♣♥♦♠♣♠♦♣♦

Intro Roberto Polli - [email protected]

Page 11: Statistics 101 for System Administrators

Netfishing correlation I

# Now we have all the ingredients for# net-fishing relations between our data!for (k1,v1), (k2,v2) in combinations(T.items(), 2):

# Look for correlations between every dataset!corr, prob = pearsonr(v1, v2)

if corr > .6:print("Series", k1, k2, "can be correlated", corr)

elif prob < 0.05:print("Series", k1, k2, "probability lower than 5%%", prob)

Intro Roberto Polli - [email protected]

Page 12: Statistics 101 for System Administrators

Netfishing correlation IINow plot all combinations: there’s more to meet with eyes!# Plot everything, and insert data in plots!for (k1,v1), (k2,v2) in combinations(T.items(), 2):

corr, prob = pearsonr(v1, v2)plt.scatter(v1, v2)

# 3 digit precision on titleplt.title("R={:0.3f} P={:0.3f}".format(corr, prob))plt.xlabel(k1); plt.ylabel(k2)

# save and close the plotplt.savefig("{}_{}.png".format(k1, k2)); plt.close()

Intro Roberto Polli - [email protected]

Page 13: Statistics 101 for System Administrators

Plotting Correlation

Intro Roberto Polli - [email protected]

Page 14: Statistics 101 for System Administrators

Color is the 3rd dimension

from itertools import cyclecolors = cycle("rgb") # use more than 3 colors!labels = cycle("morning afternoon night".split())size = datalen / 3 # 3 colors, right?for (k1,v1), (k2,v2) in combinations(T.items(), 2):

[ plt.scatter( t1[i:i+size] , t2[i:i+size],color=next(colors),label=next(labels)) for i in range(0, datalen, size) ]

# set title, save plot & co

Intro Roberto Polli - [email protected]

Page 15: Statistics 101 for System Administrators

Example Correlation

Intro Roberto Polli - [email protected]

Page 16: Statistics 101 for System Administrators

Latency Solution

• Latency wasn’t related to packet size or system throughput• Errors were not related to packet size• Discovered system throughput

Intro Roberto Polli - [email protected]

Page 17: Statistics 101 for System Administrators

Wrap Up

• Use statistics: it’s easy• Don’t use ρ to exclude relations• Plot, Plot, Plot• Continue collecting results

Intro Roberto Polli - [email protected]

Page 18: Statistics 101 for System Administrators

That’s all folks!

Thank you for the attention!Roberto Polli - [email protected]

Intro Roberto Polli - [email protected]