Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data...

33
Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer Modern Methods of Data Analysis Lecture II (22.10.07) Characterize distributions average, spread ... Correlations, covariance Contents:

Transcript of Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data...

Page 1: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Modern Methods ofData Analysis

Lecture II (22.10.07)

● Characterize distributions– average, spread ...

● Correlations, covariance

Contents:

Page 2: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

● arithmetic mean of data set:

● weighted mean of data set:

● mode – most probable value (peak in distribution)

● median – smallest value which is ≥ 50% of events` better use median than mean, more robust against outliers!

● similar defined Quantile: Median = 50% Quantil

Reminder: Average

Page 3: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Reminder: Variance●

mean square deviation called sample variance

● RMS (root mean square) – standard deviation σ

● FWHM (Full Width at Half Maximum)

FWHM more robust againstoutliers than RMS!For describing core distributionuse FWHM, for describing tailsuse RMS

Page 4: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Expectation Values● So far characterized given set realization of an

experiment (sum over N) by sample mean, sample spread ...

● Now talk about mean, spread of a distribution:

Note

However for N->∞, Law of large numbers

Page 5: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Variance of a Distribution:

● V[x] = E[(x-μ)²] =

● V[x] = E[(x-μ)²] =

● V[x] = E[x²] – µ²

V[x] is the measure of the spread of the distribution,not how well the mean is defined!

Page 6: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Example:

N = 100

N = 10000

N = 1000

µ = 5σ = 1

Page 7: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

How to determine uncertainty on the mean?

● E[ x ] = ???● V[ x ] = ???

Page 8: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Uncertainty of Mean

Page 9: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

● CDF has a mass resolution of 16 MeV– the reconstructed mass of a single B meson is spread

around the true B mass with σ=16 MeV● The B mass can be measured with way better precision

m(B0) = 5279.63 ± 0.53 (stat) ± 0.33 (sys)

Page 10: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Unbiased Estimators:

Unbiased Estimator “erwartungstreuer Schätzer”

unbiased estimator for true mean µ is :

for n data points, we estimate the variance true V(x) by the“sample variance s²” - if true mean µ is known!

- If the true mean is unknown, then an unbiased estimator for the variance σ² is the “sample variance s²”:

beware of N-1!

“One single value is not enough to determine mean and spread.”

Page 11: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Solution: Unbiased Estimators

Page 12: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Efficiency of Estimators

● Optimal Estimator: Result of Maximum Likelihood Fit (see later lectures) ”optimal” ↔ smallest variance

● Efficiency of Estimator: “variance of estimator/variance of opti. estimator”

● For Gaussian distribution is optimal estimator

● non optimal estimators are called not robust

● E.g. Median of Gauss distribution has 64% efficiency

Page 13: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Truncated Mean

● truncated mean (“getrimmter Mittelwert”):– e.g. r = 40% truncated mean:

● 10% lowest and 10% highest values ignored, calculate mean of 80% central values

– r = 50% truncated mean ->– r -> 0% -> median

Page 14: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Truncated Mean

Cauchy Laplace ordouble exponential

r = 0.23 truncatedmean best estimatorfor unkown sym. distribution

Page 15: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Moments

● r-th algebraic moment ● r-th central moment

Expectation value: 1. algebraic momentVariance: 2. central moment

“Schiefe”/skewness- dimensionless, pos. for right winged distributions

“Wölbung”/kurtosis- measure for ratio of core relative to tails- pos. kurtosis: flatter/broader than Gaussian

Page 16: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Skewness & Kurtosis

kurtosis < 0 kurtosis > 0

Gaussian distribution have kurtosis = 0

Page 17: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Which fraction of events is within 1,2,3 σ

4σ3σ

This is only true for Gaussian distributions!

Page 18: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Biennaymé-Tchebycheff-Inequality

For every distribution the following inequality is valid:

k Gauss Tchebycheff

1 0.317 1.02 0.0555 0.253 0.0027 0.11114 0.000063 0.0625

Page 19: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Solution: Biennaymé-Tchebycheff-Inequality

Given a PDF f(x) and a function w(x)≥0:

with :

Page 20: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Two Dimensional Distributions

● box plot● lego plot● surface plot● numbers● scatter plot● color map● contour plot● ...

Multiple ways to visualize 2-dim distributions

Page 21: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Two dimensional Distributions

● straight generalization of 1-dim PDFs

A 2-dim PDF is a function f(x,y)≥0 with

Page 22: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Marginal Distributions● Marginal distributions: projection on the axis

“Randverteilungen”

Page 23: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Conditional Probability ●

Page 24: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Covariance (I)

● x,y independent:– h(x|y) = f(x) for all y and h(y|x) = f(y) for all x– f(x,y) = f(x)f(y)

covariance:

Note: cov (x,x) = V(x)

Page 25: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

● estimate for cov(x,y) from sample:

Covariance (II)

Page 26: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Exercises:

● Proof: V(x+y) = V(x) + V(y) + 2 cov(x,y)

● Given are the following data points (x,y):(1,1), (2,1), (-3,-1), (2,2), (1,5) Give estimate for cov(x,y)

Page 27: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

● based on the covariance between two variables x,y

● define correlation coefficient ρ– ρ ranges between +1 and -1– if two variables are uncorrelated, then ρ=0

Correlation

Page 28: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Example Correlations● Two correlated Gaussian distributions:

ρ = 0.97ρ = -0.7

ρ = 0.0ρ = 0.5

Page 29: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Example: Linear Correlation

● y = ax+b; x = 0.5 in [-1,1], else 0● Calculate <y>?● Calculate <xy>?● Calculate cov(x,y) = <xy>-<x><y>● Calculate

Page 30: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Example: Parabola

● y = x²; x 0.5 in [-1,1], else 0● calculate <y>● calculate <xy>● calculate cov(x,y)

Page 31: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Example ρ=0.0

● By construction the covariance vanishes for uncorrelated variables. The opposite is not true, zero covarinace does not necessarily mean that the variables are uncorrelated.

● E.g.

Page 32: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Example ρ = 0:

Page 33: Modern Methods of Data Analysis - Physikalisches Institutmenzemer/Stat... · Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... Skewness & Kurtosis kurtosis

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer

Correlation ≠ Causality

To be read in a newspaper:

If you take your time for your studies you earn more money!

taken from “So lügt man mit Statistik”