Dive into the Data

35
Dive into the Data Data Analysis using Python and common sense * Dr Jean-Paul Ebejer ICT Tech Talk Series March 2015 * required

Transcript of Dive into the Data

Page 1: Dive into the Data

Dive into the Data Data Analysis using Python and common sense*

Dr Jean-Paul Ebejer ICT Tech Talk Series

March 2015

* required

Page 2: Dive into the Data

“IN GOD WE TRUST, ALL THE

OTHERS BRING DATA”

W. Edwards Deming

Page 3: Dive into the Data

Why data analysis?

• Only real way to make informed decisions

• Use data: – Either to prove that a working hypothesis is correct

– Or to formulate a hypothesis

• Data has increased exponentially in many areas: biology, astrophysics, finance, social networks – Surprisingly our knowledge has not

• Allows us to make predictions, mostly based on previous behaviour

3

Page 4: Dive into the Data

Data Scientist / Analyst skillset

Inquisitive

Cloud Computing

Statistics

Persistent

Python

Visualization

R

Data structures

Big Data

SQL / NoSQL

Hadoop

Programming

Machine Learning

Problem Solving

Quantitative Reasoning

Data Mining

Databases

4

Page 5: Dive into the Data

Data Analysis Example, find cheaters in a multiple-choice exam

• Exam consists of 40 multiple choice questions

• Compared all possible pairwise exam scripts (for all students)

• Graph shows shared correct and wrong answers between a pair of scripts

• Diagonal line y = -x + 40

http://lalashan.mcmaster.ca/theobio/math/index.php/Answer_matching 5

Page 6: Dive into the Data

Types of data

• Nominal (Categorical) - mutually exclusive e.g. Fgura, Mosta, Sliema

• Ordinal - Ordered categories e.g. exam grades: A (100-80), B (80-70), C (70-55)

• Interval - Equidistant and comparable ordinal data e.g. temperature 80C, 90C

• Ratio - Interval data with 0 having a meaning (absence of that variable)

• Not always clear (e.g. Colour? Red, Blue, Green but also specific physical wavelength)

• Continuous (measured) versus discrete (counted)

6

Page 7: Dive into the Data

The nature of data • Very often

– Incomplete – Noisy, e.g. audio sensor picking up background signal – Incorrect / Inconsistent, e.g. Totals do not match single row

entries – Out-of-date, e.g. using cost of 3D printers from 10 years ago – Not normalized, e.g. 10th November 2014, 10/11/2014) or of

different scales ($ and €) – Outliers, e.g. Using the Guinness book of records to sample

population statistics – Duplicates Duplicates

• Need to clean up and normalize the dataset before starting • For data to be of quality, it must be accurate, complete,

unique, timely, and consistent

7

Page 8: Dive into the Data

Rule 0 of any Data Analysis project: Always look at the data

Page 9: Dive into the Data

Descriptive Statistics (Analysis)

• Ways to provide quantitative, summary descriptions for a dataset

• Use these descriptions to compare datasets

• Data describing:

– Central tendency; mean, median, mode

– Dispersion; standard deviation, variance, min and max values, kurtosis, and skewness

9

Page 10: Dive into the Data

Central Tendency

• Mean – Cannot be applied to nominal (categorical) variables

– Effected by extreme values

• Median – pick the middle element in a sorted list – Need to sort out list

– Robust to extreme values

• Mode – most common element in a list – Dataset may have many modes (or none at all if all

elements are different)

– May not be representative of data

N

i

ixN

x1

1

10

Page 11: Dive into the Data

Dispersion

• Standard Deviation, σ – Tells you the spread of

your data – With normal distribution,

1σ around the mean accounts for 68% of the data, 2σ accounts for 95% of the data and 3σ accounts for 99%

– σ of almost 0 tells you the data is very close to the mean

• Many more statistics exist, variance, kurtosis etc.

34% σ

47% 2σ

34% σ

49.5% 3σ

N

i

i xxN 1

2)(1

11

Page 12: Dive into the Data

But why is this important? (I) • If you earned the average salary, would you be

happy with this raise?

12

Page 13: Dive into the Data

But why is this important? (II) σ = 2,000 Euros σ = 6,000 Euros

13

Page 14: Dive into the Data

Rule 1 of any Data Analysis project: Plot the data

Page 15: Dive into the Data

Not everything is like it seems...

Mean = 6.67

15

Page 16: Dive into the Data

PREDICTIVE MODELS

Page 17: Dive into the Data

Simple Linear Regression

• Used to model a linear relationship between two variables

• Two variables which bear a relationship are said to be “correlated”

• The degree of correlation is commonly measured using Pearson correlation coefficient, r

• Ability to predict a response variable (Y) from an explanatory variable (X)

• E.g. How is the weight of a person affected by his or her height?

17

Page 18: Dive into the Data

Correlation (I)

• Both variables increase: positive correlation (1) • One increases, the other decreases: negative correlation

(-1) • No linear correlation: 0

– This does not mean these is no relationship between the two variables!

18

Page 19: Dive into the Data

Correlation (II)

• There is a positive correlation between Weight and Height

• As height increases, the weight increases

• Draw Line of Best fit

– How ?

19

Page 20: Dive into the Data

Line of best fit

n

i i

iii

dError

yyd

1

2

ˆ

d1

d2

d6 d7

We want to minimize(Error) Why is the distance squared?

20

Page 21: Dive into the Data

How good is this line?

• A measure, R2, tells us how good our line of fit is

• Values from 0 (no correlation) to 1 (perfect correlation)

• R2 is informally the ratio of SSR over SSE+SSR (for every datapoint)

mean

correct part of Prediction (SSR)

incorrect part of prediction (SSE)

21

Page 22: Dive into the Data

Back to our example ...

R2 = 0.97 97% of the total variation in y can be explained by the linear relationship between x and y 22

Page 23: Dive into the Data

But R2 is only part of the story (Anscombe’s quartet)

• Mean(x) = 9

• Mean(y) =7.50

• Var(x) = 11

• Var(y) = 7.5

• R2 = 0.816

• Linear regression line = y = 3.00 + 0.500x

23

Page 24: Dive into the Data

Correlation does not imply Causation!

24

Page 25: Dive into the Data

You will now understand this joke (perhaps)

25

Page 26: Dive into the Data

Take care with predictive models...

26

Page 27: Dive into the Data

LINEAR REGRESSION IN PY

Page 28: Dive into the Data

Naive Bayes Classifier

• Naive Bayes Classifiers – build a probabilistic model for prediction

• Based on Bayes’ theorm

• Used in a lot of places; e.g. your spam filter

28

Page 29: Dive into the Data

Movie Suggestion using Naive-Bayes Classifiers

Movie Genre? Won Oscar? Explosions? Like?

Romantic (R)

Y N NO

Romantic (R) N N NO

Romantic (R) N N YES

Action(A) N N NO

Action (A) Y Y YES

Action (A) N Y NO

Sci-Fi (SF) Y N YES

Sci-Fi (SF) N N NO

Sci-Fi (SF) Y Y YES 29

Page 30: Dive into the Data

Will I enjoy ...

• (Genre=Action, Oscar=Y, Explosions=N)

• Note that this combination is not in the training set

30

Page 31: Dive into the Data

Calculate probabilities of observed training data

P(Romantic|LIKE) = 1/4

P(Action|LIKE) = 1/4

P(Sci-Fi|LIKE) = 2/4

P(Romantic|DISLIKE) = 2/5

P(Action|DISLIKE) = 2/5

P(Sci-Fi|DISLIKE) = 1/5

P(Oscar|LIKE) = 3/4

P(No Oscar|LIKE) = 1/4

P(Oscar|DISLIKE) = 1/5

P(No Oscar|DISLIKE) = 4/5

P(Explosions|LIKE) = 2/4

P(No Explosions|LIKE) = 2/4

P(Explosions|DISLIKE) = 1/5

P(No Explosions|DISLIKE) = 4/5

P(LIKE) = 4/9

P(DISLIKE) = 5/9

31

Page 32: Dive into the Data

Calculate probabilities

Since P(Like|Gladiator) > P(Dislike|Gladiator) therefore the prediction is that I like this movie! (TRUE)

P(Like|Gladiator) = P(LIKE)*P(Action|LIKE)*P(Oscar|LIKE)*P(Explosions|LIKE) = 4/9 * 1/4 * 3/4 * 2/4 = 0.041

P(Dislike|Gladiator) = P(DISLIKE)*P(Action|DISLIKE)*P(Oscar|DISLIKE)*P(Explosions|DISLIKE) = 5/9 * 2/5 * 1/5 * 1/5 = 0.008

32

Page 33: Dive into the Data

NBC IN PY

Page 34: Dive into the Data

Conclusions

• Data Analysis projects – start small

– Look at the data, clean, compute descriptive statistics, visualize (using correct plot type)

• Formulate hypothesis, and back up with data

• For predictive modelling, very important to use correct algorithm and understand how algorithm works internally

– Many parameters to tweak

34

Page 35: Dive into the Data

@malteseunderdog

[email protected]