Dive into the Data

Dive into the Data Data Analysis using Python and common sense*

Dr Jean-Paul Ebejer ICT Tech Talk Series

March 2015

* required

“IN GOD WE TRUST, ALL THE

OTHERS BRING DATA”

W. Edwards Deming

Why data analysis?

• Only real way to make informed decisions

• Use data: – Either to prove that a working hypothesis is correct

– Or to formulate a hypothesis

• Data has increased exponentially in many areas: biology, astrophysics, finance, social networks – Surprisingly our knowledge has not

• Allows us to make predictions, mostly based on previous behaviour

3

Data Scientist / Analyst skillset

Inquisitive

Cloud Computing

Statistics

Persistent

Python

Visualization

R

Data structures

Big Data

SQL / NoSQL

Hadoop

Programming

Machine Learning

Problem Solving

Quantitative Reasoning

Data Mining

Databases

4

Data Analysis Example, find cheaters in a multiple-choice exam

• Exam consists of 40 multiple choice questions

• Compared all possible pairwise exam scripts (for all students)

• Graph shows shared correct and wrong answers between a pair of scripts

• Diagonal line y = -x + 40

http://lalashan.mcmaster.ca/theobio/math/index.php/Answer_matching 5

Types of data

• Nominal (Categorical) - mutually exclusive e.g. Fgura, Mosta, Sliema

• Ordinal - Ordered categories e.g. exam grades: A (100-80), B (80-70), C (70-55)

• Interval - Equidistant and comparable ordinal data e.g. temperature 80C, 90C

• Ratio - Interval data with 0 having a meaning (absence of that variable)

• Not always clear (e.g. Colour? Red, Blue, Green but also specific physical wavelength)

• Continuous (measured) versus discrete (counted)

6

The nature of data • Very often

– Incomplete – Noisy, e.g. audio sensor picking up background signal – Incorrect / Inconsistent, e.g. Totals do not match single row

entries – Out-of-date, e.g. using cost of 3D printers from 10 years ago – Not normalized, e.g. 10th November 2014, 10/11/2014) or of

different scales ($ and €) – Outliers, e.g. Using the Guinness book of records to sample

population statistics – Duplicates Duplicates

• Need to clean up and normalize the dataset before starting • For data to be of quality, it must be accurate, complete,

unique, timely, and consistent

7

Rule 0 of any Data Analysis project: Always look at the data

Descriptive Statistics (Analysis)

• Ways to provide quantitative, summary descriptions for a dataset

• Use these descriptions to compare datasets

• Data describing:

– Central tendency; mean, median, mode

– Dispersion; standard deviation, variance, min and max values, kurtosis, and skewness

9

Central Tendency

• Mean – Cannot be applied to nominal (categorical) variables

– Effected by extreme values

• Median – pick the middle element in a sorted list – Need to sort out list

– Robust to extreme values

• Mode – most common element in a list – Dataset may have many modes (or none at all if all

elements are different)

– May not be representative of data

N

i

ixN

x1

1

10

Dispersion

• Standard Deviation, σ – Tells you the spread of

your data – With normal distribution,

1σ around the mean accounts for 68% of the data, 2σ accounts for 95% of the data and 3σ accounts for 99%

– σ of almost 0 tells you the data is very close to the mean

• Many more statistics exist, variance, kurtosis etc.

34% σ

47% 2σ

34% σ

49.5% 3σ

N

i

i xxN 1

2)(1

11

But why is this important? (I) • If you earned the average salary, would you be

happy with this raise?

12

But why is this important? (II) σ = 2,000 Euros σ = 6,000 Euros

13

Rule 1 of any Data Analysis project: Plot the data

Not everything is like it seems...

Mean = 6.67

15

PREDICTIVE MODELS

Simple Linear Regression

• Used to model a linear relationship between two variables

• Two variables which bear a relationship are said to be “correlated”

• The degree of correlation is commonly measured using Pearson correlation coefficient, r

• Ability to predict a response variable (Y) from an explanatory variable (X)

• E.g. How is the weight of a person affected by his or her height?

17

Correlation (I)

• Both variables increase: positive correlation (1) • One increases, the other decreases: negative correlation

(-1) • No linear correlation: 0

– This does not mean these is no relationship between the two variables!

18

Correlation (II)

• There is a positive correlation between Weight and Height

• As height increases, the weight increases

• Draw Line of Best fit

– How ?

19

Line of best fit

n

i i

iii

dError

yyd

1

2

ˆ

d1

d2

d6 d7

We want to minimize(Error) Why is the distance squared?

20

How good is this line?

• A measure, R2, tells us how good our line of fit is

• Values from 0 (no correlation) to 1 (perfect correlation)

• R2 is informally the ratio of SSR over SSE+SSR (for every datapoint)

mean

correct part of Prediction (SSR)

incorrect part of prediction (SSE)

21

Back to our example ...

R2 = 0.97 97% of the total variation in y can be explained by the linear relationship between x and y 22

But R2 is only part of the story (Anscombe’s quartet)

• Mean(x) = 9

• Mean(y) =7.50

• Var(x) = 11

• Var(y) = 7.5

• R2 = 0.816

• Linear regression line = y = 3.00 + 0.500x

23

Correlation does not imply Causation!

24

You will now understand this joke (perhaps)

25

Take care with predictive models...

26

LINEAR REGRESSION IN PY

Naive Bayes Classifier

• Naive Bayes Classifiers – build a probabilistic model for prediction

• Based on Bayes’ theorm

• Used in a lot of places; e.g. your spam filter

28

Movie Suggestion using Naive-Bayes Classifiers

Movie Genre? Won Oscar? Explosions? Like?

Romantic (R)

Y N NO

Romantic (R) N N NO

Romantic (R) N N YES

Action(A) N N NO

Action (A) Y Y YES

Action (A) N Y NO

Sci-Fi (SF) Y N YES

Sci-Fi (SF) N N NO

Sci-Fi (SF) Y Y YES 29

Will I enjoy ...

• (Genre=Action, Oscar=Y, Explosions=N)

• Note that this combination is not in the training set

30

NBC IN PY

Conclusions

• Data Analysis projects – start small

– Look at the data, clean, compute descriptive statistics, visualize (using correct plot type)

• Formulate hypothesis, and back up with data

• For predictive modelling, very important to use correct algorithm and understand how algorithm works internally

– Many parameters to tweak

34

@malteseunderdog

[email protected]

Dive into the Data

Data & Analytics

Transcript of Dive into the Data