Dive into the Data
-
Upload
drjpebejer -
Category
Data & Analytics
-
view
563 -
download
0
Transcript of Dive into the Data
Dive into the Data Data Analysis using Python and common sense*
Dr Jean-Paul Ebejer ICT Tech Talk Series
March 2015
* required
“IN GOD WE TRUST, ALL THE
OTHERS BRING DATA”
W. Edwards Deming
Why data analysis?
• Only real way to make informed decisions
• Use data: – Either to prove that a working hypothesis is correct
– Or to formulate a hypothesis
• Data has increased exponentially in many areas: biology, astrophysics, finance, social networks – Surprisingly our knowledge has not
• Allows us to make predictions, mostly based on previous behaviour
3
Data Scientist / Analyst skillset
Inquisitive
Cloud Computing
Statistics
Persistent
Python
Visualization
R
Data structures
Big Data
SQL / NoSQL
Hadoop
Programming
Machine Learning
Problem Solving
Quantitative Reasoning
Data Mining
Databases
4
Data Analysis Example, find cheaters in a multiple-choice exam
• Exam consists of 40 multiple choice questions
• Compared all possible pairwise exam scripts (for all students)
• Graph shows shared correct and wrong answers between a pair of scripts
• Diagonal line y = -x + 40
http://lalashan.mcmaster.ca/theobio/math/index.php/Answer_matching 5
Types of data
• Nominal (Categorical) - mutually exclusive e.g. Fgura, Mosta, Sliema
• Ordinal - Ordered categories e.g. exam grades: A (100-80), B (80-70), C (70-55)
• Interval - Equidistant and comparable ordinal data e.g. temperature 80C, 90C
• Ratio - Interval data with 0 having a meaning (absence of that variable)
• Not always clear (e.g. Colour? Red, Blue, Green but also specific physical wavelength)
• Continuous (measured) versus discrete (counted)
6
The nature of data • Very often
– Incomplete – Noisy, e.g. audio sensor picking up background signal – Incorrect / Inconsistent, e.g. Totals do not match single row
entries – Out-of-date, e.g. using cost of 3D printers from 10 years ago – Not normalized, e.g. 10th November 2014, 10/11/2014) or of
different scales ($ and €) – Outliers, e.g. Using the Guinness book of records to sample
population statistics – Duplicates Duplicates
• Need to clean up and normalize the dataset before starting • For data to be of quality, it must be accurate, complete,
unique, timely, and consistent
7
Rule 0 of any Data Analysis project: Always look at the data
Descriptive Statistics (Analysis)
• Ways to provide quantitative, summary descriptions for a dataset
• Use these descriptions to compare datasets
• Data describing:
– Central tendency; mean, median, mode
– Dispersion; standard deviation, variance, min and max values, kurtosis, and skewness
9
Central Tendency
• Mean – Cannot be applied to nominal (categorical) variables
– Effected by extreme values
• Median – pick the middle element in a sorted list – Need to sort out list
– Robust to extreme values
• Mode – most common element in a list – Dataset may have many modes (or none at all if all
elements are different)
– May not be representative of data
N
i
ixN
x1
1
10
Dispersion
• Standard Deviation, σ – Tells you the spread of
your data – With normal distribution,
1σ around the mean accounts for 68% of the data, 2σ accounts for 95% of the data and 3σ accounts for 99%
– σ of almost 0 tells you the data is very close to the mean
• Many more statistics exist, variance, kurtosis etc.
34% σ
47% 2σ
34% σ
49.5% 3σ
N
i
i xxN 1
2)(1
11
But why is this important? (I) • If you earned the average salary, would you be
happy with this raise?
12
But why is this important? (II) σ = 2,000 Euros σ = 6,000 Euros
13
Rule 1 of any Data Analysis project: Plot the data
Not everything is like it seems...
Mean = 6.67
15
PREDICTIVE MODELS
Simple Linear Regression
• Used to model a linear relationship between two variables
• Two variables which bear a relationship are said to be “correlated”
• The degree of correlation is commonly measured using Pearson correlation coefficient, r
• Ability to predict a response variable (Y) from an explanatory variable (X)
• E.g. How is the weight of a person affected by his or her height?
17
Correlation (I)
• Both variables increase: positive correlation (1) • One increases, the other decreases: negative correlation
(-1) • No linear correlation: 0
– This does not mean these is no relationship between the two variables!
18
Correlation (II)
• There is a positive correlation between Weight and Height
• As height increases, the weight increases
• Draw Line of Best fit
– How ?
19
Line of best fit
n
i i
iii
dError
yyd
1
2
ˆ
d1
d2
d6 d7
We want to minimize(Error) Why is the distance squared?
20
How good is this line?
• A measure, R2, tells us how good our line of fit is
• Values from 0 (no correlation) to 1 (perfect correlation)
• R2 is informally the ratio of SSR over SSE+SSR (for every datapoint)
mean
correct part of Prediction (SSR)
incorrect part of prediction (SSE)
21
Back to our example ...
R2 = 0.97 97% of the total variation in y can be explained by the linear relationship between x and y 22
But R2 is only part of the story (Anscombe’s quartet)
• Mean(x) = 9
• Mean(y) =7.50
• Var(x) = 11
• Var(y) = 7.5
• R2 = 0.816
• Linear regression line = y = 3.00 + 0.500x
23
Correlation does not imply Causation!
24
You will now understand this joke (perhaps)
25
Take care with predictive models...
26
LINEAR REGRESSION IN PY
Naive Bayes Classifier
• Naive Bayes Classifiers – build a probabilistic model for prediction
• Based on Bayes’ theorm
• Used in a lot of places; e.g. your spam filter
28
Movie Suggestion using Naive-Bayes Classifiers
Movie Genre? Won Oscar? Explosions? Like?
Romantic (R)
Y N NO
Romantic (R) N N NO
Romantic (R) N N YES
Action(A) N N NO
Action (A) Y Y YES
Action (A) N Y NO
Sci-Fi (SF) Y N YES
Sci-Fi (SF) N N NO
Sci-Fi (SF) Y Y YES 29
Will I enjoy ...
• (Genre=Action, Oscar=Y, Explosions=N)
• Note that this combination is not in the training set
30
Calculate probabilities of observed training data
P(Romantic|LIKE) = 1/4
P(Action|LIKE) = 1/4
P(Sci-Fi|LIKE) = 2/4
P(Romantic|DISLIKE) = 2/5
P(Action|DISLIKE) = 2/5
P(Sci-Fi|DISLIKE) = 1/5
P(Oscar|LIKE) = 3/4
P(No Oscar|LIKE) = 1/4
P(Oscar|DISLIKE) = 1/5
P(No Oscar|DISLIKE) = 4/5
P(Explosions|LIKE) = 2/4
P(No Explosions|LIKE) = 2/4
P(Explosions|DISLIKE) = 1/5
P(No Explosions|DISLIKE) = 4/5
P(LIKE) = 4/9
P(DISLIKE) = 5/9
31
Calculate probabilities
Since P(Like|Gladiator) > P(Dislike|Gladiator) therefore the prediction is that I like this movie! (TRUE)
P(Like|Gladiator) = P(LIKE)*P(Action|LIKE)*P(Oscar|LIKE)*P(Explosions|LIKE) = 4/9 * 1/4 * 3/4 * 2/4 = 0.041
P(Dislike|Gladiator) = P(DISLIKE)*P(Action|DISLIKE)*P(Oscar|DISLIKE)*P(Explosions|DISLIKE) = 5/9 * 2/5 * 1/5 * 1/5 = 0.008
32
NBC IN PY
Conclusions
• Data Analysis projects – start small
– Look at the data, clean, compute descriptive statistics, visualize (using correct plot type)
• Formulate hypothesis, and back up with data
• For predictive modelling, very important to use correct algorithm and understand how algorithm works internally
– Many parameters to tweak
34
@malteseunderdog