Meelis Kull [email protected] Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining...
Transcript of Meelis Kull [email protected] Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining...
![Page 1: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/1.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
Meelis Kull
Autumn 2017
1
![Page 2: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/2.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
Demo: Data science mini-project
CRISP-DM: cross-industrial standard process
for data mining
Data understanding: Types of data
Data understanding: First look at attributes
Types of attributes
First look at a nominal attribute
First look at a ordinal attribute
First look at a numeric attribute
2
![Page 3: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/3.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
Demo: Data science mini-project
CRISP-DM: cross-industrial standard process
for data mining
Data understanding: Types of data
• Data understanding: First look at attributes
Types of attributes
First look at a nominal attribute
– First look at a ordinal attribute
– First look at a numeric attribute
3
![Page 4: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/4.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 4
![Page 5: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/5.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• All the same applies as for nominal attributes
• Need to make sure that the order is retained
in histograms
• Additional ways to look at the attribute:
– Calculate the min, max, median of the attribute
5
![Page 6: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/6.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• All the same applies as for nominal attributes
• Need to make sure that the order is retained
in histograms
• Additional ways to look at the attribute:
– Calculate the min, max, median of the attribute
– More generally, calculate quantiles
6
![Page 7: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/7.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• There are 3 quartiles:– lower quartile, median, upper quartile
• They divide a sorted data set into 4 equal parts
• Percentiles divide into 100 equal parts– Lower quartile is the 25th percentile
– Median is the 50th percentile
• Deciles divide into 10 equal parts
• Quantiles generalise to any location over sorted data:– Lower quartile = 25th percentile = quantile at 0.25
– 4th decile = quantile at 0.4
7
![Page 8: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/8.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
A. 1st and 2nd decile
B. 2nd and 3rd decile
C. 3rd and 4th decile
D. 4th and 5th decile
E. 5th and 6th decile
F. 6th and 7th decile
G. 7th and 8th decile
H. 8th and 9th decile
8
![Page 9: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/9.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
A. Nominal
attributes
B. Ordinal
attributes
C. Both
D. Neither
E. Not sure
9
![Page 10: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/10.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Same aspects as in nominal attributes
• Additionally:
– Is the order specified in the documentation?
(meta-data)
– If all possible values specified in the meta-data:
• Check if all values present in the data
• If not, make sure that the empty bars in histograms
are in the right location corresponding to the order
10
![Page 11: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/11.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
Demo: Data science mini-project
CRISP-DM: cross-industrial standard process
for data mining
Data understanding: Types of data
• Data understanding: First look at attributes
Types of attributes
First look at a nominal attribute
First look at a ordinal attribute
– First look at a numeric attribute
11
![Page 12: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/12.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 12
![Page 13: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/13.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Is it really numeric?
– E.g. if contains only values 0.0 and 1.0 then
perhaps nominal is more appropriate?
• Is it discrete (only integers)?
– E.g. if contains integers 0 to 100 then can have a
first look as if it were an ordinal attribute
• Calculate quantiles as for ordinal attributes
• Calculate the mean (arithmetic average):
13
![Page 14: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/14.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Is it really numeric?
• Is it discrete (only integers)?
• Calculate quantiles as for ordinal attributes
• Calculate the mean (arithmetic average)
• Plot a histogram (with default binning)
14
![Page 15: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/15.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Check if some value occurs many times
– With real numbers often no number occurs more
than once
– If a value is frequent then:
• Possibly can denote missingness (e.g. value 0),
should be treated as N/A (not available, missing)
• Perhaps can denote an extremal value (more
extremal values represented as this value)
• Check for rounding
– E.g., if all numbers rounded to 2nd digit and one
has more digits then it might be a typing error
15
![Page 16: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/16.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
A. Nominal
attributes
B. Ordinal attributes
C. Numeric
attributes
D. All of the above
E. None of the above
F. Not sure
16
![Page 17: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/17.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
Demo: Data science mini-project
CRISP-DM: cross-industrial standard process
for data mining
Data understanding: Types of data
Data understanding: First look at attributes
Types of attributes
First look at a nominal attribute
First look at a ordinal attribute
First look at a numeric attribute
17
![Page 18: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/18.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Data understanding: distribution of
attributes
• Types of histograms
• How to describe probability distributions?
• Some standard probability distributions
• More ways to visualise distributions
• Visualising relations of attributes
• Are the attributes related?
18
![Page 19: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/19.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Data understanding: distribution of
attributes
• Types of histograms
• How to describe probability distributions?
• Some standard probability distributions
• More ways to visualise distributions
• Visualising relations of attributes
• Are the attributes related?
19
![Page 20: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/20.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 20
![Page 21: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/21.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• We have now looked at each attribute
• A key question we can now answer:
How do the items in this dataset look like?
• Now we have at least 3 answers to this:
21
![Page 22: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/22.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• How do the items in this dataset look like?
• Let us just pick up one as an example:
– Age = 22
– Workclass = Private
– Education = 11th
– Occupation = Other-service
– Capital.gain = 0
– ...
22
![Page 23: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/23.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• How do the items in this dataset look like?
• Let us just provide the ranges of attributes
– Age: {17,18,19,…,90}
– Workclass: {Federal-gov, Local-gov, Private, …}
– Education: {1st-4th,5th-6th, …, Doctorate)
– Occupation: {Adm-clerical, Exec-managerial,…}
– Capital.gain: [0,99999]
– ...
23
![Page 24: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/24.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• How do the items in this dataset look like?
• Let us just provide the histograms
– Age: <histogram>
– Workclass: <histogram>
– Education: <histogram>
– Occupation: <histogram>
– Capital.gain: <histogram>
– ...
24
![Page 25: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/25.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Data understanding: distribution of
attributes
• Types of histograms
• How to describe probability distributions?
• Some standard probability distributions
• More ways to visualise distributions
• Visualising relations of attributes
• Are the attributes related?
25
![Page 26: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/26.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 26
![Page 27: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/27.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Histograms on discrete data
– Nominal
– Ordinal
– Numeric with few different values
• E.g. small number of different integers
• Histograms on continuous data
– Numeric with many different values
27
![Page 28: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/28.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Frequency histogram
– Frequency = count of items with each value
28
ggplot(data) + geom_histogram(aes(x=age),stat="count")
![Page 29: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/29.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Frequency histogram
– Frequency = count of items with each value
29
ggplot(data) + geom_histogram(aes(x=workclass),stat="count")
![Page 30: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/30.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Relative frequency histogram
– Relative frequency = proportion (0..1) or
percentage (0..100%) of items with each value
– Heights of bars sum up to 1
30
ggplot(data) + geom_bar(aes(x=workclass,..count../sum(..count..)),stat="count")
![Page 31: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/31.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Depends on the goal
• Frequency histogram
– Gives actual counts in the data
• Relative frequency histogram
– Gives proportions in the data
– Interpretable as probability distribution of a
randomly chosen item
31
![Page 32: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/32.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Continuous attribute:
– Usually value is different in each item
– Need to introduce bins (a.k.a. intervals, ranges)
– Histogram not informative without bins:
32
ggplot(data) + geom_histogram(aes(x=factor(salaries)),stat="count")
![Page 33: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/33.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Frequency histogram of binned data
– Frequency = count of items in each bin
33
ggplot(data) + geom_histogram(aes(x=salaries),stat="bin",boundary=0,binwidth=10000)
![Page 34: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/34.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Relative frequency histogram of binned data
– Relative frequency = proportion of items in bins
– Heights of bars add up to 1
34
ggplot(data) + geom_histogram(aes(x=salaries,..count../sum(..count..)),stat="bin",boundary=0,binwidth=10000)
![Page 35: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/35.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Density histogram of binned data
– Density = Y-axis such that areas of bars in the
histogram add up to 1
– Density scale is invariant to the sizes of bins
35
ggplot(data) + geom_histogram(aes(x=salaries,..density..),stat="bin",boundary=0,binwidth=10000)
![Page 36: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/36.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Density histogram of binned data
– Density = Y-axis such that areas of bars in the
histogram add up to 1
– Density scale is invariant to the sizes of bins
36
+ geom_histogram(aes(x=salaries,..density..),stat="bin",boundary=0,binwidth=1000,alpha=0,colour="red")
![Page 37: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/37.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Density histogram of binned data
– Density = Y-axis such that areas of bars in the
histogram add up to 1
– Density scale is invariant to the sizes of bins
37
+ geom_histogram(aes(x=salaries),stat="density",colour="blue")
![Page 38: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/38.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Represented by the probability density
function (pdf)
• Area under the curve is equal to 1
• Areas represent probabilities
38
Area = P(a<X<b) =
probability that X is between a and b
![Page 39: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/39.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
Data understanding: distribution of
attributes
Types of histograms
• How to describe probability distributions?
• Some standard probability distributions
• More ways to visualise distributions
• Visualising relations of attributes
• Are the attributes related?
39
![Page 40: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/40.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 40
![Page 41: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/41.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Statistic – measure calculated from all values
• Mode of the distribution:– The most probable value (or values)
• The most frequent value if discrete
• Value with highest density if continuous
• Median and other quantiles– We have defined earlier
• Mean of the distribution:– Average value of the attribute
• Arithmetic average if discrete
• Expected value if continuous
– Centre of mass of the distribution
41
![Page 42: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/42.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Consider attribute with values:
– 1,2,2,2,3,3,4,7
• Mode:
– 2 because occurs three times
• Median
– 2.5 because 2 & 3 are in the middle, (2+3)/2=2.5
• Mean:
– 3.0 because (1+2+2+2+3+3+4+7)/8 = 24/8 = 3.0
42
![Page 43: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/43.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
A. 0
B. 0.5
C. 1
D. 1.5
E. None of the above
43
![Page 44: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/44.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
A. 0
B. 0.5
C. 1
D. 1.5
E. None of the above
44
![Page 45: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/45.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
A. 0
B. 0.5
C. 1
D. 1.5
E. None of the above
45
![Page 46: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/46.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Variance of a distribution is:
– average squared deviation from the mean
• Standard deviation is square root of variance
– Quadratic average deviation from the mean
• Example:
– Values: 1,2,2,2,3,3,4,7
– Mean: 3.0
– Variance: ((1-3)^2+(2-3)^2+…+(7-3)^2) / 8 = 3.0
– Standard deviation: sqrt(3.0)=1.732…
46
![Page 47: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/47.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Symmetric
• Skewed / right-skewed / left-skewed
• Heavy-tailed
• Bimodal
• Multi-modal
• Capped / right-capped / left-capped
47
![Page 48: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/48.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Simple definitions, not fully correct:
– Unimodal – 1 mode
– Bimodal – 2 modes
– Multimodal – multiple modes
• Actually:
– Bimodal usually means that the density (pdf) has
two local maxima:
– Multimodal means pdf has multiple maxima (2 or
more)
48
![Page 49: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/49.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Symmetric distribution:
– Symmetric around a vertical axis of symmetry
• Left and right side are mirror-images of each other
• Left- (or right-)skewed distribution:
– Non-symmetric and mean below (above) mode
• Usually used only for unimodal distributions
49
positiv
ely
ske
we
d
negati
vely
ske
wed
sym
m
et
ri
c
![Page 50: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/50.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
Data understanding: distribution of
attributes
Types of histograms
How to describe probability distributions?
• Some standard probability distributions
• More ways to visualise distributions
• Visualising relations of attributes
• Are the attributes related?
50
![Page 51: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/51.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 51
![Page 52: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/52.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Statisticians have names for many different
families of probability distributions
• Why need to know some of them?
– In practice these are used to communicate the
distribution without having to visualise it
• We will talk about:
– Uniform distributions
– Normal distributions
– Power law distributions
52
![Page 53: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/53.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• All options are equally probable
53
![Page 54: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/54.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• All values in [a,b] are equally probable:
– The pdf is constant between a and b,
and 0 elsewhere
54
![Page 55: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/55.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 55
←—————Represent data dispersion, spread —————→
Represent central tendency
![Page 56: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/56.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• N(mean,variance)
• The most common non-uniform continuous
distribution
– Why?
– Sum of many independent and identically
distributed (i.i.d.) random variables is
approximately normally distributed
• Standard normal distribution: N(0,1)
56
![Page 57: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/57.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Heavy-tailed (or long-tailed) distribution
– Very high or low values are likely
• More likely than in case of normal distribution
– Technical definition:
• Tails are not exponentially bounded
57
![Page 58: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/58.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Alternative names:
– scale-free / scale-independent
• Distributions on positive real values
• The probability of M times bigger value is K times smaller (power law; M, K - parameters)
• Linearly descending pdf when drawn in log-log-scale (both x and y logarithmic)
• Examples:
– Number of connections in many real-world graphs
– Frequencies of words in languages
– Sizes of craters on the moon
58
![Page 59: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/59.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
A. Uniform
B. Unimodal symmetric
C. Unimodal right-skewed
D. Unimodal left-skewed
E. Bimodal symmetric
F. Bimodal asymmetric
G. Multi-modal symmetric
H. Multi-modal asymmetric
I. Normally distributed
J. Power-law distributed
59
![Page 60: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/60.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
A. Uniform
B. Unimodal symmetric
C. Unimodal right-skewed
D. Unimodal left-skewed
E. Bimodal symmetric
F. Bimodal asymmetric
G. Multi-modal symmetric
H. Multi-modal asymmetric
I. Normally distributed
J. Power-law distributed
60
![Page 61: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/61.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
A. Uniform
B. Unimodal symmetric
C. Unimodal right-skewed
D. Unimodal left-skewed
E. Bimodal symmetric
F. Bimodal asymmetric
G. Multi-modal symmetric
H. Multi-modal asymmetric
I. Normally distributed
J. Power-law distributed
61
![Page 62: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/62.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
A. Uniform
B. Unimodal symmetric
C. Unimodal right-skewed
D. Unimodal left-skewed
E. Bimodal symmetric
F. Bimodal asymmetric
G. Multi-modal symmetric
H. Multi-modal asymmetric
I. Normally distributed
J. Power-law distributed
62
10-4
10-3
10-2 100 101 102
![Page 63: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/63.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
Data understanding: distribution of
attributes
Types of histograms
How to describe probability distributions?
Some standard probability distributions
• More ways to visualise distributions
• Visualising relations of attributes
• Are the attributes related?
63
![Page 64: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/64.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 64
![Page 65: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/65.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Compactly visualise many distributions
– Density plots rotated by 90º and mirrored
– Distribution of salaries for each education level
65
ggplot(data) + geom_violin(aes(x=education,y=salaries))
![Page 66: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/66.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Compact visualise many distributions
– marked median, upper and lower quartile, and
outliers (R ggplot outlier = more than 1.5x inter-
quartile range from quartile)
66
ggplot(data) + geom_boxplot(aes(x=education,y=salaries))
![Page 67: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/67.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Violin plots and box plots can also be shown
together:
67
ggplot(data) +
geom_violin(aes(x=education_factor,y=salaries)) +
geom_boxplot(aes(x=education_factor,y=salaries),alpha=0.3)
![Page 68: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/68.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Often too crowded, but sometimes provide
extra insights compared to box plots and
violin plots
68
Over-crowded with too many points
ggplot(data) + geom_point(aes(x=education_factor,y=salaries))
![Page 69: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/69.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Often too crowded, but sometimes provide
extra insights compared to box plots and
violin plots
69
More useful, here only 100 points
ggplot(data) + geom_point(aes(x=education_factor,y=salaries))
![Page 70: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/70.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
Data understanding: distribution of
attributes
Types of histograms
How to describe probability distributions?
Some standard probability distributions
More ways to visualise distributions
• Visualising relations of attributes
• Are the attributes related?
70
![Page 71: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/71.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 71
![Page 72: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/72.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• 2 categorical attributes:
– Cross-table (workclass & education)
72
table(data$education,data$workclass)
![Page 73: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/73.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• 2 categorical attributes:
– Cross-table (workclass & education)
– Heatmap (workclass & education)
73
ggplot(data %>% group_by(education,workclass) %>% summarise(count=length(education))) +
geom_tile(aes(y=education,x=workclass,fill=count)) + scale_fill_gradient(low='white',high='black',trans="log",breaks=c(1,10,100,1000)) +
theme_bw()
![Page 74: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/74.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• 2 categorical attributes:
– Cross-table (workclass & education)
– Heatmap (workclass & education)
– Cross-table and heatmap combined
74
+ geom_text(aes(x=workclass,y=education,label=count),color="red")
![Page 75: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/75.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Scatter plot, box plot, violin plot
75
![Page 76: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/76.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Scatter plot
76
ggplot(data) + geom_point(aes(x=age,y=salaries))
![Page 77: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/77.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Scatter plot
77
ggplot(data) + geom_point(aes(x=capital.gain,y=salaries))
![Page 78: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/78.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Scatter plot
• Discretise one (or both) of the attributes
– Discretise = make into categorical
– For instance, introduce bins
78
library(arules)
data$capital.gain.discretised =
discretize(data$capital.gain,method="fixed",categories=c(0,5,10,20,50,100)*1000)
ggplot(data) + geom_boxplot(aes(x=capital.gain.discretised,y=salaries))
![Page 79: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/79.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Make all pairwise visualisations and
organise in a cross-table
• Perform dimensionality reduction
– Introduced later in the course
– Projects the data into a 2-dimensional space
– Then visualise
• Use colours, shapes, etc.
79
![Page 80: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/80.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Four attributes visualised together
80
ggplot(data) +
geom_point(aes(x=capital.gain,y=salaries,color=workclass,shape=gender)) +
scale_colour_brewer(type="qual")
![Page 81: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/81.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Four attributes visualised together
81
Often hard to read when many
attributes visualised this way
ggplot(data) +
geom_point(aes(x=capital.gain,y=salaries,color=workclass,shape=gender)) +
scale_colour_brewer(type="qual")
![Page 82: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/82.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
Data understanding: distribution of
attributes
Types of histograms
How to describe probability distributions?
Some standard probability distributions
More ways to visualise distributions
Visualising relations of attributes
• Are the attributes related?
82
![Page 83: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/83.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 83
![Page 84: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/84.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• There are many statistical tests about this,
we will cover some in the next lecture
• Visual inspection can reveal some relations
– For example, university graduates have higher
median salaries than others
84
![Page 85: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/85.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• There are many statistical tests about this,
we will cover some in the next lecture
• Visual inspection can reveal some relations
– For example, university graduates have higher
median salaries than others
• Correlation is a statistic on 2 numeric
attributes, quantifying linear relationships
85
![Page 86: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/86.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Mean (actually sample mean as explained in
next lecture)
• Correlation (Pearson correlation coefficient)
86
n
i
ixn
x1
1
![Page 87: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/87.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• Ranges from -1 to +1:
– R=-1: perfectly anti-correlated
– R=+1: perfectly correlated
– R=0: absolutely uncorrelated
87
Positively
correlated
Negatively
correlated
![Page 88: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/88.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 88
Uncorrelated Uncorrelated Uncorrelated
![Page 89: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/89.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 89
![Page 90: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/90.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 90
Several sets of (x, y) points, with the Pearson correlation coefficient of x and y for
each set. Note that the correlation reflects the noisiness and direction of a linear
relationship (top row), but not the slope of that relationship (middle), nor many
aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a
slope of 0 but in that case the correlation coefficient is undefined because the
variance of Y is zero.
![Page 91: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/91.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 91
![Page 92: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/92.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
A. All equal
B. All different
C. Some equal, some
different
92
![Page 93: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/93.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 93
![Page 94: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/94.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 94
![Page 95: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/95.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 95
![Page 96: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/96.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 96
![Page 97: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/97.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 97
![Page 98: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/98.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 98
0
200
400
600
800
1000
1200
0 500 1000 1500 2000 2500 3000 3500 4000 4500
405- Feb2017
CO2 Noise Pressure
![Page 99: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/99.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 99
0
10
20
30
40
50
60
70
80
0
200
400
600
800
1000
1200
0 500 1000 1500 2000 2500 3000 3500 4000 4500
405- Feb2017
CO2 Pressure Noise
Changed the scale of noise
![Page 100: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/100.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 100
y=0.0465x+19.693
R²=0.65625
0
10
20
30
40
50
60
70
80
0 200 400 600 800 1000 1200
CO2vsNoisein405
Noise
CO2
![Page 101: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/101.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 101
y=0.0465x+19.693
R²=0.65625
0
10
20
30
40
50
60
70
80
0 200 400 600 800 1000 1200
CO2vsNoisein405
Noise
CO2
Not the same
as correlation!
![Page 102: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/102.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• r – Pearson correlation coefficient
• R2 – coefficient of determination
– Proportion of variance in the target attribute that
is predictable from the source attribute
– Basically, measures how well the data follow the
regression line
– In simple linear regression R2 is equal to the
square of r
102
![Page 103: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/103.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 103
• Source:
https://www.valuewalk.com/2016/08/usd-vs-gbp-vs-eur-a-leveling-of-the-playing-field/
![Page 104: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/104.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
Source:
https://www.valuewalk.com/2016/08/usd-vs-gbp-vs-eur-a-leveling-of-the-playing-field/
104
![Page 105: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/105.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
Source: http://www.tylervigen.com/spurious-correlations
105
???
???
???
![Page 106: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/106.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 106
Source: http://www.tylervigen.com/spurious-correlations
![Page 107: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/107.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 107
???
???
???
![Page 108: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/108.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 108
![Page 109: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/109.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 109
Relation ≠ Correlation ≠ Causation
![Page 110: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/110.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 110
Spurious correlations!
Multiple testing problem
(next lecture)
Relation ≠ Correlation ≠ Causation
![Page 111: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/111.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
Data understanding: distribution of
attributes
Types of histograms
How to describe probability distributions?
Some standard probability distributions
More ways to visualise distributions
Visualising relations of attributes
Are the attributes related?
111
![Page 112: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/112.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 112
![Page 113: Meelis Kull meelis.kull@ut.ee Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard](https://reader031.fdocuments.in/reader031/viewer/2022021823/5b36626a7f8b9a8b4b8e3e5a/html5/thumbnails/113.jpg)
Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03
• “The most merciful thing in the world... is the inability of the human mind to correlate all its contents.”
– H. P. Lovecraft
• “The correlation of quality of life and cost of energy is huge”
– Sam Altman
• “Visualization is daydreaming with a purpose.”
– Bo Bennett
• Source: https://www.brainyquote.com
113