CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine...
Transcript of CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine...
![Page 1: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/1.jpg)
CAP-394INTRODUCTION TO
DATA SCIENCE
Rafael Santos – [email protected] Ribeiro – [email protected]
www.lac.inpe.br/~rafael.santos/cap394.html
Updated in 2019
![Page 2: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/2.jpg)
About this Lecture
Introduction to Data Science
![Page 3: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/3.jpg)
3
Where are we?
![Page 4: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/4.jpg)
Basic Machine Learning Concepts
Introduction to Data Science
![Page 5: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/5.jpg)
5
Machine Learning Categories
¨ Supervised:
¤ We have data with labels or associated values we want to predict from other data (predictors).
¤ We also have data for which we have no labels or outcomes.
¤ We need to map predictors -> labels.
¤ Classification, regression.
¨ Unsupervised:
¤ We have data that we want to segment or cluster together based on similarity or distances in feature space.
¤ Clustering.
¨ Reinforcement Learning.
![Page 6: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/6.jpg)
6
Many algorithms!
¨ Linear Regression, Neural Networks (MLPs), Extreme Gradient Boosting (XGBoost), Long Short Term Memory (LSTM), many others.
¨ Decision Trees, Nearest Neighbors, Minimum Distance, Support Vector Machines, Neural Networks (MLPs), Random Forests, many others.
¨ Hierarchical Clustering, K-Means, DBScan, Self-Organizing Maps, many others.
¨ Lots of variants, mixing also allowed!
¨ We’ll play with the basics only.
![Page 7: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/7.jpg)
7
Classification: Minimum Distance
¨ Decision Boundarywith labeled pointsand classes’ prototypes
Accuracy: 92.444%
![Page 8: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/8.jpg)
8
Classification: Decision Tree
¨ Model is the set of decision rules that best separates the classes.
¨ Class is determined from evaluation of the rules.
![Page 9: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/9.jpg)
9
Classification: Decision Tree
Accuracy: 0.98
![Page 10: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/10.jpg)
10
Classification: Nearest Neighbors
¨ K=1
¨ A perfect classifier? Accuracy = 1
![Page 11: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/11.jpg)
11
Classification: Nearest Neighbors
¨ K=5
Accuracy: 0.98
![Page 12: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/12.jpg)
12
Classification: Neural Networks (MLPs)
¨ Perceptron
![Page 13: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/13.jpg)
13
Classification: Neural Networks (MLPs)
¨ Multilayer Perceptron
![Page 14: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/14.jpg)
14
Classification: MLPs with a single neuron/hidden layer
![Page 15: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/15.jpg)
15
Classification: MLPs with a single neuron/hidden layer
¨ One neuron cannot separate three classes with non-parallel hyperplanes!
![Page 16: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/16.jpg)
16
Classification: Neural Networks (MLPs)
¨ 2 Neurons in the hidden layer
¨ Accuracy: 0.9777778
![Page 17: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/17.jpg)
17
Classification: Neural Networks (MLPs)
¨ 5 Neurons in the hidden layer
¨ Accuracy: 0.9822222
![Page 18: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/18.jpg)
18
Classification: Neural Networks (MLPs)
¨ 60 Neurons in the hidden layer
¨ Accuracy: 0.9933333
![Page 19: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/19.jpg)
Machine Learning in R
Introduction to Data Science
![Page 20: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/20.jpg)
20
“R Programming”
![Page 21: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/21.jpg)
21
EDA in R
¨ Let’s switch to a browser:http://www.lac.inpe.br/~rafael.santos/r.html
![Page 22: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/22.jpg)
References
Introduction to Data Science
![Page 23: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/23.jpg)
23
References
![Page 24: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/24.jpg)
24
References
![Page 25: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/25.jpg)
25
References
![Page 26: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/26.jpg)
26
Clustering: K-Means
¨ Simulation with artificial dataset and K=7
![Page 27: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/27.jpg)
27
Clustering: K-Means
¨ Implementation in R is very efficient (quick convergence)
![Page 28: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/28.jpg)
28
Clustering: DBScan
¨ Identification of density-based clusters:
¤ Find core points
¤ Test reachability
¤ Identify noise
Wikipedia
![Page 29: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/29.jpg)
29
Clustering: DBScan
eps=0.3
![Page 30: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/30.jpg)
30
Clustering: DBScan
eps=1.0eps=1.0
![Page 31: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/31.jpg)
31
Clustering: DBScan
eps=2.0eps=2.0
![Page 32: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/32.jpg)
32
Hierarchical Clustering
¨ Methods that create several partitions in the data:
¨ Top-down: starts with all data in a single cluster, partitions every cluster until each data is in a single cluster.
¨ Bottom-up: starts with each data in a single cluster, merges data+clusters until all data is in a single cluster.
![Page 33: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/33.jpg)
33
Hierarchical Clustering
X Y1 0,25 0,272 0,32 0,913 0,33 0,804 0,18 0,33
1 2 3 41 0,000 0,644 0,536 0,0922 0,644 0,000 0,110 0,5973 0,536 0,110 0,000 0,4934 0,092 0,597 0,493 0,000
X Y1+4 0,22 0,302 0,32 0,913 0,33 0,80
1+4 2 31+4 0,000 0,619 0,5132 0,619 0,000 0,1103 0,513 0,110 0,000
X Y1+4 0,22 0,302+3 0,33 0,86
1+4 2+31+4 0,000 0,5662+3 0,566 0,000
X Y1+2+3+4 0,27 0,58
1 4 2 3
![Page 34: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/34.jpg)
34
Hierarchical Clustering
Single Linkage
Complete Linkage
Average Linkage
![Page 35: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/35.jpg)
35
Hierarchical Clustering
![Page 36: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/36.jpg)
36
Hierarchical Clustering
Using non local features for 3D shape grouping. Antonio Adán, Miguel Adán, Santiago Salamanca, and Pilar Merchán. LNCS 5432, 2008.
![Page 37: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/37.jpg)
37
Self-Organizing Map
![Page 38: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/38.jpg)
38
Self-Organizing Map
12x12 SOM,Dados em 8 dimensões.
T=0R=25Lr=0.9
![Page 39: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/39.jpg)
39
Self-Organizing Map
12x12 SOM,Dados em 8 dimensões.
T=40R=16.7Lr=0.74
![Page 40: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/40.jpg)
40
Self-Organizing Map
12x12 SOM,Dados em 8 dimensões.
T=320R=1Lr=0.18
![Page 41: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/41.jpg)
41
Self-Organizing Map
![Page 42: CAP-394 INTRODUCTION TO DATA SCIENCErafael.santos/Docs/CAP394/CAP394-2019-04.pdf · 5 Machine Learning Categories ¨ Supervised: ¤We have data with labels or associated values we](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c61ea7e708231d4352028/html5/thumbnails/42.jpg)
42
Self-Organizing Map
Visualization of Citizen Science Volunteers’ Behaviors with Data from Usage Logs, Alessandra Marli M. Morais, Rafael D.C. Santos, M. Jordan Raddick, Computing in Science and Engineering, July/August 2015