BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition:...

54
Introducon to Machine Learning BioComp Summer School 2017 Chloé-Agathe Azencot Centre for Computaonal Biology, Mines ParisTech [email protected]

Transcript of BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition:...

Page 1: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

Introduction to Machine Learning

BioComp Summer School 2017

Chloé-Agathe AzencotCentre for Computational Biology, Mines ParisTech

chloe­agathe.azencott@mines­paristech.fr

Page 2: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

Keywordsmachine learning

ensemble learning

regularization

generalization

overfittingcross­validation

model selection

kernels

deep learning

gradient descent

likelihood

Page 3: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

Overview● What kind of problems can machine learning 

solve?● Some popular supervised ML algorithms:

– Linear models– Support vector machines– Random forests– (Deep) neural networks.

● How do we select a machine learning algorithm?

● What is overfitting, and how can we avoid it?

Page 4: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

4

What is (Machine) Learning?

Page 5: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

5

Why Learn?● Learning: 

Modifying a behavior based on experience

[F. Benureau]● Machine learning: Programming computers to 

– model data– by means of optimizing an objective function– using example data.

Page 6: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

6

Why Learn?● There is no need to “learn” to calculate payroll.● Learning is used when

– Human expertise does not exist (bioinformatics);– Humans are unable to explain their expertise 

(speech recognition, computer vision);– Solutions change in time (routing computer 

networks);– Solutions need adapting to new cases (user 

biometrics).

Page 7: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

7

What about AI?

Page 8: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

8

Artificial Intelligence

ML is a subfield of Artificial Intelligence– A system that lives in a changing environment must 

have the ability to learn in order to adapt.– ML algorithms are building blocks that make computers 

behave  more  intelligently  by  generalizing  rather  than merely  storing  and  retrieving  data  (like  a  database system would do).

Page 9: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

9

Zoo of ML Problems 

Page 10: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

10

Unsupervised learning

Data

Images, text, measurements, omics data...

ML algo Data!

Learn a new representation of the data

X

p

n

Page 11: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

11

Clustering

Data

Group similar data points together

– Understand general characteristics of the data;– Infer some properties of an object based on 

how it relates to other objects.

ML algo

Page 12: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

12

Clustering: applications

– Customer relationship management: Customer segmentation

– Image compression: Color quantization– Document clustering: Group documents by 

topics (bag­of­words)– Bioinformatics: Learning motifs.

Page 13: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

13

Dimensionality reduction

Data ML algo

Find a lower­dimensional representation

Data

X

p

n

Images, text, measurements, omics data... X

m

n

Page 14: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

14

Dimensionality reduction

Data

Find a lower­dimensional representation

– Reduce storage space & computational time– Remove redundances– Visualization (in 2 or 3 dimensions) and interpretability.

ML algo Data

Page 15: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

15

Supervised learning

Data

ML algo

Make predictions

Labels

Predictor

decision function

X

p

n yn

Page 16: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

16

ClassificationMake discrete predictions

Binary classification

Multi­class classification

Data

ML algo

Labels

Predictor

Page 17: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

17

Classification

gene 1

Healthy

Sickge

ne 2

Page 18: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

18

Classification: Applications● Face recognition: independently of pose, lighting, 

occlusion (glasses, beard), make­up, hair style. ● Character recognition: independently of different 

handwriting styles.● Speech recognition: account for temporal 

dependency. ● Medical diagnosis: from symptoms to illnesses.

● Precision medicine: from clinical & genetic features to diagnosis, prognosis, response to treatment.

● Biometrics: recognition/authentication using physical or behavioral characteristics: Face, iris, signature...

Page 19: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

19

RegressionMake continunous predictions

Data

ML algo

Labels

Predictor

Page 20: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

20

Regression: Applications

– Car navigation: angle of steering

– Kinematics of a robot arm– Binding affinities between 

molecules– Age of onset of a disease– Solubility of a chemical in 

water– Yield of a crop– Direction of a forest fire

Page 21: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

21

Parametric models● Decision function has a set form● Model complexity ≈ number of parameters

● Decision function can have “arbitrary” form● Model complexity grows with the number of 

samples.

Non­parametric models

Page 22: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

22

Linear models

Page 23: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

23

Linear regression

● Least­squares fit:● Equivalent to maximizing the likelihood                    

under the assumption of Gaussian noise

● Exact solution                                                                      if X has full column rank

Page 24: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

24

Classification: logistic regressionLinear function   probability→

● Solve by maximizing the likelihood● No analytical solution● Use gradient descent.

Page 25: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

25

Support Vector Machines

Page 26: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

26

Large margin classifier● Find the separating hyperplane with the largest 

margin.

f(x)

prediction errorinverse of the margin

Page 27: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

27

When lines are not enough

Page 28: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

28

When lines are not enough

Page 29: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

29

When lines are not enough

● Non­linear mapping to a feature space● https://www.youtube.com/watch?v=3liCbRZPrZA

Page 30: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

30

The kernel trick● The solution & SVM­solving algorithm can be 

expressed using only● Never need to explicitly compute  

● k: kernel

– must be positive semi­definite – can be interpreted as a similarity function.

● Support vectors: training points for which α ≠ 0.

Page 31: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

31

Random Forests

Page 32: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

32

Decision trees

● Well suited to categorical features● Naturally handle multi­class classification● Interpretable● Perform poorly.

Color?

Shape?Size?

Apple BananaGrape Apple

YellowGreen

BigSmall Curved Ball

Page 33: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

33

Ensemble learning● Combining weak learners averages out their individual 

errors (wisdom of crowds)● Final prediction:

– Classification: majority vote– Regression: average.

● Bagging: weak learners are trained on bootstraped samples of the data (Breiman, 1996).

bootstrap: sample n, with replacement.● Boosting: weak learners are built iteratively, based on 

performance (Shapire, 1990).

Page 34: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

34

Random forests● Combine many decision trees

● Each tree is trained on a data set created using

– A bootstrap sample (sample with replacement) of the data– A random sample of the features.

● Very powerful in practice.

Majority vote

Page 35: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

35

(Deep) Neural Networks

Page 36: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

36

(Deep) neural networks

...

... hidden layer

● Nothing more than a (possibly complicated) parametric model

Page 37: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

37

(Deep) neural networks

...

... hidden layer

● Fitting weights:– Non­convex optimization problem– Solved with gradient descent– Can be difficult to tune.

Page 38: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

38

(Deep) neural networks

...

... hidden layer

● Learn an internal representation of the data on which a linear model works well.

● Currently one of the most powerful supervised learning algorithms for large training sets.

Page 39: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

39

(Deep) neural networks

Yann Le Cun et al. (1990)

Internal representation of the digits data

Page 40: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

40

Generalization & overfitting

Page 41: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

41

Generalization● Goal: build models that make good predictions 

on new data.● Models that work “too well” on the data we learn 

on tend to model noise as well as the underlying phenomenon: overfitting.

Page 42: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

42

Overfitting (Classification)

Figure from Hastie, Tibshirani & Friedman, The Elements of Statistical Learning.

Page 43: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

43

Overfitting (Regression)

Underfitting

Overfitting

Figure from Hastie, Tibshirani & Friedman, The Elements of Statistical Learning.

Page 44: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

44

Model complexitySimple models are

– more plausible (Occam’s Razor)– easier to train, use, and interpret.

Pre

dic

tion 

erro

r

Model complexity

On training data

On new data

Page 45: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

45

● Prevent overfitting by designing an objective function that accounts not only for the prediction error but also for model complexity.

min (empirical_error + λ*model_complexity)

● Rember the SVM

Regularization

f(x)

prediction errorinverse of the margin

Page 46: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

46

Ridge regression

● Unique solution, always exists

● Grouped selection:Correlated variables get similar weights.

● ShrinkageCoefficients shrink towards 0.

prediction error regularizer

hyperparameter

Page 47: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

47

Geometry of ridge regression

feasible region

unconstrained minimum

Page 48: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

48

Model selection & evaluation

Page 49: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

49

Evaluation on held­out data● If we evaluate the model on the data we’ve used to 

train it, we risk over­estimating performance.● Proper procedure:

– Separate the data in train/test sets

– Use a cross­validation on the train set to find the best algorithm + hyperparameter(s)

– Train this best algorithm + hyperparameter(s) on the entire train set

– The performance on the test set estimates generalization performance.

Page 50: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

50

Evaluation on held­out data● If we evaluate the model on the data we’ve used to 

train it, we risk over­estimating performance.● Proper procedure:

– Separate the data in train/test sets

– Use a cross­validation on the train set to find the best algorithm + hyperparameter(s)

– Train this best algorithm + hyperparameter(s) on the entire train set

– The performance on the test set estimates generalization performance.

Train Test

Page 51: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

51

Evaluation on held­out data● If we evaluate the model on the data we’ve used to 

train it, we risk over­estimating performance.● Proper procedure:

– Separate the data in train/test sets

– Use a cross­validation on the train set to find the best algorithm + hyperparameter(s)

– Train this best algorithm + hyperparameter(s) on the entire train set

– The performance on the test set estimates generalization performance.

Train Test

Val

ValTrain

Val

Val

Val

Train

Train

Train

Train

perf

perf

perf

perf

perf

CV score

Page 52: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

52

Evaluation on held­out data● If we evaluate the model on the data we’ve used to 

train it, we risk over­estimating performance.● Proper procedure:

– Separate the data in train/test sets

– Use a cross­validation on the train set to find the best algorithm + hyperparameter(s)

– Train this best algorithm + hyperparameter(s) on the entire train set

– The performance on the test set estimates generalization performance.

Train Test

Page 53: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

53

ML Toolboxes● Python: scikit­learnhttp://scikit-learn.org

Get started in python: http://scipy-lectures.github.io/

● R: Machine Learning Task Viewhttp://cran.r-project.org/web/views/MachineLearning.html

● Matlab™: Machine Learning with MATLABhttp://mathworks.com/machine-learning/index.html

– Statistics and Machine Learning Toolbox– Neural Network Toolbox

Page 54: BioComp Summer School 2017 - cazencott.info · 18 Classification: Applications Face recognition: independently of pose, lighting, occlusion (glasses, beard), makeup, hair style. Character

54

SummaryMachine learning = 

data + model + objective function● Catalog:

– Supervised vs unsupervised– Parametric vs non­parametric– Linear models, SVMs, random forests, neural 

networks.

● Key concerns: – avoid overfitting– Measure generalization performance.