DataScienceandScientificComputation...

37
Data Science and Scientific Computation Track Core Course Christoph Lampert Spring Semester 2016/17 Segment 1, Lecture 2 1 / 32

Transcript of DataScienceandScientificComputation...

Page 1: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Data Science and Scientific ComputationTrack Core Course

Christoph Lampert

Spring Semester 2016/17Segment 1, Lecture 2

1 / 32

Page 2: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Overview

Date no. TopicFeb 27 Mon 1 predictive models, least squares regression,

model selection, regularizationMar 1 Wed 2 real data, non-vectorial data, LASSO regressionMar 6 Mon 3 missing data, nonlinear regressionMar 8 Wed 4 robust regression, classification,Mar 13 Mon 5 model evaluation, large-scale model learningMar 15 Wed 6 project Q&AMar 20 Mon 7 project Q&AMar 22 Wed 8 project presentations

2 / 32

Page 3: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Refresher

3 / 32

Page 4: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Refresher – Predictive Models

Regression

• Data: anything (number, vector, image, natural text, . . . )• Predicted quantity: a real number, e.g. 5.3

Classification

• Data: anything• Predicted quantity: a discrete decision, e.g. "yes"

More complex/structured prediction tasks

• Data: anything• Predicted quantity: complex objects, e.g. a natural languagesentence, a segmentation mask of an image, . . .

4 / 32

Page 5: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Refresher – Linear Regression

Given: (x1, y1), . . . , (xn, yn) with xi = (x1i , . . . , x

di ) ∈ Rd and yi ∈ R.

In matrix notation: X =(x1|x2| . . . |xn

)∈ Rd×n, Y ∈ Rn, w ∈ Rd.

Least squares regression

f(x) = w>x+ b for w =(XX>)−1

XY, b = . . .

Ridge regressionMake more robust by adding regularization:

f(x) = w>x+ b for w =(XX>+ λ Id

)−1XY, b = . . .

where λ is the regularization strength.

5 / 32

Page 6: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Refresher – Evaluating Models

Use different data to evaluate a model than what you used fortraining it!

• split available data into part for training and part for evaluating itsperformance (testing)• put the testing part away and don’t look at it

• use training data for all steps to construct one final modelI training itselfI choice of model class (linear? bias term or not? nonlinear?)I choice of regularization strength

• evaluate prediction quality of final model using testing data

6 / 32

Page 7: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Refresher – Model Selection

How to choose models/parameters without looking at test data?Simulate the model evaluation step during the training procedure:

K-fold Cross Validation (typically K = 5 or K = 10)

input algorithm A, loss function `, data D (trnval part)split D =

⋃Kk=1Dk into K equal sized disjoint parts

for k = 1, . . . ,K dof¬k ←[ A[D \ Dk ]rk ←[ performance of f¬k on Dk

end foroutput RK-CV = 1

K

∑Kk=1 rk (K-fold cross-validation risk)

"Model selection":• compute RK-CV for for all possible settings/parameter values• fix parameters/model settings that lead to best RK-CV value• obtain final model by retraining on all available training data

7 / 32

Page 8: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Real data

8 / 32

Page 9: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Example: Particulate matter (PM)

0.0010.0001 0.01 0.1 1 10 100 1000

0.0010.0001 0.01 0.1 1 10 100 1000

Bio

logi

cal C

onta

min

ants

Typ

es o

f D

ust

Part

icu

late

Con

tam

inan

tsG

as M

olec

ule

s

Pollen

Suspended Atmospheric Dust

Settling Dust

Cement Dust

Fly Ash

Oil Smoke

Smog

Tabacco Smoke

Soot

GaseousContaminants

House Dust Mite Allergenes

Mold Spores

Heavy Dust

Bacteria

Cat Allergenes

Viruses

https://en.wikipedia.org/wiki/Particulates 9 / 32

Page 10: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Example: Particulate matter (PM)

Inovafitness"Nova PM Sensor SDS011"

Raspberry Pi

Image: Charlie Kuehnast: http://kuehnast.com/s9y/categories/8-Linux-und-Raspberry-PiImage: Raspberry-Pi Foundation, https://www.raspberrypi.org/

10 / 32

Page 11: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Example: Particulate matter (PM)

https://data.sparkfun.com/streams/RMAjV37OvKi6Mb6gLzQX

11 / 32

Page 12: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Real data comes in different formats, e.g.

TXT (text):• (numeric) entries as flat matrix• one data item per row, columns separated by spaces or tab• supported by standard software, e.g. Microsoft Excel• usually human readable• easy to parse automatically• useful only for fixed-length objects, e.g. vectors• no meta-data, such as column names

15.4 14.0 2017-02-11T12:44:06.643Z15.6 14.2 2017-02-11T12:43:06.426Z16.3 14.7 2017-02-11T12:42:10.047Z16.5 14.8 2017-02-11T12:41:09.862Z16.1 14.6 2017-02-11T12:40:08.458Z16.0 14.5 2017-02-11T12:39:07.651Z...

12 / 32

Page 13: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Real data comes in different formats, e.g.

CVS (comma-separated values):• flat matrix/table of numbers or text• one data item per row, columns usually separated by commas• supported by standard software, e.g. Microsoft Excel• human readable (more or less)• possible to parse automatically (but beware of pitfalls, such asstrings containing commas)• useful mainly for fixed-length objects, e.g. vectors• meta-data (column names) in header (first row)

pm10,pm25,timestamp15.4,14.0,2017-02-11T12:44:06.643Z15.6,14.2,2017-02-11T12:43:06.426Z16.3,14.7,2017-02-11T12:42:10.047Z16.5,14.8,2017-02-11T12:41:09.862Z...

13 / 32

Page 14: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Real data comes in different formats, e.g.

JSON (JavaScript object notation)• hierarchical, elements can have sub-entries• supported by most modern programming languages• not well readable for humans• not advised to parse manually, better use a library• useful for objects of variable-length or complexity

["pm10":"15.4","pm25":"14.0","timestamp":"2017-02-11T12:44:06.643Z","pm10":"15.6","pm25":"14.2","timestamp":"2017-02-11T12:43:06.426Z","pm10":"16.3","pm25":"14.7","timestamp":"2017-02-11T12:42:10.047Z","pm10":"16.5","pm25":"14.8","timestamp":"2017- ...

Many others, often proprietary• MAT (Matlab)• XLS (Microsoft Excel)• SQL (SQLite database)

14 / 32

Page 15: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Real data comes in different formats, e.g.

JSON (JavaScript object notation)• hierarchical, elements can have sub-entries• supported by most modern programming languages• not well readable for humans• not advised to parse manually, better use a library• useful for objects of variable-length or complexity

["pm10":"15.4","pm25":"14.0","timestamp":"2017-02-11T12:44:06.643Z","pm10":"15.6","pm25":"14.2","timestamp":"2017-02-11T12:43:06.426Z","pm10":"16.3","pm25":"14.7","timestamp":"2017-02-11T12:42:10.047Z","pm10":"16.5","pm25":"14.8","timestamp":"2017- ...

Many others, often proprietary• MAT (Matlab)• XLS (Microsoft Excel)• SQL (SQLite database)

14 / 32

Page 16: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Real data is rarely as perfect vectors as the theory demands:

Data can be non-numeric• (today)

Values can be missing• (later in the course)

Values can be wrong/broken/misleading• for scientific reasons → outliers• for non-scientific reasons → measurement errors

notebook demo

15 / 32

Page 17: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Beyond Vectors

16 / 32

Page 18: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Linear models, such as

f(x) = w>x+ b

only makes sense if• data are vectors, x ∈ Rd, of same dimension• all entries of the vectors are known

Real data

• can be categorical

• can be of variable size

• can have missing entries

17 / 32

Page 19: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Categorical data

X = red, green, blue

Typically not so hard to handle: introduce indicator variables in R|X |,called "one-hot encoding"

• red 7→ (1, 0, 0) green 7→ (0, 1, 0) blue 7→ (0, 0, 1)

Don’t use: red 7→ 1 green 7→ 2 blue 7→ 3

That would introduce spurious relations, such as

green + red = blue ?!?

One-hot works well even for large X , e.g. all English words, when usingthe right data structures (e.g. sparse vectors/matrices)

18 / 32

Page 20: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Ordinal data

X = poor, fair, good, very good, excellent

Best treatment depends on the situation

• working with distances?

poor 7→ 1 fair 7→ 2 . . . excellent 7→ 5

might work well.

• in other situations, one-hot might work better.

• if values derive from continuous quantity by quantizationI ≤ 60%: poor 61–70%: good . . . ≥ 91%: excellent

it might make sense to reflect those

poor 7→ 0.55 fair 7→ 0.65 . . . excellent 7→ 0.95

19 / 32

Page 21: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Language data

Example: X = words , task-specific encoding: "word vectors"

• represent each word w by a vector φ(w) ∈ Rd (e.g. 25 ≤ d ≤ 300)• similar vectors encode words of similar meaning (more or less)tiger -0.70 -0.34 0.44 -0.38 -0.55 0.29 0.79 0.01 0.56 . . .lion -0.89 -0.56 -0.37 0.76 -0.78 0.56 0.80 -0.05 0.80 . . .pion -0.53 -0.62 -0.13 0.55 -0.55 -0.43 -1.12 -0.39 0.67 . . .

quark -0.53 -0.55 0.17 -0.67 -0.51 -0.32 -0.90 -1.41 0.74 . . .

• φ(tiger) ≈ φ(lion) φ(pion) 6≈ φ(lion), etc.

Euclidean distances, ‖φ(wi)− φ(wj)‖:tiger lion pion quark

tiger 0 2.6 4.6 4.0lion 2.6 0 4.3 4.6pion 4.6 4.3 0 2.8quark 4.0 4.6 2.8 0

20 / 32

Page 22: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Language data

Vectors are learned automatically from large corpora (e.g. Wikipedia):

For example, GloVe: [Pennington et al . "GloVe: Global Vectors for Word Representation". ACL 2014]

• Each unique word wi, has an (unknown) vector vi.• Treat v1, v2, . . . as parameters of a large regression problem

minv1,v2,...

∑i,j

‖v>i vj − log pij‖2

where pij is the co-occurrence probability of wi and wj .• "semantic similarity" is not enforced, but emerges by itself

Pretrained models are publicly available:• downloads: https://github.com/3Top/word2vec-api#

where-to-get-a-pretrained-models

• demo: http://bionlp-www.utu.fi/wv_demo/21 / 32

Page 23: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Variable size data: text and strings

Given: a text fragment or short sentence W = ”w1 w2 . . . wk”.

Easiest option: average individual representations

Φ(W ) = 1k

k∑i=1

φ(wi)

for a word representation φ.

• linear function of Φ is average of linear functions on φ:

w>Φ(W ) = w>(1k

∑i

φ(wi))

= 1k

∑i

w>φ(wi)

• advantage: very simple• disadvantage: ignores word order, not really suitable for long texts

22 / 32

Page 24: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Variable size data: text and strings

Example: X = arbitrary lengths text documents

Task-specific encoding, x 7→ φ(x), e.g.,• create a dictionary of all possible words, w1, . . . , wL

• represent x by histogram of word occurrences

x 7→ (h1, . . . , hL) ∈ RL "bag-of-words" representation

where hi counts how often word wi occurs in x (absolute or relative)

Include domain-knowledge if possible, e.g. stop-words• ignore words a priori known not to be useful for the task at hand:

a an as at be ... the ... you

23 / 32

Page 25: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Variable size data: text and strings

Example: X = arbitrary lengths text documents

Task-specific encoding, x 7→ φ(x), e.g.,• create a dictionary of all possible words, w1, . . . , wL

• represent x by histogram of word occurrences

x 7→ (h1, . . . , hL) ∈ RL "bag-of-words" representation

where hi counts how often word wi occurs in x (absolute or relative)

Include domain-knowledge if possible, e.g. stop-words• ignore words a priori known not to be useful for the task at hand:

a an as at be ... the ... you

23 / 32

Page 26: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Variable size data: text and strings

Given: a set D = d1, d2, . . . , dN of variable length documents.

tf-idf: term frequency – inverse document frequency

tfidf(t, d) = tf(t, d) · idf(t)

• term frequency tf(t, d): how frequent is term t in document d?

tf(t, d) = raw count of how often t occurs in d

• inverse document frequency idf(t): in how many documents doesthe term occur?

idf(t, d) = log N

1 + ntfor nt = |d ∈ D : t ∈ d| and N = |D|.

Alternatives: boolean or logarithmic tf, constant idf (unweighted), . . .24 / 32

Page 27: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Variable size data: text and strings

More powerful: count not just terms but short fragments: n-grams

• xi = CTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGG

• count A,C,G,T: φ1(xi) = (9, 22, 22, 17) ∈ R4

• count AA,AC,. . . ,TT: φ2(xi) = (0, 2, 6, 1, 3, . . . , 4, 1, 5, 6, 3) ∈ R16

• count AAA,. . . ,TTT: φ3(xi) = (0, 0, 0, 0, 0, 1, 0, 1, . . . , 1, 2, 2) ∈ R64

• etc.

demo: https://books.google.com/ngrams

data: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

25 / 32

Page 28: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Back to learning linear models

f (x) = w>x + b f (x) = w>φ(x) + b

other regularizers

26 / 32

Page 29: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Other Regularizers

Reminder: ridge regression

minw,b

1n

∑i

(w>xi + b− yi)2 + λ‖w‖2

Instead of ‖w‖2 we can use other regularizers to avoid overfitting:

minw,b

1n

∑i

(w>xi + b− yi)2 + λ‖w‖1 (∗)

for ‖w‖1 =∑j |wj | (instead of ‖w‖2 =

∑j w

2j )

LASSO (least absolute shrinkage and selection operator) procedure

• convex optimization problem, but not differentiable• no closed-form solution, but efficient numeric solvers, e.g. "FISTA"

Most important difference to ridge regression: "sparsity"• solutions to (∗) with small ‖w‖1 have many 0 entries

27 / 32

Page 30: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Other Regularizers

Reminder: ridge regression

minw,b

1n

∑i

(w>xi + b− yi)2 + λ‖w‖2

Instead of ‖w‖2 we can use other regularizers to avoid overfitting:

minw,b

1n

∑i

(w>xi + b− yi)2 + λ‖w‖1 (∗)

for ‖w‖1 =∑j |wj | (instead of ‖w‖2 =

∑j w

2j )

LASSO (least absolute shrinkage and selection operator) procedure

• convex optimization problem, but not differentiable• no closed-form solution, but efficient numeric solvers, e.g. "FISTA"

Most important difference to ridge regression: "sparsity"• solutions to (∗) with small ‖w‖1 have many 0 entries

27 / 32

Page 31: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Weights learned by Ridge Regression:w b

λ = 10−2 30.77 -356.56 453.43 172.40 -11.48 -323.84 -136.43 251.37 622.51 -37.57 145.99λ = 10−1 36.06 -209.73 326.56 146.23 -26.87 -151.14 -152.59 132.29 437.45 19.33 144.21λ = 100 24.46 -21.02 98.85 57.42 4.21 -15.70 -66.52 55.16 120.99 30.91 138.37λ = 101 4.44 -0.01 13.42 8.59 1.93 0.05 -10.06 8.98 16.00 5.46 134.33

Weights learned by LASSO:w b

λ = 10−2 22.04 -372.48 474.21 169.83 0.00 -374.14 -109.18 303.41 643.98 -44.13 146.15λ = 10−1 0.00 -275.46 439.37 107.26 0.00 -172.34 -188.55 25.26 663.43 0.00 145.39λ = 100 0.00 0.00 82.02 0.00 0.00 0.00 0.00 0.00 330.44 0.00 137.98λ = 101 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 133.56

28 / 32

Page 32: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Why does ‖w‖1 leads to sparse w and ‖w‖22 does not?

∑i(w>xi − yi)2 + λ‖w‖1 vs

∑i(w>xi − yi)2 + λ‖w‖2

29 / 32

Page 33: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Why does ‖w‖1 leads to sparse w and ‖w‖22 does not?

∑i(w>xi − yi)2 + λ‖w‖1 vs

∑i(w>xi − yi)2 + λ‖w‖2

29 / 32

Page 34: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Data normalization

In real data, different features might have different units/scales:diabetes dataset

x ∈ R10 yAGE SEX BMI BP S1 S2 S3 S4 S5 S6 Y

59 2 32.1 101 157 93.2 38 4 4.8598 87 15148 1 21.6 87 183 103.2 70 3 3.8918 69 7572 2 30.5 93 156 93.6 41 4 4.6728 85 14124 1 25.3 84 198 131.4 40 5 4.8903 89 206

Standard tricks: data normalization

• 1) subtract data mean• 2) divide each data dimension by its standard deviation

or• 1) subtract data mean• 2) normalize each xi to have ‖xi‖2 = 1 → "sphering"

or• 1) subtract data mean• 2) whiten the data, such that cov(X) = Id

30 / 32

Page 35: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Original dataAGE SEX BMI BP S1 S2 S3 S4 S5 S6 Y

59 2 32.1 101 157 93.2 38 4 4.8598 87 15148 1 21.6 87 183 103.2 70 3 3.8918 69 7572 2 30.5 93 156 93.6 41 4 4.6728 85 14124 1 25.3 84 198 131.4 40 5 4.8903 89 206

Centering / Variance normalization0.801 1.065 1.297 0.460 -0.930 -0.732 -0.912 -0.054 0.419 -0.371 151

-0.040 -0.939 -1.082 -0.554 -0.178 -0.403 1.564 -0.830 -1.437 -1.938 751.793 1.065 0.935 -0.119 -0.959 -0.719 -0.680 -0.054 0.060 -0.545 141

-1.872 -0.939 -0.244 -0.771 0.256 0.525 -0.758 0.721 0.477 -0.197 206

Centering / Sphering0.243 0.012 0.132 0.147 -0.744 -0.515 -0.273 -0.002 0.005 -0.099 151

-0.015 -0.014 -0.139 -0.223 -0.179 -0.357 0.590 -0.031 -0.022 -0.649 750.494 0.011 0.087 -0.035 -0.697 -0.459 -0.185 -0.001 0.001 -0.132 141

-0.723 -0.014 -0.032 -0.314 0.261 0.470 -0.289 0.027 0.007 -0.067 206

Whitening-0.043 -0.008 -0.042 -0.058 0.025 0.025 0.013 0.087 -0.005 -0.042 151-0.039 0.049 -0.022 0.059 0.039 0.057 -0.024 0.006 -0.060 0.046 75-0.042 -0.002 -0.042 -0.060 0.087 0.006 -0.007 0.081 -0.031 -0.035 141-0.051 -0.026 0.054 0.032 -0.077 -0.022 -0.009 -0.017 -0.027 0.042 206

Important: compute normalizing transforms only on training part,apply same transformations before making predictions

31 / 32

Page 36: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Summary

General form: regularized least squares regression

minw

1n

∑i

(w>xi − yi)2

︸ ︷︷ ︸loss

+

reg.const.︷︸︸︷

λ Ω(w)

︸ ︷︷ ︸regularizer

• loss: makes model fit the training data

• Ω: regularizer, encourages simple models, prevents overfitting

• λ: regularization constant, controls trade-off

Use model selection to• find a good regularization constant

• choose between different regularizers

32 / 32

Page 37: DataScienceandScientificComputation TrackCoreCoursepub.ist.ac.at/~chl/courses/DSSC-TCC17/ds2017-02.pdf · Overview Date no. Topic Feb27 Mon 1 predictivemodels,leastsquaresregression,

Summary

General form: regularized least squares regression

minw

1n

∑i

(w>xi − yi)2

︸ ︷︷ ︸loss

+reg.const.︷︸︸︷λ Ω(w)︸ ︷︷ ︸regularizer

• loss: makes model fit the training data

• Ω: regularizer, encourages simple models, prevents overfitting

• λ: regularization constant, controls trade-off

Use model selection to• find a good regularization constant

• choose between different regularizers32 / 32