DataScienceandScientiﬁcComputation...

Data Science and Scientific ComputationTrack Core Course

Christoph Lampert

Spring Semester 2016/17Segment 1, Lecture 2

1 / 32

Overview

Date no. TopicFeb 27 Mon 1 predictive models, least squares regression,

model selection, regularizationMar 1 Wed 2 real data, non-vectorial data, LASSO regressionMar 6 Mon 3 missing data, nonlinear regressionMar 8 Wed 4 robust regression, classification,Mar 13 Mon 5 model evaluation, large-scale model learningMar 15 Wed 6 project Q&AMar 20 Mon 7 project Q&AMar 22 Wed 8 project presentations

2 / 32

Refresher

3 / 32

Refresher – Predictive Models

Regression

• Data: anything (number, vector, image, natural text, . . . )• Predicted quantity: a real number, e.g. 5.3

Classification

• Data: anything• Predicted quantity: a discrete decision, e.g. "yes"

More complex/structured prediction tasks

• Data: anything• Predicted quantity: complex objects, e.g. a natural languagesentence, a segmentation mask of an image, . . .

4 / 32

Refresher – Linear Regression

Given: (x1, y1), . . . , (xn, yn) with xi = (x1i , . . . , x

di ) ∈ Rd and yi ∈ R.

In matrix notation: X =(x1|x2| . . . |xn

)∈ Rd×n, Y ∈ Rn, w ∈ Rd.

Least squares regression

f(x) = w>x+ b for w =(XX>)−1

XY, b = . . .

Ridge regressionMake more robust by adding regularization:

f(x) = w>x+ b for w =(XX>+ λ Id

)−1XY, b = . . .

where λ is the regularization strength.

5 / 32

Refresher – Evaluating Models

Use different data to evaluate a model than what you used fortraining it!

• split available data into part for training and part for evaluating itsperformance (testing)• put the testing part away and don’t look at it

• use training data for all steps to construct one final modelI training itselfI choice of model class (linear? bias term or not? nonlinear?)I choice of regularization strength

• evaluate prediction quality of final model using testing data

6 / 32

Refresher – Model Selection

How to choose models/parameters without looking at test data?Simulate the model evaluation step during the training procedure:

K-fold Cross Validation (typically K = 5 or K = 10)

input algorithm A, loss function `, data D (trnval part)split D =

⋃Kk=1Dk into K equal sized disjoint parts

for k = 1, . . . ,K dof¬k ←[ A[D \ Dk ]rk ←[ performance of f¬k on Dk

end foroutput RK-CV = 1

K

∑Kk=1 rk (K-fold cross-validation risk)

"Model selection":• compute RK-CV for for all possible settings/parameter values• fix parameters/model settings that lead to best RK-CV value• obtain final model by retraining on all available training data

7 / 32

Real data

8 / 32

Example: Particulate matter (PM)

0.0010.0001 0.01 0.1 1 10 100 1000

0.0010.0001 0.01 0.1 1 10 100 1000

Bio

logi

cal C

onta

min

ants

Typ

es o

f D

ust

Part

icu

late

Con

tam

inan

tsG

as M

olec

ule

s

Pollen

Suspended Atmospheric Dust

Settling Dust

Cement Dust

Fly Ash

Oil Smoke

Smog

Tabacco Smoke

Soot

GaseousContaminants

House Dust Mite Allergenes

Mold Spores

Heavy Dust

Bacteria

Cat Allergenes

Viruses

https://en.wikipedia.org/wiki/Particulates 9 / 32

https://en.wikipedia.org/wiki/Particulates


Inovafitness"Nova PM Sensor SDS011"

Raspberry Pi

Image: Charlie Kuehnast: http://kuehnast.com/s9y/categories/8-Linux-und-Raspberry-PiImage: Raspberry-Pi Foundation, https://www.raspberrypi.org/

10 / 32


https://data.sparkfun.com/streams/RMAjV37OvKi6Mb6gLzQX

11 / 32

https://data.sparkfun.com/streams/RMAjV37OvKi6Mb6gLzQX

Real data comes in different formats, e.g.

TXT (text):• (numeric) entries as flat matrix• one data item per row, columns separated by spaces or tab• supported by standard software, e.g. Microsoft Excel• usually human readable• easy to parse automatically• useful only for fixed-length objects, e.g. vectors• no meta-data, such as column names

15.4 14.0 2017-02-11T12:44:06.643Z15.6 14.2 2017-02-11T12:43:06.426Z16.3 14.7 2017-02-11T12:42:10.047Z16.5 14.8 2017-02-11T12:41:09.862Z16.1 14.6 2017-02-11T12:40:08.458Z16.0 14.5 2017-02-11T12:39:07.651Z...

12 / 32


CVS (comma-separated values):• flat matrix/table of numbers or text• one data item per row, columns usually separated by commas• supported by standard software, e.g. Microsoft Excel• human readable (more or less)• possible to parse automatically (but beware of pitfalls, such asstrings containing commas)• useful mainly for fixed-length objects, e.g. vectors• meta-data (column names) in header (first row)

pm10,pm25,timestamp15.4,14.0,2017-02-11T12:44:06.643Z15.6,14.2,2017-02-11T12:43:06.426Z16.3,14.7,2017-02-11T12:42:10.047Z16.5,14.8,2017-02-11T12:41:09.862Z...

13 / 32


JSON (JavaScript object notation)• hierarchical, elements can have sub-entries• supported by most modern programming languages• not well readable for humans• not advised to parse manually, better use a library• useful for objects of variable-length or complexity

["pm10":"15.4","pm25":"14.0","timestamp":"2017-02-11T12:44:06.643Z","pm10":"15.6","pm25":"14.2","timestamp":"2017-02-11T12:43:06.426Z","pm10":"16.3","pm25":"14.7","timestamp":"2017-02-11T12:42:10.047Z","pm10":"16.5","pm25":"14.8","timestamp":"2017- ...

Many others, often proprietary• MAT (Matlab)• XLS (Microsoft Excel)• SQL (SQLite database)

14 / 32

Real data is rarely as perfect vectors as the theory demands:

Data can be non-numeric• (today)

Values can be missing• (later in the course)

Values can be wrong/broken/misleading• for scientific reasons → outliers• for non-scientific reasons → measurement errors

notebook demo

15 / 32

Beyond Vectors

16 / 32

Linear models, such as

f(x) = w>x+ b

only makes sense if• data are vectors, x ∈ Rd, of same dimension• all entries of the vectors are known

Real data

• can be categorical

• can be of variable size

• can have missing entries

17 / 32

Categorical data

X = red, green, blue

Typically not so hard to handle: introduce indicator variables in R|X |,called "one-hot encoding"

• red 7→ (1, 0, 0) green 7→ (0, 1, 0) blue 7→ (0, 0, 1)

Don’t use: red 7→ 1 green 7→ 2 blue 7→ 3

That would introduce spurious relations, such as

green + red = blue ?!?

One-hot works well even for large X , e.g. all English words, when usingthe right data structures (e.g. sparse vectors/matrices)

18 / 32

Ordinal data

X = poor, fair, good, very good, excellent

Best treatment depends on the situation

• working with distances?

poor 7→ 1 fair 7→ 2 . . . excellent 7→ 5

might work well.

• in other situations, one-hot might work better.

• if values derive from continuous quantity by quantizationI ≤ 60%: poor 61–70%: good . . . ≥ 91%: excellent

it might make sense to reflect those

poor 7→ 0.55 fair 7→ 0.65 . . . excellent 7→ 0.95

19 / 32

Language data

Example: X = words , task-specific encoding: "word vectors"

• represent each word w by a vector φ(w) ∈ Rd (e.g. 25 ≤ d ≤ 300)• similar vectors encode words of similar meaning (more or less)tiger -0.70 -0.34 0.44 -0.38 -0.55 0.29 0.79 0.01 0.56 . . .lion -0.89 -0.56 -0.37 0.76 -0.78 0.56 0.80 -0.05 0.80 . . .pion -0.53 -0.62 -0.13 0.55 -0.55 -0.43 -1.12 -0.39 0.67 . . .

quark -0.53 -0.55 0.17 -0.67 -0.51 -0.32 -0.90 -1.41 0.74 . . .

• φ(tiger) ≈ φ(lion) φ(pion) 6≈ φ(lion), etc.

Euclidean distances, ‖φ(wi)− φ(wj)‖:tiger lion pion quark

tiger 0 2.6 4.6 4.0lion 2.6 0 4.3 4.6pion 4.6 4.3 0 2.8quark 4.0 4.6 2.8 0

20 / 32

Language data

Vectors are learned automatically from large corpora (e.g. Wikipedia):

For example, GloVe: [Pennington et al . "GloVe: Global Vectors for Word Representation". ACL 2014]

• Each unique word wi, has an (unknown) vector vi.• Treat v1, v2, . . . as parameters of a large regression problem

minv1,v2,...

∑i,j

‖v>i vj − log pij‖2

where pij is the co-occurrence probability of wi and wj .• "semantic similarity" is not enforced, but emerges by itself

Pretrained models are publicly available:• downloads: https://github.com/3Top/word2vec-api#

where-to-get-a-pretrained-models

• demo: http://bionlp-www.utu.fi/wv_demo/21 / 32

https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models

https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models

http://bionlp-www.utu.fi/wv_demo/

Variable size data: text and strings

Given: a text fragment or short sentence W = ”w1 w2 . . . wk”.

Easiest option: average individual representations

Φ(W ) = 1k

k∑i=1

φ(wi)

for a word representation φ.

• linear function of Φ is average of linear functions on φ:

w>Φ(W ) = w>(1k

∑i

φ(wi))

= 1k

∑i

w>φ(wi)

• advantage: very simple• disadvantage: ignores word order, not really suitable for long texts

22 / 32


Example: X = arbitrary lengths text documents

Task-specific encoding, x 7→ φ(x), e.g.,• create a dictionary of all possible words, w1, . . . , wL

• represent x by histogram of word occurrences

x 7→ (h1, . . . , hL) ∈ RL "bag-of-words" representation

where hi counts how often word wi occurs in x (absolute or relative)

Include domain-knowledge if possible, e.g. stop-words• ignore words a priori known not to be useful for the task at hand:

a an as at be ... the ... you

23 / 32


Given: a set D = d1, d2, . . . , dN of variable length documents.

tf-idf: term frequency – inverse document frequency

tfidf(t, d) = tf(t, d) · idf(t)

• term frequency tf(t, d): how frequent is term t in document d?

tf(t, d) = raw count of how often t occurs in d

• inverse document frequency idf(t): in how many documents doesthe term occur?

idf(t, d) = log N

1 + ntfor nt = |d ∈ D : t ∈ d| and N = |D|.

Alternatives: boolean or logarithmic tf, constant idf (unweighted), . . .24 / 32


More powerful: count not just terms but short fragments: n-grams

• xi = CTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGG

• count A,C,G,T: φ1(xi) = (9, 22, 22, 17) ∈ R4

• count AA,AC,. . . ,TT: φ2(xi) = (0, 2, 6, 1, 3, . . . , 4, 1, 5, 6, 3) ∈ R16

• count AAA,. . . ,TTT: φ3(xi) = (0, 0, 0, 0, 0, 1, 0, 1, . . . , 1, 2, 2) ∈ R64

• etc.

demo: https://books.google.com/ngrams

data: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

25 / 32

https://books.google.com/ngrams

http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

Back to learning linear models

f (x) = w>x + b f (x) = w>φ(x) + b

other regularizers

26 / 32

Other Regularizers

Reminder: ridge regression

minw,b

1n

∑i

(w>xi + b− yi)2 + λ‖w‖2

Instead of ‖w‖2 we can use other regularizers to avoid overfitting:

minw,b

1n

∑i

(w>xi + b− yi)2 + λ‖w‖1 (∗)

for ‖w‖1 =∑j |wj | (instead of ‖w‖2 =

∑j w

2j )

LASSO (least absolute shrinkage and selection operator) procedure

• convex optimization problem, but not differentiable• no closed-form solution, but efficient numeric solvers, e.g. "FISTA"

Most important difference to ridge regression: "sparsity"• solutions to (∗) with small ‖w‖1 have many 0 entries

27 / 32

Weights learned by Ridge Regression:w b

λ = 10−2 30.77 -356.56 453.43 172.40 -11.48 -323.84 -136.43 251.37 622.51 -37.57 145.99λ = 10−1 36.06 -209.73 326.56 146.23 -26.87 -151.14 -152.59 132.29 437.45 19.33 144.21λ = 100 24.46 -21.02 98.85 57.42 4.21 -15.70 -66.52 55.16 120.99 30.91 138.37λ = 101 4.44 -0.01 13.42 8.59 1.93 0.05 -10.06 8.98 16.00 5.46 134.33

Weights learned by LASSO:w b

λ = 10−2 22.04 -372.48 474.21 169.83 0.00 -374.14 -109.18 303.41 643.98 -44.13 146.15λ = 10−1 0.00 -275.46 439.37 107.26 0.00 -172.34 -188.55 25.26 663.43 0.00 145.39λ = 100 0.00 0.00 82.02 0.00 0.00 0.00 0.00 0.00 330.44 0.00 137.98λ = 101 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 133.56

28 / 32

Why does ‖w‖1 leads to sparse w and ‖w‖22 does not?

∑i(w>xi − yi)2 + λ‖w‖1 vs

∑i(w>xi − yi)2 + λ‖w‖2

29 / 32

Data normalization

In real data, different features might have different units/scales:diabetes dataset

x ∈ R10 yAGE SEX BMI BP S1 S2 S3 S4 S5 S6 Y

59 2 32.1 101 157 93.2 38 4 4.8598 87 15148 1 21.6 87 183 103.2 70 3 3.8918 69 7572 2 30.5 93 156 93.6 41 4 4.6728 85 14124 1 25.3 84 198 131.4 40 5 4.8903 89 206

Standard tricks: data normalization

• 1) subtract data mean• 2) divide each data dimension by its standard deviation

or• 1) subtract data mean• 2) normalize each xi to have ‖xi‖2 = 1 → "sphering"

or• 1) subtract data mean• 2) whiten the data, such that cov(X) = Id

30 / 32

Original dataAGE SEX BMI BP S1 S2 S3 S4 S5 S6 Y

59 2 32.1 101 157 93.2 38 4 4.8598 87 15148 1 21.6 87 183 103.2 70 3 3.8918 69 7572 2 30.5 93 156 93.6 41 4 4.6728 85 14124 1 25.3 84 198 131.4 40 5 4.8903 89 206

Centering / Variance normalization0.801 1.065 1.297 0.460 -0.930 -0.732 -0.912 -0.054 0.419 -0.371 151

-0.040 -0.939 -1.082 -0.554 -0.178 -0.403 1.564 -0.830 -1.437 -1.938 751.793 1.065 0.935 -0.119 -0.959 -0.719 -0.680 -0.054 0.060 -0.545 141

-1.872 -0.939 -0.244 -0.771 0.256 0.525 -0.758 0.721 0.477 -0.197 206

Centering / Sphering0.243 0.012 0.132 0.147 -0.744 -0.515 -0.273 -0.002 0.005 -0.099 151

-0.015 -0.014 -0.139 -0.223 -0.179 -0.357 0.590 -0.031 -0.022 -0.649 750.494 0.011 0.087 -0.035 -0.697 -0.459 -0.185 -0.001 0.001 -0.132 141

-0.723 -0.014 -0.032 -0.314 0.261 0.470 -0.289 0.027 0.007 -0.067 206

Whitening-0.043 -0.008 -0.042 -0.058 0.025 0.025 0.013 0.087 -0.005 -0.042 151-0.039 0.049 -0.022 0.059 0.039 0.057 -0.024 0.006 -0.060 0.046 75-0.042 -0.002 -0.042 -0.060 0.087 0.006 -0.007 0.081 -0.031 -0.035 141-0.051 -0.026 0.054 0.032 -0.077 -0.022 -0.009 -0.017 -0.027 0.042 206

Important: compute normalizing transforms only on training part,apply same transformations before making predictions

31 / 32

Summary

General form: regularized least squares regression

minw

1n

∑i

(w>xi − yi)2

︸︷︷︸loss

+

reg.const.︷︸︸︷

λ Ω(w)

︸︷︷︸regularizer

• loss: makes model fit the training data

• Ω: regularizer, encourages simple models, prevents overfitting

• λ: regularization constant, controls trade-off

Use model selection to• find a good regularization constant

• choose between different regularizers

32 / 32

Summary

General form: regularized least squares regression

minw

1n

∑i

(w>xi − yi)2

︸︷︷︸loss

+reg.const.︷︸︸︷λ Ω(w)︸︷︷︸regularizer

• loss: makes model fit the training data

• Ω: regularizer, encourages simple models, prevents overfitting

• λ: regularization constant, controls trade-off

Use model selection to• find a good regularization constant

• choose between different regularizers32 / 32

DataScienceandScientiﬁcComputation...

Documents

Transcript of DataScienceandScientiﬁcComputation...