Data Science and Scientific ComputationTrack Core Course
Christoph Lampert
Spring Semester 2016/17Segment 1, Lecture 2
1 / 32
Overview
Date no. TopicFeb 27 Mon 1 predictive models, least squares regression,
model selection, regularizationMar 1 Wed 2 real data, non-vectorial data, LASSO regressionMar 6 Mon 3 missing data, nonlinear regressionMar 8 Wed 4 robust regression, classification,Mar 13 Mon 5 model evaluation, large-scale model learningMar 15 Wed 6 project Q&AMar 20 Mon 7 project Q&AMar 22 Wed 8 project presentations
2 / 32
Refresher
3 / 32
Refresher – Predictive Models
Regression
• Data: anything (number, vector, image, natural text, . . . )• Predicted quantity: a real number, e.g. 5.3
Classification
• Data: anything• Predicted quantity: a discrete decision, e.g. "yes"
More complex/structured prediction tasks
• Data: anything• Predicted quantity: complex objects, e.g. a natural languagesentence, a segmentation mask of an image, . . .
4 / 32
Refresher – Linear Regression
Given: (x1, y1), . . . , (xn, yn) with xi = (x1i , . . . , x
di ) ∈ Rd and yi ∈ R.
In matrix notation: X =(x1|x2| . . . |xn
)∈ Rd×n, Y ∈ Rn, w ∈ Rd.
Least squares regression
f(x) = w>x+ b for w =(XX>)−1
XY, b = . . .
Ridge regressionMake more robust by adding regularization:
f(x) = w>x+ b for w =(XX>+ λ Id
)−1XY, b = . . .
where λ is the regularization strength.
5 / 32
Refresher – Evaluating Models
Use different data to evaluate a model than what you used fortraining it!
• split available data into part for training and part for evaluating itsperformance (testing)• put the testing part away and don’t look at it
• use training data for all steps to construct one final modelI training itselfI choice of model class (linear? bias term or not? nonlinear?)I choice of regularization strength
• evaluate prediction quality of final model using testing data
6 / 32
Refresher – Model Selection
How to choose models/parameters without looking at test data?Simulate the model evaluation step during the training procedure:
K-fold Cross Validation (typically K = 5 or K = 10)
input algorithm A, loss function `, data D (trnval part)split D =
⋃Kk=1Dk into K equal sized disjoint parts
for k = 1, . . . ,K dof¬k ←[ A[D \ Dk ]rk ←[ performance of f¬k on Dk
end foroutput RK-CV = 1
K
∑Kk=1 rk (K-fold cross-validation risk)
"Model selection":• compute RK-CV for for all possible settings/parameter values• fix parameters/model settings that lead to best RK-CV value• obtain final model by retraining on all available training data
7 / 32
Real data
8 / 32
Example: Particulate matter (PM)
0.0010.0001 0.01 0.1 1 10 100 1000
0.0010.0001 0.01 0.1 1 10 100 1000
Bio
logi
cal C
onta
min
ants
Typ
es o
f D
ust
Part
icu
late
Con
tam
inan
tsG
as M
olec
ule
s
Pollen
Suspended Atmospheric Dust
Settling Dust
Cement Dust
Fly Ash
Oil Smoke
Smog
Tabacco Smoke
Soot
GaseousContaminants
House Dust Mite Allergenes
Mold Spores
Heavy Dust
Bacteria
Cat Allergenes
Viruses
https://en.wikipedia.org/wiki/Particulates 9 / 32
Example: Particulate matter (PM)
Inovafitness"Nova PM Sensor SDS011"
Raspberry Pi
Image: Charlie Kuehnast: http://kuehnast.com/s9y/categories/8-Linux-und-Raspberry-PiImage: Raspberry-Pi Foundation, https://www.raspberrypi.org/
10 / 32
Example: Particulate matter (PM)
https://data.sparkfun.com/streams/RMAjV37OvKi6Mb6gLzQX
11 / 32
Real data comes in different formats, e.g.
TXT (text):• (numeric) entries as flat matrix• one data item per row, columns separated by spaces or tab• supported by standard software, e.g. Microsoft Excel• usually human readable• easy to parse automatically• useful only for fixed-length objects, e.g. vectors• no meta-data, such as column names
15.4 14.0 2017-02-11T12:44:06.643Z15.6 14.2 2017-02-11T12:43:06.426Z16.3 14.7 2017-02-11T12:42:10.047Z16.5 14.8 2017-02-11T12:41:09.862Z16.1 14.6 2017-02-11T12:40:08.458Z16.0 14.5 2017-02-11T12:39:07.651Z...
12 / 32
Real data comes in different formats, e.g.
CVS (comma-separated values):• flat matrix/table of numbers or text• one data item per row, columns usually separated by commas• supported by standard software, e.g. Microsoft Excel• human readable (more or less)• possible to parse automatically (but beware of pitfalls, such asstrings containing commas)• useful mainly for fixed-length objects, e.g. vectors• meta-data (column names) in header (first row)
pm10,pm25,timestamp15.4,14.0,2017-02-11T12:44:06.643Z15.6,14.2,2017-02-11T12:43:06.426Z16.3,14.7,2017-02-11T12:42:10.047Z16.5,14.8,2017-02-11T12:41:09.862Z...
13 / 32
Real data comes in different formats, e.g.
JSON (JavaScript object notation)• hierarchical, elements can have sub-entries• supported by most modern programming languages• not well readable for humans• not advised to parse manually, better use a library• useful for objects of variable-length or complexity
["pm10":"15.4","pm25":"14.0","timestamp":"2017-02-11T12:44:06.643Z","pm10":"15.6","pm25":"14.2","timestamp":"2017-02-11T12:43:06.426Z","pm10":"16.3","pm25":"14.7","timestamp":"2017-02-11T12:42:10.047Z","pm10":"16.5","pm25":"14.8","timestamp":"2017- ...
Many others, often proprietary• MAT (Matlab)• XLS (Microsoft Excel)• SQL (SQLite database)
14 / 32
Real data comes in different formats, e.g.
JSON (JavaScript object notation)• hierarchical, elements can have sub-entries• supported by most modern programming languages• not well readable for humans• not advised to parse manually, better use a library• useful for objects of variable-length or complexity
["pm10":"15.4","pm25":"14.0","timestamp":"2017-02-11T12:44:06.643Z","pm10":"15.6","pm25":"14.2","timestamp":"2017-02-11T12:43:06.426Z","pm10":"16.3","pm25":"14.7","timestamp":"2017-02-11T12:42:10.047Z","pm10":"16.5","pm25":"14.8","timestamp":"2017- ...
Many others, often proprietary• MAT (Matlab)• XLS (Microsoft Excel)• SQL (SQLite database)
14 / 32
Real data is rarely as perfect vectors as the theory demands:
Data can be non-numeric• (today)
Values can be missing• (later in the course)
Values can be wrong/broken/misleading• for scientific reasons → outliers• for non-scientific reasons → measurement errors
notebook demo
15 / 32
Beyond Vectors
16 / 32
Linear models, such as
f(x) = w>x+ b
only makes sense if• data are vectors, x ∈ Rd, of same dimension• all entries of the vectors are known
Real data
• can be categorical
• can be of variable size
• can have missing entries
17 / 32
Categorical data
X = red, green, blue
Typically not so hard to handle: introduce indicator variables in R|X |,called "one-hot encoding"
• red 7→ (1, 0, 0) green 7→ (0, 1, 0) blue 7→ (0, 0, 1)
Don’t use: red 7→ 1 green 7→ 2 blue 7→ 3
That would introduce spurious relations, such as
green + red = blue ?!?
One-hot works well even for large X , e.g. all English words, when usingthe right data structures (e.g. sparse vectors/matrices)
18 / 32
Ordinal data
X = poor, fair, good, very good, excellent
Best treatment depends on the situation
• working with distances?
poor 7→ 1 fair 7→ 2 . . . excellent 7→ 5
might work well.
• in other situations, one-hot might work better.
• if values derive from continuous quantity by quantizationI ≤ 60%: poor 61–70%: good . . . ≥ 91%: excellent
it might make sense to reflect those
poor 7→ 0.55 fair 7→ 0.65 . . . excellent 7→ 0.95
19 / 32
Language data
Example: X = words , task-specific encoding: "word vectors"
• represent each word w by a vector φ(w) ∈ Rd (e.g. 25 ≤ d ≤ 300)• similar vectors encode words of similar meaning (more or less)tiger -0.70 -0.34 0.44 -0.38 -0.55 0.29 0.79 0.01 0.56 . . .lion -0.89 -0.56 -0.37 0.76 -0.78 0.56 0.80 -0.05 0.80 . . .pion -0.53 -0.62 -0.13 0.55 -0.55 -0.43 -1.12 -0.39 0.67 . . .
quark -0.53 -0.55 0.17 -0.67 -0.51 -0.32 -0.90 -1.41 0.74 . . .
• φ(tiger) ≈ φ(lion) φ(pion) 6≈ φ(lion), etc.
Euclidean distances, ‖φ(wi)− φ(wj)‖:tiger lion pion quark
tiger 0 2.6 4.6 4.0lion 2.6 0 4.3 4.6pion 4.6 4.3 0 2.8quark 4.0 4.6 2.8 0
20 / 32
Language data
Vectors are learned automatically from large corpora (e.g. Wikipedia):
For example, GloVe: [Pennington et al . "GloVe: Global Vectors for Word Representation". ACL 2014]
• Each unique word wi, has an (unknown) vector vi.• Treat v1, v2, . . . as parameters of a large regression problem
minv1,v2,...
∑i,j
‖v>i vj − log pij‖2
where pij is the co-occurrence probability of wi and wj .• "semantic similarity" is not enforced, but emerges by itself
Pretrained models are publicly available:• downloads: https://github.com/3Top/word2vec-api#
where-to-get-a-pretrained-models
• demo: http://bionlp-www.utu.fi/wv_demo/21 / 32
Variable size data: text and strings
Given: a text fragment or short sentence W = ”w1 w2 . . . wk”.
Easiest option: average individual representations
Φ(W ) = 1k
k∑i=1
φ(wi)
for a word representation φ.
• linear function of Φ is average of linear functions on φ:
w>Φ(W ) = w>(1k
∑i
φ(wi))
= 1k
∑i
w>φ(wi)
• advantage: very simple• disadvantage: ignores word order, not really suitable for long texts
22 / 32
Variable size data: text and strings
Example: X = arbitrary lengths text documents
Task-specific encoding, x 7→ φ(x), e.g.,• create a dictionary of all possible words, w1, . . . , wL
• represent x by histogram of word occurrences
x 7→ (h1, . . . , hL) ∈ RL "bag-of-words" representation
where hi counts how often word wi occurs in x (absolute or relative)
Include domain-knowledge if possible, e.g. stop-words• ignore words a priori known not to be useful for the task at hand:
a an as at be ... the ... you
23 / 32
Variable size data: text and strings
Example: X = arbitrary lengths text documents
Task-specific encoding, x 7→ φ(x), e.g.,• create a dictionary of all possible words, w1, . . . , wL
• represent x by histogram of word occurrences
x 7→ (h1, . . . , hL) ∈ RL "bag-of-words" representation
where hi counts how often word wi occurs in x (absolute or relative)
Include domain-knowledge if possible, e.g. stop-words• ignore words a priori known not to be useful for the task at hand:
a an as at be ... the ... you
23 / 32
Variable size data: text and strings
Given: a set D = d1, d2, . . . , dN of variable length documents.
tf-idf: term frequency – inverse document frequency
tfidf(t, d) = tf(t, d) · idf(t)
• term frequency tf(t, d): how frequent is term t in document d?
tf(t, d) = raw count of how often t occurs in d
• inverse document frequency idf(t): in how many documents doesthe term occur?
idf(t, d) = log N
1 + ntfor nt = |d ∈ D : t ∈ d| and N = |D|.
Alternatives: boolean or logarithmic tf, constant idf (unweighted), . . .24 / 32
Variable size data: text and strings
More powerful: count not just terms but short fragments: n-grams
• xi = CTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGG
• count A,C,G,T: φ1(xi) = (9, 22, 22, 17) ∈ R4
• count AA,AC,. . . ,TT: φ2(xi) = (0, 2, 6, 1, 3, . . . , 4, 1, 5, 6, 3) ∈ R16
• count AAA,. . . ,TTT: φ3(xi) = (0, 0, 0, 0, 0, 1, 0, 1, . . . , 1, 2, 2) ∈ R64
• etc.
demo: https://books.google.com/ngrams
data: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
25 / 32
Back to learning linear models
f (x) = w>x + b f (x) = w>φ(x) + b
other regularizers
26 / 32
Other Regularizers
Reminder: ridge regression
minw,b
1n
∑i
(w>xi + b− yi)2 + λ‖w‖2
Instead of ‖w‖2 we can use other regularizers to avoid overfitting:
minw,b
1n
∑i
(w>xi + b− yi)2 + λ‖w‖1 (∗)
for ‖w‖1 =∑j |wj | (instead of ‖w‖2 =
∑j w
2j )
LASSO (least absolute shrinkage and selection operator) procedure
• convex optimization problem, but not differentiable• no closed-form solution, but efficient numeric solvers, e.g. "FISTA"
Most important difference to ridge regression: "sparsity"• solutions to (∗) with small ‖w‖1 have many 0 entries
27 / 32
Other Regularizers
Reminder: ridge regression
minw,b
1n
∑i
(w>xi + b− yi)2 + λ‖w‖2
Instead of ‖w‖2 we can use other regularizers to avoid overfitting:
minw,b
1n
∑i
(w>xi + b− yi)2 + λ‖w‖1 (∗)
for ‖w‖1 =∑j |wj | (instead of ‖w‖2 =
∑j w
2j )
LASSO (least absolute shrinkage and selection operator) procedure
• convex optimization problem, but not differentiable• no closed-form solution, but efficient numeric solvers, e.g. "FISTA"
Most important difference to ridge regression: "sparsity"• solutions to (∗) with small ‖w‖1 have many 0 entries
27 / 32
Weights learned by Ridge Regression:w b
λ = 10−2 30.77 -356.56 453.43 172.40 -11.48 -323.84 -136.43 251.37 622.51 -37.57 145.99λ = 10−1 36.06 -209.73 326.56 146.23 -26.87 -151.14 -152.59 132.29 437.45 19.33 144.21λ = 100 24.46 -21.02 98.85 57.42 4.21 -15.70 -66.52 55.16 120.99 30.91 138.37λ = 101 4.44 -0.01 13.42 8.59 1.93 0.05 -10.06 8.98 16.00 5.46 134.33
Weights learned by LASSO:w b
λ = 10−2 22.04 -372.48 474.21 169.83 0.00 -374.14 -109.18 303.41 643.98 -44.13 146.15λ = 10−1 0.00 -275.46 439.37 107.26 0.00 -172.34 -188.55 25.26 663.43 0.00 145.39λ = 100 0.00 0.00 82.02 0.00 0.00 0.00 0.00 0.00 330.44 0.00 137.98λ = 101 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 133.56
28 / 32
Why does ‖w‖1 leads to sparse w and ‖w‖22 does not?
∑i(w>xi − yi)2 + λ‖w‖1 vs
∑i(w>xi − yi)2 + λ‖w‖2
29 / 32
Why does ‖w‖1 leads to sparse w and ‖w‖22 does not?
∑i(w>xi − yi)2 + λ‖w‖1 vs
∑i(w>xi − yi)2 + λ‖w‖2
29 / 32
Data normalization
In real data, different features might have different units/scales:diabetes dataset
x ∈ R10 yAGE SEX BMI BP S1 S2 S3 S4 S5 S6 Y
59 2 32.1 101 157 93.2 38 4 4.8598 87 15148 1 21.6 87 183 103.2 70 3 3.8918 69 7572 2 30.5 93 156 93.6 41 4 4.6728 85 14124 1 25.3 84 198 131.4 40 5 4.8903 89 206
Standard tricks: data normalization
• 1) subtract data mean• 2) divide each data dimension by its standard deviation
or• 1) subtract data mean• 2) normalize each xi to have ‖xi‖2 = 1 → "sphering"
or• 1) subtract data mean• 2) whiten the data, such that cov(X) = Id
30 / 32
Original dataAGE SEX BMI BP S1 S2 S3 S4 S5 S6 Y
59 2 32.1 101 157 93.2 38 4 4.8598 87 15148 1 21.6 87 183 103.2 70 3 3.8918 69 7572 2 30.5 93 156 93.6 41 4 4.6728 85 14124 1 25.3 84 198 131.4 40 5 4.8903 89 206
Centering / Variance normalization0.801 1.065 1.297 0.460 -0.930 -0.732 -0.912 -0.054 0.419 -0.371 151
-0.040 -0.939 -1.082 -0.554 -0.178 -0.403 1.564 -0.830 -1.437 -1.938 751.793 1.065 0.935 -0.119 -0.959 -0.719 -0.680 -0.054 0.060 -0.545 141
-1.872 -0.939 -0.244 -0.771 0.256 0.525 -0.758 0.721 0.477 -0.197 206
Centering / Sphering0.243 0.012 0.132 0.147 -0.744 -0.515 -0.273 -0.002 0.005 -0.099 151
-0.015 -0.014 -0.139 -0.223 -0.179 -0.357 0.590 -0.031 -0.022 -0.649 750.494 0.011 0.087 -0.035 -0.697 -0.459 -0.185 -0.001 0.001 -0.132 141
-0.723 -0.014 -0.032 -0.314 0.261 0.470 -0.289 0.027 0.007 -0.067 206
Whitening-0.043 -0.008 -0.042 -0.058 0.025 0.025 0.013 0.087 -0.005 -0.042 151-0.039 0.049 -0.022 0.059 0.039 0.057 -0.024 0.006 -0.060 0.046 75-0.042 -0.002 -0.042 -0.060 0.087 0.006 -0.007 0.081 -0.031 -0.035 141-0.051 -0.026 0.054 0.032 -0.077 -0.022 -0.009 -0.017 -0.027 0.042 206
Important: compute normalizing transforms only on training part,apply same transformations before making predictions
31 / 32
Summary
General form: regularized least squares regression
minw
1n
∑i
(w>xi − yi)2
︸ ︷︷ ︸loss
+
reg.const.︷︸︸︷
λ Ω(w)
︸ ︷︷ ︸regularizer
• loss: makes model fit the training data
• Ω: regularizer, encourages simple models, prevents overfitting
• λ: regularization constant, controls trade-off
Use model selection to• find a good regularization constant
• choose between different regularizers
32 / 32
Summary
General form: regularized least squares regression
minw
1n
∑i
(w>xi − yi)2
︸ ︷︷ ︸loss
+reg.const.︷︸︸︷λ Ω(w)︸ ︷︷ ︸regularizer
• loss: makes model fit the training data
• Ω: regularizer, encourages simple models, prevents overfitting
• λ: regularization constant, controls trade-off
Use model selection to• find a good regularization constant
• choose between different regularizers32 / 32
Top Related