SanjayLallandStephenBoyd EE104 StanfordUniversity
Transcript of SanjayLallandStephenBoyd EE104 StanfordUniversity
Raw data
I raw data pairs are (u; v), with u 2 U , v 2 V
I U is set of all possible input values
I V is set of all possible output values
I each u is called a record
I typically a record is a tuple, or list, u = (u1; u2; : : : ; ur)
I each ui is a field or component, which has a type, e.g., real number, Boolean, categorical, ordinal, word,text, audio, image, parse tree (more on this later)
I e.g., a record for a house for sale might consist of
(address; photo; description; house/apartment?; lot size; : : : ;# bedrooms)
3
Feature map
I learning algorithms are applied to (x; y) pairs,
x = �(u); y = (v)
I � : U ! Rd is the feature map for u
I : V ! Rm is the feature map for v
I feature maps transform records into vectors
I feature maps usually work on each field separately,
�(u1; : : : ; ur) = (�1(u1); : : : ; �r(ur))
I �i is an embedding of the type of field i into a vector
4
Embeddings
I embedding puts the different field types on an equal footing, i.e., vectors
I some embeddings are simple, e.g.,
I for a number field (U = R), �i(ui) = ui
I for a Boolean field, �i(ui) =
�1 ui = true
�1 ui = false
I color to (R;G;B)
I others are more sophisticated
I text to TFIDF histogram
I word2vec (maps words into vectors)
I pre-trained neural network for images (maps images into vectors)
(more on these later)
5
Faithful embeddings
a faithful embedding satisfies
I �(u) is near �(~u) when u and ~u are ‘similar’
I �(u) is not near �(~u) when u and ~u are ‘dissimilar’
I lefthand concept is vector distance
I righthand concept depends on field type, application
I interesting examples: names, professions, companies, countries, languages, ZIP codes, cities, songs,movies
I we will see later how such embeddings can be constructed
6
Examples
I geolocation data: �(u) =(Lat,Long) in R2 or embed in R3 (if data points are spread over planet)
I day of week (each day is ‘similar’ to the day before and day after)
1.0 0.5 0.0 0.5 1.0
1.0
0.5
0.0
0.5
1.0
Monday
TuesdayWednesday
Thursday
Friday
SaturdaySunday
7
Example: word2vec
I word2vec maps a dictionary of 3 million words (and short phrases) into R300
I developed from a data set from Google News containing 100 billion words
I assigns words that frequently appear near each other in Google News to nearby vectors in R300
8
Example: word2vec
(showing only x1 and x2, for a selection of words associated with emotion)
2.25 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25
1.0
0.5
0.0
0.5
1.0
tenderness
helpless
defeated
sympathy
outraged
content
adoration
dreading
rejected
hostile
proud
distrusting
disillusioned
bitter
satisfiedreceptive
suspicious
interested
cautious
confused
scornful
amused
disturbed
elated
shocked
overwhelmed
helpless
vengeful
enthusiastic
exhilarated
uncomfortable
isolated
disliked
optimistic
dismayed
guiltynumb
amazed
regretful
confused
lonely
ambivalent
alienated
calm
stunned
melancholy
exhausted
bitter
relaxed
interested
depressedinsulted
relieved
hopeless
disgusted
indifferent
hopeful
absorbed
curious
guilty
revulsion
anticipating
brave
eager
lonely
comfortable
hesitant
regretfulsafefearful
depressed
preoccupied
happy
anxious
hopeless
angry love
worried
sorrow
jealous
lust
scareduncertain
aroused
anguished
annoyed
tender
rejected
disappointed
compassionate
horrified
irritated
caring
alarmed
shocked
embarrassed concernpanicked
grumpy
trustawkward
liking
uncomfortableexasperated
attraction
disoriented
frustrated
9
Imagenet embedding
I Imagenet is an open image database with 14m labeled images in 1000 classes
I vgg16 maps images u (224� 224 pixels with R,G,B components) to x 2 R4096
I vgg16 was originally developed to classify the image labels
I repurposed as general image feature mapping
I vgg16 has neural network form with 16 layers, with input u, output x
10
vgg16 embedding
1 2 3 4 5 6
I images ui for i = 1; 2; : : : ; 6 are embedded to xi = �(ui) 2 R4096
I matrix of pairwise distances dij = kxi � xjk2
d =
2666664
0 109:7 97:9 96:2 107:4 103:0
0 63:9 71:6 109:4 109:2
0 69:4 101:5 101:4
0 96:5 96:8
0 86:6
0
3777775
11
Standardized embeddings
we usually assume that an embedding is standardized
I entries of �(u) are centered around 0
I entries of �(u) have RMS value around 1
I roughly speaking, entries of �(u) range over �1
I with standardized embeddings, entries of feature map
�(u1; : : : ; ur) = (�1(u1); : : : ; �r(ur))
are all comparable, i.e., centered around zero, standard deviation around one
I rms(�(u)� �(~u)) is reasonable measure of how close records u and ~u are
12
Standardization or z-scoring
I suppose U = R (field type is real numbers)
I for data set u1; : : : ; un 2 R
�u =1
n
nXi=1
ui std(u) =
�1
n
nXi=1
(ui � �u)2� 1
2
I the z-score or standardization of u is the embedding
x = zscore(u) =1
std(u)(u� �u)
I ensures that embedding values are centered at zero, with standard deviation one
I z-scored features are very easy to interpret: x = �(u) = +1:3 means that u is 1.3 standard deviationsabove the mean value
13
Log transform
I old school rule-of-thumb: if field u is positive and ranges over wide scale, embed as �(u) = logu
(or log(1 + u) if u is sometimes zero), then standardize
I examples: web site visits, ad views, company capitalization
I interpretation as faithful embedding:
I 20 and 22 are similar, as are 1000 and 1100
I but 20 and 120 are not similar
I i.e., you care about fractional or relative differences between raw values
(here, log embedding is faithful, affine embedding is not)
I can also apply to output or label field, i.e., y = (v) = log v if you care about percentage or fractionalerrors; recover v = exp(y)
14
Example: House price prediction
I we want to predict house selling price v from record u = (u1; u2)
I u1 = area (sq. ft.)
I u2 = # bedrooms
I we care about relative error in price, so we embed v as (v) = log v
(and then standardize)
I we standardize fields u1 and u2
x1 =u1 � �1�1
; x2 =u2 � �2�2
I �1 = �u1 is mean area
I �2 = �u2 is mean number of bedrooms
I �1 = std(u1) is std. dev. of area
I �2 = std(u2) is std. dev. of # bedrooms
(means and std. dev. are over our data set)
15
Example: House price linear regression predictor
I predict y = log v (log of price) from standardized area and # bedrooms
I linear predictor: y = �1 + �2x1 + �3x2
I in terms of original raw data:
v = exp��1 + �2
u1 � �1�1
+ �3u2 � �2�2
�
I exp undoes log embedding of house price
I readily interpretable, e.g., what does �2 = 0:7 mean?
16
Vector embeddings for real field
I we can embed a field u into a vector x = �(u) 2 Rk
I useful even when U = R (real field)
I polynomial embedding:�(u) = (1; u; u2; : : : ; ud)
I piecewise linear embedding:�(u) = (1; (u)
�
; (u)+)
where (u)�
= min(u; 0), (u)+ = max(u; 0)
I linear predictor with these features yield polynomial and piecewise linear predictors of raw features
18
Categorical data
I data field is categorical if it only takes a finite number of values
I i.e., U is a finite set f�1; : : : ; �kg; �i are category labels
I we often use category labels 1; : : : ; k, and refer to ‘category i’
I examples:
I true/false (two values, also called Boolean)
I apple, orange, banana (three values)
I monday, . . . , sunday (seven values)
I ZIP code (around 40000 values)
I countries (around 185 values)
I languages (several thousand spoken by large numbers of people)
19
One-hot embedding for categoricals
I U = f1; : : : ; kg
I one-hot embedding: �(i) = ei 2 Rk
I examples:
I �(apple) = (1; 0; 0); �(orange) = (0; 1; 0); �(banana) = (0; 0; 1)
I �(true) = (1; 0); �(false) = (0; 1) (another embedding of Boolean, into R2)
I �(Mandarin) = e1, �(English) = e2, �(Hindi) = e3, . . . , �(Azeri) = e55, . . .
I standardizing these features handles unbalanced data
20
Reduced one-hot embedding for categoricals
I U = f1; : : : ; kg
I one-hot embedding maps U to Rk; reduced one-hot embedding maps U to Rk�1
I choose one value, say i = k, as the default or nominal value
I �(k) = 0 2 Rk�1, i.e., map the default value to (vector) 0
I �(i) = ei 2 Rk�1, i = 1; : : : ; k � 1
I example: U = fTrue;Falseg with False as default
�(true) = 1; �(false) = 0
(a common embedding of Booleans into R)
21
Ordinal data
I ordinal data is categorical, with an order
I example: Likert scale, with values
strongly disagree, disagree, neutral, agree, strongly agree
I can embed into R with values �2;�1; 0; 1; 2
I or treat as categorical, with one-hot embedding into R5
I example: number of bedrooms in house
I can be treated as a real number
I or as an ordinal with (say) values 1; : : : ; 6
22
Feature engineering
I basic idea:
I start with some features
I then process or transform them to produce new (‘engineered’) features
I use these new features in your predictor
I was it a good idea? did it improve your predictor?
I train your model with original features and validate performance
I train your model with new features and validate performance
I if performance with new features is better, your feature engineering was successful
24
Types of feature transforms
I modify individual features: replace original feature xi with modified or transformed feature xnewi
I simple example: standardize, xnewi
= (xi � �i)=�i
I create multiple features from each original feature
I simple example: powers, replace xi with (xi; x2i; : : : ; x
qi)
I create new features from multiple original features
I simple example: product, xnewi
= xkxl
25
Gamma-transform
I -transform: xnewi = sign(xi)jxij i , i > 0
1.5 1.0 0.5 0.0 0.5 1.0 1.5x
1.5
1.0
0.5
0.0
0.5
1.0
1.5
xnew
-transform
=0.5=1.5
26
Clipping
I winsorize or clip or saturate with lower and upperclip levels li, ui
xnewi =
8<:
ui xi > uixi li � xi � uili xi < li
1.5 1.0 0.5 0.0 0.5 1.0 1.5x
1.5
1.0
0.5
0.0
0.5
1.0
1.5
xnew
Clip transform
clipped
27
Powers
I replace xi with (xi; x2i ; : : : ; x
qi )
1.5 1.0 0.5 0.0 0.5 1.0 1.5x
1.5
1.0
0.5
0.0
0.5
1.0
1.5
xnew
Power transform
power=1power=2power=3
28
Split into positive and negative parts
I replace xi with ((xi)+; (xi)�)
I or, split into negative, middle, and high values:replace xi with
((xi + 1)�
; sat(xi); (xi � 1)+)
sat(a) = min(1;max(x;�1)) is the saturationfunction
1.5 1.0 0.5 0.0 0.5 1.0 1.5x
1.5
1.0
0.5
0.0
0.5
1.0
1.5
xnew
Positive/negative part transform
(x) +(x)
29
Creating new features from multiple original features
I can be used to model interactions among features
I examples: for i < j
I maximum: max(xi; xj)
I product: xixj
I example: all monomials up to degree 3 of (x1; x2):
(x1; x2; x21; x1x2; x
22; x
31; x
21x2; x1x
22; x
32)
linear model with these features gives arbitrary degree 3 polynomial of (x1; x2)
30
Interpreting products of features as interactions
I suppose xi are Boolean, with values 0, 1, for i = 1; : : : ; d, e.g., representing patient symptoms
I create new ‘interaction’ features xixj , for i < j, of which there are d(d� 1)=2
I linear regression model (for d = 3) is
�1x1 + �2x2 + �3x3 + �12x1x2 + �13x1x3 + �23x2x3
I �1 is the amount our prediction goes up when x1 = 1
I �3 is the amount our prediction goes up when x3 = 1
I �13 is the amount our prediction goes up when x1 and x3 are both 1 (in addition to �1 + �3)
I e.g., with �13 large, the simultaneous presence of symptoms 1 and 3 makes our estimate go up a lot
31
Quantizing
I specify bin boundaries b1 < b2 < � � � < bk
I partitions into bins or buckets (�1; b1]; (b1; b2]; : : : (bk�1; bk]; (bk;1)
I common choice of bin boundaries: quantiles of xi, e.g., deciles
I replace xi with 8>>>><>>>>:
e1 xi � b1e2 b1 < xi � b2...ek bk�1 < xi � bkek+1 bk < xi
i.e., xi maps to el, if xi is in bin l
32
Feature engineering pipeline
I feature transformations can be done multiple times
I start by embedding original record u into vector feature x0 2 Rd0 using ~�, x0 = ~�(u)
I superscript 0 in x0 and d0 means starting point for feature engineering
I transform x0 using a feature engineering transform T 1, to get x1 = T 1(x0) 2 Rd1
I superscript 1 in x1 and d1 means ‘first step’ of feature engineering
I repeat M times to get final embedding x = xM = 'M (xM�1)
I final feature map is a composition:
� = T M � TM�1 � � � � � T 1 � ~�
I called feature engineering pipeline
33
Hand crafted versus automatic features
I features and feature engineering described above generally done by hand, using experience
I can also develop feature mappings automatically, directly from some data
I examples: word2vec, vgg16 were developed automatically (from very large data sets)
I we’ll later see some of these methods (PCA, neural nets, . . . )
35
Summary
I raw features are mapped to vectors for subsequent processing
I feature maps can range from simple to complex
I use validation to choose among different candidate feature maps
I sometimes the original feature map is followed by subsequent transformations, called feature engineering
I we’ll see later how feature mappings can be derived from data, as opposed to by hand
37