Machine Learning CS 165B Spring 2012
description
Transcript of Machine Learning CS 165B Spring 2012
![Page 1: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/1.jpg)
Machine LearningCS 165B
Spring 2012
1
![Page 2: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/2.jpg)
Course outline
• Introduction (Ch. 1)
• Concept learning (Ch. 2)
• Decision trees (Ch. 3)
• Ensemble learning
• Neural Networks (Ch. 4)
• Linear classifiers
• Support Vector Machines
• Bayesian Learning (Ch. 6)
• Bayesian Networks
• Clustering
• Computational learning theory2
Midterm on Wednesday
![Page 3: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/3.jpg)
3
Midterm Wednesday May 2
• Topics (till today’s lecture)
• Content– (40%) Short questions
– (20%) Concept learning and hypothesis spaces
– (20%) Decision trees
– (20%) Artificial Neural Networks
• Practice midterm will be posted today
• Can bring one regular 2-sided sheet & calculator
![Page 4: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/4.jpg)
4
Background on Probability & Statistics • Random variable, sample space, event (union, intersection)
• Probability distribution– Discrete (pmf)
– Continuous (pdf)
– Cumulative (cdf)
• Conditional probability– Bayes Rule
– P(C ≥ 2 | M = 0)
• Independence of random variables– Are C and M independent?
• Choose which of two envelopes contains a higher number– Allowed to peak at one of them
3 coinsC is the count of headsM =1 iff all coins match
![Page 5: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/5.jpg)
5
Background on Probability & Statistics • Common distributions
– Bernoulli
– Uniform
– Binomial
– Gaussian (Normal)
– Poisson
• Expected value, variance, standard deviation
![Page 6: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/6.jpg)
Approaches to classification
• Discriminant functions:– Learn the boundary between classes.
• Infer conditional class probabilities:– Choose the most probable class
6
What kind of classifier is logistic regression?
![Page 7: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/7.jpg)
Discriminant Functions
• They can be arbitrary functions of x, such as:
Nearest Neighbor
Decision Tree
LinearFunctions
( ) Tg b x w x
NonlinearFunctions
7Sometimes, transform the data and then learn a linear function
![Page 8: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/8.jpg)
High-dimensional data
Gene expression Face images Handwritten digits8
![Page 9: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/9.jpg)
Why feature reduction?
• Most machine learning and data mining techniques may not be effective for high-dimensional data – Curse of Dimensionality
– Query accuracy and efficiency degrade rapidly as the dimension increases.
• The intrinsic dimension may be small. – For example, the number of genes responsible for a certain type of
disease may be small.
9
![Page 10: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/10.jpg)
Why feature reduction?
• Visualization: projection of high-dimensional data onto 2D or 3D.
• Data compression: efficient storage and retrieval.
• Noise removal: positive effect on query accuracy.
10
![Page 11: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/11.jpg)
Applications of feature reduction
• Face recognition
• Handwritten digit recognition
• Text mining
• Image retrieval
• Microarray data analysis
• Protein classification
11
![Page 12: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/12.jpg)
Feature reduction algorithms
• Unsupervised– Latent Semantic Indexing (LSI): truncated SVD
– Independent Component Analysis (ICA)
– Principal Component Analysis (PCA)
• Supervised – Linear Discriminant Analysis (LDA)
12
![Page 13: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/13.jpg)
13
Principal Component Analysis (PCA)• Summarization of data with many variables by a
smaller set of derived (synthetic, composite) variables
• PCA based on SVD– So, look at SVD first
![Page 14: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/14.jpg)
14
Singular Value Decomposition (SVD)
• Intuition: find the axis that shows the greatest variation, and project all points to this axis
f1
e1e2
f2
14
![Page 15: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/15.jpg)
SVD: mathematical formulation
• Let A be an m x n real matrix of m n-dimensional points
• SVD decomposition – A = U x x VT
– U(m x m) is orthogonal: UTU = I– V(n x n) is orthogonal: VTV = I– (m x n) has r positive non-zero singular values in descending
order on its diagonal
• Columns of U are the orthogonal eigenvectors of AAT (called the left singular vectors of A)– AAT = (U x x VT ) (U x x VT )T = U x x Tx UT = U x 2x UT
• Columns of V are the orthogonal eigenvectors of ATA (called the right singular vectors of A)– ATA = (U x x VT )T (U x x VT ) = V x Tx x VT = V x 2x VT
• contains the square root of the eigenvalues of AAT (or ATA)– These are called the singular values (positive real)– r is the rank of A, AAT , ATA
• U defines the column space of A, V the row space.15
![Page 16: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/16.jpg)
SVD - example
16
![Page 17: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/17.jpg)
SVD - example
• A = U VT
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 30 0 0 1 1
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.800 0.27
=9.64 0
0 5.29x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
v1
17
![Page 18: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/18.jpg)
SVD - example
• A = U VT
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 30 0 0 1 1
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.800 0.27
=9.64 0
0 5.29x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
variance (‘spread’) on the v1 axis
18
![Page 19: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/19.jpg)
Dimensionality reduction
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 30 0 0 1 1
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.800 0.27
=9.64 0
0 5.29x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
19
![Page 20: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/20.jpg)
Dimensionality reduction
• set the smallest singular values to zero:
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 30 0 0 1 1
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.800 0.27
=9.64 0
0 5.29x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
20
![Page 21: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/21.jpg)
Dimensionality reduction
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 30 0 0 1 1
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.800 0.27
~9.64 0
0 0x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
21
![Page 22: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/22.jpg)
Dimensionality reduction
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 30 0 0 1 1
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.800 0.27
~9.64 0
0 0x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
22
![Page 23: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/23.jpg)
Dimensionality reduction
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 30 0 0 1 1
0.18
0.36
0.18
0.90
0
00
~9.64
x
0.58 0.58 0.58 0 0
x
23
![Page 24: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/24.jpg)
Dimensionality reduction
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 30 0 0 1 1
~
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 0 0
0 0 0 0 00 0 0 0 0
24
![Page 25: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/25.jpg)
Dimensionality reduction
‘spectral decomposition’ of the matrix:
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 30 0 0 1 1
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.800 0.27
=9.64 0
0 5.29x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
25
![Page 26: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/26.jpg)
Dimensionality reduction
‘spectral decomposition’ of the matrix:
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 30 0 0 1 1
= x xu1 u2
1
2
v1T
v2T
26
![Page 27: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/27.jpg)
Dimensionality reduction
‘spectral decomposition’ of the matrix:
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 30 0 0 1 1
= u11 v1T u22 v2
T+ +...m
n
27
![Page 28: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/28.jpg)
Dimensionality reduction
‘spectral decomposition’ of the matrix:
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 30 0 0 1 1
= u11 vT1 u22 vT
2+ +...m
n
m x 1 1 x n
r terms
28
![Page 29: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/29.jpg)
Dimensionality reduction
approximation / dim. reduction:
by keeping the first few terms (how many?)
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 30 0 0 1 1
= u11 vT1 u22 vT
2+ +...n
m
assume: 1 >= 2 >= ...
29
![Page 30: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/30.jpg)
Dimensionality reduction
A heuristic: keep 80-90% of ‘energy’ (= sum of squares of i’s)
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 30 0 0 1 1
= u11 vT1 u22 vT
2+ +...n
m
assume: 1 >= 2 >= ...
30
![Page 31: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/31.jpg)
Dimensionality reduction
• Matrix V in the SVD decomposition
(A = UΛVT ) is used to transform the data.
• AV (= UΛ) defines the transformed dataset.
• For a new data element x, xV defines the transformed data.
• Keeping the first k (k < n) dimensions, amounts to keeping only the first k columns of V.
31
![Page 32: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/32.jpg)
• Let A = U VT
• A = ∑ λiuiviT
• The Frobenius norm of an m x n matrix M is
• Let Ak = the above summation using the k largest eigenvalues.
Theorem: [Eckart and Young] Among all m x n matrices B of rank at most k, we have that:
• “Residual” variation is information in A that is not retained. Balancing act between– clarity of representation, ease of understanding
– oversimplification: loss of important or relevant information.
Optimality of SVD
FFk BAAA
2],[ jiAAF = √ λi
2
22k BAAA
32
![Page 33: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/33.jpg)
Principal Components Analysis (PCA)
• Transfer the dataset to the center by subtracting the means: let matrix A be the result.
• Compute the covariance matrix ATA.
• Project the dataset along a subset of the eigenvectors of ATA.
• Matrix V in the SVD decomposition contains these.
• Also known as K-L transform.
33
![Page 34: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/34.jpg)
Principal Component Analysis (PCA)
• Takes a data matrix of m objects by n variables, which may be correlated, and summarizes it by uncorrelated axes (principal components or principal axes) that are linear combinations of the original n variables
• The first k components display as much as possible of the variation among objects.
34
![Page 35: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/35.jpg)
2D Example of PCA
67.61 V 24.62 V 42.32,1 C
0
2
4
6
8
10
12
14
0 2 4 6 8 10 12 14 16 18 20
Variable X1
Var
iab
le X
2
+
35.81 X
91.42 X
35
![Page 36: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/36.jpg)
-6
-4
-2
0
2
4
6
8
-8 -6 -4 -2 0 2 4 6 8 10 12
Variable X1
Var
iab
le X
2
Configuration is Centered
• each variable is adjusted to a mean of zero (by subtracting the mean from each value).
36
![Page 37: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/37.jpg)
-6
-4
-2
0
2
4
6
-8 -6 -4 -2 0 2 4 6 8 10 12
PC 1
PC
2
Principal Components are Computed
• PC 1 has the highest possible variance (9.88)• PC 2 has a variance of 3.03• PC 1 and PC 2 have zero covariance.
37
![Page 38: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/38.jpg)
-6
-4
-2
0
2
4
6
8
-8 -6 -4 -2 0 2 4 6 8 10 12
Variable X1
Va
ria
ble
X2
PC 1
PC 2
• Each principal axis is a linear combination of the original two variables
38
![Page 39: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/39.jpg)
Feature reduction algorithms
• Unsupervised– Latent Semantic Indexing (LSI): truncated SVD
– Independent Component Analysis (ICA)
– Principal Component Analysis (PCA)
• Supervised – Linear Discriminant Analysis (LDA)
39
![Page 40: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/40.jpg)
Course outline
• Introduction (Ch. 1)
• Concept learning (Ch. 2)
• Decision trees (Ch. 3)
• Ensemble learning
• Neural Networks (Ch. 4)
• Linear classifiers
• Support Vector Machines
• Bayesian Learning (Ch. 6)
• Bayesian Networks
• Clustering
• Computational learning theory40
![Page 41: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/41.jpg)
41
Midterm analysis
• Grade distribution
• Solution to ANN problem
• Makeup problem on Wednesday– 20 minutes
– 15 points
– Bring a calculator
![Page 42: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/42.jpg)
Fisher’s linear discriminant
• A simple linear discriminant function is a projection of the data down to 1-D.– So choose the projection that gives the best separation of
the classes. What do we mean by “best separation”?• An obvious direction to choose is the direction of the line
joining the class means.– But if the main direction of variance in each class is not
orthogonal to this line, this will not give good separation (see the next figure).
• Fisher’s method chooses the direction that maximizes the ratio of between class variance to within class variance.– This is the direction in which the projected points contain
the most information about class membership (under Gaussian assumptions)
42
![Page 43: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/43.jpg)
Fisher’s linear discriminant
When projected onto the line joining the class means, the classes are not well separated.
Fisher chooses a direction that makes the projected classes much tighter, even though their projected means are less far apart.
43
![Page 44: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/44.jpg)
Fisher’s linear discriminant (derivation)
Find the best direction w for accurate classification.
A measure of the separation between the projected points is the difference of the sample means.
If mi is the d-dimensional sample
mean from Di given by
the difference of the projected sample means is:
the sample mean from the projected
points Yi given by
44
![Page 45: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/45.jpg)
Fisher’s linear discriminant (derivation)
Define scatter for the projection:
Choose w in order to maximize
Define scatter matrices Si (i = 1, 2) and Sw by
is called the total within-class scatter.
45
![Page 46: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/46.jpg)
Fisher’s linear discriminant (derivation)
We obtain
46
![Page 47: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/47.jpg)
Fisher’s linear discriminant (derivation)
where
In terms of SB and Sw, J(w) can be written as:
47
![Page 48: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/48.jpg)
Fisher’s linear discriminant (derivation)
48
![Page 49: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/49.jpg)
Fisher’s linear discriminant (derivation)
A vector w that maximizes J(w) must satisfy
In the case that Sw is nonsingular,
49
![Page 50: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/50.jpg)
50
Linear discriminant
• Advantages:– Simple: O(d) space/computation
– Knowledge extraction: weighted sum of attributes; positive/negative weights, magnitudes (credit scoring)
![Page 51: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/51.jpg)
51
Non-linear models
• Quadratic discriminant:
• Higher-order (product) terms:
Map from x to z using nonlinear basis functions and use a linear discriminant in z-space
215224
2132211 xxz,xz,xz,xz,xz
![Page 52: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/52.jpg)
52
Linear model: two classes
otherwise
0if choose
2
1
C
gC x
![Page 53: Machine Learning CS 165B Spring 2012](https://reader036.fdocuments.in/reader036/viewer/2022062517/56813cbb550346895da6672f/html5/thumbnails/53.jpg)
53
Geometry of classification
w is orthogonal to the decision surface
D = distance of decision surface from originConsider any point x on the decision surface. Then D = wTx / ||w|| = −b / ||w||
w 0 = b
d(x) = distance of x from decision surfacex = xp+ d(x) w/||w||wTx + b = wTxp+ d(x) wTw/||w|| + bg(x) = (wTxp+ b) + d(x) ||w||d(x) = g(x) / ||w|| = wTx / ||w|| − D