DATA MINING AND MACHINE LEARNING -...

DATA MINING AND MACHINE LEARNINGLecture 2: Elements of supervised learning

Lecturer: Simone Scardapane

Academic Year 2016/2017

Table of contents

IntroductionBasic terminologyA review of probability theoryThe statistical learning setting

Two basics approaches to supervised learningLinear regressionk-nearest neighbors

Additional topicsOverfitting and model complexityStatistical decision theoryThe curse of dimensionalityMore on distance functions

Classification vs. regression

Supervised learning is about inferring a relation between objectsof an input space X , and objects of an output space Y . Asan example, an element x ∈ X could be a representation ofan email (e.g., ordered list of words), while the correspondingelement in Y would be 0 if the email is normal, 1 if it is spam.

I Classification: the elements in Y can only take a discreteamount of values {1, . . . ,M}. If M = 2, as in the pre-vious case, we talk about binary classification or conceptlearning. Otherwise, we have multi-class classification.

I Regression: Y has infinite elements in it, e.g. predictinga room temperature. In regression we want to predict aquantitative information instead of a qualitative one.

Vectorial inputs

For the majority of this course, we will assume that the inputis a vector of real numbers, denoted with a bold font x, i.e. wewill be concerned with the specific case X = Rd . We will alsodenote x as a pattern.

Many objects in real-world applications can be reduced to thisform through some transformation. As an example, a text canbe represented as the frequency of appearance of a given set ofwords once we discard the information on their ordering (bag-of-words representation).

A single element xi in the input is called a feature in ML, whileit is also known as predictor or independent variable in the sta-tistical literature.

Categorical features

An important class of features is given by categorical features.For example, in a medical application the ith feature can distin-guish whether a patient is ‘European’, ‘American’, or ‘Other’.Encoding this as {0, 1, 2} is problematic, because we are im-plicitly introducing a distance between them (in formal terms,a metric space).

If the number of categories is not too large, the common way tosolve this is to employ a dummy encoding (or 1-of-K encoding),where we use one bit per each value of the categorical variables:

European =

100

American =

010

Other =

001

Table of contents




Toy example

Consider the following simple example taken from (Bishop, 2006):

Figure 1 : We have two boxes, each with a given ratio of orangeand green balls. We can ask basic questions such as “What is theprobability of drawing a green ball?”, or “What is the probabilitythat we extracted from the first box, given that the ball is red?”.

Frequentist definition of probabilities

Intuitively, the probability that an extracted ball has color greenis obtained by simply counting the balls:

p(C = ‘green’) = 5/12 .

The generic notation p(C ) defines the probability distributionover all possible values of the random variable C (in this case,‘orange’ and ‘green’), such that:

p(C = c) ≥ 0,∑c

p(C = c) = 1 .

Since we are considering the relative frequency of events, thisis called the frequentist interpretation of probabilities.

Joint probability distributions

We can write the joint probability distribution between two ran-dom variables, e.g.

p(C = ‘green’,B = ‘red ′) = 2/12.

where B is the random variable defining which box was selected.The sum rule of probabilities allows us to marginalize over oneof the two variables:

p(C ) =∑b

p(C ,B = b) .

As an example, p(C = ‘green’) = p(C = ‘green’,B = ‘red ′) +p(C = ‘green’,B = ‘blue ′) = 2/12 + 3/12 = 5/12.

Conditional probabilities

The conditional probability distribution p(C | B) defines theprobability of observing a value for C , once we know the corre-sponding value for B . As an example:

p(C = ‘green’ | B = ‘red ′) = 2/8.

The product rule of probability gives us a relation between con-ditional and joint distributions as:

p(C ,B) = p(B)P(C | B) = p(C )p(B | C ) .

If p(C ,B) = p(C )p(B), C and B are said to be independent.By combining the sum and product rules, we obtain Bayes’ rule:

p(B | C ) =p(C | B)p(B)

p(C ).

Expectations

If we associate numerical values to the outcomes of B and C ,we can define the expectation of a function f (B) as:

E[f ] =∑b

f (b)p(B = b) .

We can also write conditional expectations as:

EB|C [f ] =∑b

f (b)p(B = b | C ) .

And the variance:

var[f ] = E[(f − E[f ])2] = E[f (·)2]− E[f ]2 .

Continuous random variables

In the case of continuous random variables, the probability dis-tribution p(x) should respect the following properties:

p(x) ≥ 0,

∫p(x)dx = 1 .

The function is also called probability density function to distin-guish it from a probability mass function for a discrete variable.

Most of the previous definitions extend to the continuous caseby replacing sums with integrals, for example the expected valueof x is given by:

E[x ] =

∫xp(x)dx .

Table of contents




Datasets

Assume that we are given some examples of the relation we arelooking for, in the form of a dataset denoted as:

S ={

(xi , yi) | i = 1, . . . ,N}.

In practice, we cannot assume that xi uniquely determines yifor a variety of different reasons:

I yi could be determined by some additional features that wecannot observe (latent variables).

I Some elements in xi could be missing or simply wrong.

I The value in yi could be corrupted by some form of noise(e.g., measurements from a sensor).

Statistical learning setting

Very generally, we can assume that the process generating ourdata is described by some probability distribution:

p(xi , yi) = p(xi)p(yi | xi) .

Under this mathematical framework, a dataset is just a randomvariable given by N identically and independently distributed(i.i.d.) realizations of the previous process.

Identical distribution means that the underlying function isstatic. Predicting a time-varying process is an example wherethis assumption is violated. A change in the function is some-times called a concept drift in ML.

Generative vs. discriminative learning

Any approach that (implicitly or explicitly) tries to model theinput distributions p(x) is called generative, because we candraw probable inputs by sampling the distribution.

On the contrary, discriminative approaches only cares about theconditional distribution p(y | x), because this is what is mostuseful in practice. Methods that model the conditional distribu-tions can be (very roughly) subdivided in frequentist approachesand Bayesian approaches.

Learning without probabilities

In the majority of situations, we do not care about p(y | x)either, but we just need a single value of y to make a decision.

As an example, consider a classification problem, where we incura penalty of 1 every time we make an incorrect decision, anda penalty of 0 otherwise. It is trivial to show that we shouldalways take y that maximizes p(y | x).

“When solving a problem of interest, do not solve a moregeneral problem as an intermediate step.”

— V. Vapnik, Stastistical learning theory, 1998

Discriminant functions

A large subfield of ML is formulated around the idea of findinga single discriminant function f (x), which directly provides uswith the ‘best’ guess of y according to some criterion.

To this end, suppose that our function f is defined by someparameters w, and that we have a way of evaluating the errorof any given function using a loss function L(y , y). It makesintuitive sense to find a set of parameters such that:

w∗ = arg minw

{J(w)

}=

{1

N

N∑i=1

L(yi , f (xi)

)}.

This is called empirical risk minimization but, as we will seelater on, it is not enough to have a good algorithm.

The three elements of learning

In this general formulation, we can already find the three basicelements we discussed before:

1. The model is given by the specific form taken by f (x).We can take functions that have a very restricted behavior(e.g., linear), or functions which have a lot of expressivepower and a large set of adaptable parameters.

2. The evaluation is given by selecting a proper loss function.

3. The optimization is the process by which we find a set ofvalues w such that the previous function is minimized.

Table of contents




Linear models

The simplest model we can devise is a linear model, where eachfeature contributes linearly in some amount to the observedoutput:

f (x) = w0 +d∑

i=1

wixi ,

where w0 is called bias or intercept of the model. We cansimplify the notation by assuming that the last feature is alwaysconstant, i.e. x = [x1, . . . , xd−1, 1]. In this case we have:

f (x) =d∑

i=1

wixi = wTx .

Least squares regression

Now we need a way to penalize the errors of our model. A pop-ular technique is least squares regression (also know as ordinaryleast squares, OLS):

J(w) =1

2N

N∑i=1

(yi −wTxi

)2.

The main appeal of this formulation is that it lends itself to aclosed-form solution and to a large number of theoretical anal-yses. Note that this formulation weights proportionately moreerrors which are farther away from 0 (using a square law).

Stigler, S.M., 1981. Gauss and the invention of least squares. Theannals of statistics, pp. 465-474.

Visualizing the squared loss function

−6 −4 −2 0 2 4 6

Error

0

5

10

15

20

25

Squ

ared

loss

Figure 2 : Visualization of the squared loss for varying errors.

Matrix formulation of least squares

In order to derive a more convenient formulation, we define theso-called input matrix and output vector as:

X =

xT1

· · ·xTN

y =

y1· · ·yN

.

Each input is a row in our input matrix. OLS is formulated asthe optimization of the following cost function:

J(w) =1

2N

(y − Xw

)T(y − Xw

)=

1

2N‖y − Xw‖22 ,

where we make use of the Euclidean norm (or `2 norm) ‖a‖2 =√∑i a2

i .

Normal equations

Solving the previous cost function requires some basic knowl-edge of optimization, that we cover in the next lecture. Inparticular, the first-order optimality condition tells us that thesolution must satisfy the so-called normal equations:

XT (y − Xw) = 0 .

Assuming that the matrix XTX is not singular, we obtain thatthe solution (which is also unique) can be expressed as:

w∗ =(XTX

)−1XTy .

The quantity(XTX

)−1XT is called the Moore-Penrose pseudo-

inverse of X.

Considerations on the computational complexity

1. XTX takes O(d2N) to compute, and similarly XTy takesO(dN).

2. The matrix inversion takes (naıvely) O(d3).

3. Note that, in practice, the matrix inversion step is nevertaken, and an appropriate matrix decomposition is usedinstead (more on this later on).

4. Whenever d > N , we can reformulate the previous expres-sion in a more convenient form:

w∗ = XT(XXT

)−1y .

A very simple example

0 2 4 6 8

x

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

yRealOLSObservations

Figure 3 : The data is extracted from a linear model corruptedwith Gaussian noise. In this case, it can be shown that the OLS isperfectly unbiased.

OLS does not get us very far (for the moment)

0 2 4 6 8

x

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

yRealOLSObservations

Figure 4 : When the underlying model is highly nonlinear, OLS isseverely limited.

Table of contents




Global vs. local models

OLS builds a single, global model for the entire input space Rd .As a result of this, a change in a single training point mightinfluence the prediction of the model even very far away for thepoint itself.

We can think of doing something different, by building a localmodel where the prediction is influenced only by training pointswhich are close (in some sense) with respect to it.

As an example, the 1-nearest neighbor rule (1-NN) predicts avalue by picking a single training point, i.e. the closest one.Commonly, the distance between two points x1 and x2 is mea-sured with the use of the Euclidean distance ‖x1 − x2‖2.

An example of 1-NN

−2 −1 0 1 2 3 4

x1

−3

−2

−1

0

1

2

3

4x

2

Figure 5 : Decision regions for a simple binary classificationproblem when using the 1-NN rule.

From 1-NN to k-NN

Even from a simple visual inspection, the previous approach has(at least) two problems. On the border region between twoclasses, the boundary is not smooth, while we intuitively preferthe opposite. Additionally, a single outlier point creates a regionof blue among the red points.

We can obtain a better result by considering k neighbors insteadof a single one, and taking the majority class for classification(breaking ties at random), or the average value for regression.The result is the so-called k-NN rule.

k-NN classification with k = 5

−2 −1 0 1 2 3 4

x1

−3

−2

−1

0

1

2

3

4x

2


k-NN classification with k = 15

−2 −1 0 1 2 3 4

x1

−3

−2

−1

0

1

2

3

4x

2


Additional considerations on the k-NN

The k-NN algorithm can build a very complex, nonlinear modeldefined implicitly from the training data. In this sense, it be-longs to a family of learning methods called instance-based al-gorithms.

In its basic form, the optimization step does not exist, sincethe entire burden (in terms of memory and computation) isrequested during the prediction phase for computing distancemeasures. Many advanced data structures can be used to makethis step faster than a naıve implementation.

Weighted k-NN

Denote by Nk(x) the indexes of the k points in our training setclosest to x. For regression, the basic k-NN is expressed as:

f (x) =1

k

∑i∈Nk (x)

yi .

More in general, we can think of providing more importance topoints which are closer to x:

f (x) =∑

i∈Nk (x)

diyi with di =‖x− xi‖2∑

j∈Nk (x)‖x− xj‖2

.

In this form, the k-NN is a special form of the Nadaraya-Watsonestimator.

Weighted k-NN vs. uniform k-NN for regression

Figure 8 : Image source: Nearest neighbors in scikit-learn

http://scikit-learn.org/stable/modules/neighbors.html

Parametric vs. non-parametric models

An additional crucial difference between k-NN and OLS is givenby the number of parameters which can be adapted. OLS has afixed number of parameters, given by d . On the contrary, k-NNdoes not have an immediate definition of parameters; in fact,for a reasonably small k its model can take arbitrary shapes byconsidering more data.

In the statistical literature, the k-NN is called a non-parametriclearning model, since its complexity can potentially grow with-out bound with the size of the training set, as opposed to aparametric model such as OLS.

Table of contents




Underfitting and overfitting

One interesting aspect of 1-NN is that it always has 0 empiri-cal (training) error. However, its expected accuracy over pointsthat are outside the training set will probably be much largerthan 5-NN in our previous toy example. We say that 5-NN gen-eralizes better than 1-NN or, alternatively, 1-NN has overfittedthe training data.

On the opposite side, 15-NN had both training and test errorshigher than what can be achieved by properly varying k . In thiscase, we generally talk about underfitting of the algorithm.

Learning is very different from simple memorization of pat-terns: algorithms must be able to properly extrapolate fromdata in order to make useful predictions on new points.

Evaluating overfitting

A simple way to evaluate whether we are overfitting our datasetis to split the original dataset in two parts (holdout method):

I A training set to select the appropriate parameters for ourmodel.

I A test set (independent of the training set) to evaluatethe error.

If the error on the test set is significantly higher than the erroron the training set, we are probably overfitting. In the followinglectures, we will show some more advanced procedures to obtainresults which are more statistically significant.

In classification problems, if the proportions of labels are keptin both datasets, we talk about a stratified holdout.

Visual depiction of holdout

Figure 9 : Taken from the Google Research blog (source). Checkout the link to understand why the analyst is so sad.

https://research.googleblog.com/2015/08/the-reusable-holdout-preserving.html

Model complexity and hyperparameters

The k in k-NN is called a complexity parameter. By varyingit, we can go from an extremely simple model (constant overthe dataset), to a very nonlinear model. In most situations,overfitting occurs when the complexity of the model is too highwith respect to the data, and it is not kept ‘under control’.

More in general, any parameter which can be chosen by the useris called a hyper-parameter. The process of selecting a properhyper-parameter for our dataset is called model selection.

Visual depiction of overfitting

Figure 10 : Taken from Wikipedia. On the x-axis we have somemeasure of complexity, on the y -axis some measure of error. Blueand red denote error over train and test sets, respectively.

Table of contents




Expected risk

In order to provide more rigor to the previous concepts, we beginby noting that our real objective is to find some function f , suchthat our error is minimized in average over any possible point:

I [f ] = E[L(y , f (x))] =

∫L(y , f (x))p(x, y)dxdy .

The previous quantity is known as the expected risk. The fieldsof statistical learning theory and probably approximately correct(PAC) learning are devoted to finding bounds that connect I [f ]and its empirical risk asymptotically and for finite dataset sizes.

Expected risk for the squared loss

For the squared loss, we need to find f such as to minimize:

I [f ] = ExEy |x[(y − f (x))2

],

where Ey |x denotes conditional expectation with respect to p(x).For any possible x, the minimum is given by taking:

f (x) = E[y | x] .

This is called the Bayes function because it achieves the lowestpossible risk. We are just saying that, if we had access to thetrue probability distributions, the squared loss is minimized bytaking the expected y value for each x.

An explanation for our methods

I k-NN approximates the Bayes function by considering anempirical average over the points in the neighborhood ofx, instead of the entire input domain.

I On the opposite, OLS is a model-based technique, wherewe specify a parametric form for our estimator, and then tryto find the closest fit to the Bayes function. In particular,the optimal least squares solution would be given by:

w∗ = E[xxT ]−1E[xy ] .

Most of this course will be concerned with the second approach,starting from linear models up to more complex formulationssuch as support vector machines and artificial neural networks.

The bias-variance tradeoff

To get some additional insights, let us assume that our obser-vations are given by:

y = f (x) + ε ,

where f (x) is a deterministic function of x and ε is distributedaccording to a Gaussian distribution with mean 0 and varianceσ2. In this case, we can write our expected risk over a singlepoint x as:

E[(y − f (x))2] = Bias(f (x))2 + Var(f (x)) + σ2 ,

where the expectation is taken over all possible datasets.

The bias-variance tradeoff (2)

The first term in the previous equation is given by:

Bias(f (x)) = E[f (x)− f (x)]

This is a modeling error, which is caused by choosing somefunctional form for f instead of the true one. The second termis given by:

Var(f (x)) = E[(f (x)− E[f (x)])2] .

This term denotes the variance of our predictions with respectto the specific choice of a dataset.

Some additional considerations

I If f (x) is linear, it can be proven that for additive Gaussiannoise the OLS has bias 0. Under some additional assump-tions, it can be shown that the OLS has the lowest varianceamong all possible unbiased estimators.

I In the more general case, OLS might have a large bias.In principle, this bias can be reduced if we allow for anincrease in variance, a point that will return in a futurelecture.

I More interestingly, in the case of k-NN the variance canbe written in closed form as:

Var(f (x)) =σ2

k.

Universal consistency

Note that the variance of k-NN decrease as k increases, at thepossible expense of the bias. For N , k →∞, if we let k/N → 0,then it can be shown that f (x) converges in probability to theBayes function for any possible input/output distribution. Intechnical terms, we say the k-NN is universally consistent [2].

However, this tells us nothing about the rate of convergencefor finite N . More importantly, this rate might be extremelyinefficient for large d , as we discuss next.

[2] Stone, C.J., 1977. Consistent nonparametric regression. Theannals of statistics, pp. 595-620.

Table of contents




The curse of dimensionality

The accuracy of k-NN in high-dimensional spaces (large d) canarise from a problem called the curse of dimensionality. Veryroughly, high-dimensional spaces have non-intuitive propertieswhich can be problematic for any instance-based method.

As an example, sampling density in a space is proportional toN1/d ; if we add a single input, we need N2 patterns to have thesame density of points in our space.

As another example, the median distance of the origin withrespect to the closest point follows a similar power law:

Median distance =

(1− 1

2

1/N)1/d

.

Visualizing the curse of dimensionality

Figure 11 : http://www.newsnshit.com/

curse-of-dimensionality-interactive-demo/

http://www.newsnshit.com/curse-of-dimensionality-interactive-demo/

http://www.newsnshit.com/curse-of-dimensionality-interactive-demo/

Table of contents




Metric definition

Apart from the Euclidean distance, there is a vast literature onpossible distance functions to be used in learning algorithms.In fact, any function d : Rd × Rd → R is a valid metric (ordistance) if for any x1, x2, x3 ∈ Rd the following properties aresatisfied:

1. Non-negativity: d(x1, x2) ≥ 0.

2. Identity: d(x1, x2) = 0 if and only if x1 = x2.

3. Symmetry: d(x1, x2) = d(x2, x1).

4. Triangle inequality: d(x1, x3) ≤ d(x1, x2) + d(x2, x3).

The Euclidean distance is a valid metric, but it is not the onlyone.

Manhattan distance

Another popular distance is the Manhattan (or taxicab) met-ric, given by:

d(x1, x2) =d∑

i=1

|x1i − x2i | .

Figure 12 : Visual depiction of the Manhattan metric [Wikipedia].

Minkowski distances

The Euclidean distance and the Manhattan distance are twospecial cases of a general class of metrics, called the Minkowskymetrics:

dp(x1, x2) =

( d∑i=1

|x1i − x2i |p)1/p

.

By taking the limit p →∞, we obtain another interesting metriccalled the Chebyshev distance:

dChebyshev = maxi=1,...,d

|x1i − x2i | .

Normalized distances

Consider the following set of points:

−5 0 5 10

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

Using an Euclidean distance measure, a 10% change in the x-axis has a lot more influence than an equivalent change overthe y -axis.

Normalized distances (2)

One way to solve this is to rescale the points in such a way thatall axes have the same importance. Alternatively, we can use aweighted distance measure to achieve this:

d(x1, x2) =

√√√√ d∑i=1

1

δ2i(x1i − x2i)2 ,

where δ2i is the empirical variance over the i -th axis. Even moregenerally, for a given matrix A we can compute:

d(x1, x2) =√

xT1 Ax2 .

If we select A as the empirical covariance matrix, we obtain theMahalanobis distance.

Further readings

Most of this lecture is taken from Chapter 2 of ‘The Elementsof Statistical Learning’.

For someone interested in the theoretical aspects of statisticallearning, the following two textbooks provide a good entry point:

[1] Mohri, M., Rostamizadeh, A. and Talwalkar, A., 2012. Founda-tions of machine learning. MIT Press.

[2] Shalev-Shwartz, S. and Ben-David, S., 2014. Understanding

machine learning: From theory to algorithms. Cambridge Uni-

versity Press.

The book Statistical Learning Theory by V. Vapnik remainsan essential historical reading.

DATA MINING AND MACHINE LEARNING -...

Documents

Transcript of DATA MINING AND MACHINE LEARNING -...