Transcript of PART I: INTRODUCTION TO STATISTICAL LEARNING
PART I: INTRODUCTION TO STATISTICAL LEARNINGDonglin Zeng,
Department of Biostatistics, University of North Carolina
Introduction
My definition of statistical learning
I Statistical learning is a framework of combining statistical
reasoning and computing method to develop analytic and predictive
tools from available data, with the goal of data extraction and
future prediction.
Donglin Zeng, Department of Biostatistics, University of North
Carolina
Statistical reasoning
I statistical reasoning emphasizes I data are always filled with
random errors, and they are
assumed from some generation mechanism following a probability
law;
I available data (training data or training sample) are only a
random copy of future applications and may or may not contain bias,
so any method fitting training data perfectly is believed to fit
badly for future data.
I Statistical reasoning focuses on developing methods that can be
generalizable and are statistically optimal under certain
randomness assumptions.
I Statistical reasoning looks for the answer of data
MECHANISM.
Donglin Zeng, Department of Biostatistics, University of North
Carolina
Computing method
I Computing method emphasizes I one well-defined objective function
to optimize that aligns
with the goal in future applications; I an effective and efficient
algorithm for optimization, while
accounting for necessary constraints to prevent overfitting and be
resilient to potential bias.
I Computing method focuses on developing algorithms that can solve
an optimization problem efficiently and produce useful results for
practice.
I Computing method looks for the answer of data UTILITY.
Donglin Zeng, Department of Biostatistics, University of North
Carolina
I Statistical learning includes development of both statistical
reasoning methods and computational algorithms, as well as analytic
and predictive tools (software development).
I Usually, the two main goals of statistical learning are – future
prediction (supervised learning); – data pattern analysis
(unsupervised learning).
I Disciplines involved in modern statistical learning: probability
and statistics, computer science, data science (?), informatics and
all subject-of-matter applications.
I Other names used: machine learning, data mining, pattern
recognition, data analytics, predictive analytics.
Donglin Zeng, Department of Biostatistics, University of North
Carolina
How does it differ from classical statistical inference?
I Traditional statistical inference focuses on statistical
reasoning, i.e., to understand the distributional behavior of
data.
I In statistical inference, estimation and hypothesis testing
(inference) of distribution parameters are of most interest;
statistical properties such as unbiased, consistency, efficiency
are the main concerns.
I In statistical learning, distribution is the backbone but its
estimation is less concerned, as compared to the learning goals
such as prediction and feature extraction.
I Most often, prediction accuracy is the most important quantity in
statistical learning. Thus, prediction rule consistency and risk
bound are widely studied in statistical learning literature.
Donglin Zeng, Department of Biostatistics, University of North
Carolina
Statistical learning and statistical inference share the same
reasoning philosophy
I Both assume data randomly generated from some underlying
distribution so account for random behaviors in the
procedures.
I Both rely on data-dependent objective functions for estimation
and inference.
I Both, directly or indirectly, involve statistical models for data
and computation algorithm for execution.
I Specifically, supervised learning is analogue to regression and
unsupervised learning is to density estimation.
Donglin Zeng, Department of Biostatistics, University of North
Carolina
Challenges in statistical learning
I Method: what methods lead to prediction goals? I Data: how to
deal with data complexity (dimensionality,
heterogeneous structure, missing data etc.) related to its
goals?
I Algorithm: what computation algorithms are feasible and efficient
to compute prediction?
I Inference: in consideration of data randomness, what is the
assurance of the performance when applied to future data or
applications?
Donglin Zeng, Department of Biostatistics, University of North
Carolina
Example 1. Email Spam Data
Donglin Zeng, Department of Biostatistics, University of North
Carolina
Example 2. Prostate Cancer Data
Donglin Zeng, Department of Biostatistics, University of North
Carolina
Example 3. Handwritten Digit Data
Donglin Zeng, Department of Biostatistics, University of North
Carolina
Example 4. DNA Expression Data
Donglin Zeng, Department of Biostatistics, University of North
Carolina
Overview of my lectures
I I will introduce a number of statistical or machine learning
methods.
I I will discuss the probabilistic and statistical theory behind
learning methods.
I Computation algorithms and examples will be used during the
lectures.
Donglin Zeng, Department of Biostatistics, University of North
Carolina
What you should know
I Many data examples and figures are taken from Hastie, Tibshirani
and Friedman’s book.
I A number of R-algorithms and examples are taken from a variety of
web-sources publicly available.
I All errors in this course are mine.
Donglin Zeng, Department of Biostatistics, University of North
Carolina
Statistical Decision Theory
Basic set-up for Supervised Learning
I The goal of supervised learning is to learn a rule to predict an
outcome using feature variables.
I Variable components in supervised learning include I X: feature
variables (continuous, categorical, structured or
unstructured feature) with domain X . I Y: outcome variable
(continuous, categorical, ordinal or
even functional) with domain Y .
I We assume that (X,Y) follows a probability distribution P(X,Y)
(abbreviated as P).
I A prediction rule is a function from X to Y : y = f (x). I
Supervised learning is to learn a prediction rule using
training data (X1,Y1), ..., (Xn,Yn),
Donglin Zeng, Department of Biostatistics, University of North
Carolina
Loss function for inaccurate prediction
I A loss function is a function that quantifies the error incurred
due to imprecision of a prediction rule.
I It is a map from Y × X × {f : f is a prediction rule} to
[0,∞):
(y, x, f )→ L(y, x, f ),
such that L(y, x, f ) = 0 if y = f (x). I For most of applications,
L(y, x, f ) = L(y, f (x)), i.e., some
distance measuring how far the predicted value, f (x), is from the
outcome value, y. But we can also use outcome-feature-dependent
metric, for example, w(y, x)L(y, f (x)) where w(y, x) is a
non-negative weight function.
I Specifying a loss function depends on (a) the goal of a task and
(b) the desired statistical and computational properties of the
loss itself.
Donglin Zeng, Department of Biostatistics, University of North
Carolina
Examples of loss function
I Continuous Y I squared loss: L(y, f (x)) = (y− f (x))2–most
commonly used,
strictly convex. I absolute deviation loss (L1-loss):
L(y, f (x)) = |y− f (x)|–convex, more robust. I more generally,
Huber loss: L(y, f (x)) = (y− f (x))2 if |y− f (x)| < δ and
(2δ|y− f (x)| − δ2) otherwise.
I Categorical Y I zero-one (0-1) loss: L(y, f (x)) = I(y 6= f (x)).
I when Y is binary and labelled as 1 vs −1 (commonly used
in supervised learning), L(y, f (x)) = I(yf (x) > 0). I weighted
zero-one loss: L(y, f (x)) = w(y)I(y 6= f (x))
(application includes cancer diagnosis). I Ordinal Y: Y takes
values y1 < y2 < ... < yk
I we can still use zero-one loss by treating Y as categorical, I
or, we use the squared loss as L(y, f (x)) = (sy − f (x))2
where
sy is a score assigned to y value. I an interesting loss function
is called preference loss, which
is defined for a pair of data points:
L(y1, f (x1), y2, f (x2)) = 1− I(y1 < y2, f (x1) < f (xx)).
Donglin Zeng, Department of Biostatistics, University of North
Carolina
Plot of loss functions for continuous Y
−2 −1 0 1 2
0 1
2 3
Risk function
I Since (Y,X) is from a distribution P, we define a risk function
for any prediction rule, f , as the expectation of the loss
function:
R(f ) = EP[L(Y, f (X))].
I R(f ) is equivalent to the average prediction loss (error) if we
apply the prediction rule f to the population where (Y,X) follows
the probability law P.
I The optimal prediction rule, denoted as f ∗(x), is a rule
minimizing the risk function:
f ∗ = argminR(f ).
I Certainly, there can be many ways to summarize the error due to
the prediction rule f (median risk, minimax) but the expectation is
mostly common in statistical learning.
Donglin Zeng, Department of Biostatistics, University of North
Carolina
Specific goals for supervised learning
I Given a loss function L(y, f (x)), what is the optimal prediction
rule f ∗?
I With training data (X1,Y1), ..., (Xn,Yn), how can we estimate or
learn the optimal prediction rule?
I What computing algorithms can be used ? I How do we evaluate the
performance of the learned
prediction rule? I What are the statistical properties of the
estimated rule?
Donglin Zeng, Department of Biostatistics, University of North
Carolina
Compared to classical maximum likelihood estimation
I The goal is to learn the underlying distribution P, assumed from
a probability model Pθ.
I Thus, the prediction rule f is equivalent to θ. I The loss
function becomes the negative log-likelihood
function. I The maximum likelihood estimator based on the
training
data is used to estimate θ. I Optimization algorithms can be used
for computing the
MLE. In missing data, we often use EM algorithm. I Under some
assumptions, the MLE is consistent,
asymptotically normal and efficient. I In some sense, classical
statistical inference is one special
type of statistical learning but the goal is to learn distribution
parameters instead of a prediction rule.
Donglin Zeng, Department of Biostatistics, University of North
Carolina