Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and...
-
Upload
nelson-gibbs -
Category
Documents
-
view
215 -
download
1
Transcript of Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and...
Artificial IntelligenceLecture 2
Dr. Bo Yuan, ProfessorDepartment of Computer Science and Engineering
Shanghai Jiaotong University
Review of Lecture One• Overview of AI
– Knowledge-based rules in logics (expert system, automata, …) : Symbolism in logics– Kernel-based heuristics (neural network, SVM, …) : Connection for nonlinearity– Learning and inference (Bayesian, Markovian, …) : To sparsely sample for convergence– Interactive and stochastic computing (Uncertainty, heterogeneity) : To overcome the
limit of Turin Machine
• Course Content– Focus mainly on learning and inference– Discuss current problems and research efforts– Perception and behavior (vision, robotic, NLP, bionics …) not included
• Exam– Papers (Nature, Science, Nature Review, Modern Review of Physics, PNAS, TICS) – Course materials
Today’s Content
• Overview of machine learning• Linear regression– Gradient decent– Least square fit– Stochastic gradient decent– The normal equation
• Applications
Basic Terminologies• x = Input variables/features
• y = Output variables/target variables• (x, y) = Training examples, the ith training example = (x(i), y(i))• m(j) = Number of training examples (1, …, m)
• n(i) = Number of input variables/features (0, …,n)• h(x) = Hypothesis/function/model that outputs the predicative
value under a given input x• q = Parameter/weight, which parameterizes the mapping of
X to its predictive value, thus
• We define x0 = 1 (the intercept), thus able to use a matrix representation:
0 0 1 1 2 2( ) ... n nh x x x x x
0
( )n
Ti i
i
h x x X
Gradient Decent0
( ) ( ) 2
1
( )
1( ) ( ( ) )
2
: ( )
nT
i ii
mi i
i
j jj
h x x x
J h x y
J
The Cost Function is defined as:
Using the matrix to represent the training samples with respect to q:
The gradient decent is based on thepartial derivatives with respect to q:
( ) ( ) ( )
1
: ( ( ))m
i i ij j j
i
y h x x
The algorithm is therefore: Loop {
} (for every j)
There is another alternative to iterate, called stochastic gradient decent:
(1)
( )
( )
. . .
( )
T
m T
x
X
x
Normal EquationAn explicit way to directly obtain q
The Optimization Problem by the Normal Equation
We set the derivatives to zero, and obtain the Normal Equations:
1( )T TX X X y ��������������
Today’s Content
• Linear Regression– Locally Weighted Regression (an adaptive method)
• Probabilistic Interpretation– Maxima Likelihood Estimation vs. Least Square (Gaussian
Distribution)
• Classification by Logistic Regression– LMS updating– A Perceptron-based Learning Algorithm
Linear Regression
1. Number of Features
2. Over-fitting and under-fitting Issue
3. Feature selection problem (to be covered later)
4. Adaptive issue
Some definitions:
• Parametric Learning (fixed set of , q with n being constant)
• Non-parametric Learning (number of q grows with m linearly)
Locally-Weighted Regression (Loess/Lowess Regression) non-parametric
• A bell-shape weighting (not a Gaussian)
• Every time you need to use the entire training data set to train for a
given input to predict its output (computational complexity)
0 0 1 1 2 2( ) ... n nh x x x x x
Extension of Linear Regression
• Linear Additive (straight-line): x1=1, x2=x
• Polynomial: x1=1, x2=x, …, xn=xn-1
• Chebyshev Orthogonal Polynomial: x1=1, x2=x, …, xn=2x(xn-1-xn-2)• Fourier Trigonometric Polynomial: x1=0.5, followed by sin and cos of
different frequencies of xn
• Pairwise Interaction: linear terms + xk1,k2 (k =1, …, N)• …
• The central problem underlying these representations are whether or not the optimization processes for q are convex.
0 0 1 1 2 2( ) ... n nh x x x x x
Probabilistic Interpretation• Why Ordinary Least Square (OLE)? Why not other power terms?
– Assume
– PDF for Gaussian is
– This implies that
– Or, ~
( ) ( ) ( )
( )
i T i i
i
y x
= Random Noises, ~ ( ) 2
( )2
1 ( )( ) exp( )
22
iiP
( ) ( ) 2( ) ( )
2
1 ( )( | ; ) exp( )
22
i T ii i y x
p y x
( ) ( )| ;i iy x 2(0, )N
2(0, )N
Why Gaussian for random variables? Central limit theorem?
• Consider training data are stochastic
• Assume are i.i.d. (independently identically distributed)– Likelihood of L(q) = the probability of y given x parameterized by q
• What is Maximum Likelihood Estimation (MLE)?– Chose parameters q to maximize the function , so to
make the training data set as probable as possible;– Likelihood L(q) of the parameters, probability of the data.
Maximum Likelihood (updated)
( )i
( ) ( | ; )L p y X ��������������
( ; , ) | )( () ;L X y p yL X �������������� ��������������
( ) ( ) ( )i T i iy x
The Equivalence of MLE and OLE
= J(q) !?
Sigmoid (Logistic) Function
{0,1}
( ) {0,1}
1( )
11
( ) ( )1
T
z
T
x
y
h x
g ze
h x g xe
Other functions that smoothly increase from 0 to 1 can also be found, but for a couple of good reasons (we will see next time for the Generalize LinearMethods) that the choice of the logistic function is a natural one.
Recall (Note the positive sign rather than negative)
Let’s working with just one training example (x, y), and to derive the Gradient Ascent rule:
: ( )lx
One Useful Property of the Logistic Function
Identical to Least Square Again?
( ) ( ) ( )
1
( ) ( ( ))m
i i ij
ij
l y h x x
( ) ( )
1
( ): ( )( )m
i ij
ij j
i
hy x x