Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and...

Artificial IntelligenceLecture 2

Dr. Bo Yuan, ProfessorDepartment of Computer Science and Engineering

Shanghai Jiaotong University

[email protected]

mailto:[email protected]

Review of Lecture One• Overview of AI

– Knowledge-based rules in logics (expert system, automata, …) : Symbolism in logics– Kernel-based heuristics (neural network, SVM, …) : Connection for nonlinearity– Learning and inference (Bayesian, Markovian, …) : To sparsely sample for convergence– Interactive and stochastic computing (Uncertainty, heterogeneity) : To overcome the

limit of Turin Machine

• Course Content– Focus mainly on learning and inference– Discuss current problems and research efforts– Perception and behavior (vision, robotic, NLP, bionics …) not included

• Exam– Papers (Nature, Science, Nature Review, Modern Review of Physics, PNAS, TICS) – Course materials

Today’s Content

• Overview of machine learning• Linear regression– Gradient decent– Least square fit– Stochastic gradient decent– The normal equation

• Applications

Basic Terminologies• x = Input variables/features

• y = Output variables/target variables• (x, y) = Training examples, the ith training example = (x(i), y(i))• m(j) = Number of training examples (1, …, m)

• n(i) = Number of input variables/features (0, …,n)• h(x) = Hypothesis/function/model that outputs the predicative

value under a given input x• q = Parameter/weight, which parameterizes the mapping of

X to its predictive value, thus

• We define x0 = 1 (the intercept), thus able to use a matrix representation:

0 0 1 1 2 2( ) ... n nh x x x x x

0

( )n

Ti i

i

h x x X

Gradient Decent0

( ) ( ) 2

1

( )

1( ) ( ( ) )

2

: ( )

nT

i ii

mi i

i

j jj

h x x x

J h x y

J

The Cost Function is defined as:

Using the matrix to represent the training samples with respect to q:

The gradient decent is based on thepartial derivatives with respect to q:

( ) ( ) ( )

1

: ( ( ))m

i i ij j j

i

y h x x

The algorithm is therefore: Loop {

} (for every j)

There is another alternative to iterate, called stochastic gradient decent:

(1)

( )

( )

. . .

( )

T

m T

x

X

x

Normal EquationAn explicit way to directly obtain q

The Optimization Problem by the Normal Equation

We set the derivatives to zero, and obtain the Normal Equations:

1( )T TX X X y ��

Today’s Content

• Linear Regression– Locally Weighted Regression (an adaptive method)

• Probabilistic Interpretation– Maxima Likelihood Estimation vs. Least Square (Gaussian

Distribution)

• Classification by Logistic Regression– LMS updating– A Perceptron-based Learning Algorithm

Linear Regression

1. Number of Features

2. Over-fitting and under-fitting Issue

3. Feature selection problem (to be covered later)

4. Adaptive issue

Some definitions:

• Parametric Learning (fixed set of , q with n being constant)

• Non-parametric Learning (number of q grows with m linearly)

Locally-Weighted Regression (Loess/Lowess Regression) non-parametric

• A bell-shape weighting (not a Gaussian)

• Every time you need to use the entire training data set to train for a

given input to predict its output (computational complexity)

0 0 1 1 2 2( ) ... n nh x x x x x

Extension of Linear Regression

• Linear Additive (straight-line): x1=1, x2=x

• Polynomial: x1=1, x2=x, …, xn=xn-1

• Chebyshev Orthogonal Polynomial: x1=1, x2=x, …, xn=2x(xn-1-xn-2)• Fourier Trigonometric Polynomial: x1=0.5, followed by sin and cos of

different frequencies of xn

• Pairwise Interaction: linear terms + xk1,k2 (k =1, …, N)• …

• The central problem underlying these representations are whether or not the optimization processes for q are convex.

0 0 1 1 2 2( ) ... n nh x x x x x

Probabilistic Interpretation• Why Ordinary Least Square (OLE)? Why not other power terms?

– Assume

– PDF for Gaussian is

– This implies that

– Or, ~

( ) ( ) ( )

( )

i T i i

i

y x

= Random Noises, ~ ( ) 2

( )2

1 ( )( ) exp( )

22

iiP

( ) ( ) 2( ) ( )

2

1 ( )( | ; ) exp( )

22

i T ii i y x

p y x

( ) ( )| ;i iy x 2(0, )N

2(0, )N

Why Gaussian for random variables? Central limit theorem?

• Consider training data are stochastic

• Assume are i.i.d. (independently identically distributed)– Likelihood of L(q) = the probability of y given x parameterized by q

• What is Maximum Likelihood Estimation (MLE)?– Chose parameters q to maximize the function , so to

make the training data set as probable as possible;– Likelihood L(q) of the parameters, probability of the data.

Maximum Likelihood (updated)

( )i

( ) ( | ; )L p y X ��

( ; , ) | )( () ;L X y p yL X ��

( ) ( ) ( )i T i iy x

The Equivalence of MLE and OLE

= J(q) !?

Sigmoid (Logistic) Function

{0,1}

( ) {0,1}

1( )

11

( ) ( )1

T

z

T

x

y

h x

g ze

h x g xe

Other functions that smoothly increase from 0 to 1 can also be found, but for a couple of good reasons (we will see next time for the Generalize LinearMethods) that the choice of the logistic function is a natural one.

Recall (Note the positive sign rather than negative)

Let’s working with just one training example (x, y), and to derive the Gradient Ascent rule:

: ( )lx

One Useful Property of the Logistic Function

Identical to Least Square Again?

( ) ( ) ( )

1

( ) ( ( ))m

i i ij

ij

l y h x x

( ) ( )

1

( ): ( )( )m

i ij

ij j

i

hy x x

Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and...

Documents

Transcript of Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and...