Artificial Intelligence Lecture 2

Artificial IntelligenceLecture 2

Dr. Bo Yuan, ProfessorDepartment of Computer Science and Engineering

Shanghai Jiaotong University

[email protected]

mailto:[email protected]

Review of Lecture One• Overview of AI

– Knowledge-based rules in logics (expert system, automata, …) : Symbolism in logics– Kernel-based heuristics (neural network, SVM, …) : Connection for nonlinearity– Learning and inference (Bayesian, Markovian, …) : To sparsely sample for convergence– Interactive and stochastic computing (Uncertainty, heterogeneity) : To overcome the

limit of Turin Machine

• Course Content– Focus mainly on learning and inference– Discuss current problems and research efforts– Perception and behavior (vision, robotic, NLP, bionics …) not included

• Exam– Papers (Nature, Science, Nature Review, Modern Review of Physics, PNAS, TICS) – Course materials

Today’s Content

• Overview of machine learning• Linear regression– Gradient decent– Least square fit– Stochastic gradient decent– The normal equation

• Applications

Basic Terminologies• x = Input variables/features • y = Output variables/target variables• (x, y) = Training examples, the ith training example = (x(i), y(i))• m(j) = Number of training examples (1, …, m)• n(i) = Number of input variables/features (0, …,n)• h(x) = Hypothesis/function/model that outputs the predicative

value under a given input x• q = Parameter/weight, which parameterizes the mapping of

X to its predictive value, thus

• We define x0 = 1 (the intercept), thus able to use a matrix representation:

0 0 1 1 2 2( ) ... n nh x x x x xq q q q q

0

( )n

Ti i

i

h x x Xq q q

Gradient Decent0

( ) ( ) 2

1

( )

1( ) ( ( ) )2

: ( )

nT

i ii

mi i

i

j jj

h x x x

J h x y

J

q

q q

q

q q qq

The Cost Function is defined as:

Using the matrix to represent the training samples with respect to q:

The gradient decent is based on thepartial derivatives with respect to q:

( ) ( ) ( )

1

: ( ( ))m

i i ij j j

i

y h x xqq q

The algorithm is therefore: Loop {

} (for every j)

There is another alternative to iterate, called stochastic gradient decent:

(1)

( )

( ). . .

( )

T

m T

xX

x

Normal EquationAn explicit way to directly obtain q

The Optimization Problem by the Normal Equation

We set the derivatives to zero, and obtain the Normal Equations:

1( )T TX X X yq

Today’s Content

• Linear Regression– Locally Weighted Regression (an adaptive method)

• Probabilistic Interpretation– Maxima Likelihood Estimation vs. Least Square (Gaussian

Distribution)

• Classification by Logistic Regression– LMS updating– A Perceptron-based Learning Algorithm

Linear Regression

1. Number of Features

2. Over-fitting and under-fitting Issue

3. Feature selection problem (to be covered later)

4. Adaptive issue

Some definitions:

• Parametric Learning (fixed set of q, with n being constant)

• Non-parametric Learning (number of q grows with m linearly)

Locally-Weighted Regression (Loess/Lowess Regression) non-parametric

• A bell-shape weighting (not a Gaussian)

• Every time you need to use the entire training data set to train for a

given input to predict its output (computational complexity)

0 0 1 1 2 2( ) ... n nh x x x x xq q q q q

Extension of Linear Regression

• Linear Additive (straight-line): x1=1, x2=x

• Polynomial: x1=1, x2=x, …, xn=xn-1

• Chebyshev Orthogonal Polynomial: x1=1, x2=x, …, xn=2x(xn-1-xn-2)• Fourier Trigonometric Polynomial: x1=0.5, followed by sin and cos of

different frequencies of xn

• Pairwise Interaction: linear terms + xk1,k2 (k =1, …, N)• …

• The central problem underlying these representations are whether or not the optimization processes for q are convex.

0 0 1 1 2 2( ) ... n nh x x x x xq q q q q

Probabilistic Interpretation• Why Ordinary Least Square (OLE)? Why not other power terms?

– Assume

– PDF for Gaussian is

– This implies that

– Or, ~

( ) ( ) ( )

( )

i T i i

i

y xq

= Random Noises, ~ ( ) 2

( )2

1 ( )( ) exp( )22

iiP

( ) ( ) 2( ) ( )

2

1 ( )( | ; ) exp( )22

i T ii i y xp y x qq

( ) ( )| ;i iy x q 2(0, )N

2(0, )N

Why Gaussian for random variables? Central limit theorem?

• Consider training data are stochastic

• Assume are i.i.d. (independently identically distributed)– Likelihood of L(q) = the probability of y given x parameterized by q

• What is Maximum Likelihood Estimation (MLE)?– Chose parameters q to maximize the function , so to make

the training data set as probable as possible;– Likelihood L(q) of the parameters, probability of the data.

Maximum Likelihood (updated)

( )i

( ) ( | ; )L p y Xq q

( ; , ) | )( () ;L X y p yL Xq q q

( ) ( ) ( )i T i iy xq

The Equivalence of MLE and OLE

= J(q) !?

Sigmoid (Logistic) Function

{0,1}( ) {0,1}

1( )1

1( ) ( )1

T

z

Tx

yh x

g ze

h x g xe

q

q qq

Other functions that smoothly increase from 0 to 1 can also be found, but for a couple of good reasons (we will see next time for the Generalize LinearMethods) that the choice of the logistic function is a natural one.

Recall (Note the positive sign rather than negative)

Let’s working with just one training example (x, y), and to derive the Gradient Ascent rule:

: ( )lx

q q q

One Useful Property of the Logistic Function

Identical to Least Square Again?

( ) ( ) ( )

1

( ) ( ( ))m

i i ij

ij

l y h x xqqq

( ) ( )

1

( ): ( )( )m

i ij

ij j

i

hy x xqq q

Artificial Intelligence Lecture 2

Documents

Transcript of Artificial Intelligence Lecture 2