SVR FinancialForecasting RChen 1.3.3

Rocko Chen Financial Forecasting With Support Vector Regression (2009) 1

Financial Forecasting With

Support Vector Regression

Rocko Chen

([email protected])

School of Computing & Mathematical Sciences, Auckland University of Technology

Supervised by Dr Jiling Cao

30th of April, 2009


Acknowledgement

This study was made possible via guidance and support of the following persons:

Dr. Jiling Cao, my supervisor, for his vital facilitation and directives.

Neil Binnie, the overseer, for his understanding and assistance.

Peter Watson, our Programme Leader, for the unremitting encouragement.

All School of Computing & Mathematical Sciences faculty members and Staff.

Joshua Thompson, friend, professional trader, for the pragmatic perspective and much needed motivation.

Most especially to the countless Ph.D. researchers whose works I have studied throughout the past year.


Table of Content

Abstract ........................................................................................................................................ 5

Chapter I. Introduction ........................................................................................................ 6

Chapter II. Fundamentals of Optimization........................................................................... 7

2.1. Optimization General Structure ....................................................................................... 7

2.2 The Lagrangian Function .................................................................................................. 7

2.2.1 Example with Lagrangian Function............................................................................ 7

2.2.2 Example with half-disk region.................................................................................... 8

2.3 Tangent Cones [TΩ(x*)] .................................................................................................. 10

2.4 Feasible Sequence ............................................................................................................ 10

2.4.1 Linearized Feasible Directions F(x).......................................................................... 10

2.4.2 Example with Tangent Cone ≠ Linearized Feasible Sets.......................................... 11

2.4.3 Example with F(x) = TΩ(x)........................................................................................12

2.5 LICQ- Linear Independence Constraint Qualification ....................................................12

2.6 Karush-Kuhn-Tucker (KKT) conditions ..........................................................................13

2.6.5 Second Order Conditions...........................................................................................14

2.6.7 Example on Critical Cones .........................................................................................15

2.7 Duality ...............................................................................................................................15

2.7.1 Duality example ..........................................................................................................16

Chapter III. Quadratic Programming ....................................................................................17

3.1 Equality-constrained QPs .................................................................................................17

3.2 Karush-Kuhn-Tucker (KKT) matrix.................................................................................17

3.3 Solving QP with A Modified Simplex Method .................................................................19

Chapter IV. SVM Mechanics ................................................................................................. 22

4.1 General overview .............................................................................................................. 22

4.2 Structural Risk Minimization (SRM) .............................................................................. 23

4.3 The Loss Function............................................................................................................ 25

4.4 ε -SVR .............................................................................................................................. 26

4.5 Standard Method for SVR QP Solutions ......................................................................... 26


4.6 The Decomposition Method ............................................................................................ 27

4.7 The Kernel Function ........................................................................................................ 29

4.8 Cross Validation and Grid-search ...................................................................................30

Chapter V. Empirical Analysis and Findings .......................................................................31

5.1 About Financial Time Series .............................................................................................31

5.2 Analysis Set-up..................................................................................................................31

5.2.1 General idea ................................................................................................................31

5.2.2 Training Output Selection (Y 1+i ) ...............................................................................31

5.2.3 Training Input Selection (X i )................................................................................... 32

5.2.4 Testing Variables....................................................................................................... 32

5.2.5 Error Analysis............................................................................................................ 33

5.2.6 SVR Adaptive Training Strategy............................................................................... 33

5.2.7 Applied Software ....................................................................................................... 33

5.3 Empirical Findings........................................................................................................... 34

5.3.1 Cross Correlation Analysis ........................................................................................ 35

5.3.2 Normality Tests ......................................................................................................... 36

5.3.3 Error Distributions.................................................................................................... 38

5.3.4 The Gold Connection ................................................................................................40

5.4 Test Conclusions ..............................................................................................................40

Reference.....................................................................................................................................41


Abstract

This study explores Support Vector Machines (SVMs) for the purpose of forecasting the ASX200 index. SVMs have become popular within the financial forecasting realm for its unique ability, Structural Risk Minimization (SRM). The paper commences with a review of relevant numerical optimization concepts. Going from basic notions of Lagrangian Functions to quadratic programming, necessary foundations are laid for the reader to comprehend key ideas of SVR. SVR details ensue. This section explores SVR core analytical and SRM processes. Key theories are explained with examples. The final section involves an empirical test where SVR attempts to valuate the ASX200 stock index and forecast next day return. The test applies roughly 7 years of daily closing prices of several predictive variables for model establishment. Optimal test results follow an adaptive training approach. Test findings along with SRM effectiveness are analysed hereafter. Disclaimer-

Rocko Chen retains all rights to the above content in perpetuity. Chen grants others to copy and re-use the document for non-commercial purposes provided that Chen is given credit. Further, the content remains primarily a subject of research and the author does not take liability for any potential losses via utilization. April 28, 2009


Chapter I. Introduction

Developed by Vladimir Vapnik [15], SVMs distinguish from popular machine learning algorithms for several unique characteristics. They offer robust classifications/predictions while maintaining structural risk minimization. Some interesting accuracy comparisons [14]

applications #training data

#testing data #features #classes

Accuracy by users

Accuracy via SVM

Astroparticle 3,089 4,000 4 2 75.20% 96.90% Vehicle 1,243 41 21 2 4.88% 87.80%

Apparently machines have learned more efficiently than humans by significant numbers. Even in the machine learning community, SVMs have displayed superiority over processes such as Discriminant Analysis (linear or quadratic), Logit Models, Back Propagation Neural Networks and others [13]. These qualities therefore make SVR desirable for financial forecasting. Though unlikely random according to growing academic evidence, the financial markets represent high noise, non-linear, and non-stationary processes. Throughout the years many researchers have attempted to make predictions with various statistical methods, including SVR, with some promise. However, some fundamental assumptions had limited applicability of the studies. The empirical study examines SVR (assumption based) weaknesses and attempts to remedy them via data selection strategies. Consequent findings point to some interesting elements. On the whole, SVR could potentially contribute significant value for the professional financial instrument trader.


Chapter II. Fundamentals of Optimization

2.1. Optimization General Structure Given f(x) the objective function, Min f(x) x within R n Subject to c i (x)=0, if i within Equality constraints

c i (x)=>0, if i is within Inequality constraints

Feasible set Ω= set of x that satisfy the constraints. A local solution, x*, serves as a vector within neighbourhood N, inside Ω, such that f(x)≥ f(x*). An Active Set, A(x), at any feasible x consists of both Equality and Inequality constraint indices for which c i (x)= 0,

i.e. A(x)= EỤi within I| c i (x)= o

e.g. at a feasible point x, (i within I) is active if c i (x)=0

(i within I) is inactive if c i (x)>0

2.2 The Lagrangian Function L(x, λ 1 ) = f(x) - λ 1 c 1 (x), where λ 1 = Lagrangian Multiplier

then, ∇L(x, λ 1 )= ∇ f(x)- λ 1 ∇ c 1 (x), also applies to ∇ f(x*)= λ 1 ∇ c 1 (x*)

2.2.1 Example with Lagrangian Function min x 1 + x 2 , S.T. (Such That) 8- x 1

2 - x 22 ≥ 0

Feasible region: interior & border of circle x 12 + x 2

2 = 8

Constraint normal: ∇ c 1 (x) points toward interior at boundary pts

Solution: obviously at (-2 2 , -2 2 ) T Recalling ∇L(x, λ 1 )= ∇ f(x)- λ 1 ∇ c 1 (x),

then we have ∇L(x, λ 1 )= ∇ (x 1 +x 2 )- λ∇ (1 - x 12 - x 2

2 ),

= -2 - 2 - λ (-2x 1 -2x 2 ) =-4 + 4λ(-2 -2) 1 = λ(-4) λ*= -0.25, the Lagrange multiplier plays a significant role in this inequality-constrained problem. *** feasible pt x is NOT optimal if we can find a small step s that both retains feasibility and decreases objective function f to 1st order. s retains feasibility if

0≤ c 1 (x+ s) ≈ c 1 (x) + ∇ c 1 (x) T s, so to the 1st order, feasibility is retained if c 1 (x) + ∇ c 1 (x) T s ≥ 0


Case I x lies strictly inside circle: c 1 (x)> 0

So any step vector s, with sufficiently small length, satisfies c 1 (x) + ∇ c 1 (x) T s ≥ 0

In fact, whenever ∇ f(x) ≠ 0, we can get a step s that satisfies

∇ f(x) T s< 0 & c 1 (x) + ∇ c 1 (x) T s ≥ 0 by setting s = -α∇ f(x), α: positive scalar sufficiently small. However, when ∇ f(x) = 0, no step s is given. Case II x on boundary of circle, c 1 (x)=0, then ∇ f(x) T s< 0 and c 1 (x) + ∇ c 1 (x) T s ≥ 0 become

∇ f(x) T s< 0 (open half space) & ∇ c 1 (x) T s ≥ 0 (closed half space) Intersection of these two regions is empty only when ∇ f(x) and ∇ c 1 (x) point in the same

direction, i.e. ∇ f(x)= λ 1 ∇ c 1 (x) , λ 1≥ 0 Notice sign of λ 1 is significant. If ∇ f(x*) = λ∇ c 1 (x*) had a negative λ 1 , ∇ f(x) and ∇ c 1 (x) would point in opposite directions. Optimality summarized for cases I & II WRT (With Respect To) L(x, λ 1 ), When no 1st order feasible descent direction exists at x*, xL(x*, λ 1 *)= 0, for some λ 1 *≥ 0

Required condition: λ 1 c 1 (x*)= 0

Complementary Condition implies that λ 1 can be strictly positive only when c 1 is active.

2.2.2 Example with halfdisk region f(x): min x 1 + x 2 , S.T. c 1 : 8- x 1

2 - x 22 ≥ 0, c 2 : x 2 ≥ 0,

So the feasible region is the upper half-disk of (x1^2+ x2^2≤ 8).

Solution sits clearly at (- 8 , 0)T

We expect a direction d of 1st order feasible descent to satisfy

∇ c i (x)T

d≥ 0, i I= 1, 2,…, ∇ f(x)T

d< 0,

However no such direction could exist at

X = (- 8 , 0)T

The conditions ∇ c i (x)T

d≥ 0, i=1,2,… are BOTH satisfied only if d lies in ∇ c 1 and ∇ c 2 ,

but d in this quadrant all give ∇ f(x)T

d≥ 0. Lagrangian for f(x), We add λ i c i (x) for each additional constraint, so

L(x, λ) = f(x) – λ 1 c 1 (x)- λ 2 c 2 (x),and λ= (λ 1 , λ 2 )T

becomes the vector of Lagrange Multipliers.


The extension of [∇ xL(x*, λ 1 *) = 0, for some λ 1 *≥ 0] to [λ 1 c 1 (x*)= 0] is then

[∇ xL(x*, λ*) = 0], for some λ*≥ 0 The inequality λ*≥ 0 means all components of λ* are non-negative. Then we apply complementary condition (λ 1 can be strictly positive only when c 1 is active) to both I constraints. λ 1 *c 1 (x*)= 0, λ 2 *c 2 (x*)= 0

when x*= (- 8 , 0)T

, we have

∇ f(x*)= 11

, ∇ c 1 (x*)= 0

82, ∇ c 2 (x*)= 1

0

,

so ∇ xL(x*,λ*)= 0 when we select λ* as follows:

λ*= 1

]82/[1 , notice how both components are positive.

Some feasible points are NOT solutions; let’s see their Lagrangian and its gradient at these points.

At x= (- 8 , 0)T

, both constraints are active yet d such as

(-1, 0)T

would not work. For x= (-1, 0) the ∇ xL(x*, λ*)= 0 condition is only satisfied when

λ= (-1/(2 8 ), 1)T

, and if the first λ 1 is negative, [∇ xL(x*, λ*)= 0, for some λ*≥ 0] does not satisfy.

At x= (1, 0)T

, only c 2 is active. Any small step s away from this point will continue to satisfy

c 1 (x+ s)> 0, we only need to consider c 2 and f behaviour to see if s is indeed a feasible descent step. The direction of feasible descent d must satisfy

c 2 (x)T

d≥ 0, f(x)T

d< 0.

Noting that ∇ f(x) =11

, ∇ c 2 (x) = 10

, we can see d= (-0.5, 0.25)T

satisfies

[∇ c 2 (x) T d≥ 0, ∇ f(x) T d< 0] and is then a descent direction. To show that optimality conditions [∇ xL(x*, λ*)= 0, for some λ*≥ 0] and [λ 1 *c 1 (x*)= 0, λ 2 *c 2 (x*)= 0] fail. We note first that since c 1 > 0, we have λ 1 = 0.

Then to satisfy ∇ xL(x, λ) = 0, We need a value for λ 2 so that ∇ f(x)- λ 2 ∇ c 2 (x)= 0.

λ 2 does not exist, so this point fails to satisfy optimality conditions.


2.3 Tangent Cones [TΩ(x*)] Here we have, Ω: closed convex, constraint set

x* Ω F(x*): set of 1st order feasible directions at x* *The earlier approach of examining first derivative of f and c i , via first order Taylor series

expansion about x to form approximation when both objective and constraints are linear, only works when the approximation near the point x in question. If, near x, the linearization is fundamentally different from the feasible set f(x), e.g. entire plane while feasible set is a single point, then the linear approximation will not yield useful info. This is where we must make assumptions about the nature of c i , i.e. for the below,

Constraint qualifications ensure similarity of Constraint Set Ω, its linearized approximation, in a neighbourhood of x*.

2.4 Feasible Sequence [15] Given a feasible point x, we call Z k k a feasible Sequence approaching x if Z k ∈ Ω

for all k sufficiently large and Z k → x.

A local solution (x*) is where all feasible sequences approaching x* have the property that f(Z k )≥ f(x) for all k sufficiently large, and we will derive conditions under which this

property holds. A tangent is a limiting direction of a feasible sequence.

Definition 12.2 d: a tangent (vector) to Ω at a point x if there is a feasible sequence Z k approaching x

and a sequence of positive scalars t k with t k → 0 so that

lim k→ ∞ (Z k -x)/t k = d

The set of all tangents to Ω at x* is called the Tangent Cone, or TΩ(x*) Note: if d is a tangent vector with corresponding sequences t k and Z k , then replace

each t k by α 1− t k , where α>0, we get

αd TΩ(x*)

2.4.1 Linearized Feasible Directions F(x) Given a feasible point x and active constraint set A(x), the set of linearized feasible directions F(x) is F(x)= d T ∇ c i (x)= 0 for all i∈ ε,

d T ∇ c i (x)≥ 0, for all i∈ A(x) Ι∩


We can see that F(x) is a cone. Note: Definition of tangent cone does not rely on algebraic specification of Ω, only on its geometry. The linearized feasible direction set depends on the definition of the constraint functions c i , i∈ ε Ι∪

2.4.2 Example with Tangent Cone ≠ Linearized Feasible Sets Min f(x) = x 1 + x 2 , such that c i : x 1

2 + x 22 - 2= 0

About the constraint: a circle with radius of 2 , and it is near non-optimal point

x= (- 2 , 0) T . It has a feasible sequence approaching x, defined by

Z k = ⎥⎥⎦

⎤

⎢⎢⎣

⎡

−−−

kk

/1/12 2

, t k = ||Z k - x||

d= (0, -1) T is a tangent Note: f(x) increases as we move along Z k . f(Z k +1)> f(Z k ) for all k= 2,3,…

As f(Z k )< f(x) for k= 2,3,… so x cannot be a solution.

Another feasible sequence approaching x= (- 2 , 0) T from the opposite direction, defined by

Z k = ⎥⎥⎦

⎤

⎢⎢⎣

⎡ −−k

k/1

/12 2

we can see f decreases along this sequence, and tangents along this

sequence are d= (0, α) T .

Tangent cone at x: (- 2 , 0) T is (0, d 2 ) T | d 2 ∈ ℜ Via definition F(x)= d T ∇ c i (x)= 0 for all i∈ ε,

d T ∇ c i (x)≥ 0, for all i∈ A(x) Ι∩

d= (d 1 , d 2 ) T ∈ F(x) if

0= ∇ c 1 (x) T d= ⎥⎦

⎤⎢⎣

⎡

2

1

22

xx T ⎥

⎦

⎤⎢⎣

⎡

2

1

dd

= -2( 2 )d 1

Therefore, we get F(x) = (0, d 2 ) T | d 2 ∈ ℜ , so we have TΩ(x) = F(x) Suppose the feasible set is defined instead by Ω= x| c 1 (x) = 0, where c 1 (x) = (x 1

2 + x 22 - 2) 2 = 0

Note: the algebraic specification of Ω has changed. The vector d belongs to the linearized feasible set if

0= ∇ c 1 (x) T d= 221

121

)22^2^(4)22^2^(4xxxxxx

−+−+ T ⎥

⎦

⎤⎢⎣

⎡

2

1

dd

= 00 T ⎥

⎦

⎤⎢⎣

⎡

2

1

dd


It looks true for all (d 1 , d 2 ) T . So we have F(x) = ℜ 2 , so for this specification of Ω, the tangent cone and linearized feasible sets differ.

2.4.3 Example with F(x) = TΩ(x) Min x 1 + x 2 , S.T. 2- x 1

2 - x 22 ≥ 0

Feasible region: On and within the circle Solution looks pretty obvious at x= (-1, -1) T , i.e. same as the equality-constrained case but this time we have many feasible sequences converging to any given feasible point.

E.g. from x= (- 2 , 0) T , various feasible sequences defined for the equality constrained problem are still feasible for this problem. Infinitely many feasible sequences also converge to

x= (- 2 , 0) T along a straight line from interior of the circle, with the form

Z k = (- 2 , 0) T + (1/k)w,

w: a vector whose first component is positive (w 1 > 0). Z k remains feasible if ||Z k ||≤ 2,

which is true when k≥ (w 1 ^2+ w 2 ^2)/[2 2 w 1 ]

We could also have an infinite variety of sequences approaching (- 2 , 0) T along a curve from the circle’s interior.

The tangent cone of this set at (- 2 , 0) T is (w 1 , w 2 ) T |w 1≥ 0 Via definition F(x)= d T ∇ c i (x)= 0 for all i∈ ε,

d T ∇ c i (x)≥ 0, for all i∈ A(x) Ι∩ ,

d∈ F(x) if

0≤ ∇ c 1 (x) T d= 2

1

22

xx

−− T ⎥

⎦

⎤⎢⎣

⎡

2

1

dd

= 2 2 d 1

So, we have F(x) = TΩ(x) for this specification of the feasible set.

2.5 LICQ Linear Independence Constraint Qualification [15] This applies to the point x and active set A(x), if the set of active constraint gradients ∇ c i (x), i∈ A(x) is Linearly Independent.

2.5.1 Constraint Qualifications

Constraint qualifications ensure similarity of Constraint Set Ω, its linearized approximation, in a neighbourhood of x*.

Given a feasible point x, we call Z k a Feasible Sequence approaching x if Z k Ω for all k

sufficiently large and Z k → x.


A local solution (x*) is where all feasible sequences approaching x* have the property that f(Z k )≥ f(x) for all k sufficiently large, and we will derive conditions under which this

property holds. LICQ- Linear Independence Constraint Qualification,

This applies to the point x and active set A(x), if the set of active constraint gradients ∇ c i (x), i∈ A(x) is Linearly Independent.

First Order Optimality Conditions Given the Langrangian function: L(x, λ)= f(x)- ∑

Ι∪∈δ

λi

ii xc )(

if 1. x*= local solution, 2. Function f and c i are continuously differentiable,

3. LICQ holds at x*, Then a Langrange multiplier vector λ* with components λ i *, i∈ E∪ I

S.T. following conditions are satisfied at (x*, λ*). This leads to,

2.6 KarushKuhnTucker (KKT) conditions [15] ∇ xL(x*,λ*)= 0 c i (x*)= 0 for all i∈ E

c i (x*)≥ 0 for all i∈ I

λ*≥ 0 for all i∈ I λ*c i (x*) = 0 for all i∈ E∪ I

For the last condition, λ*c i (x*) = 0, the relationship could be

Complementary: i.e. one of or both λ* and c i could be 0.

or Strictly Complementary: exactly one of λ* or c i is 0

Some additionally mentioned properties. They seem pretty obvious, though worth mentioning to clear potential ambiguities as we move on. 2.6.2 Lemma 12.2 Let x* be a feasible point, then the following are true

i. TΩ(x*)⊂F(x*) ii. if LICQ is satisfied at x*, F(x*)= TΩ(x*)

2.6.3 A Fundamentally Necessary Condition A local solution x*, feasible sequences f(Z k ) have the property:

f(Z k )≥ f(x) for all k sufficiently large.


if x* is a local solution, then: ∇ f(x*) T d≥ 0, for all d∈ TΩ(x*) Lemma 2.6.4 Cone K defined as K= By+ Cw|y≥ 0, Given any vector g∈ R n , we have either that g∈ K, or There is a d∈ R^n satisfying (g) T d< 0, (B) T d≥0, (C) T d= 0 but NOT BOTH.

2.6.5 Second Order Conditions KKT conditions tell us how first derivatives of f and active constraints c i are related at

solution x*. When conditions are satisfied, move along any vector w from F(x*) either 1. increases 1st order approximation to objective function, i.e. w T ∇ f(x*)> 0,

2. or keeps the same value, i.e. w T ∇ f(x*)= 0 The 2nd derivatives of f and constraints c i play the “tiebreaking” role. E.g. for directions

w∈ F(x*) where w T ∇ f(x*) = 0, we can’t tell if a move along this will increase or decrease f. The 2nd conditions examine 2nd derivative terms in Taylor series expansions of f and c i to

resolve this. Essentially, the 2nd order conditions concern the curvature of the Lagrangian function in the “undecided” directions, where directions w∈ F(x*) where w T ∇ f(x*) = 0. Given solution x*, KKT conditions met.

The inequality constraint c i is strongly active or binding if i∈ A(x*) and λ i *> 0 for some Lagrange multiplier λ i * satisfying KKT.

c i is weakly active if

i∈ A(x*) and λ i *= 0 for all λ* satisfying KKT.

Essentially, the 2nd order conditions concern the curvature of the Lagrangian function in the “undecided” directions, where directions w∈ F(x*) where w T ∇ f(x*)= 0. 2.6.6 F(x): set of linearized feasible directions

Given F(x)= d|Ι∩∈≥∇

Ε∈=∇)(,0)(

,0)(xAforAllixcd

forAllixcd

iT

iT

,

and some Langrange multiplier vector λ* satisfying KKT conditions, we define the Critical Cone C(x*, λ*) as follows: C(x*, λ*)= w∈ F(x*)|∇ c i (x*) T w= 0, all i∈ A(x*)∩ I with λ i *> 0


Equivalently,

w∈C(x*, λ*)

0**)(,0*)(0**)(,0*)(

,0*)(

=∩∈≥∇>∩∈=∇

Ε∈=∇

iT

i

iT

i

Ti

IwithxAforAlliwxcIwithxAforAlliwxc

forAlliwxc

λλ

The critical cone contains directions w that tend to adhere to active inequality constraints even when we make small changes to the objective or equality constraints

(indices i∈I where λ* is positive).

From above we can see λ*= 0 for all inactive components i∈ I\A(x*), if then follows that

w∈ C(x*,λ*)→ λ*∇ c i (x*) T w= 0 for all i∈ E∪ I (12.54)

From the first KKT condition ∇ xL(x*, λ*)= 0,

and Lagrangian definition L(x, λ)= f(x)- ∑∪∈ IEi

ii xc )(λ

We have w∈ C(x*, λ*)→ w T ∇ f(x*)= ∑∪∈

∇IEi

iT xcw *)(*λ = 0

Hence, critical cone C(x*, λ*) contains directions from F(x*) where the first derivative does not clearly state if f will increase or decrease.

2.6.7 Example on Critical Cones Min x1, S.T. x 2 ≥ 0, 1-(x 1 -1) ^2- x 2 ^2≥ 0 So the constraint is within & border of a circle of r=1, its centre at (1, 0).

Clearly, x*= 00

And its active set A(x*)= 1, 2→ A(x*) T = ∇ c i (x*) so ∇ c 1 = 1, ∇ c 2 = 2

Optimal Lagrange multiplier λ*= 5.0

0→ *)(*)( 11 xcxf ∇=∇ λ

The gradients of active constraints at x*: 10

and 20

As LICQ holds, the optimal multiplier remains unique. And we have the Linearized Feasible Set: F(x*) = d| d≥ 0

The Critical cone: C(x*, λ*)= 2

0w

| w 2 ≥ 0

2.7 Duality [15] given, min f(x), subject to c(x)≥ 0, with the Lagrangian function [L(x, λ 1 )= f(x)- λ 1 c 1 (x)]


the Lagrangian multiplier λ∈ Rm

, L(x, λ )= f(x)- λ T c(x) Then the dual objective function q: R n → R is:

q(λ) def x

infL(x, λ)

The domain of q is a set of λ values where q is finite, D def λ|q(λ)> -∞

2.7.1 Duality example

min 0.5(x 12 +x 2

2 ) subject to x 1 - 1≥ 0

Lagrangian: L(x 1 , x2, λ 1 ) = 0.5(x 12 +x 2

2 ) - λ 1 (x 1 - 1)

λ 1 is fixed, this is a convex function of (x 1 , x) T , so the infimum (local lower bound) is

achieved when partial derivatives WRT x 1 , x2 are zero, i.e. x 1 - λ 1 = 0, x2= 0 Substituting the infimal values into L(x 1 , x2, λ 1 ), we get the dual objective

q(λ 1 )= 0.5(λ 1 ^2+ 0)- λ 1 ( λ 1 - 1)= 0.5 λ 1 ^2+ λ 1 So the dual of (Max q(λ), subject to λ≥ 0) becomes

01max≥λ

- 0.5 λ 12 + λ 1

Apparent solution: λ 1 = 1


Chapter III. Quadratic Programming [15] General QP format:

xmin q(x)= 0.5x T Gx+ x T c

Subject to a iT x= b i i∈E

a iT x≥ b i i∈I

G: symmetric n by n matrix. E and I are finite sets of indices. c, x, and a i ,i∈E∪ I, are vectors in R n .

If G is positive semi definite, then it is a convex QP, positive definite, it is a Strictly convex QP, indefinite, it is Nonconvex, more challenging due to multiple stationary points and local minima.

3.1 Equalityconstrained QPs form,

xmin q(x) def 0.5x T Gx+ x T c

subject to Ax= b A: m by n Jacobian of constraints, (where m≤ n). The rows are a i

T , i∈E.

b: vector in R m , components are b i , i∈E

1st order necessary conditions for solution x*: there is a vector λ* (Lagrange multiplier) to satisfy below system,

⎢⎣

⎡AG

⎥⎦

⎤−0

tA⎥⎦

⎤⎢⎣

⎡**

λx

= ⎥⎦

⎤⎢⎣

⎡−b

c

Expressing x* = x+ p (for computation) x: solution estimate p: desired step we now get the

3.2 KarushKuhnTucker (KKT) matrix

⎢⎣

⎡AG

⎥⎦

⎤

0

TA⎥⎦

⎤⎢⎣

⎡−*λp

= ⎥⎦

⎤⎢⎣

⎡hg

h= Ax- b, g= c+ Gx, p= x*- x note


Z: n by (n-m) matrix whose columns are a basis for null space of A, i.e. AZ= 0 KKT QP Example

min q(x)= 3x 21 + 2x 1 x 2 + x 1 x 3 + 2.5x 2

2 + 2x 2 x 3 + 2x 23 - 8x 1 - 3x 2 - 3x 3 ,

subject to x 1 + x 3 = 3,

x 2 + x 3 = 0. (16.9)

G→ q’(x 1 )= 6x 1 + 2x 2 + x 3

q’(x 2 )= 2x 1 + 5x 2 + 2x 3

q’(x 3 )= x 1 + 2x 2 + 4x 3

C→ ∇ Single variable terms: (-8, -3, -3) T A→ ∇ Constraint functions: 1+ 0+ 1= 3 0+ 1+ 1= 0 b→ Constraint pts: 3, 0 In matrix form

G=

⎢⎢⎢

⎣

⎡

126

252

⎥⎥⎥

⎦

⎤

421

, c=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−−−

338

, A= ⎢⎣

⎡01

10

⎥⎦

⎤11

, b= ⎥⎦

⎤⎢⎣

⎡03

We use ⎢⎣

⎡AG

⎥⎦

⎤−0

tA⎥⎦

⎤⎢⎣

⎡**

λx

= ⎥⎦

⎤⎢⎣

⎡−b

c

and get

⎢⎢⎢⎢⎢⎢

⎣

⎡

01126

10252

11421

00

10

1

−

−

⎥⎥⎥⎥⎥⎥

⎦

⎤

−−

00

11

0

⎥⎦

⎤⎢⎣

⎡**

λx

=

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

03338

x*= (2,-1,1) T λ*= (3,-2) T

Null-space for G→ Z= (-1, -1, 1) T Many methods currently exist to solve QPs. From this point this paper will focus exclusively on the most “popularly applied” means via contemporary research.


3.3 Solving QP with A Modified Simplex Method Jenson and Bard started with an examination of Karush-Kuhn-Tucker conditions, leading to a set of linear equalities and complement constraints. [10] They then apply a modified Simplex Algorithm for solutions. General QP

Min f(x) = cx + 1/2x T Qx Subject to Ax ≤ b and x ≥ 0

c: an n-dimensional row vector form of coefficients of the linear terms in the objective function. Q: an (n by n) symmetric matrix describing the coefficients of the quadratic terms. x: n-dimensional column vector denoting decision variables (as in linear programming) A: an (m by n) matrix defining constraints b: defining constraints’ right-side coefficient column vector The model drops constants in the equation. We assume a feasible solution exists and that the constraint region is bounded. A global minimum exists when f(x) is strictly convex for all feasible points. KKT Conditions A positive definite Q guarantees strict convexity. Excluding non-negativity conditions, the Lagrangian function for the QP is, L(x, μ) = cx + 1/2x T Qx + μ(Ax - b) μ: an m-dimensional row vector. KKT local minimum conditions follow, L xj ≥ 0, j= 1,…, n c + x T Q + μA ≥ 0 (12a)

L iμ ≤ 0, i= 1,…, m Ax – b ≤ 0 (12b)

x j L xj = 0, j= 1,…, n x T (c T + Qx + A T μ) = 0 (12c)

μ i g i (x) = 0, i= 1,…, m μ(Ax – b)= 0 (12d)

x j ≥ 0, j= 1,…, n x ≥ 0 (12e)

μ i ≥ 0, i= 1,…,m μ ≥ 0 (12f)

To make these conditions more manageable,

they introduced a surplus variables y ∈R n into (12a), and nonnegative slack variables v∈R m into (12b).

c T + Qx + A T μ T - y = 0 and Ax – b + v = 0 The KKT conditions now rewritten with constants moved to the right side Qx + A T μ T - y = -c T (13a) Ax + v = b (13b) x ≥ 0, μ ≥ 0, y ≥ 0, v ≥ 0 (13c) y T x = 0, μv = 0 (13d) So the first two expressions have become linear equalities, the third restricts all the variables nonnegative, and the fourth applies complementary slackness.


They then apply a simplex algorithm to solve 13a-13d, with a “restricted basis entry rule” to treat the complementary slackness conditions (13d) implicitly. It takes the following steps,

• Let structural constraints 13a and 13b be defined by KKT conditions • If any right-side values must always stay positive, (* -1) if need be • Addition of an artificial variable in each equation • The objective function becomes the sum of artificial variables • The resulting problem goes into simplex form

They aim to find a solution to the linear program that minimizes sum of the artificial variables with the additional requirement that the complement slackness conditions are satisfied at each iteration. If the sum becomes zero, the solution then satisfies 13a-13d. To accommodate 13d, rule for selecting entering variables must bear the following relationships, x j and y j are complementary for j= 1,…,n

μ i and v i are complementary for j= 1,…,m

The entering variable will have the most negative reduced cost, provided that its complementary variable is not in the basis or could leave the basis on the same iteration. The algorithm ultimately provides vector x as the optimal solution and vector μ the optimal dual variables. Note: According to the authors, this approach works only well when the objective function is positive definite, while requiring computational effort comparable to a linear programming problem with m + n constraints (where m is the number of constraints and n number of QP- variables). Positive semi-definite forms of the objective function could present computational complications. A suggested remedy to overcome semi-definiteness is to add a small constant to each of the diagonal elements of Q so that the matrix becomes positive definite. Though the solution will not end with high accuracy, the difference could remain insignificant as long as one keeps modifications relatively small. An example with the Simplex Method Minimize f(x) = -8x 1 - 16x 2 + x 2

1 + 4x 22

subject to x 1 + x 2 ≤ 5

x 1 ≤ 3,

x 1≥ 0, x 2 ≥ 0 First we rewrite in matrix form.

c T = ⎥⎦

⎤⎢⎣

⎡−−

168

, Q= ⎢⎣

⎡02

⎥⎦

⎤80

, A = ⎢⎣

⎡11

⎥⎦

⎤01

, b= ⎥⎦

⎤⎢⎣

⎡35

We can see that Q is positive definite so the KKT conditions are necessary and sufficient for a global optimum. The linear constraints 13a and 13b take the following form, 2x 1 + μ 1 + μ 2 - y 1 = 8


8x 2 + μ 1 - y 2 = 16

x 1 + x 2 +v 1 = 5

x 1 +v 2 = 3 Then they add the artificial variables to each constraint and minimize the sums. Minimize a 1 + a 2 + a 3 + a 4

subject to 2x 1 + μ 1 + μ 2 - y 1 +a 1 = 8

8x 2 + μ 1 - y 2 +a 2 = 16

x 1 + x 2 +v 1 + a 3 = 5

x 1 +v 2 + a 4 = 3 note: all variables ≥ 0 and complementary conditions. Then a modified simplex technique yields the following iterations, Iteration Basic

Variables Solution Objective

Value Entering Variable

Leaving Variable

1 a 1 , a 2 , a 3 , a 4 8, 16, 5, 3 32 x 2 a 2

2 a 1 , x 2 , a 3 , a 4 8, 2, 3, 3 14 x 1 a 3

3 a 1 , x 2 , x 1 , a 4 2, 2, 3, 0 2 μ 1 a 4

4 a 1 , x 2 , x 1 , μ 1 2, 2, 3, 0 2 μ 2 a 1

5 μ 2 , x 2 , x 1 , μ 1 2, 2, 3, 0 0

And the optimal solution: (x *

1 , x *2 ) = (3, 2)


Chapter IV. SVM Mechanics SVMs map data nonlinearly into a high (potentially infinite) dimensional feature space via a kernel function, where data becomes easily separable linearly by a hyperplane [14]. Distance (margin) between the points (Support Vectors) and the hyperplane are then maximized. It results in nonlinear regressions in low dimensional space, where extrapolations (of improved accuracy) become possible, i.e. classifications or forecasting.

4.1 General overview

We commence with a data set G=(X i , y i ) Ni 1= of N data points, where X i denotes input space

(predictive variables) and y i the corresponding output (response variables). ε -SVR then

works out a function f(x) that has a maximum of ε deviation from the actually obtained y i

for all the training data while keeping f(x) as flat as possible. Therefore, we end up with an “expected” range for extrapolation errors, making the findings more practical. [9] X i is mapped nonlinearly into the D-dimensional space F, with corresponding output y i .

We do the linear regression in F. We need to find a function ^f (t+1) =

^g (X) that

approximates y(t+1) based on G. SVR approximation function,

^g (X i )= ∑

=

D

i 1ω i ф(X i ) + b where ф: R n → F, ω∈F

ω i : coefficients

b: threshold value

Dot product takes place between ω and X. ^g (X i ) serves as the hyperplane in F defined by

the functions ф(X) where dimensionality could get very high, possibly infinite. A small ω would lead to improved flatness. One way to find a small ω, we could minimize the norm, i.e. || ω||^2= <ω, ω>.

The SVR QP

The SVR becomes a minimization problem with slack variables ξ and ξ*

Minimize 1/2 ω 2 + C∑=

N

i 1(ξ i + ξ* i ) [16]

subject to ^g (X i )+ b- y i ≤ ε + ξ i

y i - [^g (X i )+ b]≤ ε + ξ* i

ξ i , ξ* i ≥ 0

This is solved with the Lagrangian and transforming into a dual problem,

^f (t+1)=

^g (X)= ∑

=

N

i 1(α* i - α i )K(X,X i )+ b

α* i , α i : Lagrange Multipliers,

associated with X i ,


0≤ α* i , α i ≤ C,

∑=

N

i 1(α* i - α i )= 0

Training points with nonzero Lagrange multipliers are called Support Vectors (SV), where the smaller the fraction of SVs, the more general the solution. Coefficients α* i , α i are obtained by maximizing the following form subject to earlier stated

conditions,

R(α* i ,α i )=∑=

N

i 1y i (α* i - α i )-ε ∑

=

N

i 1(α* i + α i )

-1/2∑=

N

ji 1,

(α* i - α i )(α* j - α j ) K(X i ,X j )

The training vectors x i are mapped into a higher (potentially infinite) dimensional space by

the kernel function Φ. Then the SVM finds a linear hyperplane separating the (support) vectors in this space with maximal margin (distance to hyperplane). [12]

4.2 Structural Risk Minimization (SRM) Popular financial time series models tend to “over fit”, i.e. focusing exclusively on quality of fitting training data (empirical error), that structural risk (potential forecast error) becomes perilously volatile. SVMs offer uniqueness as a learning algorithm by applying SRM, i.e. they utilize capacity control of the decision function, the kernel function and the sparsity of solutions (Huang, Nakamori, Wang 2005). The SRM principle helps SVMs estimate functions while minimizing generalization error, therefore making SVM classifications and SVR highly resistant to over-fitting issues. Utilized Risk Function [17]

Let’s say the SVR aims to estimate/learn a function f(x, λ) where X is the input space (e.g. stock index prices together with econometric indicators). Given λ∈Λ, a set of abstract parameters from an Independent Identically Distributed (i.i.d) sample set with size N. (x 1 , y 1 ),…,(x N , y N ), x i ∈X(R d ), y i ∈R (1.1)

The training data set (x i , y i ) belongs to an unknown distribution P(x i , y i ).

So we look to find a function f(x, λ*) with the smallest possible value for the expected risk (or extrapolation error) as

R[λ] = ∫ l[y, f(x, λ)]P(x, y) dxdy (1.2)

l = loss function to defined as need be. With the probability distribution P(x, y) in (1.2) unknown, computing and minimizing R[λ] poses an impossibility, so far. However, we do have some information of P(x, y) from the i.i.d sample set (1.1) earlier, and it becomes possible to compute a stochastic approximation of R[λ] by the Empirical Risk.

R emp [λ]= 1/N∑=

N

i 1l[y i , f(x i , λ)] (1.3)


According to Yang, the law of large numbers results in empirical risk converging with respect to the statistical expectancy. Despite this, for small sample sets, minimizing empirical risk alone could lead to issues such as loss of accuracy or over-fitting. To remedy issues of relatively insufficient sample size, the statistical-learning or VC (Vapnik-Chervonenkis) theory, offers bounds on empirical risk deviations from the Expected Risk. The standard form of Vapnik and Chervonenkis bound, with probability 1 – η,

R[λ] ≤ R emp [λ]+ (N

hNh

4ln)12(ln η

−+)^0.5, ∀ λ∈Λ (1.4)

h: VC-dimension of f(x, λ) From (1.4) it becomes apparent that to achieve small expected risk, i.e. improved accuracy, the empirical risk, ratio between VC-dimension and data points must stay relatively small. As a decreasing function of h usually represents empirical risk, optimal values for VC-dimension exists with respect to the given number of samples. Therefore when given a relatively limited number of data points, a designated value for h (often controlled by free model parameters) plays the key to superb performance. The above led to the technique of SRM (Vapnik and Chervonenkis 1974), in an attempt to choose the most appropriate VC-dimension. Overall, SRM is an inductive principle to optimize the trade-off between hypothesis space complexity and empirical error. The resulting capacity control offers minimum test error while keeping the model as “simple” as possible.


SRM steps [11]:

• With given domain, a function is selected (e.g. nth degree polynomials, n-layer neural networks, n-rules fuzzy logic).

• Function classes are divided into nested subsets, with respect to order of complexity.

• Empirical risk minimization occurs at each subset, i.e. general parameter selection.

SRM therefore naturally holds the potential to dramatically improve stability of forecast errors. As financial markets characterize exceptionally noise saturated processes, consistency in prediction errors serves as the next best thing for practical utilization.

4.3 The Loss Function The loss function measures empirical risk. Many types exist, see some below, [17]

Type Loss Function l(δ ) Density Function ρ (δ )

Linear ε ‐insensitive |δ | ε

)1(21ε+

exp(‐|δ | ε )

Laplacian |δ | 1/2exp(‐|δ |)

Gaussian ½δ 2

π21

exp(‐2

2δ)

Huber’s Robust ⎢⎣

⎡

−≤

otherwiseif

,2/||||,*)2/(1 2

σδσδδσ

Α

⎪⎪⎩

⎪⎪⎨

⎧

−

≤−

otherwise

if

|),|2

exp(

||),2

exp( 2

δσ

σδσδ

Polynomial 1/d|δ | d )/1(2 d

dΓ

exp(‐|δ | d )

The target values y are generated by an underlying functional dependency f plus additive noise δ of density ρ δ . Then minimizing R emp coincides with the follow,

l(f(x), y) = -log[p(y|x, f)]

An example of the Square Loss function. It corresponds to y being affected by Gaussian, normal noise.

l 2 (y, f(x)) = ½ (y- f(x)) 2 or l 2 (δ )= 1/2δ 2 (1.7)

The squared loss however does not always result as the best choice. The above ε -insensitive loss function is also popular, and applied for ε -SVR applied later in this study.


lε (y, f(x))= ⎩⎨⎧

−−<−otherwisexfy

xfyif,|)(||)(|,0

εε

(1.8)

The ε -insensitive function do not present output errors as long as the data points remain within the range of ±ε . Therefore increasing ε values would likely reduce the number of support vectors, at extreme ends it may result in a constant regression function. Therefore, the loss function indirectly affects the complexity and generalization of models.

4.4 ε SVR With the ε -insensitive function advantages, ε -SVR has become quite helpful for regression type of data sets. It aims to find a function f with parameters w and b by minimizing the

following regression risk, R reg (f)= ½<w, w> + C∑=

N

i 1l(f(x i ),y i ) (1.5)

C: Cost of Error, a trade-off term <, >: inner product, the first term can be seen as the margin, measuring VC-dimension The Euclidean norm, <w, w> measures the flatness of function f, so minimizing <w, w> makes the objective function as flat as possible. f(x, w, b) = <w, Φ(x)> + b (1.6) Minimization of (1.5) is equivalent to the following constrained minimization problem,

min Y(w, b, ξ (*) ) = ½<w, w> + C∑=

N

i 1l(ξ i + ξ *

i ) (1.9)

subject to y i - (<w, Φ(x i )> + b) ≤ ε + ξ i ,

(<w, Φ(x i )> + b) - y i ≤ε + ξ *i , (1.10)

ξ (*)i ≥ 0

i = 1,…,N (*): both variables with and without the asterisks ξ i and ξ *

i : up and down errors for sample (x i , y i ) respectively

4.5 Standard Method for SVR QP Solutions The standard optimal solution for (1.9), the function f as [17]

f(x, w, b) = <w, Φ(x)> + b (1.6)

involves construction a dual problem of the primal minimization problem by Lagrange Method, then maximize the dual. Therefore, a duel QP is worked out,

min Q(α (*) ) = 1/2∑=

N

i 1∑=

N

j 1(α i - α *

i )(α j - α *j )<Φ(x i ), Φ(x j )>

+ ∑=

N

i 1(ε - y i ) α i +∑

=

N

i 1(ε + y i ) α *

i (1.11)

subject to ∑=

N

i 1( α i - α *

i ) = 0, α (*)i ∈[0, C] (1.12)

After solving this QP, we get the objective function,


f(x)= ∑=

N

i 1

(α i - α *i )<Φ(x i ), Φ(x)>+ b

α and α*: Lagrange multipliers used to move f towards y. With the QP solved, we still need to find value of b. KKT conditions help in this case, α i (ε + ξ i - y i + <w, Φ(x i )> + b) = 0

α *i (ε + ξ *

i + y i - <w, Φ(x i )> - b) = 0,

and (C - α i ) ξ i = 0

(C - α *i ) ξ *

i = 0,

The above presents some interesting conclusions. α (*)

i =C means that samples (x i , y i ) lie outside the ε margin.

α i α*i = 0, so this pair is never non-zero simultaneously.

α (*)i ∈(0, C) corresponds to (x i , y i ) on the ε margin, then to find b

b=⎩⎨⎧

∈+>Φ<−

∈−>Φ<−

),0(,)(,

),0(,)(,* Cforxwyi

Cforxwy

ii

iii

αε

αε

The sample points (x i , y i ) with nonzero α i or α *

i are the Support Vectors.

4.6 The Decomposition Method [8] Real life QPs however do not come in nice forms and require heavy computing power to solve. The Decomposition Method has become a general format in solving SVM QPs via an iterative process. General format for C-SVC, ε-SVR, and one-class SVM 1/2α Qα+ p α subject to y α= Δ 0≤ α ≤ C, t=1,…,l where y = ±1, t=1,…,l (C-SVC and 1-class SVM are already in form) For ε-SVR, we consider the below reformulation, ½[α ,( α*) ] + [εe + z ,εe - z ] subject to y = 0, 0≤ α , α ≤ C, t=1,…,l (3.2) y: 2l by 1 vector with y = 1 t= 1,…, l y = -1 t= l+1,…, 2l This method modifies a subset of α per iteration. This subset, working set B, leads to a small sub-problem to be minimized each iteration. Then in each iteration we solve a simple 2-variable problem. Algorithm 1, Sequential Minimal Optimization SMO-type Decomposition method, Fan et al. 2005 1. Find α as initial feasible solution. We set k= 1


2. Find a 2-element working set B= i,j by Working Set Selection (WSS) defines N 1,…, l\B and α and α to be sub-vectors of α corresponding to B and N respectively. 3. If α K + K - 2K > 0 solve this sub-problem with variable α : ½[ α α ] + (PB+ Q α ) Subject to 0≤ α , α ≤ C y α + y α = Δ- y α (3.3) If else, solve ½[ α α ] + (PB+ Q α ) + (( α - α ) + ( α - α ) ) (3.4) 4. Set α as the optimal solution of (3.3) and α α . Set k+1→ k and repeat step 2 So B is updated with eat iteration. If α ≤ 0, (3.3) is a concave problem, and we use the (3.4) convex modification. Stopping criteria and WSS for C-SVC, ε-SVR, and One-class SVM KKT optimality condition says that a vector α is a stationary point of the general form if and only if there’s a number b and two nonnegative vectors λ and μ such that f(α)+ by= λ- μ λ α = 0, μ (C- α )= 0, λ ≥ 0, μ ≥ 0, i= 1,…, l where f(α) Q α+ P is gradient of f(α), rewritten: f(α) + by ≥ 0 if α < C f(α) + by ≥ 0 if α > 0 Since y = ±1, by defining I (α) t| α < C, y = 1 or α > 0, y = -1 and I (α) t| α < C, y = -1 or α > 0, y = 1 a feasible α is a stationary point of the general form (3.1) if and only if m(α) ≤ M(α) where m(α) - y f(α) and M(α) - y f(α) then we have the following stopping condition: m(α )- M(α )≤ ε About the selection of Working Set B, we consider the following: WSS 1 1. For all t, s, define a K + K - 2K ,b -y f(α ) + y f(α ) > 0 and select i arg - y f(α ) | t I (α ) j arg - | t I (α ), - y f(α ) < y f(α ) 2. Return B= i,j


4.7 The Kernel Function [3]

The name Kernel comes from “integral operator theory”. They contain mostly supporting theories on relations between kernels and associated feature spaces.

SVR applies the mapping function Φ to solve non-linear samples, such as financial data sets. It maps the input space X into a new space Ω = Φ(x) | x ∈ X, i.e. the mapping function makes x = (x 1 ,…,x N ) become Ω= Φ(x) = [Φ 1 (x),…, Φ N (x)]. Then in the feature space Ω, we

can obtain a linear regression function.

In the standard SVR QP,

min Q(α (*) ) = 1/2∑=

N

i 1∑=

N

j 1

(α i - α *i )(α j - α *

j )<Φ(x i ), Φ(x j )>

+ ∑=

N

i 1

(ε - y i ) α i +∑=

N

i 1

(ε + y i ) α *i

subject to ∑=

N

i 1

( α i - α *i ) = 0, α (*)

i ∈[0, C]

we notice that the objective function contains an inner product of the mapping function Φ(x). The inner product lets us specify a kernel function WITHOUT considering the mapping function or the feature space explicitly. With this advantage, we then create the Kernel function,

K(x, z) = < Φ(x), Φ(z)>

Since feature vectors are not expressed explicitly, the number of inner product computations do not always reflect the number of features in exact proportions. The kernel makes it possible to map data implicitly into a feature space for training, while evading potential problems in evaluating feature map. The Gram/kernel matrix remains the only information applied in training sets.

The Kernel function usually satisfies Mercer’s Theorem,

Mercer's Theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem (presented in James Mercer (1909). "Functions of positive and negative type and their connection with the theory of integral equations". Philos. Trans. Roy. Soc. London 209. ).

There are 4 basic types of Kernel functions [4] 1. Linear: K(x i ,x j )= x i

T x j

2. Polynomial: K(x i ,x j )= (γ x iT x j + r) d , γ > 0

3. Radial Basis Function (RBF): K(x i ,x j )= exp(-γ ji xx − 2 ),γ > 0

4. Sigmoid: K(x i ,x j )=tanh(γ x iT x j + r)

Hereγ , r and d are kernel parameters. The RBF Kernel usually comes as a reasonable first choice. It non-linearly maps samples into higher dimensional space, so it would likely handle financial (non-linear) data better than the linear kernel. At the same time, it does not become as complex as the Polynomial or run into validity issues from the Sigmoid type.


Note: the linear and sigmoid kernels have shown to behave like RBF given certain parameters (Keerthi and Lin 2003, Lin and Lin 2003) Also, the polynomial kernel may become too complex as it has more hyper-parameters than RBF. The RBF kernel has less numerical difficulties, e.g. 0< k ij ≤ 1 for RBF, where for

polynomial kernel as degree increases values may go to infinity (γ x iT x j + r> 1) or zero

(γ x iT x j + r< 1).

Lastly, the sigmoid kernel isn’t valid in some circumstances. If number of features is very large, we may have to use Linear.

4.8 Cross Validation and Gridsearch This is to find the best values of (C,γ ) for optimal predictions. [4] v-fold cross-validation, we first divide training set into v subsets of equal size. Each subset is then tested using the classifier trained on the other (v- 1) subsets. So, each instance of the whole training set is predicted once and the cross-validation accuracy is the data% that are correctly classified. This process could prevent the over-fitting problem. Grid-search: pairs of (C,γ ) are tried and one with best cross-validation is picked, and it is completed.


Chapter V. Empirical Analysis and Findings

5.1 About Financial Time Series The financial markets do not function arbitrarily at random. Just as natural resources are limited, so does liquidity of financial instruments over the exchanges [2]. Supply and demand drive price moves, i.e. basic economics. To study prices of the immediate future then, one must investigate forces affecting market supply or demand. Many researchers apply historical response variable (predicted output) as also the exclusive input, implying that financial time series are dependent and stationary in nature. This is fundamentally flawed as institutions (hedge funds, investment banks, George Soros, etc.) who execute price-moving trades largely make decisions with respect to economic conditions exclusively. Yes financial time series contain quite a bit of noise, making precise value forecasts challenging. Short term intra-day price change predictions require deep analysis of market depth (bid/ask volume changes at market-maker, specialist levels and etc.) [1]. Notwithstanding the foregoing, empirical evidence suggests noteworthy relations between financial instruments and economic factors over longer time frames.

5.2 Analysis Setup This study aims to forecast ASX200 stock index value (range) and directional bias. The predictions result from interpretations of several independent, significantly correlated financial time series. Historically, equity (and commodity, real estate) markets often respond to inter-market interactions involving factors such as interest rates, political events, traders’ expectations, and etc. Inflationary, liquidity, sentiment measures independent from ASX200 (the response variable) then make viable predictive inputs. This study therefore commences on a fundamentally sound structure.

5.2.1 General idea The training data set (X i , Y 1+i ) holds daily closing prices of and derived quantities thereof.

Y 1+i : Set of output/response variables applied for training at day (i +1)

X i : Set of input variables matched at day (i)

i = 1,…,N N= 1,707 The initial training set (X i , Y 1+i ) starts from 31/10/01 and completes at the week ending

Mar. 08. The training model allows SVR to create a linear model in high dimensional space matching input sets with output values of next trading day. Once training completes, we can apply the SVR designed regression model on testing data X t to make next-day forecasts of Y 1+t .

5.2.2 Training Output Selection (Y 1+i ) Yahoo Finance serves as the (free) data provider. The following are the 1-day forward response variables,


y 1+i : (Ticker Symbol: ^axjo) This is the ASX200 closing price for the day (i+1),

y r : ASX200 1-day return for day (i+1). The formula, y r = y 1+i /y i - 1. So it will appear in

the form of a percentage.

5.2.3 Training Input Selection (X i )

While numerous economic factors contribute to effective valuation of global equity indexes (including the ASX200), the nature of this paper focuses on the mathematical potential of SVR; the predictive variables then shall stay relatively minimal, the few selected however hold significant correlations to global equity markets.

As globalization descends, global financial markets have developed highly positive correlations with one another. Therefore it makes sense to utilize American financial quantities for availability, i.e. American stock/derivative exchanges (NYSE, ISE, or CBOE) offer a vast amount of free information. The ASX does not.

The predictive time series are acquired from ratesfx.com and Yahoo Finance and listed below,

1. AUD: The value of 1 AUD (Australian Dollar) in USD (US Dollar). The exchange rate plays an important role, as arbitrageurs, particularly institutional program-trading machines exploit price discrepancies between international exchanges. Program-trading deals with large volumes, often enough to have significant price moving impact. [7]

2. VIX: (Ticker Symbol: ^vix) The CBOE (Chicago Board of Options Exchange) Volatility Index. This index reflects the average implied volatility of near-the-money S&P500 index options. As the S&P500 remains the most prevalently referenced American stock index, the VIX reflects “expected” market volatility of the immediate future and thereby sentiment of American traders.

3. GOX: (Ticker Symbol: ^gox) The CBOE Gold Index. Throughout the ages, gold has always served well as an instrument of fixed intrinsic value, i.e. insignificant/no depreciation nor real added value. Therefore, gold price presents a near-perfect indication of real rate of inflation.

1. V 1 : 1-day historical change in the VIX. Formula: (VIX i /VIX 1−i - 1)

2. V 5 : 5-day historical change in the VIX. Formula: (VIX i /VIX 5−i - 1)

3. G 1 : 1-day realized return in the GOX. Formula: (GOX i /GOX 1−i - 1)

4. G 5 : 5-day realized return in the GOX. Formula: (GOX i /GOX 5−i - 1)

The derived changes in VIX and GOX should reflect investor sentiment and inflationary concerns respectively. As the speed (derivative) of volatility and rate of inflation appear stationary, historically referenced, these measures should contribute as significant predictive variables.

5.2.4 Testing Variables X t : All of the training input variables on the day (t).

Z 1+t : The forecast set ASX200 index (z 1+t ) and 1-day return (z r ), at (t+1).

Test data for (z 1+t ) ranges from 3/9/08 to the week ending 9/4/08.

Test data for z r ranges from 10/12/08 to week ending 9/4/08.


5.2.5 Error Analysis

ε c : (z 1+t ) test error. Formula (ε c = y 1+t /z 1+t -1)

ε r : (z r ) test error. Formula (ε r =y r (t)/z r (t)-1)

5.2.6 SVR Adaptive Training Strategy Since financial time series do not exhibit stationary nature, distant extrapolations usually present inconsistent, unreliable performance. To remedy this, for every y 1+t , all training data

adjusts to complete at day (t), thereby adapting with each sequential test. Example, Training set (X 1 ,…,X 100 , Y 2 ,…,Y 101 ) leads to test variables (X 101 , Z 102 ), then the next

adapted training set (X 2 ,…,X 101 , Y 3 ,…,Y 102 ) leads to test variables (X 102 , Z 103 ).

This slight tweak of the machine learning process resulted in significantly more contained Z 1+t test errors. It lets the SVR to continuously adjust and learn from the daily changes in x i .

The extrapolation then remains always one time step (i.e. one business day) forward, thereby eliminating nonstationarity-related testing error. 5.2.7 Applied Software Many current SVM researchers have created software toolsets for Matlab. LS-SVM (for Least Squared SVM) turned out the most user-friendly and therefore applied in this study. The adaptive training modification however makes it require a bit more legwork as it does not carry this capability by default. For test error and associated analysis, Minitab does the job adequately. Excel hands the raw data.


5.3 Empirical Findings Below entails some statistical analysis of SVR forecast results and errors.

ASX200 actual performance vs. ε c & ε r

With ε c at such similar pace with actual ASX200, we could potentially exploit the index price

moves if ε c displays likelihood of remaining stationary.

While ε r appears random, it does appear fairly contained.


5.3.1 Cross Correlation Analysis

⎩⎨⎧

t

t

zTimeSeriesyTimeSeries

:2:1

A few points of interest, Correlation between (y, z): 38.56% at Lag = 0. Correlation between (y, z): 54.83% at Lag = 29. With correlation significantly positive as lag escalates, particularly at roughly 29 time steps, the ASX200 index seems to “follow” the SVR derived extrapolation value. A negative correlation with respect to negative lag suggests an equivalent likelihood. An analysis of the error terms could support this idea.


⎩⎨⎧

r

r

zTimeSeriesyTimeSeries

:2:1

Correlation between (y r , z r ): 9.7% at Lag = 0. It appears that an edge, while not significantly large, still exists for the 1-day return forecast. The apparently spurious correlation values as absolute Lag increases proposes that z r should be applied for the associated day exclusively. 5.3.2 Normality Tests These tests offer a feel for the error distributions, as consequent analysis depend on these conclusions. (ε c ): Closing Price Valuation Error Normality Test


The ASX200 closing price valuation error (ε c ) distribution does not appear normal,

therefore a nonparametric analysis is required to analyze the data.

(ε r ): 1-day Return Forecast Error Normality Test

Interestingly, the 1-day return forecast error does (likely) resemble a normal distribution. Therefore a parametric approach viably follows below.


5.3.3 Error Distributions Let us look at the distributions and perhaps get a feel for their behaviour. (ε c )

Applying the nonparametric method, Wilcoxon Signed Rank CI: Ec Confidence Estimated Achieved Interval N N* Median Confidence Lower Upper Ec 152 2 -0.1492 95.0 -0.1685 -0.1223 It appears largely negative, as we have had so much panic motivated selling since late 2007. Interestingly, (ε c ) did not present any considerable outliers despite jumps in market

volatility throughout the test period.


(ε rt )

Applying the parametric method, Descriptive Statistics: Er Variable N N* Mean SE Mean StDev Minimum Q1 Median Er 83 71 0.00061 0.00172 0.01565 -0.04245 -0.00951 0.00327 Variable Q3 Maximum Skewness Kurtosis Er 0.01076 0.03427 -0.41 0.18 In spite of the relatively small sample, the negative skew and definite kurtosis present quantities concurring with actual equity market behavior. Negative skew reveals the nature for stock prices to drop harder than to grow, partly as a result of inherent credit risk associated with each exchange listed entity. A positive kurtosis coincides with the way financial instrument values tend to move sporadically, as volatility, traders’ sentiment does not stand still (evidenced empirically).


5.3.4 The Gold Connection On a coincidental note, the AXJO shows a considerably high positive correlation with the GOX (see below graph, y= correlation, x= i), dating from Oct. 01 to Feb. 09.

The correlation suggests that the ASX200 index largely adjusts with the real rate of inflation, therefore no real significant growth in value has occurred for it in the past decade roughly. It also implies that gold remains a significant factor in pricing of equity indexes.

5.4 Test Conclusions

SRM (Structural Risk Minimization) has really shined throughout the experiment. Despite recent shocks in global financial economics, the SVR derived quantities maintained a fairly stable bound of errors. This means practicality for the professional traders. Having an accurate forecast of price ranges definitely helps, it makes it possible to literally “buy low sell high”. Though not quite the Holy Grail, SVR offers promising means to exploit economic inefficiencies, perhaps opening doors to a new frontier.

GOX, AXJO Correlation

-1.00

-0.50

0.00

0.50

1.00

1.50

0 500 1000 1500 2000


Reference

[1] Almgren, R., Thum, C., Hauptmann, E., Li, H. (2006). Equity market impact (Quantitative trading, 2006/14). University of Toronto, Dept. of Mathematics and Computer Science; Citigroup Global Quantitative Research, New York.

[2] Boucher, M. (1999). The Hedge Fund Edge. New York: John Wiley & Sons, Inc.

[3] Cao, L., Tay, F. (2001) Financial Forecasting Using Support Vector Machines (Neural Comput & Applic (2001)10:184-192). Dept. of Mechanical and Production Engineering, National University of Singapore, Singapore.

[4] Chang, C., Lin, C. (2009). LIBSVM: a Library for Support Vector Machines. Dept. of

Computer Science, National Taiwan University, Taipei.

[5] Chen, P., Fan, R., Lin, C. & Joachims, T. (2005) . Working Set Selection Using Second Order Information for Training Support Vector Machines (Journal of Machine Learning Research 6 (2005) 1889-1918). Department of Computer Science, National Taiwan University, Taipei.

[6] Claessen, H., Mittnik, S. (2002). Forecasting Stock Market Volatility and the Informational Efficiency of The DAX-index Options Market Johann Wolfgang Goethe-University, Frankfurt.

[7] Dubil, R. (2004). An Arbitrage Guide To The Financial Markets. West Sussex, England: John Wiley & Sons, Ltd.

[8] Glasmachers, T., Igel, C. (2006). Maximum-Gain Working Set Selection for SVMs (Journal of Machine Learning Research 7 (2006)1437-1466). Ruhr-University, Bochum Germany.

[9] Huang, W., Nakamori, Y. & Wang, S. (2004). Forecasting stock market direction with support vector machine (Computers & Operations Research 32 (2005) 2513-2522) School of Knowledge Science, Japan Advanced Institute of Science and Technology; Institute of Systems Science, Academy of Mathematics and System Sciences, Chinese Academy of Sciences, Beijing.

[10] Jenson, P., Bard, J. (2008). Operations Research Models and Methods. New York: Wiley, John & Sons, Inc.

[11] Joachims, T. (1998). Making Large-Scale SVM Learning Practical. Cambridge, USA: MIT Press.

[12] Kecman, V. (2001). Learning from data, Support vector machines and neural networks. Cambridge, USA: MIT Press.


[13] Kumar, M., Thenmozhi, M. (2005). Forecasting Stock Index Movement: A Comparison of Support Vector Machines And Random Forest. Dept. of Management Studies, Indian Institute of Technology, Chennai.

[14] Li, B., Hu, J. & Hirasawa, K. (2008). Financial Time Series Prediction Using a Support Vector Regression Network. Graduate School of Information, Waseda University, Japan.

[15] Nocedal, J., Wright, S. (1999). Numerical Optimization. New York: Springer-Verlag.

[16] Smola, A., Scholkopf, B. (2003). A Tutorial on Support Vector Regression (NeuroCOLT Technical Report TR-98-030). RSISE, Australian National University, Canberra, Australia & Max-Planck-Institut fur biologische Kybernetik, Tubingen, Germany.

[17] Yang, H. (2003). Margin Variations in Support Vector Regression for the Stock Market Prediction. Dept. of Computer Science & Engineering, The Chinese Univerisity of Hong Kong, Hong Kong.

SVR FinancialForecasting RChen 1.3.3

Documents

Transcript of SVR FinancialForecasting RChen 1.3.3