Stochastic Volatility Models with Auto-regressive Neural...

AUSTRALIAN NATIONAL UNIVERSITY

PROJECT REPORT

Stochastic Volatility Models withAuto-regressive Neural Networks

Author:Aditya KHAIRE

Supervisor:Adj/Prof. Hanna SUOMINEN

CoSupervisor:Dr. Young LEE

A report submitted in fulfillment of the requirementsfor the subject Special Topics in Computing (COMP6470)

in the

Department of Computer Science

October 28, 2016

http://www.university.com

http://www.Adityakhaire.com

http://www.adityakhaire.com

http://www.adityakhaire.com

http://department.university.com

iii

AUSTRALIAN NATIONAL UNIVERSITY

AbstractDr. Weifa Liang

Department of Computer Science

Stochastic Volatility Models with Auto-regressive Neural Networks

by Aditya KHAIRE

The Financial Time series data often has high volatility in the data and this makes Financialdata more unpredictable and with this data it is very hard to predict the future values. Thereare various models from econometrics which models the stochastic volatility from the data.They are mainly focus on the mean and variance of the time dependent data but prediction ofvariance for next time interval is not of main concern. The algorithm proposed in the project isfocusing on the prediction of variance through the time dependent data-set.

HTTP://WWW.UNIVERSITY.COM

http://faculty.university.com

http://department.university.com

v

AcknowledgementsI would like to express my special thanks of gratitude to my Supervisor Adj/Prof. Hanna

SUOMINEN as well as my Co-Supervisor Dr. Young LEE who gave me this opportunity to dothis project on the topic Stochastic Volatility Models with Auto-regressive Neural Networks,which also helped me in doing a lot of Research and I came to know about so many new thingsI am really thankful to them. The thesis was partly carried out in the National Information andCommunication Technology Australia and its successor Data61. I also express my gratitude toMr. Kar Wai Lim as my advisor in NICTA/Data61 and The ANU.

Secondly, I would also like to thank my parents and friends who helped me a lot in finaliz-ing this project within the limited time frame.

vii

Contents

Abstract iii

Acknowledgements v

1 Introduction 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Gaussian Process Volatility Model 32.1 Auto regressive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 Stochastic Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.4 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.5 The Stochastic Volatility model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.5.1 The priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.6 Hetroscedastic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.7 Generalized Auto-regressive Conditionally Hetroscedastic

(GARCH) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.7.1 ARCH(1) Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.7.2 GARCH (1,1) Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.8 Model GARCH process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.9 Results on GARCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Auto Regressive Neural Network (AR NN) 113.1 Feed Forward Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Design of Auto Regressive Neural Network . . . . . . . . . . . . . . . . . . . . . . 12

3.2.1 Time Series Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2.2 Weight-space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.3 Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.4 Back Propagation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2.5 Practical Reducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4.1 Alpha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4.2 when alpha = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.4.3 When alpha (�1 � ↵ 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.4.4 When alpha (↵ > 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.4.5 Discussion on Activation Function . . . . . . . . . . . . . . . . . . . . . . . 23

3.5 Prediction Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Comparison of Models 274.1 Comparison between SV, GARCH and AR-NN . . . . . . . . . . . . . . . . . . . . 27

viii

5 Summary 295.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.2 Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6 Independent Study Contract 316.1 Contract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Bibliography 35

ix

List of Figures

2.1 Prediction and Observation for the GARCH model . . . . . . . . . . . . . . . . . . 9

3.1 Auto Regressive Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 AUD/USD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.3 Error Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4 Error Function with ↵ = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.5 Error Function with �1 � ↵ 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.6 Error Function with ↵ > 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.7 Logistic Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.8 tanh Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.9 AR-NN Prediction and Observed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.10 Error on Train set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6.1 Contract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.2 Contract Page 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.3 Contract Page 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

xi

List of Tables

2.1 Values for parameters ! , ↵ , � . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Error on GARCH model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.1 Number of Iteration required for convergence . . . . . . . . . . . . . . . . . . . . 27

xiii

List of Abbreviations

MCMC - Markov Chain Monte Carlo

AR-NN - Auto-regressive Neural Network

GARCH - Generalized Auto-Regressive Conditional Heteroskedasticity

ARCH - Auto-Regressive Conditional Heteroskedasticity

SV - Stochastic Volatility

NN - Neural Network

AUD/USD - Australian Dollar / U.S Dollar

1

Chapter 1

Introduction

1.1 Introduction

The Financial time series data from the various sources like stock market, foreign exchange, orany data are time dependent and have time index as one of the dependent variable.It is oftenobserved that in the Financial time series data, large changes tend to be followed by futurelarge changes in while small changes are followed by future small changes, this phenomenonis referred to as volatility clustering.(Nicalos Chapados, 2012) The stochastic volatility modelused for modelling the financial data. In this model Gaussian noise is considered as an addi-tive factor. The idea to model the non-linear stochastic volatility with the linear state space, themodel gets complicated as Gaussian noise is transformed into non-linear distribution and isno longer mapped to Gaussian distribution. The approach for predicting the time series datais different from the time independent data as it can be transformed and is easy to manipu-late. The same cannot be considered while dealing with the time dependent data. Studies haveshown that it is easy to model the time series data when it is linear in nature with uniformdistribution with time but when the non linear time series needs to be handled usually it isconverted into the linear form and then model using the linear distribution. In the linear ap-proach, the use of Bayesian Inference is the most suitable approach for modelling the posteriormean and variance. The posterior mean is the time-varying standard deviation and it can bemodelled through the stochastic volatility model. Neural models are non-linear in nature be-cause of the activation functions used in the hidden layer of the network. The idea to mapnon-linear financial data onto the Neural model has its own advantages as Neural networkhas the Universal Approximation property, where with proper selection of prior parametersthe model can achieve a reasonably good predictor.The normal feed forward Neural networkcannot be used for the time series data because it is only dependent on non-linear factors whichmakes it difficult to map the time index data. The different approach of Neural model calledAuto-regressive Neural Network which is used in this project for the purpose of predictionanalysis. The proposed Auto-regressive Neural model is different from the Feed Forward Neu-ral model as it combines the linear model with additive Gaussian noise while developing thefinal model. The prediction from the Neural model can be compared to the posterior meanfrom the Linear Gaussian model and that will in turn minimize the Mean squared error loss.

1.2 Literature Review

In the recent years, Financial market has become more volatile after the 2008 crisis and markethas become unstable for long time since 2008. There were no specific reason for the crisis butduring this period the financial returns were extremely high or low in comparison to normalperiod. While building the portfolio on risk assessment, volatility and other factors are alwaysdependent on the time series analysis.These are the major factors in financial market and are

2 Chapter 1. Introduction

always been unpredictable because of the inefficiency in predicting the values through the sta-tistical models.As pointed out in paper (Nicalos Chapados, 2012), the large change in observa-tions are followed by large changes, while small changes are followed by small changes.This isreferred as volatility clustering. Time series analysis, is a field in economics which tries to calcu-late the various factors through observing the past values. Time series data is difficult to modelas it’s mean and variance are non-constant. Basically, there are two types of model in timeseries analysis called linear models and non-linear models . In linear models, shocks are uncor-related but are not assumed to be identical and independent. In non-linear models, shocks areassumed as identical and independent. (ruppert, 2001) Financial series is stochastic in naturewith the mean and variance which are non-constant. The useful approach of regressive modelis not beneficial in this financial field. The approaches used for modelling the Stochastic Volatil-ity in the financial market are Bayesian Inference(Nicalos Chapados, 2012), Stochastic Volatilitywith Markov Chain Monte Carlo (MCMC).(Nicalos Chapados, 2012) These models are complexin nature and difficult to build. Hence, the most common model used were Auto-RegressiveConditionally Heteroskedasticity (ARCH) and other recently used model is GARCH (General-ized Auto-Regressive Conditionally heteroskedasticity) models which is an improved versionof ARCH. These models are most importantly used in the financial time series analysis as theyexhibit the volatility clustering phenomenon. The other models like Bayesian Inference andStochastic Volatility with MCMC are too complicated and not very popular in the real worldmodelling for time series analysis. Relatively less complex models are used in the financialmarket because of it’s stability to capture non-linearity and also it is less complex to imple-ment. Along with the advantages of modelling the volatility with GARCH, it has a significantdisadvantage that it is not able to model the hidden non-linearity in the data. The reason be-hind that, the GARCH model rely only on the previous forecasting and variance to predictthe current value. GARCH/ARCH models are used specifically for non-linearity in the dataor for structural regression coefficient (Dietz, 2010). The models discussed above have neverdeveloped as a prediction models but to calculate the posterior variance and mean of the timeseries data. The approach for predicting the non-linear volatility in the financial time seriesdata is complex as there is no specific Neural Network for prediction. Though, the Neural Net-work has a property called Universal Approximation (Dietz, 2010) which could be helpful inevaluating any function and thus, it can be utilised in prediction analysis. The Universal Ap-proximation property says that any continuous, bounded and non-constant activation functioncan approximate any function. This property has made Neural Network suitable for variousapplications which inspired us to explore the Neural Networks in the field of Financial Timeseries analysis. We are not using the Normal Feed forward Neural Network as it does notcomplement with the linearity in the Time series data. So, the proposed model for the analysisgiven in (Dietz, 2010) as the foundation for the modelling.

3

Chapter 2

Gaussian Process Volatility Model

2.1 Auto regressive Models

A time series is a sequence of variables measure over a time with uniformly distributed time asexample monthly/daily or yearly time. Times series data are Markov dependent with higherorder lag. In the uni-variate state space models a times series models represent the auto-regressive AR(⇢). The series {y

t

} means y measured in t with uniform time distribution. TheAR(1) is model of y

t

regressed on the past values y

t�1 which is the previous time series valueofy

t

.

y

t

= ↵0 + ↵1yt�1 + ✏

t

(2.1)

where ✏t

N (0, 1) is a white additive noise with µ = 0 and �

2= 1 is uncorrelated with the past

values of the AR series y

t�1. ✏t

represents the new contribution to the y

t

and they are knownas the series random shocks or innovation. The equation is termed as auto-regressive which isactually a linear regression model for the y

t

in terms of the y

t�1. That is, yt

is modeled asregression on its past y

t�1.The value of the ↵ strongly affects the series behaviour for the AR(1) process. In series of

the observation where y

t�1 is a long vector then the value of �1 � ↵ 1 the weights given tothe shocks ✏

t

which occurred long time ago will also be extremely small which will make theseries stationary and thus, means and variance to the model remain constant as t grows and ifthe ↵ � 1 then the weights given to the distant shocks will be much greater than those given tomore recent ones. This model is said to be explosive as the series mean and variance tends togrow exponentially as the t grows. Finally, if ↵ = 1, the model is neither stationary or explosiveand is called a randomwalk.

2.2 Stationarity

An AR(⇢) is a stationary process, if following properties of the times series data exists

• E[yt

] = 0 for 8 t

• �(y

t

) =

P1j=0 ⇡

2j

for 8 t and this can be true if the error decays rapidly as j ! 0

• Cov(y

t

, y

t�1) = �(k) is a function of ⇡j

weights and depends only on the lag k and not ont.

When ⇢ � 3, the restrictions for the ↵ increases and the models become much more unstablewhile trying to predict and also if the model gets over-fits to the training data, the generaliza-tion error measured on the test data will be increase.

4 Chapter 2. Gaussian Process Volatility Model

2.3 Stochastic Process

A Stochastic Process is a random variable {X✓

}, with parameter ✓ and index on the set ⇥.Where ⇥ represents the time as an index. Stochastic process that depends on time is a simpleprocess where it process at specific times according to the specific probabilistic rules. Thus,the state space, assumed in the time dependent analysis, is evolved to a stationary process inwhich probabilistic rules are constant during the transition matrix.If ⇡ is a stationary and P is a transition probability P(X,Y)

⇡ = ⇡pfor all t � 0 (2.2)

A measure ⇡ is stationary for X if measure µ where, µ(x) = q

x

⇡(x), is stationary for YFor a discrete time process, the random variable X

n

will depend on earlier values of the process,X

n�1,Xn�2 therefore the conditional distribution of the form

Pr = (X

tk

|Xtk�1, Xtk�2, ..., Xt1) (2.3)

for some set of times t

k

> t

k�1 > t

k�2 > ... > t1 the stochastic processes with time dependentsatisfy the Markov property which states as

Pr(Xtk |Xtk�1 , Xtk�2 , ...., Xtk) = Pr(X

tk |Xtk�1) (2.4)

The stochastic processes that satisfy the Markov Property are easy to model such as thestock exchange and the exchange rate are also time dependent.

2.4 Gaussian Processes

A Gaussian Processes (GP) is a generalization of Gaussian distribution where probability isdistributed over the function with the specified mean and variance. GP extends the multivari-ate Gaussian distribution to the infinite dimensionality. Definition: A Gaussian process is acollection of random variables, any finite number of which have a joint Gaussian distribution.let x✏RD index into the real process f(x). we write,

f(x) ⇠ GP(M(·),K(·, ·)) (2.5)

where functions M(·) and K(·, ·) are respectively, the mean and covariance functions:

M(x) = E[f(x)]k(x1, x2) = E[f(x1)�M(x1)(f(x2 �M(x2)))] (2.6)

The data often used or received from different sources are never consistent and containerrors such as each observation of y can be thought of the data x as a function f(x) with theadditive noise model as

y = f(x) +N (0,�

2n

) (2.7)

The volatility measurements obtained from the log-range are normally distributed around thetrue log-volatility and hence the equation assumed to be hold with function f(·) which repre-sents the log-volatility. Where y is the observed values for the x modelled on the time index.For modelling the stochastic volatility with Gaussian processes, the process is to cast in termsof regression from the time indexes to volatility measurement obtained from the log-range andthe estimation would be the pair of D = {(t

i

, y

i

)} where t

i

would be time index and y

i

would

2.5. The Stochastic Volatility model 5

the obtained from formulation

y

t

= f(t) + �

2n

✏

⌧

t

(2.8)

2.5 The Stochastic Volatility model

The observation at time t is given by

y

t

= e

12ht✏

t

, (2.9)

for t = 1, 2, ...., T , where ✏t

⇠ N(0, 1). Note that the state h

t

is called the log-volatility. Thestates are assumed to evolve according to a stationary process

h

t

= µ

h

+ �

h

(h

t�1 � µ

h

) + ⇣

t

(2.10)

for t = 2, 3, ..., T , where ⇣t

⇠ N(0,�

2h

) and is independent of ✏ at all leads and lags. Hence theconditional variance of y

t

is given by

Var(y

t

|ht

) = (e

12ht

)

2Var(✏

t

) = e

ht. (2.11)

We further assume that |�h

| < 1, and that the states are initialized to

h1 ⇠ N

✓µ

h

,

�

2h

1� �2h

◆, (2.12)

which is the stationary distribution of the process.

2.5.1 The priors

We assume independent prior distributions for µh

,�

h

and �2h

, i.e.

(µ

h

,�

h

,�

2h

) = (µ

h

)(�

h

)(�

2h

) (2.13)

Specifically, we use the following independent prior distributions:

µ

h

⇠ N(µ

h0, Vµh), (2.14)�

h

⇠ N(�

h0, V�h)1{|�

h

| < 1}, (2.15)

�

2h

⇠ IG(⌫h

, S

h

) (2.16)

where 1 denotes the indicator function of the and IG represents the inverse-gamma distribu-tion. The stationarity condition |�

h

| < 1 is imposed through the prior distribution �h

.To modelstochastic volatility, we have invoked an R package stochvol which has a straight forwardfunctions for doing the Auto-regressive AR(1) SV analysis. We use the svsample function totrain the model on the AUD/USD daily returns data-set. Hence, using the prediction functionpredict for predicting the days ahead which are equal to the test data-set points. The meansquared error (MSE) is used for calculating the prediction error between the test data pointsand predicted data points. This model is compared with the other models in the chapter 4during the model comparison.(Kastner, 2016)


2.6 Hetroscedastic

A sequence of output random variables Yt

and a sequence of input random variable Xt

with theconditional variance V ar(Y

t

|Xt

) is non constant for a given time t so the model with constantvariance �2 is not efficient to capture this phenomenon. It arises in two forms: Conditional andUnconditional. Conditional Hetroscedastic can be identified as non-constant volatility whenfuture high and low returns cannot be identified. Unconditional Hetroscedastic is identified asindependent volatility where future returns could be identified.

2.7 Generalized Auto-regressive Conditionally Hetroscedastic(GARCH)

GARCH time series model are widely used in Financial market for capturing the randomvolatility of the returns. GARCH calculate the forecast y

t

based on the squared past forecastvalues y

2t�1 and past variance �2

t�1. The model is not difficult to understand but it is requiredto focus on the reasons for taking past forecast and variance values into consideration. Themain purpose of GARCH is to model variance on financial returns. To understand the GARCHmodel, we will go through a simple approach called ARCH model (D.Ruppert, 2011). ARCH(1)models the conditional variance looking at the past values which is similar to Linear AR(1)model as discussed in the Section 2.1.

The Linear regressive model with constant variance �2 and expectation equal to 0 given as,

Y

t

= f(X

t

) + ✏

t

(2.17)

The conditional constant variance V ar(Y

t

|Xt

) = �

2 and f is a conditional expectation of Y

t

given as X

t

. Equation ( 2.17) can be modified to allow conditional Hetroscedastic into themodel with the equation,

Y

t

= f(X

t

) + �(X

t

) + ✏

t

(2.18)

where ✏t

has a conditional mean equal to 0 and variance equal to 1. As �(Xt

) should be non-negative as it is a standard deviation. If the function �(·) is linear then its coefficient has to bemade non-negative so that standard deviation remains non-negative. This would be a difficulttask of controlling the coefficient, so the non-linear non-negativity approach is used instead.The conditional class variance method is also used for GARCH method.

2.7.1 ARCH(1) Processes

We have to consider the Gaussian noise in the ARCH(1) model. Hence, while adding the noiseinto the model, we consider the Gaussian noise to be with constant mean and variance as,

E(✏t

|✏t�1) = 0 (2.19)

and

V ariance(�

2)(✏

t

|✏t�1) = 1 (2.20)

This property of the white Gaussian Noise is called as homoskedasticity. The process at

is givenin ARCH(1) model as follows,

a

t

=

q! + ↵1a

2t�1✏t (2.21)

2.8. Model GARCH process 7

From the equation ( 2.21) expectation is equal to zero and � is equal toq! + ↵1a

2t�1. where to

make the � > 0 the coefficients of the variance ! > 0 and ↵1 � 0. To make the model at

to bestationary ↵ < 1. The equation 2.21 can also be modified as,

a

2t

= ! + ↵1a2t�1✏

2t

(2.22)

The equation ( 2.22) is similar to the AR(1) but with the squared term and noise with mean 1.The conditional variance for the ARCH(1) would be

�

2t

= V ar(a

t

|at�1) (2.23)

and as noise is independent of the past values a

t�1 the E(✏2t

) = V ar(✏

t

) = 1. The conditionalexpectation of the equation ( 2.21) is zero.

�

2t

= E{(! + ↵1a2t�1)✏

2t

|at�1} (2.24)

�

2t

= (! + ↵1a2t�1)E(✏2t |at�1) (2.25)

�

2t�1 = ! + ↵1a

2t�1 (2.26)

To understand the GARCH model, the variance derived in the equation( 2.26) is the same vari-ance for the model. If a

t�1 is having a large magnitude then �

t

will also be large. This tendsto make a

t

as large as well and makes the volatility to propagate from a

t

to a

t+1. Similarly, ifa

t�1 is small in magnitude then �t

is small in variance and a

t

will also be small in magnitude.There is a proportional relation between a

t

and �t+1. Thus, the ARCH(1) variance model with

y

t

is conditional on y

t�1 with variance at time t,

V ar(y

t

|yt�1) = �

2t

= ↵0 + ↵1y2t�1 (2.27)

Therefore y

t

is for ARCH model with series mean = 0 as,

�

2t

=

q�

2t

= ↵0 + ↵1y2t�1 (2.28)

y

t

= �

t

✏

t

(2.29)

2.7.2 GARCH (1,1) Process

GARCH uses the same approach of ARCH model of relying on the past values yt�1 of forecast

to predict the new y

t

but with the addition of the past variance �2t�1 to the model.

�

2t

= �

2t

= ↵0 + ↵1y2t�1 + �1�

2t�1 (2.30)

In equation ( 2.30), �1 is multiplicative with �

2t�1 and is added as the past variance into the

model to make the ARCH model as a generalized model for any time series data. Since, �t

from the past value changes the magnitude of the y

t

, past variance are added into the GARCHmodel.

2.8 Model GARCH process

To model the variance �t

from the GARCH model, the constraint for the variance should besatisfied and for this parameter selection of !, ↵,� needs to adhere to the constraints. Careshould be taken during this so that variance should not have a negative value. The range of↵0 � 0, 0 � ↵1 1 and 0 � �1 1 is required to maintain non-negative variance. To train


TABLE 2.1: Values for parameters ! , ↵ , �

Type ! ↵ �

Observed 1 1 1

Predicted 0.00751332 0.066731 0.875381

and test the model, we will use AUD/USD daily returns data set. The data-set is split intothe training and testing data-set using the split function and normalize all the set to reduce theeffect of outliers. The optimize function is used to run the GARCH model multiple times andfind the mean squared error between the observed forecast and the predicted forecast on thetrain set. The optimize function will return the values for !, ↵,� and these values will be usedfor GARCH prediction.

The GARCH prediction function predicts the variance for the specific time t equivalentto the test data-set. The Mean Squared Error (MSE) is used for calculating the cost functionbetween the predicted values and the test data. As GARCH is a simple model it is not able tocapture the hidden non-linearity in the financial data. We will go through the results in Section( 2.9).

2.9 Results on GARCH

The Financial data-set is split into 34 as train set and 1

4 as test set for the GARCH model. Theparameters for selections are !,↵ and � which need to be selected to get the optimize predictionon the test set. We are using the optimize algorithm from the python code to iteratively theGARCH model and measure the cost function for minimized error on the particular selectedparameter value for the !,↵ and �. The initialized values for all the parameters at the start ofthe optimization process is 1 for !,↵ and �. y2

t�1 will be squared with lag of the y

t

.

�

2t

= ↵0 + ↵1y2t�1 + �1�

2t�1 (2.31)

The equation ( 2.31) is used for calculating the variance and y

t

is calculated by taking the squareroot of the variance �2

t

. The equation is as follows,

y

t

=

q�

2t

(2.32)

In actual GARCH model, to calculate the y

t

we need to multiple the y

t

with Gaussian noisegiven as, ✏

t

⇠ N (0, 1) but, in the training phase we don’t add the Gaussian Noise because trainmodel never gets to the global minimum cost function. This makes less efficient values of !, ↵and �. The table ( 2.1) gives the details of the values of the parameters after the training andnow these value can be used for prediction model. The GARCH function for the prediction issame as the GARCH function for the training the only difference is now we are not using theoptimize function to continuously run the training model. The error calculated on the predic-tion model and training model is given as, In figure ( 2.1) the two plots are based on the testset. There are some outliers in observation plots which are not able to capture in the predic-tion model. The reason is the hidden non-linearity in the data set are not properly capturedby the GARCH models. The total prediction error is 0.33 which is comparatively small sincethe model is fairly simple to implement with less factors to be considered. As the input lagy

t�1 is increased to y

t�i

then the model will become complex and difficult to find the global

2.9. Results on GARCH 9

TABLE 2.2: Error on GARCH model

Type Error

Test Set 0.404414457

minimum cost function for the train model. This will have a larger impact on the test data setas the generalization error will increase drastically.

FIGURE 2.1: Prediction and Observation for the GARCH model

11

Chapter 3

Auto Regressive Neural Network (ARNN)

3.1 Feed Forward Neural Network

In linear time series analysis, the Auto-regressive AR(⇢) model consist of two layers: The inputlayer, which contains the entirely independent variables and the output layer which containsdependent variables with a constant term called as bias. A linear Auto regressive given AR(1)by

y

t

= ↵0 + ↵1yt�1 + ↵1yt�2 (3.1)

As Linear AR model is not sufficient for prediction so the non-linear part has to be augmentedwith the linear model. The non-linear AR(1) is model on the hidden layer of the Neural Net-work and thus added in between input and output layer. Thus AR-NN model are those inwhich there is a direct connection between input and output layer as designed for linear modeland as well the connection between the input and output layer through hidden layer for non-linear model. Thus, the nonlinear function F (·) is an extension of a linear AR(1) is given as,

y

t

= ↵0 + ↵1yt�1 + ↵2yt�2 + F (y

t�1, yt�2) (3.2)

This equation ( 3.2) can be model using the feed forward neural network with extension oflinear AR(1) model to build an Auto regressive Neural Network. We are using a three layerFeed Forward Neural Network with Back propagation algorithm as a method to update theweight vector on the network. The main purpose of the Neural Network in the project is toimplement it for the prediction of the times series data-set. We have selected a AUD/USDdata-set to predict the daily returns with variance of prediction that is to be compared withthe observed values. Neural network is a nonlinear model because of the use of activationfunction in hidden layer to predict the values. The hidden layer in the network uses non-linearactivation functions and as they are non-linear in nature the values from the hidden nodesare bounded between the range. We are using Neural Network as a regression model for thepurpose of predicting the time series vector. NN has a property of universal approximationwhich means it can approximate any input attribute as close possible to the target with theselection of approximate hidden layer units.

The normal feed forward network has set of input attributes and a target values and usingthe input attribute, we try to make the predicted output as close as possible to the target vector.We are implementing the special case of the neural network called Auto-regressive NeuralNetwork and the reason for building this network model is that there is no input attributes forthe time series data set. As this is a time series data set, we will be using the same target vectorwith lags as our input attribute.

12 Chapter 3. Auto Regressive Neural Network (AR NN)

i1

i2

h1

h2

o1

1

Input Layers Output LayersHidden Layers

FIGURE 3.1: Auto Regressive Neural Network (AR-NN)

The design of the neural network is a state of the art as there are various factors to beconsidered for design. There is no specific criteria or constraint while building the model.To implement the model there is a need of various algorithms like Gradient descent, Backpropagation, Cost function are considered while designing the model. In the figure ( 3.1) atthe input layer i

t

where t = 1, 2, ...T acts as a input to the model and connections from thenodes i

t

to the hidden layer nodes ht

and output layer nodes ot

. It is called non-linear becauseof the use of activations function in the hidden nodes. The connections are made for linearpart between the input node and output node. The figure is an approximate to the model usedfor the prediction there will be weight assigned on the edges connected between the nodes.This weight vectors are to be selected at the initial state and gets updated after predicting theoutput vector. This output vector is applied to the batch gradient algorithm to calculate thecost function. The aim of the model is to minimize the error and approximate the input vectoras close as possible to the target vector.

3.2 Design of Auto Regressive Neural Network

To design the AR-NN for regression as output, we have to consider prior parameters for designand model with three layers consisting of Input layer, Hidden Layer and Output layer arealso sometimes called as first, second and third layers respectively. Input layer consist of thenumber of nodes equal to the dimension/attributes of the data set and it acts as input X

t

forthe model. Output layer can have nodes according to the type of learning. We are using thismodel for predicting the regression, so only one node y

t

is required at the output layer. Thedecision on hidden layer nodes is state of the art as there is no specific criteria for hidden nodes.The input vector into the network through the input layer and forwarded to the hidden layerafter combining it with weight parameters w

ij

on the input side. At the hidden node, there isa non linear activation function f(·) it acts as a triggering node in the layer. The output fromthe hidden units are differentiated through activation function and are of non-linear nature.

3.2. Design of Auto Regressive Neural Network 13

These vectors are again multiplied with the output layer weight vector wjk

and transferred tothe output layer. Thus equation for the input side layer network is as follows,

a

j

= f(

DX

i=1

w

(1)ji

x

i

+ w

(1)j0 ) (3.3)

z

j

= h(a

j

) (3.4)

where i = 1, ...., D of equation 3.4 indicates the dimensions of the ’first’ layer also called inputlayer nodes. a

j

represents the activation function of single node of the next layer in our case thehidden layer. We add the bias value w

j0 to the nodes to handle the output offset in this equation.Hence, these activation nodes a

j

is transformed through non linear activation function givenin the equation (1.4)

The above equation are at the input side of the model and are again transformed fromhidden layer to the output layer through the output side weight vector and activation functionsto predict the data as vector at the output layer. This formulates the whole equation of theNeural Network model as described,

y

k

(X,W ) = �(

MX

j=1

w

(2)kj

h(

DX

i=1

w

(1)ji

x

i

+ w

(1)j0 ) + w

(2)k0 ) (3.5)

In the above equations, layer one and layer two given by superscripts (1) and (2) are combinedto form the non linear equation for predicting the value. This is a normal Feed Forward Net-work and to design Auto regressive Neural Network this Feed Forward has to be extendedby adding a linear term into the equation. Linear Input term is used with a multiplicative pa-rameter ↵ and directly connected to the output layer and hence, there is no non-linear factorinvolved. This linear part is sometimes called as memory of the network used for storing theprevious output vector and supplying the same to the output node at next time interval. Theestimated equation for the Auto regressive model is as,

y

t

= ↵0 +

nX

i=1

↵

i

y

t�i

+ (�(

MX

j=1

w

(2)kj

h(

DX

i=1

w

(1)ji

y

t�i

+ w

(1)j0 ) + w

(2)k0 )) + ✏

t

(3.6)

The input for the equation is y

t�i

which is lag of the output yt

and this lag can be increasedto bring the predicted output as close to the target vector. A care should be taken with thelag input as this might over-fit the model or under-fit and perform poorly on test or unknowndata-set.

For building the whole model all the parameters need to be initialized and also definedwith the specified criteria. In the following subsection, there will be detailed discussion onInput time series data set, weight-space, Back propagation algorithm, Activation functions.

3.2.1 Time Series Data set

The data-set we are using of interest is financial returns regarding the Daily asset returns inAUD/USD from January 2005 to December 2012. The data-set is only one variable representingthe returns and as this data set is time dependent time series data it contains time-varyingvariance and constant time difference. The input to the model will be past values of series datato forecast the current value. Time series data exhibit a different approach of the prediction as itis only dependent on time index for the prediction it is difficult to model normally according tothe other feed forward models. In times dependent data its current value is dependent on theprevious data point and this is similar to the Markov process. When one past series value y

t�1


is used for the prediction then the series represents Auto Regressive AR(1), where 1 representone lag vector of the y

t

vector is used. The Linear AR(1) model can be described as,

y

t

= ↵0 + ↵1 ⇤ yt�1 (3.7)

↵0 is the offset and ↵1 weight factor for y

t�1 this equation resembles to the Linear regressionequation with t

t�1 as input for y

t

. From the the figure 3.2, the variance of the daily returns

FIGURE 3.2: AUD/USD daily returns from January 2005 to December 2012

mostly is ±2% and there is more than ±8% between 2009 and 2010. This drastic returns have tobe able to capture by the model as this signifies an important event in the financial market. Thisevent were during the financial crisis in the US market due to the recession in the market. Thisevent continues for the next two years as returns continuously remains in the high variancerange of ±4%

The bounded value range of activation functions, the AUD/USD data set is scaled on thevalue range R ! [0,1] this can be done using the mean - variance method with y

t

or µ is themean of the data set y

t

. �t

the square root of the variance of yt

respective y

t

as this equates as,

y

‘t

=

y

t

� µ

�

2(3.8)

The two were taken into consideration while doing the scaling of the data


• The model behaves better if all the variables are scaled as if the range of the observed val-ues is much higher than the range of the activation function only linear values dominatethe process.

• The initial weight parameters values are not depended on the observed values and ifinput vectors are not scaled and the initial weights are not sufficiently small, the outputfrom the activation function might switch from the upper and lower bound.

Transforming the series data set might lose the information but the good thing is predictionof the series gets much better if the range of activation function and series data is in boundedrange. This criteria is not a strict criteria to follow but the cost function with scaled series givesgood approximation of the observed values.

3.2.2 Weight-space

The weight matrix W consist of the weight vectors w from all the layers of the model. Thisinitialization of weight vector holds importance as, if the values are initialized randomly thennetwork model will require more iterations to reach a minimum cost. The ratio of weightvector w and hidden units h is 2

n and with random weights makes the computation costly forthe neural model. The updating of weight space dependents on the type of gradient descentalgorithm approach to be done according to the output prediction. We have decided to useBatch gradient approach as with it weights get updated after it, computes the gradient of thecost function w.r.t. to the weight parameters W

(1) and W

(2) for the entire training set. Theweight matrix W can be updated as,

W = W � ⌘ ·r✓

J(✓) (3.9)

In our model, weights are initialized using the Gaussian normal distribution N (0, 1) with µ = 0

and �2 = 1. Weights can be initialized with any random distribution but, Gaussian is the mostpreferred distribution in the various paper studies so far done for the regressive function. Theweight matrix weight is greater than one then usually it takes more iteration to reduce the costfunction.

Another design consideration with the weight matrix W, as every input unit is connectedto every output unit with a weight value w

ij

. There are strategies to connect units in order toreduce the W matrix calculations.The strategy used for connection is to connect all the unitsto each other and initializing the weights accordingly.As we are not using any basis functionfor the data set as studies have shown that adding more hidden units is equivalent to addingthe basis function. Concerning the selection of Hidden neural units as usual approach is trialand error with arbitrary number of neurons with all other factors to be fixed and constantlymonitoring the error approximation. A rule of thumb regarding choosing the hidden layerunits is equal to the median of the input and output variables given as,

h = (n+ 1)/2 (3.10)

This method doesn’t have any technical support but, when we considered this approach ourresults were better. We started the hidden unit selection more than 20 units and the results fromthe hidden activation function used to always over shoots the range [-1,1] and at the outputunit prediction was with all ones. So after taking into consideration the equation, we choosethe hidden units to be around 3 to 5 and the results from the output unit were better but needsome changes in the type of activation function to be chosen. The approach used during thiswas to step by step increase the units in the hidden layer and monitoring the error function.


The cost function selection for the Neural model was difficult as there are no specific costfunction to be used for the time dependant nonlinear data set. We have chosen the squarederror and Mean Squared Error (MSE) to measure the performance of the prediction model.Mean Squared Error has more efficient than Squared Error, and also to take the derivative ofthe MSE is easy with back propagation it was the main criteria for the selection.

E =

1

2 ⇤ kX

k

(y

k

� t

k

)

2 (3.11)

In order to avoid the over fitting of the model to the training data set, we have added the reg-ularization parameter in the cost function. The purpose of the regularization parameter is tomake sure the model gets penalize if it tries to over-fit the data and indirectly improves the gen-eralization error measured on the test data. The cost function after adding the regularizationvariable �

W = W � ⌘ ·r✓

J(✓) + � ⇤W (3.12)

The variable � is fixed value during the prior initialization and same fixed value is used forall the iteration. The criteria used for the selection of the value is trial and approach but if weincrease the value of � model might under-fit and this care has to be taken while selecting thevalue for �. As over-fitting is a major problem for the prediction model a small changes in thecost function improves generalization error.

3.2.3 Activation Function

The task of choosing the activation function for the model is important to concertize the AR-NN function. The selection of activation function dependents on the Universal approximationproperty which says that any continuous, bounded and non constant activation can approx-imate any function on a Borel measurable function from one finite-dimensional space to an-other with any desired non-zero amount of error, provided the network has enough hiddenunits for approximating. The derivatives of the feed forward network can also approximatethe derivatives of the function well. The concept of Borel measurably is that the function isclosed continuous and bounded subset of Rn is Borel measurable. The use of basis function inAR-NN model is not used as it gets more complicated with the non linearity of the function.The bounded activation function used in general in all the NN models is the sigmoid functionand as it is bounded between R! [-1 ,1] logistic function is one of the sigmoid function,

�(·) = (1 + exp�(·))�1 (3.13)

logistic

: R! [0,1] this sigmoid function is called the tangent hyperbolic function (tanh)

�(·) = exp(·)� exp�(·)exp(·) + exp�(·) (3.14)

Linear activation function called as Identity function is used sometimes at the output units forthe regression based prediction model. We used this activation function for the model but,there was not significant improvement in the cost function value.This two activation functioncan be used in the model for the hidden units and output units. It can be used as a mixtureof the two activation function with hidden units having the logistic function and output unitswith tanh function. It depends on the how much error be reduced from the selected activationfunctions. Sigmoid functions reduce the effect of the outliers because they squash the vectorinto the range of [-1,1].


3.2.4 Back Propagation Algorithm

The back propagation algorithm is a learning procedure for feed forward neural network, bythis the network can map the a set of inputs to a set of outputs. The mapping is specified bygiving the desired activation function on the units in the hidden and output layer. Learning iscarried out iteratively by adjusting the coupling of the strength in the network so as to mini-mize the difference between the actual output state vector and input state vector. The learningprocess is repeated until the network responds for each input vector with an output vectorthat is sufficiently close to the desired one.As weights are initialized randomly with some priordistribution and we need to update the weights after every iteration. The main purpose of im-plementing back propagation is to minimize the cost function C with respect to weight W andbias b and this is achieved through partial derivative of the cost function @C

@W

@C

@b

. After calcu-lating the partial derivative of the cost function it needs to update the weight space and biasso as to make the predicted output closer to the target data. Before discussing the algorithm,we need to consider the cost function for that to be used at the output unit. We have considersquared loss function for the model and has the form as,

1

2

nX

i=1

(t� y

t

)

2 (3.15)

where y

t

is the predicted vector and t is the target vector we take the squared difference of twoand take sum of all the data points and divided the whole equation with 0.5. The usage of thethe 1

2 is while taking the partial derivative of the cost function during the gradient descent thesquared term gets eliminated because of the fraction term.

We have have initialized weight vectors on the edges, one between input layer and hiddenunit and another between hidden unit and output layer so as, we have to calculate the updatingof the weight space twice with the same value of error function. we need to do the partialderivative of the cost function as

�nX

i=1

(t� y

t

) it is some times written asnX

i=1

(y

t

� t) (3.16)

The above function gives the total error between the input and output vector and adding thisdifference into the next immediate previous layer. We are using the activation function on eachunits in the layer, we need to take the derivative of the activation function while doing theupdating. We are using three different activation function for the model to look at the errorbut for calculation we will use the sigmoid function 1/(1 + exp(�z)) and combining the aboveequation ( 3.16) with the sigmoid function, we can update the weight vector w

(2) as they arethe previous layer of the output layer. The equations for calculations of the weight space vectorbetween output to hidden layer.

a

(2)= �(z

(2)) (3.17)

@a

(2)

@z

=

(z)

(1� z)

(3.18)

@y

t

@w

(2)= �(t� y

t

) ⇤ (z)

(1� z)

⇤ �(z(2)) (3.19)

Therefore, delta �o

can be written as,

�

o

= �(t� y

t

) ⇤ (z

(2))

(1� z

(2))

(3.20)


We can update all the weights at different hidden units using,

@y

t

@w

(2)= �

o

⇤ a(2) (3.21)

From the above equations we have updated the weights vector at the hidden units and nowagain back propagate the same error to the previous layer from the hidden layer and that toinput layer. We have to use same set of equations for the updating of weight space vector W (1)

a

(1)= �(z

(1)) (3.22)

@a

(1)

@z

(1)=

(z

(1))

(1� z

(1))

(3.23)

@y

t

@w

(2)= �(t� y

t

) ⇤ (z

(1))

(1� z

(1))

⇤ �(z(1)) (3.24)

Therefore, delta �o

can be written as,

�

o

= �(t� y

t

) ⇤ (z

(1))

(1� z

(1))

(3.25)

We can update all the weights at different hidden units using �0 and input vector yt�i

,

@y

t

@w

(1)= �

o

⇤ yt�i

(3.26)

Finally, all the weights from the edges W

(1) and W

(2) of the layer are updated with Batchgradient approach. After every back propagation procedure on training sample error have todecrease slowly and weights gets updated and for the network model it requires around morethan 10,000 iteration to get the weights stabilized to the local minimum range. From whereerror never goes below the certain level and remains in that range for long time. There are twochoices for tackling this problem, one is either making the algorithm to run for fixed amount ofiteration or second approach, if the error is not decreasing for long time then it can be stoppedat the particular step. The changes made in the algorithm is to continuously tack the errorrange and take decision based on the cost function.(M.Bishop, 2006)

3.2.5 Practical Reducibility

When the AR-NN extended to add the additive noise ✏, the distribution of the noise term ispositive everywhere in the range (�1,1) and thus AR-NN forms an irreducible and aperiodicMarkov chain. This chain is aperiodic, because it does not cycle between a set of values at thespecified multiples of t. It is irreducible because it is impossible to reduce the range of Y

t

fromthe entire real line (�1,1) to a smaller finite set. As noise is additive and it does not dependon Y

t

and ✏

t

takes any random from the range. Therefore, even if the model converges, thenoise term ensures that Y

t

is irreducible and aperiodic.(Dietz, 2010)

3.3 Function

As the architecture for the model is built in the above sections, we will now discuss the stepsrequired for the model to implement.The data used for modelling AUD/USD returns withfunctions created for normalizing the data and calculating the Loss functions. The equationsneeded to model the Auto-regressive Neural Network combining all the algorithms included

3.3. Function 19

into the design of the model to make the prediction model for the time dependent non lin-ear data set.Linear equations for the model with linear weight vector ↵ with two lags of theobserved forecasting data set is as follows,

y

t

= ↵0 + ↵1yt�1 + ↵2yt�2 (3.27)

The above equation will be needed for predicting at the output unit before this non-linear partof the model equations has to be proved. where weight matrix W consisting of weight vectorw

(1) and w

(2) at the input side given by superscript (1)

z

(1)j

=

DX

i=1

w

(1)ji

y

t�i

+ w

(1)j0 (3.28)

where i✏RD represent the dimensions of the input matrix and j represents the number of unitsat the hidden layer. This equation needs to be transformed into non-linear form through acti-vation function.

a

(1)j

= �(z

(1)j

) (3.29)

We have used sigmoid function as activation function for the hidden units for transforming thelinear value from the input vector into the range of R ! [�1, 1]. This transformed value fromthe hidden units is now combined with the next weight vector w(2)

z

(2)k

=

MX

j=1

w

(2)kj

a

(1)j

+ w

(2)k0 (3.30)

where k✏RM belongs to the units at the output layer. As we have only one unit at the put layerM = 1. The final output equation is passed through the activation function

a

(2)k

= �(z

(2)k

) (3.31)

Finally equation ( 3.31) is the non-linear transformation of the input vector and now needs tobe combined with linear equation ( 3.27) and additive noise ✏

t

. At the output layer,linear andnonlinear part are combined with the ✏ to the activation function for prediction the equation( 3.32) combines all the derived equation for predicting y

t

.Data set is normalize through thefunction with mean µ and variance � to remove the outliers. The weight vectors w

ij

and w

kj

are initialized with random Gaussian distribution with N (0, 1) and ↵

i

is also initialized withGaussian distribution. ↵

i

can be initialized as a fixed values but as data set is non-linear andfixing ↵ value for the same doesn’t improve the error function. A two different constant valueare used for the regularization parameters this way two function have been created for the pre-dictions. There are two separate calculations for linear and non-linear part of the model.Fornon-linear, input vector is feed into the hidden layer to combine it with weight values and dif-ferentiated with activation function. After the prediction of vector Algorithm ( 1) to describesthe model for predicting the times series data. The factors that are need to initialized are theequation to be consider for the prediction.(Dietz, 2010)

y

t

= ↵0 +

nX

i=1

↵

i

y

t�i

+ (�(

MX

j=1

w

(2)kj

h(

DX

i=1

w

(1)ji

y

t�i

+ w

(1)j0 ) + w

(2)k0 )) + ✏

t

(3.32)


For updating the weight matrix after the training samples, Back-propagation algorithm is in-voked with the gradient descent on the cost function with the same steps which are followedin the section( 3.2.4) needs to be followed to update the weight matrix w

(1) and w

(2)

@y

t

@w

(2)= �(t� y

t

) ⇤ (z

(1))

(1� z

(1))

⇤ �(z(1)) (3.33)

�

o

= �(t� y

t

) ⇤ (z

(1))

(1� z

(1))

(3.34)

@y

t

@w

(1)= �

o

⇤ yt�i

(3.35)

Data: Times Series Data of AUD/USD Daily returnsResult: Predicting the Time series and Minimizing Error¯

Data NormalizeData(Data);for all loop ✏ num_passes do

hidden h Combining Input vector with Weight vector W;output Y Combining h and W with linear Weight Matrix ↵ and ✏;� Backpropogation Algorithm;error e CalculateLossFunction(Y);if error in range then

W �

min

+ delta;else

W �

max

+ delta ;end

endreturn model

Algorithm 1: Training AR-NNIn the algorithm ( 1), we first initialize the prior parameters for W, ↵, �, ✏, and selection of

hidden units for the hidden layer as well activation function for hidden units and output units.Data set is passed through the function NormalizeData to get the normalize data set and thiswould be used for further operations. The data set with one and two lag are created called asy

t�1 and y

t�2 an these are set as input for the model. In the For loop two things are calculated,first is hidden units combining the input vector with weight values and at output unit, valuesfrom the hidden units and with the linear values from the input vector that are combined with↵ and differentiated with the activation function. At the point, Back propagation algorithm isinvoked to calculate the gradient descent on the error function and back propagate the valueinto the previous layers through differentiating the weight vectors at each previous layers. Theweights at the immediate previous layers from the error function gets highly updated and theweight at the first layer of the network are updated with very small amount but as iteration in-creases weights start to get stabilize and error function stays at the specific range for longer timeand toggles between the ranges. The total error calculated through the CalculateLossFunctionis used for updating the weight vectors for the next iteration. The model is made to rum modethan 10,000 iteration because anything less than weight doesn’t get stabilize and as AR-NNmodel will not reach the local minimum bounded range because of the noise consideration,while invoking prediction and error function. So its better to run the algorithm more iterationto stabilize the weight to a certain weight and reach a steady low error value for the network.From the figure 3.3, the prediction error has more fluctuation in the range of [0,10,000] and asthe iteration goes on increasing the error gets stabilized in the range of [1100, 1000] after 30,000iteration and as neural model cannot have a global minimum and because of additive noisein the error function, cause is to fluctuate between the range and never gets stabilized lo local

3.4. Results 21

FIGURE 3.3: Error

minimum.

3.4 Results

In the following section we will discuss the results from the AR-NN model with considerationof the different parameters affecting the results.The subsection which are following will havedetailed discussion on each parameter and focus will be on the cost function posterior meanand variance. As these parameters will define the performance of the network architecture.Neural network cannot achieve a global minimum as the activation functions used in the unitsin all the layers act as a nonlinear function with different bounding range for each activationfunction. Some parameters are predefined in the model as model performance is not measuredon these values and they are not important prior information to be considered by designing themodel.

3.4.1 Alpha

As discussed in the Linear Auto regressive model the linear weight matrix ↵ changes the y

t

from being stationary, explosive or a random walk. We tried to see the impact of ↵ on themodel. The mean of the observed data set after normalizing it with mean and variance isµ = 5.10937x10

�4 and �2 = 1


3.4.2 when alpha = 1

At first, the vector ↵ was kept constant at ↵ = 1 From the figure ( 3.4), after 20,000 iteration

FIGURE 3.4: Error Function with ↵ = 1

error starts to get stabilize in the range [2800,2600] and because the linear weight vector is ↵ isconstant at 1 the contribution of the linear value is fixed across the iteration and because of thatthe error does not fluctuates much and remain in the same range for long time. The mean forthe predicted data set is µ = 1.018 and variance �2 = 2.93 where predicted µ and �

2 are awayfrom the observed values and this approach would not perform for the test data set.

3.4.3 When alpha (�1 � ↵ 1)

We will consider the �1 � ↵ 1 in which the predicted set has µ = 0.23 and �

2= 1.35. In

the figure ( 3.5) at the start of the iteration there is a high deviation of the cost function andthat continuous 80,000 iteration and after that error starts to get minimized with minimumdeviation. The mean and variance of this plot is very close to the observed data set. This modelmight be outfitted but it has performed better than the previous graph

3.4.4 When alpha (↵ > 1)

Finally, ↵ is considered above 1 that is ↵ > 1 for this condition there is no specific value tobe considered so, we have chosen ↵ = 2 as a parameter vector value to be fixed.As ↵ growsthe value of the mean and variance grows exponential and from the figure 3.6 cost value neverstabilizes also the cost value starts at around 12,500 and after 80,000 iterations it still staysat around 10,000 value and never reduces below 10,000 range. Compare to the other two ↵,

3.5. Prediction Results 23

FIGURE 3.5: Error Function with �1 � ↵ 1

this models has µ = 2.01 and �

2= 8.87 higher than the other two predicted set. So while

implementing the neural model for prediction it would be better to consider alpha in the rangeof [-1,1] and for this one approach would be to use Gaussian distribution for the selection ofthe prior ↵.

3.4.5 Discussion on Activation Function

As discussed the two type of activation functions in the section ( 3.2.3) we will explore theresults from the model using these activation function and the results for the two function arequite different. At first, we will consider the logistic activation function on the hidden unitsand output units. The formula for the same

�(·) = (1 + exp�(·))�1 (3.36)

The predicted plot of the logistic activation function from the figure ( 3.7) it was able to capturethe outliers from the forecasting data like the high variance of daily returns during the financialcrisis between 2008 to 2010. The mean µ for the predicted is 0.356012 and the variance �2is 1.21708. The model is not able to predict the normal target points. Secondly, the figure( 3.8) of predicted output from the tanh activation is able to predict the vector as close to theobserved values with µ = 0.046 and �

2= 1.30 and as the mean of the observed vector is

µ = 5.109X10�18. The tanh activation performs better than the logistic function as it was ableto capture the outliers regarding to the observed data. tanh activation function predicts betteron the selected data set than the logistic function but, there is no specific criteria for selectionof activation function and trail and error is the best approach for selection of this functions.

3.5 Prediction Results

In AR-NN model, we choose for prediction with prior parameters such as initialization of theweight vectors W , choosing activation function �(·), linear weight vector ↵ and regularization


FIGURE 3.6: Error Function with ↵ > 1

parameter �. The AR-NN model we choose is have the Gaussian distribution for the weightvector W and ↵ also tanh as activation function for the hidden and output layer of the network.Regularization parameter added in to the cost function is of � = 0.1 when used in the modelthe prediction results from the model are optimized.In the figure ( 3.9) there are two plots ofprediction and observed values with variance on the y-axis and time on x-axis. There is ahigh variance between 200 to 300 data points on the observed plot and this high volatility isof major concern for the financial market and our model is able to capture this variance butthe magnitude is not as high as observed plot has. The reasons for this is to consider morehidden units to capture the non-linearity in the data. Neural network cannot reach the globalminimum error for the reason of the activation function. we have to use the trial error methodto reduce the error consistently and reach a minimum range but at same time model shouldnot over-fit as this has impact on the prediction and this finally increases the prediction error.The plot ( 3.10) refers to the error calculated using Mean Squared error (MSE) the main aim isto reduce the error and bring the training set error to its minimum range and usually it takesaround more than 10,000 iteration for the error to come in this range.

3.5. Prediction Results 25

FIGURE 3.7: Logistic Activation Function

FIGURE 3.8: tanh Activation Function


FIGURE 3.9: AR-NN Prediction and Observed

FIGURE 3.10: Error on Train set

27

Chapter 4

Comparison of Models

4.1 Comparison between SV, GARCH and AR-NN

We will look into SV, GARCH, and AR-NN models and try to understand these models withdifferent parameters selections to measure the performance of the models under these priorselection of the data. The methods selected for measuring the performance of these modelsare Mean squared error (MSE), Number of Iterations required to converge to minimum errorrange , calculating the mean and variance of the Predicted values and comparing it with theObserved values. The data-set used for the for these models is split into Train set and Testset with Train set has (3/4)

th data points and Test set has (1/4)

th data points for prediction.The reason for selection of these numbers of data points is that to make the model learn thedata-set properly with all its weight and hidden parameters get updated to the data-set. Themodel complexity increases the number of iteration required to reach the minimum error. TheSV and AR-NN model are complex in functionality and hence, they require more iteration forconverging to the minimum error range. The iteration required for SV model are fixed in theR package and thus, evaluating the SV process with changing iteration was not possible. InAR-NN model, we can stop after at least 1000 iteration but this us high training error as wellas testing error which are higher than the prediction errors calculated after running the AR-NN model for more than one million iteration. There is no best mechanism to stop the AR-Niteration as it depends on the trail and error approach. The prediction error from the table ( 4.1)gives good result for the AR-NN than other two prediction models. As our project purpose isto predict the output vector and we need to measure the prediction error in order to evaluatethe performance of these selected models. Prediction error will be the main selection criteria forthe models. AR-NN has The Universal approximation property and with proper prior selectionof the number of hidden units made it possible for the model to get the minimized predictionerror. GARCH tries to predict y

t

given lag y

t�1 and variance with lag as �t�1 are deterministic

in nature instead of stochastic evolution because of these they are does not perform well in theprediction phase. Stochastic Volatility (SV) model used from R package performs better thanGARCH model but not against the AR-NN model. SV model uses Markov Chain Monte Carlo(MCMC) method to sample the data points from the Gaussian distribution to model y

t

fromthe log-volatility h

t

.The MCMC method is a complex method and it gets increasingly complexfor large data-set to model and predict values from the SV process. The test data set which is

TABLE 4.1: Number of Iteration required for convergence

Type SV GARCH AR-NN

Prediction Error 0.255484 0.404144 0.1863Mean µ 0.4397 0.3174 0.68737

28 Chapter 4. Comparison of Models

called as observation set has mean µ = 0.61403. Comparing the observation mean with the ourmodels, From the table ( 4.1), AR-NN model has a mean that is approximately close to the meanof the observation set. The Neural model has learnt the data-set with minimum error and thisgives it an advantage to predicting the values close to the observation set. The GARCH modelmean is away from the observation set mean and this can be because the GARCH parameters!, ↵ and � not able to represent the test set efficiently. We can see from the comparison table,the AR-NN performs better with Iteration and Mean performance measure but takes moreiterations than other models, it might seem as an disadvantage a decade ago, when computingpower were not as fast as they are in the current situation. Today we have all the fast computingsystems and for me it takes for me less than 2.00 minutes for this many iterations. AR-NNmodel requires more iteration than any other model to get the stabilized weights because of theBack propagation algorithm which is a slow learner. The method to calculate the cost functionalso makes the difference as well, we are using the basic Batch gradient method which takeslonger time than the stochastic gradient to update the weights.

29

Chapter 5

Summary

5.1 Conclusion

As discussed in the above chapters, We tried to model the the SV, GARCH and AR-NN modelto predict volatility on financial return. The purpose of the project was to find model that is ableto capture the volatility from time series data and predict the same. The SV and GARCH modelare from econometrics field and they are used for calculating posterior mean and variance. Aswe conducted experiments on SV and GARCH model in relation to prediction, we found outthe prediction error for these two models are considerably high. Hence, we can say that thesemodels are not much efficient for prediction analysis. We had attempted to develop a modelon Neural Network using auto regressive methods called as Auto-regressive Neural Network.We trained the model on selected training data points to initialize the weight parameters. Thistrained AR-NN model will be implemented for prediction analysis with test data set. Thepredicted vector from the Neural model has compared with target data-set to calculate theMean Squared Error (MSE) and compared the error with SV and GARCH error models. Wefound out, an AR-NN model has a low prediction error as compared to the other two models.The selection criteria of Neural model was based on hidden layers which is able to capturethe non-linearity in the time series data. The predicted output from the Neural Network wasable to capture the non-linearity in financial data. Therefore, we proposed a Neural Modelto experiment further on different financial returns to evaluate the performance of the givenmodel.

5.2 Recommendation

As per the study conducted on Stochastic volatility, as suggested by the (D.Ruppert, 2011) mod-elling the Volatility through the Auto-regressive Neural Network broaden the field of StochasticVolatility where the prediction analysis can bring some new studies into the stochastic Volatil-ity.

5.3 Future Work

We proposed an Auto-regressive Neural Network which had a basic foundation of a Feed For-ward Neural Network. As we know, Feed Forward network has severe limitations in the hid-den layer and if we increase the units in hidden layer, we see an exponential increase in theerror instead of a steady decrease. The solution for this would be to increase the depth of thehidden layers with small number of hidden units on each hidden layer which will eventuallyincrease the memory of the overall network and might give steady error for large data-set.The algorithm used in the project for calculating the cost function was gradient descent and it

30 Chapter 5. Summary

cannot handle the largest data-set because of the batch updation. The Stochastic gradient de-scent approach is more flexible and easier use for real time updation and this can make AR-NNmodel adaptable to large data-set.

The practical implementation of stochastic volatility with Markov Chain Monte Carlo (MCMC)is complex due to the iterations involved to get sample data points from MCMC simulations.Therefore, modeling stochastic volatility on Neural Network would be a new field to be ex-plored where the field of econometrics and machine learning can be combined with each other.

31

Chapter 6

Independent Study Contract

6.1 Contract

INDEPENDENT STUDY CONTRACTNote: Enrolment is subject to approval by the projects co-ordinator

SECTION A (Students and Supervisors)

UniID: u5657642

SURNAME: Khaire FIRST NAMES: Aditya

PROJECT SUPERVISOR (may be external): Adj/Prof Hanna Suominen, Dr Young Lee, Mr Kar Wai Lim

COURSE SUPERVISOR (a RSCS academic): Prof Weifa Liang

COURSE CODE, TITLE AND UNIT: COMP6470, Special Topics in Computing, 6 units

SEMESTER S2 YEAR: 2016

PROJECT TITLE:

Predicting stock markets using time series analysis of prices together with � nancial news and blogs

LEARNING OBJECTIVES:- Conducting a literature survey to relate the project work with already existing body of knowledge and

justify its signi� cance.- Stochastic Volatility, Prediction Analysis and Neural Model- Conducting experimental work on real data,- Writing a project report that follows the structure and style of scienti� c papers and project presentation

slides

Research School of Computer Science Form updated Jun-12

FIGURE 6.1: Contract Page 1

32 Chapter 6. Independent Study Contract

PROJECT DESCRIPTION:

Building and evaluating a processing cascade of1. web crawling to collect a longitudinal dataset of market prices, 2. machine learning methods to build and evaluate a model to predict market prices of these entities

in the future.

Conducting a literature survey to relate it with already existing body of knowledge and justify its signi� cance.

ASSESSMENT (as per course’s project rules web page, with the di� erences noted below):

Assessed project components: % of mark Due date Evaluated by:

Report: name style: research report(e.g. research report, software description...) 60 28/10/2016 CECS examiner

Artefact: name kind: software pipeline(e.g. software, user interface, robot...) 30 28/10/2016 Hanna

Suominen

Presentation: 10 27/10/2016 Weifa Liang



6.1. Contract 33

MEETING DATES (IF KNOWN):

STUDENT DECLARATION: I agree to ful� l the above de� ned contract:

………………………………………………….. ………………………..Signature Date

SECTION B (Supervisor):

I am willing to supervise and support this project. I have checked the student's academic record and believe this student can complete the project.

Hanna Suominen 08/08/2016

Signature Date

REQUIRED DEPARTMENT RESOURCES:

SECTION C (Course coordinator approval)

………………………………………………….. ………………………..Signature Date

SECTION D (Projects coordinator approval)

………………………………………………….. ………………………..Signature Date



35

Bibliography

Dietz, Sebastin (2010). “Autoregressive Neural Network”. In: a 72.12, pp. 477–499. URL: https://opus4.kobv.de/opus4-uni-passau/frontdoor/index/index/docId/142.

D.Ruppert (2011). “GARCH Models”. In: Statistics and Data Analysis for Financial Engineer-ing,Springer Texts in Statistics 72.12, pp. 477–499. URL: http://faculty.washington.edu/ezivot/econ589/ch18-garch.pdf.

Kastner, Gregor (2016). “Dealing with Stochastic Volatility in Time Series Using the R Packagestochvol”. In: a 69.12, pp. 1–30. URL: file:///students/u5657642/Downloads/v69i05.pdf.

M.Bishop, Christopher (2006). “Patter Recognition and Machine Learning”. In: a 12.Nicalos Chapados, Christian Dorion (2012). “Volatility Forecasting and Explanatory Variables:

A Tractable Bayesian Approach to Stochastic Volatility”. In: 72.12, pp. 477–499. URL: http://ifsid.ca/wp-content/uploads/2012/10/.

ruppert, D. (2001). “Introduction to ARCH GARCH models”. In: a 72.12, pp. 477–499. URL:file:///students/u5657642/Downloads/ARCH20(5).pdf.

https://opus4.kobv.de/opus4-uni-passau/frontdoor/index/index/docId/142

https://opus4.kobv.de/opus4-uni-passau/frontdoor/index/index/docId/142

http://faculty.washington.edu/ezivot/econ589/ch18-garch.pdf

http://faculty.washington.edu/ezivot/econ589/ch18-garch.pdf

http://ifsid.ca/wp-content/uploads/2012/10/

http://ifsid.ca/wp-content/uploads/2012/10/

Stochastic Volatility Models with Auto-regressive Neural...

Documents

Transcript of Stochastic Volatility Models with Auto-regressive Neural...