EVIC_05 Notes - Forecasting With Neural Networks Tutorial SFCrone

Forecasting with Artificial Neural Networks

EVIC 2005 Tutorial Santiago de Chile, 15 December 2005 slides on www.neural-forecasting.com Sven F. CroneCentre for Forecasting Department of Management Science Lancaster University Management School email: [email protected]

EVIC05 Sven F. Crone - www.bis-lab.com

Lancaster University Management School?


What you can expect from this session Simple back propagation algorithmE p = C (t pj , o pj ) o pj = f j (net pj )C (t pj , o pj ) w ji = C (t pj , o pj ) net pj net pj w ji

[Rumelhart et al. 1982]

p w ji

C (t pj , o pj ) w ji

pj =

C (t pj , o pj ) net pj

pj =


=

C (t pj , o pj ) o pj o pj net pj

o pt net pj

How to on Neural Network Forecasting with limited maths! CD-Start-Up Kit for Neural Net Forecasting20+ software simulators datasets literature & faq slides, data & additional info on www.neural-forecasting.com

= f j' ( net pj )C (t pj , o pj ) o pjnet pk C (t pj , o pj ) net pkk

pj =

f j' ( net pj ) wki o pii

k

C (t pj , o pj ) net pk net pk o pj

=k

C (t pj , o pj )

o pj wkj = pj wkjk

=k

pj = f ( net pj ) pj wkj' j

C (t pj , o pj ) ' f j (net pj ) if unit j is in the output layer o pj pj = f ' ( net ) w if unit j is in a hidden layer j pj pk pjk k

1


Agenda


1. 2. 3. 4.

Forecasting? Neural Networks? Forecasting with Neural Networks How to write a good Neural Network forecasting paper!


Agenda


1.

Forecasting? 1. 2. 3. Forecasting as predictive Regression Time series prediction vs. causal prediction Why NN for Forecasting?

2. 3. 4.

Neural Networks? Forecasting with Neural Networks How to write a good Neural Network forecasting paper!


Forecasting or Prediction?Data Mining: Application of data analysis algorithms & discovery algorithms that extract patterns out of the data algorithms?Data Mining

TASKS

Descriptive Data Mining

timing of dependent variable

Predictive Data Mining

Categorisation: Summarisation Clustering & VisualisationK-means Clustering Neural networks Feature Selection Princip.Component Analysis K-means Clustering Class Entropy

Association AnalysisAssociation rules Link Analysis

Sequence DiscoveryTemporal association rules

Categorisation: ClassificationDecision trees Logistic regress. Neural networks Disciminant Analysis

RegressionLinear Regression Nonlinear Regres. Neural networks MLP, RBFN, GRNN

Time Series AnalysisExponential smoothing (S)ARIMA(x) Neural networks

ALGORITHMS

2


Forecasting or ClassificationIndependent dependent Metric sale Regression Time Series Analysis DOWNSCALE Nominal scale Classification DOWNSCALE Principal Component Analysis DOWNSCALE Analysis of Variance Metric scale Ordinal scale Nominal scale

Supervised learningInputs Target ... ... ... ... ... ... ... ... . . . . . . . . . ... . . ......... . . ......... . . ...

Ordinal scale DOWNSCALE DOWNSCALE Contingency Analysis Clustering Cases

Unsupervised learningInputs ... ... ... ... ... ... ... ... ... . ......... . . ......... . ........ ...

NONE

Cases

Simplification from Regression (How much) Classification (Will event occur) FORECASTING = PREDICTIVE modelling (dependent variable is in future) FORECASTING = REGRESSION modelling (dependent variable is of metric scale)


Forecasting or Classification?What the experts say


International Conference on Data MiningYou are welcome to contribute www.dmin-2006.com !!!

3


Agenda


1.

Forecasting? 1. 2. 3. 4. Forecasting as predictive Regression Time series prediction vs. causal prediction SARIMA-Modelling Why NN for Forecasting?

2. 3. 4.



Forecasting ModelsTime series analysis vs. causal modelling

Time series prediction (Univariate)Assumes that data generating process that creates patterns can be explained only from previous observations of dependent variable

Causal prediction (Multivariate)Data generating process can be explained by interaction of causal (cause-and-effect) independent variables


Classification of Forecasting MethodsForecasting Methods

Objective Forecasting Methods Time Series Methods Averages Moving Averages Naive Methods Exponential Smoothing Simple ES Linear ES Seasonal ES Dampened Trend ES Simple Regression Autoregression ARIMA Neural Networks Causal Methods Linear Regression

Subjective Forecasting Methods

Prophecy educated guessing

Sales Force Composite Analogies Delphi PERT Survey techniques

Multiple Regression Dynamic Regression Vector Autoregression Intervention model Neural Networks

Demand Planning Practice Objektive Methods + Subjektive correction

Neural Networks ARE time series methods causal methods & CAN be used as Averages & ES Regression

4


Time Series DefinitionDefinitionTime Series is a series of timely ordered, comparable observations yt recorded in equidistant time intervals

NotationYt represents the t th period observation, t=1,2 n


Concept of Time SeriesAn observed measurement is made up of asystematic part and a random part

ApproachUnfortunately we cannot observe either of these !!! Forecasting methods try to isolate the systematic part Forecasts are based on the systematic part The random part determines the distribution shape

AssumptionData observed over time is comparableThe time periods are of identical lengths (check!) The units they are measured in change (check!) The definitions of what is being measured remain unchanged (check!) They are correctly measured (check!)

data errors arise from sampling, from bias in the instruments or the responses, from transcription.


Objective Forecasting Methods Time Series Methods of Time Series Analysis / ForecastingClass of objective Methods based on analysis of past observations of dependent variable alone

Assumptionthere exists a cause-effect relationship, that keeps repeating itself with the yearly calendar Cause-effect relationship may be treated as a BLACK BOX TIME-STABILITY-HYPOTHESIS ASSUMES NO CHANGE: Causal relationship remains intact indefinitely into the future! the time series can be explained & predicted solely from previous observations of the series Time Series Methods consider only past patterns of same variable Future events (no occurrence in past) are explicitly NOT considered! external EVENTS relevant to the forecast must be corected MANUALLY

5


Simple Time Series PatternsDia gra m 1.1: Tre nd long-te r m gr ow th or de cline occur ing w ithin a s e r ie s 100 80 60 40 20 0 12 15 18 21 24 27 30 3 6 9 Year 120 100 80 60 40

Dia gra m 1.2: S e a sona l m or e or le s s r e gular m ove m e nts w ithin a ye ar

Long term movement in series

20 0 Year

regular fluctuation within a year (or shorter period) superimposed on trend and cycle10 15 20 25 30 35 40 73M 11.2001 M 12.2001 M 01.2002 M 10.2001

Dia gra m 1.3: Cycle 10 8 6 4 2 0 10 5 Year alte r nating ups w ings of var ie d le ngth and inte ns ity 350 300 250 200 150

Dia gra m 1.4: Irre gula r r andom m ove m e nts and thos e w hich r e fle ct unus ual e ve nts

15

20

25

30

35

40

45

Regular fluctuation superimposed on trend (period may be random)

100 50 0 1 10 19 28 37 46 55 64 82M 02.2002 M 03.2002 M 04.2002 M 05.2002 M 06.2002


Regular Components of Time SeriesA Time Series consists of superimposed components / patterns:

Signal level L trend T seasonality S Noise irregular,error 'e'

Sales

=

LEVEL

+ SEASONALITY

+

TREND +

RANDOM ERROR


Irregular Components of Time SeriesOriginal-Zeitreihen

Structural changes in systematic data PULSEone time occurrence on top of systematic stationary / trended / seasonal development

45 40 35 30 25 20 15 10 5 0 M 08.2000 M 09.2000 M 10.2000 M 11.2000 M 12.2000 M 01.2001 M 02.2001 M 03.2001 M 04.2001 M 05.2001 M 06.2001 M 07.2001 M 08.2001 M 09.2001 M 07.2002

45

5

[t]

Datenwert original

Korrigierter Datenwert

LEVEL SHIFTone time / multiple time shifts on top of systematic stationary / trended / seasonal etc. development60 50

STATIONARY time series with PULSEOriginal-Zeitreihen

40

30

20

10

Okt 00

Feb 01

Okt 01

Feb 02

Mrz 01

Mrz 02

Mai 01

Nov 00

Dez 00

Nov 01

Dez 01

Jan 01

Jun 01

Jan 02

Mai 02

Jul 01

Jun 02

Sep 00

Sep 01

Apr 01

Aug 00

Trend changes (slope, direction) Seasonal pattern changes & shifts

Datenwert original

Aug 01


STATIONARY time series with level shift

Apr 02

Jul 02

STRUCTURAL BREAKS

0

[t]

6


Components of Time SeriesTime Series

Level

Trend Random

Time Series decomposed into Components


Time Series PatternsTime Series Pattern

REGULAR Time Series Patterns STATIONARY) Time SeriesOriginal-Zeitreihen6000045

IRREGULAR Time Series Patterns TRENDED Time SeriesOriginal-Zeitreihen12

SEASONAL Time SeriesOriginal-Zeitreihen

FLUCTUATING Time SeriesOriginal-Zeitreihen6030

INTERMITTANT Time SeriesOriginal-Zeitreihen

50000

40

501035

25

40000

30

408

20

30000

25

3020

15

6

20000

15

204

10

10

100005

102

5

0 M 08.2000 M 09.2000 M 10.2000 M 11.2000 M 12.2000 M 01.2001 M 02.2001 M 03.2001 M 04.2001 M 05.2001 M 06.2001 M 07.2001 M 08.2001 M 09.2001 M 10.2001 M 11.2001 M 12.2001 M 01.2002 M 02.2002 M 03.2002 M 04.2002 M 05.2002 M 06.2002 M 07.2002

M 08.2000

M 09.2000

M 10.2000

M 11.2000

M 12.2000

M 01.2001

M 02.2001

M 03.2001

M 04.2001

M 05.2001

M 06.2001

M 07.2001

M 08.2001

M 09.2001

M 10.2001

M 11.2001

M 12.2001

M 01.2002

M 02.2002

M 03.2002

M 04.2002

M 05.2002

M 06.2002

M 08.2000

M 09.2000

M 10.2000

M 11.2000

M 12.2000

M 01.2001

M 02.2001

M 03.2001

M 04.2001

M 05.2001

M 06.2001

M 07.2001

M 08.2001

M 09.2001

M 10.2001

M 11.2001

M 12.2001

M 01.2002

M 02.2002

M 03.2002

M 04.2002

M 05.2002

M 06.2002

Mrz 77

Mrz 78

Mrz 79

Nov 77

Nov 78

Nov 79

Mrz 80

Jan 77

Mai 77

Jul 77

Jan 78

Mai 78

Jul 78

Jan 79

Mai 79

Jul 79

Jan 80

Mai 80

Nov 80

Sep 77

Sep 78

Sep 79

Jul 80

Sep 80

[t]

M 07.2002

M 07.2002

0

[t]0 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980

0

0

[t]- 10

[t]-5

[t]

Datenwert original


Datenwert original


Datenwert original


Datenwert original


Datenwert original


Yt = f(Et) time series is influenced by level & random fluctuations

Yt = f(St,Et)time series is influenced by level, season and random fluctuations

Yt = f(Tt,Et) time series is influenced by trend from level and random fluctuations

time series fluctuates very strongly around level (mean deviation > ca. 50% around mean)

Number of periods with zero sales is high (ca. 30%-40%)

Combination of individual Components

Yt = f( St, Tt, Et )

+ PULSES! + LEVEL SHIFTS! + STRUCTURAL BREAKS!


Components of complex Time SeriesZeitreihen-Diagramm

Sales or observation of time series at point t

800

750

Yt =45 40

700

650

600

Different possibilities to combine componentsMrz 61 Mrz 62 Mrz 63 Mai 63 Mrz 60 Mai 62 Jul 62 Jan 63 Jul 63 Nov 60 Nov 61 Nov 62

550

500 Sep 60 Sep 61 Sep 62 Nov 63 Mai 61 Jul 61 Jan 60 Jan 61 Jan 62 Mai 60 Jul 60 Sep 63

[t]

Datenwert original


consists of a combination of

Original-Zeitreihen

f(

)

35 30

25 20 15 10 5

Mrz 77

Mrz 78

Mrz 79

Nov 77

Nov 78

Nov 79

Mrz 80

Jan 77

Jan 78

Jan 79

Jan 80

Base Level + Seasonal Component

Datenwert original


St ,

Nov 80

Mai 77

Jul 77

Mai 78

Jul 78

Mai 79

Jul 79

Mai 80

Sep 77

Sep 78

Sep 79

Jul 80

0

Sep 80

[t]

Original-Zeitreihen12

10

8

6

Additive Model Yt = L + St + Tt + Et[t]1973 1974 1975 1976 1977 1978 1979

4

2

0 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1980

Tt ,Trend-Component

Datenwert original


Multiplicative Model Yt = L * St * Tt * EtOriginal-Zeitreihen60000

EtIrregular or random ErrorComponent

50000

40000

30000

20000

10000

0 M 08.2000 M 09.2000 M 10.2000 M 11.2000 M 12.2000 M 01.2001 M 02.2001 M 03.2001 M 04.2001 M 05.2001 M 06.2001 M 07.2001 M 08.2001 M 09.2001 M 10.2001 M 11.2001 M 12.2001 M 01.2002 M 02.2002 M 03.2002 M 04.2002 M 05.2002 M 06.2002 M 07.2002

[t]

Datenwert original


7


Classification of Time Series PatternsNo Seasonal Effect No Trend Effect Additive Seasonal Effect Multiplicative Seasonal Effect

Additive Trend Effect

Multiplicative Trend Effect

[Pegels 1969 / Gardner]


AgendaForecasting with Artificial Neural Networks

1.

Forecasting? 1. 2. 3. Forecasting as predictive Regression Time series prediction vs. causal prediction SARIMA-Modelling 1. 2. 3. 4. 4. SARIMA Differencing SARIMA Autoregressive Terms SARIMA Moving Average Terms SARIMA Seasonal Terms

Why NN for Forecasting?

2. 3. 4.



Introduction to ARIMA ModellingSeasonal Autoregressive Integrated Moving Average Processes: SARIMApopularised by George Box & Gwilym Jenkins in 1970s (names often used synonymously) models are widely studied Put together theoretical underpinning required to understand & use ARIMA Defined general notation for dealing with ARIMA models

claim that most time series can be parsimoniously represented by the ARIMA class of models ARIMA (p, d, q)-Models attempt to describe the systematic pattern of a time series by 3 parameters p: Number of autoregressive terms (AR-terms) in a time series d: Number of differences to achieve stationarity of a time series q: Number of moving average terms (MA-terms) in a time series

p ( B)(1 B) d Z t = + q ( B)et

8


The Box-Jenkins Methodology for ARIMA modelsModel Identification Data Preparation Transform time series for stationarity Difference time series for stationarity

Model selection Examine ACF & PACF Identify potential Models (p,q)(sq,sp) auto

Model Estimation & Testing

Model Estimation Estimate parameters in potential models Select best model using suitable criterion

Model Diagnostics / Testing

Check ACF / PACF of residuals white noise Run portmanteau test of residuals Re-identify

Model Application

Model Application Use selected model to forecast


ARIMA-ModellingARIMA(p,d,q)-ModelsARIMA - Autoregressive Terms AR(p), with p=order of the autoregressive part ARIMA - Order of Integration, d=degree of first differencing/integration involved ARIMA - Moving Average Terms MA(q), with q=order of the moving average of error SARIMAt (p,d,q)(P,D,Q) with S the (P,D,Q)-process for the seasonal lags

ObjectiveIdentify Identify Identify Identify the appropriate ARIMA model for the time series AR-term I-term MA-term

Identification throughAutocorrelation Function Partial Autocorrelation Function


ARIMA-Models: Identification of d-termParameter d determines order of integration ARIMA models assume stationarity of the time seriesStationarity in the mean Stationarity of the variance (homoscedasticity)

Recap:Let the mean of the time series at t be and = cov Y , Yt ,t

(

t

t

)

t = E (Yt )

Definition

t ,t = var (Yt )

A time series is stationary if its mean level t is constant for all t and its variance and covariances t- are constant for all t In other words:all properties of the distribution (mean, varicance, skewness, kurtosis etc.) of a random sample of the time series are independent of the absolute time t of drawing the sample identity of mean & variance across time

9


ARIMA-Models: Stationarity and parameter dIs the time series stationarysample B Sample A

Stationarity: (A)= (B) var(A)=var(B) etc. this time series: trend instationary time series

(B)> (A)


ARIMA-Modells: Differencing for StationariryDifferencing time series E.g. : time series Yt={2, 4, 6, 8, 10}. time series exhibits linear trend 1st order differencing between observation Yt and predecessor Yt-1 derives a transformed time series: 4-2=2 6-4=2 8-6=2 10-8=2 The new time series Yt={2,2,2,2} is stationary through 1st differencing d=1 ARIMA (0,1,0) model 2nd order differences: d=2

10

2 1 2 3 4 5

xt10

2 1 2 3 4

xt-xt-1


ARIMA-Modells: Differencing for StationariryIntegrationDifferencing

Z t = Yt Yt 1Transforms: Logarithms etc. Where Zt is a transform of the variable of interest Yt chosen to make Zt-Zt-1-(Zt-1-Zt-2)- stationary

Tests for stationarity:Dickey-Fuller Test Serial Correlation Test Runs Test

10



1.



2. 3. 4.



ARIMA-Models Autoregressive TermsDescription of Autocorrelation structure auto regressive (AR) term If a dependency exists between lagged observations Yt and Yt-1 we can describe the realisation of Yt-1

Yt = c + 1Yt 1 + 2Yt 2 + ... + pYt p + etObservation in time t Weight of the AR relationship Observation in t-1 (independent) random component (white noise)

Equations include only lagged realisations of the forecast variable ARIMA(p,0,0) model = AR(p)-model Problems Independence of residuals often violated (heteroscedasticity) Determining number of past values problematic Tests for Autoregression: Portmanteau-testsBox-Pierce test Ljung-Box test


ARIMA-Modells: Parameter p of Autocorrelationstationary time series can be analysed for autocorrelationstructure The autocorrelation coefficient for lag k

k = t =k +1

(Y Y )(Yt n t =1 t

n

t k

Y )

(Y Y )Autocorrelation between x_t and x_t-1

denotes the correlation between lagged observations of distance k220

Graphical interpretation Uncorrelated data has low autocorrelations Uncorrelated data shows no correlation patern [x_t-1]

200

180

160 X_t-1 140

120

100 100

120

140

160

180

200

220

[x_t]

11


ARIMA-Modells: Parameter pE.g. time series Yt 7, lag 1: 7, 8 8, 7 7, 6 6, 5 5, 4 4, 5 5, 6 6, 4 r1=.620.6 0.0 -0.6 1 2 3 lag

8, lag 2:

7, 7, 8, 7, 6, 5, 4, 5,

6, 7 6 5 4 5 6 4

5,

4, lag 3:

5, 7, 8, 7, 6, 5, 4,

6, 6 5 4 5 6 5

4.

r2=.32

r3=.15

ACF

Autocorrelations rt gathered at lags 1, 2, make up the autocorrelation function (ACF)


ARIMA-Models Autoregressive TermsIdentification of AR-terms ?NORM1.0 1.0

AUTO

.5

.5

0.0

0.0

-.5 Confidence Limits

-.5 Confidence Limits

ACF

-1.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Coefficient

ACF

-1.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Coefficient

Lag Number

Lag Number

Random independent observations

An AR(1) process?


ARIMA-Modells: Partial AutocorrelationsPartical Autocorrelations are used to measure the degree of association between Yt and Yt-k when the effects of other time lags 1,2,3,,k-1 are removedSignificant AC between Yt and Yt-1 significant AC between Yt-1 and Yt-2 induces correlation between Yt and Yt-2 ! (1st AC = PAC!)

When fitting an AR(p) model to the time series, the last coefficient p of Yt-p measures the excess correlation at lag p which is not accounted for by an AR(p-1) model. p is called the pth order partial autocorrelation, i.e.

p = corr (Yt , Yt p | Yt 1 , Yt 2 ,..., Yt p +1 )Partial Autocorrelation coefficient measures true correlation at Yt-p

Yt =0 +p1Yt1 +p2Yt2 +. . . .+pYtp +t

12


ARIMA Modelling AR-Model patterns1 0,8 0,6 0,4 0,2 0 -0,2 -0,4 -0,6 -0,8 -1 Autocorrelation Confidence Interval 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Autocorrelation function Pattern in ACf

1 0,8 0,6 0,4 0,2 0 -0,2 -0,4 -0,6 -0,8 -1

Partial Autocorrelation function 1st lag significant

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Partial Autocorrelation Confidence Interval

AR(1) model:

Yt = c + 1Yt 1 + et

=ARIMA (1,0,0)




1 0,8 0,6 0,4 0,2 0 -0,2 -0,4 -0,6 -0,8 -1


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


AR(1) model:

Yt = c + 1Yt 1 + et

=ARIMA (1,0,0)


ARIMA Modelling AR-Model patterns1 0,8 0,6 0,4 0,2 0 -0,2 -0,4 -0,6 -0,8 -1 Autocorrelation Confidence Interval -0,8 -1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


1 0,8 0,6 0,4 0,2 0 -0,2 -0,4 -0,6

Partial Autocorrelation function 1st & 2nd lag significant

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


AR(2) model: =ARIMA (2,0,0)

Yt = c + 1Yt 1 + 2Yt 2 + et

13



Autocorrelation function Pattern in ACF: dampened sine

1 0,8 0,6 0,4 0,2 0 -0,2 -0,4 -0,6 -0,8 -1

Partial Autocorrelation function 1st & 2nd lag significant

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


AR(2) model: =ARIMA (2,0,0)

Yt = c + 1Yt 1 + 2Yt 2 + et


ARIMA Modelling AR-ModelsAutoregressive Model of order one ARIMA(1,0,0), AR(1)Autoregressive: AR(1), rho=.85 4 3 2 1 0 1 -1 -2 -3 -4 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

Yt = c + 1Yt 1 + et = 1.1 + 0.8Yt 1 + etHigher order AR models

Yt = c + 1Yt 1 + 2Yt 2 + ... + pYt p + etfor p = 1, 1 < 1 < 1 p = 2, 1 < 2 < 1 2 + 1 < 1 2 1 < 1



1.



2. 3. 4.


14


ARIMA Modelling Moving Average ProcesseDescription of Moving Average structureAR-Models may not approximate data generator underlying the observations perfectly residuals et, et-1, et-2, , et-q Observation Yt may depend on realisation of previous errors e Regress against past errors as explanatory variables

Yt = c + et 1et 1 2 et 2 ... q et qARIMA(0,0,q)-model = MA(q)-model

for q = 1, 1 < 1 < 1 q = 2, 1 < 2 < 1 2 + 1 < 1 2 1 < 1


ARIMA Modelling MA-Model patterns1 0,8 0,6 0,4 0,2 0 -0,2 -0,4 -0,6 -0,8 -1 Autocorrelation Confidence Interval 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Autocorrelation function 1st lag significant

1 0,8 0,6 0,4 0,2 0 -0,2 -0,4 -0,6 -0,8 -1 1

Partial Autocorrelation function Pattern in PACF

2 3 4 5

6 7

8 9 10 11 12 13 14 15 16


MA(1) model: (0,0,1)

Yt = c + et 1et 1

=ARIMA


ARIMA Modelling MA-Model patterns1 0,8 0,6 0,4 0,2 0 -0,2 -0,4 -0,6 -0,8 -1 Autocorrelation Confidence Interval 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


1 0,8 0,6 0,4 0,2 0 -0,2 -0,4 -0,6 -0,8 -1


1

2

3 4

5

6 7

8

9 10 11 12 13 14 15 16


MA(1) model:

Yt = c + et 1et 1

=ARIMA (0,0,1)

15


ARIMA Modelling MA-Model patterns1 0,8 0,6 0,4 0,2 0 -0,2 -0,4 -0,6 -0,8 -1 Autocorrelation Confidence Interval-0,8 -1

Autocorrelation function 1st, 2nd & 3rd lag significant

1 0,8 0,6 0,4 0,2 0


1

2 3 4

5 6 7

8 9 10 11 12 13 14 15 16

-0,2 -0,4 -0,6

1

2

3 4

5

6 7

8

9 10 11 12 13 14 15 16


MA(3) model: =ARIMA (0,0,3)

Yt = c + et 1et 1 1et 2 1et 3


ARIMA Modelling MA-ModelsAutoregressive Model of order one ARIMA(0,0,1)=MA(1)Autoregressive: MA(1), theta=.2

Yt = c + et 1et 1 = 10 + et + 0.2et 1

3 2,5 2 1,5 1 0,5 0 -0,5 1 -1 -1,5 -2

4

7

10 13 16 19 22 25 28 31 34 37 40 43 46 49

Period


ARIMA Modelling Mixture ARMA-Modelscomplicated series may be modelled by combining AR & MA termsARMA(1,1)-Model = ARIMA(1,0,1)-Model

Yt = c + 1Yt 1 + et 1et 1

Higher order ARMA(p,q)-Models

Yt = c + 1Yt 1 + 2Yt 2 + ... + pYt p + et 1et 1 2 et 2 ... q et q

16


1 0,8 0,6 0,4 0,2 0 -0,2 -0,4 -0,6 -0,8 -1 1


1 0,8 0,6 0,4 0,2 0


2 3 4

5 6 7

8 9 10 11 12 13 14 15 16

-0,2 -0,4 -0,6

1

2

3 4

5

6 7

8

9 10 11 12 13 14 15 1

Autocorrelation Confidence Interval-0,8 -1


AR(1) and MA(1) model: =ARMA(1,1)=ARIMA (1,0,1)


Agenda1. Forecasting? 1. 2. 3.Forecasting with Artificial Neural Networks

Forecasting as predictive Regression Time series prediction vs. causal prediction SARIMA-Modelling 1. 2. 3. 4. 5. SARIMA Differencing SARIMA Autoregressive Terms SARIMA Moving Average Terms SARIMA Seasonal Terms SARIMAX Seasonal ARIMA with Interventions

4. 2. 3. 4.




Seasonality in ARIMA-ModelsIdentifying seasonal data:Spikes in ACF / PACF at seasonal lags, e.g 1,0000 t-12 & t-13 for yearly ,8000 ,6000 t-4 & t-5 for quarterly ACF,4000 ,2000 ,00001 4 7 10 13 16 19 22 25 28 31 34

Upper Limit Low er Limit

Differences

Simple: Yt=(1-B)Yt -,2000 Seasonal: sYt=(1-Bs)Yt -,4000 with s= seasonality, eg. 4, 12

Data may require seasonal differencing to remove seasonalityTo identify model, specify seasonal parameters: (P,D,Q)the seasonal autoregressive parameters P seasonal difference D and seasonal moving average Q

Seasonal ARIMA (p,d,q)(P,D,Q)-model

17


Seasonality in ARIMA-Models1.0000 .8000 .6000 .4000 .2000 .0000 -.2000 -.4000 ACF Upper Limit Low er Limit

ACF

.800016

7 10

13

19

22

25

28

.6000 .4000 .2000 .0000 -.2000 -.4000 -.60001 4 7 10 13 16 25 28 31 19 22 34

31 34

Seasonal spikes = monthly data

1

4

PACF Upper Limit Low er Limit

PACF


Seasonality in ARIMA-ModelsExtension of Notation of Backshift Operator

sYt = Yt-Yt-s =YtBsYt=(1-Bs)YtSeasonal difference followed by a first difference: (1-B)

(1-Bs)

Yt

(1 1 B ) (1 1B 4 ) (1 B ) (1 B 4 ) Yt = c + (1 1 B ) (1 1 B 4 ) etNon-seasonal AR(1) Non-seasonal difference Seasonal Seasonal AR(1) difference Non-seasonal MA(1) Seasonal MA(1)

Seasonal ARIMA(1,1,1)(1,1,1)4-modell


Agenda1. Forecasting? 1. 2. 3.Forecasting with Artificial Neural Networks

Forecasting as predictive Regression Time series prediction vs. causal prediction SARIMA-Modelling 1. 2. 3. 4. 5. SARIMA Differencing SARIMA Autoregressive Terms SARIMA Moving Average Terms SARIMA Seasonal Terms SARIMAX Seasonal ARIMA with Interventions

4. 2. 3. 4.



18


Forecasting ModelsTime series analysis vs. causal modelling

Time series prediction (Univariate)Assumes that data generating process that creates patterns can be explained only from previous observations of dependent variable

Causal prediction (Multivariate)Data generating process can be explained by interaction of causal (cause-and-effect) independent variables


Causal PredictionARX(p)-Models

General Dynamic Regression Models



1.

Forecasting? 1. 2. 3. 4. Forecasting as predictive Regression Time series prediction vs. causal prediction SARIMA-Modelling Why NN for Forecasting?

2. 3. 4.


19


Why forecast with NN?Pattern or noise?

Airline Passenger data Seasonal, trended Real model disagreed: multiplicative seasonality or additive seasonality with level shifts?

Fresh products supermarket Sales Seasonal, events, heteroscedastic noise Real model unknown



Random Noise iid (normally distributed: mean 0; std.dev. 1)

BL(p,q) Bilinear Autoregressive Model

yt = 0.7 yt 1 t 2 + t



TAR(p) Threshold Autoregressive model

Random walk

yt = 0.9 yt 1 + t = 0.3 yt 1 t

for yt 1 1 for yt 1 > 1

yt = yt 1 + t

20


Motivation for NN in Forecasting Nonlinearity!True data generating process in unknown & hard to identify Many interdependencies in business are nonlinear NN can approximate any LINEAR and NONLINEAR function to any desired degree of accuracyCan learn linear time series patterns Can learn nonlinear time series patterns Can extrapolate linear & nonlinear patterns = generalisation!

NN are nonparametricDont assume particular noise process, i.e. gaussian

NN model (learn) linear and nonlinear process directly from dataApproximate underlying data generating process

NN are flexible forecasting paradigm


Motivation for NN in Forecasting - Modelling FlexibilityUnknown data processes require building of many candidate models!Flexibility on Input Variables flexible codingbinary scale [0;1]; [-1,1] nominal / ordinal scale (0,1,2,,10 binary coded [0001,0010,] metric scale (0.235; 7.35; 12440.0; )

Flexibility on Output Variablesbinary prediction of single class membership nominal / ordinal prediction of multiple class memberships metric regression (point predictions) OR probability of class membership!

Number of Input Variables

Number of Output Variables

One SINGLE network architecture

MANY applications


Applications of Neural Nets in diverse Research Fields 2500+ journal publications on NN & Forecasting alone!Neurophysiology InformaticsCitation Analysis by yeartitle =(neura l AND ne t* AND (forecast* OR pre dict* OR time -se r* OR time w /2 se r* OR time ser*) & title =(... ) and evaluate d sales forecasting re lated point predictions

simulate & explain brain

Forecast Sales Forecast Linear (Forecast)

350 R = 0.90362

35 30 25 20 15 10 5 0 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 [year]

[citations]

300 eMail & url filtering VirusScan (Symmantec Norton Antivirus) 250 Speech Recognition & Optical Character Recognition 200 150 100 50 0

Engineering

control applications in plants automatic target recognition (DARPA) explosive detection at airports Mineral Identification (NASA Mars Explorer) starting & landing of Jumbo Jets (NASA) Rainfall prediction ElNino Effects

Meteorology / weather Corporate Business

credit card fraud detection simulate forecasting methods52

Number of Publications by Business Forecasting Domain5

Business Forecasting DomainsElectrical Load / Demand Financial Forecasting Sales forecastingCurrency / Exchange rate stock forecasting etc.

32 83 51General Business Marketing Finance Production Product Sales Electrical Load

not all NN recommendations are useful for your DOMAIN!

21 10

21


IBF Benchmark Forecasting Methods usedApplied Forecasting Methods (all industries)0% Averages Autoregressive Methods Decomposition Exponential Smoothing Trendextrapolation Econometric Models Neural Networks Regression Analogies Delphi PERT Surveys6% 49% 23% 22% 9% 69% 23% 7% 8% 24% 35%

10%

20%

30%25%

40%

50%

60%

70%

80%

[Forecastinng Method]

Time Series methods (objective) 61% Causal Methods (objective) 23%

Judgemental Methods (subjective) 2x%

[number replies]Survey 5 IBF conferences in 2001 240 forecasters, 13 industries

NN are applied in coroporate Demand Planning / S&OP processes![Warning: limited sample size]


Agenda


1. 2.

Forecasting? Neural Networks? 1. 2. 3. 4. What are NN? Definition & Online Preview Motivation & brief history of Neural Networks From biological to artificial Neural Network Structures Network Training

3. 4.

Forecasting with Neural Networks How to write a good Neural Network forecasting paper!


What are Artificial Neural Networks?Artificial Neural Networks (NN)a machine that is designed to model the way in which the brain performs a particular task ; the network is implemented or .. simulated in software on a digital computer. [Haykin98] class of statistical methods for information processing consisting of large number of simple processing units (neurons), which exchange information of their activation via directed connections. [Zell97] Inputtime series observation causal variables image data (pixel/bits) Finger prints Chemical readouts n+ 4

Processing n +1

Outputtime series prediction n+ 5

dependent variables class memberships class probabilities principal components

n+ 2Black Box

n+ 6n+h

n+ 3

22


What are Neural Networks in Forecasting?Artificial Neural Networks (NN) a flexible forecasting paradigmA class of statistical methods for time-series and causal forecasting Highly flexible processing arbitrary input to output relationships Properties non-linear, nonparametric (assumed), error robust (not outlier!) Data driven modelling learning directly from data

InputNominal, interval or ratio scale (not ordinal) time series observations Lagged variables Dummy variables Causal variables Explanatory variables Dummy variables

Processing n +1 n+ 5 n+ 2Black Box

OutputRatio scale Regression Single period ahead Multiple period ahead

n+ 6n+h

n+ 3

n+ 4


DEMO: Preview of Neural Network ForecastingSimulation of NN for Business Forecasting

Airline Passenger Data Experiment3 layered NN: (12-8-1) 12 Input units - 8 hidden units 1 output unit 12 input lags t, t-1, , t-11 (past 12 observations) time series prediction t+1 forecast single step ahead forecast Benchmark Time Series [Brown / Box&Jenkins] 132 observations 13 periods of monthly data


Demonstration: Preview of Neural Network Forecasting NeuraLab Predict! look inside neural forecasting

Errors on training / validation / test dataset

Time Series versus Neural Network Forecast updated after each learning step

in sample observations & forecasts out of sample training validate = Test Time series actual value

PQ-diagramm

Absolute Forecasting Errors

NN forecasted value

23


Agenda


1. 2.


3. 4.



Motivation for using NN BIOLOGY!Human & other nervous systems (animals, insects e.g. bats)Ability of various complex functions: perception, motor control, pattern recognition, classification, prediction etc. Speed: e.g. detect & recognize changed face in crowd=100-200ms Efficiency etc.

brains are the most efficient & complex computer known to dateHuman Brain Processing Speed Neurons/Transistors Weight Energy consumption Computation: Vision10-3ms (0.25 MHz) 1500 grams 10-16 Joule 100 steps

Computer (PCs)10-9ms (2500 MHz PC) kilograms to tons! 10-6 Joule billions of steps

10 billion & 103 billion conn. 50 million (PC chip)

Comparison: Human = 10.000.000.000

ant 20.000 neurons


Brief History of Neural NetworksHistoryDeveloped in interdisciplinary Research (McCulloch/Pitts1943) Motivation from Functions of natural Neural Networksneurobiological motivation application-oriented motivationTuring Hebb 1936 1949 McCulloch / Pitts 1943 Dartmouth Project 1956 Rosenblatt 1959 Minski 1954builds 1st NeuroComputer [Smith & Gupta, 2000]

Minsky/Papert Werbos 1st IJCNN 1st journals 1969 1974 1987 1988 Kohonen Rumelhart/Hinton/Williams 1972 1986 INTEL1971 IBM 1981 Neuralware 1987founded

SAS 1997Enterprise Miner

1st microprocessor Introduces PC

GE 19541st computer payroll system

White 19881st paper on forecasting

IBM 1998$70bn BI initiative

Neuroscience

Applications in Engineering Pattern Recognition & Control

Applications in Business Forecasting

Research field of Soft-Computing & Artificial IntelligenceNeuroscience, Mathematics, Physics, Statistics, Information Science, Engineering, Business Management different VOCABULARY: statistics versus neurophysiology !!!

24


Agenda


1. 2.


3. 4.



Motivation & Implementation of Neural NetworksFrom biological neural networks to artificial neural networks

o1wi,1 o2 wi,2 oj wi,j

neti = wij o j j ai = f (neti )j

oi = ai

oi n +1

n+5 n+2 n+6 n+3

Mathematics as abstract representations of reality use in software simulators, hardware, engineering etc.

n+ h

n+4

oi = tanh w ji o j i j

neural_net = eval(net_name); [num_rows, ins] = size(neural_net.iw{1}); [outs,num_cols] = size(neural_net.lw{neural_net.numLayers, neural_net.numLayers-1}); if (strcmp(neural_net.adaptFcn,'')) net_type = 'RBF'; else net_type = 'MLP'; end

fid = fopen(path,'w');


Information Processing in biological NeuronsModelling of biological functions in Neurons10-100 Billion Neurons with 10000 connections in Brain Input (sensory), Processing (internal) & Output (motoric) Neurons

CONCEPT of Information Processing in Neurons

25


Alternative notations Information processing in neurons / nodesBiological Representation

Graphical Notation

Inputin1 wi,1 wi,2

Input Function

Activation Function

Output

i => weights wi 0 => bias

in2

neti = wij in jj

ai = f ( neti j )Neuron / Node ui

outi

inj

wi,j

Mathematical Representation

1 if yi = 0 if

w wj j

ji

x j i 0 x j i < 0

ji

alternative: yi = tanh w ji x j i j


Information Processing in artificial NodesCONCEPT of Information Processing in Neurons Input Function (Summation of previous signals) Activation Function (nonlinear) binary step function {0;1} sigmoid function: logistic, hyperbolic tangent etc. Output Function (linear / Identity, SoftMax ...)Inputin1 wi,1 wi,2

Input Function

Activation Function

Output Function

Output

in2

neti = wij in j j ai = f (net i )j

oi = ai

outi

inj

wi,j

=1 if outi = 0 if

Neuron / Node ui

Unidirectional Information Processing

w oji

j

i 0

wj

j

ji

o j i < 0


Input Functions

26


Binary Activation FunctionsBinary activation calculated from input

a j = f act ( net j , j )

e.g.

a j = f act ( net j j )


Information Processing: Node Threshold logicNode Function BINARY THRESHOLD LOGIC 1. weight individual input by connection strength 2. sum weighted inputs 3. add bias term 4. calculate output of node through BINARY transfer function

RERUN with next input

inputs2.2

weights w1,i0.71

information processing2.2* 0.71 +4.0*-1.84 +1.0* 9.01 = 3.212 3.212 - 8.0 = -4.778

output

o1 o2

4

w2,i -1.849.01 8.0

-4.778 < 0 0.00

oj

0.00

1

o3

w3,i w0,i

=o0

neti = wij o j jj

ai = f (net i )

1 w ji o j i 0 j oi = 0 w ji o j i < 0 j


Continuous Activation FunctionsActivation calculated from input

a j = f act ( net j , j )

e.g.

a j = f act ( net j j )

Hyperbolic Tangent

Logistic Function

27


Information Processing: Node Threshold logicNode Function Sigmoid THRESHOLD LOGIC of TanH activation function 1. weight individual input by connection strength 2. sum weighted inputs 3. add bias term 4. calculate output of node through BINARY transfer function RERUN with next input

inputs2.2

weights w1,i0.71

information processing2.2* 0.71 +4.0*-1.84 +1.0* 9.01 = 3.212 3.212 - 8.0 = -4.778

output

o1 o2

4

w2,i -1.849.01 8.0

Tanh(-4.778) = -0.9998

oj

-0.9998

1

o3

w3,i w0,i

=o0

neti = wij o j jj

ai = f (net i )

oi = tanh

w ji o j i j


A new Notation GRAPHICS!Single Linear Regression as an equation:

y = o + 1 x1 + 2 x2 + ... + n xn + Single Linear Regression as a directed graph:1 x1 x2 xn Unidirectional Information Processing

01

2 n

xn

n n

y


Why Graphical Notation?Simple neural network equation without recurrent feedbacks: yk = tanh wkj tanh wki tanh w ji x j j i k Min ! i k j

=> wi with i 0 =>

N N xi wij j xi wij j N e i=1 e i=1 tanh xi wij j = 2 N N i =1 xi wij j xi wij j i=1 + e i=1 e

2

Also:

n+1

n+ 2 n+5 n+3 n+ 4

Simplification for complex models!

28


Combination of NodesSimple processing per node Combination of simple nodes creates complex behaviour

ok ol

wk,j wl,j

N tanh oi wij j i =1 N N oi wij j oi wij j e i=1 e i=1 = 2 N N oi wij j oi wij j e i=1 + e i=1

2

o1 o2 o3

w1,i w2,i w3,i

N tanh oi wij j i =1 N N oi wij j oi wij j e i=1 e i=1 = 2 N N oi wij j oi wij j i =1 i =1 e +e

2

wi,j oj N tanh oi wij j i =1 N N oi wij j oi wij j e i=1 e i=1 = N N oi wij j oi wij j

wi,j+1 ol wl,j

2

2


Architecture of Multilayer PerceptronsArchitecture of a Multilayer Perceptron Classic form of feed forward neural network!Neurons un (units / nodes) ordered in Layers unidirectional connections with trainable weights wn,n Vector of input signals xi (input) Vector of output signals oj (output)w1,5 X1 X2 u1 u5 w5,8 u8 w8,12 u12 u2 u5 u9 U13 X3 u3 X4 u4 input-layer u6 u7 u10 u11 hidden-layers output-layer u14 o3 o2 o1

Combination of neurons

= neural network

ok = tanh

wkj tanh wki tanh w ji o j j i k Min! k

i

j


Dictionary for Neural Network TerminologyDue to its neuro-biological origins, NN use specific terminologyNeural Networks Input Nodes Output Node(s) Training Weights Statistics Independent / lagged Variables Dependent variable(s) Parameterization Parameters

dont be confused: ASK!

29


Agenda


1. 2.


3. 4.



Hebbian LearningHEBB introduced idea of learning by adapting weights [0,1]

wij = oi a jDelta-learning rule of Widrow-Hoff

wij = oi (t j a j ) = oi (t j o j ) = oi j


Neural Network Training with Back-PropagationTraining LEARNING FROM EXAMPLES 1. Initialize connections with randomized weights (symmetry breaking) 2. Show first Input-Pattern (independent Variables) (demo only for 1 node!) 3. Forward-Propagation of input values unto output layer 4. Calculate error between NN output & actual value (using error / objective function) 5. Backward-Propagation of errors for each weight unto input layerRERUN with next input pattern w1,4

4 5

w4,9

Input-Vector

x1 x2 x3

x

Weight-Matrix

1 2 3w3,8

9 6 10 7 8w3,10

o1 o2

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

W

E=o-t

oOutput-Vector

t Teaching-Output

30


Neural Network TrainingSimple back propagation algorithmE p = C (t pj , o pj ) o pj = f j (net pj )C (t pj , o pj ) w ji = C (t pj , o pj ) net pj net pj w ji

[Rumelhart et al. 1982]

p w ji

C (t pj , o pj ) w ji

pj =


wij = oi j f j ( net j )( t j o j ) output nodes j mit j = f j ( net j ) ( k w jk ) hidden nodes j k wij = oi j mit f (net j ) = 1 f (net j ) = o j (1 o j ) oi ( t ) wij 1+ e i o j (1 o j )( t j o j ) output nodes j j = o j (1 o j ) ( k w jk ) hidden nodes j k

pj =


=

C (t pj , o pj ) o pj o pj net pj

o pt net pj

= f j' ( net pj )C (t pj , o pj ) o pjnet pk C (t pj , o pj ) net pkk

pj =

f j' ( net pj ) wki o pii

k

C (t pj , o pj ) net pk net pk o pj

=k

C (t pj , o pj )

o pj

=k

wkj = pj wkjk

pj = f j' (net pj ) pj wkj C (t pj , o pj ) ' f j (net pj ) if unit j is in the output layer o pj f ' ( net ) w pj pk pjk j k if unit j is in a hidden layer

pj =


Neural Network Training = Error MinimizationMinimize Error through changing ONE weight wjE(wj) random starting point 2

random starting point 1

local minimum

local minimum

local minimum

local minimum GLOBAL minimum wj


Error Backpropagation = 3D+ Gradient DecentLocal search on multi-dimensional error surface4 2 0 -2 -4

task of finding the deepest valley in mountainslocal search stepsize fixed follow steepest decent

2

0

local optimum = any valley global optimum = deepest valley with lowest error varies with error surface-40 4 2 -2 -4 6 4 2 0 0 -2.5 0

-2 0 2 4

5

2.5

-4 -2 0 2 4 0

31


Demo: Neural Network Forecasting revistied!Simulation of NN for Business Forecasting

Airline Passenger Data Experiment3 layered NN: (12-8-1) 12 Input units - 8 hidden units 1 output unit 12 input lags t, t-1, , t-11 (past 12 observations) time series prediction t+1 forecast single step ahead forecast Benchmark Time Series [Brown / Box&Jenkins] 132 observations 13 periods of monthly data



1. 2. 3.

Forecasting? Neural Networks? Forecasting with Neural Networks 1. 2. 3. NN models for Time Series & Dynamic Causal Prediction NN experiments Process of NN modelling

4.

How to write a good Neural Network forecasting paper!


Time Series Prediction with Artificial Neural NetworksANN are universal approximators[Hornik/Stichcomb/White92 etc.]

Forecasts as application of (nonlinear) function-approximation various architectures for prediction (time-series, causal, combined...)

yt + h = f ( xt ) + t + hyt+h = forecast for t+h f (-) = linear / non-linear functionxt = vector of observations in t et+h = independent error term in t+h

Single neuron / node nonlinear AR(p) Feedforward NN (MLP etc.) hierarchy of nonlinear AR(p) Recurrent NN (Elman, Jordan) nonlinear ARMA(p,q)

n +1

n+5n+ 2 n+6 n+3 n+ 4n+ h

yt +1 = f yt , yt 1 , yt 2 ,..., yt n 1 Non-linear autoregressive AR(p)-model

(

)

32


Neural Network Training on Time SeriesSliding Window Approach of presenting DataInput Present new data pattern to Neural Network Calculate Neural Network Output from Input values Compare Neural Network Forecast agains actual value n +1

Backpropagation Change weights to reduce output forecast error New Data Input Slide window forward to show next pattern

Forward pass n+2 n +5 Backpropagation n +3 n+4


Neural Network Architectures for Linear AutoregressionInterpretationweights represent autoregressive terms Same problems / shortcomings as standard AR-models!

Extensionsmultiple output nodes = simultaneous autoregression models Non-linearity through different activation function in output node

yt +1 = f ( yt , yt 1 , yt 2 ,..., yt n 1 ) yt +1 = yt wtj + yt 1wt 1 j + yt 2 wt 2 j + ... + yt n 1 wt n 1 j jlinear autoregressive AR(p)-model


Neural Network Architecture for Nonlinear Autoregression

yt +1 = tanh yi wij j i =t Nonlinear autoregressive AR(p)-modelt n 1

yt +1 = f ( yt , yt 1 , yt 2 ,..., yt n 1 )

Extensionsadditional layers with nonlinear nodes linear activation function in output layer

33


Neural Network Architectures for Nonlinear AutoregressionInterpretationAutoregressive modeling AR(p)approach WITHOUT the moving average terms of errors nonlinear ARIMA Similar problems / shortcomings as standard AR-models!

n +1

n+ 2n+5 n +3 n+ 4

Extensionsmultiple output nodes = simultaneous autoregression models

yt +1 = tanh wkj tanh wki tanh w ji yt j j i k i k j Nonlinear autoregressive AR(p)-model

yt +1 = f ( yt , yt 1 , yt 2 ,..., yt n 1 )


Neural Network Architectures for Multiple Step Ahead Nonlinear AutoregressionInterpretationAs single Autoregressive modeling AR(p)

n +1

n +5

n+2 n +6 n +3 n+4

n+h

yt +1 , yt + 2 ,..., yt + n = f ( yt , yt 1 , yt 2 ,..., yt n 1 )Nonlinear autoregressive AR(p)-model


Neural Network Architectures for Forecasting Nonlinear Autoregression Intervention ModelInterpretationAs single Autoregressive modeling AR(p) Additional Event term to explain external events n +1

Extensionsn +2 n +5 n +3n +4

multiple output nodes = simultaneous multiple regression

yt +1 , yt + 2 ,..., yt + n = f ( yt , yt 1 , yt 2 ,..., yt n 1 )Nonlinear autoregressive ARX(p)-model

34


Neural Network Architecture for Linear RegressionDemand

Max Temperature

Rainfall

Sunshine Hours

y = f ( x1 , x2 , x3 ,..., xn ) y = x1 w1 j + x2t w2 j + x3t w3 j + ... + xn wnj jLinear Regression Model


Neural Network Architectures for Non-Linear Regression (Logistic Regression)Demand

Max Temperature

Rainfall

Sunshine Hours

yt +1 = f ( yt , yt 1 , yt 2 ,..., yt n 1 )Nonlinear Multiple (Logistic) Regression Model

t n 1 yt +1 = log yi wij j i =t


Neural Network Architectures for Non-linear RegressionInterpretationSimilar to linear Multiple Regression Modeling Without nonlinearity in output: weighted expert regime on nonlinear regression With nonlinearity in output layer: ???

n+1

n+ 2 n +5

n+ 3

n+ 4

y = f ( x1 , x2 , x3 ,..., xn ) y = x1 w1 j + x2t w2 j + x3t w3 j + ... + xn wnj jNonlinear Regression Model

35


Classification of Forecasting MethodsForecasting Methods

Objective Forecasting Methods Time Series Methods Averages Moving Averages Naive Methods Exponential Smoothing Simple ES Linear ES Seasonal ES Dampened Trend ES Simple Regression Autoregression ARIMA Neural Networks Causal Methods Linear Regression

Subjective Forecasting Methods

Prophecy educated guessing

Sales Force Composite Analogies Delphi PERT Survey techniques

Multiple Regression Dynamic Regression Vector Autoregression Intervention model Neural Networks

Demand Planning Practice Objektive Methods + Subjektive correction

Neural Networks ARE time series methods causal methods & CAN be used as Averages & ES Regression


Different model classes of Neural NetworksSince 1960s a variety of NN were developed for different tasksClassification Optimization ForecastingFocus

Application Specific Models

Different CLASSES of Neural Networks for Forecasting alone! Focus only on original Multilayer Perceptrons!


Problem!MLP most common NN architecture used MLPs with sliding window can ONLY capture nonlinear seasonal autoregressive processes nSAR(p,P) BUT:Can model MA(q)-process through extended AR(p) window! Can model SARMAX-processes through recurrent NN

36



1. 2. 3.

Forecasting? Neural Networks? Forecasting with Neural Networks 1. 2. 3. NN models for Time Series & Dynamic Causal Prediction NN experiments Process of NN modelling

4.



Time Series Prediction with Artificial Neural Networks Which time series patterns can ANNs learn & extrapolate?[Pegels69/Gardner85]

???Simulation of Neural Network prediction of Artificial Time Series


Time Series Demonstration Artificial Time Series Simualtion of NN in Business Forecasting with NeuroPredictor

Experiment: Prediction of Artificial Time Series (Gaussian noise)Stationary Time Series Seasonal Time Series linear Trend Time Series Trend with additive Seasonality Time Series

37


Time Series Prediction with Artificial Neural Networks Which time series patterns can ANNs learn & extrapolate?[Pegels69/Gardner85]

Neural Networks can forecast ALL mayor time series patternsNO time series dependent preprocessing / integration necessary NO time series dependent MODEL SELECTION required!!! SINGLE MODEL APPROACH FEASIBLE!


Time Series Demonstration A - Lynx TrappingsSimulation of NN in Business Forecasting

Experiment: Lynx Trappings at the McKenzie River3 layered NN: (12-8-1) 12 Input units - 8 hidden units 1 output unit Different lag structures: t, t-1, , t-11 (past 12 observations t+1 forecast single step ahead forecast

Benchmark Time Series [Andrews / Hertzberg] 114 observations Periodicity? 8 years?


Time Series Demonstration B Event ModelSimulation of NN in Business Forecasting

Experiment: Mouthwash Sales3 layered NN: (12-8-1) 12 Input units - 8 hidden units 1 output unit 12 input lags t, t-1, , t-11 (past 12 observations) time series prediction t+1 forecast single step ahead forecast

Spurious Autocorrelations from Marketing Events Advertisement with small Lift Price-reductions with high Lift

38


Time Series Demonstration C Supermarket Sales Simulation of NN in Business Forecasting

Experiment: Supermarket sales of fresh products with weather4 layered NN: (7-4-4-1) 7 Input units - 8 hidden units 1 output unit t+4 Different lag structures: t, t-1, , t-7 (past 12 observations) t+4 forecast single step ahead forecast



1. 2. 3.

Forecasting? Neural Networks? Forecasting with Neural Networks 1. 2. 3. NN models for Time Series & Dynamic Causal Prediction NN experiments Process of NN modelling 1. 2. 3. 4. Preprocessing Modelling NN Architecture Training Evaluation

4.



Decisions in Neural Network ModellingData Pre-processingNN Modelling Process Transformation Scaling Normalizing to [0;1] or [-1;1] Number of INPUT nodes Number of HIDDEN nodes Number of HIDDEN LAYERS Number of OUTPUT nodes Information processing in Nodes (Act. Functions) Interconnection of Nodes Initializing of weights (how often?) Training method (backprop, higher order ) Training parameters Evaluation of best model (early stopping)

Modelling of NN architecture

manual Decisions recquire Expert-Knowledge

Training

Application of Neural Network Model EvaluationEvaluation criteria & selected dataset

39


Modeling Degrees of FreedomVariety of Parameters must be pre-determined for ANN Forecasting:

D=Dataset

[DSE [C

Selection

DSA N

Sampling

] S ] NOno. of output nodes

P=Preprocessing

Correction

Normalization

Scaling

A=Architecture

[NIno. of input nodes Input function

NSno. of hidden nodes Activation Function

NLno. of hidden layers

Kconnectivity / weight matrix

T

]

Activation Strategy

U=signal processing

[FI [G

FA

FO IP

Output Function

] IN B ]

L=learning algorithm

choice of Algorithm

PT,L

Learning parameters phase & layer

initializations procedure

number of initializations

stopping method & parameters

Oobjective Function

interactions & interdependencies between parameter choices!


Heuristics to Reduce Design ComplexityNumber of Hidden nodes in MLPs (in no. of input nodes n)2n+1 [Lippmann87; Hecht-Nielsen90; Zhang/Pauwo/Hu98] 2n [Wong91]; n [Tang/Fishwick93]; n/2 [Kang91] 0.75n [Bailey90]; 1.5n to 3n [Kasstra/Boyd96]

Activation Function and preprocessinglogistic in hidden & output [Tang/Fischwick93; Lattermacher/Fuller95; Sharda/Patil92 ] hyperbolic tangent in hidden & output [Zhang/Hutchinson93; DeGroot/Wurtz91] linear output nodes [Lapedes/Faber87; Weigend89-91; Wong90]

... with interdependencies! no research on relative performance of all alternatives no empirical results to support preference of single heuristic ADDITIONAL SELECTION PROBLEM of choosing a HEURISTIC INCREASED COMPLEXITY through interactions of heurstics AVOID selection problem through EXHAUSTIVE ENUMERATION


Tip & Tricks in Data SamplingDos and Don'tsRandom order sampling? Yes! Sampling with replacement? depends / try! Data splitting: ESSENTIAL!!!!Training & Validation for identification, parameterisation & selection Testing for ex ante evaluation (ideally multiple ways / origins!)

Simulation Experiments

40


Data PreprocessingData TransformationVerification, correction & editing (data entry errors etc.) Coding of Variables Scaling of Variables Selection of independent Variables (PCA) Outlier removal Missing Value imputation

Data CodingBinary coding of external events binary coding n and n-1 coding have no significant impact, n-coding appears to be more robust (despite issues of multicollinearity)

Modification of Data to enhance accuracy & speed


Data Preprocessing Variable ScalingScaling of variablesX2 Income 100.000 90.000 80.000 70.000 60.000 50.000 40.000 30.000 20.000 10.000 0 X2 Income 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.5 0.9 1.0 X1 Hair Length

0.01 0.02 0.03

Linear interval scalingy = ILower + ( IUpper ILower )

X1 Hair Length

(x Min(x)) M ax(x) M in(x)

Intervall features, e.g. turnover [28.12 ; 70; 32; 25.05 ; 10.17 ] Linear Intervall scaling to taget intervall, e.g. [-1;1]eg. x = 72 Max(x) = 119.95 Min(x) = 0 Target [-1;1] y = 1 + (1 (1)) (72 0) 144 = 1 + = 0.2005 119.95 0 119.95


Data Preprocessing Variable ScalingScaling of variablesX2 Income 100.000 90.000 80.000 70.000 60.000 50.000 40.000 30.000 20.000 10.000 0 X2 Income 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

0.01 0.02 0.03

X1 0.0 0.1 0.2 0.5 0.9 1.0 X1 Hair Length Hair Length

Standardisation / Normalisationy= x

Attention: Interaction of interval with activation FunctionLogistic [0;1] TanH [-1;1]

41


Data Preprocessing OutliersOutliersextreme values Coding errors Data errors

Outlier impact on scaled variables

potential to bias the analysisScaling

Impact on linear interval scaling (no normalisation / standardisation)

Actions

0

10

253

-1

+1

Eliminate outliers (delete records) replace / impute values as missing values Binning of variable = rescaling Normalisation of variables = scaling


Data Preprocessing Skewed DistributionsAsymmetry of observations

Transform dataTransformation of data (functional transformation of values) Linearization or Normalisation

Rescale (DOWNSCALE) data to allow better analysis byBinning of data (grouping of data into groups) ordinal scale!


Data Preprocessing Data EncodingDownscaling & Coding of variablesmetric variables create bins/buckets of ordinal variables (=BINNING)Create buckets of equaly spaced intervalls Create bins if Quantile with equal frequencies

ordinal variable of n values rescale to n or n-1 nominal binary variables nominal Variable of n values, e.g. {Business, Sports & Fun, Woman} Rescale to n or n-1 binary variables 0 = Business Press 1 = Sports & Fun 2 = WomanRecode as 1 of N Coding 3 new bit-variables 1 0 0 Business Press 0 1 0 Sports & Fun 0 0 1 Woman Recode 1 of N-1 Coding 2 new bit-variables 1 0 Business Press 0 1 Sports & Fun 0 0 Woman

42


Data Preprocessing Impute Missing ValuesMissing Valuesmissing feature value for instance some methods interpret as 0! Others create special class for missing X2 Income 100.000 90.000 80.000 70.000 60.000 50.000 40.000 30.000 20.000 10.000 0

0.01 0.02 0.03

X1 Hair Length

SolutionsMissing value of interval scale Missing value of nominal scale set mean, median, etc. most prominent value in feature


Tip & Tricks in Data Pre-ProcessingDos and Don'tsDe-Seasonalisation? NO! (maybe you can try!) De-Trending / Integration? NO / depends / preprocessing! Normalisation? Not necessarily correct outliers! Scaling Intervals [0;1] or [-1;1]? Both OK! Apply headroom in Scaling? YES! Interaction between scaling & preprocessing? limited



Outlier correction in Neural Network Forecasts?Outlier correction? YES! Neural networks are often characterized asFault tolerant and robust Showing graceful degradation regarding errors Fault tolerance = outlier resistance in time series prediction?


43


Number of OUTPUT nodes

Given by problem domain! 1 or 2 depends on Information Processing in nodes Also depends on nonlinearity & continuity of time series Trial & error sorry!

Number of HIDDEN LAYERS Number of HIDDEN nodes

Information processing in Nodes (Act. Functions)Sig-Id Sig-Sig (Bounded & additional nonlinear layer) TanH-Id TanH-TanH (Bounded & additional nonlinear layer)

Interconnection of Nodes???


Tip & Tricks in Architecture ModellingDos and Don'tsNumber of input nodes? DEPENDS! use linear ACF/PACF to start! Number of hidden nodes? DEPENDS! evaluate each time (few) Number of output nodes? DEPENDS on application! fully or sparsely connected networks? ??? shortcut connections? ??? activation functions logistic or hyperbolic tangent? TanH !!! activation function in the output layer? TanH or Identity!




1. 2. 3.

Forecasting? Neural Networks? Forecasting with Neural Networks 1. 2. 3. NN models for Time Series & Dynamic Causal Prediction NN experiments Process of NN modelling 1. 2. 3. 4. Preprocessing Modelling NN Architecture Training Evaluation & Selection

4.


44


Tip & Tricks in Network TrainingDos and Don'tsInitialisations? A MUST! Minimum 5-10 times!!!Error Variation by number of Initialisations40,0 35,0 30,0 25,0 [MAE] 20,0 15,0 10,0 5,0 lowest error 0,0 1 5 10 25 50 100 200 400 [initialisations] highest error 800 1600



Tip & Tricks in Network Training & SelectionDos and Don'tsInitialisations? A MUST! Minimum 5-10 times!!! Selection of Training Algorithm? Backprop OK, DBD OK not higher order methods! Parameterisation of Training Algorithm? DEPENDS on dataset! Use of early stopping? YES carefull with stopping criteria! Suitable Backpropagation training parameters (to start with)Learning rate 0.5 (always

EVIC_05 Notes - Forecasting With Neural Networks Tutorial SFCrone

Documents

Transcript of EVIC_05 Notes - Forecasting With Neural Networks Tutorial SFCrone