Note on Exercises Let us start. - Keio University1 OSM Lecture and Exercises (2011/10/13) Akito...
Transcript of Note on Exercises Let us start. - Keio University1 OSM Lecture and Exercises (2011/10/13) Akito...
1
OSM Lecture and Exercises(2011/10/13)
Akito Sakurai
Purposes of my L&E
• To know the basics of data analysis• Through exercises
Note on Exercises
• Solve the problems (in fact they are not problems but just exercises) and hand in your solutions
• Follow the procedures described in the slides and report the results with your discussions– Experiment 1. character recognition– Experiment 2. classification of songs in Japanese– Experiment 3. prediction of USD/JPY rate
URL: http://www.sakurai.comp.ae.keio.ac.jp/2010OSMLandE.html
Let us start.
Predictions
• Humans have desired to predict• Prediction of seasons is one of them,
– since they did not have a calendar– for agriculture, they need to know the best
season to plant seeds• temperature observed at one time is not reliable
• astronomical observation was definitely important– One who is able to observe could be a ruler
Difficult to predict
• The motions of celestial bodies are predicted relatively nicely with models that are not really representing reality.– Since they are the ones (with accuracy they needed at the time)
of which simple mathematical or physical rules explain the motions
• But in many other cases phenomenon has very complex background– too complex to predict
• In many cases, the phenomenon are probabilistic– Many observations are not necessarily increase the possibility of
correct predictions
2
But we do predict
• We humans, though, try to predict• Even the realities are complex, each event
happens to be (relatively) simple– Or it may be the case that "we believe it
simple"– Economic forecast is a typical one.– Professionals predict it with many data and
with their profession, but, say, politicians do predict
Let us think it over• Prediction or forecast is to say something in
advance in time• But it is in essence, based on
– instances which are pairs of a (first) set of values (e.g. the position of a typhoon on a day) and a (second) set of values (e.g. the position of the typhoon on the next day)
– the position of a typhoon today (the first value)to infer
– the position of the typhoon tomorrow (the second value)
that is,
2492837490872352293841698332149821390117498738179470913241248481
23489928372398479823123984716723498723234239487923239487987123984798712223598728
28383169 ???
Known set of obervations
New observation (one of the pair)
Note: data
• strings of numerals• strings of characters (words). Linguistic
expressions• sets of photographs• sets of paintings• sets of representations of behaviors• sets of representations of sounds
Predictions: two types
0174532903490659052359880698131708726646104719761221730513962634
17364823420201500000064278767660444866025493969269848078
06981428 ???
Find out seemingly similar ones0174532903490659052359880698131708726646104719761221730513962634
17364823420201500000064278767660444866025493969269848078
06981428 ???
Find out rules (ground truth)
x/10000000 sin()
x y
∗10000000
"structure" of data
Structures?
S[dcl]
S[dcl]
S[dcl]/NP
S[dcl]/NP
S[X]/(S[X]\NP)
NP
N
Dr.NNPN/N
JekyllNNP
N
sawVBD
(S[dcl]\NP)/NP
(S[dcl]/NP)\(S[dcl]/NP)
andCCconj
S[dcl]/NP
S[X]/(S[X]\NP)
NP
N
Mr.NNPN/N
HydeNNP
N
ateVBD
(S[dcl]\NP)/NP
NP[nb]
aDT
NP[nb]/N
lemonNNN
...
2 4 6 8 1014
15
16
17
18
19
20
21
22width
lightness
salmon sea bass
J. Curran and S. Clark. C&C tools.
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification
USD/JPY: returns in four minutes (0.01%) vs. frequency in (2001-2008)
-100 -50 0 50 100
0.00001
0.0001
0.001
0.01
0.1
1
3
Structure 1.• For numerals (simplest and just dry)
– Data themselves (structure?) representing normal and abnormal samples
– Stimuli that cause changes of time series• For characters or strings of characters (there seems to
be some structure, intuitively)– Habit of what characters to use (in Japanese, kanji, hiragana, or
katakana; or smiley)– Habit of what words to use
• For pictures (photographs, paintings etc.)– Objects themselves (depending on intention/objective)– Composition, how to draw, how to take photos
http://cert.yahoo.co.jp/text/digicame/chap2/c2_0302.html
Structure 2.
• For sounds (music, voices, singing of insects, etc.)– Instruments (musical instruments, sex/age/health
condition, kinds of insects)– (for music) genre, players, etc.
• For behaviors– (shopping) purpose, for whom, for what, where, etc– (web browsing) purpose, by what, for what, etc.
S&P500
0
200
400
600
800
1000
1200
1400
1600
1800
1950
1953
1955
1958
1960
1963
1966
1968
1971
1973
1976
1978
1981
1984
1986
1989
1991
1994
1997
1999
2002
2004
2007
If we could know structure• we could find the best action to take
– Health condition: we may go to doctor– Time series: sell or buy or wait– literature: similar novels or different novels– Trends found on web: go with/against the tide– In general, when we predict, we could optimize in certain sense,
our behaviors to take
How could we do?• In computer science, from quite old times, the ways have
been studied in the field of "machine learning"– A research field in "artificial intelligence"– Why do machines "learn"?
• Humans' learning is to understand, to memorize, and use it afterwards
• The point is to "understand"• To know structures in data is the first step towards understanding
– Why "machines"?• "machines" here are computers, i.e. computing machines
– Why not robots?• Robots need to learn. But learning is necessary in other machines.
– Different from humans?• Not as intelligent as humans. Computers do not know "real world".• But computers never complain about the huge amount of data that
thy are processing
How could we do? (cont.)
• Many algorithms have been developed• There are too many to know
– Therefore I will not explain them• In the following experiments, please allow
me to give you chances of just using a tool– You may have chances of far better (more
expensive tools) or programming environments
Statistics?• It is a completely different discipline 20 years ago.• Now is the time for merging• Still there are differences• Statistics:
– Simpler (statistical) structures compared to machine learning. Statistical test is important. Numerical data comes first.
• Machine learning: – Any structure and models. Not concerned with statistical test. In
many cases, data is fewer than statistical tests require. Discrete or symbolic data are in daily usage
http://wwwcsteep.bc.edu/TIMSS1/database.html (calculus)
-3 -2 -1 1 2 3
0.1
0.2
0.3
0.4
涙産生率
乱視
めがね調製
なし
ソフト
ハード なし
少 正常
なし あり
近視 遠視
4
Data mining?
• No difference• The same researchers reside at both fields.
• If there were differences: – Machine learning: emphasizes on algorithms
• Accuracy, speed, lightness, wideness of applications, representation
– Data mining: Huge amount of data• To be able to take actions is important. Specific data is OK
Be a bit more concrete
• Simple applications
Examples of data
Make Model Year Head inj. c. Chest decel. L. Leg R. Leg D/P Protection Doors Weight Size
Acura Integra 87 599 35 791 262 Driver manual belts 2 2350 ltAcura Integra RS 90 585 . 1545 1301 Driver Motorized belts 4 2490 ltAcura Legend LS 88 435 50 926 708 Driver d airbag 4 3280 medAudi 80 89 600 49 168 1871 Driver manual belts 4 2790 compAudi 100 89 185 35 998 894 Driver d airbag 4 3100 medBMW 325i 90 1036 56 865 . Driver d airbag 2 2862 compBuick Century 91 815 47 1340 315 Driver passive belts 4 2992 compBuick Elect. Park 88 1467 54 712 1366 Driver manual belts 4 3360 medBuick Le Sabre 90 . 35 1049 908 Driver passive belts 2 3240 medBuick Regal 88 880 50 996 642 Driver passive belts 2 3210 medCadillac De Ville 90 423 39 541 1629 Driver d airbag 4 3500 hev
萼片長 萼片幅 花弁長 花弁幅 種別
5.1 3.5 1.4 0.2 Iris-setosa4.9 3 1.4 0.2 Iris-setosa4.7 3.2 1.3 0.2 Iris-setosa4.6 3.1 1.5 0.2 Iris-setosa
5 3.6 1.4 0.2 Iris-setosa5.4 3.9 1.7 0.4 Iris-setosa4.6 3.4 1.4 0.3 Iris-setosa
5 3.4 1.5 0.2 Iris-setosa4.4 2.9 1.4 0.2 Iris-setosa4.9 3.1 1.5 0.1 Iris-setosa5.4 3.7 1.5 0.2 Iris-setosa4.8 3.4 1.6 0.2 Iris-setosa
曜日 室温 前夕の 血圧(mmHg)通算 アルコー
ル量(LOW) (HIGH)
0 火 18 なし 107 1531 水 20 少々 78 1322 木 20 少々 92 1333 金 20 少々 87 1305 日 20 少々 86 1346 月 20 適度 90 1347 火 18 少々 87 1348 水 18 少々 104 1499 木 20 少々 83 130
10 金 20 適度 94 13111 土 20 少々 81 13712 日 20 少々 98 137
Date Open High Low Close Volume Adj. Cl YearWeek
27/11/2000 53.6875 54.5156 51.0312 51.25 40198100 51.250 20004928/11/2000 51.9375 53.1875 50.625 51 52037000 51.000 20004929/11/2000 51.3125 53 50.3125 51.6875 55316000 51.688 20004930/11/2000 50.1875 50.9375 45.1875 47.875 10840500 47.875 20004901/12/2000 49.1875 51.625 47.25 48.5 70468000 48.500 20004904/12/2000 49.0625 49.5625 45 45.8125 9501200 45.813 20005005/12/2000 47.75 52.125 47.3125 52.125 90848900 52.125 20005006/12/2000 52 53.5625 51.2656 51.4375 71419200 51.438 20005007/12/2000 50.3125 51 49 49.9375 46448400 49.938 20005008/12/2000 51.9375 53.25 51 52.375 55400200 52.375 20005011/12/2000 52.875 55.75 52.625 54.8125 78621500 54.813 20005112/12/2000 54.75 55.125 53.3125 54.375 39485300 54.375 20005113/12/2000 55.1875 55.25 50.8125 51.125 54330600 51.125 20005114/12/2000 51.0625 52.5625 50.875 50.9375 46244400 50.938 20005115/12/2000 50.0625 50.1875 47.125 48.1719 100237900 48.172 20005118/12/2000 49 50.125 42.3125 42.9375 126032400 42.938 20005219/12/2000 43 46 41.5 41.75 99018800 41.750 200052
Statistical representation
• Each data does not present regularity whereas as a whole a set of the data shows some regularity– Uniform distribution:
• Fair dices, fair coin tossing– Normal distribution
• Composition of many independent factors– Zipf's law, 80/20 rule, 1/f fluctuations
Normal distribution
• Sum of many independent factors– Distribution of the number of heads of 100 coin
tossing trials observed in 1000 experiments– (not correct in fact) distribution of score of an exam.
http://wwwcsteep.bc.edu/TIMSS1/database.html より (calculus)
-3 -2 -1 1 2 3
0.1
0.2
0.3
0.4
Plot[(1/Sqrt[2 Pi]) Exp[-x^2/2], {x, -3, 3}]
Zipf's law
• Frequency of word usage against their ranks obeys power law.– population ranks of
cities – Hit ranks of web pages– Income rankings– Property rankings
5
Rules in machine leanring
• Conditional statements– IF … THEN …– with confidence ??%
• Decision tree– In the following slides
• Neural networks• and MANY others
If-then ruleIf tear-prod-rate = reduced then contact-lenses=noneIf age=young and astigmatism=no and tear-prod-rate=normal
then contact-lenses=softIf age=pre-presbyopic and astigmatism=no and tear-pro-rate=normal
then contact-lenses=softIf age=presbyopic and spectacle-prescrip=myope and astigmatism=no
then contact-lens=none
agespectacle-prescrip
astigmatism
tear-prod-rate
contact-lenses
young myope no reduced noneyoung myope no normal softyoung myope yes reduced noneyoung myope yes normal hardyoung hypermetrope no reduced noneyoung hypermetrope no normal softyoung hypermetrope yes reduced noneyoung hypermetrope yes normal hardpre-presbyopic myope no reduced nonepre-presbyopic myope no normal soft
agespectacle-prescrip
astigmatism
tear-prod-rate
contact-lenses
young myope no reduced noneyoung myope no normal softyoung myope yes reduced noneyoung myope yes normal hardyoung hypermetrope no reduced noneyoung hypermetrope no normal softyoung hypermetrope yes reduced noneyoung hypermetrope yes normal hardpre-presbyopic myope no reduced nonepre-presbyopic myope no normal softpre-presbyopic myope yes reduced nonepre-presbyopic myope yes normal hardpre-presbyopic hypermetrope no reduced nonepre-presbyopic hypermetrope no normal softpre-presbyopic hypermetrope yes reduced nonepre-presbyopic hypermetrope yes normal nonepresbyopic myope no reduced nonepresbyopic myope no normal nonepresbyopic myope yes reduced nonepresbyopic myope yes normal hardpresbyopic hypermetrope no reduced nonepresbyopic hypermetrope no normal softpresbyopic hypermetrope yes reduced nonepresbyopic hypermetrope yes normal none
Decision tree
tear-prod-rate
astigmatism
spectacle-prescrip
none
soft
hard soft
reduces normal
no yes
myope hypermetrope
Neural networks
= ∑
=i
n
ii xwo
0
σ
1x2xnx
1w2wnw
( ) xex −+≡
11σ
∑
Many function composition
Representation of results• More important issue in data mining• In many cases, we have to represent it and
communicate it to some others (including ourselves)– When only predication is requested, it is not the case– In general, only the results are important, it is not the case– But in many cases when explanation (why the prediction is
deduced) is asked.– Comprehensible explanations are requested
• Quite unfortunately, accuracy and comprehensibility of results are in trade-off relation
• In machine learning, you can select either way (but of course not both), e.g.– Decision tree: comprehensible– SVM (support vector machine) and NN: accuracy and flexibility
Summary • Objects
– Numbers, any thing that can be represented by numbers
• e.g. word, images, music, numbers• Objective and approach
– Prediction, inference; Knowledge system that learns• Find rules that humans cannot describe explicitly (rules for
prediction and/or inference)– Separate objects from noises, and describe them
• Noise: random, without correlation• Objects: that have structure
• Instruments and tools– Statistics, artificial intelligence, and machine learning!!
6
Note• Many good tools. For practice, good features are to be selected• "Feature" is something that are obtained by observing the object or
calculated from the observations, and is used to represent the object.– (maybe) Useless features:
• Height for diagnosis of influenza• Cloud cover of a day to predict stock prices (someone says that weather
affects stock prices, though.)
• Features, not techniques – Success or not is almost decided by features– Features: values obtained or calculated from observations of objects
• In general, some combination of features decide the class of objects we are considering
Experiments
• Weka as a tool of machine learning– Another slide, please.
• Experiment 1: character recognition– Just numerals for experience
• Experiment 2: genre inference from lyrics of songs (sorry. This one is based on Japanese kana)– It is not difficult to get data if you know Japanese
• Experiment 3: prediction of USD/JPY
Experiment 1: Character recognition
• Works very well in real world– To read license plate at highway
tollgates
• N-system in Japanfrom Wikipedia
– To read ZIP codes on evelopes. In these days, handwritten addresses and names are read, too.
An application
Hitachi information and control
Applications
• Restaurant menureader
• Universal accesshttp://tabelog.com/imgview/original?id=r173888580388
http://www.afb.org/afbpress/pub.asp?DocID=aw070605
http://ameblo.jp/20dai-makoto/day-20110624.html
http://www.thepotteries.org/walks/fenton1/7.htm
CR is difficult• Humans are very good at reading characters, unless the
letters are too much distorted• We are quite sure that it is very
difficult to tell you how to read Japanese hiragana if they are handwritten. – If they are calligraphic, many of us
cannot read.• Why is it difficult to read ?• It is quite difficult for us to study,
or for teachers to teach us,• because no one (so far) could not
write down rules.• It may not be just there are no rules.
霞たち木の芽もはるの雪降れば花なきさとも花ぞちりける
7
Data for C.R.• A simple character recognition
– Only numerals• Preprocessing has been done (preprocessing is far more
difficult than recognition)– Separation (characters from other characters)– Normalization (size, slant, center, etc)
• But still it seems to be difficult– Suppose that we have to tell how to tell numerals to people who
do not know numerals• In essence, we do not know rules even if we can behave
as if we know rules– We could induce "rules" from data
• Data source: UCI Machine Learning RepositoryOptical Recognition of Handwritten Digits Data Set
Preprocessing
Giorgos Vamvakas
UCI Machine Learning Repository
Examples of characters in the data
Data format is simple8 pixels
8 pixels
Every pixel has values:0(white) … 16(black)
numerals0 … 9
Look into optdigits.tes.csv and verify the format by yourself
Data format is simple4. Relevant Information:
We used preprocessing programs made available by NIST to extractnormalized bitmaps of handwritten digits from a preprinted form. Froma total of 43 people, 30 contributed to the training set and different13 to the test set. 32x32 bitmaps are divided into nonoverlappingblocks of 4x4 and the number of on pixels are counted in each block.This generates an input matrix of 8x8 where each element is an integer in the range 0..16. This reduces dimensionality and gives invariance to small distortions.
5. Number of Instancesoptdigits.tra Training 3823optdigits.tes Testing 1797
6. Number of Attributes64 input+1 class attribute
Read it in optdigits.names.txt by yourself
Procedures• Get the data (I uploaded to lecture web site)
– Features are just pixel brightness values (worst!)• create arff files ( you already have csv files. So what you have to do
is just to add headers) . Use memo pad for example. Take care when you save it.
@relation OptDigitsTraining@attribute 00 real@attribute 01 real
……@attribute 06 real@attribute 07 real@attribute 10 real@attribute 11 real@attribute 12 real
……
……@attribute 76 real@attribute 77 real@attribute class {0,1,2,3,4,5,6,7,8,9}@dataHere data come.
At your disposal
At your disposal, too. 8 x 8 =64
optdigits.tra.csv
8
Procedures: learning and test• Create a decision tree
– Weka: J48 under "Trees"• Get accuracy by "10-fold cross validation"• Get accuracy when a separate test data set optdigits.tes
is used.• Look at the decision tree. Is there any meaning in it? I
think you would not be able to find any. Why?• Try other tools
– neural network– SMO (one of many implementations of support vector machines)– naïve Bayes
• Compare accuracies and run time
Test data in images: test_images.zip, train_images.zip
How to choose other methods
Click on + of functions
This one is neural networks
Click on + of Bayes
SMO (SVM)naïve Bayes
Experiment 2: classification of lyrics
• Words used in lyrics may be different among songs in different genre
• The same is true in waka or tanka (thirty-one syllabledverse)– Prof. Shizuo Mizutani (founder of Mathematical Linguistics
Society in Japan. Founded in 1957) analyzed Shirakaba-ha (group) and Araragi-ha tanka and succeeded in classifying the tanka to authors' group based on the words used
• It is a bit difficult to classify lyrics based on words (difficult for me to prepare) , I set up environments to classify them based on syllables or mora– I should say I used Japanese syllabary in fact
Data used• Ten children's songs and ten J-POPs in Japanese
syllables.• Frequency is used, not the songs themselves.• Since the order of character codes and Japanese
syllabary do not coincide, we have to rearrange them. A bit cumbersome– "A I U E O" must be grouped etc.– I made up an Excel file for you.
• The frequencies are normalized so that the sum of them is 1. this is because the length (number of syllables in a song differs)
Procedures: data preparation 1In column A, syllables of a song is written in order. I put a code in column B that transforms syllables in A to corresponding values in B. In column C, "あいうえお" corresponding to syllables in A are filled.
To draw histogram, we calculate frequencies.Columns E and F are the ones to be used for the histogram generation.In Excel menu, select "tool → data analysis → histogram" to get histogram.Normalized frequency will be here.Be sure to count "あいうえお" to get histogram.Normalized frequency will be here.
Procedures: data preparation 2
Paste the links(done already)
Paste the links.(done already)
Sheet "Summary"
9
Procedures: data preparation 3
1. Only values is to be pasted with rows and columns being exchanged. You might pasted to another sheet.2. Save it a CSV file.
Procedures: data preparation 4• Prepare data for ten children's songs and ten J-POPs.• Transform it to an arff.
– Insert a header on the right.– At the right end of each data
(a song in a row. 20 rows intotal), put a ",0" for children'ssong and put a ",1" for J-POP.
This has been done.Children.xlsJ-POP.xls
Procedures: Experiment and test
• Use Weka– J48, SMO, naïve Bayes and others
• Is accuracy high?• Is the decision tree obtained meaningful?
• Next, please– add 10 children's songs or J-POPs to them, or– prepare at least ten lyrics in a new genre that you
prefer.
and find out something
If you do not know Japanese songs, I am sure you do not, please forget about this problem
Experiment 3: Prediction of USD/JPY
• FX: foreign exchange market– Financial market for the trading of currencies where
the relative values of different currencies are determined. • Is it possible to get positive returns in FX?
– Spread is small, so that it is different from (government-run) lottery.
– But it is a typical gamble. It is very close to zero-sum game (not exactly). There are winners and losers. The number of losers is far larger than winners (80-20 rule or power law)
• Price movement must be a random walk.– i.e. it must be unpredictable
• Well, we need to know if it is real
Data
• To sell and bye US Dollares (USD) in Japanese Yen(JPY)
• The unit of price is 0.01 JPY• Time ticks are of minutes
Tokyo Financial Exchange Inc.
Data1• USD/JPY in minutes
– Let us think about June 21, 2010 (anytime is ok)– We use the ones uploaded on Forexite. The Time is GMT+1 (Central European
Time). Fidelity of data is not guaranteed.– In 24 hours, Open (open price), High (highest price), Low (lowest price), and
Close (closing price) are recorded in order of time.– We want to predict "Close" of the next minute.– Let us try AR model.
– USDJPY Close data need be extracted.
210610.zip
210610.txt in210610.zip
From here
10
Procedure• Obtain data of June 21, 2010.• Use R and AR package to apply AR model to the data and use it for
prediction.• A program is already written for you. What you have to do is to
repeat it with different parameters and with different data to see what happens in terms of prediction– The data file ready for you is:– The program in R is:
210610.txt in210610.zip.
Data2010.zip
0 200 400 600 800 1000 1200 1400
90.4
90.6
90.8
91.0
91.2
91.4
Index
x
Sample1.r.txt
0 200 400 600 800 1000 1200 1400
90.4
90.6
90.8
91.0
91.2
91.4
Index
x[3:
leng
th(x
)]
Real data of a day Prediction for the left data
AR model
• One of models (mathematical expressions) expressing
• AR(p) is
where
– There are conditions under which we the system is allowed to be modeled by AR(p). But we do not consider them here.
,....,....,,,,, 2101 TXXXXX −L
tptpttt XXXX εααα ++++= −−− L2211
),0( 2σε Nt ∼
Expressing FX rates• To model FX rates or stock prices, it is common to adopt
ratio or logarithm of ratio instead of the rate or prices themselves.
• But for educational purposes, we try the values themselves and logarithm of the values.
111 logloglog,, −−− −=− ttttttt XXXXXXX
Procedure• R will be used since Weka does not have the functions• The data is, for example, June 21, 2010.• Read the data and extract closing rates of every minute of USD/JPY.• Plot it for examination.• Use arima for AR modeling. The order is (p,0,0) for AR(p).
210610.txt からUSDJPY のcloseを全部
setwd("D:/R/")# Read a file,x.tmp <- read.csv("210610.txt", header=T)# pick up UDSJPY rows and then select X.CLOSE. columns,x <- subset( x.tmp, X.TICKER. == "USDJPY" )$X.CLOSE.# plot it,plot( x, type="l")# and fit AR(2) model to the data(fit2 <- arima(x, c(2, 0, 0)))
0 200 400 600 800 1000 1200 1400
90.4
90.6
90.8
91.0
91.2
91.4
Index
x
Your folder, please.
Procedure (cont.)• Note that the followings are simply expressed.
• The coefficients in "arima" outputs are like the table. The meaning is
where c is the intercept, i.e., 90.9245 hereand s.e. is the standard error.
111 logloglog, −−− −=− tttttt XXXXXXpar( mfrow=c( 2, 1 ) )plot( diff(x), type="l" ); plot( diff(log(x)), type="l" )
0 200 400 600 800 1000 1200 1400
-0.2
0.1
0.3
Index
diff(
x)
0 200 400 600 800 1000 1200 1400
-0.0
020.
002
Index
diff(
log(
x))
Coefficients:ar1 ar2 intercept
0.9026 0.0952 90.9245s.e. 0.0263 0.0263 0.2112
tttt cXarcXarcX ε+−+−=− −− )(2)(1 21
Procedure (cont.)• To predict, use the following program:
• It seems that the black lines, i.e., true values, are overlaid by red lines, i.e., predicted values.
• But in fact, they are not.
# Read a test data filey.tmp <- read.csv("220610.txt", header=T)y <- subset( y.tmp, X.TICKER. == "USDJPY" )$X.CLOSE.# Prediction based on fit2 <- arima(x,c(2,0,0)) will be in y.ary.ar <- array( 0, dim = c( length(y) ) )int <- fit2$coef["intercept"]for ( i in (2+1):length(y) ) y.ar[i] <- int + coef(fitp)[1:2] %*% (y[(i-1):(i-2)] - int )plot(10:length(y), y[10:length(y)], type="l"); lines(10:length(y), y[10:length(y)], col=2)
0 200 400 600 800 1000 1200
87.4
87.6
87.8
88.0
88.2
10:length(y)
y[10
:leng
th(y
)]
plotrange= 700:730plot(plotrange,y[plotrange],type="l"); lines(plotrange,y.ar[plotrange],col=2)
700 705 710 715 720 725 730
90.5
290
.54
90.5
690
.58
90.6
090
.62
plotrange
y[pl
otra
nge]
Is this prediction successful?What will happen for other days data as test data?How about logarithm of ratio?
11
Data2• USD/JPY in minutes
– Let us think about June 21, 2010 (anytime is ok)– We use the ones uploaded on Forexite. The Time is GMT+1 (Central European
Time). Fidelity of data is not guaranteed.– In 24 hours, Open (open price), High (highest price), Low (lowest price), and
Close (closing price) are recorded in order of time.– We want to predict "Close" – "Open" (returns) of a minute.– Difficult: what should be the basis of prediction? What are features?
– (Let us try with) returns of every minute form five minutes ago
210610.zip
210610.txt in210610.zip
From here
Procedures• Obtain data of June 21, 2010.• Put it into Excel file. For every minute, calculate returns of the
minute from five minutes ago.• To make the prediction problem simpler, let us put our target on
predicting "up or down" (not the relative price value) (+1 is for up and −1 for down)– The file is ready for you
210610.txt in210610.zip. All USDJPYs
USDJPY100621.xls
Procedures: prepare files
• Pick up only returns to form a csv file.• Put an arff header to make it an arff file for Weka.
Use your favorite editor, such as memo pad.
Procedures: a trial• Use Weka
– Before applying an algorithm, be sure to remove "returns" feature in "Preprocess" of Weka (see the next slide)
– J48 under "trees": decision tree– neural network under "function"– SMO under "functions": one of support vector machine
implementations– naïveBayes under "Bayes"
• Accuracies are around 1/3.– Do not just say "it is uniformly random". When we examine data,
it is easily see that the returns are positive, zero, or negative with almost 1/3 ratios. It was hard to believe for me.
Procedures in Weka:a note
Select andclick
Unbelievably symmetric
Procedures: other days• Try with other days.• How about 22, 23, and 24 of June.
– Please make xls, csv, and arff files• Results?
• Lacking information to predict?• Must be. Then shall we include High and Low of
previous minute?– I have prepared June 21st for you. Make files for the
other days and try prediction.– I guess no success results.
220610.zip230610.zip240610.zip
USDJPY100621A.xls
12
Procedures• Next: how about making five minutes a unit.
– Since, for just a minute, dealers might not be able to observe other dealers' behavior, correlations among prices might not emerge, and therefore the price changes seem to be random, and therefore unpredictable.
– Since five minutes seem to be long enough, there must be some correlation between the prices and therefore some predictabilitymust exist.
– Correct ?
In reality, it was shown that there exist correlations up to 20 minutes long (but then disappears). Refer for exampleP. Gopikrishnan, et al. Scaling of the distribution of fluctuations of financial market indices, Physical Review E vol. 60, 5305 - 5316 (1999)
Procedures• Before preparing five minute data, we need to obtain
Open, High, Low, and Close of five minute ago.• After that, we could get data for five minute time interval.
Residue of <TIME> divided by 500.When we sort the data by this column,We get five minute interval data.
maximum
Copy and paset
Copy and paste
minimum
Procedures• Use Weka.
– J48, SMO, naïve Bayes and others• You might get better accuracy.
– But, if you look at distribution of returns, you will find the number of instances for returns=0 is less.
– As was expected (?), prediction is not possible?– Well, the number of data could be smaller than necessary.
• Let us try on 23rd June to 26th June.– A bit better?– Let us test (apply the obtained knowledge) to other days such as
28th June to 1st July.
280610.zip290610.zip300610.zip010710.zip
Procedures• A way to test a result on another data• Prepare test data that has the same number of
features as the learning data.• Since for the current problem, we have only files
for returns of this minute, the files will not serve as test data files. One easy way to prepare is to delete column "returns" of the time and to save it as a file in Weka.
For whom to go a bit further
• I have prepared data up to 2nd July, 2010 for USD/JPY and GBP/USD
2010-01to06.zip
Weka: a note for test data
Select this And click here
Then click here,
Specify a file of test data
And click here to close the widget
13
Weka: a note for saving as a file
Select andclick
①
And save it②