Post on 03-Jul-2015
description
ITB TERM PAPER
DATA MINING TECHNIQUES (LINEAR MODELLING AND CLASSIFICATION)
RAHUL MAHAJAN (10BM60066)
2
Table of Contents
INTRODUCTION ............................................................................................................................................. 3
ABOUT WEKA ............................................................................................................................................ 3
ABOUT R .................................................................................................................................................... 3
LINEAR MODELLING TECHNIQUE USING R - PREDICTION OF FUTURE SHAR PRICE ..................................... 4
DATA ......................................................................................................................................................... 4
CASE 1 ........................................................................................................................................................... 5
THE CODE .................................................................................................................................................. 5
THE RESULT ............................................................................................................................................... 5
INTERPRETATION OF THE RESULT ............................................................................................................. 6
CASE 2 ........................................................................................................................................................... 7
THE CODE .................................................................................................................................................. 7
THE RESULT ............................................................................................................................................... 7
INTERPRETATION OF THE RESULT ............................................................................................................. 9
CLASSIFICATION .......................................................................................................................................... 10
THE DATASET .......................................................................................................................................... 10
CLASSIFICATION PROCEDURE ................................................................................................................. 10
INTERPRETING THE RESULTS .................................................................................................................. 11
3
INTRODUCTION
In this term paper I have demonstrated two data mining techniques
LINEAR MODELLING TECHNIQUE o The linear modelling technique is demonstrated using R.
CLASSIFICATION
o The classification technique is demonstrated using WEKA
ABOUT WEKA
Weka is java based collection of open source of many data mining and machine learning
algorithms, including
o Pre-processing on data
o Classification:
o Clustering
o Association rule extraction
ABOUT R
R is an open source programming language and software environment for statistical
computing and graphics. The R language is widely used among statisticians for developing
statistical software and data analysis
4
LINEAR MODELLING TECHNIQUE USING R - PREDICTION OF FUTURE SHAR PRICE
Here I will try to use GARCH model to predict future share price. GARCH Models gives us
liberty to define model using previous share prices and volatility for a defined period. There are
many versions of GARCH Models to give better estimate in different scenarios.
Case 1 - Using previous day share prices and standard deviation.
In the example explained in this term paper the expression of tomorrow’s price is dependent
on yesterday’s prices and standard deviation of last 3 days.
Case 2 – Using previous day share price and gain of previous day.
It is generally known that share prices behave in momentum basis, for a period of time share
prices go up, then comes a period when prices goes down. This model takes advantage of this
behaviour of the stock prices.
So using the statistical techniques I will try to compare the model developed using case 1 and
case 2. It is widely accepted that the model developed using case 2 fits better than model
developed using case 1
DATA
Dr Devlina Chatterjee of VGSoM has purchased lots of data from NSE for her research. I have
used few files from her data. In both the cases I have used February 2008 share price data of
Tata Motors. Except the traded data rest all data is available in public domain.
The file contains the following items
i) Symbol,
ii) Series,
iii) Date,
iv) Prev Close,
v) Open Price,
vi) High Price,
vii) Low Price,
viii) Last Price,
ix) Close Price,
x) Average Price,
xi) Total Traded
xii) Quantity,
xiii) Turnover in Lacs,
This text file is available at this link- http://bit.ly/TM_PVD
5
CASE 1
The program first reads the file. Then it extracts the price data. It creates few matrixes for
prices of previous 3 days i.e. A, B and C. Then using for loop it finds the standard deviation of
prices of past 3 days. Then using linear modelling it tries to fit the model to predict future
prices.
Before running the case one thing we need to keep in mind is that we change the directory
location of R to the place where we have saved our text file. The packages required to run this
code are already installed in R so there is no need of adding any additional packages.
THE CODE
TFile<-"tatamotors.txt"
Trade<-read.table(TFile)
A <- Trade[,4]
B <- A[-1]
C <- B[-1]
l<-length(B)
B <- B[-l]
l<-length(A)
A <- A[-l]
l<-length(A)
A <- A[-l]
l<-length(A)
for (i in 1:l) D[i]= sd(D <-c(A[i],B[i],C[i]),na.rm = FALSE)
summary(lm(C~A+D))
THE RESULT
The result of the above code can be found on the following page (figure 1)
6
Figure 1 the output of case 1
INTERPRETATION OF THE RESULT
P value and F Stats shows that model is not able to predict the prices well.
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.891e+03 1.589e+03 -1.190 0.445
A 3.694e+00 2.257e+00 1.637 0.349
D -9.471e-02 1.066e-01 -0.888 0.538
7
CASE 2
The program first reads the file. Then it extract the price data in matrix A.T hen using matrix B
and C it finds the gains for first n-1 days (where n is total number of days available. This data is
stored in matrix D. Now using linear model function one can find statistical significance of the
model.
We know that in this case correlation will be high so we use correlation flag of liner model as
true, this way the function gives better prediction by minimizing the auto correlation problem
from the data.
THE CODE
TFile<-"tatamotors.txt"
Trade<-read.table(TFile)
A <- Trade[,4]
l<-length(A)
B <- A[-1]
C <- A[-l]
D <- (C-B)*100/C
summary(lm(B~C+D), correlation=TRUE)
THE RESULT
The result of the above code can be found on the following page (figure 2)
8
Figure 2 The output of case 2
9
INTERPRETATION OF THE RESULT
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.335684 2.392506 -0.976 0.333
C 1.002842 0.003261 307.518 <2e-16 ***
D -7.187997 0.042478 -169.215 <2e-16 ***
Here we see that significance of the model is very high. Also the Adjusted R square is high.
However adjusted R value also indicates auto correlations, which is very evident in this case.
But again the F-statistic analysis shows that model is able to predict share prices in better way.
So here we confirm our assumption that previous day gain models (case 2) fits better than
standard deviation model using R(case 1).
10
CLASSIFICATION
Classification is also known as decision trees. It’s basically an algorithm that creates a rule to
determine the output of a new data instance.
It creates a tree where each node represents attribute of our dataset. A decision is made at
these spots based on the input. By moving on from one to another node you reach at the end
of the tree which gives a predicted output.
This is illustrated using the following example
THE DATASET
The dataset used in this example was found on net. The data can be downloaded from the link
http://maya.cs.depaul.edu/classes/ect584/weka/data/bank-data.csv.
Let’s say there is a bank ABC. It has data of 600 people who have either opted for its product
or not. It has the following information of the people: age, gender, income, marital status region
and mortgage. Now bank can use this information to create a rule to predict whether a new
potential customer would opt for its product or not based on the known attributes of the
customer.
CLASSIFICATION PROCEDURE
Load the data in weka. To load data click on open file and specify the path. The window shown
in figure 3 should appear after loading.
One will note that there are 12 attributes in the dataset as seen in the attribute tab of the
window. For this example we will be using only the following attributes
Age, sex, region, income, married, mortgage, savings and product.
Here we will try to predict the response of new customer using the 7 attributes age, sex,
region, income, marriage, mortgage and savings
To remove the remaining attributes click on the checkbox on the left side of the attributes and
click on remove .After removing the attributes one should get the window as shown in figure 4.
Now click on the classify tab on the top. Under the classifier tab click on choose trees
J48 as shown in figure 5.
J48 is an algorithm used to generate a decision tree developed by Ross Quinlan. It is an
extension of Quinlan's earlier ID3 algorithm. The decision trees generated by uses the concept
of information entropy.
11
Now we can create the model in WEKA. First ensure that training set is selected so the data
we have loaded is only used for creating the model. Click start. The output from this model will
look like as shown in figure 6.
INTERPRETING THE RESULTS
The important results to focus on are
1. Correctly Classified Instances" (75.66 percent) and the "Incorrectly Classified Instances
(24.33)” which tells us about the accuracy of the model. Our model is neither very good
nor very bad. It’s Ok. Further modification needs to be done.
2. Confusion matrix which shows number of false positives and negatives. Here in this case
117 a are incorrectly classified as b and 29 b are incorrectly classified as a.
3. The ROC area measures the discrimination ability of the forecast. Although there is
some discrimination whenever the ROC area is > 0.5, in most situations the
discrimination ability of the forecast is not really considered useful in practice unless the
ROC area is > 0.7. For our model the value of ROC is greater then .7 (.787).
4. The decision tree is the main output. It’s the rule that will help to predict the outcome
of new data instances – To view the decision tree right-click on the model and
select Visualize tree. You will get the window as shown in the figure7
12
Figure 3 Window after loading dataset
Figure 4 Window after removing unwanted attributes
13
Figure 5 Choosing the J48 tree
Figure 6 Output of the classification process
14
Figure 7The decision tree