Post on 10-May-2015
An Evaluation of Commercial Data Mining
Oracle Data Mining
Emily DavisComputer Science Department
Rhodes UniversitySupervisor: John Ebden
November 2004
Submitted in partial fulfilment of the requirements for BSc. Honours in Computer Science
Acknowledgements
I am very grateful for all the advice and assistance given to me by my supervisor,
John Ebden. I am exceedingly thankful for all the time and effort he put into helping
me produce this work. I am also grateful for the funding provided by the Andrew
Mellon Foundation in the form of an Honours Degree Scholarship.
I must acknowledge the financial and technical support of this project of
Telkom SA, Business Connexion, Comverse SA, and Verso Technologies through
the Telkom Centre of Excellence at Rhodes University.
I must also thank the technical division in the Computer Science Department at
Rhodes University and especially Jody Balarin and Chris Morley for their help.
An Evaluation of Commercial Data Mining: Oracle Data Mining2
Table of Contents
Abstract: 5
Section 1 Introduction6
Chapter 1 Introduction_____________________________________________________61.1 Background to Data Mining______________________________________________61.2 Supervised Learning and Classification Techniques___________________________81.3 Oracle Data Mining (ODM)_______________________________________________8
1.3.1 Oracle Data Mining Algorithms___________________________________________91.3.2 Functionality of Oracle Data Mining Algorithms and ODM____________________9
1.4 Chapter Summary______________________________________________________9
Section 2 Evaluation of Oracle Data Mining 11
Chapter 2 Methodology of the Evaluation____________________________________112.1 Approach_________________________________________________________________112.2 Choice of Data Mining Tool__________________________________________________122.3 The Data__________________________________________________________________142.4 Classification Algorithms____________________________________________________16
2.4.1 Naïve Bayes___________________________________________________________162.4.2 Adaptive Bayes Network_________________________________________________17
2.5 Algorithm Settings_________________________________________________________172.5.1 Naïve Bayes Settings____________________________________________________172.5.2 Adaptive Bayes Network Settings_________________________________________18
2.6 Chapter Summary__________________________________________________________21
Chapter 3 Classification Models 22
3.1 Preparing the Data____________________________________________________223.1.1 Build and Test Data Sets___________________________________________________223.1.2 Priors___________________________________________________________________233.2 Building the Models________________________________________________________25
3.2.1 Building the Naïve Bayes Models__________________________________________253.2.1.1 nbBuild___________________________________________________________253.2.1.2 nbBuild2__________________________________________________________25
3.2.2 Building the Adaptive Bayes Network Models_______________________________263.2.2.1 abnBuild__________________________________________________________263.2.2.2 abnBuild2_________________________________________________________26
3.3 Testing the Models_________________________________________________________273.3.1 Model Accuracy________________________________________________________283.3.2 Model Confusion Matrices_______________________________________________29
3.4 Calculating Model Lift______________________________________________________303.5 Training and Tuning the Models______________________________________________333.6 Applying the Models to New Data_____________________________________________36
An Evaluation of Commercial Data Mining: Oracle Data Mining3
3.7 Chapter Summary__________________________________________________________37
Chapter 4 Model Results____________________________________________384.1 Results of Application to New Data_________________________________38
4.1.1 Rules Associated with Adaptive Bayes Network Predictions___________________394.2 Comparison of Model Results________________________________________________414.3 Chapter Summary__________________________________________________________43
Chapter 5 Interpretation of Results__________________________________________445.1 Comparison of Model Results________________________________________44
5.1.1 Comparison 1____________________________________________________455.1.2 Comparison 2____________________________________________________465.1.3 Comparison 3__________________________________________________________475.1.4 Comparison 4__________________________________________________________48
5.2 Effectiveness of Models______________________________________________________495.3 Significance of Results______________________________________________________515.4 Chapter Summary__________________________________________________________52
Section 3 Conclusion 53
Chapter 6 Conclusions Drawn from Results__________________________________536.1 Conclusions Regarding Model Results_________________________________________536.2 Conclusions Regarding Data_________________________________________________546.3 Conclusions Regarding Oracle Data Mining____________________________________576.4 Chapter Summary__________________________________________________________58
Chapter 7 Conclusion_____________________________________________________597.1 Conclusion________________________________________________________________597.2 Possible Extensions to Research______________________________________________60
List of Figures62
List of Tables 63
References 64
An Evaluation of Commercial Data Mining: Oracle Data Mining4
Abstract:
This project describes an investigation of a commercial data mining suite, that
available with Oracle9i database software.
This investigation was conducted in order to determine the type of results achieved
when data mining models were created using Oracle’s data mining components and
applied to data. Issues investigated in this process included whether the algorithms
used in the evaluation found a pattern in a data set, which of the algorithms built the
most effective data mining model, the manner in which the data mining models were
tested and the effect the distribution of the data set had on the testing process.
Two algorithms in the Classification category, Naïve Bayes and Adaptive Bayes
Network, were used to build the data mining models. The models were then tested to
determine their accuracy and applied to new data to establish their effectiveness. The
results of the testing process and the results of applying the models to new data were
analysed and compared as part of this investigation.
A number of conclusions were drawn from this investigation, namely that Oracle Data
Mining provides all the functionality necessary to easily build an effective data
mining model and that the Adaptive Bayes Network algorithm produced the most
effective data mining model. As far as actual results were concerned the accuracy the
models displayed during testing was not a good indication of the accuracy they would
display when applied to new data and the distribution of the target attribute in the data
sets had an impact on the data mining models and the testing thereof.
An Evaluation of Commercial Data Mining: Oracle Data Mining5
Section 1 Introduction
Chapter 1 Introduction
The purpose of this evaluation is to determine how the Oracle Data Mining suite
provides data mining functionality. This involves investigating a number of issues:
1. How easy the tools available with the data mining software are to use and in
what ways they provide aspects of data mining like data preparation, building
of data mining models and testing of these models.
2. Whether the algorithms selected for this evaluation found a useful pattern in
a data set and what happened when the models produced by the algorithms
were applied to a new data set.
3. Which of the algorithms investigated built the most effective data mining
model and under what circumstances this occurred.
4. How the models were tested and whether test results gave an indication of
how the models would perform when applied to new data.
5. Lastly, the manner in which the distribution of the data used to build the data
mining models affected the models and how the distribution of the data used
to test the models affected the test results.
1.1 Background to Data Mining
Data mining is a relatively new offshoot of database technology which has arisen
primarily as a result of the ability of computers to:
An Evaluation of Commercial Data Mining: Oracle Data Mining6
Store vast quantities of data in data warehouses. (Data warehouses differ from
operational databases in that the data in a warehouse is historical; the data
does not only consist of active records in a database.)
Implement various algorithms for the mining of data.
Use these algorithms to analyse these vast quantities of data in a reasonable
amount of time.
The ability to store vast amounts of data is of little use if the data cannot somehow be
organised in a meaningful way. Data mining achieves this by discovering the patterns
in data that represent knowledge and providing some sort of description or abstraction
of what is contained in a data set. These patterns allow organisations to learn from
past behaviour stored in historical data and exploit those patterns that work best for
them.
There are various ways to classify data mining into categories as suggested by a
number of authors. Berry and Linoff [2000] attempt to classify into categories the
various techniques of data mining and specify two main categories – directed data
mining and undirected data mining. Geatz and Roiger [2003] divide data mining into
two categories, supervised and unsupervised learning. Al-Attar [2004] makes a
distinction between data mining and data modelling.
Berry and Linoff [2000] suggest considering the goals of the data mining project
when classifying data mining and, accordingly, what techniques can be used to fulfil
these goals. Prescriptive techniques are useful for making predictions and descriptive
techniques help with understanding of a problem space.
According to Berry and Linoff [2000], directed data mining involves using the data to
build a model that describes one particular variable of interest in terms of the rest of
the data. This category includes techniques such as classification, estimation and
prediction. Undirected data mining builds a model with no single target variable but
rather to establish the relationships among all the variables. Included in this category
are affinity groupings or association discovery, clustering (classification with no
predefined data) and description or visualization. [Berry and Linoff, 2000]
An Evaluation of Commercial Data Mining: Oracle Data Mining7
Geatz and Roiger [2003] define input variables as independent variables and output
variables as dependent variables. It can then be deduced that dependent variables do
not exist in unsupervised learning as no output variable is produced but rather a
descriptive relationship is produced. In supervised learning a predictive, dependent
variable is produced as output.
According to Al-Attar [2004], data mining results in patterns that are understandable
such as decision trees, rules and associations. Data modelling produces a model that
fits the data that can be understandable (trees, rules) or presented as a black box as in
neural networks.
In keeping with these definitions it is possible to say that directed data mining,
supervised learning and Al-Attar’s [2004] definition of data mining describe similar
predictive techniques and fall into the category of supervised learning. Undirected
data mining, unsupervised learning and Al-Attar’s [2004] data modelling are in the
same class as descriptive techniques and fall into the category of unsupervised
learning.
1.2 Supervised Learning and Classification Techniques
Algorithms are used to implement the techniques in these various data mining
categories. Supervised learning covers techniques that include prediction,
classification, estimation, decision trees and association rules. As this evaluation
investigates classification techniques, these will be discussed in further detail.
Geatz and Roiger [2003] describe classification as a technique where the dependent or
output variable is categorical. The emphasis of the model is to assign new instances of
data to categorical classes. The authors describe estimation as a similar technique that
is used to determine the value of an unknown output attribute that is numerical. Geatz
and Roiger [2003] state that prediction only differs from the two techniques
mentioned above in that it is used to determine future outcomes of data. Classification
techniques such as these are generally used when there is a set of input and output
data as dependent and independent variables exist in the data.
An Evaluation of Commercial Data Mining: Oracle Data Mining8
1.3 Oracle Data Mining (ODM)
Oracle embeds data mining in the Oracle 9i Enterprise Edition version 9.2.0.5.0
database which allows for integration with other database applications. All data
mining functions are provided through the Java API giving complete control to the
data miner over the data mining functions. [Oracle9i Data Mining Concepts Release 2
(9.2) 2002]
The Oracle Data Mining suite is made up of two components, the data mining Java
API and the Data Mining Server (DMS). [Oracle9i Data Mining Concepts Release 2
(9.2), 2002] The DMS is a server side component that provides a repository of
metadata of the input and result objects of data mining. The DMS also provides a
connection to the database and access to the data that is mined. It is possible to use
JDeveloper 10g to provide the access to the Java API and the DMS. The data mining
can then be performed using Data Mining for Java (DM4J) 9.0.4 or by writing Java
code. DM4J provides a number of wizards that automatically produce the Java code.
[Oracle Data Mining Tutorial, Release 9.0.4, 2004]
1.3.1 Oracle Data Mining Algorithms
ODM supports a number of algorithms and choice of algorithm for ODM depends on
the data available for mining as well as the format of results required. This project has
made use of the Adaptive Bayes Network and Naïve Bayes algorithms which are
Classification algorithms that assign new instances of data to categorical classes and
can be used to make predictions when applied to new data.
1.3.2 Functionality of Oracle Data Mining Algorithms and ODM
Mining tasks are available to perform data mining operations using these algorithms
which include building and testing of models, computing model lift and applying
models to new data (scoring).
DM4J wizards control the preparation and mining of data as well as evaluation and
scoring of models. DM4J has the ability to automatically generate Java and SQL code
An Evaluation of Commercial Data Mining: Oracle Data Mining9
to transfer the data mining into integrated data mining or business intelligence
applications. [Oracle Data Mining for Java (DM4J), 2004]
1.4 Chapter Summary
This chapter introduces the evaluation and describes what is hoped to be achieved by
investigating the Oracle Data Mining suite. A short background to data mining is
presented and supervised learning and Classification techniques introduced. A short
introduction to ODM is also presented. The next chapter will describe the approach
taken by this evaluation and will present reasons for some of the design decisions.
An Evaluation of Commercial Data Mining: Oracle Data Mining10
Section 2 Evaluation of Oracle Data Mining
Chapter 2 Methodology of the Evaluation
This chapter aims to provide an explanation of the approach that has been taken
during this evaluation. It will explain why ODM was selected as the data mining tool
to be evaluated as well as why the Naïve Bayes and Adaptive Bayes Network
algorithms were used to build the data mining models. The parameters required by
these algorithms are explained and the data used during this evaluation is described.
2.1 Approach
One purpose of this evaluation is to determine what functionality is provided with
ODM as well as to ascertain what kinds of models can be produced by ODM. In order
to make these discoveries, it is necessary to use a number of algorithms in the data
mining suite to build data mining models, to test the accuracy of these models and to
validate the results these models produce when applied to new data.
To be able to perform comparisons of the results the models produce, it has been
necessary to select two forms of data mining algorithm that fall into the same
categories, in this case, supervised learning and classification. For this reason, Naïve
Bayes for Classification and Adaptive Bayes Network for Classification have been
selected as both algorithms fall into the supervised learning category and can be used
to make predictions. These predictions could then be compared to determine which
models, built using the different algorithms, are more effective. Both algorithms
allow for building the model, testing the model, computing model lift (providing a
An Evaluation of Commercial Data Mining: Oracle Data Mining11
measure of how quickly the model finds actual positive target values) and application
of the model to new data.
An Oracle 9i Enterprise Edition version 9.2.0.5.0 database was configured and the
tools and software for data mining installed and configured for use with the database.
For the purposes of this investigation, JDeveloper 10g provides the access to the Java
API and the DMS. The data mining itself is performed using DM4J 9.0.4 which is an
extension of JDeveloper that provides the user with a number of wizards that
automatically create the Java programs that perform the data mining when these
programs are run. [Oracle Data Mining Tutorial, Release 9.0.4, 2004]
The data used during the evaluation was obtained at http://www.ru.ac.za/weather/
which provides an archive of weather data in the Grahamstown area for a number of
years. It was deemed that it would be more interesting to use this data to determine
whether a pattern was present in the data when conducting the evaluation as the
results would be of more interest than sample data with little relevance to Rhodes
University.
The two Classification algorithms were then used to build, test and apply a number of
data mining models to the data and it was then possible to compare the predictions
made by each model. During the model building stage it was possible to build the
models using prepared and unprepared data as well as to build models using the
different techniques to determine the effect this had on the results. During testing of
the models it was possible to compare the models’ accuracy and to measure how
quickly the model finds actual positive target values (model lift). Once the models had
been built and tested it was possible to apply the models to new data and then
compare the predictions made by the models to those of the other models as well as to
the actual values in the historical data. It was also of interest to compare the results of
testing the models to those of applying the models to new data.
2.2 Choice of Data Mining Tool
It was chosen to evaluate the data mining functionality provided with the Oracle9i
Enterprise Edition database. An aspect of ODM that supported its use was that all data
mining processing occurs within the database. This removes the need to extract data
An Evaluation of Commercial Data Mining: Oracle Data Mining12
from the database in order to perform the mining as well as reducing the need for
hardware and software to store and manage this data. According to Berger [2004] this
results in a more secure and stable data management and mining environment and
enhances productivity as the data does not have to be extracted from the database
before it is mined.
ODM uses Java code to build, test and apply the models. It was decided to use DM4J
9.0.4 (an extension of JDeveloper 10g) to conduct the data mining as DM4J provides
wizards that allow the user to adjust the settings for the data mining and automatically
generates the Java code that is run when the mining is performed. This functionality
allows novice users to use the default settings for the various algorithms and more
advanced users can experiment with the different settings without having to rewrite
vast amounts of code. DM4J also provides access to the Oracle 9i database and the
data used for the data mining which allows the user to carry out data preparation
within the database using similar wizards. These factors would allow the ease of use
of the tools to be evaluated and to determine how the various stages of the data
mining process are supported by ODM.
In the study of related literature it is apparent that a number of authors feel data
mining should be conducted in a procedural manner. Al-Attar [2004] feels that a step
by step data mining methodology needs to be developed to allow non-experts to
conduct data mining and that this methodology should be repeatable for most data
mining projects. This and similar statements show the need for a well defined data
mining process to be used by data miners.
Geatz and Roiger [2003] introduce the KDD (Knowledge Discovery and Data
Mining) data mining process where emphasis is placed on data preparation for model
building which involves:
Identification of the goal to be achieved using data mining.
Selecting the data to be mined.
Data preprocessing in order to deal with noisy data.
Data transformation which involves the addition or removal of attributes and
instances, normalizing of data and type conversions.
An Evaluation of Commercial Data Mining: Oracle Data Mining13
The actual data mining, at this stage the model is built from training and test
data sets.
The resulting model is interpreted to determine if the results it presents are
useful or interesting.
The model or acquired knowledge is applied to the problem.
When this suggested process is compared to the process used by ODM as depicted in
Figure 1, it is apparent that ODM makes use of similar stages in their data mining and
places the necessary emphasis on preparation of data and evaluation of results. This
suggests that ODM provides access to the necessary stages involved in conducting a
more successful data mining project.
Figure 1.The Oracle Data Mining Process [Berger, 2004]
2.3 The Data
The data used in this evaluation consists of a number of tables that are stored in the
Oracle database and available in Appendix B on the CD-ROM that accompanies this
project. The data was created from a weather data archive available at
http://www.ru.ac.za/weather/ compiled by Jacot-Guillarmod, F. According to the
An Evaluation of Commercial Data Mining: Oracle Data Mining14
explanation on the web page, the data available at the site represents data gathered at
5 minute intervals throughout a day. Data recorded includes:
Temperature (degrees F)
Humidity (percent)
Barometer (inches of mercury)
Wind Direction (degrees, 360 = North, 90 = East)
Wind Speed (MPH)
High Wind Speed (MPH)
Solar Radiation (Watts/m^2)
Rainfall (inches)
Wind Chill (computed from high wind speed and temperature)
Preparing the data to create the database tables involved removing the reading of
rainfall in inches from the records and replacing it with a ‘yes’ or ‘no’ value,
depending on whether rain had been measured or not. This implies that the 5 minute
interval measurements are used to determine whether rain had been recorded on the
day the measurements were taken. Although information is lost regarding the amount
of rain that had fallen on a specific day, for the purposes of this evaluation it is of
interest whether rain fell at all on a specific day as the predictions made by the
algorithms are categorical.
This categorical variable which was named RAIN would then be predicted by the
models when applied to new data of the same format. The resulting structure of the
tables of data is depicted in Table 1.
Name Data Type Size Nulls?
THETIME NUMBER NO
TEMP NUMBER YES
HUM NUMBER YES
BARO NUMBER YES
WDIR NUMBER 3 YES
WSPD NUMBER YES
WSHI NUMBER YES
SRAD NUMBER YES
CHILL NUMBER YES
An Evaluation of Commercial Data Mining: Oracle Data Mining15
RAIN VARCHAR 3 YES
Table 1 Mining Data Table Structure
The data set WEATHER_BUILD is used for the building of the data mining models
for both algorithms. This data set consists of 2601 records and is created from a
number of daily weather archives recorded in September 2004.
The test data set used to evaluate the effectiveness of the models is created from
WEATHER_BUILD and the process of creating this data set will be explained in
more detail later in the project.
WEATHER_ APPLY consists of 290 records and is the data set which
the built and tested model is applied to in order to make predictions.
All the actual values of the RAIN attribute had been removed and
stored for later comparison. This means the models will predict
whether the value of RAIN will be ‘yes’ or ‘no’ and it will then be
possible to compare these predictions with the actual values in the
original data used to create WEATHER_APPLY. The results of the
application of the models to the data are stored by DM4J for
inspection and use. It is also possible to export the results to
spreadsheet format which has been done in this case to allow for
comparison between models and with the actual data values.
2.4 Classification Algorithms
The two algorithms selected for the evaluation were Naïve Bayes and Adaptive Bayes
Network. Both are classification algorithms that allow the data miner to build a model
using historical data and then apply this model to new data in order to make
predictions regarding a dependent, categorical variable in the data. Berger [2004],
states that both algorithms should be used in a data mining project to see which
algorithm is able to build the better model. This provides a further justification for the
comparison of these two algorithms within the data mining suite.
2.4.1 Naïve Bayes
An Evaluation of Commercial Data Mining: Oracle Data Mining16
The Naïve Bayes algorithm builds a model that predicts the probability of a variable
falling into a categorical class. This is achieved by discovering patterns present in the
data and counting the number of times certain conditions or relationships in the data
occur. [Berger, 2004] The data mining model represents these relationships and can
be applied to new data to make predictions. The algorithm makes use of Bayes’
Theorem, which is statistical in nature. [Berger, 2004]
The algorithm is said to provide quicker model building and faster application to new
data than the Adaptive Bayes Network algorithm. Naïve Bayes can also be used to
make predictions of categorical classes that consist of binary-type outcomes or
multiple categories of outcomes. [Berger, 2004]
2.4.2 Adaptive Bayes Network
The Adaptive Bayes Network model provides similar functionality to that of Naïve
Bayes but can also be used to generate rules or decision tree-like outcomes when built
and again to make predictions when applied to new data. The rules that are generated
are easy to interpret in the form of “if…..then” statements. Berger [2004] states that
this algorithm can be used to build better models than Naïve Bayes but it does involve
a larger number of parameters to be set and it tends to take a longer time to build such
a model.
2.5 Algorithm Settings
2.5.1 Naïve Bayes Settings
Naïve Bayes works by looking at the build data and calculating
conditional probabilities for the target value. This is done by
observing the frequency of certain attribute values and
combinations thereof. [Oracle Data Mining Tutorial, Release 9.0.4,
2004] The two parameters that must be supplied to the Naïve Bayes
build wizard, as shown in Figure 2, indicate how outliers in the data
should be treated; occurrences below the threshold values are
An Evaluation of Commercial Data Mining: Oracle Data Mining17
ignored when creating the model. [Oracle Data Mining Tutorial,
Release 9.0.4, 2004]
The singleton threshold value provides a threshold for the count of
items that occur frequently in the data. Given k as the number of
times the item occurs in the data, P as the number of records and t
as the singleton threshold expressed as a percentage of P; then the
item is considered to occur frequently if k>=t*P. [Oracle Help for
Java,1997-2004]
The pairwise threshold provides a threshold for the count of pairs of
items that occur frequently in the data. Given k as the number of
times two items appear together in the records and P and t as
above; a pair is frequent if k>t*P. [Oracle Help for Java, 1997-2004]
Figure 2 Naïve Bayes algorithm settings
2.5.2 Adaptive Bayes Network Settings
Adaptive Bayes Network works by ranking the attributes in a data
set and then building a Naïve Bayes model in order of the ranked
An Evaluation of Commercial Data Mining: Oracle Data Mining18
attributes. The algorithm then builds a set of features or ‘trees’
using these attributes which are in turn tested against the model in
order to determine whether they improve the accuracy of the model
or not. If no improvement is found the feature is discarded. When
the number of discarded features reaches a certain level the
building stops and the model is those features that remain. [Oracle
Data Mining Tutorial, Release 9.0.4, 2004]
The choice of settings when building an Adaptive Bayes Network
model allows the user to choose from three types of models:
SingleFeatureBuild, MultiFeatureBuild and NaiveBayesBuild.
The SingleFeatureBuild model produces rules in an “if….then”
format and produces only one feature. The parameters required by
this type of model are shown in Figure 3 and include the maximum
depth of the feature (number of attributes in the feature) and
number of predictors to use during the building of the model. It is
then possible for the algorithm to determine which attributes to
include in the feature and how many to include up to the specified
maximum. [Oracle Help for Java, 1997-2004] A greater feature
depth as well as a greater number of predictors included will result
in a slower model building process.
The MultiFeatureBuild model does not generate any rules. This
model builds a form of Naive Bayes model and creates one or more
features made up of a number of attributes. The parameters
required by this kind of model are the maximum number of features
to build and, as with the SingleFeatureBuild model type, the
maximum number of predictors or attributes to use while the model
is built. Also to be specified is the maximum number of failures to
allow when a feature is tested against model accuracy, before it is
discarded and the number of attributes allowed in a feature. [Oracle
Help for Java, 1997-2004]. Again a greater feature depth, greater
An Evaluation of Commercial Data Mining: Oracle Data Mining19
number of predictors and a greater number of failures allowed will
result in a slower model build process.
Figure 3 Adaptive Bayes Network algorithm settings
The NaiveBayesBuild model type does not generate rules either and,
like the MultiFeatureBuild, also builds a form of Naïve Bayes model.
The maximum number of predictors to consider during the build
process must be specified by the user. [Oracle Help for Java, 1997-
2004] Again, the greater the number of predictors the algorithm
must consider, the slower the model building will be.
An Evaluation of Commercial Data Mining: Oracle Data Mining20
The type of model created in all the Adaptive Bayes Network models
in this evaluation was the SingleFeatureBuild. This model type was
chosen as in the explanations of the model types it appears to be
the model type that results in a model less similar to a Naïve Bayes
model. Also it is the only model type that produces rules and the
rules produced by the model would be of interest in determining
what aspects of the data influenced the predictions made by the
model.
2.6 Chapter Summary
This chapter has described what has hoped to have been achieved by building data
mining models using the Naïve Bayes and Adaptive Bayes Network algorithms in the
Oracle Data Mining suite. The reasons for selecting Oracle Data Mining for this
research have been highlighted. The models built using the algorithms have been
outlined and the parameters required by each algorithm have been described. The
source of the data used for this evaluation has been explained as well as how the data
sets for the data mining were created.
The next chapter will describe the process of preparing the data, building the models,
testing the models and training and tuning the models.
An Evaluation of Commercial Data Mining: Oracle Data Mining21
Chapter 3 Classification Models
This chapter describes the process of building the Classification models. The process
of preparing the data to create the build and test data sets is discussed and the Priors
technique is introduced. The actual model building is explained in this chapter. The
model testing process is described including aspects like model accuracy, confusion
matrices and model lift. The process of training and tuning the models to increase
their effectiveness is explained. This chapter provides insight into how ODM provides
data mining functionality.
3.1 Preparing the Data
3.1.1 Build and Test Data Sets
Pyle [2000] emphasises the importance of proper data preparation for data mining and
says the benefits of data mining using properly prepared data include the creation of
more effective models faster. He states that at least two outputs are required from data
preparation: the training data set which is used for building the model and the testing
data set which helps detect overtraining (noise trained into the model). These data sets
are used by the data mining suite later in the data mining process.
In the case of this evaluation it was necessary to use the data in WEATHER_BUILD
to create the training and testing data sets. DM4J provides a tool which allows the
An Evaluation of Commercial Data Mining: Oracle Data Mining22
user to create randomized build and test tables from the existing data. The wizard is
known as the Transformation Split wizard and is specifically developed for use with
Classification models.
The wizard allows the user to select which data is to be used to create the new tables
as well as to specify what percentage of records in the original data should be
allocated to each of the build and test tables. WEATHER_BUILD was used as the
original data and 75% of the records were allocated to the build table and 25% were
placed in the test table. That is, 1951 records were randomly selected from
WEATHER_BUILD and placed in the build table and the remaining 650 records were
placed in the test table. These ratios were chosen because the varying nature of the
weather data meant it would be more beneficial to have a larger number of cases in
the build data set, thus allowing the data mining model to be aware of a larger number
of cases that influenced the target attribute RAIN.
The wizard produced the Transformation Split component which was run and the
resulting tables were named THE_BUILD and THE_TEST and were stored in the
database along with the original data.
3.1.2 Priors
In a number of scenarios where the variable that is being predicted is binary in nature,
one outcome of this variable may occur more frequently in the data that the other.
When the model is built from such data the model may not observe enough of the one
case to build an accurate model and may predict the other case nearly every time but
still show a high accuracy during testing. In order to prevent this from occurring, it is
necessary to create a build table that has approximately equal numbers of each
outcome and also to supply the algorithm with the original distribution of the data or
the prior distribution. This technique, known as Priors, should result in a more
effective model. However, the model must be tested against data of the original
distribution. [Oracle Data Mining Tutorial, Release 9.0.4, 2004]
In order to determine the effect of using such a technique as a form of data
preparation it was decided to build models using both algorithms that would use data
An Evaluation of Commercial Data Mining: Oracle Data Mining23
prepared in this way. When the data in THE_BUILD data set was examined it was
apparent that the ‘no’ outcome occurred more frequently than the ‘yes’ for the target
attribute RAIN as shown in Figure 4. An outcome of ‘no’ occurred 1242 times and a
‘yes’ occurred 727 times.
Histogram for:RAIN
0
200
400
600
800
1000
1200
1400
yes no
Bin Range
Bin
Co
un
t
Figure 4. Data Distribution for RAIN Attribute from THE_BUILD data set
It was possible to create a build data set with a more even distribution of the target
attribute. This was accomplished using the ODM browser and a Transformation
wizard which created a stratified sample of the data with a balanced distribution of the
target attribute. Stratified random sampling divides the data set into subpopulations
and samples are then taken from these in proportion to subpopulation size.
[Fernandez, 2003] As there were 727 cases of ‘yes’ for the RAIN attribute, creating a
balanced data set would require a data set of approximately twice that size (1454).
This data set was created by the wizard and named THE_BUILD1. When the
distribution of the RAIN attribute was inspected again a more balanced distribution
was shown as depicted in Figure 5.
An Evaluation of Commercial Data Mining: Oracle Data Mining24
Histogram for:RAIN
0
200
400
600
800
1000
1200
1400
yes no
Bin Range
Bin
Co
un
t
Figure 5. Data Distribution for RAIN Attribute from THE_BUILD1 data set
The data sets THE_BUILD and THE_BUILD1 were used to build models using each
algorithm and tested on the same test data, THE_TEST, in order to allow for an
evaluation of the effect the distribution of the data has on the resulting models.
3.2 Building the Models
In total, 8 classification models were built using DM4J, four using the Naïve Bayes
for Classification algorithm and four using the Adaptive Bayes Network for
Classification algorithm. Of the four for each algorithm, two models were built using
the data set THE_BUILD where the Priors technique was not made use of and
weighting was used in one of the two, two models were built using THE_BUILD1,
using the Priors technique and again, weighting was used in one of the two.
Weighting and its effects on the models will be discussed later in this chapter. All the
models were built using the attribute RAIN as the target value. This means the models
were built in order to predict the outcome, ‘yes’ or ‘no’, of RAIN when applied to
new data.
3.2.1 Building the Naïve Bayes Models
3.2.1.1 nbBuild
An Evaluation of Commercial Data Mining: Oracle Data Mining25
The first model built was named nbBuild and used the data set THE_BUILD which
had the uneven distribution of the target attribute RAIN (as discussed in section
3.1.2). The Naïve Bayes algorithm was used and the default algorithm settings were
used. This means the singleton threshold was 0.1 and the pairwise threshold was 0.1
for the model.
3.2.1.2 nbBuild2
The second model was named nbBuild2 and made use of the data set THE_BUILD1
which was adjusted using stratified sampling and the Priors technique to have an even
distribution of the target value RAIN. When making use of the Priors technique it was
necessary to specify in the model build wizard what the original distribution of the
data had been in order for the algorithm to be aware of this when making its
classifications. The values supplied at this stage of the model build process are shown
in Figure 6. Again, the default algorithm settings of 0.1 for the pairwise and singleton
thresholds were used.
3.2.2 Building the Adaptive Bayes Network Models
3.2.2.1 abnBuild
The third model was named abnBuild and made use of the data set THE_BUILD.
The Adaptive Bayes Network algorithm was used and a model type of
SingleFeatureBuild was selected. This model type produces rules along with its
predictions. The settings for the model type were left at the defaults. These settings
included a maximum number of predictors of 25, a maximum network feature depth
of 10 and no time limit for the running of the algorithm.
3.2.2.2 abnBuild2
The fourth model was named abnBuild2 and made use of THE_BUILD1 which was
the adjusted data set. Again it was necessary to specify the original distribution of the
original data set as shown in Figure 6. A SingleFeatureBuild model type was selected
and the default settings as described above were used.
An Evaluation of Commercial Data Mining: Oracle Data Mining26
Figure 6. Extract from Classification Model Build Wizard, Priors Settings.
3.3 Testing the Models
Roiger and Geatz [2003] state that evaluation of supervised learning models involves
determining the level of predictive accuracy and that supervised learning models can
be evaluated using test data sets. Such models can be evaluated by comparing the test
set error rates of supervised learning models created from the same training data to
determine the accuracy of the models and which model is most effective. It is of
interest how ODM supports testing, whether the accuracy a model displays during
An Evaluation of Commercial Data Mining: Oracle Data Mining27
testing indicates how it will perform on new data and how the data used during testing
affects the results of testing the models.
The test model results produced by DM4J are depicted in confusion matrices.
Confusion matrices can be used to determine the accuracy of Classification models
and to show the number of false negative or false positive predictions made by the
model on the test data. Confusion matrices are best used for evaluating the accuracy
of models using categorical data which is being used in this case. [Roiger and Geatz,
2003]
Roiger and Geatz [2003] provide an example of a confusion matrix as shown in
Table 2, Model A is used to classify categorical data into two classes, Accept and
Reject. The rows in the table represent the actual values in the data and the columns
represent the predicted values. The model correctly classified 600 Accept instances
from the data and correctly classified 300 Reject instances. However, there were
actually 625 Accept instances in the data and 375 Reject instances. The model also
classified 675 instances as Accept instances and 325 instances as Reject instances.
The accuracy of the model is then determined by dividing 900 by 1000 and results in
a 90% accuracy or an error rate of 10%.
Example Model Predicted Accept Predicted Reject
Actual Accept 600 25Actual Reject 75 300
Table 2. Example Confusion Matrix
3.3.1 Model Accuracy
The four models discussed in the previous section were each tested on the same test
data set, THE_TEST, consisting of 633 records. The test accuracy for each model is
shown in Table 3. It is interesting to note the greater accuracy of the models built
using the Adaptive Bayes Network algorithm and that using the prior distribution
technique appears to have had a negative impact on the test accuracy of the models.
Model nbBuild nbBuild2 abnBuild abnBuild2Test Accuracy 72.35387% 71.09005% 85.15008% 84.9921%
Table 3. Model Test Accuracy Rates
An Evaluation of Commercial Data Mining: Oracle Data Mining28
3.3.2 Model Confusion Matrices
Testing the models produced a confusion matrix for each model which showed the
tendencies of the individual model’s predictions when examined. The following
Tables 4-7 depict each models confusion matrix which is then discussed. Again, the
rows represent actual values and the columns represent the predicted values.
nbBuild no yesno 384 34yes 141 74
Table 4. Confusion Matrix for Model nbBuild Testing
When nbBuild was tested the model correctly predicted the value of the RAIN attribute in 384 + 74 =
458 cases out of 633 cases. As can be seen in the lower left corner of the matrix, the model also
incorrectly predicts a larger number (141) of ‘no’ values that are actually ‘yes’ values. This error will
be adjusted for when the model is tuned.
nbBuild2 no yesno 320 98yes 85 130
Table 5. Confusion Matrix for Model nbBuild2 Testing
The nbBuild2 model correctly predicted the value of the RAIN attribute in 320 + 130
= 450 cases out of 633 cases. When tested this model shows less of a tendency for an
error in a certain direction, i.e. ‘yes’ or ‘no’, as the false prediction numbers of 98 and
85 are close. This can be attributed to the fact that the model was built using the Priors
technique, to compensate for the lower level of ‘yes’ values for RAIN in the original
data.
abnBuild no yesno 353 65yes 29 186
Table 6. Confusion Matrix for Model abnBuild Testing
The abnBuild model correctly predicted the value of the RAIN attribute in 353 + 186
= 539 cases out of 633 cases. This model shows a higher accuracy during testing than
the previous models built using Naïve Bayes. Testing also shows that this model
An Evaluation of Commercial Data Mining: Oracle Data Mining29
makes a larger number of incorrect ‘yes’ predictions. This effect could also be
minimised during tuning.
abnBuild2 no yesno 346 72yes 23 192
Table 7. Confusion Matrix for Model abnBuild2 Testing
The abnBuild2 model correctly predicted the value of the RAIN attribute in 346+192
= 538 cases out of 633 cases. Similarly, this model shows a higher accuracy during
testing than those models built using Naïve Bayes. This model also tends to make a
larger number of incorrect ‘yes’ predictions. This too could be dealt with during
model tuning.
Once the accuracy of the models is tested it is possible to perform another kind of
model testing using cumulative gains charts or lift charts.
3.4 Calculating Model Lift
A lift or cumulative gains chart shows how well the model improves predictions of
positive target attribute outcomes over a sample of the data containing actual results.
The usefulness of such a technique would be apparent in a business problem where
predicted positive values in a model may indicate possible business opportunities. Lift
allows that miner to estimate how well the model will perform when applied to new
data. [Oracle Help for Java, 1997-2004]
An Evaluation of Commercial Data Mining: Oracle Data Mining30
Figure 7. nbBuild Lift Chart
Figure 7 shows the cumulative lift chart for nbBuild when applied to the test data set,
THE_TEST. The value in the first column, approximately 2.4, indicates that the
model should find approximately 2.4 times as many actual positive values for the
RAIN attribute than a random selection of 10% of the data would show.
Figure 8. nbBuild2 Lift Chart
An Evaluation of Commercial Data Mining: Oracle Data Mining31
Figure 8 depicts the cumulative lift chart for nbBuild2 when applied to the test data
set. In the first and second columns the graph indicates that the model should find
approximately 2.4 times as many positive values than random selection.
Figure 9. abnBuild Lift Chart
Figure 9 shows the cumulative lift chart for abnBuild when applied to the test data
set. The value of approximately 2.6 indicates that the model finds approximately 2.6
times as many positive values as random selection would.
An Evaluation of Commercial Data Mining: Oracle Data Mining32
Figure 10. abnBuild2 Lift Chart
Figure 10 shows the cumulative lift chart for abnBuild2 when applied to the test data
set. Similarly, the value of approximately 2.6 indicates that the model finds
approximately 2.6 times as many positive values as random selection would.
It is evident from the above charts that, although the accuracy of the models is not
high in all cases, when applied to new data they should provide a far greater level of
accuracy than attempting to make predictions using no model at all.
3.5 Training and Tuning the Models
Using ODM it is possible to assign weights to the target value when using Naïve
Bayes or Adaptive Bayes so that the model predicts more of one kind of outcome if it
appears that there are a large number of false predictions of a certain kind when
testing the model. [Oracle Data Mining Tutorial, Release 9.0.4, 2004] This bias can be
built into the model to increase predictions of the desired target value.
In this investigation weighting was used to introduce this bias because when testing
the nbBuild model, it was apparent from the confusion matrix that a significant error
was encountered as the model predicted a large number of false negatives, ‘no’ values
for the target attribute RAIN that were in fact ‘yes’ values. These predictions were
An Evaluation of Commercial Data Mining: Oracle Data Mining33
false in 141 of the cases. This level of false predictions was high, thus it was viable to
use weighting in order to decrease the number of false negative predictions.
A weighting value is often chosen by trial and error and is then associated with a
certain type of prediction, false negative or positive, and the model will then treat a
false prediction of that kind as ‘the weighting value’ times as costly as an error of the
other kind. This forces the model to make more predictions in the other direction.
[Oracle Data Mining Tutorial, Release 9.0.4, 2004]
As it was apparent during testing that nbBuild predicted a large number of false
negatives and as this was the most substantial error out of all the models, it was
decided to build another four models, two more for each algorithm, which
incorporated a weighting of 3 against false negatives. The weighting value of 3 was
chosen after some experimentation and used on all the new models as shown in the
extract of the model build wizard for abnBuild4 in Figure 11. The Priors technique
was used in one case for each algorithm.
The models were then tested on the same test data set, THE_TEST, which the
previous models were tested on. Table 8 represents the previous models’ test accuracy
rates and Table 9 represents the new weighted models’ test accuracy rates.
Model nbBuild (no Priors)
nbBuild2 abnBuild (no Priors)
abnBuild2
Test Accuracy 72.35387% 71.09005% 85.15008% 84.9921%Table 8 Unweighted Models’ Test Accuracy Rates
Model nbBuild3 (no Priors)
nbBuild4 abnBuild3 (no Priors)
abnBuild4
Test Accuracy 72.511846% 68.24645% 77.40916% 77.40916%Table 9 Weighted Models’ Test Accuracy Rates
An Evaluation of Commercial Data Mining: Oracle Data Mining34
Figure 11. Extract Showing Weighting of Model Build Wizard for abnBuild4
In only one case, nbBuild3, did weighting improve model test accuracy when
compared to the model with the same settings, nbBuild, before weighting was added.
It is of interest to compare the confusion matrices for these two models.
nbBuild no yesno 384 34yes 141 74
Table 10 nbBuild Confusion Matrix
An Evaluation of Commercial Data Mining: Oracle Data Mining35
nbBuild3 no yesno 381 37yes 137 78
Table 11 nbBuild3 Confusion Matrix
Table 10 shows the confusion matrix for nbBuild and Table 11 shows the confusion
matrix for the weighted model nbBuild3. nbBuild3 was weighted 3 against false
negatives. The effects of this weighting are shown in the decrease of false ‘no’
predictions, from 141 to 137, the increase in correct ‘yes’ predictions, from 74 to 78,
and the increase in false ‘yes’ predictions, from 34 to 37. The affect of the weighting
seems minimal but can be increased by increasing the value of the weighting.
However, since the weighting appears to have had a negative impact on the test
accuracy of other models it was decided to leave the value at 3.
3.6 Applying the Models to New Data
At this stage it is necessary to provide a summary of the models built and tested thus
far. This summary is provided in Table 12.
Classification Algorithm Naïve Bayes Adaptive Bayes NetworkNo weighting, no use of Priors
nbBuild abnBuild
No Weighting, use of Priors
nbBuild2 abnBuild2
Weighting, no use of Priors
nbBuild3 abnBuild3
Weighting, use of Priors nbBuild4 abnBuild4
Table 12. Summary of Classification Models
The models were applied to the new data in the WEATHER_APPLY set. The results
were depicted according to the unique THE_TIME attribute for each record and
showed a prediction, ‘yes’ or ‘no’, of whether it was likely to rain. The results were
exported to spreadsheets to allow for inspection and the comparisons are discussed in
the following chapters.
An Evaluation of Commercial Data Mining: Oracle Data Mining36
3.7 Chapter Summary
The eight Classification models that have been built have been discussed and it is
apparent that the algorithms have found a pattern in the data. The support ODM
provides for the process of preparing the data to build the models has been described
and the Priors technique has been explained. The model testing process has been
described and has given an indication of the accuracy of the models. It will be
interesting to compare this accuracy with the accuracy the models exhibit when
applied to new data. Model lift has been calculated for the models. Four of the models
have been tuned by introducing weighting into the models. The models have been
applied to new data and the results of this are described in the next chapter.
An Evaluation of Commercial Data Mining: Oracle Data Mining37
Chapter 4 Model Results
This chapter describes the results obtained when the models were
applied to new data. Extracts of the results are provided to show
how these can be interpreted. The rules associated with the
predictions made by the Adaptive Bayes Network models are
explained. As a form of external validation the predictions made by
the models are compared to the actual values in the original data.
The results of this validation are compared for the eight models in
order to determine which model is most effective when applied to
new data and with what settings this model was built.
4.1 Results of Application to New Data
The eight classification models were applied to the new data in the
WEATHER_APPLY data set. This data set consisted of 290 records all
of which had had the value for the RAIN attribute removed. These
values had been stored for later comparison. The results were
depicted by THE_TIME attribute and showed a prediction ‘yes’ or
‘no’, of whether it was likely to rain for all 290 records. The
probability of this prediction was also depicted as shown in a sample
from the results for nbBuild in Table 13. The results in this extract
can be interpreted as at THE_TIME attribute with value 1, it is
predicted that no rain will have been measured and this prediction
is given with a probability of 0.9999. At THE_TIME attribute with
value 138 it is predicted that rain will have been measured with a
probability of 0.6711.
PREDICTION PROBABILITY THE_TIMEno 0.9999 1yes 0.6711 138
Table 13. Extract of results from model nbBuild
An Evaluation of Commercial Data Mining: Oracle Data Mining38
Those models that were weighted provided predictions and cost
figures. This cost figure is provided instead of probability as the
model makes prediction based on the cost of an incorrect prediction
to the model’s accuracy. This cost figure is determined by the
weighting of a certain type of false prediction when the model is
tuned and the algorithm then attempts to minimise costs when
making predictions. An extract from these types of results is shown
in Table 14. This extract can be interpreted as at THE_TIME attribute
of value 1, it is predicted that no rain will have been measured and
the cost of such a prediction is 0. At THE_TIME attribute of value 138
it is predicted that rain will have been measured, if this prediction is
incorrect the cost is higher at 0.3288 which is due to the fact that a
target value of ‘yes’ was weighted to avoid false negatives. Low cost
can be interpreted as high probability as can be seen from
comparing the two extracts, but it is not possible to directly
calculate probability from cost. [Oracle Data Mining Tutorial,
Release 9.0.4, 2004]
PREDICTION COST THE_TIMEno 0 1yes 0.3288 138
Table 14. Extract of results from model nbBuild3
4.1.1 Rules Associated with Adaptive Bayes Network Predictions
Those models that were built using the Adaptive Bayes Network algorithm provide
the same format of results as shown in Tables 13 and 14 but also provide the rule with
which the associated prediction was made. During the model build stage these rules
are generated and then predictions are made using these rules when the model is
applied to new data. However, not all rules are made use of when the model is applied
to new data. The format of these results is shown in Table 15.
PREDICTION PROBABILITY RULE_ID THETIMEno 0.5418 52 1yes 0.6677 53 138
Table 15 Extract of Results from model abnBuild showing rules
An Evaluation of Commercial Data Mining: Oracle Data Mining39
After inspecting the spreadsheets containing the results of those models built using the
Adaptive Bayes Network algorithm, it was apparent that when the models were
applied to the new data only 8 of the 61 rules generated during the model building
process were used to make the predictions. These 8 rules will be expanded upon in
Table 16.
Rule ID If (Condition) Then (classification)
Confidence Support
2 CHILL in (37 - 46.6)
no 0.63258135 0.104113765
38 CHILL in (37 - 46.6) and WDIR in (22 - 89.6)
yes 0.6427132 0.019299136
43 CHILL in (37 - 46.6) and WDIR in (89.6 - 157.2)
yes 0.94884205 0.019807009
44 CHILL in (46.6 - 56.2) and WDIR in (89.6 - 157.2)
yes 0.9037015 0.014728288
48 CHILL in (46.6 - 56.2) and WDIR in (157.2 - 224.8)
yes 0.8486806 0.031488065
52 CHILL in (37 - 46.6) and WDIR in (224.8 - 292.4)
no 0.54187334 0.019807009
53 CHILL in (46.6 - 56.2) and WDIR in (224.8 - 292.4)
yes 0.6677172 0.1777552
57 CHILL in (37 - 46.6) and WDIR in (292.4 - 360)
no 0.93961054 0.0726257
Table 16 Rules used by Adaptive Bayes Network Models to Make Predictions
These rules can be interpreted as follows for rule 52:
IF
CHILL in (37 - 46.6) and WDIR in (224.8 - 292.4)
THEN
An Evaluation of Commercial Data Mining: Oracle Data Mining40
RAIN equal (no)
Confidence=0.54187334
Support=0.019807009
The support value given with the rules gives an indication of the percentage of cases
in the build data set with the same predicted target attribute and that meet the
conditions of the rule. The confidence value indicates the improvement in the
accuracy of the model that has been made by adding the rule. [Oracle Data
Mining Tutorial, Release 9.0.4, 2004]
4.2 Comparison of Model Results
Once the models had been applied to new data and the results of this step had been
exported to spreadsheets, it was possible to replace the original values of the RAIN
attribute in the WEATHER_APPLY data set to allow the effectiveness of the models
to be evaluated. After the RAIN attribute was replaced in the original data, each
prediction made by each model was compared to the actual value of the RAIN
attribute for that record. The number of correct predictions was counted and the
percentage of correct predictions was calculated. It is also of interest to consider the
accuracy of the model during testing when evaluating the effectiveness of the model
when applied to new data. This makes it possible to determine whether testing results
give a good indication of model performance when applied to new data. These results
are depicted in Table17.
An Evaluation of Commercial Data Mining: Oracle Data Mining41
Model Model Settings
Number of Correct Predictions (out of 290)
Percentage of Correct Predictions
Model Accuracy During Testing
nbBuild
No weighting, no use of Priors 40 13.79% 72.35386%
nbBuild2
No weighting, use of Priors 107 36.90% 71.09005%
nbBuild3Weighting, no use of Priors 40 13.79% 72.511846%
nbBuild4Weighting, use of Priors 185 63.79% 68.24645%
abnBuild
No weighting, no use of Priors 123 42.41% 85.15008%
abnBuild2
No weighting, use of Priors 123 42.41% 84.9921%
abnBuild3Weighting, no use of Priors 212 73.10% 77.40916%
abnBuild4Weighting, use of Priors 212 73.10% 77.40916%
Table 17 Summary of Accuracy of Predictions When Compared to Actual Data
An Evaluation of Commercial Data Mining: Oracle Data Mining42
It is also of interest to directly compare the results when applied to new data of those
models built using different algorithms but the same settings in terms of weighting
and use of Priors.
This is depicted in Table 18.
Models Settings Naïve Bayes Percentage of Correct Predictions
Adaptive Bayes Network Percentage of Correct Predictions
nbBuild vs abnBuild
No weighting, no use of Priors
13.79% 42.41%
nbBuild2 vs abnBuild2
No weighting, use of Priors
36.90% 42.41%
nbBuild3 vs abnBuild3
Weighting, no use of Priors
13.79% 73.10%
nbBuild3 vs abnBuild3
Weighting, use of Priors
63.79% 73.10%
Table 18 Comparison of Models Built using Same Settings
4.3 Chapter Summary
This chapter has provided a description of the results that were obtained when the
models were applied to new data including the rules the Adaptive Bayes Network
algorithm used to make its predictions. The predictions made by the models were
compared to the original values for RAIN in the data. The accuracy of the predictions
was compared between as well as with the accuracy the models showed during
testing.
In the following chapter the results of applying the models to new data and the results
of the comparisons between the models will be interpreted as part of the evaluation of
ODM.
An Evaluation of Commercial Data Mining: Oracle Data Mining43
Chapter 5 Interpretation of Results
This chapter provides an interpretation of the results obtained from the data mining
models built using similar techniques but different algorithms. Each comparison
between the models is interpreted and reasons presented for the results obtained. The
effectiveness of all the models is compared and the significance of these observations
discussed.
5.1 Comparison of Model Results
As presented in Table 18 in the previous chapter, the percentage of
correct predictions for each model built using the Naïve Bayes
algorithm was compared to that of the model built using the
Adaptive Bayes Network algorithm using similar techniques, in
terms of Priors and weighting. Table 19 includes the accuracy during testing
for the models along with the other results.
Comparison Models Settings Naïve Bayes Percentage of Correct Predictions
Adaptive Bayes Network Percentage of Correct Predictions
Naïve Bayes Accuracy During Testing
Adaptive Bayes Network Accuracy During Testing
1 nbBuild vs abnBuild
No weighting, no use of Priors
13.79% 42.41% 72.35386% 85.15008%
2 nbBuild2 vs
No weighting,
36.90% 42.41% 71.09005% 84.9921%
An Evaluation of Commercial Data Mining: Oracle Data Mining44
abnBuild2 use of Priors3 nbBuild3
vs abnBuild3
Weighting, no use of Priors
13.79% 73.10% 72.511846% 77.40916%
4 nbBuild4 vs abnBuild4
Weighting, use of Priors
63.79% 73.10% 68.24645% 77.40916%
Table 19 Comparison of Models built using same techniques and showing
accuracy during testing
When each comparison is inspected, it is apparent that in all cases
when the models have been applied to new data, those models built
using the Adaptive Bayes Network algorithm outperform those built
using Naïve Bayes. In all except comparison 2, the percentage of
correct predictions for the Adaptive Bayes Network models are
markedly higher than those for the Naïve Bayes models.
During testing, those models built using the Adaptive Bayes
Network algorithm showed a higher level of accuracy than those
models built using Naïve Bayes. However, this difference in test
accuracy between the models in each comparison is not nearly as
large as that demonstrated when the models are applied to new
data.
In the following subsections each comparison is interpreted and
discussed.
5.1.1 Comparison 1
The models in this comparison were built without making use of the
Priors technique and were not tuned using weighting to introduce
bias. The model built using the Adaptive Bayes Network algorithm,
abnBuild, correctly predicted 42.41% of the RAIN attribute
outcomes when applied to new data whereas the model built using
Naïve Bayes, nbBuild, only predicted 13.79% of the outcomes
An Evaluation of Commercial Data Mining: Oracle Data Mining45
correctly. During testing, nbBuild showed an accuracy of 72.35386%
and abnBuild showed an accuracy of 85.15008%.
The fact that nbBuild showed a relatively high test accuracy and a low accuracy when
applied to new data can be attributed to the fact that the data set used for building the
model, THE_BUILD, had an unbalanced distribution of outcomes for the RAIN
attribute. During the model building stage the model did not observe enough of one
outcome of the target attribute to build an accurate model but still showed a high level
of accuracy during testing as the data distribution of the test data set, THE_TEST, was
similar to that of the build data. Thus, when applied to the new data,
WEATHER_APPLY, the model was shown to be ineffective.
abnBuild showed a higher overall accuracy than nbBuild which was expected as the
Adaptive Bayes Network algorithm is said to build more effective models. [Berger,
2004]
5.1.2 Comparison 2
The models in this comparison were built using the Priors technique
in order to minimise the effect of an unbalanced distribution of
outcomes for the RAIN attribute in the build data set. When applied
to the new data, nbBuild2 correctly predicted 36.90% of the
outcomes of RAIN and abnBuild2 correctly predicted 42.41%.
During testing, nbBuild2 showed an accuracy of 71.09005% and
abnBuild2 showed an accuracy of 84.9921%.
When the Priors technique was implemented and the models applied to new data,
nbBuild2 showed an increase in accuracy of 23.11% compared to nbBuild. However,
the test accuracy of this model decreased by little over 1%.
The increase in accuracy when the model was applied to new data can be attributed to
the fact that the build data set used to build this model had a balanced distribution of
An Evaluation of Commercial Data Mining: Oracle Data Mining46
outcomes for the RAIN attribute, thus allowing the model to observe a sufficient
number of cases of each outcome to ensure a more effective model.
The slight decrease in test accuracy is due to the fact that the test data set had a similar
distribution to the original build data set with the uneven distribution of the target
attribute.
abnBuild2 showed the same accuracy when applied to new data as the Adaptive
Bayes Network model that did not make use of the priors technique. However,
abnBuild2 showed a decrease in test accuracy from that of abnBuild. The fact that
both Adaptive Bayes Network models showed the same accuracy when applied to
new data is indicative of the effectiveness of the algorithm for building a model from
data with varying distributions of the target attribute. The decrease in test accuracy of
abnBuild2 can be attributed to the fact that the build data has a more even
distribution of the target attribute and the test data set has a less even distribution of
the target attribute.
5.1.3 Comparison 3
The models in this comparison were built from the original build data with the uneven
distribution of the target attribute, RAIN. These models were tuned as it was evident
during testing of the model nbBuild, that the model predicted a large number of false
‘no’ values for the target attribute. Thus, it was viable to introduce bias into the model
by using weighting. The models were weighted 3 against false negatives, implying
that the cost of predicting a false negative was 3 times that of predicting a false
positive.
With this weighting in place, nbBuild3 showed no improvement in accuracy when
applied to new data compared to the first Naïve Bayes model. The accuracy remained
at 13.79%. The accuracy of nbBuild3 during testing only improved by 0.157986%
compared to the first model.
An Evaluation of Commercial Data Mining: Oracle Data Mining47
It is unexpected that after introducing weighting into the model, nbBuild3’s accuracy
should not improve when the model is applied to new data. This could indicate that
the majority of reasons for this model’s ineffectiveness are due to the use of the build
data that does not incorporate the Priors technique.
The improvement in accuracy during testing of nbBuild3 could be attributed to
introducing weighting, but indicates that this has had little effect on the accuracy of
the model when applied to new data.
A dramatic improvement in the Adaptive Bayes Network model was observed after
introducing weighting. The accuracy of the model when applied to new data increased
to 73.10% even though during testing the accuracy of abnBuild3 dropped to
77.40916%.
It can be deduced that introducing the weighting has had a significant impact on the
abnBuild3’s accuracy when applied to new data as the model has been sufficiently
tuned to avoid errors of a certain kind.
The accuracy of the first Adaptive Bayes Network model during testing was
85.15008%. Introducing weighting reduced this accuracy to 77.40916% even though
the weighted model markedly outperforms the non-weighted model when applied to
new data. It must also be emphasised that in this case the accuracy during testing and
during application to new data are relatively close which has not occurred with
previous models. For this reason, it can be said that introducing weighting into this
model has improved the effectiveness of the model in terms of accuracy of the testing
of the model itself and the accuracy of the model when applied to new data.
5.1.4 Comparison 4
The models in this comparison, nbBuild4 and abnBuild4, were built using the Priors
technique and tuned by introducing a weighting of 3 against false negatives into the
model.
An Evaluation of Commercial Data Mining: Oracle Data Mining48
nbBuild4 showed a significant improvement in accuracy when applied to new data,
correctly predicting 63.79% of the outcomes. The accuracy during testing and
application of abnBuild4 was unchanged from that of the previous model that only
made use of weighting.
It is apparent that nbBuild4 showed an improvement due to the combination of the
Priors technique and the introduction of weighting. These results show that building
the model using data with a balanced distribution of the target attribute and then
introducing weighting enhances the effect of the weighting in terms of building a
more effective model and then tuning the model to increase the overall accuracy of
the model.
Also to be noted is that the accuracy of nbBuild4 during testing, 68.24645%, is more
similar to the accuracy of the model when applied to new data than in the case of the
other Naïve Bayes models. This could indicate the increased effectiveness during
application to new data and testing of this model built using Priors and using
weighting.
The accuracy of abnBuild4 remained the same during application to new data and
testing as it did for the previous model that did not make use of the Priors technique.
For this reason, it is apparent that the use of the Priors technique has had no impact on
the effectiveness of this model and tuning the model using weighting has significantly
improved the effectiveness of this model.
5.2 Effectiveness of Models
Figure 12 graphically portrays the accuracy of the models when applied to the new
data set, WEATHER_APPLY, as well as the settings for each pair of models.
In the case of the Naïve Bayes models, the most effective model correctly predicted
63.79% of the RAIN outcomes and was built using the Priors technique and
introducing a weighting of 3 against false negatives.
An Evaluation of Commercial Data Mining: Oracle Data Mining49
The most effective Adaptive Bayes model correctly predicted 73.10% of the RAIN
outcomes and was built by introducing a weighting of 3 against false negatives into
the model. The Priors technique had no influence on the accuracy of the Adaptive
Bayes models when they were applied to new data.
Figure 13 graphically portrays the accuracy of the models when tested on the test data
set, THE_TEST, as well as the settings for each pair of models.
The most accurate model during testing built with the Naïve Bayes algorithm was the
model that used weighting but no use of the Priors technique. This model showed an
accuracy of 72.51%. However, this model was also one of the two models that
performed most poorly when applied to new data correctly predicting only 13.79% of
the RAIN outcomes. The high accuracy in this case is attributed to the build and test
data sets used during building and testing of this model having a similar unbalanced
distribution of outcomes for the target attribute. This caused the model to perform
well during testing but to be ineffective when applied to the new data set with a
different distribution of outcomes for the target attribute.
The model built using the Adaptive Bayes Network algorithm that demonstrated the
greatest accuracy during testing was the model that did not make use of the Priors
technique and had no bias introduced in the form of weighting. This model
demonstrated an accuracy of 85.15% during testing. This model was also one of the
two models of the Adaptive Bayes Network kind that performed most poorly when
applied to new data predicting only 42.41% of the outcomes of RAIN correctly. Since
the use of the Priors technique had no effect on the effectiveness of the models when
applied to new data, it is possible to deduce that the introduction of weighting
improved this models accuracy when applied to new data even though this was not
reflected during testing. These results raise some questions about the effectiveness of
the model testing process.
An Evaluation of Commercial Data Mining: Oracle Data Mining50
Model Results
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
noweighting,no priors
noweighting,
priors
weighting,no priors
weighting,priors
Model Settings
Ac
cu
rac
y
Naïve Bayes
Adaptive Bayes Network
Figure 12 Model results and settings of application to new data
Testing Results
0.00%10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%90.00%
no
we
igh
ting
,n
o p
rio
rs
no
we
igh
ting
,p
rio
rs
we
igh
ting
,n
o p
rio
rs
we
igh
ting
,p
rio
rs
Model Settings
Ac
cu
rac
y Naïve Bayes
Adaptive BayesNetwork
Figure 13 Model results and settings during testing
5.3 Significance of Results
An Evaluation of Commercial Data Mining: Oracle Data Mining51
It is significant that the effectiveness of the models when applied to new data of those
models built using the Adaptive Bayes Network algorithm was not affected by the use
of the Priors techniques which attempts to ensure a balanced distribution of the target
attribute in the build data set. This could indicate the effectiveness of the algorithm in
incorporating rare occurrences in the data into the model. This is opposed to the case
of the models built using the Naïve Bayes algorithm. The effectiveness of the models
in these cases was markedly improved by use of the Priors technique, indicating the
requirement of data preparation for the algorithm.
Also to be emphasised is the effect of introducing weighting into the models of both
kinds. In the case of those models built using the Adaptive Bayes Network algorithm,
the introduction of weighting provided a dramatic increase in the accuracy of the
results when the model was applied to new data. Those models built using the Naïve
Bayes algorithm only benefited from the introduction of weighting when the Priors
technique had also been used in the model. This could indicate that the effect of
introducing weighting in order to tune the model is most beneficial when the model is
already at a relatively high level of effectiveness.
It was interesting to note the discrepancies between the models’ accuracy during
testing and accuracy when applied to new data. In all cases, the test accuracy was
higher than the accuracy calculated when the model was applied to new data. In some
case this difference was significant. This could indicate the impact the nature of the
test data has on the results of model test accuracy. The data used for testing the
models was created from the data set used to build the models using the
Transformation Split wizard as discussed in Chapter 3. For this reason, the
distribution of the target attribute in both data sets was similar, which positively
influenced the accuracy of the models built from and tested on similar data. External
validation of the models’ performance on the new data emphasised this influence.
These findings indicate the need for test data sets that show fewer similarities to the
build data sets and question the use of the Transformation Split wizard to create build
and test data sets from data that shows a specific distribution of the target attribute.
An Evaluation of Commercial Data Mining: Oracle Data Mining52
5.4 Chapter Summary
After interpreting the results obtained from the different models it is apparent that the
most effective model was built using the Adaptive Bayes Network algorithm with a
weighting of 3 against false negatives. It was apparent that the results obtained from
those models built using the Adaptive Bayes Network algorithm were not affected by
the use of the Priors technique whereas the results of the models built using the Naïve
Bayes algorithm were. Weighting had an effect on the results obtained from both
kinds of models but was only noticeable in the case of the Naïve Bayes models when
the Priors technique was used. Also to be noted is that the accuracy of the models
during testing does not always indicate the effectiveness of the models when applied
to new data.
The following chapter will discuss the conclusions that can be drawn from the results
obtained in this chapter.
Section 3 Conclusion
Chapter 6 Conclusions Drawn from Results
This chapter will draw conclusions from the results presented in the previous
chapters. The first set of conclusions will be made from the actual results obtained
when the models were applied to new data. The next set will consider the effect the
data used during the data mining had on the results obtained and lastly, conclusions
regarding Oracle Data Mining will be drawn.
6.1 Conclusions Regarding Model Results
The most effective model built using the Naïve Bayes algorithm correctly predicted
the outcome of the RAIN attribute 63.79% of 290 records. The model built using this
algorithm and no other techniques correctly predicted only 13.79% of the outcomes.
Introduction of bias into the model using weighting had no effect on this accuracy.
Use of the Priors technique increased this accuracy to 36.90%. A combination of
An Evaluation of Commercial Data Mining: Oracle Data Mining53
weighting and the use of Priors increased the accuracy of the model when applied to
new data to 63.79%.
Tuning was accomplished by introducing bias into the model using weighting. It was
viable to introduce bias into the model because during testing, the confusion matrix of
the model showed the model tended to make errors by predicting outcomes of a
certain kind. Bias makes these particular errors more costly to the effectiveness of the
model and thus the algorithm attempts to minimise them when building the model.
These observations indicate that the Naïve Bayes algorithm requires the use of the
Priors technique when the build data has an uneven distribution of the target attribute.
This ensures the algorithm observes enough of each target attribute outcome to build a
model that will be effective when applied to data with a different distribution of the
target attribute.
Adjusting the settings of the algorithm parameters would also be beneficial when
using this algorithm to build a model of data with an uneven target attribute
distribution. These parameters, the pairwise and singleton thresholds, affect how the
algorithm treats outliers in the data. By reducing the values of these parameters a
more accurate model can be built but this would only be beneficial if the model
already observes enough of a certain type of target attribute outcome.
Introducing bias into the Naïve Bayes model was most beneficial when the model had
been built using the Priors technique. This could indicate that the effect of weighting
is enhanced when the model is already relatively effective.
The most effective of all the models was built using the Adaptive Bayes Network
algorithm. This model correctly predicted 73.10% of 290 RAIN attribute outcomes
when applied to the new weather data set. This level of accuracy was increased from
42.41% by tuning the model.
The use of the Priors technique had no effect on the models built using the Adaptive
Bayes Network algorithm. This indicates that the effectiveness of the resulting models
was not affected by the distribution of the target attribute in the data set used to build
An Evaluation of Commercial Data Mining: Oracle Data Mining54
the model. Thus, it can be concluded that the algorithm effectively considers
occurrences of instances in the data even when these occurrences are rare.
6.2 Conclusions Regarding Data
It is apparent from the results of applying the models to the WEATHER_APPLY data
set that the algorithms found a pattern in the data that allowed them to be able to
correctly predict the outcome of the RAIN attribute in a significant number of cases.
According to the rules generated by the Adaptive Bayes Network algorithms these
predictions were mostly influenced by the measurements for wind chill factor and
wind direction in the records. Although unexpected, these measurements appear to
allow the models to make accurate predictions in most cases.
The Transformation Split wizard allows a data set to be split up into build and test
data sets by randomly selecting a predetermined number of records and placing them
in the build data set and placing the remainder of records in the test data set. However,
use of this technique to create the data sets results in both data sets showing a similar
distribution of the target attributes.
If the distribution of the target attribute is uneven, both data sets will show this to an
extent. This is depicted in Figures 14, 15 and 16. Figure 14 shows the distribution of
the RAIN attribute in the data set used with the Transformation Split wizard to create
the test and build data sets from. Figure 15 shows the distribution of this attribute in
the build data set and Figure 16 in the test data set created from this wizard.
An Evaluation of Commercial Data Mining: Oracle Data Mining55
Original Data Distribution
0200400600800
10001200140016001800
yes no
Bin Range
Bin
Co
un
t
Figure 14 Distribution of RAIN attribute in the original data set
It appears that similar distributions of the target attribute in both the build and test
data sets influence the accuracy of the model during testing. The result of testing a
model using a data set that resembles the build data set is an inflated accuracy. This
was evident from the significantly lower levels of accuracy the models showed when
applied to the new data of a different distribution. The distribution of the apply data
set is shown in Figure 17.
These findings indicate the need to test a model on a variety of data sets of different
distributions in order to properly validate model accuracy and effectiveness when
applied to data sets with different distributions of the target attribute.
Further, it appears it would be more beneficial to use the largest data set possible to
build and test models on. This would result in a more effective model as a wider range
of occurrences in the data would be incorporated into the model.
An Evaluation of Commercial Data Mining: Oracle Data Mining56
Build Data Distribution
0200400600800
10001200140016001800
yes no
Bin Range
Bin
Co
un
t
Figure 15 Distribution of RAIN attribute in the build data set
Test Data Distribution
0200400600800
10001200140016001800
yes no
Bin Range
Bin
Co
un
t
Figure 16 Distribution of RAIN attribute in the test data set
An Evaluation of Commercial Data Mining: Oracle Data Mining57
Apply Data Distribution
0
50
100
150
200
250
300
yes no
Bin Range
Bin
Cou
nt
Figure17 Distribution of RAIN attribute in the apply data set
6.3 Conclusions Regarding Oracle Data Mining
Oracle Data Mining and DM4J in particular provide the user with easy to use and
understand wizards that cover all aspects of the data mining process. Wizards are
available to create the build and test data sets from an original data set, to prepare the
data for use with the Priors technique, to build models, to test these models and to
apply these models to new data.
Although data preparation is an important aspect of the data mining process [Berger,
2004], it is not explicitly emphasised in the wizards that allow for the model building.
Although accessible through the Data Mining Browser, techniques for data
preparation and the benefits of using them are not emphasised.
The Data Mining Browser in DM4J allows the user to easily access the results of
model testing and application of models to new data. These results can also be
exported to spreadsheets allowing increased accessibility and ensuring they can be
easily worked with.
DM4J provides easy and reliable access to the database and the tables stored in the
database. This makes it possible to search for a specific data set during the data
mining process. The Data Mining Browser also allows the user to view summaries of
An Evaluation of Commercial Data Mining: Oracle Data Mining58
the data including distributions of attributes in the data set, which is of use during the
data preparation phase.
It is apparent from the results of the model testing during this evaluation that testing a
model on a single data set does not provide an indication of the effectiveness of the
model when applied to new data. Test accuracy of models can be misleading. For this
reason models should be externally validated using a technique similar to the one used
in this investigation (applying to data where outcome of the target attribute is known)
or tested on a number of data sets with varying distributions to better determine model
accuracy. The need to validate a model on a variety of data sets is not emphasised in
the documentation or by the wizards.
It must be emphasised that the ease and speed of building and testing a model using
the wizards allows for a number of models to be built and tests to be conducted. This
approach is recommended in order to ensure the most effective model possible is
produced.
6.4 Chapter Summary
This chapter has drawn a number of conclusions from the results obtained during the
data mining. Conclusions have been made regarding model results, the effect of data
used during the data mining and Oracle Data Mining itself. The following chapter will
conclude this evaluation.
An Evaluation of Commercial Data Mining: Oracle Data Mining59
Chapter 7 Conclusion
This chapter presents the conclusions drawn from the evaluation and suggests
possible extensions to the research area.
7.1 Conclusion
Oracle Data Mining provides data mining functionality through a series of wizards.
These wizards allow the user to perform data preparation, to build models, to test
these models and to apply the models to new data. The data preparation in this
evaluation was performed using the Transformation Split wizard and the Stratified
Sampling wizard. A number of wizards were used to build, test and apply the models
to new data.
The wizards were easy to use and understand and allowed a number of models to be
built in a short amount of time. Access to the database was provided through the
wizards. However, it was found that the wizards for building the data mining models
placed little emphasis on data preparation.
The two Classification algorithms used in this evaluation found a distinct pattern in
the weather data sets. This allowed the models to be used to make predictions of the
outcome of the RAIN attribute when the models were applied to the new data set. It is
possible to conclude that given a new set of weather data, the data mining models
would be able to make fairly accurate predictions of the outcome of the RAIN
attribute.
Of the algorithms investigated, the Adaptive Bayes Network algorithm produced the
most effective model when applied to new data, correctly predicting 73.10% of the
RAIN attributes outcomes. This model was tuned using a weighting of 3 against false
negatives to introduce bias into the model. The most effective model built using the
Naïve Bayes algorithm correctly predicted 63.79% of the RAIN attributes outcomes
when applied to new data. This model made use of the Priors technique and was
weighted 3 against false negatives.
An Evaluation of Commercial Data Mining: Oracle Data Mining60
The models were tested against a data set created from the original data,
WEATHER_BUILD, to determine the level of predictive accuracy. This testing was
conducted using the test wizards in DM4J and resulted in confusion matrices showing
test accuracy. It can be concluded that testing a model on a single data set does not
provide an accurate indication of how the model will perform when applied to new
data. This is because any similarities between the target attribute distribution (that of
RAIN) in the build and test data sets appear to inflate the test accuracy results. This is
evident from the fact that when the test accuracy of the models is compared to the
accuracy of the models when applied to new data, the test accuracy is significantly
higher. It is recommended that models be tested on a number of data sets of varying
distribution in order to gauge more accurately how they will perform when applied to
new data.
In conclusion, Oracle Data Mining provides the functionality required to build
effective data mining models using the two Classification algorithms, namely
Adaptive Bayes Network and Naïve Bayes. However, in order to build an effective
data mining model it is necessary to perform data preparation and to test the models
on a number of data sets of varying distribution. These aspects are not explicitly
emphasised when using the wizards to build the models. It can be stated that building
an effective model is an iterative process and requires the data miner to have an
awareness of the data sets used in the process, how these could influence the
outcomes of applying the models to new data and what techniques are available to
train and tune the models to be most effective.
7.2 Possible Extensions to Research
Possible extensions to this research could involve conducting similar evaluations of
the other algorithms available in the Oracle Data Mining Suite. These algorithms
include Attribute Importance, Association Rules, O-Cluster and Enhanced k-Means
Clustering. The results of models built on a similar data set using these other
algorithms could be compared to the models built in this evaluation to determine
when a specific algorithm is most applicable to a data mining project.
An Evaluation of Commercial Data Mining: Oracle Data Mining61
It would also be possible to compare the two clustering algorithms in ODM, O-
Cluster and Enhanced k-Means, in a manner similar to the one used in this evaluation.
Another extension of the research could involve comparing the functionality
demonstrated by Oracle Data Mining in this evaluation to that provided by another
data mining suite. Other data mining suites that could be compared include IBM
Intelligent Miner and SQL Server Data Mining and Analysis Server.
A further possibility would be to compare the results obtained using Oracle Data
Mining with those obtained after performing a regression analysis on the weather
data.
An Evaluation of Commercial Data Mining: Oracle Data Mining62
List of Figures
Figure 1: The Oracle Data Mining Process [Berger, 2004]……………………... 14
Figure 2: Naïve Bayes algorithm settings……………………………………….. 18
Figure 3: Adaptive Bayes Network algorithm settings………………………….. 20
Figure 4: Data Distribution for Rain Attribute from THE_BUILD data set…….. 24
Figure 5: Data Distribution for RAIN Attribute from THE_BUILD1 data set…. 24
Figure 6: Extract from Classification Model Build Wizard, Priors Settings……. 27
Figure 7: nbBuild Lift Chart……………………………………………………. 31
Figure 8: nbBuild2 Lift Chart…………………………………………………... 31
Figure 9: abnBuild Lift Chart…………………………………………………... 32
Figure10: abnBuild2 Lift Chart…………………………………………………. 33
Figure 11: Extract Showing Weighting of Model Build Wizard for abnBuild4 .. 35
Figure 12: Model results and settings of application to new data………………... 50
Figure 13: Model results and settings during testing…………………………….. 51
Figure 14: Distribution of RAIN attribute in the original data set……………….. 55
Figure 15: Distribution of RAIN attribute in the build data set………………….. 56
Figure 16: Distribution of RAIN attribute in the test data set……………………. 56
Figure 17: Distribution of RAIN attribute in the apply data set…………………. 58
An Evaluation of Commercial Data Mining: Oracle Data Mining63
List of Tables
Table 1: Mining Data Table Structure…………………………………………... 15
Table 2: Example Confusion Matrix...................................................................... 28
Table 3: Model Test Accuracy Rates……………………………………………. 28
Table 4: Confusion Matrix for nbBuild Testing……………………………........ 29
Table 5: Confusion Matrix for nbBuild2 Testing ………………………………. 29
Table 6: Confusion Matrix for abnBuild Testing ………………………………. 29
Table 7: Confusion Matrix for abnBuild2 Testing ……………………………... 30
Table 8: Unweighted Models’ Test Accuracy Rates…………………………….. 34
Table 9: Weighted Models’ Test Accuracy Rates ………………………………. 34
Table 10: nbBuild Confusion Matrix …………………………………………… 35
Table 11: nbBuild3 Confusion Matrix ………………………………………….. 36
Table 12: Summary of Classification Models……………………………………. 36
Table 13: Extracts of results from model nbBuild................................................. 38
Table 14: Extracts of results from model nbBuild3 …………………….............. 39
Table 15: Extracts of results from model abnBuild showing rules ……………... 39
Table 16: Rules used by Adaptive Bayes Network Models to make predictions... 40
Table 17: Summary of Accuracy of Predictions When Compared to Actual Data. 42
Table 18: Comparison of Models Built using Same Settings……………………. 43
Table 19: Comparison of Models built using same settings and showing
accuracy during testing……………………………………………………………
44
References
Al-Attar, A., 2004, White Paper: Data Mining - Beyond Algorithms, URL: <http://www.attar.com/tutor/mining.htm>, Accessed: 06/2004.
Berger, C., 09/2004, Oracle Data Mining, Know More, Do More, Spend Less - An Oracle White Paper, URL:
An Evaluation of Commercial Data Mining: Oracle Data Mining64
<http://www.oracle.com/technology/products/bi/odm/pdf/bwp_db_odm_10gr1_0904.pdf>, Accessed: 10/2004.
Berry, M. J. A. and Linoff, G. S., 2000, Mastering Data Mining: The Art and Science of Customer Relationship Management, USA, Wiley Computer Publishing.
Fernandez, G., 2003, Data Mining Using SAS Applications, USA, Chapman and Hall/CRC.
Jacot-Guillarmod, F., 2004, Index of /weather/ARCHIVE/2004, URL: <http://www.ru.ac.za/weather/>, Accessed: 10/2004.
Oracle9i Data Mining Concepts Release 2 (9.2), 2002, Oracle Home Page,URL: <http://www.lc.leidenuniv.nl/awcourse/oracle/datamine.920/a95961/preface.htm>, Accessed: 06/2004.
Oracle Data Mining Tutorial, Release 9.0.4, Oracle Home Page, 02/2004,URL: < http://www.oracle.com/technology/products/bi/odm/9idm4jv2.html>, Accessed: 09/2004.
Oracle Help for Java, Version 4.2.5.1.0, Copyright 1997-2004, Available via Oracle Data Mining Browser Help Menu in JDeveloper 10g.
Oracle Data Mining for Java (DM4J), 24/03/2004, Oracle Technology Network, URL: < http://www.oracle.com/technology/products/bi/pdf/odm4java.pdf >, Accessed: 09/2004.
Pyle, D., 2000, Data Preparation for Data Mining, San Francisco, California, Morgan Kauffman.
Roiger, R. J. and Geatz M. W., 2003, Data mining: a tutorial- based primer, Boston, Massachusetts, Addison Wesley.
Appendix A – Introductory Manual
This appendix aims to provide the reader with an introductory manual for using Data
Mining for Java (DM4J) 9.0.4 and JDeveloper 10g in order to perform data mining. It
An Evaluation of Commercial Data Mining: Oracle Data Mining65
assumes the reader has access to Oracle Data Mining Tutorial, Release 9.0.4 which is
available on the CD-ROM that accompanies this project.
The data mining performed during this project was conducted with DM4J 9.0.4. This
component is an extension of JDeveloper 10g and provides the user interface to the
data mining components.
A.1 Preparing for Data Mining
It is possible to login to the machine used in this project with the username
‘g01d1801’ and the password ‘810412’.
Any data sets that will be mined should be loaded into the ODM or ODM_MTR
schema in the Oracle database. This can be accomplished in the Enterprise Manager
Console using the Load wizard which is accessible under the menu items:
Tools>Database Tools>Data Management>Load. The most success was achieved
loading data sets in the form of .txt files with items in records separated by commas
and each record on a new line.
The Enterprise Manager Console is accessible from the Start Menu:
Start>All Programs>Oracle-Ora1>Enterprise Manager Console.
At the login select the ‘Login to the Oracle Management Server’ radio button. The
administrator is ‘sysman’, the password is ‘810412’ and the Management Server is
ora1.ict.ru.ac.za.
Once the console opens expand the ‘Database’ item in the navigation pane. Right
click on Ora1.ict.ru.ac.za. Select ‘Connect’ from the menu. The username is ‘sys’ and
the password ‘emily’. Connect as ‘sysdba’. At this stage it is possible to use the Load
wizard which explains the process of loading data into the database.
A.2 Starting JDeveloper
JDeveloper can be launched by double clicking on the shortcut icon on the desktop. It
is then necessary to connect to the Oracle database using JDeveloper as follows:
An Evaluation of Commercial Data Mining: Oracle Data Mining66
Clicking on Connections to expand it in the System Navigator pane in
JDeveloper.
Right clicking on the Database item in the list
Selecting New Connection in the menu that appears.
Following the instructions in the Connection Wizard.
o At step 2 of the wizard a username and password is required. The user
name is ‘odm’ and password is ‘odm’. The Deploy Password check
box must be checked.
o Any other information required can be left as the default settings.
A.3 Data Mining
It should now possible to conduct data mining on a data set.
In JDeveloper click on the ‘File’ menu and select the ‘New’ menu item. In the dialog
box that appears select ‘General’ in the left pane and ‘Workspace’ in the right pane.
This will create a workspace to store the components of data mining. Fill in a name
for the workspace and note where it is saved. Click OK, fill in a project name and
click OK. This will be where any data mining models created are stored. It is possible
to view the workspace and projects in it in the System Navigator pane in JDeveloper.
Highlight the project name in the System Navigator pane, select ‘File’ in the
JDeveloper menu. Select ‘New’ and in the dialog box that appears expand the
‘Business Tier’ item in the left pane. Select ‘Data Mining Components’. In the pane
on the right will appear the wizards available for data mining.
The workspace created in this evaluation was named weather.jws. The projects
created within the workspace were named theWeather.jpr and finalWeather.jpr.
theWeather.jpr contains the component created with the Transformation Split wizard.
This wizard allows the user to create the build and test data sets from a single data set.
It is easy to use and provides clear instructions to be followed. The data used by this
An Evaluation of Commercial Data Mining: Oracle Data Mining67
wizard in this project was stored in the odm_mtr schema and named weather_build2.
The wizard produced the tables theBuild and theTest which were stored in the odm
schema.
finalWeather.jpr contains all the components of all the data mining models created
during this project. The data set the models were applied to was named
weather_apply2 and was stored in the odm_mtr schema.
At this stage the reader should refer to Oracle Data Mining Tutorial, Release 9.0.4.
This tutorial provides step by step instructions for running the data mining wizards to
perform data mining. In reference to this project the reader should refer to chapters 2
to 8 in the tutorial.
An Evaluation of Commercial Data Mining: Oracle Data Mining68