Post on 04-Jun-2018
8/13/2019 Appendix Weka
1/17
1
Appendix: The WEKA Data Mining
Software
http://www.cs.waikato.ac.nz/ml/weka/
8/13/2019 Appendix Weka
2/17
2
WEKA: Introduction WEKA, developed by Waikato University, New Zealand.
WEKA (Waikato Environment for Knowledge Analysis) History: 1stversion (version 2.1, 1996); Version 2.3,
1998; Version 3.0, 1999; Version 3.4, 2003; Version 3.6,2008.
WEKA provides a collection of data mining, machinelearning algorithms and preprocessing tools. It includes algorithms for regression, classification, clustering,
association rule mining and attribute selection.
It also has data visualization facilities.
WEKA is an environment for comparing learning
algorithms With WEKA, researchers can implement new data
mining algorithms to add in WEKA
WEKA is the best-known open-source data miningsoftware.
8/13/2019 Appendix Weka
3/17
3
WEKA: Introduction
WEKA was written in Java. WEKA 3.4 consists of 271477 lines of code. WEKA 3.6 consists of 509903 lines of code.
It can work on Windows, Linux and Macintosh.
Users can access its components through Java
programming or through a command-line interface. It consists of three main graphical user interfaces:
Explorer, Experimenterand Knowledge Flow.
The easiest way to use WEKA is through Explorer,the main graphical user interface.
Data can be loaded from various sources, includingfiles, URLs and databases. Database access isprovided through Java Database Connectivity.
8/13/2019 Appendix Weka
4/17
4
WEKA data format
WEKA stores data in flat files (ARFF format).
Its easy to transform EXCEL file to ARFF format.
An ARFF file consists of a list of instances
We can create an ARFF file by using Notepad orWord.
The name of the dataset is with @relation
Attribute information is with @attribute
The data is with @data.
Beside ARFF format, WEKA allows CSV, LibSVM,
and C4.5s format.
8/13/2019 Appendix Weka
5/17
5
WEKA ARFF format
@relation weather
@attribute outlook {sunny, overcast, rainy}@attribute temperature real
@attribute humidity real
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny, 85, 85, FALSE, no
sunny, 80, 90, TRUE, no
overcast, 83, 86, FALSE, yesrainy, 70, 96, FALSE, yes
rainy, 68, 80, FALSE, yes
8/13/2019 Appendix Weka
6/17
6
Explorer GUI Consists of 6 panels, each for one data mining
tasks: Preprocess
Classify
Cluster
Associate
Select Attributes Visualize.
Preprocess: to use WEKAs data preprocessing tools (called filters) to
transform the dataset in several ways.
WEKA contains filters for: Discretization, normalization, resampling, attribute
selection, transforming and combining attributes,
8/13/2019 Appendix Weka
7/17
7
Explorer (cont.)
Classify: Regression techniques (predictors of continuous classes)
Linear regression
Logistic regression
Neural network
Support vector machine
Classification algorithms Decision treesID3, C4.5 (called J48)
Nave Bayes, Bayes network
k-nearest-neighbors
Rule learners: Ripper, Prism
Lazy rule learners Meta learners (bagging, boosting)
8/13/2019 Appendix Weka
8/17
8
Clustering Clustering algorithms:
K-Means, X-Means, FarthestFirst
Likelihood-based clustering: EM (Expectation-Maximization)
Cobweb (incremental clustering algorithm)
Clusters can be visualized and compared to true clusters (ifgiven)
Attribute Selection: This provides access to various methods formeasuring the utility of attributes and identifying the most
important attributes in a dataset. Filter method: the attribute set is filtered to produce the most
promising subset before learning begins.
A wide range of filtering criteria, including correlation-basedfeature selection, the chi-square statistic, gain ratio, information,support-machine-based criterion.
A variety of search methods: forward and backward selection,best-first search, genetic search and random search.
PCA (principal component analysis) to reduce the dimensionalityof a problem.
Discretizing numeric attributes.
8/13/2019 Appendix Weka
9/17
9
Explorer (cont.)
Assocation rule mining Apriori algorithm
Work only with discrete data
Visualization
Scatter plots, ROC curves,Trees, graphs WEKA can visualize single attributes (1-d) and pairs of
attributes (2-d).
Color-coded class values.
Zoom-in function
8/13/2019 Appendix Weka
10/17
10
8/13/2019 Appendix Weka
11/17
11
Explorer
GUI
(Classify)
8/13/2019 Appendix Weka
12/17
12
WEKA Experimenter
This interface is designed to facilitate experimentalcomparisonsof the performance of algorithmsbased on many different evaluation criteria.
Experiments can involves many algorithms that are
run on multiple datasets.
Can also iterate over different parameter settings
Experiments can also be distributed across differentcomputer nodes in a network.
Once an experiment has been set up, it can besaved in either XML or binary form, so that it can bere-visited.
8/13/2019 Appendix Weka
13/17
13
8/13/2019 Appendix Weka
14/17
14
Knowledge Flow Interface
The Explorer is designed for batch-based data
processing: training data is loaded into memory andthen processed.
However WEKA has implemented some incremental
algorithms.
Knowledge-flow interface can handle incremental
updates. It can load and preprocess individual
instances before feeding them into incremental
learning algorithms.
Knowledge-flow also provides nodes for
visualization and evaluation.
8/13/2019 Appendix Weka
15/17
15
8/13/2019 Appendix Weka
16/17
16
Conclusions Comparison to R, WEKA is weaker in classical statistics but
stronger in machine learning (data mining) algorithms.
WEKA has developed a set of extensions covering diverseareas, such as text mining, visualization and bioinformatics.
WEKA 3.6 includes support for importing PMML models(Predictive Modeling Markup Language). PMML is a XML-basedstandard fro expressing statistical and data mining models.
WEKA 3.6 can read and write data in the format used by the wellknown LibSVM and SVM-Light support vector machineimplementations.
WEKA has 2 limitations:
Most of the algorithms require all the data stored in mainmemory. So it restricts application to small or medium-sizeddatasets.
Java implementation is somewhat slower than an equivalent inC/C++
8/13/2019 Appendix Weka
17/17
17
References
I.H. Witten and E. Frank, Data Mining: PracticalMachine Learning Tools and Techniques with JavaImplementations, Morgan Kaufmann, SanFrancisco, 2000.
M. Hall and E. Frank, The WEKA Data Mining
Software: An Update, J. SIGKDD Explorations, Vol.11, No. 1, 2008.
R. R. Bouckaert et al., WEKA Manual for Version3.6.0, 2008.
E. Frank et al., WEKAA Machine LearningWorkbench for Data Mining, 2003.