WEKA TUTORIAL 1 Weka Tutorial on Document Classification ...€¦ · stages, data acquisition,...
Transcript of WEKA TUTORIAL 1 Weka Tutorial on Document Classification ...€¦ · stages, data acquisition,...
WEKA TUTORIAL 1
Weka Tutorial on Document Classification
Valeria Guevara
Thompson Rivers University
Author Note
This is a final project COMP 4910 for the bachelors of computing science from the
Thompson Rivers University supervised by Mila Kwiatkowska.
WEKA TUTORIAL 2
Abstract
This project focuses on document classification using text mining through a classification
model generated by the open source software “WEKA”. This software is a repository of machine
learning algorithms to discover knowledge. Weka easily preprocesses the training documents to
compare different algorithms configurations. The exactitude in the generated predictive model
will be measured based on a confusion matrix. This project will help to illustrate text mining
preprocessing and classification using WEKA. The result will be the development of a tool to
generate the input data files arff and of a video tutorial on documents classification in Weka in
English and Spanish.
Keywords: Weka, documents classification, arff, stopwords, toquenizer, pruning,
decision tree C4.5, words vector, text mining, F-measurement, machine learning, text
classification, stemming, knowledge society.
WEKA TUTORIAL 3
Weka document classification
Weka tool was selected in order to generate a model that classifies specialized documents
from two different courpus (English and Spanish). WEKA package is a collection of machine
learning algorithms for data mining tasks. Text mining uses these algorithms to learn from
examples or "training set", new texts are classified into categories analyzed. It is defined as
Waikato Environment for Knowledge Analysis. For more information contact
http://www.cs.waikato.ac.nz/~ml/weka/.
Installing WEKA
Weka can be downloaded from:
http://www.cs.waikato.ac.nz/ml/weka/downloading.html.
In this tutorial version is Weka 3.6.12.
For Windows
WEKA must be situated in the program launcher located in a weka folder. The Weka
default directory is the same directory where the file is loaded.
For Linux:
WEKA will have to open a terminal and type: java -jar /installation/directory/weka.jar.
WEKA TUTORIAL 4
Based on the text mining methodology Weka is represented in a framework with four
stages, data acquisition, document preprocessing, information extraction and evaluation.
Data Acquisition
ARFF files are the primary format to use any classification task in WEKA. These files
considered basic input data (concepts, instances and attributes) for data mining. An Attribute-
Relation File Format file describes a list of instances of a concept with their respective attributes.
The documents selected for the training data set has been found on the Thompson Rivers
University library that has the following link: http://www.tru.ca/library.html. It was randomly
selected 71 medical academic articles in English and Spanish. These documents are stored in
Portable Document Format (PDF). Based on the TRU library was detected the classification of
this documents into six categories Hemodialysis, Nutrition, Cancer, Obesity, Diet and Diabetes
recognized. These documents are stored in directories named by its categories within the main
folder called Medicine. As shown in the figure below.
In order to form an arff file it was created in Microsoft Visual Studio Professional C #
2012 an application that generated the arff from a directory that contains a collection of
WEKA TUTORIAL 5
documents in a based on their category name. This application could be carried out with the
collaboration of a library called iTextSharp PDF for a portable document format text extraction.
Documents Directory to ARFF can specify the name of the relationship to define, the
location of the home directory that contains all documents subdivided into categorical directories
and comments required. Also, it specify the file name generated with arff extension and its
location. At the end of the application are two buttons, one for exit and another to generate the
arff file with the information described.
This can be download http://www.scientificdatabases.ca under current projects for Text
Mining.
The resulting arff generate a string type attribute called " textoDocumento" that describe
all text found in the document and the nominal attribute "docClass" that define the class to
which it belongs. As a note, recent versions of Weka Weka as in this case 3.6.12 the class
attribute can never be named "class".
WEKA TUTORIAL 6
The file will be generated as follows:
% tutorial de Weka para la Clasificación de Documentos.
@RELATION Medicina
@attribute textoDocumento string
@attribute docClass {Hemodialysis, Nutrition, Cancer, Obesity, Diet, Diabetes}
@data
"texto…", Hemodialysis
“texto…”, Nutrition
"texto….", Cancer
"texto…", Obesity
"texto…", Diet
"texto…", Diabetes
Document Preprocess
Weka contains tools for data pre-processing, classification, regression, clustering,
association rules, and visualization.
"Applications" is the first screen on Weka to select the desired sub-tool. In this
"Explorer" is selected. It consists of six panels: Preprocess, Classify, cluster, Associate, Select
attributes and Visualize.
Preprocess
Preprocessing for the classification of documents.
To load the generated arff, click on the button "Open file ..." at the top right.
Select the created file "medicinaWeka.arff".
On "Current Relation" the dataset that has been loaded is described. It describes the
relationship with the medicina name, the number of instances as 71 and a total of attributes as 2.
At the bottom of the under "Attributes" section, attributes are described. This framework allows
to select the attributes, in this case are show " textoDocumento " and "docClass".
When selecting "docClass" the "Selected attribute" part describes the nominal attribute
with 6 labels and the total of its instances. These "labels" are 11 levels from Hemodialysis and 12
WEKA TUTORIAL 7
instances from the others: Nutrition, Cancer, Obesity, Diabetes Diet. At the bottom of this
section is ilustrated a histogram of the attribute "docClass" labels that by hovering the graph it
will describe the attribute name as shown in the following figure illustrates.
Weka uses StringToWordVector filter to convert the "textoDocumento" and
"docClass"." attribute into a set of attributes that represent the occurrence of words of the full
text,. This filter is a technique of unsupervised learning. These inductive technique is designed to
detect clusters and label entries from a set of observations without knowing the correct
classification.
The filters are found when click the “Choose " button under "Filter" section. This button
opens a window with root weka. From there selecte filters and the unsupervised folder to after
select attribute and finally select StringToWordVector.
WEKA TUTORIAL 8
StringToWordVector filter can configured its attributes with language processing
techniques. To edit this filter is only necessary to click on the filter name. it will open a that
show the following options.
They were generated a set of optimal options from different combinations of options
applied to the same training data . Each resulting model was calculated its F-measurement which
describes the proportion of its predicted instances erroneously. The options that generated the
greatest number of instances predicted correctly are as follows:
a) wordsToKeep: Standing with 1000 since it defines the word limit per class to maintain.
Where doNotOperateOnPerClassBasis flag: as "False" to base wordsToKeep in all
classes.
b) TFTransform as "True", DFTransform as "True" outputWordCounts as "True" and
normalizeDocLength: is set to "No normalization".
The values are not normalized to the filter papers find more interrelated and count how
often a word is in the document and not only consider whether the term is in the
document. OutputWordCounts is the flag that describes whether a word exist or not in the
document and normalizeDocLength couts a word with its actual value from tf-idf result
of that word in the document, no matter how small or longer the document is.
c) lowerCaseTokens: as "True" to convert all to lowercase words before being added to
the record and analyze the same word in lowercase and uppercase separately.
WEKA TUTORIAL 9
d) Stemmer: selects the algorithm to elimination the morpheme in a given language in
order to reduce the word to its root. Select no stemmer as the classification of texts is
multilingual and it will only aply stemming for one lenguage. No stemmer is configured
when click on the "Select" button menu is deployed and "NullStemmer" is selected.
Weka has a standard algorithm in English from snowball.tartarus.org. Snowball is a
string processing language designed for creating stemmer and feature a stemming
algorithm in Spanish. To use the algorithm in Spanish will have to download the jar
snowball-20051019.jar from https://weka.wikispaces.com/Stemmers. This will be stored
in the location where Weka application is. Finally the algorithm will be added when the
following command is applied from the command line in Weka.
For Windows: java -classpath "weka.jar, snowball-20051019.jar" weka.gui.GUIChooser
For Linux: java -classpath "weka.jar: snowball-20051019.jar" weka.gui.GUIChooser
It will be confirmed with the command to verify the parameter java.class.path
java weka.core.SystemInfo
As shown in the following figure:
WEKA TUTORIAL 10
Having set the SnowballStemmer, Selecte it by clicking the "Choose" button.
This button will display a menu which selecte from weka> core> stemmers and choose
SnowballStemmer.
Click on the stemmer name and a window that can delimit the language will apear. For
Spanish on the side labeled "stemmer" it will be type "spanish" in place of "porter" and
click "OK".
e) Stopwords determines whether a sub string in a text is a word that does not provide
information about a text. This words come from a predefined Rainbow list, where the
default is Weka-3-6. Rainbow is a program that performs the statistical text
classification base on Bow library. Rainbow has separate lists in English and Spanish,
in order to make both languages is use the "ES-stopwords" file that contains both lists
from Rainbow. "ES-stopwords" list can be download from
http://www.scientificdatabases.ca/current-projects/english-spanish-text-data-mining/.
To change the list click on Weka-3-6 which is next to the label stopwords and
choose “ES-stopwords" previously downloaded. Set the useStoplistse option to
WEKA TUTORIAL 11
"True" to ignore the words that are on "ES-stopwords" within the "Stopwords" option
list.
f) Tokenizer: option to choose unit to separate the attribute "DocumentText". By
clicking "Choose" button a menu will be displayed and select "WordTokenizer". Set
the "deimiters" in English and Spanish when cloc on the name and following window
will appear. Delimiters in Spanish are,;: .,;:'()?!“¿!-[]’<>“ ".. this includes an end
character in for exclamation and interrogation. .,;:'"()?!“¿!-[]’<>“
As shown in the figure below.:
Another option is to choose NGramTokenizer to divide the original text string in a
subset of consecutive words that form a pattern with unique meaning. This uses the
default "delimiters" is '\ r \ n \ t,;:.' ?! "()", This is useful to help uncover patterns of
words between them representing a meaningful context.
g) minTermFreq: default is 1 for each word must to possess to be considered as an
attribute to this the "doNotOperateOnPerClassBasis" flag should be "False".
h) periodicPruning be filed in no pruning with -1, it won’t remove low-frequency
words.
WEKA TUTORIAL 12
i) attributeNamePrefix lefts with nothing to not add a prefix to the attributes
generated.
j) attributeIndices: will be saved as first-last to ensure that all attributes are treated as
if they were a single chain from first to last.
k) invertSelection be preserved in "False" to work with the selected attributes.
At the end, you can save, cancel and apply. The window must have been as follows:
WEKA TUTORIAL 13
To save the algorithm with these options click on Save ..." button and the select the
location and name.
To apply the algorithm with these options in the click "OK" button. This will return to the
"Preprocess" window where "DocumentText" attribute must have been selected from the
"Attributes" framework.
Click the button "Apply". It is located in the upper right of the module "Filter". Weka
image located in the lower right corner will start to dance until the process is complete.
Information extraction
After the data cleaning on the "Preprocess" tab, it proceeds to the extraction of
information. By click on the tab "Classify" on the second panel of Explorer.
This stage analyze the attributes vector for the creation of the classification model that
will define the structure found in the analyzed information.
Weka considered the decision tree model J48 the most popular on text classification. J48
is the Java implementation of the algorithm C4.5. Algorithm that in each node represent one of
the possible decisions to be taken and each leave represent the predicted class.
First, choose the sorting algorithm from the "Choose" button located in the upper left side
of the window.
WEKA TUTORIAL 14
This button will display a tree where the root is weka and the sub folder is "classifiers".
Within the sub folder tree located in weka.classifiers.trees, select the tree model J48, as
shown in the following figure:
Double-click on the name of the J48 classifier located next to the "Select"
button to access to its options.
WEKA TUTORIAL 15
It can reach 100% in correct classification disabling pruning and setting the
minimum number of instances in a leaf as 1. In this case these parameters changed
are:
a) minNumObj: is set to 1 and leave the other parameters in the default configuration.
In the "Test Options" module the training data is set.
Select “Use training set" to train the method with all available data and apply the results
on the same input data collection.
WEKA TUTORIAL 16
Additionally you can apply a partitioning percentage to the input data by selecting the
"Percentage Split" option and defining the percentage from the total input data to build the
classifier model, leaving the remaining part to test.
Under options "Test Options" is a menu that displays a list with all attributes. In the case
select "docClass" because this is the attribute that act as the result for classification in this
example.
The classification method started by pressing the "Start" button.
The weka bird image found in the bottom right, will begin to dance until the end of the
sorting process.
WEKA TUTORIAL 17
WEKA creates a graphical representation of the classification tree J48. This tree can be
viewed by right-clicking on the last set of results "Result List" and selecting "Visualize tree" or
"tree Display" option.
WEKA TUTORIAL 18
The window size can be adjusted to make it more explicit by right clicking and selecting
"Fit to Screen", as show in the image below.
Results Evaluation
Weka describes the proportion of instances erroneously predicted with the measure - Fβ
score. The value is a percentage consist of precision and Recall. Precision measures the
percentage of correct positive predictions that are truly positive Recall is the ability to detect
positive cases out of the total of all positive cases.
WEKA TUTORIAL 19
With these percentages it is expected that the best model is the F-measure value closer to
1. The following table shows some combinations that are significant in the data preprocess for
model generation. This comparison table describes its measures of precision and recall as well as
its measurement-f.
First the best filter options are analyzed with unadjusted values for the J48 classifier. In
this the best parameters are selected. After the best settings for J48 classifier algorithm are
selected with the best configuration on the StringToWordVector filter.
Comparison table: Documents classification models.
Features Precision Recall F-Measure
Word Tokenizer English Spanish (E&S ) 0.810 0.803 0.800
Word Tokenizer E&S + Lower Case Conversion 0.863 0.859 0.860
Trigrams E&S + Lower Case Conversion 0.823 0.775 0.754
Stemming + Word Tokenizer E&S + Lower Case
Conversion
0.864 0.817 0.823
Stopwords + Word Tokenizer E&S + Lower Case
Conversion
0.976 0.972 0.972
Stopwords + Stemming +
Word Tokenizer E&S + Lower Case Conversion
0.974 0.972 0.971
Stopwords + Word Tokenizer E&S + Lower Case
Conversion + J48 minNumObj = 1
1 1 1
In conclusion the best model is a combination of the options Word Tokenizer Stopwords
+ S + E & Lower Case Conversion applied to the filter on the data preprocessing and further
adjusting 1 minNumObj on the J48 classifier algorithm.
WEKA TUTORIAL 20
The next confusion matrix is the result from the combination of Stopwords + Word
Tokenizer E&S + Lower Case Conversion adjusting minNumObj to 1 on the J48 algorithm.
This generates the following binary values in their confusion matrix.
a b c d E f Classified as
11 0 0 0 0 0 a = Hemodialysis
0 12 0 0 0 0 b = Nutrition
0 0 12 0 0 0 c = Cancer
0 0 0 12 0 0 d = Obesity
0 0 0 0 12 0 e = Diet
0 0 0 0 0 12 f = Diabetes
This table only shows classes with precision and recall at 100%. Accuracy values are as
follows for each class:
Class TP Rate FP Rate Precision Recall F-Measure
Hemodialysis 1 0 1 1 1
Nutrition 1 0 1 1 1
Cancer 1 0 1 1 1
Obesity 1 0 1 1 1
Diet 1 0 1 1 1
Diabetes 1 0 1 1 1
Weighted Avg. 1 0 1 1 1
WEKA TUTORIAL 21
Conclusion
Document classification in Spanish is analyzed using text mining through Weka an open
source software. This software analyzes large amounts of data and decide which is the most
important. It aims to make automatic predictions that help decision making.
Text mining is considering as a subset of data mining. For this reason, adopts text mining
adopts the data mining techniques which uses machine learning algorithms. Computational
linguistics techniques also provides techniques to text mining. This science studies natural
language with computational methods to make them understandable by the operating system.
Automatic categorization determines the subject matter from a document collection. The
classification starts with a set of training texts previously categorized then generate a
classification model based on the set of examples. This is be able to allocate the correct clas from
a new text. Decision tree is a classification technique that represent the knowledge through if-
else statements structure represented in the branches of a tree.
Textual mining methodology provides a framework performed in four stages, data
acquisition, preprocessing documents, information extraction and evaluation of results. Witten,
Frank and Hall make mention of these steps in his work for the use of WEKA.
WEKA uses a standard format called File Attribute Relation (ARFF) to represent the
collection of documents into instances that share an ordered set of attributes divided into 3
sections, relationship, and attribute data.
Preprocessing data is based on the preparation of the text using a series of operations over
the text and generate some kind of structured or semi-structured information for analysis. The
most popular way to represent documents is with a vector. That vector contains all words found
in the text indicating its occurrence. Important tasks for preprocessing to categorize documents
WEKA TUTORIAL 22
are stemming, lexematización, removing empty words, tokenization and conversion to
lowercase.
Stemming algorithm eliminates morphemes and find the relationships between words and
lexeme not themed. Stopwords exclude the words that not help to generate knowledge of the
text. Tokenization is how to separate the text into words using punctuation. In Spanish
punctuation are "; . :? ! - -. () [] '"<< >>" Where the dot and dash are ambiguous in Spanish,
unlike English contemplates a sign of end in an exclamation and interrogation. Conversion to
lowercase treat all letters regardless equal terms.
After data preprocess, the next step is knowledge extraction. Document classification in
weka look for learn a predictive classification model. These models are used to predict the class
to which an instance belongs. The model is created using the decision tree algorithm C4.5 as it is
the simplest and wide for the classification task.
Weka generates a confusion matrix for the generated model. This shows in an easy way
to detect how many times the model predictions were made correctly. The four possible
outcomes are: true positives, false positives, true negatives and false negatives. TP - true
positive: positive instance was predicted in the class as positive. TN - true negative: negative
instance correctly classified as negative. FP - false positives: positive instance was listed in the
wrong class. FN - false-negative negative instance incorrectly classified as positive.
The training data set selected has been found on the Thompson Rivers University library.
It was randomly selected 71 medical academic articles in English and Spanish stored in PDF
format. Based on the TRU library was classified this documents into six categories
Hemodialysis, Nutrition, Cancer, Obesity, Diet and Diabetes recognized. These documents are
stored in directories named by its categories within the main folder called Medicine.
WEKA TUTORIAL 23
In order to form an arff file it an application that generated the arff from a documents
collection a directory based. This application could be carried out with the collaboration of a
library called iTextSharp PDF for a portable document format text extraction. This application is
named as Documents Directory to ARFF.
The resulting arff generate a string type attribute called "DocumentText" that describe
all text found in the document and the nominal attribute "docClass" that define the class to
which it belongs. As a note, recent versions of Weka Weka as in this case 3.6.12 the class
attribute can never be named "class".
Various tests applied to the same set of texts to assess the predictive exactitude of the
model. They were generated a set of optimal options from different combinations of options
applied to the same training data . Each resulting model was calculated its F-measurement which
describes the proportion of its predicted instances erroneously.
First the best structure for the filter is analyzed, with unadjusted the J48 classifier options.
In this the best parameters for the filter were selected. It select the best configuration to assess
the best settings for J48 classifier algorithm. Based on a comparison chart it was discovered that
the parameters of the combination of Stopwords + Word Tokenizer E&S + Lower Case
Conversion adjusting the minNumObj to 1 on the J48 algorithm, provide values of 1 for recall
and precision.
Concluding that the best model is the combination of the options Word Tokenizer
Stopwords + S&E + Lower Case Conversion applied to the data preprocessing filter and further
adjusting minNumObj to 1 on the J48 classifier algorithm.
WEKA TUTORIAL 24
References
Witten, I. H., Frank, E. ;., & Hall, M. A. (2011). Data Mining: Practical Machine Learning
Tools and techniques / Ian H. Witten (3a. ed. --.). s.l.: Elsevier.
Cs.waikato.ac.nz,. (2015). Weka 3 - Data Mining with Open Source Machine Learning Software
in Java . Retrieved 5 May 2015, from http://www.cs.waikato.ac.nz/~ml/weka/
Shams, R. (2015). Weka Tutorial 31: Document Classification 1 (Application). YouTube.
Retrieved 15 May 2015, from https://www.youtube.com/watch?v=jSZ9jQy1sfE
Shams, R. (2015). Weka Tutorial 32: Document classification 2 (Application). YouTube.
Retrieved 15 May 2015, from https://www.youtube.com/watch?v=zlVJ2_N_Olo
Rodríguez, J., Calot, E., & Merlino, H. (2014). Clasificación de prescripciones médicas en
español. Sedici.unlp.edu.ar. Retrieved 15 May 2015, from
http://sedici.unlp.edu.ar/handle/10915/42402
Weinberg, B. (2015). Weka Text Classification for First Time & Beginner Users. YouTube.
Retrieved 15 May 2015, from https://www.youtube.com/watch?v=IY29uC4uem8.
Nlm.nih.gov,. (2015). PubMed Tutorial - Building the Search - How It Works - Stopwords.
Retrieved 18 May 2015, from
http://www.nlm.nih.gov/bsd/disted/pubmedtutorial/020_170.html