WEKA TUTORIAL 1 Weka Tutorial on Document Classification ...€¦ · stages, data acquisition,...

WEKA TUTORIAL 1

Weka Tutorial on Document Classification

Valeria Guevara

Thompson Rivers University

Author Note

This is a final project COMP 4910 for the bachelors of computing science from the

Thompson Rivers University supervised by Mila Kwiatkowska.

WEKA TUTORIAL 2

Abstract

This project focuses on document classification using text mining through a classification

model generated by the open source software “WEKA”. This software is a repository of machine

learning algorithms to discover knowledge. Weka easily preprocesses the training documents to

compare different algorithms configurations. The exactitude in the generated predictive model

will be measured based on a confusion matrix. This project will help to illustrate text mining

preprocessing and classification using WEKA. The result will be the development of a tool to

generate the input data files arff and of a video tutorial on documents classification in Weka in

English and Spanish.

Keywords: Weka, documents classification, arff, stopwords, toquenizer, pruning,

decision tree C4.5, words vector, text mining, F-measurement, machine learning, text

classification, stemming, knowledge society.

WEKA TUTORIAL 3

Weka document classification

Weka tool was selected in order to generate a model that classifies specialized documents

from two different courpus (English and Spanish). WEKA package is a collection of machine

learning algorithms for data mining tasks. Text mining uses these algorithms to learn from

examples or "training set", new texts are classified into categories analyzed. It is defined as

Waikato Environment for Knowledge Analysis. For more information contact

http://www.cs.waikato.ac.nz/~ml/weka/.

Installing WEKA

Weka can be downloaded from:

http://www.cs.waikato.ac.nz/ml/weka/downloading.html.

In this tutorial version is Weka 3.6.12.

For Windows

WEKA must be situated in the program launcher located in a weka folder. The Weka

default directory is the same directory where the file is loaded.

For Linux:

WEKA will have to open a terminal and type: java -jar /installation/directory/weka.jar.

http://www.cs.waikato.ac.nz/~ml/weka/

WEKA TUTORIAL 4

Based on the text mining methodology Weka is represented in a framework with four

stages, data acquisition, document preprocessing, information extraction and evaluation.

Data Acquisition

ARFF files are the primary format to use any classification task in WEKA. These files

considered basic input data (concepts, instances and attributes) for data mining. An Attribute-

Relation File Format file describes a list of instances of a concept with their respective attributes.

The documents selected for the training data set has been found on the Thompson Rivers

University library that has the following link: http://www.tru.ca/library.html. It was randomly

selected 71 medical academic articles in English and Spanish. These documents are stored in

Portable Document Format (PDF). Based on the TRU library was detected the classification of

this documents into six categories Hemodialysis, Nutrition, Cancer, Obesity, Diet and Diabetes

recognized. These documents are stored in directories named by its categories within the main

folder called Medicine. As shown in the figure below.

In order to form an arff file it was created in Microsoft Visual Studio Professional C #

2012 an application that generated the arff from a directory that contains a collection of

WEKA TUTORIAL 5

documents in a based on their category name. This application could be carried out with the

collaboration of a library called iTextSharp PDF for a portable document format text extraction.

Documents Directory to ARFF can specify the name of the relationship to define, the

location of the home directory that contains all documents subdivided into categorical directories

and comments required. Also, it specify the file name generated with arff extension and its

location. At the end of the application are two buttons, one for exit and another to generate the

arff file with the information described.

This can be download http://www.scientificdatabases.ca under current projects for Text

Mining.

The resulting arff generate a string type attribute called " textoDocumento" that describe

all text found in the document and the nominal attribute "docClass" that define the class to

which it belongs. As a note, recent versions of Weka Weka as in this case 3.6.12 the class

attribute can never be named "class".

WEKA TUTORIAL 6

The file will be generated as follows:

% tutorial de Weka para la Clasificación de Documentos.

@RELATION Medicina

@attribute textoDocumento string

@attribute docClass {Hemodialysis, Nutrition, Cancer, Obesity, Diet, Diabetes}

@data

"texto…", Hemodialysis

“texto…”, Nutrition

"texto….", Cancer

"texto…", Obesity

"texto…", Diet

"texto…", Diabetes

Document Preprocess

Weka contains tools for data pre-processing, classification, regression, clustering,

association rules, and visualization.

"Applications" is the first screen on Weka to select the desired sub-tool. In this

"Explorer" is selected. It consists of six panels: Preprocess, Classify, cluster, Associate, Select

attributes and Visualize.

Preprocess

Preprocessing for the classification of documents.

To load the generated arff, click on the button "Open file ..." at the top right.

Select the created file "medicinaWeka.arff".

On "Current Relation" the dataset that has been loaded is described. It describes the

relationship with the medicina name, the number of instances as 71 and a total of attributes as 2.

At the bottom of the under "Attributes" section, attributes are described. This framework allows

to select the attributes, in this case are show " textoDocumento " and "docClass".

When selecting "docClass" the "Selected attribute" part describes the nominal attribute

with 6 labels and the total of its instances. These "labels" are 11 levels from Hemodialysis and 12

WEKA TUTORIAL 7

instances from the others: Nutrition, Cancer, Obesity, Diabetes Diet. At the bottom of this

section is ilustrated a histogram of the attribute "docClass" labels that by hovering the graph it

will describe the attribute name as shown in the following figure illustrates.

Weka uses StringToWordVector filter to convert the "textoDocumento" and

"docClass"." attribute into a set of attributes that represent the occurrence of words of the full

text,. This filter is a technique of unsupervised learning. These inductive technique is designed to

detect clusters and label entries from a set of observations without knowing the correct

classification.

The filters are found when click the “Choose " button under "Filter" section. This button

opens a window with root weka. From there selecte filters and the unsupervised folder to after

select attribute and finally select StringToWordVector.

WEKA TUTORIAL 8

StringToWordVector filter can configured its attributes with language processing

techniques. To edit this filter is only necessary to click on the filter name. it will open a that

show the following options.

They were generated a set of optimal options from different combinations of options

applied to the same training data . Each resulting model was calculated its F-measurement which

describes the proportion of its predicted instances erroneously. The options that generated the

greatest number of instances predicted correctly are as follows:

a) wordsToKeep: Standing with 1000 since it defines the word limit per class to maintain.

Where doNotOperateOnPerClassBasis flag: as "False" to base wordsToKeep in all

classes.

b) TFTransform as "True", DFTransform as "True" outputWordCounts as "True" and

normalizeDocLength: is set to "No normalization".

The values are not normalized to the filter papers find more interrelated and count how

often a word is in the document and not only consider whether the term is in the

document. OutputWordCounts is the flag that describes whether a word exist or not in the

document and normalizeDocLength couts a word with its actual value from tf-idf result

of that word in the document, no matter how small or longer the document is.

c) lowerCaseTokens: as "True" to convert all to lowercase words before being added to

the record and analyze the same word in lowercase and uppercase separately.

WEKA TUTORIAL 9

d) Stemmer: selects the algorithm to elimination the morpheme in a given language in

order to reduce the word to its root. Select no stemmer as the classification of texts is

multilingual and it will only aply stemming for one lenguage. No stemmer is configured

when click on the "Select" button menu is deployed and "NullStemmer" is selected.

Weka has a standard algorithm in English from snowball.tartarus.org. Snowball is a

string processing language designed for creating stemmer and feature a stemming

algorithm in Spanish. To use the algorithm in Spanish will have to download the jar

snowball-20051019.jar from https://weka.wikispaces.com/Stemmers. This will be stored

in the location where Weka application is. Finally the algorithm will be added when the

following command is applied from the command line in Weka.

For Windows: java -classpath "weka.jar, snowball-20051019.jar" weka.gui.GUIChooser

For Linux: java -classpath "weka.jar: snowball-20051019.jar" weka.gui.GUIChooser

It will be confirmed with the command to verify the parameter java.class.path

java weka.core.SystemInfo

As shown in the following figure:

WEKA TUTORIAL 10

Having set the SnowballStemmer, Selecte it by clicking the "Choose" button.

This button will display a menu which selecte from weka> core> stemmers and choose

SnowballStemmer.

Click on the stemmer name and a window that can delimit the language will apear. For

Spanish on the side labeled "stemmer" it will be type "spanish" in place of "porter" and

click "OK".

e) Stopwords determines whether a sub string in a text is a word that does not provide

information about a text. This words come from a predefined Rainbow list, where the

default is Weka-3-6. Rainbow is a program that performs the statistical text

classification base on Bow library. Rainbow has separate lists in English and Spanish,

in order to make both languages is use the "ES-stopwords" file that contains both lists

from Rainbow. "ES-stopwords" list can be download from

http://www.scientificdatabases.ca/current-projects/english-spanish-text-data-mining/.

To change the list click on Weka-3-6 which is next to the label stopwords and

choose “ES-stopwords" previously downloaded. Set the useStoplistse option to

WEKA TUTORIAL 11

"True" to ignore the words that are on "ES-stopwords" within the "Stopwords" option

list.

f) Tokenizer: option to choose unit to separate the attribute "DocumentText". By

clicking "Choose" button a menu will be displayed and select "WordTokenizer". Set

the "deimiters" in English and Spanish when cloc on the name and following window

will appear. Delimiters in Spanish are,;: .,;:'()?!“¿!-[]’<>“ ".. this includes an end

character in for exclamation and interrogation. .,;:'"()?!“¿!-[]’<>“

As shown in the figure below.:

Another option is to choose NGramTokenizer to divide the original text string in a

subset of consecutive words that form a pattern with unique meaning. This uses the

default "delimiters" is '\ r \ n \ t,;:.' ?! "()", This is useful to help uncover patterns of

words between them representing a meaningful context.

g) minTermFreq: default is 1 for each word must to possess to be considered as an

attribute to this the "doNotOperateOnPerClassBasis" flag should be "False".

h) periodicPruning be filed in no pruning with -1, it won’t remove low-frequency

words.

WEKA TUTORIAL 12

i) attributeNamePrefix lefts with nothing to not add a prefix to the attributes

generated.

j) attributeIndices: will be saved as first-last to ensure that all attributes are treated as

if they were a single chain from first to last.

k) invertSelection be preserved in "False" to work with the selected attributes.

At the end, you can save, cancel and apply. The window must have been as follows:

WEKA TUTORIAL 13

To save the algorithm with these options click on Save ..." button and the select the

location and name.

To apply the algorithm with these options in the click "OK" button. This will return to the

"Preprocess" window where "DocumentText" attribute must have been selected from the

"Attributes" framework.

Click the button "Apply". It is located in the upper right of the module "Filter". Weka

image located in the lower right corner will start to dance until the process is complete.

Information extraction

After the data cleaning on the "Preprocess" tab, it proceeds to the extraction of

information. By click on the tab "Classify" on the second panel of Explorer.

This stage analyze the attributes vector for the creation of the classification model that

will define the structure found in the analyzed information.

Weka considered the decision tree model J48 the most popular on text classification. J48

is the Java implementation of the algorithm C4.5. Algorithm that in each node represent one of

the possible decisions to be taken and each leave represent the predicted class.

First, choose the sorting algorithm from the "Choose" button located in the upper left side

of the window.

WEKA TUTORIAL 14

This button will display a tree where the root is weka and the sub folder is "classifiers".

Within the sub folder tree located in weka.classifiers.trees, select the tree model J48, as

shown in the following figure:

Double-click on the name of the J48 classifier located next to the "Select"

button to access to its options.

WEKA TUTORIAL 15

It can reach 100% in correct classification disabling pruning and setting the

minimum number of instances in a leaf as 1. In this case these parameters changed

are:

a) minNumObj: is set to 1 and leave the other parameters in the default configuration.

In the "Test Options" module the training data is set.

Select “Use training set" to train the method with all available data and apply the results

on the same input data collection.

WEKA TUTORIAL 16

Additionally you can apply a partitioning percentage to the input data by selecting the

"Percentage Split" option and defining the percentage from the total input data to build the

classifier model, leaving the remaining part to test.

Under options "Test Options" is a menu that displays a list with all attributes. In the case

select "docClass" because this is the attribute that act as the result for classification in this

example.

The classification method started by pressing the "Start" button.

The weka bird image found in the bottom right, will begin to dance until the end of the

sorting process.

WEKA TUTORIAL 17

WEKA creates a graphical representation of the classification tree J48. This tree can be

viewed by right-clicking on the last set of results "Result List" and selecting "Visualize tree" or

"tree Display" option.

WEKA TUTORIAL 18

The window size can be adjusted to make it more explicit by right clicking and selecting

"Fit to Screen", as show in the image below.

Results Evaluation

Weka describes the proportion of instances erroneously predicted with the measure - Fβ

score. The value is a percentage consist of precision and Recall. Precision measures the

percentage of correct positive predictions that are truly positive Recall is the ability to detect

positive cases out of the total of all positive cases.

WEKA TUTORIAL 19

With these percentages it is expected that the best model is the F-measure value closer to

1. The following table shows some combinations that are significant in the data preprocess for

model generation. This comparison table describes its measures of precision and recall as well as

its measurement-f.

First the best filter options are analyzed with unadjusted values for the J48 classifier. In

this the best parameters are selected. After the best settings for J48 classifier algorithm are

selected with the best configuration on the StringToWordVector filter.

Comparison table: Documents classification models.

Features Precision Recall F-Measure

Word Tokenizer English Spanish (E&S ) 0.810 0.803 0.800

Word Tokenizer E&S + Lower Case Conversion 0.863 0.859 0.860

Trigrams E&S + Lower Case Conversion 0.823 0.775 0.754

Stemming + Word Tokenizer E&S + Lower Case

Conversion

0.864 0.817 0.823

Stopwords + Word Tokenizer E&S + Lower Case

Conversion

0.976 0.972 0.972

Stopwords + Stemming +

Word Tokenizer E&S + Lower Case Conversion

0.974 0.972 0.971

Stopwords + Word Tokenizer E&S + Lower Case

Conversion + J48 minNumObj = 1

1 1 1

In conclusion the best model is a combination of the options Word Tokenizer Stopwords

+ S + E & Lower Case Conversion applied to the filter on the data preprocessing and further

adjusting 1 minNumObj on the J48 classifier algorithm.

WEKA TUTORIAL 20

The next confusion matrix is the result from the combination of Stopwords + Word

Tokenizer E&S + Lower Case Conversion adjusting minNumObj to 1 on the J48 algorithm.

This generates the following binary values in their confusion matrix.

a b c d E f Classified as

11 0 0 0 0 0 a = Hemodialysis

0 12 0 0 0 0 b = Nutrition

0 0 12 0 0 0 c = Cancer

0 0 0 12 0 0 d = Obesity

0 0 0 0 12 0 e = Diet

0 0 0 0 0 12 f = Diabetes

This table only shows classes with precision and recall at 100%. Accuracy values are as

follows for each class:

Class TP Rate FP Rate Precision Recall F-Measure

Hemodialysis 1 0 1 1 1

Nutrition 1 0 1 1 1

Cancer 1 0 1 1 1

Obesity 1 0 1 1 1

Diet 1 0 1 1 1

Diabetes 1 0 1 1 1

Weighted Avg. 1 0 1 1 1

WEKA TUTORIAL 21

Conclusion

Document classification in Spanish is analyzed using text mining through Weka an open

source software. This software analyzes large amounts of data and decide which is the most

important. It aims to make automatic predictions that help decision making.

Text mining is considering as a subset of data mining. For this reason, adopts text mining

adopts the data mining techniques which uses machine learning algorithms. Computational

linguistics techniques also provides techniques to text mining. This science studies natural

language with computational methods to make them understandable by the operating system.

Automatic categorization determines the subject matter from a document collection. The

classification starts with a set of training texts previously categorized then generate a

classification model based on the set of examples. This is be able to allocate the correct clas from

a new text. Decision tree is a classification technique that represent the knowledge through if-

else statements structure represented in the branches of a tree.

Textual mining methodology provides a framework performed in four stages, data

acquisition, preprocessing documents, information extraction and evaluation of results. Witten,

Frank and Hall make mention of these steps in his work for the use of WEKA.

WEKA uses a standard format called File Attribute Relation (ARFF) to represent the

collection of documents into instances that share an ordered set of attributes divided into 3

sections, relationship, and attribute data.

Preprocessing data is based on the preparation of the text using a series of operations over

the text and generate some kind of structured or semi-structured information for analysis. The

most popular way to represent documents is with a vector. That vector contains all words found

in the text indicating its occurrence. Important tasks for preprocessing to categorize documents

WEKA TUTORIAL 22

are stemming, lexematización, removing empty words, tokenization and conversion to

lowercase.

Stemming algorithm eliminates morphemes and find the relationships between words and

lexeme not themed. Stopwords exclude the words that not help to generate knowledge of the

text. Tokenization is how to separate the text into words using punctuation. In Spanish

punctuation are "; . :? ! - -. () [] '"<< >>" Where the dot and dash are ambiguous in Spanish,

unlike English contemplates a sign of end in an exclamation and interrogation. Conversion to

lowercase treat all letters regardless equal terms.

After data preprocess, the next step is knowledge extraction. Document classification in

weka look for learn a predictive classification model. These models are used to predict the class

to which an instance belongs. The model is created using the decision tree algorithm C4.5 as it is

the simplest and wide for the classification task.

Weka generates a confusion matrix for the generated model. This shows in an easy way

to detect how many times the model predictions were made correctly. The four possible

outcomes are: true positives, false positives, true negatives and false negatives. TP - true

positive: positive instance was predicted in the class as positive. TN - true negative: negative

instance correctly classified as negative. FP - false positives: positive instance was listed in the

wrong class. FN - false-negative negative instance incorrectly classified as positive.

The training data set selected has been found on the Thompson Rivers University library.

It was randomly selected 71 medical academic articles in English and Spanish stored in PDF

format. Based on the TRU library was classified this documents into six categories

Hemodialysis, Nutrition, Cancer, Obesity, Diet and Diabetes recognized. These documents are

stored in directories named by its categories within the main folder called Medicine.

WEKA TUTORIAL 23

In order to form an arff file it an application that generated the arff from a documents

collection a directory based. This application could be carried out with the collaboration of a

library called iTextSharp PDF for a portable document format text extraction. This application is

named as Documents Directory to ARFF.

The resulting arff generate a string type attribute called "DocumentText" that describe

all text found in the document and the nominal attribute "docClass" that define the class to

which it belongs. As a note, recent versions of Weka Weka as in this case 3.6.12 the class

attribute can never be named "class".

Various tests applied to the same set of texts to assess the predictive exactitude of the

model. They were generated a set of optimal options from different combinations of options

applied to the same training data . Each resulting model was calculated its F-measurement which

describes the proportion of its predicted instances erroneously.

First the best structure for the filter is analyzed, with unadjusted the J48 classifier options.

In this the best parameters for the filter were selected. It select the best configuration to assess

the best settings for J48 classifier algorithm. Based on a comparison chart it was discovered that

the parameters of the combination of Stopwords + Word Tokenizer E&S + Lower Case

Conversion adjusting the minNumObj to 1 on the J48 algorithm, provide values of 1 for recall

and precision.

Concluding that the best model is the combination of the options Word Tokenizer

Stopwords + S&E + Lower Case Conversion applied to the data preprocessing filter and further

adjusting minNumObj to 1 on the J48 classifier algorithm.

WEKA TUTORIAL 24

References

Witten, I. H., Frank, E. ;., & Hall, M. A. (2011). Data Mining: Practical Machine Learning

Tools and techniques / Ian H. Witten (3a. ed. --.). s.l.: Elsevier.

Cs.waikato.ac.nz,. (2015). Weka 3 - Data Mining with Open Source Machine Learning Software

in Java . Retrieved 5 May 2015, from http://www.cs.waikato.ac.nz/~ml/weka/

Shams, R. (2015). Weka Tutorial 31: Document Classification 1 (Application). YouTube.

Retrieved 15 May 2015, from https://www.youtube.com/watch?v=jSZ9jQy1sfE

Shams, R. (2015). Weka Tutorial 32: Document classification 2 (Application). YouTube.

Retrieved 15 May 2015, from https://www.youtube.com/watch?v=zlVJ2_N_Olo

Rodríguez, J., Calot, E., & Merlino, H. (2014). Clasificación de prescripciones médicas en

español. Sedici.unlp.edu.ar. Retrieved 15 May 2015, from

http://sedici.unlp.edu.ar/handle/10915/42402

Weinberg, B. (2015). Weka Text Classification for First Time & Beginner Users. YouTube.

Retrieved 15 May 2015, from https://www.youtube.com/watch?v=IY29uC4uem8.

Nlm.nih.gov,. (2015). PubMed Tutorial - Building the Search - How It Works - Stopwords.

Retrieved 18 May 2015, from

http://www.nlm.nih.gov/bsd/disted/pubmedtutorial/020_170.html

WEKA TUTORIAL 1 Weka Tutorial on Document Classification ...€¦ · stages, data acquisition,...

Documents

Transcript of WEKA TUTORIAL 1 Weka Tutorial on Document Classification ...€¦ · stages, data acquisition,...