Machine Learning in GATE
Valentin Tablan
2
Machine Learning in GATE
• Uses classification.[Attr1, Attr2, Attr3, … Attrn] Class
• Classifies annotations.(Documents can be classified as well using a
simple trick.)• Annotations of a particular type are
selected as instances.• Attributes refer to instance annotations.• Attributes have a position relative to the
instance annotation they refer to.
3
Attributes
Attributes can be:– Boolean
The [lack of] presence of an annotation of a particular type [partially] overlapping the referred instance annotation.
– NominalThe value of a particular feature of the referred instance
annotation. The complete set of acceptable values must be specified a-priori.
– NumericThe numeric value (converted from String) of a particular
feature of the referred instance annotation.
4
Implementation
Machine Learning PR in GATE.Has two functioning modes:
– training– application
Uses an XML file for configuration:<?xml version="1.0" encoding="windows-1252"?><ML-CONFIG>
<DATASET> … </DATASET><ENGINE>…</ENGINE>
<ML-CONFIG>
5
<DATASET><DATASET><INSTANCE-TYPE>Token</INSTANCE-TYPE> <ATTRIBUTE> <NAME>POS_category(0)</NAME> <TYPE>Token</TYPE> <FEATURE>category</FEATURE> <POSITION>0</POSITION> <VALUES> <VALUE>NN</VALUE> <VALUE>NNP</VALUE> <VALUE>NNPS</VALUE> … </VALUES> [<CLASS/>] </ATTRIBUTE> …</DATASET>
6
<ENGINE>
<ENGINE> <WRAPPER>gate.creole.ml.weka.Wrapper</WRAPPER> <OPTIONS> <CLASSIFIER>weka.classifiers.j48.J48</CLASSIFIER> <CLASSIFIER-OPTIONS>-K 3</CLASSIFIER-OPTIONS> <CONFIDENCE-THRESHOLD>0.85</CONFIDENCE-
THRESHOLD> </OPTIONS> </ENGINE>
7
Attributes Position
Instances type: Token
8
Machine Learning PR
• Can save a learnt model to an external file for later use.Saves the actual model and the collected dataset.
• Can export the collected dataset in .arff format.
9
Standard Use ScenarioTraining• Prepare training data by
enriching the documents with annotation for attributes. (e.g. run Tokeniser, POS tagger, Gazetteer, etc).
• Run the ML PR in training mode.
• Export the dataset as .arff and perform experiments using the WEKA interface in order to find the best attribute set / algorithm / algorithm options.
• Update the configuration file accordingly.
• Run the ML PR again to collect the actual data.
• [ Save the learnt model. ]
Application• Prepare data by enriching the
documents with annotation for attributes. (e.g. run Tokeniser, POS tagger, Gazetteer, etc).
• [ Load the previously saved model. ]
• Run the ML PR in application mode.
• [ Save the learnt model. ]
10
An Example
Learn POS category from POS context.
11
Using Other ML LibrariesThe MLEngine InterfaceMethod Summary• void addTrainingInstance(List attributes)
Adds a new training instance to the dataset. • Object classifyInstance(List attributes)
Classifies a new instance. • void init()
This method will be called after an engine is created and has its dataset and options set.
• void setDatasetDefinition(DatasetDefintion definition) Sets the definition for the dataset used.
• void setOptions(org.jdom.Element options) Sets the options from an XML JDom element.
• void setOwnerPR(ProcessingResource pr) Registers the PR using the engine with the engine.
Top Related