MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line...

MALLETMAchine Learning for LanguagE Toolkit

Outline• About MALLET

• Representing Data

• Command Line Processing

• Simple Evaluation

• Conclusion

About MALLET• "MALLET: A Machine Learning for Language Toolkit.“

• written by Andrew McCallum• http://mallet.cs.umass.edu. 2002.• Implemented in Java, currently version 2.0.6

• Motivation:• Text classification and information extraction• Commercial machine learning• Analysis and indexing of academic publications

http://mallet.cs.umass.edu/

About MALLET• Main idea

• Text focus: data is discrete rather than continuous, even when values could be continuous

• How to• Command line scripts:

• bin/mallet [command] --[option] [value] …• Text User Interface (“tui”) classes

• Direct Java API• http://mallet.cs.umass.edu/api





• Conclusion

Representations• Transform text documents to

vectors x1 , x2 …

• Elements of vector are called feature values• Example: “Feature at row 345 is

number of times “dog” appears in document”

• Retain meaning of vector indices

Documents to Vectors

Instances




• Developing with MALLET

• Conclusion

Command Line• Importing Data

• Classification

• Sequence Tagging

• Topic Modeling

Importing Data• One Instance per file

• files in the folder:sample-data/web/en or sample-data/web/de

• command line: bin/mallet import-dir --input sample-data/web/* --output web.mallet

• One file, one instance per line• file format:[URL] [language] [text of the page...]

• command line:bin/mallet import-file --input /data/web/data.txt --output web.mallet

Classification• Training a classifier

bin/mallet train-classifier --input training.mallet --output-classifier my.classifier

• Choosing an algorithm• MaxEnt, NaiveBayes, C45, DecisionTree and many others.

bin/mallet train-classifier --input training.mallet --output-classifier my.classifier --trainer MaxEnt

• Evaluation• Random split the data into 90% training instances, which will be used to train the

classifier, and 10% testing instances.

bin/mallet train-classifier --input labeled.mallet --training-portion 0.9

Sequence Tagging• Sequence algorithms

• hidden Markov models (HMMs)• linear chain conditional random fields (CRFs).

• SimpleTagger• a command line interface to the MALLET Conditional Random

Field (CRF) class

SimpleTagger• Input file: [feature1 feature2 ... featuren label]

Bill CAPITALIZED nounslept non-nounhere LOWERCASE STOPWORD non-noun

• Train a CRF• An input file “sample”• A trained CRF in the file "nouncrf"

java -cp “~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --train true --model-file nouncrf sample

SimpleTagger• A file “stest” needed to be labeled

CAPITAL Al slept here

• Label the inputjava -cp “~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --model-file nouncrf stest

• OutputNumber of predicates: 5 noun CAPITAL Al non-noun slept non-noun here

Topic Modeling• Building Topic Models

bin/mallet train-topics --input topic-input.mallet --num-topics 100 --output-state topic-state.gz

--input [FILE]

--num-topics [NUMBER] The number of topics to use. The best number depends on what you are looking for in the model.

--num-iterations [NUMBER] The number of sampling iterations should be a trade off between the time taken to complete sampling and the quality of the topic model.

--output-state [FILENAME] This option outputs a compressed text file containing the words in the corpus with their topic assignments.





• Conclusion

Methodology• Focus on sequence tagging module in MALLET

• CRF-based implementation• Some scripts written for importing data and evaluating results

• Small corpora collected from web• Divided into two parts, 80% for training, 20% for test

• Evaluate both POS Tagging and Named Entity Recognition• The performance of training• Accuracy (POS Tagging) and Precision, Recall and FB1 (NER)

• All scripts, corpora and results can be found here• http://mallet-eval.googlecode.com

http://mallet-eval.googlecode.com/

A Survey of Named Entity Corpora• Well known named entity corpora

• Language-Independent Named Entity Recognition at CoNLL-2003• A manual annotation of a subset of RCV1 (Reuters Corpus Volume 1)• free and public, but need RCV1 raw texts as the input

• Message Understanding Conference (MUC) 6 / 7• not for free

• Affective Computational Entities (ACE) Training Corpus• not for free

• Other special purpose corpora• Enron Email Dataset

• email messages in this corpus are tagged with person names, dates and times.

• A variety of biomedical corpora• some corpora in this collection are tagged with entities in the biomedical domain,

such as gene name

Small Corpora• Two small corpora collected from web

• Penn Treebank Sample• English POS tagging corpora, ~5% fragment of Penn Treebank, (C)

LDC 1995.• raw, tagged, parsed and combined data from Wall Street Journal• 148120 tokens, 36 Standard treebank POS tagger• http://web.mit.edu/course/6/6.863/OldFiles/share/data/corpora/treebank/

• HIT CIR LTP Corpora Sample• Chinese NER corpora integrated• 10% of the whole corpora (open to public)• 23751 tokens, 7 kinds of named entities• http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm

http://web.mit.edu/course/6/6.863/OldFiles/share/data/corpora/treebank/

http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm

http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm

Environment• Hardware

• CPU: Q8300 Quad Core 2.50 GHz• Memory: 3GB

• Software• Fedora 13 x86_64• Java 1.6.0_18• MALLET 2.0.6

pos chunking ner

Training

Instance # 3982 8936 1286

Tokens # 95767 211727 20913

Time 308m 23s 190m 50s 17m 13s

Test

Tokens # 46452 47377 2829

Accuracy 85.67% 93.97% 98.55%

Precision - 90.54% 86.89%

Recall - 89.89% 86.89%

FB1 - 90.21 86.89

Time 15.80s 4.43s 0.8s

Evaluation

StagesTasks

MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line...

Documents

Transcript of MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line...