MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line...
-
Upload
kerry-fletcher -
Category
Documents
-
view
223 -
download
0
Transcript of MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line...
MALLETMAchine Learning for LanguagE Toolkit
Outline• About MALLET
• Representing Data
• Command Line Processing
• Simple Evaluation
• Conclusion
Outline• About MALLET
• Representing Data
• Command Line Processing
• Simple Evaluation
• Conclusion
About MALLET• "MALLET: A Machine Learning for Language Toolkit.“
• written by Andrew McCallum• http://mallet.cs.umass.edu. 2002.• Implemented in Java, currently version 2.0.6
• Motivation:• Text classification and information extraction• Commercial machine learning• Analysis and indexing of academic publications
About MALLET• Main idea
• Text focus: data is discrete rather than continuous, even when values could be continuous
• How to• Command line scripts:
• bin/mallet [command] --[option] [value] …• Text User Interface (“tui”) classes
• Direct Java API• http://mallet.cs.umass.edu/api
Outline• About MALLET
• Representing Data
• Command Line Processing
• Simple Evaluation
• Conclusion
Representations• Transform text documents to
vectors x1 , x2 …
• Elements of vector are called feature values• Example: “Feature at row 345 is
number of times “dog” appears in document”
• Retain meaning of vector indices
Documents to Vectors
Documents to Vectors
Documents to Vectors
Documents to Vectors
Documents to Vectors
Instances
Instances
Instances
Outline• About MALLET
• Representing Data
• Command Line Processing
• Developing with MALLET
• Conclusion
Command Line• Importing Data
• Classification
• Sequence Tagging
• Topic Modeling
Importing Data• One Instance per file
• files in the folder:sample-data/web/en or sample-data/web/de
• command line: bin/mallet import-dir --input sample-data/web/* --output web.mallet
• One file, one instance per line• file format:[URL] [language] [text of the page...]
• command line:bin/mallet import-file --input /data/web/data.txt --output web.mallet
Classification• Training a classifier
bin/mallet train-classifier --input training.mallet --output-classifier my.classifier
• Choosing an algorithm• MaxEnt, NaiveBayes, C45, DecisionTree and many others.
bin/mallet train-classifier --input training.mallet --output-classifier my.classifier --trainer MaxEnt
• Evaluation• Random split the data into 90% training instances, which will be used to train the
classifier, and 10% testing instances.
bin/mallet train-classifier --input labeled.mallet --training-portion 0.9
Sequence Tagging• Sequence algorithms
• hidden Markov models (HMMs)• linear chain conditional random fields (CRFs).
• SimpleTagger• a command line interface to the MALLET Conditional Random
Field (CRF) class
SimpleTagger• Input file: [feature1 feature2 ... featuren label]
Bill CAPITALIZED nounslept non-nounhere LOWERCASE STOPWORD non-noun
• Train a CRF• An input file “sample”• A trained CRF in the file "nouncrf"
java -cp “~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --train true --model-file nouncrf sample
SimpleTagger• A file “stest” needed to be labeled
CAPITAL Al slept here
• Label the inputjava -cp “~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --model-file nouncrf stest
• OutputNumber of predicates: 5 noun CAPITAL Al non-noun slept non-noun here
Topic Modeling• Building Topic Models
bin/mallet train-topics --input topic-input.mallet --num-topics 100 --output-state topic-state.gz
--input [FILE]
--num-topics [NUMBER] The number of topics to use. The best number depends on what you are looking for in the model.
--num-iterations [NUMBER] The number of sampling iterations should be a trade off between the time taken to complete sampling and the quality of the topic model.
--output-state [FILENAME] This option outputs a compressed text file containing the words in the corpus with their topic assignments.
Demo
Outline• About MALLET
• Representing Data
• Command Line Processing
• Simple Evaluation
• Conclusion
Methodology• Focus on sequence tagging module in MALLET
• CRF-based implementation• Some scripts written for importing data and evaluating results
• Small corpora collected from web• Divided into two parts, 80% for training, 20% for test
• Evaluate both POS Tagging and Named Entity Recognition• The performance of training• Accuracy (POS Tagging) and Precision, Recall and FB1 (NER)
• All scripts, corpora and results can be found here• http://mallet-eval.googlecode.com
A Survey of Named Entity Corpora• Well known named entity corpora
• Language-Independent Named Entity Recognition at CoNLL-2003• A manual annotation of a subset of RCV1 (Reuters Corpus Volume 1)• free and public, but need RCV1 raw texts as the input
• Message Understanding Conference (MUC) 6 / 7• not for free
• Affective Computational Entities (ACE) Training Corpus• not for free
• Other special purpose corpora• Enron Email Dataset
• email messages in this corpus are tagged with person names, dates and times.
• A variety of biomedical corpora• some corpora in this collection are tagged with entities in the biomedical domain,
such as gene name
Small Corpora• Two small corpora collected from web
• Penn Treebank Sample• English POS tagging corpora, ~5% fragment of Penn Treebank, (C)
LDC 1995.• raw, tagged, parsed and combined data from Wall Street Journal• 148120 tokens, 36 Standard treebank POS tagger• http://web.mit.edu/course/6/6.863/OldFiles/share/data/corpora/treebank/
• HIT CIR LTP Corpora Sample• Chinese NER corpora integrated• 10% of the whole corpora (open to public)• 23751 tokens, 7 kinds of named entities• http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm
Environment• Hardware
• CPU: Q8300 Quad Core 2.50 GHz• Memory: 3GB
• Software• Fedora 13 x86_64• Java 1.6.0_18• MALLET 2.0.6
Data Format and Labels• Data Format
• Each token one row, each feature one columnBill nounslept non-nounHere non-noun
• Labels• Standard treebank POS Tagger
• CC Coordinating conjunction | CD Cardinal number | DT Determiner | EX Existential there | FW Foreign word | IN Preposition or subordinating conjunction | JJ Adjective | JJR Adjective, comparative | JJS Adjective, superlative | LS List item marker | MD Modal | NN Noun, singular or mass | NNS Noun, plural … … (36 taggers in all)
• HIT Named Entity• O 不是 NE | S- 单独构成 NE | B- 一个 NE 的开始 | I- 一个 NE 的中间 | E- 一个 NE 的
结尾• Nm 数词 | Ni 机构名 | Ns 地名 | Nh 人名 | Nt 时间 | Nr 日期 | Nz 专有名词• Example: 美国 B-Ni 洛杉矶 I-Ni 警察局 E-Ni
pos chunking ner
Training
Instance # 3982 8936 1286
Tokens # 95767 211727 20913
Time 308m 23s 190m 50s 17m 13s
Test
Tokens # 46452 47377 2829
Accuracy 85.67% 93.97% 98.55%
Precision - 90.54% 86.89%
Recall - 89.89% 86.89%
FB1 - 90.21 86.89
Time 15.80s 4.43s 0.8s
Evaluation
StagesTasks
DEMO
Q&A