MeTA: A Unified Toolkit for Text Retrieval and...

51
MeTA: A Unified Toolkit for Text Retrieval and Analysis Sean Massung, Chase Geigle, and ChengXiang Zhai December 5th, 2017 1

Transcript of MeTA: A Unified Toolkit for Text Retrieval and...

Page 1: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

MeTA: A Unified Toolkit for Text Retrieval

and Analysis

Sean Massung, Chase Geigle, and ChengXiang Zhai

December 5th, 2017

1

Page 2: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Motivation

Page 3: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

A Classic Text-Mining Example

Document Classification

2

Page 4: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Step 1

SvmLight1

1svmlight.joachims.org/

3

Page 5: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Step 2

SvmPerf2

2www.cs.cornell.edu/people/tj/svm_light/svm_perf.html

4

Page 6: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Step 3

SvmMulticlass3

3www.cs.cornell.edu/people/tj/svm_light/svm_multiclass.html

5

Page 7: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Step 4

Data munging

6

Page 8: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Step 5

Feature generation

7

Page 9: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Step 6

Handling unicode

8

Page 10: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Step 7

Out of memory

9

Page 11: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Step 8

Incapacitated server, different

algorithm?

10

Page 12: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Step 9

Despair???

11

Page 13: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Step 0

MeTA!

12

Page 14: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

To the Rescue

MeTA4 has been created to solve these text miner’s headaches

once and for all by

1. uniting machine learning and information retrieval algorithms

under one roof (no more tool hunting, separate compiling and

configuration, etc.),

• indexing, ranking, searching, topic modeling, classification,

parallelization, n-gram language modeling, sequence labeling,

parsing, word embeddings, graph mining, …

2. integrating all steps of text tokenization into one framework,

3. respecting the Unicode standards,

4. supporting out-of-the-box out-of-core classification,

5. being competitive with existing, highly specific tools, and

6. letting you write as little code as possible (often none at all)

4https://meta-toolkit.org

13

Page 15: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Components of MeTA

Page 16: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Overview

MeTA can be broken down into the following major components

(bold I’ll talk about today):

• Indexes (data storage)

• Analyzers and Filters (text tokenization and processing)

• Search engine (on top of inverted_index)

• Classification (on top of forward_index)

• Regression (on top of forward_index)

• Topic modeling (on top of forward_index)

• Sequence labeling

• Syntactic parsing

• Word embeddings

• Graph mining

14

Page 17: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Feature Generation

Page 18: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Tokenizers

Tokenizers convert documents to token

streams;

15

Page 19: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Filters

Filters add, remove, or modify tokens;

16

Page 20: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Analyzers

and Analyzers count.

17

Page 21: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Example

icu-tokenizer→ lowercase→ length(2, 35)→ unigrams

[[analyzers]]

method = "ngram-word"

ngram = 1

[[analyzers.filter]]

type = "icu-tokenizer"

[[analyzers.filter]]

type = "lowercase"

[[analyzers.filter]]

type = "length"

min = 2

max = 35

18

Page 22: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Porter2 English Stemmer5

[[analyzers.filter]]

type = "porter2-filter"

{waits, waited, waiting} → wait

{consist, consisted, consistency, consists} → consist

{run, runs, running} → run

5https://github.com/smassung/porter2_stemmer

19

Page 23: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Feature Combining

(default-chain→ bigrams) ∪ POS-tags

[[analyzers]]

method = "ngram-word"

ngram = 2

filter = "default-chain"

[[analyzers]]

method = "ngram-pos"

ngram = 1

filter = [{type = "icu-tokenizer},

{type = "ptb-normalizer"}]

crf-prefix = "path/to/model/files"

20

Page 24: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Features from Parse Trees

Example parse tree of a sentence Rewrite rule features

Shhhhhhhhh#

#(((((((((

NP

PRP

They

VPPPPP

����VBP

have

NPPPPPP

�����JJ

many

JJ

theoretical

NNS

ideas

SHHHEE

���NP VP •

NP

PRP

VPll,,

VBP NP

NPHHH��

���JJ JJ NNS

21

Page 25: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Structural Features from Parse Trees

xPPPPP

�����x

x

xZZ��

x xZZ��

x x x

x

xZZ��

x xZZ��

x x x

x

x

xZZ��

x x x

22

Page 26: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Search

Page 27: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Indexing Process

corpus→ documents→ (analyzer: tokenizer→ filters)→ indexer

Corpus variants:

• line_corpus: one document per line

• gz_corpus: one document per line, gzip compressed

• file_corpus: one document per file

23

Page 28: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

…is it any good?

Three criteria:

• Indexing speed

• Index size

• Query speed

Docs Size |D|avg |V |Blog06 3,215,171 26 GB 782.3 10,971,746

Gov2 25,205,179 147 GB 515.5 21,203,125

Table 1: (IR) The two TREC datasets used. Uncleaned versions of blog06

and gov2 were 89 GB and 426 GB respectively.

24

Page 29: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Indexing Speed and Size

Indri Lucene MeTA

Blog06 55m 40s 20m 23s 11m 23s

Gov2 8h 13m 43s 1h 59m 42s 1h 12m 10s

Table 2: (IR) Indexing speed.

Indri Lucene MeTA

Blog06 31.02 GB 2.06 GB 2.84 GB

Gov2 170.50 GB 11.02 GB 10.24 GB

Table 3: (IR) Index size.

25

Page 30: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Query Speed and Ranking Performance

Indri Lucene MeTA

Blog06 55.0s 1.60s 3.67s

Gov2 24m 6.73s 57.53s 1m 3.98s

Table 4: (IR) Query speed.

Indri Lucene MeTA

MAP P@10 MAP P@10 MAP P@10

Blog06 29.13 63.20 29.10 63.60 32.34 64.70

Gov2 25.96 53.69 30.23 59.26 29.97 57.43

Table 5: (IR) Query performance via Mean Average Precision and

Precision at 10 documents.

26

Page 31: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Ranking Functions

• Configurable, state-of-the-art existing rankers

[ranker]

method = "bm25"

k1 = 1.2 # optional

b = 0.75 # optional

k3 = 500 # optional

• Absolute discounting

• Dirichlet prior

• Jelinek-Mercer

• Okapi BM25(L)

• Pivoted length normalization

• Rocchio pseudo-relevance feedback

• KL-Divergence pseudo-relevance feedback

• Easy to add new rankers (see score_data)

27

Page 32: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Other Search Components

• Caching: configurable caches for improving retrieval speed

(allows optimization of memory utilization)

• Parallelization: indexing will utilize all machine cores during

tokenization, most during index merging

• Memory-mapped I/O: efficient access to all index files is

performed via using OS-level memory mapping

28

Page 33: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

DEMO

28

Page 34: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Classification

Page 35: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Classification Methods

Binary classifiers:

• SGD trained:

• SVM (hinge, smooth hinge, squared hinge)

• Perceptron

• Logistic regression

Multiclass adapters:

• One-vs-all

• One-vs-one

Multiclass classifiers:

• Wrapper for LIBSVM/LIBLINEAR

• Naïve Bayes

• Winnow

• Dual perceptron (kernels)

• kNN (using the search engine)

29

Page 36: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Out-of-core Support

An experiment on rcv1 with only 200MB of memory:

LIBLINEAR crashes; MeTA works just fine!6

6https://meta-toolkit.org/online-learning.html

30

Page 37: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

…is it good?

Docs Size Classes Features

20news 18,846 86 MB 20 112,377

Blog 19,320 778 MB 3 548,812

rcv1 804,414 1.1 GB 2 47,152

Webspam 350,000 24 GB 2 16,609,143

Table 6: (ML) Datasets used.

31

Page 38: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

…is it good?

liblinear scikit SVMmult MeTA

20news79.4% 74.3% 67.1% 80.1%

2.58s 0.326s 2.54s 0.648s

Blog75.8% 76.2% 72.2% 72.2%

61.3s 0.801s 17.5s 1.11s

rcv194.7% 94.0% 83.6% 94.8%

17.6s 1.66s 2.01s 3.44s

Webspam 797.4%

799.4%

11m 52s 1m 16s

Table 7: (ML) Accuracy and speed classification results. Reported time is

to both train and test the model. For all except Webspam, this excludes

IO.

32

Page 39: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

DEMO

32

Page 40: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Topic Modeling

Page 41: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Models

Currently has many flavors of LDA inference:

• Collapsed Gibbs sampling

• Parallel collapsed Gibbs sampling (approximate)

• Collapsed Variational Bayes (0-th order approximation)

• Stochastic Collapsed Variational Bayes (0-th order

approximation)

33

Page 42: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Sequence Labeling

Page 43: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Sequence Labeling

General task on sequence data in NLP, useful for

• Part of speech (POS) tagging

• Chunking/shallow parsing

• Named-entity recognition (NER)

• Information extraction

Supported in MeTA via an implementation of linear-chain

conditional random fields (CRFs) or a greedy averaged

perceptron.

34

Page 44: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Sequence Labeling Framework

sequences→ sequence_analyzer→model

Sequence analyzer creates observation features for each

element:

• Lexical features: wt = vi, wt is capitalized, wt is a number, …

• Context features: wt+1 = vj, wt+2 = vk, wt−1 = v`, wt−2 = vm, …

User-configurable: specify a function that takes an observation

and generates features (represented as strings)

(Ships with a default analyzer for POS tagging)

35

Page 45: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

CRF Results for POS Tagging

Sentences Examples

Training 42,685 969,905

Development 6,526 148,158

Testing 7,505 171,138

Table 8: (NLP) The Penn Treebank dataset.

Extra Data Accuracy

Human annotators 97.0%

CoreNLP X 97.3%

LTag-Spinal 97.3%

SCCN X 97.5%

MeTA (CRF) 97.0%

MeTA (AP) 96.9%

Table 9: (NLP) Part-of-speech tagging token-level accuracies. “Extra

data” implies the use of large amounts of extra unlabeled data (e.g. for

distributional similarity features).36

Page 46: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Syntactic Parsing

Page 47: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Shift-reduce Constituency Parser

Shhhhhhhhh#

#(((((((((

NP

PRP

They

VPPPPP

����VBP

have

NPPPPPP

�����JJ

many

JJ

theoretical

NNS

ideas

A linear-time algorithm for producing parse trees from input

sequences (POS-tagged words)

Around 34 times faster than similar alternatives7!

7nlp.stanford.edu/software/srparser.shtm

37

Page 48: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Parser Comparison

CoreNLP MeTA

Training Testing F1 Training Testing F1

Greedy7m 27s 18.6s

86.717m 31s 12.9s

86.98.85 GB 1.53 GB 0.79 GB 0.29 GB

Beam (4)6h 10m 43s 46.8s

89.92h 17m 25s 59.2s

88.110.84 GB 3.83 GB 2.29 GB 0.94 GB

Table 10: (NLP) Training/testing performance for the shift-reduce constituency

parsers. All models were trained for 40 iterations on the standard training split of

the Penn Treebank. Accuracy is reported as labeled F1 from evalb on section 23.

38

Page 49: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

DEMO

38

Page 50: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

Conclusion

Page 51: MeTA: A Unified Toolkit for Text Retrieval and Analysistimes.cs.uiuc.edu/course/510f17/ppt/meta-lecture.pdfExample icu-tokenizer! lowercase! length(2,35)! unigrams [[analyzers]] method

https://meta-toolkit.org

MIT & UIUC/NCSA Licensed

(Liberal; use it at work!)

Questions?

38