Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping...
Transcript of Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping...
![Page 1: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/1.jpg)
Corpus Bootstrapping with NLTKby Jacob Perkins
![Page 2: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/2.jpg)
Jacob Perkins
http://www.weotta.com
http://streamhacker.com
http://text-processing.com
https://github.com/japerk/nltk-trainer
@japerk
![Page 3: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/3.jpg)
Problem
you want to do NLProc
many proven supervised training algorithms
but you don’t have a training corpus
![Page 4: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/4.jpg)
Solution
make a custom training corpus
![Page 5: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/5.jpg)
Problems with Manual Annotation
takes time
requires expertise
expert time costs $$$
![Page 6: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/6.jpg)
Solution: Bootstrap
less time
less expertise
costs less
requires thinking & creativity
![Page 7: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/7.jpg)
Corpus Bootstrapping at Weotta
review sentiment
keyword classification
phrase extraction & classification
![Page 8: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/8.jpg)
Bootstrapping Examples
english -> spanish sentiment
phrase extraction
![Page 9: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/9.jpg)
Translating Sentiment
start with english sentiment corpus & classifier
english -> spanish -> spanish
![Page 10: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/10.jpg)
English -> Spanish -> Spanish
1. translate english examples to spanish
2. train classifier
3. classify spanish text into new corpus
4. correct new corpus
5. retrain classifier
6. add to corpus & goto 4 until done
![Page 11: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/11.jpg)
Translate Corpus
$ translate_corpus.py movie_reviews --source english --target spanish
![Page 12: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/12.jpg)
Train Initial Classifier
$ train_classifier.py spanish_movie_reviews
![Page 13: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/13.jpg)
Create New Corpus
$ classify_to_corpus.py spanish_sentiment --input spanish_examples.txt --classifier spanish_movie_reviews_NaiveBayes.pickle
![Page 14: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/14.jpg)
Manual Correction
1. scan each file
2. move incorrect examples to correct file
![Page 15: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/15.jpg)
Train New Classifier
$ train_classifier.py spanish_sentiment
![Page 16: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/16.jpg)
Adding to the Corpus
start with >90% probability
retrain
carefully decrease probability threshold
![Page 17: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/17.jpg)
Add more at a Lower Threshold
$ classify_to_corpus.py categorized_corpus --classifier categorized_corpus_NaiveBayes.pickle --threshold 0.8 --input new_examples.txt
![Page 18: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/18.jpg)
When are you done?
what level of accuracy do you need?
does your corpus reflect real text?
how much time do you have?
![Page 19: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/19.jpg)
Tips
garbage in, garbage out
correct bad data
clean & scrub text
experiment with train_classifier.py options
create custom features
![Page 20: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/20.jpg)
Bootstrapping a Phrase Extractor1. find a pos tagged corpus
2. annotate raw text
3. train pos tagger
4. create pos tagged & chunked corpus
5. tag unknown words
6. train pos tagger & chunker
7. correct errors
8. add to corpus, goto 5 until done
![Page 21: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/21.jpg)
NLTK Tagged Corpora
English: brown, conll2000, treebank
Portuguese: mac_morpho, floresta
Spanish: cess_esp, conll2002
Catalan: cess_cat
Dutch: alpino, conll2002
Indian Languages: indian
Chinese: sinica_treebank
see http://text-processing.com/demo/tag/
![Page 22: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/22.jpg)
Train Tagger
$ train_tagger.py treebank --simplify_tags
![Page 23: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/23.jpg)
Phrase Annotation
Hello world, [this is an important phrase].
![Page 24: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/24.jpg)
Tag Phrases
$ tag_phrases.py my_corpus --tagger treebank_simplify_tags.pickle --input my_phrases.txt
![Page 25: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/25.jpg)
Chunked & Tagged Phrase
Hello/N world/N ,/, [ this/DET is/V an/DET important/ADJ phrase/N ] ./.
![Page 26: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/26.jpg)
Correct Unknown Words
1. find -NONE- tagged words
2. fix tags
![Page 27: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/27.jpg)
Train New Tagger
$ train_tagger.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader
![Page 28: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/28.jpg)
Train Chunker
$ train_chunker.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader
![Page 29: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/29.jpg)
Extracting Phrasesimport collections, nltk.datafrom nltk import tokenizefrom nltk.tag import untag
tagger = nltk.data.load('taggers/my_corpus_tagger.pickle')chunker = nltk.data.load('chunkers/my_corpus_chunker.pickle')
def extract_phrases(t): d = collections.defaultdict(list) for sub in t.subtrees(lambda s: s.node != 'S'): d[sub.node].append(' '.join(untag(sub.leaves()))) return d
sents = tokenize.sent_tokenize(text)words = tokenize.word_tokenize(sents[0])d = extract_phrases(chunker.parse(tagger.tag(words)))# defaultdict(<type 'list'>, {'PHRASE_TAG': ['phrase']})
![Page 30: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/30.jpg)
Final Tips
error correction is faster than manual annotation
find close enough corpora
use nltk-trainer to experiment
iterate -> quality
no substitute for human judgement
![Page 31: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import](https://reader036.fdocuments.in/reader036/viewer/2022062317/5c6a41fd09d3f26b7d8c70a4/html5/thumbnails/31.jpg)
Links
http://www.nltk.org
https://github.com/japerk/nltk-trainer
http://text-processing.com