a library for efficient text classification bojanowski.pdf · 2017-03-28 · fastText, h=50 31.2...

a library for efficient text classificationand word representationPiotr Bojanowski

November 23th, 2016

Collaborators

PiotrBojanowski

ArmandJoulin

EdouardGrave

Tomáš Mikolov

Scientific context• Representing words as vectors

• Several drawbacks:• No sentence representations

• Not exploiting morphology

• Bleeding simple and fast -> widely used

Takingtheaveragepre-trainedwordvectorispopularButdoesnotworkverywell…

Wordswithsameradicalsdon’tshareparameters

disastrous/disaster mangera /mangerai

[Mikolov etal.2013]DistributedRepresentationsofWordsandPhrasesandtheirCompositionalityEfficient Estimation of Word Representations in Vector Space

Goal of the library• Unified framework for

1. Text representation2. Text classification

• Core of the library: given a set of indices –> predict an index• cbow, skip-gram and bow text classification are instances of this model

cbow

Manywords word

Skip-gram

word word

textclassification

Manywords label

man ang

nge gererarai

mangange

nger

geraerai

Two main applications

• Text classification

• Word representation (with character-level features)

fenomeno interisanitalian sportsmagazineentirelydedicatedtothefootballclubfootballclubinternazionale milano .itisreleasedonamonthlybasis.itfeaturesarticlespostersandphotosofinterplayersincludingboththefirstteamplayersandtheyouthsystemkidsaswellasclubemployees.italsofeatureanecdotesandfamousepisodesfromtheclub'shistory.

WrittenWork

“Je mangerai bien une pomme!” je

bien

une

pomme

Background knowledgeThe skip-gram and cbow models of word2vec

The cbow and skipgram models

w(t-2)

w(t+1)

w(t-1)

w(t+2)

w(t)

SUM

w(t)

w(t-2)

w(t-1)

w(t+1)

w(t+2)

CBOW Skip-gram

[Mikolov etal.2013]

The skip-gram model

• Model probability of a context word given a word

• Word vectors

The mighty knight Lancelot fought bravely.

knight The

mightyknight

Lancelotknight

foughtknight

bravely.knight

Background: the skip-gram model• Minimize a negative log likelihood:

• The above sum hides co-occurrence counts

Computationallyintensive!

Approximations to the loss• Replace the multiclass loss by a set of binary logistic losses• Negative sampling

• Hierarchical softmax

The cbow model

• Model probability of a word given a context

• Continuous Bag Of Words

The mighty knight Lancelot fought bravely.

The

mighty

Lancelot

knightfought

bravely.

fasttext• Both models are instances of a broader set of models

• Different input and output dictionaries

• Common core but different pooling strategies

• Efficient and modular C++ implementation

• Allows easy building of extensions by writing own pooling

Bag of Tricks for Efficient Text Classification

Fast text classification• BoW model on text classification and tag prediction

• A very strong (and fast) baseline, often on-par with SOTA approaches• Ease of use is at the core of the library

Starsmith (bornFinlayDow-Smith8July1988BromleyEngland)isaBritishsongwriterproducerremixerandDJ.HestudiedaclassicalmusicdegreeattheUniversityofSurreymajoringinperformanceonsaxophone.HehasalreadyreceivedacclaimfortheremixeshehascreatedforLadyGagaRobynTimbaland KatyPerryLittleBootsPassionPitPalomaFaithMarinaandtheDiamondsandFrankmusik amongstmanyothers.

Rikkavesi isamedium-sizedlakeineasternFinland.Atapproximately63squarekilometres (24sq mi)itisthe66thlargestlakeinFinland.Rikkavesi issituatedinthemunicipalitiesofKaavi OutokumpuandTuusniemi.Rikkavesi is101metres (331ft)abovethesealevel.Kaavinjärvi andRikkavesi areconnectedbytheKaavinkoski Canal.OhtaansalmistraitflowsfromRikkavesi toJuojärvi.

./fasttext supervised -input data/dbpedia.train -output data/dbpedia

./fasttext test data/dbpedia.bin data/dbpedia.test

Model• Model probability of a label given a paragraph

• Paragraph feature

• Word vectors are latent and not useful per se• If scarce supervised data, use pre-trained word vectors

n-grams• Possible to add higher-order features

• Avoid building n-gram dictionary

Icouldlistentoeverytrackeveryminuteofeveryday.

I

could listento

every track

every

minuteof

every

day

Icould

couldlisten

listento

toeveryeverytracktrackevery

everyminuteminuteof ofevery

everyday

Useahasheddictionary!

Sentiment analysis - performance

Model AG Sogou DBP Yelp P. Yelp F. Yah. A. Amz. F. Amz. P.

BoW (Zhang et al., 2015) 88.8 92.9 96.6 92.2 58.0 68.9 54.6 90.4ngrams (Zhang et al., 2015) 92.0 97.1 98.6 95.6 56.3 68.5 54.3 92.0ngrams TFIDF (Zhang et al., 2015) 92.4 97.2 98.7 95.4 54.8 68.5 52.4 91.5char-CNN (Zhang and LeCun, 2015) 87.2 95.1 98.3 94.7 62.0 71.2 59.5 94.5char-CRNN (Xiao and Cho, 2016) 91.4 95.2 98.6 94.5 61.8 71.7 59.2 94.1VDCNN (Conneau et al., 2016) 91.3 96.8 98.7 95.7 64.7 73.4 63.0 95.7

fastText, h= 10 91.5 93.9 98.1 93.8 60.4 72.0 55.8 91.2fastText, h= 10, bigram 92.5 96.8 98.6 95.7 63.9 72.3 60.2 94.6Table 1: Test accuracy [%] on sentiment datasets. FastText has been run with the same parameters for all the datasets. It has10 hidden units and we evaluate it with and without bigrams. For char-CNN, we show the best reported numbers without dataaugmentation.

Sentiment analysis - runtime

Zhang and LeCun (2015) Conneau et al. (2016) fastText

small char-CNN big char-CNN depth=9 depth=17 depth=29 h= 10, bigram

AG 1h 3h 24m 37m 51m 1sSogou - - 25m 41m 56m 7sDBpedia 2h 5h 27m 44m 1h 2sYelp P. - - 28m 43m 1h09 3sYelp F. - - 29m 45m 1h12 4sYah. A. 8h 1d 1h 1h33 2h 5sAmz. F. 2d 5d 2h45 4h20 7h 9sAmz. P. 2d 5d 2h45 4h25 7h 10s

Table 2: Training time for a single epoch on sentiment analysis datasets compared to char-CNN and VDCNN.

Tag prediction

Model prec@1Running time

Train Test

Freq. baseline 2.2 - -Tagspace, h = 50 30.1 3h8 6hTagspace, h = 200 35.6 5h32 15h

fastText, h= 50 31.2 6m40 48sfastText, h= 50, bigram 36.7 7m47 50sfastText, h= 200 41.1 10m34 1m29fastText, h= 200, bigram 46.1 13m38 1m37Table 5: Prec@1 on the test set for tag prediction onYFCC100M. We also report the training time and test time.Test time is reported for a single thread, while training uses 20threads for both models.

• Using Flickr Data• Given an image caption• Predict the most likely tag• Sample outputs:

Input Prediction

taiyoucon 2011 digitals: individuals digital pho-tos from the anime convention taiyoucon 2011 inmesa, arizona. if you know the model and/or thecharacter, please comment.

#cosplay

2012 twin cities pride 2012 twin cities pride pa-rade

#minneapolis

beagle enjoys the snowfall #snow

Enriching Word Vectors with Sub-word Information

Exploiting sub-word information• Represent words as sum of its character n-grams• We add special positional characters:• All ending n-grams have special meaning

• Grammatical variations still share most of n-grams

• Compound nouns are easy to model Tisch Tennis Tischtennis

Singular PluralNominative uniwersytet uniwersytetyGenetive uniwersytetu uniwersytetówDative uniwersytetowi uniwersytetomAccusative uniwersytet uniwersytetyInstrumental uniwersytetem uniwersytetamiLocative uniwersytecie uniwersytetachVocative uniwersytecie uniwersytety

^mangerai$

Model• As in skip-gram: model probability of a context word given a word

• Feature of a word computed using n-grams:

• As for the previous model, use hashing for n-grams

man ang

nge gerera

rai

mang ange

nger

geraerai

mangerai

Charactern-grams Worditself

OOV words• Possible to build vectors for unseen words!

• Evaluated in our experiments vs. word2vec

man ang

nge gerera

rai

mang ange

nger

geraerai

mangerai

Charactern-grams Worditself

Word similarity

sg cbow ours

*

ours

AR WS353 51 52 54 55

DEGUR350 61 62 64 70

GUR65 78 78 81 81ZG222 35 38 41 44

ENRW 43 43 46 47

WS353 72 73 71 71

ES WS353 57 58 58 59

FR RG65 70 69 75 75

RO WS353 48 52 51 54

RU HJ 59 60 60 66

Table 2: Correlation between human judgement andsimilarity scores on word similarity datasets. Wetrain both our model and the word2vec baseline onnormalized wikipedia dumps. Evaluation datasetscontain words that are not part of the training set,so we represent them using null vectors (ours*).With our model, we also compute vectors for unseenwords by summing the n-gram vectors (ours).

First, by looking at Table 2, we notice that the pro-posed model (ours), which uses subword informa-tion, outperforms the baselines on all datasets exceptthe English WS353 dataset. Moreover, computingvectors for out-of-vocabylary words (ours) is al-ways at least as good as not doing so (ours*). Thisproves the advantage of using sub-word informationin the form of character n-grams.

Second, we observe that the effect of using char-acter n-grams is more important for Arabic, Germanand Russian than for English, French or Spanish.German and Russian exhibits grammatical declen-tions with four cases for German and six for Russian.Also, many German words are compound words,and for instance the nominal phrase “table tennis”is written in a single word as “Tischtennis”. Byexploiting the character-level similarities between“Tischtennis” and “Tennis”, our model does not rep-resent the two words as completely different words.

Finally, we observe that on the English RareWords dataset (RW), our approach outperforms thebaselines while it does not on the English WS353dataset. This is due to the fact that words in the En-

sg cbow ours

CSSemantic 25.7 27.6 27.5Syntactic 52.8 55.0 77.8

DESemantic 66.5 66.8 62.3Syntactic 44.5 45.0 56.4

ENSemantic 78.5 78.2 77.8Syntactic 70.1 69.9 74.9

ITSemantic 52.3 54.7 52.3Syntactic 51.5 51.8 62.7

Table 3: Accuracy of our model and baselines onword analogy tasks for Czech, German, English andItalian. We report results for semantic and syntacticanalogies separately.

glish WS353 dataset are common words for whichgood vectors can be obtained without exploitingsubword information. When evaluating on less fre-quent words, we see that using similarities at thecharacter level between words can help learninggood word vectors.

5.2 Word analogy tasks

We now evaluate our approach on word analogyquestions, of the form A is to B as C is to D,where D must be predicted by the models. We usethe datasets introduced by Mikolov et al. (2013a)for English, by Svoboda and Brychcin (2016) forCzech, by Köper et al. (2015) for German and byBerardi et al. (2015) for Italian. Some questions con-tain words that do not appear in our training corpus,and we thus excluded these questions from the eval-uation.

We report accuracy for the different models in Ta-ble 2. We observe that morphological informationsignificantly helps for the syntactic tasks, our ap-proach outperforming the baselines. In contrast, itdoes not help for semantic questions, or even de-grades the performance for German and Italian. Wealso observe that the improvement over the base-lines is more important for morphologically rich lan-guages, such as Czech and German.

• Given pairs of words• Human judgement of similarity• Similarity given vectors

• Spearman’s rank correlation

• Works well for rare words and morphologically rich languages!

Word analogies

• Given triplets of words:

• Predict the analogy• Evaluated using accuracy

• Works well for syntactic analogies• Does not degrade semantic much

sg cbow ours

*

ours

AR WS353 51 52 54 55

DEGUR350 61 62 64 70

GUR65 78 78 81 81ZG222 35 38 41 44

ENRW 43 43 46 47

WS353 72 73 71 71

ES WS353 57 58 58 59

FR RG65 70 69 75 75

RO WS353 48 52 51 54

RU HJ 59 60 60 66

Table 2: Correlation between human judgement andsimilarity scores on word similarity datasets. Wetrain both our model and the word2vec baseline onnormalized wikipedia dumps. Evaluation datasetscontain words that are not part of the training set,so we represent them using null vectors (ours*).With our model, we also compute vectors for unseenwords by summing the n-gram vectors (ours).

First, by looking at Table 2, we notice that the pro-posed model (ours), which uses subword informa-tion, outperforms the baselines on all datasets exceptthe English WS353 dataset. Moreover, computingvectors for out-of-vocabylary words (ours) is al-ways at least as good as not doing so (ours*). Thisproves the advantage of using sub-word informationin the form of character n-grams.

Second, we observe that the effect of using char-acter n-grams is more important for Arabic, Germanand Russian than for English, French or Spanish.German and Russian exhibits grammatical declen-tions with four cases for German and six for Russian.Also, many German words are compound words,and for instance the nominal phrase “table tennis”is written in a single word as “Tischtennis”. Byexploiting the character-level similarities between“Tischtennis” and “Tennis”, our model does not rep-resent the two words as completely different words.

Finally, we observe that on the English RareWords dataset (RW), our approach outperforms thebaselines while it does not on the English WS353dataset. This is due to the fact that words in the En-

sg cbow ours

CSSemantic 25.7 27.6 27.5Syntactic 52.8 55.0 77.8

DESemantic 66.5 66.8 62.3Syntactic 44.5 45.0 56.4

ENSemantic 78.5 78.2 77.8Syntactic 70.1 69.9 74.9

ITSemantic 52.3 54.7 52.3Syntactic 51.5 51.8 62.7

Table 3: Accuracy of our model and baselines onword analogy tasks for Czech, German, English andItalian. We report results for semantic and syntacticanalogies separately.

glish WS353 dataset are common words for whichgood vectors can be obtained without exploitingsubword information. When evaluating on less fre-quent words, we see that using similarities at thecharacter level between words can help learninggood word vectors.

5.2 Word analogy tasks

We now evaluate our approach on word analogyquestions, of the form A is to B as C is to D,where D must be predicted by the models. We usethe datasets introduced by Mikolov et al. (2013a)for English, by Svoboda and Brychcin (2016) forCzech, by Köper et al. (2015) for German and byBerardi et al. (2015) for Italian. Some questions con-tain words that do not appear in our training corpus,and we thus excluded these questions from the eval-uation.

We report accuracy for the different models in Ta-ble 2. We observe that morphological informationsignificantly helps for the syntactic tasks, our ap-proach outperforming the baselines. In contrast, itdoes not help for semantic questions, or even de-grades the performance for German and Italian. Wealso observe that the improvement over the base-lines is more important for morphologically rich lan-guages, such as Czech and German.

Comparison to state-of-the-art methods

DE EN ES FR

GUR350 ZG222 WS353 RW WS353 RG65

Luong et al. (2013) - - 64 34 - -Qiu et al. (2014) - - 65 33 - -

Soricut and Och (2015) 64 22 71 42 47 67Ours 73 43 73 48 54 69

Botha and Blunsom (2014) 56 25 39 30 28 45Ours 66 34 54 41 49 52

Table 4: Spearman’s rank correlation coefficient between human judgement and model scores for differentmethods using morphology to learn word representations. We keep all the word pairs of the evaluation setand obtain representations for out-of-vocabulary words with our model by summing the vectors of charactern-grams. Our model was trained on the same datasets as the methods we are comparing to (hence the twolines of results for our approach).

5.3 Comparison with morphologicalrepresentations

We also compare our approach to previous work onincorporating subword information in word vectors,on word similarity tasks. The methods used are:the recursive neural network of Luong et al. (2013),the morpheme cbow of Qiu et al. (2014) and themorphological transformations of Soricut and Och(2015). In order to make the results comparable, wetrained our model on the same datasets as the meth-ods we are comparing to: the English Wikipediadata released by Shaoul and Westbury (2010), andthe news crawl data from the 2013 WMT sharedtask for German, Spanish and French. We alsocompare our approach to the log-bilinear languagemodel introduced by Botha and Blunsom (2014),which was trained on the Europarl and news com-mentary corpora. Again, we trained our model onthe same data to make the results comparable. Us-ing our model, we obtain representations of out-of-vocabulary words by summing the representationsof character n-grams. We report results in Table 4.We observe that our simple approach, performs wellrelative to techniques based on subword informationobtained from morphological segmentors. We alsoobserve that our approach outperforms the methodby Soricut and Och (2015), which is based on prefixand suffix analysis. For German, the large improve-ment is due to the fact that their approach does notmodel noun compounding, contrary to ours.

5.4 Effect of the size of the training data

Since we exploit character-level similarities betweenwords, we are able to better model unfrequentwords. Therefore, we should also be more robustto the size of the training data that we use. In or-der to assess that, we propose to evaluate the per-formance of our word vectors on the similarity taskas a function of the training data size. To this end,we train our model and the cbow baseline on por-tions of wikipedia of increasing size. We use thewikipedia corpus described above and isolate thefirst 1, 2, 5, 10, 20, and 50 percents of the data. Sincewe don’t reshuffle the dataset, they are all subsets ofeach other. We report results for this experiment inFig. 1.

As in the experiment presented in Sec. 5.1, allwords from the evaluation set are not present in thewikipedia data. Again, by default, we use a nullvector for these words (ours*) or compute a vec-tor by summing the n-gram representations (ours).The out-of-vocabulary rate is growing as the datasetshrinks, and therefore the performance of ours*and cbow necesserily degrades. On the opposite,the proposed model (ours) assigns non-trivial vec-tors to previously unseen words.

First, we notice that for all datasets, and all sizes,the proposed approach (ours) performs better thanthe baseline. However, the performance of the base-line cbow model gets better as more and more datais available. Our model, on the other hand seems

Qualitative results

2 3 4 5 6

2 57 64 67 69 693 65 68 70 704 70 70 715 69 716 70

(a) De-Gur350

2 3 4 5 6

2 59 55 56 59 603 60 58 60 624 62 62 635 64 646 65

(b) De-Semantic

2 3 4 5 6

2 45 50 53 54 553 51 55 55 564 54 56 565 56 566 54

(c) De-Syntactic

2 3 4 5 6

2 41 42 46 47 483 44 46 48 484 47 48 485 48 486 48

(d) En-RW

2 3 4 5 6

2 78 76 75 76 763 78 77 78 774 79 79 795 80 796 80

(e) En-Semantic

2 3 4 5 6

2 70 71 73 74 733 72 74 75 744 74 75 755 74 746 72

(f) En-Syntactic

Table 5: Study of the effect of sizes of n-grams considered on performance. We compute word vectors byusing character n-grams with n in {i, . . . , j} and report performance for various values of i and j. We eval-uate this effect on German and English, and represent out-of-vocabylary words using subword information.

query tiling tech-rich english-born micromanaging eateries dendritic

ours tile tech-dominated british-born micromanage restaurants dendriteflooring tech-heavy polish-born micromanaged eaterie dendrites

skipgram bookcases technology-heavy most-capped defang restaurants epithelialbuilt-ins .ixic ex-scotland internalise delis p53

Table 6: Nearest neighbors of rare words using our representations and skipgram. These hand pickedexamples are for illustration.

the beginning and end of word. Therefore, 2-gramswill not be enough to properly capture suffixes thatcorrespond to conjugations or declentions as in thatcase they are composed of a single proper characterand a positional one.

5.6 Qualitative results

We report sample qualitative results in Table 6. Forselected words, we show nearest neighbors accord-ing to cosine similarity for vectors trained using theproposed approach and for the skipgram base-line. As expected, the nearest neighbors for com-plex, technical and unfrequent words using our ap-proach are better than the ones obtained using thebaseline model.

6 Discussion

In this paper, we investigate a simple method to learnword representations by taking into account sub-word information. Our approach, which incorpo-rates character n-grams into the skipgram model, isrelated to an old idea that was introduced by Schütze(1993). Because of its simplicity, our model trainsfast and does not require any preprocessing or super-vision. We show that our model outperforms base-lines which do not take into account subword infor-mation, as well as methods relying on morphologi-cal analysis. We will open source the implementa-tion of our model, in order to facilitate comparisonof future work on learning subword representations.

Conclusion

fasttext is open source• Available on Github

After 6 months:> 6700 stars!1.6k members FB group

• Featured in “popular” press• C++ code• Bash scripts as examples• Very simple usage• Several OS projects

Python wrapperDocker files

Questions

a library for efficient text classification bojanowski.pdf · 2017-03-28 · fastText, h=50 31.2...

Documents

Transcript of a library for efficient text classification bojanowski.pdf · 2017-03-28 · fastText, h=50 31.2...