Word Embeddings - Deep Learning for NLPshad.pcs15/data/we-kevin.pdf · Word Embeddings Deep...
Transcript of Word Embeddings - Deep Learning for NLPshad.pcs15/data/we-kevin.pdf · Word Embeddings Deep...
.
......
Word EmbeddingsDeep Learning for NLP
Kevin Patel
ICON 2017
December 21, 2017
Kevin Patel Word Embeddings 1/100
Outline
...1 Introduction
...2 Word EmbeddingsCount Based EmbeddingsPrediction Based EmbeddingsMultilingual Word EmbeddingsInterpretable Word Embeddings
...3 Evaluating Word EmbeddingsIntrinsic EvaluationExtrinsic EvaluationEvaluation FrameworksVisualizing Word Embeddings
Kevin Patel Word Embeddings 2/100
Outline
...4 Discussion on Lower Bounds
...5 Applications of Word EmbeddingsAre Word Embeddings Useful for Sarcasm Detection?Iterative Unsupervised Most Frequent Sense Detection using WordEmbeddings
...6 Conclusion
Kevin Patel Word Embeddings 3/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Layman(ish) Intro to ML
In simple terms, Machine Learning comprises ofRepresenting data in some numeric formLearning some function on that representation
Kevin Patel Word Embeddings 4/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Layman(ish) Intro to ML
In simple terms, Machine Learning comprises ofRepresenting data in some numeric formLearning some function on that representation
Kevin Patel Word Embeddings 4/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Layman(ish) Intro to ML
In simple terms, Machine Learning comprises ofRepresenting data in some numeric formLearning some function on that representation
Kevin Patel Word Embeddings 4/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Layman(ish) Intro to ML
In simple terms, Machine Learning comprises ofRepresenting data in some numeric formLearning some function on that representation
Kevin Patel Word Embeddings 4/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Layman(ish) Intro to ML
In simple terms, Machine Learning comprises ofRepresenting data in some numeric formLearning some function on that representation
Kevin Patel Word Embeddings 4/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Layman(ish) Intro to ML
In simple terms, Machine Learning comprises ofRepresenting data in some numeric formLearning some function on that representation
How to place words to learn,say, Binary SentimentClassification?
Good: PositiveAwesome: PositiveBad: Negative
Kevin Patel Word Embeddings 4/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Layman(ish) Intro to ML
In simple terms, Machine Learning comprises ofRepresenting data in some numeric formLearning some function on that representation
How to place words to learn,say, Binary SentimentClassification?
Good: PositiveAwesome: PositiveBad: Negative
Kevin Patel Word Embeddings 4/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Representations for Learning AlgorithmsDetect whether the following image is dog or not?
Basic idea: feed raw pixels as input vectorWorks well:
Inherent structure in the image
Detect whether a word is a dog or not?Labrador
Nothing in spelling of labrador that can connect it to dogNeed a representation of labrador which indicates that it is adog
Kevin Patel Word Embeddings 5/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Local Representations
Information about a particular item located solely in thecorresponding representational element (dimension)Effectively one unit is turned on in a network, all the othersare offNo sharing between represented dataEach feature is independentNo generalization on the basis of similarity between features
Kevin Patel Word Embeddings 6/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Distributed Representations
Information about a particular item distributed among a set of(not necessarily) mutually exclusive representational elements(dimensions)
One item spread over multiple dimensionsOne dimension contributing to multiple items
A new input is processed similar to samples in training datawhich were similar
Better generalization
Kevin Patel Word Embeddings 7/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Distributed Representations: Example
Number Local Representation Distributed Representation0 1 0 0 0 0 0 0 0 0 0 01 0 1 0 0 0 0 0 0 0 0 12 0 0 1 0 0 0 0 0 0 1 03 0 0 0 1 0 0 0 0 0 1 14 0 0 0 0 1 0 0 0 1 0 05 0 0 0 0 0 1 0 0 1 0 16 0 0 0 0 0 0 1 0 1 1 07 0 0 0 0 0 0 0 1 1 1 1
Kevin Patel Word Embeddings 8/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Word Embeddings : IntuitionWord Embeddings: distributed vector representations of wordssuch that the similarity among vectors correlate with semanticsimilarity among the corresponding words
Given that sim(dog, cat) is more than sim(dog, furniture),cos(−→dog, −→cat) is greater than cos(−→dog, −−−−−→furniture)
Such similarity information uncovered from context
Consider the following sentences:I like sweet food .You like spicy food .They like xyzabc food .
What is xyzabc ?Meaning of words can be inferred from their neighbors(context) and words that share neighbors
Neighbors of xyzabc: like, foodWords that share neighbors of xyzabc: sweet, spicy
Kevin Patel Word Embeddings 9/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Word Embeddings : IntuitionWord Embeddings: distributed vector representations of wordssuch that the similarity among vectors correlate with semanticsimilarity among the corresponding words
Given that sim(dog, cat) is more than sim(dog, furniture),cos(−→dog, −→cat) is greater than cos(−→dog, −−−−−→furniture)
Such similarity information uncovered from contextConsider the following sentences:
I like sweet food .You like spicy food .They like xyzabc food .
What is xyzabc ?Meaning of words can be inferred from their neighbors(context) and words that share neighbors
Neighbors of xyzabc: like, foodWords that share neighbors of xyzabc: sweet, spicy
Kevin Patel Word Embeddings 9/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Modelling Meaning via Word Embeddings
Geometric metaphor of meaning(Sahlgren, 2006):
Meanings are locations insemantic space, and semanticsimilarity is proximity betweenthe locations.
Distributional Hypothesis (Harris,1970)
Words with similar distributionalproperties have similar meaningsOnly differences in meaningcan be modelled
Kevin Patel Word Embeddings 10/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Modelling Meaning via Word Embeddings
Geometric metaphor of meaning(Sahlgren, 2006):
Meanings are locations insemantic space, and semanticsimilarity is proximity betweenthe locations.
Distributional Hypothesis (Harris,1970)
Words with similar distributionalproperties have similar meanings
Only differences in meaningcan be modelled
Kevin Patel Word Embeddings 10/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Modelling Meaning via Word Embeddings
Geometric metaphor of meaning(Sahlgren, 2006):
Meanings are locations insemantic space, and semanticsimilarity is proximity betweenthe locations.
Distributional Hypothesis (Harris,1970)
Words with similar distributionalproperties have similar meaningsOnly differences in meaningcan be modelled
Kevin Patel Word Embeddings 10/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Modelling Meaning via Word Embeddings
Geometric metaphor of meaning(Sahlgren, 2006):
Meanings are locations insemantic space, and semanticsimilarity is proximity betweenthe locations.
Distributional Hypothesis (Harris,1970)
Words with similar distributionalproperties have similar meaningsOnly differences in meaningcan be modelled
Kevin Patel Word Embeddings 10/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Modelling Meaning via Word Embeddings
Geometric metaphor of meaning(Sahlgren, 2006):
Meanings are locations insemantic space, and semanticsimilarity is proximity betweenthe locations.
Distributional Hypothesis (Harris,1970)
Words with similar distributionalproperties have similar meaningsOnly differences in meaningcan be modelled
Kevin Patel Word Embeddings 10/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Entire Vector vs. Individual dimensions
Only proximity in the entire space is representedNo phenomenological correlations with dimensions ofhigh-dimensional space (in majority of algorithms)
Those models who do have some correlations, are known asinterpretable models
Kevin Patel Word Embeddings 11/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Modelling Meaning via Word Embeddings
Co-occurrence matrix (Rubenstein and Goodenough, 1965)A mechanism to capture distributional propertiesRows of co-occurrence matrix can be directly considered asword vectors
Neural Word EmbeddingsVector representations learnt using neural networks - Bengioet al. (2003); Collobert and Weston (2008a); Mikolov et al.(2013b)
Kevin Patel Word Embeddings 12/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Co-occurrence Matrix
Originally proposed by Schütze (1992)Foundation of count based approaches that followAutomatic derivation of vectorsCollect co-occurrence counts in a matrixRows or columns are the vectors of corresponding wordIf counting in both directions, matrix is symmetricalIf counting in one side, matrix is asymmetrical, and is knownas directional co-occurrence
Kevin Patel Word Embeddings 13/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Co-occurrence Matrix (contd.)<> I like cats <> I love dogs <> I hate rats <> I rate bats <>
Co-occurrence Matrix
word <> I like love hate rate rats cats dogs bats<> 0 4 0 0 0 0 1 1 1 1
I 4 0 1 1 1 1 0 0 0 0like 0 1 0 0 0 0 0 1 0 0love 0 1 0 0 0 0 0 0 1 0hate 0 1 0 0 0 0 1 0 0 0rate 0 1 0 0 0 0 0 0 0 1rats 1 0 0 0 1 0 0 0 0 0cats 1 0 1 0 0 0 0 0 0 0dogs 1 0 0 1 0 0 0 0 0 0bats 1 0 0 0 0 1 0 0 0 0
Kevin Patel Word Embeddings 14/100
.
......
Word EmbeddingsCount Based Embeddings
Kevin Patel Word Embeddings 15/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
LSA
Latent Semantic AnalysisOriginally developed as Latent Semantic Indexing (LSI)(Dumais et al., 1988)Adapted for word-space modelsDeveloped to tackle inability of models of co-occurrencematrices to handle synonymy
Query about hotels cannot retrieve results about motelsWords and Documents dimensions → Latent dimensions
Uses Singular Value Decomposition (SVD) for dimensionalityreduction
Kevin Patel Word Embeddings 16/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
LSA (contd.)
Words-by-documents matrixEntropy based weighting of co-occurrences
fij = (log(TFij) + 1)× (1− (∑
j(pijlogpij
logD ))) (1)
where D is number of documents, TFij is frequency of term iin document j, fi is frequency of term i in documentcollection, and pij =
TFijfi
Truncated SVD to reduce dimensionalityCosine measure to compute vector similarities
Kevin Patel Word Embeddings 17/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
HAL
Hyperspace Analogous to Language (Lund and Burgess,1996a)Developed specifically for word representationsUses directional co-occurrence
Kevin Patel Word Embeddings 18/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
HAL (contd.)
Directional word by word matrixDistance weighting of the co-occurrencesConcatenation of row-column vectorsDimensionality reduction optional
Discard low variant dimensions
Normalization of vectors to unit lengthSimilarities computed through either Manhattan or Euclideandistance
Kevin Patel Word Embeddings 19/100
.
......
Word EmbeddingsPrediction Based Embeddings
Kevin Patel Word Embeddings 20/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
NNLM
Neural Network Language ModelProposed by Bengio et al. (2003)Predict word given contextWord Vectors learnt as a by-product of language modelling
Kevin Patel Word Embeddings 21/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
NNLM: Original Model
Kevin Patel Word Embeddings 22/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
NNLM: Simplified (1)
Kevin Patel Word Embeddings 23/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
NNLM: Simplified (2)
Kevin Patel Word Embeddings 24/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Skip Gram
Proposed by Mikolov et al. (2013b)Predict Context given word
Kevin Patel Word Embeddings 25/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Skip Gram (contd.)
Given a sequence of training words w1,w2, . . . ,wT, maximize
1
T
T∑t=1
∑−c≤j≤c,j
logp (wt+j|wt) (2)
wherep(wO|wI) =
exp(uTwOvwI)∑W
w=1 exp(uTwvwI)
(3)
Kevin Patel Word Embeddings 26/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Global Vectors (GloVe)
Proposed by Pennington et al. (2014)Predict Context given wordSimilar to Skip-gram, but objective function is different
J =
V∑i,j=1
f(Xij)(wTi wj + bi + bj − logXij)
2 (4)
where Xij can be likelihood of ith and jth word occuringtogether, and f is a weightage function
Kevin Patel Word Embeddings 27/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Tuning word embeddings
Techniques which intend to tune already trained wordembeddings to various tasks using additional informationLing et al. (2015) improve quality of word2vec for syntactictasks such as POS
Take word positioning into accountStructured Skip-Gram and Continuous Windows: available aswang2vec
Levy and Goldberg (2014) use dependency parse treesLinear windows capture broad topical similarities, anddependency context captures functional similarities
Patel et al. (2017) use medical code hierarchy to improvemedical domain specific word embeddings
Kevin Patel Word Embeddings 28/100
.
......
Word EmbeddingsMultilingual Word Embeddings
Kevin Patel Word Embeddings 29/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
ObjectiveEnglish data >> Data for other languagesLanguage independent phenomenon learnt on English shouldbe applicable to other languagesSolution via word embeddings:
Project words of different languages into a common subspaceGoal of multilingual word embeddings: Shared subspace for alllanguagesNeural MT learns such embeddings implicitly by optimizingthe MT objectiveWe shall discuss explicit models
These models are for MT what word2vec, GloVe are for NLPMuch lower cost of training as compared to Neural MT
Applications: Machine Translation, Automated BilingualDictionary Generation, Cross-lingual Information Retrieval,etc.
Kevin Patel Word Embeddings 30/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Types of Cross-lingual Embeddings
Based on the underlying approaches:Monolingual mappingCross-lingual trainingJoint optimization
Based on the resource used:Word-aligned dataSentence-aligned dataDocument-aligned dataLexiconNo parallel data
Kevin Patel Word Embeddings 31/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Monolingual Mapping
Learning in two step:Train separate embeddings we and wf on large monolingualcorpora of corresponding languages e and fLearn transformations g1 and g2 such that we = g1(wf) andwf = g2(we)
Transformations learnt using bilingual word mappings (lexicon)
Kevin Patel Word Embeddings 32/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Monolingual Mapping (contd.)
Linear Projection proposed by Mikolov et al. (2013a)
Learn matrix W s.t
we ≈ W.wf
which minimizesn∑i=1
∥Wwf − we∥2
We adapted this method for automatic synset linking inmultilingual wordnets (accepted at GWC 2018)
Kevin Patel Word Embeddings 33/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Monolingual Mapping (Contd.)
Linear projection (Mikolov et al., 2013a): LexiconProjection via CCA (Faruqui and Dyer, 2014b): LexiconAlignment-based projection (Guo et al., 2015): Word-aligneddataAdversarial auto-encoder (Barone, 2016): No parallel data
Kevin Patel Word Embeddings 34/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Cross Lingual Training
Goal: optimizing cross-lingual objectiveMainly rely on sentence alignmentsRequire parallel corpus for training
Kevin Patel Word Embeddings 35/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Cross Lingual Training (contd.)
Bilingual CompositionalSentence Model proposed byHermann and Blunsom(2013)Train two models to producesentence representations ofaligned sentences in twolanguagesMinimize distance betweensentence representations ofaligned sentences
Kevin Patel Word Embeddings 36/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Cross Lingual Training (contd.)
Bilingual compositional sentence model (Hermann andBlunsom, 2013): Sentence-aligned dataDistributed word alignment (Kočiskỳ et al., 2014):Sentence-aligned dataTranslation-invariant LSA (Huang et al., 2015): LexiconInverted Indexing on Wikipedia (Søgaard et al., 2015):Document-aligned data
Kevin Patel Word Embeddings 37/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Joint Optimization
Jointly optimize both monolingual M and cross-lingual ΩconstraintsObjective: minimize Ml1 + Ml2 + λ.Ωl1→l2 +Ωl2→l1where λ decides weightage of cross-lingual constraints
Kevin Patel Word Embeddings 38/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Joint Optimization (contd.)
Multitask Language Model proposed by Klementiev et al.(2012):
Train neural language model (NNLM)Jointly optimize monolingual maximum likelihood (M) withword alignment based MT regularization term (Ω)
Kevin Patel Word Embeddings 39/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Joint Optimization (contd.)
Multi-task language model (Klementiev et al., 2012):Word-aligned dataBilingual skip-gram (Luong et al., 2015): Word-aligned dataBilingual bag-of-words without alignment (Gouws et al.,2015): Sentence-aligned dataBilingual sparse representations (Vyas and Carpuat, 2016):Word-aligned data
Kevin Patel Word Embeddings 40/100
.
......
Word EmbeddingsInterpretable Word Embeddings
Kevin Patel Word Embeddings 41/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Interpretability and Explainability
A model is interpretable if a human can make sense out of itExample: Decision treesInterpretable models enable one to explain the performance ofthe system and tune it accordinglyHowever, in practice, interpretable models generally performpoor compared to other systems
Kevin Patel Word Embeddings 42/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Interpretable Word EmbeddingsDimensions interpretable by ordering words based on value
Word d205
iguana 0.599371bueller 0.584335chimpanzee 0.577834wasp 0.556845chimp 0.553980hamster 0.534810giraffe 0.532316unicorn 0.529533caterpillar 0.528376baboon 0.526324gorilla 0.521590tortoise 0.519941sparrow 0.516842lizard 0.515716cockroach 0.505015crocodile 0.491139alligator 0.486275moth 0.471682kangaroo 0.469284toad 0.463514
Word d272
thigh 0.875286knee 0.872282shoulder 0.866209elbow 0.857403wrist 0.852959ankle 0.851555groin 0.841347forearm 0.837988leg 0.836661pelvis 0.777564neck 0.758420spine 0.754774torso 0.707458hamstring 0.701921buttocks 0.689092knees 0.676485ankles 0.658485jaw 0.653126biceps 0.650972hips 0.647000
Examples from NNSE embeddings Murphy et al. (2012)
Kevin Patel Word Embeddings 43/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
NNSENon Negative Sparse Embeddings proposed by Murphy et al.(2012)Word embeddings interpretable and cognitively plausiblePerforms a mixture of topical and taxonomical semanticsComputation
Dependency co-occurrence adjusted with PPMI (to normalizefor word frequency) and reduced with sparse SVDDocument co-occurrence adjusted with PPMI and reducedwith sparse SVDTheir union factorized using a variant of non-negative sparsecoding
Resulting word embeddings have both topical neighbors(judge is near to prison) and taxonomical neighbors (judge isnear to referee)Code unavailable, embeddings available athttp://www.cs.cmu.edu/~bmurphy/NNSE/
Kevin Patel Word Embeddings 44/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
OIWE
Online Interpretable Word Embeddings proposed by Luo et al.(2015)Main idea: apply sparsity to skip gramAchieve sparsity by setting to 0 any dimensions of a vectorthat falls below 0Propose two techniques to do this via gradient descentThey outperform NNSE at word intrusion taskCode available on Github athttps://github.com/SkTim/OIWE
Kevin Patel Word Embeddings 45/100
.
......
Evaluating Word EmbeddingsIntrinsic Evaluation
Kevin Patel Word Embeddings 46/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Word Pair Similarity
Evaluates generalizability of word embeddingsOne of the most widely used evaluationsMany datasets available: WS353, RG65, MEN, SimLex,SCWS, etc.
Word1 Word2 HumanScore
Model1Score
Model2Score
street street 10.00 1.0 1.0street avenue 8.88 0.04 0.38street block 6.88 0.14 0.26street place 6.44 0.21 0.18street children 4.94 -0.08 0.15Spearman Correlation 0.6 1.0
Kevin Patel Word Embeddings 47/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Word Analogy task
Proposed by Mikolov et al. (2013b)Try to answer the questionman is to woman as king is to ?Often discussed in media
Kevin Patel Word Embeddings 48/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Categorization
Evaluates the ability of embeddings to form proper clustersGiven sets of words with different labels, try to cluster them,and check the correspondence between clusters and sets.The purer the cluster, the better is the embeddingsDatasets available: Bless, Battig, etc.
Kevin Patel Word Embeddings 49/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Word Intrusion Detection
Proposed by Murphy et al. (2012)Provides a way to interpret dimensionsMost approaches do not report results on this task
Experiments done by us suggest many of them are notinterpretable
Kevin Patel Word Embeddings 50/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Word Intrusion Detection (contd.)
The approach:...1 Select a dimension...2 Reverse sort all vectors based on this dimension...3 Select top 5 words...4 Select a word, which is in bottom half of this list, and is in top
10 percentile in some other columns...5 Give a random permutation of these 6 words to a human
evaluatorExample: bathroom, closet, attic, balcony, quickly, toilet
...6 Check precision
Kevin Patel Word Embeddings 51/100
.
......
Evaluating Word EmbeddingsExtrinsic Evaluation
Kevin Patel Word Embeddings 52/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Extrinsic Evaluations
Evaluating word embeddings on downstream NLP tasks suchas Part of speech tagging, Named Entity Recognition, etc.Makes more sense as we ultimately want to use embeddingsfor such tasksHowever, performance does not solely rely on embeddings
Improvement/Degradation could be due to other factors suchas network architecture, hyperparameters, etc.
If an embedding E1 is better than another embedding E2
when used with some network architecture for NER, does thatmean E1 will be better for all architectures of NER?
Kevin Patel Word Embeddings 53/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Evaluations on Unified Architectures
Unified architectures such as Collobert and Weston (2008b)used for extrinsic evaluationsFor different tasks, the architecture remains same, except thelast layer, where the output neurons are changed according tothe task at handIf an embedding E1 is better than another embedding E2 onall tasks on such a unified architecture, then we can expect itto be truly better
Kevin Patel Word Embeddings 54/100
.
......
Evaluating Word EmbeddingsEvaluation Frameworks
Kevin Patel Word Embeddings 55/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
WordVectors.org
Proposed by Faruqui and Dyer (2014a)A web interface for evaluating a collection of word pairsimilarity datasets on your embeddings available athttp://wordvectors.org/Also provides visualization for common sets of words like(Male,Female) and (Antonym,Synonym) pairs
Kevin Patel Word Embeddings 56/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
VecEval
Proposed by Nayak et al. (2016)A web based tool for performing extrinsic evaluationshttp://www.veceval.com/Claimed to support six different tasks: POS, NER, Chunking,Sentiment Analysis, Natural Language Inference, QuestionAnsweringHas never worked for meWeb interface no longer available inactive, code available onGithub at https://github.com/NehaNayak/veceval
Kevin Patel Word Embeddings 57/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Anago
A Keras implementation of sequence labelling based onLample et al. (2016)’s architectureCan perform POS, NER, SRL, etc.Used in our lab for extrinsic evaluationCode available on Github athttps://github.com/Hironsan/anago
Kevin Patel Word Embeddings 58/100
.
......
Evaluating Word EmbeddingsVisualizing Word Embeddings
Kevin Patel Word Embeddings 59/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Visualizing Word Embeddings
Various ways to visualize word embeddings: PCA, Isomap,tSNE, etc., available in scikit-learn
from sklearn import decomposition, manifoldvis = decomposition.TruncatedSVD(n_components=2) - PCAE_vis = vis.fit_transform( E)plot E_vis here
Check out http://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html for manymethods applied to MNIST visualization
Kevin Patel Word Embeddings 60/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Related Work
Baroni et al. (2014): Neural word embeddings are better thantraditional methods such as LSA, HAL, RI (Landauer andDumais, 1997; Lund and Burgess, 1996b; Sahlgren, 2005)Levy et al. (2015): Superiority of neural word embeddings notdue to the embedding algorithm, but due to certain designchoices and hyperparameters optimizations
Varies other hyperparameters; keeps number of dimensions =500
Schnabel et al. (2015); Zhai et al. (2016); Ghannay et al.(2016): No justification for chosen number of dimensions intheir evaluationsMelamud et al. (2016): Optimal number of dimensionsdifferent for different evaluations of word embeddings
Kevin Patel Word Embeddings 61/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Why Dimensions matter?: A Practical Example
Various app developers want to utilize word embeddingsExample memory limit for app: 200 MBSize of Google Pre-trained vectors file: 3.4 GBNatural thought process: decrease dimensions
To what value? 100? 50? 20?
Depends on the words/entities we want to place in the space
Kevin Patel Word Embeddings 62/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Number of Dimensions and Equidistant points
Kevin Patel Word Embeddings 63/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Number of Dimensions and Equidistant pointsNumber of dimensions of a vector space imposes a restrictionon the number of equidistant points it can haveGiven that distance is euclidean, if the number of dimensionsλ = N, then maximum number of equidistant points E in thecorresponding space is N + 1 (Swanepoel, 2004)Given that distance is cosine, no closed form solution exists
Dimensions λ and max. no. of equiangular lines E(Barg and Yu, 2014)λ E λ E3 6 18 614 6 19 765 10 20 966 16 21 126
7<=n<=13 28 22 17614 30 23 27615 36 24<=n<=41 27616 42 42 28817 51 43 344
Kevin Patel Word Embeddings 64/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Objective
.Problem Statement..
......Does the number of pairwise equidistant words enforce a lowerbound on the number of dimensions for word embeddings?
’Equidistance’ determined using co-occurrence matrixPlan of Action:
Verify using a toy corpusEvaluate on actual corpus
Kevin Patel Word Embeddings 65/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Motivation (1/4)
Consider the following toy corpus:<>I like cats <>I love dogs <>I hate rats <>I rate bats <>Corresponding co-occurrence matrix:
word <> I like love hate rate rats cats dogs batslike 0 1 0 0 0 0 0 1 0 0love 0 1 0 0 0 0 0 0 1 0hate 0 1 0 0 0 0 1 0 0 0rate 0 1 0 0 0 0 0 0 0 1
Distance between any pair of words =√2
The words form a regular tetrahedron
Kevin Patel Word Embeddings 66/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Motivation (2/4)Mean and Std Dev of Mean of a point’s distance with other points
Dimension Mean Stddev1 0.94 0.942 1.77 0.803 2.63 0.10
−1.5−1.0−0.50.0 0.5 1.0 1.5
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
.hate .like.love .rate
1d(Before)−1.5−1.0−0.50.0 0.5 1.0 1.5
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
.hate.like .love.rate
2d(Before)−1.5−1.0−0.50.0 0.5 1.0 1.5
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
.hate
.like.love
.rate
3d(Before)
−1.5−1.0−0.50.0 0.5 1.0 1.5
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
.hate.like.love .rate
1d(After)−1.5−1.0−0.50.00.51.01.5
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
.hate
.like
.love
.rate
2d(After)−2 −1 0 1 2
−2
−1
0
1
2
.hate .like
.love
.rate
3d(After)
Kevin Patel Word Embeddings 67/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Motivation (3/4)Hypothesis:
If the learning algorithm of word embeddings does not getenough dimensions, then it will fail to uphold the equalityconstraint
Standard deviation of the mean of all pairwise distances will behigher
As we increase the dimension, the algorithm will get moredegrees of freedom to model the equality constraint in abetter way
There will be statistically significant changes in the standarddeviation
Once the lower bound of dimensions is reached, the algorithmgets enough degrees of freedom.
From this point onwards, even if we increase dimensions, therewill not be any statistically significant difference in thestandard deviation
Kevin Patel Word Embeddings 68/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Motivation (4/4)
Dim σ P-value Dim σ P-value7 0.358 12 0.154 0.00588 0.293 0.0020 13 0.111 0.00019 0.273 0.0248 14 0.044 0.0001
10 0.238 0.0313 15 0.047 0.309611 0.189 0.0013 16 0.054 0.1659
Avg standard deviation (σ) for 15 pairwise equidistant words (along withtwo tail p-values of Welch’s unpaired t-test for statistical significance)
Kevin Patel Word Embeddings 69/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Approach (1/5)1. Compute the word × word co-occurrence matrix from the corpus
<> I like cats <> I love dogs <> I hate rats <> I rate bats <>
word <> I like love hate rate rats cats dogs bats<> 0 4 0 0 0 0 1 1 1 1
I 4 0 1 1 1 1 0 0 0 0like 0 1 0 0 0 0 0 1 0 0love 0 1 0 0 0 0 0 0 1 0hate 0 1 0 0 0 0 1 0 0 0rate 0 1 0 0 0 0 0 0 0 1rats 1 0 0 0 1 0 0 0 0 0cats 1 0 1 0 0 0 0 0 0 0dogs 1 0 0 1 0 0 0 0 0 0bats 1 0 0 0 0 1 0 0 0 0
Kevin Patel Word Embeddings 70/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Approach (2/5)2. Create the corresponding word × word cosine similarity matrix
<> I like cats <> I love dogs <> I hate rats <> I rate bats <>
<> I like love hate rate rats cats dogs bats<> 1.0 0.0 0.8 0.8 0.8 0.8 0.0 0.0 0.0 0.0
I 0.0 1.0 0.0 0.0 0.0 0.0 0.8 0.8 0.8 0.8like 0.8 0.0 1.0 0.5 0.5 0.5 0.0 0.0 0.0 0.0love 0.8 0.0 0.5 1.0 0.5 0.5 0.0 0.0 0.0 0.0hate 0.8 0.0 0.5 0.5 1.0 0.5 0.0 0.0 0.0 0.0rate 0.8 0.0 0.5 0.5 0.5 1.0 0.0 0.0 0.0 0.0rats 0.0 0.8 0.0 0.0 0.0 0.0 1.0 0.5 0.5 0.5cats 0.0 0.8 0.0 0.0 0.0 0.0 0.5 1.0 0.5 0.5dogs 0.0 0.8 0.0 0.0 0.0 0.0 0.5 0.5 1.0 0.5bats 0.0 0.8 0.0 0.0 0.0 0.0 0.5 0.5 0.5 1.0
Kevin Patel Word Embeddings 71/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Approach (3/5)3. For each similarity value sk, create a graph, where the words are
nodes, and an edge between node i and node j if sim( i, j) = sk
<> I like cats <> I love dogs <> I hate rats <> I rate bats <>
Sim=0.0Kevin Patel Word Embeddings 72/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Approach (3/5)3. For each similarity value sk, create a graph, where the words are
nodes, and an edge between node i and node j if sim( i, j) = sk
<> I like cats <> I love dogs <> I hate rats <> I rate bats <>
Sim=0.5Kevin Patel Word Embeddings 72/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Approach (3/5)3. For each similarity value sk, create a graph, where the words are
nodes, and an edge between node i and node j if sim( i, j) = sk
<> I like cats <> I love dogs <> I hate rats <> I rate bats <>
Sim=0.8Kevin Patel Word Embeddings 72/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Approach (3/5)3. For each similarity value sk, create a graph, where the words are
nodes, and an edge between node i and node j if sim( i, j) = sk
<> I like cats <> I love dogs <> I hate rats <> I rate bats <>
Sim=1.0Kevin Patel Word Embeddings 72/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Approach (4/5)4. Find maximum clique on this graph. The number of nodes inthis clique is the maximum number of pairwise equidistant points
Ek
<> I like cats <> I love dogs <> I hate rats <> I rate bats <>
Sim Ek0.5 4
0.8 01.0 0
Sim=0.5Kevin Patel Word Embeddings 73/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Approach (4/5)4. Find maximum clique on this graph. The number of nodes inthis clique is the maximum number of pairwise equidistant points
Ek
<> I like cats <> I love dogs <> I hate rats <> I rate bats <>
Sim Ek0.5 40.8 0
1.0 0
Sim=0.8Kevin Patel Word Embeddings 73/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Approach (4/5)4. Find maximum clique on this graph. The number of nodes inthis clique is the maximum number of pairwise equidistant points
Ek
<> I like cats <> I love dogs <> I hate rats <> I rate bats <>
Sim Ek0.5 40.8 01.0 0
Sim=1.0Kevin Patel Word Embeddings 73/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Approach (5/5)5. Reverse lookup Ek to get the number of dimension λ
<> I like cats <> I love dogs <> I hate rats <> I rate bats <>
Sim Ek0.5 40.8 01.0 0
Max 4
λ E λ E3 6 18 614 6 19 765 10 20 966 16 21 126
7<=n<=13 28 22 17614 30 23 27615 36 24<=n<=41 27616 42 42 28817 51 43 344
Kevin Patel Word Embeddings 74/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Evaluation
Used Brown CorpusFound 19 as lower bound using our approachContext window: 1 to the left and 1 to the rightNumber of dimensions: 1 to 355 randomly initialized models for each configuration (averageresults reported)Intrinsic Evaluation
Word Pair Similarity: Predicting sim(wa,wb) usingcorresponding word embeddingsWord Analogy: Finding missing wd in the relation: a is to b asc is to dCategorization: Checking the purity of clusters formed by wordembeddings
Kevin Patel Word Embeddings 75/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Results
Performance for Word Pair Similarity task with respect to number ofdimensions
Kevin Patel Word Embeddings 76/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Results
Performance for Word Analogy task with respect to number of dimensions
Kevin Patel Word Embeddings 76/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Results
Performance for Categorization task with respect to number ofdimensions
Kevin Patel Word Embeddings 76/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
AnalysisFound lower bound consistent with experimental evaluation
Poltawa snakestrike burnings Tsar’smiswritten brows maintained South-Eastfar-famed 27% non-dramas octagonalboatyards U-2 Devol mournersHearing sideshow third-story upcomingpram dolphins Croydon neuromuscularGladius pvt littered annoyingvuhranduh athletes eraser provincialismDaly wreaths villain suspiciousnooks fielder belly Gogol’sinterchange two-to-three resemble discountedkidneys Hangman’s commend accordionsummarizing optimality Orlando Leamingtonswift Taras-Tchaikovsky puts groomedspit firmer rosy-fingered Bechhofercampfire Tomas
Set of pairwise equiangular points (vectors) from Brown corpus
Kevin Patel Word Embeddings 77/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Limitations
The Max Clique finding component of the approachRenders approach intractable for larger corporaNeed to find an alternative
Kevin Patel Word Embeddings 78/100
.
......
Applications of Word EmbeddingsAre Word Embeddings Useful for Sarcasm Detection?
Kevin Patel Word Embeddings 79/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Problem Statement
Detect whether a sentence is sarcastic or not?Especially among those sentences which do not containsentiment bearing words
Example: A woman needs a man just like a fish needs a bicycle
Kevin Patel Word Embeddings 80/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Motivation
Similarity measure among word embeddings a proxy formeasuring contextual incongruityExample: A woman needs a man just like a fish needs a bicycle
similarity(man,woman) = 0.766similarity(fish,bicycle) = 0.131
Imbalance in similarities above an indication of contextualincongruity
Kevin Patel Word Embeddings 81/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Approach
Gist of the approach is adding similarity of word embeddingsbased features, such as
Maximum similarity between all pairs of words in a sentenceMinimum similarity between all pairs of words in a sentence
Kevin Patel Word Embeddings 82/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Evaluation...1 Liebrecht et al. (2013): They consider unigrams, bigrams
and trigrams as features....2 González-Ibánez et al. (2011): Two sets of features:
unigrams and dictionary-based....3 Buschmeier et al. (2014):
Hyperbole (captured by 3 positive or negative words in a row)Quotation marks and ellipsisPositive/Negative Sentiment words followed by an exclamationor question markPositive/Negative Sentiment Scores followed by ellipsis (‘...’)Punctuation, Interjections, and Laughter expressions.
...4 Joshi et al. (2015): In addition to unigrams, they usefeatures based on implicit and explicit incongruity
Implicit incongruity features - patterns with implicit sentiment, extracted in a pre-processing step.Explicit incongruity features - number of sentiment flips, lengthof positive and negative sub-sequences and lexical polarity.
Kevin Patel Word Embeddings 83/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Results
Word Embedding Average F-score GainLSA 0.453Glove 0.651
Dependency 1.048Word2Vec 1.143
Average gain in F-scores for the four types of word embeddings; Thesevalues are computed for a subset of these embeddings consisting of wordscommon to all four
Kevin Patel Word Embeddings 84/100
.
......
Applications of Word EmbeddingsIterative Unsupervised Most Frequent Sense Detection using
Word Embeddings
Kevin Patel Word Embeddings 85/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
WordNet
Groups synonymous words into synsetsSynset example:
Synset ID: 02139199Synset Members: bat, chiropteran Gloss: nocturnal mouselike mammal with forelimbs modified toform membranous wings and anatomical adaptations forecholocation by which they navigateExample: Bats are creatures of the night.
Relations with other synsets (hypernym/hyponym:parent/child, meronym/holonym: part/whole)
Kevin Patel Word Embeddings 86/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Introduction
Word Sense Disambiguation (WSD) : one of the relativelyhard problems in NLP
Both supervised and unsupervised ML explored in literatureMost Frequent Sense (MFS) baseline: strong baseline forWSD
Given a WSD problem instance, simply assign the mostfrequent sense of that word
Ignores contextReally strong results
Due to skew in sense distribution of dataComputing MFS:
Trivial for sense-annotated corpora, which is not available inlarge amounts.Need to learn from raw data
Kevin Patel Word Embeddings 87/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Problem Statement.Problem Statement..
......Given a raw corpus, estimate most frequent sense of differentwords in that corpus
Bhingardive et al. (2015) showed that pretrained wordembeddings can be used to compute most frequent senseOur work further strengthens the claim by Bhingardive et al.(2015) that word embeddings indeed capture most frequentsenseOur approach outperforms others at the task of MFSextractionTo compute MFS using our approach:
...1 Train word embeddings on the raw corpus.
...2 Apply our approach on the trained word embeddings.
Kevin Patel Word Embeddings 88/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Intuition
Strive for consistency in assignment of senses to maintainsemantic congruityExample:
If cricket and bat co-occur a lot, then cricket taking insectsense and bat taking reptile sense is less likely
If cricket and bat co-occur a lot, and cricket’s MFS is sports,then bat taking reptile sense is extremely unlikely
Key point: solve easy words, then use them for difficult wordsIn other words, iterate over degree of polysemy from 2 onward
Kevin Patel Word Embeddings 89/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Intuition
Strive for consistency in assignment of senses to maintainsemantic congruityExample:
If cricket and bat co-occur a lot, then cricket taking insectsense and bat taking reptile sense is less likelyIf cricket and bat co-occur a lot, and cricket’s MFS is sports,then bat taking reptile sense is extremely unlikely
Key point: solve easy words, then use them for difficult wordsIn other words, iterate over degree of polysemy from 2 onward
Kevin Patel Word Embeddings 89/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Algorithm
Kevin Patel Word Embeddings 90/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Algorithm
Kevin Patel Word Embeddings 90/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Algorithm
Kevin Patel Word Embeddings 90/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Algorithm
Kevin Patel Word Embeddings 90/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Algorithm
wisj is vote for sj due to wiTwo components
Wordnet similarity between mfs(wi) and sjEmbedding space similarity between wi and current word
Kevin Patel Word Embeddings 90/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Algorithm
Kevin Patel Word Embeddings 90/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Algorithm
Kevin Patel Word Embeddings 90/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Parameters
KSimilarity MeasureUnweighted (no vector space component) vs. Weighted
Kevin Patel Word Embeddings 91/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Evaluation
Two setups:Evaluating MFS as solution for WSDEvaluating MFS as a classification task
Kevin Patel Word Embeddings 92/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
MFS as solution for WSD
Method Senseval2 Senseval3Bhingardive(reported) 52.34 43.28SemCor(reported) 59.88 65.72Bhingardive 48.27 36.67Iterative 63.2 56.72SemCor 67.61 71.06
Accuracy of WSD using MFS (Nouns)
Kevin Patel Word Embeddings 93/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
MFS as solution for WSD (contd.)
Method Senseval2 Senseval3Bhingardive(reported) 37.79 26.79Bhingardive(optimal) 43.51 33.78Iterative 48.1 40.4SemCor 60.03 60.98Accuracy of WSD using MFS (All Parts of Speech)
Kevin Patel Word Embeddings 94/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
MFS as classification task
Method Nouns Adjectives Adverbs Verbs TotalBhingardive 43.93 81.79 46.55 37.84 58.75Iterative 48.27 80.77 46.55 44.32 61.07
Percentage match between predicted MFS and WFS
Kevin Patel Word Embeddings 95/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
MFS as classification task (contd.)
Nouns(49.20)
Verbs(26.44)
Adjectives(19.22)
Adverbs(5.14) Total
Bhingardive 29.18 25.57 26.00 33.50 27.83Iterative 35.46 31.90 30.43 47.78 34.19
Percentage match between predicted MFS and true SemCor MFS. Notethat numbers in column headers indicate what percent of total wordsbelong to that part of speech
Kevin Patel Word Embeddings 96/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
AnalysisBetter than Bhingardive et al. (2015); not able to beatSemCor and WFS.
There are words for which WFS doesn’t give proper dominantsense. Consider the following examples:
tiger - an audacious personlife - characteristic state or mode of living (social life, city life,real life)option - right to buy or sell property at an agreed priceflavor - general atmosphere of place or situationseason - period of year marked by special events
Tagged words ranking very low to make a significant impact.For example:
While detecting MFS for a bisemous word, the firstmonosemous neighbour actually ranks 1101i.e. a 1000 polysemous words are closer than this monosemousword.Monosemous word may not be the one who can influence theMFS.
Kevin Patel Word Embeddings 97/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Summary
Proposed an iterative approach for unsupervised mostfrequent sense detection using word embeddingsSimilar trends, yet better overall results from Bhingardiveet al. (2015)Future Work
Apply approach to other languages
Kevin Patel Word Embeddings 98/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Conclusion
Discussed why we need word embeddingsBriefly looked at classical word embeddingsDiscussed a few cross-lingual word embeddings andinterpretable word embeddingsMentioned evaluation mechanisms and toolsArgued on existence of lower bounds for number ofdimensions of word embeddingsDiscussed some in-house applications
Kevin Patel Word Embeddings 99/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Thank You
Kevin Patel Word Embeddings 100/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
References
Barg, A. and Yu, W.-H. (2014). New bounds for equiangular lines.Contemporary Mathematics, 625:111–121.
Barone, A. V. M. (2016). Towards cross-lingual distributedrepresentations without parallel text trained with adversarialautoencoders. arXiv preprint arXiv:1608.02996.
Baroni, M., Dinu, G., and Kruszewski, G. (2014). Don’t count,predict! a systematic comparison of context-counting vs.context-predicting semantic vectors. In Proceedings of the 52ndAnnual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers), pages 238–247. Association forComputational Linguistics.
Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003). Aneural probabilistic language model. J. Mach. Learn. Res.,3:1137–1155.
Kevin Patel Word Embeddings 101/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
References
Bhingardive, S., Singh, D., V, R., Redkar, H., and Bhattacharyya,P. (2015). Unsupervised most frequent sense detection usingword embeddings. In Proceedings of the 2015 Conference of theNorth American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages 1238–1243,Denver, Colorado. Association for Computational Linguistics.
Buschmeier, K., Cimiano, P., and Klinger, R. (2014). An impactanalysis of features in a classification approach to ironydetection in product reviews. In Proceedings of the 5thWorkshop on Computational Approaches to Subjectivity,Sentiment and Social Media Analysis, pages 42–49.
Kevin Patel Word Embeddings 102/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
ReferencesCollobert, R. and Weston, J. (2008a). A unified architecture for
natural language processing: deep neural networks withmultitask learning. In Cohen, W. W., McCallum, A., andRoweis, S. T., editors, ICML, volume 307 of ACM InternationalConference Proceeding Series, pages 160–167. ACM.
Collobert, R. and Weston, J. (2008b). A unified architecture fornatural language processing: deep neural networks withmultitask learning. In Cohen, W. W., McCallum, A., andRoweis, S. T., editors, ICML, volume 307 of ACM InternationalConference Proceeding Series, pages 160–167. ACM.
Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S.,and Harshman, R. (1988). Using latent semantic analysis toimprove access to textual information. In Proceedings of theSIGCHI conference on Human factors in computing systems,pages 281–285. ACM.
Kevin Patel Word Embeddings 103/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
ReferencesFaruqui, M. and Dyer, C. (2014a). Community evaluation and
exchange of word vectors at wordvectors.org. In Proceedings ofthe 52nd Annual Meeting of the Association for ComputationalLinguistics: System Demonstrations, Baltimore, USA.Association for Computational Linguistics.
Faruqui, M. and Dyer, C. (2014b). Improving vector space wordrepresentations using multilingual correlation. Association forComputational Linguistics.
Ghannay, S., Favre, B., Estéve, Y., and Camelin, N. (2016). Wordembedding evaluation and combination. In Chair), N. C. C.,Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard,B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., and Piperidis,S., editors, Proceedings of the Tenth International Conferenceon Language Resources and Evaluation (LREC 2016), Paris,France. European Language Resources Association (ELRA).
Kevin Patel Word Embeddings 104/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
ReferencesGonzález-Ibánez, R., Muresan, S., and Wacholder, N. (2011).
Identifying sarcasm in twitter: a closer look. In Proceedings ofthe 49th Annual Meeting of the Association for ComputationalLinguistics: Human Language Technologies: shortpapers-Volume 2, pages 581–586. Association for ComputationalLinguistics.
Gouws, S., Bengio, Y., and Corrado, G. (2015). Bilbowa: Fastbilingual distributed representations without word alignments. InProceedings of the 32nd International Conference on MachineLearning (ICML-15), pages 748–756.
Guo, J., Che, W., Yarowsky, D., Wang, H., and Liu, T. (2015).Cross-lingual dependency parsing based on distributedrepresentations. In ACL (1), pages 1234–1244.
Harris, Z. S. (1970). Distributional structure. Springer.
Kevin Patel Word Embeddings 105/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
References
Hermann, K. M. and Blunsom, P. (2013). Multilingual distributedrepresentations without word alignment. arXiv preprintarXiv:1312.6173.
Huang, K., Gardner, M., Papalexakis, E., Faloutsos, C.,Sidiropoulos, N., Mitchell, T., Talukdar, P. P., and Fu, X.(2015). Translation invariant word embeddings. In Proceedingsof the 2015 Conference on Empirical Methods in NaturalLanguage Processing, pages 1084–1088.
Joshi, A., Sharma, V., and Bhattacharyya, P. (2015). Harnessingcontext incongruity for sarcasm detection. In Proceedings of the53rd Annual Meeting of the Association for ComputationalLinguistics and the 7th International Joint Conference onNatural Language Processing, volume 2, pages 757–762.
Kevin Patel Word Embeddings 106/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
References
Klementiev, A., Titov, I., and Bhattarai, B. (2012). Inducingcrosslingual distributed representations of words. Proceedings ofCOLING 2012, pages 1459–1474.
Kočiskỳ, T., Hermann, K. M., and Blunsom, P. (2014). Learningbilingual word representations by marginalizing alignments. arXivpreprint arXiv:1405.0947.
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., andDyer, C. (2016). Neural architectures for named entityrecognition. arXiv preprint arXiv:1603.01360.
Landauer, T. K. and Dumais, S. T. (1997). A solution to plato’sproblem: The latent semantic analysis theory of acquisition,induction, and representation of knowledge. PSYCHOLOGICALREVIEW, 104(2):211–240.
Kevin Patel Word Embeddings 107/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
References
Levy, O. and Goldberg, Y. (2014). Dependency-based wordembeddings. In Proceedings of the 52nd Annual Meeting of theAssociation for Computational Linguistics, ACL 2014, June22-27, 2014, Baltimore, MD, USA, Volume 2: Short Papers,pages 302–308.
Levy, O., Goldberg, Y., and Dagan, I. (2015). Improvingdistributional similarity with lessons learned from wordembeddings. Transactions of the Association for ComputationalLinguistics, 3:211–225.
Liebrecht, C., Kunneman, F., and van den Bosch, A. (2013). Theperfect solution for detecting sarcasm in tweets# not. Workshopon Computational Approaches to Subjectivity, Sentiment andSocial Media Analysis, WASSA.
Kevin Patel Word Embeddings 108/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
References
Ling, W., Dyer, C., Black, A. W., and Trancoso, I. (2015).Two/too simple adaptations of word2vec for syntax problems. InProceedings of the 2015 Conference of the North AmericanChapter of the Association for Computational Linguistics:Human Language Technologies, pages 1299–1304.
Lund, K. and Burgess, C. (1996a). Producing high-dimensionalsemantic spaces from lexical co-occurrence. Behavior ResearchMethods, Instruments, & Computers, 28(2):203–208.
Lund, K. and Burgess, C. (1996b). Producing high-dimensionalsemantic spaces from lexical co-occurrence. Behavior ResearchMethods, Instruments, & Computers, 28(2):203–208.
Kevin Patel Word Embeddings 109/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
ReferencesLuo, H., Liu, Z., Luan, H., and Sun, M. (2015). Online learning of
interpretable word embeddings. In Proceedings of the 2015Conference on Empirical Methods in Natural LanguageProcessing, pages 1687–1692.
Luong, T., Pham, H., and Manning, C. D. (2015). Bilingual wordrepresentations with monolingual quality in mind. In VS@HLT-NAACL, pages 151–159.
Melamud, O., McClosky, D., Patwardhan, S., and Bansal, M.(2016). The role of context types and dimensionality in learningword embeddings. In Knight, K., Nenkova, A., and Rambow, O.,editors, NAACL HLT 2016, The 2016 Conference of the NorthAmerican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, San DiegoCalifornia, USA, June 12-17, 2016, pages 1030–1040. TheAssociation for Computational Linguistics.
Kevin Patel Word Embeddings 110/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
References
Mikolov, T., Le, Q. V., and Sutskever, I. (2013a). Exploitingsimilarities among languages for machine translation. CoRR,abs/1309.4168.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J.(2013b). Distributed representations of words and phrases andtheir compositionality. In Burges, C., Bottou, L., Welling, M.,Ghahramani, Z., and Weinberger, K., editors, Advances inNeural Information Processing Systems 26, pages 3111–3119.Curran Associates, Inc.
Murphy, B., Talukdar, P., and Mitchell, T. (2012). LearningEffective and Interpretable Semantic Models using Non-NegativeSparse Embedding, pages 1933–1949. Association forComputational Linguistics.
Kevin Patel Word Embeddings 111/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
References
Nayak, N., Angeli, G., and Manning, C. D. (2016). Evaluatingword embeddings using a representative suite of practical tasks.ACL 2016, page 19.
Patel, K., Patel, D., Golakiya, M., Bhattacharyya, P., and Birari,N. (2017). Adapting pre-trained word embeddings for use inmedical coding. In BioNLP 2017, pages 302–306, Vancouver,Canada,. Association for Computational Linguistics.
Pennington, J., Socher, R., and Manning, C. D. (2014). Glove:Global vectors for word representation. Proceedings of theEmpiricial Methods in Natural Language Processing (EMNLP2014), 12.
Rubenstein, H. and Goodenough, J. B. (1965). Contextualcorrelates of synonymy. Commun. ACM, 8(10):627–633.
Kevin Patel Word Embeddings 112/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
ReferencesSahlgren, M. (2005). An introduction to random indexing. In In
Methods and Applications of Semantic Indexing Workshop atthe 7th International Conference on Terminology and KnowledgeEngineering, TKE 2005.
Sahlgren, M. (2006). The Word-Space Model: Using distributionalanalysis to represent syntagmatic and paradigmatic relationsbetween words in high-dimensional vector spaces. PhD thesis,Institutionen för lingvistik.
Schnabel, T., Labutov, I., Mimno, D., and Joachims, T. (2015).Evaluation methods for unsupervised word embeddings. InProceedings of the Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 298–307.
Schütze, H. (1992). Dimensions of meaning. InSupercomputing’92., Proceedings, pages 787–796. IEEE.
Kevin Patel Word Embeddings 113/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
References
Søgaard, A., Agić, Ž., Alonso, H. M., Plank, B., Bohnet, B., andJohannsen, A. (2015). Inverted indexing for cross-lingual nlp. InThe 53rd Annual Meeting of the Association for ComputationalLinguistics and the 7th International Joint Conference of theAsian Federation of Natural Language Processing (ACL-IJCNLP2015).
Swanepoel, K. J. (2004). Equilateral sets in finite-dimensionalnormed spaces. In Seminar of Mathematical Analysis,volume 71, pages 195–237. Secretariado de Publicationes,Universidad de Sevilla, Seville.
Vyas, Y. and Carpuat, M. (2016). Sparse bilingual wordrepresentations for cross-lingual lexical entailment. InHLT-NAACL, pages 1187–1197.
Kevin Patel Word Embeddings 114/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
References
Zhai, M., Tan, J., and Choi, J. D. (2016). Intrinsic and extrinsicevaluations of word embeddings. In Proceedings of the ThirtiethAAAI Conference on Artificial Intelligence, AAAI’16, pages4282–4283. AAAI Press.
Kevin Patel Word Embeddings 115/100
Introduction Word Embeddings Evaluating WordEmbeddings
Discussion on LowerBounds
Applications ofWord Embeddings
Conclusion References
.
Web References
http://www.indiana.edu/~gasser/Q530/Notes/representation.html
Kevin Patel Word Embeddings 116/100