Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire...

48
Introduction Classification Visualization Conclusion c EPFL J.-C. Chappelier Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48

Transcript of Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire...

Page 1: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

Classification

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

Textual Data Analysis

J.-C. Chappelier

Laboratoire d’Intelligence ArtificielleFaculté I&C

Textual Data Analysis – 1 / 48

Page 2: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

Classification

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

Objectives of this lecture

Basics of textual data analysis:å classification

å visualization: dimension reduction / projection

(usefull for a good understanding/presentation of classification/clustering results)

Textual Data Analysis – 2 / 48

Page 3: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

Classification

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

Is this course a Machine Learning Course?

CAVEAT/R

EMINDER

I NLP makes use of Machine Learning (as would Image Processing for instance)I but good results require:

I good preprocessingI good data (to learn from), relevant annotationsI good understanding of the pros/cons, features, outputs, results, ...

+ The goal of this course is to provide you with specific knowledge about NLP.

New:

+ The goal of this lecture is to make some link between general ML and NLP.This lecture is worth deepening with some real ML course.

Textual Data Analysis – 3 / 48

Page 4: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

Classification

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

Introduction: Data Analysis

WHAT does Data Analysis consist in?

“to represent in a live and intelligible manner the (statistical) informations, simplifyingand summarizing them in diagrams”

[L. Lebart]

+ classification (regrouping in the original space)

+ visualization: projection in a low-dimension space

Classification/clustering consists in regrouping several objects in categories/clusters(i.e. subsets of objects)

Vizualisation: display in a intelligible way the internal structures of data (documentshere)

com

plem

enta

ry

Textual Data Analysis – 4 / 48

Page 5: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

Classification

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

Contents

À ClassificationÀ FrameworkÁ Methods (in general) Presentation of a few methodsà Evaluation

Á VisualizationÀ IntroductionÁ Principal Component Analysis (PCA)Â Multidimentional Scaling

Textual Data Analysis – 5 / 48

Page 6: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

Supervized/unsupervized

The classification can beI supervized (strict meaning of classification) :

Classes are known a prioriThey are usually meaningfull for the user

I unsupervized (called: clustering) :Clusters are based on the inner structures of the data (e.g. neighborhoods)Their meaning is really more dubious

Textual Data Analysis: relate documents(or words) so as to...structure (supervized) / discover structure (unsupervized)

Textual Data Analysis – 6 / 48

Page 7: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

Classify what?

WHAT is to be classified?

Stating point: a chart (numbers) representing in a way or another a set of objectsI continuous valuesI contingency tables: cooccurence countsI presence/absence of attributesI distance/(dis)similarity (square symetric chart)

+ N "row" objects (or "observations") x (i) characterized by m "features" (columns)x (i)

j

Two complementary points of view:À N points in Rm

Á m points in RN

Not necessarily the same metrics:objects similarities vs. feature similarity

Textual Data Analysis – 7 / 48

Page 8: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

Classify what?

ob

ject

sfeatures

i

j

= "importance" offeature j for object i

N

m

(i)j

x(i)j

x

Textual Data Analysis – 8 / 48

Page 9: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

Textual Data Classification

I What is classified?I authors (1 object = several documents)I documentsI paragraphsI "words"(/tokens) (vocabulary study, lexicometry)

I How to represent the objects?I document indexingI choose the textual units that are meanigfullI choice of the metric/similarity

+ preprocessing: "unsequentialize" text, suppress (meaningless) lexical variability

Frequently: lines = documents, columns = "words" (tokens, words, n-grams)+ the former two "visions" are complementary

Textual Data Analysis – 9 / 48

Page 10: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

Textual Data Classification: Examples of applications

I Information RetrievalI Open-Questions Survey (polls)I emails classification/routingI client survey (complaints analysis)I Automated processing of addsI ...

Textual Data Analysis – 10 / 48

Page 11: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

(Dis)Symilarity Matrix

Most of classification techniques use distance mesures or (dis)similaries:matrix of the distances between each data points: N(N−1)

2 values (symetric with nulldiagonal)

distance:À d(x ,y)≥ 0 and d(x ,y) = 0⇐⇒ x = y

Á d(x ,y) = d(y ,x)

 d(x ,y)≤ d(x ,z)+d(z,y)

dissimilarity: À and Á only

Textual Data Analysis – 11 / 48

Page 12: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

Some of the usual metrics/symilarities

I Euclidian:

d(x ,y) =

√√√√ m

∑j=1

(xj −yj )2

I generalized (p ∈ [1...∞[):

dp(x ,y) =

(m

∑j=1

(xj −yj )p

)1/p

I χ2:

d(x ,y) =m

∑j=1

λj (xj

∑xj ′−

yj

∑yj ′)2

where λj =∑i ∑j uij

∑i uijdepends on some reference data (ui , i = 1...N)

Textual Data Analysis – 12 / 48

Page 13: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

Some of the usual metrics/symilaritiesI cosine (similarity) :

S (x ,y) =

m

∑j=1

xjyj√∑j

xj2√

∑j

yj2

=x||x ||· y||y ||

I for probability distributions :I KL-divergence:

DKL(x ,y) =m

∑j=1

xj log

(xj

yj

)I Jensen-Shannon divergence:

JS(x ,y) =12

(DKL(x ,

x +y2

)+DKL(y ,x +y

2)

)I Hellinger distance:

d(x ,y) = dEuclid(√

x ,√

y) =

√√√√ m

∑j=1

(√

xj −√

yj )2

Textual Data Analysis – 13 / 48

Page 14: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

Computational Complexity

Various complexities (depends on the method), but typically:N(N−1)

2 distances

m computations for one single distance

+ complexity in m ·N2

Costly: m ' 103, N ' 104 +→ 1011 !!

Textual Data Analysis – 14 / 48

Page 15: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

Classification as a mathematical problem

I supervized:I function approximation f (x1, ...,xm) = Ck

I distribution estimation: P(Ck |x1, ...,xm) or P(x1, ...,xm|Ck )I parametric: multi-gaussian, maximum likelihood, Bayesian inference, discriminative analysis

I non-parametric: kernels, K nearest neighbors, LVQ, neural nets (Deep Learning, SVM)

I inference:if xi = ... and xj = ... (etc.) then C = Ck+ decision trees

I unsupervized (clustering):I (local) minimization of a global criterion over the data set

Textual Data Analysis – 15 / 48

Page 16: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

Many different classification methods

How to choose? + Several criteria

Task specification:I supervizedI unsupervized

I hierarchicalI non hierarchical

I overlappingI non overlapping (partition)

Model choices:

I generative models (P(X ,Y ))I discriminative models (P(Y |X ))

I parametricI non parametric (= many parameters)

I linear methods (Statistics)I trees (GOFAI)I neural networks

Textual Data Analysis – 16 / 48

Page 17: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

Classification methods: examples

I supervizedI Naive BayesI K-nearest neighborsI ID3 – C4.5 (decision tree)I Kernels, Support Vector Machines (SVM)I Gaussian MixturesI Neural nets: Deep Learning, SVM, MLP, Learning Vector QuantizationI ...

I unsupervizedI K-meansI dendrogramsI minimum spanning treeI Neural net: Kohonen’s Self Organizing Maps (SOM)I ...

+ The question you should ask yourself:What is the optimized criterion?

Textual Data Analysis – 17 / 48

Page 18: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

Bayesian approach

Probabilitic modeling: the classification is made according to P(Ck |x): an object x (i) isclassified in category

argmaxC

P(C|x = x (i))

Discriminative: model P(Ck |x) directly;

Generative: assume we know P(Ck ) and P(x |Ck ),then using Bayes formula:

P(C|x = x (i)) =P(x = x (i)|C) ·P(C)

P(x = x (i))=

P(x (i)|C) ·P(C)

∑C[P(C) ·P(x (i)|C)

]P(C): "prior" P(C|x): "posterior" P(x |C): "likelihood"

In practice, those distributions are hardly known.All the difficulty consists in "learning" (estimating) them from samples making severalhypotheses.

Textual Data Analysis – 18 / 48

Page 19: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

Naive Bayes

Supervised generative probabilistic (non overlaping) model:

Classification is made using the Bayes formula

P(C) is estimated directly on a typical example

What is "naive" in this approach is the computation of P(x |C)

Hypothesis: feature independance:

P(x |C) =m

∏j=1

p(xj |C)

The p(xj |C) (a priori much fewer than the P(x |C)) are estimated on typical examples(learning corpus).

In the case of Textual Data: features = indexing terms (e.g. lemmas)

+ This hypothesis is most certainly wrongbut good enough in practice

Textual Data Analysis – 19 / 48

Page 20: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

(multinomial) Logistic regression

Supervised discriminative probabilistic (non overlaping) model:Directly model P(C|x) as:

P(C|x) =m

∏j=1

f (xj ,C) =

exp(m

∑j=1

wC,j xj )

∑C′

exp(m

∑j ′=1

wC′,j ′ xj ′)

where wC,j is a parameter, the “weight” of xj for class C(xj being here some numerical representation of j-th indexing term: 0–1, frequency,log-normalized, ...).

The parameters wC,j can be learned using various approximation algorithms (e.g.iterative or batch; IGS, IRLS, L-BGFS, ...), for instance:

wC,j(t+1) = wC,j

(t) + α

(δC,Cn

−P(C|xn))

xnj

with α a learning parameter (step strength/speed) and δC,Cnthe Kronecker delta

function between class C and expected class Cn for sample input xn.

Textual Data Analysis – 20 / 48

Page 21: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

K nearest neighbors – Parzen window

non hierachical non overlapping classification

K nearest neighbors:

very simple:classify a new object according to the majority class in its K nearest neighbors (vote).(no learning phase)

Parzen window:

same idea, but the votes are weighted according to the distance to the new object

distance

weight

Textual Data Analysis – 21 / 48

Page 22: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

Dendrograms

It’s a bottom-up hierachical clusteringStarts form a distance chart between the N objects

À Regroup in one cluster the two closest"elements" and consider the new clusteras a new element

Á compute the distances between this newelement and the others

 loop in À while there are more than oneelement

+ representation in the form of a binary tree

Complexity: O(N2 logN)

Textual Data Analysis – 22 / 48

Page 23: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

Dendrograms: "linkage" scheme (1/2)"regroup the two closest elements" + closest?

Two questions:1. How to define the distance between two clusters (two sets of elements)?

(based on the distances between the elements)

d(A,B) = ?

AB

2. How to (efficiently) compute distance between a former cluster and a (new) mergeof two clusters?(based on the former distances between clusters)

d(C,(A+B)) = ?

C

AB

Textual Data Analysis – 23 / 48

Page 24: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

Dendrograms: "linkage" scheme (2/2)

"regroup the two closest elements" + closest?

Let A and B be two subclusters: what is their distance? (Lance-Williams algorithm)

methoddefinition mergingD(A,B) = D(A∪B,C) =

single linkage: minx∈A,y∈B

d(x ,y) min(

D(A,C),D(B,C))

complete linkage: maxx∈A,y∈B

d(x ,y) max(

D(A,C),D(B,C))

average linkage:1

|A| · |B| ∑x∈A,y∈B

d(x ,y)|A| ·D(A,C) + |B| ·D(B,C)

|A|+ |B|

Textual Data Analysis – 24 / 48

Page 25: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

K -means

non hierachical non overlapping clusteringÀ choose a priori the number of clusters : KÁ randomly draw K objects as clusters’ representatives ("clusters’ centers") partition the objects with respect to the K centers (closest)à recompute the K centers as the mean of each clusterÄ loop in  until convergence (or any other stoping criterion).

Textual Data Analysis – 25 / 48

Page 26: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

K -means (2) : example with K = 2

Random choice ofinitial "means"

Assignment of classes

Re-computation of means

Re-assignment of classes

ETC...

then re-affectation ofclasses

Re-computation of means

Textual Data Analysis – 26 / 48

Page 27: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

K -means (3)cluster representatives:

mean (centre of gravity): Rk = 1Nk ∑

x∈Ck

x

+ The algorithm is convergent because the intra-class variance can only decrease

v =K

∑i=1

∑x∈Ci

p(x)d(x ,Ri )2

(p(x): probability of the objects)

BUT it converges to a local minimum; improvements:I stable clustersI Deterministic Annealing

Other methods similar to K -means:I having several representativesI compute representatives at each binding of an individualI choose representatives among the objects

Textual Data Analysis – 27 / 48

Page 28: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

about Word Embedings & Deep Learning

“Word embedding”:I numerical representation of words (see “Information Retrieval” lecture)

I a.k.a. “Semantic Vectors”, “Distributionnal Semantics”

I objective: relative similarities of representations correlate with syntactic/semanticsimilarity of words/phrases.

I two key ideas:1. representation(composition of words) = vectorial-composition(representations(word))

for instance: representation(document) = ∑word∈document

representation(word)

2. remove sparsness, compactify representation: dimension reduction

I have been aroud for a long time (renewal these days with the “deep learning buzz”)Harris, Z. (1954), "Distributional structure", Word 10(23):146–162.Firth, J.R. (1957), "A synopsis of linguistic theory 1930-1955", Studies in Linguistic Analysis. pp 1–32.

Textual Data Analysis – 28 / 48

Page 29: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

Word Embedings: different techniques

“Many recent publications (and talks) on word embeddings are surprisingly oblivious ofthe large body of previous work [...]”(from https://www.gavagai.se/blog/2015/09/30/a-brief-history-of-word-embeddings/)

Main techniques:I co-occurence matrix; often reduced (LSI, Hellinger-PCA)I probabilistic/distribution (DSIR, LDA)I shallow (Mikolov) or deep-learning Neural Networks

There are theoretical and empirical correspondences between these different models[see e.g. Levy, Goldberg and Dagan (2015), Pennington et al. (2014), Österlund et al. (2015)].

Textual Data Analysis – 29 / 48

Page 30: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

about Deep Learning

I there is NO need of deep learning for good word-embeddingI not all Neural Network models (NN) are deep learnersI models: convolutional NN (CNN) or recurrrent NN (RNN, incl. LSTM)I still suffer the same old problems: overfitting and computational power

a final word, from Michel Jordan (IEEE Spectrum, 2014):“deep learning is largely a rebranding of neural networks, which go back to the 1980s.They actually go back to the 1960s; it seems like every 20 years there is a new wavethat involves them. In the current wave, the main success story is the convolutionalneural network, but that idea was already present in the previous wave.”

Why such a reborn now?+ many more data (user-data pillage), more computational power (GPUs)

Textual Data Analysis – 30 / 48

Page 31: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

about Embedings: some references

Some softwares:word2vec, glove, tensorflow, gensim, mallet, http://www.wordvectors.org/

Some papers:O. Levy, Y. Goldberg and I. Dagan (2015), “Improving distributional similarity with lessonslearned from word embeddings“, Journ. Trans. ACL, vol. 3, pp. 211-225.

Österlund et al. (2015) “Factorization of Latent Variables in Distributional Semantic Models”,proc. EMNLP.

J. Pennington, R. Socher, and C. D. Manning (2014) “GloVe: Global Vectors for WordRepresentation“, proc. EMNLP.

T. Mikolov et al. (2013), “Distributed Representations of Words and Phrases and theirCompositionality”, proc. NIPS.

R. Lebret and R. Collobert (2013), “Word Emdeddings through Hellinger PCA”, proc. EACL.

+ more about this topic in two weeks

Textual Data Analysis – 31 / 48

Page 32: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

Classification: evaluation

I classification (supervised): evaluation is "easy"→ test corpus (some knownsamples kept for testing only)

I clustering (unsupervised): objective evaluation is more difficult: what are thecriteria?

(supervised) Classification: REMINDER (see “Evaluation” lecture)I Check IAA (if possible)I Measure the misclassification error on the test corpus

+ !! really separated from the learning set (and also from the validation set, ifany)+ criteria: confusion matrix, error rate, ..

I Is the difference in the results statistically significant?

Textual Data Analysis – 32 / 48

Page 33: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

ClassificationFramework

Methods

Evaluation

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

Clustering (unsupervised learning) evaluationThere is no absolute scheme with which to evaluate clustering, but a variety of ad-hocmeasures from diverse areas/point-of-view.

For K non overlapping clusters (with objects having a probability p), standardmeasures include:

Intra-cluster variance (to be minimized): v =K

∑k=1

∑x∈Ck

p(x)d(x ,xk )2

Inter-cluster variance (to be maximized): V =K

∑k=1

(∑

x∈Ck

p(x)

)︸ ︷︷ ︸

=p(Ck )

d(xk ,x)2

The best way is to think to how you want to assess the quality of a clustering w.r.t. yourneeds:

usually: high intra-cluster similarity and low inter-cluster similarity(but what does “similar” mean?...)

One way also is to have manual evaluation of the clustering.Note: and if you already have a gold-standard with classes: why not use (supervised)classification in the first place??(rather than using a supervised corpus to assess unsupervised methods...)

Textual Data Analysis – 33 / 48

Page 34: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

Classification

VisualizationFramework

Linear projections

Non-linearprojections

Mappings

Conclusion

c©EPFLJ.-C. Chappelier

“Visualization”

Visualize: project/map data in 2D or 3D

More generaly: techniques presented in this section are to lower the dimension of data

+ go form N-D to n-D with n < N or even n� N

+ usualy means: go from sparse to dense representation

visualization: projection in a low-dimension space

classification: regrouping in the original space

Which one to start with, depends on your data/application(can even loop between the two)

com

plem

enta

ry

Textual Data Analysis – 34 / 48

Page 35: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

Classification

VisualizationFramework

Linear projections

Non-linearprojections

Mappings

Conclusion

c©EPFLJ.-C. Chappelier

Several approaches

I Simple methods (but poorly informative): ordered list, "thermometer-like",histograms

I some of the classification methods can be used:I use/display the classes

e.g. dendrograms with minimal spanning tree

I Linear and non-linear projections/mappings

(projection: in the same space as original datamapping: in some other space)

original

target=subspace

projection

original

target

mapping

Textual Data Analysis – 35 / 48

Page 36: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

Classification

VisualizationFramework

Linear projections

Non-linearprojections

Mappings

Conclusion

c©EPFLJ.-C. Chappelier

Several Representation Criteria

A good visualization technique combines several representation criteria:I positions (relative, absolute)

(from far the most used criterion, but do not forget the others!)I colorsI shapesI others... (cf Chernoff’s faces)

Textual Data Analysis – 36 / 48

Page 37: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

Classification

VisualizationFramework

Linear projections

Non-linearprojections

Mappings

Conclusion

c©EPFLJ.-C. Chappelier

Linear projections

Projections on selected sub-spaces of the original space

I Principal Components Analysis (PCA ) [Pearson 1901]:object–feature chart (continuous values)feature similarity: correlationsobject similarity: distance on the feature space

I Correspondance Analysis:contingency tablesrow/column symetry (features)χ2 metric

+ Singular value decomposition

Textual Data Analysis – 37 / 48

Page 38: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

Classification

VisualizationFramework

Linear projections

Non-linearprojections

Mappings

Conclusion

c©EPFLJ.-C. Chappelier

Principal Components Analysis (PCA)

Input: a matrix M objects (rows) – features (columns) (of size N×m with N > m)

centered: M i• = x (i)−x

Singular value decomposition (SVD) of M:

eigenvalue decomposition of M Mt

(i.e. the covariance matrix (multiplied by (N−1) ))

+ M = U ΛV t

Λ diagonal, ordered: λ1 ≥ λ2 ≥ ...≥ λm ≥ 0

U of size N×m with orthogonal columns

and V orthogonal, of size m×m

Textual Data Analysis – 38 / 48

Page 39: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

Classification

VisualizationFramework

Linear projections

Non-linearprojections

Mappings

Conclusion

c©EPFLJ.-C. Chappelier

PCA (2)

The "principal components" are the columns of M V (or of V )

Textual Data Analysis – 39 / 48

Page 40: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

Classification

VisualizationFramework

Linear projections

Non-linearprojections

Mappings

Conclusion

c©EPFLJ.-C. Chappelier

PCA (3)

Projection in a low dimension space:

M = Uq Λq V tq

with q < m and Xq matrices reduced to only the q first singular values

M is the better approximation of rank q of M.

"better approximation" w.r.t several criteria:

L2 norm, biggest variance (trace and determinant of the corvariance matrix),Frobenius norm, ...

This means that the subspace of the first q principal components is the best linearapproximation of dimension q of the data, "best" in the sense of the distance betweenthe original data points and their projection.

Textual Data Analysis – 40 / 48

Page 41: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

Classification

VisualizationFramework

Linear projections

Non-linearprojections

Mappings

Conclusion

c©EPFLJ.-C. Chappelier

PCA (4): how to choose dimension q?

I sometimes imposed by the application (e.g. for visualization q = 2 or 3)

I otherwise: make use of the spectrum:I simple: choose q where there is a “big step” in λi/∑j λj plot (a.k.a. “Cattell’s scree plot”

or “explained variance”):

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

0 5 10 15 20

I advanced: see:Tom Minka, Automatic choice of dimensionality for PCA, NIPS, 2000.https://tminka.github.io/papers/pca/

Textual Data Analysis – 41 / 48

Page 42: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

Classification

VisualizationFramework

Linear projections

Non-linearprojections

Mappings

Conclusion

c©EPFLJ.-C. Chappelier

PCA (4)

Simple and efficient approximation method using sub-spaces (i.e. linear manifolds)

Weaknesses:À linear method (precisely what makes it easy to use!)Á since the methods maximizes the (co)variance, it is strongly dependant on the

measure units used for the features

In practice, except when the variance is really what has to be maximized, the dataare renormalized before: it is then the correlation matrix which is decomposedrather than the (co)variance.

Textual Data Analysis – 42 / 48

Page 43: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

Classification

VisualizationFramework

Linear projections

Non-linearprojections

Mappings

Conclusion

c©EPFLJ.-C. Chappelier

"Projection Pursuit"

Linear projection methods on a low dimension space (1, 2 ou 3) but maximizinganother criterion than (co)variance.

+ No analytic solution: numerical optimization (iteration and local convergence)⇒ The criterion has to be easily comptutable

Several possible criteria:entropy, dispersion, higher momenta (> 2), divergence to normal distribution, ...

Textual Data Analysis – 43 / 48

Page 44: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

Classification

VisualizationFramework

Linear projections

Non-linearprojections

Mappings

Conclusion

c©EPFLJ.-C. Chappelier

linear vs. non-linear

PCA:

non-linear method:

Textual Data Analysis – 44 / 48

Page 45: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

Classification

VisualizationFramework

Linear projections

Non-linearprojections

Mappings

Conclusion

c©EPFLJ.-C. Chappelier

Non-linear Methods

I "principal curve" [Hastie & Stuetzle 89]I ACC (neural net) [Demartines 94]I Non-linear PCA (NLPCA) [Karhunen 94]I Kernel PCA [Schölkopf, Smola, Müller 97]I Gaussian process latent variable models (GPLVM) [Lawrence 03]

Textual Data Analysis – 45 / 48

Page 46: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

Classification

VisualizationFramework

Linear projections

Non-linearprojections

Mappings

Conclusion

c©EPFLJ.-C. Chappelier

Multidimensional Scaling (MDS)uses the chart of distances/dissimilarities between objects

Sammon Mapping: criterion:

C(d , d) = ∑x 6=y

(d(x ,y)− d(x , y)

)2

d(x ,y)= ∑

x 6=yweight(x ,y) ·error(x ,y)

where d is the dissimilarity in the original object space, and d the dissimilarity in theprojection space (e.g. Euclidian)

+ more accurate representation of objects that are close

More recent alternatives:I t-SNE (t-Distributed Stochastic Neighbor Embedding)

[L.J.P. van der Maaten and G.E. Hinton; Visualizing High-Dimensional Data Using t-SNE; Journal ofMachine Learning Research 9(Nov):2579-2605, 2008.]

I UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction)[L. McInnes, Healy J., N. Saul and L. Grossberger; UMAP: Uniform Manifold Approximation andProjection; Journal of Open Source Software 3(29):861 (2018).]

original

target

mapping~xx

yy~

Textual Data Analysis – 46 / 48

Page 47: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

Classification

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

Keypoints

I Many classification/clustering techniques (coming from different fields)Know the main characteristics, criteriaKnow at least two methods (e.g. Naive Bayes and K-means), that could be usefullas baseline in any case.

I A priori choice of "the best method" is not easy:+ well define what you are looking for, means (time, samples, ...) you have access

toI It’s even more difficults for Textual Data⇒ preprocessing is really essential

(lemmatization, parsing, ...)

I Pay attention to use a proper methodolgy: good evaluation protocol, statistical test,...

I Classification/Clustering and Projection methods are complementary in (Textual)Data Analysis

I Use several representation/classification criteria

I Visualization: Focuss on usefullness first:What does it bring/shows to the user? How is it usefull?Pay attention not overwhelming the user...

Textual Data Analysis – 47 / 48

Page 48: Textual Data Analysis - EPFL · 2019-11-26 · Textual Data Analysis J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C Textual Data Analysis – 1 / 48. Introduction

Introduction

Classification

Visualization

Conclusion

c©EPFLJ.-C. Chappelier

References

I F. Sebastiani, Machine learning in automated text categorization, ACM Comput.Surv, 34(1): 1-47, 2002.

I C. Bishop, Pattern Recognition and Machine Learning, springer, 2006.

I B.D. Ripley, Pattern recognition and Neural Networks, Cambridge UniversityPress, 1996.

I V. Vapnik, The Nature of Statistical Learning Theory, Springer, 2000.

I B. Schölkopf & A. Smola, Learning with Kernels, MIT Press, 2002.

Textual Data Analysis – 48 / 48