Tristan Anacker Department of Mathematical Sciences ...

30
To t-SNE or not to t-SNE: An Examination of t-SNE’s Perplexities Tristan Anacker Department of Mathematical Sciences Montana State University May 3, 2019 A writing project submitted in partial fulfillment of the requirements for the degree Master of Science in Statistics

Transcript of Tristan Anacker Department of Mathematical Sciences ...

Page 1: Tristan Anacker Department of Mathematical Sciences ...

To t-SNE or not to t-SNE: An Examination oft-SNE’s Perplexities

Tristan Anacker

Department of Mathematical Sciences

Montana State University

May 3, 2019

A writing project submitted in partial fulfillmentof the requirements for the degree

Master of Science in Statistics

Page 2: Tristan Anacker Department of Mathematical Sciences ...

APPROVAL

of a writing project submitted by

Tristan Anacker

This writing project has been read by the writing project advisor and has beenfound to be satisfactory regarding content, English usage, format, citations,bibliographic style, and consistency, and is ready for submission to the StatisticsFaculty.

May 3, 2019 Mark GreenwoodWriting Project Advisor

May 3, 2019 Mark GreenwoodWriting Projects Coordinator

Page 3: Tristan Anacker Department of Mathematical Sciences ...

Abstract

t-Distributed Stochastic Neighborhood Embedding (t-SNE) is a dimension reductiontechnique used to visualize high dimensional data. In application of the t-SNE algorithmthe user is required to input a parameter named perplexity. t-SNE visualizations differon the perplexity input, however prior suggestions for using t-SNE only provide roughguidelines in selecting the input value. This paper explores one method of selectingthe perplexity input value; minimizing the Kullback-Leibler (KL) divergence. To doso, three simulated data sets with varying degrees of high dimensional structure and asubset of the MNIST data set are explored. Visual inspection is used to compare t-SNEmaps created using the minimization of the KL divergence method to manual selection.Through this analysis it is shown that minimization of the Kullback-Leibler divergencedoes not necessarily lead to the best t-SNE map both at a given perplexity and acrossperplexities.

I. Introduction

The importance of statistics lies in its ability to extract information in the face ofuncertainty. Traditionally, science has relied upon natural observation and theory to createquestions and subsequent hypotheses. Following that, data are gathered and analyzed throughstatistical methods to arrive at a conclusion. Thus, the value of statistics has been based onits inferential abilities. This model of science, while important, is quickly being circumventedby a new approach. One where both the answers and the questions are data driven. Whenthis new model is properly applied, one set of data spawns a question and another set is usedto formulate conclusions. The exploratory techniques of this statistical paradigm have meritin their abilities to extract these questions, as they are intended to gain insight into data.Such techniques will become increasingly relied upon as time progresses and the amount ofquantifiable information grows. Today’s world is a vast network of data generators wheretechnology has transformed most actions of everyday life into data that need to be explored.

The size (number of observations, N) and dimension (number of measured attributes,Q) of data sets are only going to increase as time progresses. Increasing the size of a data setcreates issues in terms of computing power. Whereas, when dimension increases, an entirelynew set of challenges present themselves. These problems are often summarized as the curseof dimensionality. The curse of dimensionality, in lay terms, states that things do not actas expected in high dimensions. To illustrate this peculiar phenomenon, consider a circleand a sphere, both with the same radius. Which has the larger volume? Clearly the sphere.This would intuitively suggest that if the radius were held constant, as the dimension of acircle increases, the volume would also increase. This however is not the case. Examining theequation for the volume of a hypersphere, Vd(R) = πd/2

Γ(d/2+1) ∗Rd, where d is the dimension of

the sphere, and R is radius, indicates that if the radius is held constant at 1, the volume ofthe hypersphere will be maximized at a dimension of 5. Then as the dimension increases thevolume decreases (Zaki and Meria, 2014). This curse of dimensionality does not stop there,but creates a host of problems that must be overcome. One way to handle such issues is touse dimension reduction techniques, which, as the name suggests, reduce the dimensionalityof the data set being analyzed.

1

Page 4: Tristan Anacker Department of Mathematical Sciences ...

The focus of this paper is on one dimension reduction technique, t-DistributedStochastic Neighborhood Embedding (t-SNE). t-SNE was first introduced by van der Maatenand Hinton (2008). The authors originally intended t-SNE to be used as a tool for visualizinghigh dimensional data in a lower dimensional projection, but the technique has been extendedto other applications including classification and image retrieval (Xie et al; 2011). t-SNE hasbecome a vastly popular technique, evident by the 7,498 citations the original paper has onGoogle Scholar, as of 4/6/2019. However, this does not necessarily mean it operates withoutany issues across all these applications. Along with describing the inner workings of t-SNE,this paper will explore one of the more interesting, and less well-explained aspects of t-SNE,the user defined parameter, perplexity. The method of choosing an input value for perplexityis not explained well, rather it is stated “performance of SNE [and it turn t-SNE] is fairlyrobust to changes in the perplexity” (van der Matten and Hinton, 2008). However in someapplications, the choice of perplexity will lead to differing results. This paper will explore onepotential method for choosing the perplexity input; selecting perplexity based on minimizingthe Kullback-Leibler divergence.

To accomplish these tasks this paper contains the following sections: A generalexplanation of the t-SNE algorithm; a brief comparison to another dimension reductiontechnique; a description of the data used in this analysis; methodology behind the proceduresthat were used to examine the perplexity parameter; and finally, the results of the analysisand a discussion of both the perplexity parameter and t-SNE in general are considered.

II. t-SNE Algorithm

The intent of t-SNE is to produce a low dimensional projection that retains the localstructure of the high dimensional data that are being projected. That is, t-SNE aims toproduce a visualization in either two or three dimensions that retains the groupings or otherstructures in the data that were present in the high dimensional space (van der Matten andHinton, 2008). In general the t-SNE algorithm can be broken up into four rough steps. First,it randomly projects the high dimensional data points into either two or three dimensionalspace. Then, similarities between each point within both the high dimensional space andthe low dimensional space are calculated. Once the similarities are obtained, the associatedjoint distributions that are formed by these similarities are compared using Kullback-Leiblerdivergence. Finally, gradient descent is used to adjust the lower dimensional projectionswith the aim of minimizing the Kullback-Leibler divergence between the joint distributionsassociated with the high and low dimensional similarities (van der Matten and Hinton, 2008).

An optional (though generally used) initialization step before running t-SNE is torun principle component analysis (PCA) to reduce the number of dimensions in the data. Indoing so, some of the noise in the data is removed and calculation of the distances betweeneach point is sped up (van der Matten and Hinton, 2008). In the experiments ran in van derMatten and Hinton (2008), all data were preprocessed using PCA to reduce the number ofdimensions to 30. Then the associated PC scores for the 30 dimensions were extracted andused as variables in their application of t-SNE. In the R packaged used here (Krijthe, 2015),this option is controlled through the pca argument.

To calculate the similarities between the high dimensional data points a Gaussian

2

Page 5: Tristan Anacker Department of Mathematical Sciences ...

similarity of the form pj|i = exp(−||xi−xj ||2/2σ2i )

Σk 6=iexp(−||xi−xk||2/2σ2i ) is used (van der Matten, 2014). The pj|i can

be interpreted as the probability that point i is the closest neighbor, in terms of Euclideandistance, to point j in the high dimensional space (van der Matten and Hinton, 2008). Tosimplify the optimization step, a symmetric restriction is imposed such that pij = pj|i+pi|j

2N ,where N is the size of the data. It is also assumed that pi|i = 0 (van der Matten, 2014).Of special importance to this paper is the variance term σ2

i , which is calculated by usinga binary search that involves the user defined perplexity parameter (van der Matten andHinton, 2008).

To calculate the similarities between the projected low dimensional data points aCauchy distribution (t-distribution with 1 degree of freedom) of the following form qij =

(1+||yi−yj)||2)−1

Σk 6=l(1+||yk−yl)||2)−1 is used (van der Matten, 2014). The qij can be interpreted as the probabilitythat point i is the closest neighbor, in terms of Euclidean distance in the low dimensionalprojection, to point j (van der Matten and Hinton, 2008). It is again assumed that qii = 0(van der Matten, 2014). A Cauchy is used in place of a Gaussian distribution in the lowerdimensional similarity calculation to mitigate the potential for overcrowding that is presentin the earlier version of t-SNE named Stochastic Neighborhood Embedding (SNE) (van derMatten and Hinton, 2008). Since there is more area in the tails of a Cauchy distributionin comparison to a Gaussian, observations that are close together are more spread out,alleviating the overcrowding issue that was present in the prior version (van der Matten,2014).

Once the similarities have been calculated within both the high dimensional data andassociated lower dimensional projection, the associated joint distributions that these similari-ties define are compared using Kullback-Leibler divergence. Kullback-Leibler divergence isdefined as KL(P ||Q) = Σi 6=jpijlog(pij

qij) (van der Matten and Hinton, 2008). By restricting

the distance between points to be symmetric, t-SNE differs in more than one way from itspredecessor, SNE. The optimization can be done on a single Kullback-Leibler divergence thatcompares the joint distribution of the high dimensional similarities to the joint distributionof the low dimensional similarities. In comparison, SNE, which did not require the pairwisesimilarities to be symmetric, uses the sum of Kullback-Leibler divergences as the associatedobjective function (van der Matten and Hinton, 2008).

Using the Kullback-Leibler divergence as the objective function, t-SNE iterativelyupdates the lower dimensional projection using gradient descent. The goal of the optimizationis to make the joint distribution of the low dimensional similarities as close to the jointdistribution of the high dimensional similarities, and thus minimizing the Kullback-Leiblerdivergence (van der Matten and Hinton, 2008).

The user defined parameter perplexity is of particular interest to this paper. Mathe-matically, perplexity is Perp(Pi) = 2H(Pi) where H(Pi), known as Shannon entropy, and isdefined as H(Pi) = −Σjpj|ilog2pj|i, where Pi is the conditional probability distribution ofobservation i given all other high dimensional points. Perplexity is thus set by the user, and abinary search is used to calculate each σ2

i that leads to a conditional probability distributionsfor each point Pi that has the defined perplexity (van der Matten and Hinton, 2008). As arough interpretation of what perplexity is, van der Matten and Hinton state “perplexity can

3

Page 6: Tristan Anacker Department of Mathematical Sciences ...

−50

0

50

−100 −50 0 50 100

X Coordinate

Y C

oord

inat

e

Perplexity: 5

t−SNE: MNIST

a

−25

0

25

50

−25 0 25

X Coordinate

Y C

oord

inat

e

Perplexity: 50

t−SNE: MNIST

b

Figure 1: t-SNE maps based on perplexities of 5 (a) and 50 (b) for a sample of n = 1000observations from the MNIST data set.

be interpreted as a smooth measure of the effective number of neighbors” (van der Mattenand Hinton, 2008).

They also state that “The performance of SNE [and in turn t-SNE] is fairly robustto changes in perplexity” (van der Matten and Hinton, 2008). Upon further investigation,this statement does appear overstated. There is typically little qualitative difference betweent-SNE maps produced using perplexities that are close or that are the same in large N datasets. However all choice of perplexity will not lead to the same results. Figure 1a presentsa t-SNE produced map with the perplexity set to 5, whereas Figure 1b presents a t-SNEmap of the same data, using the same initial random projection, with perplexity set to 50.Observation of the two plots yields the conclusion that the two maps differ from one another.This leads to the question of how to set perplexity such that the resulting map is an optimalrepresentation on the true structure in the high dimensional data.

III. t-SNE Comparison

The t-SNE algorithm aims to create a lower dimension representation of high dimen-sional data. These representations can be thought of as a map of the high dimensional data.As with all map projections, information is inherently lost as the number of dimensions isreduced. An example of this is the Mercator projection, which is responsible for a common

4

Page 7: Tristan Anacker Department of Mathematical Sciences ...

two dimensional map of the earth. The Mercator projection, originally used for sea navigation,retains angles between points. However it distorts the size of objects on the map. Specifically,land masses near the poles are greatly stretched, suggesting they are larger than they actuallyare and land masses near the equator are compressed, suggesting they are smaller than theyactually are (Mercator projection, 2019).

t-SNE aims to retain high dimensional structure by keeping points close in highdimensional space also close in the lower dimensional projection. However, t-SNE losesinformation such as position of groups relative to one another. That is, how the groupsare displayed in the low dimensional space does not necessarily indicate their position withrespect to one another in the high dimensional space. In addition, t-SNE maps are arbitraryin reflection, rotation, and scale; these are characteristics shared by most dimension-reductiontechniques. Therefore interpretation of these maps must take these aspects into account.

Aside from t-SNE’s predecessor, SNE, other algorithms exist that aim to producelower dimensional maps of the high dimensional data. Two classic techniques are principlecomponent analysis (PCA) and non-metric multidimensional scaling (NMDS). PCA works byprojecting the data into the desired dimension such that the resulting projection retains themaximum amount of variability in the new orthogonal axis. For visualization, the desiredprojection would include scores on the first two principle components. These PCs are theeigenvectors associated with the two largest eigenvalues of the correlation or covariancematrix of the Q variables. This will correspond to the projection that retains the maximumvariability (Zaki and Meria, 2014). Unlike t-SNE which aims to keep points in the highdimensional space close in the low dimensional space, PCA aims at keeping points that arefar apart in the high dimensional space also far apart in the low dimensional projection (vander Matten and Hinton, 2008). Figure 2a displays a t-SNE map of the MNIST data set withthe perplexity defined as 35. Figure 2b displays a map created using the first two principlecomponents of the MNIST data set. The map created using t-SNE correctly delineates somegroupings in the data, whereas the map created using PCA does not differentiate any cleargroups.

IV. Data Description

To further investigate the performance of t-SNE, four data sets were examined. Thefirst data set was simulated to contain no high dimensional structure, hereafter referred to as“No Structure”. The second data set was simulated to have near-perfect high dimensionalstructure, hereafter referred to as “Clear Structure”. The third data set was simulated tohave a moderate degree of high dimensional structure, hereafter referred to as “ModerateStructure”. The last was a subsample of the classic MNIST data set (LeCun and Cortes,1999) where 1,000 observations, were randomly sampled from the original 60,000 observationshereafter referred to as “MNIST”. The simulated data sets are used to display how changesin the perplexity parameter impact the resulting t-SNE map under controlled circumstances.Alternatively, the MNIST data set was used to display how changes in perplexity impactthe resulting t-SNE map when the classes of the data are known, but not the exact highdimensional structure.

To simulate the “No Structure” data set, N = 100 observations were randomly

5

Page 8: Tristan Anacker Department of Mathematical Sciences ...

−50

−25

0

25

−40 0 40

X Coordinate

Y C

oord

inat

e

Perplexity: 35

t−SNE: MNIST

a

−1000

−500

0

500

1000

−1000 0 1000 2000

Principle Component 1

Prin

cipl

e C

ompo

nent

2

First Two Principle Components

PCA: MNIST

b

Figure 2: t-SNE (a) map based on a perplexity of 35 and a PCA (b) map using the first twoprinciple components. Both are for a sample of 1000 observations from the MNIST data set.First two PCAs retain 17.63% of the variation in the original 784 variables.

6

Page 9: Tristan Anacker Department of Mathematical Sciences ...

40

50

60

70

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10

Dimension

Sim

ulat

ed V

alue

No Structure

Figure 3: Modified Parallel Coordinates Plot of the No Structure data set displaying eachsimulated observation across the 10 dimensions.

generated from a Q = 10 dimensional multivariate normal distribution where each yi ∼MVN(~µ,Σ) with ~µ = ~5010x1 and Σ = 25I10x10. Figure 3, a modified parallel coordinatesplot (Schloerke et al; 2018), visualizes each of the 100 simulated observations, across the 10dimensions on the original scale. It displays no inherent structure across the 10 dimensions.

The “Clear Structure” data set was simulated to have three distinct groups. The firstgroup was generated by sampling 10 observations from a 10 dimensional multivariate normaldistribution, such that each yi ∼MVN(~µ,Σ) with ~µ = ~2010x1 and Σ = 25I10x10. The secondgroup was generated by sampling 30 observations from a 10 dimensional multivariate normaldistribution, such that each yi ∼MVN(~µ,Σ) with ~µ = ~5010x1 and Σ = 25I10x10. The thirdgroup was generated by sampling 60 observations from a 10 dimensional multivariate normaldistribution such that each yi ∼ MVN(~µ,Σ) with ~µ = ~8010x1 and Σ = 25I10x10. Figure4, a modified parallel coordinates plot, visualizes the value for each of the 100 simulatedobservations, across the 10 dimensions on the original scale. It displays the three distinctgroups.

The “Moderate Structure” data set was also simulated to have three groups, but withless definite boundaries. The first group was generated by sampling 10 observations from a 10dimensional multivariate normal distribution such that each yi ∼MVN(~µ,Σ) with ~µ = ~4110x1

7

Page 10: Tristan Anacker Department of Mathematical Sciences ...

25

50

75

100

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10

Dimension

Sim

ulat

ed V

alue Group

1

2

3

Clear Structure

Figure 4: Modified Parallel Coordinates Plot of the Clear Structure data set displaying eachsimulated observation across the 10 dimensions.

8

Page 11: Tristan Anacker Department of Mathematical Sciences ...

30

40

50

60

70

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10

Dimension

Sim

ulat

ed V

alue Group

1

2

3

Moderate Structure

Figure 5: Modified Parallel Coordinates Plot of the Moderate Structure data set displayingeach simulated observation across the 10 dimensions.

and Σ = A10x10, where A has 40s on the main diagonal and 5s on the off diagonals. Thesecond group was generated by sampling 30 observations from a 10 dimensional multivariatenormal distribution such that each yi ∼MVN(~µ,Σ) with ~µ = ~5010x1 and Σ = A10x10, whereA has 40s on the main diagonal and 5s on the off diagonals. The third group was generatedby sampling 30 observations from a 10 dimensional multivariate normal distribution suchthat each yi ∼ MVN(~µ,Σ) with ~µ = ~5910x1 and Σ = A10x10, where A has 40s on the maindiagonal and 5s on the off diagonals. Figure 5, a modified parallel coordinates plot, visualizesthe value for each of the 100 simulated observations, across the 10 dimensions on the originalscale. It displays the three groups and indicates some group-level differences but also someoverlap across the groups.

The MNIST data set was downloaded from Kaggle.com and it contains 60000 ob-servations and 785 variables. Each observation corresponds to one hand written image of anumber between 0 and 9. One of the variables in the data set is an identifier variable thatindicates the number that was hand drawn. Whereas the other 784 variables in the data seteach correspond to a single pixel in the images and contains the gray-scale value rangingfrom 0 to 255. t-SNE has a computational complexity of O(n2) (van der Matten and Hinton,2008). To make repeated analyses of this data set feasible, 1,000 of the images were randomlysampled and used in this analysis.

9

Page 12: Tristan Anacker Department of Mathematical Sciences ...

V. Methodology

t-SNE uses gradient descent to minimize the Kullback-Leibler divergence between thejoint probability distribution of the high dimensional similarities and the joint probabilitydistribution of the low dimensional similarities (van der Matten and Hinton, 2008). Thereforeit seemed reasonable to select the optimal perplexity based on optimizing the same objectivefunction, that is, minimizing the Kullback-Leibler divergence.

To test this idea, the t-SNE algorithm was applied using every potential value ofperplexity for each of the four data sets, twice. Two runs of the algorithm allows for theassessment of variability due to the random initialization. For each application of t-SNE theKullback-Leibler divergence of the last iteration was extracted. Then, for each data set, aplot was created that displayed the Kullback-Leibler divergence against perplexity. Based onthese plots, a t-SNE map was produced using the perplexity that lead to the t-SNE projectionwith the smallest Kullback-Leibler divergence. The t-SNE map was compared to the “best”t-SNE map that was found based on manual selection. That is, running through a suiteof values of perplexity manually, visually inspecting each associated plot, and selecting theresulting map that appeared “best”.

The goal of t-SNE is to produce meaningful visualizations that retain the localstructure of the high dimensional data. “Meaningful” differs depending on the data set used.For the “No Structure” simulated data set, there should be no t-SNE map that indicatesany groupings in the data. Whereas for the two structured data sets, the more defined thegroupings are in the map, the better. For the MNIST data set, there are 10 known groups.The best map would display each known group as distinctly as possible.

This metric of visual comparison is clearly subject to bias. However when the goal isto produce a meaningful visualization, the “eye test” may be what ultimately matters.

All t-SNE maps were estimated in R using the package Rtsne (Krijthe, 2015). Anotherpackage in R that has the ability to run t-SNE is the package tsne (Donaldson, 2016). Alongwith the t-SNE algorithm, the Rtsne package provides a few optional parameters: a settingfor whether or not to run PCA prior to t-SNE and a “speed up” parameter named theta. Thetheta parameter input can range from 0 to 1 depending on the degree of speedup. It uses theBarnes Hut method as an approximation (van der Matten, 2014), and the closer theta is toone, the greater the approximation. A theta setting of 0 will lead to no approximation, andthus the t-SNE algorithm in its original version. For the purpose of this paper, theta was setat zero for all applications, although variation of results across combinations of perplexityand theta could also be of interest.

VI. Results

Figures 6a through 6d display the Kullback-Leibler (KL) divergence across each valueof perplexity for the four data sets analyzed. All four plots indicate an interesting relationshipbetween perplexity and KL divergence, Figures 6b through 6d in particular. Initially, asperplexity increases so does the KL divergence. However, quickly after this initial increase,KL divergence decreases across the remaining values of perplexity. Figure 6a also displaysthis pattern, however at a perplexity of 1, the KL divergence is lower than at any other value

10

Page 13: Tristan Anacker Department of Mathematical Sciences ...

0.6

0.8

1.0

0 10 20 30

Perplexity

KL

Simulated No Structure Data

KL Divergence Across Perplexities

a

0.2

0.3

0.4

0.5

0.6

0 10 20 30

Perplexity

KL

Simulated Clear Structure Data

KL Divergence Across Perplexities

b

0.3

0.4

0.5

0.6

0.7

0 10 20 30

Perplexity

KL

Simulated Moderate Structure Data

KL Divergence Across Perplexities

c

0.5

0.6

0.7

0.8

0.9

0 100 200 300

Perplexity

KL

MNIST Data

KL Divergence Across Perplexities

d

Figure 6: Scatterplots displaying the change in Kullback-Leibler divergence across perplexitiesfor each of the (a) No Structure, (b) Clear Structure, (c) Moderate Structure, and (d) MNISTdata sets.

of perplexity. For the other three data sets, the lowest KL divergence occurs at the maximumvalue of perplexity considered. If minimization of the KL divergence is the goal, it wouldappear that a perplexity value of 1 should be used for the “No Structure” simulated data set,perplexity values of 33 for both the “Clear Structure” and “Moderate Structure” simulateddata sets, and a perplexity of 333 should be used for the reduced MNIST data set. Thoselast three are the maximum values of perplexity allowed by the algorithm.

In addition to this noticeable trend in KL divergence across perplexity, it should benoted that the KL divergence differs across different runs using the same perplexity. Foreach data set, t-SNE was ran twice with different seeds for the random initialization foreach potential value of perplexity. Figures 6a through 6d also display variability in the KLdivergence at the same value of perplexity. That is, different runs using the same perplexitywill create different t-SNE visualizations.

Figures 7a and 7b display t-SNE visualizations for the “No Structure” simulateddata set. Figure 7a was created using the KL minimization suggested perplexity of one, andFigure 7b was created using a manually selected perplexity of 15. With no prior experiencereading a t-SNE map, one may be inclined to pick out many small groups from Figure 7a.

11

Page 14: Tristan Anacker Department of Mathematical Sciences ...

−50

0

50

−80 −40 0 40

Perplexity: 1

No Structure t−SNE

a

−20

0

20

−20 −10 0 10 20

Perplexity: 15

No Structure t−SNE

b

Figure 7: t-SNE maps based on perplexities of 1 (a) and 15 (b) for a sample of n=100observations from the No Structure data set.

However, with a some experience it becomes clear that maps of this nature, that displaysmall string-like connections, indicate a perplexity value that is too small. Figure 7b, with aperplexity of 15 displays a random cloud of points. In fact, there was no value of perplexitythat displayed any sort of grouping (besides the many small string-like groupings) in thelow dimensional projection. Thus in the case where there is absolutely no high dimensionalstructure (at least with Q = 10 dimensions considered here), the choice of perplexity doesnot seem to have a great impact on the resulting t-SNE visualization. It is possible thatthe number of dimensions in the original data set could impact these results but that is notexplored here.

Figures 8a and 8b display t-SNE visualizations for the “Clear Structure” simulateddata set. Figure 8a was created using the KL minimization suggested perplexity of 33, andFigure 8b was created using a manually selected perplexity of nine. Both visualizations displaythree groups. However the groupings are more clear in the t-SNE map using a perplexityof nine. Figure 8a displays very compact groups. Once again without much experience,one may be inclined to misinterpret the plot by observing only one or two points in oneof the groups. However, all the associated points in that group are there, they just havevery similar projected values. An interesting point to note about Figures 8a and 8b are thepositioning of the groups in relation to one another. In Figure 8a, the large group (which

12

Page 15: Tristan Anacker Department of Mathematical Sciences ...

−2

0

2

4

−15 −10 −5 0 5 10

Perplexity: 33

Clear Structure t−SNE

a

−40

−20

0

20

−10 0 10 20 30

Perplexity: 9

Clear Structure t−SNE

b

Figure 8: t-SNE maps based on perplexities of 33 (a) and 9 (b) for a sample of n=100observations from the Clear Structure data set.

was simulated to have a multivariate mean vector of all 80s) is located far away from theother two groups. Whereas the two smaller groups (simulated to have multivariate meanvectors of all 20s and 50s respectively) are relatively close to one another. The projectionis clearly not using the means to decide how the groups are positioned with respect to oneanother, rather the number of observations together or not in a groups. Since perplexity isrelated to the effective number of neighbors, it is evident with a perplexity of 33, the twosmaller groups, which have 10 and 30 points respectively, are pulled closer together, suchthat each point is close to approximately 33 other points. Whereas in Figure 8b, when theperplexity is set at nine, the groups can spread out as there is no need for the group of size10 to be close to any other points to meet the perplexity definition. A manual exploration ofmaps across the perplexities showed that all values of perplexity lead to three groups, besidesa perplexity of one. When a perplexity of one was used, the resulting t-SNE map showed themany small-string like groupings that were discussed previously. Thus, it appears that in the“Clear Structure” case, the choice of perplexity does not seem to have a qualitative impact onthe resulting t-SNE visualization if it is set to reasonable values.

Figures 9a and 9b display t-SNE visualizations for the “Moderate Structure” simulateddata set. Figure 9a was created using the KL minimization suggested perplexity of 33 andFigure 9b was created using a manually selected perplexity of 10. In this case, both plots do

13

Page 16: Tristan Anacker Department of Mathematical Sciences ...

−2.5

0.0

2.5

5.0

−5.0 −2.5 0.0 2.5 5.0

Perplexity: 33

Moderate Structure t−SNE

a

−20

0

20

−20 0 20 40

Perplexity: 10

Moderate Structure t−SNE

b

Figure 9: t-SNE maps based on perplexities of 33 (a) and 10 (b) for a sample of n=100observations from the Moderate Structure data set.

not clearly display three groups. Figure 9a displays two obvious groups, one small and onelarge. Whereas in Figure 9b, three possibly distinct groups can be seen. In this case then,the choice of perplexity matters. The goal of t-SNE is to project high dimensional pointsinto a low dimensional space while retaining the local structure. Therefore a perplexity of 33results in a visualization that “fails”, as the three groups are not clearly identifiable. Whereasin the visualization created using a perplexity of 10, the three groups are clearly identifiable.

Figures 10a and 10b display t-SNE visualizations for the reduced MNIST data set.Figure 10a was created using the KL minimization suggested perplexity of 333, whereasFigure 10b was created using a manually selected perplexity of 35. Figure 6a displays arandom cloud of points and there are no distinguishable groups. Whereas Figure 6b displayssome groupings. Without prior knowledge of the groupings, a person may pick out six orseven groups. This is not the 10 classes that are present in these data, but is closer to showingthe known groups than Figure 6a that shows one large group.

Figures 11a and 11b also display t-SNE visualizations for the sample of the MNISTdata set discussed previously. However they are colored to indicate the grouping that eachobserved point belongs to. Like Figures 10a and 10b, 11a and 11b were created usingperplexities of 333 and 35, respectively. It is important to first examine the non-colored

14

Page 17: Tristan Anacker Department of Mathematical Sciences ...

−6

−3

0

3

6

−8 −4 0 4

Perplexity: 333

MNIST t−SNE

a

−50

−25

0

25

−40 0 40

Perplexity: 35

MNIST t−SNE

b

Figure 10: t-SNE maps based on perplexities of 333 (a) and 35 (b) for a sample of n=1000observations from the MNIST data set.

maps, as to not get mislead by addition of group information. In practice, t-SNE is oftenused in situations where groups are not known and so coloring the points by known groupsis not possible. By adding color to distinguish groupings in Figures 11a and 11g, a generalunderstanding of the accuracy of t-SNE can be examined. Ignoring the fact that Figure 11ais a random cloud of points, the colors indicate that t-SNE generally places points of thesame class close to one another. However there is not enough space between the groups todistinguish them from one another without the aid of color. Figure 11b indicates that thegroups that were created by the t-SNE algorithm were fairly accurate. For the most part,observations that are close to one another belong to the same class. This method is certainlynot perfect, as there are many points that appear to be in groups that actually belong to adifferent class.

VII. Discussion

Simply selecting the perplexity parameter based on minimization of Kullback-Leiblerdivergence does not lead to the best possible t-SNE visualization. In the “No Structure”simulated data set, the resulting t-SNE map, which used the perplexity suggested by thismethod, led to many small groupings that appear commonly for values of perplexity thatare too small. In the “Clear Structure” simulated data set, the resulting t-SNE map which

15

Page 18: Tristan Anacker Department of Mathematical Sciences ...

−6

−3

0

3

6

−8 −4 0 4

Perplexity: 333

MNIST t−SNE

Figure a

−50

−25

0

25

−40 0 40

Perplexity: 35

MNIST t−SNE

Figure b

Figure 11: t-SNE maps based on perplexities of 333 (a) and 35 (b) for a sample of n=1,000observations from the MNIST data set. Color has been added to display known groups basedon the number that were written.

16

Page 19: Tristan Anacker Department of Mathematical Sciences ...

used the perplexity suggested by this method produced a map that correctly identified threegroups. However, the proximity of these groups was not in accordance to the global structureof the high dimensional data. In the “Moderate Structure” simulated data set, the resultingt-SNE map which used the perplexity suggested by this method indicated only two groupsin these data. Whereas other values of perplexity were able to correctly separate all threegroups in the data. In the reduced MNIST data set, the results were similar. The resultingt-SNE visualization using the suggested perplexity indicated one large group; using differentperplexities indicated at least some grouping in those data. Thus, selecting perplexity basedon minimization of Kullback-Leibler divergence does not appear to be an appropriate method.

Based on these varying results, the question arises: how is perplexity to be chosen?It could be as simple and unsatisfying as a game of guess and check. The original authorsprovide guidelines that suggest the optimal perplexity parameter will generally be between5 and 50 (van der Matten and Hinton, 2008). Based on the efforts of this project, thisstatement does appear generally correct. The method of guess and check falters in its abilityto produce results without human interaction. A manual inspection of each produced plotand a judgement call are required for each trial. This is time consuming and likely to lead tobiased results. The question that arises by this predicament is how to algorithmically mimicthe manual inspection and make a selection based on a less biased criterion.

There is no perfect algorithm, and t-SNE is certainly no exception to the rule. Thatbeing said, t-SNE does aim to answer the right questions. Or maybe better yet, producethe right questions. The scientific method relied upon by the current technological world isbased on an observation that leads to questions. Traditionally this observation comes fromthe natural world. However, as time progresses, I believe these observations will be made byobserving data. Tools such as t-SNE give a glimpse of the data, they provide insight into aphenomena and spark reason to ask further questions.

17

Page 20: Tristan Anacker Department of Mathematical Sciences ...

VIII. References

Auguie, B. (2017). gridExtra: Miscellaneous Functions for “Grid” Graphics. R packageversion 2.3. https://CRAN.R-project.org/package=gridExtra.

Azzalini, A. and Genz, A. (2016).The R package ’mnormt’: The multivariate normal and‘t’ distributions (version 1.5-5). URL: http://azzalini.stat.unipd.it/SW/Pkg-mnormt.

Donaldson, J. (2016). tsne: T-Distributed Stochastic Neighbor Embedding for R (t-SNE). Rpackage version 0.1-3. https://CRAN.R-project.org/package=tsne.

Krijthe, J.H. (2015). Rtsne: T-Distributed Stochastic Neighbor Embedding using a Barns-Hut Implementation, URL: https://github.com/jkrijthe/Rtsne.

Mercator projection. (2019, February 04). Retrieved April 30, 2019, from https://en.wikipedia.org/wiki/Mercator_projection.

R Core Team (2018). R: A language and environment for statistical computing. R Foundationfor Statistical Computing, Vienna, Austria. URL: https://www.R-project.org/.

Schloerke B., Crowley, J., Cook, D., Briatte, F., Marbach, M., Thoen, E., Elberg, A. &Larmarange, J. (2018). GGally: Extension to ‘ggplot2’. R package version 1.4.0. https://CRAN.R-project.org/package=GGally.

van der Matten, L.J.P. Accelerating t-SNE using Tree-Based Algorithms. Journal ofMachine Learning Research 15(Oct):3221-3245, 2014.

van der Matten, L.J.P. & Hinton, G.E. Visualizing Non-Metric Similarities in MultipleMaps. Machine Learning 87(1):33-55, 2012.

van der Matten, L.J.P. & Hinton, G.E. Visualizing High Dimensional Data Usingt-SNE Journal of Machine Learning Research 9(Nov):2579-2605, 2008.

Wickham, H. (2017). tidyverse: Easily Install and Load the ‘Tidyverse’. R package version1.2.1. https://CRAN.R-project.org/package=tidyverse.

Xie, B., Mu, Y., Tao, D., Huang, K., & Huang, K. m-SNE: Multiview stochasticneighbor embedding. IEEE Transactions on Systems, Man, and Cybernetics, Part B:Cybernetics. Volume 41 Number 4: 1088-1096, August 2011.

Yann Lecun, Y. & Cortes, C. (1999) The MNIST database of handwritten digits(Images). http://yann.lecun.com/exdb/mnist/.

Zaki, M. J., & Meira, W. (2014). Data mining and analysis: Fundamental conceptsand algorithms. New York: Cambridge University Press.

18

Page 21: Tristan Anacker Department of Mathematical Sciences ...

IX Appendix:# Settings

knitr::opts_chunk$set(echo = FALSE)knitr::opts_chunk$set(message = FALSE)knitr::opts_chunk$set(warning = FALSE)###### Setup

### Packageslibrary(tidyverse)library(Rtsne)library(gridExtra)library(mnormt)library(GGally)

### t-sne mnist

mnist <- read_csv("mnist_train.csv")

mnist_small <- mnist %>%sample_n(1000, replace = FALSE)

set.seed(34536)

tsne_mnist_1 <- Rtsne(mnist_small[,2:785], pca = TRUE, perplexity = 5, theta = 0)

set.seed(34536)

tsne_mnist_2 <- Rtsne(mnist_small[,2:785], pca = TRUE, perplexity = 50, theta = 0)

tm1 <- ggplot(data = data.frame(tsne_mnist_1$Y)) +geom_point(aes(x = tsne_mnist_1$Y[,1], y = tsne_mnist_1$Y[,2])) +labs(title = "t-SNE: MNIST",

subtitle = "Perplexity: 5",x = "X Coordinate",y = "Y Coordinate",caption = "a") +

theme_minimal() +coord_fixed(ratio = 1)

tm2 <- ggplot(data = data.frame(tsne_mnist_2$Y)) +geom_point(aes(x = tsne_mnist_2$Y[,1], y = tsne_mnist_2$Y[,2])) +labs(title = "t-SNE: MNIST",

19

Page 22: Tristan Anacker Department of Mathematical Sciences ...

subtitle = "Perplexity: 50",x = "X Coordinate",y = "Y Coordinate",caption = "b") +

theme_minimal() +coord_fixed(ratio = 1)

grid.arrange(tm1, tm2, nrow = 1)

set.seed(34536)

tsne_mnist_good <- Rtsne(mnist_small[,2:785], pca = TRUE, perplexity = 35, theta = 0)

tmgood <- ggplot(data = data.frame(tsne_mnist_good$Y)) +geom_point(aes(x = tsne_mnist_good$Y[,1], y = tsne_mnist_good$Y[,2])) +labs(title = "t-SNE: MNIST",

subtitle = "Perplexity: 35",x = "X Coordinate",y = "Y Coordinate",caption = "a") +

theme_minimal() +coord_fixed(ratio = 1) +guides(color = FALSE)

pca1 <- princomp(mnist_small[,2:785])pca2 <- pca1$scores %>%

matrix(nrow = 1000) %>%data.frame() %>%select(c(1, 2))

pcaplot <- ggplot(data = pca2) +geom_point(aes(x = pca2$X1, y = pca2$X2)) +labs(title = "PCA: MNIST",

subtitle = "First Two Principle Components",x = "Principle Component 1",y = "Principle Component 2",caption = "b") +

theme_minimal()+coord_fixed(ratio = 1) +guides(color = FALSE)

20

Page 23: Tristan Anacker Department of Mathematical Sciences ...

grid.arrange(tmgood, pcaplot, nrow = 1)

### Data Generate

# "No Structure"set.seed(34536)

no_structure <- rmnorm(n = 10, mean = rep(50, 100), diag(25, 100), sqrt=NULL) %>%t() %>%data.frame()

# "Clear Structure"

set.seed(34536)

g1 <- rmnorm(n = 10, mean = rep(20, 10), diag(25, 10)) %>%t() %>%data.frame()

g2 <- rmnorm(n = 10, mean = rep(50, 30), diag(25, 30)) %>%t() %>%data.frame()

g3 <- rmnorm(n = 10, mean = rep(80, 60), diag(25, 60)) %>%t() %>%data.frame()

yes_structure <- rbind(g1, g2, g3) %>%data.frame() %>%mutate(Group = c(rep("1", 10), rep("2", 30), rep("3", 60)))

# Middle Structure

set.seed(343446)

g1 <- rmnorm(n = 10, mean = rep(41, 10), matrix(c(rep(5, 100)), nrow = 10) + diag(35, 10)) %>%t() %>%data.frame()

g2 <- rmnorm(n = 10, mean = rep(50, 30), matrix(c(rep(5, 900)), nrow = 30) + diag(35, 30)) %>%t() %>%

21

Page 24: Tristan Anacker Department of Mathematical Sciences ...

data.frame()

g3 <- rmnorm(n = 10, mean = rep(59, 60), matrix(c(rep(5, 3600)), nrow = 60) + diag(35, 60)) %>%t() %>%data.frame()

middle_structure <- rbind(g1, g2, g3) %>%data.frame() %>%mutate(Group = c(rep("1", 10), rep("2", 30), rep("3", 60)))

ggparcoord(no_structure,columns = 1:10,scale = "globalminmax") +

labs(x = "Dimension",y = "Simulated Value",title = "No Structure")+

theme_minimal()

ggparcoord(yes_structure,columns = 1:10,groupColumn = 11,scale = "globalminmax") +

labs(x = "Dimension",y = "Simulated Value",title = "Clear Structure") +

theme_minimal()

ggparcoord(middle_structure,columns = 1:10,groupColumn = 11,scale = "globalminmax") +

labs(x = "Dimension",y = "Simulated Value",title = "Moderate Structure")+

theme_minimal()

KL_perp <- function(data, rep_perplexity, max_perplexity) {

KL <- matrix(c(rep(0, max_perplexity * rep_perplexity)), nrow = max_perplexity)

for(i in 1:max_perplexity) {for(j in 1:rep_perplexity){

tsne_train <- Rtsne(data, pca = TRUE, perplexity = i, theta = 0.0)

22

Page 25: Tristan Anacker Department of Mathematical Sciences ...

KL[i, j] <- tsne_train$itercosts[20]}

}

KL_gather <- KL %>%t() %>%data.frame() %>%gather(key = "Perplexity", value = "KL", 1:max_perplexity) %>%mutate(Perplexity = as.numeric(substring(Perplexity, 2)))

return(KL_gather)}

### "No Structure"

set.seed(34536)

KL_gather_nostr <- KL_perp(no_structure, 2, 33)

g_nostr <- ggplot(data = KL_gather_nostr) +geom_point(aes(x = Perplexity, y = KL)) +theme(axis.text.x = element_text(angle = 90, hjust = 1)) +

labs(title = "KL Divergence Across Perplexities",subtitle = "Simulated No Structure Data",caption = "a") +

theme_minimal()

### yes structure

set.seed(34536)

KL_gather_yesstr <- KL_perp(yes_structure, 2, 33)

g_yesstr <- ggplot(data = KL_gather_yesstr) +geom_point(aes(x = Perplexity, y = KL)) +theme(axis.text.x = element_text(angle = 90, hjust = 1)) +

labs(title = "KL Divergence Across Perplexities",subtitle = "Simulated Clear Structure Data",caption = "b") +

theme_minimal()

### middle structure

23

Page 26: Tristan Anacker Department of Mathematical Sciences ...

set.seed(34536)

KL_gather_midstr <- KL_perp(middle_structure, 2, 33)

g_midstr <- ggplot(data = KL_gather_midstr) +geom_point(aes(x = Perplexity, y = KL)) +theme(axis.text.x = element_text(angle = 90, hjust = 1)) +

labs(title = "KL Divergence Across Perplexities",subtitle = "Simulated Moderate Structure Data",caption = "c")+

theme_minimal()

### MNIST

set.seed(34536)

#KL_MNIST <- KL_perp(mnist_small[, 2:785], 3, 333)

KL_MNIST <- read_csv("MNIST_perplexity.csv")

KL_MNIST <- KL_MNIST[,2:3]

g_mnist <- ggplot(data = KL_MNIST) +geom_point(aes(x = Perplexity, y = KL)) +theme(axis.text.x = element_text(angle = 90, hjust = 1)) +

labs(title = "KL Divergence Across Perplexities",subtitle = "MNIST Data",caption = "d")+

theme_minimal()

### plotsgrid.arrange(g_nostr, g_yesstr, g_midstr, g_mnist, nrow = 2)

# No str

set.seed(34536)

tsne_no_str_1 <- Rtsne(no_structure, pca = TRUE, perplexity = 1, theta = 0)

g_nostr1 <- ggplot(data = data.frame(tsne_no_str_1$Y)) +geom_point(aes(x = tsne_no_str_1$Y[,1], y = tsne_no_str_1$Y[,2])) +labs(title = "No Structure t-SNE",

subtitle = "Perplexity: 1",

24

Page 27: Tristan Anacker Department of Mathematical Sciences ...

caption = "a",x = "",y = "") +

theme_minimal()

set.seed(34536)

tsne_no_str_15 <- Rtsne(no_structure, pca = TRUE, perplexity = 15, theta = 0)

g_nostr15 <- ggplot(data = data.frame(tsne_no_str_15$Y)) +geom_point(aes(x = tsne_no_str_15$Y[,1], y = tsne_no_str_15$Y[,2])) +labs(title = "No Structure t-SNE",

subtitle = "Perplexity: 15",caption = "b",x = "",y = "") +

theme_minimal() +coord_fixed(ratio = 1)

grid.arrange(g_nostr1, g_nostr15, nrow = 1)

# yes str

set.seed(34536)

tsne_yes_str33 <- Rtsne(yes_structure, pca = TRUE, perplexity = 33, theta = 0)

g_yesstr33 <- ggplot(data = data.frame(tsne_yes_str33$Y)) +geom_point(aes(x = tsne_yes_str33$Y[,1], y = tsne_yes_str33$Y[,2])) +labs(title = "Clear Structure t-SNE",

subtitle = "Perplexity: 33",caption = "a",color = "Multivariate Mean",x = "",y = "") +

theme_minimal() +coord_fixed(ratio = 1)

set.seed(34536)

tsne_yes_str10 <- Rtsne(yes_structure, pca = TRUE, perplexity = 9, theta = 0)

g_yesstr9 <- ggplot(data = data.frame(tsne_yes_str10$Y)) +

25

Page 28: Tristan Anacker Department of Mathematical Sciences ...

geom_point(aes(x = tsne_yes_str10$Y[,1], y = tsne_yes_str10$Y[,2])) +labs(title = "Clear Structure t-SNE",

subtitle = "Perplexity: 9",caption = "b",color = "Multivariate Mean",x = "",y = "") +

theme_minimal() +coord_fixed(ratio = 1)

grid.arrange(g_yesstr33, g_yesstr9, nrow = 1)

# mid str

set.seed(34536)

tsne_mid_str33 <- Rtsne(middle_structure, pca = TRUE, perplexity = 33, theta = 0)

g_midstr33 <- ggplot(data = data.frame(tsne_mid_str33$Y)) +geom_point(aes(x = tsne_mid_str33$Y[,1], y = tsne_mid_str33$Y[,2])) +labs(title = "Moderate Structure t-SNE",

subtitle = "Perplexity: 33",caption = "a",color = "Multivariate Mean",x = "",y = "") +

theme_minimal() +coord_fixed(ratio = 1)

set.seed(34536)

tsne_mid_str10 <- Rtsne(middle_structure, pca = TRUE, perplexity = 10, theta = 0)

g_midstr10 <- ggplot(data = data.frame(tsne_mid_str10$Y)) +geom_point(aes(x = tsne_mid_str10$Y[,1], y = tsne_mid_str10$Y[,2])) +labs(title = "Moderate Structure t-SNE",

subtitle = "Perplexity: 10",caption = "b",color = "Multivariate Mean",x = "",y = "") +

theme_minimal() +coord_fixed(ratio = 1)

26

Page 29: Tristan Anacker Department of Mathematical Sciences ...

grid.arrange(g_midstr33, g_midstr10, nrow = 1)

# mnist

set.seed(34536)

tsne_mnist_333 <- Rtsne(mnist_small[,2:785], pca = TRUE, perplexity = 333, theta = 0)

g_mnist333 <- ggplot(data = data.frame(tsne_mnist_333$Y)) +geom_point(aes(x = tsne_mnist_333$Y[,1], y = tsne_mnist_333$Y[,2])) +labs(title = "MNIST t-SNE",

subtitle = "Perplexity: 333",caption = "a",color = "Multivariate Mean",x = "",y = "") +

theme_minimal() +coord_fixed(ratio = 1)

set.seed(34536)

tsne_mnist_40 <- Rtsne(mnist_small[,2:785], pca = TRUE, perplexity = 35, theta = 0)

g_mnist40 <- ggplot(data = data.frame(tsne_mnist_40$Y)) +geom_point(aes(x = tsne_mnist_40$Y[,1], y = tsne_mnist_40$Y[,2])) +labs(title = "MNIST t-SNE",

subtitle = "Perplexity: 35",caption = "b",color = "Multivariate Mean",x = "",y = "") +

theme_minimal() +coord_fixed(ratio = 1)

grid.arrange(g_mnist333, g_mnist40, nrow = 1)

g_mnist333_col <- ggplot(data = data.frame(tsne_mnist_333$Y)) +geom_point(aes(x = tsne_mnist_333$Y[,1], y = tsne_mnist_333$Y[,2], color = as.factor(mnist_small$label))) +labs(title = "MNIST t-SNE",

subtitle = "Perplexity: 333",caption = "Figure a",

27

Page 30: Tristan Anacker Department of Mathematical Sciences ...

color = "Number",x = "",y = "") +

theme(legend.position="none") +theme_minimal() +coord_fixed(ratio = 1) +guides(color = FALSE)

g_mnist40_col <- ggplot(data = data.frame(tsne_mnist_40$Y)) +geom_point(aes(x = tsne_mnist_40$Y[,1], y = tsne_mnist_40$Y[,2], color = as.factor(mnist_small$label))) +labs(title = "MNIST t-SNE",

subtitle = "Perplexity: 35",caption = "Figure b",color = "Multivariate Mean",x = "",y = "") +

theme(legend.position="none") +theme_minimal() +coord_fixed(ratio = 1) +guides(color = FALSE)

grid.arrange(g_mnist333_col, g_mnist40_col, nrow = 1)

28