From Softmax to Sparsemax - ULisboa · From Softmax to Sparsemax: A Sparse Model of Attention and...

From Softmax to Sparsemax:A Sparse Model of Attention and Multi-Label Classification

Andre Martins

Probability & Statistics Seminar, 4/5/16

Andre Martins (Unbabel/IT) From Softmax to Sparsemax P&S Seminar, 4/5/16 1 / 43

From Softmax to Sparsemax

This work will appear at:

A. Martins and R. Astudillo.“From Softmax to Sparsemax: A Sparse Model of Attention andMulti-Label Classification.”International Conference of Machine Learning, 2016(preprint available in ArXiV).


Outline

1 Motivation

2 The Sparsemax Transformation

3 A Loss Function for Sparsemax

4 Experiments

5 Conclusions


Motivation

Many statistical models require mapping a vector of reals to a probabilitydistribution:

Multinomial logistic regression (McCullagh and Nelder, 1989)

Action selection in reinforcement learning (Sutton and Barto, 1998)

Neural networks for multi-class classification (Bridle, 1990;Goodfellow et al., 2016)

More recently: “attention mechanisms” in neural networks (Bahdanauet al., 2015; Xu et al., 2015)

This is usually done with a softmax transformation

This talk: A new transformation (sparsemax) with maps a vector of realsto a sparse probability distribution


Neural Attention


Outline

1 Motivation



4 Experiments

5 Conclusions


Recap: Softmax

Let ∆K−1 := {p ∈ RK | 1 · p = 1, p ≥ 0} be the(K − 1)-dimensional probability simplex

The “standard” way to map a vector of reals to the probabilitysimplex is the transformation softmax : RK → ∆K−1:

softmaxi (z) =exp(zi )∑K

k=1 exp(zk)

Resulting distribution has full support: softmax(z) > 0, ∀zA disadvantage in applications where a sparse probability distributionis desired

Common workaround: define a threshold below which smallprobability values are truncated to zero


Our Proposal: Sparsemax

We propose as an alternative sparsemax : RK → ∆K−1:

sparsemax(z) := arg minp∈∆K−1

‖p− z‖2.

In words: sparsemax returns the Euclidean projection of the inputvector z onto the probability simplex

Likely to hit the boundary of the simplex, in which casesparsemax(z) becomes sparse (hence the name)

We will see that sparsemax retains many of the properties of softmax,having in addition the ability of producing sparse distributions


Closed-Form Solution

Projecting onto the simplex is a well studied problem in optimization(Michelot, 1986; Pardalos and Kovoor, 1990; Duchi et al., 2008)

Such projections correspond to a soft-thresholding operation:

sparsemaxi (z) = [zi − τ(z)]+

= max{0, zi − τ(z)}

where the threshold function τ : RK → R is a piecewise linearfunction satisfying

∑j [zj − τ(z)]+ = 1 for every z.

All we need for evaluating the sparsemax is to compute the thresholdτ(z)

Coordinates above the threshold will be shifted by this amount; theothers will be truncated to zero.


A Formal Algorithm

Input: z ∈ RK

Sort z as z(1) ≥ . . . ≥ z(K)

Find k(z) := max{k ∈ [K ] | 1 + kz(k) >

∑j≤k z(j)

}Define τ(z) =

(∑

j≤k(z) z(j))−1

k(z)

Output: p ∈ ∆K−1 s.t. pi = [zi − τ(z)]+.

Time complexity is O(K logK ) due to the sort operation; but O(K )algorithms exist based on linear-time selection.


Properties of Softmax/Sparsemax

Let A(z) := {k ∈ [K ] | zk = z(1)} contain the largest entries of z

Let γ(z) := z(1) −maxk /∈A(z) zk be the margin w.r.t. the second largest

The following hold for ρ ∈ {softmax, sparsemax}.

1 Uniform distribution: ρ(0) = 1/K

2 Temperature limit: limε→0+ ρ(ε−1z) = 1A(z)/|A(z)|(for sparsemax, equality holds for any ε ≤ γ(z) · |A(z)|)

3 Invariance to adding constants: ρ(z) = ρ(z + c1), for any c ∈ R.

4 Commutes with permutations: ρ(Pz) = Pρ(z), for anypermutation matrix P.

5 Coordinate non-expansiveness: if zi ≤ zj , then0 ≤ ρj(z)− ρi (z) ≤ η(zj − zi ), where η = 1

2 for softmax, and η = 1for sparsemax.


Properties of Softmax/Sparsemax

Reassuring: sparsemax, while defined very differently from softmax,has a similar behaviour and preserves the same invariances

Other proposed replacements of softmax fail some of theseproperties, e.g. the spherical softmax (Ollivier, 2013), defined asρi (z) := z2

i /∑

j z2j


Two DimensionsParametrize z = (t, 0)The 2D softmax is the logistic (sigmoid) function:

softmax1(z) = (1 + exp(−t))−1

The 2D sparsemax is the “hard” version of the sigmoid:

sparsemax1(z) =

1, if t > 1(t + 1)/2, if −1 ≤ t ≤ 10, if t < −1.

− 3 − 2 − 1 0 1 2 3t

0.0

0.2

0.4

0.6

0.8

1.0 softmax1([t,0])

sparsemax1([t,0])


Three Dimensions

Parameterize z = (t1, t2, 0) and plot softmax1(z) and sparsemax1(z)as a function of t1 and t2

sparsemax is piecewise linear, but asymptotically similar to softmax

t1

− 3− 2

− 10

12

3

t 2

− 3− 2

− 10

12

3

spar

sem

ax1([t 1

,t2,0

])

0.0

0.2

0.4

0.6

0.8

1.0

t1

− 3− 2

− 10

12

3

t 2

− 3− 2

− 10

12

3

softm

ax1([t 1

,t2,0

])0.0

0.2

0.4

0.6

0.8

1.0


Jacobian of Softmax

The Jacobian matrix of a transformation ρ, Jρ(z) := [∂ρi (z)/∂zj ]i ,j , isof key importance to train models with gradient-based optimization

For the softmax,

∂ softmaxi (z)

∂zj=

δijezi∑

k ezk − ezi ezj

(∑

k ezk )2

= softmaxi (z)(δij − softmaxj(z)),

where δij is the Kronecker delta.

In matrix notation:

Jsoftmax(z) = Diag(p)− pp>, where p := softmax(z)


Jacobian of Sparsemax

Let S(z) := {j ∈ [K ] | sparsemaxj(z) > 0} be the support ofsparsemax(z)

sparsemax is differentiable everywhere except at splitting points zwhere S(z) 6= S(z + εd) for some d and infinitesimal ε

We have that:

∂ sparsemaxi (z)

∂zj=

{δij − 1

|S(z)| , if i , j ∈ S(z),

0, otherwise.

In matrix notation:

Jsparsemax(z) = Diag(s)− ss>/|S(z)|, where s := 1S(z)

(equals the Laplacian of a graph whose elements of S(z) are fullyconnected.)


Jacobian Times Vector

Often we don’t need the full Jacobian, but just its product with agiven vector v (e.g. for gradient backpropagation)

For softmax we can compute this in linear time, via:

Jsoftmax(z) · v = p� (v − v1), with v :=∑

j pjvj ,

For sparsemax, we have:

Jsparsemax(z) · v = s� (v − v1), with v :=

∑j∈S(z) vj

|S(z)|.

If sparsemax(z) has already been evaluated (i.e., in the forwardpass), then S(z) is already known, and the nonzeros ofJsparsemax(z) · v can be computed in sublinear time O(|S(z)|)An important advantage of sparsemax over softmax for large K


Outline

1 Motivation



4 Experiments

5 Conclusions


Supervised Learning w/ Empirical Risk Minimization

Let x ∈ RD be a input vector

Let y ∈ {1, . . . ,K} be an output label

Training dataset D := {(xi , yi )}Ni=1 ⊆ RD × {1, . . . ,K}Regularized empirical risk minimization:

minimizeλ

2‖W‖2

F +1

N

N∑i=1

L(Wxi + b; yi ),

w. r. t. W ∈ RK×D , b ∈ RK ,

where W is a matrix of weights, b is a bias vector, and L is a lossfunction


Softmax: Logistic Loss

Let z = Wxi + b, and let k = yi be the “gold” label

Logistic loss (or negative log-likelihood):

Lsoftmax(z; k) = − log softmaxk(z)

= −zk + log∑j

exp(zj),

The gradient of this loss is

∇zLsoftmax(z; k) = −δk + softmax(z),

where δk denotes the delta distribution on k

Gradient updates move mass from the distribution predicted by thecurrent model (i.e., softmaxk(z)) to the gold label (via δk)

Can we have something similar for sparsemax?


How to Define a Sparsemax Loss?

A nice aspect of the log-likelihood: adding up loss terms for severalexamples (i.i.d.) we obtain the log-probability of the full training data:

N∑i=1

Lsoftmax(zi ; yi ) = − logN∏i=1

softmaxyi (zi )

Unfortunately, this idea cannot be carried out to sparsemax: now,some labels may get exactly probability zero, and this could zero outthe probability of the entire training sample

One possible workaround is to define, for a small constant ε,

Lεsparsemax(z; k) = − logε+ sparsemaxk(z)

1 + Kε,

Disadvantage: this loss is non-convex, unlike the logistic loss


Sparsemax Loss

A better approach: construct an alternative loss function whosegradient resembles the gradient of the logistic loss

The gradient is particularly important, since it is directly involved inthe model updates for typical optimization algorithms

That is, we want Lsparsemax to be a differentiable function such that

∇zLsparsemax(z; k) = −δk + sparsemax(z)

This property is fulfilled by the following function (sparsemax loss):

Lsparsemax(z; k) = −zk +1

2

∑j∈S(z)

(z2j − τ2(z)) +

1

2,

where τ2 is the square of the threshold function.


Properties of the Logistic/Sparsemax Loss

The following hold for Lρ ∈ {Lsoftmax, Lsparsemax}.

1 Lρ is differentiable everywhere, with gradient ∇zLρ(z; k) = −δk +ρ(z)

2 Lρ is convex

3 Lρ(z + c1; k) = Lρ(z; k), ∀c ∈ R.

4 Lρ(z; k) ≥ 0, for all z and k.

5 For sparsemax only, the following statements are all equivalent:(i) Lsparsemax(z; k) = 0; (ii) sparsemax(z) = δk ; (iii) marginseparation holds, zk ≥ 1 + maxj 6=k zj .

Note: the last property is also satisfied by the hinge loss of SVMs...

... but unlike the hinge loss, Lsparsemax is everywhere differentiable, henceamenable to smooth optimization methods such as L-BFGS or acceleratedgradient descent (Liu and Nocedal, 1989; Nesterov, 1983).


Two Dimensions: Relation to the Huber Loss

In 2D, the sparsemax loss reduces to the Huber classification lossknown from robust statistics (Huber, 1964).

Assume without loss of generality that the correct label is k = 1, anddefine t = z1 − z2. We have

Lsparsemax(t) =

0 if t ≥ 1−t if t ≤ −1(t−1)2

4 if −1 < t < 1,

This loss is a variant of the Huber loss adapted for classification,proposed by Zhang (2004); Zou et al. (2006).


Loss Functions in 2D


Generalization to Multi-Label Classification

In multi-label classification, we assume the target is a non-emptyset of labels Y ∈ 2[K ] \ {∅} rather than a single label y ∈ [K ]

In sparse label proportion estimation (more general), the target isa probability distribution q ∈ ∆K−1, such that Y = {k | qk > 0}Let D := {(xi ,qi )}Ni=1 ⊆ RD ×∆K−1 be a training dataset, wherethe target distributions qi are typically sparse

Related to: “learning with a probabilistic teacher” (Agrawala, 1970),and semi-supervised learning (Chapelle et al., 2006), as it can modellabel uncertainty

Subsumes single-label classification (where all qi are deltadistributions concentrated on a single class)


Multi-Label Logistic/Sparsemax Losses

The generalization of the multinomial logistic loss to this setting is:

Lsoftmax(z; q) = KL(q ‖ softmax(z))

= −H(q)− q · z + log∑

j exp(zj),

The corresponding generalization in the sparsemax case is:

Lsparsemax(z; q) = −q · z +1

2

∑j∈S(z)

(z2j − τ2(z)) +

1

2‖q‖2.

The gradients of these losses are:

∇zLsoftmax(z; q) = −q + softmax(z),

∇zLsparsemax(z; q) = −q + sparsemax(z).

We’ll make use of these losses in our experiments.


Outline

1 Motivation



4 Experiments

5 Conclusions


Sparse Label Proportion Estimation

We generated datasets with 1,200 training and 1,000 test examples

Each example emulates a “multi-labeled document”

We sample the number of labels N ∈ {1, . . . ,K} from a Poisson anddraw the N labels from a multinomial

We pick a document length from a Poisson, and repeatedly sample itswords from the mixture of the N label-specific multinomials

Two settings: uniform mixtures (qkn = 1/N for the N active labelsk1, . . . , kN) and random mixtures (whose label proportions qkn weredrawn from a flat Dirichlet).

We set the vocabulary size equal to the number of labelsK ∈ {10, 50}, and varied the average doc-length in 200–2,000 words

The regularization constant λ was picked with 5-fold cross-validation


Sparse Label Proportion Estimation

Reported: mean squared error and Jensen-Shannon divergencebetween the target and predicted label posteriors


Multi-Label Classification

Five benchmark multi-label classification datasets:

Dataset Descr. #labels #train #testScene images 6 1211 1196Emotions music 6 393 202Birds audio 19 323 322CAL500 music 174 400 100Reuters text 103 23,149 781,265

We compared three systems (all optimized with 100 epochs of L-BFGS):

Logistic: train binary logistic regressors on each label, then tune aprobability threshold δ ∈ [0, 1] on validation data (Koyejo et al., 2015)

Softmax: train a multinomial logistic regressor (Lsoftmax), then tunea similar probability threshold p0 for prediction

Sparsemax: train with Lsparsemax, then tune a constant t ≥ 1, andpredict using the support of p = sparsemax(tz)


Multi-Label Classification

Micro (left) and macro-averaged (right) F1 scores:

Dataset Logistic Softmax SparsemaxScene 70.96 / 72.95 74.01 / 75.03 73.45 / 74.57Emotions 66.75 / 68.56 67.34 / 67.51 66.38 / 66.07Birds 45.78 / 33.77 48.67 / 37.06 49.44 / 39.13CAL500 48.88 / 24.49 47.46 / 23.51 48.47 / 26.20Reuters 81.19 / 60.02 79.47 / 56.30 80.00 / 61.27

The performances of the three systems are all very similar

The sparsemax loss appears better suited for problems with largernumbers of labels


Neural Networks with Attention Mechanisms

We now use sparsemax to construct a “sparse” neural attentionmechanism

We ran experiments on the Stanford Natural Language Inferencecorpus (Bowman et al., 2015), which contains 570,000 human-writtenEnglish sentence pairs

Each pair consists of a premise and an hypothesis, with a manuallabel entailment, contradiction, or neutral

We used an attention-based architecture analogous to the oneproposed by Rocktaschel et al. (2015)


Four Neural Attention StrategiesNoAttention, a RNN-based system similar to Bowman et al. (2015)

LogisticAttention, which uses independent logistic activations

SoftAttention, a near-reproduction of the Rocktaschel et al. (2015)’ssoftmax attention-based system

SparseAttention, which replaces the latter softmax by a sparsemax


Some Details (I)

We represent the words in the premise/hypothesis with GloVe vectors(Pennington et al., 2014), projected onto a D-dim subspace, D = 100

Denote by x1, . . . , xL and xL+1, . . . , xN resp. the projected premiseand hypothesis word vectors

The two sequences are then each fed into a RNN (instead of LSTMs,we used GRUs, which behave similarly but have fewer parameters)

The premise GRU gives a state sequence H1:L = [h1 . . .hL] ∈ RD×L as

zt = σ(Wxzxt + Whzht−1), ht = tanh(Wxhxt + Whh(rt � ht−1)),

rt = σ(Wxrxt + Whrht−1), ht = (1− zt)ht−1 + zt ht .

Likewise, the hypothesis GRU gives a state sequence [hL+1, . . . ,hN ]


Some Details (II)

The NoAttention system then computes the final state u based onthe last states from the premise and the hypothesis as follows:

u = tanh(WpuhL + WhuhN)

and predicts a label y from u with a standard softmax layer.

SoftAttention replaces the last premise state hL by a weightedaverage of premise states computed with an attention mechanism:

zt = v · tanh(Wpmht + WhmhN)

p = softmax(z), where z := (z1, . . . , zL) (1)

r = H1:Lp

u = tanh(Wpur + WhuhN),

LogisticAttention replaces (1) by p = (σ(z1), . . . , σ(zL)).

SparseAttention replaces (1) by p = sparsemax(z).

We optimized all the systems with Adam (Kingma and Ba, 2014),with `2-regularization and dropout.


Experimental Results

Soft and sparse-activated attention systems perform similarly

Both outperform the NoAttention and LogisticAttention systems

Dev Acc. Test Acc.NoAttention 81.84 80.99LogisticAttention 82.11 80.84SoftAttention 82.86 82.08SparseAttention 82.52 82.20


Some Examples

In bold: premise words selected by SparseAttention

Only a few words are selected, which are key for the system’s decision

The sparsemax activation yields a compact and more interpretableselection, which can be particularly useful in long sentences

A boy rides on a camel in a crowded area while talking on his cellphone.Hypothesis: A boy is riding an animal. [entailment]A young girl wearing a pink coat plays with a yellow toy golf club.Hypothesis: A girl is wearing a blue jacket. [contradiction]Two black dogs are frolicking around the grass together.Hypothesis: Two dogs swim in the lake. [contradiction]A man wearing a yellow striped shirt laughs while seated next to another manwho is wearing a light blue shirt and clasping his hands together.Hypothesis: Two mimes sit in complete silence. [contradiction]


Outline

1 Motivation



4 Experiments

5 Conclusions


Conclusions

We introduced the sparsemax transformation, which has similarproperties to the softmax, but is able to output sparse distributions

We derived a closed-form expression for its Jacobian

We proposed a novel “sparsemax loss” function, a sparse analogue ofthe logistic loss, which is smooth and convex

Empirical results in multi-label classification and in attention networksfor natural language inference attest the validity of our approach

Many avenues for future research!

What can we say about statistical consistency/sparsistency of thesparsemax loss?

Application to neural architectures with random access memory(Graves et al., 2014; Sukhbaatar et al., 2015)

Hierarchical attention: the sparse distributions produced bysparsemax will prune the hierarchy, leading to computational savings

Efficient implementations on GPUs (Alabi et al., 2012)


Thank you!

Acknowledgments:

Tim Rocktaschel, Mario Figueiredo, Chris Dyer

Fundacao para a Ciencia e Tecnologia, grants UID/EEA/50008/2013and PTDC/EEI-SII/7092/2014.

Fundacao para a Ciencia e Tecnologia, GoLocal project, grantCMUPERI/TIC/0046/2014


References I

Agrawala, A. K. (1970). Learning with a Probabilistic Teacher. IEEE Transactions on Information Theory, 16(4):373–379.

Alabi, T., Blanchard, J. D., Gordon, B., and Steinbach, R. (2012). Fast k-Selection Algorithms for Graphics Processing Units.Journal of Experimental Algorithmics (JEA), 17:4–2.

Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. InInternational Conference on Learning Representations.

Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D. (2015). A Large Annotated Corpus for Learning Natural LanguageInference. In Proc. of Empirical Methods in Natural Language Processing.

Bridle, J. S. (1990). Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to StatisticalPattern Recognition. In Neurocomputing, pages 227–236. Springer.

Chapelle, O., Scholkopf, B., and Zien, A. (2006). Semi-Supervised Learning. MIT Press Cambridge.

Duchi, J., Shalev-Shwartz, S., Singer, Y., and Chandra, T. (2008). Efficient Projections onto the L1-Ball for Learning in HighDimensions. In Proc. of International Conference of Machine Learning.

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. Book in preparation for MIT Press.

Graves, A., Wayne, G., and Danihelka, I. (2014). Neural Turing Machines. arXiv preprint arXiv:1410.5401.

Huber, P. J. (1964). Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics, 35(1):73–101.

Kingma, D. and Ba, J. (2014). Adam: A Method for Stochastic Optimization. In Proc. of International Conference on LearningRepresentations.

Koyejo, S., Natarajan, N., Ravikumar, P. K., and Dhillon, I. S. (2015). Consistent Multilabel Classification. In Advances inNeural Information Processing Systems, pages 3303–3311.

Liu, D. C. and Nocedal, J. (1989). On the Limited Memory BFGS Method for Large Scale Optimization. Mathematicalprogramming, 45(1-3):503–528.

McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models, volume 37. CRC press.

Michelot, C. (1986). A Finite Algorithm for Finding the Projection of a Point onto the Canonical Simplex of Rn . Journal ofOptimization Theory and Applications, 50(1):195–200.


References II

Nesterov, Y. (1983). A Method of Solving a Convex Programming Problem with Convergence Rate O(1/k2). Soviet Math.Doklady, 27:372–376.

Ollivier, Y. (2013). Riemannian Metrics for Neural Networks. arXiv preprint arXiv:1303.0818.

Pardalos, P. M. and Kovoor, N. (1990). An Algorithm for a Singly Constrained Class of Quadratic Programs Subject to Upperand Lower Bounds. Mathematical Programming, 46(1):321–328.

Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global Vectors for Word Representation. Proceedings of theEmpiricial Methods in Natural Language Processing (EMNLP 2014), 12:1532–1543.

Rocktaschel, T., Grefenstette, E., Hermann, K. M., Kocisky, T., and Blunsom, P. (2015). Reasoning about Entailment withNeural Attention. arXiv preprint arXiv:1509.06664.

Sukhbaatar, S., Szlam, A., Weston, J., and Fergus, R. (2015). End-to-End Memory Networks. In Advances in NeuralInformation Processing Systems, pages 2431–2439.

Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction, volume 1. MIT press Cambridge.

Xu, K., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R., Zemel, R., and Bengio, Y. (2015). Show, Attend and Tell: NeuralImage Caption Generation with Visual Attention. In International Conference on Machine Learning.

Zhang, T. (2004). Statistical Behavior and Consistency of Classification Methods Based on Convex Risk Minimization. Annalsof Statistics, pages 56–85.

Zou, H., Zhu, J., and Hastie, T. (2006). The Margin Vector, Admissible Loss and Multi-class Margin-Based Classifiers.Technical report, Stanford University.


From Softmax to Sparsemax - ULisboa · From Softmax to Sparsemax: A Sparse Model of Attention and...

Documents

Transcript of From Softmax to Sparsemax - ULisboa · From Softmax to Sparsemax: A Sparse Model of Attention and...