Winning the Lottery with Continuous Sparsiﬁcation · 2020. 6. 30. · Winning the Lottery with...

Winning the Lottery with Continuous Sparsification

Pedro Savarese∗TTI-Chicago

[email protected]

Hugo Silva*

University of [email protected]

Michael MaireUniversity of Chicago

[email protected]

Abstract

The search for efficient, sparse deep neural network models is most prominentlyperformed by pruning: training a dense, overparameterized network and removingparameters, usually via following a manually-crafted heuristic. Additionally, therecent Lottery Ticket Hypothesis conjectures that, for a typically-sized neural net-work, it is possible to find small sub-networks which, when trained from scratch ona comparable budget, match the performance of the original dense counterpart. Werevisit fundamental aspects of pruning algorithms, pointing out missing ingredientsin previous approaches, and develop a method, Continuous Sparsification, whichsearches for sparse networks based on a novel approximation of an intractable `0regularization. We compare against dominant heuristic-based methods on pruningas well as ticket search – finding sparse subnetworks that can be successfullyre-trained from an early iterate. Empirical results show that we surpass the state-of-the-art for both objectives, across models and datasets, including VGG trained onCIFAR-10 and ResNet-50 trained on ImageNet. In addition to setting a new stan-dard for pruning, Continuous Sparsification also offers fast parallel ticket search,opening doors to new applications of the Lottery Ticket Hypothesis.

1 Introduction

Although deep neural networks have become ubiquitous in fields such as computer vision andnatural language processing, extreme overparameterization is typically required to achieve state-of-the-art results, incurring higher training costs and hindering applications limited by memory orinference time. Recent theoretical work suggest that overparameterization plays a key role in networktraining dynamics [1] and generalization [2]. However, it remains unclear whether, in practice,overparameterization is truly necessary to train networks to state-of-the-art performance.

Concurrently, empirical approaches have been successful in finding compact neural networks, eitherby shrinking trained models [3, 4, 5] or through efficient architectures, yielding less overparameterizedmodels that can be trained from scratch [6]. Recently, combining these two strategies has lead tonew methods which discover efficient architectures through optimization instead of design [7, 8].Nonetheless, parameter efficiency is typically maximized by pruning an already trained network.

Despite the fact that the search for sparse solutions to optimization problems can be naturallydescribed by `0 regularization, the vast majority of pruning methods rely on manually-designedstrategies that are not based on the `0 penalty [3, 4, 9, 10]. The approaches that aim to approximatean `0-regularized problem in order to find sparse, less overparameterized networks are limited innumber [11, 12] and fail to perform competitively against heuristic-based pruning methods.

Prior work has shown that pruned networks are hard to train from scratch [3], suggesting that whileoverparameterization is not necessary for a model’s capacity, it might be required for successfultraining. Frankle and Carbin [13] put this idea into question by training heavily pruned networks∗equal contribution

34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

arX

iv:1

912.

0442

7v4

[cs

.LG

] 1

1 Ja

n 20

21

from scratch, while achieving performance matching that of their original counterparts. A key findingis that the same initialization should be used when re-training the pruned network, or, equivalently,that better strategies – depending on future weights – can result in trainable pruned networks.

More recently, Frankle et al. [14] show that although this approach can fail in large-scale settings,pruned networks can be successfully re-trained when parameters from very early training are used asinitialization. Coupling a pruned network with a set of parameter values from initialization yields aticket – a winning ticket if it is able to match the dense model’s performance when trained in isolationfor a comparable number of iterations. These have already found applications in, for example, transferlearning [15, 16, 17], making ticket search a problem of independent interest.

Iterative Magnitude Pruning (IMP) [13], the first and currently only algorithm able to find winningtickets, consists of a repeating a two-stage procedure that alternates between training and pruning. IMPrelies on a sensible choice for pruning strategy [18] and can be costly: maximizing the performanceof the found subnetworks typically requires multiple rounds of training followed by pruning [19].

In this paper, we focus on two questions related to pruning and the Lottery Ticket Hypothesis. First,can we find sparse networks with competitive performance by approximating `0 regularization insteadof relying on a heuristic pruning strategy and, if yes, what are the missing ingredients in previousapproaches [11, 12]? Second, would a method that relies on `0-regularization, rather than an ad-hocheuristic, be able to find winning tickets, as IMP does?

We provide positive answers to both questions by proposing Continuous Sparsification1, a newpruning method that relies on approximating the intractable `0 penalty and finds networks thatperform competitively when either fine-tuned or re-trained. Unlike prior `0-based approaches, ourapproximation is deterministic, providing insights and raising questions on how pruning and sparseregularization should be performed. The core of our method lies in constructing a smooth continuationpath [20] connecting training of soft-gated parameters and the intractable `0-regularized objective.

Contributions:• We propose a novel approximation to `0 regularization, resulting in Continuous Sparsification, a

new pruning method with theoretical and empirical advantages over previous `0-based approaches.We show through experiments that the deterministic nature of our re-parameterization is key toachieving competitive results with `0 approximations.

• We show that Continuous Sparsification outperforms state-of-the-art heuristic-based pruningmethods. Our experiments include pruning of VGG-16 [21] and ResNet-20 [22] trained on CIFAR-10 [23], and ResNet-50 trained on ImageNet [24].

• Our method raises questions on how to do better ticket search – producing subnetworks that canbe re-trained from early iterates. We show empirically that Continuous Sparsification is capableof finding subnetworks of VGG-16, ResNet-20, and ResNet-50 that, when re-trained, outperformones found by IMP. Moreover, the search cost of our method does not depend on the producedsubnetwork’s sparsity, making ticket search considerably more efficient when run in parallel.

2 Preliminaries

Here we define terms used throughout the paper.

Subnetwork: For a network f that maps samples x ∈ X and parameters w ∈ Rd to f(x;w), asubnetwork f ′ of f is given by a binary mask m ∈ {0, 1}d, where a parameter component wi is keptin f ′ if mi = 1 and removed otherwise i.e., f ′ : x,w 7→ f(x;w�m), with � denoting element-wisemultiplication. For any configuration m, the effective parameter space of the induced network f ′is {w �m|w ∈ Rd} – a ‖m‖0-dimensional space, hence we say that the subnetwork f ′ has ‖m‖0many parameters instead of d.

Matching subnetwork: For a network f and randomly-initialized parameters w(0), a matchingsubnetwork f ′ of f is given by a configuration m ∈ {0, 1}d, such that f ′ can be trained in isolationfrom w′(0) = w(k)�m, where w(k) is the collection of parameter values obtained by training f fromw(0) for k iterations, where k is small. Moreover, to be a matching subnetwork, f ′ needs to matchthe performance of a trained f given the same budget, when measured in terms of training iterations.

1Code available at https://github.com/lolemacs/continuous-sparsification

2

https://github.com/lolemacs/continuous-sparsification

Winning ticket: For a network f and randomly-initialized parameters w(0), a winning ticket is amatching subnetwork f ′ of f that can be trained in isolation from initialization, i.e., w′(0) = w(0)�m.In other words, a winning ticket is a matching subnetwork such that k = 0 in the definition above.

Ticket search is the task of finding matching subnetworks given a network f and randomly-initializedparameters w(0). We say that an algorithm A performs ticket search if A(f, w(0)) = m ∈ {0, 1}d,such that m induces a (possibly matching) subnetwork f ′.

3 Related Work

3.1 Sparse Networks

Classical pruning methods [25] follow a pre-defined strategy to remove weights, and generally operateby ranking parameters according to an easy-to-compute statistic like weight magnitude [3]. Suchmethods rely on the assumption that the considered statistic is a sensible surrogate for how much eachparameter affects a network’s output, and typically select weights for removal once the dense modelhas been fully trained. Magnitude-based pruning, the most prominent heuristic pruning method,improves when given multiple rounds of training followed by pruning [3, 26].

Another approach consists of approximating an intractable `0-regularized objective which accounts forthe number of non-zero weights in the model, yielding one-stage procedures that can be fully describedin the optimization framework. More common in the literature are stochastic approximations, wherea binary mask m over the weights is sampled from a distributionM(s) at each training iteration,introducing new variables s which are optimized jointly with the weight parameters [11, 12].

Training the mask parameters s is done by estimating the gradients of the expected loss w.r.t. s, e.g.,via the straight-through estimator [27], thus relying on estimated gradients which can be biased andhave high variance. `0-based methods have the advantage of not relying on a heuristic to pruneweights, and continuously sparsify the network during training instead of at pre-defined steps.

3.2 Lottery Ticket Hypothesis

Frankle and Carbin [13] show that, in some settings, sparse subnetworks can be successfully re-trained and yield better performance than their original dense networks, often also on a smallercompute budget for re-training. This observation leads to the Lottery Ticket Hypothesis [13], whichconjectures that for a reasonably-sized network f and randomly-initialized parameters w(0) ∈ Rd,there exists a sparse subnetwork f ′, given by a configuration m ∈ {0, 1}d, ‖m‖0 � d, that can betrained from w′(0) = m� w(0) to perform comparably to a trained version of the original model f .

The proposal of Iterative Magnitude Pruning (IMP; Algorithm 1) [19] supports this hypothesis.IMP is capable of finding such subnetworks, named winning tickets, in convolutional networkstrained for image classification. IMP operates in multiple rounds, sparsifying the network at discretetime intervals and producing subnetworks with increasing sparsity levels during execution. Morespecifically, each round in IMP consists of: (1) training the weights w of a network, (2) pruning afixed fraction of the weights with the smallest magnitude, and (3) rewinding: setting the remainingweights back to their original initialization w(0).

Following Frankle et al. [19], we consider a general form of IMP where step (3) is relaxed to rewindthe weights to an early iterate w(k) (for relatively small k) instead of the original initialization valuesw(0). We refer to the process of searching for a sparse subnetwork and a set of early iterates w(k)

as ticket search, even though the produced subnetworks are truly only winning tickets when theyperform comparably to the dense model when trained in isolation from w(0), i.e., k = 0.

The search for winning tickets has attracted attention due to their valuable properties. In small-scale settings, tickets can be trained faster than their dense counterparts while yielding better finalperformance [13]. Moreover, they can be transferred between datasets [16, 17] and training methods[15]. Zhou et al. [18] attempt to better understand the Lottery Ticket Hypothesis through extensiveexperiments, showing that a stochastic approximation to `0 regularization can be used to performticket search with SGD, without ever training the weights (non-retroactive search).

3

Algorithm 1 Iterative Magnitude Pruning [19]Input: Pruning ratio τ , number of rounds R,iterations per round T , rewind point k

1: Initialize w ∼ D, m← ~1d, r ← 12: Minimize L(f( · ;m � w)) for T iterations,

producing w(T )

3: Remove τ percent of the weights with small-est magnitude

4: If r = R, output f( · ;m� w(k))5: Otherwise, set w ← w(k), r ← r + 1 and go

back to step 2, thereby starting a new round

Algorithm 2 Continuous Sparsification

Input: Mask init s(0), penalty λ, number ofrounds R, iterations per round T , rewind pointk

1: Initialize w ∼ D, s← s(0), β ← 1, r ← 12: Minimize L(f( · ;σ(βs)�w))+λ ‖σ(βs)‖1

for T iterations while increasing β, producingw(T ), s(T ), and β(T )

3: If r = R, output f( · ;H(s(T ))� w(k))4: Otherwise, set s ← min(β(T )s(T ), s(0)),β ← 1, r ← r + 1 and go back to step 2,thereby starting a new round

4 Method

Our goal is to design a method that can efficiently sparsify networks without causing performancedegradation. Ideally, and in contrast to magnitude pruning, the time to produce a subnetwork should beindependent of its sparsity. Unlike dominant pruning approaches [3, 4, 9], we rely on approximating`0 regularization, as it induces a clear trade-off between sparsity and performance, providing a way tomaximize sparsity while maintaining performance. By continuously sparsifying the network duringtraining, we do not require a heuristic to select which parameters to remove or when to remove them.

To avoid gradient estimators and to avoid having to commit to a configuration for m to be used atinference – obstacles that are inherent to stochastic approximations to the `0 objective – we design adeterministic approximation instead, as we describe below.

4.1 Continuous Sparsification by Learning Deterministic Masks

Given a network f that maps samples x to f(x;w) using parameters w ∈ Rd, we first frame thesearch for sparse subnetworks as a loss minimization problem with `0 regularization:

minw∈Rd

L(f( · ;w)) + λ · ‖w‖0 , (1)

where L(f( · ; w)) denotes the loss incurred by the network f( · ;w) and λ ≥ 0 controls the trade-offbetween the loss and number of parameters ‖w‖0. We restate the above minimization problem as

minw∈Rd, m∈{0,1}d

L(f( · ;m� w)) + λ · ‖m‖1 , (2)

which uses the fact that ‖m‖0 = ‖m‖1 for binary m. While the `1 penalty is amenable to subgradientdescent, the combinatorial constraint m ∈ {0, 1}d makes local search unsuited for the problem above.

As in most methods that approximate `0 regularization, we will circumvent the discrete space of mby re-parameterizing it as a function of a newly-introduced variable s ∈ Rd. In contrast to previouswork [12, 11], we propose a re-parameterization that is fully deterministic, hence avoiding biasedand/or noisy training caused by gradient estimators [27].

Consider an intermediate and still intractable problem, given by defining m := H(s), with s ∈ Rd6=0

and H : R 6=0 → {0, 1} being the Heaviside step function applied element-wise, i.e., H(s) = 1 ifs > 0 and 0 otherwise. This yields the following equivalent form for the problem in (2):

minw∈Rd, s∈Rd

6=0

L(f( · ;H(s)� w)) + λ · ‖H(s)‖1 . (3)

Being equivalent to (1), the above is still intractable: the step function H is discontinuous and itsderivative is zero everywhere. We approximate H by constructing a set of functions indexed byβ ∈ [1,∞) given by s 7→ σ(βs) where σ is the sigmoid function σ(s) = 1

1+e−s applied element-wise.This set can be seen as a path parameterized by β, and given any fixed s ∈ R 6=0, we have at one of

4

its endpoints limβ→∞ σ(βs) = H(s). Conversely, for β = 1 we have σ(βs) = σ(s), the standardsigmoid activation function that is smooth and widely used in neural network models.

Using this family of functions to approximate H yields the re-parameterization m := σ(βs). Con-trolling the inverse temperature β allows interpolation between the sigmoid activation σ(s), whichassigns continuous values for m, and the step function H(s) ∈ {0, 1}. Each β induces the objective

Lβ(w, s) := L(f( · ;σ(βs)� w)) + λ · ‖σ(βs)‖1 . (4)

Note that if L is continuous in w, then:

minw∈Rd

s∈Rd6=0

limβ→∞

Lβ(w, s) = minw∈Rd

s∈Rd6=0

L(f( · ;H(s)� w)) + λ · ‖H(s)‖1 , (5)

where the right-hand-side is equivalent to the `0-regularized objective. Therefore, β controls thecomputational hardness of the objective: as β increases from 1 to∞, the objective changes fromL1(w, s), where a soft-gating w � σ(s) is applied to the weights, to L∞(w, s), where weights areeither removed or fully preserved. Increasing the hardness of the underlying objective during trainingstems from continuation methods [20] and can be successful in approximating intractable problems.

In terms of sparsification, every negative component of s will drive the corresponding component ofw � σ(βs) to 0 as β →∞, effectively pruning a weight. While analytically it is never the case thatσ(βs) = 0 regardless of how large β is, limited numerical precision has a fortunate side-effect ofcausing actual sparsification to the network during training as β becomes sufficiently large.

In a nutshell, our method consists of learning sparse networks by minimizing Lβ(w, s) for Tparameter updates with gradient descent while jointly annealing β: producing w(T ), s(T ) and β(T ).Note that, in order to recover a binary mask m from our re-parameterization, β must be large enoughsuch that, numerically2, σ(β(T )s(T )) = H(s(T )). Alternatively, we can directly output m = H(s(T ))at the end of training, guaranteeing that the learned mask is indeed binary. We adopt an exponential

schedule β(t) =(β(T )

) tT for β during training, increasing it from 1 up to β(T ). Such a schedule has

the advantage of only requiring us to tune β(T ), and has been successfully utilized in prior work [28].

4.2 Ticket Search through Continuous Sparsification

The method described above is essentially a pruning method, which we use to replace magnitude-based pruning as the backbone for ticket search. Note that searching for matching subnetworksrequires produced masks to be binary: otherwise, the magnitude of the weights will also be learned.We guarantee that the final mask is binary regardless of numerical precision by outputting H(s(T )).

Similarly to IMP, our ticket search procedure operates in rounds, where each round consists of trainingand sparsifying the network. At the beginning of a round, we set β back to 1 so that additional weightscan be removed (otherwise β would be large throughout the round, causing a vanishing Jacobian ofσ(βs) w.r.t. s). Moreover, we reset the parameter s of each weight w that has not been suppressed bythe optimizer during the round (i.e., weights whose gating value has increased during training). Thisis achieved by setting s← min(β(T )s(T ), s(0)), effectively resetting the soft mask parameters s for“kept” weights without interfering with weights that have been suppressed. Algorithm 2 presents ourmethod for ticket search, which does not rewind weights between rounds, in contrast to IMP [19].

4.3 Comparison to Stochastic Approaches

Prior works [11, 12] approximating the `0 objective adopt a stochastic re-parameterizationm ∼M(s)for some distributionM with parameters s. During training, a new binary mask m is sampled fromM(s) at every forward pass of the network. Hence, outputs can change drastically from one pass toanother due to variance in sampling. Such approaches have found limited success in pruning.

Gale et al. [26] report that the stochastic approach from Louizos et al. [12] fails to sparsify a residualnetwork without degrading its accuracy to random chance. Stochastic approximations introduce

2In experiments, we observed that a final temperature of 500 is sufficient for iterates of s when training withSGD using 32-bit precision. The required temperature is likely to depend on how s is represented numerically,as our implementation relies on numerical imprecision rather than (alternatively) clamping after some threshold.

5

another problem: different behavior between training and inference. While a new mask is sampled ateach training iteration, at inference it is common to use a deterministic mask, such as that with highestmass [11] or an approximation for it [12]. This assures that the outputs at inference are consistent,but can introduce a gap in sparsity and performance between training and inference modes.

Conversely, Continuous Sparsification offers consistency in training mode – outputs for a input are thesame across forward passes – and no gap between training and inference. In Section 5.2, experimentalcomparisons with stochastic approximations show that these differences play a key role in attainingfaster training, higher sparsity, and superior performance when pruning deep networks.

5 Experiments

We compare methods on the tasks of pruning and finding matching subnetworks. We quantify theperformance of ticket search by focusing on two specific subnetworks produced by each method:• Sparsest matching subnetwork: the sparsest subnetwork that, when trained in isolation from an

early iterate, yields performance no worse than that achieved by the trained dense counterpart.• Best performing subnetwork: the subnetwork that achieves the best performance when trained in

isolation from an early iterate, regardless of its sparsity.

We also measure the efficiency of each method in terms of total number of epochs to producesubnetworks, given enough parallel computing resources. As we will see, Continuous Sparsification isparticularly suited for parallel execution since it requires relatively few rounds to produce subnetworksregardless of sparsity. On the other hand, CS offers no explicit mechanism to control the sparsity ofthe found subnetworks, hence producing a subnetwork with a pre-defined sparsity level can requiremultiple runs with different hyperparameter settings. For this use case, IMP is more efficient bydesign, since a single run suffices to produce subnetworks with varying, pre-defined sparsity levels.

For Continuous Sparsification, we set hyperparameters λ = 10−8 and β(T ) = 200, based on analysisin Appendix A, which studies how λ, s(0), and β(T ) affect the sparsity of produced subnetworks. Weobserve that s(0) has a major impact on sparsity levels, while λ and β(T ) require little to no tuning.

We reiterate that Continuous Sparsification does not perform weight rewinding in the followingexperiments; rather, it maintains weights between rounds. Our experimental comparisons include avariant of IMP that also does not rewind weights between rounds, which we denote as “continued”IMP (IMP-C). Algorithms 1 and 2 provide more implementation details. Comparisons against abaseline inspired by Zhou et al. [18], and described in Appendix B, on the tasks of learning asupermask and ticket search on a 6-layer CNN can be found in Appendices C and D.

5.1 Ticket Search on Residual Networks and VGG

First, we evaluate how IMP and CS perform on the task of ticket search for VGG-16 [21] andResNet-203 [22] trained on the CIFAR-10 dataset, a setting where IMP can take over 10 rounds (850epochs given 85 epochs per round [19]) to find sparse subnetworks. We follow Frankle and Carbin’ssetup [13]: in each round, we train with SGD, a learning rate of 0.1, and a momentum of 0.9, for atotal of 85 epochs, using a batch size of 64 for VGG and 128 for ResNet. We decay the learning rateby a factor of 10 at epochs 56 and 71, and utilize a weight decay of 0.0001.

For CS, we do not apply weight decay to the mask parameters s, since they are already suffer`1 regularization. Sparsification is performed on all convolutional layers, excluding the two skip-connections of ResNet-20 that have 1× 1 kernels: for IMP, their parameters are not pruned, while forCS their weights do not have an associated learnable mask.

We evaluate produced subnetworks by initializing their weights with the iterates from the end ofepoch 2, similarly to Frankle et al. [19], followed by re-training. IMP performs global pruning at aper-round rate of removing 20% of the remaining parameters with smallest magnitude. We run IMPfor 30 iterations, yielding 30 tickets with varying sparsity levels (80%, 64%, . . . ). To produce ticketsof differing sparsity with CS, we vary s(0) across 11 values from −0.3 to 0.3, performing a run of 5rounds for each setting. We repeat experiments 3 times, with different random seeds.

3We used the same network as Frankle and Carbin [13] and Frankle et al. [19], who refer to it as ResNet-18.

6

Ticket Search: VGG-16 on CIFAR-10

100.0 51.4 26.5 13.7 7.1 3.7 1.9 1.0 0.5 0.3 0.2 0.1Percentage of weights remaining (%)

80

82

84

86

88

90

92

94

Test

Acc

urac

y (%

)

Iterative Mag.PrIterative Mag.Pr (continued)CSCS (all runs)

Ticket Search: ResNet-20 on CIFAR-10

100.0 51.4 26.5 13.7 7.1 3.7 1.9Percentage of weights remaining (%)

82

84

86

88

90

92

94

Test

Acc

urac

y (%

)

Iterative Mag.PrIterative Mag.Pr (continued)CSCS (all runs)

Figure 1: Test accuracy and sparsity of subnetworks produced by IMP and CS after re-training fromweights of epoch 2. Purple curves show individual runs of CS, while the green curve connects ticketsproduced after 5 rounds of CS with varying s(0). Iterative Magnitude Pruning (continued) refers toIMP without rewinding between rounds. Error bars depict variance across 3 runs.

Table 1: Test accu-racy and sparsity ofthe sparsest matchingand best performingsubnetworks pro-duced by CS, IMP,and IMP-C (IMPwithout rewinding)for VGG-16 andResNet-20 trainedon CIFAR-10.

VGG-16 ResNet-20

Method Round Test Weights Round Test WeightsAccuracy Remaining Accuracy Remaining

Dense Network 1 92.35% 100.0% 1 90.55% 100.0%Sparsest IMP 18 92.36% 1.8% 7 90.57% 20.9%

Matching IMP-C 18 92.56% 1.8% 8 91.00% 16.7%Subnetwork CS 5 93.35% 1.7% 5 91.43% 12.3%

Best IMP 13 92.97% 5.5% 6 90.67% 26.2%Performing IMP-C 12 92.77% 6.9% 4 91.08% 40.9%Subnetwork CS 4 93.45% 2.4% 5 91.54% 16.9%

Figure 1 shows the performance and sparsity of tickets produced by CS and IMP, including IMPwithout rewinding (continued). Purple curves show individual runs of CS for different values of s(0),each consisting of 5 rounds, and the green curve shows the performance of subnetworks producedwith different hyperparameters. Plots of individual runs are available in Appendix E, but have beenomitted here for the sake of clarity. Given a search budget of 5 rounds (i.e., 5× 85 = 425 epochs),CS successfully finds subnetworks with diverse sparsity levels. Notably, IMP produces tickets withsuperior performance when weight rewinding is not employed between rounds.

Table 1 summarizes the performance of each method when evaluated in terms of the sparsest matchingand best performing subnetworks. IMP-C denotes IMP without rewinding, i.e., IMP (continued) fromFigure 1. Sparsest matching subnetworks produced by CS are sparser than the ones found by IMP andIMP-C, while also delivering higher accuracy. CS also outperforms IMP and IMP-C when evaluatingthe best performing produced subnetworks. In particular, CS yields highly sparse subnetworks thatoutperform the original model by approximately 1% on both VGG-16 and ResNet-20.

If all runs are executed in parallel, producing all tickets presented in Figure 1 takes CS a total of5 × 85 = 425 training epochs, while IMP requires 30 × 85 = 2550 epochs instead. Note thatour re-parameterization results in approximately 10% longer training times on a GPU due to themask parameters s, therefore wall-clock time for CS is 10% higher per epoch. Sequential searchtakes 5 × 11 × 85 = 4675 epochs for CS to produce all tickets in Figure 1, while IMP requires30× 85 = 2550 epochs, hence CS is faster given sufficient parallelism, but slower if run sequentially.Appendix F shows preliminary results of a variant of CS designed for sequential search.

5.2 Pruning

Since CS is a general-purpose method to find sparse networks, we also evaluate it on the morestandard task of network pruning, where produced subnetworks are fine-tuned instead of re-trained.We compare it against the prominent pruning methods AMC [29], magnitude pruning (MP) [3],

7

Pruning: VGG-16 on CIFAR-10

100.0 51.4 26.5 13.7 7.1 3.7 1.9 1.0 0.5Percentage of weights remaining (%)

80

82

84

86

88

90

92

94

Test

Acc

urac

y (%

)

AMCGMPL0Mag.PrSlimCS

Pruning: ResNet-20 on CIFAR-10

100.0 51.4 26.5 13.7 7.1 3.7 1.9 1.0Percentage of weights remaining (%)

50

55

60

65

70

75

80

85

90

Test

Acc

urac

y (%

)

AMCGMPL0Mag.PrSlimCS

Figure 2: Performance of different methods when performing one-shot pruning on VGG-16 andResNet-20, measured in terms of test accuracy and sparsity of produced subnetworks after fine-tuning.

Table 2: Sparsity (%) of the sparsestsubnetwork within 2% test accuracy ofthe original dense model, for differentpruning methods on CIFAR.

[12] AMC MP GMP NetSlim CS

VGG-16 18.2% 86.0% 97.5% 98.0% 99.0% 99.6%ResNet-20 13.6% 50.0% 80.0% 86.0% 85.0% 94.4%

GMP [4], and Network Slimming (Slim) [30], along with the `0-based method of Louizos et al. [12](referred to as “`0”), which, in contrast to ours, adopts a stochastic approximation for `0 regularization.

We train VGG-16 and ResNet-20 on CIFAR-10 for 200 epochs, with a initial learning rate of 0.1which is decayed by a factor of 10 at epochs 80 and 120. The subnetwork is produced at epoch 160and is then fine-tuned for 40 extra epochs with a learning rate of 0.001. More specifically, at epoch160 the subnetwork structured is fixed: AMC, MP, GMP and Slim zero-out elements in the binarymatrix m for the last time, while CS fixes m = H(s) and stops training of the mask parameters s.

Adopting the inference behavior suggested in Louizos et al. [12] for `0, i.e., using the expectedvalue of the uniform distribution to generate hard concrete samples, leads to poor results, includingaccuracy akin to random guessing at sparsity above 90%; this is also reported in Gale et al. [26].Instead, at epoch 160, we sample different masks and commit to the one that performs the best – thisstrategy results in drastic improvements at high sparsity levels. This suggests that the gap betweentraining and inference behavior introduced by stochastic approaches can be an obstacle. Althoughour modification improves results for `0, the method still performs poorly compared to alternatives.

Moreover, some methods required modifications as they were originally designed to perform struc-tured pruning. For AMC, Slim, and `0 we replace a filter-wise mask by one that acts over weights.Since Network Slimming relies on the filter-wise scaling factors of batch norm, we introduce weight-wise scaling factors which are trained jointly with the weights. We observe that applying both `1 and`2 regularization to the scaling parameters, as done by Liu et al. [30], yields inferior performance,which we attribute to over-regularization. A grid search over the penalty of each norm regular-izer shows that only applying `1 regularization with a strength of λ1 = 10−5 for ResNet-20 andλ1 = 10−6 for VGG-16 improves results.

Figure 2 displays one-shot pruning results. On VGG, only CS and Slim successfully prune over98% of the weights without severely degrading the performance of the model, while on ResNetthe best results are achieved by CS and GMP. Table 2 shows the percentage of weights that eachmethod can remove while maintaining a performance within 2% of the original, dense model. CS iscapable of removing significantly more parameters than all competing methods on both networks: onResNet-20, the pruned network found by CS contains 60% less parameters than the one found byGMP, when counting prunable parameters only. CS not only offers significantly superior performancecompared to the prior `0-based method of Louizos et al. [12], but also comfortably outperforms allother methods, providing a new state-of-the-art for network pruning.

8

5.3 Residual Networks on ImageNet

We perform pruning and ticket search for ResNet-50 trained on ImageNet [24]. Following Frankle etal. [19], we train the network with SGD for 90 epochs, with an initial learning rate of 0.1 that isdecayed by a factor of 10 at epochs 30 and 60. We use a batch size of 256 distributed across 4 GPUsand a weight decay of 0.0001. We run CS for a single round due to the high computational cost oftraining ResNet-50 on ImageNet. Once the round is complete, we evaluate the performance of theproduced subnetwork when fine-tuned (pruning) or re-trained from an early iterate (ticket search).

Table 3: Performance offound ResNet-50 subnet-works on ImageNet.

Method Top-1Acc. Sparsity

GMP 73.9% 90.0%DNW 74.0% 90.0%STR 74.3% 90.2%IMP† 73.6% 90.0%CS† 75.5% 91.8%

GMP 70.6% 95.0%DNW 68.3% 95.0%STR 70.4% 95.0%CS 72.4% 95.3%IMP† 69.2% 95.0%CS† 71.1% 95.3%

STR 67.2% 96.5%CS 71.4% 97.1%CS† 69.6% 97.1%

GMP 57.9% 98.0%DNW 58.2% 98.0%STR 61.5% 98.5%CS 70.0% 98.0%CS† 67.9% 98.0%

GMP 44.8% 99.0%STR 54.8% 98.8%CS 66.8% 98.9%CS† 64.9% 98.9%

We run CS with s(0) ∈ {0.0,−0.01,−0.02,−0.03,−0.05} yielding5 subnetworks with varying sparsity levels. Table 3 summarizes theresults achieved by CS, IMP, and current state-of-the-art pruning meth-ods GMP [4], STR [10], and DNW [9]. A † superscript denotes resultsof a re-trained, rather than fine-tuned, subnetwork. Differences ineach technique’s methodology – for example, the adopted learningrate schedule and number of epochs – complicate the comparison.

CS produces subnetworks that, when re-trained, outperform the onesfound by IMP by a comfortable margin (compare CS† and IMP†).Moreover, when evaluated as a pruning method, CS outperforms allcompeting approaches, especially in the high-sparsity regime. There-fore, our method provides state-of-the-art results whether the networkis fine-tuned (pruning) or re-trained (ticket search).

6 Discussion

With Frankle and Carbin [13], we now realize that sparse subnetworkscan indeed be successfully trained from scratch or an early iterate,putting in question whether overparameterization is required for properoptimization of neural networks. Such subnetworks can potentiallydecrease the required resources for training deep networks, as they areshown to transfer between different, but similar, tasks [16, 17].

The search for winning tickets is a poorly explored problem, with,prior to our work, Iterative Magnitude Pruning [13] standing as theonly algorithm suited for this task. It is unclear whether IMP’s key in-gredients – post-training magnitude pruning and parameter rewinding– are the correct choices. Here, we approach the problem of findingsparse subnetworks as an `0-regularized optimization problem, whichwe approximate through a smooth relaxation of the step function.

Our proposed algorithm, Continuous Sparsification, relies on a de-terministic approximation of `0 regularization, removes parametersautomatically and continuously during training, and can be fully described by the optimizationframework. We show empirically that, indeed, post-training pruning might not be the most sensiblechoice for ticket search, raising questions on how the search for tickets differs from standard networkcompression. In tasks such as pruning VGG and finding winning tickets in ResNets, our methodoffers improvements in terms of ticket search and resulting sparsity – we can sparsify VGG to extremelevels, and speed up ticket search using an efficiently parallelizable framework. We hope to furthermotivate the problem of quickly finding tickets in complex networks, as the task might be highlyrelevant to transfer learning and mobile applications.

At the same time, Continuous Sparsification serves as a practical network pruning method, out-performing modern competitors as measured by accuracy and sparsity of produced subnetworks.Continuous Sparsification’s principled formulation has the potential to open new avenues for researchinto neural network optimization and architecture search.

9

Acknowledgments and Disclosure of Funding

We thank the anonymous reviewers for providing extensive and extremely valuable feedback onearlier drafts of this work.

The University of Chicago CERES Center contributed to the financial support of Pedro Savarese. Theauthors have no competing interests.

References

[1] Allen-Zhu, Z., Y. Li, Z. Song. A convergence theory for deep learning via over-parameterization.In ICML. 2019.

[2] Neyshabur, B., Z. Li, S. Bhojanapalli, et al. The role of over-parametrization in generalizationof neural networks. In ICLR. 2019.

[3] Han, S., J. Pool, J. Tran, et al. Learning both weights and connections for efficient neuralnetworks. In NeurIPS. 2015.

[4] Zhu, M., S. Gupta. To prune, or not to prune: exploring the efficacy of pruning for modelcompression. arXiv:1710.01878, 2017.

[5] Han, S., H. Mao, W. J. Dally. Deep compression: Compressing deep neural networks withpruning, trained quantization and Huffman coding. In ICLR. 2016.

[6] Iandola, F. N., M. W. Moskewicz, K. Ashraf, et al. SqueezeNet: AlexNet-level accuracy with50x fewer parameters and <1MB model size. arXiv:1602.07360, 2016.

[7] Liu, H., K. Simonyan, Y. Yang. DARTS: Differentiable architecture search. In ICLR. 2019.[8] Savarese, P., M. Maire. Learning implicitly recurrent CNNs through parameter sharing. In

ICLR. 2019.[9] Wortsman, M., A. Farhadi, M. Rastegari. Discovering neural wirings. In NeurIPS. 2019.

[10] Kusupati, A., V. Ramanujan, R. Somani, et al. Soft threshold weight reparameterization forlearnable sparsity. In ICML. 2020.

[11] Srinivas, S., A. Subramanya, R. Venkatesh Babu. Training sparse neural networks.arXiv:1611.06694, 2016.

[12] Louizos, C., M. Welling, D. P. Kingma. Learning sparse neural networks through l0 regulariza-tion. In ICLR. 2018.

[13] Frankle, J., M. Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks.In ICLR. 2019.

[14] Frankle, J., G. Karolina Dziugaite, D. M. Roy, et al. Linear mode connectivity and the lotteryticket hypothesis. In ICML. 2020.

[15] Morcos, A. S., H. Yu, M. Paganini, et al. One ticket to win them all: generalizing lottery ticketinitializations across datasets and optimizers. In NeurIPS. 2019.

[16] Mehta, R. Sparse transfer learning via winning lottery tickets. arXiv:1905.07785, 2019.[17] Soelen, R. V., J. W. Sheppard. Using winning lottery tickets in transfer learning for convolutional

neural networks. In IJCNN. 2019.[18] Zhou, H., J. Lan, R. Liu, et al. Deconstructing lottery tickets: Zeros, signs, and the supermask.

In NeurIPS. 2019.[19] Frankle, J., G. Karolina Dziugaite, D. M. Roy, et al. Stabilizing the lottery ticket hypothesis.

arXiv:1903.01611, 2019.[20] Allgower, E. L., K. Georg. Introduction to Numerical Continuation Methods. 2003.[21] Simonyan, K., A. Zisserman. Very deep convolutional networks for large-scale image recogni-

tion. In ICLR. 2015.[22] He, K., X. Zhang, S. Ren, et al. Deep residual learning for image recognition. In CVPR. 2016.[23] Krizhevsky, A. Learning multiple layers of features from tiny images. Tech. rep., 2009.

10

[24] Russakovsky, O., J. Deng, H. Su, et al. ImageNet large scale visual recognition challenge. IJCV,2015.

[25] LeCun, Y., J. S. Denker, S. A. Solla. Optimal brain damage. In NIPS. 1990.[26] Trevor Gale, S. H., Erich Elsen. The state of sparsity in deep neural networks. In ICML. 2019.[27] Bengio, Y., N. Léonard, A. Courville. Estimating or propagating gradients through stochastic

neurons for conditional computation. arXiv:1308.3432, 2013.[28] Jang, E., S. Gu, B. Poole. Categorical reparameterization with gumbel-softmax. In ICLR. 2019.[29] He, Y., J. Lin, Z. Liu, et al. AMC: AutoML for model compression and acceleration on mobile

devices. In ECCV. 2018.[30] Liu, Z., J. Li, Z. Shen, et al. Learning efficient convolutional networks through network

slimming. In ICCV. 2017.[31] Kingma, D. P., J. Ba. Adam: A method for stochastic optimization. In ICLR. 2015.

11

Appendix

A Hyperparameter Analysis

A.1 Continuous Sparsification

In this section, we study how the hyperparameters of Continuous Sparsification affect its behaviorin terms of sparsity and performance of the produced tickets. More specifically, we consider thefollowing hyperparameters:

• Final temperature β(T ): the final value for β, which controls how close to the original`0-regularized problem the proxy objective Lβ(w, s) is.

• `1 penalty λ: the strength of the `1 regularization applied to the soft mask σ(βs), whichpromotes sparsity.

• Mask initial value s(0): the value used to initialize all components of the soft mask m =σ(βs), where smaller values promote sparsity.

Our setup is as follows. To analyze how each of the 3 hyperparameters impact the performance ofContinuous Sparsification, we train ResNet-20 on CIFAR-10 (following the same protocol fromSection 5.1), varying one hyperparameter while keeping the other two fixed. To capture howhyperparameters interact with each other, we repeat the described experiment with different settingsfor the fixed hyperparameters.

Since different hyperparameter settings naturally yield vastly distinct sparsity and performance forthe found tickets, we report relative changes in accuracy and in sparsity.

In Figure 3, we vary λ between 0 and 10−8 for three different (s(0), β(T )) settings: (s(0) =−0.2, β(T ) = 100), (s(0) = 0.05, β(T ) = 200), and (s(0) = −0.3, β(T ) = 100). As we cansee, there is little impact on either the performance or the sparsity of the found ticket, except for thecase where s(0) = 0.05 and β(T ) = 200, for which λ = 10−8 yields slightly increased sparsity.

0 10 810 1010 12

1 regularizer ( )

-20

-10

0

10

20

Test

Rel

ativ

e Ac

cura

cy (%

)

0 10 810 1010 12

1 regularizer ( )

-0.20

-0.10

0.00

0.10

0.20

Rela

tive

Wei

ghts

Rem

aini

ng

s0 = 0.2; T = 100 s0 = 0.3; T = 100 s0 = 0.05; T = 200

Figure 3: Impact on relative test accuracy and sparsity of tickets found in a ResNet-20 trained onCIFAR-10, for different values of λ and fixed settings for β(T ) and s(0).

Next, we consider the fixed settings (s(0) = −0.2, λ = 10−10), (s(0) = 0.05, λ = 10−12), (s(0) =−0.3, λ = 10−8), and proceed to vary the final inverse temperature β(T ) between 50 and 200.Figure 4 shows the results: in all cases, a larger β of 200 yields better accuracy. However, it decreasessparsity compared to smaller temperature values for the settings (s(0) = −0.2, λ = 10−10) and(s(0) = −0.3, λ = 10−8), while at the same time increasing sparsity for (s(0) = 0.05, λ = 10−12).While larger β appear beneficial and might suggest that even higher values should be used, note that,the larger β(T ) is, the earlier in training the gradients of s will vanish, at which point training ofthe mask will stop. Since the performance for temperatures between 100 and 200 does not change

12

0 50 100 150 200 250 300Final Temperature ( T)

-20

-10

0

10

20

Test

Rel

ativ

e Ac

cura

cy (%

)

0 50 100 150 200 250 300Final Temperature ( T)

-0.20

-0.10

0.00

0.10

0.20

Rela

tive

Wei

ghts

Rem

aini

ng

s0 = 0.2; = 1e 10 s0 = 0.3; = 1e 08 s0 = 0.05; = 1e 12

Figure 4: Impact on relative test accuracy and sparsity of tickets found in a ResNet-20 trained onCIFAR-10, for different values of β(T ) and fixed settings for λ and s(0).

0.00 0.10 0.20 0.30-0.10-0.20-0.30Initial masks (s0)

-20

-10

0

10

20

Test

Rel

ativ

e Ac

cura

cy (%

)

0.00 0.10 0.20 0.30-0.10-0.20-0.30Initial masks (s0)

0.00

0.20

0.40

0.60

0.80

1.00

Rela

tive

Wei

ghts

Rem

aini

ng

T = 100; = 1e 08 T = 100; = 1e 10 T = 200; = 1e 12

Figure 5: Impact on relative test accuracy and sparsity of tickets found in a ResNet-20 trained onCIFAR-10, for different values of s(0) and fixed settings for β(T ) and λ.

significantly, we recommend values around 150 or 200 when either pruning or performing ticketsearch.

Lastly, we vary the initial mask value s(0) between −0.3 and +0.3, with hyperpameter settings(β(T ) = 100, λ = 10−10), (β(T ) = 200, λ = 10−12), and (β(T ) = 100, λ = 10−8). Results aregiven in Figure 5: unlike the exploration on λ and β(T ), we can see that s(0) has a strong andconsistent effect on the sparsity of the found tickets. For this reason, we suggest proper tuning of s(0)when the goal is to achieve a specific sparsity value. Since the percentage of remaining weights ismonotonically increasing with s(0), we can employ search strategies over values for s(0) to achievepre-defined desired sparsity levels (e.g., binary search). In terms of performance, lower values fors(0) naturally lead to performance degradation, since sparsity quickly increases as s(0) becomes morenegative.

13

Ticket Search using IMP: ResNet-20 on CIFAR-10


82

84

86

88

90

92

94

Test

Acc

urac

y (%

)

IMP, prune rate = 0.1IMP, prune rate = 0.2IMP, prune rate = 0.4IMP, prune rate = 0.6

Figure 6: Performance of tickets found by Iterative Magnitude Pruning in a ResNet-20 trained onCIFAR-10, for different pruning rates.

A.2 Iterative Magnitude Pruning

Here, we assess whether the running time of Iterative Magnitude Pruning can be improved byincreasing the amount of parameters pruned at each iteration. The goal of this experiment is toevaluate if better tickets (both in terms of performance and sparsity) can be produced by moreaggressive pruning strategies.

Following the same setup as the previous section, we train ResNet-20 on CIFAR-10. We run IMPfor 30 iterations, performing global pruning with different pruning rates at the end of each iteration.Figure 6 shows that the performance of tickets found by IMP decays when the pruning rate isincreased to 40%. In particular, the final performance of found tickets is mostly monotonicallydecreasing with the number of remaining parameters, suggesting that, in order to find tickets whichoutperform the original network, IMP is not compatible with more aggressive pruning rates.

B Iterative Stochastic Sparsification

Algorithm 3 Iterative Stochastic Sparsification (inspired by [18])

Input: Mask init s(0), penalty λ, number of rounds R, iterations per round T , rewind point k1: Initialize w ∼ D, s← s(0), r ← 12: Minimize Em∼Ber(σ(s)) [L(f( · ;m� w))] + λ ‖σ(s)‖1 for T iterations, producing w(T ) ands(T )

3: If r = R, sample m ∼ Ber(σ(s(T ))) and output f( · ;m� w(k))4: Otherwise, set w ← w(k), s← −∞ for components of s where s(T ) < s(0), r ← r + 1 and go

back to step 2, starting a new round

Besides comparing our proposed method to Iterative Magnitude Pruning (Algorithm 1), we alsodesign a baseline method, Iterative Stochastic Sparsification (ISS, Algorithm 3), motivated by theprocedure in Zhou et al. [18] to find a binary mask m with gradient descent in an end-to-end fashion.More specifically, ISS uses a stochastic re-parameterization m ∼ Bernoulli(σ(s)) with s ∈ Rd, and

14

0 10 20 30 40 50 60 70 80 90 100Epoch

20

30

40

50

60

70

80

Test

Acc

urac

y (%

)

SSCS

100.0 51.4 26.5 13.7Percentage of weights remaining (%)

20

30

40

50

60

70

80

Test

Acc

urac

y (%

)

SSCS

Figure 7: Learning a binary mask with weights frozen at initialization with Stochastic Sparsification(SS, Algorithm 3 with one iteration) and Continuous Sparsification (CS), on a 6-layer CNN onCIFAR-10. Left: Training curves with hyperparameters for which masks learned by SS and CS wereboth approximately 50% sparse. CS learns the mask significantly faster while attaining similarearly-stop performance. Right: Sparsity and test accuracy of masks learned with different settingsfor SS and CS: our method learns sparser masks while maintaining test performance, while SS isunable to successfully learn masks with over 50% sparsity.

trains w and s jointly with gradient descent and the straight-through estimator [27]. Note that themethod is also similar to the one proposed by Srinivas et al. [11] to prune networks. The goal of thisbaseline and comparisons is to evaluate whether the deterministic nature of CS’s re-parameterizationis advantageous when performing sparsification through optimization methods.

When run for multiple iterations, all components of the mask parameters s which have decreased invalue from initialization are set to −∞, such that the corresponding weight is permanently removedfrom the network. While this might look arbitrary, we observe empirically that ISS was unable toremove weights quickly without this step unless λ was chosen to be large – in which case the model’sperformance decreases in exchange for sparsity.

We also observe that the mask parameters s require different settings in terms of optimization to besuccessfully trained. In particular, Zhou et al. [18] use SGD with a learning rate of 100 when trainings, which is orders of magnitude larger than the one used when training CNNs. Our observationsare similar, in that typical learning rates on the order of 0.1 cause s to be barely updated duringtraining, which is likely a side-effect of using gradient estimators to obtain update directions for s.The following sections present experiments that compare IMP, CS and ISS on ticket search tasks.

C Supermask Search on a 6-layer CNN

We train a neural network with 6 convolutional layers on the CIFAR-10 dataset [23], following Frankleand Carbin [13]. The network consists of three blocks of two resolution-preserving convolutionallayers followed by 2 × 2 max-pooling, where convolutions in each block have 64, 128, and 256channels, a 3× 3 kernel, and are immediately followed by ReLU activations. The blocks are followedby fully-connected layers with 256, 256, and 10 neurons, with ReLUs in between. The network istrained with Adam [31] with a learning rate of 0.0003 and a batch size of 60.

As a first baseline, we consider the task of learning a “supermask” [18]: a binary mask m that aims tomaximize the performance of a network with randomly initialized weights once the mask is applied.This task is equivalent to pruning a randomly-initialized network since weights are neither updatedduring the search for the supermask, nor for the comparison between different methods.

We only compare ISS and CS for this specific experiment: the reason not to consider IMP is that,since the network weights are kept at their initialization values, IMP amounts to removing the weightswhose initialization were the smallest. Hence, we compare ISS and CS, where each method is runfor a single round composed of 100 epochs. In this case, where it is run for a single round, ISS isequivalent to the algorithm proposed in Zhou et al. [18] to learn a supermask, referred here as simplyStochastic Sparsification (SS). We control the sparsity of the learned masks by varying s(0) and λ. All

15

Ticket Search: Conv-6 on CIFAR-10


60

65

70

75

80

85

Test

Acc

urac

y (%

)

Iterative Mag.PrIterative Mag.Pr (continued)Iterative SSCSCS (all runs)

Figure 8: Accuracy and sparsity of tickets produced by IMP, ISS and CS after re-training, startingfrom initialization. Tickets are extracted from a Conv-6 network trained on CIFAR-10. Purple curvesshow individual runs of CS, while green curve connects tickets produced after 4 rounds of CS withvarying s(0). Blue and red curves show performance and sparsity of tickets produced by IMP andISS, respectively. Error bars depict variance across 3 runs.

parameters are trained using Adam and a learning rate of 3× 10−4, excluding the mask parameters sfor SS, for which we adopted SGD with a learning rate of 100 – following Zhou et al. [18] and thediscussion in the previous section.

Figure 7 presents results: CS is capable of finding high performing sparse supermasks (i.e., 25%or less remaining weights while yielding 75% test accuracy), while SS fails at finding competitivesupermasks for sparsity levels above 50%. Moreover, CS makes faster progress in training, suggestingthat not relying on gradient estimators indeed results in better optimization and faster progress whenmeasured in epochs or parameter updates.

D Ticket Search on a 6-layer CNN

In what follows we compare IMP, ISS and CS in the task of finding winning tickets on the Conv-6architecture used in the supermask experiments in Appendix C. The goal of these experiments is toassess how our deterministic re-parameterization compares to the common stochastic approximationsto `0-regularization [11, 12, 18]. Therefore, we run CS with weight rewinding between rounds,so that we remove any advantages that might be caused by not performing weight rewinding – inthis case, we better isolate the effects caused by our re-parameterization. Following Frankle andCarbin [13], we re-train the produced tickets from their values at initialization (i.e., k = 0 on eachalgorithm).

We run IMP and ISS for a total of 30 rounds, each consisting of 40 epochs. Parameters are trainedwith Adam [31] with a learning rate of 3×10−4, following Frankle and Carbin [13]. For IMP, we usepruning rates of 15%/20% for convolutional/dense layers. We initialize the Bernoulli parameters ofISS with s(0) = ~1, and train them with SGD and a learning rate of 20, along with a `1 regularizationof λ = 10−8. For CS, we train both the weights and the mask with Adam and a learning rate of3 × 10−4. Each run of CS is limited to 4 rounds, and we perform a total of 16 runs, each with a

16

different value for the mask initialization s(0), from−0.2 up to 0.1. Runs are repeated with 3 differentrandom seeds so that error bars can be computed.

Figure 8 presents tickets produced by each method, measured by their sparsity and test accuracywhen trained from scratch. Even when performing weight rewinding, CS produces tickets that aresignificantly superior than the ones found by ISS, both in terms of sparsity and test accuracy, showingthat our deterministic re-parameterization is fundamental to finding winning tickets.

E Additional Plots for Ticket Search Experiments

In Section 5.1, we compare IMP and CS in the task of performing ticket search for ResNet-20 trainedon CIFAR-10, where CS was run with 11 different values for s(0) in order to produce tickets withdiverse sparsity levels, each run consisting of 5 rounds.

Figures 9 and 10 contain the training curves for each of the 11 settings of s(0) that produce ticketspresented in Figure 1 (left). Purple curves show the performance and sparsity of tickets producedafter each of the 5 rounds. The accuracy for each ticket is computed by re-training from early-trainingweights (epoch 2). For each setting, we also execute IMP with a pruning rate per round matching CS,which is presented a blue curve – note that these runs of IMP are different than the ones in Section 5.1,where IMP had a fixed and pre-defined pruning ratio of 20% per round.

The plots show that CS not only adjusts the per-round pruning ratio automatically, but it is alsosuperior in terms of what parameters are removed from the network. The bottom right plot ofFigure 10 shows curves connecting tickets that are presented in all other plots of Figure 9 and 10(left), where we can see that CS produces superior tickets even when IMP adopts a dynamic pruningratio that matches the one of CS at each round.

17


82

84

86

88

90

92

94

Test

Acc

urac

y (%

)

Iterative Mag.PrIterative Mag.Pr (continued)CS

s(0) = 0.3


82

84

86

88

90

92

94

Test

Acc

urac

y (%

)


s(0) = 0.3


82

84

86

88

90

92

94

Test

Acc

urac

y (%

)


s(0) = 0.2


82

84

86

88

90

92

94

Test

Acc

urac

y (%

)


s(0) = 0.2


82

84

86

88

90

92

94

Test

Acc

urac

y (%

)


s(0) = 0.1


82

84

86

88

90

92

94Te

st A

ccur

acy

(%)


s(0) = 0.1


82

84

86

88

90

92

94

Test

Acc

urac

y (%

)


s(0) = 0.05


82

84

86

88

90

92

94

Test

Acc

urac

y (%

)


s(0) = 0.05


82

84

86

88

90

92

94

Test

Acc

urac

y (%

)


s(0) = 0.03


82

84

86

88

90

92

94

Test

Acc

urac

y (%

)


s(0) = 0.03


82

84

86

88

90

92

94

Test

Acc

urac

y (%

)


s(0) = 0.0


82

84

86

88

90

92

94

Test

Acc

urac

y (%

)


s(0) = 0.0

Figure 9: Accuracy and sparsity of tickets produced by IMP and CS after re-training, starting fromweights of epoch 2. Tickets were extracted from a ResNet-20 trained on CIFAR-10. Each plotcorresponds to different value for the mask initialization s(0) of CS, ranging from 0.3 to 0.0, withIMP adopting the same pruning rate per round. Ticket performance is given by purple curves whenproduced by CS, while blue shows performance of IMP and continued IMP (IMP without weightrewinding between rounds).

18


82

84

86

88

90

92

94

Test

Acc

urac

y (%

)


s(0) = −0.03


82

84

86

88

90

92

94

Test

Acc

urac

y (%

)


s(0) = −0.03


82

84

86

88

90

92

94

Test

Acc

urac

y (%

)


s(0) = −0.05


82

84

86

88

90

92

94

Test

Acc

urac

y (%

)


s(0) = −0.05


82

84

86

88

90

92

94

Test

Acc

urac

y (%

)


s(0) = −0.1


82

84

86

88

90

92

94

Test

Acc

urac

y (%

)


s(0) = −0.1


82

84

86

88

90

92

94

Test

Acc

urac

y (%

)


s(0) = −0.2


82

84

86

88

90

92

94

Test

Acc

urac

y (%

)


s(0) = −0.2


82

84

86

88

90

92

94

Test

Acc

urac

y (%

)


s(0) = −0.3


82

84

86

88

90

92

94

Test

Acc

urac

y (%

)


s(0) = −0.3


82

84

86

88

90

92

94

Test

Acc

urac

y (%

)


Figure 10: Accuracy and sparsity of tickets produced by IMP and CS after re-training, starting fromweights of epoch 2. Tickets were extracted from a ResNet-20 trained on CIFAR-10. Each plotcorresponds to different value for the mask initialization s(0) of CS, ranging from −0.03 to −0.3,with IMP adopting the same pruning rate per round. Ticket performance is given by purple curveswhen produced by CS, while blue shows performance of IMP and continued IMP (IMP withoutweight rewinding between rounds). The bottom right plot shows performance of tickets producedduring runs corresponding to all other plots in Figures 9 and 10.

F Sequential Search with Continuous Sparsification

There might be cases where the goal is either to find a ticket with a specific sparsity value or toproduce a set of tickets with varying sparsity levels in a single run – tasks that can be naturallyperformed with a single run of Iterative Magnitude Pruning. However, Continuous Sparsification hasno explicit mechanism to control the sparsity of the produced tickets, and, as shown in Section 5.1and Appendix E, CS quickly sparsifies the network in the first few rounds and then roughly maintainsthe number of parameters during the following rounds until the end of the run. In this scenario, IMP

19

Ticket Search with Sequential Continuous Sparsification: ResNet-20 on CIFAR-10


82

84

86

88

90

92

94

Test

Acc

urac

y (%

)

Iterative Mag.PrCS (Sequential)

Figure 11: Accuracy and sparsity of tickets produced by IMP and Sequential CS after re-training,starting from weights of epoch 2. Tickets are extracted from a ResNet-20 trained on CIFAR-10.

has a clear advantage, as a single run suffices to produce tickets with varying, pre-defined sparsitylevels.

Here, we present a sequential variant of CS, named Sequential Continuous Sparsification, thatremoves a fixed fraction of the weights at each round, hence being better suited for the task describedabove. Unlike IMP, this sequential form of CS removes the weights with lowest mask values s –note the difference from CS, which, given a large enough temperature β, removes all weights whosecorresponding mask parameters are negative.

Following the same experimental protocol from Section 5.1, we again perform ticket search on ResNet-20 trained on CIFAR-10. We run Sequential Continuous Sparsification and Iterative MagnitudePruning for a total of 30 rounds each, and with a pruning rate of 20% per round. Note that unlike theexperiments with Continuous Sparsification (the non-sequential form), we perform a single run withs(0) = 0, i.e., no hyperparameters are used to control the sparsity of the produced tickets.

Figure 11 shows the performance of tickets produced by Sequential CS and IMP, indicating thatCS might be a competitive method in the sequential search setting. Note that the performance of thetickets produced by Sequential CS is considerably inferior to those found by CS (refer to Section5.1, Figure 1). Although these results are promising, additional experiments would be required tomore thoroughly evaluate the potential of Sequential Continuous Sparsification and its comparison toIterative Magnitude Pruning.

G Learned Sparsity Structure

To see how CS differs from magnitude pruning in terms of which layers are more heavily prunedby each method, we force the two to prune VGG to the same sparsity level in a single round. Wefirst run CS with s(0) = 0, yielding 94.19% sparsity, and then run IMP with global pruning rate of94.19%, producing a sub-network with the same number of parameters.

Figure 12 shows the final sparsity of blocks consisting of two consecutive convolutional layers (8blocks total since VGG has 16 convolutional layers). CS applies a pruning rate that is roughly twiceas aggressive as IMP to the first blocks. Both methods heavily sparsify the widest layers of VGG

20

Learned Sparsity Patterns in VGG on CIFAR-10

1 2 3 4 5 6 7 8Block

0

10

20

30

40

50

60

70Pe

rcen

tage

of w

eigh

ts re

mai

ning

(%)

Continuous SparsificationIterative Magnitude Pruning

Figure 12: Sparsity patterns learned by CS and IMP for VGG-16 trained on CIFAR-10 – each blockconsists of 2 non-overlapping consecutive layers of VGG.

(blocks 5 to 8), while still achieving over 91% test accuracy. More heavily pruning earlier layers inCNNs can offer inference speed benefits: due to the increased spatial size of earlier layers’ inputs,each weight is used more times and has a larger contribution in terms of FLOPs.

21

Winning the Lottery with Continuous Sparsiﬁcation · 2020. 6. 30. · Winning the Lottery with...

Documents

Transcript of Winning the Lottery with Continuous Sparsiﬁcation · 2020. 6. 30. · Winning the Lottery with...