ABSTRACT arXiv:1710.10772v1 [cs.LG] 30 Oct 2017 · INTRODUCTION Generative Adversarial Networks [1]...

4
TENSORIZING GENERATIVE ADVERSARIAL NETS Xingwei Cao , Xuyang Zhao †‡ and Qibin Zhao RIKEN Center for Advanced Intelligence Project, Tokyo, Japan Saitama Institute of Technology, Saitama, Japan {xingwei.cao,xuyang.zhao,qibin.zhao}@riken.jp ABSTRACT Generative Adversarial Network (GAN) and its variants exhibit state-of-the-art performance in the class of generative mod- els. To capture higher-dimensional distributions, the common learning procedure requires high computational complexity and a large number of parameters. The problem of employ- ing such massive framework arises when deploying it on a platform with limited computational power such as mobile phones. In this paper, we present a new generative adversar- ial framework by representing each layer as a tensor structure connected by multilinear operations, aiming to reduce the num- ber of model parameters by a large factor while preserving the generative performance and sample quality. To learn the model, we employ an efficient algorithm which alternatively optimizes both discriminator and generator. Experimental out- comes demonstrate that our model can achieve high compres- sion rate for model parameters up to 35 times when compared to the original GAN for MNIST dataset. Index TermsTensor, Tucker Decomposition, Genera- tive Model, Model Compression 1. INTRODUCTION Generative Adversarial Networks demonstrate state-of-the-art performance within the class of generative models [1]. The success of GAN is accomplished not only by algorithmic ad- vance but also by the recent growth of computational capacity. For example, state-of-the-art generative adversarial models such as [2, 3] utilize enormous computational resource by constructing complex models with a large number of model parameters and train them on powerful Graphics Processing Units (GPUs). A critical problem that accompanies with such dependency on powerful computational system arises when deploying the large-scale generative frameworks on platforms with limited computational power (e.g. tablets, smartphones). For instance, mobile devices such as smartphones can only carry some- what limited computational system due to its hardware design. Although unsupervised learning (especially generative adver- sarial learning) could improve the capability and functionality Fig. 1: Best viewed in color. Visualization of a three-way tensor layer. Given an input three-way tensor X∈ R I×J×K and three weight matrices U 1 R L×I , U 2 R M×J and U 3 R N×K , the output is Y∈ R L×M×N . We call the resulting tensor Y after an activation a tensor layer and U 1 , U 2 and U 3 weight matrices. of mobile devices substantially, the trend of employing mas- sive computational resource seems not profiting the usage of such generative framework for the general public. The necessity of such complex model is partially attributed to the multi-dimensionality of datasets. Natural datasets often possess multi-modal structure, and among a plethora of such available datasets, images are often the subject of generative learning frameworks [1, 2, 3]. As it becomes necessary to learn such high-dimensional dataset, models with a large number of parameters have been employed. Goodfellow et. al. adopt multilayer perceptron (MLP) to learn and classify such higher-dimensional datasets [1]. Al- though MLPs can represent rich classes of functions, the major drawback of them are, 1) the dense connection between layers requires a large number of parameters, leading to a limited applicability of the framework to common computational envi- ronment, and 2) the vectorization operation leads to the loss of rich inter-modal information of the datasets. In this paper, we propose a new generative framework with the purpose of reducing the number of model parameters while maintaining the quality of generated samples. Our framework is inspired by the recent works of applying tensor methods to machine learning models [4, 5]. For illustration, Novikov et al. proposed to use low-rank Tensor-Train (TT) approxima- tions for weight parameters, where it shows that employing such tensor approximations will lead to the reduction of the space complexity of a model [4]. Although applying TT de- arXiv:1710.10772v2 [cs.LG] 30 Mar 2018

Transcript of ABSTRACT arXiv:1710.10772v1 [cs.LG] 30 Oct 2017 · INTRODUCTION Generative Adversarial Networks [1]...

Page 1: ABSTRACT arXiv:1710.10772v1 [cs.LG] 30 Oct 2017 · INTRODUCTION Generative Adversarial Networks [1] demonstrate high perfor-mance within the class of generative models. ... state-of-the-art

TENSORIZING GENERATIVE ADVERSARIAL NETS

Xingwei Cao†, Xuyang Zhao†‡ and Qibin Zhao†

† RIKEN Center for Advanced Intelligence Project, Tokyo, Japan‡ Saitama Institute of Technology, Saitama, Japan

{xingwei.cao,xuyang.zhao,qibin.zhao}@riken.jp

ABSTRACT

Generative Adversarial Network (GAN) and its variants exhibitstate-of-the-art performance in the class of generative mod-els. To capture higher-dimensional distributions, the commonlearning procedure requires high computational complexityand a large number of parameters. The problem of employ-ing such massive framework arises when deploying it on aplatform with limited computational power such as mobilephones. In this paper, we present a new generative adversar-ial framework by representing each layer as a tensor structureconnected by multilinear operations, aiming to reduce the num-ber of model parameters by a large factor while preservingthe generative performance and sample quality. To learn themodel, we employ an efficient algorithm which alternativelyoptimizes both discriminator and generator. Experimental out-comes demonstrate that our model can achieve high compres-sion rate for model parameters up to 35 times when comparedto the original GAN for MNIST dataset.

Index Terms— Tensor, Tucker Decomposition, Genera-tive Model, Model Compression

1. INTRODUCTION

Generative Adversarial Networks demonstrate state-of-the-artperformance within the class of generative models [1]. Thesuccess of GAN is accomplished not only by algorithmic ad-vance but also by the recent growth of computational capacity.For example, state-of-the-art generative adversarial modelssuch as [2, 3] utilize enormous computational resource byconstructing complex models with a large number of modelparameters and train them on powerful Graphics ProcessingUnits (GPUs).

A critical problem that accompanies with such dependencyon powerful computational system arises when deploying thelarge-scale generative frameworks on platforms with limitedcomputational power (e.g. tablets, smartphones). For instance,mobile devices such as smartphones can only carry some-what limited computational system due to its hardware design.Although unsupervised learning (especially generative adver-sarial learning) could improve the capability and functionality

Fig. 1: Best viewed in color. Visualization of a three-way tensor layer. Givenan input three-way tensor X ∈ RI×J×K and three weight matrices U1 ∈RL×I , U2 ∈ RM×J and U3 ∈ RN×K , the output is Y ∈ RL×M×N .We call the resulting tensor Y after an activation a tensor layer and U1, U2

and U3 weight matrices.

of mobile devices substantially, the trend of employing mas-sive computational resource seems not profiting the usage ofsuch generative framework for the general public.

The necessity of such complex model is partially attributedto the multi-dimensionality of datasets. Natural datasets oftenpossess multi-modal structure, and among a plethora of suchavailable datasets, images are often the subject of generativelearning frameworks [1, 2, 3]. As it becomes necessary to learnsuch high-dimensional dataset, models with a large number ofparameters have been employed.

Goodfellow et. al. adopt multilayer perceptron (MLP) tolearn and classify such higher-dimensional datasets [1]. Al-though MLPs can represent rich classes of functions, the majordrawback of them are, 1) the dense connection between layersrequires a large number of parameters, leading to a limitedapplicability of the framework to common computational envi-ronment, and 2) the vectorization operation leads to the loss ofrich inter-modal information of the datasets.

In this paper, we propose a new generative framework withthe purpose of reducing the number of model parameters whilemaintaining the quality of generated samples. Our frameworkis inspired by the recent works of applying tensor methods tomachine learning models [4, 5]. For illustration, Novikov etal. proposed to use low-rank Tensor-Train (TT) approxima-tions for weight parameters, where it shows that employingsuch tensor approximations will lead to the reduction of thespace complexity of a model [4]. Although applying TT de-

arX

iv:1

710.

1077

2v2

[cs

.LG

] 3

0 M

ar 2

018

Page 2: ABSTRACT arXiv:1710.10772v1 [cs.LG] 30 Oct 2017 · INTRODUCTION Generative Adversarial Networks [1] demonstrate high perfor-mance within the class of generative models. ... state-of-the-art

composition to dense matrices demonstrates a large factor ofcompression rate [4], finding optimal TT-ranks still remains tobe a difficult problem.

In our proposed framework, we compress the traditionalaffine transformation using tensor algebra in order to reducethe number of parameters of fully-connected layers. In par-ticular, all hidden, input and output layers are representedas a tensor structure, not as a vector. By treating a multi-dimensional input without vectorization, our model aims topreserve its original multi-modal information while saving thespace complexity of a model by a large factor. The usage oftensor algebra results in model compression while preservingmulti-modal information of dataset.

We empirically demonstrate high compression rate of ourproposed model with both benchmark and synthetic datasets.We compare our framework with models without tensorization,displaying that generative learning process could be accom-plished with a smaller number of model parameters. In ourexperiment with a dataset of handwritten digits, we observedthat our model achieves the compression rate of 40 times whileproducing images with a comparable quality.

The rest of this paper is organized as follows. We start withthe concise review of basic tensor arithmetics and generativeadversarial nets in Section 2. Section 3 introduces the tensorlayer. In Section 4, we present the swift applicability of state-of-the-art learning algorithms to our framework. Experimentalresults and details are shown in Section 5 followed by theconclusion in Section 6.

2. BACKGROUND

2.1. Tensor algebra

There are multiple tensor operations necessary to constructa tensor layer. In particular, the n-mode product, Kroneckerproduct and Hadamard product are briefly reviewed in thissection. For more comprehensive review of tensor arithmeticsand notations, see [6].

The order of a tensor is the number of dimensions. Vec-tor, matrix and tensor with order three or higher are denotedby a, A and A respectively. Given an N th-order tensorX ∈ RI1×I2×···×IN , its (i1, i2, . . . , iN )th entry is denoted byXi1i2...iN , where in = 1, 2, . . . , In,∀n ∈ [1, N ]. The notation[I, J ] denotes a set of integers ranged from I to J inclusive.

The mode-n fibers are the vectors obtained by fixing everyindex but the n-th index of a tensor. The mode-n matricizationor mode-n unfolding of a tensor X is denoted by X(n). Itarranges the mode-n fibers to be the columns of the resultingmatrix. Given matrices A and B, both of same size RI×J ,their Hadamard product (or component-wise product) is de-noted by A ∗B. The resulting matrix is also of the size I × J .

The n-mode product of a tensor X ∈ RI1×I2×···×IN

with a matrix U ∈ RJ×In is denoted by X ×n U ∈RI1×···×In−1×J×In+1×···×IN and an entry of the product

is defined by

(X ×n U)i1···in−1 j in+1···iN =

In∑in=1

xi1i2···iN ujin . (1)

The Kronecker product of matrices A ∈ RI×J and B ∈RK×L is a matrix of size IK × JL, denoted by A⊗B. Theproduct is defined by

A⊗B =

a11B a12B a13B . . . a1JBa21B a22B a23B . . . a2JB

......

.... . .

...aI1B aI2B aI3B . . . aIJB

. (2)

We use one of the properties of the Kronecker product inthis paper. Given a tensor X ∈ RI1×I2×···×IN and a set ofmatrices denoted by A(n) ∈ RJn×In for n ∈ [1, N ],

Y = X ×1 A(1) ×2 A

(2) ×3 · · · ×N A(N)

⇐⇒ Y(n) = A(n)X(n)

N∏i=1i 6=n

⊗A(i)

T

.(3)

2.2. Generative adversarial nets

Generative adversarial nets consist of two components calledgenerator and discriminator, which usually are represented byMLPs. The task of the discriminator is to correctly identifywhether the input belongs to the real data distribution pdataor the model distribution pmodel. Given a prior z ∼ pz , thegenerator tries to produce indistinguishable samples to deceivethe discriminator. By alternatively training discriminator andgenerator, GAN aims to implicitly learn two distributions pdataand pmodel. Due to the space limitation, we abbreviate thedetails of GAN trainings. For more detailed explanation ofGANs, we refer readers to [1].

3. TENSOR LAYER

In this section, we introduce the Tensor layer using tensoralgebra we reviewed in Section 2.1. A Tensor layer replacesthe traditional affine transformation between layers of MLP,which forms GANs, with multilinear affine transformations.When we train an MLP model, it is common to flatten theinputs then feed those vectors to the network. Given suchinput vector x, the traditional transformation for MLPs isdefined as follows:

y = σ (Wx+ b) . (4)

Tensor layer treats the input without vectorization. Givenan N-way tensor X ∈ RI1···×IN as an input, rather having one

Page 3: ABSTRACT arXiv:1710.10772v1 [cs.LG] 30 Oct 2017 · INTRODUCTION Generative Adversarial Networks [1] demonstrate high perfor-mance within the class of generative models. ... state-of-the-art

matrix W, we establish the transformation between two tensorlayers by applying mode product operations between the inputtensor and weight matrices Ui for i ∈ [1, N ]. We thereafteradd a bias tensor B ∈ RJ1×···×JN to the product to form atensor layer. The transformation from X to the tensor layerY ∈ RJ1×···×JN is formulated as:

Y = σ (X ×1 U1 ×2 U2 ×3 · · · ×N UN + B) (5)

where Ui ∈ RJi×Ii for i ∈ [1, N ]. A visualization for a thirdorder input is provided in Figure 1.

Tensor layers could be interpreted as Tucker Decomposi-tion, which factorizes a higher-order tensor into a core tensorand factor matrices [7, 6]. If we treat the input tensor X assuch core tensor, the weight matrices Ui for i ∈ [1, N ] couldbe interpreted as factor matrices.

4. TRAINING FT-NETS

In this paper, networks consisting of tensor layers are referredas FT-Nets. A generative adversarial nets consisting of FT-NETs are called TGAN. In this section, we demonstrate thatthe gradient-based back-propagation algorithms are swiftlyapplicable to FT-Nets.

Given a FT-Net with an input tensor X ∈ RI1×···×IN×C

and two tensor layers O1 ∈ RJ1 ×···×JN×C and O2 ∈RK1×···×KN×C , we can formulate such FT-Net as follows;

(6)O1 = g (H1)

= g (X ×1 W1 ×2 W2 ×3 · · · ×N WN + B1)

(7)O2 = g (H2)

= g (O1 ×1 U1 ×2 U2 ×3 · · · ×N UN + B2)

where Wi ∈ RJi×Ii for i ∈ [1, N ] and Ui ∈ RKi×Ji fori ∈ [1, N ]. The function g(A) is a component-wise activationfunction applied to a tensor A. Note that the last mode ofthe input tensor X denotes the size of each batch C, thus wedo not apply any mode product to the N + 1 th mode whenperforming the multilinear operation to X .

Given an output layerH2, the gradient ofH2 with respectto each weight matrix Ui is derived using Eq. (3) ∀i ∈ [1, N ];

(8)

∂H2

∂Ui=

∂Ui(O1 ×1 U1 ×2 U2 ×3 · · · ×N UN + B1)

=∂

∂Ui

UiO1(i)

N∏j=1j 6=i

⊗Uj

T

+ B1(i)

The gradient ofH2 with respect to O1 can be derived as

(9)∂H2

∂O1=

∂O1(O1 ×1 U1 ×2 U2 ×3 · · · ×N UN + B2)

= UN ⊗UN−1 ⊗ · · · ⊗U1

We can derive the gradient of a loss function f with respectto Wi for i ∈ [1, N ] as follows.

(10)

∂f

∂Wi=

(((∂f

∂O2∗ ∂O2

∂H2

)∂H2

∂O1

)∗ ∂O1

∂H1

)∂H1

∂Wi

=

( ∂f

∂O2∗ ∂O2

∂H2

) 1∏j=N

⊗Uj

∗ ∂O1

∂H1

O1(i)

N∏j=1j 6=i

⊗W(j)

T

Given a neural network with three fully-connected layersof size I =

∏Ni=1 Ii, J =

∏Ni=1 Ji and K =

∏Ni=1Ki, the

number of model parameters in the network is J (I +K) +J +K. We can simply use such factorizations of I , J and Kto formulate a FT-Net. The number of model parameters of theFT-Net is

∑Ni=1 (IiJi + JiKi) + J +K. It is easily observ-

able that the difference in the number of parameters betweentensorized and un-tensorized networks can grow exponentiallyas the order of the input tensor increases.

5. EXPERIMENTS

In this section we empirically demonstrate the compressivepower of our framework using both synthetic and real dataset.All Hyper-parameters and architectural details for every exper-iment are available at https://github.com/xwcao/TGAN.

5.1. MNIST

In this section, we present empirical comparisons of TGANand GAN using images of handwritten digits [9]. We con-ducted experiments with TGAN and two GANs. The firstGAN, namely GAN(1) has larger amount of parameters thanTGAN, and GAN(2) has approximately same number of pa-rameters as TGAN. See Figure 2 for experimental outcomes.Although numerical evaluation of generative model is an openproblem, we believe that quality of samples from TGAN andGAN(1) are comparable. We report the factor of the compres-sion rate between TGAN and GAN(1) to be 35 times.

5.2. Synthetic Data

In this section, we empirically compare the performance ofTGAN and GANs using synthetic data; bivariate normal distri-butions. In particular, we collected 10, 000 data points sampledfrom 6 clusters located circularly around the point (0.5, 0.5).Each cluster is populated by bivariate normal distribution withσ2 = 0.2. We note that all models (e.g. TGAN and GANs) em-ploy approximately same number of trainable units. Figure 3represents samples populated by TGAN and three GANs at

Page 4: ABSTRACT arXiv:1710.10772v1 [cs.LG] 30 Oct 2017 · INTRODUCTION Generative Adversarial Networks [1] demonstrate high perfor-mance within the class of generative models. ... state-of-the-art

ModelIterations

# Params10k 20k 30k 40k 50k

TGAN 12k

GAN(1) 429k

GAN(2) 12k

Fig. 2: Comparison of MNIST samples generated by TGAN and GANs.We collected the samples in the figure without cherry-picking. We used theprior z ∼ U(−1, 1) to sample randomly for both TGAN and GANs. Ourframework demonstrates its compressive power when compared to GAN(1)

and GAN(2).

each different training step. The experimental outcome demon-strates that; 1) our model converges to original distributionmore quickly than GANs and 2) our model successfully learnsthe true distribution while others struggle to do so.

6. CONCLUSION

The trend of employing models with a large number of param-eters for generative adversarial learnings prohibits its appli-cability to systems with limited computational resources. Atthe same time, traditional affine transformations employed toGANs could significantly increase the model complexity. Wepresent a new generative adversarial model with tensorizedstructure, TGAN. Two major advantages of TGAN are; 1) sig-nificant reduction in consumption of computational resourceand 2) capability to capture multi-modal structure of datasets.We empirically demonstrated the compressive rate of 40 timeswhen compared to GAN while having negligible impact onthe quality of generated samples. One of the future works weconsider is to apply various tensor decomposition algorithmssuch as [10, 11] to each tensor layer for further reduction inthe number of model parameters.

7. REFERENCES

[1] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville,and Yoshua Bengio, “Generative adversarial nets,” inAdvances in neural information processing systems, 2014,pp. 2672–2680.

[2] Tero Karras, Timo Aila, Samuli Laine, and JaakkoLehtinen, “Progressive growing of gans for im-proved quality, stability, and variation,” arXiv preprintarXiv:1710.10196, 2017.

ModelIterations

True1k 2k 4k 6k 8k 10k

TGAN

GAN(1)

GAN(2)

GAN(3)

Fig. 3: Samples from TGAN and three GANs with different size of layers.Each blue-colored figure represents scatter plot of 10, 000 samples generatedat each training stage. It is observable that our model (TGAN) learns thedistribution quickly while GANs struggle to do so. We set the prior to z ∼U(−1, 1) for every synthetic experiment.

[3] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang,Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas,“Stackgan: Text to photo-realistic image synthesis withstacked generative adversarial networks,” in IEEE Int.Conf. Comput. Vision (ICCV), 2017, pp. 5907–5915.

[4] Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin,and Dmitry P Vetrov, “Tensorizing neural networks,”in Advances in Neural Information Processing Systems,2015, pp. 442–450.

[5] Jean Kossaifi, Aran Khanna, Zachary C Lipton, Tom-maso Furlanello, and Anima Anandkumar, “Tensorcontraction layers for parsimonious deep nets,” arXivpreprint arXiv:1706.00439, 2017.

[6] Tamara G Kolda and Brett W Bader, “Tensor decomposi-tions and applications,” SIAM review, vol. 51, no. 3, pp.455–500, 2009.

[7] Ledyard R Tucker, “Some mathematical notes on three-mode factor analysis,” Psychometrika, vol. 31, no. 3, pp.279–311, 1966.

[8] DRGHR Williams and Geoffrey Hinton, “Learning rep-resentations by back-propagating errors,” Nature, vol.323, no. 6088, pp. 533–538, 1986.

[9] Yann LeCun, “The mnist database of handwritten digits,”http://yann. lecun. com/exdb/mnist/, 1998.

[10] Ivan V Oseledets, “Tensor-train decomposition,” SIAMJournal on Scientific Computing, vol. 33, no. 5, pp. 2295–2317, 2011.

[11] Qibin Zhao, Guoxu Zhou, Shengli Xie, Liqing Zhang,and Andrzej Cichocki, “Tensor ring decomposition,”arXiv preprint arXiv:1606.05535, 2016.