Learning Functors using Gradient Descent Gavranovic.pdf · Learning Functors using Gradient Descent...

Learning Functors using Gradient DescentBruno Gavranović

University of Zagreb, Faculty of Electrical Engineering and Computing, Zagreb, Croatia

Neural networks have become an in-creasingly popular tool for solving manyreal-world problems. They are a gen-eral framework for differentiable optimiza-tion which includes many other machinelearning approaches as special cases. Inthis paper we build a categorical formal-ism around a class of neural networks ex-emplified by CycleGAN [13]. CycleGANis a collection of neural networks, closedunder composition, whose inductive biasis increased by enforcing composition in-variants, i.e. cycle-consistencies. Inspiredby Functorial Data Migration [12], wespecify the interconnection of these net-works using a categorical schema, and net-work instances as set-valued functors onthis schema. We also frame neural net-work architectures, datasets, models, anda number of other concepts in a categor-ical setting and thus show a special classof functors, rather than functions, can belearned using gradient descent. We usethe category-theoretic framework to con-ceive a novel neural network architecturewhose goal is to learn the task of object in-sertion and object deletion in images withunpaired data. We test the architectureon a CelebA dataset and obtain promisingresults.

1 IntroductionCompositionality describes and quantifies howcomplex things can be assembled out of simplerparts. It is a principle which tells us that thedesign of abstractions in a system needs to bedone in such a way that we can intentionally for-get their internal structure [7]. There are twointeresting properties of neural networks relatedto compositionality: (i) they are compositionalBruno Gavranović: [email protected]

– increasing the number of layers tends to yieldbetter performance, and (ii) they are discovering(compositional) structures in data.

Figure 1: We devise a novel method to regularize neuralnetwork training using a category presented by genera-tors and relations. We train neural networks to removeglasses from the face of a person and insert them para-metrically, without ever telling the neural networks thatthe image contains glasses or even that it contains aperson.

Furthermore, an increasing number of com-ponents of a modern deep learning system islearned. For instance, Generative AdversarialNetworks [6] learn the cost function. The paperLearning to Learn by gradient descent by gradientdescent [2] specifies networks that learn the op-timization function. The paper Decoupled NeuralInterfaces using Synthetic Gradients [8] specifieshow gradients themselves can be learned. Thesystem in [8] can be thought of as a cooperativemulti-player game, where some players depend onother ones to learn but can be trained in an asyn-chronous manner.These are just rough examples, but they give

a sense of things to come. As more and morecomponents of these systems stop being fixedthroughout training, there is an increasinglylarger need for more precise formal specificationof the things that do stay fixed. This is not aneasy task; the invariants across all these networksseem to be rather abstract and hard to describe.In this paper we explore the hypothesis that thelanguage of category theory could be well suitedto describe these systems in a precise manner.

1

mailto:[email protected]

Inter-domain mappings. Recent advances inneural networks describe and quantify the pro-cess of discovering high-level, abstract structurein data using gradient information. As such,learning inter-domain mappings has received in-creasing attention in recent years, especially inthe context of unpaired data and image-to-imagetranslation [1, 13]. Pairedness of datasets X andY generally refers to the existence of some invert-ible function X → Y . Note that in classificationwe might also refer to the input dataset as beingpaired with the dataset of labels, although themeaning is slightly different as we cannot obvi-ously invert a label f(x) for some x ∈ X.

Obtaining datasets that contain extra informa-tion about inter-domain relationships is a muchmore difficult task than just the collection of therelevant datasets. Consider the task of objectremoval from images; obtaining pairs of imageswhere one of them lacks a certain object, witheverything else the same, is much more difficultthan the mere task of obtaining two sets of im-ages: one that contains that object and one thatdoes not, with everything else varying. Moreover,we further motivate this example by the reminis-cence of the way humans reason about the miss-ing object: simply by observing two unpaired setsof images, where we are told one set of images lackan object, we are able to learn what the missingobject looks like.

Motivated by the sucess of Generative Ad-versarial Networks (GANs) [6] in image genera-tion, some existing unsupervised learning meth-ods [1, 13] use adversarial losses to learn the truedata distribution of given domains of natural im-ages and cycle-consistency losses to learn coher-ent mappings between those domains. Cycle-GAN is an architecture which learns a one-to-onemapping between two domains. Each domain hasan associated discriminator, while the mappingsbetween these domains correspond to generators.The set of generators in CycleGAN is a collectionof neural networks which is closed under com-position, and whose inductive bias is increasedby enforcing composition invariants, i.e. cycle-consistencies. The canonical example of this iso-morphism used in [13] is that between the imagesof horses and zebras. Simply by changing thetexture of the animal in such an image we can,approximately, map back and forth between theseimages.

Outline of the main contributions. In thispaper we take a collection of abstractions knownto deep learning practitioners and translate theminto the language of category theory.

We package a notion of the interconnection ofnetworks as a free category Free(G) on somegraph G and specify any equivalences betweennetworks as relations between morphisms as aquotient category Free(G)/∼. Given such a cat-egory – which we call a schema, inspired by [12]– we specify the architectures of its networks asa functor Arch. We reason about various othernotions found in deep learning, such as datasets,embeddings, and parameter spaces. The trainingprocess is associated with an indexed family offunctors {Hp : Free(G) → Set}Ti=1, where T isthe number of training steps and p is some choiceof a parameter for that architecture.

Analogous to standard neural networks – westart with a randomly initialized Hp and itera-tively update it using gradient descent. Our op-timization is guided by two objectives. These ob-jectives arise as a natural generalization of thosefound in [13]. One of them is the adversarial ob-jective – the minmax objective found in any Gen-erative Adversarial Network. The other one is ageneralization of the cycle-consistency loss whichwe call path-equivalence loss.

This approach yields useful insights and a largedegree of generality: (i) it enables learning withunpaired data as it does not impose any con-straints on ordering or pairing of the sets in acategory, and (ii) although specialized to gener-ative models in the domain of computer vision,this approach is domain-independent and generalenough to hold in any domain of interest, suchas sound, text, or video. This allows us to con-sider a subcategory of SetFree(G) as a space inwhich we can employ a gradient-based search. Inother words, we use the structure of categoricalschemas as regularization during training, suchthat the imposed relationships guide the learningprocess.

We show that for specific choices of Free(G)/∼and the dataset we recover GAN [6] and Cycle-GAN [13]. Furthermore, a novel neural networkarchitecture capable of learning to remove and in-sert objects into an image with unpaired data isproposed. We outline its categorical perspectiveand test it on the CelebA dataset.

2

2 Categorical Deep Learning

Modern deep learning optimization algorithmscan be framed as a gradient-based search in somefunction space Y X , where X and Y are sets thathave been endowed with extra structure. Givensome sets of data points DX ⊆ X, DY ⊆ Y , atypical approach for adding inductive bias relieson exploiting this extra structure associated tothe data points embedded in those sets, or thosesets themselves. This structure includes domain-specific features which can be exploited by var-ious methods – convolutions for images, Fouriertransform for audio, and specialized word embed-dings for textual data.In this section we develop the categorical tools

to increase inductive bias of a model by enforc-ing the composition invariants of its constituentnetworks.

2.1 Model schema

Many deep learning models are complex systems,some comprised of several neural networks. Eachneural network can be identified with domain X,codomain Y , and a differentiable parametrizedfunction X → Y . Given a collection of such net-works, we use a directed multigraph to capturetheir interconnections. Each directed multigraphG gives rise to a corresponding free category onthat graph Free(G). Based on this construction,Figure 2 shows the interconnection pattern forgenerators of two popular neural network archi-tectures: GAN [6] and CycleGAN [13].

Latent space•

Image•

h

no equations

(a) GAN

Horse• Zebra•

f

g

g ◦ f = idHf ◦ g = idZ

(b) CycleGAN

Figure 2: Bird’s-eye view of two popular neural networkmodels

Observe that CycleGAN has some additionalproperties imposed on it, specified by equa-tions in Figure 2 (b). These are calledcycle-consistency conditions and can roughlybe stated as follows: given domains A andB considered as sets, a ≈ g(f(a)), ∀a ∈ A and

b ≈ f(g(b)), ∀b ∈ B. A particularly clear diagramof the cycle-consistency condition can be found inFigure 3 in [13].Our approach involves a realization that cycle-

consistency conditions can be generalized topath equivalence relations, or, in formal terms- a congruence relation. The condition a ≈g(f(a)),∀a ∈ A can be reformulated such that itdoes not refer to the elements of the set a ∈ A. Byeta-reducing the equation we obtain ida = g ◦ f .Similar reformulation can be done for the othercondition: idb = f ◦ g.

This allows us to package the newly formedequations as equivalence relations on the setsFree(G)(A,A) and Free(G)(B,B), respectively.This notion can be further packaged into a quo-tient category Free(G)/∼, together with the quo-tient functor Free(G) Q−→ Free(G)/∼.This formulation – as a free category on a graph

G – represents the cornerstone of our approach.These schemas allow us to reason solely aboutthe interconnections between various concepts,rather than jointly with functions, networks orother some other sets. All the other constructs inthis paper are structure-preserving maps betweencategories whose domain, roughly, can be tracedback to Free(G).

2.2 What is a neural network?

In computer science, the idea of a neural networkcolloquially means a number of different things.At a most fundamental level, it can be interpretedas a system of interconnected units called neu-rons, each of which has a firing threshold actingas an information filtering system. Drawing in-spiration from biology, this perspective is thor-oughly explored in literature. In many contextswe want to focus on the mathematical propertiesof a neural network and as such identify it witha function between sets A f−→ B. Those sets areoften considered to have extra structure, such asthose of Euclidean spaces or manifolds. Func-tions are then considered to be maps of a givendifferentiability class which preserve such struc-ture. We also frequently reason about a neuralnetwork jointly with its parameter space P as afunction of type f : P × A → B. For instance,consider a classifier in the context of supervisedlearning. A convolutional neural network whoseinput is a 32 × 32 RGB image and output is real

3

number can be represented as a function with thefollowing type: Rn × R32×32×3 → R, for somen ∈ N. In this case Rn represents the parameterspace of this network.

The former (A→ B) and the latter (P ×A→B) perspective on neural networks are related.Namely, consider some function space BA. Anynotion of smoothness in such a space is not welldefined without any further assumptions on setsA or B. This is the reason deep learning em-ploys a gradient-based search in such a space viaa proxy function P ×A→ B. This function spec-ifies an entire parametrized family of functions oftype A → B, because partial application of eachp ∈ P yields a function f(p,−) : A → B. Thischoice of a parametrized family of functions ispart of the inductive bias we are building into thetraining process. For example, in computer visionit is common to restrict the class of functions tothose that can be modeled by convolutional neu-ral networks.

With this in mind, we recall the model schema.For each morphism A → B in Free(G) we areinterested in specifying a parametrized functionf : P×A→ B, i.e. a parametrized family of func-tions in Set. The function f describes a neuralnetwork architecture, and a choice of a partiallyapplied p ∈ P to f describes a choice of someparameter value for that specific architecture.

We capture the notion of parametrization witha category Para [4]. It is a strict symmetricmonoidal category whose objects are Euclideanspaces and morphisms Rn → Rm are, roughly,differentiable functions of type Rp×Rn → Rm, forsome p. We refer to Rp as the parameter space.Composition of morphisms in Para is defined insuch a way that it explicitly keeps track of pa-rameters. For more details, we refer the readerto [4].

A closely related construction we use is Euc,the strict symmetric monoidal category whose ob-jects are finite-dimensional Euclidean spaces andmorphisms are differentiable maps. A monoidalproduct on Euc is given by the cartesian product.

We package both of these notions – choosing anarchitecture and choosing parameters – into func-tors whose domain is Free(G) and codomains arePara and Euc, respectively.

2.3 Network architectureWe now formally specify model architecture as afunctor. Observe that the action on morphismsis defined on the generators in Free(G).

Definition 1. Architecture of a model is a func-tor Arch : Free(G)→ Para.

• For each A ∈ Ob(Free(G)), it specifies anEuclidean space Ra;

• For each generating morphism Af−→ B in

Free(G), it specifies a morphismRa Arch(f)−−−−→ Rb which is a differentiableparametrized function of type Rn×Ra → Rb.

Given a non-trivial composite morphism f = fn◦fn−1 ◦ · · · ◦ f1 in C, the image of f under Archis the composite of the image of each constituent:Arch(f) = Arch(fn) ◦ Arch(fn−1) ◦ · · · ◦ Arch(f1).Arch maps identities to the projection π2 : I ×A→ A.

The choice of architecture Free(G) Arch−−−→ Paragoes hand in hand with the choice of an embed-ding.

Proposition 2. An embedding is a functor|Free(G)| E−→ Set which agrees with Arch on ob-jects.

Notice that the codomain of E is Set, ratherthan Para, as Para and Euc have the same ob-jects and objects in Euc are just sets with extrastructure.Embedding E and Arch come up in two dif-

ferent scenarios. Sometimes we start out witha choice of architecture which then induces theembedding. In other cases, the embedding isgiven to us beforehand and it restricts the pos-sible choice of architectures.

3 Parameter spaceEach network architecture f : Rn × Ra → Rbcomes equipped with its parameter space Rn.Just as Free(G) Arch−−−→ Para is a categorical gen-eralization of architecture, we now show there ex-ists a categorical generalization of a parameterspace. In this case – it is the parameter space ofa functor. Before we move on to the main defi-nition, we package the notion of parameter spaceof a function f : Rn × Ra → Rb into a simplefunction p(f) = Rn.

4

Definition 3 (Functor parameter space).Let GenFree(G) the set of generators inFree(G). The total parameter mapP : Ob(ParaFree(G))→ Ob(Euc) assigns toeach functor Free(G) Arch−−−→ Para the productof the parameter spaces of all its generatingmorphisms:

P(Arch) :=∏

f∈GenFree(G)

p(Arch(f))

Essentially, just as p returns the parameterspace of a function, P does the same for a functor.We are now in a position to talk about

parameter specification. Recall the non-categorical setting: given some network architec-ture f : P ×A→ B and a choice of p ∈ p(f) wecan partially apply the parameter p to the net-work to get f(p,−) : A → B. This admits astraightforward generalization to the categoricalsetting.

Definition 4 (Parameter specification). Param-eter specification PSpec is a dependently typedfunction with the following signature:

(Arch : Ob(ParaFree(G)))× P(Arch)→ Ob(EucFree(G))(1)

Given an architecture Arch and a parameterchoice (pf )f∈GenFree(G) ∈ P(Arch) for that ar-chitecture, it defines a choice of a functor inEucFree(G). This functor acts on objects thesame as Arch. On morphisms, it partially appliesevery pf to the corresponding morphism Arch(f) :Rn × Ra → Rb, thus yielding f(pf ,−) : Ra → Rbin Euc.

Elements of EucFree(G) will play a central rolelater on in the paper. These elements are func-tors which we will call Models. Given some ar-chitecture Arch and a parameter p ∈ P(Arch), amodel Free(G) Modelp−−−−→ Euc generalizes the stan-dard notion of a model in machine learning – itcan be used for inference and evaluated.Analogous to database instances in [12], we call

a network instance Hp the composition of someModelp with the forgetful functor Euc U−→ Set.That is to say, a network instance is a functorFree(G) Hp−−→ Set := U ◦Modelp.We shed some more light on these constructions

using Figure 3.

Free(G) Para

Euc

Set

Arch

Modelp

Hp

U

Figure 3: Free(G) is the domain of three types of func-tors of interest: Arch, Modelp and Hp.

4 DataWe have described constructions which allow usto pick an architecture for a schema and considerits different models Modelp, each of them iden-tified with a choice of a parameter p ∈ P(Arch).In order to understand how the optimization pro-cess is steered in updating the parameter choicefor an architecture, we need to understand a vitalcomponent of any deep learning system – datasetsthemselves.This necessitates that we also understand the

relationship between datasets and the space theyare embedded in. Recall the embedding functorand the notation |C| for the discretizaton of somecategory C.

Definition 5. Let |Free(G)| E−→ Set be theembedding. A dataset is a subfunctor DE :|Free(G)| → Set of E. DE maps each ob-ject A ∈ Ob(Free(G)) to a dataset DE(A) :={ai}Ni=1 ⊆ E(A).

Note that we refer to this functor in the singu-lar, although it assigns a dataset to each objectin Free(G). We also highlight that the domainof DE is |Free(G)|, rather than Free(G). Wegenerally cannot provide an action on morphismsbecause datasets might be incomplete. Goingback to the example with Horses and Zebras – adataset functor on Free(G) in Figure 2 (b) mapsHorse to the set of obtained horse images andZebra to the set of obtained zebra images.The subobject relation DE ⊆ E in Proposition

5 reflects an important property of data; we can-not obtain some data without it being in someshape or form, embedded in some larger space.Any obtained data thus implicitly fixes an em-bedding.

5

Observe that when we have a dataset in stan-dard machine learning, we have a dataset ofsomething. We can have a dataset of historicalweather data, a dataset of housing prices in NewYork or a dataset of cat images. What ties allthese concepts together is that each element ai ofsome dataset {ai}Ni=1 is an instance of a more gen-eral concept. As a trivial example, every image inthe dataset of horse images is a horse. The wordhorse refers to a more general concept and as suchcould be generalized from some of its instanceswhich we do not possess. But all the horse imageswe possess are indeed an example of a horse. Byconsidering everything to be embedded in somespace E(A) we capture this statement with therelation {ai}Ni=1 ⊆ C(A) ⊆ E(A). Here C(A) isthe set of all instances of some notion A whichare embedded in E(A). In the running examplethis corresponds to all images of horses in a givenspace, such as the space of all 64 × 64 RGB im-ages. Obviously, the precise specification of C(A)is unknown – as we cannot enumerate or specifythe set of all horse images.

We use such calligraphy to denote this is anabstract concept. Despite the fact that its pre-cise specification is unknown, we can still reasonabout its relationship to other structures. Fur-thermore, as it is the case with any abstract no-tion, there might be some edge cases or it mightturn out that this concept is ambiguously definedor even inconsistent. Moreover, it might be possi-ble to identify a dataset with multiple concepts; isa dataset of male human faces associated with theconcept of male faces or is it a non-representativesample of all faces in general? We ignore theseconcerns and assume each dataset is a dataset ofsome well-defined, consistent and unambiguousconcept. This does not change the validity of therest of the formalism in any way as there existplenty of datasets satisfying such a constraint.

Armed with intuition, we show this admits ageneralization to the categorical setting. Justas {ai}Ni=1 ⊆ C(A) ⊆ E(A) are all subsets ofE(A) we might hypothesize the domain of C is|Free(G)| and that DE ⊆ C ⊆ E are all subfunc-tors of E. However, just as we assign a set ofall concept instances to objects in Free(G), wealso assign a function between these sets to mor-phisms in Free(G). Unlike with datasets, thiscan be done because, by definition, these sets arenot incomplete.

Definition 6. Given a schema Free(G)/∼ anda dataset |Free(G)| DE−−→ Set, a concept asso-ciated with the dataset DE embedded in E is afunctor C : Free(G)/∼ → Set such that DE ⊆C ◦ I ⊆ E. We say C picks out sets of conceptinstances and functions between those sets.

Another way to understand a conceptFree(G)/∼ C−→ Set is that it is requiredthat a human observer can tell, for eachA ∈ Ob(Free(G)) and some a ∈ E(A) whethera ∈ C(A). Similarly for morphisms, a humanobserver should be able to tell if some functionC(A) f−→ C(B) is an image of some morphism inFree(G)/∼ under C.For instance, consider the GAN schema in Fig-

ure 2 (a) where C(Image) is a set of all imagesof human faces embedded in some space such asR64×64×3. For each image in this space, a humanobserver should be able to tell if that image con-tains a face or not. We cannot enumerate such aset C(Image) or write it down explicitly, but wecan easily tell if an image contains a given con-cept. Likewise, for a morphism in the CycleGANschema (Figure 2 (b)), we cannot explicitly writedown a function which transforms a horse into azebra, but we can tell if some function did a goodjob or not by testing it on different inputs.The most important thing related to this

concept is that this represents the goal ofour optimization process. Given a dataset|Free(G)| DE−−→ Set, want to extend it into a func-tor Free(G)/∼ C−→ Set, and actually learn its im-plementation.

4.1 Restriction of network instance to thedataset

We have seen how data is related to its embed-ding. We now describe the relationship betweennetwork instances and data.Observe that network instance Hp maps each

object A ∈ Ob(Free(G)) to the entire embed-ding Hpi(A) = E(A), rather than just the con-cept C(A). Even though we started out with anembedding E(A), we want to narrow that embed-ding down just to the set of instances correspond-ing to some concept A.For example, consider a diagram such as the

one in Figure 2 (a). Suppose the result of a suc-cessful training was a functor Free(G) H−→ Set.

6

Suppose that the image of h is H(h) : [0, 1]100 →[0, 1]64×64×3. As such, our interest is mainlythe restriction of [0, 1]64×64×3 to C(Image), theimage of [0, 1]100 under H(h), rather than theentire [0, 1]64×64×3. In the case of horses andzebras in Figure 2 (b), we are interested in amap C(Horse) → C(Zebra) rather than a map[0, 1]64×64×3 → [0, 1]64×64×3. In what followswe show a construction which restricts some Hp

to its smallest subfunctor which contains thedataset DE . Recall the previously defined inclu-sion |Free(G)| I

↪−→ Free(G).

Definition 7. Let DE : |Free(G)| → Set be thedataset. Let Free(G) Hp−−→ Set be the networkinstance on Free(G). The restriction of Hp toDE is a subfunctor of Hp defined as follows:

IHp :=⋂

{G∈Sub(Hp))|DE⊆G◦I}G

where Sub(Hp) is the set of subfunctors of Hp.

This definition is quite condensed so we sup-ply some intuition. We first note that the meetis well defined because each G is a subfunctor ofH. In Figure 4 we depict the newly defined con-structions using a commutative diagram.

Free(G)

|Free(G)| Set

⊆Hp

IHpI

⊆

DE

E

Figure 4: The functor Hp contains IHp in such a waythat DE is a subfunctor of IHp ◦ I.

It is useful to think of IH as a restriction ofH to the smallest functor which fits all data andmappings between the data. This means that IHpcontains all data samples specified by DE .

Corollary 8. DE is a subfunctor of IHp ◦ I:

Proof. This is straightforward to show, as IHpis the intersection of all subobjects of H which,when composed with the inclusion I contain DE .Therefore IHp ◦ I contains DE as well.

5 OptimizationWe now describe how data guides the search pro-cess. We identify the goal of this search withthe concept functor Free(G)/∼ C−→ Set. Thismeans that given a schema Free(G)/∼ and data|Free(G)| DE−−→ Set we want to train some archi-tecture and find a functor Free(G)/∼ H′

−→ Setthat can be identified with C. Of course, unlikein the case of the concept C, the implementationof H ′ is something that will be known to us.We now define the notion of a task.

Definition 9. Let G be a directed multigraph and∼ a congruence relation on Free(G). A task is atriple (G,∼, |Free(G)| DE−−→ Set).

In other words, a graph G and ∼ specify aschema Free(G)/∼ and a functor DE specifies adataset for that schema. Each dataset is a datasetof something and thus can be associated with afunctor Free(G)/∼ C−→ Set. Moreover, recall thata dataset fixes an embedding E too, as DE ⊆ E.This in turn also narrows our choice of architec-ture Free(G) Arch−−−→ Para, as it has to agree withthe embedding on objects. This situation fullyreflects what happens in standard machine learn-ing practice – a neural network P × A → B hasto be defined in such a way that its domain Aand codomain B embed the datasets of all of itsinputs and outputs, respectively.Even though for the same schema Free(G)/∼

we might want to consider different datasets, wewill always assume a chosen dataset correspondsto a single training goal C.

5.1 Optimization objectives

We generalize the training procedure described in[13] in a natural way, free of ad-hoc choices.Suppose we have a task (G,∼, |Free(G)| DE−−→

Set). After choosing an architectureFree(G) Arch−−−→ Para consistent with theembedding E and, hopefully, with the rightinductive bias, we start with a randomly chosenparameter θ0 ∈ P(Arch). This amounts the

choice of a specific Free(G)Modelθ0−−−−−→ Euc.

Using the loss function defined further downin this section, we partially differentiate eachf : Rn × Ra → Rb ∈ GenFree(G) with respect tothe corresponding pf . We then obtain a new

7

parameter value for that function using someupdate rule, such as Adam [9]. The productof these parameters for each of the generators(pf )f∈GenFree(G) (Definition 3) defines a newparameter θ1 ∈ P(Arch) for the model Modelθ1 .This procedure allows us to iteratively updatea given Modelθi and as such fixes a linear order{θ0, θ1, . . . , θT } on some subset of P(Arch).The optimization objective for a

model Free(G) Modelθ−−−−→ Euc and a task(G,∼, |Free(G)| DE−−→ Set) is twofold. The totalloss will be stated as a sum of the adversarial lossand a path-equivalence loss. We now describeboth of these losses. As we slowly transition tostandard machine learning lingo, we note thatsome of the notation here will be untyped due tothe lack of the proper categorical understandingof these concepts.1

We start by assigning a discriminator to eachobject A ∈ Ob(Free(G)) using the followingfunction:

D : (A : Ob(Free(G)))→ Para(Arch(A),R)

This function assigns to each object A ∈Ob(Free(G)) a morphisms in Para such that itsdomain is that given by Arch(A). This will allowus to compose compatible generators and discrim-inators. For instance, consider Arch(A) = Ra.Discriminator D(A) is then a function of typeRq × Ra → R : Para(Ra,R), where Rq is the pa-rameter space of the discriminator. As a slightabuse of notation – and to be more in line withmachine learning notation – we will call DA dis-criminator of the object A with some partiallyapplied parameter value D(A)(p,−).In the context of GANs, when we refer to a

generator we refer to the image of a generatingmorphism in Free(G) under Arch. Similarly aswith discriminators, a generator corresponding toa morphism Ra f−→ Rb in Para with some par-tially applied parameter value will be denoted us-ing Gf .The GAN minimax objective LBGAN for a gen-

erator Gf and a discriminator DB is stated in Eq.

1Categorical formulation of the adversarial componentof Generative Adversarial Networks is still an open prob-lem. It seems to require nontrivial reformulations of ex-isting constructions [4] and at least a partial integrationof Open Games [5] into the framework of gradient-basedoptimization.

(2). In this formulation we use the Wassersteindistance [3].

LBGAN (Gf ,DB) := Eb∼DE(B)

[DB(b)]

− Ea∼DE(A)

[DB(Gf (a))](2)

The generator is trained to minimize the loss inthe Eq. (2), while the discriminator is trained tomaximize it.The second component of the total loss is a

generalization of cycle-consistency loss in Cycle-GAN [13], analogous to the generalization of thecycle-consistency condition in Section 2.1.

Definition 10. Let Af−→−→g

B and suppose there

exists a path equivalence f = g. For the equiva-lence f = g and the model Free(G) Modeli−−−−→ Eucwe define a path equivalence loss Lf,g∼ as fol-lows:

Lf,g∼ := Ea∼DE(A)[||Modeli(f)(a)−Modeli(g)(a)||1

]This enables us to state the total loss simply as

a weighted sum of adversarial losses for all genera-tors and path equivalence losses for all equations.

Definition 11. The total loss is given asthe sum of all adversarial and path equivalencelosses:

Li :=∑

Af−→B∈GenFree(G)

LBGAN (Gf ,DB) + γ∑

f=g∈∼Lf,g∼

where γ is a hyperparameter that balances be-tween the adversarial loss and the path equiva-lence loss.

5.2 Path equivalence relationsThere is one interesting case of the totalloss – when the total path-equivalence loss iszero:

∑f=g∈∼ Lf,g∼ = 0. This tells us that

H(f) = H(g) for all f = g in ∼. In what followswe elaborate on what this means by recalling howFree(G) and Free(G)/∼ are related.So far, we have been only considering schemas

given by Free(G). This indeed is a limiting fac-tor, as it assumes the categories of interest areonly those without any imposed relations R be-tween the generatorsG. One example of a schemawith relations is the CycleGAN schema 2 (b) for

8

which fixing a functor Free(G)/∼ → Set re-quires that its image satisfies any relations im-posed by Free(G)/∼. As neural network param-eters usually are initialized randomly, any suchimage in Set will most surely not preserve suchrelations and thus will not be a proper functorwhose domain is Free(G)/∼.However, this construction is a functor if we

consider its domain to be Free(G). Furthermore,assuming a successful training process whose endresult is a path-equivalence relation preservingfunctor Free(G) → Set, we show this inducesan unique Free(G)/∼ → Set (Figure 5).

Free(G)

Free(G)/∼ Set

HQ

H′

Figure 5: Functor H which preserves path-equivalencerelations factors uniquely through Q.

Theorem 12. Let Q : Free(G) → Free(G)/∼be the quotient functor and let H : Free(G) →Set be a path-equivalence relation preservingfunctor. Then there exists a unique functor H ′ :Free(G)/∼ → Set such that H ′ ◦Q = H.

Proof. [11], Section 2.8., Proposition 1.

Finding such a functor H is no easier taskthan finding a functor H ′. However, this con-struction allow us to initially just guess a functorHθ0 , since this initial choice will not have to pre-serve any relations. As training progresses andthe path-equivalence loss of a network instanceFree(G) Hθ−−→ Set converges to zero, by The-orem 12 we show Hθ factors uniquely throughFree(G) Q−→ Free(G)/∼ via Free(G)/∼ H′

−→Set.

5.3 Functor spaceGiven an architecture Arch, each choice of p ∈P(Arch) specifies a functor of type Free(G) →Set. In this way exploration of the parameterspace amounts to exploration of part of the func-tor category SetFree(G). Roughly stated, thismeans that a choice of an architecture adjoinsa notion of space to the image of PSpec(Arch,−)in the functor category SetFree(G). This spaceinherits all the properties of Euc.

By using gradient information to search theparameter space P(Arch), we are effectively us-ing gradient information to search part of thefunctor space SetFree(G). Although we cannotexplicitly explore just SetFree(G)/∼, we penalizethe search method for veering into the parts ofthis space where the specified path equivalencesdo not hold. As such, the inductive bias of themodel is increased without special constraints onthe datasets or the embedding space - we merelyrequire that the space is differentiable and that ishas a sensible notion of distance.Note that we do not claim inductive bias is suf-

ficient to guarantee training convergence, merelythat it is a useful regularization method applica-ble to a wide variety of situations. As categoriescan encode complex relationships between con-cepts and as functors map between categories in astructure-preserving way – this enables structuredlearning of concepts and their interconnections ina very general fashion.

6 Product task

We now present a choice of a dataset for the Cy-cleGAN schema which makes up a novel task wewill call the product task. The interpretation ofthis task comes in two flavors: as a simple changeof dataset for the CycleGAN schema and as amethod of composition and decomposition of im-ages.Just as we can take the product of two real

numbers a, b ∈ R with a multiplication function(a, b) 7→ ab, we show we can take a product ofsome two sets of images A,B ∈ Ob(Set) with aneural network of type A×B → C. We will showC ∈ Ob(Set) is a set of images which possessesall the properties of a categorical product.In a cartesian category such as Set there al-

ready exists a notion of a categorical product –the cartesian product. Recall that the categoricalproduct A×B is uniquely isomorphic to any otherobject AB which satisfies the universal propertyof the categorical product of objects A and B.This isomorphism will be central to the notion ofthe product task.By learning the model Free(G) Model−−−→ Euc

corresponding to the isomorphism AB ∼= A × Bwe are also learning the projection maps θA andθB of AB. This follows from the universal prop-erty of the categorical product: πA ◦ d = θA and

9

πB◦d = θB. Note how AB differs from a cartesianproduct. The domain of the corresponding pro-jections θA and θB is not a simple pair of objects(a, b) and thus these projections cannot merelydiscard an element. θA needs to learn to removeA from a potentially complex domain. As such,this can be any complex, highly non-linear func-tion which satisfies coherence conditions of a cat-egorical product.We will be concerned with supplying this new

notion of the product AB with a dataset andlearning the image of the isomorphism AB ∼=A × B under Free(G) Model−−−→ Set. We illustratethis on a concrete example. Consider a dataset Aof images of human faces, a dataset B of imagesof glasses, and a dataset AB of people wearingglasses. Learning this isomorphism amounts tolearning two things: (i) learning how to decom-pose an image of a person wearing glasses (ab)iinto an image of a person aj and image bk of theseglasses, and (ii) learning how to map this personaj and some other glasses bl into an image of aperson aj wearing glasses bl. Generally, AB rep-resents some sort of composition of objects A andB in the image space such that all informationabout A and B is preserved in AB. Of course,this might only be approximately true. Glassesusually cover a part of a face and sometimes itsdark shades cover up the eyes – thus losing infor-mation about the eye color in the image and ren-dering the isomorphism invalid. However, in thispaper we ignore such issues and assume that thenetworks Arch(d) can learn to unambiguously fillpart of the face where the glasses were and thatArch(c) can learn to generate and superimposethe glasses on the relevant part of the face.Even though we use the same CycleGAN

schema from Figure 2 (b), we label one of its ob-jects as AB and the other one as A × B. Notethat this does not change the schema itself, thelabeling is merely for our convenience. The no-tion of a product or its projections is not capturedin the schema itself. As schemas are merely cate-gories presented with generators G and relationsR, they lack the tools needed to encode a com-plex abstraction such as a universal construction.So how do we capture the notion of a product?In this paper we frame this simply as a spe-

cific dataset functor, which we now describe. Adataset functor corresponding to the product taskmaps the object A×B in CycleGAN schema to a

cartesian product of two datasets, DE(A × B) ={ai}Ni=0 × {bj}Mj=0. It maps the object AB toa dataset {(ab)i}Ni=0. In this case ab, a, and bare free to be any elements of datasets of a well-defined concept C. Although the difference be-tween the product task and the CycleGAN taskboils down to a different choice of a dataset func-tor, we note this is a key aspect which allows fora significantly different interpretation of the tasksemantics.By considering A as some image background

and B as the object which will be inserted, thisallows us to interpret d and c as maps which re-move an object from the image and insert an ob-ject in an image, respectively. This seems like anovel method of object generation and deletionwith unpaired data, though we cannot claim toknow the literature well enough to be sure.

7 Experiments

In this section we test whether the product taskdescribed in Section 6 can be trained in practice.In our experiments we use the CelebA dataset.CelebFaces Attributes Dataset (CelebA) [10] is

a large-scale face attributes dataset with morethan 200000 celebrity images, each with 40 at-tribute annotations. Frequently used for imagegeneration purposes, it fits perfectly into the pro-posed paradigm of the product task. The im-ages in this dataset cover large pose variationsand background clutter. The attribute annota-tions include “eyeglasses”, “bangs”, “pointy nose”,“wavy hair” etc., as boolean flags for every image.We used the attribute annotations to separate

CelebA into two datasets. The dataset DE(AB)consists of images with the attribute “Eyeglasses”,while the dataset DE(A) consists of all the otherimages.Given that we could not obtain a dataset of

images of just glasses, we set DE(BZ) = [0, 1]100

and add the subscript Z to B, as to make it moreclear we are not generating images of this object.We refer to an element z ∈ DE(BZ) as a latentvector, in line with machine learning terminology.This is a parametrization of all the missing infor-mation from A such that A×BZ ∼= AB.We investigate three things: (i) whether it is

possible to generate an image of a specific per-son wearing specific glasses, (ii) whether we canchange glasses that a person wears by changing

10

the corresponding latent vector, and (iii) whetherthe same latent vector corresponds to the sameglasses, irrespectively of the person we pair itwith.

7.1 ResultsIn Figure 6 (left) we show the model learns thetask (i): generating image of a specific personwearing glasses. Glasses are parametrized by thelatent vector z ∈ DE(BZ). The model learns towarp the glasses and put them in the right angleand size, based on the shape of the face. Thiscan especially be seen in Figure 8, where someof the faces are seen from an angle, but glassesstill blend in naturally. Figure 6 (right) showsthe model learning task (ii): changing the glassesa person wears.

(a) (b)

Figure 6: Parametrically adding glasses (a) and chang-ing glasses (b) on a person’s face. (a): the leftmostcolumn shows a sample from the dataset ai ∈ DE(A).Three rightmost columns show the result of c(ai, zj),where zj ∈ DE(BZ) is a randomly sampled latent vec-tor. (b): leftmost column shows a sample from thedataset (ab)i ∈ DE(AB). Three rightmost columnsshow the image c(πA(d((abi))), zj) which is the resultof changing the glasses of a person. The latent vectorzj ∈ DE(BZ) is randomly sampled.

Figure 7 shows the model can learn to removeglasses. Observe how in some cases the modeldid not learn to remove the glasses properly, as aslight outline of glasses can be seen.An interesting test of the learned semantics can

be done by checking if a specific randomly sam-pled latent vector zj is consistent across differentimages. Does the resulting image of the applica-tion of g(ai, zj), contain the same glasses as wevary the input image ai? The results for the tasks(ii, iii) are shown in Figure 8. It shows how thenetwork has learned to associate a specific vectorzj to a specific type of glasses and insert it in anatural way.

Figure 7: Top row shows samples (ab)i ∈ DE(AB).Bottom row shows the result of a functionπA ◦ d : AB → A which removes the glasses fromthe person.

We note low diversity in generated glasses anda slight loss in image quality, which is due to sub-optimal architecture choice for neural networks.Despite this, these experiments show that it ispossible to train networks to (i) remove objectsfrom, and (ii) parametrically insert objects intoimages in a unsupervised, unpaired fashion. Eventhough none of the networks were told that im-ages contain people, glasses, or objects of anykind, we highlight that they learned to preserveall the main facial features.

Figure 8: Bottom row shows true samples ai ∈ DE(A).Top two rows show the image c(ai, zj) of adding glaseswith a specific latent vector z1 for the topmost row andz2 for the middle row. Observe how the general styleof the glasses stays the same in a given row, but getsadapted for every person that wears them.

8 From categorical databases to deeplearningThe formulation presented in this paper bears astriking and unexpected similarity to FunctorialData Migration (FDM) [12]. Given a categoricalschema Free(G)/∼ on some graph G, FDM de-fines a functor category SetFree(G)/∼ of databaseinstance on that schema. The notion of data in-

11

tegrity is captured by path equivalence relationswhich ensure any specified “business rules” hold.The analogue of data integrity in neural networksis captured in the same way, first introduced inCycleGAN [13] as cycle-consistency conditions.The main difference between the approaches isthat in this paper we do not start out with an im-plementation of the network instance functor, butrather we randomly initialize it and then learn it.This shows that the underlying structures used

for specifying data semantics for a given databasesystems are equivalent to the structures used todesign data semantics which are possible to cap-ture by training neural networks.

9 Conclusion and future workIn this paper we introduced a categorical formal-ism for training networks given by an arbitrarycategorical schema. We showed a correspondencebetween categorical formulation of databases andneural network training and developed a rudi-mentary theory of learning a specific class of func-tors using gradient descent. Using the CelebAdataset we obtained experimental results and ver-ified that semantic image manipulation can becarried out in a novel way.We believe this to be one of the first steps ex-

ploring a rich connection between category theoryand machine learning. It opens up interesting av-enues of research and is seems to be deserving offurther exploration.

References[1] Amjad Almahairi, Sai Rajeswar, Alessan-

dro Sordoni, Philip Bachman, and Aaron C.Courville. Augmented cyclegan: Learn-ing many-to-many mappings from unpaireddata. CoRR, abs/1802.10151, 2018. URLhttp://arxiv.org/abs/1802.10151.

[2] Marcin Andrychowicz, Misha Denil, Ser-gio Gomez Colmenarejo, Matthew W. Hoff-man, David Pfau, Tom Schaul, and Nandode Freitas. Learning to learn by gradi-ent descent by gradient descent. CoRR,abs/1606.04474, 2016. URL http://arxiv.org/abs/1606.04474.

[3] Martin Arjovsky, Soumith Chintala, andLéon Bottou. Wasserstein GAN. arXiv e-prints, art. arXiv:1701.07875, January 2017.

[4] Brendan Fong, David I. Spivak, and RémyTuyéras. Backprop as functor: A compo-sitional perspective on supervised learning.CoRR, abs/1711.10455, 2017. URL http://arxiv.org/abs/1711.10455.

[5] Neil Ghani, Jules Hedges, Viktor Winschel,and Philipp Zahn. Compositional game the-ory. arXiv e-prints, art. arXiv:1603.04641,March 2016.

[6] Ian Goodfellow, Jean Pouget-Abadie,Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. Generative adversarialnets. In Z. Ghahramani, M. Welling,C. Cortes, N. D. Lawrence, and K. Q.Weinberger, editors, Advances in NeuralInformation Processing Systems 27, pages2672–2680. Curran Associates, Inc., 2014.URL http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf.

[7] Jules Hedges. On compositionality. 2017.URL https://julesh.com/2017/04/22/on-compositionality/.

[8] Max Jaderberg, Wojciech Marian Czarnecki,Simon Osindero, Oriol Vinyals, Alex Graves,and Koray Kavukcuoglu. Decoupled neuralinterfaces using synthetic gradients. CoRR,abs/1608.05343, 2016. URL http://arxiv.org/abs/1608.05343.

[9] Diederik P. Kingma and Jimmy Ba. Adam:A method for stochastic optimization. InICLR, 2015.

[10] Ziwei Liu, Ping Luo, Xiaogang Wang, andXiaoou Tang. Deep learning face attributesin the wild. In Proceedings of InternationalConference on Computer Vision (ICCV),2015.

[11] Saunders MacLane. Categories for the Work-ing Mathematician. Springer-Verlag, NewYork, 1971. Graduate Texts in Mathemat-ics, Vol. 5.

[12] David I. Spivak. Functorial data migration.CoRR, abs/1009.1166, 2010. URL http://arxiv.org/abs/1009.1166.

[13] Jun-Yan Zhu, Taesung Park, Phillip Isola,and Alexei A. Efros. Unpaired image-to-image translation using cycle-consistent ad-versarial networks. CoRR, abs/1703.10593,2017. URL http://arxiv.org/abs/1703.10593.

12

http://arxiv.org/abs/1802.10151





http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf



https://julesh.com/2017/04/22/on-compositionality/

https://julesh.com/2017/04/22/on-compositionality/







Learning Functors using Gradient Descent Gavranovic.pdf · Learning Functors using Gradient Descent...

Documents

Transcript of Learning Functors using Gradient Descent Gavranovic.pdf · Learning Functors using Gradient Descent...