Knowledge Distillation by On-the-Fly NativeEnsemblexl309/Doc/Publication/2018/... · Figure...

Post on 17-Jul-2020

6 views 0 download

Transcript of Knowledge Distillation by On-the-Fly NativeEnsemblexl309/Doc/Publication/2018/... · Figure...

Knowledge Distillation by On-the-Fly Native EnsembleXu Lan& Xiatian Zhu+ Shaogang Gong&

1Queen Mary University of London, London, UK 2Vision Semantics Ltdx.lan@qmul.ac.uk eddy@visionsemantics.com s.gong@qmul.ac.uk

1. Introduction

Solution: Knowledge Distillation

2. Methodology

Cross Entropy Hard vs. Soft Class Labels:

l ONE leads to higher correlationsdue to the learning constraint fromthe distillation loss;

l ONE yields superior mean modelgeneralisation capability with lowererror rate 26.61 vs 31.07 by ME.

3. Experiments

4. Further Analysis

ØCIFAR and SVHN tests

Figure 1: Knowledge distillation and vanilla training

Figure 2: Overview of online distillation training of ResNet-110 by the proposed On-the-Fly Native Ensemble (ONE). With ONE, we reconfigure the network by addingm auxiliary branches. Each branch with shared layers makes an individual model, and their ensemble is used to build the teacher model.

Knowledge Distillation by On-the-Fly Native Ensemble

Ø ImageNet test

ØKnowledge Distillation and Ensemble Comparisons

5. Reference

(1) Model variance: average prediction differences between everytwo models/branches. (2) Mean model generalisation capability. [2] Y. Zhang, X. Tao, T. Hospedales, and H. Lu. “Deep mutual learning.” CVPR, 2018.

[1] G. Hinton, O. Vinyals, and J. Dean. “Distilling the knowledge in a neural network.” arXiv:1503.02531, 2015.

Multi-Branch Design:

Ø Considering no correlation between classes.Ø Prone to model overfitting

(b) knowledge distillation(a) Vanilla strategy

Target Model

P L Teacher Model

P L Target Model

P L

P Model PredictionsL Labels TS Training Samples

TS TS TS

Legend

Limitations of Offline Knowledge Distillation:Ø Lengthy training time

Ø Complex multi-stage training process

Ø Possible teacher model overfitting

Limitations of Online Knowledge Distillation:

Ø Provide limited extra supervision information

Ø Still need to train multiply models

Ø Complex Asynchronous model updating

Classifier

Branch0

Gate

Logits

!"##!"#$

!"#%

!"#&

!'($

!'(%

!'(&

)$*

)%*

)&*

)#

)%

)&

)$

)#*

EnsemblePredictions

On-the-FlyKnowledgeDistillation

EnsembleLogits

High-Level Layers(Res4X block)

Low-Level Layers (Conv1, Res2X, Res3X)

Branch 1

Branchm

Predictions

Gate Network:

On-the-Fly Knowledge Distillation:

A gate which learns to ensemble all (m + 1) branches tobuild a stronger teacher:

m auxiliary branches with the same configuration, each serving as an independent efficient classification model.

Compute soft probability distributions at a temperature ofT for branches and the ONE teacher as:

Distill knowledge from the teacher to each branch:

���

���

� ��� ��� ���

� �� ���

�����

������������ �

Drawbacks of Hard Label based Cross Entropy:

Overall Loss Function:

ONE vs. Model Ensemble (ME)

ØEffect of On-the-Fly Knowledge Distillation