Knowledge Distillation by On-the-Fly NativeEnsemblexl309/Doc/Publication/2018/... · Figure...

Knowledge Distillation by On-the-Fly Native EnsembleXu Lan& Xiatian Zhu+ Shaogang Gong&

1Queen Mary University of London, London, UK 2Vision Semantics Ltdx.lan@qmul.ac.uk eddy@visionsemantics.com s.gong@qmul.ac.uk

1. Introduction

Solution: Knowledge Distillation

2. Methodology

Cross Entropy Hard vs. Soft Class Labels:

l ONE leads to higher correlationsdue to the learning constraint fromthe distillation loss;

l ONE yields superior mean modelgeneralisation capability with lowererror rate 26.61 vs 31.07 by ME.

3. Experiments

4. Further Analysis

ØCIFAR and SVHN tests

Figure 1: Knowledge distillation and vanilla training

Figure 2: Overview of online distillation training of ResNet-110 by the proposed On-the-Fly Native Ensemble (ONE). With ONE, we reconfigure the network by addingm auxiliary branches. Each branch with shared layers makes an individual model, and their ensemble is used to build the teacher model.

Knowledge Distillation by On-the-Fly Native Ensemble

Ø ImageNet test

ØKnowledge Distillation and Ensemble Comparisons

5. Reference

(1) Model variance: average prediction differences between everytwo models/branches. (2) Mean model generalisation capability. [2] Y. Zhang, X. Tao, T. Hospedales, and H. Lu. “Deep mutual learning.” CVPR, 2018.

[1] G. Hinton, O. Vinyals, and J. Dean. “Distilling the knowledge in a neural network.” arXiv:1503.02531, 2015.

Multi-Branch Design:

Ø Considering no correlation between classes.Ø Prone to model overfitting

(b) knowledge distillation(a) Vanilla strategy

Target Model

P L Teacher Model

P L Target Model

P Model PredictionsL Labels TS Training Samples

TS TS TS

Legend

Limitations of Offline Knowledge Distillation:Ø Lengthy training time

Ø Complex multi-stage training process

Ø Possible teacher model overfitting

Limitations of Online Knowledge Distillation:

Ø Provide limited extra supervision information

Ø Still need to train multiply models

Ø Complex Asynchronous model updating

Classifier

Branch0

Logits

!"##!"#$

EnsemblePredictions

On-the-FlyKnowledgeDistillation

EnsembleLogits

High-Level Layers(Res4X block)

Low-Level Layers (Conv1, Res2X, Res3X)

Branch 1

Branchm

Predictions

Gate Network:

On-the-Fly Knowledge Distillation:

A gate which learns to ensemble all (m + 1) branches tobuild a stronger teacher:

m auxiliary branches with the same configuration, each serving as an independent efficient classification model.

Compute soft probability distributions at a temperature ofT for branches and the ONE teacher as:

Distill knowledge from the teacher to each branch:

��

� ��

��

Drawbacks of Hard Label based Cross Entropy:

Overall Loss Function:

ONE vs. Model Ensemble (ME)

ØEffect of On-the-Fly Knowledge Distillation

Knowledge Distillation by On-the-Fly NativeEnsemblexl309/Doc/Publication/2018/... · Figure...

Documents

Transcript of Knowledge Distillation by On-the-Fly NativeEnsemblexl309/Doc/Publication/2018/... · Figure...

A General Knowledge Distillation Framework for ...csse.szu.edu.cn/staff/panwk/publications/Conference...A General Knowledge Distillation Framework for Counterfactual Recommendation

Knowledge Distillation via Route Constrained Optimization

Memory-Replay Knowledge Distillation

Online Knowledge Distillation via Multi-branch Diversity ...

Explaining Knowledge Distillation by Quantifying the Knowledge · terpreted knowledge distillation as a form of learning with privileged information. Phuong et al. [29] explained

Self-supervised Knowledge Distillation Using …openaccess.thecvf.com/content_ECCV_2018/papers/SEUNG...Self-supervised Knowledge Distillation Using Singular Value Decomposition 3 the

Knowledge Distillation for Small-footprint Highway …...Knowledge Distillation for Small-footprint Highway Networks Liang Luy, Michelle Guozand Steve Renals yToyota Technological

In Defense of Knowledge Distillation for Task Incremental ...

Annealing Knowledge Distillation

Online Knowledge Distillation via Collaborative Learning · 2020. 6. 29. · Online Knowledge Distillation via Collaborative Learning Qiushan Guo1, Xinjiang Wang2, Yichao Wu2, Zhipeng

Exclusivity-Consistency Regularized Knowledge Distillation for … · 2020. 8. 5. · Exclusivity-Consistency Regularized Knowledge Distillation for Face Recognition Xiaobo Wang 1?,

Knowledge Distillation Meets Self-Supervisionpersonal.ie.cuhk.edu.hk/~ccloy/files/eccv_2020_sskd.pdfKnowledge Distillation Meets Self-Supervision 3 rounded knowledge from a teacher

Instance-Conditional Knowledge Distillation for Object ...

Self-supervised Knowledge Distillation for Few-shot Learning · original inputs result in similar predictions to enhance between-class discrimination. The knowledge distillation mechanism

Densely Guided Knowledge Distillation Using Multiple ...

Uncertainty-Aware Multi-Shot Knowledge Distillation for ...

Knowledge distillation in deep learning and its applications

Comprehensive Knowledge Distillation with Causal Intervention

Graph-based Knowledge Distillation by Multi-head Attention ... · LEE AND SONG.: GRAPH-BASED KNOWLEDGE DISTILLATION 1 Graph-based Knowledge Distillation by Multi-head Attention Network

KDGAN: Knowledge Distillation with Generative Adversarial ... · We brieﬂy review studies on knowledge distillation (KD) and generative adversarial networks (GAN). KD aims to transfer