New NetTailor: Tuning the Architecture, Not Just the Weightsmorgado/nettailor/poster.pdf · 2019....

1
NetTailor: Tuning the Architecture, Not Just the Weights Pedro Morgado, Nuno Vasconcelos University of California, San Diego NetTailor Abstract Evaluation Real-world applications of object recognition often require the solution of multiple tasks in a single platform. We propose a transfer learning procedure, denoted NetTailor, in which layers of a pre-trained CNN are used as universal blocks that can be combined with small task-specific layers to generate new networks. In this way, NetTailor can adapt the network architecture, not just its weights, to the target task. Experiments show that networks adapted to simpler tasks become significantly smaller than those adapted to complex tasks. Due to the modular nature of the procedure, this reduction is achieved without compromise of either parameter sharing across tasks, or classification accuracy. Complexity Constraints Conditions for removing block Probability Self exclusion: Small coefficient associated with the block. , =1− Input exclusion: When all incoming paths are removed , =ෑ 1− Output exclusion: When all departing paths are removed , =ෑ 1− [] = , 1− , , , Expected Complexity High performance on all tasks Models share parameters and computation to be supported in the same device Models trained independently of tasks/datasets for distributed development. Adaptive model complexity (Easier tasks simple model, Complex complex) Goals HP SP DD AC Prior Work Feature Extraction HP SP DD AC Universal classifier HP SP DD AC Residual Adapters Task 1 Task 2 Task 3 Task 4 Pretrained Model HP SP DD AC Finetuning Pretrained Model Task 1 Task 2 Task 3 Task 4 HP SP DD AC Idea U U U U U P P P P P C P P P P U U U U U P P P P P C P P P P U U U P C P P P U U U U U C 1 2 3 Start with a pre-trained network that remains unchanged. Augment the original network ( ) with a set of small task-specific blocks ( ) Find the smallest combination of layers and connections that solves the task. 1 2 3 U P Learned architectures SVHN Flowers VOC Prunned Unprunned #Univ. Layers FLOPs # Params Accuracy #Univ. Layers FLOPs # Params Accuracy #Univ. Layers FLOPs # Params Accuracy Complex tasks require larger models, simpler task require smaller models Path Selection U U U U U P P P P P C P P P P −1 −1 −2 −2 −3 −3 −3 −2 −1 + An attention coefficient 0,1 for each block = −1 + Blocks removed from inference path when is small. Coefficients are parameterized by a softmax distribution to force network paths to compete Teacher Network Teacher Student To preserve performance, a teacher network finetuned on the target task is used to supervise the student at each stage. = 2 Learning 1 st Stage: Tuning the architecture. Student model optimized by SGD under the loss ℒ= , + 1 + 2 []. 2 nd Stage: Pruning and retraining. Blocks with small are removed. Remaining parameters retrained with 2 =0. Study different block pruning strategies Extend NetTailor path selection framework to tasks beyond recognition (e.g. detection, semantic segmentation, multi-task learning) Study the benefits of NetTailor to general neural architecture search NetTailor on different datasets 0.00 5.00 ImageNet Flowers CIFAR100 Dped UCF101 SVHN DTD Omniglot Aircraft GTSR Parameters [M] Prunned Unprunned 0.00 0.20 0.40 0.60 ImageNet Flowers CIFAR100 Dped UCF101 DTD Aircraft SVHN Omniglot GTSR FLOPS [B] Compression rates up to 80% of parameters and 66% of FLOPs NetTailor on different base networks # Params (M) Accuracy (%) Original Pruned SVHN Flowers VOC ResNet50 pruned to the size of ResNet18 while preserving its performance Support https://pedro-morgado.github.io/nettailor Website https://github.com/pedro-morgado/nettailor Repository Task 1 Task 2 Task 3 Task 4 Feature extractor jointly trained for all tasks Task 1 Task 2 Task 3 Task 4 Initialize Pretrained Model Feature extractor remains frozen after pre-training. Future Work

Transcript of New NetTailor: Tuning the Architecture, Not Just the Weightsmorgado/nettailor/poster.pdf · 2019....

  • NetTailor: Tuning the Architecture, Not Just the WeightsPedro Morgado, Nuno Vasconcelos

    University of California, San Diego

    NetTailorAbstract Evaluation

    Real-world applications of object recognition often require the solution of multiple tasks

    in a single platform. We propose a transfer learning procedure, denoted NetTailor, in

    which layers of a pre-trained CNN are used as universal blocks that can be combined with

    small task-specific layers to generate new networks. In this way, NetTailor can adapt the

    network architecture, not just its weights, to the target task.

    Experiments show that networks adapted to simpler tasks become significantly smaller

    than those adapted to complex tasks. Due to the modular nature of the procedure, this

    reduction is achieved without compromise of either parameter sharing across tasks, or

    classification accuracy.

    Complexity Constraints

    Conditions for removing block 𝑩𝒊𝒋 Probability

    Self exclusion: Small coefficient 𝛼 associated with the block. 𝑃 𝑅𝑠𝑒𝑙𝑓𝑖,𝑗

    = 1 − 𝛼𝑖𝑗

    Input exclusion: When all incoming paths are removed 𝑃 𝑅𝑖𝑛𝑝𝑢𝑡𝑖,𝑗

    =ෑ

    𝑘

    1 − 𝛼𝐼𝑘

    Output exclusion: When all departing paths are removed 𝑃 𝑅𝑜𝑢𝑡𝑝𝑢𝑡𝑖,𝑗

    =ෑ

    𝑘

    1 − 𝛼𝑂𝑘

    𝐸[𝒞] =

    𝑖,𝑗

    𝒞𝑖𝑗1 − 𝑃 𝑅𝑠𝑒𝑙𝑓

    𝑖,𝑗∪ 𝑅𝑖𝑛𝑝𝑢𝑡

    𝑖,𝑗∪ 𝑅𝑜𝑢𝑡𝑝𝑢𝑡

    𝑖,𝑗Expected Complexity

    High performance on all tasks

    Models share parameters and computation to be supported in the same device

    Models trained independently of tasks/datasets for distributed development.

    Adaptive model complexity (Easier tasks→ simple model, Complex → complex)

    Goals

    HP

    SP

    DD

    AC

    Prior Work

    Feature Extraction HP SP DD AC Universal classifier HP SP DD AC

    Residual Adapters

    ⋯⋯Task 1 Task 2 Task 3 Task 4

    Pre

    trai

    ne

    d M

    od

    el

    HP SP DD ACFinetuning

    Pre

    trai

    ne

    d M

    od

    el

    Task 1 Task 2 Task 3 Task 4

    HP SP DD AC

    Idea

    U

    U

    U

    U

    U

    P

    P

    P

    P

    P

    C

    P

    P

    P

    P

    U

    U

    U

    U

    U

    P

    P

    P

    P

    P

    C

    P

    P

    P

    P

    U

    U

    U

    P

    C

    PP

    P

    U

    U

    U

    U

    U

    C

    1 2 3

    Start with a pre-trained network that remains unchanged.

    Augment the original network ( ) with a set of small task-specific blocks ( )

    Find the smallest combination of layers and connections that solves the task.

    1

    2

    3

    U P

    Learned architectures

    SVHN

    Flowers

    VOC

    Accuracy

    # Params

    FLOPs

    # Univ Layers

    Prunned Unprunned

    Accuracy

    # Params

    FLOPs

    # Univ Layers

    Accuracy

    # Params

    FLOPs

    # Univ Layers

    #Univ. LayersFLOPs

    # ParamsAccuracy

    #Univ. LayersFLOPs

    # ParamsAccuracy

    #Univ. LayersFLOPs

    # ParamsAccuracy

    Complex tasks require larger models, simpler task require smaller models

    Path SelectionU

    U

    U

    U

    U

    P

    P

    P

    P

    P

    C

    P

    P

    P

    P

    𝑈𝑖

    𝑃𝑖−1𝑖

    𝒙𝑖−1

    𝑃𝑖−2𝑖

    𝒙𝑖−2

    𝑃𝑖−3𝑖

    𝒙𝑖−3𝛼𝑖−3𝑖

    𝛼𝑖−2𝑖

    𝛼𝑖−1𝑖

    𝛼𝑖𝑖 +

    𝒙𝑖

    An attention coefficient 𝛼 ∈ 0,1 for each block

    𝒙𝑖 = 𝛼𝑖𝑖𝑈𝑖 𝒙𝑖−1 +

    𝑘

    𝛼𝑘𝑖 𝑃𝑘

    𝑖 𝒙𝑘

    Blocks removed from inference path when 𝛼 is small.

    Coefficients 𝛼 are parameterized by a softmaxdistribution to force network paths to compete

    Teacher NetworkTeacher Student

    𝒅

    𝒅

    𝒅

    To preserve performance, a teacher network finetuned on the target task is used to supervise the student at each stage.

    𝐿𝑡 𝒙 =

    𝑙

    𝒙𝑙𝑡 − 𝒙𝑙

    2

    Learning1st Stage: Tuning the architecture.

    Student model optimized by SGD under the loss ℒ = 𝐿𝑐 𝒙, 𝑦 + 𝜆1𝐿𝑡 𝒙 + 𝜆2𝐸[𝒞].

    2nd Stage: Pruning and retraining.Blocks with small 𝛼 are removed. Remaining parameters retrained with 𝜆2 = 0.

    ❖ Study different block pruning strategies

    ❖ Extend NetTailor path selection framework to tasks beyond recognition (e.g. detection, semantic segmentation, multi-task learning)

    ❖ Study the benefits of NetTailor to general neural architecture search

    NetTailor on different datasets

    0.00

    5.00

    ImageNet Flowers CIFAR100 Dped UCF101 SVHN DTD Omniglot Aircraft GTSRPar

    amet

    ers

    [M]

    Prunned Unprunned

    0.00

    0.20

    0.40

    0.60

    ImageNet Flowers CIFAR100 Dped UCF101 DTD Aircraft SVHN Omniglot GTSR

    FLO

    PS

    [B]

    Compression rates up to 80% of parameters and 66% of FLOPs

    NetTailor on different base networks

    # Params (M)

    Acc

    ura

    cy (

    %)

    Original

    Pruned

    SVHN

    Flowers

    VOC

    ResNet50 pruned to the size of ResNet18 while preserving its

    performance

    Support

    https://pedro-morgado.github.io/nettailor

    Website

    https://github.com/pedro-morgado/nettailor

    Repository

    Task 1 Task 2 Task 3 Task 4

    Feature extractor jointly trained for all tasks

    Task 1 Task 2 Task 3 Task 4

    Initialize

    Pre

    trai

    ne

    d M

    od

    el Feature extractor

    remains frozen after pre-training.

    Future Work