Deep learning and weak supervision for image...
Transcript of Deep learning and weak supervision for image...
Deep learning and weak supervisionfor image classification
Matthieu CordJoint work with Thibaut Durand, Nicolas Thome
Sorbonne Universités - Université Pierre et Marie Curie (UPMC)Laboratoire d’Informatique de Paris 6 (LIP6) - MLIA Team
UMR CNRS
June 09, 2016
1/35
Outline
Context: Visual classification1. MANTRA: Latent variable
model to boost classificationperformances
2. WELDON: extension toDeep CNN
2/35
Motivations
• Working on datasets with complex scenes (large and clutteredbackground), not centered objects, variable size, ...
VOC07/12 MIT67 15 Scene COCO VOC12 Action
• Select relevant regions → better prediction• ImageNet: centered objects
I Efficient transfert: needs bounding boxes [Oquab, CVPR14]
• Full annotations expensive ⇒ training with weak supervision
3/35
Motivations
How to learn without bounding boxes?
• Multiple-Instance Learning/Latent variables for missinginformation [Felzenszwalb, PAMI10]
• Latent SVM and extensions => MANTRA
How to learn deep without bounding boxes?
• Learning invariance with input image transformationsI Spatial Transformer Networks [Jaderberg, NIPS15]
• Attention models: to select relevant regionsI Stacked Attention Networks for Image Question Answering
[Yang, CVPR16]• Parts model
I Automatic discovery and optimization of parts for imageclassification [Parizi, ICLR15]
• Deep MILI Is object localization for free? [Oquab, CVPR15]I Deep extension of MANTRA: WELDON
4/35
Notations
Variable Notation Space Train Test ExampleInput x X observed observed imageOutput y Y observed unobserved labelLatent h H unobserved unobserved region
• Model missing information with latent variables h• Most popular approach in Computer Vision: Latent SVM[Felzenszwalb, PAMI10] [Yu, ICML09]
5/35
Latent Structural SVM [Yu, ICML09]
• Prediction function:
(y, h) = argmax(y,h)∈Y×H
〈w,Ψ(x, y,h)〉 (1)
I Ψ(x, y,h): joint feature mapI Joint inference in the (Y ×H) space
• Training: a set of N labeled trained pairs (xi , yi )I Objective function: upper bound of ∆(yi , yi )
12‖w‖2+
C
N
N∑i=1
max(y,h)∈Y×H
[∆(yi , y) + 〈w,Ψ(xi , y,h)〉]−maxh∈H
〈w,Ψ(xi , yi ,h)〉︸ ︷︷ ︸≥∆(yi ,yi )
I Difference of Convex Functions, solved with CCCPI LAI: max
(y,h)∈Y×H[∆(yi , y) + 〈w,Ψ(xi , y,h)〉]
I Challenge exacerbated in the latent case, (Y ×H) space
6/35
MANTRA: Minimum Maximum Latent Structural SVM
Classifying only with themax scoring latent valuenot always relevant
MANTRA model:• Pair of latent variables (h+
i ,y,h−i ,y)
I max scoring latent value: h+i,y = argmax
h∈H〈w,Ψ(xi , y,h)〉
I min scoring latent value: h−i,y = argminh∈H
〈w,Ψ(xi , y,h)〉
• New scoring function:
Dw(xi , y) = 〈w,Ψ(xi , y,h+i ,y)〉+ 〈w,Ψ(xi , y,h−i ,y)〉 (2)
• Prediction function ⇒ find the output with maximum scorey = argmax
y∈YDw(xi , y) (3)
• MANTRA: max+min vs max for LSSVM ⇒ negativeevidence
7/35
MANTRA: Minimum Maximum Latent Structural SVM
Classifying only with themax scoring latent valuenot always relevant
MANTRA model:• Pair of latent variables (h+
i ,y,h−i ,y)
I max scoring latent value: h+i,y = argmax
h∈H〈w,Ψ(xi , y,h)〉
I min scoring latent value: h−i,y = argminh∈H
〈w,Ψ(xi , y,h)〉
• New scoring function:
Dw(xi , y) = 〈w,Ψ(xi , y,h+i ,y)〉+ 〈w,Ψ(xi , y,h−i ,y)〉 (2)
• Prediction function ⇒ find the output with maximum scorey = argmax
y∈YDw(xi , y) (3)
• MANTRA: max+min vs max for LSSVM ⇒ negativeevidence
7/35
MANTRA: Model & Training Rationale
Intuition of the max+min prediction function
• x image, h image region, y image class• 〈w,Ψ(x, y,h)〉: region h score for class y• Dw(x, y) = 〈w,Ψ(x, y,h+
y )〉+ 〈w,Ψ(x, y,h−y )〉I h+
y : presence of class y ⇒ large for yiI h−y : localized evidence of the absence of class y
I Not too low for yi ⇒ latent space regularizationI Low for y 6= yi ⇒ tracking negative evidence [Parizi, ICLR15]
street image x Dw(x, street) = 2 Dw(x, highway) = 0.7 Dw(x, coast) = −1.5
8/35
MANTRA: Model Training
Learning formulation
• Loss function: `w(xi , yi ) = maxy∈Y
[∆(yi , y) + Dw(xi , y)]− Dw(xi , yi )
I (Margin rescaling) upper bound of ∆(yi , y), constraints:
∀y 6= yi , Dw(xi , yi )︸ ︷︷ ︸score for ground truth output
≥ ∆(yi , y)︸ ︷︷ ︸margin
+ Dw(xi , y)︸ ︷︷ ︸score for other output
• Non-convex optimization problem
minw
12‖w‖2 +
C
N
N∑i=1
`w(xi , yi ) (4)
• Solver: non convex one slack cutting plane [Do, JMLR12]I Fast convergenceI Direct optimization 6= CCCP for LSSVMI Still needs to solve LAI: maxy [∆(yi , y) + Dw(xi , y)]
9/35
MANTRA: Optimization
• MANTRA Instantiation: define (x, y,h), Ψ(x, y,h), ∆(yi , y)
• Instantiations: binary & multi-class classification, AP rankingBinary Multi-class AP Ranking
x bag bag set of bags(set of regions) (set of regions) (of regions)
y ±1 1, . . . ,K ranking matrixh instance (region) region regions
Ψ(x, y, h) y · Φ(x, h)[1(y=1)Φ(x, h), . . . , joint latent ranking1(y=K)Φ(x, h)] feature map
∆(yi , y) 0/1 loss 0/1 loss AP lossLAI exhaustive exhaustive exact and efficient
• Solve Inference maxy Dw(xi , y) & LAI maxy [∆(yi , y) + Dw(xi , y)]I Exhaustive for binary/multi-class classificationI Exact and efficient solutions for ranking
10/35
WELDONWeakly supErvised Learning of Deep cOnvolutional Nets
• MANTRA extension for training deep CNNs• Learning Ψ(x, y,h): end-to-end learning of deep CNNs withstructured prediction and latent variables
I Incorporating multiple positive & negative evidenceI Training deep CNNs with structured loss
11/35
Standard deep CNN architecture: VGG16
Simonyan et al. Very deep convolutional networks for large-scale image recognition.ICLR 2015
12/35
MANTRA adaptation for deep CNN
Problem
• Fixed-size image as input
Adapt architecture to weakly supervised learning
1. Fully connected layers → convolution layersI sliding window approach
13/35
MANTRA adaptation for deep CNN
Problem
• Fixed-size image as input
Adapt architecture to weakly supervised learning
1. Fully connected layers → convolution layersI sliding window approach
13/35
MANTRA adaptation for deep CNN
Problem
• Fixed-size image as input
Adapt architecture to weakly supervised learning
1. Fully connected layers → convolution layersI sliding window approach
13/35
MANTRA adaptation for deep CNN
Problem
• Fixed-size image as input
Adapt architecture to weakly supervised learning
1. Fully connected layers → convolution layersI sliding window approach
2. Spatial aggregationI Perform object localization prediction
13/35
WELDON: deep architecture
• C : number of classes14/35
Aggregation function
[Oquab, 2015]
• Region aggregation = max• Select the highest-scoring window
original image motorbike feature map max prediction
Oquab, Bottou, Laptev, Sivic. Is object localization for free? weakly-supervisedlearning with convolutional neural networks. CVPR 2015 15/35
WELDON: region aggregation
Aggregation strategy:
• max+min pooling (MANTRA prediction function)• k-instances
I Single region to multiple high scoring regions:
max→ 1k
k∑i=1
i-thmax min→ 1k
k∑i=1
i-thmin
I More robust region selection [Vasconcelos CVPR15]
max max+min 3max+3min16/35
WELDON: architecture
17/35
WELDON: learning
• Objective function for multi-class task and k = 1:
minwR(w) +
1N
N∑i=1
`(fw(xi ), ygti )
fw(xi ) =argmaxy
(maxh
Lwconv(xi , y , h) + min
h′Lw
conv(xi , y , h′))
How to learn deep architecture ?
• Stochastic gradient descent training.• Back-propagation of the selecting windows error.
18/35
WELDON: learning
Class is present
• Increase score of selecting windows.
Figure: Car map
19/35
WELDON: learning
Class is absent
• Decrease score of selecting windows.
Figure: Boat map
20/35
Experiments
• VGG16 pre-trained on ImageNet• Torch7 implementation
Datasets
• Object recognition: Pascal VOC 2007, Pascal VOC 2012• Scene recognition: MIT67, 15 Scene• Visual recognition, where context plays an important role:COCO, Pascal VOC 2012 Action
VOC07/12 MIT67 15 Scene COCO VOC12 Action
21/35
Experiments
Dataset Train Test Classes ClassificationVOC07 ∼5.000 ∼5.000 20 multi-labelVOC12 ∼5.700 ∼5.800 20 multi-label15 Scene 1.500 2.985 15 multi-classMIT67 5.360 1.340 67 multi-classVOC12 Action ∼2.000 ∼2.000 10 multi-labelCOCO ∼80.000 ∼40.000 80 multi-label
22/35
Experiments
• Multi-scale: 8 scales (combination with Object Bank strategy)
23/35
Object recognition
VOC 2007 VOC 2012VGG16 (online code) [1] 84.5 82.8SPP net [2] 82.4Deep WSL MIL [3] 81.8WELDON 90.2 88.5
Table: mAP results on object recognition datasets.
[1] Simonyan et al. Very deep convolutional networks. ICLR 2015[2] He et al. Spatial pyramid pooling in deep convolutional networks. ECCV 2014[3] Oquab et al. Is object localization for free? CVPR 2015
24/35
Scene recognition
15 Scene MIT67VGG16 (online code) [1] 91.2 69.9MOP CNN [2] 68.9Negative parts [3] 77.1WELDON 94.3 78.0
Table: Multi-class accuracy results on scene categorization datasets.
[1] Simonyan et al. Very deep convolutional networks. ICLR 2015[2] Gong et al. Multi-scale Orderless Pooling of Deep Convolutional ActivationFeatures. ECCV 2014[3] Parizi et al. Automatic discovery and optimization of parts. ICLR 2015
25/35
Context datasets
VOC 2012 action COCOVGG16 (online code) [1] 67.1 59.7Deep WSL MIL [2] 62.8Our WSL deep CNN 75.0 68.8
Table: mAP results on context datasets.
[1] Simonyan et al. Very deep convolutional networks. ICLR 2015[2] Oquab et al. Is object localization for free? CVPR 2015
26/35
Visual results
Aeroplane model (1.8) Bus model (-0.4)
27/35
Visual results
Motorbike model (1.1) Sofa model (-0.8)
28/35
Visual results
Sofa model (1.2) Horse model (-0.6)
29/35
Visual results (failing examples)
Buffet Restaurant kitchen
30/35
Visual results (failing examples)
Kindergarden Classroom
31/35
Analysis
Impact of the different improvementsa) max b) +k=3 c) +min d) +AP VOC07 VOC12 action
X 83.6 53.5X X 86.3 62.6X X 87.5 68.4X X X 88.4 71.7X X X 87.8 69.8X X X X 88.9 72.6
• WSL detection results on VOC 2012 Action
max (a)) [Oquab, 2015] WELDONIoU 25.6 30.4
32/35
Analysis
• Impact of the number or regions k
k=1 k=3
33/35
Connections to others Latent Variables Models• Hidden CRF (HCRF) [Quattoni, PAMI07]
12‖w‖2 +
C
N
N∑i=1
log∑
(y,h)∈Y×H
exp〈w,Ψ(xi , y,h)〉 − log∑h∈H
exp〈w,Ψ(xi , yi ,h)〉
• Latent Structural SVM (LSSVM) [Yu, ICML09]
12‖w‖2 +
C
N
N∑i=1
max(y,h)∈Y×H
∆(yi ,y)+〈w,Ψ(xi ,y,h)〉 −maxh∈H〈w,Ψ(xi ,yi ,h)〉
• Marginal Structural SVM (MSSVM) [Ping, ICML14]
12‖w‖2+
C
N
N∑i=1
maxy
∆(yi ,y)+log
∑h∈H
exp〈w,Ψ(xi ,y,h)〉
−log
∑h∈H
exp〈w,Ψ(xi ,yi,h)〉
• WELDON
12‖w‖2+
C
N
N∑i=1
maxy
∆(yi ,y)+∑
h∈Ω⊆H〈w,Ψ(xi ,y,h)〉
−∑h∈Ω⊆H〈w,Ψ(xi , yi ,h)〉
34/35
Thibaut Durand Nicolas Thome Matthieu Cord
MLIA Team (Patrick Gallinari)Sorbonne Universités - UPMC Paris 6 - LIP6
MANTRA project pagehttp://webia.lip6.fr/~durandt/project/mantra.html
Thibaut Durand, Nicolas Thome, and Matthieu Cord.MANTRA: Minimum Maximum LSSVM for Image Classification and Ranking.In IEEE International Conference on Computer Vision (ICCV), 2015.
Thibaut Durand, Nicolas Thome, and Matthieu Cord.WELDON: Weakly Supervised Learning of Deep Convolutional Neural Networks.In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
35/35