R-CNN - TAUweb.eng.tau.ac.il/deep_learn/wp-content/uploads/2017/01/RCNN.pdf · R-CNN Test Time per...
Transcript of R-CNN - TAUweb.eng.tau.ac.il/deep_learn/wp-content/uploads/2017/01/RCNN.pdf · R-CNN Test Time per...
R-CNN
R-CNN
Over 2180
citations !
R-CNN
R-CNN
𝑴𝒆𝒕𝒉𝒐𝒅 → 𝑩𝒑 − 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝑩𝒐𝒖𝒏𝒅𝒊𝒏𝒈 𝑩𝒐𝒙
𝑪𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 𝒔𝒄𝒐𝒓𝒆 𝒑𝒆𝒓 𝒅𝒆𝒕𝒆𝒄𝒕𝒊𝒐𝒏
𝑮𝒓𝒐𝒖𝒏𝒅 𝑻𝒓𝒖𝒕𝒉 → 𝑩𝒈𝒕 − 𝑨𝒄𝒕𝒖𝒂𝒍 𝑩𝒐𝒖𝒏𝒅𝒊𝒏𝒈 𝑩𝒐𝒙
R-CNN
𝑨𝒓𝒆𝒂 𝑶𝒗𝒆𝒓𝒍𝒂𝒑 ≜ 𝑰𝒐𝑼 ≜𝑨𝒓𝒆𝒂(𝑩𝒑 ∩ 𝑩𝒈𝒕)
𝑨𝒓𝒆𝒂(𝑩𝒑 ∪ 𝑩𝒈𝒕)
𝑴𝒆𝒕𝒉𝒐𝒅 → 𝑩𝒑 − 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝑩𝒐𝒖𝒏𝒅𝒊𝒏𝒈 𝑩𝒐𝒙
𝑪𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 𝒔𝒄𝒐𝒓𝒆 𝒑𝒆𝒓 𝒅𝒆𝒕𝒆𝒄𝒕𝒊𝒐𝒏
𝑮𝒓𝒐𝒖𝒏𝒅 𝑻𝒓𝒖𝒕𝒉 → 𝑩𝒈𝒕 − 𝑨𝒄𝒕𝒖𝒂𝒍 𝑩𝒐𝒖𝒏𝒅𝒊𝒏𝒈 𝑩𝒐𝒙
𝑪𝒐𝒓𝒓𝒆𝒄𝒕 𝑫𝒆𝒕𝒆𝒄𝒕𝒊𝒐𝒏: 𝑰𝒐𝑼 >𝟏
𝟐
R-CNN
𝑴𝒆𝒕𝒉𝒐𝒅 → 𝑩𝒑 − 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝑩𝒐𝒖𝒏𝒅𝒊𝒏𝒈 𝑩𝒐𝒙
𝑪𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 𝒔𝒄𝒐𝒓𝒆 𝒑𝒆𝒓 𝒅𝒆𝒕𝒆𝒄𝒕𝒊𝒐𝒏
𝑮𝒓𝒐𝒖𝒏𝒅 𝑻𝒓𝒖𝒕𝒉 → 𝑩𝒈𝒕 − 𝑨𝒄𝒕𝒖𝒂𝒍 𝑩𝒐𝒖𝒏𝒅𝒊𝒏𝒈 𝑩𝒐𝒙
𝑨𝒗𝒆𝒓𝒂𝒈𝒆 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 ≜ 𝑨𝑷
𝑨𝒓𝒆𝒂 𝑶𝒗𝒆𝒓𝒍𝒂𝒑 ≜ 𝑰𝒐𝑼 ≜𝑨𝒓𝒆𝒂(𝑩𝒑 ∩ 𝑩𝒈𝒕)
𝑨𝒓𝒆𝒂(𝑩𝒑 ∪ 𝑩𝒈𝒕)
𝑪𝒐𝒓𝒓𝒆𝒄𝒕 𝑫𝒆𝒕𝒆𝒄𝒕𝒊𝒐𝒏: 𝑰𝒐𝑼 >𝟏
𝟐
R-CNN
𝑴𝒆𝒕𝒉𝒐𝒅 → 𝑩𝒑 − 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝑩𝒐𝒖𝒏𝒅𝒊𝒏𝒈 𝑩𝒐𝒙
𝑪𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 𝒔𝒄𝒐𝒓𝒆 𝒑𝒆𝒓 𝒅𝒆𝒕𝒆𝒄𝒕𝒊𝒐𝒏
𝑮𝒓𝒐𝒖𝒏𝒅 𝑻𝒓𝒖𝒕𝒉 → 𝑩𝒈𝒕 − 𝑨𝒄𝒕𝒖𝒂𝒍 𝑩𝒐𝒖𝒏𝒅𝒊𝒏𝒈 𝑩𝒐𝒙
𝑴𝒆𝒂𝒏 𝑨𝒗𝒆𝒓𝒂𝒈𝒆 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 ≜ 𝒎𝑨𝑷 ≜𝑴𝒆𝒂𝒏( 𝑨𝑷 𝒐𝒗𝒆𝒍 𝒂𝒍𝒍 𝒄𝒍𝒂𝒔𝒔 )
𝑨𝒗𝒆𝒓𝒂𝒈𝒆 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 ≜ 𝑨𝑷
𝑨𝒓𝒆𝒂 𝑶𝒗𝒆𝒓𝒍𝒂𝒑 ≜ 𝑰𝒐𝑼 ≜𝑨𝒓𝒆𝒂(𝑩𝒑 ∩ 𝑩𝒈𝒕)
𝑨𝒓𝒆𝒂(𝑩𝒑 ∪ 𝑩𝒈𝒕)
𝑪𝒐𝒓𝒓𝒆𝒄𝒕 𝑫𝒆𝒕𝒆𝒄𝒕𝒊𝒐𝒏: 𝑰𝒐𝑼 >𝟏
𝟐
R-CNN
R-CNN
R-CNN
R-CNN
R-CNN
Input image
R-CNN
Input image
Regions of interest (ROI)
from a proposal method
(~2k)
R-CNN
Input image
Warped image regions
Regions of interest (ROI)
from a proposal method
(~2k)
R-CNN
Input image
Forward each region
through ConvNet
Warped image regions
Regions of interest (ROI)
from a proposal method
(~2k)
R-CNN
Classify each region with SVMs
Regions of interest (ROI)
from a proposal method
(~2k)
Warped image regions
Forward each region
through ConvNet
Input image
R-CNN
R-CNN
mini batch size
of 128
R-CNN
R-CNN
Better
mAP of
3-5%
R-CNN
R-CNN
R-CNN
R-CNN
Input image
Regions of interest
(ROI) from a proposal
method (~2k)
Warped image regions
Forward each region
through ConvNet
Classify each region with
SVMsApply
bounding box
regressors
R-CNN
R-CNN
R-CNN
R-CNN
R-CNN
R-CNN
R-CNN
R-CNN
R-CNN
R-CNN
arXiv: 1504.08083 (2015):
By: Ross Girshick, Microsoft Reasearch
R-CNN
R-CNN
R-CNN
R-CNN
R-CNN
R-CNN
R-CNN
R-CNN
R-CNN
R-CNN
R-CNN
R-CNN
𝑳 𝒑, 𝒖, 𝒕𝒖, 𝒗 = 𝑳𝒄𝒍𝒔(𝒑, 𝒖) + 𝝺 ∙ 𝒖 ≥ 𝟏 ∙ 𝑳𝒍𝒐𝒄(𝒕𝒖, 𝒗)
p = 𝑝0, 𝑝1, … , 𝑝𝐾
𝑡𝑘 = 𝑡𝑥𝑘 , 𝑡𝑦
𝑘 , 𝑡𝑤𝑘 , 𝑡ℎ
𝑘
over K + 1 categories
For each of the K object classes, indexed by k
𝒖 be the ground truth class of the RoI
𝒗 be the ground truth bounding box
R-CNN
𝑳 𝒑, 𝒖, 𝒕𝒖, 𝒗 = 𝑳𝒄𝒍𝒔(𝒑, 𝒖) + 𝝺 ∙ 𝒖 ≥ 𝟏 ∙ 𝑳𝒍𝒐𝒄(𝒕𝒖, 𝒗)
𝑳𝒄𝒍𝒔 𝒑, 𝒖 = −𝒍𝒐𝒈 𝒑𝒖
𝝺 − 𝑹𝒆𝒈𝒖𝒍𝒓𝒊𝒛𝒂𝒕𝒊𝒐𝒏 𝒑𝒂𝒓𝒂𝒎𝒆𝒕𝒆𝒓
𝒖 ≥ 𝟏 − 𝑭𝒐𝒓𝒆𝒈𝒓𝒐𝒖𝒏𝒅 𝒂𝒄𝒕𝒊𝒗𝒂𝒕𝒊𝒐𝒏
𝑳𝒍𝒐𝒄 𝒕𝒖, 𝒗 =
𝒊∈ 𝒙,𝒚,𝒘,𝒉
𝒔𝒎𝒐𝒐𝒕𝒉𝑳𝟏(𝒕𝒊𝒖 − 𝒗𝒊)
𝒔𝒎𝒐𝒐𝒕𝒉𝑳𝟏 𝒙 = 𝟎. 𝟓 ∙ 𝒙𝟐, 𝒙 < 𝟏𝒙 − 𝟎. 𝟓, 𝒙 ≥ 𝟏
R-CNN
𝒚𝒓𝒋 = 𝒙𝒊 ∗(𝒓,𝒋)
𝒊 ∗ (𝒓, 𝒋) = 𝐚𝐫𝐠𝐦𝐚𝐱𝒊′∈ 𝓡 𝒓,𝒋
𝒙𝒊′
𝝏𝑳
𝝏𝒙𝒊=
𝒓
𝒋
[𝒊 = 𝒊∗(𝒓, 𝒋)]𝝏𝑳
𝝏𝒚𝒓𝒋
R-CNN
R-CNN
R-CNN
R-CNN
R-CNN
Neural Information Processing Systems (NIPS), 2015:
By: S. Ren, K. He, R. Girshick, J. Sun, Microsoft Research
R-CNN
R-CNN
R-CNN
R-CNN
OR
R-CNN
𝑳 𝒑𝒊 , 𝒕𝒊 =𝟏
𝑵𝒄𝒍𝒔
𝒊
𝑳𝒄𝒍𝒔(𝒑𝒊, 𝒑𝒊∗) + 𝝺 ∙
𝟏
𝑵𝒓𝒆𝒈
𝒊
𝒑𝒊∗ ∙ 𝑳𝒓𝒆𝒈(𝒕𝒊, 𝒕𝒊
∗)
OR
R-CNN
𝑳 𝒑𝒊 , 𝒕𝒊 =𝟏
𝑵𝒄𝒍𝒔
𝒊
𝑳𝒄𝒍𝒔(𝒑𝒊, 𝒑𝒊∗) + 𝝺 ∙
𝟏
𝑵𝒓𝒆𝒈
𝒊
𝒑𝒊∗ ∙ 𝑳𝒓𝒆𝒈(𝒕𝒊, 𝒕𝒊
∗)
𝒊 − 𝒂𝒏𝒄𝒉𝒐𝒓 𝒊𝒏𝒅𝒆𝒙
𝒑𝒊 − 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝒑𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒚 𝒐𝒇 𝒂𝒏𝒄𝒉𝒐𝒓 𝒊 𝒃𝒆𝒊𝒏𝒈 𝒂𝒏 𝒐𝒃𝒋𝒆𝒄𝒕
𝒑𝒊∗ =
𝟏 , 𝒊𝒇𝒂𝒏𝒄𝒉𝒐𝒓 𝒊 𝒊𝒔 𝑷𝒐𝒔𝒊𝒕𝒊𝒗𝒆𝟎 , 𝒊𝒇𝒂𝒏𝒄𝒉𝒐𝒓 𝒊 𝒊𝒔 𝑵𝒆𝒈𝒆𝒕𝒊𝒗𝒆
𝑳𝒄𝒍𝒔 𝒑𝒊, 𝒑𝒊∗ − 𝒍𝒐𝒈 𝒍𝒐𝒔𝒔 𝒐𝒗𝒆𝒓 𝒕𝒘𝒐 𝒄𝒍𝒂𝒔𝒔𝒆𝒔
𝑵𝒄𝒍𝒔 − 𝒕𝒉𝒆 𝒎𝒊𝒏𝒊 − 𝒃𝒂𝒕𝒄𝒉 𝒔𝒊𝒛𝒆 (𝟐𝟓𝟔)
R-CNN
𝑳 𝒑𝒊 , 𝒕𝒊 =𝟏
𝑵𝒄𝒍𝒔
𝒊
𝑳𝒄𝒍𝒔(𝒑𝒊, 𝒑𝒊∗) + 𝝺 ∙
𝟏
𝑵𝒓𝒆𝒈
𝒊
𝒑𝒊∗ ∙ 𝑳𝒓𝒆𝒈(𝒕𝒊, 𝒕𝒊
∗)
𝑳𝒓𝒆𝒈 𝒕𝒊, 𝒕𝒊∗ = 𝒔𝒎𝒐𝒐𝒕𝒉𝑳𝟏(𝒕𝒊 − 𝒕𝒊
∗)
𝑡𝑥 = 𝑥 − 𝑥𝑎 /𝑤𝑎
𝑡𝑥∗ = 𝑥∗ − 𝑥𝑎 /𝑤𝑎
𝑡𝑦 = 𝑦 − 𝑦𝑎 /ℎ𝑎
𝑡𝑦∗ = 𝑦∗ − 𝑦𝑎 /ℎ𝑎
𝑡𝑤 = 𝑙𝑜𝑔 𝑤/𝑤𝑎
𝑡𝑤∗ = 𝑙𝑜𝑔 𝑤∗/𝑤𝑎
𝑡ℎ = 𝑙𝑜𝑔 ℎ/ℎ𝑎
𝑡ℎ∗ = 𝑙𝑜𝑔 ℎ∗/ℎ𝑎
𝑵𝒓𝒆𝒈 − 𝒕𝒉𝒆 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒂𝒏𝒄𝒉𝒐𝒓 𝒍𝒐𝒄𝒂𝒕𝒊𝒐𝒏𝒔 (~𝟐, 𝟒𝟎𝟎)
𝑷𝒂𝒓𝒂𝒎𝒆𝒕𝒆𝒓𝒊𝒛𝒂𝒕𝒊𝒐𝒏𝒔 𝒐𝒇 𝒂𝒍𝒍 𝒕𝒉𝒆 𝒕𝒊 𝒖𝒔𝒊𝒏𝒈 𝒕𝒉𝒆 𝒂𝒏𝒄𝒉𝒐𝒓𝒔:
𝑥 − 𝑡ℎ𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑡 = (𝑡𝑥 , 𝑡𝑦, 𝑡𝑤 , 𝑡ℎ) 𝑥𝑎 − 𝑡ℎ𝑒 𝑎𝑛𝑐ℎ𝑜𝑟 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛
𝑥∗ − 𝑡ℎ𝑒 𝐺𝑇 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛
R-CNN
R-CNN
R-CNN
Test Time per Image
using VGG-16
Detection mAP on
PASCAL VOC
201220102007
47 Sec58.553.762.4R-CNN
300 mSec(Excluding object proposal time
For 2K proposals)
7068.868.4Fast R-CNN
200 mSecOverall time
73.2---70.4Faster R-CNN
R-CNN
Thank You
For Listening
-
Any Questions ?