Deep TEN: Texture Encoding Network IEEE 2016 Conference on · IEEE 2016 Conference on Computer...

1
IEEE 2016 Conference on Computer Vision and Pattern Recognition Deep TEN: Texture Encoding Network Hang Zhang, Jia Xue, Kristin Dana {zhang.hang, jia.xue}@rutgers.edu, [email protected], Department of Electrical and Computer Engineering, Rutgers University. Overview: Ø Encoding-Net (a new CNN architecture) with a novel Encoding Layer. Ø State-of-the art results on texture recognition (minc-2500, FMD, GTOS, 4D-light datasets). Ø Flexible deep learning framework (arbitrary image size and easy to transfer learned features). Encoding Layer : Ø Residual Encoder Given a set of visual descriptors = { % ,.. ( } and a learned codebook = { % ,… - }. Each descriptor . can be assigned with a weight .0 to each codeword 0 . The residual encoder aggregate the residuals with assignment weights 0 =2 .0 ( .3% =2 .0 .0 ( .3% Experiments: Ø Dataset Material & texture datasets: MINC-2500, KTH, FMD, 4D-light, GTOS General recognition datasets: MIT-Indoor, Caltech-101 Ø Baselines FV-SIFT (128 Gaussian Components, 32 ⇒ 512) FV-CNN (Cimpoi et al. VGG-VD & ResNet, 32GMM) Feature extraction Dictionary Learning Encoding Classifier Histogram Encoding Dictionary SIFT / Filter Bank Responses Fisher Vector Dictionary Pre-trained CNNs Residual Encoding Dictionary Convoluonal Layers Off-the-Shelf SVM SVM FC Layer End-to-End BoWs FV-CNN Deep-TEN Encoding Layer CIFAR STL Conv Layers E1 E2 Ø Classic Vision Approaches Flexible by allowing arbitrary input image size. No problem of domain transfer (features are generic). Dictionary encoding usually carries domain information. Ø Deep learning Preserving spatial information (texture needs orderless). Fixed image size. Difficulties in domain transfer. Input Dictionary Residuals Assign Aggregate Encoding-Layer Ø Relation to Other Approaches: Dictionary learning: K-means or K-SVD BoW, VLAD, Fisher Vector & Net-VLAD Global Pooling: Avg-pool, SPP-Net, Bilinear pool Ø Domain Transfer The Residual Encoding discards the frequently appearing features, which is likely to be domain specific. For a visual feature . that appears frequently in the data, it is likely close to a visual center 0 a). 0 ≈0, since .0 = . 0 ≈0 b). ? ≈0 , since .0 ≈0 (soft-assignment) Ø Compare to State-of-the-art Ø Assignment Weights: Soft-weighting .0 = exp(− .0 I ) exp(− .? I ) - ?3% Learnable smoothing Factor .0 = exp(− 0 .0 I ) exp(− ? .? I ) - ?3% Ø Effect of Multi-size Training Ideally arbitrary image sizes Training with pre-defined sizes iteratively w/o modifying solver Single-size testing for simplicity Ø Joint Encoding Dictionary encoding representation is likely to carry domain information. The features are likely to be generic. CIFAR-10: 36×36 STL-10: 96×96 Code here! (Only shallow architectures considered.)

Transcript of Deep TEN: Texture Encoding Network IEEE 2016 Conference on · IEEE 2016 Conference on Computer...

Page 1: Deep TEN: Texture Encoding Network IEEE 2016 Conference on · IEEE 2016 Conference on Computer Vision and Pattern Recognition Deep TEN: Texture Encoding Network Hang Zhang, JiaXue,

IEEE 2016 Conference on Computer Vision and Pattern

Recognition Deep TEN: Texture Encoding Network

Hang Zhang, Jia Xue, Kristin Dana {zhang.hang, jia.xue}@rutgers.edu, [email protected], Department of Electrical and Computer Engineering, Rutgers University.

Overview:Ø Encoding-Net(anewCNNarchitecture)withanovel

EncodingLayer.Ø State-of-theartresultsontexturerecognition

(minc-2500,FMD,GTOS,4D-lightdatasets).Ø Flexible deep learning framework (arbitrary image size and

easy to transfer learned features).

EncodingLayer:Ø Residual Encoder

• Given a set of visual descriptors 𝑋 = {𝑥%, . . 𝑥(} and a learnedcodebook 𝐶 = {𝑐%, … 𝑐-}.

• Each descriptor 𝑥. can be assigned with a weight 𝑎.0to eachcodeword 𝑐0.

• Theresidualencoderaggregatetheresidualswithassignmentweights

𝑒0 =2𝑒.0

(

.3%

=2𝑎.0𝑟.0

(

.3%

Experiments:Ø Dataset

• Material & texture datasets: MINC-2500, KTH, FMD, 4D-light, GTOS

• General recognition datasets: MIT-Indoor, Caltech-101

Ø Baselines• FV-SIFT (128 Gaussian Components, 32𝐾 ⇒ 512)• FV-CNN (Cimpoi et al. VGG-VD & ResNet, 32GMM)

Featureextraction

DictionaryLearning

Encoding Classifier

Histogram Encoding

Dictionary

SIFT / Filter Bank Responses

Fisher Vector

Dictionary

Pre-trained CNNs

Residual Encoding

Dictionary

Convolu!onal Layers

Off-the-Shelf

SVM SVM FC Layer

End-to-End

BoWs FV-CNN Deep-TEN

Encoding Layer

CIFAR

STL

ConvLayers

E1E2

Ø Classic Vision Approaches• Flexible by allowing arbitrary input image size.• No problem of domain transfer (features are generic).• Dictionaryencodingusuallycarriesdomaininformation.

Ø Deep learning• Preserving spatial information (texture needs orderless).• Fixed image size.• Difficulties in domain transfer.

Input

Dictionary Residuals

Assign

Aggregate

Encoding-Layer

Ø Relation to Other Approaches:• Dictionarylearning:K-meansorK-SVD• BoW,VLAD,FisherVector&Net-VLAD• GlobalPooling:Avg-pool,SPP-Net,Bilinearpool

Ø Domain Transfer• TheResidualEncodingdiscardsthefrequentlyappearing

features,whichislikelytobedomainspecific.• Foravisualfeature𝑥. thatappearsfrequentlyinthedata,itis

likelyclosetoavisualcenter𝑑0a).𝑒0 ≈ 0,since𝑟.0 = 𝑥. − 𝑑0 ≈ 0b).𝑒? ≈ 0 𝑗 ≠ 𝑘 ,since𝑎.0 ≈ 0 (soft-assignment)

Ø Compare to State-of-the-art

Ø Assignment Weights:• Soft-weighting

𝑎.0 =exp(−𝛽 𝑟.0 I)

∑ exp(−𝛽 𝑟.?I)-

?3%

• LearnablesmoothingFactor

𝑎.0 =exp(−𝑠0 𝑟.0 I)

∑ exp(−𝑠? 𝑟.?I)-

?3%

Ø Effect of Multi-size Training• Ideallyarbitraryimagesizes• Trainingwithpre-definedsizes

iterativelyw/omodifyingsolver

• Single-sizetestingforsimplicity

Ø Joint Encoding• Dictionaryencodingrepresentationislikelytocarry

domaininformation.• Thefeaturesarelikelytobegeneric.• CIFAR-10:36×36

STL-10:96×96Codehere!

(Only shallow architectures considered.)