Deep TEN: Texture Encoding Network IEEE 2016 Conference on · IEEE 2016 Conference on Computer...

IEEE 2016 Conference on Computer Vision and Pattern Recognition Deep TEN: Texture Encoding Network Hang Zhang, Jia Xue, Kristin Dana {zhang.hang, jia.xue}@rutgers.edu, [email protected], Department of Electrical and Computer Engineering, Rutgers University. Overview: Ø Encoding-Net (a new CNN architecture) with a novel Encoding Layer. Ø State-of-the art results on texture recognition (minc-2500, FMD, GTOS, 4D-light datasets). Ø Flexible deep learning framework (arbitrary image size and easy to transfer learned features). Encoding Layer : Ø Residual Encoder • Given a set of visual descriptors = { % ,.. ( } and a learned codebook = { % ,… - }. • Each descriptor . can be assigned with a weight .0 to each codeword 0 . • The residual encoder aggregate the residuals with assignment weights 0 =2 .0 ( .3% =2 .0 .0 ( .3% Experiments: Ø Dataset • Material & texture datasets: MINC-2500, KTH, FMD, 4D-light, GTOS • General recognition datasets: MIT-Indoor, Caltech-101 Ø Baselines • FV-SIFT (128 Gaussian Components, 32 ⇒ 512) • FV-CNN (Cimpoi et al. VGG-VD & ResNet, 32GMM) Feature extraction Dictionary Learning Encoding Classifier Histogram Encoding Dictionary SIFT / Filter Bank Responses Fisher Vector Dictionary Pre-trained CNNs Residual Encoding Dictionary Convoluonal Layers Off-the-Shelf SVM SVM FC Layer End-to-End BoWs FV-CNN Deep-TEN Encoding Layer CIFAR STL Conv Layers E1 E2 Ø Classic Vision Approaches • Flexible by allowing arbitrary input image size. • No problem of domain transfer (features are generic). • Dictionary encoding usually carries domain information. Ø Deep learning • Preserving spatial information (texture needs orderless). • Fixed image size. • Difficulties in domain transfer. Input Dictionary Residuals Assign Aggregate Encoding-Layer Ø Relation to Other Approaches: • Dictionary learning: K-means or K-SVD • BoW, VLAD, Fisher Vector & Net-VLAD • Global Pooling: Avg-pool, SPP-Net, Bilinear pool Ø Domain Transfer • The Residual Encoding discards the frequently appearing features, which is likely to be domain specific. • For a visual feature . that appears frequently in the data, it is likely close to a visual center 0 a). 0 ≈0, since .0 = . − 0 ≈0 b). ? ≈0 ≠ , since .0 ≈0 (soft-assignment) Ø Compare to State-of-the-art Ø Assignment Weights: • Soft-weighting .0 = exp(− .0 I ) ∑ exp(− .? I ) - ?3% • Learnable smoothing Factor .0 = exp(− 0 .0 I ) ∑ exp(− ? .? I ) - ?3% Ø Effect of Multi-size Training • Ideally arbitrary image sizes • Training with pre-defined sizes iteratively w/o modifying solver • Single-size testing for simplicity Ø Joint Encoding • Dictionary encoding representation is likely to carry domain information. • The features are likely to be generic. • CIFAR-10: 36×36 STL-10: 96×96 Code here! (Only shallow architectures considered.)

Upload
others
Category

Documents
view
3
download
0

Embed Size (px):

Transcript of Deep TEN: Texture Encoding Network IEEE 2016 Conference on · IEEE 2016 Conference on Computer...

IEEE 2016 Conference on Computer Vision and Pattern

Recognition Deep TEN: Texture Encoding Network

Hang Zhang, Jia Xue, Kristin Dana {zhang.hang, jia.xue}@rutgers.edu, [email protected], Department of Electrical and Computer Engineering, Rutgers University.

Overview:Ø Encoding-Net(anewCNNarchitecture)withanovel

EncodingLayer.Ø State-of-theartresultsontexturerecognition

(minc-2500,FMD,GTOS,4D-lightdatasets).Ø Flexible deep learning framework (arbitrary image size and

easy to transfer learned features).

EncodingLayer:Ø Residual Encoder

• Given a set of visual descriptors 𝑋 = {𝑥%, . . 𝑥(} and a learnedcodebook 𝐶 = {𝑐%, … 𝑐-}.

• Each descriptor 𝑥. can be assigned with a weight 𝑎.0to eachcodeword 𝑐0.

• Theresidualencoderaggregatetheresidualswithassignmentweights

𝑒0 =2𝑒.0

(

.3%

=2𝑎.0𝑟.0

(

.3%

Experiments:Ø Dataset

• Material & texture datasets: MINC-2500, KTH, FMD, 4D-light, GTOS

• General recognition datasets: MIT-Indoor, Caltech-101

Ø Baselines• FV-SIFT (128 Gaussian Components, 32𝐾 ⇒ 512)• FV-CNN (Cimpoi et al. VGG-VD & ResNet, 32GMM)

Featureextraction

DictionaryLearning

Encoding Classifier

Histogram Encoding

Dictionary

SIFT / Filter Bank Responses

Fisher Vector

Dictionary

Pre-trained CNNs

Residual Encoding

Dictionary

Convolu!onal Layers

Off-the-Shelf

SVM SVM FC Layer

End-to-End

BoWs FV-CNN Deep-TEN

Encoding Layer

CIFAR

STL

ConvLayers

E1E2

Ø Classic Vision Approaches• Flexible by allowing arbitrary input image size.• No problem of domain transfer (features are generic).• Dictionaryencodingusuallycarriesdomaininformation.

Ø Deep learning• Preserving spatial information (texture needs orderless).• Fixed image size.• Difficulties in domain transfer.

Input

Dictionary Residuals

Assign

Aggregate

Encoding-Layer

Ø Relation to Other Approaches:• Dictionarylearning:K-meansorK-SVD• BoW,VLAD,FisherVector&Net-VLAD• GlobalPooling:Avg-pool,SPP-Net,Bilinearpool

Ø Domain Transfer• TheResidualEncodingdiscardsthefrequentlyappearing

features,whichislikelytobedomainspecific.• Foravisualfeature𝑥. thatappearsfrequentlyinthedata,itis

likelyclosetoavisualcenter𝑑0a).𝑒0 ≈ 0,since𝑟.0 = 𝑥. − 𝑑0 ≈ 0b).𝑒? ≈ 0 𝑗 ≠ 𝑘 ,since𝑎.0 ≈ 0 (soft-assignment)

Ø Compare to State-of-the-art

Ø Assignment Weights:• Soft-weighting