SEMEX: Enabling Exploratory Video Search by Semantic Video Analysis
Video search by deep-learning
Transcript of Video search by deep-learning
![Page 1: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/1.jpg)
VideoSearchbyDeepLearning
CeesSnoek
![Page 2: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/2.jpg)
2
Which one is the plane?
![Page 3: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/3.jpg)
3
Which one is the plane?
![Page 4: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/4.jpg)
4
Which one is the bird?
![Page 5: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/5.jpg)
5
Which one is the bird?
![Page 6: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/6.jpg)
6
Which one is the Kentucky Warbler?
![Page 7: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/7.jpg)
7
Which one is the Kentucky Warbler?
![Page 8: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/8.jpg)
8
How difficult is the problem?
Humanvisionconsumes50%brainpower…
Van Essen, Science 1992
![Page 9: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/9.jpg)
9
Video recognition in a nutshell
Visualization by Jasper Schulte
![Page 10: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/10.jpg)
10
NIST TRECVID Benchmark
Promote progress in video retrieval research
Big data, standardized tasks, independent evaluation and open innovation
Internationalvideosearchcompetition
http://trecvid.nist.gov/
![Page 11: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/11.jpg)
11
Conceptdetectiontask
http://trecvid.nist.gov/
Aircraft
Beach
Mountain
People marching
Police/Security
Flower
![Page 12: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/12.jpg)
12
From University-lab to spin-off and your mobile phone
• = 1000+ others* = UvA / Euvision / Qualcomm
Universities win Start-ups win
Snoek et al., TRECVID 2004-2015
![Page 13: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/13.jpg)
13
Latest jump due to deep learning2006 2009 2015
Mea
n av
erag
e pr
ecis
ion
Progress in video recognition
![Page 14: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/14.jpg)
14
The more features the better
Typical shallow learning architecture
e.g. SIFT
dense sampling
Local Feature Extraction
Feature Pooling
Feature Encoding Classification
avg/sum poolingmax pooling
BoWSparse coding FisherVLAD
Linear / Non-linear SVM
![Page 15: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/15.jpg)
15
The deeper the better
Typical deep learning architecture
Layer 6
Loss
Layer 7
Max pool. 2
224
224
3×3
4,096 4,096
Dropout
Dropout
3×33×35×511×11
Convolution Non-linearity Pooling
Krizhevsky et al., NIPS 2012
![Page 16: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/16.jpg)
16
Video search demo’s
Social media Forensics Cultural heritage
![Page 17: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/17.jpg)
17
Tomorrow: The Internet of things that video
![Page 18: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/18.jpg)
18
Need to understand what is happening where and when?
![Page 19: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/19.jpg)
19
Examples
ShakinghandsKissing
![Page 20: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/20.jpg)
20
Goal: obtain the red tube around the actionJain et al., IJCV 2017
![Page 21: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/21.jpg)
21
Method: Super-voxel segmentation of the videoJain et al., IJCV 2017
![Page 22: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/22.jpg)
22
Group voxels to generate action proposalsJain et al., IJCV 2017
Unsupervised and class-agnostic
![Page 23: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/23.jpg)
23
Example proposals
![Page 24: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/24.jpg)
24
Encode video proposals as 15,000 object scoresJain et al., CVPR 2015
Layer 6
Loss
Layer 7
Max pool. 2
3×34,096 4,096
Dropout
Dropout
3×33×35×511×11
![Page 25: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/25.jpg)
25
Actions have object preference, relation is generic
TypingPlaying Cello Bodyweight squats
Jain et al., CVPR 2015
![Page 26: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/26.jpg)
26
We consider three object encodings− Whole video− Outside of tube only− Inside of tube only
Where do objects aid actions the most?
![Page 27: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/27.jpg)
27
Objects aid most close to the action
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
Wholevideo Outsidetube Insidetube
Jain et al., CVPR 2015
![Page 28: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/28.jpg)
28
Simple convex combination of known classifiers
Objects2action: Translate objects to an action
Object representationTest video Object/action affinities
where s() = word2vec
Mikolov et al., NIPS 2013
Jain et al., ICCV 2015
![Page 29: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/29.jpg)
29
Objects2action localizes actions without examples
Retrieval results from action query only
Jain et al., ICCV15
Prediction Ground truth
![Page 30: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/30.jpg)
30
So far we have considered video search from text only, what about text search from video?
That is: given a video, can we find the best matching sentence?
Matching sentences to videos
![Page 31: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/31.jpg)
31
Word2VisualVec: Predicting the visual representation of textTraining time
Dong et al., ArXive17
![Page 32: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/32.jpg)
32
Word2VisualVec: Predicting the visual representation of textTesting time
Dong et al., ArXive17
![Page 33: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/33.jpg)
33
ResultsDong et al., ArXive17
![Page 34: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/34.jpg)
34
‘Arithmetic’ with visual and textual query
![Page 35: Video search by deep-learning](https://reader031.fdocuments.in/reader031/viewer/2022030313/58d0efb61a28abba558b6bfb/html5/thumbnails/35.jpg)
35
Video search by deep learning is powerful, even without examples
Field is progressing rapidly
Precise spatiotemporal video understanding is next
Conclusion
www.ceessnoek.info