ICME08 Visual Saliency - pdfs.semanticscholar.org · Detecting visual saliency from a video is...

Dynamic Visual Saliency Modeling Based on Spatiotemporal Analysis

Duan-Yu Chen, Hsiao-Rong Tyan, Dun-Yu Hsiao, Shen-Wen Shih, and Hong-Yuan Mark Liao

ICME 2008 Duan-Yu Chen 2

Outline

IntroductionRelated WorksProposed Approach

Selective Visual Attention ModelingDetect Spatiotemporal Salient PointsCompute Motion Attention MapDetermine the Extent of Attended Regions

Experiment ResultsConclusion


IntroductionMethods for annotating video content as well as related video retrieval techniques have attracted a great deal of attention in recent years.Video content contains much richer information than other medias and thus can be annotated for applications, such as video surveillance and entertainment. Detecting visual saliency from a video is considered a good way of understanding or annotating the video content.


Introduction (cont.)The idea is from the father of American psychology, William James in 1980, who suggests that subjects selectively direct attention to objects in a scene using both bottom-up, image-based saliency cues and top-down, task-dependent cues.


A Typical Model for the Control of Bottom-up Attention

Itti’01


Introduction (c.1)Motion attention models can be classified into two categories

Motion EstimationE.g. block-based matching for motion vector estimationH. J. Zhang, et al., “A User Attention Model for Video Summarization,” ACM MM’02.

Structural tensor-based approachS. Li and M. C. Lee, “An Efficient Spatiotemporal Attention Model and Its Application to Shot Matching,” IEEE Trans. on CSVT, Vol.17, No.10, Oct. 2007.


Motivation – using Structural Tensor

Spatio-temporal image data contains rich information

Emphasize local structures by spatio-temporal analysis over a set of consecutive video frames.

⇒

Observation:

Events in video are frequently characterized by non-constant motion and non-constant appearance

Traditional methods for video analysis include

• optical flow estimation

• tracking of features/models over time• Mubarak Shah, et al.,” Visual Attention Detection in Video Sequences

Using Spatiotemporal Cues,” ACM MM’06.


Mubarak Shah, et al.,” Visual Attention Detection in Video Sequences Using Spatiotemporal Cues,” ACM MM’06.


Mubarak Shah, et al.,” Visual Attention Detection in Video Sequences Using Spatiotemporal Cues,” ACM MM’06.

Temporal Attention ModelUse SIFT (Scale Invariant Feature Transformation) to compute correspondences between points in frame pairs.


Temporal Attention Model

'33:'22:

'11:

3

2

1

→→→

HHHH: Homography

Use RANSAC to determine 321 ,, HHH


Temporal Salience MapsExample


S. Li and M. C. Lee, “An Efficient Spatiotemporal Attention Model and Its Application to Shot Matching,” IEEE Trans. on CSVT, Vol.17, No.10, Oct. 2007.

Spatiotemporal Attention Modeling Center-surround pyramids are used in motion attention detection to produce salience maps with different precisions.


S. Li and M. C. Lee, “An Efficient Spatiotemporal Attention Model and Its Application to Shot Matching,” IEEE Trans. on CSVT, Vol.17, No.10, Oct. 2007. (c.1)

Center-Surround pyramidsThe weighting is realized by a Gaussian kernel. ∇g = (gx, gy, gt) is the space-time gradients of the intensity.

Given the spatiotemporal structural tensors M1 and M2of a local neighborhood Ω1 and a background Ω2respectively,


Compute Salience Maps using Different Visual Cues


T. Liu, et al., “Learning to Detect A Salient Object,” CVPR’07

• Detect A Salient Object in Still Images


20 40 60 80 100 120 140 160

20

40

60

80

100

120

140

20 40 60 80 100 120 140 160

20

40

60

80

100

120

140

shape from: SaliencyMap

20 40 60 80 100 120 140 160

20

40

60

80

100

120

140

1

2

3

Static Visual Saliency –color, intensity, orientation


20 40 60 80 100 120 140 160

20

40

60

80

100

120

140

20 40 60 80 100 120 140 160

20

40

60

80

100

120

140

20

40

60

80

100

120

140 20 40 60 80 100 120 140 160

20

40

60

80

100

120

1

1

1

1

2

2

22

3

3

3

3


Proposed Approach

GoalTo design a measure that better captures the extent of motion saliency without using the clue of spatial saliency, even in a dynamic cluttered background.

Visual Saliency Maps IntegrationsDetect Salient Points to determine the rough boundary of a moving targetDetect Salient Regions with consistent motion to determine the regions inside the estimated boundary


Salient points in space

(Harris and Stephenes 1988): image points with high variation of values in both image direction

⇒ High eigenvectors of the second-moment matrix integrated at the local neighbourhood

where Lx, Ly are Gaussian derivatives

⇒ Select points with maxima of the corner function

Structural Tensor


Interest points in space-time

High variation of image values in both space and time

⇒ extend Harris corner function into 3D spatio-temporal domain; compute the second moment matrix

where Lx, Ly , Lt are Gaussian derivatives in space-time obtained by spatio-temporal convolution:

with


Salient region in space-time

Function of Visual Saliency

where are eigenvalues of .


Problem of only detecting salient regions based on 3D corner detection

Broken parts of moving targets

Why results in broken parts ?Regions inside a moving target are relatively smoothly changed within XYT video volumes.

20 40 60 80 100 120 140 160

20

40

60

80

100

120

140

X: 23 Y: 59Index: 0RGB: 0, 0, 0


SolutionSmoothly changed regions physically mean that these regions are moving with consistent motionsWhat is the motion cue ?

The rank of structural tensor

Rank(μ)=3: multiple motionsRank(μ)=2: one dominant motion


Solution (c.1)

Noise and FG/BG movements in video sequences would result in multiple motions within a neighborhood.

Rank(μ)=3 in almost all video volumes

A new measure of rank deficiency dST =


Median Filter for dST

Goal: to drop off the regions with multiple motions and only keep regions with consistent motion

Use median filter to filter out two extreme cases

rank(μ)=1 and rank(μ)=3

20 40 60 80 100 120 140 160

20

40

60

80

100

120

140

X: 32 Y: 46Index: 0.0002428RGB: 0, 0, 0.563

mask:[30 30]

20 40 60 80 100 120 140 160

20

40

60

80

100

120

140


Demo – 1WP with more complex background


Visual Salient Regions Enclosure Computed Using Exponential Entropy

The most appropriate scale Se for each region centered at the seed (xs,ys) of a motion attention map is defined by

( )( ) ( )( ){ }tyxeWtyxeHmaxargS ssssee ,,,,,, expexp ×=

( )( ) ( ) ( )( )∑∈

−×=Dv

tyxevtyxevss sssspexpptyxeH ,,,,,,,,exp 1,,,

( )( ) ( )( ) ( )( )tyxeHtyxeHtyxeW ssexpssexpss ,,,1,,,,,,exp −−=


Experimental Results

(a) (b)

Qualitative EvaluationQuantitative Evaluation – Precision & Recall


Demo


Demo (c.1)


Demo (c.2)


Performance Comparison[Mubarak Shah, et al.,” Visual Attention Detection in Video Sequences Using Spatiotemporal Cues,” ACM MM’06.]


Quantitative Evaluation

0 10 20 30 40 50 60 700.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Frame

Prec

isio

n

OursZhaiOurs MeanZhai Mean


Quantitative Evaluation (cont.)

0 10 20 30 40 50 60 700.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rec

all

Frame

OursZhaiOurs MeanZhai Mean


Quantitative Evaluation (cont.)

Our method Zhai and Shah[3]

Precision 71.8% 57.9%

Recall 80.6% 53.8%


Conclusion and Future Work

Visual Salient Maps IntegrationsDetect Salient Points to determine the boundary of a moving targetDetect the appropriate Extent of Salient Regions in the motion attention map and refine using the exponential entropy-based center-surround approach.

Future WorkAnnotate the extent of salient regions automatically in general video sequences.


Object Annotation

Hierarchical FilteringOrientation Filters

Examples

0 90 180 2700

1

2

3

4

5

6

7

8

9

10x 10

5 Car

Orientation

Orie

ntat

ion

Ene

rgy

0 90 180 2700

0.5

1

1.5

2

2.5

3x 10

6

Orientation

Orie

ntat

ion

Ene

rgy

People

ICME08 Visual Saliency - pdfs.semanticscholar.org · Detecting visual saliency from a video is...

Documents

Transcript of ICME08 Visual Saliency - pdfs.semanticscholar.org · Detecting visual saliency from a video is...