[IEEE 2007 IEEE International Symposium on Industrial Electronics - Vigo, Spain...

6
Synchronized digital video subsampling to achieve temporal resolution independence Jose San Pedro Wandelmer, Sergio Dominguez Cabrerizo DISAM Universidad Politecnica de Madrid Madrid, Spain Email: jsanpedro @etsii.upm.es, [email protected] Abstract- In this paper, we present a selective video sub- sampling algorithm to achieve temporal resolution independence in applications with hard requirements on frame sequentiality, especially targeted to video identification systems based on fingerprint strings. The algorithm extracts video entropy, and based on the resulting series, a set of intervals and frames for each interval are selected. The same exact intervals and frames can be repeatedly extracted for any other variation of the original content, which is especially desirable when variations have undergone dramatic quality and frame rate reduction in their distribution. Different from basic subsampling techniques, the proposed method is not just able to get a frame set of equal length for different frame rate variations, but the generated set includes more frames when events are going on in the video, while less frames are chosen when the video content remains steady. This fact provides a significant improve of performance in terms of recall and precision values in video identification applications. I. INTRODUCTION The advent of network-aware multimedia systems is allow- ing to cope with the inevitable and unpredictable changes on the network conditions to improve the multimedia experience. To be able to reduce network congestion, network-aware frameworks detect its current condition an adapt the video and audio streams appropriately. For this purpose, encoding parameters are changed dynamically to adjust [1]: . Spatial resolution: width and height of the frames . Temporal resolution: number of frames per second . Bitrate: number of bits to encode each second of video (quality of encoding) . I-Frame rate: number of I-Frames These parameters are dynamically adapted depending on the application and on the network conditions, e.g. when distributing video for playback in mobile devices through GSM networks, the spatial resolution can be greatly reduced (screen resolution in current cell phones is around 200x150 [2]) while when distributing video for playback in desktop computers, where packet loss is scarce, small reductions in each parameter (including I-Frame rate) can help to decrease bandwidth requirements temporarily. Visual features of video content are, thus, subject to change to adapt to the new conditions. These changes can cause mal- function on a number of other systems which depend on the actual content by means of automatic analysis algorithms. This problem becomes especially noticeable in video monitoring applications, such as surveillance or copyright management. In surveillance applications, the visual features of the incoming stream must remain always above certain thresholds to ensure acceptable results. Copyright management relies in video identification techniques, which compare the incoming stream with a database of material to be detected. For both of these applications, changes in the expected visual features must be carefully handled to minimize results degradation. This paper focuses on video identification applications, which can be extremely sensitive to even slight changes in temporal resolution. These applications must be able to guar- antee video matches even when comparing different instances (called variations) of the same content. Dealing with variations featuring dynamic temporal conditions force to use techniques to make this comparison independent of temporal charac- teristics. An entropy-based approach is proposed to select a set of synchronized frames, equal for any variation with independence of spatial and temporal resolution, as well as quality (bitrate). This set can be highly configured depending on the actual application. The selected frames are not equally spaced in the timeline, a mandatory characteristic to deal with dynamic temporal conditions. The selection method is both time and event triggered, i.e. a minimum number of synchronization frames is chosen per user defined interval, but some more frames can be chosen depending on the visual changes in the content. The paper is organized in the following way. Related previous works on the field of multimedia identification are introduced in Section II. A detailed description of hash string based video identification main problems is presented in Sec- tion III. Section IV details requirements for the subsampling method, and introduces the proposed solution. In section V, the proposed solution is used as a pre-stage to an identification system, to evaluate its robustness in real applications. II. RELATED WORKS Video indexing and retrieval algorithms are the base of video identification and copy detection and tracking systems. They allow to detect the existence of a certain multimedia item when it has been distributed by a determinate transmission channel, which can be a live TV source, a set of web pages (youtube.com and alike), .... Depending on the different user 1-4244-0755-9/07/$20.00 (C2007 IEEE 1785

Transcript of [IEEE 2007 IEEE International Symposium on Industrial Electronics - Vigo, Spain...

Page 1: [IEEE 2007 IEEE International Symposium on Industrial Electronics - Vigo, Spain (2007.06.4-2007.06.7)] 2007 IEEE International Symposium on Industrial Electronics - Synchronized digital

Synchronized digital video subsampling to achieve

temporal resolution independence

Jose San Pedro Wandelmer, Sergio Dominguez CabrerizoDISAM

Universidad Politecnica de MadridMadrid, Spain

Email: jsanpedro @etsii.upm.es, [email protected]

Abstract- In this paper, we present a selective video sub-sampling algorithm to achieve temporal resolution independencein applications with hard requirements on frame sequentiality,especially targeted to video identification systems based onfingerprint strings. The algorithm extracts video entropy, andbased on the resulting series, a set of intervals and framesfor each interval are selected. The same exact intervals andframes can be repeatedly extracted for any other variation of theoriginal content, which is especially desirable when variationshave undergone dramatic quality and frame rate reduction intheir distribution. Different from basic subsampling techniques,the proposed method is not just able to get a frame set of equallength for different frame rate variations, but the generated setincludes more frames when events are going on in the video, whileless frames are chosen when the video content remains steady.This fact provides a significant improve of performance in termsof recall and precision values in video identification applications.

I. INTRODUCTION

The advent of network-aware multimedia systems is allow-ing to cope with the inevitable and unpredictable changes onthe network conditions to improve the multimedia experience.To be able to reduce network congestion, network-awareframeworks detect its current condition an adapt the videoand audio streams appropriately. For this purpose, encodingparameters are changed dynamically to adjust [1]:

. Spatial resolution: width and height of the frames

. Temporal resolution: number of frames per second

. Bitrate: number of bits to encode each second of video(quality of encoding)

. I-Frame rate: number of I-Frames

These parameters are dynamically adapted depending onthe application and on the network conditions, e.g. whendistributing video for playback in mobile devices throughGSM networks, the spatial resolution can be greatly reduced(screen resolution in current cell phones is around 200x150[2]) while when distributing video for playback in desktopcomputers, where packet loss is scarce, small reductions ineach parameter (including I-Frame rate) can help to decreasebandwidth requirements temporarily.

Visual features of video content are, thus, subject to changeto adapt to the new conditions. These changes can cause mal-function on a number of other systems which depend on theactual content by means of automatic analysis algorithms. This

problem becomes especially noticeable in video monitoringapplications, such as surveillance or copyright management. Insurveillance applications, the visual features of the incomingstream must remain always above certain thresholds to ensureacceptable results. Copyright management relies in videoidentification techniques, which compare the incoming streamwith a database of material to be detected. For both of theseapplications, changes in the expected visual features must becarefully handled to minimize results degradation.

This paper focuses on video identification applications,which can be extremely sensitive to even slight changes intemporal resolution. These applications must be able to guar-antee video matches even when comparing different instances(called variations) of the same content. Dealing with variationsfeaturing dynamic temporal conditions force to use techniquesto make this comparison independent of temporal charac-teristics. An entropy-based approach is proposed to selecta set of synchronized frames, equal for any variation withindependence of spatial and temporal resolution, as well asquality (bitrate). This set can be highly configured dependingon the actual application. The selected frames are not equallyspaced in the timeline, a mandatory characteristic to dealwith dynamic temporal conditions. The selection method isboth time and event triggered, i.e. a minimum number ofsynchronization frames is chosen per user defined interval,but some more frames can be chosen depending on the visualchanges in the content.The paper is organized in the following way. Related

previous works on the field of multimedia identification areintroduced in Section II. A detailed description of hash stringbased video identification main problems is presented in Sec-tion III. Section IV details requirements for the subsamplingmethod, and introduces the proposed solution. In section V,the proposed solution is used as a pre-stage to an identificationsystem, to evaluate its robustness in real applications.

II. RELATED WORKS

Video indexing and retrieval algorithms are the base ofvideo identification and copy detection and tracking systems.They allow to detect the existence of a certain multimedia itemwhen it has been distributed by a determinate transmissionchannel, which can be a live TV source, a set of web pages(youtube.com and alike), .... Depending on the different user

1-4244-0755-9/07/$20.00 (C2007 IEEE 1785

Page 2: [IEEE 2007 IEEE International Symposium on Industrial Electronics - Vigo, Spain (2007.06.4-2007.06.7)] 2007 IEEE International Symposium on Industrial Electronics - Synchronized digital

requirements, the search may be targeted to find exact orpartial matches of video items, taking into account all the dif-ferent transformations that these elements may have sufferedduring the distribution stage. These systems are commonlyused in video copyright management [3], [4], [5], [6] and videocontent identification [7], [8], [9], [10].

Content-based video retrieval and identification is a com-putationally complex problem. Digital video identificationengines are required to handle different variations while per-forming their tasks. Monitoring of TV broadcasts, for example,requires adaptation to the different transformations whichcontent may undergo because of the different characteristics oftransmission standards (PAL, NTSC, D VB-T, etc). Network-aware multimedia distribution systems increase the number ofpossible scenarios in which the handling of this dynamismis necessary. In these cases, handling that complexity in realtime requires the support of indexing structures [7], [11] andacceleration techniques [12], [13].

These systems commonly create a signature, fingerprint, thatcompactly describes small sections of the video. One way toproduce fingerprints is by means of robust hash techniques[14] (also called visual hash techniques). While standard hashtechniques try to reduce collisions, especially with very similarinput messages (e.g. cryptographic applications), robust hashalgorithms produce similar hash values in these cases.

Robust hash values are obtained from each frame, or setof frames, creating a string of signatures, which can be laterretrieved using string matching techniques. These methods ob-tain brilliant recall and precision values, but are very sensitiveto temporal features changes in the variations being compared.In those cases, actual frames used to compute the hash valueand the length of the signatures strings may be different fromone variation to another.The work presented in this paper helps to overcome this

problem, improving the robustness of this and other kindalgorithms which based their results on the accurate selectionof the same set of frames, independently of the features of thevariations they deal with.

III. IDENTIFICATION UNDER DYNAMIC CONDITIONS

This section analyses the problems found in video contentidentification systems based on robust hash signatures strings,justifying the necessity of introducing techniques to achieve in-dependence of temporal features. Basic requirements of videoidentification systems are also detailed to aid the understandingof both the problem and the valid solutions.

Video identification systems manage a database of videoitems to be monitored, along with the hash signatures (whichare extracted offline and stored for their repeated use). Thegoal is to find appearances of these items in the input signal,normally a much larger video source (e.g. tv broadcast) oran external database of video clips (online video repositories,e.g. youtube.com). The incoming video is analyzed and asignature is generated from it. The comparison engine triesto match it with the signatures stored in the local database.An identification is reported when these signatures match.

Given the different transformations that input video sourcesare subject to, these systems often deal with different varia-tions of the content: the original variation, used to generate theregistered signature, and the incoming transformed variation,which must be monitored. These transformations may causeslight differences in the signatures, which must be properlyhandled to perform the identification even in this case.

Robust hash algorithms are able to handle dramatic changesin visual quality, including frame size reduction, bitrate re-duction and other noises or artifacts generated in reencod-ing processes [14]. However, the string-based nature of thecomparison make these systems unable to deal with changesin temporal features. Hash signatures are sequentially createdfrom video frames (or frame differences); this imposes reallyhard frame rate conditions for the video sources, because largerframe sequences will create larger hash strings.

This is not, however, the only problem related to thetemporal dimension of videos. Most of these systems requirethe proper identification of partial appearances of the digitalitems being protected. This possibility makes the process muchmore complex because all temporal references that may beused to perform the comparison are removed.

Fig. 1 illustrates how both of the described facts affectthe detection. We consider an algorithm that chooses the firstand last frame of each second of video to compute the hashsignature. The first row shows the frames of this video secondfor the original variation. The second and third rows representdifferent possible input sources. In the first case, a partialappearance of the element is missing 1 frame. In the secondcase, a reduced frame rate variation is shown, where partof the frames of the reference do not appear. The pair offrames chosen to generate the signature are different for eachvariation, leading to different signature values.

1 sec

Fig. 1. Identification problems with variations featuring different temporalcharacteristics.

In a nutshell, the goal is to be able to select the same subsetof frames from two variations of the same content, even when:

0

0

0

temporal resolution of each variation is differentone of the variations is incompleteone of the variations has gone through quality degradationprocesses (spatial resolution reduction, bitrate reduction,reencoding, etc).

1786

Page 3: [IEEE 2007 IEEE International Symposium on Industrial Electronics - Vigo, Spain (2007.06.4-2007.06.7)] 2007 IEEE International Symposium on Industrial Electronics - Synchronized digital

IV. ENTROPY SYNCHRONIZATION

As introduced in the previous section, an important numberof multimedia analysis algorithms are quite sensitive evento very subtle changes in temporal resolution, due to theirdependence on frame sequentiality. In this section, a method toachieve independence of temporal resolution is proposed basedon the content-based extraction of synchronization frames,which reduces any variation of the original content to acommon subset of frames, even under dramatic changes intemporal resolution and other low-level features (such asbitrate or spatial resolution).

A. The objective

Given the problems presented in the previous section, theauthors propose in this paper a way to achieve independence oftemporal resolution changes, especially targeted to visual hashfingerprint identification applications. The method is, however,not restricted to this sort of applications and can be adaptedto fulfil other necessities in different contexts.The unique features of visual hash fingerprint generation,

makes these systems especially sensitive to changes in framerate or slight temporal shifts. These temporal inconsistenciescause the algorithm to use different frames to generate thesignature, therefore leading to signatures which are eitherdifferent in size and/or value.The work presented focuses on achieving a method to

ensure the same frames are chosen for any variation, andthus, the generated signature is equal in size and similarin value (depending on the quality differences). If the exactsubsampling result is obtained from any variation of the samecontent, and the given set includes every relevant frame, thisresult can be used as the set of frames to be analyzed by thealgorithm. This method is described in this section.

B. Entropy of video frames

Let us consider the Shannon's definition of Entropy. Givena random variable, X, which takes on a finite set valuesaccording to a probability distribution, p(X), the entropy ofthis probability distribution is defined as:

n

H(X) =- p(xi) log p(xi)i=l

(1)

Shannon's entropy is the average amount of information con-tained in random variable X, that is, the uncertainty removedafter the actual outcome of X is revealed. The Shannon's func-tion is based on the concept that the information gained froman event is inversely related to its probability of occurrence.

This definition of the information theory field can be directlyused in many other areas, such as computer vision. In this case,the entropy of an image, I, is used to have a quantitativevalue of its color complexity, being X a random variablerepresenting the luminance intensity of a random pixel in thatimage.

Using (1) the entropy of an image, I, is computed in thefollowing way:

Consider Ik to be the number of pixels in I with a luminancelevel of k, where 0 < k < N -1. Then,

(2)Pij(k)l

PI ( k)N-1Ej=o -I

is the probability of X being color k, i.e. the probabilitydistribution of the pixel intensities in image I.

This probability is commonly obtained by means of lumi-nance histogram of images. Consider h, to be the histogramof I and hi (k) the value of the histogram for luminance levelk, then it follows:

pi(k) hi (k)Ej=o dI(

N-1

H(X) =-X, pI (k) log(pi (k))k=O

(3)

(4)

This standard entropy value for images, represents theuncertainty removed after the actual color of a random pixelin I is revealed. The value will be higher in images with widerhistograms, as each pixel can have a color in a wider range.Thus, maximum entropy is reached when the color distributionis equiprobable,

pi(ci)=I (5)

while minimum is reached when all the pixels are within asingle bin k,

Pi(C) { O; k (6)

Eq. (4) provides a measurement of the luminance com-plexity of a frame in a sequence. Computing this value forthe complete sequence gives us a time series that can beanalyzed. The actual pattern depends on the evolution of thecolor complexity of the frames, which is subject to change forany of the following reasons:

* motion: which can result in the occlusion, appearance anddisappearance of objects in the scene.

. lighting changes: which completely transforms the colorsof one or more objects

. edition effects: such as fades, dissolves, etc.

Entropy provides, thus, a fast way to compute a quantitativemeasurement of the evolution of frame differences within atemporal range of a certain video. The absence of "action"in the sequence produces constant series, while the presenceof motion, illumination changes and edition effects producesvariations on this value. This behavior has been successfullyused in the detection of certain effects [15] and its uniquecharacteristics make it especially suitable for the task ofoptimum synchronized subsampling of video sequences.

C. Subsampling algorithmIn this section, a novel synchronized subsampling algorithm

is proposed which takes advantage of the features of frameentropy to get a common subset of frames independent oftemporal resolution from the source video. The algorithm

1787

Page 4: [IEEE 2007 IEEE International Symposium on Industrial Electronics - Vigo, Spain (2007.06.4-2007.06.7)] 2007 IEEE International Symposium on Industrial Electronics - Synchronized digital

does not use any reference on the analyzed videos; the actualcontent of each variation is analyzed separately and the subsetis extracted without any further help.

In some cases, for instance when the frame rate hasbeen dramatically reduced, selecting the exact same subsetof frames may be impossible, especially when the reducedvariation actually lacks those frames. In these cases, thealgorithm tries to select the closest frame. As the algorithmis both time and event triggered, results are not affected bythis issue. In addition to help subsampling of temporallyreduced variations, being event triggered the algorithm isable to include representative frames, making it suitable foridentification, summarization, and similar tasks.

Individual frame entropy values are used to create a tempo-ral entropy series for each clip, that the algorithm analyzes tolocate points of interest that determine the frames that will beselected. The selection procedure is, however, far from beingstraightforward given the set of restrictions that have to be met.In this regard, the unique characteristics of frame entropy helpto ease the algorithm greatly, as it will be shown.The most restricting requirement is the necessity to deal

with partial variations, as this implies that a reference framecannot be selected (see Fig. 1). The problem will be quitesimple in the opposite case; a straight subsampling procedurebased on minimum and maximum supported frame rates willbe enough. Loosing this reference, as well as allowing to haveundefined range support for frame rates, makes this processmuch more complex. Moreover, using straight subsamplingmethods may lead to loosing frames representing importantevents, due to their independence from the actual content.

In order to cope with these changes in temporal features itis necessary to base the algorithm on the content analysis ofdelimited temporal ranges. Content based analysis will ensurethe selection of similar frames; delimiting temporal ranges willensure a minimum number of selected frames per second. Thecontent-based analysis stage can be used, in addition, to locatespots of activity, that can serve both as synchronization pointsand to improve results (including relevant frames).

Entropy series serve this exact purpose in the proposedscheme. It provides a simple and fast, yet powerful, means ofanalyzing content looking for synchronization points, whichwill be preferably chosen in relevant instants of the timeline.A minimum number of synchronization frames are chosenper defined interval of the series. These frames are said torepresent that interval, being the algorithm responsible forselecting the same representative frames in the same intervalsof different variations. However, due to the necessity of dealingwith incomplete or shifted variations, the actual definitionof the interval requires some computation. Using a fixedinterval duration does not help to solve the problem, giventhe conditions.The proposed solution not only achieves near optimum se-

lection of synchronization frames, but also includes support tosynchronize the considered interval duration and boundaries.The algorithm can be, then, split into two independent parts:

D

HA'ff1 _

m1,

ik

Ir _ _

1

Fig. 2. Synchronization frames for an entropy interval Ik of length TI.

D. Selection of synchronization frames

This stage of the algorithm analyses a given interval, Ik, ofthe incoming video, which spans from to to tf, and selects afixed number, N, of synchronization frames, i.e. frames chosenby the algorithm to represent that interval in any variation ofthe video:

(7)

where Fk is the set of synchronization frames for interval Ik,and selected frames are represented by their timestamp, fk, inascending order, i.e. f +< 1fj1The proposed solution uses the frame entropy series ex-

tracted from the source video to perform the selection. As in-troduced above, entropy series provides information of eventsand changes in the content shown in the video. There aremany valid approaches to the selection of synchronizationframes based on entropy series; once a criterion is set, thealgorithm just apply it to any entropy series interval to getthe relevant frames. Depending on these criteria, the selectionmay be more or less compact, descriptive, or suitable for acertain application.

Authors propose Max/Min approach, where frames withabsolute maximum, H-AJ and minimum, Hk, entropy valuesare selected. This approach has the following features:

. The number of chosen frames per interval, N, is 2

. When there are several frames which share the absolutemaximum or minimum values, the algorithm chooses f1and to maximize Dk ff 1.

Fig. 2 illustrates this approach and helps to understand thenames and concepts introduced in this section. This selectionmethod gets a really compact but useful selection as it is shownin section V where an exhaustive analysis of its performanceis given.

E. Synchronization of intervalsThis stage of the algorithm defines the next interval that

will be analyzed. The goal is to be able to define the sameintervals even when dealing with variations with temporalchanges. Intervals are considered as sliding windows, featuringa variable overlapping area. Even though their width its fixed,and corresponds to the duration of the interval, it becomesnecessary to variate the overlapping area to achieve boundary

1788

NIFk = UO ... . fk

Page 5: [IEEE 2007 IEEE International Symposium on Industrial Electronics - Vigo, Spain (2007.06.4-2007.06.7)] 2007 IEEE International Symposium on Industrial Electronics - Synchronized digital

0,9

0,8

0,7

0,5

0,4

0,2

0,1

II._

.~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~I.

.

iIL12

f

4 \:\.XTt:

W ~~~~~~~Synchronization PerFormedA ', K P.~IV,.

020 40 60 80 100 120 140 160 180 200

Fig. 3. Interval synchronization with a real entropy series. Two differentinterval progressions, 11 and 12, where t°(I1) :A t0(I2). The progressionshows the different overlapping areas chosen for each of them; selected framesare labeled with circles for 11 and stars for 12. Interval synchronizationoccurs, in this case, at frame 150.

synchronization, as shown in Fig. 3. Otherwise, the boundarieskeep desynchronized, affecting the frame selection stage.An interval, Ik, is defined by the time it starts, to, and

by its duration, TI, which is a constant for all intervals. Wecan conveniently define tf = tk + TI as the ending time forIk. The frame selection stage must look for synchronizationframes within the range [to, tf]. The overlapping area of thefollowing interval, 'k+1, is defined as

0k+1 = t f_ (8)

meaning that 'k+1 begins just after f2, the time of the latestframe found for Ik. Fig. 2 helps to understand these concepts.

This method is depicted in Fig. 3. A standard fixed timeoverlapping method is also depicted in this figure, and thecomparison clearly shows how the proposed method achievessynchronization of intervals while simpler approaches wouldbe unable. The figure also shows how the proposed methodtakes some time to synchronize intervals, depending on thelocal entropy series maxima and minima location. This syn-chronization period will generate different frames for differentvariations. However, Fig. 3 also shows that the closest frame,given interval differences, is chosen for other variations. Oncethe interval synchronization time is complete, frame 150 infigure 3, selected frames are the same for any variation.

V. EXPERIMENTAL RESULTS

This section contains results obtained in the experimentsconducted to test the proposed algorithm. Due to the orig-inal target of the algorithm, i.e. improving hash-base videoidentification, most of the offered results have been obtaineddirectly from a video identification system prototype whichwe built based on [14]. Originally, this previous work wasunable to handle temporal feature changes in the variations.We have introduced a subsampling pre-stage, based on our

proposed method, keeping the rest of the identification systemas it was.The video test set consisted of more than 4 hours of musical

video clips footage. A random subset of these video clips,around 25%, was manually selected to build the database ofregistered elements, i.e. elements that must be detected inthe original incoming video stream. Different features of thisoriginal input stream were changed to test the robustness ofthe approach, including:

. Bitrate: original input stream was encoded at 600kbps. Avariation was created reducing 40% this value.

. Frame resolution: original input stream was encoded at480x260. A variation was created reducing this value to288x216.

Those variations, which kept the temporal features un-changed, are used as the reference values for precision andrecall. Using this reference, we proceeded to introduce thesubsampling pre-stage and to change the temporal features ofthe previous variations. This test method allows to compareresults obtained in the best case scenario (temporal featuresunchanged) with results obtained when introducing our sub-sampling pre-stage (temporal features reduced).The algorithm has been configured to use two frames per

interval, N = 2, and an interval length of TI = lsg. Thatmeans the minimum number of frames chosen per second ofvideo, Nf = 2. The actual value of synchronization frameschosen per second of video is always higher and dependson the features of the video stream. In our experiments, theaverage value for Nf was indeed much higher, Nf = 2.71,meaning that almost 3 frames per second of video wereselected. The value of Nf grows when the entropy seriesexperiments oscillations, given the max/min approach used toselect frames (see Fig. 3).

Fig. 4 shows the performance found for the identificationsystem when dealing with bitrate reduced variations. As can beseen, dramatic reductions in bitrate do not produce a dramaticloss of performance. Fig. 4 also shows performance whenframe rate is reduced, both in the case of the original and re-duced bitrate cases. The first fact to notice is that reducing theframe rate produces a bigger loss of performance than reducingbitrate. When frame rate is reduced, part of the frames of theoriginal variation are lost; that means the selection algorithmmay not find the same minimum/maximum entropy frames.Frames chosen are, however, very close, though it depends onthe fps value. This effect is much more severe when framerate is reduced to 10fps. Note that the worst case presented,with a bitrate of 360kbps and 10fps, gets still considerablyhigh performance.

Fig. 5 shows the performance of the system when reducingframe resolution. As can be seen, frame resolution does nothave as much effect in performance as bitrate does. In fact,reducing frame size 40% does not really affect performanceat all, thanks to the great behavior of visual hash techniques.When combining frame resolution and frame rate reduction,some performance is lost, especially in the 10fps case, but theoverall result is better that with bitrate reduced variations.

1789

Page 6: [IEEE 2007 IEEE International Symposium on Industrial Electronics - Vigo, Spain (2007.06.4-2007.06.7)] 2007 IEEE International Symposium on Industrial Electronics - Synchronized digital

Fig. 4. Recall & Precision values when comparing with bitrate and framerate reduced variations.

Fig. 5. Recall & Precision values when comparing with frame size and framerate reduced variations.

These results show how the presented algorithm is able toprovide temporally restricted systems with an efficient way tosupport dynamic temporal features.

VI. CONCLUSIONS

In this paper, we deal with the problem of properly han-dling changes in temporal features, in applications with greatsensitivity to them, such as video identification systems. Theproposed method serves, in these applications, as a pre-stage that reduces the video to a significant set of frames,which is fully configurable. The extraction is performed usingvideo entropy series, in a content-based way which is able toachieve several important requirements. Firstly, differences intemporal features (e.g. frame rate, partial video appearances)are effectively filtered, so the same set (or subset, in the caseof partial appearances) is got as a result. The need to handle

partial appearances makes the problem much more complex,as no reference points can be chosen, forcing the subsamplingscheme to be able to select a variable number of frames perinterval. Secondly, the algorithm is able to select more frameswhen relevant events are occurring, which helps to get morecomplete results and to synchronize intervals faster.

Resulting sets have been used as an input to a robusthash-based video identification system, endowing it with thecapability of dealing with frame rate reduced variations.Experimental results show how the algorithm, even dealingwith a subset of the original frames, is able to achieve highrecall/precision when frame rate is highly reduced (more than60%). Improving these results and, especially, adding supportfor faster synchronization, will be our future work. In addition,we will research how to integrate this pre-stage into other kindof systems with strict temporal requirements.

REFERENCES

[1] J. Cao, D. Zhang, K. M. McNeill, and J. Jay F. Nunamaker, "Anoverview of network-aware applications for mobile multimedia delivery,"Proceedings of the 37th Annual Hawaii International Conference onSystem Sciences, vol. 09, p. 90292b, 2004.

[2] C. Narayanaswami and M. Raghunath, "Expanding the digital camera'sreach," Computer, vol. 37, no. 12, pp. 65-73, 2004.

[3] A. Hampapur, K. Hyun, and R. M. Bolle, "Comparison of sequencematching techniques for video copy detection." in Storage and Retrievalfor Media Databases, SPIE, Ed., vol. 4676, 2001, pp. 194-201.

[4] J. S. Pedro, N. Denis, and S. Dominguez, "Video retrieval using anedl-based timeline." in Pattern Recognition and Image Analysis, IbPRIA2005, Estoril, Portugal. Lecture Notes in Computer Science., Springer,Ed., vol. 3522, 2005, pp. 401-408.

[5] X. Yang, P. Xue, and Q. Tian, "A repeated video clip identificationsystem," in MULTIMEDIA '05: Proceedings of the 13th annual ACMinternational conference on Multimedia. New York: ACM Press, 2005,pp. 227-228.

[6] Y Rui, T. S. Huang, and S. Mehrotra, "Exploring video structurebeyond the shots," in ICMCS '98: Proceedings of the IEEE InternationalConference on Multimedia Computing and Systems. Washington, DC,USA: IEEE Computer Society, 1998, p. 237.

[7] J. Yuan, Q. Tian, and S. Ranganath, "Fast and robust search method forshort video clips from large video collection," in ICPR '04: Proceedingsof the Pattern Recognition, 17th International Conference on. IEEEComp Society, 2004, pp. 866-869.

[8] M. Naphade, M. Yeung, and B. Yeo, "A novel scheme for fast andefficient video sequence matching using compact signatures," in Storageand Retrievalfor Media Databases, SPIE, Ed., vol. 3972, 2000, pp. 564-572.

[9] S. Cheung and A. Zakhor, "Efficient video similarity measurement withvideo signature," in Transactions on Circuits and Systems for VideoTechnology, vol. 13. IEEE Comp Society, 2003, pp. 1051-8215.

[10] J. Yuan, L.-Y Duan, Q. Tian, and C. Xu, "Fast and robust short videoclip search using an index structure," in MIR '04: Proceedings of the6th ACM SIGMM international workshop on Multimedia informationretrieval. New York, NY, USA: ACM Press, 2004, pp. 61-68.

[11] M. L. Miller, M. A. Rodriguez, and I. J. Cox, "Audio fingerprinting:nearest neighbor search in high dimensional binary spaces." in IEEEWorkshop on Multimedia Signal Processing, 2002, pp. 182-185.

[12] K. Kashino, T. Kurozumi, and H. Murase, "A quick search method foraudio and video signals based on histogram pruning," Multimedia, IEEETransactions on, vol. 5, no. 3, pp. 348-357, 2003.

[13] D. Dementhon and D. Doermann, "Video retrieval of near-duplicatesusing k-nearest neighbor retrieval of spatio-temporal descriptors," Mul-timedia Tools and Applications, vol. 30, no. 3, pp. 229-253, 2006.

[14] J. C. Oostveen, T. Kalker, and J. Haitsma, "Visual hashing of digitalvideo: applications and techniques," A. G. Tescher, Ed., vol. 4472, no. 1.SPIE, 2001, pp. 121-131.

[15] J. S. Pedro, S. Dominguez, and N. Denis, "On the use of entropy seriesfor fade detection," in Lecture Notes in Artificial Intelligence, vol. 4177,2006, pp. 360-369.

1790