Time Interval Maximum Entropy Based Event Indexing in Soccer Video

download Time Interval Maximum Entropy Based Event Indexing in Soccer Video

of 4

Transcript of Time Interval Maximum Entropy Based Event Indexing in Soccer Video

  • 8/9/2019 Time Interval Maximum Entropy Based Event Indexing in Soccer Video

    1/4

    TIME INTERVAL MAXIMUM ENTROPY BASED EVENT INDEXING IN SOCCER VIDEOCees G .M. Snoek and Marcel W om.ng

    Intelligent Sensory Information Systems, University of AmsterdamKruislaan 403, Io48 SJAmsterdam, T he Netherlands{cgmsnoek, worring} @science.uva.nl

    ABSTRACT?Multimodal indexing of events in video docum ents posesproblems w ith respect to representation, inclusion of con-textual information, and synchronization of the heteroge-neous information sources involved. In this paper we presentthe l i m e Interval Maximum Entropy (TIME) frameworkthat tackles aforemen tioned problem s. To demo nstrate theviability of TIME or event classification in mu ltimodal video,an evaluation was performed on the domain of soccer broad-casts. It was found that hy applying TIME, the amount ofvideo a user has to watch in order to see almos t all highlightscan b e reduced considerably.

    1. I N T R O D U C T I O NEffective and efficient extraction of semantic indexes fiom!video documents requires simultaneous analysis of visual,auditory, and textual information sources. In literature sev-eral of such methods have been proposed, addressing dif-ferent types of semantic indexes, see [I21 for an extensiveoverview. Multimodal methods for detection of semanticevents are still rare, notable exception s are [3 ,7 ,8 , IO]. Forthe integration of the heterogeneous data sources a statisti-cal classifier gives the best results [12], compared to heuris-tic methods, e.g. 131. In particular, instances of the Dyn amicBayesian Network (DBN) framework, e.g. [8, IO]. Draw-backs of the DBN framework are the fact that the modelwork s with fixed comm on units, e.g. im age frames, therebyignoring differences in layout schemes of the modalities,and thus proper synchronization. Secondly, it is difficult tomodel several asynchronous temporal context relations si-multaneously. Finally. it lacks satisfactory inclusion of thetextual modality.

    Some limitations are overcome by using a maximumentropy framework. Which has been successfully appliedin diverse research disciplines, including the area of statis-tical natural language processing, where it achieved state-of-the-art performance 141. More recently it was also re-ported in video indexing literature [7], indicating promising

    This research is sponsored by lhe ICESKIS MIA pmjen andWO.

    results for the purpose of highlight classification in base-ball. How ever, the presented method lacks synchron iza-tion of multimo dal inform ation sources. We propose theTime Interval Maximum Entropy (TIME) framework thatextgnds the standard fram ework with time interval relations,to allow proper inclusion of multimodal data, synchroniza-tion, and context relations. To demo nstrate the viabilityof TIME for detection of semantic events in multimodalvideo documents, we evaluated the method on the dom ain ofsoccer broadcasts. Other methods using this domain exist,e.g. [2, 141. We improve on this existing work by exploitingmultimodal, instead of unimodal, information sources, andby using a classifieribased on statistics:instead.of.heuristics.

    The rest [of: tiisipaper~S:organiied!as.follbws.. W6:first;introduce event representation in the TIMEfram ework: Thenwe.proceed w ith the basics of the maximum entropy classi-fierzin-section3. In section 4 we discuss the classification ofevents in soccer video, and th e features used. Experimentsare presented in section 5.

    2. VIDEO EVENT REPRESENTATIONWe view the problem of event detection in video as a pat-tern recognition problem, where the task is to assign to apattern x an event or category w , based on a set of n features(fi, fi,. . fn) derived from x. We now consider how torepresent a pattern.A multimodal video document is composed of differ-ent modalities, each with their own layout and content el-ements. Therefore, features have to be defined on layoutspecific segments. Hence, synchronization is required. Toillustrate, consider figure 1. In this example a video doc-ument is represented by five time dependent features de-fined on different asynch ronous time scales. At a certainmoment an event occurs. Clues for the occurrence of thisevent are found in the features that have a value within thetime-window of the event, but also in contextual featuresthat have a value before or after, the actual occurrence ofthe event. As an example consider a goal in a soccer match.Clues that indicate this event are a swift camera pan towardsthe goal area before the goal, an excited commentator dur-

    0-7803-7965-9/03/$17.00 0 20 03 IEEE 111 - 48 1 ICME 2003

    Authorized licensed use limited to: Francis Xavier Engineering College. Downloaded on January 7, 2009 at 05:28 from IEEE Xplore. Restrictions apply.

  • 8/9/2019 Time Interval Maximum Entropy Based Event Indexing in Soccer Video

    2/4

    l"$ 1

    event timeFigure 1: Feature based representation of a video docu mentwith an event (b ox )and contextual relations (dashed lines).

    ing the goal, and a specific keyword in the closed captionafterwards. Hence, we need a means to express the differ-ent visual, auditory, and textual feature s into one fixed ref-erence pattem without loss of their original layout schem e.For this purpose we pro pose to use binary fuzzy Allen timeinterval relations [l]. A total of thirteen possible intervalrelations, i.e. preced es, meets , ove rlaps, s tarts, during, fin-ishes, equals, and their inverses, identified by i t the end,can be distinguished. A margin is introduced to account forimprecise boundary segm entation, explaining the fuzzy na-ture. By using fuzzy Allen relations it becomes possible tomodel events, context, and synchronization into one com-mon framework. When w e choose a camera shot as a refer-ence pattem, a goal in a sm ce r broadcast can be modelledby a sw ift camera pan that precedes the current camera shot,excited speech thatfinishes the camera shot, and a keywordin the closed caption that p r e c e d e s i the camera shot. Notethat the precedes an d p r e c e d e s i relations require a rangeparameter to limit the am ount of contextual information thatis included in the analysis.Thus, we c h w s e a reference pattern, and express a co -occurrence between a pattern x and category w by means ofbinary fuzzy Allen relations with binary features, f,. Whereeach f, is defined as:

    ( 1 ), if A,(.) = ru e and w = w ';f i ( z ' w ) = ( 0 , otherwise;Where A,(.) is a predicate function that checks for a fuzzyAllen relation, and w' is one of t he categories, or events.

    3. PATTERN CLASSIFICATIONHaving defined the feature based p attem representation, w enow switch to classification using the Maximum Entropyframework [4].

    For each feature the expe cted value over the training setS s computed4 ( f j )= C ~ C z ) w ) f j ( x > w ) (2)

    ZP J

    Where p(z , ) is the observed probability of x and w in S.This creates a model of S. To use this model for classifica-

    ' -.F

    1 1 I 1 1 1 1................. ............ .........~. ...........-.q....... .....~ + f f p ..~~....f f 56Figure 2: Simplified visual representation of the m i m u mentropy framewo rk.

    tion of unseen patterns, i.e. the reconstructed model p ( w l x ) ,w e require that the constraints for S are in accordance withthe constraints of the test set 7. Hence, we need the ex-pected value o ff , with respect to the model p ( w ( x ) :

    Wf,)c m P ( w l X ) f , ( z > w ) (3 )= ,U

    where @(z)s the observed probability of x in S. he com-plete model of training and test set is visualized in figure 2.We are left with the problem of finding the optimal recon-structed model p ' ( w 1 z ) . This is solved by restricting atten-tion to those models p ( w ) x ) or which the expected valueof f, over T quals the expected value of fj over S. Fromall those possible models, the ma xim um entropy philosophydictates that we select the one with the m ost uniform distri-bution. A ssuming n o evidence if nothing has been observed.Th e uniformity of the conditional distribution p ( w l x ) can bemeasu red by the con ditional entropy, defined as:

    H@ )= - C F ( z ) P ( w l z ) % P ( W l 4 (4)The model with maximum entropy, p * (w l z ) , should be se-lected , It is shown in [4] that there is always a uniquemodel p * ( w l x ) with maximum entropy, and that p ' ( w1 x )must have a form that is equivalent to:

    z.w

    where aj s the weight for feature f j an d Z is a normaliz-ing constant, used to ensure that a probability distributionresults. T he values for aj are computed by the GeneralizedIterative Scaling (GIS) [ 5 ] algorithm. Since GI S relies on

    111 - 48 2

    Authorized licensed use limited to: Francis Xavier Engineering College. Downloaded on January 7, 2009 at 05:28 from IEEE Xplore. Restrictions apply.

  • 8/9/2019 Time Interval Maximum Entropy Based Event Indexing in Soccer Video

    3/4

    F

  • 8/9/2019 Time Interval Maximum Entropy Based Event Indexing in Soccer Video

    4/4

    G