A multimodal data mining framework for soccer goal detection based ...

Int. J. Computer Applications in Technology, Vol. x, No. x, 200x 1

Copyright © 200x Inderscience Enterprises Ltd.

A multimodal data mining framework for soccer goal detection based on decision tree logic

Shu-Ching Chen Distributed Multimedia Information System Laboratory, School of Computer Science, Florida International University, Miami, FL 33199, USA E-mail: [email protected]

Mei-Ling Shyu Department of Electrical and Computer Engineering, University of Miami, Coral Gables, FL 33124, USA Fax: 1 305 284 4044 E-mail: [email protected]

Chengcui Zhang Department of Computer & Information Science, University of Alabama at Birmingham, Birmingham, AL 35294, USA Fax: 1 205 934 5473 E-mail: [email protected]

Min Chen Distributed Multimedia Information System Laboratory, School of Computer Science, Florida International University, Miami, FL 33199, USA Fax: 1 305 348 3549 E-mail: [email protected]

Abstract: In this paper, we propose a new multimedia data mining framework for the extraction of soccer goal events in soccer videos by utilising both multimodal analysis and decision tree logic. The extracted events can be used to index the soccer videos. We first adopt an advanced video shot detection method to produce shot boundaries and some important visual features. Then, the visual/audio features are extracted for each shot at different granularities. This rich multimodal feature set is then filtered by a pre-filtering step in order to clean the noise as well as to reduce the irrelevant data. A decision tree model is built upon the cleaned data set and is used to classify the goal shots. We also present the experimental results for the proposed framework, which indicate the performance of the framework for soccer goal extraction.

Keywords: multimedia data mining; soccer event detection; video indexing.

Reference to this paper should be made as follows: Chen, S-C., Shyu, M-L., Zhang, C. and Chen, M. (xxxx) ‘A multimodal data mining framework for soccer goal detection based on decision tree logic’, Int. J. Computer Applications in Technology, Vol. x, No. x, pp.xxx–xxx.

Biographical notes: Dr. Shu-Ching Chen is an Associate Professor in the School of Computer Science (SCS), Florida International University (FIU), since August 2004. Prior to that he was an Assistant Professor in SCS at FIU since August 1999. He received his

Author: Please indicate who the corresponding author is.

2 S-C. CHEN, M-L. SHYU, C. ZHANG AND M. CHEN

PhD from the School of Electrical and Computer Engineering at Purdue University, West Lafayette, IN, USA, in December 1998. He also received Master degrees in Computer Science, Electrical Engineering, and Civil Engineering from Purdue University, West Lafayette, IN, USA. His main research interests include distributed multimedia database systems, data mining, and multimedia networking. Dr Chen has authored and co-authored more than 120 research papers in journals, refereed conference proceedings, and book chapters. He was awarded University Outstanding Faculty Research Award from FIU in 2004. He also received Outstanding Faculty Research and Service Awards from SCS at FIU in 2002 and 2004, respectively. He is/was the general co-chair of the IEEE International Conference on Information Reuse and Integration, the co-founder and programme co-chair of the ACM International Workshop on Multimedia Databases, and programme chair of several conferences.

Mei-Ling Shyu received her PhD degree from the School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, USA, in 1999, and her three master degrees in Computer Science, Electrical Engineering, and Restaurant, Hotel, Institutional, and Tourism Management from Purdue University. She has been an Assistant Professor at the department of Electrical and Computer Engineering, University of Miami, since January 2000. Her research interests include data mining, multimedia database systems, multimedia networking, and database systems. She has authored and co-authored more than 100 technical papers published in various prestigious journals, refereed conference/symposium/workshop proceedings, and book chapters. She was the co-founder and programme co-chair of the ACM International Workshop on Multimedia Databases. She is the programme co-chair of the 2005 IEEE International Conference on Information Reuse and Integration.

Chengcui Zhang is an Assistant Professor of Computer and Information Science at University of Alabama at Birmingham (UAB) since August 2004. She received her PhD from the School of Computer Science at Florida International University, Miami, FL, USA, in August 2004. She also received her bachelor and master degrees in Computer Science from Zhejiang University in China. Her research interests include multimedia databases, multimedia data mining, image and video database retrieval, and GIS data filtering. She has authored and co-authored more than 35 technical papers published in various prestigious journals, refereed conference/workshop proceedings and book chapters. She is the recipient of several awards, including the UAB ADVANCE Junior Faculty Research Award from the National Science Foundation, the Presidential Fellowship and the Best Graduate Student Research Award at FIU.

Min Chen received her bachelor degree in Electrical Engineering from Zhejiang University in China. She is currently a PhD candidate at the School of Computer Science in Florida International University (FIU). Her research interests include distributed multimedia database systems, image and video database retrieval, and multimedia data mining. She was granted the Presidential Fellowship and the Best Graduate Student Research Award from FIU in 2004.

1 INTRODUCTION

The rapid development of various affordable technologies in video capturing, data storage, and data transmission has resulted in a dramatic increase in the size of video collections. However, currently, the practical use of video data as an information resource is largely limited by the lack of viable event mining, video indexing, and summarisation techniques. Due to the excessive popularity of sports videos, mining information in sports video data, especially soccer videos, has become an active research topic in recent years. The impact of such research effort is twofold. First, extracted soccer events can be used to facilitate the indexing and summarisation of soccer videos. Secondly, the research outcome can be extended for general video analysis. For soccer video analysis and event recognition, most of the existing work is based on unimodal approaches, such as visual (Wu et al., 2002; Xu et al., 2001), audio (El-Maleh et al., 2000; Xu and Maddage, 2003), and text

(Babaguchi et al., 2002; Zhang and Chang, 2002). For example, Xu et al. (2001) used a domain-specific visual feature, grass-area ratio, to detect the play/break segments according to whether the soccer ball is in play. As an example of the modalities rather than the above three, the object’s 3-D location information was used (Tovinkere and Qian, 2001) to build an E-R model capturing the domain knowledge. However, it is not a general approach since the 3-D information is not available in general video data. Also, there is no discussion about goal detection in Tovinkere and Qian (2001). As another example, the algorithm in Leonardi and Migliorati (2002) took advantage of using motion descriptors that are directly available in MPEG format video sequences, and the temporal evolution of motion descriptors was used for soccer goal detection. Although this method is simple and easy to implement, it reported a high recall value but a poor precision value in terms of soccer goal detection. This is because the selection of timing requirements for motion evolutions is not scalable.

A MULTIMODAL DATA MINING FRAMEWORK FOR SOCCER GOAL DETECTION BASED ON DECISION TREE LOGIC 3

Thus, it is difficult to extend this method to a variety of video sequences.

It is easy to see that different modalities have different contributions in this specific application domain (soccer goal detection) (Dagtas and Abdel-Mottaleb, 2001). Recently, the approaches using multimodal analysis have drawn increasing attention (Babaguchi et al., 2002; Dagtas and Abdel-Mottaleb, 2001; Snoek and Worring, 2003). Dagtas and Abdel-Mottaleb (2001) proposed a multimodal framework using combined audio/visual/text cues. However, the use of textual transcript is not reliable because of the errors in captions. In addition, the availability of the transcript is not guaranteed. In their method, they also used the grass-area ratio as an important indication to identify the close-up segments following the event segments. However, this grass-detection method is not robust because it needs manual efforts to obtain the training data for each new video sequence. According to its experimental results, the precision value for goal event detection is low. In (Snoek and Worring, 2003), the 13 binary fuzzy Allen time interval relations, i.e., precedes, meets, and their inverses, were used to model events and their context. For example, a soccer goal event is preceded by a camera pan and followed by excited commentator speech. However, for the detection of soccer goals, it mainly used textual information (closed caption), which is often not available and is language-dependent. Though a multimodal approach shows promise, it also raises the issue of how to integrate the rich semantic information contained in large amounts of multimodal features. A set of domain-specific rules is often used to perform reasoning based on the multimodal cues detected. Such approaches are easy to implement and can be very successful in practice (Li and Sezan, 2003; Zhong and Chang, 2001). However, when the number of multimodal features becomes large, it is not easy to derive such rules. In addition, a set of fixed inference rules may not be general enough to be extended to a large number of video samples. Moreover, the fixed thresholds used in many of these rules are not suitable for varied real-world soccer videos (Li and Sezan, 2003). A possible solution to alleviate this problem is to use data mining techniques coupled with simple, or not so specific, heuristic rules. In Assfalg et al. (2002), a data mining method to detect and recognise soccer highlights using Hidden Markov Model was proposed, in which each model is trained separately for each type of event. However, it cannot identify the goal event and has the problem of dealing with long video sequences.

Data mining techniques have long been used to discover interesting patterns from large data sets. However, relatively few research efforts have been directed towards the multimedia data mining area, that is, to mine high-level semantics and patterns from a large amount of multimedia data. In this paper, we propose a decision-tree-based multimodal data mining framework for soccer goal detection. A soccer goal event is identified when the ball passes over the goal line, between the goal posts and under the crossbar (Ekin and Tekalp, 2003). However, it is difficult to detect and verify these conditions automatically

and reliably. However, the occurrence of a goal event is generally indicated by some special patterns of multimodal features, which is what we exploit in our proposed data mining framework. The input training data for the data mining component is the multimodal features (visual and audio) extracted for each video shot. It is reasonable to make it shot-based because video shots are the basic indexing unit for video content analysis (Chen et al., 2003). In addition, we adopt an advanced video shot detection method, with the advantage of producing some important visual features and even mid-level features (such as object information) during shot detections. Therefore, we can further extract the visual features for the objects, namely the object-level features. Based on the object-level features, an unsupervised and robust grass-area detection method is also proposed with very limited extra effort, which distinguishes our framework from most of the other existing approaches. However, the extracted feature set cannot be directly passed to the data mining component because of the small percentage of the positive samples (goal shots) compared to the huge amount of negative samples (non-goal shots). To our best knowledge, there is hardly any work addressing this issue. In this framework, the use of domain knowledge has simplified this problem. Instead of trying to solve the general problem of distinguishing a very small percent of interesting events (e.g., 1%) from a large amount of irrelevant data, clues from the different modalities (e.g., visual and audio clues) have been used in our data pre-filtering step to clean the original feature data set to provide a reasonable input training data set for the data mining component. Finally, the decision tree model generated by the data mining process is tested and the performance is evaluated by using large amounts of long soccer video sequences with different styles and produced by different broadcasters. It is worth mentioning that we are dealing with broadcast soccer videos that have lower visual/audio quality than videos used in some other works. Based on our experiments, the average classification results reach 91.0% for recall and 87.7% for precision, which demonstrates the power of integrating data mining and multimodal processing.

The paper is organised as follows. Section 2 discusses the architecture of the proposed framework as well as the details of its major components. Experimental results are presented in Section 3. Section 4 gives the conclusion.

2 ARCHITECTURE OF THE FRAMEWORK

The architecture of our system is shown in Figure 1. As can be seen from this figure, the proposed framework consists of three major components, namely video parsing, data pre-filtering, and data mining.

• Video parsing. Parse the raw soccer video sequences by using a video shot detection subcomponent. It not only detects video shot boundaries, but also produces some important frame-level visual features during shot detection. Moreover, since object segmentation is


embedded in video shot detection, the high-level semantic information, such as the grass areas that serve as an important indication in soccer goal detection, can be derived from the object segmentation results. Hence, only a small amount of work needs to be done to extract the visual features for each shot, which distinguishes our framework from most of the other existing approaches. The detected shot boundaries are passed to the feature extraction phase, where the complete multimodal features (visual and audio) are extracted for each shot.

• Data pre-filtering. The domain knowledge such as visual/audio clues is used to eliminate the noise data and reduce the irrelevant data from the original feature set. This step is critical because the ratio of the goal shots over the non-goal shots is very small (e.g., 1 goal shot out of 100 shots), and thus it is not suitable to take the original feature set as the input for data mining. By data pre-filtering, the ratio of positive samples over negative samples can be increased to 1:4, which is reasonably good for data mining.

• Data mining. Take the ‘cleaned’ feature data as the training data, and build a decision tree model suitable for soccer goal detection. To illustrate the robustness of the proposed framework, in our experimental settings, a large amount of soccer video data with different broadcasting styles are served as the experimental data set, where a so-called Random Subsampling approach is adopted to evaluate the classification performance. More specifically, we perform K random data splits on the data set, where each split results in a different group of two disjoint subsets, i.e., training data set and testing data set. Then the decision tree model is trained by each training data set in a group and tested by the corresponding testing data set in the same group. Finally, the average performance obtained from the K different models is utilised to evaluate the proposed framework. The experimental results demonstrate the effectiveness and great potential of our proposed framework.

Figure 1 The architecture of the framework

2.1 Video parsing

2.1.1 Video shot detection

Video shot detection is the first step for video parsing, and the detected shot boundaries are the basic units for video feature extraction. In this study, we adopt the shot detection

technique developed in our previous work (Chen et al., 2004; Zhang et al., 2003), which utilises a multi-filtering architecture consisting of a pixel-level comparison filter, a histogram comparison filter, a segmentation mask map comparison filter, and an object-tracking filter. The multi-filtering architecture for this method is shown in Figure 2.


Figure 2 The multi-filtering architecture for shot detection

In the traditional pixel-level comparison approach, the grey-scale values of the pixels at the corresponding locations in two successive frames are subtracted and the absolute value is used as a measure of dissimilarity between the pixel values. If this value exceeds a certain threshold, then the pixel grey scale is said to have changed. The percentage of the pixels that have changed is the measure of dissimilarity between the frames. This approach is computationally simple but sensitive to digitalisation noises, illumination changes, and object moving. As a means to compensate for this, histogram comparison (i.e., use the colour histogram difference to compare two consecutive video frames) is incorporated into this method to reduce the false positives detected by the pixel-level comparison. Therefore, the first two filters can compensate for each other in reducing the numbers of both false positives and false negatives. In addition, since the object segmentation and tracking techniques are much less sensitive to luminance change and object motion, they are used as the next two filters in this multi-filtering architecture, to help determine the actual shot boundaries when both pixel comparison and histogram comparison fail. In other words, we apply the segmentation and object-tracking techniques only when it is necessary for the sake of efficiency. Object segmentation is implemented based on the unsupervised object segmentation and tracking method proposed in our previous work (Chen et al., 2001). By using the unsupervised object segmentation algorithm, the significant objects or regions of interests as well as the segmentation mask map of a video frame can be automatically extracted. We adopted the two-class segmentation in this shot detection algorithm for efficiency purposes. In such a case, the pixels in each frame have been grouped into two classes. Then two frames can be compared by checking the difference between their segmentation mask maps. An example of subtracting segmentation mask maps is given in Figure 3. In fact, since the pixel value in the segmentation mask map is binary (either one or two), using the ‘XOR’ operation is very efficient as shown in Figure 3.

The advantages of this method are: • It has high precision (>92%) and recall (>98%) values.

This overall performance is obtained based on our experiments over more than 1,000 testing shots. With such a solid performance, only very little manual effort

is needed to correct the false positives and to recover the missing positives.

• It can generate a set of important visual features (frame-level visual features as shown in Figure 1) for each shot during the process of shot detection. Thus, the computation for extracting visual features can be greatly alleviated.

(a) (b) (c)

Figure 3 Subtraction of segmentation mask maps: (a) previous_map; (b) current_map, and (c) diff_map

2.1.2 Visual feature extraction

In addition to shot boundaries, the process of video shot detection also generates a rich set of visual features associated with each video shot. Among these visual features, pixel_change represents the average percent of the changed pixels between frames within a shot, which is generated by the first filter (Pixel-level Filter). The feature histo_change indicates the mean value of the histogram difference between frames within a shot, and is the output from the second filter (Histogram Filter). Both of the two global features are important indications for camera motions and object motions. Other mid-level features such as the mean (back_mean) and the variance (back_var) values of the background pixels can be obtained via the Segmentation Map Filter.

As shown in Figure 4(c)–(d), the background areas (black) and foreground areas (grey) are detected by object segmentation. In global view shots (Figure 4(a) and (c)), the grass areas tend to be detected as the background, while in close-up shots (Figure 4(b) and (d)), the background is very complex and may contain crowd, sign board, etc. Based on our observations, there is a large amount of grass areas present in global view shots (including goal shots), while there is less or hardly any grass area in the mid- or the close-up shots (including the cheering shots following the goal shots), which means the average percent of grass areas (grass_ratio) in a video shot is a very important indication for classifying shot types (global, close-up, etc.). By taking these advantages brought by video shot detection, we include five visual features in our data mining framework for soccer goal detection as listed in Table 1.


(a) (b)

(c) (d)

Figure 4 (a) a sample frame from a goal shot (global view); (b) a sample frame from the cheering shot following the goal shot for (a); and (c)–(d) object segmentation results for (a) and (b), respectively

Table 1 Visual features and their descriptions

Feature Name Description Indication

pixel_change Shot-level average of the percent of the changed pixels between frames

camera motion, object motion

histo_change Shot-level average of the histogram differences between frames

camera motion, object motion

back_var Shot-level average of the variance of background pixels

grass area

back_mean Shot-level average of the mean of background pixels

grass area

grass_ratio Shot-level average of the percent of grass areas

global, close-up

We observe that the grass area in a global shot is relatively smooth in terms of its colour and texture. Hence, the value of back_var < threshold will indicate the possible grass area. In particular, within each shot, we draw a set of video frames at a 50-frame interval and do object segmentation for them. By object segmentation, the background (grass, crowd, etc.) areas and foreground

areas (player, ball, etc.) are roughly identified, where the background areas have lower values of variance and the foreground areas have higher values of variance. Then, we check the back_var value of the background areas for each shot. If back_var < threshold, indicating the possible grass area, then we put its corresponding back_mean into a candidate pool containing possible grass values. The next step is to group the back_mean values of all the possible grass areas into a candidate pool, filter off the outliers by taking out those shots that are too short and those whose back_mean values are out of a reasonable scope of the average back_mean, and take the average of the remaining values as the grass colour detector.

A robust method to handle a more complex situation when the grass colours are different between the global shots and the close-up shots caused by camera shooting scale and lightning condition is also developed. In this case, we select the histogram peak(s) within a reasonable range of the values in the candidate pool as the grass detector(s). It should be pointed out that this grass-area detection method is unsupervised and the grass values are learned through unsupervised learning within each video sequence, which is invariant to different types of videos. Figure 5 shows the detected grass areas for three sample images from different types of shots (global, close-up, and middle shots), and the results are quite promising.


(a) (b) (c)

Figure 5 Detected grass areas (black areas) for three sample video frames from different types of shots: (a) global shot, (b) close-up shot and (c) middle shot

As a summary, the five visual features: pixel_change, histo_change, back_mean, back_var, and grass_ratio are included in our data mining framework. The visual feature extraction process takes advantage of the robust shot detection process as described in Section 2.1.1, such that only limited extra effort is needed to extract visual features for each shot. For example, pixel_change and histo_change are obtained directly during the pixel-level comparison and histogram comparison for video shot detection. Also, some high-level semantic information, such as the grass area ratio, can also be obtained automatically by using the object segmentation component in video shot detection. Therefore, this method has a great potential to provide more high-level semantic information such as the locations of players and balls, and the spatio-temporal characteristics among the objects as well. It should be pointed out that necessary data normalisation should be done within each video sequence. By doing this, the values of each visual feature are normalised to a [0, 1] range.

2.1.3 Audio feature extraction

A variety of audio features have been proposed in the literature that can serve the purpose of audio track characterisation (Liu et al., 1998; Wang et al., 2000). Generally, audio features can be classified into two major groups: time-domain features and frequency-domain features. Moreover, with respect to the analysis requirements of specific applications, audio features may be extracted in different granularities such as frame-level and clip-level. Both time-domain and frequency-domain audio features are considered in our framework. Since the

semantic meaning of an audio track is better represented by the audio features of a relatively longer period, we also explore both the clip-level and shot-level audio features. In this study, we define an audio clip with a fixed length of one second, which usually contains a continuous sequence of audio frames.

The generic audio features utilised in our framework can be divided into three groups, namely, volume features (volume), energy features (energy), and Spectrum Flux features (sf). For each generic audio feature, the audio files are processed to obtain the audio features at both clip-level and shot-level. The audio data is sampled at a sampling rate of 16,000 HZ. An audio frame contains 512 samples, which lasts 32 ms under a sampling rate of 16,000 HZ. Within each clip, the neighbouring frames overlap 128 samples with each other. Some statistical features are calculated at both clip-level and shot-level. The statistical functions are formalised as follows:

mean the average value std the standard deviation range the dynamic range of a feature stdd the standard deviation of the frame-to-frame

difference lowrate the ratio of the number of frames whose

feature values are less than 50% of the average feature value.

Volume-related features

Volume is one the most frequently used and simplest audio features. As an indication of the loudness of sound, volume is very useful for soccer video analysis. One volume-based feature used is:

zhang

Highlight

should be: "one of the most"


Feature name Description volumn_mean The mean value of the volume

Energy-related features

Short time energy means the average waveform amplitude defined over a specific time window. To model the energy properties more accurately, energy characteristics of sub-bands are explored as well. Four energy sub-bands are identified, which cover respectively the frequency interval of 1 HZ-(fs/16) HZ, (fs/16) HZ-(fs/8) HZ, (fs/8) HZ-(fs/4) HZ, and (fs/4) HZ-(fs/2) HZ, where fs is the sample rate.

Feature name Description

energy_mean The mean RMS (Root-Mean-Square) energy

sub1_mean The average RMS energy of the first sub-band

sub3_mean The average RMS energy of the third sub-band

energy_lowrate The percentage of the samples with RMS power less than 0.5 times the mean RMS power

sub1_lowrate The percentage of the samples with RMS power less than 0.5 times the mean RMS power of the first sub-band

Spectrum flux-related features

Spectral flux (Delta Spectrum Magnitude) is defined as the two-norm of the frame-to-frame spectral amplitude difference vector.

Feature Name Description sf_mean The mean value of the Spectrum Flux sf_std The standard deviation of the

Spectrum Flux, normalised by the maximum Spectrum Flux

sf_stdd The standard deviation of the difference of the Spectrum Flux, which is normalised too

sf_range The dynamic range of the Spectrum Flux

The steps to obtain the abovementioned audio features are summarised as follows:

• The basic information such as the duration and the total number of frames of each video file is collected.

• For each video file: • process the video file through the video shot

boundary detection algorithm to identify its video shots

• separate the corresponding audio track from the original video file.

• for each video shot, calculate the fifteen generic audio features of its first three seconds and last three seconds firstVeci and lastVeci

• for each video shot, calculate the 15 generic audio features SVi

• obtain the normalised shot-level feature vector NormSVi via NormSVi = (SVi – min(SVi))/(max(SVi) – min(SVi)).

In all, ten audio features (one volume feature, five energy features, and four spectrum flux features) are extracted for each video shot and used in our framework.

2.2 Data pre-filtering

Once the proper video features and audio features have been extracted, the data mining techniques can be applied to identify the goal shots. However, these features may contain noisy and inconsistent data, which were introduced during the video production process. Moreover, the data amount is typically huge and the ratio of the goal shots to the non-goal shots is only 1:100 in our case. It would be difficult for the data mining process to capture the small portion of useful information from the huge amount of data including the irrelevant information. In the worst case, the goal shots may be treated as noise and ignored by the mining process. Therefore, before performing the actual data mining process, for the sake of accuracy and efficiency, a pre-filtering process is needed to clean data and select a small set of candidate goal shots using domain knowledge. Here, domain knowledge is defined as the empirically verified or proven information specific to the application domain that is served to reduce the problem or search space (Witten and Frank, 1999). In Section 2.2, we present this pre-filtering process using some computable observation rules on soccer videos.

In soccer videos, the sound track mainly includes the foreground commentary and the background crowd noise. Based on observation and prior knowledge, the commentator and crowd become extremely excited at the end of a goal shot. In addition, different from other sparse happenings of excited sounds or noises, normally this kind of excitement will last to the following shot(s). Thus, the duration and the intensity of sound can be used to capture the candidate goal shots as defined in the following rule:

Rule 1: As a candidate goal shot, the last three (or less) seconds of its audio track and the first three (or less) seconds of its following shot should both contain at least one exciting point.

Here, the exciting point is defined as a one-second period whose volume is larger than 60% of the highest one-second volume in this video. It is worth mentioning that actually this volume threshold can be assigned to an even greater value for most of the videos. However, based on our experiments, 60% is a reasonable threshold since the number of the candidate goal shots can be reduced to 17% of the whole search space while including all the goal shots. In addition, this rule performs as a data cleaning step to remove some of the noise data because, though normally the noise data has a high volume, it will not last for long.


As mentioned earlier, we have two basic types of shots, close-up shots and global shots, for soccer videos based on the ratio of the green grass area. We observe that the goal shots belong to the global shots with a high grass ratio and are always closely followed by the close-up shots, which include cutaways, crowd scenes, and other shots irrelevant to the game without grass pixels, as shown in Figure 6. Figure 6 ((a)–(c)) captures three consecutive shots starting from the goal shot (Figure 6 (a)), and Figure 6 ((d)–(f)) shows three consecutive shots where Figure 6(d) is the goal shot. As can be seen from this figure, within two consecutive shots that follow the goal shot, usually there is a close-up shot (Figure 6 (b) and (f), respectively). According to these observations, two rules are defined as follows:

Rule 2: A goal shot should have a grass_ratio larger than 40%.

Rule 3: Within two succeeding shots that follow the goal shot, at least one shot should belong to the close-up shots.

Figure 6 Goal shots followed by close-up shots: (a)–(c) are three consecutive shots in a goal event, (b) is the close-up shot that follows (a) the goal shot, (d)–(f) show another goal event and its three consecutive shots, (f) is the close-up shot that follows (d) the goal shot

Note that the threshold defined in Rule 2 can be altered to a higher value for most of the videos. However, our experiments show that 81% of the candidate pool obtained after applying Rule 1 can be reduced using Rule 2 and Rule 3, which means that only 3% of the whole search space is used as the input for the data mining process. In addition, according to the prior knowledge, a goal shot normally lasts more than two seconds, which can be used as an optional filter called optional rule as shown below.

Optional rule: A candidate goal shot should have a duration > 2s.

2.3 Mining goal shots using decision trees

In this framework, the decision tree logic is adopted for mining the goal shots in soccer videos due to its promise in learning the associations between the different attributes of a set of pre-labelled instances. The goal of decision tree

classification is to derive a model for each class in terms of the attribute-value pairs and later to use that induced model to categorise any unseen instance which is characterised by the values of a set of attributes. The output trees summarise all given information but express it in a more concise and perspicuous manner. The construction of a decision tree is performed by recursively partitioning the training set with respect to certain criteria until all the instances in a partition have the same class label, or no more attributes can be used for further partitioning. An interior node in a decision tree involves testing a particular attribute, and the branches that fork from that node correspond to all possible outcomes of a test. Eventually, a leaf node, which carries a class label that indicates the majority class within the final partition, is formed. The classification phase works like traversing a path in the tree. Starting from the root, the instance’s value of a certain attribute decides which branch to go at each internal node. Whenever a leaf node is reached, its associated class label is assigned to the instance. There are several well-known decision tree classifiers including CHAID (Chi-square Automatic Interaction Detection) (Kass, 1980), C4.5 (Quinlan, 1993), and CART (Classification and Regression Tree) (Breiman et al., 1984). The algorithm exploited in this study is adopted from the C4.5 decision tree classifier.

Attribute evaluation heuristics

There exist a variety of attribute evaluation heuristics in the literature. Some examples include entropy and its variants (Quinlan, 1986; Quinlan, 1993), the GINI index (Breiman et al., 1984), Chi-square test, and the G statistic (Mingers, 1987). The information gain ratio criterion is used to determine the most appropriate attribute for partitioning due to its efficiency and simplicity. It is defined as:

GainGainratio( , )SplitInfo( , )

SS AS A

= (1)

where SplitInfo(S, A) represents the entropy of the data with respect to the values of an attribute:

21

SplitInfo( , ) logc

v v

v

S SS A

S S=

= −

∑ (2)

and Gain(S, A) is the information gain defined as:

1

Gain( , ) Entropy( ) Entropyc

vv

v

SS A S S

S=

= − ∑ (3)

where S is a set of instances, A is the selected attribute which has c distinct values, {a1, a2, …, ac}. Sv is the vth subset of S and the instances in Sv have the value av of attribute A. The calculation of entropy is achieved by:

21

Entropy( ) log ( )n

i iv

S p p=

= ∑ (4)


where n is the number of classes and pi represents the fraction of class i examples in set S.

Handle numeric attributes

The attribute evaluation process for nominal attributes is straightforward. However, for numeric attributes, the instances need to be divided into several groups so that each group can be regarded as one value of a nominal attribute. Since all the visual and audio features used in this framework are numeric attributes, it is necessary to handle the numeric attributes.

Numeric attributes are accommodated by a two-way split, which means one single breakpoint is located and serves as a threshold to separate the instances into two groups. The voting of the best breakpoint is based on the information gain value. First, all instances within the current set are sorted according to their values of the tested attribute. Then, a breakpoint is placed and thus the instances are split into two groups. The corresponding information gain of that split is calculated and stored. After going through all the possible placements of the breakpoint, the placement with the highest information gain value is identified and will be used later to partition the set of instances if the corresponding numeric attribute excels all the other attributes and becomes the best attribute for node splitting.

With the intention of not generating tiny partitions after the binary split (partitions that contain comparatively a small number of instances), a threshold value is automatically set, which indicates the minimum number of instances in a partition. The threshold value is set proportionally to the total number of instances for the sake of scalability.

Construct decision tree:

The data output by the pre-filtering process forms the training data set for the decision tree classifier. Each data entry consists of the shot-level audio and visual features as well as the class label. The shot-level features are extracted in the feature extraction stage (see Section 2.1). A ‘yes’ or ‘non’ class label is assigned to each shot manually, where if there is a goal event, that shot is tagged as ‘yes’; otherwise it is marked as ‘non’ for the label.

Given the training data set, the C4.5 decision tree algorithm is adopted to learn a classifier that is able to predict the class label of any new instance. The induced classification rules are finally represented in the form of a decision tree.

3 EXPERIMENTAL RESULTS

3.1 Soccer video data and feature extraction

In our experiments, we collected 27 soccer video files from a wide range of sources via the internet, with different styles and produced by different broadcasters. The total duration is 9 hours and 28 minutes. Among the total 4,885 video shots, only 41 are goal shots, which constitute only 0.8% of the total shots. The detailed statistics of all the video files are listed in Table 2. As shown in Table 2, the years of production of our video collection range from 1998 to 2003, including 7 FIFA2003 soccer videos and 20 other videos, which constitutes the test bed for our proposed soccer goal mining framework.

Table 2 Detailed statistics of all the video data files

No. Length

(Seconds) Shots Goals Year Teams No. Length

(Seconds) Shots Goals Year Teams

1 1,404.2 194 2 1998 Ukraine vs. Russia 15 1,411 150 2 2002 Other

2 1,235.7 148 2 1999 Other 16 889.1 95 1 2002 Other

3 820.4 83 1 1999 Other 17 607.8 129 1 2003 Other

4 958.1 93 4 1999 Other 18 1,638.7 322 3 2003 Other

5 1,926.1 346 2 2002 Other 19 765.7 119 1 2003 Other

6 584.5 96 1 2002 Other 20 1,132.6 137 1 2003 Canada vs. China

7 524.9 106 1 2002 Other 21 1,445.4 173 2 2003 Canada vs. Sweden

8 2,061 418 1 2002 Other 22 2,191.4 294 2 2003 Canada vs. USA

9 841.9 169 1 2002 Other 23 1,094 125 1 2003 Germany vs. Russia

10 1,998.4 238 1 2002 Other 24 576.8 81 1 2003 Germany vs. Sweden (1)

11 1,672.7 212 1 2002 Other 25 718 95 1 2003 Germany vs. Sweden (2)

12 1,858.9 230 3 2002 Other 26 2,440.4 371 1 2003 Germany vs. USA

13 800.4 115 1 2002 Other 27 1,412.6 197 1 2003 Norway vs. USA

14 1,104.7 149 2 2002 Other

Total 34,115.4 4,885 41


These video files are first parsed by using the proposed shot detection algorithm. Because of the solid performance of the shot detection algorithm (with > 92% precision and > 98% recall value), only little effort is needed to correct the shot boundaries. Then, both visual and audio features are computed and normalised for each video shot via the feature extracting processes. We include ten audio features and five visual features in each feature vector and pass the feature set to the pre-filtering stage. The candidate shots generated by pre-filtering are then used in the data mining stage, which contains much less noise and outliers than the original data set. The resulting candidate pool size after pre-filtering is 886.

3.2 Video data mining for goal shot detection

These 886 candidate shots are randomly selected to serve as either the training data or the testing data. In our experimental settings, ten random data splits are performed on this data set, where each split results in a training data set with approximately 2/3 of the whole data set and a testing data set containing the remaining data.

The decision tree is induced by the C4.5 approach based on the training data set. Figure 7 illustrates an example decision tree model, where the training data set contains totally 666 shots including 28 goal shots, whereas the remaining 220 shots with 13 goal shots and 207 non-goal shots are in the testing data set. Both visual features and audio features are used in constructing the decision tree. In addition, we explore two effective features based on Rule 1 (specified in Section 2.2). First, for each shot, the feature sumVol keeps the summation of the peak volumes of its last three-second audio track and its following shot’s first three-second track (for short, nextfirst3). Second, the mean volume of its nextfirst3 is another audio feature vol_nextfirst3. Using this constructed decision tree classifier, totally 25 goal shots and 637 non-goal shots are correctly identified (i.e., labelled as yes and non, respectively). In other words, only three yes and one non instances are misclassified. Note that such misclassification is caused by our effort to avoid overfitting. Due to the noise and outliers contained in the training data, a decision tree may become too complex if it overfits the training samples.

Figure 7 An example decision tree model


This will eventually deteriorate the model performance on the unseen data set. Therefore, in our model building process, the tree stops growing when the data split is not statistically significant. It should also be pointed out that the higher the level at which the feature node is located, the more impact it has for the classification process.

3.3 Classification performance

As mentioned earlier, each induced decision tree classifier is validated by the corresponding testing data set. Therefore, totally ten decision tree models are constructed and tested in our experiment. Table 3 shows the classification results for each model and the average performance as well. As we can see from the table, the average recall and precision values reach 91.0% and 87.7%, respectively, which demonstrates the robustness of the proposed framework.

Table 3 Model performance on the testing data sets

Model No.

No.of goals Identified Missed Misidentified

Recall (%)

Precision (%)

1 13 12 1 1 92.3 92.3

2 13 12 1 6 92.3 66.7

3 13 12 1 1 92.3 92.3

4 13 12 1 1 92.3 92.3

5 13 10 3 1 76.9 90.9

6 14 14 0 3 100.0 82.4

7 14 13 1 1 92.9 92.9

8 14 12 2 1 85.7 92.3

9 14 14 0 3 100.0 82.4

10 14 12 2 1 85.7 92.3

Average (%) 91.0 87.7

4 CONCLUSIONS AND FUTURE WORK

In this paper, we have demonstrated a multimodal data mining framework that integrates data mining techniques with multimodal processing in extracting the soccer goal events from soccer videos. The proposed framework is composed of three major components, namely video parsing, data pre-filtering, and data mining. The video parsing component uses efficient visual processing to detect shot boundaries, and to extract important visual features and even mid-level features. Then, the audio features are extracted on both clip-level and shot-level to capture the audio characteristics on different granularities. The use of domain knowledge in data pre-filtering cleans the original feature set and provides a reasonable data set for the data mining process. A decision tree classifier is constructed using the cleaned training data set and is tested using the cleaned test data set. Our experimental results over diverse video data from different sources have demonstrated that the integration of data mining and multimodal

processing of video is a viable and powerful approach for effective and efficient extraction of soccer goal events.

In our future work, the proposed framework will be extended to detect other soccer events (e.g., corner kicks, free kicks, etc.) and applied to different application domains (e.g., mining significant events from surveillance videos, accident detection in transportation videos, etc.).

ACKNOWLEDGEMENT

For Shu-Ching Chen, this research was supported in part by NSF EIA-0220562 and HRD-0317692. For Mei-Ling Shyu, this research was supported in part by NSF ITR (Medium) IIS-0325260. For Chengcui Zhang, this research was supported in part by SBE-0245090 and the UAB ADVANCE programme of the Office for the Advancement of Women in Science and Engineering.

REFERENCES

Assfalg, J., et al. (2002) ‘Soccer highlights detection and recognition using HMMs’, Proceedings of IEEE International Conference on Multimedia and Expo, pp.825–828.

Babaguchi, N., Kawai, Y. and Kitahashi, T. (2002) ‘Event based indexing of broadcasted sports video by intermodal collaboration’, IEEE Transactions on Multimedia, Vol. 4, No. 1, pp.68–75.

Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984) Classification and Regression Trees, Wadsworth & Brooks, Monterey, CA.

Chen, S-C., Shyu, M-L. and Zhang, C. (2004) ‘Innovative shot boundary detection for video indexing’, Deb, S. (Ed.): Video Data Management and Information Retrieval, accepted for publication, Idea Group Publishing.

Chen, S-C., Shyu, M-L., Zhang, C. and Kashyap, R.L. (2001) ‘Identifying overlapped objects for video indexing and modeling in multimedia database systems’, International Journal on Artificial Intelligence Tools, Vol. 10, No. 4, pp.715–734.

Chen, S-C., Shyu, M-L., Zhang, C., Luo, L. and Chen, M. (2003) ‘Detection of soccer goal shots using joint multimedia features and classification rules’, Proceedings of International Workshop on Multimedia Data Mining (MDM/KDD’2003), pp.36–44.

Dagtas, S. and Abdel-Mottaleb, M. (2001) ‘Extraction of TV highlights using multimedia features’, Proceedings of IEEE International Workshop on Multimedia Signal Processing, pp.91–96.

Ekin, A. and Tekalp, A.M. (2003) ‘Generic play-break event detection for summarization and hierarchical sports video analysis’, Proceedings of IEEE International Conference on Multimedia and Expo, pp.169–172.

El-Maleh, K. et al. (2000) ‘Speech/music discrimination for multimedia applications’, Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.2445–2448.

Kass, G.V. (1980) ‘An exploratory technique for investigating large quantities of categorical data’, Applied Statistics, Vol. 29, pp.119–127.

Leonardi, R. and Migliorati, P. (2002) ‘Semantic indexing of multimedia documents’, IEEE Multimedia, Vol. 9, pp.44–51.


Li, B. and Sezan, I. (2003) ‘Semantic sports video analysis: approaches and new applications’, Proceedings of International conference on Image Processing, Vol. 1, pp.17–20.

Liu, Z., Wang, Y. and Chen, T. (1998) ‘Audio feature extraction and analysis for scene segmentation and classification’, Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology, Vol. 20, Nos. 1–2, pp.61–80.

Mingers, J. (1987) ‘Expert systems – experiments with rule induction’, Journal of Operations Research, Vol. 38, pp.39–47.

Quinlan, J.R. (1986) ‘Induction of decision trees’, Machine Learning, Vol. 1, pp.81–106.

Quinlan, J.R. (1993) C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA.

Snoek, C.G.M. and Worring, M. (2003) ‘Time interval maximum entropy based event indexing in soccer video’, Proceedings of IEEE International Conference on Multimedia and Expo, Vol. 3, pp.481–484.

Tovinkere, V. and Qian, R.J. (2001) ‘Detecting semantic events in soccer games: towards a complete solution’, Proceedings of International Conference on Multimedia and Expo, pp.1040–1043.

Wang, Y., Liu, Z. and Huang, J. (2000) ‘Multimedia content analysis using both audio and visual clues’, Signal Processing Magazine, Vol. 17, pp.12–36.

Wu, C., Ma, Y.F., Zhang, H.J. and Zhong, Y.Z. (2002) ‘Event recognition by semantic inference for sports video’, Proceedings of IEEE International Conference on Multimedia and Expo, pp.805–808. author please provide place name.

Witten, H. and Frank, E. (1999) Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann Publishers. author please provide place name.

Xu, M. and Maddage, N.C. et al. (2003) ‘Creating audio keywords for event detection in soccer video’, Proceedings of IEEE International Conference on Multimedia and Expo, pp.281–284.

Xu, P., et al. (2001) ‘Algorithms and system for segmentation and structure analysis in soccer video’, Proceedings of IEEE International Conference on Multimedia and Expo, pp.928–931.

Zhang, C., Chen, S-C. and Shyu, M-L. (2003) ‘PixSO: a system for video shot detection’, Proceedings of the 4th IEEE Pacific-Rim Conference on Multimedia, pp.1–5.

Zhang, D.Q. and Chang, S-F. (2002) ‘Event detection in baseball video using superimposed caption recognition’, Proceedings of the 10th ACM International Conference on Multimedia, pp.315–318.

Zhong, D. and Chang, S-F. (2001) ‘Structure analysis of sports video using domain models’, Proceedings of IEEE International Conference on Multimedia and Expo, pp.182–185.

A multimodal data mining framework for soccer goal detection based ...

Documents

Transcript of A multimodal data mining framework for soccer goal detection based ...