[ACM Press the 2014 International Conference - Amritapuri, India (2014.10.10-2014.10.11)]...

An improved Human Action Recognition system using RSD Code generation

Geetha M, Anandsankar B, Lakshmi S Nair, Amrutha T, Amith Rajeev Department of Computer Science

Amrita School of Engineering, Amrita Vishwa Vidyapeetham Amritapuri

[email protected], [email protected], [email protected], [email protected], [email protected]

ABSTRACT This paper presents a novel method for recognizing human actions from a series of video frames. The paper uses the idea of an RSD (Region Speed Direction) code generation, which is capable of recognizing most of the common activities in spite of the spatiotemporal variability between subjects. Majority of the researches focus on upper body part or make use of hand and leg trajectories. The trajectory-based approach gives less accurate results due to variability of action pattern between subjects. In RSD Code, we give importance to three factors Region, Speed and Direction to detect the action. These three factors together gives better result for recognizing actions. The proposed method is free from occlusion, positional errors and missing information. The results from our algorithm are comparable to the results of the existing human action detection algorithms.

1. INTRODUCTION Human action detection is becoming a basic requirement due

to its immense application in the field of surveillance, medical diagnosis and web based video retrieval [Moeslund et al. 2006]. However, variation in the movements of an individual, differences in video type and camera configuration makes the detection more challenging. Some of the systems that exist today use head-neck-torso template for the detection of human action while others employ kinematic model to extract the feature points.

Our proposed solution is inspired from both these methods. We use the head and torso template for human detection and human graphical model [Andriluka et al. 2009] for feature extraction. In our method, once the feature points are identified, the actions associated are recognized. For this, the paper uses an (RSD) Code generation approach based on the attributes region, speed and direction. The concept of region based code has provided effective recognition in spite of the spatiotemporal variability due to the difference in style of actions by different subjects. Speed and direction code has helped in improving the accuracy of this method. Most of the existing system recognizes only upper body actions, but in the proposed solution we have considered full body actions. This helps in correctly detecting the human activities as hand and leg points together carry more information. The temporal features are extracted based on RSD code for hand and leg which determine the activity.

The first step is to identify the feature points i.e. the two hand ICONIAAC’14, October 10 – 11 2014, Amritapuri, India Copyright 2014 ACM 978-1-4503-2908-8/14/08…$15.00. http://dx.doi.org/10.1145/2660859.2660971

points, the two leg points and the head point making use of the human dimensions. Considering the fact that the human action can be determined based on the location where the feature point falls, we have developed a method, which divides the whole human body into eight regions. Action is detected depending on the region where the feature points fall. But sometimes the region could be same for different actions; the actions cannot be judged based on region code alone. Hence, to make the detected action more accurate, the speed and direction of feature points are taken into consideration in our analysis.

Original Frame

Gray scale frame

Background Subtraction

Flood fill Preprocessed video frame

Figure 1: Video Preprocessing. The figure depicts different stages of video preprocessing for removal of noises.

2. RELATED WORK The proposed method can be classified into three important

modules- video Preprocessing, Human detection and Human Action Analysis (see Fig. 2). The related papers are as follows.

There are many literatures on role of SHAPE analysis in object recognition, matching, and registration. We got some important ideas on shape analysis and pre-processing from the paper Matching Shape Sequences in Video with Applications in Human Movement Analysis by Roy Chaudhary . This paper mainly focuses on the time variation of the shape of an object using gait analysis. This method treats the human silhouette as a time sequence of deforming shapes and presents some parametric and nonparametric methods for tackling stationary shape sequences. This paper proposes uniform sampling along each row or by uniform arc-length sampling for feature extraction from Kendall’s statistical shape. This paper gives a conclusion that body shape is significantly important in human action detection, along with the kinematics of human body to improve the accuracy. In our algorithm, we incorporated both parameters of shape and kinematics and our proposed method is free from other parameters like translation, rotation, and scaling.

Boosted pose estimation [Li Wang, 2010] uses the human pose estimation technique to determine the action, based on the orientation and positioning of body parts. The featured body parts

are divided into boxes and based on the orientation of these boxes, actions are determined. This can cause flaws since the

action can vary for each person. This method can't differentiate action of similar movement. In this paper, we improve the prediction algorithm for detecting actions in the missed frames.

The paper Role of shape and kinematics in human movement analysis, discusses the role of two important clues in human motion- shape and kinematics. They propose a new gait recognition algorithm by computing the distance between two sequences of shapes that lie on a spherical manifold. In this paper, they conclude that we require models that contain both shape and kinematics in order to perform accurate activity classification. So based on the analysis, we incorporated 3 important characteristics of human activities (Region, Speed, Direction) in our proposed algorithm to make it more generic and accurate.

Controlled human pose estimation [Zhu et al. 2008] from depth image streams presents a model-based Cartesian control theoretic approach for estimating human pose. This paper stores the trajectories of primary points of human body to detect the action. Our approach extends the paper by identifying the key points in the lower body part along with RSD Code feature developed in this approach for human action detection.

The efficiency of our proposed solution increases as the noise in the frame reduces. This is attained through effective techniques like Binarization, background subtraction, erosion and dilation. Human detection from noise free frame are done using template matching [Zhu et al. 2008] and face detection [Haar Classifier Viola and Jones,2001]. The human blob is transformed to skeletal structure [Wang et al. 2010] , [Barron and Kakadiaris 2000] using thinning algorithms. Our paper explains effective methods for thinning and removing unwanted bifurcations, making it easy for end point detection. Region, Speed and Direction (RSD) method used in this paper helps in detection of human action even for complex situations (like similar action, variance in individual's action, body parts occlusions e.t.c). The results that we get are comparable with the existing state-of-the-art methods. The following sections explain the proposed method.

3. VIDEO PREPROCESSING Video preprocessing, used in extracting the featured objects, is one of the important parts of our algorithm. The frames are converted to gray scale to reduce the complexity of handling three channels, followed by background subtraction where previous frame is subtracted from the current frame in order to remove all

static objects. The frame is binarized to easily extract the contour points. The resulting frame is then subjected to a sequence of

erosion, dilation and flood fill operations [Schalkoff 1989] to obtain a noise-free frame which is very significant for our proposed algorithm. Fig.1 shows the some snapshots of video preprocessing.

Figure 3: The human is segmented into 16 regions based

on headSize

Figure 4: Human detected using the head torso template

4. HUMAN DETECTION For identification of human action, the human body needs to be segmented out from the moving objects in the video. The method we used is called “Template matching”. An adaptive template is created to perform a match in order to avoid the scaling issues. The proportionality of human height, head size and head to torso distance is used for creating the adaptive template.

The next step is to define a human template (head and torso) using prior knowledge of human dimensions as shown in the Fig. 3. By using the head size i.e. bounding box height divided by eight, the template for the human using circle and rectangle can be generated. Human template is used to detect the human in the video frame using template matching [Bradski and Kaehler 2008], [Winter 2009] and to return the location where the best match occurs. Fig. 4 shows the template and the location where the best match occurs.

Figure 2: Overview of Human Action Detection

5. HUMAN ANALYSIS Once the human is detected, the following step is to analyze the detected human to extract the vital information for determining the human action.

5.1 Skeletonization The human body needs to be abstracted into a model to easily track the actions. Skeletonization of human body is apt for this case, as the movement of the entire body needs to be taken into consideration. It is one of the most crucial steps in our paper. For skeletonization, Zhang-Suen algorithm has been used. This thinning algorithm repeatedly removes boundary points from the image until the image becomes irreducible. Not every boundary point qualifies for deletion. First, the 8-neighbourhood pixels are examined. Only if the 8-neighbourhood satisfies certain condition, will the boundary point be deleted. The deletion of points takes place in two iterations. In the first iteration southeast boundary point and northwest corner point is deleted .In the second, northwest boundary point and southeast corner point are removed. Let p1, p2….p8 be the neighbors of p as shown in the Fig.5, B(p) be the number of non-zero 8 neighbors of p and A(p) be the number of zero to one transition in the order p1p2….p8p1. If p is the contour point and if the 8-neighbourhood of p satisfies the following conditions then only will p be subjected for deletion.

(a) 2 ≤ B(p) ≤ 6

(b) A(p) =1

(c) p1.p3.p5=0

(d) P3.p5.p7=0

(a) ensures that the end-points are preserved. (b) prevents the deletion of point that will split an image into two. (c) and (d) selects southeast boundary point and northwest corner point for deletion in the first iteration. In the second iteration, the contour point will be subjected to deletion if it satisfies (a) and (b) as well as the following two conditions.

(c’) p1.p3.p7=0

(d’) p1.p5.p7=0

After skeletonization, a frame as shown in Fig.7 (a) is obtained. More detail on Zhang-Suen algorithm is available in [Ritter and Wilson 1996].

Figure 5: 8-neighbours around contour pixel p

Figure 6: The skeleton image shows bifurcation points (red)

and endpoints (black)

5.2 Pruning of Unwanted Bifurcations For removing the unwanted bifurcation, we have developed customized pruning algorithm. The algorithm commence with the

finding of top most point (refer Algorithm 1) i.e. the headPoint (assuming that the first point detected is the head). After finding the headPoint, traversing through the skeleton is done. Algorithm 2 is used for pruning the unwanted bifurcation. According to the algorithm, for every current pixel, its 8-neighboring pixels are checked. If the pixel value is 0, then the corresponding flag is set (there 8 flag each representing the neighboring pixel of the current pixel). Sums of the flag values are calculated and store it in total as shown in the algorithm and the following conditions are estimated:

(a) If total = 1: no bifurcation i.e. the path is straight (Fig.5) then this pixel is made as the current pixel and proceed.

(b) If total > 1: There is a bifurcation. The linepoint is the previously detected bifurcation point, which is initially set as headPoint. A line from the linepoint to current point is drawn and the current point is stored as the new linepoint. The algorithm traverse through the other pixels whose corresponding flags are set.

(c) If total = 0: It is an end point. Euclidean distance of the current point from the linepoint is calculated. If the computed distance satisfies the minimum threshold, a line joining the current point and the linepoint is drawn, else the endpoint is discarded.

5.3 End Point Detection The pruned skeleton is drawn on a new frame and the headpoint of the skeleton is fed as input to Algorithm 2 to determine the end

points. As described above, the 8-neighboring pixels are checked and flags are set for the pixels whose value is 0. The sum of the flag are calculated and stored in total.

(a) If total = 1 no bifurcation, then continue traversing through the skeleton by setting the pixel whose flag value is set as the current pixel.

(b) If total > 1 There is a bifurcation, This pixel is stored into an array (to handle loops) and the function is called recursively by making the pixel whose corresponding flag is set as current pixel.

(c) If total = 0 and the pixel value is not in the array, then that pixel is marked as endpoint and this point is sent to the classification of points module.

input : A video frame, Bounding Box points pt1 and pt2 output: headPoint for every column y, y= pt1:y to pt2:y do for every row x, x= pt1:x to pt2:x do if pixel value=0 at x and y then store x and y in headPoint end end end return headPoint;

Algorithm 1: To detect headPoint

6. HUMAN ACTION ANALYSIS

6.1 Classification of Points As shown in the block diagram, the inputs to this algorithm are the end points that were detected. Since the body is moving, it is hard to determine the action. So feature points are analyzed with respect to bounding box. The change of origin for feature points are calculated using:-

(1)

(2)

Where points (x,y) are points with respect to original frame, pt1 and pt2 are the coordinates of bounding box and new point is the point with respect to the bounding box.

After Skeletonization After Pruning Endpoints Detected

Figure 7: Result of Human Analysis. The detected human blob undergoes thinning and pruning algorithms to detect the end

points

For determining whether the point is handPoint or legPoint, the body bifurcations are used. The pruned skeleton has two main bifurcations shoulder-point and waist-point bifurcations. The end

points traced from the shoulder-point bifurcation are hand points and are leg points. These points are then classified as hand1, hand2, leg1 and leg2 points. There are three cases existing for classification.

(a)

(b)

(c)

Figure 8: (a) Third handPoint detected, (b) newPoint closer to red region (c) replacing with average

Case1: If it is the first frame, then the first point detected is directly classified as hand1 and the second point detected as hand2. But if one more point is detected in the hand region, in the same frame (due to some noise), then we find the distance to this point from hand1 and hand2 are calculated to find to which handPoint, the newPoint is near. If it is nearer to hand1, then the average between hand1 and the new point replaces the hand1. (see Fig. 8). Same procedure is followed for determining leg points. For all other frames Algorithm.3 is used. This Algorithm deals with all cases.

Figure 9: 2-handPoints detected in the next frame

Case2: If in the next frame, two hand points are detected one after the other (see Fig. 9), Assume the black circle be the handPoint that was classified as hand1 in the previous frame, blue circle be the newPoint detected first and red be the point detected next. Let y be the error value calculated between hand1 and blue point and let x be the error value calculated between hand1 and red point. Initially only blue point is detected, so the error value of that point from both hand1 and hand2 is calculated using Eqn. 5.

Let blue point be closer to hand1 compared to hand2. So, the blue point is currently classified as hand1. hand2 is not detected till this frame. Since hand2 is not detected, it is given a negative value by default. When the red point is detected, it is evident from Fig. 9 that x is less than y i.e. the red point is closer to hand1 than the blue, so a misclassification has occurred. The red point was supposed to be classified as hand1. To avoid such a condition, (case2 in the Algorithm takes care of this) the error value for this blue point using Eqn. 5 is checked. Assume that the blue point is nearer to hand1 compared to hand2. So, blue is classified as hand1. If we get another point shown in red, the distance between hand1 and red is x. As observed in Fig 9, x is less than y, or red point is closer to hand1. In fact, it was the red point that was supposed to be classified as hand1 and blue point as hand2. However, since blue point was detected first, we have wrongly classified blue point as hand1. In order to avoid such situations (case 1 in Algorithm 3) the error value of red point and blue point are checked again. If error value of red point is less than error value of blue point, then occurrence of misclassification

Input: headPoint x and y, linePoint, oldFrame, newFrame Output: newframe containing pruned skeleton

Initialize flag1....8 to 0;

check for 8 neighbours of x and y; if pixel value of any of the 8 neighbour = 0 then set respective flag =1 end

if total=0 then find distance using (x; y) and linePoint; if distance > threshold then exclude the point end draw a line using (x; y) and linePoint; end else if total=1 then recursively call this function by passing the coordinate positions associated with the flag whose value is set end else draw a line using (x; y) and linePoint; set linePoint as (x; y); recursively call this function by passing the coordinate positions associated with the flag whose value is set end

Algorithm 2: Pruning of Unwanted Bifurcation

can be inferred. So, the reclassification of hand1 as red point and blue point as hand2 are done.

Case3: If handPoints are not detected in some frames. To understand this more clearly, consider an example. Let the current frame be 15, hand1 be last detected in frame 10 and hand2 in frame 12. Hence, it is unable to determine whether the newPoint detected in frame 15 is hand1 or hand2. In order to avoid such ambiguities, the probable distance is calculated (using Eqn. 3) where the new handPoint is supposed to be detected. In Fig 10 the black circle shows the probable distance for hand1 and yellow circle shows the probable distance for hand2.

(3)

Let the point first detected in the current frame be blue point. We calculate the error value of blue point from hand1 and hand2, and name them errorb1 and errorb2 respectively. Since, errorb1 is less than errorb2, blue point falls into hand1. Similarly, if red point is the next point detected, then error value for red point from both the handPoints are calculated and red point is classified accordingly.

Figure 10: Calculation of error value using probable region method for classification of hands.

Algorithm3 deals with all the cases mentioned above. In this algorithm the number of missed frames is calculated (using Eqn.4) (which may have been caused due to occlusion of handPoints or legPoints). The frame number of the frame in which the hand point or leg point is detected is stored in last_detected_frame.

(4)

Using this, the error value is estimated, which is calculated to determine whether the current point falls in hand1 or hand2. error value is calculated using Eqn.5.

(5) Where,

(6)

pre_hand_speed is the hand_speed of the previous frame and dist is the distance between current points and hand point (ie hand1 and hand2)

In addition to primary points (hand and leg end points), the secondary point i.e. elbow and knee points can also be detected. The secondary points are detected using the maximum curvature detection method. Here, the point that shows maximum slope change between hand end point and the shoulder point is considered as the elbow point. If there is no remarkable slope change between hand point and shoulder point i.e., the line

between these two points is considerably straight. Then the mid-point of the hand point and the shoulder point is considered the secondary point. The same method is used to find the knee points using the waist and leg end points.

6.2 RSD code generation One of the highlights of this paper is the RSD (Region,

Speed, and Direction) Code generation. The Human Actions are determined based on the region where the feature point falls. The idea of region code provides more flexibility for classification when compared to the existing trajectory based methods. Trajectory based methods may result in misclassifications when the different subjects exhibit spatiotemporal variability in actions. But region based approach works better in that context since the method gives approximate information about the trajectory shape, thus accommodating variability to a good extent. Sometimes the action can’t be judged based on region code alone. For example, in case of walking and running, regions are almost same, so we have to consider another factor, Speed. Speed of walking is less than running. To make the action detection more accurate, another factor is considered, Direction of the feature points. The three features are explained in the coming subsections.

Figure 13: RSD code structure

6.2.1 Region code generation: For the detection of human action, the region in which the particular feature point falls is determined. The regions are classified as shown in the Fig.11.

(a) Region1 (a and b) = headSize

(b) Region2 (c and d) = Region1 + 2(headSize)

(c) Region3 (e and f) = Region2 + 3(headSize)

(d) Region4 (g and h) = Region3 + 2(headSize)

For example in the case of walking handpoints are found in the region e and f and the legPoints in the region g and h. whereas in case of running, we have handPoints in the region c, d, e and f and legPoints in the region e, f, g, h. But in case of running, sometimes the leg points fall in the same region as of walking. Hence, as mentioned before another factor Speed has to be considered.

The bounding box outlining the human can further be divided to increase the count of regions to the multiples of 8. Increasing the count will help the system identify more specific and finer actions. This will improve the accuracy of the system while detecting the action.

6.2.2 Speed code generation Speed of both handPoint and legPoint are calculated using Eqn.6, and the speed of the body is calculated using

(7)

Where, dist is the distance between the current waistPoint (i.e. the middle point of the human skeleton) and previous waistPoint and max_human_length is the maximum bounding box length.

Figure 11: Body segmentation

into 8-regions

Figure 12: The figure shows the possible directions body

movement. The directions are represented by numbers for

easy calculation

6.2.3 Direction code generation The motion types for a human action can basically be considered as to and fro, straight and constant actions. The first step is to determine the direction of motion, followed by the type. Fig.12 gives the possible directions a point can move. To determine the direction where the feature point lies, the deltaX and deltaY change are calculated.

(8)

(9)

Where current_x, current_y, previous_x and previous_y are the featurePoints. After calculating deltaX and deltaY the direction is calculated. i.e. if deltaX > 0 and deltaY > 0 then direction is 2 and if deltaX < 0 and deltaY = 0 then the direction is 5 (as shown in the Fig. 12).

Once the direction is determined, the next step is to find the type of motion. This is explained in the following section.

6.3 Pruned RSD Code Generation

Figure 14: Pruned RSD Code Structure

Once the RSD code values are determined, they are stored in a structure as shown in Fig. 13. In Fig. 14 the head portion comprises of a section, i.e. deviation which is the change of angle the headpoint has from the midpoint of the bounding box i.e. headOrgin. The headOrgin is calculated as,

(10)

(12)

where, width_rect is the width of the bounding box and similarly height_rect is the height of the bounding box. If the deviation is more than a threshold value, the action is considered as bending. The deviation is calculated as

(11)

Where deviate_distance is given by

(13)

And deviate_x and deviate_y is calculated as

(14)

And

(15)

Once the RSD code is generated for each frame, the next step is to create the pruned RSD code, consider all human detected frames. The pruned RSD code is made after analyzing the entire set of RSD Code. The structure of pruned RSD code is shown in Fig. 14.

In the pruned RSD code, the first field identifies whether the person is standing straight or is facing sideways. This is done by using Haar Classifiers. Haar classifier helps us to identify headPoint in a frame. Haar classifier is considered only if the head exists in more than 75% of the frames in which body has been detected. There may be some cases in which Haar classifier misclassifies one or more points as headPoint because of some disturbances. In such cases, the average of the midpoints of the head bound rectangles is found. Then each frame is analyzed to check which rectangle is having midpoint closer to the average value and that rectangle is made the headPoint. In case of constant actions (with no body movement), head is identified by using Haar classifier. But in such cases, the body portion will be removed when we do background subtraction. So in order to create a body- like figure to classify the action, we will be using the head detected using the Haar classifier and by using headSize a body template is created to compensate for the removal of the actual body.

Input: Endpoints x and y, BoundingBox coordinates pt1 and pt2 Output: Classified Endpoints calculate frames missed, dist and error for both the handPoints; if error value is less for first hand then switch no_of_hands do case 0 classify point as hand1 and find its speed case 1 if hand1 is misclassified then set hand2 as hand1 and Classify the current point as hand1 and update both hands speed. end

else classify point as hand2 and find its speed end endsw case 2 classify point as hand1 and find its speed endsw case 3 set hand1 as average of current point and previously classified hand1 endsw endsw end

Algorithm 3: Classification of hands and legs

The next fields are for Body Motion type and Body Displacement. The body motion type is identified from the body direction fields of each RSD code. The motion type is calculated as follows: The direction value of the body from the current RSD code is subtracted from the previous RSD code which is the direction value for that RSD codes. If the difference is greater than 4, then the current direction value is subtracted from 8 to make it in the region of 0-4. This is done for each frame and the average the direction value (motion type value) is determined. The motion type value will be in the range of 0-4 and based on the motion type value. The motion types are classified mainly into 3:

(a) (0 to 1) then it is a straight type motion

(b) (1.5 to 4) then it is a to and fro type motion

(c) (1 to 1.5) average then constant type motion

For e.g. the direction obtained is 2-1-1-8-7, so after subtraction 1-0-7-1 is obtained. Since motion type value range is between 0-4, any value greater than 4 in normalized by subtracting it with the 8 i.e. subtracting 1 from 8 gives 7 which is greater than 4, so we subtract 7 from 8, giving a number in the range of 0-4. So after normalization, the value will be 1-0-1-1.The same method is used to classify the motion types of hand and leg, which can also be used for identifying the actions. Motion type corresponding to this sequence will be (1+0+1+1)/4 = 0.75, therefore the type of Motion is straight

The next factor in the pruned RSD Code is the Speed. It is of two types’ speed_per_frame and displacement, they are calculated as.

(16)

Where, exist_frame are the no of frames in which human is detected

(17)

Where startPoint and endPoint are the body points (the waistpoint of the human) detected in the first frame and last frame. By looking at the speed_per_frame, the actions can be distinguished from each other. Whereas displacement helps to identify whether the action is Up-down or left-right i.e. after calculating the displacement, direction is determined (see section 6.2.3) If the displacement is from 1 to 5 or 4 to 8 or 2 to 6, it is left-right motion (e.g. walking, running) else it is an up-down motion (e.g. jumping). This body displacement value is divided by a factor which is calculated by multiplying the maximum frames in which human was detected and the maximum size of the human detected in the particular video. This is done because sometimes

the size of human body in each frame will differ. This will cause error in the displacement value, as a person who is tall will have larger displacement than a person who is short. This value is multiplied by 100 to get the value in the range of 0-10.

The next field in the pruned RSD code is the regions of occurrence of hand. This field exists for both hand1 and hand2. For getting the regions of hand1, the RSD code of hand1 in each frame is analyzed. A linked list containing the regions in which the hand falls along with its count of occurrence is generated. This is converted into percentage by dividing with the maximum number of frames in which either hand1 or hand2 was detected. This is how region is found out for hand1, hand2, leg1 and leg2. (For finding the percentage of occurrence of leg in a particular region, divide the count with maximum number of frames in which either leg1 or leg2 was detected.)

(18)

Where region_no is the number of times the featurePoints (handPoint or legPoint) lie in a regioni (a, b, c, d, e, f).

The RSD code generated for each frame helps in determining the specific set of points that the hands and legs have traced. These set of points determine the path followed by the primary points (hand and legs). From these set of points, the key points that represents the path are extracted using the maximum curvature detection method. In this method, the traced points are grouped into clusters using various clustering techniques. The number of elements in each cluster is determined based on the size of the total number of points in the set. Then the centroids of three consecutive clusters are taken and the difference between the tangential slope of first and second centroids and second and third centroids are calculated. If the difference in the tangential slope is higher than the threshold value of the tangential curve, then the points are determined as key points. Otherwise, the mid centroid point is skipped and next centroid is considered. The identified key points when joined using straight line will represent the geometric structure of the path traced. This geometric structure is used to determine the action accurately. The method is most useful when the primary points of two different actions lie in the same region and hence widens the range of detecting actions.

7. DECISION MAKING Once the pruned RSD Code is generated the next step is decision making. Various methods can be used for decision making, like making use of neural network [Alpaydin 2004],[Barber 2012] to train and detect the action, making trajectories using feature points, making use of decision tree [Han et al. 2006] or CART

Table 2: Decision tree implementation details

[Skinner 1999],[Inmon 1996]. The method involved in this paper uses decision tree for action detection. The decision tree [Magerman 1995] was implemented using the following conditions given in Table 2. By analyzing the decision tree, the action is being shown in the video is determined. For actions like walking, running etc., in which both hands and legs move with equal prominence, the regions in which the hands and legs are moving (mainly legs) are considered, along with body speed and motion types of body, hand and leg. For actions like hand waving, clapping, boxing etc. in which leg movement is not as prominent as hand movement, the hand movement are taken into consideration i.e. the regions in which hands are moving, along with other factors like hand movement type and motion type of body. For actions like jumping, in which hand movement is minimal, leg movement along with the regions of movement, type of movement and body movement type are considered to determine the action. For bending actions, the importance has been given to head deviation to analyze what action is shown in the video.

The classification of the action in decision tree is based on nine key features. These key features include regions comprising hand and leg, body speed, straight or side-wise motion, motion type, hand motion type, leg motion type, head deviation and body direction. The possible range of values each feature can have for a particular action is depicted in the Table 2. The feature ’region’ represents the regions through which the hands or legs have moved. Each action has a minimum percentage value for the regions in which it may appear. Only the regions that cross this particular value are considered. Considering the case of running,

Table 1: Expected sequence for correctly classified video.

ACTION STRAIGHT/

SIDE BODY SPEED

LEG REGION

HAND REGION

MONTION TYPE

BODY DIRECTION

LEG MOTION

TYPE

HAND MOTION

TYPE

WALKING 0 2 1 1 2 1 1 1

RUNNING 0 4 2 2 3 1 1 1

JOGGING 0 3 2 3 4 1 1 1

BOXING 0 1 3,1 3 1 0 0 1

CLAPPING 1 1 3,1 4 1 0 0 1

HAND WAVING

1 1 3,1 5 1 0 0 1

JUMPING 0,1 1 3,1 - 5 2 0 0

the hand can fall into four regions mainly including c, d, e, f but c, e regions have higher probability of occurrence than the other two regions. So, c and e are given higher percentage value than the others. Feature vector is given with a specific numeric value for easy manipulation in decision tree. These numeric values are

stored in a numeric array for each feature of an action. The values stored in the array can uniquely identity the specified actions. The Table 1 shows the tabular description of the above said numerical array. Each feature is given with a specific weight, which is described in Table 3. The weight of each feature of an action is computed using Eqn. 19 to determine the average internal accuracy. This is more explained in next section.

8. EXPERIMENTAL RESULT The bar chart (Fig. 15) shows the average internal accuracy

over total videos taken. For calculation of the bar chart we have allotted weights for each feature i.e. contributing towards the action recognition. Each action is detected based on a set of features (shown in Table 2).

0

20

40

60

80

100

Walking Running Jogging Boxing ClappingP

erc

en

tage

of

Acc

ura

cy

Figure 15: Internal Accuracy of the algorithm in detecting each action.

If the inputted video sequence has the same set of features, then the action is correctly classified with full accuracy, else if some of the features are missing or are wrong, then the weight is allotted accordingly. The feature along with the weight allotted to them is shown in Fig.15. The weight is calculated out of the 7 actions, how many of the features are unique. For E.g. For hand, for all actions, the sequence in which the hand falls in each region is unique, so the weight allotted to hand region is 1. If the action for e.g. is running, then the probable regions of legs are e, f, g and h. But for the inputted video, the sequence is given as f, g and h. The weight allotted to the leg is 3. So the internal accuracy of leg is 3 X (40+10+40)/100=2.7. Similarly, the internal accuracy for each and every feature is calculated.

(19)

Table 3: Weight allocated to each features.

Fig 16 shows how many videos are correctly classified out of the given videos. As inferred, the rate of misclassification is very less and can be further reduced by improving the pre-processing steps.

0

20

40

60

80

100

Walking Running Jogging Boxing Clapping

Pe

rce

nta

ge o

f C

orr

ect

ne

ss

Figure 16: Count of correctly classified videos.

The confusion matrix given in Table 4 shows the accuracy of actions being correctly classified.

Table 4: Confusion Matrix

Walking Running Jogging Boxing Clapping Accuracy

Walking 8 0 0 0 0 1.0

Running 0 9 0 0 0 1.0

Jogging 0 1 8 0 0 0.9

Boxing 0 0 0 7 0 1.0

Clapping 0 0 0 2 6 0.8

Reliability 1.0 0.9 1.0 0.8 1.0

In this case, accuracy is correctly classified videos/Total videos for the particular action. Reliability is correctly classified/Total classification in to the action. The algorithm detects almost all actions correctly, thus the accuracy and reliability is high. Therefore, we can say our work is comparable with the existing methods.

9. CONCLUSION The human detection method usually considers upper body part or requires the use of depth image or 3D cameras to detect the human actions. Moreover the accuracy of the algorithm varies with direction of human.

In this work, full human posture, speed and direction are considered to generate RSD code. The use of RSD helps to determine action irrespective of individual style of people. It also helps to differentiate between similar actions.

Although eight regions are used for human action detection, for getting comparable results (see Fig. 12) In future we would like to increase the regions by sixteen in order to increase the efficiency and also include trajectories to get a clear understanding of direction a human is moving.

10. REFERENCES [1] Ethem Alpaydin. 2004. Introduction to machine learning.

The MIT Press.

[2] Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. 2009. Pictorial structures revisited: People detection and articulated pose estimation. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 1014–1021.

[3] David Barber. 2012. Bayesian reasoning and machine learning. Cambridge University Press.

[4] Heng wang, Alexander Klaser, Cordelia Schmid, Cheng-Lin Liu 2013A Dense trajectories and motion boundary descriptors for action recognition

[5] Carlos Barron and Ioannis A Kakadiaris. 2000. Estimating anthropometry and pose from a single image. In Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on, Vol. 1. IEEE, 669–676.

[6] Gary Bradski and Adrian Kaehler. 2008. Learning OpenCV: Computer vision with the OpenCV library. O’Reilly Media, Incorporated.

[7] Claudette Cedras and Mubarak Shah. 1995. Motion-based recognition a survey. Image and Vision Computing 13, 2 (1995), 129–155.

[8] Chi-hau Chen. 2009. Handbook of pattern recognition and computer vision. World Scientific.

[9] Jiawei Han, Micheline Kamber, and Jian Pei. 2006. Data mining: concepts and techniques. Morgan kaufmann.

[10] William H Inmon. 1996. The data warehouse and data mining. Commun. ACM 39, 11 (1996), 4950.

[11] David M Magerman. 1995. Statistical decision tree models for parsing. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, 276–283.

[12] Thomas B Moeslund, Adrian Hilton, and Volker Kr¨uger. 2006. A survey of advances in visionbased human motion capture and analysis. Computer vision and image understanding 104, 2 (2006), 90–126.

[13] Gerhard X Ritter and Joseph N Wilson. 1996. Handbook of computer vision algorithms in image algebra. Vol. 1. Citeseer.

[14] Robert J Schalkoff. 1989. Digital image processing and computer vision. Wiley New York. David C Skinner. 1999. Introduction to decision analysis: a practitioner’s guide to improving decision quality. Probabilistic.

[15] Camillo J Taylor. 2000. Reconstruction of articulated objects from point correspondences in a single uncalibrated image. In Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on, Vol. 1. IEEE, 677684.

[16] Paul Viola and Michael Jones. 2001. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, Vol. 1. IEEE, I–511.

[17] Li Wang, Li Cheng, Tuan Hue Thi, and Jian Zhang. 2010. Human action recognition from boosted pose estimation. In Digital Image Computing.

[ACM Press the 2014 International Conference - Amritapuri, India (2014.10.10-2014.10.11)]...

Documents

Transcript of [ACM Press the 2014 International Conference - Amritapuri, India (2014.10.10-2014.10.11)]...