Final ReportRudrafarhan

Image Processing and Computer Vision Project Fall - 2014

Abstract

The way humans interact with computers and devices

is soon to change with the help of gesture recognition. It is

one of the trending technologies that will revolutionize

human computer interaction in the future. This project

aims to use hand gestures captured from the camera for a

live painting application called Doodle. Detection of the

human hand from the scene and motion estimation of the

hand in order to produce the painting output as the hand

moves are the major parts of this project. Important

techniques from computer vision such as extracting the

region of interest using background subtraction, skin color

segmentation, detection of convex hull, convexity defects

and usage of camshift algorithm for motion tracking are

used. The user is provided with a panel containing

different colors and an eraser, to choose his options as he

draws on the screen.

Keywords

Background subtraction, Skin Segmentation, Camshift,

Convex hulls, Convexity defects, Motion detection

I. INTRODUCTION

Human Computer Interaction has emerged into an area

of spectacular research activities and has introduced a new

dimensionality to the science of computing. One of the most

universally known example is the graphic user interface used

in Windows 95 by Microsoft. As computers have become

universally prevalent, attention has shifted to incorporating

techniques that will enhance the end-user experience. This is

one of the fundamental reasons for the increase in popularity,

research activities and commercial products related to Human

Computer Interaction.

HCI can be defined informally as science of design and

study of interaction between humans and computers with an

aim to enhance the quality of interaction. Analytical, cognitive

and empirical techniques are used in the process of design

better systems. In the study of HCI, there are many important

facets that have evolved into different areas under it. Some of

these areas include constraint based interface design,

exploring different methodologies in implementation,

designing new interface systems, evaluation and comparison

of interfaces etc. In the recent years, we have seen dramatic

changes to the user interfaces for electronic devices by means

of touch, voice and gestures.

There are many trending technologies in human

computer interaction that include virtual and augmented

reality, 3d visualization, natural language processing, speech

processing, hand gesture recognition, touch sensing etc.

Owing to the advancements in image acquisition and

processing as well as computer vision techniques, Hand

Gesture Recognition has gained importance in HCI. Using a

part of the human body for communicating can be termed as

gesture. All human beings are naturally inclined to use

gestures while communicating, for e.g. while giving a

presentation or taking part in a conversation. Therefore,

integrating gesture recognition technology with computer

operation will definitely provide a more seemingly

comfortable experience. Many different type of tools can be

used for dealing with image/video based gesture recognition.

One of these methods include the usage of gloves that are

equipped with tracking capabilities or special features for

transferring signals for the purpose of communication such as

magnetic interaction. This method is called the contact based

approach for modelling gestures. Restrictive experience and

the requirement of additional utilities such as gloves have not

made this method popular.

As a contrast, vision based techniques are cost-

effective, non-intrusive and non-restrictive in nature. The

quality of such a rendering directly depends on the kind of

algorithm used and the computational capabilities of the

system. In the image/video analysis based techniques various

tools are being used namely, Stereo cameras, Depth-Aware

cameras and simple 2d cameras. In Stereo cameras, multiple

(usually two) cameras are used are capture different view

feeds of the same frame and these are merged with special

processing so as to obtain the 3d representation of the scene.

Depth-aware cameras are capable of sensing depth map of the

video feed within a certain range. Even simple 2d cameras

available in laptops, digital cameras and smartphones can be

used for gesture recognition.

The vision based methods can be further categorized

into Appearance modelling and 3D modelling. Furthermore,

the techniques used also depend on whether the gesture

recognition is static or dynamic. The static methods used are

related to template matching, machine learning classification

etc. The temporal aspect is introduced in dynamic systems, so

the gestures and postures can vary with time. Therefore, more

sophisticated methods are needed for processing and they are

related to Advanced Hidden Markov Models, Time Warping,

and Delayed Neural Networks etc. In this project, appearance

based modelling with the help of standard 2d camera feed

from a laptop is used as input for gesture modelling.

Doodle using Hand Gesture Recognition Vandana Ravichandran

Electrical and Computer Engineering, University of Florida, Gainesville 32611


The rest of the report is organized as follows Section

II discusses the fundamental approach in utilizing computer

vision techniques for Hand Gesture Recognition, Section III

gives a design description of the painting application called

Doodle. In Section IV highlights the approach used for

implementing the features described in the previous section.

Here, the various image processing techniques that were used

in the project are discussed in depth. The algorithm used for

writing the program is described in Section V. Section VI

provides the conclusion by summarizing the practical

experience and challenges faced during the implementation

Section VII discusses the scope of the project and some

related future work.

II. APPROACH

Generally, all object or motion tracking systems and

hand interactive systems have the same goal the required

object must be tracked in successive frames of a video or a

camera feed when it moves dynamically. In hand interactive

systems, the object that now needs to be recognized is

essentially the hand. These systems can be divided into layers

that have distinct functions to perform namely Detection,

Tracking and Recognition.

a. Detection Layer This is the first layer and is responsible for defining, correlating and extracting the

features that correspond to the object of interest, i.e. to

detect the presence of hands in the field of view obtained

from the camera feed.

b. Tracking Layer This is the mid-layer that makes use of the features that were extracted in the detection layer and

associates data temporally between successive images in

order to track the motion of the hands.

c. Recognition Layer This the last layer and deals with the classification problem. The spatiotemporal data

extracted in the previous two layers are assigned to

relevant result groups with labels that are associated with

specific classes of gestures.

Once these three steps are performed successfully, the

hand can be detected, tracked and the gesture can be

classified. Now, the gesture that identified can be associated

with a specific function as output, for e.g. play the music when

the hand moves right etc. These three functions need to be

iteratively computed for every frame so as to process dynamic

gesture recognition.

III. DESCRIPTION

As described earlier, gesture recognition is employed to

fulfill specific functions(s). In this project, the result of gesture

recognition is used to create a painting application called

Doodle. Briefly, the idea is to paint the live/active webcam

window by the movement of the hand. The image shown

below is an image captured during a demonstration.

The application is equipped with the following features

a. The user can select different colors from a panel of colors present on the upper part of the window

b. When the users hand hovers over a particular color, the color is selected and a message indicating that a particular

color was selected is displayed at the bottom of the

window.

c. The user can select erase option to use the fingertip like an eraser to erase the pixels that were earlier drawn on the

screen.

d. The erase option can also be used to clear the screen completely by selecting the white color on top right

corner and hovering on that area for a few seconds. The

screen is cleared after a brief countdown.

e. The user can exit the application by pressing the Escape key. The programs breaks from the indefinite while loop

and de-allocates or releases any memory associated with

image pointers.

f. When the user exits the application, the image that is drawn by the user is saved to a file named doodle.jpg in

the project folder.


Other hidden objectives The application also needs to

fulfill certain requirements in the background for it to properly

function such as:

g. Before the application is tested, trackbars that are provided for fine-tuning the YCrCb parameters for skin

segmentation must be used for calibration.

h. The upper area on the webcam field holds the panel and the lower area is used for displaying the message. These

are like control areas and user should not be able to draw

on these areas. The area accessible for drawing is

restricted to the space between these two sections.

i. Sudden illumination disturbances can cause the drawing to be triggered even when there is no noticeable

movement in the hand. In order to handle such

disturbances, the pixels are not drawn if the magnitude of

the distance between the two pixel points (two

consecutive hand positions) is greater than 100.

j. The system should be capable of handling external noise signals to a certain extent.

IV. IMPLEMENTATION

The program was written in C++ with the help of the

OpenCV image processing and computer vision library. The

application was developed in Visual Studio 2012. Many

important image processing techniques were used to get the

entire project. Some of the major techniques that are

fundamental to this application are enlisted below

1. Skin Color Segmentation 2. Morphological Operations 3. Background Subtraction 4. Camshift Algorithm 5. Convex Hull and Convexity Defects 6. Panel design and labelling

The implementation methodology and the usage of these

techniques are discussed in detail below

1. Skin Color Segmentation

The process of distinguishing and separating the area

of interest from the image based on color is called color

segmentation. Human skin possesses a distinct color tone and

we can use color based tracking for skin segmentation. Here

we try to differentiate the skin pixels from the rest of the

image in order to recognize the human body parts in the

image. In order for the skin segmentation to work effectively,

the right color space and the range values of the channel

parameters of the color space must be chosen. The black and

white image shown in Fig 3 is a binary image obtained by

thresholding the camera feed to restrict only skin-colored

pixels to be white and the rest to be black. Only the pixels that

correspond to skin (the face and hands) are white in color

while the rest of the image is black in color.

As shown in Fig 4, trackbars were used in the program

to finetune the Y, Cr, Cb parameters for our application.

Why YCrCb color space was chosen?

Although RGB color space is commonly used to

represent images and widely used in computer graphics, it is

an additive model with high correlation. So usually, HSV or

YCbCr colorspace is preferred for skin segmentation. The

separation of brightness information from the chrominance

part reduces the effect of uneven illumination. In this project, I

tried using both the color spaces and YCbCr seemed to work a

little more effectively. So, finally YCbCr color space was

used.

As shown in the diagram, Y Cr Cb values can be

varied from 0 to 255, every color in the RGB colorspace has a

corresponding unique value in the Y Cr Cb colorspace. The

equations that are used to convert from RGB to YCrCb is

shown in the blue box. The values of the 3 parameters that

were most suitable for my skin tone and the ambient

conditions in my room are shown below. Since these are

subject to change based on the person and location, we can

make use of the trackbars for adjusting the values.


2. Morphological Operations

Usually the results of segmentation contains noise, we

use morphological operations namely erosion and dilation for

removing the noise. Morphological operations are shape based

non-linear operations applied for removal of noise and

smoothening of edges. Morphological operations are

dependent only on the relative ordering of pixel values and are

not affected by their numerical values, hence can be most

suitably used to process binary images.

In this technique, the image is probed with a small

shape called a structuring element. At all the possible

locations, the structuring element is positioned and the pixel

values of the structuring element are compared with that of the

neighborhood pixels in a particular position. If for every pixel

in the structuring element that is set to 1, the corresponding

image pixel is also 1, then the element is said to fit the image.

If for at least one of the pixels in the structuring element that is

set to 1, the corresponding image pixel is also 1, then the

element is said to hit the image. The same operation is

iteratively applied to every pixel of the image and based on

whether the structuring element hits or fits the image, we have

two fundamental operations called erosion and dilation.

During erosion a new binary image is produced such

that if the element fits at pixel P(x,y) then new pixel P(x,y) is

1, otherwise it is 0. Erosion is generally used for reducing the

noise in the image but it also has a negative effect of reducing

the region of interest. Similarly, during dilation a new binary

image is produced such that if the element hits at pixel P(x,y)

then new pixel P(x,y) is 1, otherwise it is 0. Dilation is

responsible for expanding the region of interest and it used for

filling any tiny unrequired gaps in the image. Results of

dilation or erosion are influenced both by the size and shape of

a structuring element.

In this project after smoothening the image using

Gaussian Blur, erosion and dilation are applied consecutively

for two iterations using 3x3 ellipse as the structuring element.

This is termed as opening of an image in morphology which

is nothing but applying erosion followed by dilation

consecutively using the same structuring element. The result

of applying the opening operation to a binary image in my

project is shown in Fig 6.

3. Background Subtraction

Background subtraction is a widely used technique for

obtaining the foreground mask of scene. The foreground that

is extracted using this method is usually sent for further pre-

processing. Generally an image's regions of interest are objects

(humans, cars, in our case the hand) in its foreground.

Background subtraction is typically used for detecting moving

objects in videos from static cameras and this method is

mostly used if the image in question is a part of a video

stream. This an area with much research interest and there are

many effective algorithms that have been proposed. The type

of background subtraction usually depends on the

requirements of the application, such as level of sensitivity of

the system, accuracy with which the moving object has to be

tracked and any constraints in terms of memory etc.

In the OpnCV library, we have preset background

subtraction techniques based on the Mixure of Gaussians

method. When there is a possibility of multiple objects being

introduced into the frames and the background permanently or

significantly gets altered, this method is useful as it is adaptive

in nature. In our application, since there will be not much

change to the background and only one object (which is the

hand) needs to be tracked, a simpler approach can be used.

Here we make use of frame difference method, the moving

object is extracted by subtracting the current frame and a static

background image usually known as the background model.

Sometimes using only a static frame as a background

model doesnt suffice and it increases the sensitivity of the

system leading to very small illumination disturbances to be

wrongly captured as moving objects. In order to reduce the

sensitivity instead of directly using the frame difference, we

use the weighted average method as described by the equation

shown below.

In this equation, Fi represents the foreground of the ith

frame, Bi represents the background of the ith frame. The new

background at time instance i+1 is calculated as a weighted

average of Fi and Bi. Alpha is called the learning rate and lies

between 0 and 1. This parameter is used for tuning the

sensitivity of the system and by varying this factor, the effect


of the older objects present in the scene can be controlled

while determining the new background. After determining the

background in this manner, frame difference is conventionally

used to extract the foreground objects of interest. The image

shown in fig 7 is an example of background subtraction used

for extracting the hand from the binary image.

4. Meanshift and Camshift algorithms

Meanshift and Camshift algorithms are popular object

tracking algorithms. For a given pixel distribution, the

centroid or the mean is computed and a track window is used

to demarcate the object of interest in the frame. Subsequently,

as the object moves, the pixel distribution changes and the

track window needs to be shifted or moved to the area of

higher pixel density. We are obtaining the result of skin

segmentation and background subtraction and providing it as

an input to the Camshift algorithm, therefore the area of higher

pixel density should correspond to the tracked object, which in

our case is the hand.

Here, the method used for moving the track window is

of interest to us. For determining the probability distribution

of the image pixels, we use histogram back projection

technique. Initially, the area of the hand is selected as the

feature vector to calculate the histogram model. This

histogram obtained is then used to determine the area of the

image that contains a matching feature in the subsequent

images. This is called as the backprojection technique. Since,

in this method we make use of histograms that is composed of

various pixel frequencies, we again rely on the color

distribution of the frame.

The image with the colored bars shown in Fig 8 is the

histogram obtained for my hand. Different colored rectangular

boxes are used to plot the histogram in order to visualize it as

there was no readily available plotting function in OpenCV.

We use a Gaussian Kernel function K(x) shown below to

obtain the weighted mean of the pixels in the neighborhood.

The neighborhood N(x) of a pixel x is defined as the set of

pixels around x such that when the kernel function is applied,

the value obtained should be non-zero.

The mean m(x) is calculated using the following equation

In this equation, the mean m(x) is calculated as a weighted

average of every pixel xi in the neighborhood of X

(represented by the neighborhood function N(x)) using the

given Gaussian Kernel function K(x). Now after the mean is

calculated the track window is now shifted to the new centroid

or the mean m(x). This is an iterative process and repeats itself

until convergence is achieved. This is how the object is

tracked using Meanshift algorithm.

How Camshift differs?

The term Camshift stands for Continuously Adaptive

Meanshift. Camshift exactly imitates the Meanshift model but

only a slight difference. Once convergence is achieved using

meanshift, not only is the position of the track window

updated based on the new mean, but also the size of the track

window is updated. The equation used for updating the size of

the track window is shown below.

In the equation above, s represents each dimension of the track

window which is proportional to the new mean M. Also, the

orientation of the track window is calculated if there is a tilt in

the object tracked from its initial position. Again Camshift

uses the new scaled track window for determining the centroid

and this process continues iteratively. The images in Fig 9 and

10 show how the size of the track window is updated

proportional to the size of the hand when it is moved closer to

the camera. This is exactly how camshift improvises over

meanshift.


5. Convex Hull and Convexity Defects

Convex Hull is a prominent area of interest in

geometry. In simple words, convex hull can be defined as the

closest or smallest curve that can completely enclose a

required set of points. It can be visualized as a rubber band

stretched to exactly fit the points in the plane. The problem of

determining the convex hull pertains to computational

geometry. The mathematical definition is shown below.

In an n-dimensional space, the convex hull of a given

set of points S is the intersection of all convex sets containing

S. The convex hull C for the given set of points P1, P2 ,

PN is computed using the equation:

If we are able to compute the convex hull of the white

blob which belongs to the hand in the binary image, then we

can visually track the hand. First for this, the contours of the

hand in the binary image are found. The contours of the hand

can be determined using any edge-detection technique. After

the contour of the hand is obtained, we provide these set of

points as an input for calculating the hull. The hull is

calculated using the function ConvexHull2 available in

OpenCV. The output of this function is an array of points in

the frame that correspond to the convex hull of the hand. Line

segments using selected points from this array are plotted on

the frame to form a polygon around the hand.

The points in the enclosed area that deviate from the convex

hull are called the defect points or the convexity defects. The

deviation from the hull are the valley points or the points

between the fingers. So, computing the convex hull and the

convexity defect will enable us to track the border of the hand

along with the fingers and the area between them. Thus the

palm can be tracked in totality. In Fig 11 shown below, the

hull is represented by the pink color polygon which is

bounded by the blue color rectangular bounding box. The

finger tips and defect points are dots represented in green

color.

6. Panel design and labelling

The drawing is produced by tracking the finger tips

with the help of convex hull as described in the previous

section. First an empty image matrix is initialized, it is white

in color. As the user moves the hand, the fingertip taken to be

the highest point in the hull in the vertical dimension is

tracked and the corresponding pixels are drawn on the blank

image. The default color selected is red and the user can

change the drawing color by selecting one of the colors shown

in the panel image below. The drawing is also added to the

webcam feed so the user can visualize how the drawing is

produced when the hand is moved.

The panel was designed using mspaint and was saved

as a image with .panel extension. Then, this panel image is

overlapped on the webcam feed so as to provide the user with

the color options. When, the white color is selected, the pixels

drawn on the white image can be erased using the fingertip.


Whenever the user selects any color option or clears

the screen, corresponding messages are displayed in the lower

area on the webcam feed. This is achieved by using a text

buffer that will hold the text in the selected color, font and font

size and the text is overlapped on the webcam feed using the

cvPutText function.

V. ALGORITHM

The flowchart below explains in a gist the algorithm

that was used and the important steps that were followed

while writing the program. The following process is carried

out iteratively until the user quits the application.

Fig 13: Flowchart of the algorithm

The algorithm can be briefly explained as follows.

Initially the image is acquired from the camera and is sent for

pre-processing. Next, skin segmentation is used to extract the

skin colored pixels from the frame which consists of the hand

and which may also contain other body parts such as face and

neck. The image frame is subtracted from the background

calculated using the running or weighted average method.

At this stage, morphological opening is applied to

reduce noise and expand ROI. Then camshift is used for

tracking the hand based on histogram back projection. The

contours, convex hull and defect points are determined. The

fingertip is the highest point of the hull which is tracked and

the pixels are added to the drawing buffer which is initially an

empty image. Finally, the drawing buffer, color panel and

webcam feed are multiplied (and operation) to get the final

Doodle window.

VI. CONCLUSION

The application doodle was developed using

fundamental computer vision and image processing

techniques. This project was essentially implemented in two

steps. In the first step, only hand detection and tracking was

developed. In the next step, a program for drawing with a

colored glove was developed. Later, both these parts were

integrated to get Doodle functioning. In terms of challenges

faced during the implementation, fine tuning the YCrCb

parameters and rendering the system less sensitive to ambient

light conditions was a difficult task.

VII. RELATED FUTURE WORK

This project has a good scope for improvement in the

future wherein other sophisticated techniques like

classification using machine learning and advanced

background subtraction and edge detection methods can be

used to increase the stability of the system. Many other

interesting options for the user like drawing different shapes,

changing the brush effect and its thickness can be included.

The work related such future developments of this project are

currently under progress. This project has definitely been a

very good starting point for creating a painting application

using hand gestures.

VIII. REFERENCES

[1] Amir Rosenfeld and Daphna Weinshall. Extracting

Foreground Masks towards Object Recognition. In 13th IEEE

International Conference on Computer Vision, Nov 2011.

[2] Ryosuke Araki and Takeshi Ikenaga. Real-time both hands

tracking using CAMshift with motion mask and probability

reduction by motion prediction. APSIPA ASC, 2012.

[3] Dr. Dapeng Wu, University of Florida, Lecture Notes

[4] Lee, D., and Lee. Vision-Based Finger Action Recognition

by Angle Detection and Contour Analysis. ETRI Journal,

2011.

[5] Afef Salhi and Ameni Yengui Jammoussi. Object tracking

system using Camshift, Meanshift and Kalman filter. In World

Academy of Science, Engineering and Technology, Apr 2012.

[6] Learning OpenCV by Gary Bradski and Adrian Kaebler.

OReilly publications, September 2008.


[7] S. Franois, B. and R.J. Alexandre. Camshift Tracker

Design Experiments. IMSC, no. 11, pp. 111. 2004.

[8] Massimo Piccardi, Background subtraction techniques: a

review. University of Technology, Sydney.

[9] Skin segmentation using color pixel classification: analysis

and comparison by Son Lam Phung and A. Bouzerdoum,

University of Wollongon.

[10] Advanced background subtraction approach using

Laplacian distribution model by Fan-Chieh Cheng, ICME

2010.

[11] M Panwar, P S Mehra. Hand Gesture Recognition for

Human Computer Interaction, Proceedings of IEEE

International Conference on Image Information Processing

(ICIIP 2011), Waknaghat, India,November 2011.

[12] L Howe, F Wong, A Chekima, Comparison of Hand

Segmentation Methodologies for Hand Gesture Recognition,

IEEE-978-4244-2328-6, 2008.

[13] A survey of skin-color modeling and detection methods,

P. Kakumanu, S. Makrogiannis, N. Bourbakis, Department of

Computer Science and Engineering, Wright State University.

[14] Real-Time Both Hands Tracking Using CAMshift with

Motion Mask and Probability Reduction by Motion Prediction

by Ryosuke Araki, Seiichi Gohshi and Takeshi Ikenaga,

Kogakuin University

[15] Web Resources:

http://docs.opencv.org http://opencv-srf.blogspot.com http://opencvpython.blogspot.com

Final ReportRudrafarhan

Documents

Transcript of Final ReportRudrafarhan