Final ReportRudrafarhan
description
Transcript of Final ReportRudrafarhan
-
Image Processing and Computer Vision Project Fall - 2014
Abstract
The way humans interact with computers and devices
is soon to change with the help of gesture recognition. It is
one of the trending technologies that will revolutionize
human computer interaction in the future. This project
aims to use hand gestures captured from the camera for a
live painting application called Doodle. Detection of the
human hand from the scene and motion estimation of the
hand in order to produce the painting output as the hand
moves are the major parts of this project. Important
techniques from computer vision such as extracting the
region of interest using background subtraction, skin color
segmentation, detection of convex hull, convexity defects
and usage of camshift algorithm for motion tracking are
used. The user is provided with a panel containing
different colors and an eraser, to choose his options as he
draws on the screen.
Keywords
Background subtraction, Skin Segmentation, Camshift,
Convex hulls, Convexity defects, Motion detection
I. INTRODUCTION
Human Computer Interaction has emerged into an area
of spectacular research activities and has introduced a new
dimensionality to the science of computing. One of the most
universally known example is the graphic user interface used
in Windows 95 by Microsoft. As computers have become
universally prevalent, attention has shifted to incorporating
techniques that will enhance the end-user experience. This is
one of the fundamental reasons for the increase in popularity,
research activities and commercial products related to Human
Computer Interaction.
HCI can be defined informally as science of design and
study of interaction between humans and computers with an
aim to enhance the quality of interaction. Analytical, cognitive
and empirical techniques are used in the process of design
better systems. In the study of HCI, there are many important
facets that have evolved into different areas under it. Some of
these areas include constraint based interface design,
exploring different methodologies in implementation,
designing new interface systems, evaluation and comparison
of interfaces etc. In the recent years, we have seen dramatic
changes to the user interfaces for electronic devices by means
of touch, voice and gestures.
There are many trending technologies in human
computer interaction that include virtual and augmented
reality, 3d visualization, natural language processing, speech
processing, hand gesture recognition, touch sensing etc.
Owing to the advancements in image acquisition and
processing as well as computer vision techniques, Hand
Gesture Recognition has gained importance in HCI. Using a
part of the human body for communicating can be termed as
gesture. All human beings are naturally inclined to use
gestures while communicating, for e.g. while giving a
presentation or taking part in a conversation. Therefore,
integrating gesture recognition technology with computer
operation will definitely provide a more seemingly
comfortable experience. Many different type of tools can be
used for dealing with image/video based gesture recognition.
One of these methods include the usage of gloves that are
equipped with tracking capabilities or special features for
transferring signals for the purpose of communication such as
magnetic interaction. This method is called the contact based
approach for modelling gestures. Restrictive experience and
the requirement of additional utilities such as gloves have not
made this method popular.
As a contrast, vision based techniques are cost-
effective, non-intrusive and non-restrictive in nature. The
quality of such a rendering directly depends on the kind of
algorithm used and the computational capabilities of the
system. In the image/video analysis based techniques various
tools are being used namely, Stereo cameras, Depth-Aware
cameras and simple 2d cameras. In Stereo cameras, multiple
(usually two) cameras are used are capture different view
feeds of the same frame and these are merged with special
processing so as to obtain the 3d representation of the scene.
Depth-aware cameras are capable of sensing depth map of the
video feed within a certain range. Even simple 2d cameras
available in laptops, digital cameras and smartphones can be
used for gesture recognition.
The vision based methods can be further categorized
into Appearance modelling and 3D modelling. Furthermore,
the techniques used also depend on whether the gesture
recognition is static or dynamic. The static methods used are
related to template matching, machine learning classification
etc. The temporal aspect is introduced in dynamic systems, so
the gestures and postures can vary with time. Therefore, more
sophisticated methods are needed for processing and they are
related to Advanced Hidden Markov Models, Time Warping,
and Delayed Neural Networks etc. In this project, appearance
based modelling with the help of standard 2d camera feed
from a laptop is used as input for gesture modelling.
Doodle using Hand Gesture Recognition Vandana Ravichandran
Electrical and Computer Engineering, University of Florida, Gainesville 32611
-
Image Processing and Computer Vision Project Fall - 2014
The rest of the report is organized as follows Section
II discusses the fundamental approach in utilizing computer
vision techniques for Hand Gesture Recognition, Section III
gives a design description of the painting application called
Doodle. In Section IV highlights the approach used for
implementing the features described in the previous section.
Here, the various image processing techniques that were used
in the project are discussed in depth. The algorithm used for
writing the program is described in Section V. Section VI
provides the conclusion by summarizing the practical
experience and challenges faced during the implementation
Section VII discusses the scope of the project and some
related future work.
II. APPROACH
Generally, all object or motion tracking systems and
hand interactive systems have the same goal the required
object must be tracked in successive frames of a video or a
camera feed when it moves dynamically. In hand interactive
systems, the object that now needs to be recognized is
essentially the hand. These systems can be divided into layers
that have distinct functions to perform namely Detection,
Tracking and Recognition.
a. Detection Layer This is the first layer and is responsible for defining, correlating and extracting the
features that correspond to the object of interest, i.e. to
detect the presence of hands in the field of view obtained
from the camera feed.
b. Tracking Layer This is the mid-layer that makes use of the features that were extracted in the detection layer and
associates data temporally between successive images in
order to track the motion of the hands.
c. Recognition Layer This the last layer and deals with the classification problem. The spatiotemporal data
extracted in the previous two layers are assigned to
relevant result groups with labels that are associated with
specific classes of gestures.
Once these three steps are performed successfully, the
hand can be detected, tracked and the gesture can be
classified. Now, the gesture that identified can be associated
with a specific function as output, for e.g. play the music when
the hand moves right etc. These three functions need to be
iteratively computed for every frame so as to process dynamic
gesture recognition.
III. DESCRIPTION
As described earlier, gesture recognition is employed to
fulfill specific functions(s). In this project, the result of gesture
recognition is used to create a painting application called
Doodle. Briefly, the idea is to paint the live/active webcam
window by the movement of the hand. The image shown
below is an image captured during a demonstration.
The application is equipped with the following features
a. The user can select different colors from a panel of colors present on the upper part of the window
b. When the users hand hovers over a particular color, the color is selected and a message indicating that a particular
color was selected is displayed at the bottom of the
window.
c. The user can select erase option to use the fingertip like an eraser to erase the pixels that were earlier drawn on the
screen.
d. The erase option can also be used to clear the screen completely by selecting the white color on top right
corner and hovering on that area for a few seconds. The
screen is cleared after a brief countdown.
e. The user can exit the application by pressing the Escape key. The programs breaks from the indefinite while loop
and de-allocates or releases any memory associated with
image pointers.
f. When the user exits the application, the image that is drawn by the user is saved to a file named doodle.jpg in
the project folder.
-
Image Processing and Computer Vision Project Fall - 2014
Other hidden objectives The application also needs to
fulfill certain requirements in the background for it to properly
function such as:
g. Before the application is tested, trackbars that are provided for fine-tuning the YCrCb parameters for skin
segmentation must be used for calibration.
h. The upper area on the webcam field holds the panel and the lower area is used for displaying the message. These
are like control areas and user should not be able to draw
on these areas. The area accessible for drawing is
restricted to the space between these two sections.
i. Sudden illumination disturbances can cause the drawing to be triggered even when there is no noticeable
movement in the hand. In order to handle such
disturbances, the pixels are not drawn if the magnitude of
the distance between the two pixel points (two
consecutive hand positions) is greater than 100.
j. The system should be capable of handling external noise signals to a certain extent.
IV. IMPLEMENTATION
The program was written in C++ with the help of the
OpenCV image processing and computer vision library. The
application was developed in Visual Studio 2012. Many
important image processing techniques were used to get the
entire project. Some of the major techniques that are
fundamental to this application are enlisted below
1. Skin Color Segmentation 2. Morphological Operations 3. Background Subtraction 4. Camshift Algorithm 5. Convex Hull and Convexity Defects 6. Panel design and labelling
The implementation methodology and the usage of these
techniques are discussed in detail below
1. Skin Color Segmentation
The process of distinguishing and separating the area
of interest from the image based on color is called color
segmentation. Human skin possesses a distinct color tone and
we can use color based tracking for skin segmentation. Here
we try to differentiate the skin pixels from the rest of the
image in order to recognize the human body parts in the
image. In order for the skin segmentation to work effectively,
the right color space and the range values of the channel
parameters of the color space must be chosen. The black and
white image shown in Fig 3 is a binary image obtained by
thresholding the camera feed to restrict only skin-colored
pixels to be white and the rest to be black. Only the pixels that
correspond to skin (the face and hands) are white in color
while the rest of the image is black in color.
As shown in Fig 4, trackbars were used in the program
to finetune the Y, Cr, Cb parameters for our application.
Why YCrCb color space was chosen?
Although RGB color space is commonly used to
represent images and widely used in computer graphics, it is
an additive model with high correlation. So usually, HSV or
YCbCr colorspace is preferred for skin segmentation. The
separation of brightness information from the chrominance
part reduces the effect of uneven illumination. In this project, I
tried using both the color spaces and YCbCr seemed to work a
little more effectively. So, finally YCbCr color space was
used.
As shown in the diagram, Y Cr Cb values can be
varied from 0 to 255, every color in the RGB colorspace has a
corresponding unique value in the Y Cr Cb colorspace. The
equations that are used to convert from RGB to YCrCb is
shown in the blue box. The values of the 3 parameters that
were most suitable for my skin tone and the ambient
conditions in my room are shown below. Since these are
subject to change based on the person and location, we can
make use of the trackbars for adjusting the values.
-
Image Processing and Computer Vision Project Fall - 2014
2. Morphological Operations
Usually the results of segmentation contains noise, we
use morphological operations namely erosion and dilation for
removing the noise. Morphological operations are shape based
non-linear operations applied for removal of noise and
smoothening of edges. Morphological operations are
dependent only on the relative ordering of pixel values and are
not affected by their numerical values, hence can be most
suitably used to process binary images.
In this technique, the image is probed with a small
shape called a structuring element. At all the possible
locations, the structuring element is positioned and the pixel
values of the structuring element are compared with that of the
neighborhood pixels in a particular position. If for every pixel
in the structuring element that is set to 1, the corresponding
image pixel is also 1, then the element is said to fit the image.
If for at least one of the pixels in the structuring element that is
set to 1, the corresponding image pixel is also 1, then the
element is said to hit the image. The same operation is
iteratively applied to every pixel of the image and based on
whether the structuring element hits or fits the image, we have
two fundamental operations called erosion and dilation.
During erosion a new binary image is produced such
that if the element fits at pixel P(x,y) then new pixel P(x,y) is
1, otherwise it is 0. Erosion is generally used for reducing the
noise in the image but it also has a negative effect of reducing
the region of interest. Similarly, during dilation a new binary
image is produced such that if the element hits at pixel P(x,y)
then new pixel P(x,y) is 1, otherwise it is 0. Dilation is
responsible for expanding the region of interest and it used for
filling any tiny unrequired gaps in the image. Results of
dilation or erosion are influenced both by the size and shape of
a structuring element.
In this project after smoothening the image using
Gaussian Blur, erosion and dilation are applied consecutively
for two iterations using 3x3 ellipse as the structuring element.
This is termed as opening of an image in morphology which
is nothing but applying erosion followed by dilation
consecutively using the same structuring element. The result
of applying the opening operation to a binary image in my
project is shown in Fig 6.
3. Background Subtraction
Background subtraction is a widely used technique for
obtaining the foreground mask of scene. The foreground that
is extracted using this method is usually sent for further pre-
processing. Generally an image's regions of interest are objects
(humans, cars, in our case the hand) in its foreground.
Background subtraction is typically used for detecting moving
objects in videos from static cameras and this method is
mostly used if the image in question is a part of a video
stream. This an area with much research interest and there are
many effective algorithms that have been proposed. The type
of background subtraction usually depends on the
requirements of the application, such as level of sensitivity of
the system, accuracy with which the moving object has to be
tracked and any constraints in terms of memory etc.
In the OpnCV library, we have preset background
subtraction techniques based on the Mixure of Gaussians
method. When there is a possibility of multiple objects being
introduced into the frames and the background permanently or
significantly gets altered, this method is useful as it is adaptive
in nature. In our application, since there will be not much
change to the background and only one object (which is the
hand) needs to be tracked, a simpler approach can be used.
Here we make use of frame difference method, the moving
object is extracted by subtracting the current frame and a static
background image usually known as the background model.
Sometimes using only a static frame as a background
model doesnt suffice and it increases the sensitivity of the
system leading to very small illumination disturbances to be
wrongly captured as moving objects. In order to reduce the
sensitivity instead of directly using the frame difference, we
use the weighted average method as described by the equation
shown below.
In this equation, Fi represents the foreground of the ith
frame, Bi represents the background of the ith frame. The new
background at time instance i+1 is calculated as a weighted
average of Fi and Bi. Alpha is called the learning rate and lies
between 0 and 1. This parameter is used for tuning the
sensitivity of the system and by varying this factor, the effect
-
Image Processing and Computer Vision Project Fall - 2014
of the older objects present in the scene can be controlled
while determining the new background. After determining the
background in this manner, frame difference is conventionally
used to extract the foreground objects of interest. The image
shown in fig 7 is an example of background subtraction used
for extracting the hand from the binary image.
4. Meanshift and Camshift algorithms
Meanshift and Camshift algorithms are popular object
tracking algorithms. For a given pixel distribution, the
centroid or the mean is computed and a track window is used
to demarcate the object of interest in the frame. Subsequently,
as the object moves, the pixel distribution changes and the
track window needs to be shifted or moved to the area of
higher pixel density. We are obtaining the result of skin
segmentation and background subtraction and providing it as
an input to the Camshift algorithm, therefore the area of higher
pixel density should correspond to the tracked object, which in
our case is the hand.
Here, the method used for moving the track window is
of interest to us. For determining the probability distribution
of the image pixels, we use histogram back projection
technique. Initially, the area of the hand is selected as the
feature vector to calculate the histogram model. This
histogram obtained is then used to determine the area of the
image that contains a matching feature in the subsequent
images. This is called as the backprojection technique. Since,
in this method we make use of histograms that is composed of
various pixel frequencies, we again rely on the color
distribution of the frame.
The image with the colored bars shown in Fig 8 is the
histogram obtained for my hand. Different colored rectangular
boxes are used to plot the histogram in order to visualize it as
there was no readily available plotting function in OpenCV.
We use a Gaussian Kernel function K(x) shown below to
obtain the weighted mean of the pixels in the neighborhood.
The neighborhood N(x) of a pixel x is defined as the set of
pixels around x such that when the kernel function is applied,
the value obtained should be non-zero.
The mean m(x) is calculated using the following equation
In this equation, the mean m(x) is calculated as a weighted
average of every pixel xi in the neighborhood of X
(represented by the neighborhood function N(x)) using the
given Gaussian Kernel function K(x). Now after the mean is
calculated the track window is now shifted to the new centroid
or the mean m(x). This is an iterative process and repeats itself
until convergence is achieved. This is how the object is
tracked using Meanshift algorithm.
How Camshift differs?
The term Camshift stands for Continuously Adaptive
Meanshift. Camshift exactly imitates the Meanshift model but
only a slight difference. Once convergence is achieved using
meanshift, not only is the position of the track window
updated based on the new mean, but also the size of the track
window is updated. The equation used for updating the size of
the track window is shown below.
In the equation above, s represents each dimension of the track
window which is proportional to the new mean M. Also, the
orientation of the track window is calculated if there is a tilt in
the object tracked from its initial position. Again Camshift
uses the new scaled track window for determining the centroid
and this process continues iteratively. The images in Fig 9 and
10 show how the size of the track window is updated
proportional to the size of the hand when it is moved closer to
the camera. This is exactly how camshift improvises over
meanshift.
-
Image Processing and Computer Vision Project Fall - 2014
5. Convex Hull and Convexity Defects
Convex Hull is a prominent area of interest in
geometry. In simple words, convex hull can be defined as the
closest or smallest curve that can completely enclose a
required set of points. It can be visualized as a rubber band
stretched to exactly fit the points in the plane. The problem of
determining the convex hull pertains to computational
geometry. The mathematical definition is shown below.
In an n-dimensional space, the convex hull of a given
set of points S is the intersection of all convex sets containing
S. The convex hull C for the given set of points P1, P2 ,
PN is computed using the equation:
If we are able to compute the convex hull of the white
blob which belongs to the hand in the binary image, then we
can visually track the hand. First for this, the contours of the
hand in the binary image are found. The contours of the hand
can be determined using any edge-detection technique. After
the contour of the hand is obtained, we provide these set of
points as an input for calculating the hull. The hull is
calculated using the function ConvexHull2 available in
OpenCV. The output of this function is an array of points in
the frame that correspond to the convex hull of the hand. Line
segments using selected points from this array are plotted on
the frame to form a polygon around the hand.
The points in the enclosed area that deviate from the convex
hull are called the defect points or the convexity defects. The
deviation from the hull are the valley points or the points
between the fingers. So, computing the convex hull and the
convexity defect will enable us to track the border of the hand
along with the fingers and the area between them. Thus the
palm can be tracked in totality. In Fig 11 shown below, the
hull is represented by the pink color polygon which is
bounded by the blue color rectangular bounding box. The
finger tips and defect points are dots represented in green
color.
6. Panel design and labelling
The drawing is produced by tracking the finger tips
with the help of convex hull as described in the previous
section. First an empty image matrix is initialized, it is white
in color. As the user moves the hand, the fingertip taken to be
the highest point in the hull in the vertical dimension is
tracked and the corresponding pixels are drawn on the blank
image. The default color selected is red and the user can
change the drawing color by selecting one of the colors shown
in the panel image below. The drawing is also added to the
webcam feed so the user can visualize how the drawing is
produced when the hand is moved.
The panel was designed using mspaint and was saved
as a image with .panel extension. Then, this panel image is
overlapped on the webcam feed so as to provide the user with
the color options. When, the white color is selected, the pixels
drawn on the white image can be erased using the fingertip.
-
Image Processing and Computer Vision Project Fall - 2014
Whenever the user selects any color option or clears
the screen, corresponding messages are displayed in the lower
area on the webcam feed. This is achieved by using a text
buffer that will hold the text in the selected color, font and font
size and the text is overlapped on the webcam feed using the
cvPutText function.
V. ALGORITHM
The flowchart below explains in a gist the algorithm
that was used and the important steps that were followed
while writing the program. The following process is carried
out iteratively until the user quits the application.
Fig 13: Flowchart of the algorithm
The algorithm can be briefly explained as follows.
Initially the image is acquired from the camera and is sent for
pre-processing. Next, skin segmentation is used to extract the
skin colored pixels from the frame which consists of the hand
and which may also contain other body parts such as face and
neck. The image frame is subtracted from the background
calculated using the running or weighted average method.
At this stage, morphological opening is applied to
reduce noise and expand ROI. Then camshift is used for
tracking the hand based on histogram back projection. The
contours, convex hull and defect points are determined. The
fingertip is the highest point of the hull which is tracked and
the pixels are added to the drawing buffer which is initially an
empty image. Finally, the drawing buffer, color panel and
webcam feed are multiplied (and operation) to get the final
Doodle window.
VI. CONCLUSION
The application doodle was developed using
fundamental computer vision and image processing
techniques. This project was essentially implemented in two
steps. In the first step, only hand detection and tracking was
developed. In the next step, a program for drawing with a
colored glove was developed. Later, both these parts were
integrated to get Doodle functioning. In terms of challenges
faced during the implementation, fine tuning the YCrCb
parameters and rendering the system less sensitive to ambient
light conditions was a difficult task.
VII. RELATED FUTURE WORK
This project has a good scope for improvement in the
future wherein other sophisticated techniques like
classification using machine learning and advanced
background subtraction and edge detection methods can be
used to increase the stability of the system. Many other
interesting options for the user like drawing different shapes,
changing the brush effect and its thickness can be included.
The work related such future developments of this project are
currently under progress. This project has definitely been a
very good starting point for creating a painting application
using hand gestures.
VIII. REFERENCES
[1] Amir Rosenfeld and Daphna Weinshall. Extracting
Foreground Masks towards Object Recognition. In 13th IEEE
International Conference on Computer Vision, Nov 2011.
[2] Ryosuke Araki and Takeshi Ikenaga. Real-time both hands
tracking using CAMshift with motion mask and probability
reduction by motion prediction. APSIPA ASC, 2012.
[3] Dr. Dapeng Wu, University of Florida, Lecture Notes
[4] Lee, D., and Lee. Vision-Based Finger Action Recognition
by Angle Detection and Contour Analysis. ETRI Journal,
2011.
[5] Afef Salhi and Ameni Yengui Jammoussi. Object tracking
system using Camshift, Meanshift and Kalman filter. In World
Academy of Science, Engineering and Technology, Apr 2012.
[6] Learning OpenCV by Gary Bradski and Adrian Kaebler.
OReilly publications, September 2008.
-
Image Processing and Computer Vision Project Fall - 2014
[7] S. Franois, B. and R.J. Alexandre. Camshift Tracker
Design Experiments. IMSC, no. 11, pp. 111. 2004.
[8] Massimo Piccardi, Background subtraction techniques: a
review. University of Technology, Sydney.
[9] Skin segmentation using color pixel classification: analysis
and comparison by Son Lam Phung and A. Bouzerdoum,
University of Wollongon.
[10] Advanced background subtraction approach using
Laplacian distribution model by Fan-Chieh Cheng, ICME
2010.
[11] M Panwar, P S Mehra. Hand Gesture Recognition for
Human Computer Interaction, Proceedings of IEEE
International Conference on Image Information Processing
(ICIIP 2011), Waknaghat, India,November 2011.
[12] L Howe, F Wong, A Chekima, Comparison of Hand
Segmentation Methodologies for Hand Gesture Recognition,
IEEE-978-4244-2328-6, 2008.
[13] A survey of skin-color modeling and detection methods,
P. Kakumanu, S. Makrogiannis, N. Bourbakis, Department of
Computer Science and Engineering, Wright State University.
[14] Real-Time Both Hands Tracking Using CAMshift with
Motion Mask and Probability Reduction by Motion Prediction
by Ryosuke Araki, Seiichi Gohshi and Takeshi Ikenaga,
Kogakuin University
[15] Web Resources:
http://docs.opencv.org http://opencv-srf.blogspot.com http://opencvpython.blogspot.com