[IEEE 2013 10th International Conference on Electrical Engineering, Computing Science and Automatic...

A Human-Machine Interface withUnmanned Aerial Vehicles

Daniel Soto-GuerreroInformation Technology Laboratory,

Cinvestav Tamaulipas, Mexico.Email: [email protected]

Jose Gabriel Ramırez TorresInformation Technology Laboratory,

Cinvestav Tamaulipas, Mexico.Email: [email protected]

Abstract—Many UAV applications rely on a ground-basedcomputer for accomplishing all processing tasks and the powerconsumed by such systems are a major roadblock when it comesto portability. In this paper we present an efficient and practicalhuman-machine interface with an unmanned aerial vehicle (UAV).One of the main goals of the proposed interface is to beubiquitous, so we moved all computational load to a mobile devicerunning Android. Through this interface, the UAV responds toboth movement and body gestures of the user; moreover, allprocessing tasks are done efficiently in the mobile device. Wedescribe how we tracked the user throughout the video stream,which techniques we used and how we implemented them in ourmobile platform for achieving a good level of performance.

I. INTRODUCTION

The interface proposed and described in this article con-tributes to what has been observed in many social interactionsbetween unmanned aerial vehicles (UAV) and untrained per-sonnel. Technically, the software developed for this interfaceinterprets what the user does throughout a video stream andtranslate his actions to control commands. We analyzed user’smovements and body gestures to make of our UAV, a loyalcompanion. In order to assemble this interface, we usedseveral digital image processing techniques in one applicationfor Google’s favorite mobile operative system: Android. Wethink of mobile devices as the best processing platform forthis interface, because they already provide a more naturalinteraction with their touch capabilities and they are overtakingPCs in common everyday tasks. Furthermore, today’s mobiledevices provide enough processing power to make of thisinterface a portable solution only limited by battery powerresources.

For the digital image processing purposes, we avoideddistinguishable markers on the user in order to keep theinteraction as natural as possible and we tracked user’s upperbody to classify gestures he could perform with his arms. Wealso took advantage of the drone’s video decoder architecturefor Android in our tracking algorithm, obtaining a 14 FPSprocessing rate while still capable of compensating UAV’srotation changes.

Previous studies show how people can interact and getemotionally attached to one UAV that responds to user’smovement and body gestures. The Texas A&M University usedseven UAV’s as fairies during the production of William Shake-speare’s A Midsummer Night’s Dream [2]. Unpredictably,untrained actors’ actions uncovered unsafe forms of operationfor both, the UAV’s and actors; only by introducing scary

metaphores by the roboticists in charge, actors began to treatUAVs carefully.

Other study sustain that human-robot interaction researchhas ignored the social and collocated aspects, which willimpact in the future when autonomous flying robots take astep forward in social activities [6]. They attempted to leavebehind the remote controlled interaction by using the Wizardof Oz technic: a test subject performs gestures to the drone,while an unseen operator controls the UAV wirelessly. Theyproved that children and grown-ups get emotionally attachedto the UAV.

Among the recent research works on human-machine in-teraction with UAVs we can cite a sports assistant from theuniversity of Tokyo [5]. The UAV follows an athlete to providewhat is known as external visual imagery. Athletes no longerneed to picture themselves from the perspective of an externalobserver to evaluate their skills, strategies and game plans. Allthat information is delivered by the drone to the athlete througha head mounted display. They used a particle filter for trackingone color in the video stream, which corresponds to the user’sjacket color. Although their work is dependant to either alaptop or a PC and that they didn’t describe their trackingalgorithm’s implementation and performance, they proved thatthe external visual imagery made the athlete feel dizzy.

In addition, if humans would have to choose between acompanion of mechanical or humanoid appearance, we wouldcertainly chose mechanical over humanoid because we humansfind unconsciously repulsive machines that try to mimic uswith imperfection [9]. This is also supported by some otherhypothesis such as the uncanny valley [4], which states thatrobots intended to actively participate in a social context mustfirst cross the uncanny valley for getting the acceptance oftheir humans counterparts, otherwise they shouldn’t look likehumans at all.

The structure of this article is as follows: in Section II wedescribe all components involved in our interface: hardwareand software. All tracking considerations and control schemesare discussed in Section III and V, respectively. How weaccomplished gesture recognitions can be found on Section IV.Results and conclusions are described in last two Sections, VIand VII.

II. HUMAN-UAV INTERFACE

A description to our interface’s hardware, software andcontrol schemes is provided in this Section. Basically, the UAV

2013 10th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE) Mexico City, Mexico. September 30-October 4, 2013

IEEE Catalog Number CFP13827-ART ISBN 978-1-4799-1461-6 978-1-4799-1461-6/13/$31.00 ©2013 IEEE

307

(a) User and UAV (b) UAV’s front camera

(c) Four posible scenarios.

AR-Drone IHRVANT

Video decoder and tracking Video Reception

ANDROID MOBILE DEVICEWiFiWiFi

UDP

Camera

USER

Long sleeve jacket

Colorsegmentation

Gray tonesegmentation

ParticleFilter

Data fusion User's state estimation Classifier

Control

(d) Interface component interoperability diagram.

Fig. 1: Interface description

reacts to user’s commands that it captures through its on boardcamera; all commands are processed on one mobile devicecarried by the user. The software is capable of estimatingthe user’s movements and gestures through the video streamsent from the UAV (Fig.1a); it also estimates user’s position,relative to the drone, to control the aircraft and make it followthe user. On our software, the user can set up the trackingalgorithm using the same screen section in which the videostream is being displayed; he also can toggle between manualand automatic flight control. Figure 1d describes how allsoftware and hardware components are related to each other.

A. Hardware

The chosen drone for this set-up is the Parrot’s AR-Drone, a WiFi controlled quadcopter1. The communicationis accomplished through UDP ports sending and receiving:control commands (UDP 5556) for telling the drone the move-ment direction, navigation data (UDP 5555) which describethe current drone’s state and lastly, the video stream (UDP5554). For the mobile processing platform, we chose Asus’

1Ar-Drone website: http://ardrone2.parrot.com/

Transformer Prime2, which features Tegra 3, the world’s firstquad core mobile processor from NVIDIA. More on how wetake advantage of the tablet’s architecture is included in nextSection.

B. Software

We built from the ground-up an application namedIHRVANT in Java and native code for image processingpurposes, discarding Parrot’s API3. IHRVANT is a multi-threaded application that initializes, attends and closes thecommunication with the UAV. First four threads attend GUI,control commands, navigation data and video stream reception,respectively (see Fig. 2). The video reception thread is far morecomplex, because it manages all image processing threads,which in fact represent the tracking algorithm and gesturerecognition functionality. Android relies on the Dalvik virtualmachine to assign cores on demand to each application anddoesn’t allow direct control over system’s resources, being so,IHRVANT assumes all resources administration is dealt withefficiency.

The video data stream is coded with UVLC (4:2:0, 8x8DCT, QVGA), which means each frame is first decoded intoY CbCr color space for later conversion to an RGB bitmap.Since we based our digital image processing (DIP) techniquesin the Y CbCr color space, the decoder feeds directly all colorchannels to DIP threads only if they are idle. Notice thatDIP threads may reject one or more frames if they are busyprocessing last frame accepted. With this strategy the videostream is displayed flawlessly while DIP threads work on thelatest available frame. The user can set all these parameters upthrough a touch-based interface.

Besides video decoder and DIP threads, IHRVANT alsohandles in separate threads: Drone control commands andNavigation data; these two threads make possible a closedloop control scheme. UAV’s self state estimation is deliveredas Navigation data, and it’s used as feedback for controlcommands. Navigation data sent from the UAV includes, but itis not limited to: battery charge, flight’s height and orientationangles (pitch, roll and yaw).

III. TRACKING ALGORITHM

This Section describes how we solved the tracking prob-lem, taking into consideration that user’s upper body shapedetection is also necessary to accomplish gesture recognition.We used user’s long sleeve jacket color as a marker, so wecould segment its representative area in each video frame. Thetechnique we used for segmenting the user’s jacket is suitablefor the mobile processing platform because of its simplicity,which will be described along side its implementation inSection III-B. By fusing the results from color segmented areaswith a particle filter, as pictured in Fig. 1d, we can overcomethe false-positives that appear during the segmentation process,while adding resilience to partial occlusion.

There is a wide variety of vision based tracking algorithms.Among them, probabilistic algorithms such as particle filtershave been implemented for tracking color objects [1], [11] and

2Tablet’s website: http://eee.asus.com/en/transformer-prime/3API website: https://projects.ardrone.org/



308

Start

Thread creation

GUI ControlCommands

Open UDPport 5556

Navigation data

Open UDPport 5554

Video stream

Open UDPport 5555

AT commandsqueue

Commandavailable?

Send command

Yes

Service start-up Service start-up

ResetWatch dog

No

Monitor

Data reception

Parsing

Data reception

Decoder

TrackingAlgorithm

Navigation datainterface

Video streaminterface

GUI callback

GUI callback

End

Done?

Yes

No

GUIJoysticks

Send AT commandto UAV

Fig. 2: Thread diagram for IHRVANT. Video stream receptionthread decodes and process all received frames.

Cb ChannelIntervals [0,124] [125,255]

Cr Channel [0,124] Bin 1 Bin 2[125,255] Bin 3 Bin 4

TABLE I: Color histogram with two discrete intervals n = 2.

some have been implemented using AR-Drone’s front camera[5].

A. Particle Filter

Particle filters estimate a variable’s distribution functionusing a set of weighted samples (the particles) from previousvariable’s observations. They work with a weighted particleset Xt of size I . Sampling based on each particle’s weight iswhat makes this technique so relevant [10]. In this particularcase, our observation process corresponds to the coordinatesin which our area of interest is located on the video’s frames.We achieved this by making all the particles track one specificcolor histogram using the Cb and Cr channels from the videostream. A color histogram can be seen as a bidimensionaltable in which all pixels are categorized depending on itscorresponding chroma values and n discrete intervals, givinga total of n2 bins (See Table I).

The first important aspect of our implementation is how wedefined each particle as a square region Wi with side lengthl, centered in coordinates xt, yt at time t. The state vectorfor each particle is defined as xt = (xt, yt, xt−1, yt−1)

T .Secondly, the state transition distribution from which wesample variable’s a priori hypothetical state xt+1 does notrequire all previous observations. We assumed that the objectmoves in uniform motion frame by frame, hence we onlyrequired the last two observations xt−1 and xt; this is calleda second order dynamic model (see Eq. 1, vt ∼ N(0, σ)) [8].Figure 3 shows ith particle, its surrounding Wi with l = 3, itslast two observations (xt−1, xt) and its a priori state xt+1

estimated with Eq. 1.

9TH INTL. CONFERENCE ON ELECTRICAL ENGINEERING, COMPUTING SCIENCE AND AUTOMATIC CONTROL, SEPTEMBER-2012 4

cle, for all regions Wi, we first obtain their nor-malized bidimensional histograms from channelsCbCr, divided in n2 buckets. Secondly, we measureits similarity to a reference normalized histogramwith the Bhattacharyya similarity coefficient (Eq. 1),which measures a distance D[q∗,qt(xi)] betweena previously defined reference histogram q∗ andith particle’s surroundings histogram qt(xi) in timet. We observed a consistent exponential behaviorfor the squared distance D2, hence we used Eq.2 for weighting all particles [7]. Later resamplingwith probability ∝ p(xt+1|xt+1, zt+1) will suffice toachieve objet tracking.

D[q∗,qt(xi)] =

�1 −

N�

n=1

�q∗(n)qt(xi)

� 1

2

(1)

p(xt+1|xt+1, zt+1) ∝ exp(−λD2[q∗,qt(xi)]) (2)

3.2 Color segmentation

In order to track user’s body shape as input forgesture classification, we segment all color areassimilar to our reference color Cref , which is infact, user’s jacket color. We denote any color astwo coordinates (cb, cr) in the CbCr color plane;any color C with a squared euclidian distancesmaller to our threshold r2, is to be segmented. Wedo something similar with the luminance channel,which will provide important contrast informationby segmenting a range of gray tones. Gray tonessegmented are meant to be between user’s jackettone ±15 discrete values.

Last strategies mentioned for color segmentationwill easily lead to non desirable false-positives. It isby fusioning results with morphological operationsfrom color segmentation and particle filter that wedistinguish the user throughout the video stream.

Consider one frame from the video stream, thecolor segmented area with false-positives, gray tonesegmentation and a particle set Xt in time t. Keep-ing in mind that Y frame is 4 times larger thanchroma frames (Fig. 4b), color segmented pixels(Fig. 4a) are useful only if 3 or more luma pixelsare lit. After one dilation with diamond (3 × 3)as structural element, we overlay all particles forapplying bush-fire (Fig. 4e) and diminish all areasnot reached by fire (Fig. 4f). Figure (5) shows howIHRVANT displays fusion results in blue, particlesin red, centroid and boundary box in yellow.

(a) Color segmentated (b) Gray tone segmented.

(c) Fusion 4a and 4b. (d) Dilation and particles.

(e) After bush-fire. (f) Area of interest.

Fig. 4: Fusion of all DIP techniques.

(a) Particles in red, color segmented area in blue.

(b) Boundary Box in Yellow

Fig. 5: Upper body silhouette segmentation.

4 GESTURE RECOGNITION

For distinguishing user’s gestures given by upperbody silhouettes, we first do some conditioningby taking all segmented area into one stabilized,tinier bitmap Bs. Consider the fact that we knowthe orientation of UAV’s on board camera throughnavigation data; from that, we can absorve minormisleading rotations caused by normal flight con-ditions. We scale silhouettes’s height to match Bs’,making the center of Bs share its place with silhou-ettes’s centroid and then we rotate it to absorb. Inthe lower-right side, Figures 5a and 5b show their



D[q∗,qt−1(xi)] =

�1 −

N�

n=1

�q∗(n)qt(xi)

� 1

2

(1)

















D[q∗,qt+1(xi)] =

�1 −

N�

n=1

�q∗(n)qt(xi)

� 1

2

(1)

















D[q∗,qt(xi)]xt−1yt−1 =

�1 −

N�

n=1

�q∗(n)qt(xi)

� 1

2

(1)


3.2 Color segmentationIn order to track user’s body shape as input forgesture classification, we segment all color areassimilar to our reference color Cref , which is infact, user’s jacket color. We denote any color astwo coordinates (cb, cr) in the CbCr color plane;any color C with a squared euclidian distancesmaller to our threshold r2, is to be segmented. Wedo something similar with the luminance channel,which will provide important contrast informationby segmenting a range of gray tones. Gray tonessegmented are meant to be between user’s jackettone ±15 discrete values.














D[q∗,qt(xi)]xt−1yt−1 =

�1 −

N�

n=1

�q∗(n)qt(xi)

� 1

2

(1)
















D[q∗,qt(xi)]xtyt =

�1 −

N�

n=1

�q∗(n)qt(xi)

� 1

2

(1)

















D[q∗,qt(xi)]xtyt =

�1 −

N�

n=1

�q∗(n)qt(xi)

� 1

2

(1)

















D[q∗,qt(xi)]xt+1yt+1 =

�1 −

N�

n=1

�q∗(n)qt(xi)

� 1

2

(1)
















D[q∗,qt(xi)]xt+1yt+1 =

�1 −

N�

n=1

�q∗(n)qt(xi)

� 1

2

(1)
















D[q∗,qt(xi)]xy =

�1 −

N�

n=1

�q∗(n)qt(xi)

� 1

2

(1)

















D[q∗,qt(xi)]xy =

�1 −

N�

n=1

�q∗(n)qt(xi)

� 1

2

(1)

















The video data stream comes coded with UVLC(4:2:0, 8x8 DCT, QVGA). Which means each frameis first decoded into Y CbCr color space for laterconversion to an RGB bitmap. Since we based ourdigital image processing (DIP) techniques in theY CbCr color space, the decoder feeds directly allcolor channels to DIP threads only if they areidle. Notice that DIP threads may reject one ormore frames if they are busy processing last frameaccepted. With this strategy the video stream is dis-played flawlessly while DIP threads do their trick.The user can manage all parameters initializationthrough the tablet’s touch capabilities.

Besides video decoder and DIP threads,IHRVANT also handles in separate threads:Drone control commands and Navigation data;these two threads make possible a closed loopcontrol scheme. UAV’s self state estimation isdelivered as Navigation data, and it’s used asfeedback from control commands. Navigation datasent from the UAV includes, but it is not limitedto: battery charge, flight’s hight and orientationangles (pitch, roll and yaw).

3 TRACKING ALGORITHM

In this section we discuss how we solved thetracking problem, taking into consideration thatuser’s upper body shape detection is also necessaryto accomplish gesture recognition. We used user’scolor’s jacket as marker, but unfortunately a lot offalse-positives appeared during the segmentation

(a) Measurement a priori.

(b) Measurement a posteriori marked withan oval.

Fig. 3: Particle filter and probabilistic framework.

process in the Y CbCr color space . We overcomedfalse-positives by fusioning results from color seg-mented areas with a particle filter.

There is a wide variety of vision based trackingalgorithms, among them, probabilistic algorithmssuch as particle filters have been implemented fortracking color objects [10], [1] and some have beenimplemented using AR-Drone’s front camera [5].

3.1 Particle FilterIn particle filters, each particle represents a sampleof the observation process. All particles are firstprojected into the future with an estimation ofa hypothetical a priori state xt+1, using as refer-ence previous observations xt−1 and xt (Fig. 3a).Then, when the corresponding measurement zt+1

is available, an a posteriori hypothetical estimationp(xt+1|xt+1, zt+1) takes place (Fig. 3b). The particlefilter works with a particle set Xt of size I , dis-tinguishing which ones are more relevant is whatmakes particle filtering so relevant.

We assume that the object moves in uniformmotion frame by frame, therefore we chose a secondorder dynamic model to estimate all a priori states[8]. In our implementation, each particle definesa square region Wi centered in coordinates xi =(x, y) within the video frame and the sides of Wi

are l pixels long. In order to assign relevance to
































IHRVANT is a multi-threaded application that ini-tializes, attends and closes the communication withthe UAV. First four threads attend GUI, controlcommands, navigation data and video stream re-ception, respectively (see Fig. 2). The video recep-tion thread is far more complex, because it man-ages all image processing threads, which in factrepresent the tracking algorithm and gesture recog-nition functionality. Android relies on the Dalvikvirtual machine to assign cores on demand to eachapplication and doesn’t allow direct control oversystem’s resources, being so, IHRVANT assumes allresources administration is dealt with efficiency.

The video data stream comes coded with UVLC(4:2:0, 8x8 DCT, QVGA), which means each frameis first decoded into Y CbCr color space for laterconversion to an RGB bitmap. Since we based ourdigital image processing (DIP) techniques in theY CbCr color space, the decoder feeds directly allcolor channels to DIP threads only if they areidle. Notice that DIP threads may reject one ormore frames if they are busy processing last frameaccepted. With this strategy the video stream isdisplayed flawlessly while DIP threads work onthe latest available frame. The user can manage allparameters initialization through the tablet’s touchcapabilities.



In this section we discuss how we solved thetracking problem, taking into consideration thatuser’s upper body shape detection is also necessaryto accomplish gesture recognition. We used user’slong sleeve jacket color as a marker, so we couldsegment its representative area in each video frame.The technique we used for segmenting the user’sjacket is suitable for the mobile processing platformbecause of its simplicity, it will be described alongside its implementation in section 3.2. Unfortu-nately, a lot of false-positives appeared during thesegmentation process, that we overcomed while


adding resilience to partial occlusion, by fusioningresults from color segmented areas with a particlefilter as pictured in Fig. 1d.


3.1 Particle Filter

Particle filters estimate a variable’s distributionfunction using a set of weighted samples (the par-ticles) from previous variable’s observations. Theywork with a weighted particle set Xt of size I ,sampling based on each particle’s weight is whatmakes this technique so relevant [9]. In this particu-lar case, our observation process corresponds to thecoordinates in which our area of interest is locatedon the video’s frames.

The first important aspect of our implementationis how we defined each particle as a square re-gion Wi with side length l, centered in coordinates(xt, yt) at time t. The state vector for each particleis defined as xt = (xt, yt, xt−1, yt−1)

T . Secondly, thestate transition distribution from which we sam-ple variable’s a priori hypothetical state xt+1 doesnot require all previous observations. We assumedthat the object moves in uniform motion frameby frame, hence we only required the last twoobservations xt−1 and xt; this is called a secondorder dynamic model (see Eq. 1, vt ∼ N(0,σ))[7]. Figure 3 shows a particle with l = 3 , its last

~

~

~

Fig. 3: A particle with its surrounding

xt+1 =

2 0 −1 00 2 0 −11 0 0 00 1 0 0

xt +

1 00 10 00 0

vt (1)

The third important aspect of our implementation is howwe weighted all particles; we computed the similarity betweenthe normalized color histogram qt, taken from Wi, and a nor-malized reference color histogram (q∗) with the Bhattacharyyasimilarity coefficient (Eq. 2). As noted by Perez et al. [7],we also observed a consistent exponential behavior for thesquared distance D2, hence we used Eq. 3 for weighting allparticles (λ = 20). Later resampling with draw probability∝ p(xt+1|xt+1,qt+1) will suffice to make all particles lookas if they were bees looking for honey, all trying to match thereference color histogram q∗. Figure 3 pictures how a particletracks a gray area over time, computing the color histogramfrom its sorrounding with every displacement.

D[q∗,qt(xt, yt)] =

[1−

N∑

n=1

√q∗(n)qt(xt, yt)

] 12

(2)

p(xt+1|xt+1, qt+1) ∝ exp(−λD2[q∗, qt+1)]) (3)

B. Color segmentation

In order to track user’s body shape, we segmented allcolor areas similar to one reference color Cref , which is infact, user’s jacket color. Complementary, we reinforced oursegmentation process by adding contrast information fromthe grayscale representation for each frame. To achieve thisefficiently, we took advantage of the UVLC video decoderarchitecture, which works with the subsampled Y CbCr colorspace to transmit all video frames. This drastically improvedthe segmentation process performance in the mobile platform,since we didn’t had to implement any space color transitionfunction.

Every pixel’s chromatic information is denoted as onevector c = (cb, cr), any given pixel is to be segmented if the



309

euclidian distance d(·) between its corresponding chromaticdata to a reference chromatic vector c∗ = (c∗b , c

∗r) is smaller

than our threshold Tc (See Eq. 4).

Cs(x, y) =

{1 if d(c(x, y), c∗) < Tc0 if d(c(x, y), c∗) > Tc

(4)

Similarly, gray tones segmented from the Y channel are meantto be within one fixed interval [y∗ − 15, y∗ + 15]. Where y∗represents a gray tone reference (the user’s jacket gray tone).

Ys(x, y) =

{1 if (y∗ − 15) ≤ Y(x, y) ≤ (y∗ + 15)0 Othterwise (5)

Last strategies mentioned for color and gray segmentationwould easily lead to non desirable false-positives. In otherwords, not all segmented pixels will necessarily belong tothe true user’s location on the video frame. It is by fusion-ing results from last segmentation techniques mentioned andparticle filter’s, that we distinguish the user throughout thevideo stream. In the next section we describe how this task isaccomplished.

C. Data fusion

We developed one way to fusion all available processeddata based on morphological operations to deliver one finalresult. As input this process receives the color (Cs) and graytone segmented (Ys) areas which may contain false-positivesat time t and particle set Xt. Keeping in mind that the videodecoder subsampling scheme (4 : 2 : 0) makes the Y frames(Fig. 4b) double the resolution of chroma frames (Fig. 4a),we considered color segmented pixels useful only if 3 ormore Y pixels are lit, those are marked with green in Fig.4c and represent color and contrast fusion. After one dilationwith structural element diamond (3 × 3), we overlayed allparticles for applying a bush-fire algorithm [3] (Fig. 4e) anddiminish all areas not reached by fire (Fig. 4f), by doing soand under the condition that the particle filter was set up rightwe just eliminated all false-positives shown in Fig. 4e in green.Figure 4g shows how IHRVANT displays fusion results inblue, particles in red, segmented area’s centroid (Bc) and itsboundary box (Bbb) in yellow. During normal operation, thearea in blue corresponds to the user’s position in the videoframe and represents his upper body silhouette, which we canclassify to accomplish body gestures interpretation.

IV. GESTURE RECOGNITION

For classifying user’s upper body silhouette given by thesegmentation process described in last Section, we first dosome conditioning by stabilizing all segmented area Bbb withcentroid B∗

bb, into one normalized bitmap Bs (See Fig. 4g).Consider the fact that we know UAV’s onboard camera’sorientation (roll, pitch and yaw angles), as it is directly relatedto UAV’s pose received through navigation data; from that,we can improve our classifier performance by absorbing minormisleading rotations caused by normal flight conditions.

Bs’ width and height (ws and hs, respectively) ought tofollow human proportions. We used a widely known humanproportion model in graphical arts in which the basic unit of

(a) Color segmentated area. (b) Gray tone segmented area.



(g) Upper body silhouette segmentation. Particles in red,color segmented area in blue and boundary box in yellow.Lower right side, Bs bitmap.

Fig. 4: Fusion of all DIP techniques and how IHRVANTdisplays the result on the tablet’s screen.

measurement is the head. According to this model our bodyis 7.5 heads tall and 4/3 heads wide, from that we concludedthat a proportion equal to ws : 2.5hs will gratefully depicthuman’s upper body proportions when arms are horizontalor close to the body. For preconditioning Bbb, we fixed hs(hence ws, because of the ws : 2.5hs relationship) and thenscale boundary box’s height (hbb) to match hs (height scalefactor s = hbb/hs). Then, we translated all segmented area toBs making sure B∗

bb overlaps the center of Bs (ws/2, hs/2);this make look the body silhouette always centered. Finally,we rotated the result area according to the UAV’s roll statevariable, using as pivot the center of Bs (see Eq. 6). By doingso, we diminish misguidances caused by rotation along theUAV’s fixed roll axis. In the lower-right side, in Fig. 5c weshow their corresponding Bs bitmap after scaling, translatingand rotating is done.

Bs = Rz(roll)T(B∗bb → (ws/2, hs/2))S(hbb/hs)Bbb (6)

Our classifier is based on histograms, from Bs we obtain itsvertical and horizontal histograms, which in turn can be unifiedinto one normalized histogram h (see Fig. 5b). Classifieris trained with the average of 31 samples of h, taken 1.2seconds apart from each other while the user holds his pose in



310

(a) All four gestures proved with IHRVANT.

(b) Vertical and horizontal histograms for gesture 4 and its completenormalized histogram.

(c) Silhouette stabilization in bitmap Bs. Successful classi-fication while the drone is rotated around the roll fixed axis.IHRVANT displays classification results using 4 yellow bars,winner’s label is in red.

Fig. 5: Histograms for classification

front of the UAV’s camera; the result histogram h∗ definesthe reference histogram for the gesture trained. With twoor more gestures trained, the classifier starts measuring thesquare error between any given histogram against all availabletrained gestures using Eq. 7. The gesture for which the erroris minimum will be considered as the winner. All errorscomputed are shown by IHRVANT as a bar graph, the winner’slabel is shown in red to the user (see Fig. 5c).

e =N∑

i=1

(h∗i − hi)2 (7)

IHRVANT supports up to four different gestures and it recog-nizes one valid gesture if it is classified during two consecutiveseconds. Figure 5a shows all four classified gestures.

V. CONTROL

In Fig. 6a we show AR-Drone’s flight altitude relative tothe ground (g) and the three fixed axis that are used to describeany aircraft’s pose on air; the AR-Drone can estimate its posethanks to the onboard inertial measurement unit, and it makesit available to us through the navigation data stream. For eachof these references there is one control variable we can modifyto govern the AR-Drone. If we modify the variable related tothe roll axis, we make the AR-Drone go sideways, the pitchvariable will make it go forward or backwards, the yaw variable

Yaw

Pitch

Roll h

r

l

g

Video frame

d

(a) Control scheme.

(b) A third-degree polynomial adjustment to all experimentalmeasurements.

Fig. 6: Control scheme and distance estimation to the user,relative to AR-Drone’s front camera.

will make it spin and the altitude variable will make it go upand down.

For each variable just mentioned, we used a closed loopPID control scheme. For a start, we used the estimated flightaltitude we were receiving from the drone (g, see Fig. 6a)as input to the altitude’s controller. After experimental PIDtuning, the drone was able to maintain a stable flight altitude;it is worth mentioning that the control loop’s refreshing ratedoubles navigation data’s reception, which is approximately 16Hz.

With the tracking algorithm (Sec. III) we obtain the user’ssegmented area, boundary box dimensions and its centroid,from which we can estimate the user’s position relative to thedrone. In Fig. 6a a gray rectangle describes the possible user’sposition and its distance to the video frame’s center (r). Thismakes possible for a PID controller to correct the yaw axis,which takes as input the distance in pixels between the centerof the frame and the upper body’s centroid (r).

Figure 6b shows five different photos of the user, the readercan observe how the farther the user goes away from the UAV,



311

the smaller the boundary box becomes. In order to estimatethe distance d to the user, we mapped the height h of theboundary box to d in meters (see Fig. 6b) with a third-degreepolynomial, using several experimental measurements. Fig. 6ashows two gray areas with a different height with l < h, it alsodemonstrates that the closer the user is the drone the bigger hbecomes. The difference between the estimated distance d andone reference value (1.5 meters in our case) serves as input tothe PID controller for UAV’s pitch axis.

VI. RESULTS

The reader is welcome to watch a video on YouTubenamed ”IHRVANT”4, it shows how the UAV follows theuser and reacts to body gestures; it also includes one linkto another video showing how we used the tablet’s touchcapabilities to initialize all segmentation processes. Since ourtracking algorithm is working along side the video decoder,we reached more than sufficient performance, all measured inframe per second (FPS). Consider that the UAV sends videoat 18FPS; color and gray tone segmentation: 14 FPS, Particlefilter (I = 200, l = 7, n = 4) 7 FPS, after the fusion of results,we reached an overall 14 FPS. In gesture classification, tryingwith roll angles in the [0◦, 15◦] range, we measured our clas-sifier performance statistically. By measuring the proportionof actual positives which are correctly identified as such, wereached a sensitivity of 89%, in the other hand, by measuringthe proportion of actual negatives identified as such, we reacha specificity of 90%. It is worth mentioning that we didn’t usearchitecture’s specific instructions to achieve this performance,including Neon instructions and GPU capabilities.

VII. CONCLUSION

We presented a feasible implementation of a portableHuman Machine Interface with an UAV. Capable of success-fully tracking the user indoors throughout the video streamwithout many light restrictions and a dynamically changingbackground; we also showed how to accomplish a visionbased control system in most recent portable devices, capableof autonomously drive the UAV without the need of highperformance computing. The AR-drone is not able of carryingany significant additional payload and that made impossiblefor us to add exteroceptive sensors onboard the drone whichin turn, would have made it more resilient, that has become partof our future job. Our main effort is with the idea that sooneror later, robots will take a step forward into our everyday socialinteraction. We are getting there...

4Look for this video in the uploaded videos section available on one of theauthor’s YouTube channel: www.youtube.com/danielsoto888

REFERENCES

[1] Y. Chen, S. Yu, J. Fan, W. Chen, and H. Li. An improved color-based particle filter for object tracking. In Genetic and EvolutionaryComputing, 2008. WGEC ’08. Second International Conference on,pages 360 –363, sept. 2008.

[2] B.A. Duncan, R.R. Murphy, D. Shell, and A.G. Hopper. A midsummernight’s dream: Social proof in hri. In Human-Robot Interaction (HRI),2010 5th ACM/IEEE International Conference on, pages 91 –92, march2010.

[3] R. Fabbri, L. F. Da Costa, J. C. Torelli, and O. M. Bruno. 2d euclideandistance transform algorithms: A comparative survey. ACM Comput.Surv., 40(1):2:1–2:44, February 2008.

[4] F. Hara. Artificial emotion of face robot through learning in com-municative interactions with human. In Robot and Human InteractiveCommunication, 2004. ROMAN 2004. 13th IEEE International Work-shop on, pages 7 – 15, sept. 2004.

[5] K. Higuchi, T. Shimada, and J. Rekimoto. Flying sports assistant: ex-ternal visual imagery representation for sports training. In Proceedingsof the 2nd Augmented Human International Conference, AH ’11, pages7:1–7:4, New York, NY, USA, 2011. ACM.

[6] Wai Shan Ng and E. Sharlin. Collocated interaction with flying robots.In RO-MAN, 2011 IEEE, pages 143 –149, 31 2011-aug. 3 2011.

[7] P. Perez, C. Hue, J. Vermaak, and M. Gangnet. Color-based probabilistictracking. In In Proc. ECCV, pages 661–675, 2002.

[8] Y. Satoh, T. Okatani, and K. Deguchi. A color-based tracking by kalmanparticle filter. In Pattern Recognition, 2004. ICPR 2004. Proceedingsof the 17th International Conference on, volume 3, pages 502 – 505Vol.3, aug. 2004.

[9] D.S. Syrdal, Kheng Lee Koay, M.L. Walters, and K. Dautenhahn. Apersonalized robot companion? - the role of individual differences onspatial preferences in hri scenarios. In Robot and Human interactiveCommunication, 2007. RO-MAN 2007. The 16th IEEE InternationalSymposium on, pages 1143 –1148, aug. 2007.

[10] S. Thrun, W. Burgard, and D. Fox. Probabilistic Robotics (IntelligentRobotics and Autonomous Agents). The MIT Press, 2005.

[11] T. Zhang, S. Fei, X. Li, and H. Lu. An improved particle filterfor tracking color object. In Intelligent Computation Technology andAutomation (ICICTA), 2008 International Conference on, volume 2,pages 109 –113, oct. 2008.



312

[IEEE 2013 10th International Conference on Electrical Engineering, Computing Science and Automatic...

Documents

Transcript of [IEEE 2013 10th International Conference on Electrical Engineering, Computing Science and Automatic...