Feature Segmentation for Object Recognition Using Robot ... · Feature Segmentation for Object...

9
Feature Segmentation for Object Recognition Using Robot Manipulation Oleg O. Sushkov and Claude Sammut School of Computer Science and Engineering University of New South Wales, Australia [email protected], [email protected] Abstract Object recognition and localisation is vital for many robotics applications, such as household service robotics. However, it is not always pos- sible to provide information about an object a priori. Instead we aim to have the robot au- tonomously acquire information about an ob- ject for the purposes of recognition and locali- sation. We present a system combining robot- initiated object motion, stereo vision, and long term feature tracking to effectively extract ob- ject features from scene images. This allows a robot to learn to recognise previously unseen objects in the presence of clutter, noise and background motion. 1 Introduction This paper presents a method for a humanoid robot to actively learn to recognise a previously unseen object instance. Our method uses motion to separate object features from the background in order to build an ob- ject feature database. The robot picks up an object and performs a series of motions, while stereo vision and local image features, specifically Scale Invariant Fea- ture Transform (SIFT) [Lowe, 2004], are used to gather trajectory data for each feature over multiple frames. Our contribution is demonstrating that by tracking sta- ble image features over many frames we can effectively separate object and background features in a dynamic, highly cluttered environment (example scene in Figure 1). These features can then be used for building an all-aspect object model [Cyr and Kimia, 2004], allow- ing recognition, localisation, and effective manipulation of the object by the robot. Most object recognition methods are split into two parts: learning the appearance of an object, and later using the learned data to recognise the object in a scene. Training data are typically segmented views of the ob- ject where it is clear which image regions belong to Figure 1: Robot view of the arm gripping an object in a cluttered environment. the object. These data are typically provided a pri- ori. However, in many scenarios (eg: household ser- vice robot) this is not desirable, as it is impossible to predict which objects may be encountered. Instead we would like the robot to learn to recognise new objects autonomously. This can be done by generating training data autonomously by segmenting object features from the background. The challenges include separating back- ground features from object features, segmenting robot arm and object features, being robust to changes in arm appearance and background motion. This paper will ad- dress the issue of finding object features in a scene. The remainder of this paper will present a review of related work, a description of our algorithm, an evalu- tation of the effectiveness of our approach, and finally future work. 2 Related Work Object recognition is crucial for robotics applications such as household service robotics [Kemp et al., 2007], with some recognition methods using local image de- scriptors [Mikolajczyk and Schmid, 2005] as SIFT [Lowe, Proceedings of Australasian Conference on Robotics and Automation, 7-9 Dec 2011, Monash University, Melbourne Australia. Page 1 of 9

Transcript of Feature Segmentation for Object Recognition Using Robot ... · Feature Segmentation for Object...

Page 1: Feature Segmentation for Object Recognition Using Robot ... · Feature Segmentation for Object Recognition Using Robot Manipulation ... [Cyr and Kimia, 2004], allow-ing recognition,

Feature Segmentation for Object Recognition Using RobotManipulation

Oleg O. Sushkov and Claude SammutSchool of Computer Science and EngineeringUniversity of New South Wales, Australia

[email protected], [email protected]

Abstract

Object recognition and localisation is vital formany robotics applications, such as householdservice robotics. However, it is not always pos-sible to provide information about an object apriori. Instead we aim to have the robot au-tonomously acquire information about an ob-ject for the purposes of recognition and locali-sation. We present a system combining robot-initiated object motion, stereo vision, and longterm feature tracking to effectively extract ob-ject features from scene images. This allows arobot to learn to recognise previously unseenobjects in the presence of clutter, noise andbackground motion.

1 Introduction

This paper presents a method for a humanoid robot toactively learn to recognise a previously unseen objectinstance. Our method uses motion to separate objectfeatures from the background in order to build an ob-ject feature database. The robot picks up an objectand performs a series of motions, while stereo visionand local image features, specifically Scale Invariant Fea-ture Transform (SIFT) [Lowe, 2004], are used to gathertrajectory data for each feature over multiple frames.Our contribution is demonstrating that by tracking sta-ble image features over many frames we can effectivelyseparate object and background features in a dynamic,highly cluttered environment (example scene in Figure1). These features can then be used for building anall-aspect object model [Cyr and Kimia, 2004], allow-ing recognition, localisation, and effective manipulationof the object by the robot.

Most object recognition methods are split into twoparts: learning the appearance of an object, and laterusing the learned data to recognise the object in a scene.Training data are typically segmented views of the ob-ject where it is clear which image regions belong to

Figure 1: Robot view of the arm gripping an object in acluttered environment.

the object. These data are typically provided a pri-ori. However, in many scenarios (eg: household ser-vice robot) this is not desirable, as it is impossible topredict which objects may be encountered. Instead wewould like the robot to learn to recognise new objectsautonomously. This can be done by generating trainingdata autonomously by segmenting object features fromthe background. The challenges include separating back-ground features from object features, segmenting robotarm and object features, being robust to changes in armappearance and background motion. This paper will ad-dress the issue of finding object features in a scene.

The remainder of this paper will present a review ofrelated work, a description of our algorithm, an evalu-tation of the effectiveness of our approach, and finallyfuture work.

2 Related Work

Object recognition is crucial for robotics applicationssuch as household service robotics [Kemp et al., 2007],with some recognition methods using local image de-scriptors [Mikolajczyk and Schmid, 2005] as SIFT [Lowe,

Proceedings of Australasian Conference on Robotics and Automation, 7-9 Dec 2011, Monash University, Melbourne Australia.

Page 1 of 9

Page 2: Feature Segmentation for Object Recognition Using Robot ... · Feature Segmentation for Object Recognition Using Robot Manipulation ... [Cyr and Kimia, 2004], allow-ing recognition,

2004] for robust detection and and localisation in clut-tered scenes [Collet et al., 2009]. SIFT is a scale and ro-tation invariant interest point detector which generateshighly discriminatory feature vectors for image inter-est points. However these require accurately segmentedviews of the objects to use as training examples. In thecase of an autonomous robot, working in a dynamic en-vironment, it may not be feasible to provide trainingexamples for all objects it may encounter.

One solution is to have the robot generate the re-quired training examples itself. Specifically, the robotmust generate segmented views of an unfamiliar object,separating the object from the background. There areseveral static image segmentation algorithms which maybe used for this purpose [Beveridge et al., 1989; Shi andMalik, 2000]. However, these may not be effective incomplex cluttered environments due to ambiguity be-tween object and background image regions. An alter-native is to take an active approach, where the robot canresolve object-background ambiguities by performing aprobing action. Background subtraction [Stauffer andGrimson, 1999] can be used for segmenting object viewsby modelling the background and then placing the ob-ject in the camera view with the robot manipulator [Udeet al., 2008]. Optical flow [Horn and Schunk, 1981] canbe used to track scene motion as the robot manipulatornudges the target object [Fitzpatrick, 2003; Kenney etal., 2009]. The resulting motion provides a cue for thesegmentation of the object image region.

The disadvantage of these approaches is their inabil-ity to robustly handle background motion, making themunsuitable for a dynamic environment. The challengesto be addressed include:

1. separating object features from a cluttered back-ground,

2. separating object and robot gripper features,

3. robustness to background motion, and

4. robustness to changes in the visual appearance ofthe robot arm.

We solve these by tracking the trajectory of each fea-ture over a time period long enough to allow effectivesegmentation of object and background features.

3 Feature Segmentation Algorithm

3.1 Overview

Our algorithm allows a robot to autonomously find ob-ject image features in a dynamic unstructured environ-ment; these can later be used for object recognition andlocalisation. We do this by using the robot arm to movethe object through a scene, recording this motion with astereo camera, extracting long-term feature trajectoriesfrom the resulting video stream, and finally using this

Figure 2: Robot and workspace setup. The robot has astereo camera on a pan-tilt unit and a 6-DOF arm witha gripper attached.

data to extract the object features. We track the mo-tion in 3D world space rather than in 2D image space asthis provides more information to determine whether afeature belongs to the background or the object. The useof SIFT features, which can be reliably correlated fromframe to frame, allows a feature to be tracked throughmultiple frames. This provides more information for seg-mentation as compared to instantaneous motion in thecase of optical flow methods.

The rest of this section describes the different stagesof our algorithm. These are:

1. extracting and correlating SIFT features from thestereo video stream (described in sub-section B),

2. tracking the trajectory of each feature during themotion (described in sub-section C),

3. periodically take snapshots of object features whensufficient trajectory data are available (described insub-section C),

4. separate arm and object features (described in sub-section D),

5. filter out background motion by ignoring featureswhose motion differs from the trajectory of the arm,(described in sub-section E),

6. finally we improve the separation of arm and objectfeatures by comparing the feature snapshots fromdifferent objects and remove common features (de-scribed in sub-section F).

The platform we used for implementing and testing ouralgorithm consists of a Bumblebee2 stereo camera and a

Proceedings of Australasian Conference on Robotics and Automation, 7-9 Dec 2011, Monash University, Melbourne Australia.

Page 2 of 9

Page 3: Feature Segmentation for Object Recognition Using Robot ... · Feature Segmentation for Object Recognition Using Robot Manipulation ... [Cyr and Kimia, 2004], allow-ing recognition,

6-DOF arm with gripper attached, a workspace is acces-sible in front of the robot (see Figure 2).

We do not give details on grasping [Bicchi and Kumar,2000] an object with the robot gripper. We use objectsthat are easy to grip by aligning on the long axis ofthe object. Additionally, we assume that a previouslyunknown object can be located in a scene in order toperform the initial grasp. This can be done by detectingsome contrasting features such as vertical displacementon a flat table top.

3.2 Stereo Feature Generation

Our algorithm tracks stereo features, which are pairs ofmatching SIFT features from the left and right imagesof the stereo camera and a corresponding 3D world po-sition. A SIFT feature is characterised by a highly dis-criminatory 128 dimensional description vector and animage gradient orientation at the feature point.

Algorithm 1 Finding stereo features

leftF trs← SIFTFeatures(leftImage)rightF trs← SIFTFeatures(rightImage)stereoF trs← [ ]

for all lF tr in leftF trs doepipolarF trs← epipolarFeatures(rightF trs, lF tr)rF tr ← findNearest(epipolarF trs, lF tr)oriDist← oriDistance(lF tr, rF tr)if oriDist ≤ oriThreshold thenfvDist← ftrV ecDistance(lF tr, rF tr)if fvDist ≤ ftrV ecThreshold thenworldPos← findPos(lF tr, rF tr)newStereoF tr ← {lF tr, rF tr, worldPos}stereoF trs← newStereoF tr : stereoF trs

end ifend if

end for

Algorithm 1 describes the procedure for finding thestereo features in a frame. We extract the SIFT featuresfor both the left and right images and match them bycomparing their image positions, description vectors andorientation. For each feature in the left image, we searchthe right image for a feature on the same epipolar linewith the closest description vector by Euclidean distance.If this pair of left and right image features have theirorientations and description vectors within a thresholddistance they are said to match and form a new stereofeature.

The performance of this algorithm is dependant on thechoice of orientation threshold (oriThreshold) and thefeature vector threshold (ftrV ecThreshold) values. Wedetermined suitable values by manually finding corre-sponding SIFT feature pairs in a large number of stereo

Figure 3: This shows the percentage of correspondingand non-corresponding SIFT stereo feature pairs withdescription vectors within a given distance threshold.

images and recording the distance between their descrip-tion vectors and orientation difference. We found themean orientation difference to be 3.7◦ with a standarddeviation of 3.1◦; we set the orientation threshold tobe x + 3σ(which is equal to 13◦). For the descriptionvector distance threshold we consider the percentage ofcorresponding stereo feature pairs with description vec-tor distance less than x : {0 ≤ x ≤ 700} (we found thatno feature desciption vector distance exceeded 700), ascompared to random non-corresponding feature pairs.Figure 3 shows this relationship in detail. We set thefeature vector threshold to 300 as we found that 90% ofcorresponding stereo feature pairs have description vec-tors with a distance less than this, while the averagedistance between non-matching feature pairs is 534 witha standard deviation of 65.

Evalutating these threshold parameters we found thatthe rate of incorrect feature correspondence was 1.3%.The above values should be applicable to a variety ofscenes and environments since SIFT features are not de-pendent on the overall scene composition and are robustto lighting changes. However, cameras of varying qual-ity, frame rate and baseline length may require differentparameters.

3.3 Feature Tracking and Snapshotting

The purpose of the feature tracking module is to trackeach stereo feature across multiple frames, dealing withintermittent feature visibility and motion of the featuresin 3D space. This is done by maintaining a set of featuretrajectories, an ordered list of (Feature,DetectT ime)tuples. DetectT ime refers to when the given featurewas detected. The ultimate aim of keeping track of thefeature trajectories is to determine which features arepart of the background and which are part of the objectheld by the gripper.

Proceedings of Australasian Conference on Robotics and Automation, 7-9 Dec 2011, Monash University, Melbourne Australia.

Page 3 of 9

Page 4: Feature Segmentation for Object Recognition Using Robot ... · Feature Segmentation for Object Recognition Using Robot Manipulation ... [Cyr and Kimia, 2004], allow-ing recognition,

Algorithm 2 Matching a stereo feature to a trajectory

Input: allT rjctrsInput: nFtrnearTrjctry ← φbestFvDist←∞

for all trjctry in allT rjctrs do(tF tr, tDetectT ime)← head(trjctry)posDist← worldPosDist(nFtr, tF tr)oriDist← oriDistance(nFtr, tF tr)fvDist← ftrV ecDistance(nFtr, tF tr)if fvDist ≤ min(ftrV ecThreshold, bestFvDist)then

if posDist ≤ distThreshold(tDetectT ime) thenif oriDist ≤ oriThreshold thennearTrjctry ← trjctrybestFvDist← fvDist

end ifend if

end ifend for

if nearTrjctry 6= φ thennearTrjctry ← (nFtr, cT ime) : nearTrjctry

elseallT rjctrs← [(nFtr, cT ime)] : allT rjctrs

end if

We maintain a set of visible feature trajectories; eachnew stereo feature is inserted into an existing matchingtrajectory, or if none match, a new trajectory is cre-ated. Algorithm 2 describes this process. Matching anew feature to an existing trajectory relies on spatial lo-cality and descriptor locality. When inserting a feature,we consider it’s description vector, orientation and 3Dworld coordinates. Successive features in a trajectoryshould have spatial locality in 3D world space and sim-ilar feature description vectors and orientation, as theyshould be centred on the same point on an object. Wheninserting a new stereo feature we try to find a trajectorywhose most recent feature is within a threshold distancein world space (distThreshold), whose orientation differsless than a threshold amount (oriThreshold), and whosedescription vector is both within a threshold distance(ftrVecThreshold) and is the closest such feature on theepipolar line.

The distThreshold depends on how long ago the latestfeature of the trajectory was detected. Since no point onthe held object should move faster than the arm, we setthe distance threshold to be:

ArmSpeed× (CurT ime−DetectT ime) (1)

ArmSpeed is the maximum arm movement speed,CurTime is the current time, and DetectTime is time

when the particular trajectory was last updated with anew feature.

Figure 4: These graphs show the increasing descriptionvector and orientation difference between the most re-cent stereo feature of a trajectory and older features ofthe same trajectory.

We determined appropriate values for the otherthresholds by examining the variability of the descriptionvector and orientation of features during linear move-ment (movement such that the object remains in thesame orientation in the image). We plot the average de-scription vector distance and orientation difference be-tween a feature and its corresponding match in each ofthe previous 30 frames; this is shown in Figure 4. Wecan see that after 3 frames the average feature has itsdescription vector shift by approximately 100 units witha standard deviation of 50, and the orientation changes

Proceedings of Australasian Conference on Robotics and Automation, 7-9 Dec 2011, Monash University, Melbourne Australia.

Page 4 of 9

Page 5: Feature Segmentation for Object Recognition Using Robot ... · Feature Segmentation for Object Recognition Using Robot Manipulation ... [Cyr and Kimia, 2004], allow-ing recognition,

by 1.3◦ with a standard deviation of 1.2◦. We set ourthresholds for matching a feature to a trajectory to bex+3σ for both the feature descriptor and orientation dif-ference. This corresponds to a value of 250 for the featurevector threshold (ftrFecThreshold) and 5◦ for the ori-entation threshold (oriThreshold). These values shouldbe applicable to a variety of environments as SIFT key-points have been shown to be robust to addition of noiseand illumination changes [Lowe, 2004].

The purpose of tracking feature trajectories is to usemotion to separate foreground object features from thebackground, these features then form a feature databaseused for object recognition. While the object is beingmoved by the robot, we periodically take snapshots ofthe leading features of trajectories which have a dis-placement over a certain threshold (see Figure 5). Theintent is to use the snapshots of object features to builda database of object features for later recognition andlocalisation. The frequency of feature snapshots is de-pendent on several factors: the camera frame rate, speedwith which the robot moves the object and the desiredlevel of detail for the learned object feature database.We perform a snapshot every five frames, inserting intothe object feature database the latest features of tra-jectories which have moved in 3D world space at leastArmSpeed × (TimeSinceLastSnapshot − 1) since thelast snapshot.

3.4 Separating Arm and Object Features

The result of the previous step is a set of snapshots ofarm and object features, separated from the backgroundusing motion tracking. The next step is to separate theobject features from the arm features in each snapshot.We do this by using a known database of arm SIFT fea-tures. We generate this database using a training stagewhere the robot performs the previous set of steps with-out holding an object. In this way, the robot generatesa database of arm SIFT features, as the only featureswhich would be captured belong to the arm. An alter-native method would be to use a human labeled set oftraining images of the arm, with the disadvantage of be-ing time consuming for the human operator and lackingin robot autonomy.

Once a robot arm feature database is generated, wecan use it to filter out arm features from the feature snap-shots, leaving behind object features. We use Lowe’s[2004] method for SIFT feature to database matching.SIFT description vectors are not equally discriminatoryand a simple Euclidean distance for feature matching isnot sufficient for comparing against a database of fea-tures generated under potentially different lighting con-ditions and orientations. The solution is to search thearm feature database for the nearest and second-nearestfeatures to the snapshot feature. If the Euclidean dis-

Figure 5: This image shows in green the trajectoriesof stereo features, the ones on the robot arm and theheld object have long trails because they are movingthrough the scene. The red markers indicate stereo fea-tures which form a snapshot due to having moved over athreshold amount. Note that this is before arm featuresare separated from object features.

tance of the snapshot feature to the nearest database armfeature is less than 80% of the distance to the second-nearest feature then the snapshot feature matches thearm feature. We remove from each snapshot all fea-tures which match an arm feature, the remaining fea-tures are labeled as object features. The justification forthis approach is that the density of features in the neigh-bourhood of a feature indicates how discriminatory thatfeature’s description vector is. The 80% value was deter-mined by Lowe [2004] experimentally to give the optimaltrade-off between false positives and false negatives.

3.5 Filtering Background Motion

So far we have assumed a largely static background,where the only significant motion in the scene is due tothe robot arm moving with a held object. In a dynamicenvironment significant background motion must be fil-tered out in order for the trajectory snapshot method towork.

Background motion can be filtered in several ways.First, distant features outside the robot’s workspace canbe removed by examining their 3D world position. Afurther refinement is to use arm kinematics to determinethe gripper position in each frame and only consider fea-tures near this point. In our case, however, the accuracyof the kinematics while the arm was in motion was poordue to lag and camera-arm synchronisation difficulties,and not sufficient to filter out background motion in closeproximity to the gripper.

Our solution to this problem is to use the fact that thegripper and the features of the held object follow similartrajectories. If the robot gripper features are known, wecan compare the trajectory paths of the arm features to

Proceedings of Australasian Conference on Robotics and Automation, 7-9 Dec 2011, Monash University, Melbourne Australia.

Page 5 of 9

Page 6: Feature Segmentation for Object Recognition Using Robot ... · Feature Segmentation for Object Recognition Using Robot Manipulation ... [Cyr and Kimia, 2004], allow-ing recognition,

all others and remove trajectories with dissimilar paths,thus filtering out background motion. Prior to generat-ing a trajectory snapshot the arm feature trajectoriesare found by comparing the features comprising eachtrajectory against the database of arm features. Theremaining trajectories are compared to the arm trajec-tories by comparing the 3D world path of a trajectory.To compare two trajectories, a list of feature positionsis extracted from each and normalised to the origin asfollows:

pn, pn+1, ..., pm → (pn − pm), (pn+1 − pm), ..., (pm − pm)(2)

where pn is the position of a trajectory feature atframe n. To calculate the difference between two tra-jectories, take the average distance between positions onmatching frames across the two normalised position lists:

1

n

∑|pa − qa| (3)

where n is the number of matching frames, pa is thenormalised position of the first trajectory at frame num-ber a, and qa is the normalised position of the secondtrajectory at the same frame number. We remove anyfeature trajectories which are not within a threshold dis-tance of an arm feature trajectory as determined by theabove distance measure. We determined the appropriatethreshold by considering the variability of trajectories ofa rigid object. We found that the average distance, usingthe above measure (3) of feature trajectories, on a rigidobject is 0.21cm with a standard deviation of 0.25. Weset the threshold to be x̄+ 3σ which is 0.96.

It should be possible to extend our approach of sepa-rating background motion from object motion to utiliseaccurate arm kinematics when available. We comparefeature trajectories to gripper feature trajectories to findbackground motion. However, if accurate gripper kine-matics are available we could compare feature trajecto-ries to the gripper position trajectory derived directlyfrom the kinematics. We leave this for future work.

It is important to note that this method of backgroundmotion filtering can only be used if the motion performedby the robot is a translation, rather than a rotation.In the case of a rotation, the path of each feature de-pends on its distance from the axis of rotation, whichis not suitable for this approach. However, SIFT fea-tures cannot be reliably tracked through more than afew degress of rotation (out of the camera plane). As aresult, to gather long term feature trajectory data, weare restricted to translation motion anyway. In the caseof learning features of multiple aspects of the object, wewould need to perform a translation motion for each sep-arate aspect.

3.6 Refining the Segmentation

Up to this point we have assumed that we can reli-ably separate the robot arm and object features usinga known database of arm features (sub-section D). How-ever, due to various types of noise, lighting changes, oractual changes in arm appearance (for example dirt orother wear), some arm features may fail to match thestored database and instead be mislabelled as object fea-tures. This would have a detrimental effect on the accu-racy of object recognition using the generated databaseof features.

We avoid this problem by performing some post-processing of the database features. We compare snap-shots of features of different objects to one another andremove any common features. This is because similarfeatures appearing in snapshots of different objects arelikely to be misclassified arm or background features(assuming that the different objects are visually uniquefrom one another).

When a set of feature snapshots has been generated fora number of objects, we consider each object in turn andall of the feature snapshots of that object are comparedagainst the snapshots of the remaining objects. Call theset of snapshots belonging to the current object A, andthe set of snapshots belonging to the remaining objectsB. Every snapshot a ∈ A is compared against a num-ber of randomly sampled snapshots b ∈ B, searching formatching features; the number of snapshots b ∈ B thatare considered varies depending on the desired accuracyand time constraints.

When comparing snapshots, we search for featuresthat match. The matching criterion is similar to thatused for matching arm features: feature fa from snap-shot a matches feature fb from snapshot b if fb is thenearest neighbour to fa of all features occurring in busing the SIFT descriptor distance metric, and this dis-tance is less than 80% of the distance to the second near-est neighbour. All features fa with a matching feature inb are marked as arm features. If some of the objects areexpected to have a similar appearance, and thus sharesimilar SIFT features, this criteria can be extended toonly mark features fa which match multiple snapshotsfrom different objects.

4 Evaluation

We tested the effectiveness of our algorithm by examin-ing the effect of each separate refinement on the fea-ture segmentation quality during training, as well asthe feature recognition performance of the database thatwas generated. To test the object feature segmentationthe robot picked up an object and moved it following asquare shaped path of side length approximately 50cmwith its hand. During this movement, 18 feature snap-shots were taken, with the number of arm and object

Proceedings of Australasian Conference on Robotics and Automation, 7-9 Dec 2011, Monash University, Melbourne Australia.

Page 6 of 9

Page 7: Feature Segmentation for Object Recognition Using Robot ... · Feature Segmentation for Object Recognition Using Robot Manipulation ... [Cyr and Kimia, 2004], allow-ing recognition,

Figure 6: Some of the objects used to test the robotlearning of object recognition.

Feature DetectionTrue Arm Features 328

True Object Features 378False Arm Features 8

False Object Features 40

Table 1: The average number of correctly and incorrectlyidentified object and arm features during the learningphase.

features detected being recoded for each snapshot. Wetested with ten different objects, simple shapes such asa soft drink cans and cubes of edge length 7cm with animage on the side (Figure 6).

The baseline result is summarised in Table 1. Thisshows how successful our method is in terms of separat-ing object features from arm and background featuresduring training. We can see that the number of back-ground and arm features misclassified as object featuresis low when compared to the number of correctly classi-fied object features.

To test the effectiveness of background motion filteringwe introduced background movement into the scene inthe form of a textured object (the same form and sizeas the held object) moving in a circle (radius 10cm) inthe middle of the workspace. We then compared theperformance of our algorithm with background motionfiltering disabled and enabled by counting the numberof correct and incorrect arm and object features foundduring the course of the arm movement (see Table 2).The effect of background motion filtering was a ten-foldreduction in false object features, the filtering largelyremoving feature trajectories of the object moving in thebackground. However there was a slight reduction in thenumber of correctly detected object features.

To test the feature correlation, filtering the appear-ance of the arm is altered slightly as compared tothe robot training stage by attaching a small texturedmarker (see Figure 7). The features from this markerwould not match against the database of arm featuresgenerated during training, and simulates the scenario ofthe arm getting dirty or worn. When correlating snap-shot features for each object, all nine other objects are

Motion Filtering Disabled Enabled

True Arm Features 289 289True Object Features 312 282False Arm Features 3 6

False Object Features 366 38

Table 2: Motion filtering reduces false positive objectfeatures in the presence of background motion.

Figure 7: The marker attached to the robot gripper iscircled in red. Used to test correlation filtering effective-ness.

compared against, with 18 snapshots per object. Table 3shows that as a result of correlation filtering the numberof false object features falls by 81%. This is due to theremoval of the texture marker features from the objectfeatures set.

Finally we tested the object recognition accuracy ofthe generated object feature database. The robot au-tonomously learns to recognise each object using our al-gorithm, adding the object features from each snapshotto a database. A baseline is created by a human manu-ally segmenting six image snapshots of each object; theobject features from these are added to a comparison fea-ture database. To compare the performance of the twodatabases we take nine image snapshots of each of theten objects located amongst clutter, with the robot arm

Correlation Disabled Enabled

True Arm Features 253 581True Object Features 446 494False Arm Features 1 6

False Object Features 380 72

Table 3: Feature cross correlation reduces false positiveobject features when the arm appearance changes.

Proceedings of Australasian Conference on Robotics and Automation, 7-9 Dec 2011, Monash University, Melbourne Australia.

Page 7 of 9

Page 8: Feature Segmentation for Object Recognition Using Robot ... · Feature Segmentation for Object Recognition Using Robot Manipulation ... [Cyr and Kimia, 2004], allow-ing recognition,

Figure 8: Performance of object recognition using auto-matic segmentation vs manual segmentation. The errorbars represent the 95% confidence interval.

visible, and with the objects partially occluded. Eachof these images has a manually generated ground truthfeature classification, determining which of the featuresin the image are object features and which are back-ground or arm features. Both the human generated fea-ture database and the autonomously generated databaseare tested on this set of images by recording the numberof correctly and incorrectly classified object features as afraction of all object features in each image. For featurematching we use Lowe’s [2004] matching method. Forthe automatically generated database we test the effec-tiveness of the database after each feature snapshot isadded, to determine improvement of the classification asmore snapshots are added (for a maximum of 18 snap-shots per object). These results are presented in Figure8.

The autonomously generated database detected 65%of object features after learning the full series of snapshotfeatures; the human generated database detected 56% ofobject features. Neither database incorrectly matchedany background features as object features.

5 Discussion

We have presented an effective algorithm for active learn-ing of object recognition on a robot platform using mo-tion based autonomous feature segmentation. Stable im-age features localised in 3D world space using stereo vi-sion, combined with motion, provide sufficient data toperform effective object feature segmentation from back-ground features in the presence of background motionand clutter.

However, this algorithm has some shortcomings. Themotion the robot performs while holding the object ischosen to be a translation, allowing one particular as-pect of the object’s appearance to be learned. It is still

possible to learn to recognise an object from multipleview points by performing a separate translation mo-tion for each individual view point. However, this is notideal. Ideally the robot should rotate the object in placeso as to view the object from all sides, allowing multipleaspects to be learned. However, we found that trackingfeatures becomes a problem under rotation (pitch or yawrelative to the camera), only being able to track featuresthrough approximately 15◦ of rotation. This reduces thetrajectory information per feature as compared to trans-lation motion, making the separation of foreground ob-ject and arm features from background features less re-liable. Furthermore background motion filtering basedon arm feature trajectories becomes ineffective. As aresult, to learn the features of an object from multipleviewpoints, we fall back to performing multiple transla-tion motions each with a slightly different rotation of therobot wrist rather than a single rotation motion.

6 Conclusion and Future Work

There are many possible applications and extensions ofour work. One improvement would be to accuratelydetermine the robot gripper pose in every frame basedon the arm kinematics, and use it to improve and sim-plify both the background motion filtering and the fea-ture tracking. We could compare each feature trajectoryagainst the arm trajectory derived from the kinemat-ics, which may be faster and simpler than our currentapproach. By considering the gripper motion, featuretrajectory tracking could be improved, narrowing downthe search area when inserting new features into existingtrajectories.

Our algorithm may be used by a robot as a basis tolearn to recognise, localise, and manipulate new objectsautonomously. One possibility is to use the describedapproach to extract object features for different aspectsand to stitch these together into a single coherent 3Dfeature point-cloud. This would then allow the object’s3D shape to be determined by fitting a model to thepoints, as well as allowing accurate 3D positioning ofthe object within a scene. Together this would lead tothe robot being able to manipulate and use a previouslyunknown object.

It may also be worth exploring using different typesof local image features such as SURF [Bay et al., 2006]or PCA-SIFT [Ke and Sukthankar, 2004] rather thanSIFT, as these may provide improved performance insome circumstances.

References

[Lowe, 2004] D. Lowe, “Distinctive Image Features fromScale-Invariant Key-points”, International Journalof Computer Vision, 2(60):91-110, 2004.

Proceedings of Australasian Conference on Robotics and Automation, 7-9 Dec 2011, Monash University, Melbourne Australia.

Page 8 of 9

Page 9: Feature Segmentation for Object Recognition Using Robot ... · Feature Segmentation for Object Recognition Using Robot Manipulation ... [Cyr and Kimia, 2004], allow-ing recognition,

[Cyr and Kimia, 2004] C. Cyr, B. Kimia. A Similarity-Based Aspect-Graph Approach to 3D ObjectRecognition. International Journal of Computer Vi-sion, 57:5-22, 2004.

[Kemp et al., 2007] C. Kemp, A. Edsinger, E. Torres-Jara. Challenges for robot manipulation in humanenvironments. IEEE Robotics & Automation Maga-zine, 14:20-29, 2007.

[Mikolajczyk and Schmid, 2005] K. Mikolajczyk, C.Schmid. A Performance Evaluation of Local De-scriptors. IEEE Transactions on Pattern Analysisand Machine Intelligence, pages 1615-1630, 2005.

[Collet et al., 2009] A. Collet, D. Berenson, S. Srinivasa,D. Ferguson. Object recognition and full pose reg-istration from a single image for robotic manipu-lation. IEEE International Conference on Roboticsand Automation, pages 48-55, 2009.

[Beveridge et al., 1989] R. Beveridge, J. Griffith, R.Kohler, A. Hanson, E. Riseman. Segmenting imagesusing localized histograms and region merging. In-ternational Journal of Computer Vision, 2(3):311-347, 1989.

[Shi and Malik, 2000] J. Shi, J. Malik. NormalizedCuts and Image Segmentation. IEEE Transac-tions on Pattern Analysis and Machine Intelligence,22(8):888-905, 2000.

[Stauffer and Grimson, 1999] C. Stauffer, W. Grimson.Adaptive background mixture models for real-timetracking. Computer Vision and Pattern Recogni-tion, 2:252, 1999.

[Ude et al., 2008] A. Ude, D. Omrcen, and G. Cheng.Making object learning and recognition an ac-tive process. International Journal of HumanoidRobotics, 5(2):267-286, 2008.

[Horn and Schunck, 1981] B. Horn, B. Schunck. Deter-mining optical flow. Artificial Intelligence, 17:185-203, 1981.

[Fitzpatrick, 2003] P. Fitzpatrick. First contact: an ac-tive vision approach to segmentation. IntelligentRobots and Systems Proceedings, 3:2161-2166, 2003.

[Kenney et al., 2009] J. Kenney, T. Buckley, O. Brock.Interactive segmentation for manipulation in un-structured environments. IEEE International Con-ference on Robotics and Automation, pages 1377-1382, 2009.

[Bicchi and Kumar, 2000] A. Bicchi, V. Kumar. Roboticgrasping and contact: a review. IEEE InternationalConference on Robotics and Automation, 1:348-353,2000.

[Bay et al., 2006] H. Bay, T. Tuytelaars, L. Van Gool.SURF: Speeded Up Robust Features. Computer Vi-sion - ECCV 2006, pages 404-417, 2006.

[Ke and Sukthankar, 2004] Y. Ke, R. Sukthankar. PCA-SIFT: A More Distinctive Representation for LocalImage Descriptors. IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition,2:506-513, 2004.

Proceedings of Australasian Conference on Robotics and Automation, 7-9 Dec 2011, Monash University, Melbourne Australia.

Page 9 of 9