HarvestNet: Mining Valuable Training Data from High-Volume...

HarvestNet: Mining Valuable Training Data fromHigh-Volume Robot Sensory Streams

Sandeep Chinchali, Evgenya Pergament, Manabu Nakanoya, Eyal Cidon, Edward ZhangDinesh Bharadia, Marco Pavone, Sachin Katti

Abstract—Today’s robotic fleets are increasingly measuringhigh-volume video and LIDAR sensory streams, which canbe mined for valuable training data, such as rare scenes ofroad construction sites, to steadily improve robotic perceptionmodels. However, re-training perception models on growingvolumes of rich sensory data in central compute servers (or the“cloud”) places an enormous time and cost burden on networktransfer, cloud storage, human annotation, and cloud computingresources. Hence, we introduce HARVESTNET, an intelligentsampling algorithm that resides on-board a robot and reducessystem bottlenecks by only storing rare, useful events to steadilyimprove perception models re-trained in the cloud. HARVESTNETsignificantly improves the accuracy of machine-learning modelson our novel dataset of road construction sites, field testing ofself-driving cars, and streaming face recognition, while reducingcloud storage, dataset annotation time, and cloud compute timeby between 65.7 − 81.3%. Further, it is between 1.05 − 2.58×more accurate than baseline algorithms and scalably runs onembedded deep learning hardware.

I. INTRODUCTION

Learning to identify rare events, such as traffic disrup-tions due to construction or hazardous weather conditions,can significantly improve the safety of robotic perceptionand decision-making models. While performing their primarytasks, fleets of future robots, ranging from delivery drones toself-driving cars, can passively collect rare training data theyhappen to observe from their high-volume video and LIDARsensory streams. As such, we envision that robotic fleets willact as distributed, passive sensors that collect valuable trainingdata to steadily improve the robotic autonomy stack.

Despite the benefits of continually improving robotic mod-els from field data, they come with severe systems costs thatare largely under-appreciated in today’s robotics literature.Specifically, these systems bottlenecks (shown in Fig. 1) stemfrom the time or cost of (1) network data transfer to computeservers, (2) limited robot and cloud storage, (3) dataset anno-tation with human supervision, and (4) cloud computing fortraining models on massive datasets. Indeed, it is estimated thata single self-driving car will produce upwards of 4 Terabytes(TB) of LIDAR and video data per day [1, 7], which wecontrast with our own calculations in Section II.

In this paper, we introduce a novel Robot Sensory SamplingProblem (Fig. 1), which asks how a robot can maximallyimprove model accuracy via cloud re-training, but with min-imal end-to-end systems costs. To address this problem, weintroduce HARVESTNET, a compute-efficient sampler thatscalably runs on even resource-constrained robots and reducessystems bottlenecks by “harvesting” only task-relevant trainingexamples for cloud processing.

In general, the problem of intelligently sampling useful oranomalous training data is of extremely wide scope, since fieldrobotic data contains both known weakpoints of a model aswell as unknown weakpoints that have yet to even be discov-ered. A concrete example of a known weakpoint is shownin Fig. 2, where a fleet operator identifies that constructionsites are common errors of a perception model and wishesto collect more of such images, perhaps because they areinsufficiently represented in a training dataset. In contrast,unknown weakpoints could be rare visual concepts that arenot even anticipated by a roboticist.

To provide a focused contribution, this paper addressesthe problem of how robotic fleets can source more trainingdata for known weakpoints to maximize model accuracy, butwith minimal end-to-end systems costs. Though of tractablescope, this problem is still extremely challenging since arobot must harvest rare, task-relevant training examples fromgrowing volumes of robotic sensory data. A principal benefitof addressing known model weakpoints is that the cloud candirectly use knowledge of task-relevant image classes, whichrequire more training data, to guide and improve a robot’ssampling process. Our key technical insight is to keep robotsampling logic simple, but to leverage the power of the cloudto continually re-train models and annotate uncertain imageswith a human-in-the-loop, thereby using cloud feedback todirect robotic sampling. Moreover, we discuss how insightsfrom HARVESTNET can be extended to collecting training datafor unknown weakpoints in future work.

Limited NetworkRobot

Vision Model

SampleCache

Sampler

Update Model

Re-train Model

StoredDataset

Cloud

AnnotateCache

Upload

“Interesting”Image?

Fig. 1: The Robot Sensory Sampling Problem. What arethe minimal set of “interesting” training samples robots cansend to the cloud to periodically improve perception modelaccuracy while minimizing system costs of network transfer,cloud storage, human annotation, and cloud-computing?

Literature Review: HARVESTNET is closely aligned withcloud robotics [6, 16, 15] , where robots send video or LIDARto the cloud to query compute-intensive models or databaseswhen they are uncertain. Robots can face significant networkcongestion while transmitting high-datarate sensory streamsto the cloud, leading to work that selectively offloads onlyuncertain samples [11] or uses low-latency “fog computing”nodes for object manipulation and tele-operation [25, 26].As opposed to focusing solely on network latency for real-time control, HARVESTNET addresses overall system cost(from network transfer to human annotation time) and insteadfocuses on the open problem of continual learning fromcleverly-sampled batches of field data.

Our work is inspired by traditional active learning [27, 22,12, 14], novelty detection [18, 10], and few-shot learning [30,28, 20], but differs significantly since we use novel compute-efficient algorithms, distributed between the robot and cloud,that are scalable to run on even resource-constrained robots.Many active learning algorithms [27, 22, 12] use compute-intensive acquisition functions to decide which limited samplesa learner should label to maximize model accuracy, whichoften involve Monte Carlo Sampling to estimate uncertaintyover a new image. Indeed, the most similar work to ours [14]uses Bayesian Deep Neural Networks (DNNs), which have adistribution over model parameters, to take several stochasticforward passes of a neural network to empirically quantifyuncertainty and decide whether an image should be labeled.Such methods are infeasible for resource-constrained robots,which might not have the time to take several forward passesof a DNN for a single image or capability to efficiently runa distribution over model parameters. Indeed, we show inSection V how our sampler uses a hardware DNN accelerator,specifically the Google Edge Tensor Processing Unit (TPU),to evaluate an on-board vision DNN, with one fixed set ofpre-compiled model parameters, in < 16ms and decides tosample an image in only ≈ 0.3ms! More broadly, our workdiffers since we (1) address sensory streams (as opposed toa pool of unlabeled data), (2) use a compute-efficient modelto guide sampling, (3) direct sampling with task-relevantcloud feedback, and (4) also address storage, network, andcomputing costs.

Statement of Contributions: In light of prior work, our con-tributions are four-fold. First, we collect months of novel videofootage to quantify the systems bottlenecks of processinggrowing sensory data. Second, we empirically show that largeimprovements in perception model accuracy can be realized byautomatically sampling task-relevant training examples fromvast sensory streams. Third, we design a sampler that intelli-gently delegates compute-intensive tasks to the cloud, such asadaptively tuning key parameters of robot sampling policiesvia cloud feedback. Finally, we show that our perceptionmodels run efficiently on the state-of-the-art Google Edge TPU[5] for resource-constrained robots.

Organization: This paper is organized as follows. Section IImotivates the Robot Sensory Sampling Problem by quantifyingour system costs for collecting three novel perception datasets.Then, Section III formulates a general sampling problem andintroduces the HARVESTNET system architecture (Fig. 3).

Next, Sections IV and V quantify HARVESTNET’s accuracyimprovements for perception models, key reductions in systembottlenecks, and performance on the Edge TPU [5]. Finally,we conclude with future applications in Section VI.

Domain-Specific

Corrected

Error

Fig. 2: HARVESTNET Re-Trained Vision Models. From onlya minimal set of training examples, HARVESTNET adaptsdefault vision models trained on the standard COCO dataset[17] (top images) to better, domain-specific models (bottomimages). We flexibly fine-tune mobile-optimized models, suchas MobileNet 2, as well as compute-intensive, but moreaccurate models such as Faster R-CNN [19].

II. MOTIVATION

We now describe the costs and benefits of continual learningin the cloud by collecting and processing three novel percep-tion datasets.

A. Costs and Benefits of Continual Learning in the Cloud

Benefits: A prime benefit of continual learning in thecloud is to correct weak points of existing models or specializethem to new domains. As shown in Fig. 2 (right, top), stan-dard, publicly-available vision deep neural networks (DNNs)pre-trained on benchmark datasets like MS-COCO [17] canmisclassify traffic cones as humans but also miss the largeexcavator vehicle, which neither looks like a standard trucknor a car. However, as shown in Fig. 2 (right, bottom), aspecialized model re-trained in the cloud using automaticallysampled, task-relevant field data can correct such errors.More generally, field robots that “crowd-source” interestingor anomalous training data could automatically update high-definition (HD) road maps.

Costs: The above benefits of cloud re-training, however,often come with understated costs, which we now quantify.Table I estimates monthly systems costs for a small fleet of 10self-driving cars, each of which is assumed to capture 4 hoursof driving data per day. In the first scenario, called “Dashcam-ours”, we assume a single car has only one dashcam capturingH.264 compressed video, as in our collected dataset. In thesecond scenario, called “Multi-Sensor”, we base our calcula-tions off of an Intel report [1] that estimates each SDV willgenerate 4 TB of video and LIDAR data in just 1.5 hours.

To be conservative, we assumed all field data is uploadedover fast 10 Gbps ethernet, which might be realistic for

Bottleneck Multi-Sensor Dashcam-oursB. Box Mask B. Box Mask

Storage $31,200 $31,200 $172.50 $172.50Transfer Time (hr) 8.88 8.88 0.05 0.05Annotation Cost $95,256 $1,652,400 $15,876 $275,400Annot. Time (days) 22 NA 18 NA

TABLE I: Systems Costs: Monthly storage/annotation costand daily network upload time are significant for self-drivingcar data even if we annotate 1% of video frames.

SDVs, but is much slower for network-limited robots. Mostimportantly, we assume only 1% of image data is interestingand needs to be annotated, specifically by the state-of-the-artGoogle Data Labeling service [3], which has standard rates forobject detection bounding boxes and (costlier) segmentationmasks (referred to as “B. Box” and “Mask” in Table I).Further, to roughly estimate annotation time, we hand-labeled1000 bounding boxes in 1.4 hours. Since we did not hand-label nor train perception models for semantic segmentation,the time estimates are marked as “NA”. Though conservative,our estimates, which are further detailed in supplementaryinformation, show significant costs, especially for annotation.Indeed, we confirmed the sampling problem’s relevance bytalking with robotics startups that face high systems costs.

B. Novel Datasets we Collected

To quantify systems costs first-hand, we collected over threemonths of field data, comprising over 10 GB of compressedvideo clips from a dashboard iPhone 6 camera (“dashcam”).While modest in size, it contains 346,788 frames of video,which is challenging to automatically mine for training exam-ples. The following datasets, as well as re-trained vision mod-els, are available at https://sites.google.com/view/harvestnet:1. Identifying Construction Sites: Our three months of con-struction footage shows safety-critical scenarios and modeldrift, including roads that were first excavated by tractors (Fig.2), filled with tar, and rolled back to normal by “compactor”machines and human workers.2. Identifying Self-driving Vehicles (SDVs): We captured rarefootage of SDVs being tested on public roads over 3 months,predominantly of the Waymo van (Fig. 2). This example is aproof-of-concept of HARVESTNET’s ability to sift rare, task-relevant training samples from large volumes of regular carfootage, and is easily generalizable to other image classes.3. Streaming Face Recognition (FaceNet): To test HARVEST-NET’s robustness to domains outside of road scenes, we usea video dataset of streaming faces of consenting volunteersfrom a prior robotics paper [11]. We expressly do not advocateprivacy-infringing surveillance missions, and simply test asearch-and-rescue scenario where a sampler must use faceembeddings, such as from the de-facto FaceNet [21, 9] DNN,to sift faces of interest from large volumes of video. Ourresults on all three datasets, henceforth referred to as Waymo,Construction, and FaceNet, are shown in Section V.

III. PROBLEM STATEMENT

We now formulate a general Robot Sensory SamplingProblem that is applicable to diverse field robots. Examples

(7) Re-train Model

Robot Limited Network(2) Vision Model i

(3) Target Images, Sampler Params. i

(5) Sample Cache i

Store?

(4) Sampler

Predictions,Embeddings

Accumulated Dataset

Cloud(1)

New Image

(6) AnnotateCache

Update Model i+1

New Sampler Params. i+1

(8) Adapt Sampler Params.

Upload

Fig. 3: System Architecture: Our core contribution is a robotsensory sampler in gray, which uses predictions from an on-board vision DNN and cloud-selected image targets.

include self-driving cars, low-power drones, and even miningrobots which might only have network access when theysurface. As a common theme, these robots must prioritizesensory samples due to limits in some combination of on-board storage, network connectivity, human annotation, and/orcloud-computing time that preclude naıvely using all field data.We now define general abstractions for the sampling problem.1. Sensory Input: A robot measures a stream of sensory inputs{xt,i}Tt=0, where each xt,i is a new image or LIDAR pointcloud, on a day or learning “round” i. At the end of a day’shorizon of T steps, it can upload samples to the cloud fordownstream annotation and model learning. In practice, timehorizon T can be the duration of a day’s field test when asubterranean robot surfaces and connects to a network. Sincethe robot is caching images for model-learning, as opposedto real-time control, it can sub-sample the sensory input at asustainable rate, such as ∆t = 30s, by skipping frames.2. Robot Perception Models: Vision DNN fDNN maps an im-age xt to a prediction yt (e.g. image class or object detections)and associated confidence conft, typically from softmax lay-ers in the DNN. Further, the DNN also provides embeddingsembt, which can be used to compare the similarity betweenimages as shown in Fig. 4 (right). The DNN outputs a vectorof predictions, confidences, and embeddings, defined as:

yt = [yt, conft, embt] = fDNN(xt). (1)

We denote the perception DNN learned after round i as f iDNN,where f0

DNN is the initial DNN the robot is equipped withbefore re-training. Our goal is to make f iDNN as accurate aspossible for increasing i with the fewest samples sent to thecloud. Fixed model f iDNN is not re-trained within a round i ofsample collection since the samples must first be annotated.3. Target Image Set:

As stated in the introduction, the scope of this paper is tocollect more examples of known weakpoints of a model, suchas the mis-labeled construction sites in Fig. 2. In practice, theseweakpoints, which we refer to as task-relevant image classes,can be identified by a fleet operator who might nightly reviewmodel performance on historical data. The set of task-relevantimage classes, such as traffic cones, excavators, and LIDARunits (Fig. 2), are denoted by TargetClassSet. Importantly,

https://sites.google.com/view/harvestnet

we assume the cloud has a few initial examples of imagesbelonging to TargetClassSet, denoted as TargetImageSet.This is a practical assumption, since TargetImageSet canbe obtained from a few examples of mis-labeled field data,like Fig. 2, that initially prompted the problem, or evencollated from Google Images. Equally importantly, the clouddataset can already have sufficient training data for well-understood scenes from previous field tests or public datasetslike ImageNet [13], to ensure model performance does notdegrade on such data.4. Sampling Policy: Our goal is to develop a lightweightsampling policy πsample (grey unit in Fig. 3) that decideswhether or not to store a new image xt based on predictionsfrom the on-board perception DNN yt = f iDNN(xt) and thedesired target-set TargetClassSet of images to look for:

at = πsample(xt,yt, TargetClassSet, θisampler), (2)

where θisampler are sampler hyper-parameters that, in general,can be adapted per round i. Here, at = 1 indicates storing theimage to later send to the cloud and at = 0 indicates droppingit, such as for common, well-understood scenes. Ideally, towork on resource-constrained robots as well as not interferewith real-time decision-making, πsample’s logic, including useof hyper-parameters θisampler, should be compute-efficient.5. Limited Robot Sample Cache: In round i, the robot savessampled images in a small cache denoted by Cachei ={xτ}Ncache

τ=0 , where 0 ≤ τ ≤ T . These images, limited by asize Ncache � T , are small enough to store on-board a robotand will be uploaded to the cloud at the end of round i.6. Annotated Cloud Dataset: When samples in Cachei areuploaded to the cloud, a human, or trusted machine, providesground-truth annotations ytGroundTruth for the selected images,which could be different from the robot’s predictions yt. Inthe cloud, we store a dataset Di = {xτ , yτ} that grows withrounds i and comprises images xτ and ground-truth labels yτ :

Di = Di−1 ∪ {(xτ , yτGroundTruth) , ∀xτ ∈ Cachei}. (3)

Initial dataset D0 has very few examples of the target imagesthat we care for, such as self-driving cars or road disruptions.However, initial dataset D0 can be seeded with a large set of“common” training examples, such as from ImageNet [13]. Inthe cloud, we split full dataset Di into train, validation, andtest components denoted by Ditrain, Dival, and Ditest.7. Independent Test Dataset Dtest, final: Model accuracy is al-ways reported on an independent, held-out test set denotedDtest, final. To accurately assess the efficacy of a samplerin accruing task-relevant data that yields steady performancegains on the target classes, Dtest, final should have sufficientrepresentation, or even be biased towards, target classes.The rest of the images can come from cloud datasets likeImageNet, if we wish to ensure performance does not degradeon these classes. In the extreme case where we only desirea specialized model, such as for road scene classes only,Dtest, final can also fully contain only target class images.

In practice, Dtest, final can be sourced from designatedrobots in a fleet tasked with collecting only test data. If specialtest robots are scarce, Dtest, final can be accumulated from arandom subset of data annotated by a human per round i. In

both scenarios, we ensure test data is independently verifiedby annotators and never used for training.8. Model Re-training and Evaluation: Once the new anno-tated cache Cachei is added to the dataset Di, model re-training function gretrain takes current robot model f iDNN andthe updated annotated dataset Di+1 and outputs an updatedmodel f i+1

DNN. The training objective is to minimize lossfunction L(yt, ytGroundTruth) penalizing prediction errors, suchas the standard cross-entropy loss for image classificationor mean average precision (mAP) for object detection [4].Then, evaluation function gevaluate quantifies the accuracyimprovements of newly trained model f i+1

DNN given a validationdataset, Dival, or test set Dtest, final. The model re-train andevaluation functions are denoted by:

f i+1DNN ← gretrain(f

iDNN,Di+1

train) (4)

Li = gevaluate(fiDNN,Dtest, final). (5)

Finally, new model f i+1DNN is downloaded back to the robot

over a network link. Vision DNN models are typically less thana few GB, which are very small, in contrast to raw sensorydata, to download daily even over bandwidth-limited networks.

A. The Robot Sensory Sampling Problem (Fig. 1)

We now formally define the abstract sampling problem.Problem 1 (Robot Sensory Sampling Problem): Given a

target set of image classes TargetClassSet, initial clouddataset D0, and initial robot model f0

DNN, find the bestsampling policy π∗

sample that minimizes full held-out test setloss Dtest, final for the final robot DNN fNround

DNN trained overNround days:

π∗sample ∈ argmin

πsample

L(fNroundDNN ,Dtest, final), (6)

subject to the constraint that a robot can only daily sampleNcache images.

Sampling Problem Complexity: Deriving an optimal sam-pling policy that solves Problem 1 is extremely challengingfor several reasons. First, it is hard to analytically modelthe sequence of high-dimensional sensory inputs {xt,i}Tt=0 arobot can possibly measure on a day i. Second, even if suchan observation model exists, it is extremely challenging todecide which limited subset of Ncache samples, chosen fromT possibilities, will lead to the best model performance of acomplex DNN evaluated on a final test-set at the end of severalrounds Nround. Indeed, even for a small drone that measuresvideo for only 15 minutes at a slow frame-rate of 5 frames persecond with a cache of Ncache = 2000 daily images, there isa combinatorial explosion of

(T

Ncache

)= 4.21× 101340 images

to choose from for downstream model learning!Wide Practicality of the Sensory Sampling Problem:

Despite its complexity, the sampling problem is built onsimple, but general, abstractions of (a) a robot perceptionmodel, (b) limited on-board storage, (c) occasional networkaccess to upload samples, and (d) centralized annotation andcompute resources. These abstractions apply to a wide varietyof robots where the rate of sensory measurements outpacesthat of human annotation and downstream cloud processing.

!"#$%&#'

()*+"#$%&#'

Fig. 4: Empirical insights guiding HARVESTNET: (Left) Forrelatively static targets, classification test accuracy saturateswith more images used to fine-tune a vision DNN. (Right) Weuse either embedding distances or softmax confidence scoresto signal if a new image is likely a target of interest.

IV. THE HARVESTNET INTELLIGENT SAMPLER

While obtaining a provably optimal solution to Problem1 is infeasible, one can envision a myriad of possibly well-performing heuristics. To ensure HARVESTNET applies to awide diversity of robots, we enforce a practical requirement -the sampler should be sufficiently compute-efficient to run oneven a resource-constrained robot, equipped with, for example,a Raspberry Pi and Google Edge TPU as shown in Fig. 8.A minimalistic on-board sampler that does not interfere withreal-time control is extremely useful for larger robots as well,such as SDVs. We now describe three empirical insights,evidenced in Fig. 4, that constrain the sampler design spaceand directly lead to HARVESTNET.

A. Empirical Insights Behind the HARVESTNET Sampler

Insight 1: Use transfer learning for high model accuracy withfew sampled images.Many pre-trained CNNs’ initial network layers serve asgeneral-purpose feature extractors that yield finite-dimensionembeddings [23, 29, 2]. Embeddings can be used to fine-tune final linear classification or object detection layers torecognize new classes of interest, such as autonomous cars orconstruction sites, that a CNN was not initially trained upon.

HARVESTNET exploits fine-tuning by our empirical ob-servation that only a limited set of images are necessary toadapt a pre-trained vision DNN to new domains, especiallyif the new domains’ data does not change much with time.Fig. 4 (left) shows the full dataset test accuracy of ourthree separate datasets (y-axis) versus the percentage of totaltraining data provided to the learning model (x-axis). Clearly,for the relatively static Waymo and FaceNet domains, the finalmodel accuracy saturates as more training images are acquired.

In contrast, for the dynamic task of identifying constructionsites, the accuracy steadily increases as more data is acquired,even after months of data. This is because construction sitesare constantly changing, with new tractors and signs being up-dated daily. Indeed, HARVESTNET can be used for continualre-training, allowing robots to update HD maps in the cloud.

Insight 2: Filter task-relevant target images using DNNembeddings and softmax confidences.

Embedding distances: Many, but not all, CNNs have theproperty that the L2 distance between embeddings of twoimages is a proxy for image similarity. In particular, FaceNetDNN embeddings typically have a lower embedding-distancefor two pictures of the same person than unrelated faces, sincethe DNN is optimized with a triplet loss [9, 21]. We test thisproperty with our FaceNet data in Fig. 4 (right), allowingHARVESTNET to exploit embedding distance as a valuableindication of whether a new image is of the same desiredclass as training data we wish to acquire.

Softmax confidence thresholds: Softmax confidencescores yield a probability distribution over CNN class predic-tions. We empirically note that, while the softmax probabilitiesare not necessarily a formal probability measure, images thatare indeed truly a class (“true positives”) have a higher softmaxconfidence element for that class of interest compared to “truenegatives”. For example, as shown in Fig. 4 (right), images thatare indeed of Waymo cars (red) have a higher confidence scorefor the Waymo class output compared to negatives (in blue),showing that confidence scores above a separating thresholdare a useful signal to store a sample of interest.

Crucially, embeddings and softmax confidence scores arereadily produced simply by running a DNN forward-pass forinference. Thus, when a robot’s perception model is alreadybeing run for a high-level task, such as motion planning, thesampler obtains the embeddings and confidence scores “forfree”. Then, these two metrics can be used in a lightweight,data-driven filter to discern if a new image is likely a targetof interest for caching. We further quantify in Sec. V-B howwe can run HARVESTNET’s models on the Edge TPU [5].Insight 3: Use prioritized human feedback in the cloud torefine the target image filter.The final insight behind HARVESTNET, shown in Fig. 5,is that the DNN softmax confidence distributions betweentarget images (red) and irrelevant images (blue) become moredifferentiated as more human-annotated images are added tothe cloud dataset at every round i. Crucially, this allowsthe cloud to better signal to the robot’s sampler how toidentify task-relevant target images from background noise,thereby directly building on insight 2. A broader insight behindHARVESTNET is that we can delegate compute-intensive tasksof training to the cloud, but provide simple feedback to a robotto improve its target image filter.

B. Practical application of the HARVESTNET insights

We now formalize how, together, insights 1-3 directly leadto our novel HARVESTNET sampler, introduced in Algorithm1. Importantly, HARVESTNET is markedly distinct from to-day’s active learning and data selection algorithms, due to itsuse of a compute-efficient robot model and adaptive feedbackfrom cloud computing resources.

Application of Insight 1 (Fine-Tuning): Insight 1, whichgoverns how to fine-tune a DNN with minimal images (Fig. 4,left), is addressed by using pre-trained feature extraction layersfor initial robot model f0

DNN, such as from publicly available

Fig. 5: Role of Cloud Feedback: After a human annotatesmore samples on each round i, the softmax confidence distri-bution difference between target (red) and irrelevant (blue) im-ages becomes sharper, which HARVESTNET uses to improvesampling by adapting the confidence threshold confithresh.

models, and only re-training final layers with transfer learningfor new target image classes. Further, we limit the sampledcache size by Ncache.

Application of Insight 2 (Simple Target Image Filter):Insight 2, which governs how to use DNN softmax confidencesor embedding distances to build a compute-efficient on-boardtarget image filter (Fig. 4, right), is now formalized in twovariants of sampling policies πsample (Eq. 2).

Variant 1. Embedding Distance Based Sampling: Two pic-tures of the same person xt1 and xt2 typically have alower embedding-distance ρ(xt1 , xt2) = ‖embt1 − embt2‖22than unrelated faces [21]. Using this insight, we compare theembedding distance between a robot’s new sensory input xt toeach of the target images in TargetImageSet, and cache thesample if the median distance is below an empirical thresholdρithresh, indicating similarity to a face we want to learn.Embedding distance threshold ρithresh serves as the samplerhyper-parameter θisampler = {ρithresh} in Eqn. 2, yielding:

at = πsample(xt,yt, TargetImageSet, θisampler)

= 1(

MEDIAN{ρ(xt, xtarg.), ∀xtarg. ∈ TargImSet

}< ρithresh

).

In this variant of the sampling policy, the cloud sends theTargetImageSet of images and their embeddings to the robotto compare with the embedding of a new image xt. In practice,it is trivial to store a limited set of target images on-board, suchas 18 images which sufficed in our experiments.

Variant 2. Classification Confidence Based Sampling: Incases where embedding distances are not optimized forsimilarity between two input images, we can directly useconfidence scores conft to determine if a new image is likelyto be a useful representative of the desired target. We rely onthe fact that initial robot DNN f0

DNN, as well as successiveversions f iDNN, are trained with a few examples of targetimages and can predict their target class TargetClass in theDNN model architecture. Then, given a new image xt, wesee if the confidence element for a target class, denoted byconf[TargetClassSet], is above an empirically determined

threshold confithresh (Fig. 5), which serves as the samplerhyper-parameter θisampler = {confithresh} in Eqn. 2. Thus, thesampling policy becomes:

at = πsample(xt,yt, TargetClassSet, θisampler)

= 1(conf

t[TargetClass] > confithresh

).

Unlike the embedding-distance variant of the samplingpolicy, the softmax confidence version simply needs to knowthe desired target classes TargetClassSet to harvest moreimages of, which does not require storing any target imagesTargetImageSet on-board. As measured in Sec. V, it istrivial to do a comparison of the output softmax confidences ofa DNN with a fixed threshold efficiently on embedded-devices.

Application of Insight 3 (Cloud Feedback): Our keyinsight, depicted in Fig. 5, is to use the new cloud datasetand improved model at the end of round i, Di+1 andf i+1

DNN, to best adapt the sampler hyper-parameters θi+1sampler =

{confi+1thresh, ρ

i+1thresh} for the next round. This is achieved by

the AdaptThresh function:

θi+1sampler ← AdaptThresh(Di+1, f i+1

DNN). (7)

In practice, AdaptThresh can be implemented using anylearning algorithm that creates a simple decision boundarybetween task-relevant images’ scores or embeddings and thoseof non-target images, such as logistic regression or a supportvector machine (SVM). In Sec. V, we show that cloud-feedback on thresholds outperforms non-adaptive schemes.

C. The lightweight HARVESTNET sampling algorithmHaving explained insights 1-3, we can now present HAR-

VESTNET’s sampler in Algorithm 1, which serves as a prac-tical heuristic solution to Problem 1. In each learning roundi, a robot has a fixed perception model f iDNN and observes astream of sensory inputs {xt,i}Tt=0. Sampling policy πsample

stores sampled images in cache Cachei, which is uploadedto the cloud, annotated with ground-truth labels, appended tocloud dataset Di, and finally used to re-train perception modelf i+1

DNN. This new model and an updated set of sampling param-eters is then downloaded to the robot. Learning repeats for afixed set of rounds Nround, until DNN validation accuracysaturates. Key steps of Algorithm 1 are colored.

D. Benchmark Sampling AlgorithmsWe compare HARVESTNET with the following benchmarks:1. Random Sampling: This scheme randomly fills the limitedcache from T possible samples in round i.2. Non-adaptive cloud feedback: This scheme does notadapt thresholds using Eq. 7 (yellow line in Alg. 1). Rather,this scheme caches an image based on a probability that isproportional to the softmax confidence. For example, an imagewith a softmax confidence of 90% is added to the cache withprobability 0.90. Analogously, an image with the 10% quantileof embedding distances is cached with probability 0.90, sincelower embedding distances are preferred. Since this scheme

Algorithm 1: HARVESTNET Sensory Stream Sampler

Assemble Targets TargetClassSet, Cloud Dataset D0

Pre-train Initial Robot Perception DNN f0DNN on D0

Init. {conf0thresh, ρ

0thresh} ← AdaptThresh(D0, f0

DNN)for Learning Round i ← 0 to Nround do

Cachei ← {∅} // Init Empty Cachefor t ← 0 to T do

yt ← [yt, conft, embt] = f iDNN(xt)

at ← πsample(xt,yt, TargSet, {confithr., ρithr.})

Cachei.APPEND(xt) IFF at = 1end{(xτ , yτGroundTruth)} ← ANNOTATE(Cachei)Di+1 ← Di ∪ {(xτ , yτGroundTruth)}, ∀xτ ∈ Cachei

Di+1train,D

i+1val ← TRAINVALSPLIT(Di+1)

f i+1DNN ← gretrain(f iDNN,D

i+1train) // Re-train

{confi+1thresh, ρ

i+1thresh} ← AdaptThresh(Di+1, f i+1

DNN)endResult: Final DNN fNround

DNN , Cloud Dataset DNround

does not calibrate itself using a threshold, it often misses truepositives which have low confidence scores at early rounds(Fig. 5, left). We also experimented with an “arg-max” policywhich stores an image if the confidence element for the target-class was most probable, but this performed poorly since thecutoff was too strict to store any task-relevant images.3. HARVESTNET-PriorityQueue: This scheme caches an im-age into a priority queue ranked by either softmax confidencescores or embedding distances. The first items that are droppedwhen the priority queue is full are the target image with lowestsoftmax confidence or highest embedding distance.4. Oracle Sampling: This is an un-realizable upper boundsince it perfectly identifies images of the desired target at therobot. Since it is analytically hard to compute the best setof limited images of size Ncache to best improve the model,we instead approximate the upper bound by averaging severalsamples of Ncache true target images, chosen without anyerror. Importantly, the oracle takes the place of an “all-cloud”approach, since uploading all the data to the cloud is infeasiblefor network-limited robots like subterranean rovers.

V. HARVESTNET’S EXPERIMENTAL PERFORMANCE

We now show HARVESTNET’s ability to (1) achieve highmodel accuracy for a minimal number of sampled images(Figs. 6, 7), and (2) produce compute-efficient vision models.Every benchmark that we evaluate has a fixed cache limit ofNcache and thus obeys the constraints of Problem 1. Further,our results metric of high model accuracy is equivalent tominimizing model loss, which is the objective of Problem 1.

A. Performance on Diverse Video StreamsFaceNet: Our key result for the FaceNet scenario shows

the final DNN classification accuracy on all test images (y-axis) as more learning rounds i progress (x-axis). As shown inFig. 6 (left) our sampler of HARVESTNET, in orange, achievesupto 1.68× the final accuracy of benchmarks and no lower

Fig. 6: FaceNet Results: We achieve steady accuracy gainswith minimal face examples, even matching the oracle perfor-mance for Face B, shown by the curve overlap.

Fig. 7: Road-Scene Results: HARVESTNET outperforms base-lines that sample randomly or without cloud feedback.

than 92% the accuracy of an upper-bound oracle solution ingreen. Fig. 6 also shows HARVESTNET can flexibly adapt todifferent target faces of interest, since it shows steady modelimprovements for two randomly selected targets, referred toas Face A and B in the plots. Importantly, the robot is onlyprovided TargetSet = 18 initial examples of the face ofinterest and only caches a minimal set of images Ncache = 10per learning round i, much less than the thousands of framesit observes per video. We used the widely-adopted OpenFacenn4.small2.v1 DNN for face embeddings [9], which werepassed to an SVM for face-classification.

Generalization to Dynamic Road Scenes: We now stresstest HARVESTNET’s ability to identify challenging road scenesof self-driving cars and construction sites. Both domains show-case HARVESTNET’s flexibility in using softmax confidencescores to decide if an image is likely a task-relevant trainingexample to cache, unlike embedding distances in the FaceNetscenario. In each round i, we fine-tuned the Inception V3DNN [24] in PyTorch on an Nvidia Tesla K80 cloud GPUfor 20 epochs, which led to stable convergence, using hyper-parameters outlined in our technical report.

Fig. 7 shows the steady increase of final test set accuracyof various samplers (y-axis) for learning rounds i in boththe construction and self-driving car domains. For Waymo,

Device Model Acc.(mAP)

Infer.Time (ms)

Size (MB)

TPU MN 2 0.170 6.9 5.5

CPU MN 2 0.167 285.6 4.7

Cloud GPU

MN 2 0.242 56.8 20

Cloud GPU

FasterR-CNN

0.284 447.5 183

Edge TPU USB

Battery

TPU Camera

Raspberry Pi

Fig. 8: Prototype: (Left) The lightweight sampler runs on anembedded CPU, such as on the Raspberry Pi, which uses anEdge TPU USB to accelerate DNN inference. Model Per-formance: (Right) HARVESTNET re-trained object-detectorsflexibly run on the Edge TPU or cloud GPUs.

HARVESTNET, in orange, achieves upto 1.11× the final ac-curacy of benchmarks and 93.5% the accuracy of an upper-bound oracle solution in green. The most challenging test ofHARVESTNET comes from classifying whether a road scenehas a construction site, since these rapidly changed daily. Still,HARVESTNET outperforms non-adaptive sampling by over1.38× times and is within 87% of the oracle solution.

Comparison with Priority Queue Version: HARVEST-NET, which uses adaptive thresholds, convincingly outper-forms benchmarks of non-adaptive and random sampling. Italso outperforms a variant of HARVESTNET that uses the exactsame algorithm, but rather caches images in a Priority Queue,dubbed HNetPriorityQueue, albeit by a lower margin suchas in the Waymo domain. The priority queue version doesworse when the proportion of target images per day is highlynon-uniform, especially on days where there are no targets,leading it to cache irrelevant images below the empiricalthreshold. Depending on how uniform the distribution ofsensory inputs are per day, a designer might consider usingthe HNetPriorityQueue variant, which is also extremelycompute-efficient and simple.

Benefits of re-training with field data: To appreciate theaccuracy gains of HARVESTNET, we consider the final modelaccuracy if we trained a CNN solely on high-quality publicimages of Waymo cars and construction sites and tested on ourrealistic road dataset. We could only find less than 200 GoogleImages for each scenario, which yielded an extremely pooraccuracy of only 41.6 % for Waymo and 0 % for construction.Indeed, public images were often zoomed in, clear picturesfrom marketing images, and were not representative of realroad scenes with partially occluded targets. Thus, our steadyaccuracy gains are substantial, especially with minimally-sampled images. Unlike the rapid saturation of FaceNet andWaymo accuracy, the construction example kept improvingwith time, serving as key benefit of continual learning.

Systems Bottleneck Savings: The principal benefits ofintelligent sampling derive from limiting the number of imagescached in each learning round (e.g Ncache = 60). Thus,

by re-training on between only 18-34% of possible images,we decrease storage, annotation cost and time, and cloud-computing time by 65.7-81.3%, since costs scaled linearly withthe number of sampled images in our setup. These estimateswill vary considerably based on systems infrastructure.

B. Prototype on Edge Tensor Processing Unit (TPU)Finally, we show that training examples collected by HAR-

VESTNET can also be used for re-training object detectionCNNs, shown in Fig. 2, that scalably run on compute-limited robots. Accordingly, we ran HARVESTNET’s DNNson Google’s Edge TPU development board [5], where theTPU is a dedicated circuit for fast, low-power TensorFlow [8]model inference. The relatively new Edge TPU board supports,at best, a quantized MobileNet 2 (MN2) DNN and featuresa complementary ARM Cortex embedded CPU, which runsDNNs without the aid of an AI accelerator.

Fig. 8 (right) shows that HARVESTNET can flexibly re-train mobile-optimized models, such as MobileNet 2, or moreaccurate, but slower “cloud-grade” models such as Faster R-CNN (contrasted in Fig. 2). Thus, one can trade-off speed,size, and accuracy, quantified by the standard COCO mAPmetric [4], for a wide variety of robot compute resources.

For example, one can scalably run real-time vision algo-rithms on the TPU, which readily provides embeddings andclass confidences, and pass them to HARVESTNET’s samplerrunning on an embedded CPU with minimal overhead. Thisconfiguration is shown on the robot we used in Fig. 8, wherethe Raspberry Pi’s CPU can decide whether to cache a newimage with a mean speed of just ≈ .31ms, which is an orderof magnitude faster than DNN inference on the Edge TPU!

Limitations of our sampler: HARVESTNET might failin scenarios with extremely rare visual data, such as caseswhere we cannot obtain enough samples to seed the initialcloud dataset and train the initial robot model on round 0. Infuture work, we can consider whether unknown weakpointsof a model can be cached by storing examples with highconfidence score entropy.

VI. DISCUSSION AND CONCLUSIONS

The key contribution of this paper is to introduce a new,widely-applicable sensory sampling problem to the roboticscommunity and experimentally show the gains of our compute-efficient sampler. Since we anticipate the volume of richrobotic sensory data to grow, thereby making systems bot-tlenecks more acute, we encourage the robotics community tosignificantly build upon our work. As such, we have open-sourced our dataset, Edge TPU vision DNNs, and samplingalgorithm code at https://sites.google.com/view/harvestnet.

In the future, we hope to enhance HARVESTNET by fusingdata from several cameras and LIDAR sensors on-board arobot. Further, the sampling algorithm can be modified toselect adversarial training examples for vision DNNs. Lastly,HARVESTNET’s sampler can be enhanced to explicitly notstore “private” training data by rejecting such samples. Webelieve HARVESTNET is a vital first step in keeping pace withsurging sensory data in robotics by insightfully filtering videostreams for actionable training examples, using insights fromembedded AI, networking, and cloud robotics.

https://sites.google.com/view/harvestnet

REFERENCES

[1] Data is the new oil in the future of automated driv-ing. https://newsroom.intel.com/editorials/krzanich-the-future-of-automated-driving/#gs.LoDUaZ4b, 2016. [Online; accessed30-Jan.-2019].

[2] Transfer learning. http://cs231n.github.io/transfer-learning/,2019. [Online; accessed 08-Sep.-2019].

[3] Ai platform data labeling service. https://cloud.google.com/data-labeling/docs/, 2019.

[4] Coco: Common objects in context: Detection evaluation. http://cocodataset.org/#detection-eval, 2019.

[5] Edge tpu. https://cloud.google.com/edge-tpu/, 2019. [Online;accessed 01-Sep.-2019].

[6] Cloud robotics and automation. http://goldberg.berkeley.edu/cloud-robotics/, 2019. [Online; accessed 02-Jan.-2019].

[7] Autonomous and adas test cars produce over 11 tb of dataper day. https://www.tuxera.com/blog/autonomous-and-adas-test-cars-produce-over-11-tb-of-data-per-day/, 2019. [Online;accessed 08-Sep.-2019].

[8] Martın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen,Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat,Geoffrey Irving, Michael Isard, et al. Tensorflow: A system forlarge-scale machine learning. In Proceedings of the OSDI 2016.Savannah, Georgia, USA, 2016.

[9] Brandon Amos, Bartosz Ludwiczuk, and Mahadev Satya-narayanan. Openface: A general-purpose face recognitionlibrary with mobile applications. Technical report, CMU-CS-16-118, CMU School of Computer Science, 2016.

[10] Paul Bodesheim, Alexander Freytag, Erik Rodner, MichaelKemmler, and Joachim Denzler. Kernel null space methodsfor novelty detection. In Proceedings of the IEEE conferenceon computer vision and pattern recognition, pages 3374–3381,2013.

[11] Sandeep Chinchali, Apoorva Sharma, James Harrison, AmineElhafsi, Daniel Kang, Evgenya Pergament, Eyal Cidon, SachinKatti, and Marco Pavone. Network offloading policies forcloud robotics: a learning-based approach. arXiv preprintarXiv:1902.05703, 2019.

[12] David A Cohn, Zoubin Ghahramani, and Michael I Jordan.Active learning with statistical models. Journal of artificialintelligence research, 4:129–145, 1996.

[13] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, andLi Fei-Fei. Imagenet: A large-scale hierarchical image database.In 2009 IEEE conference on computer vision and patternrecognition, pages 248–255. Ieee, 2009.

[14] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deepbayesian active learning with image data. In InternationalConference on Machine Learning, 2017.

[15] Ken Goldberg and Ben Kehoe. Cloud robotics and automation:A survey of related work. EECS Department, University ofCalifornia, Berkeley, Tech. Rep. UCB/EECS-2013-5, 2013.

[16] J Kuffner. Cloud-enabled robots in: Ieee-ras internationalconference on humanoid robots. Piscataway, NJ: IEEE, 2010.

[17] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence

Zitnick. Microsoft coco: Common objects in context. In Euro-pean conference on computer vision, pages 740–755. Springer,2014.

[18] Pramuditha Perera and Vishal M Patel. Deep transfer learningfor multiple class novelty detection. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pages11544–11552, 2019.

[19] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Fasterr-cnn: Towards real-time object detection with region proposalnetworks. In Advances in neural information processing sys-tems, pages 91–99, 2015.

[20] Artsiom Sanakoyeu, Vadim Tschernezki, Uta Buchler, andBjorn Ommer. Divide and conquer the embedding spacefor metric learning. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 471–480,2019.

[21] Florian Schroff, Dmitry Kalenichenko, and James Philbin.Facenet: A unified embedding for face recognition and clus-tering. In Proceedings of the IEEE conference on computervision and pattern recognition, pages 815–823, 2015.

[22] Burr Settles. Active learning literature survey. Technical report,University of Wisconsin-Madison Department of ComputerSciences, 2009.

[23] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, andStefan Carlsson. Cnn features off-the-shelf: an astoundingbaseline for recognition. In Proceedings of the IEEE conferenceon computer vision and pattern recognition workshops, pages806–813, 2014.

[24] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, JonShlens, and Zbigniew Wojna. Rethinking the inception architec-ture for computer vision. In Proceedings of the IEEE conferenceon computer vision and pattern recognition, pages 2818–2826,2016.

[25] Ajay Kumar Tanwani, Nitesh Mor, John Kubiatowicz, Joseph EGonzalez, and Ken Goldberg. A fog robotics approach to deeprobot learning: Application to object recognition and grasp plan-ning in surface decluttering. arXiv preprint arXiv:1903.09589,2019.

[26] Nan Tian, Jinfa Chen, Mas Ma, Robert Zhang, Bill Huang,Ken Goldberg, and Somayeh Sojoudi. A fog robotic systemfor dynamic visual servoing. arXiv preprint arXiv:1809.06716,2018.

[27] Simon Tong. Active learning: theory and applications, vol-ume 1. Stanford University USA, 2001.

[28] Xin Wang, Fisher Yu, Ruth Wang, Trevor Darrell, and Joseph EGonzalez. Tafe-net: Task-aware feature embeddings for low shotlearning. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 1831–1840, 2019.

[29] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson.How transferable are features in deep neural networks? In Ad-vances in neural information processing systems, pages 3320–3328, 2014.

[30] Pengkai Zhu, Hanxiao Wang, and Venkatesh Saligrama. Gen-eralized zero-shot recognition based on visually semantic em-bedding. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 2995–3003, 2019.

https://newsroom.intel.com/editorials/krzanich-the-future-of-automated-driving/#gs.LoDUaZ4b

https://newsroom.intel.com/editorials/krzanich-the-future-of-automated-driving/#gs.LoDUaZ4b

http://cs231n.github.io/transfer-learning/

https://cloud.google.com/data-labeling/docs/

https://cloud.google.com/data-labeling/docs/

http://cocodataset.org/#detection-eval

http://cocodataset.org/#detection-eval

https://cloud.google.com/edge-tpu/

http://goldberg.berkeley.edu/cloud-robotics/

http://goldberg.berkeley.edu/cloud-robotics/

https://www.tuxera.com/blog/autonomous-and-adas-test-cars-produce-over-11-tb-of-data-per-day/

https://www.tuxera.com/blog/autonomous-and-adas-test-cars-produce-over-11-tb-of-data-per-day/

HarvestNet: Mining Valuable Training Data from High-Volume...

Documents

Transcript of HarvestNet: Mining Valuable Training Data from High-Volume...