Large-Scale Situation Awareness with Camera Networks … · Large-scale Situation Awareness with...

19
PROCEEDINGS OF THE IEEE 1 Large-scale Situation Awareness with Camera Networks and Multimodal Sensing Umakishore Ramachandran, Kirak Hong, Liviu Iftode, Ramesh Jain, Rajnish Kumar, Kurt Rothermel, Junsuk Shin, Raghupathy Sivakumar AbstractSensors of various modalities and capabilities, es- pecially cameras, have become ubiquitous in our environment. Their intended use is wide ranging and encompasses surveillance, transportation, enter- tainment, education, healthcare, emergency response, disaster recovery, and the like. Technological advances and the low cost of such sensors enable deployment of large-scale camera networks in large metropolis such as London and New York. Multimedia algorithms for analyzing and drawing inferences from video and audio have also matured tremendously in recent times. Despite all these advances, large-scale reliable sys- tems for media-rich sensor-based applications, often classified as situation awareness applications, are yet to become commonplace. Why is that? There are several forces at work here. First of all, the system abstractions are just not at the right level for quickly prototyping such applications in the large. Second, while Moore’s law has held true for predicting the growth of processing power, the volume of data that applications are called upon to handle is growing similarly, if not faster. Enormous amount of sensing data is continually generated for real-time analysis in such applications. Further, due to the very nature of the application domain, there are dynamic and demanding resource requirements for such analyses. The lack of right set of abstractions for programming such applications coupled with their data intensive na- ture have hitherto made realizing reliable large-scale situation awareness applications difficult. Incidentally, situation awareness is a very popular but ill-defined U. Ramachandran, K. Hong, R. Kumar, and J. Shin are with the College of Computing, Georgia Institute of Technology, e- mail: {rama,hokira,rajnish,jshin}@cc.gatech.edu L. Iftode is with the Department of Computer Science, Rutgers University, e-mail: [email protected] R. Jain is with the School of Information and Computer Sciences, University of California, Irvine, e-mail: [email protected] K. Rothermel is with the IPVS, University of Stuttgart, Stuttgart, Germany, e-mail: [email protected] R. Sivakumar is with the School of ECE, Georgia Institute of Technology, e-mail: [email protected] research area that has attracted researchers from many different fields. In this paper, we adopt a strong systems perspective and consider the components that are essential in realizing a fully functional situation awareness system. Index Terms—Situation Awarenss, Resource Manage- ment; Scalability; Programming Model; Video-Based Surveillance; Large-scale Distributed Systems I. Introduction Situation Awareness is both a property and an application class that deals with recognizing when sensed data could lead to actionable knowledge. With advances in technology, it is becoming fea- sible to integrate sophisticated sensing, comput- ing, and communication in a single small foot- print sensor platform (e.g., smart cameras). This trend is enabling deployment of powerful sensors of different modalities in a cost-effective manner. While Moore’s law has held true for predicting the growth of processing power, the volume of data that applications handle is growing similarly, if not faster. Situation awareness applications are inherently distributed, interactive, dynamic, stream- based, computationally demanding, and needing real-time or near real-time guarantees. A sense- process-actuate control loop characterizes the be- havior of this application class. There are three main challenges posed by data explosion for realizing situation awareness: over- load on the infrastructure, cognitive overload on humans in the loop, and dramatic increase in false positives and false negatives in identifying threat scenarios. Consider, for example, providing situa- tion awareness in a battlefield. It needs complex

Transcript of Large-Scale Situation Awareness with Camera Networks … · Large-scale Situation Awareness with...

PROCEEDINGS OF THE IEEE 1

Large-scale Situation Awareness withCamera Networks and Multimodal Sensing

Umakishore Ramachandran, Kirak Hong, Liviu Iftode, Ramesh Jain, Rajnish Kumar,Kurt Rothermel, Junsuk Shin, Raghupathy Sivakumar

Abstract—Sensors of various modalities and capabilities, es-pecially cameras, have become ubiquitous in ourenvironment. Their intended use is wide rangingand encompasses surveillance, transportation, enter-tainment, education, healthcare, emergency response,disaster recovery, and the like. Technological advancesand the low cost of such sensors enable deployment oflarge-scale camera networks in large metropolis suchas London and New York. Multimedia algorithmsfor analyzing and drawing inferences from video andaudio have also matured tremendously in recent times.Despite all these advances, large-scale reliable sys-tems for media-rich sensor-based applications, oftenclassified as situation awareness applications, are yetto become commonplace. Why is that? There areseveral forces at work here. First of all, the systemabstractions are just not at the right level for quicklyprototyping such applications in the large. Second,while Moore’s law has held true for predicting thegrowth of processing power, the volume of data thatapplications are called upon to handle is growingsimilarly, if not faster. Enormous amount of sensingdata is continually generated for real-time analysisin such applications. Further, due to the very natureof the application domain, there are dynamic anddemanding resource requirements for such analyses.The lack of right set of abstractions for programmingsuch applications coupled with their data intensive na-ture have hitherto made realizing reliable large-scalesituation awareness applications difficult. Incidentally,situation awareness is a very popular but ill-defined

U. Ramachandran, K. Hong, R. Kumar, and J. Shin are withthe College of Computing, Georgia Institute of Technology, e-mail: {rama,hokira,rajnish,jshin}@cc.gatech.edu

L. Iftode is with the Department of Computer Science, RutgersUniversity, e-mail: [email protected]

R. Jain is with the School of Information and ComputerSciences, University of California, Irvine, e-mail: [email protected]

K. Rothermel is with the IPVS, University of Stuttgart,Stuttgart, Germany, e-mail: [email protected]

R. Sivakumar is with the School of ECE, Georgia Institute ofTechnology, e-mail: [email protected]

research area that has attracted researchers frommany different fields. In this paper, we adopt a strongsystems perspective and consider the components thatare essential in realizing a fully functional situationawareness system.

Index Terms—Situation Awarenss, Resource Manage-ment; Scalability; Programming Model; Video-BasedSurveillance; Large-scale Distributed Systems

I. IntroductionSituation Awareness is both a property and anapplication class that deals with recognizing whensensed data could lead to actionable knowledge.With advances in technology, it is becoming fea-sible to integrate sophisticated sensing, comput-ing, and communication in a single small foot-print sensor platform (e.g., smart cameras). Thistrend is enabling deployment of powerful sensorsof different modalities in a cost-effective manner.While Moore’s law has held true for predictingthe growth of processing power, the volume ofdata that applications handle is growing similarly,if not faster. Situation awareness applications areinherently distributed, interactive, dynamic, stream-based, computationally demanding, and needingreal-time or near real-time guarantees. A sense-process-actuate control loop characterizes the be-havior of this application class.There are three main challenges posed by dataexplosion for realizing situation awareness: over-load on the infrastructure, cognitive overload onhumans in the loop, and dramatic increase in falsepositives and false negatives in identifying threatscenarios. Consider, for example, providing situa-tion awareness in a battlefield. It needs complex

PROCEEDINGS OF THE IEEE 2

fusion of contextual knowledge with time-sensitivesensor data obtained from different sources to derivehigher-level inferences. With an increase in thesensed data, a fighter pilot will need to take moredata into account in decision-making leading to acognitive overload and an increase in human errors(false positives and negatives). Also, to process anddisseminate the sensed data, more computationaland network resources are needed thus overloadingthe infrastructure.Distributed video-based surveillance is a goodcanonical example of this application class. Visualinformation plays a vital role in surveillance appli-cations, demonstrated by the strategic use of videocameras as a routine means of physical security.With advances in imaging technology, video cam-eras have become increasingly versatile and sophis-ticated. They can be multi-spectral, can sense atvarying resolutions, can operate with differing lev-els of actuation (stationary, moving, controllable),and can even be airborne (e.g., in military applica-tions). Cameras are being deployed in the large fromairports to city-scale infrastructures. Such large-scale deployments result in massive amounts ofvisual information that must be processed in real-time to extract useful and actionable knowledgefor timely decision-making. The overall goal ofsurveillance systems is to detect and track suspi-cious activities to ward off potential threats. Reli-able computer-automated surveillance using vision-based tracking, identification and activity monitor-ing can relieve operator tedium and allow coverageof larger areas for various applications (airports,cities, highways, etc.). Figure 1 is a visual of thecamera deployment in an airport to serve as theinfrastructure for such a video-based surveillancesystem.Video surveillance based on Closed-Circuit Tele-vision (CCTV) was first introduced in the UK inthe middle of the last century. Since then, camerasurveillance networks have proliferated in the UK,with over 200,000 cameras in London alone [1].In the US, the penetration of CCTV has beenrelatively slower; Chicago is leading with more than2000 cameras, which connect to an operation center

constantly monitored by police officers [2].Apartfrom the legal and privacy aspects of the CCTVtechnology [3], it is both expensive and hard toscale due to the huge human capital involved inmonitoring the camera feeds.

Fig. 1: Cameras and People Movement in an Air-port.

Smart or intelligent surveillance combines sensingand computing to automatically identify interestingobjects and suspicious behaviors. Advances in com-puter vision has enabled a range of technologiesincluding: human detection and discrimination [4]–[7]; single and multi-camera target tracking [8], [9];biometric information gathering, such as face [10]and gait signatures [11], [12]; and human motionand activity classification [13]–[15]. Such advances(often referred to as video analytics) are precursorsto fully automated surveillance, and bode well foruse in many critical applications including andbeyond surveillance.As image processing and interpretation tasks mi-grate from a manual to a computer-automatedmodel, questions of system scalability and effi-cient resource management will arise and mustbe addressed. In large settings such as airports orurban environments, processing the data streamingcontinuously from multiple video cameras is acomputationally intensive task. Moreover, given thegoals of surveillance, images must be processedin real time in order to provide the timelinessrequired by modern security practice. Questions ofsystem scalability go beyond video analytics, andfall squarely in the purview of distributed systemsresearch.

PROCEEDINGS OF THE IEEE 3

0 200 400 600 8000

2040

6080

100

Time(s)

CP

U L

oad

(%)

1 camera2 cameras3 cameras4 cameras

(a) CPU Load

0 200 400 600 800

24

68

1012

Time(s)

Mem

ory

Usa

ge (

MB

)

1 camera2 cameras3 cameras4 cameras

(b) Memory UsageCameras produce images at 5 FPS with 320x240 resolution.Image processing and RFID reading occur at a centralizedsystem on a 1.4GHz Intel Pentium CPU.

Fig. 2: Surveillance System Resource Utilization

Consider a typical smart surveillance system inan airport with cameras deployed in a pattern tomaintain continuous surveillance of the terminalconcourses (Figure 1). Images from these camerasare processed by some application-specific logic toproduce the precise level of actionable knowledgerequired by the end user (human and/or softwareagent). The application-specific processing may an-alyze multiple camera feeds to extract higher-levelinformation such as “motion”, “presence of a hu-man face”, or “committing a suspicious activity”.Additionally, a security agent can specify policies,e.g., “only specified people are allowed to enter aparticular area”, which causes the system to triggeran alert whenever such a policy is violated.The surveillance system described above, fully re-alized, is no longer a problem confined to com-puter vision but a large-scale distributed systemsproblem with intensive data-processing resource re-quirements. Consider, for example, a simple small-scale surveillance system that does motion sensingand JPEG encoding/decoding. Figure 2 shows theprocessing requirements for such a system usinga centralized setup (single 1.4GHz Intel PentiumCPU). In this system, each camera is restrictedto stream images at a slow rate of 5 frames per

second (FPS), and each image has a very coarse-grained resolution of only 320x240. Even underthe severely restricted data-processing conditions,the results show that the test system cannot scalebeyond four cameras due to CPU saturation. In-creasing the video quality (FPS and resolution)to those required by modern security applicationswould saturate even a high-end computing systemattempting to process more than a few cameras.Clearly, scaling up to a large number of cameras(on the order of hundreds or thousands) warrants adistributed systems solution.We take a systems approach to scalable smartsurveillance, embodying several inter-related re-search threads: (i) determining the appropriate sys-tem abstractions to aid the computer-vision domainexpert in developing such complex applications, (ii)determining the appropriate execution model thatfully exploits the resources across the distributedsystem, and (iii) identifying technologies spanningsensing hardware and wireless infrastructures forsupporting large-scale situation awareness.Situation awareness as a research area is still evolv-ing. It has attracted researchers from vastly differentfields spanning computer vision, robotics, artificialintelligence, systems and networking. In this paper,we are talking about component technologies, notend-to-end system, that are essential to realize afully functional situation awareness system. Westart by understanding the application requirements,especially in the domain of video-based surveillance(Section II). We use this domain knowledge toraise questions about the systems research that isneeded to support large-scale situation awareness.We then present a bird’s eye view of the enablingtechnologies of relevance to large-scale situationawareness (Section III). This tour of technologiesspans computer vision, smart cameras and othersensors, wireless, context-aware computing, andprogramming models. We then report on our ownexperience in developing a system architecture forsituation awareness applications (Section IV).IBM’s S3 system [16] is perhaps the only completeend-to-end system for situation awareness that weare aware of. We include a case study of IBM’s

PROCEEDINGS OF THE IEEE 4

S3 product, which represents the state-of-the-art inonline video-based surveillance (Section V). Weconclude with thoughts on where we are headed inthe future in the exploration of large-scale situationawareness (Section VI).

II. Application ModelUsing video-based surveillance as a concrete in-stance of the domain of situation awareness ap-plications, let us first understand the applicationmodel. In a video-based surveillance application,there are two key functions: detection and tracking.For the sake of this discussion, we will say thatdetection concerns with identifying any anomalousbehavior of a person or an object from a scene.For example, in an airport, a person leaving a bagin a public place and walking away is one suchanomalous event. Such an event has to be capturedin real-time by an automated surveillance systemamong thousands of normal activities in the airport.As can be imagined, there could be several suchpotentially anomalous events that may be happeningin an airport at any given time. Once such an eventis detected, the object or the person becomes atarget and the automated surveillance system shouldkeep track of the target that triggered the event.While tracking the target across multiple cameras,the surveillance system provides all relevant infor-mation of the target including location and multipleviews captured by different cameras, to eventuallylead to a resolution of whether the original eventis a benign one or something serious warrantingappropriate action by a security team. For clarity,we will use the term detector and tracker to indicatethese two pieces of the application logic.The application model reveals the inherent paral-lel/distributed nature of a video-based surveillanceapplication. Each detector is a per camera compu-tation and these computations are inherently dataparallel since there is no data dependency amongthe detectors working on different camera streams.Similarly, each tracker is a per target computationthat can be run concurrently for each target. If atarget simultaneously appears in the field of view(FOV) of multiple cameras, the trackers following

the target on each of the different camera streamsneed to work together to build a composite knowl-edge of the target. Moreover, there exist complexdata sharing and communication patterns among thedifferent instances of detectors and trackers. Forexample, the detector and trackers have to worktogether to avoid duplicate detection of the sametarget.The application model as presented above can easilybe realized in the small (i.e., on the order of tensof camera streams) by implementing the applicationlogic to be executed on each of the cameras, and theoutput to be centrally analyzed for correlation andrefinement. Indeed, there are already video analyticssolution providers [17], [18] that peddle maturecommercial products for such scenarios. However,programming such scenarios in the large requiresa distributed approach, whose scalability is a hardopen problem.How can a vision expert write an application forvideo-based surveillance that spans 1000s of cam-eras and other sensors? How can we design a scal-able infrastructure that spans a huge geographicalarea such as an airport or a city to support suchapplications? How do we reduce the programmingburden on the domain expert by providing the righthigh-level abstractions? What context information isneeded to support prioritization of data streams andassociated computations? How can we transparentlymigrate computations between the edges of thenetwork (i.e., at or close to the sensors) and thecomputational workhorses (e.g., Cloud)? How dowe adaptively adjust the fidelity of the computationcommensurate with the application dynamics (e.g.,increased number of targets to be observed thancan be sustained by the infrastructure)? These aresome of the questions that our vision for large-scalesituation awareness raises.

III. Enabling TechnologiesThe objective in this section is to give a bird’seye view of the state-of-the-art in technologies thatare key enablers for large-scale situation awareness.We start with a brief survey of computer visiontechnologies as they apply to video-based surveil-

PROCEEDINGS OF THE IEEE 5

lance (Section III-A). We then discuss smart cameratechnology that is aimed at reducing the stress onthe compute and networking infrastructure by fa-cilitating efficient edge processing (such as filteringand motion detection) to quench the uninterestingcamera streams at the source (Section III-B). Wethen survey wireless technologies (Section III-C)that allow smart cameras to be connected to back-end servers given the computationally intensivenature of computer vision tasks. This is followed byreviewing the middleware framework for context-aware computing, a key enabler to paying selectiveattention to streams of interest for deeper analysisin situation awareness applications (Section III-D).Lastly, we review programming models and exe-cution framerworks – perhaps the most importantpiece of the puzzle for developing large-scale situ-ation awareness applications (Section III-E).

A. Computer VisionComputer Vision technologies have advanced dra-matically during the last decade in a number ofways. Many algorithms have been proposed in dif-ferent subareas of computer vision and have signif-icantly improved the performance of computer vi-sion processing tasks. There are two aspects to per-formance when it comes to vision tasks: accuracyand latency. Accuracy has to do with the correctnessof the inference made by the vision processing task(e.g., how precise is the bounding box around a facegenerated by a face detection algorithm?). Latencyon the other hand, has to do with the time it takesfor a vision processing task to complete its work.Traditionally, computer vision research is focusedon developing algorithms that increase the accuracyof detection, tracking, etc. However, when computervision techniques are applied to situation awarenessapplications, there is a tension between accuracyand latency. Algorithms that increase the accuracyof event detection are clearly preferable. However,if the algorithm is too slow then the outcome ofthe event detection may be too late to serve asactionable knowledge. In general, in a video-basedsurveillance application the objective is to shrinkthe elapsed time (i.e., latency) between sensing andactuation. Since video processing is continuous in

nature, computer vision algorithms strive to achievea higher processing frame rate (i.e., frames persecond) to ensure that important events are notmissed. Therefore, computer vision research hasbeen focusing on improving performance both interms of accuracy and latency for computer visiontasks of relevance to situation awareness, namely,(1) Object Detection, (2) Object Tracking, and (3)Event Recognition.Object Detection algorithms, as the name suggests,detect and localize objects in a given image frame.As the same object can have significant variationsin appearance due to the orientation, lighting, etc.,accurate object detection is a hard problem. Forexample, with human subjects there can be variationfrom frame to frame in poses, hand position, andface expressions [19]. Object detection can also suf-fer from occlusion [20]. This is the reason detectionalgorithms tend to be slow and do not achieve avery high frame rate. For example, a representativedetection algorithm proposed by Felzenszwalb, etal. [19] takes 3 seconds to train one frame, and 2seconds to evaluate one frame. To put this perfor-mance in perspective, cameras systems are capableof grabbing frames at rates upwards of 30 frames asecond.Object Tracking research has addressed online algo-rithms that train and track in real-time (see [21] fora comprehensive survey). While previous researchhas typically used lab-environment with static back-ground and slowly moving objects in the fore-ground, recent research [22]–[24] has focused onimproving (1) real-time processing, (2) occlusions,(3) movement of both target and background, and(4) the scenarios where an object leaves the field ofview of one camera and appears in front of another.Real-time performance to accuracy tradeoff is evi-dent in the design and experimental results reportedby the authors of these algorithms. For example, thealgorithm proposed by Babenko, et al. [22] runsat 25 fps while the algorithm proposed by Kwonand Lee [24] takes 1 ∼ 5 seconds per frame (forsimilar sized video frames). However, Kwon andLee [24] show through experimental results thattheir algorithm results in higher accuracy over a

PROCEEDINGS OF THE IEEE 6

larger dataset of videos.Event Recognition is a higher level computer visiontask that plays an important role in situation aware-ness applications. There are many different typesof event recognition algorithms that are trainedto recognize certain events and/or actions fromvideo data. Examples of high level events includemodeling individual object trajectories [25], [26],recognizing specific human poses [27], and detec-tion of anomalies and unusual activities [28]–[30].In recent years, the state of the art in automatedvisual surveillance has advanced quite a bit formany tasks including: detecting humans in a givenscene [4], [5]; tracking targets within a given scenefrom a single or multiple cameras [8], [9]; followingtargets in a wide field of view given overlappingsensors; classification of targets into people, vehi-cles, animals, etc.; collecting biometric informationsuch as face [10] and gait signatures [11]; andunderstanding human motion and activities [13],[14].In general it should be noted that computer vi-sion algorithms for tasks of importance to situ-ation awareness, namely, detection, tracking, andrecognition, are computationally intensive. The firstline of defense is to quickly eliminate streams thatare uninteresting. This is one of the advantages ofusing smart cameras (to be discussed next). Moregenerally, facilitating real-time execution of suchalgorithms on a large-scale deployment of camerasensors necessarily points to a parallel/distributedsolution (see Section III-E).

B. Smart CamerasOne of the keys to a scalable infrastructure forlarge-scale situation awareness is to quench thecamera streams at the source if they are not relevant(for e.g., no action in front of a camera). Onepossibility is moving some aspects of the visionprocessing (e.g., object detection) to the camerasthemselves. This would reduce the communicationrequirements from the cameras to the backenedservers and add to the overall scalability of thewireless infrastructure (see Section III-C). Further,it will also reduce the overall computation re-

quirements in the backend server for the visionprocessing tasks.With the evolution of sensing and computation tech-nologies, smart cameras have also evolved alongthree dimensions: data acquisition, computationalcapability, and configurability. Data acquisition sen-sors are used for capturing the images of cameraviews. There are two current alternatives for suchsensors: Charged Coupled Device (CCD) sensorsand Complementary Metal Oxide Semiconductor(CMOS) sensors. Despite the superior image qualityobtainable with CCD sensors, CMOS sensors aremore common in today’s smart cameras mainlybecause of their flexible digital control, high speedexposure, and other functionalities.Computational element, needed for real-time pro-cessing of the sensed data, is in general imple-mented using one of (or a combination of) thefollowing technologies: digital signal processors(DSPs), microcontroller or microprocessors, FieldProgrammable Gate Arrays (FPGAs), multimediaprocessors, and Application Specific Integrated Cir-cuits (ASICs). Microcontrollers provide the mostflexibility among these options but may be lesssuitable for the implementation of image processingalgorithms compared to DSPs or FPGAs. Withrecent advances, memory controllers and microcon-trollers are integrated with FPGA circuits to attainhardware-level parallelism while maintaining thereconfigurability of microprocessors. Thus, FPGAsare emerging to be a good choice for implementingthe computational elements of smart cameras [31].Finally, because of the reconfigurability of CMOSsensors and FPGA-based computational flexibil-ity of today’s smart cameras, it is now possibleto have fine-grained control of both sensing andprocessing units, leading to a whole new field ofcomputational cameras [32]. This new breed ofcameras can quickly adjust their optical circuitryto obtain high-quality images even under dynamiclighting or depth of view conditions. Such com-putational cameras, combined with pan-tilt-zoom(PTZ) controllers, FPGA-based image-processingelements, and a communication element to inter-act with other cameras or remote servers, can be

PROCEEDINGS OF THE IEEE 7

considered today as the state-of-the-art design of asmart camera. CITRIC [33] is a a recent examplefrom an academic setting of a wireless cameraplatform. Multi-tiered camera platforms have alsobeen proposed wherein low power camera moteswake up higher resolution cameras to capture andprocess interesting images. SensEye [34] is onesuch platform; it achieves low latency from sensingto actuation without sacrificing energy efficiency.Companies such as Philips, Siemens, Sony, andTexas Instruments [35], [36] have commercial smartcamera products, and such smart cameras usuallyhave programmable interfaces for customization.Axis [37], while focusing on IP camera, incor-porates multimodal sensors and passive infrared(PIR) sensors (for motion detection) in their cameraofferings. The entertainment industry has also em-braced cameras with additional sensing modalities,e.g., Microsoft Kinect [38], uses advanced sensortechnologies to construct 3D video data with depthinformation using a combination of CMOS camerasand infrared sensing.One of the problems with depending on only onesensing technology is the potential for increasingfalse positives - false alarm for a non-existentthreat situation, and false negatives - a real threatmissed by the system. Despite the sophisticationin computer vision algorithms, it is still the casethat these algorithms are susceptible to lightingconditions, ambient noise, occlusions, etc. One wayof enhancing the quality of the inference is toaugment the vision techniques with other sensingmodalities that may be less error prone. Becauseof the obvious advantage of multi-modal sensing,many smart camera manufacturers nowadays adddifferent sensors along with optics and provide anintelligent surveillance system that takes advantageof the non-optical data, e.g., use of integrated GPSto tag location-awareness to the streamed data.

C. Wireless InfrastructureThe physical deployment for a camera-based sit-uation awareness application would consist of aplethora of wired and wireless infrastructure com-ponents: simple and smart cameras, wireless accesspoints, wireless routers, gateways, and Internet con-

nected backend servers. The cameras will of coursebe distributed spatially in a given region alongwith wireless routers and gateways. The role ofthe wireless routers is to stream the camera imagesto backend servers in the Internet (e.g., Cloudcomputing resources) using one or more gateways.The gateways connect the wireless infrastructurewith the wired infrastructure and are connectedto the routers using long-range links referred toas backhaul links. Similarly, the links betweenthe wireless cameras and the wireless routers areshort-range links and are referred to as accesslinks. Additionally, wireless access points may beavailable to directly connect the cameras to thewired infrastructure. A typical deployment may infact combine wireless access points and gatewaystogether, or access points may be connected togateways via Gigabit Ethernet.

1) Short-Range TechnologiesIEEE 802.11n [39] is a very high throughput stan-dard for wireless LANs. The 802.11n standardhas evolved considerably from its predecessors- 802.11b and 802.11a/g. The 802.11n standardincludes unique capabilities such as the use ofmultiple antennas at the transmitter and receiverto realize high throughput links along with frameaggregation and channel bonding. These featuresenable a maximum physical layer data rate of up to600Mbps. The 802.11 standards provide an indoorcommunication range of less than 100m and henceare good candidates for short-range links.IEEE 802.15.4 (Zigbee) [40] is another standardfor small low-power radios intended for network-ing low-bit rate sensors. The protocol specifiesa maximum physical layer data rate of 250kbpsand a transmission range between 10m and 75m.Zigbee uses multi-hop routing built upon the Ad-hoc On demand Distance Vector (AODV [41]) rout-ing protocol. In the context of situation awarenessapplications, Zigbee would be useful for networkingother sensing modalities in support of the cameras(e.g., RFID, temperature and humidity sensors).The key issue in the use of the above technologiesin a camera sensor network is the performancevs. energy trade-off. The IEEE 802.11n provides

PROCEEDINGS OF THE IEEE 8

much higher data rates and wider coverage butis less energy efficient when compared to Zigbee.Depending on the power constraints and data raterequirements in a given deployment, either of thistechnologies would be more appropriate than theother.

2) Long-Range TechnologiesThe two main candidate technologies for long-range links (for connecting routers to gateways) areLong Term Evolution (LTE) [42] and IEEE 802.16Worldwide Interoperability for Microwave Access(WiMax) [43]. The LTE specification provides anuplink data rate of 50Mbps and communicationranges from 1km to 100km. WiMax provides highdata rates of up to 128Mbps uplink and a maximumrange of 50km. Thus, both these technologies arewell suited as backhaul links for camera sensornetworks. There is an interesting rate-range tradeoffbetween access links and backhaul links. To supportthe high data rates (but short range) of access links,it is quite common to bundle multiple backhaullinks together. While both the above technologiesallow long-range communication, the use of onetechnology in a given environment would depend onthe the spectrum that is available in a given deploy-ment (licensed vs. unlicensed) and the existence ofprior cellular core networks in the deployment area.

3) Higher layer protocolsIn addition to the link layer technologies thatcomprise a camera sensor network, higher layerprotocols for routing are also essential for suc-cessful operation of camera networks: (i) SurgeMesh Routing is a popular routing protocol usedin several commercial Zigbee devices such as theCrossbow Micaz motes [44]; This provides auto-matic re-routing when a camera sensor link failsand constructs a topology dynamically by keepingtrack of link conditions. (ii) RPL [45] is an IPv6routing protocol for communicating multipoint-to-point traffic from low power devices towards acentral control point, as well as point-to-multipointtraffic from the central control point to the devices.RPL allows dynamic construction of routing treeswith the root at the gateways in a camera sensor net-work. (iii) Mobile IP is a protocol that is designed

to allow mobile device users to move from onenetwork to another while maintaining a permanentIP address [46]; This is especially useful in appli-cations with mobile cameras (mounted on vehiclesor robots). In the context of large-scale camerasensor networks high throughput and scalabilityare essential. While Mobile IP and Surge MeshRouting are easier to deploy, they are usable onlyfor small size networks. For large-scale networksRPL is more suited but is more complex to operateas well.

D. Context-aware FrameworksSituation awareness applications are context-sensitive, i.e., they adapt their behavior dependingon the state of their physical environment or infor-mation derived from the environment. Further, thecontext associated with applications is increasingin both volume and geographic extent over time.The notion of context is application-dependent. Forexample, IBM’s Smarter Planet [47] vision includesthe global management of the scarce resources ofour earth like energy and water, traffic, transporta-tion, and manufacturing.Specifically, with respect to video-based surveil-lance the context information is the meta-data asso-ciated with a specific object that is being tracked.An infrastructure for context management shouldintegrate both the collected sensor data and geo-graphic data, including maps, floor plans, 3D build-ing models, and utility network plans. Such contextinformation can be used by the automated surveil-lance system to preferentially allocate computing,networking, and cognitive resources. In other words,context-awareness enables selective attention lead-ing to better situation awareness.An appropriate framework for managing and shar-ing context information for situation awareness ap-plications is the sensor web concept, exemplifiedby Microsoft SenseWeb [48], and Nokia SensorPlanet [49].Also various middleware systems formonitoring and processing sensor data have beenproposed (e.g., [50]–[52]). These infrastructuresprovide facilities for sensor discovery (e.g., by typeor location), retrieval and processing of sensor dataand more sophisticated operations such as filtering

PROCEEDINGS OF THE IEEE 9

and/or event recognition.Federated Context Management goes one step fur-ther than sensor webs. It integrates not only sen-sor data but also geographic data from varioussources.The Nexus framework [53], [54] federatesthe context information of the different providersand offers context-aware applications a global andconsistent view of the federated context. This globalcontext not only includes observable context infor-mation (such as sensor values, road maps, and 3Dmodels), but also higher level situation informationinferred from other contexts.

E. Programming ModelsBy far the biggest challenge in building large-scalesituation awareness applications is the programmingcomplexity associated with large-scale distributedsystems, both in terms of ease of development forthe domain expert and efficiency of the executionenvironment to ensure good performance. We willexplore a couple of different approaches to address-ing the programming challenge.

1) Thread based Programming ModelThe lowest-level approach to building surveillancesystems is to have the application developer han-dle all aspects of the system, including traditionalsystems aspects, such as resource management, andmore application-specific aspects, such as mappingtargets to cameras. In this model, a developer wish-ing to exploit the inherent application-level concur-rency (see Section I) has to manage the concurrentlyexecuting threads over a large number of computingnodes. This approach gives maximum flexibilityto the developer for optimizing the computationalresources since he/she has complete control of thesystem resources and the application logic.However, effectively managing the computationalresources for multiple targets and cameras is adaunting responsibility for the domain expert. Forexample, the shared data structures between detec-tors and trackers ensuring target uniqueness shouldbe carefully synchronized to achieve the most ef-ficient parallel implementation. Multiple trackersoperating on different video streams may also needto share data structures when they are monitor-

Detector

TrackerList

Detector

TrackerList

Compare

Frame

TargetPosition

NewBlob+

Frame+

Mask

Target IDFeature

Fig. 3: Target Tracking based on Stream-orientedModels

ing the same target. These complex patterns ofdata communication and synchronization place anunnecessary burden on an application developer,which is exacerbated by the need to scale the systemto hundreds or even thousands of cameras andtargets in a large-scale deployment (airports, cities).

2) Stream oriented Programming ModelAnother approach is to use a stream-oriented pro-gramming model [55]–[58] as a high-level abstrac-tion for developing surveillance applications. In thismodel, a programmer does not need to deal withlow-level system issues such as communication andsynchronization. Rather, she can focus on writingan application as a stream graph consisting of com-putation vertices and communication edges. Once aprogrammer provides necessary information includ-ing a stream graph, the underlying stream process-ing system manages the computational resourcesto execute the stream graph over multiple nodes.Various optimizations are applied at the systemlevel shielding the programmers from having toconsider performance issues.Figure 3 illustrates a stream graph for a targettracking application using IBM System S [55] – oneof the representative off-the-shelf stream processingengines. A detector processes each frame from acamera, and produces a digest containing newlydetected blobs, the original camera frame, and aforeground mask. A second stream stage, trackerlist,maintains a list of trackers following different tar-gets within a camera stream. It internally creates anew tracker for newly detected blobs by the detec-

PROCEEDINGS OF THE IEEE 10

tor. Each tracker in the trackerlist uses the originalcamera frame and a foreground mask to update eachtarget’s blob position. The updated blob positionwill be sent back to the detector (associated withthis camera stream), to prevent redundant detectionof the same target.There are several drawbacks for using this approachfor a large-scale surveillance application. First, acomplete stream graph should be provided by aprogrammer. In a large-scale setting, specifyingsuch a stream graph with a huge number of streamstages and connections among them (taking intoaccount camera proximities) is a very tedious task.Second, it cannot exploit the inherent parallelismof target tracking. Dynamically creating a newstream stage is not supported by System S; thereforea single stream stage, namely, trackerlist, has toexecute multiple trackers internally. This drawbackcreates a significant load imbalance among thestream stages of different trackerlists, as well as lowtarget tracking performance due to the sequentialexecution of the trackers by a given stream stage.Lastly, stream stages can only communicate throughstatically defined stream channels (internal to IBMSystem S), which prohibits arbitrary real-time datasharing among different computation modules. Asshown in Figure 3, a programmer has to explicitlyconnect stream stages using stream channels anddeal with the ensuing communication latency underconditions of infrastructure overload.

IV. System Building ExperiencesIn this section, we report on our experiences inbuilding a system infrastructure for large-scale sit-uation awareness applications. Specifically, we de-scribe (a) a novel programming model called TargetContainer (TC) (Section IV-A) that addresses someof the limitations of the existing programming mod-els described in Section III-E; and (b) a peer-to-peerdistributed software architecture called ASAP1thataddresses many of the scalability challenges forscaling up situation awareness applications thatwere identified in Sections III-A - III-D. ASAP uses

1ASAP stands for Priority Aware Situation Awareness readbackwards.

the principles in the wireless model for deploymentof the physical infrastructure (see Section III-C); al-lows for edge processing (where possible with smartcameras) to conserve computing and networking re-sources; incorporates multi-modal sensing to reducethe ill-effects of false positives and false negatives;and exploits context-awareness to prioritize camerastreams of interest.

A. TC Programming ModelThe TC programming model is designed for domainexperts to rapidly develop large-scale surveillanceapplications. The key insight is to elevate target asa first class entity both from the perspective of theprogrammer and from the perspective of resourceallocation by the execution environment. Conse-quently, all application level vision tasks becometarget-centric, which is more natural from the pointof view of the domain expert. The runtime systemis also able to optimize the resource allocationcommensurate with the application’s perception ofimportance (expressed as target priority, see Table I)and equitably across all equal priority targets. Inprinciple, the TC model generalizes to dealing withheterogeneous sensors (cameras, RFID readers, mi-crophones, etc.). However, for the sake of clarity ofthe exposition, we adhere to cameras as the onlysensors in this section.TC programming model shares with large-scalestream processing engines [55], [56], the conceptof providing a high-level abstraction for large-scalestream analytics. However, TC is specifically de-signed for real-time surveillance applications withspecial support based on the notion of a target.

1) TC Handlers and APIThe intuition behind the TC programming modelis quite simple and straightforward. Figure 4 showsthe conceptual picture of how a surveillance appli-cation will be structured using the new program-ming model and Table I summarizes APIs providedby the TC system. The application is written as acollection of handlers. There is a detector handlerassociated with each camera stream. The role ofthe detector handler is to analyze each cameraimage it receives to detect any new target that is

PROCEEDINGS OF THE IEEE 11

Target Container

Tracker

Detector Detector

Tracker

TC Data

Detector Data

Tracker Data Tracker Data

Detector Data

TC

TC DataEquality Checker

Fig. 4: Surveillance Application using TC Model

not already known to the surveillance system. Thedetector creates a target container for each newtarget it identifies in a camera frame by callingTC create target with initial tracker and TC data.In the simple case, where a target is observed inonly one camera, the target container contains asingle tracker handler, which receives images fromthe camera and updates the target information onevery frame arrival. However, due to overlappingfields of view, a target may appear in multiple cam-eras. Thus, in the general case, a target containermay contain multiple trackers following a targetobserved by multiple cameras. A tracker can callTC stop track to notify the TC system that thistracker need not be scheduled anymore; it woulddo that upon realizing that the target it is trackingis leaving the camera’s field of view.In addition to the detectors (one for each sensorstream), and the trackers (one per target per sensorstream associated with this target), the applicationmust provide additional handlers to the TC systemfor the purposes of merging TCs as explainedbelow. Upon detecting a new target in its field ofview, a detector would create a new target container.However, it is possible that this is not a newtarget but simply an already identified target thathappened to move into the field of view of thiscamera. To address this situation, the applicationwould also provide a handler for equality checkingof two targets. Upon establishing the equality of two

targets, the associated containers will be merged toencompass the two trackers. (see Target Containerin Figure 4). The application would provide amerger handler to accomplish this merging of twotargets by combining two application-specific targetdata structure (TC data) into one. Incidentally, theapplication may also choose to merge two distincttargets into a single one (for example, considera potential threat situation when two cohorts jointogether and walk in unison in an airport).As shown in Figure 4, there are three categoriesof data with different sharing properties and lifecycles. Detector data is the result of processing per-stream input, that is associated with a detector. Thedata can be used to maintain detector context suchas detection history and average motion level in thecamera’s field of view, which are potentially use-ful for surveillance applications using per camerainformation. The detector data is potentially sharedby the detector and the trackers spawned thereof.The trackers spawned by the detector as a resultof blob detection may need to inspect this detectordata. The tracker data maintains the tracking contextfor each tracker. The detector may inspect this datato ensure target uniqueness. TC data represents atarget. It is the composite of the tracking resultsof all the trackers within a single TC. The equalitychecking handler inspects TC data to see if two TCsare following the same target.While all three categories of data are shared, thelocality and degree of sharing for these three cat-egories can be vastly different. For example, thetracker data is unique to a specific tracker andat most shared with the detector that spawned it.On the other hand, the TC data may be sharedby multiple trackers potentially spanning multiplecomputational nodes if an object is in the FOV ofseveral cameras. The detector data is also sharedamong all the trackers that are working off a specificstream and the detector associated with that stream.This is the reason our API (see Table I) includessix different access calls for these three categoriesof shared data.When programming a target tracking application,the developer has to be aware of the fact that

PROCEEDINGS OF THE IEEE 12

the handlers may be executed concurrently. There-fore, the handlers should be written as sequentialcodes with no side effects to shared data structuresto avoid explicit application-level synchronization.The TC programming model allows an applicationdeveloper to use optimized handlers written in low-level programming languages such as C and C++.To shield the domain expert from having to dealwith concurrency bugs, data sharing between differ-ent handlers are only allowed through TC API calls(shown in Table I), which subsume data access withsynchronization guarantees.

2) TC Merge ModelTo seamlessly merge two TCs into one whiletracking the targets in real time, the TC systemperiodically calls equality checker on candidates formerge operation. After merge, one of the two TCsis eliminated, while the other TC becomes the unionof the two previous TCs.Execution of the equality checker on different pairsof TCs can be done in parallel since it does notupdate any TC data. Similarly, merger operationscan go on in parallel so long as the TCs involvedin the parallel merges are all distinct.TC system may use camera topology informationfor efficient merge operations. For example, if manytargets are being tracked in a large scale camera net-work, only those targets in nearby cameras shouldbe compared and merged to reduce the performanceoverhead of real-time surveillance application.

B. ASAPSituation awareness applications (such as video-based surveillance) are capable of stressing theavailable computation and communication infras-tructures to their limits. Hence the underlying sys-tem infrastructure should be: (a) highly scalable(i.e., designed to reduce infrastructure overload,cognitive overload, and false positives and falsenegatives); (b) flexible to use (i.e., provide query-and policy-based user interfaces to exploit context-sensitive information for selective attention); and (c)easily extensible (i.e., accommodate heterogeneoussensing and allow for incorporation of new sensingmodalities).

C

CameraDatabase

Control NetworkC

M

MotionSensor

C

ZSoundSensor

C

SGlass Break

Sensor

FIRE

FireAlarm

RFIDReader

Data Network

Sense(Data)

Process(Event)

Actuate(Actionable Knowledge)

Control Network Control Network Data Network

Prioritize(Filter/Fuse)

Data Network

Fig. 5: Functional view of ASAP

Fig. 6: ASAP Software Architecture

We have designed and implemented a distributedsoftware architecture called ASAP for situationawareness applications. Figure 5 shows the logicalorganization of the ASAP architecture into con-trol and data network. The control network dealswith low-level sensor specific processing to derivepriority cues. These cues in turn are used by thedata network to prioritize the streams and carry outfurther processing such as filtering and fusion ofstreams. It should be emphasized that this logicalseparation is simply a convenient vehicle to parti-tion the functionalities of the ASAP architecture.The two networks are in fact overlaid on the samephysical network and share the computational andsensing resources. For example, low bitrate sensingsuch as an RFID tag or a fire alarm are part ofthe control network. However, a high bitrate camerasensor while serving the video stream for the datanetwork may also be used by the control networkfor discerning motion.Figure 6 shows the software architecture of ASAP:it is a peer-to-peer network of ASAP agents (AA)that execute on independent nodes of the distributedsystem. The software organization in each nodeconsists of two parts: ASAP Agent (AA) and SensorAgent (SA). There is one sensor agent per sensor,and a collection of sensor agents are assigned

PROCEEDINGS OF THE IEEE 13

API DescriptionTC create target() Upon call by a detector, system creates a TC and associates it with the

new target, associating a tracker within this TC for this new target thatanalyzes the same camera stream as the detector.

TC stop track() Called by a tracker when a target disappears from a camera’s FOV, toprevent further execution of this tracker on behalf of this target.

TC get priority() Get a priority of a TC.TC set priority() Set a priority of a TC.TC update detector data() Used by detector for updates to per detector data structures.TC read detector data() Used by detector/tracker for read access to per detector data structures.TC update tracker data() Used by tracker for updates to per tracker data structures.TC read tracker data() Used by detector/tracker for read access to per tracker data structures.TC update TC data() Used by tracker for updates to per TC data structures.TC read TC data() Used by detector/tracker for read access to per TC data structures.

TABLE I: Target Container API

dynamically to an ASAP agent.

1) Sensor AgentSA provides a virtual sensor abstraction that pro-vides a uniform interface for incorporating hetero-geneous sensing devices as well as to support multi-modal sensing in an extensible manner. This ab-straction allows new sensor types to be added with-out requiring any change of the ASAP agent (AA).There is a potential danger in such a virtualizationthat some specific capability of a sensor may getmasked from full utilization. To avoid such semanticloss, we have designed a minimal interface thatserves the needs of situation awareness applications.The virtual sensor abstraction allows the samephysical sensor to be used for providing multiplesensing services. For example, a camera can servenot only as a video data stream, but also as a motionor a face detection sensor. Similarly, an SA mayeven combine multiple physical sensors to provide amulti-modal sensing capability. Once these differentsensing modalities are registered with ASAP agents,they are displayed as a list of available featuresthat users can select to construct a query for theASAP platform. ASAP platform uses these featuresas control cues for prioritization (to be discussedshortly).

2) ASAP AgentAs shown in Figure 6, an AA is associated witha set of SAs. The association is dynamic, andis engineered at runtime in a peer-to-peer fashionamong the AAs. The components of AA are shownin Figure 6. ASAP agent provides a simple queryinterface with SQL-like syntax. Clients can pose anSQL query using control cues as attributes. Differ-ent cues can be combined using “AND” and “OR”operators to create multi-modal sensing queries.

False positives and negativesFigure 5 shows that sensed data leads to events,which when filtered and fused ultimately leadsto actionable knowledge. Unfortunately, individualsensors may often be unreliable due to environmen-tal conditions (e.g., poor lighting conditions near acamera). Thus it may not always be possible to havehigh confidence in the sensed data; consequentlythere is a danger that the system may experiencehigh levels of false negatives and false positives.It is generally recognized that multi-modal sensorswould help reduce the ill effects of false positivesand negatives. The virtual sensor abstraction ofASAP allows multiple sensors to be fused togetherand registered as a new sensor. Unlike multi-featurefusion (a la face recognizer) where features arederived from the same (possibly noisy) image,multi-sensor fusion uses different sensing modali-

PROCEEDINGS OF THE IEEE 14

ties. ASAP exploits a quorum system to make adecision. Even though a majority vote is imple-mented at the present time, AA may assign differentweights to the different sensors commensurate withthe error rates of the sensors to make the votingmore accurate.

Prioritization StrategiesASAP needs to continuously extract prioritizationcues from all the cameras and other sensors (con-trol network), and disseminate the selected camerastreams (data network) to interested clients (whichcould be detectors/trackers of the TC system fromSection IV-A and/or a end user such as a securitypersonnel). ASAP extracts information from a sen-sor stream by invoking the corresponding SA. Sincethere may be many SAs registered at any time,invoking all SAs may be very compute intensive.ASAP needs to prioritize the invocations of SAs toscale well with the number of sensors. This leadsto the need for priority-aware computation in thecontrol network. Once a set of SAs that are relevantto client queries are identified, the correspondingcamera feeds need to be disseminated to the clients.If the bandwidth required to disseminate all streamsexceed the available bandwidth near the clients, net-work will end up dropping packets. This leads to theneed for priority-aware communication in the datanetwork. Based on these needs, the prioritizationstrategies employed by ASAP can be grouped intothe following categories: Priority-aware computa-tion and priority-aware communication.Priority-aware Computation. The challenge is dy-namically determining a set of SAs among allavailable SAs that need to be invoked such thatoverall value of the derived actionable knowledge(benefit for the application) is maximized. We usethe term Measure of Effectiveness (MOE) to denotethis overall benefit. ASAP currently uses a simpleMOE based on clients’ priorities.The priority of an SA should reflect the amountof possibly “new” information the SA output mayhave and its importance to the query in progress.Therefore, the priority value is dynamic, and it de-pends on multiple factors, including the applicationrequirements, and the information already available

from other SAs. In its simplest form, priority assign-ment can be derived from the priority of the queriesthemselves. For instance, given two queries froman application, if the first query is more importantthan the second one, the SAs relevant to the firstquery will have higher priority compared to theSAs corresponding to the second query. More im-portantly, computations do not need to be initiatedat all of SAs since (1) such information extractedfrom sensed data may not be required by anyAA, and (2) unnecessary computation can degradeoverall system performance. The “WHERE” clausein the SQL-like query is used to activate a specificsensing task. If multiple WHERE conditions exist,the lowest computation-intensive task is initiatedfirst that activates the next task in turn. While it hasa trade-off between latency and overhead, ASAPuses this heuristic for the sake of scalability.Priority-aware Communication. The challenge isdesigning prioritization techniques for communi-cation on the data network such that applicationspecific MOE can be maximized. Questions to beexplored here include: how to assign priorities todifferent data streams and how to adjust their spatialor temporal fidelities that maximizes the MOE?In general, the control network packets are givenhigher priority than data network packets. Sincethe control network packets are typically muchsmaller than the data network packets, supportinga cluster of SAs with each AA does not overloadthe communication infrastructure.

C. Summary of ResultsWe have built a testbed with network cameras andRFID readers for object tracking based on RFIDtags and motion detection. This testbed allows us toboth understand the programmability of large-scalecamera networks using the TC model, as well asunderstand the scalability of the ASAP architecture.Specifically, in implementing ASAP, we had threeimportant goals: (1) platform neutrality for the“box” that hosts the AA and SA, (2) ability tosupport a variety of sensors seamlessly (for e.g.,network cameras as well as USB cameras), and (3)extensibility to support a wide range of handhelddevices. We augmented our real testbed consisting

PROCEEDINGS OF THE IEEE 15

of 10’s of cameras, RFID readers, and microphoneswith emulated sensors. The emulated sensors usethe uniform virtual sensor interface discussed inSection IV-B. Due to the virtual sensor abstraction,an ASAP Agent does not distinguish whether datacomes from an emulated sensor or a real sensor.The emulated camera sends JPEG images at a raterequested by a client. The emulated RFID readersends tag detection event based on an event file,where different event files mimic different objectmovement scenarios.By using real devices (cameras, RFID readers, andmicrophones) and emulated sensors, we were ableto conduct experiments to verify the scalability ofour proposed software architecture to scale to alarge number of cameras. The workload used is asfollows. An area is assumed to be made of a setof cells, organized as a grid. Objects start from arandomly selected cell, wait for a predefined time,and move to a neighbor cell. The number of objects,the grid size, and the object wait time are work-load parameters. We used end-to-end latency (fromsensing to actuation), network bandwidth usage, andCPU utilization as figures of merit as we scale upthe system size (i.e., the number of cameras from 20to 980) and the number of queries (i.e., interestingevents to be observed). The scalability is attestedby two facts: (a) the end-to-end latency remains thesame as we increase the number of queries (fora system with 980 camera streams); and (b) theCPU load and the network bandwidth requirementsgrow linearly with the number of interesting eventsto be observed (i.e., number of queries) and notproportional to the size of the system (i.e., thenumber of camera sensors in the deployment)2.

V. Case Study: IBM S3The IBM Smart Surveillance project [16], [61] isone of the few research projects in smart surveil-lance systems that turned into a product, whichhas been recently used to augment Chicagos videosurveillance network [62]. Quite a bit of funda-

2Details of the TC programming system can found in [59],and detailed results of the ASAP system evaluation can be foundin [60].

mental research in computer vision technologiesform the cornerstone for IBM’s smart surveillancesolution. Indeed, IBM S3 transformed video-basedsurveillance systems from a pure data acquisitionendeavor (i.e., recording the video streams on DVRsfor post-mortem analysis) to an intelligent real-time online video analysis engine that converts rawdata into actionable knowledge. IBM S3 productincludes several novel technologies [63] includingmulti-scale video acquisition and analysis, salientmotion detection, 2-D multi-object tracking, 3-Dstereo object tracking, video-tracking based ob-ject classification, object structure analysis, facecategorization following face detection to prevent“tailgating”, etc. Backed by the IBM DB2 product,IBM S3 is a powerful engine for online queryingof live and historical data in the hands of securitypersonnel.Our work, focusing on the programming model forlarge-scale situation awareness and a scalable peer-to-peer system architecture for multi-modal sensing,is complementary to the state-of-the-art establishedby the IBM S3 research.

VI. Concluding RemarksLarge-scale situation awareness applications willcontinue to grow in scale and importance as ourpenchant for instrumenting the world with sensorsof various modalities and capabilities continues.Using video-based surveillance as a concrete ex-ample, we have reviewed the enabling technologiesspanning wireless networking, smart cameras, com-puter vision, context-aware frameworks, and pro-gramming models. We have reported our own ex-periences in building scalable programming modelsand software infrastructure for situation awareness.Any interesting research answers a few questionsand raises several more. This is no different. Oneof the most hairy problems with physical deploy-ment is the heterogeneity and lack of standards forsmart cameras. A typical large-scale deploymentwill include smart cameras of different models andcapabilities. Vendors typically provide their ownproprietary software for analyzing camera feeds andcontrolling the cameras from a dashboard. Inter-

PROCEEDINGS OF THE IEEE 16

operability of camera systems from different ven-dors is difficult if not impossible.From the perspective of computer vision, one ofthe major challenges is increasing the accuracy ofdetection and/or scene analysis in the presence ofambient noise, occlusion, and rapid movement ofobjects. Multiple views of the same object helps inimproving the accuracy of detection and analysis;with the ubiquity of cameras in recent years it isnow feasible to deploy several 10’s if not 100’s ofcameras in relatively small spaces (e.g., one gateof an airport); but the challenge of using thesemultiple views to develop accurate and scalableobject detection algorithms still remains an openproblem.From systems perspective, there is considerableamount of work to be done in aiding the domainexpert. There needs to be closer synergy between vi-sion researchers and systems researchers to developthe right abstractions for programming large-scalecamera networks, facilitating seamless handoff fromone camera to another as objects moves, state andcomputation migration between smart cameras andbackend servers, elastically increasing the compu-tational resources to deal with dynamic applicationneeds, etc.Last but not the least, one of the most thornyproblems that is plaguing the Internet today isbound to hit sensor-based distributed computing inthe near future, namely, spam. We have intention-ally avoided discussing tamper-proofing techniquessuch as stenography in camera systems; but aswe explore mission critical applications (such assurveillance, urban terrorism, emergency response,and healthcare) ensuring the veracity of sensorsources will become increasingly important.

References[1] M. McCahill and C. Norris, “Estimating the extent, sophis-

tication and legality of cctv in london,” in CCTV, M. Gill,Ed. Perpetuity Press, 2003.

[2] R. Hunter, “Chicago’s surveillance plan is an ambitiousexperiment,” 2004, gartner Research, http://www.gartner.com/DisplayDocument?doc cd=123919.

[3] C. Norris, M. McCahill, and D. Wood, “The growth of cctv:a global perspective on the international diffusion of videosurveillance in publicly accessible space,” Surveillance and

Society, vol. 2, no. 2/3, 2004.[4] W.E.L.Grimson and C.Stauffer, “Adaptive background

mixture models for real-time tracking,” in IEEE Conferenceon Computer Vision and Pattern Recognition, 1999.

[5] A. Elgammal, D. Harwood, and L. S. Davis, “Nonparamet-ric background model for background subtraction,” in Proc.of 6th European Conference of Computer Vision, 2000.

[6] D. M. Gavrila, “Pedestrian detection from a moving ve-hicle,” in ECCV ’00: Proceedings of the 6th EuropeanConference on Computer Vision-Part II, 2000, pp. 37–49.

[7] P. Viola and M. Jones, “Robust real-time object detection,”in International Journal of Computer Vision, 2001.

[8] I. Haritaoglu, D. Harwood, and L. S. Davis, “W4:who?when? where? what? a real time system for detecting andtracking people,” in International Conference on Face andGesture Recognition, 1998.

[9] C. R. Wern, A. Azarbayejani, T. Darrell, and A. P. Pentland,“Pfinder: Real-time tracking of human body,” IEEE Trans.Pattern Anal. Mach. Intell., 1997.

[10] W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld,“Face recognition: A literature survey,” ACM Comput.Surv., vol. 35, no. 4, pp. 399–458, 2003.

[11] J. Little and J. Boyd, “Recognizing people by their gait:the shape of motion,” Videre, vol. 1, no. 2, 1998.

[12] A. Bissacco, A. Chiuso, Y. Ma, and S. Soatto, “Recognitionof human gaits.” in Proc. of CVPR, vol. 2, 2001, pp. 52–58.

[13] D. M. Gavrila, “The visual analysis of human movement:a survey,” Comput. Vis. Image Underst., vol. 73, no. 1, pp.82–98, 1999.

[14] A. F. Bobick and J. W. Davis, “The recognition of humanmovement using temporal templates,” IEEE Trans. PatternAnal. Mach. Intell., vol. 23, no. 3, pp. 257–267, 2001.

[15] T. B. Moeslund and E. Granum, “A survey of computervision-based human motion capture,” Comput. Vis. ImageUnderst., vol. 81, no. 3, pp. 231–268, 2001.

[16] “IBM smart surveillance system (S3),” http://www.research.ibm.com/peoplevision/.

[17] https://www.buildingtechnologies.siemens.com, “VideoSurveillance Integrated Surveillance Systems.”

[18] http://www.objectvideo.com/products/, “Products thatmake surveillance smart.”

[19] P. F. Felzenszwalb, R. B. Girshick, D. A. McAllester,and D. Ramanan, “Object Detection with DiscriminativelyTrained Part-based Models,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 32, pp. 1627–1645,2010.

[20] X. Wang, T. X. Han, and S. Yan, “An HOG-LBP humandetector with Partial Occlusion Handling,” in InternationalConference on Computer Vision, 2009.

[21] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: Asurvey,” ACM Comput. Surv., vol. 38, no. 4, p. 13, 2006.

[22] B. Babenko, M. Yang, and S. J. Belongie, “Visual trackingwith online Multiple Instance Learning,” in ComputerVision and Pattern Recognition, 2009, pp. 983–990.

[23] D. A. Ross, J. Lim, R. Lin, and M. Yang, “IncrementalLearning for Robust Visual Tracking,” International Jour-nal of Computer Vision, vol. 77, pp. 125–141, 2008.

[24] J. Kwon and K. M. Lee, “Visual Tracking Decomposition,”in Computer Vision and Pattern Recognition, 2010.

[25] A. Basharat, A. Gritai, and M. Shah, “Learning objectmotion patterns for anomaly detection and improved object

PROCEEDINGS OF THE IEEE 17

detection,” Computer Vision and Pattern Recognition, IEEEComputer Society Conference on, vol. 0, pp. 1–8, 2008.

[26] I. Saleemi, K. Shafique, and M. Shah, “Probabilisticmodeling of scene dynamics for applications in visualsurveillance,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 31, pp. 1472–1485, 2009.

[27] B. Yao and L. Fei-Fei, “Modeling mutual context of objectand human pose in human-object interaction activities,”Computer Vision and Pattern Recognition, IEEE ComputerSociety Conference on, vol. 0, pp. 17–24, 2010.

[28] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos,“Anomaly detection in crowded scenes,” Computer Visionand Pattern Recognition, IEEE Computer Society Confer-ence on, vol. 0, pp. 1975–1981, 2010.

[29] X. Wang, X. Ma, and W. Grimson, “Unsupervised activityperception in crowded and complicated scenes using hier-archical bayesian models,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 31, pp. 539–555,2009.

[30] H. Zhong, J. Shi, and M. Visontai, “Detecting unusual ac-tivity in video,” Computer Vision and Pattern Recognition,IEEE Computer Society Conference on, vol. 2, pp. 819–826, 2004.

[31] I. Bravo, J. Balias, A. Gardel, J. L. Lzaro, F. Espinosa,and J. Garca, “Efficient smart cmos camera based onfpgas oriented to embedded image processing,” Sensors,vol. 11, no. 3, pp. 2282–2303, 2011. [Online]. Available:http://www.mdpi.com/1424-8220/11/3/2282/

[32] S. K. Nayar, “Computational Cameras: Redefining theImage,” IEEE Computer Magazine, Special Issue on Com-putational Photography, pp. 30–38, Aug 2006.

[33] P. Chen, P. Ahammad, C. Boyer, S. Huang, L. Lin, E. Lo-baton, M. Meingast, S. Oh, S. Wang, P. Yan, A. Y. Yang,C. Yeo, L. chung Chang, J. D. Tygar, and S. S. Sastry,“Citric: A low-bandwidth wireless camera network plat-form,” in In Proceedings of the International Conferenceon Distributed Smart Cameras, 2008.

[34] P. Kulkarni, D. Ganesan, P. Shenoy, and Q. Lu, “Senseye:a multi-tier camera sensor network,” in MULTIMEDIA’05: Proceedings of the 13th annual ACM internationalconference on Multimedia. New York, NY, USA: ACMPress, 2005, pp. 229–238.

[35] H. Broers, W. Caarls, P. Jonker, and R. Kleihorst, “Archi-tecture study for smart cameras,” 2005.

[36] “Texas Instruments IP Camera,” http://www.ti.com/ipcamera.

[37] “Axis Communications,” http://www.axis.com/.[38] “Microsoft Kinect,” http://www.xbox.com/kinect.[39] IEEE standard for Wireless Local Area Networks: 802.11n.

[Online]. Available: http://www.ieee802.org/11[40] The Zigbee specification. [Online]. Available: http:

//www.zigbee.org[41] C. Perkins, E. B. Royer, and S. Das, “Ad hoc on-demand

distance vector (aodv) routing,” in IETF RFC 3561, 2003.[42] 3GPP Long Term Evolution. [Online]. Available: http:

//www.3gpp.org/article/lte[43] Worldwide Interoperability for Microwave Access.

[Online]. Available: http://www.wimaxforum.org[44] Xbow Micaz motes. [Online]. Available: http://www.xbow.

com

[45] RPL: IPv6 Routing Protocol for Low power andLossy Networks,draft-ietf-roll-rpl-19 . [Online]. Available:http://tools.ietf.org/html/draft-ietf-roll-rpl-19

[46] “RFC 3344: IP mobility support for IPv4, insitution =United States, year = 2002,” Tech. Rep.

[47] IBM, “A smarter planet,” http://www.ibm.com/smarterplanet.

[48] L. Luo, A. Kansal, S. Nath, and F. Zhao, “Sharing andExploring Sensor Streams over Geocentric Interfaces,” inProc. of 16th ACM SIGSPATIAL Int’l Conf. on Advancesin Geographic Information Systems (ACM GIS ’08), Irvine,CA, USA, Nov. 2008, pp. 3–12.

[49] Nokia, “Sensor planet,” http://www.sensorplanet.org/.[50] L. Sanchez, J. Lanza, M. Bauer, R. L. Olsen, and M. G.

Genet, “A generic context management framework forpersonal networking environments,” in Proceedings of 3rdAnnual International Conference on Mobile and Ubiqui-tous Systems, 2006, workshop on Personalized Networks.

[51] S. Kang et al., “Seemon: Scalable and energy-efficientcontext monitoring framwork for sensor-rich mobile en-vironments,” in ACM Int. Conf. on Mobile Systems, 2008.

[52] D. J. Lillethun, D. Hilley, S. Horrigan, and U. Ramachan-dran, “MB++: An Integrated Architecture for PervasiveComputing and High-Performance Computing,” in Pro-ceedings of the 13th IEEE International Conference onEmbedded and Real-Time Computing Systems and Appli-cations (RTCSA ’07), August 2007, pp. 241–248.

[53] F. Hohl, U. Kubach, A. Leonhardi, K. Rothermel, andM. Schwehm, “Next Century Challenges: Nexus – AnOpen Global Infrastructure for Spatial-Aware Applica-tions,” in Proc. of 5th ACM/IEEE Int’l Conf. on MobileComputing and Networking (MobiCom ’99), Seattle, WA,USA, Aug. 1999, pp. 249–255.

[54] R. Lange, N. Cipriani, L. Geiger, M. Grossmann, H. Wein-schrott, A. Brodt, M. Wieland, S. Rizou, and K. Rothermel,“Making the world wide space happen: New challengesfor the nexus platform,” in Proceedings of the SeventhIEEE International Conference on Pervasive Computingand Communications (Percom ’09), 2009.

[55] B. Gedik, H. Andrade, K.-L. Wu, P. S. Yu, andM. Doo, “Spade: the System S declarative streamprocessing engine,” in Proceedings of the 2008 ACMSIGMOD international conference on Management ofdata, ser. SIGMOD ’08. New York, NY, USA:ACM, 2008, pp. 1123–1134. [Online]. Available: http://doi.acm.org/10.1145/1376616.1376729

[56] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari, “S4:Distributed stream computing platform,” in Proceedingsof the IEEE International Conference on Data MiningWorkshops, 2010.

[57] W. Thies, M. Karczmarek, and S. P. Amarasinghe,“Streamit: A language for streaming applications,” inProceedings of the 11th International Conference onCompiler Construction, ser. CC ’02. London, UK:Springer-Verlag, 2002, pp. 179–196. [Online]. Available:http://portal.acm.org/citation.cfm?id=647478.727935

[58] P. S. Pillai, L. B. Mummert, S. W. Schlosser, R. Sukthankar,and C. J. Helfrich, “Slipstream: scalable low-latencyinteractive perception on streaming data,” in Proceedingsof the 18th international workshop on Network andoperating systems support for digital audio and video,

PROCEEDINGS OF THE IEEE 18

ser. NOSSDAV ’09. New York, NY, USA: ACM, 2009,pp. 43–48. [Online]. Available: http://doi.acm.org/10.1145/1542245.1542256

[59] K. Hong, B. Branzoi, J. Shin, S. Smaldone, L. Iftode,and U. Ramachandran, “Target container: A target-centricparallel programming abstraction for video-based surveil-lance,” 2010, http://hdl.handle.net/1853/36186.

[60] J. Shin, R. Kumar, D. Mohapatra, U. Ramachandran, andM. Ammar, “ASAP: A camera sensor network for situationawareness,” in OPODIS’07: Proceedings of 11th Interna-tional Conference On Principles Of Distributed Systems,2007.

[61] R. Feris, A. Hampapur, Y. Zhai, R. Bobbitt, L. Brown,D. Vaquero, Y. Tian, H. Liu, and M.-T. Sun, “Case-study: IBM smart surveillance system,” in IntelligentVideo Surveillance: Systems and Technologies, Y. Ma andG. Qian, Eds. Taylor & Francis, CRC Press, 2009.

[62] “ABC7 puts video analytics to the test,” ABC News, Feb23, 2010. http://abclocal.go.com/wls/story?section=news/special segments&id=7294108.

[63] A. Hampapur, L. Brown, J. Connell, A. Ekin, N. Haas,M. Lu, H. Merkl, S. Pankanti, A. Senior, C.-F. Shu, andY. L. Tian, “Smart video surveillance: exploring the con-cept of multiscale spatiotemporal tracking,” IEEE SignalProcessing Magazine, vol. 22, no. 2, pp. 38 – 51, March2005.

Umakishore Ramachandran is theDirector of Samsung Tech AdvancedResearch (STAR) center and a Pro-fessor in the College of Computing,Georgia Tech. He received his Ph.D.in Computer Science from the Uni-versity of Wisconsin-Madison in 1986.His research interests span parallel anddistributed systems, sensor networks,pervasive computing, and mobile andembedded computing.

Kirak Hong received a B.S. in Com-puter Science from Yonsei Univeristy,Korea in 2009. Currently, he is a Ph.D.student in the College of Computing,Georgia Tech. His dissertation researchfocusses on programming models andexecution frameworks for large-scalesituation awareness applications. Hisresearch interests span distributed sys-tems, mobile and embedded comput-ing, and sensor networks.

Liviu Iftode is a Professor of Com-puter Science at Rutgers University. Hereceived a Ph.D. in Computer Sciencefrom Princeton University in 1998.His research interests include operatingsystems, distributed systems, mobile,vehicular, and pervasive computing. Heis a senior member of IEEE and amember of ACM.

Ramesh Jain a Donald Bren Professorin Information & Computer Sciences atUniversity of California, Irvine wherehe is doing research in Event Weband experiential computing. He is aFellow of ACM, IEEE, AAAI, IAPR,and SPIE. His current research inter-ests are in searching multimedia dataand creating EventWebs for experien-tial computing.

Rajnish Kumar received his Ph.D. inComputer Science from Georgia Techin 2006. As part of his dissertation,he designed and implemented Sen-sorStack that provides systems supportfor cross layering in network stackfor adaptability. Rajnish is currentlyworking as chief technology officerat Weyond and his research interestsare in systems support for large scalestreaming data analytics.

Kurt Rothermel received his Ph.D. inComputer Science from University ofStuttgart in 1985. Since 1990, he hasbeen with the University of Stuttgart,where he a Professor of Computer Sci-ence and the Director of the Instituteof Parallel and Distributed Systems.His research interests span distributedsystems, computer networks, mobilecomputing, and sensor networks.

PROCEEDINGS OF THE IEEE 19

Junsuk Shin received a B.S. in Elec-trical Engineering from Yonsei Univer-sity, Korea and the MS degree in com-puter science from Georgia Instituteof Technology. He joined Microsoftin 2009. He is currently working onhis Ph.D. thesis. His research interestincludes distributed system, sensor net-work, mobile computing, and embed-ded system.

Raghupathy Sivakumar is a Professorin the School of Electrical and Com-puter Engineering at Georgia Tech. Heleads the Georgia Tech Networking andMobile Computing (GNAN) ResearchGroup, conducting research in the areasof wireless networking, mobile com-puting, and computer networks. He re-ceived his Ph.D. degree in ComputerScience from the University of Illinoisat Urbana-Champaign in 2000.