Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor...

38
Storage and Analysis of Big Data from Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD

Transcript of Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor...

Page 1: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Storage and Analysis of Big Data from Sensor Networks: Challenges

and Opportunities

Sameer Tilak,

Calit2, UCSD

Page 2: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Large-Scale Sensor Network Applications

• Environmental Monitoring

– Limnology, Marine Science

• Participatory Sensing (Healthcare)

• Disaster Management and Emergency Response (Floods, earthquakes)

• Smart Home and grid

Page 3: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Overview

Large-scale environmental observing systems consists of sensors embedded deeply with our physical environment. Typically these systems consist of highly-calibrated sensors deployed at strategic locations to generate science quality data. Recently, personalized mobile sensing (e.g., sensors mounted on cars or carried by people) has received considerable attention from the research community. Although, these systems consist of cheap, mobile sensors that are not well calibrated, they can provide spatial sampling diversity as compared to the traditional environmental observing systems. Together these large-scale sensor networks can gathering data at high spatio-temporal resolutions and have potential to provide scientists unprecedented insights into complex physical environment.

Page 4: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Research Challenges

• Software O&M (Sensor-Rocks)

• Power Management: Smart phones and sensors (context-aware sensing)

• Visualization (Calit2 Optiportal)

• Data Processing (Storm and Apache Big data ecosystem)

• Data Storage (HDD vs SSD Technologies)

Page 5: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Sensor-Rocks: Sensor Network Software O&M Automation

Sameer Tilak, Tony Fountain, Philip Papadopoulos, and Tajana Rosing Tim Telfer (MURPA)

Page 6: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Motivation: Automating Sensor Device and Network Definition

• Inexpensive sensor devices have no interactive screens and must therefore be configured only using a flashed-system image for which there exist only hand-build techniques today for this step

• Scaling to 100s or 1000s of sensors means that we must be able to handle hardware heterogeneity of individual sensors without the time-consuming process of building a highly-customized, independent image for each variant; and

• We want better reproducibility of the basic software configuration so that we can easily adjust to the rapid changes of the Android environment and reap the benefits of new capabilities.

Page 7: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Rocks Background

•Rocks is a software toolkit that solves the computing cluster definition, deployment and management problem and has reduced the time from raw hardware to a working system within the Data Centers from days/weeks to a few hours. • The toolkit treats a complete software footprint on any machine as a set of software packages and configuration that together form a Rocks Appliance (e.g., database, parallel file servers, load managers, software routers). Appliances can share packages and configuration.

Page 8: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Data Center Configuration using Rocks

A Portion and a Complete Rocks Config. Graph. Software installation on a given node is performed through the traversal of the configuration graphs.

A Rocks-based sample Data center architecture

Page 9: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Android Software Stack

Android application and kernel stack

key system definition differences between Android and Full Linux

• No scriptable system definition framework like Redhat's Kickstart exists today for Android • Unlike installed Linux systems, common tools for modifying configuration files via scripting

languages are not part of the installed Android environmet. This means that Linux can be used to automatically define and configure itself (this is what happens when you install from a DVD)nt, but Android does not have the same closed form.

Page 10: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

CyanogenMod

• Custom device firmware based on AOSP

• Android benefits for Wireless Sensor Devices:

• CM’s improvements on android:

o Free

o Open source

o Large developer base

o Many supported devices

o Many communication

options

o Virtualised applications

o Hardware abstraction layer

o Older device support

o Greater battery efficiency

o Less bloat

o Nightly builds

o CPU under/overclocking

o Wifi/Bluetooth/USB tethering

Page 11: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Sensor-Rocks Profile graphs

• Profile graph to describe overall set of configurations

• Rolls as sub-graphs of nodes, containing packages and scripts

• Produces installable custom rom for selected distributions

• Ability to specify conditions and dependencies

Page 12: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Sensor-Rocks (technical)

• Graph, roll and node information stored in xmls

• Edify used to manage installation scripting – Instead of kickstart for standard Rocks

• Shell scripts used for post scripting

• Package Manager handles source repositories and compiling for packages

• Roll Manager takes care of creating roms and graph

• Device Manager will handle database of devices and configurations

Page 13: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Steps for Building a new device

• sensor-rocks-cm.py compile-roll -r earth-sensor -a "{'hw' : 'grouper'}"

• adb push <path-to-sensor-rolls>/out/rom.zip /sdcard/

• adb reboot recovery

Page 14: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Specifying Post-Config: Edify Script

Page 15: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Applications and Updater Script

• Apk file stored appropriately

• Updater Script generated automatically • Location • Contents

Page 16: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

OA Deployment

Moorea LTER

Martz SeapHox: pH, conductivity, temperature

Pro-Oceanus CO2-Pro: PCO2

MBARI-modified: pH

Sea-Bird Inductive Modem

Page 17: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Lake Deployment

NTL LTER

Hydro-lab DS-5: chlorophyll, dissolved

oxygen

NexSens T-Node: temperature

Page 18: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Participatory Sensing: Healthcare

Page 19: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Patient-Centric Healthcare

Page 20: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Context-Aware Sensing

Page 21: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Source: Dr. Larry Smarr, Calit2, UCSD

Page 22: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Source: Dr. Larry Smarr, Calit2, UCSD

Data Visualization

Page 23: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Data Processing Challenges: Real-time and Batch-mode

• Real-time processing of tens of thousands of streams /second

• Data representation • Integration of analysis tools • Development of light-weight and scalable

algorithms and models • Accommodating failures: hardware (servers, disks Network,..) and software (OS, middleware, apps). • Accurate workload characterization • Development of efficient data pipeline

Page 24: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Big Data Ecosystem

• Apache Hadoop • Apache Pig • Apache Hive • Apache Cassandra • Apache Hbase • Apache Oozie • Storm • Esper • Add your favorite here…

Page 25: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Solid State Drives

• Solid-state (Flash) disks present an attractive storage option. Solid-state (Flash) drives are becoming cheaper and more common in Data Centers and we believe that this trend will continue to grow. By 2020, the quantity of electronically stored data will reach 35 trillion gigabytes. Big data technologies such as Apache Hadoop, HBase, Pig, Hive are striving to make the storage, manipulation and analysis of huge volumes of data cheaper and faster than ever. With current 6GBps SATA III, NAND-based Solid-State (flash) drives are delivering astounding performance (in terms of random I/O) compared to traditional hard-disk drives. However, their benefit for big data processing is not yet quantified.

Page 26: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

FutureGrid key Concepts I • FutureGrid is an international testbed modeled on Grid5000

– June 27 2012: 225 Projects, 920 users

• Supporting international Computer Science and Computational Science research in cloud, grid and parallel computing (HPC)

– Industry and Academia

• The FutureGrid testbed provides to its users:

– A flexible development and testing platform for middleware and application users looking at interoperability, functionality, performance or evaluation

– Each use of FutureGrid is an experiment that is reproducible

– A rich education and teaching platform for advanced cyberinfrastructure (computer science) classes

Source: Prof. Geoffrey Fox, Indiana University.

Page 27: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

FutureGrid: a Grid/Cloud/HPC Testbed

Private Public

FG Network

NID: Network Impairment Device

10TF Disk rich + GPU 320 cores

Source: Prof. Geoffrey Fox, Indiana University.

Page 28: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Compute Hardware

Name System type # CPUs # Cores TFLOPS Total RAM

(GB)

Secondary Storage

(TB) Site Status

india IBM iDataPlex 256 1024 11 3072 180 IU Operational

alamo Dell PowerEdge

192 768 8 1152 30 TACC Operational

hotel IBM iDataPlex 168 672 7 2016 120 UC Operational

sierra IBM iDataPlex 168 672 7 2688 96 SDSC Operational

xray Cray XT5m 168 672 6 1344 180 IU Operational

foxtrot IBM iDataPlex 64 256 2 768 24 UF Operational

Bravo Large Disk & memory

32 128 1.5 3072

(192GB per node)

192 (12 TB per Server)

IU Operational

Delta Large Disk & memory With Tesla GPU’s

32 CPU 32 GPU’s

192+ 14336 GPU

? 9 1536

(192GB per node)

192 (12 TB per Server)

IU Operational

TOTAL Cores 4384

Source: Prof. Geoffrey Fox, Indiana University.

Page 29: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

FutureGrid: Inca Monitoring

Source: Prof. Geoffrey Fox, Indiana University.

Page 30: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

5 Use Types for FutureGrid • 225 approved projects (~920 users) June 27 2012

– USA, China, India, Pakistan, lots of European countries – Industry, Government, Academia

• Training Education and Outreach (8%) – Semester and short events; promising for small universities

• Interoperability test-beds (3%) – Grids and Clouds; Standards; from Open Grid Forum OGF

• Domain Science applications (31%) – Life science highlighted (18%), Non Life Science (13%)

• Computer science (47%) – Largest current category

• Computer Systems Evaluation (27%) – XSEDE (TIS, TAS), OSG, EGI

• Clouds are meant to need less support than other models; FutureGrid needs more user support …….

30 Source: Prof. Geoffrey Fox, Indiana University.

Page 31: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Online MOOC’s • Science Cloud MOOC repository

– http://iucloudsummerschool.appspot.com/preview

• FutureGrid MOOC’s – https://fgmoocs.appspot.com/explorer

• A MOOC that will use FutureGrid for class laboratories (for advanced students in IU Online Data Science masters degree) – https://x-informatics.appspot.com/course

• MOOC Introduction to FutureGrid can be used by all classes and tutorials on FutureGrid

• Currently use Google Course Builder: Google Apps + YouTube – Built as collection of modular ~10 minute lessons

Source: Prof. Geoffrey Fox, Indiana University.

Page 32: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

SSD experimentation using Lima, a

FutureGrid resource

Lima @ UCSD • 8 nodes, 128 cores

• AMD Opteron 6212

• 64 GB DDR3

• 10GbE Mellanox ConnectX 3 EN

• 1 TB 7200 RPM Ent SATA Drive

• 480 GB SSD SATA Drive (Intel 520)

Page 33: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

TestDFSIO: Throughput mb/sec

0

20

40

60

80

100

120

140

160

180

1000 2000 3000 4000 5000 10000 15000 20000 25000

HDD

SSD

0

20

40

60

80

100

120

140

160

180

1000 2000 3000 4000 5000 10000150002000025000

HDD

SSD

Random Write Random Read

Page 34: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

0

100

200

300

400

500

600

700

1000 2000 3000 4000 5000 10000 15000 20000 25000

HDD

SSD

0

200

400

600

800

1000

1200

1400

1000 2000 3000 4000 5000 10000 15000 20000 25000

HDD

SSD

TestDFSIO: Execution Time (sec)

Random Write Random Read

Page 35: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

0

20

40

60

80

100

120

140

160

10 20 30 40 50 100 150 200 250

HDD

SSD

0

5

10

15

20

25

30

35

40

45

50

10 20 30 40 50 100 150 200 250

HDD

SSD

TestDFSIO: Throughput mb/sec

Random Write Random Read

Page 36: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

0

100

200

300

400

500

600

10 20 30 40 50 100 150 200 250

HDD

SSD

0

100

200

300

400

500

600

700

800

10 20 30 40 50 100 150 200 250

HDD

SSD

Random Write Random Read

TestDFSIO: Execution Time (sec)

Page 37: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

MURPA Experience

• Geoff Pascoe

• Thomas Moore (Journal article in SPE): Honors thesis

• Tim Telfer(One conference paper and will try to submit one more paper): Honors thesis

Page 38: Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor Networks: Challenges and Opportunities Sameer Tilak, Calit2, UCSD . Large-Scale Sensor Network

Why UCSD?

• World-renowned research university

• Calit2 is multi-disciplinary

• San Diego: great weather and culturally diverse