Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor...
-
Upload
vuongthuan -
Category
Documents
-
view
219 -
download
1
Transcript of Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor...
Storage and Analysis of Big Data from Sensor Networks: Challenges
and Opportunities
Sameer Tilak,
Calit2, UCSD
Large-Scale Sensor Network Applications
• Environmental Monitoring
– Limnology, Marine Science
• Participatory Sensing (Healthcare)
• Disaster Management and Emergency Response (Floods, earthquakes)
• Smart Home and grid
Overview
Large-scale environmental observing systems consists of sensors embedded deeply with our physical environment. Typically these systems consist of highly-calibrated sensors deployed at strategic locations to generate science quality data. Recently, personalized mobile sensing (e.g., sensors mounted on cars or carried by people) has received considerable attention from the research community. Although, these systems consist of cheap, mobile sensors that are not well calibrated, they can provide spatial sampling diversity as compared to the traditional environmental observing systems. Together these large-scale sensor networks can gathering data at high spatio-temporal resolutions and have potential to provide scientists unprecedented insights into complex physical environment.
Research Challenges
• Software O&M (Sensor-Rocks)
• Power Management: Smart phones and sensors (context-aware sensing)
• Visualization (Calit2 Optiportal)
• Data Processing (Storm and Apache Big data ecosystem)
• Data Storage (HDD vs SSD Technologies)
Sensor-Rocks: Sensor Network Software O&M Automation
Sameer Tilak, Tony Fountain, Philip Papadopoulos, and Tajana Rosing Tim Telfer (MURPA)
Motivation: Automating Sensor Device and Network Definition
• Inexpensive sensor devices have no interactive screens and must therefore be configured only using a flashed-system image for which there exist only hand-build techniques today for this step
• Scaling to 100s or 1000s of sensors means that we must be able to handle hardware heterogeneity of individual sensors without the time-consuming process of building a highly-customized, independent image for each variant; and
• We want better reproducibility of the basic software configuration so that we can easily adjust to the rapid changes of the Android environment and reap the benefits of new capabilities.
Rocks Background
•Rocks is a software toolkit that solves the computing cluster definition, deployment and management problem and has reduced the time from raw hardware to a working system within the Data Centers from days/weeks to a few hours. • The toolkit treats a complete software footprint on any machine as a set of software packages and configuration that together form a Rocks Appliance (e.g., database, parallel file servers, load managers, software routers). Appliances can share packages and configuration.
Data Center Configuration using Rocks
A Portion and a Complete Rocks Config. Graph. Software installation on a given node is performed through the traversal of the configuration graphs.
A Rocks-based sample Data center architecture
Android Software Stack
Android application and kernel stack
key system definition differences between Android and Full Linux
• No scriptable system definition framework like Redhat's Kickstart exists today for Android • Unlike installed Linux systems, common tools for modifying configuration files via scripting
languages are not part of the installed Android environmet. This means that Linux can be used to automatically define and configure itself (this is what happens when you install from a DVD)nt, but Android does not have the same closed form.
CyanogenMod
• Custom device firmware based on AOSP
• Android benefits for Wireless Sensor Devices:
• CM’s improvements on android:
o Free
o Open source
o Large developer base
o Many supported devices
o Many communication
options
o Virtualised applications
o Hardware abstraction layer
o Older device support
o Greater battery efficiency
o Less bloat
o Nightly builds
o CPU under/overclocking
o Wifi/Bluetooth/USB tethering
Sensor-Rocks Profile graphs
• Profile graph to describe overall set of configurations
• Rolls as sub-graphs of nodes, containing packages and scripts
• Produces installable custom rom for selected distributions
• Ability to specify conditions and dependencies
Sensor-Rocks (technical)
• Graph, roll and node information stored in xmls
• Edify used to manage installation scripting – Instead of kickstart for standard Rocks
• Shell scripts used for post scripting
• Package Manager handles source repositories and compiling for packages
• Roll Manager takes care of creating roms and graph
• Device Manager will handle database of devices and configurations
Steps for Building a new device
• sensor-rocks-cm.py compile-roll -r earth-sensor -a "{'hw' : 'grouper'}"
• adb push <path-to-sensor-rolls>/out/rom.zip /sdcard/
• adb reboot recovery
Specifying Post-Config: Edify Script
Applications and Updater Script
• Apk file stored appropriately
• Updater Script generated automatically • Location • Contents
OA Deployment
Moorea LTER
Martz SeapHox: pH, conductivity, temperature
Pro-Oceanus CO2-Pro: PCO2
MBARI-modified: pH
Sea-Bird Inductive Modem
Lake Deployment
NTL LTER
Hydro-lab DS-5: chlorophyll, dissolved
oxygen
NexSens T-Node: temperature
Participatory Sensing: Healthcare
Patient-Centric Healthcare
Context-Aware Sensing
Source: Dr. Larry Smarr, Calit2, UCSD
Source: Dr. Larry Smarr, Calit2, UCSD
Data Visualization
Data Processing Challenges: Real-time and Batch-mode
• Real-time processing of tens of thousands of streams /second
• Data representation • Integration of analysis tools • Development of light-weight and scalable
algorithms and models • Accommodating failures: hardware (servers, disks Network,..) and software (OS, middleware, apps). • Accurate workload characterization • Development of efficient data pipeline
Big Data Ecosystem
• Apache Hadoop • Apache Pig • Apache Hive • Apache Cassandra • Apache Hbase • Apache Oozie • Storm • Esper • Add your favorite here…
Solid State Drives
• Solid-state (Flash) disks present an attractive storage option. Solid-state (Flash) drives are becoming cheaper and more common in Data Centers and we believe that this trend will continue to grow. By 2020, the quantity of electronically stored data will reach 35 trillion gigabytes. Big data technologies such as Apache Hadoop, HBase, Pig, Hive are striving to make the storage, manipulation and analysis of huge volumes of data cheaper and faster than ever. With current 6GBps SATA III, NAND-based Solid-State (flash) drives are delivering astounding performance (in terms of random I/O) compared to traditional hard-disk drives. However, their benefit for big data processing is not yet quantified.
FutureGrid key Concepts I • FutureGrid is an international testbed modeled on Grid5000
– June 27 2012: 225 Projects, 920 users
• Supporting international Computer Science and Computational Science research in cloud, grid and parallel computing (HPC)
– Industry and Academia
• The FutureGrid testbed provides to its users:
– A flexible development and testing platform for middleware and application users looking at interoperability, functionality, performance or evaluation
– Each use of FutureGrid is an experiment that is reproducible
– A rich education and teaching platform for advanced cyberinfrastructure (computer science) classes
Source: Prof. Geoffrey Fox, Indiana University.
FutureGrid: a Grid/Cloud/HPC Testbed
Private Public
FG Network
NID: Network Impairment Device
10TF Disk rich + GPU 320 cores
Source: Prof. Geoffrey Fox, Indiana University.
Compute Hardware
Name System type # CPUs # Cores TFLOPS Total RAM
(GB)
Secondary Storage
(TB) Site Status
india IBM iDataPlex 256 1024 11 3072 180 IU Operational
alamo Dell PowerEdge
192 768 8 1152 30 TACC Operational
hotel IBM iDataPlex 168 672 7 2016 120 UC Operational
sierra IBM iDataPlex 168 672 7 2688 96 SDSC Operational
xray Cray XT5m 168 672 6 1344 180 IU Operational
foxtrot IBM iDataPlex 64 256 2 768 24 UF Operational
Bravo Large Disk & memory
32 128 1.5 3072
(192GB per node)
192 (12 TB per Server)
IU Operational
Delta Large Disk & memory With Tesla GPU’s
32 CPU 32 GPU’s
192+ 14336 GPU
? 9 1536
(192GB per node)
192 (12 TB per Server)
IU Operational
TOTAL Cores 4384
Source: Prof. Geoffrey Fox, Indiana University.
FutureGrid: Inca Monitoring
Source: Prof. Geoffrey Fox, Indiana University.
5 Use Types for FutureGrid • 225 approved projects (~920 users) June 27 2012
– USA, China, India, Pakistan, lots of European countries – Industry, Government, Academia
• Training Education and Outreach (8%) – Semester and short events; promising for small universities
• Interoperability test-beds (3%) – Grids and Clouds; Standards; from Open Grid Forum OGF
• Domain Science applications (31%) – Life science highlighted (18%), Non Life Science (13%)
• Computer science (47%) – Largest current category
• Computer Systems Evaluation (27%) – XSEDE (TIS, TAS), OSG, EGI
• Clouds are meant to need less support than other models; FutureGrid needs more user support …….
30 Source: Prof. Geoffrey Fox, Indiana University.
Online MOOC’s • Science Cloud MOOC repository
– http://iucloudsummerschool.appspot.com/preview
• FutureGrid MOOC’s – https://fgmoocs.appspot.com/explorer
• A MOOC that will use FutureGrid for class laboratories (for advanced students in IU Online Data Science masters degree) – https://x-informatics.appspot.com/course
• MOOC Introduction to FutureGrid can be used by all classes and tutorials on FutureGrid
• Currently use Google Course Builder: Google Apps + YouTube – Built as collection of modular ~10 minute lessons
Source: Prof. Geoffrey Fox, Indiana University.
SSD experimentation using Lima, a
FutureGrid resource
Lima @ UCSD • 8 nodes, 128 cores
• AMD Opteron 6212
• 64 GB DDR3
• 10GbE Mellanox ConnectX 3 EN
• 1 TB 7200 RPM Ent SATA Drive
• 480 GB SSD SATA Drive (Intel 520)
TestDFSIO: Throughput mb/sec
0
20
40
60
80
100
120
140
160
180
1000 2000 3000 4000 5000 10000 15000 20000 25000
HDD
SSD
0
20
40
60
80
100
120
140
160
180
1000 2000 3000 4000 5000 10000150002000025000
HDD
SSD
Random Write Random Read
0
100
200
300
400
500
600
700
1000 2000 3000 4000 5000 10000 15000 20000 25000
HDD
SSD
0
200
400
600
800
1000
1200
1400
1000 2000 3000 4000 5000 10000 15000 20000 25000
HDD
SSD
TestDFSIO: Execution Time (sec)
Random Write Random Read
0
20
40
60
80
100
120
140
160
10 20 30 40 50 100 150 200 250
HDD
SSD
0
5
10
15
20
25
30
35
40
45
50
10 20 30 40 50 100 150 200 250
HDD
SSD
TestDFSIO: Throughput mb/sec
Random Write Random Read
0
100
200
300
400
500
600
10 20 30 40 50 100 150 200 250
HDD
SSD
0
100
200
300
400
500
600
700
800
10 20 30 40 50 100 150 200 250
HDD
SSD
Random Write Random Read
TestDFSIO: Execution Time (sec)
MURPA Experience
• Geoff Pascoe
• Thomas Moore (Journal article in SPE): Honors thesis
• Tim Telfer(One conference paper and will try to submit one more paper): Honors thesis
Why UCSD?
• World-renowned research university
• Calit2 is multi-disciplinary
• San Diego: great weather and culturally diverse