Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization -...

18
Physical Data Storage Stephen Dawson-Haggerty
  • date post

    15-Jan-2016
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization -...

Page 1: Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile.

Physical Data Storage

Stephen Dawson-Haggerty

Page 2: Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile.

Data Sources

sMAP

sMAP

sMAP

sMAP

- Data exploration/visualization- Control Loops- Demand response- Analytics- Mobile feedback- Fault detection

Hadoop

HDFS

Applications

StreamFS

Page 3: Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile.

Time-Series Databases

• Expected workload• Related work• Server architecture• API• Performance• Future directions

Page 4: Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile.

Den

t ci

rcui

t m

eter

sMAP

sMAP

Write Workload

• sMAP Sources– HTTP/REST protocol for exposing physical

information– Data trickles in as its generated– Typical data rates: 1 reading/1-60s

• Bulk imports– Existing databases– Migrations

Page 5: Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile.

Read Workload

• Plotting engine• Matlab & python

adaptors for analysis

• Mobile apps• Batch analysis

Dominated by range queries

Latency is important, for interactive data exploration

Page 6: Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile.

Page Cache Lock Manager

Key-Value Store

Storage Alloc.

Time-series Interface

Bucketing RPC Compression

read

ingd

b

insert

resample

aggregate

query

stre

amin

g pi

pelin

e

SQL

Storage mapper

MySQL

Page 7: Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile.

Time series interface

db_open()

db_query(streamid, start, end) Query points in a range

db_next(streamid, ref), db_prev(...) Query points near a reference time

db_add(streamid, vector) Insert points into the database

db_avail(streamid) Retrieve storage map

db_close()

All data is part of a stream, identified only by streamid

A stream is a series of tuples: (timestamp, sequence, value, min, max)

Page 8: Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile.

Storage Manager: BDB

• Berkeley Database: embedded key-value store• Store binary blobs using B+ trees• Very mature: around since 1992, supports

transactions, free-threading, replication• We use version 4

Page 9: Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile.

RPC Evolution

• First: shared memory– Low latency

• Move to threaded TCP• Google protocol buffers– zig-zag integer representation, multiple language

bindings– Extensible for multiple versions

Page 10: Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile.

On-Disk Format

• All data stores perform poorly with one key per reading– index size is high– unnecessary

• Solution: bucket readings• Excellent locality of reference

with B+ tree intexes– Data sorted by streamid and

timestamp– Range queries translate into

mostly large sequential IOs

bucket

(streamid, timestamp)

Page 11: Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile.

• Represent in memory with materialized structure – 32b/rec– Inefficient on disk – lots of

repeated data, missing fields• Solution: compression

– First: delta encode each bucket in protocol buffer

– Second: Huffman Tree or Run Length encoding (zlib)

• Combined compression 2x better than gzip or either one

• 1m rec/second compress/decompress on modest hardware

On-Disk Format

compress

bdb page

...

Page 12: Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile.

Other Services: Storage Mapping

• What is in the database?– Compute a set of tuples (start, end, n)

• The desired interpretation is “the data source was alive”

• Different data sources have different ways of maintaining this information and maintaining confidence– Sometimes you have to infer it from the data– Sometime data sources give you liveness/presence guarantees – “I haven’t heard from you in an hour, but I’m still alive!”

dead or alive?

Page 13: Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile.

readingdb6

• Up since December supporting Cory Hall, SDH Hall, most other LoCal Deployments– behind www.openbms.org

• > 2 billion points in 10k streams– 12Gb on disk ~= 5b/rec including index– So... we fit in memory!

• Import at around 300k points/sec– We maxed out the NIC

Page 14: Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile.

Low Latency RPC

Page 15: Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile.

Compression ratios

Page 16: Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile.

Write load

Importing old data: 150k points/sec Continuous write load: 300-500pts/sec

Page 17: Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile.

Future thoughts

• A component of a cloud storage stack for physical data

• Hadoop adaptor: improve Mapreduce performance over Hbase solution

• The data is small: 2 billion points in 12GB– We can go a long time without distributing this

very much– Probably necessary for reasons other than

performance

Page 18: Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile.

THE END