A fast time series data server

15
A fast time series data server Bob Weigel George Mason University Status: In development

description

A fast time series data server. Bob Weigel George Mason University Status: In development. Motivation. Want to do fast large scale analysis on time series data Data volume and data processing speed often do matter! Speed enables many services. - PowerPoint PPT Presentation

Transcript of A fast time series data server

Page 1: A fast time series data server

A fast time series data server

Bob WeigelGeorge Mason University

Status: In development

Page 2: A fast time series data server

Motivation

• Want to do fast large scale analysis on time series data– Data volume and data processing speed often

do matter!– Speed enables many services.– If you want users to contribute data, provide

1. $,2. free storage,3. better organization and search than their OS + local

file system provides, or4. services on their data that are better than what they

can do on their local machine.

Page 3: A fast time series data server

Demo

Page 4: A fast time series data server
Page 5: A fast time series data server

The problems

• Heliophysics “data bases”– The “granule” paradigm. The fundamental unit

is the granule (file) contains many parameters.– The “small-box” paradigm. Given a user

request, return a list of granule URLs that match. User needs to do the rest. Leads to slow-downs in response time to queries by up to a factor of 100!

– Fundamental unit exposed to scientist should be the data set. Requires “aggregation”. Can be client-side or server-side.

• Well-know and widely available RDBS don’t work well for time series (“column-based” versus “row-based”)

Page 6: A fast time series data server

Approaches for Large Scale Analysis

0. Let the user do the “aggregation”

A. Service: The “run-on-demand” paradigm - A reader (or “accessor”) is developed for each data provider that downloads data to the user's computer, extracts the relevant parts, and puts the data in a uniform form in an array or structure in the user's software analysis program.

1. Disadvantages: Requires high server reliability (servers are typically run by scientists …). Higher sever load, higher data transfer volume.

2. Advantages: No additional disk space. Always up-to-date.

B. Service: The “pre-caching” approach - The data are stored in a uniform manner on an intermediate server. The user makes a request to a single server.

1. Disadvantages: more disk space. Cache may be out-of-date2. Advantages: 5-100x speed-ups in response. Reliability

(Errors are caught ahead-of-time as are server problems). Many new services will be enabled.

Page 7: A fast time series data server

Ideal Approach

Note that pre-caching requires “run-on-demand” solutions, but takes data a step further

Note that “run-on-demand” approach will eventually develop a caching approach anyway – better to develop caching as a separate component

=> Use “pre-caching” for reasonably sized data sets. Use “run-on-demand” for large data sets and for filling cache . A significant portion of heliophysics data could be pre-cached.

Page 8: A fast time series data server

Question

Why hasn’t this been done before?

1. Looks like data centralization.2. Without improved data base,

improvements using existing infrastructure is incremental.

Page 9: A fast time series data server

Only one data type

• Focus on only one data type: time series.

• Defined as– Scalar x(t),x(t+1), …– Vector Bx(t),By(t),Bx(t+1),By(t+1),…– Spectrogram

A1(t),A2(t),…,AN(t),…,A1(t+1),A2(t+1),…AN(t+1)

Page 10: A fast time series data server

Development history

• Developed as a part of ViRBO• Built on OPeNDAP

Page 11: A fast time series data server

Codebase

• Java• OPeNDAP

– Have written “I/O Service Provider” for data files.

– Added ability to do pass time constraint expressions

– Added ability to output data as an ASCII table

– Added basic filters

Page 12: A fast time series data server

Technical details

• Each time series is stored as a single flat binary file with IEEE 754 floating point values.

• Time series that are close to being on a uniform grid are re-gridded with fill values.

• All time series use a single fill value of NaN.• Files are stored on a compressed file system.

– Fast random access to compressed files. About 6x slower access speed, but compression ratio is usually 8.

• Files are stored on a versioning file system. Only differences are stored.

Page 13: A fast time series data server

API – lowest level

• HTTP byte-range request

http://timeseries.org/data/TimeSeries.ncml(contains data structure information and a URL to the science metadata)

http://timeseries.org/data/TimeSeries.bin(just a time-ordered set of values Bx(t),By(t),Bx(t+1)By(t+1))

Page 14: A fast time series data server

API –highest level

DAP protocol (builds on HTTP)http://timeseries.org/data/TimeSeries.

{ascii.bin,dods,dat,etc.}?time<1999:01:01

http://timeseries.org/data/TimeSeries.ascii?time<1999:01:01&value>10

http://timeseries.org/data/TimeSeries.ascii?time<1999:01:01&value>10&filter=5minboxcar

Page 15: A fast time series data server

Future

• Add submission API• Implement versioning file system• Implement suite of filters• Add ability to scale• Implement suite of applications• Connect to Universal Reader Library• Connect to QData set