A fast time series data server

A fast time series data server

Bob WeigelGeorge Mason University

Status: In development

Motivation

• Want to do fast large scale analysis on time series data– Data volume and data processing speed often

do matter!– Speed enables many services.– If you want users to contribute data, provide

1. $,2. free storage,3. better organization and search than their OS + local

file system provides, or4. services on their data that are better than what they

can do on their local machine.

The problems

• Heliophysics “data bases”– The “granule” paradigm. The fundamental unit

is the granule (file) contains many parameters.– The “small-box” paradigm. Given a user

request, return a list of granule URLs that match. User needs to do the rest. Leads to slow-downs in response time to queries by up to a factor of 100!

– Fundamental unit exposed to scientist should be the data set. Requires “aggregation”. Can be client-side or server-side.

• Well-know and widely available RDBS don’t work well for time series (“column-based” versus “row-based”)

Approaches for Large Scale Analysis

0. Let the user do the “aggregation”

A. Service: The “run-on-demand” paradigm - A reader (or “accessor”) is developed for each data provider that downloads data to the user's computer, extracts the relevant parts, and puts the data in a uniform form in an array or structure in the user's software analysis program.

1. Disadvantages: Requires high server reliability (servers are typically run by scientists …). Higher sever load, higher data transfer volume.

2. Advantages: No additional disk space. Always up-to-date.

B. Service: The “pre-caching” approach - The data are stored in a uniform manner on an intermediate server. The user makes a request to a single server.

1. Disadvantages: more disk space. Cache may be out-of-date2. Advantages: 5-100x speed-ups in response. Reliability

(Errors are caught ahead-of-time as are server problems). Many new services will be enabled.

Ideal Approach

Note that pre-caching requires “run-on-demand” solutions, but takes data a step further

Note that “run-on-demand” approach will eventually develop a caching approach anyway – better to develop caching as a separate component

=> Use “pre-caching” for reasonably sized data sets. Use “run-on-demand” for large data sets and for filling cache . A significant portion of heliophysics data could be pre-cached.

Question

Why hasn’t this been done before?

1. Looks like data centralization.2. Without improved data base,

improvements using existing infrastructure is incremental.

Only one data type

• Focus on only one data type: time series.

• Defined as– Scalar x(t),x(t+1), …– Vector Bx(t),By(t),Bx(t+1),By(t+1),…– Spectrogram

A1(t),A2(t),…,AN(t),…,A1(t+1),A2(t+1),…AN(t+1)

Development history

• Developed as a part of ViRBO• Built on OPeNDAP

Codebase

• Java• OPeNDAP

– Have written “I/O Service Provider” for data files.

– Added ability to do pass time constraint expressions

– Added ability to output data as an ASCII table

– Added basic filters

Technical details

• Each time series is stored as a single flat binary file with IEEE 754 floating point values.

• Time series that are close to being on a uniform grid are re-gridded with fill values.

• All time series use a single fill value of NaN.• Files are stored on a compressed file system.

– Fast random access to compressed files. About 6x slower access speed, but compression ratio is usually 8.

• Files are stored on a versioning file system. Only differences are stored.

API – lowest level

• HTTP byte-range request

http://timeseries.org/data/TimeSeries.ncml(contains data structure information and a URL to the science metadata)

http://timeseries.org/data/TimeSeries.bin(just a time-ordered set of values Bx(t),By(t),Bx(t+1)By(t+1))

API –highest level

DAP protocol (builds on HTTP)http://timeseries.org/data/TimeSeries.

{ascii.bin,dods,dat,etc.}?time<1999:01:01

http://timeseries.org/data/TimeSeries.ascii?time<1999:01:01&value>10

http://timeseries.org/data/TimeSeries.ascii?time<1999:01:01&value>10&filter=5minboxcar

Future

• Add submission API• Implement versioning file system• Implement suite of filters• Add ability to scale• Implement suite of applications• Connect to Universal Reader Library• Connect to QData set

A fast time series data server

Documents

Transcript of A fast time series data server