Post on 31-Dec-2015
07 December 2001
Data Streaming in Wide-Area Computations: the
Missing Link
Beth PlaleComputer Science Dept.
Indiana University
07 December 2001
Grid services that facilitate access to remote data currently focus on the file as the unit of transfer.
In this talk, we argue that the data stream and its delivery of a stream of events is a valuable complementary form of remote data access in grid applications.
07 December 2001
Talk Outline
• Wide area computing applications– Data oriented services
• Data streams: the missing link– dQUOB system for querying
streaming data
• WAN performance results
07 December 2001
•Authenticate once
•Submit a grid computation
(code, resources, data,
…)
•Locate resources
•Negotiate authorization,
acceptable use, etc.
•Select and acquire resources
•Initiate data transfers,
computation
•Monitor progress
•Steer computation
•Store and distribute results
•Account for usage
Grid Applications
LI GO in Louisiana
a b
c
LI GO in Louisiana
a b
c
Slide courtesy Ann Chervanek, ISI
07 December 2001
Data Oriented Grid Services
Metadata Service Application
Replica LocationService
Information Services
Planner:Data location, Replica selection,Selection of compute and storage nodes
Security and Policy
Executor:Initiates data transfers and computations
Data Movement
Data Access
Compute Resources Storage Resources
Courtesy Ann Chervanek, ISI
07 December 2001
Metadata
• Information that describes the contents of data
• Typically, each discipline develops an ontology: – What attributes are important?– What is exact meaning of attributes?– How is information structured?
• Must be able to interpret, share and reproduce data
07 December 2001
Metadata Examples
• UNIX-style file system metadata– file size, access permissions, creation time, modify
time, etc.
• Information needed to read/interpret the bits– Format of information: MPEG, JPEG, GIF, ppt, doc,…– File type: ascii, binary, …– Big-endian or little-endian– For removable media: what device wrote bits?
07 December 2001
Metadata Examples (cont.)
• Descriptions of meaning of files: what do ones and zeroes represent?– Satellite image of South America– Results of Monte Carlo simulation– Word document
• Contextual information– When was data created? By whom?– Under what experimental conditions?– Using what application software, operating system
software, hardware configuration?– What input files/parameter settings?– Goal: experimental results understandable,
repeatable
07 December 2001
A Metadata Service• Records metadata attributes associated with
data– Typically stored in attribute:value pairs– Schema determined by application domain
• Answers queries– Attribute-based search capability– Identify files that contain data with specified
metadata attributes– “Find precipitation measurements over North
America for January to June 1999”
• Example Metadata Service– a Metadata CATalog (MCAT), SDSC
07 December 2001
Metadata Service Application
Replica LocationService
Information Services
Planner:Data location, Replica selection,Selection of compute and storage nodes
Security and Policy
Executor:Initiates data transfers, computations
Data Movement
Data Access
Compute Resources Storage Resources
07 December 2001
Replica Management
• Terabytes or petabytes of data shared by researchers around the world
• Often read-only data, “published” by experiments
• Replicate data at multiple locations1. Fault tolerance
• Avoid single points of failure2. Performance
• Avoid wide area data transfer latencies• Load balancing
07 December 2001
Replica Management, cont.• Issues:
– Location: finding copies of files– Aggregation: manage groups of files to
improve convenience and scalability– Consistency model: how out of date is the
file one obtains from a replica?
• Current Replication Grid Services– Globus Replica Catalog, ANL/ISI– Grid Data Mirroring Package (GDMP), CERN,
Caltech, and Fermilab– Storage Resource Broker (SRB), SDSC
07 December 2001
Metadata Service Application
Replica LocationService
Information Services
Planner:Data location, Replica selection,Selection of compute and storage nodes
Security and Policy
Executor:Initiates data transfers, computations
Data Movement
Data Access
Compute Resources Storage Resources
07 December 2001
Information Services
• Repository of information about people, organizations, computational resources, software, storage devices, etc.
• Emerging Grid standard is Globus MDS2– Level 1: distributed LDAP directory servers, typically
one per organization or administrative domain. These are called GRIS servers.
– Level 2: aggregating directory servers (GIIS servers).
– Grid applications capable of querying either.
07 December 2001
Metadata Service Application
Replica LocationService
Information Services
Planner:Data location, Replica selection,Selection of compute and storage nodes
Security and Policy
Executor:Initiates data transfers, computations
Data Movement
Data Access
Compute Resources Storage Resources
07 December 2001
Security• Forms of Security:
– Authorization: Verify that users are allowed to perform requested operations
– Privacy: Knowledge of existence, location and content of data must be controlled
– Integrity: Prevent adversary from tampering stored data or data transfers (access control)
• Emerging standards:– Grid Security Infrastructure (GSI), Globus
• for authentication and access control
– Community Authorization Service (CAS), Globus• for authentication and access control
07 December 2001
Metadata Service Application
Replica LocationService
Information Services
Planner:Data location, Replica selection,Selection of compute and storage nodes
Security and Policy
Executor:Initiates data transfers, computations
Data Movement
Data Access
Compute Resources Storage Resources
07 December 2001
Data Movement
• Want efficient, secure movement of large amounts of data for:– Publishing large data collections– Replication of large files or collections of files– As input data to grid application
• Emerging standard: GridFTP, Globus– Extends standard FTP protocol: get/put etc.– Secure (GSS security bindings)– Parallel data movement– Automatic and manual TCP buffer setting
07 December 2001
Data Access
• Fine-grain operations on large data sets– Partial file accesses– Database queries– Structured data formats (HDF)– Containers
• Current Grid data access services– Storage Resource Broker (SRB), UCSD
• Container can hold logically connected files
Data object
Data object
Data object
Data object
Container
07 December 2001
Talk Outline
• Wide area computing applications– Data oriented services
• Data streams: the missing link– dQUOB system for querying
streaming data
• WAN performance results
07 December 2001
Role of Topography on Tornado Formation
Long time frames
raw Level 2 data
Time
Convert data format and stream
Exponential average
candidate substreams
archivearchive
archive
Unneeded data
Peachtree City, GA
Grier, SC
Hytop, AL
07 December 2001
R2
R4
R3
SQLqueries
dispatcher
R4user
functions
Peachtree City, GA
Hytop, AL
Grier, SC
R1
Relational SchemaR1: attrib1, attrib2, …R2: attrib3, attrib 4, …R3: attrib5, attrib6, …R4: attrib7, attrib8, …
07 December 2001
SQL Queries to Extract, Transform and Filter Data Streams
• View data stream as set of relations (tables) in DBMS• Data streams joined
– Join on valid time (timestamp or logical time)
• Materialized views – New event streams created
• Queries embedded into data streams on-the-fly• Statistics collected about data at runtime.
– Sample data stream, build histograms– Use statistics to reoptimize queries
• Associate user supplied (mathematical) function with SQL query to strengthen transformation capability– e.g., FFT
07 December 2001
Instantiation of a Query
repositoryrepository
event channel
providerprovider consumerconsumerquoblet
Preexisting event handler
action routines
dQUOB library
TCL interpreter
dQUOB runtime
compiledqueries 6
5
4
2commandchannel
3
dQUOB server
2
1
script
07 December 2001
ECho: Event Delivery Middleware
• Publish/subscribe model of event flow• Receivers register for events that are pushed
from a source, • Sender unaware of identity and number of
receivers• Binary data transmission, based on dynamically
defined event formats• Georgia Tech, Eisenhauer, Schwan
07 December 2001
Current Research Issues
• Better memory utilization
• Integrated support for mathematical transformations
• Support for complex data types
07 December 2001
Efficient Memory Utilization
• Stream arrival rates can vary vastly between one stream and another
• Detect stream arrival characteristics, adjust sliding window size
• Adjustments done within context of global memory usage
Relation R contains tuples a, b, …, l, …, zRelation S contains tuples 1,2,…, 5, … 20
Can detect that stream (relation) S is both slow and erratic, so reduce sliding window size.
ag f e d c b
1
l k j i h
5 4 3 2
R:S:
sliding window
……
Notes:-- Joins are typically Cartesian Product.-- Sliding window controls number of tuples participating in join.-- Participating tuples must be retained in memory, thus consume resources.
07 December 2001
Efficient memory utilization, cont.
QM
memory space
dispatcher
Eventhandlers
quoblet
QN
-- All tuples participating in joins are resident in common memory space.
-- Most tuples are either forwarded on, used as input for a new tuple (materialized view), or discarded
joins
07 December 2001
Better support for large data
typedef struct EC_Data_ { int tid; /* task id */ int tag; /* logical timestep */ int aid; /* adaptation id */unsigned long timestamp; /* event timestamp */ double time_measure;int spec_nr; /* species type */ int lon_min; /* longitude */ int lon_max; int lon_count; int lat_min; /* latitude */ int lat_max; int lat_count; int level_min; /* atmospheric 'level' */ int level_max; int level_count; int values_size; float * values;} EC_Data, * EC_DataPtr;
Attributes(fields over whichqueries can beexpressed)
int values_size = 32768262K of data as vector
data
-- Event instance transported in binary network format using TCP
Problem: data notvisible to query (thoughit is visible to user-definedfunction or ‘action’)
07 December 2001
Better support for large data, cont.
• DATALINK SQL data type • data type is
pointer to file where data resides
• vector replaced by pointer to file
• Object-relational representation
• Files too large: streaming on ‘record-by-record’ basis becomes impractical
• Ability to query data as well as attributes. – Attribute points to
location in vector.
Issues: Solutions:
07 December 2001
Integrate mathematical transformations
Query (selects,projects, joins)
User suppliedfunction
memory space
dispatcher
Eventhandlers
quoblet
-- Results of user-supplied function available to query at next timestep
07 December 2001
Talk Outline
• Wide area computing applications– Data oriented services
• Data streams: the missing link– dQUOB system for querying
streaming data
• WAN performance results
07 December 2001
Measuring Wide-area Performance
Workload: 540 events generated by global atmospheric transport modelEnvironment:
-- Georgia Tech: Sun Ultra 30 cluster, Solaris 7--- Albuquerque High Performance Computing Center (AHPCC):
Onyx 2 8 processor, Irix64 6.5-- NCSA, Urbana-Champaign Illinois: Origin 2000 Array,
48 processor, IrixNetwork: Abilene (2.4 Gbits per second)
AHPCC,Albuquerque, NM
Georgia Tech,Atlanta, GA
NCSA,Urbana, IL
Abilene 2.4 Gbits per second
07 December 2001
-- Baseline case (LAN communication)-- Query’s filtering capability progressively strengthens in response to changes in network environment
07 December 2001
Pushing data transformation closer to source yields totalduration times that are closer to baseline (LAN) time. - variance due to traffic flow control in TCP
filter atdestination
filter atsource
baseline
07 December 2001
filter atsource
baseline
filter atdestination
NCSA Origin appears to throttle process-to-process communicationwhen both processes reside on machine.
07 December 2001
Related Research• SQL query processing; non-traditional
application– Active Disks (U Maryland)– Continual Queries (Georgia Tech)– Snodgrass (U Arizona)– Eddies (UC Berkeley)
• Data stream computation– ABACUS (CMU)– Data Cutter (U Maryland) – Distributed Laboratories (Georgia Tech)
07 December 2001
Summary• The grid is an emerging computational and networking
infrastructure providing pervasive, uniform, and reliable access to remote data, computational, sensor, and human resources.
• Grid data services include metadata, replication, information services, data movement, and data access.
• Data streams play an important role in grid applications and are the missing link.
• Our view of data streams as a database enables scientists to better manage data streams.
• dQUOB: an implementation of our approach.
http://www.cs.indiana.edu/~plale/projects/dQUOB