DataIntensive Computer

7/23/2019 DataIntensive Computer

1/10

An Architecture for a Data-Intensive Computer

Edward GivelbergDepartment of Physics and Astronomy

IDIES

The Johns Hopkins University

Email: [email protected]

Alexander SzalayDepartment of Physics and Astronomy

Computer Science Department

IDIES


Kalin Kanov and Randal BurnsComputer Science Department

IDIES


AbstractScientific instruments, as well as simulations, gener-ate increasingly large datasets, changing the way we do science.We propose that processing Petascale-sized datasets will becarried in a data-intensive computer, a system consisting of anHPC cluster, a massively parallel database and an intermediateoperating system layer. The operating system will run on dedi-cated servers and will exploit massive parallelism in the database,as well as numerous optimization strategies, to deliver high-throughput, balanced and regular data flow for I/O operationsbetween the HPC cluster and the database. The programming

model of sequential file storage is not appropriate for data-intensive computations, so we propose a data-object-oriented

operating system, where support for high-level data objects,such as multi-dimensional arrays, is built in. User applicationprograms will be compiled into code that is executed both onthe HPC cluster and inside the database. The data-intensiveoperating system is however non-local, so that user applicationsrunning on a remote PC will be compiled into code executingboth on the PC and inside the database. This model supportsthe collaborative environment, where a large data set is typicallycreated and processed by a large group of users.

We have implemented a software library, MPI-DB, which is aprototype of the data-intensive operating system. It is currentlybeing used to ingest the output of the simulation of a turbulentchannel flow into the database.

I. INTRODUCTION

The traditional process of scientific discovery consists of

systematic observation, experimentation, measurement and

data collection, leading to the creation of a theory that

explains past observations and predicts the results of future

experiments. In virtually every field of science technological

progress has led to the construction of high throughput mea-

surement instruments (telescopes, high-energy particle accel-

erators, gene sequencing machines, etc.), that generate very

large data sets from observation of complex physical systems.

On the other hand, theoretical description of complex physical

phenomena starts with a set of basic laws (typically expressed

as partial differential equations), whose consequencies areinvestigated with the help of simulation experiments using

computational models. The volume of data produced by com-

puter simulations has been increasing even more rapidly than

the size of empirical measurement data sets.

In astrophysics, for example, database technology has been

developed to host the results of the Sloan Digital Sky Survey,

and to expose the data to scientists through the SDSS Sky-

Server [1]. Following the example of the SDSS SkyServer,

the Millennium simulation [2] created a remotely accessible

database with a collaborative environment [3], which drew

hundreds, if not thousands of astronomers into analyzing

simulations.

The need to process very large data sets is changing the

way we do science. The ability to perform computations with

very large data sets is necessary for mining both experimental

and simulation output data sets, as well as for developing

new methodologies for comparing empirical observations with

theory.In this paper we propose that computations with Petascale-

sized data sets should be carried out on a data-intensive

computer, a system consisting of an HPC cluster, a massively

parallel database and an intermediate operating system layer.

We present a design for the operating system and discuss

novel features distinguishing the data-intensive computer from

the traditional computer. Our approach to data-intensive com-

puting is substantially different from the database-centric

approach of SciDB [4].

The NSF has recently awarded our group to build a 5PB

cluster for extreme data-intensive computations. The Data-

Scope will be co-located and integrated with a BeoWulf cluster

of about 4000 CPU cores at the Department of Physics andAstronomy. The systems components are currently evaluated

and tested, and deployment is expected to happen in October

2011. The Data-Scope will consist of 90 performance and 12

storage servers. The total disk capacity will exceed 5PB, with

3PB in the storage and 2.2PB in the performance layer. The

peak aggregate sequential IO performance is projected to be

460GB/s.

The driving goal behind the Data-Scope design is to max-

imize stream processing throughput over 100TB-size datasets

while using commodity components to keep acquisition and

maintenance costs low. Accessing the data in a massively

parallel fashion from the cluster via locally attached disks and

SSDs is significantly faster than serving the data from sharednetwork file servers to multiple compute servers.

We have developed a software library called MPI-DB

which is a prototype of the operating system for the data-

intensive computer. MPI-DB will enable us to establish peer-

to-peer connections between nodes on the HPC cluster and

the Data-Scope I/O nodes both for the on-the-fly ingest of

data generated by an MPI application, and for the parallel

compute-intensive analysis of large data sets read from the

parallel database. MPI-DB has been successfully tested and is


2/10

presently being used to ingest the output of the simulation of

a turbulent channel flow directly into the database.

The remainder of the paper is organized as follows. In

sectionII we survey several research projects involving data-

intensive computing. An examination of computing require-

ments for these projects leads us to propose the concept of

the data-intensive computer in section III, where we discuss

design requirements for the computer and its operating system.

A prototype of this operating system, MPI-DB, is described in

sectionIV.We present the user view of the software library;

then describe its software architecture and discuss implemen-

tation issues. We summarize our conclusions in sectionV.

II. EXAMPLES OF DATA-INTENSIVE SCIENTIFIC RESEARCH

In this section we examine several examples involving data-

intensive computations from astrophysics, turbulence, neu-

roscience and hearing research. The examples, taken from

research activities of our group members, illustrate the demand

for data-intensive computations, the present technological lim-

itations, and guide us in the design of the operating system

for the data-intensive computer.

A. The large-scale structure of the Universe

Contemporary research in astrophysics has deep and im-

portant connections to particle physics. Observations of large

structures in the universe lead physicists to the discovery of

the dark matter and the dark energy, and understanding these

new forms of matter will change our view of the universe

on all scales, including the particle scale and the human

scale. Theoretical developments in astrophysics must be tested

against vast amounts of data collected by instruments, such as

the Hubble Space Telescope, as well as against the results

of supercomputer simulation experiments, like the Millenium

Run [5]. These data sets are available in public databases,and are being mined by scientists to gain intuition and to

make new discoveries, but the researchers are limited by the

technological means available to access the data. In order

to analyze astrophysical data researchers write scripts that

perform database queries, transfer the resulting data sets to

their local computers and store them as flat files.

Such limited access has already produced important dis-

coveries. For example, recently a new log-power density

spectrum was discovered by such analysis of the data in

the Millenium Run database [6]. This is the most efficient

quantitative description of the distribution of the density of

matter in the Universe, that was obtained so far.

B. Turbulence research

A new database approach to scientific computing in fluid

dynamics was developed by researchers at the JHU Institute

of Data-Intensive Engineering and Science (IDIES), where the

entire time-history of a simulation is stored in a database using

relational database technologies[7],[8], and is publicly acces-

sible to scientific analysis. A pseudospectral direct numerical

simulation of forced isotropic turbulence on a 10243-pointEulerian computational grid with periodic boundary conditions

was computed, and 1024 time-steps are stored, covering a full

large-eddy turnover time of model evolution. The resulting

27 Tbyte data set is publicly available in the JHU Public

Turbulence Database [9]. Velocity, pressure, velocity gradient

tensor and other quantities can be obtained directly from the

query web page of the Turbulence Database.

The technology enabling scientists to execute such queries

includes a mediator (DataBase Access Server), which acts as

a gateway for users and applications, and takes individual

requests and routes them to a node that has the data needed to

service that request. Users and applications discover the inter-

face through the Web-Service Definition Language (WSDL).

Requests and results are transmitted between the mediator

and its clients through the Simple Object Access Protocol

(SOAP). Analysis tools within the database are implemented

in user-defined functions (UDFs) in SQL Server 2005 using

the common-language runtime (CLR) [10].

To service data calls from C and Fortran codes, a wrapper

interface to the gSOAP library is provided, which is linked

at compile time to the user program. While executing, the

program makes subroutine-like calls that request desired partsof the data from the database over the internet. The request

may trigger execution of pre-defined analysis routines in the

database, e.g. interpolation, differentiation, etc.

An example of the transformative impact of the new ap-

proach on scientific research is the recent study of H. Yu

and C. Meneveau [11], where new evidence for Kolmogorovs

Refined Similarity Hypothesis has been found by following the

time evolution of velocity and pressure gradients in isotropic

turbulence and quantifying their autocorrelation functions and

decorrelation time scales.

Data analysis was performed on a workstation, accessing

the database using web service tools. Fluid particles were

tracked using the second-order Runge-Kutta particle trackingalgorithm [12]. Lagrangian information, such as fluid velocity

and pressure gradients, along the fluid particle trajectories was

extracted and used in analysis. The required velocities were

interpolated using 8-th order Lagrange polynomials in space

and piecewise-cubic Hermite polynomials in time. Averages

were performed over more than 800 million pairs of stochastic

trajectories, consuming a couple of months of wall-clock

time. These operations were implemented in predefined func-

tions in the database [8].

The JHU turbulence database has been providing services

to a growing community of users both from JHU and from

outside. So far, it has been accessed by over 160 separate

IPs. There are presently over 2 104 individual webqueriesper day, each requesting data on an average of 250 points, so

that every day over 5 106 points are queried.The database has been used to evaluate key assumptions

in a new experimental pressure measurement techniques [13],

bubble dispersion statistics in turbulence [14], evolution of

material surfaces and curvature statistics in turbulence [15],

etc. In the JHU turbulence group, the data have been exploited

to study Lagrangian models of intermittency [8], small-scale

turbulent magnetic dynamo effect [16] subgrid-scale models


3/10

[17], and statistics of rotation and strain-rates along fluid

particle trajectories [11],

As a consequence of Kolmogorovs Refined Similarity

Hypothesis the viscous cut-off length must be regarded as a

local fluctuating quantity, and it is expected that many high

Reynolds number simulations substantially underresolve some

regions in space. The Turbulence group at JHU is planning to

carry out database-driven simulations, where a nested grid

of very fine resolution is positioned in the vicinity of intense

small-scale structures of interest and co-moving with the mean

velocity at that position, in order to locally refine the archived

data.

Such simulation has never been carried out before, and it

requires boundary conditions on the surface of the nested, co-

moving grid, which may be extracted from the time-history

stored in the database, interpolated to finer resolution. There is

presently no technology to provide such a capability. MPI-DB

will provide this new capability by employing a smart combi-

nation of caching, scheduling and parallelization techniques.

We believe this capability is extremely valuable, not only for

turbulence, but in virtually every scientific simulation. To givejust one example, consider environmental impact assessment

of the spreading of a chemical agent in a building or an area.

This can be carried out as a database-driven simulation, using

a previously computed fluid-flow simulation.

C. Computational modeling of the cochlea

The human cochlea is a remarkable highly nonlinear trans-

ducer that extracts vital information from sound pressure and

converts it into neuronal impulses that are sent to the auditory

cortex. The cochleas accuracy, amplitude range and frequency

range are orders of magnitude better than man made transduc-

ers. Understanding its function has tremendous medical and

engineering significance. The two most fundamental questionsof cochlear research are to provide a mathematical description

of the transform computed by the cochlea and to explain the

biological mechanisms that compute this transform. Presently

there is no adequate answer to either of these two questions.

Signal processing in the cochlea is carried out by a collection

of coupled biological processes occuring on length scales

measuring from one centimeter down to a fraction of a

nanometer. A comprehensive model describing the coupling of

the dynamics of the biological processes occurring on multiple

scales is needed in order to achieve system level understanding

of cochlear signal processing.

A model of cochlear macro-mechanics was constructed in

19992002 by Givelberg and Bunn [18], who used super-computers to generate very large data sets, containing results

of simulation experiments. These results were stored as flat

files which were subsequently analyzed by the authors on

workstations using specially developed software. aA set of web

pages devoted to this research [19] is widely and frequently

accessed, however the data was never exposed to the wide

community for analysis since no tools to ingest simulation

output into a database existed when the cochlea model was

developed.

This model remains the most comprehensive large-scale

model that has been constructed to study cochlear mechanics,

but it does not include the crucial dynamics of the smaller

scales. The first author is presently developing a new multi-

scale method that will enable to extend the macro mechanical

cochlea model to include microscopic structures. The new

cochlea model will run on a multi-processor cluster using

a special algorithm for distributed immersed boundary com-

putations [20]. We plan to use MPI-DB to ingest cochlea

simulation results into the database and to expose these results

for analysis by the wide hearing research community. Further-

more, since the development of such a model is a product

of many years of research, a new technological capability is

needed to enable cochlea researchers outside JHU to use the

model, to remotely carry out simulations, as well as to observe,

monitor and analyze ongoing simulations.

D. Neuroscience

One of the biggest unanswered questions in neuroscience

today is the organization of the brain at the level of the

neural micro-circuits that form the basis of neural computation.Recently, due to technological and experimental advances in

electron microscopy, it has become possible to investigate

these networks through high-resolution imaging of the brain

[21]. For example, Bock et al. [22] have recently imaged

a 350 x 450 x 50 cubic micron region of mouse cortex

with 4 nanometer lateral resolution a sufficiently detailed

dataset to resolve every synaptic connection in the field of

view (indeed, even vesicles are readily apparent). It has been

estimated that imaging the whole brain at this resolution would

require multiple exabytes; a cubic millimeter occupies roughly

1 petabyte. Ultimately, to fully exploit this data, it is desirable

to assign a label to each voxel indicating its identity and the

structure to which it belongs.Clearly, while even collecting this type of data is an

enormous task, interpreting and analyzing the data is far

more difficult. It is infeasible to annotate this volume of

data manually, and probably impractical to assume that any

one group will devise a perfect automated solution. The

Open Connectome project is working to provide universal

access to this type of data via web services hosted at

http://openconnectomeproject.org. More specifically, one of us

(R. Burns) is developing tools for both human (visualization)

and computer (application programming interface, or API)

access to the data. Granting global access will enable the

largest possible community of image processing and machine

learning experts to investigate the data and develop algorithmsto annotate it. Unlike standard crowdsourcing endeavors, the

goal is to compile efforts from a variety of machine an-

notators, as opposed to human annotators, an approach that

was dubbed alg-sourcing (for algorithm outsourcing). As

different groups tackle different aspects of the problem with

different approaches, the results will be aggregated and the

collective output will be shared, building towards a long-term

vision of a fully-annotated cortical volume.

The project is being initialized with two datasets: (1) a 12


4/10

TB dataset from Bock et al. described above, and (2) a >600GB dataset from Kasthuri and Lichtman (unpublished; spatial

resolution: 3 x 3 x 29 cubic nanometers). Panning, zooming,

and manual annotation are made possible via a web-based

graphical user interface called CATMAID [23]. An API for

two-dimensional analysis of the data, including downloading

arbitrary image planes and uploading planar annotations to

the shared repository are in progress. An additional server

for three-dimensional representation of the data is being built,

along with an API for downloading volumes and upload-

ing volumetric annotations. Graphics processor unit (GPU)-

enabled software will allow for visualizing arbitrary rotations

of the data in three dimensions, overlaid with the annotations.

All of the services are designed to scale up to petabytes and

beyond, and all of the developed code will be released as open

source.

The Open Connectome Project is gearing up for massive

polyscience, i.e. science collectively conducted by a large

group of individuals. This marks a radical departure from the

typical scientific workflow, in which raw data are kept local

until results are released, and will hopefully usher in a newera of understanding about the brain.

III. THE DATA-INTENSIVE COMPUTER

The operating system of a modern computer is designed

to balance programmer productivity with implementation ef-

ficiency. High-level programming languages hide the com-

puters memory hierarchy and system architecture, while the

operating system provides highly optimized services for all

application developers. The only means of permanently storing

data is by writing it in a file, and the abstract programming

model of sequential file access is efficiently implemented in

the operating system. The operating system typically does not

include services for handling high level programming objects,such as arrays or graphs. When there is a need to store such

objects for subsequent computation, the programmer must

make use of the file system with serialization/unserialization

of these objects.

Scientific data sets are now approaching the Petascale,

exceeding the capabilities of file systems, and are therefore

stored in databases. They are not easily accessible to com-

putation because performing I/O operations between an HPC

system and a database is presently very difficult. There are

no off-the-shelf solutions and a considerable effort is required

on the part of domain scientists to incorporate special-purpose

database access tools in the analysis code. Data access using

web services does not provide a scalable solution for manydata-intensive applications. Furthermore, the resulting data

flow throughput needs to be improved by orders of magnitude;

even trivially parallelizable data processing tasks are presently

very difficult.

In order to satisfy the increasing demand for computations

with very large data sets it is necessary to build an operating

system that will exploit the massive parallelism in the database

system to efficiently carry out data storage and retrieval

between the database and a multiprocessor computing system.

We therefore define the data-intensive computeras a system

consisting of an HPC cluster, a massively parallel database

and an intermediate operating system laye. (see Figure 1).

The operating system enables direct I/O operations between

HPC processor memory and the database, making the database

transparent to the programmer and effectively turning it into a

layer in the memory hierarchy of the data-intensive computer.

OS

server

OS

server

OS

server

remote

computer

DB

server

DB

server

DB

server

DB

server

HPC

node

HPC

node

HPC

node

HPC

node

remote

computer

remote

computer

remote

computer

Data-Intensive Computer

Fig. 1. The architecture of the data-intensive computer: The HPC cluster andthe database are connected to operating system servers by a high bandwidthnetwork. Remote clients obtain operating system services over the internet.

The operating system also provides services to remote

computers (see section III-E). The data-intensive computer

differs from the traditional computer in a number of important

aspects which we discuss below.

A. Direct I/O between memory and databaseThe main challenges in the design of the data-intensive

operating system are to guarantee the quality of service in the

level of performance of the data flow and to ensure scalability

and efficient parallel scheduling and resource management.

We plan the operating system to run on a set of dedicated

servers, with the user HPC applications acting as clients for

the operating system processes. A major goal of the operating

system is to transform application burst I/O into uniform,

balanced traffic in the database.

Since execution of database queries is significantly slower

than the transmission of the results over the high bandwidth

network, it is advantageous, whenever possible, to execute

queries on multiple database servers in parallel. The operatingsystem will act as a distributed scheduler for the database

(see Figure 2): each dedicated operating system server pro-

cess allocates multiple database server connections for data-

intensive applications, and fewer database server connections

for applications with lower data requirements. This design

is scalable and is aimed at minimizing application I/O by

employing smart heuristic scheduling algorithms.

When a large number of applications are accessing the same

data set, such as for example the user analysis applications


5/10

Operating system

server processes

DB servers

Data-non-intensive applicationData-intensive application

Operating

system

clients

d a t a b a s e

H P C c l u s t e r

Fig. 2. Scalable design of I/O operations in the data-intensive operatingsystem. The operating system acts as a scheduler for the database, allocatingmultiple database server connections for each process of a data-intensive appli-cations, and few connections to an application with lower data requirements.

accessing the JHU Public Turbulence database, significant

efficiencies may be realized by grouping the I/O requests

of different applications together. The operating system will

maintain its own storage for caching I/O requests and will op-

timize database access based on applications access patterns,

as well as across applications (see for example Multicollective

I/O [24]). The operating system layer will incorporate efficient

management of available resources, and will grow or shrink

on demand.

B. Moving the program to the data

A major goal in the design of the operating system ofthe data-intensive computer is to enable applications with an

arbitrary mix of I/O and computation. In many instances it is

advantageous to carry out computations with large data objects

in the database. The move the program to the data approach

of Szalay et al. [25], has been a fundamental tenet in the

design of large-scale scientific databases. We have seen that

in the JHU Turbulence database (see section II) data requests

may trigger execution of pre-defined routines in the database.

A serious limitation of this approach is that such routines must

be pre-programmed in the database.

We propose to extend the move the program to the data

approach by automatically generating the code that will be

executed in the database. An HPC application will be compiledinto code that will execute on the HPC cluster, as well as

code for computations with operating-system-supported data

objects that will execute in the database. The operating system

will carry out moving the program to the data. Compiler-

generated code for large data object computations will be sent

from the HPC cluster to the database using the operating sys-

tems client-server communications. The user HPC application

will be linked against the operating system client software. At

run time, it will execute code on the HPC cluster, call system

services that will execute in the operating system layer and

execute the application-generated code in the database.

User applications will be developed in a high-level pro-

gramming language (Fortran, C, C++, etc.) that includes

mechanisms for concurrency control (e.g. MPI), allowing easy

porting of legacy applications to the data-intensive computer.

A specially designed language, such as Titanium [26] or

Charm++ [27], which has a built-in mechanism for concur-

rency control, can also be used for application development.

Furthermore, we believe that designing a special-purpose lan-

guage for processing large data sets will improve programmer

productivity.

C. Data-object-oriented operating system

The abstract programming model of sequential file access

is not appropriate to represent the storage layout of large data

objects in the distributed database; nor is it convenient for

the applications programmer. Instead, the operating system

must provide support for storing and manipulating abstract

objects such as arrays, graphs, sparse arrays, etc. Implementing

system-level support for a particular data structure is non-trivial: a distributed database layout and optimized I/O man-

agement must be planned for each data object. Nevertheless,

the operating system must provide a set of services which

simplify development and execution of application programs.

These can be made available to the application programmer

through the use of advanced compilers.

Our prototype operating system, MPI-DB, provides support

for multi-dimensional arrays. (See section IV-B.) The survey

in section II readily suggests that sparse arrays and graphs

are among the data structures that must be supported in the

data-intensive operating system. Indeed, it is estimated that the

number of synaptic connections in the human brain is on the

order of 1015

. Operating system support for additional datastructures will depend on applications, such as the Berkeley

seven (or thirteen) dwarfs [28].

D. Operating system support for distributed data objects

Virtually every parallel computation involves a partition of

a large data object among several processors with the aim

of reducing the total wall-clock time of the computation. A

typical example is the partitioning of a multi-dimensional array

(see section IV-B). This Map/Reduce methodology breaks up

data objects, creating distributed data objects which exist only

during computation. At the end of the computation a single

coherent data object must be assembled from the distributed

data object. We believe that support for distributed data objectsmust be provided both in programming languages (e.g. the

single objects in Titanium [26]) and in the operating system

of the data-intensive computer.

While the data object stored in the database is logically

single, its storage layout is distributed among database servers.

In the process of reducing a run-time distributed data object to

a logically single object stored in the database, the operating

system will generate a physical mapping of the objects storage

layout in the global database system. This mapping will


6/10

identify the database servers, the server-attached databases and

the storage partitions that hold the data representing the object,

and will determine methods for access and modification of the

object.

E. Collaborative, non-local operating system services

A large data set is typically created and processed by a

large group of possibly collaborating individuals, who execute

a set of concurrent processes. While the main goal of the

data-intensive operating system is to enable data-intensive

computations, its services can be used by remote users.

Remote computers obtain services from the data-intensive

operating system in the same way the HPC applications do:

an application running on a remote computer is compiled into

code that executes on that computer, connects to the data-

intensive operating system over the network and sends to the

operating system code that is executed within the database.

The main difference is in the network connection speed.

Remote users with slow network connections may choose

to download portions of data sets from the database to their

computers, perform extensive local computations and sendresults back to the database. Furthermore, the data-intensive

operating system can be used as a software library installed

on the remote computer and run in conjunction with a local

database, enabling the user to store data objects imported from

a remote database directly into the local database, and to

process the data in the local database using the same program

that was previously created for remote, possibly large-scale,

data processing.

Such exposure of data and data-processing services has

the potential of an impact beyond the scientific community,

involving the general public in both science education and sci-

entific research. For example, the GalaxyZoo project, initiated

at IDIES, serving galaxy images for visual type classificationto the general public, opened in July 12, 2007, and after 8 days

of operation the JHU servers had 56 million web hits, and 8.5

million galaxies classified. The project was featured from the

BBC to the London Times, the Economist, the Washington

Post, Christian Science Monitor, and was on the front page of

Slashdot and Wikipedia. This unprecedented interest is a clear

example that open access to science is in the interest of the

public and is beneficial to the advancement of science.

I V. THE MPI-DB SOFTWARE LI BRARY

The MPI-DB software library is a prototype of an operating

system for a data-intensive computer. It provides database

services to scientific computing processes and currently sup-ports SQL-Server and MySQL databases on Windows and

Linux with C, C++ and Fortran language bindings. The library

consists of two compatible software packages: the MPI-DB

client and the MPI-DB server, and it requires a working MPI

installation (including MPI-2 functionality) and UDT sockets

[29]for its client-server communications, and a database con-

nected to the server. The MPI-DB server accepts connections

from clients at a known network address, services clients

requests by querying the database and sending the results back

to the clients. User client applications must be compiled and

linked against MPI-DB client and use the library to store,

retrieve and modify the data in the database.

A. An introductory tutorial

Consider a scientific application consisting of several paral-

lel MPI processes, continuously generating output that needs

to be stored. We show how MPI-DB can be easily used to

store the output in a database. The user application is written

in C++ with MPI. It is linked against the MPI-DB library

and in our simple example there are two parallel processes at

runtime, whose ranks are 0 and 1.

The user interaction with MPI-DB starts by defining the data

structures that will be stored in the database. In our example

the two parallel MPI processes jointly perform a computation

using a single three-dimensional array of 128 128 128double precision floating point numbers. The array is divided

between the two processors, with processor 0 holding in its

local memory the [0 . . . 127] [0 . . . 127] [0 . . . 63] por-tion of the array and processor 1 holding the [0 . . . 127] [0 . . . 127] [64 . . . 127] part. Correspondingly, each processdefines an mpidb::Domain object subdomain and an

mpidb::Array object a.

// this user process has rank = MyID,

// which in our example is either 0 or 1

MPI_Comm_rank(MPI_COMM_WORLD, &MyID);

mpidb::Domain subdomain(0, 127, 0, 127, 64 * MyID,

64 * MyID + 63);

mpidb::Array a(subdomain, mpidb::DOUBLE_PRECISION);

// generate a stream of array data objects

mpidb::DataStream s(a);

mpidb::DataSet d(); // DataSet d is a single

object, common to both processes

// DataSet d will contain two data streams

d.AddStream(s);

Our application will perform repeated computation of the data

array, with each process periodically storing its portion of the

data array in the database. Each process will therefore generate

a stream of arrays. This is expressed in the definition of the

mpidb::DataStream object s.

Finally, the application defines the mpidb::DataSet

object d, which in contrast to previously defined objects is

a single (distributed) object common to both processes. After

each process adds a data stream to this data set, it will contain

exactly two streams.

Having defined the data structures, each of the two

MPI processes attempts to establish a connection with

an MPI-DB server. This is achieved by defining an

mpidb::Connection object c and executing on it the

ConnectToServer method with a given server address.


7/10

mpidb::Connection c;

char * ServerAddress = "128.220.233.155::52415";

if (!c.ConnectToServer(ServerAddress))

{

cerr


8/10

MPI-DB assigns unique integer descriptors to objects stored

in the database, allowing the user to access previously stored

objects and load them from the database directly into pro-

gramming objects. In the future MPI-DB will provide a set of

services to name objects and list user-defined objects which

are already stored in the database.

C. MPI-DB Software architecture

MPI-DB software is built as a layered structure, analogous

to multi-layer protocols used in computer network commu-

nications. Such a design is flexible and extensible. Figure 3

illustrates the software architecture.

serverclientSystem

Management

Layer

serverclient Data Object

Layer

serverclient DatabaseAccessLayer

serverclient Data Transport

Layer

network connection

database

Fig. 3. Layered software architecture of MPI-DB. Data Transport layerencapsulates physical communication. The server of the Database AccessLayer acts as a client of the database server.

The Data Transport Layer is the lowest layer of the

MPI-DB layer hierarchy. It provides the basic functionality for

establishing and managing the connection between clients and

servers over a fast network. This design encapsulates packet

transmission in the Data Transport Layer. Indeed, we have

found it necessary to create two independent implementations

of the Transport Layer: one using UDT sockets and the other

using the MPI-2 standard (see sectionIV-D). The MPI protocol

is a de facto standard in scientific computing. MPI installations

are available for a wide range of operating systems and

computer networks, and in many instances benchmarking tests

have shown MPI to be the fastest protocol for data transfer

[32].The Database Access Layer provides basic functionality

to remotely execute queries and access database tables over

the network. It provides the Data Object Layer with a narrow

set of abstract operations needed to manipulate MPI-DB

programming objects in the database. This layer encapsulates

all SQL queries and includes drivers for major databases, such

as SQL Server, MySQL and PostgreSQL.

The Data Object Layer contains the description of

the user-defined programming objects that are stored in the

database, including their physical storage layout, and provides

access and manipulation methods for these objects. User-

defined objects are serialized by the client, sent to the server

and unserialized by the server, to be subsequently stored in

the database. A hierarchical description of the physical storage

layout lists the servers, the server-attached databases and the

storage partitions holding the data associated with each object.

Data access methods implement the mapping between user-

defined run-time partition of the object among multiple pro-

cessors and the objects hierarchical database storage layout.

The System Management Layer maintains a resource

map, describing all the resources (storage and servers) avail-

able in the global database system. It includes a caching

system for grouping applications I/O requests and a scheduler

assigning the I/O requests to database servers. This layer is

also planned to handle administration functions, managing all

user related information, including managing user logins and

monitoring user connections.

D. Implementation

MPI-DB is being developed as object-oriented softwarein C++ and will be made available under the new BSD

open-source software license. It requires a working imple-

mentation of the MPI standard, including MPI-2 function-

ality, most importantly functions for client-server interaction

(MPI_Open_Port, etc.) and dynamic process management

(MPI_Comm_spawn). In our experience this requirement

proved to be difficult to satisfy: Presently, MPI-DB works only

with the Intel MPI software library. Many MPI implementa-

tions contain run-time bugs, do not provide adequate hardware

drivers or do not support MPI-2 standard altogether. Process

intercommunication depends on accessing MPI ports using an

address string whose format is implementation dependent. This

complicates implementation of client-server software whenclient and server are built using different implementations of

MPI.

In view of these difficulties we re-implemented the Trans-

port Layer of MPI-DB using UDT sockets. While the client

side of the library currently uses MPI, the server side relies

on MPI only to spawn server processes. Future versions of

the library will use different services to manage processes and

will be independent of MPI.

We have tested the throughput of data ingest in a pro-

totype MPI-DB installation. Both the client and the server

ran on Linux nodes having eight physical cores each. The

client and the server were linked by a 10 Gigabit Ethernet

connection with a mysql database running on the server.The client application generated a three-dimensional array of

data which was partitioned among p parallel processes with

p = 1, 2, 4, 8, 16, 32 and 64. For each client process MPI-DB allocated a dedicated server process, which opened a

connection to mysql, received data from its assigned client

process and ingested the data into the mysql database. The data

ingestion process was thus performed in p parallel threads over

the Ethernet connection. The results of two test run sequences

are shown in Figure4.Peak performance between 600 and 700


9/10

Aggregate Throughput

0

100

200

300

400

500

600

700

800

0 10 20 30 40 50 60 70

Number of processes

Thoughput

Fig. 4. Throughput test of MPI-DB in parallel data ingestion over a 10Gigabit/sec Ethernet with both the client and the server having eight physicalCPU cores each. Client data is sent to the server and then ingested into mysqldatabase. MPI-DB allocated a dedicated server process for each client process.The aggregate throughput is recorded as a function of the number of parallelingestion processes. Maximum throughput between 600 and 700 Megabytesper second is realized when the number of ingestion processes equals the

number of physical cores.

Megabytes per second has been achieved when the number of

parallel processes was the same as the number of available

physical cores (eight).

Following successful tests of MPI-DB, it is now being

used to ingest the results of the simulation of a turbulent

channel flow directly from the HPC cluster at the JHU Physics

department into the database.

V. CONCLUSION

We have introduced the concept of the data-intensive com-

puter and have proposed a novel operating system architecture

to support data-intensive computations with Petascale-sized

data sets. Section III outlines a research and development

program for building a data-object-oriented operating system

with advanced compiler technology to translate user applica-

tion programs into code that runs on the HPC cluster and

code that runs in the database. The proposed operating system

is non-local: it supports large-scale collaborative computations

where user applications can be translated into code that runs

on a remote computer and code that runs in the database.

Our discussion touches only a few topics related to the

design of the data-intensive computer, and necessarily omits

many important aspects. The construction of the data-intensive

computer is an ongoing research project. A prototype of thedata-intensive operating system has been implemented as a

software library, MPI-DB, and is currently being used in

production by the Turbulence research group at JHU.

REFERENCES

[1] http://www.sdss.org/.[2] V. Springel, S. White, A. Jenkins, C. Frenk, N. Yoshida, L. Gao,

J. Navarro, R. Thacker, D. Croton, J. Helly, J. Peacock, S. Cole,P. Thomas, H. Couchman, A. Evrard, J. Colberg, and F. Pearce,Simulations of the formation, evolution and clustering of galaxies andquasars, Nature, vol. 435, pp. 629636, 2005.

[3] G. Lemson and the Virgo Consortium, Halo and galaxy formationhistories from the millennium simulation: Public release of a vo-orienteddatabase, 2006.

[4] M. Stonebraker, P. Brown, A. Poliakov, and S. Raman, The architectureof scidb, in Scientific and Statistical Database Management, ser.Lecture Notes in Computer Science, J. Bayard Cushing, J. French, andS. Bowers, Eds. Springer Berlin / Heidelberg, 2011, vol. 6809, pp.116.

[5] http://www.mpa-garching.mpg.de/millennium/.

[6] M. C. Neyrinck, I. Szapudi, and A. S. Szalay, Rejuvenating the Matter

Power Spectrum: Restoring Information with a Logarithmic DensityMapping, Astrophysics J. Letters, vol. 698, pp. L90L93, Jun. 2009.

[7] E. Perlman, R. Burns, Y. Li, and C. Meneveau, Data exploration ofturbulence simulations using a database cluster, in Proceedings of theSupercomputing Conference (SC07), 2007.

[8] Y. Li, E. Perlman, M. Wan, Y. Yang, R. Burns, C. Meneveau, S. Chen,A. Szalay, and G. Eyink, A public turbulence database cluster andapplications to study lagrangian evolution of velocity increments inturbulence. J. Turbulence, vol. 9, no. 31, 2008.

[9] http://turbulence.pha.jhu.edu/.

[10] C. Blakeley, N. Cunningham, B. Ellis, Rathakrishnan, and M. C. Wu,Distributed/heterogeneous query processing in microsoft sql server, in21st Int. Conf. on Data Engineering (ICDE05), 2005, pp. 10011012.

[11] H. Yu and C. Meneveau, Lagrangian refined kolmogorov similarityhypothesis for gradient time evolution and correlation in turbulent flows,Phys. Rev. Lett., vol. 104, no. 8, p. 084502, Feb 2010.

[12] P. K. Yeung and S. B. Pope, An algorithm for tracking fluid particles innumerical simulations of homogeneous turbulence, J. Comput. Phys.,vol. 79, no. 2, pp. 373416, 1988.

[13] L. X and J. Katz, Measurement of pressure-rate-of-strain, pressurediffusion and velocity-presure- gradient tensors around an open cavitytrailing corner, Bull. Am. Phys. Soc., vol. 53, no. 15, 2008.

[14] Snyder, personal communication. 2009.

[15] Leonard, personal communication. 2009.

[16] G. L. Eyink, Stochastic flux freezing and magnetic dynamo, 2011.

[17] C. Meneveau, A web-services accessible turbulence database ofisotropic turbulence: lessons learned. in Progress in wall turbulence:understanding and modeling. (M. Stanislas, ed.), held on 21-23 April,in Lille France, 2009.

[18] E. Givelberg and J. Bunn, A comprehensive three-dimensional modelof the cochlea, J. Comp. Phys., vol. 191, no. 2, pp. 377391, 2003.

[19] http://pcbunn.cacr.caltech.edu/cochlea/.

[20] E. Givelberg and K. Yelick, Distributed immersed boundary simulation

in titanium, SIAM J. on Scientific Computing, vol. 28, no. 4, pp. 13671378, Jul. 2006.

[21] N. Kasthuri, K. Hayworth, J. Tapia, R. Schalek, S. Nundy, and J. Licht-man, The brain on tape: Imaging an ultra-thin section library (utsl).Society for Neuroscience Abstracts, 2009.

[22] D. D. Bock, W.-C. A. Lee, A. M. Kerlin, M. L. Andermann, G. Hood,A. W. Wetzel, S. Yurgenson, E. R. Soucy, H. S. Kim, and R. C. Reid,Network anatomy and in vivo physiology of visual cortical neurons,

Nature, no. 471, pp. 177182, March 2011.

[23] S. Saalfeld, A. Cardona, V. Hartenstein, and P. Toman?k, Catmaid:collaborative annotation toolkit for massive amounts of image data,

Bioinformatics, vol. 25, no. 15, pp. 19841986, 2009.

[24] G. Memik, M. T. Kandemir, W.-K. Liao, and A. Choudhary, Multicol-lective i/o: A technique for exploiting inter-file access patterns, ACMTransactions on Storage, vol. 2, no. 3, Aug. 2006.

[25] A. S. Szalay, P. Z. Kunszt, A. Thakar, J. Gray, D. Slutz, and R. J.Brunner, Designing and mining multi-terabyte astronomy archives: the

sloan digital sky survey, in SIGMOD 00: Proceedings of the 2000ACM SIGMOD international conference on Management of data. NewYork, NY, USA: ACM, 2000, pp. 451462.

[26] K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krish-namurthy, P. Hilfinger, S. Graham, D. Gay, P. Colella, and A. Aiken,Titanium: A high-performance java dialect,Concurrency: Practice and

Experience, vol. 10, no. 11-13, September-November 1998.

[27] L. V. Kale and G. Zheng, Charm++ and ampi: Adaptive runtimestrategies via migratable objects, M. Parashar, Ed. Wiley-Interscience,2009, pp. 265282.

[28] K. Asanovc, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer,D. Patterson, W. Plishker, J. Shalf, S. Williams, and K. Yelick, Thelandscape of parallel computing research: A view from berkeley,


10/10

Electrical Engineering and Computer Sciences, University of Californiaat Berkeley, Tech. Rep. UCB/EECS-2006-183, December 2006.

[29] Y. Gu and R. L. Grossman, Udt: Udp-based data transfer for high-speedwide area networks,Comput. Netw., vol. 51, pp. 17771799, May 2007.[Online]. Available: http://dl.acm.org/citation.cfm?id=1229189.1229240

[30] L. Dobos, I. Csabai, M. Milovanovic, T. Budavari, A. Szalay, M. Tintor,J. Blakeley, A. Jovanovic, and D. Tomic, Array requirements forscientific applications and an implementation for microsoft sql server, inProc. of the EDBT/ICDT 2011 Workshop on Array Databases, Uppsala,Sweden, (ed.: P. Baumann), 2011.

[31] T. Budavari, A. Szalay, and G. Fekete, Searchable sky coverage ofastronomical observations: Footprints and exposures. Submitted, 2010.

[32] Benchmarks provided by infiniband vendors.
http://dl.acm.org/citation.cfm?id=1229189.1229240http://dl.acm.org/citation.cfm?id=1229189.1229240

DataIntensive Computer

Documents

Transcript of DataIntensive Computer