DataIntensive Computer
-
Upload
swapnil022 -
Category
Documents
-
view
212 -
download
0
Transcript of DataIntensive Computer
-
7/23/2019 DataIntensive Computer
1/10
An Architecture for a Data-Intensive Computer
Edward GivelbergDepartment of Physics and Astronomy
IDIES
The Johns Hopkins University
Email: [email protected]
Alexander SzalayDepartment of Physics and Astronomy
Computer Science Department
IDIES
The Johns Hopkins University
Kalin Kanov and Randal BurnsComputer Science Department
IDIES
The Johns Hopkins University
AbstractScientific instruments, as well as simulations, gener-ate increasingly large datasets, changing the way we do science.We propose that processing Petascale-sized datasets will becarried in a data-intensive computer, a system consisting of anHPC cluster, a massively parallel database and an intermediateoperating system layer. The operating system will run on dedi-cated servers and will exploit massive parallelism in the database,as well as numerous optimization strategies, to deliver high-throughput, balanced and regular data flow for I/O operationsbetween the HPC cluster and the database. The programming
model of sequential file storage is not appropriate for data-intensive computations, so we propose a data-object-oriented
operating system, where support for high-level data objects,such as multi-dimensional arrays, is built in. User applicationprograms will be compiled into code that is executed both onthe HPC cluster and inside the database. The data-intensiveoperating system is however non-local, so that user applicationsrunning on a remote PC will be compiled into code executingboth on the PC and inside the database. This model supportsthe collaborative environment, where a large data set is typicallycreated and processed by a large group of users.
We have implemented a software library, MPI-DB, which is aprototype of the data-intensive operating system. It is currentlybeing used to ingest the output of the simulation of a turbulentchannel flow into the database.
I. INTRODUCTION
The traditional process of scientific discovery consists of
systematic observation, experimentation, measurement and
data collection, leading to the creation of a theory that
explains past observations and predicts the results of future
experiments. In virtually every field of science technological
progress has led to the construction of high throughput mea-
surement instruments (telescopes, high-energy particle accel-
erators, gene sequencing machines, etc.), that generate very
large data sets from observation of complex physical systems.
On the other hand, theoretical description of complex physical
phenomena starts with a set of basic laws (typically expressed
as partial differential equations), whose consequencies areinvestigated with the help of simulation experiments using
computational models. The volume of data produced by com-
puter simulations has been increasing even more rapidly than
the size of empirical measurement data sets.
In astrophysics, for example, database technology has been
developed to host the results of the Sloan Digital Sky Survey,
and to expose the data to scientists through the SDSS Sky-
Server [1]. Following the example of the SDSS SkyServer,
the Millennium simulation [2] created a remotely accessible
database with a collaborative environment [3], which drew
hundreds, if not thousands of astronomers into analyzing
simulations.
The need to process very large data sets is changing the
way we do science. The ability to perform computations with
very large data sets is necessary for mining both experimental
and simulation output data sets, as well as for developing
new methodologies for comparing empirical observations with
theory.In this paper we propose that computations with Petascale-
sized data sets should be carried out on a data-intensive
computer, a system consisting of an HPC cluster, a massively
parallel database and an intermediate operating system layer.
We present a design for the operating system and discuss
novel features distinguishing the data-intensive computer from
the traditional computer. Our approach to data-intensive com-
puting is substantially different from the database-centric
approach of SciDB [4].
The NSF has recently awarded our group to build a 5PB
cluster for extreme data-intensive computations. The Data-
Scope will be co-located and integrated with a BeoWulf cluster
of about 4000 CPU cores at the Department of Physics andAstronomy. The systems components are currently evaluated
and tested, and deployment is expected to happen in October
2011. The Data-Scope will consist of 90 performance and 12
storage servers. The total disk capacity will exceed 5PB, with
3PB in the storage and 2.2PB in the performance layer. The
peak aggregate sequential IO performance is projected to be
460GB/s.
The driving goal behind the Data-Scope design is to max-
imize stream processing throughput over 100TB-size datasets
while using commodity components to keep acquisition and
maintenance costs low. Accessing the data in a massively
parallel fashion from the cluster via locally attached disks and
SSDs is significantly faster than serving the data from sharednetwork file servers to multiple compute servers.
We have developed a software library called MPI-DB
which is a prototype of the operating system for the data-
intensive computer. MPI-DB will enable us to establish peer-
to-peer connections between nodes on the HPC cluster and
the Data-Scope I/O nodes both for the on-the-fly ingest of
data generated by an MPI application, and for the parallel
compute-intensive analysis of large data sets read from the
parallel database. MPI-DB has been successfully tested and is
-
7/23/2019 DataIntensive Computer
2/10
presently being used to ingest the output of the simulation of
a turbulent channel flow directly into the database.
The remainder of the paper is organized as follows. In
sectionII we survey several research projects involving data-
intensive computing. An examination of computing require-
ments for these projects leads us to propose the concept of
the data-intensive computer in section III, where we discuss
design requirements for the computer and its operating system.
A prototype of this operating system, MPI-DB, is described in
sectionIV.We present the user view of the software library;
then describe its software architecture and discuss implemen-
tation issues. We summarize our conclusions in sectionV.
II. EXAMPLES OF DATA-INTENSIVE SCIENTIFIC RESEARCH
In this section we examine several examples involving data-
intensive computations from astrophysics, turbulence, neu-
roscience and hearing research. The examples, taken from
research activities of our group members, illustrate the demand
for data-intensive computations, the present technological lim-
itations, and guide us in the design of the operating system
for the data-intensive computer.
A. The large-scale structure of the Universe
Contemporary research in astrophysics has deep and im-
portant connections to particle physics. Observations of large
structures in the universe lead physicists to the discovery of
the dark matter and the dark energy, and understanding these
new forms of matter will change our view of the universe
on all scales, including the particle scale and the human
scale. Theoretical developments in astrophysics must be tested
against vast amounts of data collected by instruments, such as
the Hubble Space Telescope, as well as against the results
of supercomputer simulation experiments, like the Millenium
Run [5]. These data sets are available in public databases,and are being mined by scientists to gain intuition and to
make new discoveries, but the researchers are limited by the
technological means available to access the data. In order
to analyze astrophysical data researchers write scripts that
perform database queries, transfer the resulting data sets to
their local computers and store them as flat files.
Such limited access has already produced important dis-
coveries. For example, recently a new log-power density
spectrum was discovered by such analysis of the data in
the Millenium Run database [6]. This is the most efficient
quantitative description of the distribution of the density of
matter in the Universe, that was obtained so far.
B. Turbulence research
A new database approach to scientific computing in fluid
dynamics was developed by researchers at the JHU Institute
of Data-Intensive Engineering and Science (IDIES), where the
entire time-history of a simulation is stored in a database using
relational database technologies[7],[8], and is publicly acces-
sible to scientific analysis. A pseudospectral direct numerical
simulation of forced isotropic turbulence on a 10243-pointEulerian computational grid with periodic boundary conditions
was computed, and 1024 time-steps are stored, covering a full
large-eddy turnover time of model evolution. The resulting
27 Tbyte data set is publicly available in the JHU Public
Turbulence Database [9]. Velocity, pressure, velocity gradient
tensor and other quantities can be obtained directly from the
query web page of the Turbulence Database.
The technology enabling scientists to execute such queries
includes a mediator (DataBase Access Server), which acts as
a gateway for users and applications, and takes individual
requests and routes them to a node that has the data needed to
service that request. Users and applications discover the inter-
face through the Web-Service Definition Language (WSDL).
Requests and results are transmitted between the mediator
and its clients through the Simple Object Access Protocol
(SOAP). Analysis tools within the database are implemented
in user-defined functions (UDFs) in SQL Server 2005 using
the common-language runtime (CLR) [10].
To service data calls from C and Fortran codes, a wrapper
interface to the gSOAP library is provided, which is linked
at compile time to the user program. While executing, the
program makes subroutine-like calls that request desired partsof the data from the database over the internet. The request
may trigger execution of pre-defined analysis routines in the
database, e.g. interpolation, differentiation, etc.
An example of the transformative impact of the new ap-
proach on scientific research is the recent study of H. Yu
and C. Meneveau [11], where new evidence for Kolmogorovs
Refined Similarity Hypothesis has been found by following the
time evolution of velocity and pressure gradients in isotropic
turbulence and quantifying their autocorrelation functions and
decorrelation time scales.
Data analysis was performed on a workstation, accessing
the database using web service tools. Fluid particles were
tracked using the second-order Runge-Kutta particle trackingalgorithm [12]. Lagrangian information, such as fluid velocity
and pressure gradients, along the fluid particle trajectories was
extracted and used in analysis. The required velocities were
interpolated using 8-th order Lagrange polynomials in space
and piecewise-cubic Hermite polynomials in time. Averages
were performed over more than 800 million pairs of stochastic
trajectories, consuming a couple of months of wall-clock
time. These operations were implemented in predefined func-
tions in the database [8].
The JHU turbulence database has been providing services
to a growing community of users both from JHU and from
outside. So far, it has been accessed by over 160 separate
IPs. There are presently over 2 104 individual webqueriesper day, each requesting data on an average of 250 points, so
that every day over 5 106 points are queried.The database has been used to evaluate key assumptions
in a new experimental pressure measurement techniques [13],
bubble dispersion statistics in turbulence [14], evolution of
material surfaces and curvature statistics in turbulence [15],
etc. In the JHU turbulence group, the data have been exploited
to study Lagrangian models of intermittency [8], small-scale
turbulent magnetic dynamo effect [16] subgrid-scale models
-
7/23/2019 DataIntensive Computer
3/10
[17], and statistics of rotation and strain-rates along fluid
particle trajectories [11],
As a consequence of Kolmogorovs Refined Similarity
Hypothesis the viscous cut-off length must be regarded as a
local fluctuating quantity, and it is expected that many high
Reynolds number simulations substantially underresolve some
regions in space. The Turbulence group at JHU is planning to
carry out database-driven simulations, where a nested grid
of very fine resolution is positioned in the vicinity of intense
small-scale structures of interest and co-moving with the mean
velocity at that position, in order to locally refine the archived
data.
Such simulation has never been carried out before, and it
requires boundary conditions on the surface of the nested, co-
moving grid, which may be extracted from the time-history
stored in the database, interpolated to finer resolution. There is
presently no technology to provide such a capability. MPI-DB
will provide this new capability by employing a smart combi-
nation of caching, scheduling and parallelization techniques.
We believe this capability is extremely valuable, not only for
turbulence, but in virtually every scientific simulation. To givejust one example, consider environmental impact assessment
of the spreading of a chemical agent in a building or an area.
This can be carried out as a database-driven simulation, using
a previously computed fluid-flow simulation.
C. Computational modeling of the cochlea
The human cochlea is a remarkable highly nonlinear trans-
ducer that extracts vital information from sound pressure and
converts it into neuronal impulses that are sent to the auditory
cortex. The cochleas accuracy, amplitude range and frequency
range are orders of magnitude better than man made transduc-
ers. Understanding its function has tremendous medical and
engineering significance. The two most fundamental questionsof cochlear research are to provide a mathematical description
of the transform computed by the cochlea and to explain the
biological mechanisms that compute this transform. Presently
there is no adequate answer to either of these two questions.
Signal processing in the cochlea is carried out by a collection
of coupled biological processes occuring on length scales
measuring from one centimeter down to a fraction of a
nanometer. A comprehensive model describing the coupling of
the dynamics of the biological processes occurring on multiple
scales is needed in order to achieve system level understanding
of cochlear signal processing.
A model of cochlear macro-mechanics was constructed in
19992002 by Givelberg and Bunn [18], who used super-computers to generate very large data sets, containing results
of simulation experiments. These results were stored as flat
files which were subsequently analyzed by the authors on
workstations using specially developed software. aA set of web
pages devoted to this research [19] is widely and frequently
accessed, however the data was never exposed to the wide
community for analysis since no tools to ingest simulation
output into a database existed when the cochlea model was
developed.
This model remains the most comprehensive large-scale
model that has been constructed to study cochlear mechanics,
but it does not include the crucial dynamics of the smaller
scales. The first author is presently developing a new multi-
scale method that will enable to extend the macro mechanical
cochlea model to include microscopic structures. The new
cochlea model will run on a multi-processor cluster using
a special algorithm for distributed immersed boundary com-
putations [20]. We plan to use MPI-DB to ingest cochlea
simulation results into the database and to expose these results
for analysis by the wide hearing research community. Further-
more, since the development of such a model is a product
of many years of research, a new technological capability is
needed to enable cochlea researchers outside JHU to use the
model, to remotely carry out simulations, as well as to observe,
monitor and analyze ongoing simulations.
D. Neuroscience
One of the biggest unanswered questions in neuroscience
today is the organization of the brain at the level of the
neural micro-circuits that form the basis of neural computation.Recently, due to technological and experimental advances in
electron microscopy, it has become possible to investigate
these networks through high-resolution imaging of the brain
[21]. For example, Bock et al. [22] have recently imaged
a 350 x 450 x 50 cubic micron region of mouse cortex
with 4 nanometer lateral resolution a sufficiently detailed
dataset to resolve every synaptic connection in the field of
view (indeed, even vesicles are readily apparent). It has been
estimated that imaging the whole brain at this resolution would
require multiple exabytes; a cubic millimeter occupies roughly
1 petabyte. Ultimately, to fully exploit this data, it is desirable
to assign a label to each voxel indicating its identity and the
structure to which it belongs.Clearly, while even collecting this type of data is an
enormous task, interpreting and analyzing the data is far
more difficult. It is infeasible to annotate this volume of
data manually, and probably impractical to assume that any
one group will devise a perfect automated solution. The
Open Connectome project is working to provide universal
access to this type of data via web services hosted at
http://openconnectomeproject.org. More specifically, one of us
(R. Burns) is developing tools for both human (visualization)
and computer (application programming interface, or API)
access to the data. Granting global access will enable the
largest possible community of image processing and machine
learning experts to investigate the data and develop algorithmsto annotate it. Unlike standard crowdsourcing endeavors, the
goal is to compile efforts from a variety of machine an-
notators, as opposed to human annotators, an approach that
was dubbed alg-sourcing (for algorithm outsourcing). As
different groups tackle different aspects of the problem with
different approaches, the results will be aggregated and the
collective output will be shared, building towards a long-term
vision of a fully-annotated cortical volume.
The project is being initialized with two datasets: (1) a 12
-
7/23/2019 DataIntensive Computer
4/10
TB dataset from Bock et al. described above, and (2) a >600GB dataset from Kasthuri and Lichtman (unpublished; spatial
resolution: 3 x 3 x 29 cubic nanometers). Panning, zooming,
and manual annotation are made possible via a web-based
graphical user interface called CATMAID [23]. An API for
two-dimensional analysis of the data, including downloading
arbitrary image planes and uploading planar annotations to
the shared repository are in progress. An additional server
for three-dimensional representation of the data is being built,
along with an API for downloading volumes and upload-
ing volumetric annotations. Graphics processor unit (GPU)-
enabled software will allow for visualizing arbitrary rotations
of the data in three dimensions, overlaid with the annotations.
All of the services are designed to scale up to petabytes and
beyond, and all of the developed code will be released as open
source.
The Open Connectome Project is gearing up for massive
polyscience, i.e. science collectively conducted by a large
group of individuals. This marks a radical departure from the
typical scientific workflow, in which raw data are kept local
until results are released, and will hopefully usher in a newera of understanding about the brain.
III. THE DATA-INTENSIVE COMPUTER
The operating system of a modern computer is designed
to balance programmer productivity with implementation ef-
ficiency. High-level programming languages hide the com-
puters memory hierarchy and system architecture, while the
operating system provides highly optimized services for all
application developers. The only means of permanently storing
data is by writing it in a file, and the abstract programming
model of sequential file access is efficiently implemented in
the operating system. The operating system typically does not
include services for handling high level programming objects,such as arrays or graphs. When there is a need to store such
objects for subsequent computation, the programmer must
make use of the file system with serialization/unserialization
of these objects.
Scientific data sets are now approaching the Petascale,
exceeding the capabilities of file systems, and are therefore
stored in databases. They are not easily accessible to com-
putation because performing I/O operations between an HPC
system and a database is presently very difficult. There are
no off-the-shelf solutions and a considerable effort is required
on the part of domain scientists to incorporate special-purpose
database access tools in the analysis code. Data access using
web services does not provide a scalable solution for manydata-intensive applications. Furthermore, the resulting data
flow throughput needs to be improved by orders of magnitude;
even trivially parallelizable data processing tasks are presently
very difficult.
In order to satisfy the increasing demand for computations
with very large data sets it is necessary to build an operating
system that will exploit the massive parallelism in the database
system to efficiently carry out data storage and retrieval
between the database and a multiprocessor computing system.
We therefore define the data-intensive computeras a system
consisting of an HPC cluster, a massively parallel database
and an intermediate operating system laye. (see Figure 1).
The operating system enables direct I/O operations between
HPC processor memory and the database, making the database
transparent to the programmer and effectively turning it into a
layer in the memory hierarchy of the data-intensive computer.
OS
server
OS
server
OS
server
remote
computer
DB
server
DB
server
DB
server
DB
server
HPC
node
HPC
node
HPC
node
HPC
node
remote
computer
remote
computer
remote
computer
Data-Intensive Computer
Fig. 1. The architecture of the data-intensive computer: The HPC cluster andthe database are connected to operating system servers by a high bandwidthnetwork. Remote clients obtain operating system services over the internet.
The operating system also provides services to remote
computers (see section III-E). The data-intensive computer
differs from the traditional computer in a number of important
aspects which we discuss below.
A. Direct I/O between memory and databaseThe main challenges in the design of the data-intensive
operating system are to guarantee the quality of service in the
level of performance of the data flow and to ensure scalability
and efficient parallel scheduling and resource management.
We plan the operating system to run on a set of dedicated
servers, with the user HPC applications acting as clients for
the operating system processes. A major goal of the operating
system is to transform application burst I/O into uniform,
balanced traffic in the database.
Since execution of database queries is significantly slower
than the transmission of the results over the high bandwidth
network, it is advantageous, whenever possible, to execute
queries on multiple database servers in parallel. The operatingsystem will act as a distributed scheduler for the database
(see Figure 2): each dedicated operating system server pro-
cess allocates multiple database server connections for data-
intensive applications, and fewer database server connections
for applications with lower data requirements. This design
is scalable and is aimed at minimizing application I/O by
employing smart heuristic scheduling algorithms.
When a large number of applications are accessing the same
data set, such as for example the user analysis applications
-
7/23/2019 DataIntensive Computer
5/10
Operating system
server processes
DB servers
Data-non-intensive applicationData-intensive application
Operating
system
clients
d a t a b a s e
H P C c l u s t e r
Fig. 2. Scalable design of I/O operations in the data-intensive operatingsystem. The operating system acts as a scheduler for the database, allocatingmultiple database server connections for each process of a data-intensive appli-cations, and few connections to an application with lower data requirements.
accessing the JHU Public Turbulence database, significant
efficiencies may be realized by grouping the I/O requests
of different applications together. The operating system will
maintain its own storage for caching I/O requests and will op-
timize database access based on applications access patterns,
as well as across applications (see for example Multicollective
I/O [24]). The operating system layer will incorporate efficient
management of available resources, and will grow or shrink
on demand.
B. Moving the program to the data
A major goal in the design of the operating system ofthe data-intensive computer is to enable applications with an
arbitrary mix of I/O and computation. In many instances it is
advantageous to carry out computations with large data objects
in the database. The move the program to the data approach
of Szalay et al. [25], has been a fundamental tenet in the
design of large-scale scientific databases. We have seen that
in the JHU Turbulence database (see section II) data requests
may trigger execution of pre-defined routines in the database.
A serious limitation of this approach is that such routines must
be pre-programmed in the database.
We propose to extend the move the program to the data
approach by automatically generating the code that will be
executed in the database. An HPC application will be compiledinto code that will execute on the HPC cluster, as well as
code for computations with operating-system-supported data
objects that will execute in the database. The operating system
will carry out moving the program to the data. Compiler-
generated code for large data object computations will be sent
from the HPC cluster to the database using the operating sys-
tems client-server communications. The user HPC application
will be linked against the operating system client software. At
run time, it will execute code on the HPC cluster, call system
services that will execute in the operating system layer and
execute the application-generated code in the database.
User applications will be developed in a high-level pro-
gramming language (Fortran, C, C++, etc.) that includes
mechanisms for concurrency control (e.g. MPI), allowing easy
porting of legacy applications to the data-intensive computer.
A specially designed language, such as Titanium [26] or
Charm++ [27], which has a built-in mechanism for concur-
rency control, can also be used for application development.
Furthermore, we believe that designing a special-purpose lan-
guage for processing large data sets will improve programmer
productivity.
C. Data-object-oriented operating system
The abstract programming model of sequential file access
is not appropriate to represent the storage layout of large data
objects in the distributed database; nor is it convenient for
the applications programmer. Instead, the operating system
must provide support for storing and manipulating abstract
objects such as arrays, graphs, sparse arrays, etc. Implementing
system-level support for a particular data structure is non-trivial: a distributed database layout and optimized I/O man-
agement must be planned for each data object. Nevertheless,
the operating system must provide a set of services which
simplify development and execution of application programs.
These can be made available to the application programmer
through the use of advanced compilers.
Our prototype operating system, MPI-DB, provides support
for multi-dimensional arrays. (See section IV-B.) The survey
in section II readily suggests that sparse arrays and graphs
are among the data structures that must be supported in the
data-intensive operating system. Indeed, it is estimated that the
number of synaptic connections in the human brain is on the
order of 1015
. Operating system support for additional datastructures will depend on applications, such as the Berkeley
seven (or thirteen) dwarfs [28].
D. Operating system support for distributed data objects
Virtually every parallel computation involves a partition of
a large data object among several processors with the aim
of reducing the total wall-clock time of the computation. A
typical example is the partitioning of a multi-dimensional array
(see section IV-B). This Map/Reduce methodology breaks up
data objects, creating distributed data objects which exist only
during computation. At the end of the computation a single
coherent data object must be assembled from the distributed
data object. We believe that support for distributed data objectsmust be provided both in programming languages (e.g. the
single objects in Titanium [26]) and in the operating system
of the data-intensive computer.
While the data object stored in the database is logically
single, its storage layout is distributed among database servers.
In the process of reducing a run-time distributed data object to
a logically single object stored in the database, the operating
system will generate a physical mapping of the objects storage
layout in the global database system. This mapping will
-
7/23/2019 DataIntensive Computer
6/10
identify the database servers, the server-attached databases and
the storage partitions that hold the data representing the object,
and will determine methods for access and modification of the
object.
E. Collaborative, non-local operating system services
A large data set is typically created and processed by a
large group of possibly collaborating individuals, who execute
a set of concurrent processes. While the main goal of the
data-intensive operating system is to enable data-intensive
computations, its services can be used by remote users.
Remote computers obtain services from the data-intensive
operating system in the same way the HPC applications do:
an application running on a remote computer is compiled into
code that executes on that computer, connects to the data-
intensive operating system over the network and sends to the
operating system code that is executed within the database.
The main difference is in the network connection speed.
Remote users with slow network connections may choose
to download portions of data sets from the database to their
computers, perform extensive local computations and sendresults back to the database. Furthermore, the data-intensive
operating system can be used as a software library installed
on the remote computer and run in conjunction with a local
database, enabling the user to store data objects imported from
a remote database directly into the local database, and to
process the data in the local database using the same program
that was previously created for remote, possibly large-scale,
data processing.
Such exposure of data and data-processing services has
the potential of an impact beyond the scientific community,
involving the general public in both science education and sci-
entific research. For example, the GalaxyZoo project, initiated
at IDIES, serving galaxy images for visual type classificationto the general public, opened in July 12, 2007, and after 8 days
of operation the JHU servers had 56 million web hits, and 8.5
million galaxies classified. The project was featured from the
BBC to the London Times, the Economist, the Washington
Post, Christian Science Monitor, and was on the front page of
Slashdot and Wikipedia. This unprecedented interest is a clear
example that open access to science is in the interest of the
public and is beneficial to the advancement of science.
I V. THE MPI-DB SOFTWARE LI BRARY
The MPI-DB software library is a prototype of an operating
system for a data-intensive computer. It provides database
services to scientific computing processes and currently sup-ports SQL-Server and MySQL databases on Windows and
Linux with C, C++ and Fortran language bindings. The library
consists of two compatible software packages: the MPI-DB
client and the MPI-DB server, and it requires a working MPI
installation (including MPI-2 functionality) and UDT sockets
[29]for its client-server communications, and a database con-
nected to the server. The MPI-DB server accepts connections
from clients at a known network address, services clients
requests by querying the database and sending the results back
to the clients. User client applications must be compiled and
linked against MPI-DB client and use the library to store,
retrieve and modify the data in the database.
A. An introductory tutorial
Consider a scientific application consisting of several paral-
lel MPI processes, continuously generating output that needs
to be stored. We show how MPI-DB can be easily used to
store the output in a database. The user application is written
in C++ with MPI. It is linked against the MPI-DB library
and in our simple example there are two parallel processes at
runtime, whose ranks are 0 and 1.
The user interaction with MPI-DB starts by defining the data
structures that will be stored in the database. In our example
the two parallel MPI processes jointly perform a computation
using a single three-dimensional array of 128 128 128double precision floating point numbers. The array is divided
between the two processors, with processor 0 holding in its
local memory the [0 . . . 127] [0 . . . 127] [0 . . . 63] por-tion of the array and processor 1 holding the [0 . . . 127] [0 . . . 127] [64 . . . 127] part. Correspondingly, each processdefines an mpidb::Domain object subdomain and an
mpidb::Array object a.
// this user process has rank = MyID,
// which in our example is either 0 or 1
MPI_Comm_rank(MPI_COMM_WORLD, &MyID);
mpidb::Domain subdomain(0, 127, 0, 127, 64 * MyID,
64 * MyID + 63);
mpidb::Array a(subdomain, mpidb::DOUBLE_PRECISION);
// generate a stream of array data objects
mpidb::DataStream s(a);
mpidb::DataSet d(); // DataSet d is a single
object, common to both processes
// DataSet d will contain two data streams
d.AddStream(s);
Our application will perform repeated computation of the data
array, with each process periodically storing its portion of the
data array in the database. Each process will therefore generate
a stream of arrays. This is expressed in the definition of the
mpidb::DataStream object s.
Finally, the application defines the mpidb::DataSet
object d, which in contrast to previously defined objects is
a single (distributed) object common to both processes. After
each process adds a data stream to this data set, it will contain
exactly two streams.
Having defined the data structures, each of the two
MPI processes attempts to establish a connection with
an MPI-DB server. This is achieved by defining an
mpidb::Connection object c and executing on it the
ConnectToServer method with a given server address.
-
7/23/2019 DataIntensive Computer
7/10
mpidb::Connection c;
char * ServerAddress = "128.220.233.155::52415";
if (!c.ConnectToServer(ServerAddress))
{
cerr
-
7/23/2019 DataIntensive Computer
8/10
MPI-DB assigns unique integer descriptors to objects stored
in the database, allowing the user to access previously stored
objects and load them from the database directly into pro-
gramming objects. In the future MPI-DB will provide a set of
services to name objects and list user-defined objects which
are already stored in the database.
C. MPI-DB Software architecture
MPI-DB software is built as a layered structure, analogous
to multi-layer protocols used in computer network commu-
nications. Such a design is flexible and extensible. Figure 3
illustrates the software architecture.
serverclientSystem
Management
Layer
serverclient Data Object
Layer
serverclient DatabaseAccessLayer
serverclient Data Transport
Layer
network connection
database
Fig. 3. Layered software architecture of MPI-DB. Data Transport layerencapsulates physical communication. The server of the Database AccessLayer acts as a client of the database server.
The Data Transport Layer is the lowest layer of the
MPI-DB layer hierarchy. It provides the basic functionality for
establishing and managing the connection between clients and
servers over a fast network. This design encapsulates packet
transmission in the Data Transport Layer. Indeed, we have
found it necessary to create two independent implementations
of the Transport Layer: one using UDT sockets and the other
using the MPI-2 standard (see sectionIV-D). The MPI protocol
is a de facto standard in scientific computing. MPI installations
are available for a wide range of operating systems and
computer networks, and in many instances benchmarking tests
have shown MPI to be the fastest protocol for data transfer
[32].The Database Access Layer provides basic functionality
to remotely execute queries and access database tables over
the network. It provides the Data Object Layer with a narrow
set of abstract operations needed to manipulate MPI-DB
programming objects in the database. This layer encapsulates
all SQL queries and includes drivers for major databases, such
as SQL Server, MySQL and PostgreSQL.
The Data Object Layer contains the description of
the user-defined programming objects that are stored in the
database, including their physical storage layout, and provides
access and manipulation methods for these objects. User-
defined objects are serialized by the client, sent to the server
and unserialized by the server, to be subsequently stored in
the database. A hierarchical description of the physical storage
layout lists the servers, the server-attached databases and the
storage partitions holding the data associated with each object.
Data access methods implement the mapping between user-
defined run-time partition of the object among multiple pro-
cessors and the objects hierarchical database storage layout.
The System Management Layer maintains a resource
map, describing all the resources (storage and servers) avail-
able in the global database system. It includes a caching
system for grouping applications I/O requests and a scheduler
assigning the I/O requests to database servers. This layer is
also planned to handle administration functions, managing all
user related information, including managing user logins and
monitoring user connections.
D. Implementation
MPI-DB is being developed as object-oriented softwarein C++ and will be made available under the new BSD
open-source software license. It requires a working imple-
mentation of the MPI standard, including MPI-2 function-
ality, most importantly functions for client-server interaction
(MPI_Open_Port, etc.) and dynamic process management
(MPI_Comm_spawn). In our experience this requirement
proved to be difficult to satisfy: Presently, MPI-DB works only
with the Intel MPI software library. Many MPI implementa-
tions contain run-time bugs, do not provide adequate hardware
drivers or do not support MPI-2 standard altogether. Process
intercommunication depends on accessing MPI ports using an
address string whose format is implementation dependent. This
complicates implementation of client-server software whenclient and server are built using different implementations of
MPI.
In view of these difficulties we re-implemented the Trans-
port Layer of MPI-DB using UDT sockets. While the client
side of the library currently uses MPI, the server side relies
on MPI only to spawn server processes. Future versions of
the library will use different services to manage processes and
will be independent of MPI.
We have tested the throughput of data ingest in a pro-
totype MPI-DB installation. Both the client and the server
ran on Linux nodes having eight physical cores each. The
client and the server were linked by a 10 Gigabit Ethernet
connection with a mysql database running on the server.The client application generated a three-dimensional array of
data which was partitioned among p parallel processes with
p = 1, 2, 4, 8, 16, 32 and 64. For each client process MPI-DB allocated a dedicated server process, which opened a
connection to mysql, received data from its assigned client
process and ingested the data into the mysql database. The data
ingestion process was thus performed in p parallel threads over
the Ethernet connection. The results of two test run sequences
are shown in Figure4.Peak performance between 600 and 700
-
7/23/2019 DataIntensive Computer
9/10
Aggregate Throughput
0
100
200
300
400
500
600
700
800
0 10 20 30 40 50 60 70
Number of processes
Thoughput
Fig. 4. Throughput test of MPI-DB in parallel data ingestion over a 10Gigabit/sec Ethernet with both the client and the server having eight physicalCPU cores each. Client data is sent to the server and then ingested into mysqldatabase. MPI-DB allocated a dedicated server process for each client process.The aggregate throughput is recorded as a function of the number of parallelingestion processes. Maximum throughput between 600 and 700 Megabytesper second is realized when the number of ingestion processes equals the
number of physical cores.
Megabytes per second has been achieved when the number of
parallel processes was the same as the number of available
physical cores (eight).
Following successful tests of MPI-DB, it is now being
used to ingest the results of the simulation of a turbulent
channel flow directly from the HPC cluster at the JHU Physics
department into the database.
V. CONCLUSION
We have introduced the concept of the data-intensive com-
puter and have proposed a novel operating system architecture
to support data-intensive computations with Petascale-sized
data sets. Section III outlines a research and development
program for building a data-object-oriented operating system
with advanced compiler technology to translate user applica-
tion programs into code that runs on the HPC cluster and
code that runs in the database. The proposed operating system
is non-local: it supports large-scale collaborative computations
where user applications can be translated into code that runs
on a remote computer and code that runs in the database.
Our discussion touches only a few topics related to the
design of the data-intensive computer, and necessarily omits
many important aspects. The construction of the data-intensive
computer is an ongoing research project. A prototype of thedata-intensive operating system has been implemented as a
software library, MPI-DB, and is currently being used in
production by the Turbulence research group at JHU.
REFERENCES
[1] http://www.sdss.org/.[2] V. Springel, S. White, A. Jenkins, C. Frenk, N. Yoshida, L. Gao,
J. Navarro, R. Thacker, D. Croton, J. Helly, J. Peacock, S. Cole,P. Thomas, H. Couchman, A. Evrard, J. Colberg, and F. Pearce,Simulations of the formation, evolution and clustering of galaxies andquasars, Nature, vol. 435, pp. 629636, 2005.
[3] G. Lemson and the Virgo Consortium, Halo and galaxy formationhistories from the millennium simulation: Public release of a vo-orienteddatabase, 2006.
[4] M. Stonebraker, P. Brown, A. Poliakov, and S. Raman, The architectureof scidb, in Scientific and Statistical Database Management, ser.Lecture Notes in Computer Science, J. Bayard Cushing, J. French, andS. Bowers, Eds. Springer Berlin / Heidelberg, 2011, vol. 6809, pp.116.
[5] http://www.mpa-garching.mpg.de/millennium/.
[6] M. C. Neyrinck, I. Szapudi, and A. S. Szalay, Rejuvenating the Matter
Power Spectrum: Restoring Information with a Logarithmic DensityMapping, Astrophysics J. Letters, vol. 698, pp. L90L93, Jun. 2009.
[7] E. Perlman, R. Burns, Y. Li, and C. Meneveau, Data exploration ofturbulence simulations using a database cluster, in Proceedings of theSupercomputing Conference (SC07), 2007.
[8] Y. Li, E. Perlman, M. Wan, Y. Yang, R. Burns, C. Meneveau, S. Chen,A. Szalay, and G. Eyink, A public turbulence database cluster andapplications to study lagrangian evolution of velocity increments inturbulence. J. Turbulence, vol. 9, no. 31, 2008.
[9] http://turbulence.pha.jhu.edu/.
[10] C. Blakeley, N. Cunningham, B. Ellis, Rathakrishnan, and M. C. Wu,Distributed/heterogeneous query processing in microsoft sql server, in21st Int. Conf. on Data Engineering (ICDE05), 2005, pp. 10011012.
[11] H. Yu and C. Meneveau, Lagrangian refined kolmogorov similarityhypothesis for gradient time evolution and correlation in turbulent flows,Phys. Rev. Lett., vol. 104, no. 8, p. 084502, Feb 2010.
[12] P. K. Yeung and S. B. Pope, An algorithm for tracking fluid particles innumerical simulations of homogeneous turbulence, J. Comput. Phys.,vol. 79, no. 2, pp. 373416, 1988.
[13] L. X and J. Katz, Measurement of pressure-rate-of-strain, pressurediffusion and velocity-presure- gradient tensors around an open cavitytrailing corner, Bull. Am. Phys. Soc., vol. 53, no. 15, 2008.
[14] Snyder, personal communication. 2009.
[15] Leonard, personal communication. 2009.
[16] G. L. Eyink, Stochastic flux freezing and magnetic dynamo, 2011.
[17] C. Meneveau, A web-services accessible turbulence database ofisotropic turbulence: lessons learned. in Progress in wall turbulence:understanding and modeling. (M. Stanislas, ed.), held on 21-23 April,in Lille France, 2009.
[18] E. Givelberg and J. Bunn, A comprehensive three-dimensional modelof the cochlea, J. Comp. Phys., vol. 191, no. 2, pp. 377391, 2003.
[19] http://pcbunn.cacr.caltech.edu/cochlea/.
[20] E. Givelberg and K. Yelick, Distributed immersed boundary simulation
in titanium, SIAM J. on Scientific Computing, vol. 28, no. 4, pp. 13671378, Jul. 2006.
[21] N. Kasthuri, K. Hayworth, J. Tapia, R. Schalek, S. Nundy, and J. Licht-man, The brain on tape: Imaging an ultra-thin section library (utsl).Society for Neuroscience Abstracts, 2009.
[22] D. D. Bock, W.-C. A. Lee, A. M. Kerlin, M. L. Andermann, G. Hood,A. W. Wetzel, S. Yurgenson, E. R. Soucy, H. S. Kim, and R. C. Reid,Network anatomy and in vivo physiology of visual cortical neurons,
Nature, no. 471, pp. 177182, March 2011.
[23] S. Saalfeld, A. Cardona, V. Hartenstein, and P. Toman?k, Catmaid:collaborative annotation toolkit for massive amounts of image data,
Bioinformatics, vol. 25, no. 15, pp. 19841986, 2009.
[24] G. Memik, M. T. Kandemir, W.-K. Liao, and A. Choudhary, Multicol-lective i/o: A technique for exploiting inter-file access patterns, ACMTransactions on Storage, vol. 2, no. 3, Aug. 2006.
[25] A. S. Szalay, P. Z. Kunszt, A. Thakar, J. Gray, D. Slutz, and R. J.Brunner, Designing and mining multi-terabyte astronomy archives: the
sloan digital sky survey, in SIGMOD 00: Proceedings of the 2000ACM SIGMOD international conference on Management of data. NewYork, NY, USA: ACM, 2000, pp. 451462.
[26] K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krish-namurthy, P. Hilfinger, S. Graham, D. Gay, P. Colella, and A. Aiken,Titanium: A high-performance java dialect,Concurrency: Practice and
Experience, vol. 10, no. 11-13, September-November 1998.
[27] L. V. Kale and G. Zheng, Charm++ and ampi: Adaptive runtimestrategies via migratable objects, M. Parashar, Ed. Wiley-Interscience,2009, pp. 265282.
[28] K. Asanovc, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer,D. Patterson, W. Plishker, J. Shalf, S. Williams, and K. Yelick, Thelandscape of parallel computing research: A view from berkeley,
-
7/23/2019 DataIntensive Computer
10/10
Electrical Engineering and Computer Sciences, University of Californiaat Berkeley, Tech. Rep. UCB/EECS-2006-183, December 2006.
[29] Y. Gu and R. L. Grossman, Udt: Udp-based data transfer for high-speedwide area networks,Comput. Netw., vol. 51, pp. 17771799, May 2007.[Online]. Available: http://dl.acm.org/citation.cfm?id=1229189.1229240
[30] L. Dobos, I. Csabai, M. Milovanovic, T. Budavari, A. Szalay, M. Tintor,J. Blakeley, A. Jovanovic, and D. Tomic, Array requirements forscientific applications and an implementation for microsoft sql server, inProc. of the EDBT/ICDT 2011 Workshop on Array Databases, Uppsala,Sweden, (ed.: P. Baumann), 2011.
[31] T. Budavari, A. Szalay, and G. Fekete, Searchable sky coverage ofastronomical observations: Footprints and exposures. Submitted, 2010.
[32] Benchmarks provided by infiniband vendors.
http://dl.acm.org/citation.cfm?id=1229189.1229240http://dl.acm.org/citation.cfm?id=1229189.1229240