DataIntensive Computer

download DataIntensive Computer

of 10

Transcript of DataIntensive Computer

  • 7/23/2019 DataIntensive Computer

    1/10

    An Architecture for a Data-Intensive Computer

    Edward GivelbergDepartment of Physics and Astronomy

    IDIES

    The Johns Hopkins University

    Email: [email protected]

    Alexander SzalayDepartment of Physics and Astronomy

    Computer Science Department

    IDIES

    The Johns Hopkins University

    Kalin Kanov and Randal BurnsComputer Science Department

    IDIES

    The Johns Hopkins University

    AbstractScientific instruments, as well as simulations, gener-ate increasingly large datasets, changing the way we do science.We propose that processing Petascale-sized datasets will becarried in a data-intensive computer, a system consisting of anHPC cluster, a massively parallel database and an intermediateoperating system layer. The operating system will run on dedi-cated servers and will exploit massive parallelism in the database,as well as numerous optimization strategies, to deliver high-throughput, balanced and regular data flow for I/O operationsbetween the HPC cluster and the database. The programming

    model of sequential file storage is not appropriate for data-intensive computations, so we propose a data-object-oriented

    operating system, where support for high-level data objects,such as multi-dimensional arrays, is built in. User applicationprograms will be compiled into code that is executed both onthe HPC cluster and inside the database. The data-intensiveoperating system is however non-local, so that user applicationsrunning on a remote PC will be compiled into code executingboth on the PC and inside the database. This model supportsthe collaborative environment, where a large data set is typicallycreated and processed by a large group of users.

    We have implemented a software library, MPI-DB, which is aprototype of the data-intensive operating system. It is currentlybeing used to ingest the output of the simulation of a turbulentchannel flow into the database.

    I. INTRODUCTION

    The traditional process of scientific discovery consists of

    systematic observation, experimentation, measurement and

    data collection, leading to the creation of a theory that

    explains past observations and predicts the results of future

    experiments. In virtually every field of science technological

    progress has led to the construction of high throughput mea-

    surement instruments (telescopes, high-energy particle accel-

    erators, gene sequencing machines, etc.), that generate very

    large data sets from observation of complex physical systems.

    On the other hand, theoretical description of complex physical

    phenomena starts with a set of basic laws (typically expressed

    as partial differential equations), whose consequencies areinvestigated with the help of simulation experiments using

    computational models. The volume of data produced by com-

    puter simulations has been increasing even more rapidly than

    the size of empirical measurement data sets.

    In astrophysics, for example, database technology has been

    developed to host the results of the Sloan Digital Sky Survey,

    and to expose the data to scientists through the SDSS Sky-

    Server [1]. Following the example of the SDSS SkyServer,

    the Millennium simulation [2] created a remotely accessible

    database with a collaborative environment [3], which drew

    hundreds, if not thousands of astronomers into analyzing

    simulations.

    The need to process very large data sets is changing the

    way we do science. The ability to perform computations with

    very large data sets is necessary for mining both experimental

    and simulation output data sets, as well as for developing

    new methodologies for comparing empirical observations with

    theory.In this paper we propose that computations with Petascale-

    sized data sets should be carried out on a data-intensive

    computer, a system consisting of an HPC cluster, a massively

    parallel database and an intermediate operating system layer.

    We present a design for the operating system and discuss

    novel features distinguishing the data-intensive computer from

    the traditional computer. Our approach to data-intensive com-

    puting is substantially different from the database-centric

    approach of SciDB [4].

    The NSF has recently awarded our group to build a 5PB

    cluster for extreme data-intensive computations. The Data-

    Scope will be co-located and integrated with a BeoWulf cluster

    of about 4000 CPU cores at the Department of Physics andAstronomy. The systems components are currently evaluated

    and tested, and deployment is expected to happen in October

    2011. The Data-Scope will consist of 90 performance and 12

    storage servers. The total disk capacity will exceed 5PB, with

    3PB in the storage and 2.2PB in the performance layer. The

    peak aggregate sequential IO performance is projected to be

    460GB/s.

    The driving goal behind the Data-Scope design is to max-

    imize stream processing throughput over 100TB-size datasets

    while using commodity components to keep acquisition and

    maintenance costs low. Accessing the data in a massively

    parallel fashion from the cluster via locally attached disks and

    SSDs is significantly faster than serving the data from sharednetwork file servers to multiple compute servers.

    We have developed a software library called MPI-DB

    which is a prototype of the operating system for the data-

    intensive computer. MPI-DB will enable us to establish peer-

    to-peer connections between nodes on the HPC cluster and

    the Data-Scope I/O nodes both for the on-the-fly ingest of

    data generated by an MPI application, and for the parallel

    compute-intensive analysis of large data sets read from the

    parallel database. MPI-DB has been successfully tested and is

  • 7/23/2019 DataIntensive Computer

    2/10

    presently being used to ingest the output of the simulation of

    a turbulent channel flow directly into the database.

    The remainder of the paper is organized as follows. In

    sectionII we survey several research projects involving data-

    intensive computing. An examination of computing require-

    ments for these projects leads us to propose the concept of

    the data-intensive computer in section III, where we discuss

    design requirements for the computer and its operating system.

    A prototype of this operating system, MPI-DB, is described in

    sectionIV.We present the user view of the software library;

    then describe its software architecture and discuss implemen-

    tation issues. We summarize our conclusions in sectionV.

    II. EXAMPLES OF DATA-INTENSIVE SCIENTIFIC RESEARCH

    In this section we examine several examples involving data-

    intensive computations from astrophysics, turbulence, neu-

    roscience and hearing research. The examples, taken from

    research activities of our group members, illustrate the demand

    for data-intensive computations, the present technological lim-

    itations, and guide us in the design of the operating system

    for the data-intensive computer.

    A. The large-scale structure of the Universe

    Contemporary research in astrophysics has deep and im-

    portant connections to particle physics. Observations of large

    structures in the universe lead physicists to the discovery of

    the dark matter and the dark energy, and understanding these

    new forms of matter will change our view of the universe

    on all scales, including the particle scale and the human

    scale. Theoretical developments in astrophysics must be tested

    against vast amounts of data collected by instruments, such as

    the Hubble Space Telescope, as well as against the results

    of supercomputer simulation experiments, like the Millenium

    Run [5]. These data sets are available in public databases,and are being mined by scientists to gain intuition and to

    make new discoveries, but the researchers are limited by the

    technological means available to access the data. In order

    to analyze astrophysical data researchers write scripts that

    perform database queries, transfer the resulting data sets to

    their local computers and store them as flat files.

    Such limited access has already produced important dis-

    coveries. For example, recently a new log-power density

    spectrum was discovered by such analysis of the data in

    the Millenium Run database [6]. This is the most efficient

    quantitative description of the distribution of the density of

    matter in the Universe, that was obtained so far.

    B. Turbulence research

    A new database approach to scientific computing in fluid

    dynamics was developed by researchers at the JHU Institute

    of Data-Intensive Engineering and Science (IDIES), where the

    entire time-history of a simulation is stored in a database using

    relational database technologies[7],[8], and is publicly acces-

    sible to scientific analysis. A pseudospectral direct numerical

    simulation of forced isotropic turbulence on a 10243-pointEulerian computational grid with periodic boundary conditions

    was computed, and 1024 time-steps are stored, covering a full

    large-eddy turnover time of model evolution. The resulting

    27 Tbyte data set is publicly available in the JHU Public

    Turbulence Database [9]. Velocity, pressure, velocity gradient

    tensor and other quantities can be obtained directly from the

    query web page of the Turbulence Database.

    The technology enabling scientists to execute such queries

    includes a mediator (DataBase Access Server), which acts as

    a gateway for users and applications, and takes individual

    requests and routes them to a node that has the data needed to

    service that request. Users and applications discover the inter-

    face through the Web-Service Definition Language (WSDL).

    Requests and results are transmitted between the mediator

    and its clients through the Simple Object Access Protocol

    (SOAP). Analysis tools within the database are implemented

    in user-defined functions (UDFs) in SQL Server 2005 using

    the common-language runtime (CLR) [10].

    To service data calls from C and Fortran codes, a wrapper

    interface to the gSOAP library is provided, which is linked

    at compile time to the user program. While executing, the

    program makes subroutine-like calls that request desired partsof the data from the database over the internet. The request

    may trigger execution of pre-defined analysis routines in the

    database, e.g. interpolation, differentiation, etc.

    An example of the transformative impact of the new ap-

    proach on scientific research is the recent study of H. Yu

    and C. Meneveau [11], where new evidence for Kolmogorovs

    Refined Similarity Hypothesis has been found by following the

    time evolution of velocity and pressure gradients in isotropic

    turbulence and quantifying their autocorrelation functions and

    decorrelation time scales.

    Data analysis was performed on a workstation, accessing

    the database using web service tools. Fluid particles were

    tracked using the second-order Runge-Kutta particle trackingalgorithm [12]. Lagrangian information, such as fluid velocity

    and pressure gradients, along the fluid particle trajectories was

    extracted and used in analysis. The required velocities were

    interpolated using 8-th order Lagrange polynomials in space

    and piecewise-cubic Hermite polynomials in time. Averages

    were performed over more than 800 million pairs of stochastic

    trajectories, consuming a couple of months of wall-clock

    time. These operations were implemented in predefined func-

    tions in the database [8].

    The JHU turbulence database has been providing services

    to a growing community of users both from JHU and from

    outside. So far, it has been accessed by over 160 separate

    IPs. There are presently over 2 104 individual webqueriesper day, each requesting data on an average of 250 points, so

    that every day over 5 106 points are queried.The database has been used to evaluate key assumptions

    in a new experimental pressure measurement techniques [13],

    bubble dispersion statistics in turbulence [14], evolution of

    material surfaces and curvature statistics in turbulence [15],

    etc. In the JHU turbulence group, the data have been exploited

    to study Lagrangian models of intermittency [8], small-scale

    turbulent magnetic dynamo effect [16] subgrid-scale models

  • 7/23/2019 DataIntensive Computer

    3/10

    [17], and statistics of rotation and strain-rates along fluid

    particle trajectories [11],

    As a consequence of Kolmogorovs Refined Similarity

    Hypothesis the viscous cut-off length must be regarded as a

    local fluctuating quantity, and it is expected that many high

    Reynolds number simulations substantially underresolve some

    regions in space. The Turbulence group at JHU is planning to

    carry out database-driven simulations, where a nested grid

    of very fine resolution is positioned in the vicinity of intense

    small-scale structures of interest and co-moving with the mean

    velocity at that position, in order to locally refine the archived

    data.

    Such simulation has never been carried out before, and it

    requires boundary conditions on the surface of the nested, co-

    moving grid, which may be extracted from the time-history

    stored in the database, interpolated to finer resolution. There is

    presently no technology to provide such a capability. MPI-DB

    will provide this new capability by employing a smart combi-

    nation of caching, scheduling and parallelization techniques.

    We believe this capability is extremely valuable, not only for

    turbulence, but in virtually every scientific simulation. To givejust one example, consider environmental impact assessment

    of the spreading of a chemical agent in a building or an area.

    This can be carried out as a database-driven simulation, using

    a previously computed fluid-flow simulation.

    C. Computational modeling of the cochlea

    The human cochlea is a remarkable highly nonlinear trans-

    ducer that extracts vital information from sound pressure and

    converts it into neuronal impulses that are sent to the auditory

    cortex. The cochleas accuracy, amplitude range and frequency

    range are orders of magnitude better than man made transduc-

    ers. Understanding its function has tremendous medical and

    engineering significance. The two most fundamental questionsof cochlear research are to provide a mathematical description

    of the transform computed by the cochlea and to explain the

    biological mechanisms that compute this transform. Presently

    there is no adequate answer to either of these two questions.

    Signal processing in the cochlea is carried out by a collection

    of coupled biological processes occuring on length scales

    measuring from one centimeter down to a fraction of a

    nanometer. A comprehensive model describing the coupling of

    the dynamics of the biological processes occurring on multiple

    scales is needed in order to achieve system level understanding

    of cochlear signal processing.

    A model of cochlear macro-mechanics was constructed in

    19992002 by Givelberg and Bunn [18], who used super-computers to generate very large data sets, containing results

    of simulation experiments. These results were stored as flat

    files which were subsequently analyzed by the authors on

    workstations using specially developed software. aA set of web

    pages devoted to this research [19] is widely and frequently

    accessed, however the data was never exposed to the wide

    community for analysis since no tools to ingest simulation

    output into a database existed when the cochlea model was

    developed.

    This model remains the most comprehensive large-scale

    model that has been constructed to study cochlear mechanics,

    but it does not include the crucial dynamics of the smaller

    scales. The first author is presently developing a new multi-

    scale method that will enable to extend the macro mechanical

    cochlea model to include microscopic structures. The new

    cochlea model will run on a multi-processor cluster using

    a special algorithm for distributed immersed boundary com-

    putations [20]. We plan to use MPI-DB to ingest cochlea

    simulation results into the database and to expose these results

    for analysis by the wide hearing research community. Further-

    more, since the development of such a model is a product

    of many years of research, a new technological capability is

    needed to enable cochlea researchers outside JHU to use the

    model, to remotely carry out simulations, as well as to observe,

    monitor and analyze ongoing simulations.

    D. Neuroscience

    One of the biggest unanswered questions in neuroscience

    today is the organization of the brain at the level of the

    neural micro-circuits that form the basis of neural computation.Recently, due to technological and experimental advances in

    electron microscopy, it has become possible to investigate

    these networks through high-resolution imaging of the brain

    [21]. For example, Bock et al. [22] have recently imaged

    a 350 x 450 x 50 cubic micron region of mouse cortex

    with 4 nanometer lateral resolution a sufficiently detailed

    dataset to resolve every synaptic connection in the field of

    view (indeed, even vesicles are readily apparent). It has been

    estimated that imaging the whole brain at this resolution would

    require multiple exabytes; a cubic millimeter occupies roughly

    1 petabyte. Ultimately, to fully exploit this data, it is desirable

    to assign a label to each voxel indicating its identity and the

    structure to which it belongs.Clearly, while even collecting this type of data is an

    enormous task, interpreting and analyzing the data is far

    more difficult. It is infeasible to annotate this volume of

    data manually, and probably impractical to assume that any

    one group will devise a perfect automated solution. The

    Open Connectome project is working to provide universal

    access to this type of data via web services hosted at

    http://openconnectomeproject.org. More specifically, one of us

    (R. Burns) is developing tools for both human (visualization)

    and computer (application programming interface, or API)

    access to the data. Granting global access will enable the

    largest possible community of image processing and machine

    learning experts to investigate the data and develop algorithmsto annotate it. Unlike standard crowdsourcing endeavors, the

    goal is to compile efforts from a variety of machine an-

    notators, as opposed to human annotators, an approach that

    was dubbed alg-sourcing (for algorithm outsourcing). As

    different groups tackle different aspects of the problem with

    different approaches, the results will be aggregated and the

    collective output will be shared, building towards a long-term

    vision of a fully-annotated cortical volume.

    The project is being initialized with two datasets: (1) a 12

  • 7/23/2019 DataIntensive Computer

    4/10

    TB dataset from Bock et al. described above, and (2) a >600GB dataset from Kasthuri and Lichtman (unpublished; spatial

    resolution: 3 x 3 x 29 cubic nanometers). Panning, zooming,

    and manual annotation are made possible via a web-based

    graphical user interface called CATMAID [23]. An API for

    two-dimensional analysis of the data, including downloading

    arbitrary image planes and uploading planar annotations to

    the shared repository are in progress. An additional server

    for three-dimensional representation of the data is being built,

    along with an API for downloading volumes and upload-

    ing volumetric annotations. Graphics processor unit (GPU)-

    enabled software will allow for visualizing arbitrary rotations

    of the data in three dimensions, overlaid with the annotations.

    All of the services are designed to scale up to petabytes and

    beyond, and all of the developed code will be released as open

    source.

    The Open Connectome Project is gearing up for massive

    polyscience, i.e. science collectively conducted by a large

    group of individuals. This marks a radical departure from the

    typical scientific workflow, in which raw data are kept local

    until results are released, and will hopefully usher in a newera of understanding about the brain.

    III. THE DATA-INTENSIVE COMPUTER

    The operating system of a modern computer is designed

    to balance programmer productivity with implementation ef-

    ficiency. High-level programming languages hide the com-

    puters memory hierarchy and system architecture, while the

    operating system provides highly optimized services for all

    application developers. The only means of permanently storing

    data is by writing it in a file, and the abstract programming

    model of sequential file access is efficiently implemented in

    the operating system. The operating system typically does not

    include services for handling high level programming objects,such as arrays or graphs. When there is a need to store such

    objects for subsequent computation, the programmer must

    make use of the file system with serialization/unserialization

    of these objects.

    Scientific data sets are now approaching the Petascale,

    exceeding the capabilities of file systems, and are therefore

    stored in databases. They are not easily accessible to com-

    putation because performing I/O operations between an HPC

    system and a database is presently very difficult. There are

    no off-the-shelf solutions and a considerable effort is required

    on the part of domain scientists to incorporate special-purpose

    database access tools in the analysis code. Data access using

    web services does not provide a scalable solution for manydata-intensive applications. Furthermore, the resulting data

    flow throughput needs to be improved by orders of magnitude;

    even trivially parallelizable data processing tasks are presently

    very difficult.

    In order to satisfy the increasing demand for computations

    with very large data sets it is necessary to build an operating

    system that will exploit the massive parallelism in the database

    system to efficiently carry out data storage and retrieval

    between the database and a multiprocessor computing system.

    We therefore define the data-intensive computeras a system

    consisting of an HPC cluster, a massively parallel database

    and an intermediate operating system laye. (see Figure 1).

    The operating system enables direct I/O operations between

    HPC processor memory and the database, making the database

    transparent to the programmer and effectively turning it into a

    layer in the memory hierarchy of the data-intensive computer.

    OS

    server

    OS

    server

    OS

    server

    remote

    computer

    DB

    server

    DB

    server

    DB

    server

    DB

    server

    HPC

    node

    HPC

    node

    HPC

    node

    HPC

    node

    remote

    computer

    remote

    computer

    remote

    computer

    Data-Intensive Computer

    Fig. 1. The architecture of the data-intensive computer: The HPC cluster andthe database are connected to operating system servers by a high bandwidthnetwork. Remote clients obtain operating system services over the internet.

    The operating system also provides services to remote

    computers (see section III-E). The data-intensive computer

    differs from the traditional computer in a number of important

    aspects which we discuss below.

    A. Direct I/O between memory and databaseThe main challenges in the design of the data-intensive

    operating system are to guarantee the quality of service in the

    level of performance of the data flow and to ensure scalability

    and efficient parallel scheduling and resource management.

    We plan the operating system to run on a set of dedicated

    servers, with the user HPC applications acting as clients for

    the operating system processes. A major goal of the operating

    system is to transform application burst I/O into uniform,

    balanced traffic in the database.

    Since execution of database queries is significantly slower

    than the transmission of the results over the high bandwidth

    network, it is advantageous, whenever possible, to execute

    queries on multiple database servers in parallel. The operatingsystem will act as a distributed scheduler for the database

    (see Figure 2): each dedicated operating system server pro-

    cess allocates multiple database server connections for data-

    intensive applications, and fewer database server connections

    for applications with lower data requirements. This design

    is scalable and is aimed at minimizing application I/O by

    employing smart heuristic scheduling algorithms.

    When a large number of applications are accessing the same

    data set, such as for example the user analysis applications

  • 7/23/2019 DataIntensive Computer

    5/10

    Operating system

    server processes

    DB servers

    Data-non-intensive applicationData-intensive application

    Operating

    system

    clients

    d a t a b a s e

    H P C c l u s t e r

    Fig. 2. Scalable design of I/O operations in the data-intensive operatingsystem. The operating system acts as a scheduler for the database, allocatingmultiple database server connections for each process of a data-intensive appli-cations, and few connections to an application with lower data requirements.

    accessing the JHU Public Turbulence database, significant

    efficiencies may be realized by grouping the I/O requests

    of different applications together. The operating system will

    maintain its own storage for caching I/O requests and will op-

    timize database access based on applications access patterns,

    as well as across applications (see for example Multicollective

    I/O [24]). The operating system layer will incorporate efficient

    management of available resources, and will grow or shrink

    on demand.

    B. Moving the program to the data

    A major goal in the design of the operating system ofthe data-intensive computer is to enable applications with an

    arbitrary mix of I/O and computation. In many instances it is

    advantageous to carry out computations with large data objects

    in the database. The move the program to the data approach

    of Szalay et al. [25], has been a fundamental tenet in the

    design of large-scale scientific databases. We have seen that

    in the JHU Turbulence database (see section II) data requests

    may trigger execution of pre-defined routines in the database.

    A serious limitation of this approach is that such routines must

    be pre-programmed in the database.

    We propose to extend the move the program to the data

    approach by automatically generating the code that will be

    executed in the database. An HPC application will be compiledinto code that will execute on the HPC cluster, as well as

    code for computations with operating-system-supported data

    objects that will execute in the database. The operating system

    will carry out moving the program to the data. Compiler-

    generated code for large data object computations will be sent

    from the HPC cluster to the database using the operating sys-

    tems client-server communications. The user HPC application

    will be linked against the operating system client software. At

    run time, it will execute code on the HPC cluster, call system

    services that will execute in the operating system layer and

    execute the application-generated code in the database.

    User applications will be developed in a high-level pro-

    gramming language (Fortran, C, C++, etc.) that includes

    mechanisms for concurrency control (e.g. MPI), allowing easy

    porting of legacy applications to the data-intensive computer.

    A specially designed language, such as Titanium [26] or

    Charm++ [27], which has a built-in mechanism for concur-

    rency control, can also be used for application development.

    Furthermore, we believe that designing a special-purpose lan-

    guage for processing large data sets will improve programmer

    productivity.

    C. Data-object-oriented operating system

    The abstract programming model of sequential file access

    is not appropriate to represent the storage layout of large data

    objects in the distributed database; nor is it convenient for

    the applications programmer. Instead, the operating system

    must provide support for storing and manipulating abstract

    objects such as arrays, graphs, sparse arrays, etc. Implementing

    system-level support for a particular data structure is non-trivial: a distributed database layout and optimized I/O man-

    agement must be planned for each data object. Nevertheless,

    the operating system must provide a set of services which

    simplify development and execution of application programs.

    These can be made available to the application programmer

    through the use of advanced compilers.

    Our prototype operating system, MPI-DB, provides support

    for multi-dimensional arrays. (See section IV-B.) The survey

    in section II readily suggests that sparse arrays and graphs

    are among the data structures that must be supported in the

    data-intensive operating system. Indeed, it is estimated that the

    number of synaptic connections in the human brain is on the

    order of 1015

    . Operating system support for additional datastructures will depend on applications, such as the Berkeley

    seven (or thirteen) dwarfs [28].

    D. Operating system support for distributed data objects

    Virtually every parallel computation involves a partition of

    a large data object among several processors with the aim

    of reducing the total wall-clock time of the computation. A

    typical example is the partitioning of a multi-dimensional array

    (see section IV-B). This Map/Reduce methodology breaks up

    data objects, creating distributed data objects which exist only

    during computation. At the end of the computation a single

    coherent data object must be assembled from the distributed

    data object. We believe that support for distributed data objectsmust be provided both in programming languages (e.g. the

    single objects in Titanium [26]) and in the operating system

    of the data-intensive computer.

    While the data object stored in the database is logically

    single, its storage layout is distributed among database servers.

    In the process of reducing a run-time distributed data object to

    a logically single object stored in the database, the operating

    system will generate a physical mapping of the objects storage

    layout in the global database system. This mapping will

  • 7/23/2019 DataIntensive Computer

    6/10

    identify the database servers, the server-attached databases and

    the storage partitions that hold the data representing the object,

    and will determine methods for access and modification of the

    object.

    E. Collaborative, non-local operating system services

    A large data set is typically created and processed by a

    large group of possibly collaborating individuals, who execute

    a set of concurrent processes. While the main goal of the

    data-intensive operating system is to enable data-intensive

    computations, its services can be used by remote users.

    Remote computers obtain services from the data-intensive

    operating system in the same way the HPC applications do:

    an application running on a remote computer is compiled into

    code that executes on that computer, connects to the data-

    intensive operating system over the network and sends to the

    operating system code that is executed within the database.

    The main difference is in the network connection speed.

    Remote users with slow network connections may choose

    to download portions of data sets from the database to their

    computers, perform extensive local computations and sendresults back to the database. Furthermore, the data-intensive

    operating system can be used as a software library installed

    on the remote computer and run in conjunction with a local

    database, enabling the user to store data objects imported from

    a remote database directly into the local database, and to

    process the data in the local database using the same program

    that was previously created for remote, possibly large-scale,

    data processing.

    Such exposure of data and data-processing services has

    the potential of an impact beyond the scientific community,

    involving the general public in both science education and sci-

    entific research. For example, the GalaxyZoo project, initiated

    at IDIES, serving galaxy images for visual type classificationto the general public, opened in July 12, 2007, and after 8 days

    of operation the JHU servers had 56 million web hits, and 8.5

    million galaxies classified. The project was featured from the

    BBC to the London Times, the Economist, the Washington

    Post, Christian Science Monitor, and was on the front page of

    Slashdot and Wikipedia. This unprecedented interest is a clear

    example that open access to science is in the interest of the

    public and is beneficial to the advancement of science.

    I V. THE MPI-DB SOFTWARE LI BRARY

    The MPI-DB software library is a prototype of an operating

    system for a data-intensive computer. It provides database

    services to scientific computing processes and currently sup-ports SQL-Server and MySQL databases on Windows and

    Linux with C, C++ and Fortran language bindings. The library

    consists of two compatible software packages: the MPI-DB

    client and the MPI-DB server, and it requires a working MPI

    installation (including MPI-2 functionality) and UDT sockets

    [29]for its client-server communications, and a database con-

    nected to the server. The MPI-DB server accepts connections

    from clients at a known network address, services clients

    requests by querying the database and sending the results back

    to the clients. User client applications must be compiled and

    linked against MPI-DB client and use the library to store,

    retrieve and modify the data in the database.

    A. An introductory tutorial

    Consider a scientific application consisting of several paral-

    lel MPI processes, continuously generating output that needs

    to be stored. We show how MPI-DB can be easily used to

    store the output in a database. The user application is written

    in C++ with MPI. It is linked against the MPI-DB library

    and in our simple example there are two parallel processes at

    runtime, whose ranks are 0 and 1.

    The user interaction with MPI-DB starts by defining the data

    structures that will be stored in the database. In our example

    the two parallel MPI processes jointly perform a computation

    using a single three-dimensional array of 128 128 128double precision floating point numbers. The array is divided

    between the two processors, with processor 0 holding in its

    local memory the [0 . . . 127] [0 . . . 127] [0 . . . 63] por-tion of the array and processor 1 holding the [0 . . . 127] [0 . . . 127] [64 . . . 127] part. Correspondingly, each processdefines an mpidb::Domain object subdomain and an

    mpidb::Array object a.

    // this user process has rank = MyID,

    // which in our example is either 0 or 1

    MPI_Comm_rank(MPI_COMM_WORLD, &MyID);

    mpidb::Domain subdomain(0, 127, 0, 127, 64 * MyID,

    64 * MyID + 63);

    mpidb::Array a(subdomain, mpidb::DOUBLE_PRECISION);

    // generate a stream of array data objects

    mpidb::DataStream s(a);

    mpidb::DataSet d(); // DataSet d is a single

    object, common to both processes

    // DataSet d will contain two data streams

    d.AddStream(s);

    Our application will perform repeated computation of the data

    array, with each process periodically storing its portion of the

    data array in the database. Each process will therefore generate

    a stream of arrays. This is expressed in the definition of the

    mpidb::DataStream object s.

    Finally, the application defines the mpidb::DataSet

    object d, which in contrast to previously defined objects is

    a single (distributed) object common to both processes. After

    each process adds a data stream to this data set, it will contain

    exactly two streams.

    Having defined the data structures, each of the two

    MPI processes attempts to establish a connection with

    an MPI-DB server. This is achieved by defining an

    mpidb::Connection object c and executing on it the

    ConnectToServer method with a given server address.

  • 7/23/2019 DataIntensive Computer

    7/10

    mpidb::Connection c;

    char * ServerAddress = "128.220.233.155::52415";

    if (!c.ConnectToServer(ServerAddress))

    {

    cerr

  • 7/23/2019 DataIntensive Computer

    8/10

    MPI-DB assigns unique integer descriptors to objects stored

    in the database, allowing the user to access previously stored

    objects and load them from the database directly into pro-

    gramming objects. In the future MPI-DB will provide a set of

    services to name objects and list user-defined objects which

    are already stored in the database.

    C. MPI-DB Software architecture

    MPI-DB software is built as a layered structure, analogous

    to multi-layer protocols used in computer network commu-

    nications. Such a design is flexible and extensible. Figure 3

    illustrates the software architecture.

    serverclientSystem

    Management

    Layer

    serverclient Data Object

    Layer

    serverclient DatabaseAccessLayer

    serverclient Data Transport

    Layer

    network connection

    database

    Fig. 3. Layered software architecture of MPI-DB. Data Transport layerencapsulates physical communication. The server of the Database AccessLayer acts as a client of the database server.

    The Data Transport Layer is the lowest layer of the

    MPI-DB layer hierarchy. It provides the basic functionality for

    establishing and managing the connection between clients and

    servers over a fast network. This design encapsulates packet

    transmission in the Data Transport Layer. Indeed, we have

    found it necessary to create two independent implementations

    of the Transport Layer: one using UDT sockets and the other

    using the MPI-2 standard (see sectionIV-D). The MPI protocol

    is a de facto standard in scientific computing. MPI installations

    are available for a wide range of operating systems and

    computer networks, and in many instances benchmarking tests

    have shown MPI to be the fastest protocol for data transfer

    [32].The Database Access Layer provides basic functionality

    to remotely execute queries and access database tables over

    the network. It provides the Data Object Layer with a narrow

    set of abstract operations needed to manipulate MPI-DB

    programming objects in the database. This layer encapsulates

    all SQL queries and includes drivers for major databases, such

    as SQL Server, MySQL and PostgreSQL.

    The Data Object Layer contains the description of

    the user-defined programming objects that are stored in the

    database, including their physical storage layout, and provides

    access and manipulation methods for these objects. User-

    defined objects are serialized by the client, sent to the server

    and unserialized by the server, to be subsequently stored in

    the database. A hierarchical description of the physical storage

    layout lists the servers, the server-attached databases and the

    storage partitions holding the data associated with each object.

    Data access methods implement the mapping between user-

    defined run-time partition of the object among multiple pro-

    cessors and the objects hierarchical database storage layout.

    The System Management Layer maintains a resource

    map, describing all the resources (storage and servers) avail-

    able in the global database system. It includes a caching

    system for grouping applications I/O requests and a scheduler

    assigning the I/O requests to database servers. This layer is

    also planned to handle administration functions, managing all

    user related information, including managing user logins and

    monitoring user connections.

    D. Implementation

    MPI-DB is being developed as object-oriented softwarein C++ and will be made available under the new BSD

    open-source software license. It requires a working imple-

    mentation of the MPI standard, including MPI-2 function-

    ality, most importantly functions for client-server interaction

    (MPI_Open_Port, etc.) and dynamic process management

    (MPI_Comm_spawn). In our experience this requirement

    proved to be difficult to satisfy: Presently, MPI-DB works only

    with the Intel MPI software library. Many MPI implementa-

    tions contain run-time bugs, do not provide adequate hardware

    drivers or do not support MPI-2 standard altogether. Process

    intercommunication depends on accessing MPI ports using an

    address string whose format is implementation dependent. This

    complicates implementation of client-server software whenclient and server are built using different implementations of

    MPI.

    In view of these difficulties we re-implemented the Trans-

    port Layer of MPI-DB using UDT sockets. While the client

    side of the library currently uses MPI, the server side relies

    on MPI only to spawn server processes. Future versions of

    the library will use different services to manage processes and

    will be independent of MPI.

    We have tested the throughput of data ingest in a pro-

    totype MPI-DB installation. Both the client and the server

    ran on Linux nodes having eight physical cores each. The

    client and the server were linked by a 10 Gigabit Ethernet

    connection with a mysql database running on the server.The client application generated a three-dimensional array of

    data which was partitioned among p parallel processes with

    p = 1, 2, 4, 8, 16, 32 and 64. For each client process MPI-DB allocated a dedicated server process, which opened a

    connection to mysql, received data from its assigned client

    process and ingested the data into the mysql database. The data

    ingestion process was thus performed in p parallel threads over

    the Ethernet connection. The results of two test run sequences

    are shown in Figure4.Peak performance between 600 and 700

  • 7/23/2019 DataIntensive Computer

    9/10

    Aggregate Throughput

    0

    100

    200

    300

    400

    500

    600

    700

    800

    0 10 20 30 40 50 60 70

    Number of processes

    Thoughput

    Fig. 4. Throughput test of MPI-DB in parallel data ingestion over a 10Gigabit/sec Ethernet with both the client and the server having eight physicalCPU cores each. Client data is sent to the server and then ingested into mysqldatabase. MPI-DB allocated a dedicated server process for each client process.The aggregate throughput is recorded as a function of the number of parallelingestion processes. Maximum throughput between 600 and 700 Megabytesper second is realized when the number of ingestion processes equals the

    number of physical cores.

    Megabytes per second has been achieved when the number of

    parallel processes was the same as the number of available

    physical cores (eight).

    Following successful tests of MPI-DB, it is now being

    used to ingest the results of the simulation of a turbulent

    channel flow directly from the HPC cluster at the JHU Physics

    department into the database.

    V. CONCLUSION

    We have introduced the concept of the data-intensive com-

    puter and have proposed a novel operating system architecture

    to support data-intensive computations with Petascale-sized

    data sets. Section III outlines a research and development

    program for building a data-object-oriented operating system

    with advanced compiler technology to translate user applica-

    tion programs into code that runs on the HPC cluster and

    code that runs in the database. The proposed operating system

    is non-local: it supports large-scale collaborative computations

    where user applications can be translated into code that runs

    on a remote computer and code that runs in the database.

    Our discussion touches only a few topics related to the

    design of the data-intensive computer, and necessarily omits

    many important aspects. The construction of the data-intensive

    computer is an ongoing research project. A prototype of thedata-intensive operating system has been implemented as a

    software library, MPI-DB, and is currently being used in

    production by the Turbulence research group at JHU.

    REFERENCES

    [1] http://www.sdss.org/.[2] V. Springel, S. White, A. Jenkins, C. Frenk, N. Yoshida, L. Gao,

    J. Navarro, R. Thacker, D. Croton, J. Helly, J. Peacock, S. Cole,P. Thomas, H. Couchman, A. Evrard, J. Colberg, and F. Pearce,Simulations of the formation, evolution and clustering of galaxies andquasars, Nature, vol. 435, pp. 629636, 2005.

    [3] G. Lemson and the Virgo Consortium, Halo and galaxy formationhistories from the millennium simulation: Public release of a vo-orienteddatabase, 2006.

    [4] M. Stonebraker, P. Brown, A. Poliakov, and S. Raman, The architectureof scidb, in Scientific and Statistical Database Management, ser.Lecture Notes in Computer Science, J. Bayard Cushing, J. French, andS. Bowers, Eds. Springer Berlin / Heidelberg, 2011, vol. 6809, pp.116.

    [5] http://www.mpa-garching.mpg.de/millennium/.

    [6] M. C. Neyrinck, I. Szapudi, and A. S. Szalay, Rejuvenating the Matter

    Power Spectrum: Restoring Information with a Logarithmic DensityMapping, Astrophysics J. Letters, vol. 698, pp. L90L93, Jun. 2009.

    [7] E. Perlman, R. Burns, Y. Li, and C. Meneveau, Data exploration ofturbulence simulations using a database cluster, in Proceedings of theSupercomputing Conference (SC07), 2007.

    [8] Y. Li, E. Perlman, M. Wan, Y. Yang, R. Burns, C. Meneveau, S. Chen,A. Szalay, and G. Eyink, A public turbulence database cluster andapplications to study lagrangian evolution of velocity increments inturbulence. J. Turbulence, vol. 9, no. 31, 2008.

    [9] http://turbulence.pha.jhu.edu/.

    [10] C. Blakeley, N. Cunningham, B. Ellis, Rathakrishnan, and M. C. Wu,Distributed/heterogeneous query processing in microsoft sql server, in21st Int. Conf. on Data Engineering (ICDE05), 2005, pp. 10011012.

    [11] H. Yu and C. Meneveau, Lagrangian refined kolmogorov similarityhypothesis for gradient time evolution and correlation in turbulent flows,Phys. Rev. Lett., vol. 104, no. 8, p. 084502, Feb 2010.

    [12] P. K. Yeung and S. B. Pope, An algorithm for tracking fluid particles innumerical simulations of homogeneous turbulence, J. Comput. Phys.,vol. 79, no. 2, pp. 373416, 1988.

    [13] L. X and J. Katz, Measurement of pressure-rate-of-strain, pressurediffusion and velocity-presure- gradient tensors around an open cavitytrailing corner, Bull. Am. Phys. Soc., vol. 53, no. 15, 2008.

    [14] Snyder, personal communication. 2009.

    [15] Leonard, personal communication. 2009.

    [16] G. L. Eyink, Stochastic flux freezing and magnetic dynamo, 2011.

    [17] C. Meneveau, A web-services accessible turbulence database ofisotropic turbulence: lessons learned. in Progress in wall turbulence:understanding and modeling. (M. Stanislas, ed.), held on 21-23 April,in Lille France, 2009.

    [18] E. Givelberg and J. Bunn, A comprehensive three-dimensional modelof the cochlea, J. Comp. Phys., vol. 191, no. 2, pp. 377391, 2003.

    [19] http://pcbunn.cacr.caltech.edu/cochlea/.

    [20] E. Givelberg and K. Yelick, Distributed immersed boundary simulation

    in titanium, SIAM J. on Scientific Computing, vol. 28, no. 4, pp. 13671378, Jul. 2006.

    [21] N. Kasthuri, K. Hayworth, J. Tapia, R. Schalek, S. Nundy, and J. Licht-man, The brain on tape: Imaging an ultra-thin section library (utsl).Society for Neuroscience Abstracts, 2009.

    [22] D. D. Bock, W.-C. A. Lee, A. M. Kerlin, M. L. Andermann, G. Hood,A. W. Wetzel, S. Yurgenson, E. R. Soucy, H. S. Kim, and R. C. Reid,Network anatomy and in vivo physiology of visual cortical neurons,

    Nature, no. 471, pp. 177182, March 2011.

    [23] S. Saalfeld, A. Cardona, V. Hartenstein, and P. Toman?k, Catmaid:collaborative annotation toolkit for massive amounts of image data,

    Bioinformatics, vol. 25, no. 15, pp. 19841986, 2009.

    [24] G. Memik, M. T. Kandemir, W.-K. Liao, and A. Choudhary, Multicol-lective i/o: A technique for exploiting inter-file access patterns, ACMTransactions on Storage, vol. 2, no. 3, Aug. 2006.

    [25] A. S. Szalay, P. Z. Kunszt, A. Thakar, J. Gray, D. Slutz, and R. J.Brunner, Designing and mining multi-terabyte astronomy archives: the

    sloan digital sky survey, in SIGMOD 00: Proceedings of the 2000ACM SIGMOD international conference on Management of data. NewYork, NY, USA: ACM, 2000, pp. 451462.

    [26] K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krish-namurthy, P. Hilfinger, S. Graham, D. Gay, P. Colella, and A. Aiken,Titanium: A high-performance java dialect,Concurrency: Practice and

    Experience, vol. 10, no. 11-13, September-November 1998.

    [27] L. V. Kale and G. Zheng, Charm++ and ampi: Adaptive runtimestrategies via migratable objects, M. Parashar, Ed. Wiley-Interscience,2009, pp. 265282.

    [28] K. Asanovc, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer,D. Patterson, W. Plishker, J. Shalf, S. Williams, and K. Yelick, Thelandscape of parallel computing research: A view from berkeley,

  • 7/23/2019 DataIntensive Computer

    10/10

    Electrical Engineering and Computer Sciences, University of Californiaat Berkeley, Tech. Rep. UCB/EECS-2006-183, December 2006.

    [29] Y. Gu and R. L. Grossman, Udt: Udp-based data transfer for high-speedwide area networks,Comput. Netw., vol. 51, pp. 17771799, May 2007.[Online]. Available: http://dl.acm.org/citation.cfm?id=1229189.1229240

    [30] L. Dobos, I. Csabai, M. Milovanovic, T. Budavari, A. Szalay, M. Tintor,J. Blakeley, A. Jovanovic, and D. Tomic, Array requirements forscientific applications and an implementation for microsoft sql server, inProc. of the EDBT/ICDT 2011 Workshop on Array Databases, Uppsala,Sweden, (ed.: P. Baumann), 2011.

    [31] T. Budavari, A. Szalay, and G. Fekete, Searchable sky coverage ofastronomical observations: Footprints and exposures. Submitted, 2010.

    [32] Benchmarks provided by infiniband vendors.

    http://dl.acm.org/citation.cfm?id=1229189.1229240http://dl.acm.org/citation.cfm?id=1229189.1229240