A Service for Data-Intensive Computations

on Virtual Clusters

Rainer Schmidt, Christian Sadilek, and Ross King

rainer.schmidt@arcs.ac.at

Intensive 2009, Valencia, Spain

Executing Preservation Strategies at Scale

Planets Project

“Permanent Long-term Access through NETworked Services”

Addresses the problem of digital preservation

driven by National Libraries and Archives

Project instrument: FP6 Integrated Project

5. IST Call

Consortium: 16 organisations from 7 countries

Duration: 48 months, June 2006 – May 2010

Budget: 14 Million Euro http://www.planets-project.eu/

Outline

Digital Preservation

Need for eScience Repositories

The Planets Preservation Environment

Distributed Data and Metadata Integration

Data-Intensive Computations

Grid Service for Job Execution on AWS

Experimental Results

Conclusions

Drivers for Digital Preservation

Exponential growth of digital data across all sectors of society

Production large quantities of often irreplaceable information

Produced by academic and other research

Data becomes significantly more complex and diverse

Digital information is fragile

Ensure long term access and interpretability

Requires data curation and preservation

Development and assessment of preservation strategies

Need for integrated and largely automated environments

Repository Systems and eScience

Digital repositories and libraries are designed for the creation, discovery, and publication of digital collections systems often heterogeneous, closed, limited in scale,

Data grid middleware focus on the controlled sharing and management of large data sets in a distributed environment. limited in metadata mgnt., and preservation functionality

Integrate repositories into large-scale environment distributed data management capabilities

integration of policies, tools, and workflows

integration with computational facilities

The Planets Environment

A collaborative research infrastructure for the systematic development and evaluation and of preservation strategies.

A decision-support system for the development of preservation plans

An integrated environment for dynamically providing, discovering and accessing a wide range of preservation tools, services, and technical registries.

A workflow environment for the data integration, process execution, and the maintenance of preservation information.

Service Gateway Architecture

Preservation Planning (Plato)

Experimentation Testbed Application

Notification andLoggingSystem

Workflow Execution UI

Workflow Execution and

Monitoring

Experiment Data and Metadata

Repository

Service and Tool

Registry

Application Services

ExecutionServices

Data Storage Services

AdministrationUI

Authenticationand

Authorization

User Applications

Portal Services

Application Execution and Data Services

Physical Resources, Computers, Networks

Data and Metadata Integration

Support distributed and remote storage facilities

data registry and metadata repository

Integrate different data sources and encoding standards Currently a range of interfaces/protocols (OAI-PMH, SRU) and encoding standards

(DC, METS) are supported.

Development of common digital object model

Recording of provenance information, preservation, integrity, and descriptive metadata

Dynamic mapping of objects from different digital libraries (TEL, memory institutions, research collections)

Data Catalogue

ExperimentData

MetadataRepository

DigObj DigObj

Content RepositoryContent Repository

Temp Storage

PortalApplication

ApplicationService

Data and Metadata Registry

Data Intensive Applications

Development Planets Job Submission Services

Allow Job Submission to a PC cluster (e.g. Hadoop, Condor)

Standard grid protocols/interfaces (SOAP, HPC-BP, JSDL)

Demand for massive compute and storage resources

Instantiate cluster on top of (leased) cloud resources based on AWS EC2 + S3, (or alternatively in-house in a PC lab.)

Allows computation to be moved to data or vice-versa

On-demand cluster based on virtual machine images

Many Preservation Tools are/rely on 3rd party applications

Software can be preinstalled on virtual cluster nodes

Experimental Architecture

Virtual Node (Xen)

Data Service

Cloud Infrastructure (EC2)

Storage Infrastructure

Virtual Cluster and File System (Apache Hadoop)

Digital Objects

Job Description File

A Light-Weight Job Execution ServiceTL

WS - ContainerB

AccountManager

JSDLParser

SessionHandler

ager Data Resolver

Input Generator

Job Manager

A Light-Weight Job Execution ServiceTL

WS - ContainerB

AccountManager

JSDLParser

SessionHandler

ager Data Resolver

Input Generator

Job Manager

Hadoop

The MapReduce Application

Map-Reduce implements a framework and prog. model for processing large documents (Sorting, Searching, Indexing) on multiple nodes.

Automated decomposition (split)

Mapping to intermediary pairs (map), optionally (combine)

Merge output (reduce)

Provides implementation for data parallel + i/o intensive applications

Migrating a digital object (e.g web page, folder, archive)

Decompose into atomic pieces (e.g. file, image, movie)

On each node, process input splits

Merge pieces and create new data collections

Experimental Setup Amazon Elastic Compute Cloud (EC2)

1 – 150 cluster nodes

Custom image based on Fedora 8 i386

Amazon Simple Storage Service (S3) max. 1TB I/O per experiment

All measurements based on a single core per virtual machine

Apache Hadoop (v.0.18) MapReduce Implementation

Preinstalled command line tools

Quantitative evaluation of VMs (AWS) based on execution time, number of tasks, number of nodes, physical size

Experimental Results 1 – Scaling Job Size

0,001,002,003,004,005,006,007,008,009,00

1 10 100 1000

EC2 0,07 MB

EC2 7,5 MB

EC2 250 MB

SLE 0,07 MB

SLE 7,5 MB

SLE 250 MB

number of nodes = 5x(1k) = 3,5 x(1k) = 4,4 x(1k) = 3,6

x(1k) = t_seq / t_paralleland tasks = 1000

Experimental Results 2 – Scaling #nodes

0 50 100 150

EC2 1000 x 0,07 MB

SLE 1000 x 0,07 MB

X n=1, t=36, s1 = 0.72, e=72%

n=5, t=4.5, s=3.3, e=66%

n=10, t=4.5, s=5.5, e=55%

n=50, t=1.68, s=16, e=31%

n=100, t=1.03, s=26, e=26%

n=1 (local), t=26

Results and Observations

AWS + Hadoop + JSS

Robust and fault tolerant, scalable up to large numbers of nodes ~32,5MB/s download / ~13,8MB/s upload (cloud internally)

Also small clusters of virtual machines reasonable

S = 4,4 for p = 5, with n=1000 and s=7,5MB

Ideally, size of cluster grows/shrinks on demand

Overheads:

SLE vs. Cloud (p=1, n=1000) 30% due to file system master

In average, less then 10% due to S3

Small overheads due to coordination, latencies, pre-processing

Conclusion

Challenges in digital preservation are diversity and scale Focus: scientific data, arts and humanities

Repositories Preservation systems need to be embedded with large-scale distributed environments grids, clouds, eScience environments

Cloud Computing provides a powerful solution for getting on-demand access to an IT infrastructure. Currently integration issues: Policies, Legal Aspects, SLAs

We have presented current developments in the context of the EU project Planets on building an integrated research environment for the development and evaluation of preservation strategies

A Service for Data-Intensive Computations

Technology

Transcript of A Service for Data-Intensive Computations

Data-Intensive Distributed Computing

Data Intensive University

Data Intensive Engineering

Designing Data-Intensive Applications

How to securely perform computations on secret-shared data · 4 A framework for secure computations 32 ... computations on the data without reconstructing the original values. When

Data Intensive Applications Fitzmaurice - Data Intensive... · Summary • Designing Data Intensive Applications is worth reading for any backend or storage engineer • Generators

The Complexity of Massive Data Set Computations

Data-Intensive Experimental Linguistics

Data Intensive Cyberinfrastructure

Optimizing Data-Intensive Computations in Existing ...Optimizing Data-Intensive Computations in Existing Libraries with Split Annotations Shoumik Palkar and Matei Zaharia Stanford

Runtime Data Flow Graph Scheduling of Matrix Computations

Lower bounds on data stream computations

A Web-based distributed system for hurricane occurrence ...10 Firstly, our system is a large-scale system and can handle the huge amount of hurricane data and intensive computations

Resource provisioning for data-intensive applications with ...adel/pdf/FGCS2017.pdf · Bandwidth, Data-intensive Applications 1. Introduction Data-intensive applications involving

FastBit: an indexing technology for data-intensive science · for data-intensive science John Wu. FastBit: an efficient indexing technology for accelerating data-intensive science

Data-Flow Algorithms for Parallel Matrix Computations · data-flow algorithms for matrix computations might be implemented suggests architectural features that would be desirable

Optimizing Interactive Development of Data-Intensive ...todd/research/socc16.pdf · of Data-Intensive Applications ... Spark is a platform for executing data-parallel computations

DI-MMAP: A High Performance Memory-Map Runtime for Data-Intensive Applications · 2013. 10. 26. · Data-Intensive High-Performance Computing Data-Intensive Applications: I large

Data Layout Transformation for Stencil Computations on - UCLA

CompSci516 Data Intensive Computing SystemsCompSci516 Data Intensive Computing Systems Lecture 21 Datalog Instructor: SudeepaRoy Duke CS, Fall 2016 1 CompSci 516: Data Intensive Computing