Data-Intensive and High Performance Computing on Cloud Environments Gagan Agrawal 1.

Data-Intensive and High Performance Computing on Cloud Environments

Gagan Agrawal

Outline

• Introduction to Cloud Computing• Ongoing Projects in Cloud Computing

‣ Data-intensive computing Middleware System‣ Resource Provisioning with Budget and Time Constraints‣ Workflow consolidation with power constraints ‣ An Elastic Cache on the Amazon Cloud

• Other Research Projects‣ Heterogeneous High-Performance Computing‣ Deep web Integration and Mining ‣ Scientific Data Management

Utilities: Things We Can’t Live without

Utility Costs Depend on Usage

Utility Providers Consumers

Resource on Demand

Utility Costs Depend on Usage

Utility Providers Consumers

Pay Per Usage

Utilities of Today Haven’t Always Been Utilities

Hand-pumpA Horse Cart: Your purchase and `maintain’ the source of power for your transportation

How Do We Currently Do Computing?

Resources are co-located on site

Computing ResourcesSupport Personnel

Computing Consumer

How Do We Currently Do Computing?

Resources are co-located on site

Computing ResourcesSupport Personnel

Computing Consumer

Computing as a Utility

Cloud “Utility” Providers:Amazon AWS, Azure,

Cloudera, Google App Engine

Consumers:Companies, labs, schools, et

ProcessedResults

Algorithms& Data

ProcessedResults

Algorithms& Data

Why Now?

• It has finally become cost-effective to offer computing as a service

• Large companies, e.g., Amazon, Microsoft, Google, Yahoo!‣ Already have the computing personnel,

infrastructure in place‣ Decreasing costs of hardware‣ Virtualization advancements

Example of Cost Effectiveness at the Provider

Why Now?

• This creates a win-win situation

• For the provider:‣ They get paid to fully utilize

otherwise idle hardware

• For the user:‣ They save on costs‣ Example: Amazon’s Cloud is

$0.10 per machine-hour

Promises of Cloud Computing

• Cost Associativity‣ Running 1 machine for 10

hours = running 10 machines for 1 hour

• Elasticity‣ Cloud applications can

stretch and contract their resource requirements

• “Infinite resources”

Research Challenges

• How do we exploit cost associativity and elasticity of the cloud for various applications?

• How do the cloud providers provide adequate QoS to various applications and users ‣ Maximize their revenue, lower their costs

• How do we develop effective services to support applications on cloud providers

• How can we combine the use of cloud and traditional resources for various applications‣ (HPC) Cloud Bursting

• How do we effectively manage large scale data on the cloud?

Outline

‣ Data-intensive computing Middleware Systems‣ Resource Provisioning with Budget and Time Constraints‣ Workflow consolidation with power constraints ‣ An Elastic Cache on the Amazon Cloud

April 20, 202318

•Growing need for analysis of large scale data ‣Scientific ‣Commercial

• Data-intensive Supercomputing (DISC) • Map-Reduce has received a lot of attention

‣ Database and Datamining communities ‣ High performance computing community

• Closely coupled with interest in cloud computing

Motivation

April 20, 202319

•Positives: ‣Simple API

- Functional language based - Very easy to learn

‣Support for fault-tolerance - Important for very large-scale clusters

•Questions‣Performance?

- Comparison with other approaches

‣Suitability for different class of applications?

Map-Reduce: Positives and Questions

Class of Data-Intensive Applications

• Many different types of applications ‣ Data-center kind of applications

- Data scans, sorting, indexing ‣ More ``compute-intensive`` data-intensive applications

- Machine learning, data mining, NLP - Map-reduce / Hadoop being widely used for this class

‣ Standard Database Operations - Sigmod 2009 paper compares Hadoop with Databases and OLAP systems

• What is Map-reduce suitable for?• What are the alternatives?

‣ MPI/OpenMP/Pthreads – too low level?

April 20, 202320

Our Work

• Proposes MATE (a Map-Reduce system with an AlternaTE API) based on Generalized Reduction ‣ Phoenix implemented Map-Reduce in shared-memory

systems‣ MATE adopted Generalized Reduction, first proposed in

FREERIDE that was developed at Ohio State 2001-2003‣ Observed API similarities and subtle differences between

MapReduce and Generalized Reduction

• Comparison for ‣ Data Mining Applications ‣ Compare performance and API ‣ Understand performance overheads

• Will an alternative API be better for ``Map-Reduce``?

April 20, 202321

Comparing Processing Structures

• Reduction Object represents the intermediate state of the execution• Reduce func. is commutative and associative• Sorting, grouping.. overheads are eliminated with red. func/obj.

April 20, 2023

Observations on Processing Structure

• Map-Reduce is based on functional idea ‣ Do not maintain state

• This can lead to overheads of managing intermediate results between map and reduce‣Map could generate intermediate results of very large

• MATE API is based on a programmer managed reduction object ‣ Not as ‘clean’ ‣ But, avoids sorting of intermediate results ‣ Can also help shared memory parallelization ‣ Helps better fault-recovery

April 20, 202323

April 20, 202324

Results: Data Mining (I)

• K-Means: 400MB dataset, 3-dim points, k = 100 on one WCI node with 8 cores

1 2 4 8

PhoenixMATEHadoop

# of threads

Outline

Resource Provisioning Motivation: Adaptive Applications

Earthquake modelingCoastline forecasting Medical systems

• Time-Critical Event Processing- Compute-intensive- Time constraints- Application-specific flexibility- Application Quality of Service (QoS)

Adaptive Applications (Cont’d)

Adaptive Applications that perform time-critical event processing

• Application-specific flexibility: parameter adaptation• Trade-off between application QoS and execution time

HPC ApplicationsHPC Applications(compute-intensive)(compute-intensive)HPC ApplicationsHPC Applications(compute-intensive)(compute-intensive)

• Aim at maximize performance• Do not consider adaptation

Deadline-drivenDeadline-drivenSchedulingSchedulingDeadline-drivenDeadline-drivenSchedulingScheduling

• Not very compute-intensive

Challenges

-- Resource Budget Constraints

•Elastic Cloud Computing

- Pay-as-you-go model

•Satisfy the Application QoS with the Minimum Resource Cost

•Dynamic Resource Provisioning

- Dynamically varying application workloads

- Resource budget

Background: Pricing Model

•Charged Fees

‣Base price

‣Transfer fee

•Linear Pricing Model

•Exponential Pricing Model

Base price charged for the smallest amount of CPU

cycles

Transfer fee for each CPU allocation change

CPU cycle at the ith allocation

Time duration at the ith allocation

Number of CPU cycle allocations

Problem Description

• Adaptive Applications

‣ Adaptive parameters

‣ Benefit

‣ Time constraint

• Cloud Computing Environment

‣ Resource budget

‣ Overprovisioning/Underprovisioning

• Goal

‣ Maximize the application benefit while satisfying the time constraints and resource budget

Approach Overview

Dynamic Resource Dynamic Resource Provisioning Provisioning (feedback control)(feedback control)

Resource ModelResource Model(with optimization)(with optimization)Resource ModelResource Model(with optimization)(with optimization)

• Resource Provisioning Controller

‣ Multi-input-multi-output (MIMO) feedback control model

‣ Modeling between adaptive parameters and performance metrics

‣ Control policy: reinforcement learning

• Resource Model

‣ Map change of parameters to change in CPU/memory allocations

‣ Optimization: avoid frequent resource changes

change to the adaptive parameters

change to CPU/memoryallocations

Resource Provisioning Controller

Performance Performance MetricsMetrics Performance Performance MetricsMetrics

Multi-Input-Multi-Input-Multi-Output Multi-Output ModelModel

Control Control PolicyPolicyControl Control PolicyPolicy

• Satisfy time constraints and resource budget

• Relationship between adaptive parameters and performance metrics

• Decide how to change values of the adaptive parameters

Control Model Formulation -- Performance Metrics

•Performance Metrics

‣Processing progress: ratio between the currently obtained application benefit and the elapsed execution time

‣Performance/cost ratio: ratio between the currently obtained application benefit and the cost of the resources that have been assigned

•Notation

Application benefit obtained at time step kElapsed execution time at time step kResource cost at time step k

Control Model Formulation -- Multi-Input-Multi-Output Model

• Auto-Regressive-Moving-Average with Exogenous Inputs (ARMAX)

‣Second-order model

‣ is ith adaptive parameter at time step k

‣ are updated at the end of every interval

Previous observed performance metricsPrevious and current values of adaptive parameters

Framework Design

ApplicationApplicationApplicationApplication

Virtualization Management (Eucalyptus, Open Virtualization Management (Eucalyptus, Open Nebular...)Nebular...)

Xen HypervisorXen Hypervisor

VMVM VMVM...

ServiceDeployment

ServiceWrapper

Resource ProvisioningController

Application Controller

ResourceModel

ModelOptimizer

PerformanceManager

PriorityAssignment

StatusQuery

PerformanceAnalysis

Outline

Workflow Consolidation: Motivation

• Another Critical Issue in Cloud Environment: Power Management

‣HPC servers consume a lot of energy

‣Significant adverse impact on the environment

• To Reduce Resource and Energy Costs

‣Server consolidation

‣Minimize the total power consumption and resource costs without a substantial degradation in performance

Problem Description

• Our Target Applications

‣ Workflows with DAG structure

‣ Multiple processing stages

‣ Opportunities for consolidation

• Research Problems

‣ Combine parameter adaptation, budget constraints and resource allocation with consolidation and power optimization

‣ Challenge: consolidation without parameter adaptation

‣ Support power-aware parameter adaptation -- future work

Contributions

• A power-aware consolidation framework, pSciMapper, based on hierarchical clustering and an optimization search method

• pSciMapper is able to reduce the total power consumption by up to 56% with a most a 15% slowdown for the workflow

• pSciMapper incurs low overhead and thus suitable for large-scale scientific workflows

The pSciMapper Framework Design

Offline Analysis Online Consolidation

Scientific WorkflowsScientific WorkflowsScientific WorkflowsScientific Workflows

Resource Usage Resource Usage GenerationGenerationResource Usage Resource Usage GenerationGeneration

Temporal Feature Temporal Feature ExtractionExtractionTemporal Feature Temporal Feature ExtractionExtraction

Feature ReductionFeature Reductionand Modelingand ModelingFeature ReductionFeature Reductionand Modelingand Modeling

Time Series

KnowledgeKnowledgebasebase

Temporal Signatures

Hierarchical Hierarchical ClusteringClusteringHierarchical Hierarchical ClusteringClustering

Optimization Optimization SearchSearchAlgorithmAlgorithm

Time VaryingTime VaryingResource Resource ProvisioningProvisioning

ConsolidatedWorkloads

Outline

Motivation: Data-Intensive Services on Clouds

• Cloud can provide flexible storage • Data-intensive services can be executed on clouds • Caching is an age-old idea to accelerate services

‣ On clouds, can we exploit elasticity

• A cost-sensitive elastic cache for clouds!

Problem: Query Intensive Circumstances

Scaling up to Handle Load. .

invoke:

haitimap(29)

(29 mod 3) = 2Which proxy has the page?h(k) = (k mod num_proxies)

HIT!reply: data(29)

Derived Data Cache(Cloud Nodes)

HaitiMap

Scaling up to Handle Load. .

(29 mod 4) = 1Which proxy has the page?h(k) = (k mod num_proxies)

h(29) MISS

Derived Data Cache(Cloud Nodes)

Service Infrastructure

HaitiMap

invoke:

haitimap(29)

Outline

‣ Resource Provisioning with Budget and Time Constraints‣ Workflow consolidation with power constraints ‣ An Elastic Cache on the Amazon Cloud‣ Data-intensive computing Middleware System

Heterogeneous High Performance Computing

• Heterogeneous arch., a common place‣ Eg., Today’s desktops & notebooks‣ Multi-core CPU + Graphics card on PCI-E

• A Recent HPC system‣ Eg., Tianhe-1 [5th fastest SC, NOV 2009] ‣ Use Multi-core CPUs and GPU (ATI Radeon HD 4870)

on each node

• Multi-core CPU and GPU usage still divided‣ Resources may be under-utilized

• Can Multi-core CPU and GPU be used simultaneously for computation?

Overall System Design

User Input:Simple C code with

annotations

Application Developer

Multi-core

Middleware API

GPU Code for

Compilation Phase

Code Generator

Run-time System

Worker Thread Creation and Management

Map Computation to CPU and GPU

Dynamic Work Distribution

Key Components

Performance of K-Means (Heterogeneous - NUCS)

Outline

‣ Resource Provisioning with Budget and Time Constraints‣ Workflow consolidation with power constraints ‣ An Elastic Cache on the Amazon Cloud‣ Data-intensive computing Middleware System

The Deep Web

• The definition of “the deep web” from Wikipedia

The deep Web refers to World Wide Web content that is not part of the surface web, which is indexed by standard search engines.

• Some Examples: Expedia, Priceline

The Deep Web is Huge and Informative• 500 times larger than the

surface web• 7500 terabytes of information

(19 terabytes in the surface web)

• 550 billion documents (1 billion in the surface web)

• More than 200,000 deep web sites

• Relevant to every domain: scientific, e-commerce, market

• 95 percent of the deep web is publicly accessible (with access limitations)

How to Access Deep Web Data1. A user issues query through input interfaces of deep web data sources

2. Query is translated into SQL style query

3. Trigger search on backend database

4. Answers returned through network

Select priceFrom ExpediaWhere depart=CMH and arrive=SEA and dedate=“7/13/10” and redate=“7/16/10”

System OverviewD eepW eb

S chem a M in ing

Source Input Ouput Constraint

S1 A1 B1,B2 C2

S2 A1 B2,B3 C1

D ata S ource M odel D ata S ourceD ependency M ode l

S ys tem M odels

QueryP lann ing

QueryOptim iza tion

FaultTo lerance

C o m plexS tructured Q uery

A pproxim ateQuery

A nsw ering

E xploring P art of S E E DE E P Q uerying P art of S E E DE E P

A ggregatio n/Lo wS electiv ity Q uery

Hidden schema discoveryData source integration

Structured SQL querySampling the deep webOnline aggregationLow selectivity query

Summary

• Research in Cloud, High Performance Computing and Data-Intensive Computing (including data mining and web mining)

• Currently working with 10 PhD students and 5 MS students

• 10 PhDs completed in last 6 years • To get Involved

‣ Join 888 in Winter 2011

Data-Intensive and High Performance Computing on Cloud Environments Gagan Agrawal 1.

Documents

Transcript of Data-Intensive and High Performance Computing on Cloud Environments Gagan Agrawal 1.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Compiler Supported High-level Abstractions for Sparse Disk-resident Datasets Renato Ferreira Gagan Agrawal Joel Saltz Ohio State University.

Supporting High-Level Abstractions through XML Technologies Xiaogang Li Gagan Agrawal The Ohio State University.

Modeling and Adaptive Scheduling of Large-Scale Wide-Area Data Transfers Raj Kettimuthu Advisors: Gagan Agrawal, P. Sadayappan.

Effective Automatic Parallelization and Locality ... · and entertaining. I would also like to thank Gagan (Prof. Gagan Agrawal) for being a very helpful and accessible Graduate Studies

Transitioning to Semesters CSE MS Program Prof. Gagan Agrawal Grad Studies Chair.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.

Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.

Data-Intensive Computing: From Multi-Cores and GPGPUs to Cloud Computing and Deep Web Gagan Agrawal u.

Smita Vijayakumar Qian Zhu Gagan Agrawal 1. Background Data Streams Virtualization Dynamic Resource Allocation Accuracy Adaptation Research.

Introduction to CSE PhD Program Prof. Gagan Agrawal Grad Studies Chair.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal.

Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.

Elastic Cloud Caches for Accelerating Service-Oriented Computations Gagan Agrawal Ohio State University Columbus, OH David Chiu Washington State University.

Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.