Взгляд на облака с точки зрения HPC

21
High Performance Computing Cloud point of view Alexey Ragozin [email protected] Sep 2012

description

Алексей Рагозин, Technical Lead, Caching and Data Grid Services, VP, Risk and PnL, Deutsche Bank

Transcript of Взгляд на облака с точки зрения HPC

Page 1: Взгляд на облака с точки зрения HPC

High Performance ComputingCloud point of view

Alexey [email protected]

Sep 2012

Page 2: Взгляд на облака с точки зрения HPC

Massive parallel computing

I/O bound workload• Data mining / machine learning / indexing• Focus: Do not move data, process in place

CPU bound• complex simulations / complex math models• Focus: Keep all cores busy

Latency bound• Physical process simulations

(e.g. weather forecast)

• Focus: Minimize communication latencies

Page 3: Взгляд на облака с точки зрения HPC

CPU bound task

Stream of independent tasks• Independent tasks• Random continuous stream of tasks• E.g. video conversion, crawling

Structured batch jobs• Single batch is split into subtasks for parallel execution• Task may have data dependency on each other• Task may be generated during batch execution• E.g. portfolio risk calculation

Page 4: Взгляд на облака с точки зрения HPC

Handling task stream in Cloud

Worker pool

adjusts pool sizebased on queue metrics

Controler

Task queue

queue metrics

incomingtasks

Simple pattern. Exploiting “elasticy” of cloud. Cost effective.

Page 5: Взгляд на облака с точки зрения HPC

Structured batch jobs in cloud

Batches are usually more sporadic e.g. end of day risk calculations

Task may have cross dependencies scheduler should be “cloud-aware”Supplying tasks with data data delivery delay is critical worker pool is generally very large data sets also could be very large

Page 6: Взгляд на облака с точки зрения HPC

Data delivery strategy

Push approach scheduler controls data delivery worker expects data to be available locally more opportunities for optimization complexPull approach worker pulls required data from central service scheduler is unaware about data sets requires scalable data service much simpler

Page 7: Взгляд на облака с точки зрения HPC

What kind of data do we have?

Working set• working set is divided between jobs• each portion of working set processed by single job• often jobs are producing working set for next

computation stage

Reference data• exactly same data shared by multiple/all jobs• usually static data set

Page 8: Взгляд на облака с точки зрения HPC

Data distribution problem

Working set• Spiky work load – especially at the start• Hard to predict there piece of data will be required• Caching is ineffective

Reference data set• Naïve approach will produce huge volume of

redundant transfers – smart caching required• Spiky work load

Page 9: Взгляд на облака с точки зрения HPC

Private grid practice

HPC Grid

Data grid

RDBMSor

Data Warehouse

Page 10: Взгляд на облака с точки зрения HPC

Data grid, what is it?

• Key/Value storage• Data distributed across cluster of servers• RAM is usually used as storage• Redundant copies provide level of fault tolerant /

durability• No single point of failure• Automatic rebalancing of data when servers

added/removed from grid• Capacity and throughput are scaling linearly

Page 11: Взгляд на облака с точки зрения HPC

Data service for cloud HPC

• Block storage service Azure drive / Amazon EBS

– Lack of shared access to data• Key / Value storage

Azure Tables / Amazon Simple DB

– Pricing: volume + usage• Blob store

Azure Tables (blobs) / Amazon S3

– Pricing: volume + transactions– Good read scalability

Page 12: Взгляд на облака с точки зрения HPC

Use case for caching

Avoid storage of data in cloud• Upload data once per batch and cache in cloud

Reduce storage cost by reducing number of operations

Save IO bandwidth for shared data• Edge caching• Routing overlays

Page 13: Взгляд на облака с точки зрения HPC

Distribution tree / Routing overlays

Switch Switch Switch

Clie

nts

Pro

xyS

tora

ge

Page 14: Взгляд на облака с точки зрения HPC

Task stealing

Task steeling – alternative scheduling approachTask steeling in widely used for in-process multi-core concurrency

Why use it for cluster task scheduling?• Stochastic and adaptive• Can use cost models accounting internal cloud

topology• Decently solves problem of data delivery,

without additional caching• Unproven for cluster computation, so far

Page 15: Взгляд на облака с точки зрения HPC

Task stealing

fork

fork

processing

fork

Worker 1

Work backlog is organized in a form of stack

Tasks are generated recursively Top of stack – fine grained tasks Bottom of stack – coarse

grained tasks Execution from top of stack Stealing – bottom of stack

Page 16: Взгляд на облака с точки зрения HPC

Task stealing

fork

fork

done

forkprocessing

steal

processing

fork

fork

Worker 1 Worker 2

Page 17: Взгляд на облака с точки зрения HPC

IO bound workload in cloud

Dawn of Map/Reduce- high bandwidth interconnects are expensive- network storage is expensive (due to network cost)- cheap serves and local processing for keeping costs

low- price – very complex computation model“Cloud” reality- network bandwidth is cheap- disks are already “networked”- RAM is abundant

Page 18: Взгляд на облака с точки зрения HPC

Hadoop is cloud unfriendly

Assume I have 50 nodes Hadoop cluster in cloudWhat will I gain by adding another 50 nodes?- Not much, until they are populated with data.What if I will shut these 50 afterward?- Effort to populate them with data will be wasted.

Hadoop is coupling execution and storage services together – you have pay for both even if you use one.

Page 19: Взгляд на облака с точки зрения HPC

How cloud M/R should look?

• Use cloud storage service and persistent storage• Streaming M/R processing• Aggressive use of memory for intermediate data

Peregrine – storeless M/R frameworkhttp://peregrine_mapreduce.bitbucket.org/

Spark – in-memory M/R frameworkhttp://www.spark-project.org/

Page 20: Взгляд на облака с точки зрения HPC

Looking into future

Highly anticipated features Scheduler as a Service

Azure HPC / Amazon SWF Simple middleware for organizing caches and

routing overlaysExisting solutions are far from simple

Cloud friendly map/reduce frameworksCould provider work hard to offer effective Hadoop

Page 21: Взгляд на облака с точки зрения HPC

Thank you

Alexey Ragozin [email protected]

http://blog.ragozin.info- my articles