Migratory File Services for Batch-Pipelined Workloads

32
Migratory File Services for Batch-Pipelined Workloads John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny WiND and Condor Projects 6 May 2003

description

Migratory File Services for Batch-Pipelined Workloads. John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny WiND and Condor Projects 6 May 2003. How to Run a Batch-Pipelined Workload?. in.1. in.2. in.3. Batch Shared Data. a1. a2. a3. pipe.1. pipe.2. - PowerPoint PPT Presentation

Transcript of Migratory File Services for Batch-Pipelined Workloads

Page 1: Migratory File Services for Batch-Pipelined Workloads

Migratory File Services

forBatch-Pipelined

WorkloadsJohn Bent, Douglas Thain, Andrea Arpaci-Dusseau,

Remzi Arpaci-Dusseau, and Miron LivnyWiND and Condor Projects

6 May 2003

Page 2: Migratory File Services for Batch-Pipelined Workloads

How to Run aBatch-Pipelined Workload?

in.1

pipe.1

b1

out.1

a1

in.2

pipe.2

b2

out.2

a2

in.3

pipe.3

b3

out.3

a3Batch

SharedData

Page 3: Migratory File Services for Batch-Pipelined Workloads

The

Internet

Archive

Node Node Node

NodeNodeNodeNode Node Node

NodeNodeNode

Node Node Node

Node Node Node

NodeNodeNode

Node Node Node

Node

Node

Node

Node Node

NodeNode

Condor Pool

PBS Cluster

Grid Engine Cluster

HomeSystem

Cluster-to-Cluster Computing

Page 4: Migratory File Services for Batch-Pipelined Workloads

How to Run aBatch-Pipelined Workload?

“Remote I/O”• Submit jobs to a remote batch system.• Let all I/O come directly home.• Inefficient if re-use is common.• (( But perfect if no data sharing! ))

“FTP-Net”• User finds remote clusters.• Manually stages data in.• Submits jobs, deals with failures.• Pulls data out.• Lather, rinse, repeat.

Page 5: Migratory File Services for Batch-Pipelined Workloads

Hawk: A Migratory File Servicefor Batch-Pipelined Workloads

Automatically deploys a “task force” across multiple wide-area systems.

Manages applications from a high level, using knowledge of process interactions.

Provides dependable performance with peer-to-peer techniques. (( Locality is key! ))

Understands and reacts to failures using knowledge of the system and workloads.

Page 6: Migratory File Services for Batch-Pipelined Workloads

Dangers Failures

• Physical: Networks fail, disks crash, CPUs halt.• Logical: Out of space/memory, lease expired.• Administrative: You can’t use cluster X today.

Dependencies• A comes before C and D, which are simultaneous.• What do we do if the output of C is lost?

Risk vs Reward• A gamble: Staging input data to a remote CPU.• A gamble: Leaving output data at a remote CPU.

Page 7: Migratory File Services for Batch-Pipelined Workloads

The

Internet

Archive

Node Node Node

NodeNodeNodeNode Node Node

NodeNodeNode

Node Node Node

Node Node Node

NodeNodeNode

Node Node Node

Node

Node

Node

Node Node

NodeNode

Condor Pool

PBS Cluster

Grid Engine Cluster

HomeSystem

Hawk HawkHawk

Hawk

Hawk Hawk

Hawk

Hawk

Hawk

Hawk

Hawk

Hawk

a1 a2

b1 b2

a3

b3

c3c2c1

Batch-PipelinedWorkload

i1a1

i1 i2 i3

i2a2

i3a3

Hawk In Action

b1

b2

b3

c1

c2

c3

o1

o2

o3

o1 o2 o3

Page 8: Migratory File Services for Batch-Pipelined Workloads

Workflow Language 1(Start With Condor DAGMan)

job a a.condorjob b b.condorjob c c.condorjob d d.condorparent a child cparent b child d

a b

c d

Page 9: Migratory File Services for Batch-Pipelined Workloads

v1

Archive Storage

mydata

v2 v3

Workflow Language 2

volume v1ftp://archive/

mydatamount v1 a /datamount v1 b /data

volume v2 scratchmount v2 a /tmpmount v2 c /tmp

volume v3 scratchmount v3 b /tmpmount v3 d /tmp

a b

c d

Page 10: Migratory File Services for Batch-Pipelined Workloads

v1

Archive Storage

mydata

v2 v3

Workflow Language 3

extract v2 x ftp://home/out.1

extract v3 x ftp://home/out.2

a b

c dx

out.1 out.2

x

Page 11: Migratory File Services for Batch-Pipelined Workloads

Mapping the Workflow tothe Migratory File System

Abstract Jobs• Become jobs in a batch system• May start, stop, fail, checkpoint, restart...

Logical “scratch” volumes• Become temporary containers on a scratch

disk.• May be created, replicated, and destroyed...

Logical “read” volumes• Become blocks in a cooperative proxy cache.• May be created, cached, and evicted...

Page 12: Migratory File Services for Batch-Pipelined Workloads

Node

System Components

CondorMM

CondorSchedDArchive

Node Node

NodeNodeNode

PBS Cluster

Node Node

NodeNode

Condor Pool

WorkflowManager

Page 13: Migratory File Services for Batch-Pipelined Workloads

Node

Gliding In

CondorMM

CondorSchedDArchive

StartDProxy

Master

Node Node

NodeNodeNodeStartDProxy

Master

StartDProxy

Master

PBS Cluster

Node Node

NodeNode

Condor Pool

StartDProxy

Master

StartDProxy

Master

StartDProxy

Master Glide-InJob

Page 14: Migratory File Services for Batch-Pipelined Workloads

Node

System Components

CondorMM

CondorSchedDArchive

StartDProxy

Node Node

NodeNodeNodeStartDProxy StartDProxy

PBS Head Node

Node Node

NodeNode

Condor Pool

StartDProxy

StartDProxy

WorkflowManager

Page 15: Migratory File Services for Batch-Pipelined Workloads

Node

Cooperative Proxies

CondorMM

CondorSchedDArchive

StartDProxy

Node Node

NodeNodeNodeStartDProxy StartDProxy

PBS Head Node

Node Node

NodeNode

Condor Pool

StartDProxy

StartDProxy

WorkflowManager

Page 16: Migratory File Services for Batch-Pipelined Workloads

Node

System Components

CondorMM

CondorSchedDArchive

StartDProxy

Node Node

NodeNodeNodeStartDProxy StartDProxy

PBS Head Node

Node Node

NodeNode

Condor Pool

StartDProxy

StartDProxy

WorkflowManager

Page 17: Migratory File Services for Batch-Pipelined Workloads

Node

Batch Execution System

CondorMM

CondorSchedDArchive

Proxy

Node Node

NodeNodeNodeProxy Proxy

PBS Head Node

Node Node

NodeNode

Condor Pool

Proxy

StartD

StartD StartD StartD

StartDProxy

WorkflowManager

Page 18: Migratory File Services for Batch-Pipelined Workloads

Node

Proxy

Node Node

NodeNodeNode

PBS Head Node

Node Node

NodeNode

Condor Pool

Proxy

StartD

StartD

StartDProxy

Proxy StartD

System Components

CondorMM

CondorSchedD

Proxy StartD

ArchiveWorkflowManager

Page 19: Migratory File Services for Batch-Pipelined Workloads

Workflow Manager Detail

CondorMM

CondorSchedDArchive

Proxy StartD

WorkflowManager

Page 20: Migratory File Services for Batch-Pipelined Workloads

Archive

StartDProxy

StartD

Proxy

CondorMM

CondorSchedD

Archive

WorkflowManager

Page 21: Migratory File Services for Batch-Pipelined Workloads

StartD

Proxy

CondorMM

CondorSchedD

Archive

WorkflowManager

CoopBlockInput

Cache

Container 120

/mydata

d15

d16

Wide Area NetworkLo

cal A

rea

Net

wor

kCreateContainer

Page 22: Migratory File Services for Batch-Pipelined Workloads

StartD

Job

Proxy

POSIX Library Interface

Agent/tmp cont://host5/120/data cache://host5/archive/mydata

CondorMM

CondorSchedD

Archive

WorkflowManager

CoopBlockInput

Cache

Container 120

outfile

creat(“/tmp/outfile”);open(“/data/d15”);

/mydata

d15

d16

Wide Area NetworkLo

cal A

rea

Net

wor

k

ExecuteJob

Page 23: Migratory File Services for Batch-Pipelined Workloads

StartD

Proxy

Loca

l Are

a N

etw

ork

CondorMM

CondorSchedD

Archive

WorkflowManager

CoopBlockInput

Cache

Container 120

outfile

out65

/mydata

d15

d16

JobCompleted

Wide Area Network ExtractOutput

Page 24: Migratory File Services for Batch-Pipelined Workloads

StartD

Proxy

Loca

l Are

a N

etw

ork

CondorMM

CondorSchedD

Archive

WorkflowManager

CoopBlockInput

Cache

Container 120

outfile

out65

/mydata

d15

d16

JobCompleted

Wide Area Network DeleteContainer

Page 25: Migratory File Services for Batch-Pipelined Workloads

StartD

Proxy

Loca

l Are

a N

etw

ork

CondorMM

CondorSchedD

Archive

WorkflowManager

CoopBlockInput

Cache

/mydata

d15

d16

JobCompleted

ContainerDeleted

out65

Wide Area Network

Page 26: Migratory File Services for Batch-Pipelined Workloads

Fault Detection and Repair The proxy, startd, and agent detect failures:

• Job evicted by machine owner.• Network disconnection between job and proxy.• Container evicted by storage owner.• Out of space at proxy.

The workflow manager knows the consequences:• Job D couldn’t perform it’s I/O.• Check: Are volumes V1 and V3 still in place?• Aha: Volume V3 was lost -> Run B to create it.

Page 27: Migratory File Services for Batch-Pipelined Workloads

Performance Testbed

Controlled “remote” cluster:• 32 cluster nodes at UW.• Hawk submitter also at UW.• Connected by a restricted 800 Kb/s link.

Also some preliminary tests on uncontrolled systems:• Hawk over PBS cluster at Los Alamos• Hawk over Condor system at INFN Italy.

Page 28: Migratory File Services for Batch-Pipelined Workloads

Batch-Pipelined Applications

Name Stages

Load Remote (jobs/hr)

Hawk (jobs/hr)

BLAST 1 BatchHeavy

4.67 747.40

CMS 2 Batchand

Pipe

33.78 1273.96

HF 3 PipeHeavy

40.96 3187.22

Page 29: Migratory File Services for Batch-Pipelined Workloads

Cascading

Failure

Rollback

Failure Recovery

Page 30: Migratory File Services for Batch-Pipelined Workloads

A Little Bit of Philosophy

Most systems build from the bottom up:• “This disk must have five nines, or else!”

MFS works from the top down:• “If this disk fails, we know what to do.”

By working from the top down, we finesse many of the hard problems in traditional filesystems.

Page 31: Migratory File Services for Batch-Pipelined Workloads

Future Work

Integration with Stork P2P Aspects: Discovery & Replication Optional Knowledge: Size & Time Delegation and Disconnection Names, names, names:

• Hawk – A migratory file service.• Hawkeye – A system monitoring tool.

Page 32: Migratory File Services for Batch-Pipelined Workloads

Feelingoverwhelmed?

jobs

data

Let Hawk juggle your

work!

job

data

job

data

?