Khalil Amiri, David Petrou, Gregory R. Ganger and Garth A. Gibson "Dynamic Function Placement for...

Khalil Amiri*, David Petrou, Gregory R. Ganger* and Garth A. Gibson "Dynamic Function Placement for Data-intensive

ClusterComputing," Proceedings of the USENIX Annual Technical

Conference, San Diego, CA, June 2000.(http://www.pdl.cs.cmu.edu/Publications/publications.html)

Function Placement for Data Intensive Cluster Computing

• Data intensive applications that filter, mine, sort or manipulate large data sets– Spread data parallel computations across source/sink

servers

– Exploit servers’ computational resources

– Reduce network bandwidth

Programming model and runtime system

• Compose data intensive applications from explicitly-migratable, functionally independent components

• Mobile objects provide explicit methods that checkpoint and restore state

• Application and filesystem represented as graph of communicating mobile objects

• Graph rooted at storage servers by non-migratable storage objects• Anchored at clients by non-migratable console object• Mobile objects have explicit methods that checkpoint and restore state

during migration• Storage objects provide persistent storage• Console object contains part of application that must remain at the node where

application is started

ABACUS runtime system

• Migration and location-transparent invocation component (binding manager)– Responsible for creation of location-transparent references to mobile objects

– Redirection of method invocations in face of object migrations

– Each machine’s binding manager notifies local resource manager of each procedure call to and return from mobile object

• Resource monitoring and management component (resource manager)– Uses notifications to collect statistics about bytes moved between objects and

resources used by objects

– Monitors load on local processor and costs associated with moving data to and from storage servers

– Server side managers collect statistics from client side resource managers

– Employ analytic models to estimate performance advantages that might accrue from moving to alternate placements

Programming Model

• Mobile Objects– C++– Required to implement a few mthods to enable runtime system to create

instances and migrate– Medium granularity – performs self-contained processing step that is data

intensive, such as parity computation, caching, searching or aggregation– Has private state not accessible to outside objects except via exported

interface– Responsible for saving private state, including state of all embedded objects

when Checkpoint() method is called by ABACUS– Responsible for restoring state, including creation and initialization of all

embedded objects, when runtime system invokes restore() method after migration to a new node

– Checkpoint and restore go to/from external file or memory

• See Figure 1

Storage Servers

• Provides local storage objects exporting a flat file interface

• Accessible only at server that hosts them and never migrates

• Migratable objects lie between storage objects and console objects– Applications can declare other objects to be non-migratable

– Object that implements write-ahead logging can be declared by filesystem as non-migratable

Iterative Processing Model

• Synchronous invocations start at the top-level console object and propagate down the object graph

• Amount of data moved is an application-specific number of records, rather than the entire file or data set

• ABACUS accumulates statistics on return from method invocations to make object migration decisions

Object based distributed filesystem

• Filesystems composed of explicitly migratable objects (Figure 2)– RAID– Caching– Application specific functionality

• Coherent file and directory abstractions atop flat file space exported by base storage objects

• File – Stack of objects supporting services bound to file– Files whose data cannot be lost include RAID object– When file is opened, top-most object is instantiated, lower level objects then

instantiated– Supports inter-client file and directory sharing– Allows both file data and directory data to be cached and manipulated at trusted clients– AFS style call backs for cache coherence– Timestamp ordering protocol to make sure that updates performed at client are

consistent before being committed at server

Virtual File System Interface (VFS)

• The Virtual File System (VFS) interface hides implementation dependent parts of the file system

• BSD implemented VFS for NFS: aim dispatch to different filesystems• Manages kernel level file abstractions in one format for all file systems• Receives system call requests from user level (e.g. write, open, stat, link)• Interacts with a specific file system based on mount point traversal• Receives requests from other parts of the kernel, mostly from memory

management• (http://bukharin.hiof.no/fag/iad22099/innhold/lab/lab3/nfsnis_slides/

text13.htm)

Virtual File System Interface (VFS)

• Microsoft Windows have VFS type interfaces• Functions of the VFS:• Separate file system generic operations from their

implementation. • Enables transparent access to a variety of different local

file systems. • At the VFS interface, files are represented as v-nodes,

which are networkwide unique numerical designator for a file.

• Vnode contains pointer to its parent file system and to the file system over which it's mounted

File Graph

• File’s graph provides:– VFS interface to applications

– Cache object• Keeps index of particular object’s blocks in the shared cache kept by the

ABACUS filesystem

– Optional RAID 5 object• Stripes and maintains parity for individual files across set of storage servers

– One or more storage objects

– RAID isolation/atomicity object anchored at storage servers• Intercepts reads and writes to base storage object and verifies the consistency

of updates before committing

– Linux ext2 filesystem or CMU’s NASD prototype can be used for backing store

Directory Graph

• Directory object– POSIX like directory calls and caches directory entries

• Isolation/atomicity object– Specialized to directory semantics for performance reasons

• Storage object

Accessing ABACUS filesystem

• Applications that include ABACUS objects directly append per-file object subgraphs onto their application object graphs

• Can be mounted as a standard file system via VFS layer redirection– Legacy applications can use filesystem objects adaptively migrating

below them

– Legacy applications themselves do not migrate

Object-based applications

• Data intensive applications decomposed into objects that search, aggregate or data mine

• Formulate applications to iterate over input data and operate on data one buffer at a time

• Encapsulate the filtering component into C++ object, write checkpoint, restore methods

• Applications instantiate mobile objects by making request to ABACUS run-time system

• ABACUS allocates and returns to caller network-wide unique run-time identifier

• Acts as layer of indirection• Per-node has tables map rid to (node, object_reference_within_node) pair• Data is passed by procedure call when objects are in the same address space,

and RPCs when objects cross machines

Object Migration

• Migrate from source to target– Binding manager blocks new calls to the migrating object

– Binding manager waits until all active invocations to migrating object have drained

– Object is locally checkpointed by invoking Checkpoint() method

– State written into buffer or memory

– Location tables at source, target and home node are updated

– Invocations are unblocked and redirected to proper node via updated location table

– Per-node hash tables may contain stale data, if object cannot be located, up-to-date information can be found in object’s home node

Resource Monitoring

• Memory consumption, instructions executed per byte, stall time– Since runtime system redirects calls to objects, it is in a position to collect

all necessary statistics– Monitoring code interposed between mobile object procedure call and

return – Number of bytes transferred recorded in timed data flow graph– Moving averages of bytes moved between every pair of communicating

objects in graph– Runtime system tracks dynamic memory allocation using wrappers

around each memory allocation routine– Linux interval times or Pentium cycle counter used to count instructions

executed by objects– Track amount of time thread is blocked by having kernel update blocking

timers of any threads in the queue marked as blocked

Dynamic Placement

• Net benefit– Server side resource manager collects per-object measurements

– Receives statistics about client processor speed and current load

– Given data flow graph between objects, latency of client-server link, model estimates changes in stall time if object changes location

– Model also estimates change in execution time for other objects executing at target node

Example – Placement of RAID Object

• Figure 3:– In Client A, RAID object runs in client, Client B RAID object runs at

storage device

– If LAN is slow, better if RAID object runs at storage device

• Figure 4:– Two clients writing 4MB files sequentially

– Stripe size is 5 (4 data + parity)

– Stripe unit is 32KB

– In LAN case, better to execute on server

– In SAN case, RAID running RAID object locally at client is 1.3X faster

– In degraded read case, client based RAID wins (due to computational cost of doing reconstruction)

Placement of filter

• Vary filter’s selectivity and CPU consumption

• High selectivity filters are better on server

• Possible to arrange things so that filter should run on client if filter is expensive

David F. Nagle, Gregory R. Ganger, Jeff Butler, Garth Gibson, and Chris Sabol "Network Support for

Network-AttachedStorage", Hot Interconnects 1999, August 18 - 20, 1999, Stanford University, Stanford, California.

Network Support for Network Attached Storage

• Enable scalable storage systems in ways that minimize file manager bottleneck– Homogeneous network of trusted clients that issue unchecked commands

to shared storage• Poor security and integrity (anybody can read or write to anything!)

– NetSCSI• Minimal changes to hardware, software of SCSI disks

• NetSCSI disks send data directly to clients

• Crytographic hashes, encryption, verified by NetSCSI disks provide for integrity and privacy

• File manager still involved in each storage access

• Translates namespaces and sets up third party transfer on each request

Network Attached Secure Disks (NASD)

• NASD architecture provides command interface that reduce number of client-storage interactions

• Data intensive operations go right to disk, less common policy making operations (e.g. namespace and access control) go to the file manager

• See Figure 1 for scheme

• NASD drives map and authorize requests for disk sectors

• Time limited capability provided by file manager – Allows access to a given file’s map and contents

– Storage mapping metadata is maintained by drive• Smart drives can exploit knowledge of their own resources to optimize data layout, read-

ahead and cache management

– NASD drive exports “namespace” • Describes file objects which can generalize to banks of stripped files

NASD Implementation

• Ported AFS and NSF to use interface

• Implemented striped version of NSF on top of interface

• NAS/AFS and NASD/NFS filesystems, frequent data moving operations occur directly between client and NASD drive– NFS --- stateless server, weak cache consistency

– Based on client’s RPC opcode, RPC destination addresses are modifie to deliver requests to NASD drive

• AFS port had to maintain sequential consistency guarantees of AFS– Used ability of NASD capabilities to be revoked based expired time or

object attributes

Performance Comparisons

• Compare NASD/AFS, NASD/NFS v.s. Server Attached Disk (SAD) AFS, NFS

– Computing frequent sets of 300MB sales transactions

– Maximum bandwidth with given number of disks or NASD drives and up to 10 clients

– Bandwidth of client reading from single NASD file striped across n drives (linear scaling)

– NFS – performance of all clients reading from single file striped across n disks on server (bottlenecks near 20 MB/s)

– NFS_parallel – each client reads from separate file on an independent disk through the same server (bottlenecks near 22.5 MB/s)

Network Support

• Non-cached read/write can be serviced by modest hardware

• Requests that hit in cache need much higher bandwidth – lots of time in network stack

• Need to deliver scalable bandwidth– Deal efficiently with small messages

– Don’t spend too much time going between OS layers

– Don’t copy data too many times

Reducing Data Copies

• Integrate buffering/caching into OS– Effective where caching plays central role

• Direct user level access to network– High bandwidth applications

• Layered NASD on top of VI-Architecture (VIA) interface– Integrates user-level Network Interface Control access with protection

mechanisms – send/receive/DMA

– Commercial VIA implementations are available – full network bandwidth while consuming less than 10% of CPU’s cycles

– Support from hardware, software vendors

Integrating VIA with NASD

• NASD software runs in kernel mode• Drive must support external VIA interface and semantics• Can result in 100’s of connections and lots of RAM• Interface

– Client preposts set of buffers matching read request size– Issues NASD read comman– Drive returns data– Writes are analogous but bursts require large amount of preposted buffer– VIA’s remote DMA

• Clients send write command with pointer to data stored in client’s pinned memory

• Drive uses VIA RDMA command to pull data out of client’s memory• Drive would treat client’s RAM as extended buffer/cache

Network striping and Incase

• File striping across multiple storage devices– Poor support for incase (many to one)

– Client should receive equal bandwidth from each source

– Poor performance (Figure 4)

Khalil Amiri, David Petrou, Gregory R. Ganger and Garth A. Gibson "Dynamic Function Placement for...

Documents

Transcript of Khalil Amiri, David Petrou, Gregory R. Ganger and Garth A. Gibson "Dynamic Function Placement for...

Khalil Amiri*, David Petrou, Gregory R. Ganger* and Garth A. Gibson "Dynamic Function Placement for...

Documents

Transcript of Khalil Amiri*, David Petrou, Gregory R. Ganger* and Garth A. Gibson "Dynamic Function Placement for...

Khalil Amiri, David Petrou, Gregory R. Ganger and Garth A. Gibson "Dynamic Function Placement for...

Transcript of Khalil Amiri, David Petrou, Gregory R. Ganger and Garth A. Gibson "Dynamic Function Placement for...