Positioning Dynamic Storage Caches for Transient Data

22
Positioning Dynamic Storage Caches for Transient Data Sudharshan Vazhkudai Oak Ridge National Lab Douglas Thain University of Notre Dame Xiaosong Ma North Carolina State Univ. Vince Freeh North Carolina State Univ. High Performance I/O Workshop at IEEE Cluster Computing 2006

description

Positioning Dynamic Storage Caches for Transient Data. Sudharshan VazhkudaiOak Ridge National Lab Douglas Thain University of Notre Dame Xiaosong Ma North Carolina State Univ. Vince FreehNorth Carolina State Univ. High Performance I/O Workshop at IEEE Cluster Computing 2006. - PowerPoint PPT Presentation

Transcript of Positioning Dynamic Storage Caches for Transient Data

Page 1: Positioning Dynamic Storage Caches for Transient Data

PositioningDynamic Storage Caches

for Transient Data

Sudharshan Vazhkudai Oak Ridge National LabDouglas Thain University of Notre DameXiaosong Ma North Carolina State Univ.Vince Freeh North Carolina State Univ.

High Performance I/O Workshopat IEEE Cluster Computing 2006

Page 2: Positioning Dynamic Storage Caches for Transient Data

Problem Space• Data Deluge

– Experimental facilities: SNS, LHC (PBs/yr)– Observatories: sky surveys, world-wide telescopes– Simulations from NLCF end-stations– Internet archives: NIH GenBank (serves 100 gigabases of

sequence data)

• Typical user access traits on large scientific data– Download remote datasets using favorite tools

• FTP, GridFTP, hsi, wget

– Shared interest among groups of researchers• A Bioinformatics group collectively analyze and visualize a

sequence database for a few days: Locality of interest!

– Often times, discard original datasets after interest dissipates

Page 3: Positioning Dynamic Storage Caches for Transient Data

Existing Storage Models

• Local Disk– High bandwidth local access to small data.

• Distributed File Systems and NAS– Medium bandwidth for dist/shared data.

• Mass Storage ($)– High latency access for disaster recovery.

• Parallel Storage ($$$)– High bandwidth shared access to large data

with high reliability and fault tolerance.

Page 4: Positioning Dynamic Storage Caches for Transient Data

What’s Missing?

CPU CPU CPU CPU

CPU CPU CPU CPU

Parallel Storage

CPU CPU CPU CPU

CPU CPU CPU CPU

Computing Cluster

Mass Storage

Computing Cluster

FatPipe

FatPipe

CPU CPU CPU CPU

CPU CPU CPU CPU

University Cluster

CPU

CPU

CPU CPU

CPU

Private Workstations

Medium BandwidthHigh Latency

Wide AreaNetworks

Page 5: Positioning Dynamic Storage Caches for Transient Data

Needed: Transient Storage• High bandwidth

– Needs to be keep up with network and archive.– Also needs to keep up with aggressive apps. (viz?)

• Some management control.– Capacity, bandwidth, locality are all limited.– Need some controls in order to guarantee QoS.

• Understandable latency.– Keep user informed about stage-in latency.– Once staged, should have consistent latency.

• Low cost.– Old idea: Lots of commodity disks.– Can we scavenge space from existing systems?

• Reliability useful, but not crucial.

Page 6: Positioning Dynamic Storage Caches for Transient Data

Transient Storage: Use Cases

• Checkpointing Large Computations– Don’t need to keep all forever!

• Impedance Matching for Large Outputs– Evacuate CPUs, then trickle data to archive.

• Caching Large Inputs– Share same data among many local users.

• Out of Core Datasets– Large temporary array split across caches.

Page 7: Positioning Dynamic Storage Caches for Transient Data

A Real Example: Grid3 (OSG)

Robert Gardner, et al. (102 authors)The Grid3 Production Grid

Principles and PracticeIEEE HPDC 2004

The Grid2003 Project has deployed a multi-virtual organization, application-driven grid laboratory

that has sustained for several months the production-level services required by…

ATLAS, CMS, SDSS, LIGO…

Page 8: Positioning Dynamic Storage Caches for Transient Data

Grid2003: The DetailsThe good news:

– 27 sites with 2800 CPUs– 40985 CPU-days provided over 6 months– 10 applications with 1300 simultaneous jobs

The bad news on ATLAS jobs:– 40-70 percent utilization– 30 percent of jobs would fail.– 90 percent of failures were site problems– Most site failures were due to disk space!

Page 9: Positioning Dynamic Storage Caches for Transient Data

Two Transient Storage Projects

• Freeloader– Oak Ridge Natl Lab and North Carolina State U– Scavenge unused desktop storage.– Provide a large cache for archival backends.– Modify scientific apps slightly for direct access.

• Tactical Storage– University of Notre Dame– Use comp. cluster storage as flexible substrate.– Configure subsets for distinct needs.– Filesystem interfaces for existing apps.

Page 10: Positioning Dynamic Storage Caches for Transient Data

Desktop Storage Scavenging?

• FreeLoader – Imagine Condor for storage

• Harness the collective storage potential of desktop workstations ~ Harnessing idle CPU cycles

– Increased throughput due to striping• Split large datasets into pieces, Morsels, and stripe them

across desktops

• Scientific data trends– Usually write-once-read-many– Remote copy held elsewhere– Primarily sequential accesses

• Data trends + LAN-Desktop Traits + user access patterns make collaborative caches using storage scavenging a viable alternative!

Page 11: Positioning Dynamic Storage Caches for Transient Data

Properties of Desktop Machines

• Desktop Capabilities better than ever before

• Space usage to Available storage ratio is significantly low in academic and industry settings

• Increasing numbers of workstations online most of the time– At ORNL-CSMD, ~ 600 machines

are estimated to be online at any given time

– At NCSU, > 90% availability of 500 machines

• Well-connected, secure LAN settings– A high-speed LAN connection can

stream data faster than local disk I/O

Page 12: Positioning Dynamic Storage Caches for Transient Data

FreeLoader Environment

Page 13: Positioning Dynamic Storage Caches for Transient Data

FreeLoader Architecture

• Lightweight UDP• Scavenger device:

metadata bitmaps, morsel organization

• Morsel service layer• Monitoring and

Impact control

• Global free space management

• Metadata management• Soft-state registrations• Data placement• Cache management• Profiling

Page 14: Positioning Dynamic Storage Caches for Transient Data

Comparing FreeLoader with other storage systems

0

20

40

60

80

100

120

512MB 4GB 32GB 64GB

Dataset Size

Th

rou

gh

pu

t (M

B/s

ec)

FreeLoader PVFS HPSS-Hot HPSS-Cold RemoteNFS wget-ncbi

Page 15: Positioning Dynamic Storage Caches for Transient Data

Tactical Storage Systems (TSS)

• A TSS allows any node to serve as a file server or as a file system client.

• All components can be deployed without special privileges – but with security.

• Users can build up complex structures.– Filesystems, databases, caches, ...– Admins need not know/care about larger structures.

• Two Independent Concepts:– Resources – The raw storage to be used.– Abstractions – The organization of storage.

Page 16: Positioning Dynamic Storage Caches for Transient Data

file transfer

filesystem

filesystem

filesystem

filesystem

filesystem

filesystem

filesystem

SimpleFilesystem

App

Distributed Database Abstraction

Parrot

App

Distributed Filesystem Abstraction

Parrot

App

Cluster administrator controlspolicy on all storage in cluster

UNIX UNIX UNIX UNIX UNIX UNIX UNIX

Workstations owners controlpolicy on each machine.

fileserver

fileserver

fileserver

fileserver

fileserver

fileserver

fileserver

UNIX UNIX UNIX UNIX UNIX UNIX UNIX

???Parrot

3PT

Page 17: Positioning Dynamic Storage Caches for Transient Data

Applications:High BW Access to Astrophys Data

CPU CPUDisk Disk CPU Disk

CPU CPUDisk Disk CPU Disk

CPU CPUDisk Disk CPU Disk

CPU CPUDisk Disk CPU Disk

Tape Archive

GBs/Day

ScratchDisk

General Purpose Computing Cluster

GBs/Day

Adapter

tcsh, cp, vi, emacs, fortran...

Disk Disk

Disk Disk

Disk Disk

10 TBLogicalVolume

GBs / Day

Page 18: Positioning Dynamic Storage Caches for Transient Data

Applications:High BW Access to Biometric Data

CPU CPUDisk Disk CPU Disk

CPU CPUDisk Disk CPU Disk

CPU CPUDisk Disk CPU Disk

CPU CPUDisk Disk CPU Disk

General Purpose Computing Cluster

Disk Disk

Disk Disk

Disk Disk

Disk Disk Disk

Storage Archive

Gb Ethernet

Job

Job

Job

Job

Job

NFS I/O

NFS I/O

NFS I/O

Page 19: Positioning Dynamic Storage Caches for Transient Data

Applications:High BW Access to Biometric Data

CPU CPUDisk Disk CPU Disk

CPU CPUDisk Disk CPU Disk

CPU CPUDisk Disk CPU Disk

CPU CPUDisk Disk CPU Disk

General Purpose Computing Cluster

Disk Disk

Disk Disk

Disk Disk

Disk Disk Disk

Storage Archive

Gb Ethernet

ControlledReplication

DiskDisk

Disk

Disk

Job

Job Job

Job

Job

Page 20: Positioning Dynamic Storage Caches for Transient Data
Page 21: Positioning Dynamic Storage Caches for Transient Data

Open Problems

• Combining Technologies– A filesystem interface for Freeloader.– Making TSS harness FL benefactors.

• Seamless Data Migration– Not easy to move between parallel systems!– Can transient storage “match impedance?”

• Performance Adaptation– Many axes: BW, Latency, Locality, Mgmt.– Can we have a system that allows for a more

continuous tradeoff or reconfiguration?

Page 22: Positioning Dynamic Storage Caches for Transient Data

Take-Home Message

Big, fast storage archives are important, but...

Making transient storage usable, accessible, and high performance is critical to improving

the end-user experience.