Exa-Scale Volunteer Computing

Exa-Scale Volunteer Computing

David P. Anderson

Space Sciences LaboratoryU.C. Berkeley

Outline

• Volunteer computing

• BOINC

• Applications

• Research directions

High-throughputcomputing

High-performancecomputing

program runstoo slow on PC

cluster(MPI)

supercomputer

cluster(batch)

Grid

Commercialcloud

Volunteercomputing

single job

# processors

multiple jobs

10K-1M

1000

100

1

Volunteer computing

• Early projects

– 1997: GIMPS, distributed.net

– 1999: SETI@home, Folding@home

• Today

– 50 projects

– 500K volunteers

– 900K computers

– 10 PetaFLOPS

The PetaFLOPS barrier

• September 2007: Folding@home• January 2008: BOINC• June 2008: IBM Roadrunner

ExaFLOPS

• Current PetaFLOPS breakdown:

• Potential: ExaFLOPS by 2010– 4M GPUs * 1 TFLOPS * 0.25 availability

Processor type0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5 4.6

2.42.2

1.2

NVIDIACPUPS3 (Cell)ATI

BOINC

• Middleware for volunteer computing

– client, server, web

• Based at UC Berkeley Space Sciences Lab

• Open source (LGPL)

• NSF-funded since 2002

• http://boinc.berkeley.edu

BOINC: volunteers and projects

volunteers projects

CPDN

LHC@home

WCGattachments

The BOINC computing ecosystem

Goals:

• Better research gets more computing power

• The public decides what’s better

The world’scomputing

power

ScientificresearchThe public

BOINC software overview

client

apps

screensaver

GUI

scheduler

MySQL

data server

daemons

volunteer host

project serverHTTP

Scheduler RPC

• Request:– hardware, software description– work requests (CPU, GPU)– completed jobs

• Reply– application descriptions– job descriptions

Client: job scheduling

• Queue lots of jobs

– to avoid starvation

– for variety

• Job scheduling

– Round-robin time-slicing

– Earliest deadline first

Client: work fetch policy

• When? From which project? How much?• Goals

– maintain enough work– minimize scheduler requests– honor resource shares

• per-project “debt”

CPU 0

CPU 3

CPU 2

CPU 1

maxmin

Work fetch for GPUs: goals

• Queue work separately for different resource types

• Resource shares apply to aggregate

Example: projects A, B have same resource share

A has CPU and GPU jobs, B has only GPU jobs

GPU

CPU A

BA

Work fetch for GPUs

• For each resource type

– per-project backoff

– per-project debt• accumulate only while not backed off

• A project’s overall debt is weighted average of resource debts

• Get work from project with highest overall debt

Scheduling: server

• Possible outcomes of a job:– success– runs but returns wrong answer– doesn’t run, returns wrong answer (hacker)– crashes, client reports it– never hear from client again

• Job delay bounds• Replicated computing

– homogeneous replication

Server abstractionsapplications

Win32 + NVIDIA

Win64

Mac OS X

app versions

jobs

instances

Win32 N-core

Win32

Scheduler overview

MySQL

feederschedulers

share-memoryjob cache

client

How scheduler chooses app versions

• App versions have project-supplied “planning function”

• Inputs:– host description

• Outputs:

– Whether host can run app version

– Resource usage (#CPUs, #GPUs)

– expected FLOPS

App version selection

• Call planning function for platform’s app versions

• Skip versions that use resources for which no work is being requested

• Use the version with highest expected FLOPS

• Repeat this when a resource request is satisfied

Anonymous platform mechanism

• The idea: volunteer supplies app versions. Why?

– security

– optimization

– unsupported platforms

Science areas using BOINC• Biology

– protein study, genetic analysis• Medicine

– drug discovery, epidemiology• Physics

– LHC, nanotechnology, quantum computing• Astronomy

– data analysis, cosmology, galactic modeling• Environment

– climate modeling, ecosystem simulation• Math• Graphics rendering

Application types

• Computing-intensive analysis of large data

• Physical simulations

• Genetic algorithms

– GA-inspired optimization

• Non-CPU-intensive

– Internet study

– distributed sensor network

Malariacontrol.net

Simulation models of the transmission dynamics and health effects of malaria are an important tool for malaria control. They can be used to determine optimal strategies for delivering mosquito nets, chemotherapy, or new vaccines which are currently under development and testing.

Climateprediction.net

Einstein@home

• Gravitational waves; gravitational pulsars

SETI@home

Milkyway@home

GPUGRID.net

AQUA@home

• D-Wave Systems

• Simulation of “adiabatic quantum algorithms” for binary quadratic optimization

Collatz Conjecture

• even N → N/2• odd N → 3N + 1• always goes to 1?

Quake Catcher Network

Organizational models

Umbrella projects• Institutional

– Lattice, VTU@home• Corporate

– IBM World Community Grid• Community

– AlmereGrid• Research community

– MindModeling.org

Project

publicityweb developmentsysadmin

Volunteer computing research

• Goals (mutually incompatible)– maximize throughput– minimize makespan of job batches– minimize average time until credit– minimize network traffic– minimize server disk usage

Characterizing hosts

• What are good models? What are correlations with other characteristics? How to model churn?

• BOINC client is instrumented to log all this; have data from 200K hosts over 1 year

Mining for Statistical Models of Availability in Large-Scale Distributed Systems: An Empirical Study of SETI@home. Bahman Javadi, Derrick Kondo, Jean-Marc Vincent, David P. Anderson. 17th Annual Meeting of the IEEE/ACM International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems, Sept 21-23 2009, London.

On Correlated Availability in Internet-Distributed Systems. Derrick Kondo, Artur Andrzejak, and David P. Anderson. 9th IEEE/ACM International Conference on Grid Computing (Grid 2008), Tsukuba, Japan, Sept 29 - Oct 1 2008.

powered on

connected

available

Studying server scheduling policies

• EmBOINC: BOINC project emulatorPerformance Prediction and Analysis of BOINC Projects: An Empirical Study with EmBOINC.

Trilce Estrada, Michela Taufer, David Anderson. To appear, Journal of Grid Computing.

EmBOINC: An Emulator for Performace Analysis of BOINC Projects. Trilce Estrada, Michela Taufer, Kevin Reed, David Anderson. 3rd Workshop on Desktop Grids and Volunteer Computing Systems (PCGrid 2009), May 29, 2009, Rome.

MySQLfeeder

scheduler

share-memoryjob cache

Simulator of a large,dynamic set ofvolunteer hosts

Studying client scheduling policies

• BOINC client simulator

– simulates a client connected to several projects

– based on actual client code

Performance Evaluation of Scheduling Policies for Volunteer Computing. Derrick Kondo, David P. Anderson and John McLeod VII. 3rd IEEE International Conference on e-Science and Grid Computing. Banagalore, India, December 10-13 2007.

Local Scheduling for Volunteer Computing. David P. Anderson and John McLeod VII. Workshop on Large-Scale, Volatile Desktop Grids (PCGrid 2007) held in conjunction with the IEEE International Parallel & Distributed Processing Symposium (IPDPS), March 30, 2007, Long Beach.

Supporting distributed applications

• Volpex: Linda-like “dataspace” system

– MPI layer

– centralized implementation

– fault tolerance, performance issues

A Communication Framework for Fault-tolerant Parallel Execution. Nagarajan Kanna, Jaspal Subhlok, Edgar Gabriel, Eshwar Rohit and David Anderson. The 22nd International Workshop on Languages and Compilers for Parallel Computing, Newark, Delaware, Oct 8-10 2009.

Using virtual machines

• App version is VM wrapper + virtual machine image

• VM image may contain the client of a non-BOINC distributed batch system

BOINCclient VM wrapper

hypervisor(VirtualBox,kQEMU,etc.)

VM

Data-intensive computing

• Maintain large data set on clients– 10 years of radio telescope data– gene/protein data

• Compute against data set– MapReduce, other models

Volunteer motivation study

• Online survey correlated with participation data

• Survey is currently being designed

• Preliminary findings:

– Talk is cheap: claimed motivations not supported by data

– Team members contribute more

– Contribution decreases over time (especially for non-team members)

Conclusion

• Volunteer computing: Exa-scale potential– GPUs are crucial

• BOINC: enabling technology

• Bottlenecks

– organizational models

– public awareness

• Lots of research opportunities

Exa-Scale Volunteer Computing

Documents

Transcript of Exa-Scale Volunteer Computing