Exa-Scale Volunteer Computing David P. Anderson Space Sciences Laboratory U.C. Berkeley.

download Exa-Scale Volunteer Computing David P. Anderson Space Sciences Laboratory U.C. Berkeley.

If you can't read please download the document

description

High-throughput computing High-performance computing program runs too slow on PC cluster (MPI) supercomputer cluster (batch) Grid Commercial cloud Volunteer computing single job # processors multiple jobs 10K-1M

Transcript of Exa-Scale Volunteer Computing David P. Anderson Space Sciences Laboratory U.C. Berkeley.

Exa-Scale Volunteer Computing David P. Anderson Space Sciences Laboratory U.C. Berkeley Outline Volunteer computing BOINC Applications Research directions High-throughput computing High-performance computing program runs too slow on PC cluster (MPI) supercomputer cluster (batch) Grid Commercial cloud Volunteer computing single job # processors multiple jobs 10K-1M Volunteer computing Early projects 1997: GIMPS, distributed.net 1999:Today 50 projects 500K volunteers 900K computers 10 PetaFLOPS The PetaFLOPS barrier September 2007: January 2008: BOINC June 2008: IBM Roadrunner ExaFLOPS Current PetaFLOPS breakdown: Potential: ExaFLOPS by 2010 4M GPUs * 1 TFLOPS * 0.25 availability BOINC Middleware for volunteer computing client, server, web Based at UC Berkeley Space Sciences Lab Open source (LGPL) NSF-funded since 2002 BOINC: volunteers and projects volunteers projects CPDN WCG attachments The BOINC computing ecosystem Goals: Better research gets more computing power The public decides whats better The worlds computing power Scientific research The public BOINC software overview client apps screensaver GUI scheduler MySQL data server daemons volunteer host project server HTTP Scheduler RPC Request: hardware, software description work requests (CPU, GPU) completed jobs Reply application descriptions job descriptions Client: job scheduling Queue lots of jobs to avoid starvation for variety Job scheduling Round-robin time-slicing Earliest deadline first Client: work fetch policy When? From which project? How much? Goals maintain enough work minimize scheduler requests honor resource shares per-project debt CPU 0 CPU 3 CPU 2 CPU 1 max min Work fetch for GPUs: goals Queue work separately for different resource types Resource shares apply to aggregate Example: projects A, B have same resource share A has CPU and GPU jobs, B has only GPU jobs GPU CPU A BA Work fetch for GPUs For each resource type per-project backoff per-project debt accumulate only while not backed off A projects overall debt is weighted average of resource debts Get work from project with highest overall debt Scheduling: server Possible outcomes of a job: success runs but returns wrong answer doesnt run, returns wrong answer (hacker) crashes, client reports it never hear from client again Job delay bounds Replicated computing homogeneous replication Server abstractions applications Win32 + NVIDIA Win64 Mac OS X app versions jobs instances Win32 N-core Win32 Scheduler overview MySQL feeder schedulers share-memory job cache client How scheduler chooses app versions App versions have project-supplied planning function Inputs: host description Outputs: Whether host can run app version Resource usage (#CPUs, #GPUs) expected FLOPS App version selection Call planning function for platforms app versions Skip versions that use resources for which no work is being requested Use the version with highest expected FLOPS Repeat this when a resource request is satisfied Anonymous platform mechanism The idea: volunteer supplies app versions. Why? security optimization unsupported platforms Science areas using BOINC Biology protein study, genetic analysis Medicine drug discovery, epidemiology Physics LHC, nanotechnology, quantum computing Astronomy data analysis, cosmology, galactic modeling Environment climate modeling, ecosystem simulation Math Graphics rendering Application types Computing-intensive analysis of large data Physical simulations Genetic algorithms GA-inspired optimization Non-CPU-intensive Internet study distributed sensor network Malariacontrol.net Simulation models of the transmission dynamics and health effects of malaria are an important tool for malaria control. They can be used to determine optimal strategies for delivering mosquito nets, chemotherapy, or new vaccines which are currently under development and testing. Climateprediction.net Gravitational waves; gravitational pulsars GPUGRID.net D-Wave Systems Simulation of adiabatic quantum algorithms for binary quadratic optimization Collatz Conjecture even N N/2 odd N 3N + 1 always goes to 1? Quake Catcher Network Organizational models Umbrella projects Institutional Lattice, Corporate IBM World Community Grid Community AlmereGrid Research community MindModeling.org Project publicity web development sysadmin Volunteer computing research Goals (mutually incompatible) maximize throughput minimize makespan of job batches minimize average time until credit minimize network traffic minimize server disk usage Characterizing hosts What are good models? What are correlations with other characteristics? How to model churn? BOINC client is instrumented to log all this; have data from 200K hosts over 1 year Mining for Statistical Models of Availability in Large-Scale Distributed Systems: An Empirical Study of Bahman Javadi, Derrick Kondo, Jean-Marc Vincent, David P. Anderson. 17th Annual Meeting of the IEEE/ACM International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems, Sept , London. On Correlated Availability in Internet-Distributed Systems. Derrick Kondo, Artur Andrzejak, and David P. Anderson. 9th IEEE/ACM International Conference on Grid Computing (Grid 2008), Tsukuba, Japan, Sept 29 - Oct powered on connected available Studying server scheduling policies EmBOINC: BOINC project emulator Performance Prediction and Analysis of BOINC Projects: An Empirical Study with EmBOINC. Trilce Estrada, Michela Taufer, David Anderson. To appear, Journal of Grid Computing. EmBOINC: An Emulator for Performace Analysis of BOINC Projects. Trilce Estrada, Michela Taufer, Kevin Reed, David Anderson. 3rd Workshop on Desktop Grids and Volunteer Computing Systems (PCGrid 2009), May 29, 2009, Rome. MySQL feeder scheduler share-memory job cache Simulator of a large, dynamic set of volunteer hosts Studying client scheduling policies BOINC client simulator simulates a client connected to several projects based on actual client code Performance Evaluation of Scheduling Policies for Volunteer Computing. Derrick Kondo, David P. Anderson and John McLeod VII. 3rd IEEE International Conference on e-Science and Grid Computing. Banagalore, India, December Local Scheduling for Volunteer Computing. David P. Anderson and John McLeod VII. Workshop on Large-Scale, Volatile Desktop Grids (PCGrid 2007) held in conjunction with the IEEE International Parallel & Distributed Processing Symposium (IPDPS), March 30, 2007, Long Beach. Supporting distributed applications Volpex: Linda-like dataspace system MPI layer centralized implementation fault tolerance, performance issues A Communication Framework for Fault-tolerant Parallel Execution. Nagarajan Kanna, Jaspal Subhlok, Edgar Gabriel, Eshwar Rohit and David Anderson. The 22nd International Workshop on Languages and Compilers for Parallel Computing, Newark, Delaware, Oct Using virtual machines App version is VM wrapper + virtual machine image VM image may contain the client of a non- BOINC distributed batch system BOINC client VM wrapper hypervisor (VirtualBox, kQEMU,etc.) VM Data-intensive computing Maintain large data set on clients 10 years of radio telescope data gene/protein data Compute against data set MapReduce, other models Volunteer motivation study Online survey correlated with participation data Survey is currently being designed Preliminary findings: Talk is cheap: claimed motivations not supported by data Team members contribute more Contribution decreases over time (especially for non-team members) Conclusion Volunteer computing: Exa-scale potential GPUs are crucial BOINC: enabling technology Bottlenecks organizational models public awareness Lots of research opportunities