Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San...

Post on 16-Jan-2016

215 views 3 download

Tags:

Transcript of Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San...

Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees

Mark A. MillerSan Diego Supercomputer Center

Systematics is the study of the diversification of life on the planet Earth, both past and present, and the relationships among living things through time

?

Evolutionary relationships can (for the most part) be represented as a rooted graph.

Originally, evolutionary relationships were inferred from morphology alone:

Morphological characters are scored “by hand” to create matrices of characters.

Scoring occurs via low volume/low throughput methodologies

Even though tree inference is NP hard, matrices created using morphological characters alone are typically relatively small, so computations are relatively tractable (with heuristics developed by the community)

Originally, evolutionary relationships were inferred from morphology alone:

Evolutionary relationships are also inferred from DNA sequence comparisons:

Unlike morphological characters, DNA sequence determination is now fully automated.

The increase in DNA sequences with time is faster than Moore’s law.

There are at least 107 species, each with 3000 - 30,000 genes, so the needfor computational power and new tools will continue to grow.

Tree inference is NP hard, so even with heuristics, computational powerfrequently limits the analysis.

Analyses often involve 1000’s of species, and 1000’s of characters, creatingvery large matrices.

Evolutionary relationships are also inferred from DNA sequence comparisons:

The CIPRES Project was created to support this new age of large phylogenetic data sets. The project had as its principal goals:

1. Developing heuristics and tools for analyzing the large DNA data sets that are available.

2. Improving researcher access to computational resources.

The CIPRES Portal was created as part of Goal 2, improving researcher access to computational resources

The CIPRES Portal was designed to be a flexible web application that allows users to run analyses of large sequence data sets using community codes on a significant computational resource.

User requirements:

• Provide login-protected personal user space for storing results indefinitely.

• Provide access to most or all native command line options for each code.

• Support addition of new tools and new versions as needed.

The CIPRES Portal was built on a generic portal software package called The Workbench Framework

<?xml version="1.0" encoding="ISO-8859-1" ?><!DOCTYPE pise SYSTEM "http://www.phylo.org/dev/rami/PARSER/pise.dtd" [<!ENTITY nucdbs SYSTEM "http://www.phylo.org/dev/rami/XMLDIR/nucdbs.xml"><!ENTITY protdbs SYSTEM "http://www.phylo.org/dev/rami/XMLDIR/protdbs.xml"><!ENTITY blastDBpath SYSTEM "http://www.phylo.org/dev/rami/XMLDIR/blastDBpath.xml"><!ENTITY fastaDBpath SYSTEM "http://www.phylo.org/dev/rami/XMLDIR/fastaDBpath.xml"><!ENTITY blocksDBpath SYSTEM "http://www.phylo.org/dev/rami/XMLDIR/blocksDBpath.xml"><!ENTITY nucDBfasta SYSTEM "http://www.phylo.org/dev/rami/XMLDIR/nucDBfasta.xml"><!ENTITY protDBfasta SYSTEM "http://www.phylo.org/dev/rami/XMLDIR/protDBfasta.xml">]><pise>

<head> <title>TFASTY</title> <version>34t10d3</version> <description>Compare PS to Translated NS Or NS-DB</description> <authors>W. Pearson</authors> <reference>Pearson, W. R. (1999) Flexible sequence similarity searching with the FASTA3 program package. Methods in Molecular Biology</reference> <reference>W. R. Pearson and D. J. Lipman (1988), Improved Tools for Biological Sequence Analysis, PNAS 85:2444-2448</reference> <reference> W. R. Pearson (1998) Empirical statistical estimates for sequence similarity searches. In J. Mol. Biol. 276:71-84</reference> <reference>Pearson, W. R. (1996) Effective protein sequence comparison. In Meth. Enz., R. F. Doolittle, ed. (San Diego: Academic Press) 266:227-258</reference><category>Protein Sequence</category>

To expose Command Line Tools quickly, the Workbench Framework uses the PISE XML standard….

All command line parameters can be set.

All command line parameters can be set.

Usage Statistics for CIPRES Portal 5/2007 – 11/2009

47,500 total jobs in 30 months

Limitations of the original CIPRES Portal

• all jobs were run serially (efficient, but no gain in wall time)• the cluster was modest (16 X 8-way dual core nodes)• runs were limited to 72 hours• the cluster was at the end of its useful lifetime• funding for the project was ending• demand for job runs was increasing

This is not a scalable, sustainable solution!

Workbench

Framework

The solution: make community codes available on scalable, sustainable resources (e.g. TeraGrid).

TeraGrid

CIPRES Cluster

Triton

Parallel codes

Serial codes

Greater than 90% of all computational time is used for three tree inference codes: MrBayes, RAxML, and GARLI.

Implement parallel versions of these codes on TeraGrid Machines Abe and Lonestar; using Globus/GRAM.

Work with community developers to improve the speed-up available through the parallel codes offered by CSG.

Keep other serial codes on local SDSC resources that provide the project with fee-for-service cycles.

The Workbench Framework design made it possible to deploy jobs on TeraGrid resources fairly easily. A Science Gateway development allocation allowed us to accomplish the initial setup.

Code Type Max cores

Speed-up Efficiency

MrBayes Hybrid MPI/OpenMP 32 2.4 X (4 nodes)

~60%

RAxML Hybrid MPI/OpenMP 40 3.0 X(5 nodes)

~ 60%

GARLI MPI 100 77 X(100 nodes)

77-94%

CIPRES Science Gateway parallel code profiles

Month 2 4 6 8 1012

Us

ers

/

mo

nth

100

200

300

400all users

Jan

new users

Mar May Jul

repeat users

Sep

all users

new users

repeat users

CIPRES Science Gateway Usage Dec 2009 – Oct 2010

Month 2 4 6 8 1012

Us

ers

/

mo

nth

100

200

300

400all users

Jan

new users

Mar May Jul

repeat users

Sep

all users

new users

repeat users

CIPRES Science Gateway Usage Dec 2009 – Oct 2010

1575 new TGusers

Month

2 4 6 8 10 12

Jo

bs

/ m

on

th

1000

2000

3000

4000

Jan Mar JulMay

SU

s / m

on

th (in

tho

us

an

ds

)

200

400

600

800

Sep

CIPRES Science Gateway Usage

:

Intellectual Merit: 90+ publications enabled by the CIPRES Science Gateway

Broad Impact: CIPRES Science Gateway used to deliver curriculum by at least 21 instructors

What happens if you build it and too many people come???

Month2 4 6 810

SU

s (

tho

us

an

ds

)

5001000150020002500300035004000?!?

What happens if you build it and too many people come???

• make an explicit fair resource use policy

• make sure resource use delivers impact

• make sure resource use is efficient

• then expand resource base as required

What happens if you build it and too many people come???

• make an explicit fair resource use policy

• make sure resource use delivers impact

• make sure resource use is efficient

• then expand resource base as required

What happens if you build it and too many people come???

SUs /monthNumber of

Users% total SU % per user

< 100 536 (75.3%) 19 0.04

100 - 500 86 (12.1%) 9 0.10

500 - 2000 53 ( 7.5%) 10 0.19

2000 - 5000 24 ( 3.4%) 17 0.71

5,000 - 10,000 9 ( 1.3%) 14 1.56

> 10,000 3 ( 0.4%) 31 10.33

What happens if you build it and too many people come???

SUs /monthNumber of

Users% total SU % per user

< 100 536 (75.3%) 19 0.04

100 - 500 86 (12.1%) 9 0.10

500 - 2000 53 ( 7.5%) 10 0.19

2000 - 5000 24 ( 3.4%) 17 0.71

5,000 - 10,000 9 ( 1.3%) 14 1.56

> 10,000 3 ( 0.4%) 31 10.33

What happens if you build it and too many people come???

SUs /monthNumber of

Users% total SU % per user

< 100 536 (75.3%) 19 0.04

100 - 500 86 (12.1%) 9 0.10

500 - 2000 53 ( 7.5%) 10 0.19

2000 - 5000 24 ( 3.4%) 17 0.71

5,000 - 10,000 9 ( 1.3%) 14 1.56

> 10,000 3 ( 0.4%) 31 10.33

This level of resource use requires additional justification…

What happens if you build it and too many people come???

Tier % Community

allocation usedAccount Status

1< 2% of monthly

allocationOpen access

22% of monthly

allocationRequest personal

TG allocation

33% of monthly

allocationUse of community allocation blocked

CIPRES SG Fair Use Policy

What happens if you build it and too many people come???

• make an explicit fair resource use policy

• make sure resource use delivers impact

• make sure resource use is efficient

• then expand resource base as required

What happens if you build it and too many people come???

High end users must acquire a personal allocation from the TRAC committee. This will insure that very heavy users of the resource are supporting peer-reviewed research (and have a US institutional affiliation).

What happens if you build it and too many people come???

Tools required to implement the CIPRES SG Fair Use Policy:

• ability to halt submissions from a given user account

• ability to charge to a user’s personal TG allocation

• ability to monitor usage by each account

What happens if you build it and too many people come???

• make an explicit fair resource use policy

• make sure resource use delivers impact

• make sure resource use is efficient

• then expand resource base as required

Job Attrition on the CIPRES Science Gateway

CPU time User Staff

Input error low high low

Machine error 0 high low

Communication error high high high

Unknown error high high low

Error Impact analysis

Communication Errors

• occur when there is a break in communication between the web application and the TG resource

• kills the job monitoring process for all running jobs.

• jobs will continue to execute and consume SU, but web application can no longer return job results.

• the user must ask for the results, and a staff member must fetch them manually.

CONCLUSION: Time to refactor the job monitoring system!

User Submission Command line;files TG Machine

Running Task Table

Globus “gsissh”

Normal Operation

LoadResults DaemonDetects results,fetches via Grid ftp,puts results in the user db.

User DB

User Submission checkJobsDaemon TG Machine

Running Task Table

Globus “gsissh”

If normal notification fails from:Machine outageJob timeoutCIPRES Application down

Abnormal Operation:

User DBLoadResults DaemonDetects results w/delivery error;fetches via Grid ftp;puts results in the user DB.

SEPT OCT

MrBayes 39 71

RAxML 106 172

GARLI 14 23

Total 159* 266*

JOBS SAVED BY THE GSISSH / TASK TABLE SYSTEM

* 7% of all submitted jobs

Future plans to address other source of attrition

User errors (15%) : pre-check uploaded files for valid format.

Machine errors (8%): establish redundancy in where codes can run, use tools to check machine availability,queue depth.

Unknown errors (3%): TBD

What happens if you build it and too many people come???

• make an explicit fair resource use policy

• make sure resource use delivers impact

• make sure resource use is efficient

• then expand resource base as required

What happens if you build it and too many people come???

Month2 4 6 810

SU

s (

tho

us

an

ds

)

5001000150020002500300035004000?!?

Month2 4 6 810

SU

s (

tho

us

an

ds

)

5001000150020002500300035004000

What happens if you build it and too many people come???

Other possible resource opportunities:

New TG machines (e.g. Trestles machine at SDSC)

The Open Science Grid (OSG)

The NSF FutureGrid project (adapting these HPC applications to a cloud environment).

CIPRES Science Gateway Terri Liebowitz

TeraGrid Hybrid Code Development Wayne PfeifferAlexandros Stamatakis

TeraGrid Implementation Support Nancy Wilkins-DiehrDoru Marcusiu

Workbench Framework: Paul HooverLucie Chan

Acknowledgements: