Grid Usecase BioMed

29
November 14, 2008 1 Grid Usecase BioMed How to get biologists to compute Surfnet / Grid Tutorial Jan Bot Vermelding onderdeel organisatie

description

How to get biologists to compute. Grid Usecase BioMed. Surfnet / Grid Tutorial. Jan Bot. Vermelding onderdeel organisatie. Who am I. Graduated March 2008 Bioinformatics group TU Delft BioAssist programmer Happy grid user Working on the grid as part of the TU Delft – NKI collaboration - PowerPoint PPT Presentation

Transcript of Grid Usecase BioMed

Page 1: Grid Usecase BioMed

November 14, 2008

1

Grid Usecase BioMed

How to get biologists to compute

Surfnet / Grid Tutorial

Jan Bot

Vermelding onderdeel organisatie

Page 2: Grid Usecase BioMed

November 14, 2008 2

Who am I

• Graduated March 2008

• Bioinformatics group TU Delft

• BioAssist programmer

• Happy grid user

• Working on the grid as part of the TU Delft – NKI collaboration

• Chris Klijn: human copy number variation

• Jeroen de Ridder: viral insertions in mice

Page 3: Grid Usecase BioMed

November 14, 2008 3

DNA & Genes

Page 4: Grid Usecase BioMed

November 14, 2008 4

Copy number variation & Viral insertions

• Pieces of DNA can be added, deleted, moved & removed

• Viruses can insert themselves into a genome• This causes all kinds of problems, for example cancer:

• Multiple mutations needed before a tumor starts to develop

Page 5: Grid Usecase BioMed

November 14, 2008 5

aCGH data

• Array comparative genomic hybridization

• Compare DNA of sample against a reference

Page 6: Grid Usecase BioMed

November 14, 2008 6

KCSmart: Datasets

• Leukaemia & lymphoma cell-lines• aCGH data (10k affy) from the Sanger Institute• Same samples measured on 1.8M SNP6• 105 cell-line samples• About 350 mb of data

Page 7: Grid Usecase BioMed

November 14, 2008 7

KCSmart: Overview

For each tumor we construct a pair-wise space by comparing each chromosome arm with each other chromosome arm. A point in this space is a pair of genomic loci.

Page 8: Grid Usecase BioMed

November 14, 2008 8

KCSmart: Compute Co-occurrence Score

Using a 2d Gaussian kernel we want to look for local enrichment of high scores in the pairwise space.

Peaks in the convolved space allows us to define two genomic loci that can be said to be co-aberrated to a certain degree

Page 9: Grid Usecase BioMed

November 14, 2008 9

KCSmart: Parameters (1)

Chromosome arms:Natural split at the centromere to better divide work loadNot all p-arms contain measurements (39 out of 44)

Resolution:'Grid points' are fixed on the genomeLocation of the grid points, and thus the computational complexity, doesn't change when using different datasetsMeasurements are allocated to grid pointsTried this for [20, 25, 35, 50] kbpChoice based on the best resolution which still fits in memory

10k data

Grid

1.8m data

Page 10: Grid Usecase BioMed

November 14, 2008 10

KCSmart: Parameters (2)

Scale:The kernel width in base pairsCapture changes on different scales:[0.2, 2, 10, 20] mbp (6 sigma)

Amplification type:Either insertion or deletionAll possible combinations for two chromosomes:[ins:ins, del:del, ins:del, del:ins]ins=amplification, del=loss)

Page 11: Grid Usecase BioMed

November 14, 2008 11

KCSmart: Getting the Parameters Right

• 10k data to estimate memory consumption and running times

• Find best resolution & scale that still fit in 2.3 gb of memory

• Final Parameters:

• chr = [1.0, 1.5, ..., 22.5]

• res = [20000]

• scale = [0.2, 2, 10, 20]

• amp = ['ins-ins', 'del-del', 'ins-del', 'del-ins']

• Roughly 10k jobs (without the jobs required for finding the correct parameter settings!)

• All parameters generated using a python script

• In a jdl it looks like:Parameters={"19.5 15.5 2 1 20000", "2.5 4.0 2 1 20000"};

Page 12: Grid Usecase BioMed

November 14, 2008 12

KCSmart: Output

• +/- 10k files

• 7.5 gb of 'peak-info'

• 1 TB of raw data

• Problems with the grid:

• once you have all the scripts in place to run jobs it's easy to create more output than a biologist can analyze

• once the biologist has some results he'll ask you to do it again (and again...)

Page 13: Grid Usecase BioMed

November 14, 2008 13

KCSmart: Results 10k data

Page 14: Grid Usecase BioMed

November 14, 2008 14

KCSmart: Results 1.8m data

Page 15: Grid Usecase BioMed

November 14, 2008 15

KCSmart: Results 1.8M data

Found a know deletion pair (T-cell receptor): the method works.

Page 16: Grid Usecase BioMed

November 14, 2008 16

KCSmart: Future work

• Higher resolution (once we have 64 bit WNs)• Smaller scale• Mutual exclusiveness tests• Run on real tumor dataset

Page 17: Grid Usecase BioMed

November 14, 2008 17

Matlab jobs

• Compile code using Matlab (on a UI), run using MCR

• Add ctf & executable to input sandbox:InputSandbox={"kcsmart_topos.sh","kcsmart_large.bin","kcsmart_large_run.ctf","curl.gz"};

• Add 'require code' to jdl:Requirements = Member("lsgmcr-7.5",other.GlueHostApplicationSoftwareRunTimeEnvironment);

• Load module on WN:module load mcr

• Call executable

Page 18: Grid Usecase BioMed

November 14, 2008 18

Job status tracking problems

• How do you check which jobs failed?

• Use output files as indicators:lcg-ls lfn:///grid/lsgrid/jbot/chris_large/output/ > output.txtcat output.txt | ~/code/chris/check_missing.pl > to_do.txt

• Copy subset of parameters to jdl file

• Submit job again

• This takes too long!

Page 19: Grid Usecase BioMed

November 14, 2008 19

The Annoyances: glite-wms-job-*

glite-wms-job-status

• It barely tells me anything (unless I specified error codes myself)

• I would rather know

• the number of failed / running jobs

• the error output or the parameters with which this job was run

• Use with grep & awk:

glite-wms-job-status `job-ids` > status.txt

cat status.txt | gawk '{prev=$7;getline;if($0~/Exit\ Code/){print prev;}}'

• Output: https://wms.grid.sara.nl:9000/ztINwkKvTJfKnUuZBTYs_g

Status info for the Job : https://wms.grid.sara.nl:9000/ztINwkKvTJfKnUuZBTYs_g Current Status: Done (Exit Code !=0) Exit code: 1 Status Reason: Warning: job exit code != 0 Destination: gb-ce-lumc.lumc.nl:2119/jobmanager-pbs-medium Submitted: Sun Sep 7 21:24:56 2008 CEST

Page 20: Grid Usecase BioMed

November 14, 2008 20

The Annoyances: glite-wms-job-*

glite-wms-job-cancel

• Does not recursively cancel jobs stored in a file

• Fix:

glite-wms-job-status -i jobs.txt | grep 'http' | gawk '{print $7}' > to_cancel.txt

glite-wms-job-cancel -i to_cancel.txt

Status info for the Job : https://wms.grid.sara.nl:9000/ztINwkKvTJfKnUuZBTYs_g Current Status: Done (Exit Code !=0) Exit code: 1 Status Reason: Warning: job exit code != 0 Destination: gb-ce-lumc.lumc.nl:2119/jobmanager-pbs-medium Submitted: Sun Sep 7 21:24:56 2008 CEST

Page 21: Grid Usecase BioMed

November 14, 2008 21

The Annoyances: lcg-*

lcg-cr

• Getting files to and from the SEs:

• What, lcg-cr doesn't always work?

• On error: try again

• No error: good to go, right?

• Try copying the file back to the WN

lcg-cp

• Copying > 3000 files from a SE to the UI machine takes > 1 hour

• Copying the same files over ssh (scp) to my (remote) machine: ~2 minutes

• Security overhead?

• Work-around:

• lcg-rec-cp: slow

• custom script (do it in parallel): nasty

Both: don't work when the MCR is loaded

Page 22: Grid Usecase BioMed

November 14, 2008 22

ToPoS

• Main developer: Pieter van Beek

• WebDav + Tokens + pilot job

• Instead of submitting one job at a time, claim a (bunch of) computer(s) until all jobs are done

Page 23: Grid Usecase BioMed

November 14, 2008 23

ToPoS Overview

ToPoS Server

User

The Grid

(1) Job tokens

(2) Pilot Jobs

(3) Job Request

(4) Job Token

(5) Job Output

(6) All Output

Page 24: Grid Usecase BioMed

November 14, 2008 24

Token renewal

Pilot jobPilot job

affirmtokenuse

affirmtokenuse

Getunusedtoken

Getunusedtoken

SubmitSubmit

Pilot job with token

Pilot job with token

Running pilot job

Running pilot job

Executetoken task

Executetoken task

Finished?Finished?

Deletetoken

Deletetoken

noyes

Page 25: Grid Usecase BioMed

November 14, 2008 25

ToPoS: Conclusion

• Advantages:

• Easy output handling using Curl with atomic operations

• Handles failed jobs

• Less overhead

• Able to dynamically add or remove nodes

• Easy to re-run jobs

• Easy access to output

• Disadvantages:

• Little / no security

• Some overhead at the end of a run (unless you're reserving tokens)

• Feature request: progress bar

Page 26: Grid Usecase BioMed

November 14, 2008 26

Fixing the difficulties: LEARN BASH!

• diff is your friend:

• Useful to transfer missing files to and from SE

• grep

• Usefull for querying status of jobs (use with the -c option)

• (g)awk

• Handy to cancel jobs

• Redirect output to file and push processes to background:

• lcg-ls is a typical example

Page 27: Grid Usecase BioMed

November 14, 2008 27

Why not let the biologist do it?

• Recourses needed to get this working on the grid:

• +/- 180 replies from grid support

• +/- 100 messages exchanged with the biologists

• Many hours of work, mostly finding out about the 'quirks' of the software

• Advantage of making a programmer submit the jobs:

• One person to handle support

• Re-use experience with other projects

Page 28: Grid Usecase BioMed

November 14, 2008 28

Some other tricks

• Nikhef does not 'advertise' the installed software

• Do your own load balancing (once the job is in a queue, it doesn't get re-scheduled)

• Easy to do with the cancel-script shown previously

• Don't keep your stuff in $home when on WNs, change directory to $TMPDIR at the beginning of your script

• Keep in mind: once you retrieved your job-output it's gone from the grid

• Use startGridSession

• When using ToPoS: make sure you land in the 'long' queue

Page 29: Grid Usecase BioMed

November 14, 2008 29

Thanks!

• Sara Grid Support

• Jeroen Engelberts

• Pieter van Beek

• Machiel Jansen

• NikHef

• Jan Just Keijser

• Collaborators

• Chris Klijn

• Jeroen de Ridder