Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly...

Post on 21-May-2020

17 views 1 download

Transcript of Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly...

Software:thegood,thebadandtheugly

FestivalofGenomics2017

RussellHamiltonrsh46@cam.ac.uk

Software:What’sgood,badandugly

BioinformaticsSoftware

SoftwareX

All softwarecontainsbugs:

• IndustryAverage:“about15- 50errorsper1000linesofdeliveredcode”SteveMcConnell(authorofCodeCompleteandSoftwareEstimation:DemystifyingtheBlackArt)

• Rangefromspellingmistakeinerrormessagetocompletelyincorrectresults

• Mostsoftwarewillprocesstheinputtoproduceanoutputwithouterrorsorwarnings

Never blindlytrustsoftwareorpipelines

• Always test andvalidate results

• Avoidblack box software(definition:producesresults,butnooneknowshow)

W Y

X

Z

Software:What’sgood,badandugly

Differentclassesofbioinformaticssoftware

SoftwareX

W Y

X

Z

Class Examples Description

Processing TopHat2,Bowtie2 Performingcomputationally intensivetask,applyingmathematicalmodels

Evaluation FastQC,BamQC DerivingQCmetricsfromoutputfiles

Converters SamToFastq (PicardTools) Simply convertingbetweenfileformats.Generallystablenoregularupdates

Pipelines Galaxy,ClusterFlow Theglueforjoiningsoftwaretocreateanautomatedpipeline

1. FindingSoftwaretodothejob2. Hasthesoftwarebeenpublished?3. SoftwareAvailability4. DocumentationAvailability5. Presenceonusergroups6. InstallationandRunning7. ErrorsandLogFiles8. Usestandardfileformats9. EvaluatingCommercialSoftware10. Bugsinscripts/pipelinestorunsoftware11. Writingyourownsoftware12. Usingandcreatingpipelines

12StepGuideforevaluatingandselectingbioinformaticssoftwaretools

Software:What’sgood,badandugly

Identifytherequiredtask

1.Findingsoftwaretodothejob

Alignmentofmethylation sequencingdatatoreferencegenome

Arethererelatedstudiesperformingsimilaranalysis?Publication/posters/talks

Requiredfeatures MustHave1.INPUTstandardFASTQformatfiles2.OUTPUTstandardBAMalignments3.OUTPUTCompatiblewithmethylKit

Liketohave1.Performmethylationcalls2.Mustmakeuseofmultiprocessorsforlargenumbersofsamples

Software:What’sgood,badandugly

✓ PublishedinapeerreviewedjournalAsstandalonesoftwareorpartofstudy

Citedbyotherpeerreviewedpapers✓

Hasthesoftwarebeenbenchmarked(byotherpeoplethantheauthors)

Shortreadmappingis“generallysolvedproblem”Informativeforruntimes

✓ BMCBioinformatics;2016;17(Suppl 4):69DOI:10.1186/s12859-016-0910-3

2.Hasthesoftwarebeenpublished?

Software:What’sgood,badandugly

SoftwareavailablefordownloadHostedonarecognisedsoftwarerepositorye.g. GitHub,BitBucket,SourceForge

Softwareregularlyupdated/bugsfixed/releasesMorethanonedeveloper(e.g.groupaccount)

Permanentarchiveofsoftwarereleasese.g.zenodo.org,figshare.com

University/Institute/CompanyWebsiteSoftwareistheresponsibilityofagroupnotjustanindividual

3.SoftwareAvailability

Software:What’sgood,badandugly

Bugsarereportedandfixed

Newfeaturerequestsareadded

3.SoftwareAvailability

Software:What’sgood,badandugly

UserDocumentation✓

ReleaseDocumentationVersions– aidsreproduceability

4.DocumentationAvailability

Software:What’sgood,badandugly

WT KO1 KO2 KO3

WT

KO1

KO2

KO3

X

X

X

XX

X

4.DocumentationAvailability

Software:What’sgood,badandugly

Example:RNA-SeqDifferentialGeneanalysisusingDESeq2DiscoverlimitationsDESeq2 Manual”The resultsfunction without any arguments will automatically perform a contrast of the last level of the last variable in the design formula over the first level.”

count.data <- DESeqDataSetFromMatrix(sampleTable=smplTbl, design= ~ genotype)

count.data <- DESeq(count.data)

binomial.result <- results(count.data)

binomial.result <- results(count.data, contrast=c(”genotype",”K01",”K02"))

Name fileName genotypeWT1 wt1.htseq_counts.txt WTWT2 wt2.htseq_counts.txt WTWT3 wt3.htseq_counts.txt WTKO1.1 ko1.1.htseq_counts.txt KO1KO1.2 ko1.2.htseq_counts.txt KO1KO1.3 ko1.3.htseq_counts.txt KO1KO2.1 ko2.1.htseq_counts.txt KO2KO2.2 ko2.2.htseq_counts.txt KO2KO2.3 ko2.3.htseq_counts.txt KO2KO3.1 ko3.1.htseq_counts.txt KO3KO3.2 ko3.2.htseq_counts.txt KO3KO3.3 ko3.3.htseq_counts.txt KO3

Evidenceforsupportquestionsbeinganswerede.g.FAQ,searchablepublicsupportgroup

5.Presenceonusergroups

https://www.biostars.org

http://seqanswers.com

Software:What’sgood,badandugly

Istheresomeonenearbyyoucanaskforhelp BioinformaticsCoreFacility

Researchgroupdownthecorridor

GitHubIssuesGoogleGroups

Willrunonstandardarchitecture✓

✓ Easytoinstall

6.InstallationandRunning

Releaseversions✓$bismark --version

Bismark - BisulfiteMapperandMethylationCaller.Bismark Version:v0.16.3_dev

Copyright2010-15FelixKrueger,Babraham Bioinformaticswww.bioinformatics.babraham.ac.uk/projects/

SourcecodeavailableBinariescansimplifyinstallation

Docker/Galaxy/BaseSpace

Software:What’sgood,badandugly

✓ DefaultparametersAsensiblesetofdefaultparametersthatarelikelytoproduceagoodfirstpassattheresults

Example:Traceabilityofresultsthoughthestepsintheanalysis

Intermediateresultsareexcellentcheckpoints

Software:What’sgood,badandugly

6.InstallationandRunning

Alignment BAM countsHTSeq-count DESeq2

Normalisedreadcounts

DifferentiallyExpressedGenes

✓ ✓ ✓

RNA-SeqDifferentialGeneExpressionAnalysis

Sample:SampleCorrelation

MA-PlotsSamplePCA

PerGenestd.dev

✓✓

✓ ✓

7.ErrorsandLogFiles

Warnings Don’tignorewarnings,theymaybetellingyousomethingcrucialaboutyourdata

Errors Problemsevereenoughfortheprogramtostopandproduceanerror

Software:What’sgood,badandugly

Keepandreadlogfilesforsoftwarerun

8.Usestandardfileformats

StandardInputFiles✓FASTA,FASTQ

Convertingbetweenformatscouldintroduceerrors

StandardOutputFiles✓BAM

Compatiblewithdownstreamtools

Software:What’sgood,badandugly

Bioinformaticiansspendanembarrassingamountoftimeconvertingbetweenfileformats

9.EvaluatingCommercialSoftware

Software:What’sgood,badandugly

ShouldyouusecommercialsoftwaretodoRNA-SeqDGEanalysis?

Lotsofgoodcommercialsoftwareavailablee.g.Partek

Pros Cons

GraphicalInterface– nocommandline Runanalysiswithoutunderstandingthesteps

Singleapplicationforallsteps Hardertotracebackstepbystep

DedicatedCustomerSupport Limited usergroupactivity

Lesstransparency(methods/bugsfixed)

Expensive

Licenserequiredtoreproduceanalysis(e.g.reviewers)

10.Bugsinscripts/pipelinestorunsoftware

Software:What’sgood,badandugly

1.BashScriptforrunningfastQC

for file in *_1.fq.gz;dofastqc $file

done

multiqc .

Oftenwrittenspecificallyforeachanalysisorprojectandarepronetobugs

Examplesofaccidentallymissingoutsamples

2.RNA-SeqDESeq2SampleTablegenotype <- data.frame(

‘WT’, ’wt’, ’Wt’, ‘KO1’,’kO1’,’KO1’,‘KO2’,’Ko2’,’KO2’,‘KO3’,’KO3’,’KO3’)

...results(dds, contrast=c(”genotype",”WT",”K02"))

Software:What’sgood,badandugly

Thebadandugly

• Homemade“glue”scriptsforrunningsoftwarecanbebugprone• “darkscriptmatter”isn’treviewedorassessesandrarelyreleasedinmethodssections• Ina3000samplestudy,errorsarepropagated3000times!

Thegood

• Purposebuildpipelinetools• Premadepipelinesfore.g.RNA-Seqdifferentialgeneexpression• Jobqueuing- Loadbalancingacrosshardware(laptoptoclusterfarm)• Logfilestrackasamplesprogressthroughpipeline

11.UtilisingdedicatedPipelinetools

Software:What’sgood,badandugly

http://clusterflow.io/

https://usegalaxy.org/

https://github.com/common-workflow-language

CommonWorkflowLanguage

11.UtilisingdedicatedPipelinetools

InteractionviaawebbrowserPublicandprivateserverinstallsManypre-builtpipelinesLargeusercommunity

CommandlineinterfaceManypre-builtpipelines

AlanguageforbuildingyourownpipelinesUtilisedbyotherpipelinetoolse.g.NextIO

Rule1:IdentifytheMissingPiecesRule2:CollectFeedbackfromProspectiveUsersRule3:BeReadyforDataGrowthRule4:UseStandardDataFormatsforInputandOutputRule5:ExposeOnlyMandatoryParametersRule6:ExpectUserstoMakeMistakesRule7:ProvideLoggingInformationRule8:GetUsersStartedQuicklyRule9:OfferTutorialMaterialRule10:ConsidertheFutureofYourTool

Software:What’sgood,badandugly

12.Developingyourownsoftware

Ifyouaresureagreatpieceofsoftwaredoesn’talreadyexistorcanbemodifiedforthetask

Developingyourowntoolsgivesanappreciationofhowdifficultitcanbe

Weightingtheevaluationcriteria

Software:What’sgood,badandugly

Criteria Importance Comments1 FindingSoftwaretodothejob +++++ Use therighttoolsforthejob

2 Hasthesoftwarebeenpublished? +++ Newsoftwarebeingreleasedsocheckforimprovedmethods.Justbecauseitspublishedandwelluseddoesn’tmeanit’sstill thebest

3 SoftwareAvailability ++

Opennessitagoodsignforfindingerror/bugs/suggestingfeatureenhancements

4 DocumentationAvailability +++++5 Presenceonusergroups +++++6 InstallationandRunning +++7 ErrorsandLogFiles +++++8 Usestandardfileformats ++++ Conversions couldaddsourcesoferror9 EvaluatingCommercialSoftware + PriceVsOpensourcesoftware

10 Bugsinscripts/pipelinestorunsoftware +++++ Pipelinesstandardise workflows

11 Utilising dedicatedpipelinetools +12 Writingyourownsoftware + Don’tre-inventingthewheel

Software:What’sgood,badandugly

Method1Method2Method3

Compromisesforruntimevsaccuracy/sensitivity

ProjectAhas3000samplesvsProjectBwith12samples

Method1:4hourspersample98%accuracyMethod2:30minspersample97%accuracy

WhatwouldyouchooseifMethod1:4hourspersample98%accuracyMethod3:15minspersample 90%accuracy

Weightingtheevaluationcriteria

Software:What’sgood,badandugly

Summary

Manywaystoevaluatesoftware

• Opennessandengagementwithusersisveryimportant- bugsfixed,featuresadded,largeuserbase

• Evaluatefeatures,e.g.runtime,againstyourprojectrequirements

• Ifyouareusingpipelines,usepurposebuildpipeliningtools