Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple...
Transcript of Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple...
![Page 1: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/1.jpg)
Building bioinformatic pipelines
6/20/2019
P.Zumbo
![Page 2: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/2.jpg)
What is a pipeline?
Apipelineorworkflowreferstoaseriesofprocessingstepssuchthatoutputofeachprocessistheinputofthenext,typicallydonetotransformrawdataintosomethingmoreinterpretable.
![Page 3: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/3.jpg)
Why bother building pipelines?
1. Reproducibility2. Dataprovenance3. Automation4. Transparency
![Page 4: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/4.jpg)
Pipelines aid in reproducibility
Reproducibility=obtainingthesameresult*usingthesamecodeanddata*withinreason(e.g.,somealignersassignmulti-mappingreadstoarandomlocation)
![Page 5: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/5.jpg)
Data provenance contextualizes results
Provenancereferstothedescriptionoftheoriginofapieceofdata• Thestepstakentoarriveatapieceofdata• Thesoftwareused• Theversionofthesoftwareused• Theargumentssuppliedtothesoftwareused
![Page 6: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/6.jpg)
Automation: the amount of data keeps increasing
StephensZD,LeeSY,FaghriF,CampbellRH,ZhaiC,etal.(2015)BigData:AstronomicalorGenomical?.PLOSBiology13(7):e1002195.https://doi.org/10.1371/journal.pbio.1002195http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
![Page 7: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/7.jpg)
Automation: some pipelines complex
![Page 8: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/8.jpg)
From:AreviewofbioinformaticpipelineframeworksBriefBioinform.2016;18(3):530-536.doi:10.1093/bib/bbw020
![Page 9: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/9.jpg)
Simple alignment pipeline with bowtie2
#alignreadswithbowtie2bowtie2-xref.fa–Ushort_read.fq>aln-se.sam#convertfromsamtobamsamtoolsview-bSaln-se.sam>aln-se.bam#sortbamfilesamtoolssortaln-se.bam>aln-se.sorted.bam
![Page 10: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/10.jpg)
Simple sample script
#!/usr/bin/envbash##toolsBOWTIE=/usr/local/bin/bowtie2#v2.3.5.1SAMTOOLS=/usr/local/bin/samtools#v1.9##referencegenomeREFERENCE=/usr/local/ref/e_coli.fa$BOWTIE-x$REFERENCE-UA.fastq.gz>A.sam$SAMTOOLSview-bSA.sam>A.bam$SAMTOOLSsortA.bam>A.sorted.bam
![Page 11: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/11.jpg)
For loops #!/usr/bin/envbashBOWTIE=/usr/local/bin/bowtie2SAMTOOLS=/usr/local/bin/samtoolsREFERENCE=/usr/local/ref/e_coli.faforreadin$(ls*fastq.gz);do
$BOWTIE-x$REFERENCE-U$read>${read/.fastq.gz/.sam}$SAMTOOLSview-bS${read/.fastq.gz/.sam}>${read/.fastq.gz/.bam}$SAMTOOLSsort${read/.fastq.gz/.bam}>${read/.fastq.gz/.sorted.bam}
done
![Page 12: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/12.jpg)
GNU parallel
toolforprocessingrepetitivecommandsparallel[options][command[arguments]]:::<files>• :::<files>orfind<files>|• Thefilename:{}• Thefilenamewiththeextensionremoved:{.}e.g.test.fawouldbecometest• --jobs,-jn
![Page 13: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/13.jpg)
GNU parallel pipeline
THREADS=2parallel--jobs$THREADSgunzip{}:::*fastq.gzparallel--jobs$THREADS$BOWTIE-x$REFERENCE-U{}">"{.}.sam:::*fastqparallel--jobs$THREADS$SAMTOOLSview-bS{.}.sam">"{.}.bam:::*samparallel--jobs$THREADS$SAMTOOLSsort{.}.sam">"{.}.sorted.bam:::*bam
![Page 14: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/14.jpg)
A brief history of make
• firstintroducedbyStuartFeldmanin1977atBellLabs• buildautomationtool• usedtobuildexecutableprogramsandlibrariesfromsourcecode• however,makeisnotlimitedtobuildingbinariesandlibraries
![Page 15: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/15.jpg)
Key features of make
• Dependencyanalysis• Re-entrancy• Parallelization• Patternrules/abstraction• Audittrail
![Page 16: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/16.jpg)
what is make?
makeisaprogramthatreadsamakefileandthatbuildsoneormorefilesfromzeroormoreotherfilesthattheydependon.
![Page 17: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/17.jpg)
how does make do what it does?
makeparsesthemakefile,buildsadependencytree(bydeterminingtherelationshipsbetweentheinputsandoutputs),andthentraverseseachbranchofthetree,executingcommandsalongtheway.
![Page 18: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/18.jpg)
what is a makefile?
amakefileisatextfilewhichcontainsrulesforhowtocreateasetoftargetfiles.
![Page 19: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/19.jpg)
what is a rule?
aruletellsmakewhichseriesofcommandstoexecuteandwhatfilesmustexistbeforehandinordertocreateasetoftargetsfromsomeinput.
![Page 20: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/20.jpg)
the general form of a rule is:
target … : dependency … command … …
![Page 21: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/21.jpg)
a practical example: alignment
BOWTIE=/usr/local/bin/bowtie2 #v2.3.5.1
SAMTOOLS=/usr/local/bin/samtools #v1.9
REFERENCE=/usr/local/ref/e_coli.fa
all: A.sam
A.sam: A.fastq.gz
$(BOWTIE) -x $(REFERENCE) -U A.fastq.gz > A.sam
![Page 22: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/22.jpg)
adding another step:
BOWTIE=/usr/local/bin/bowtie2 #v2.3.5.1
SAMTOOLS=/usr/local/bin/samtools #v1.9
REFERENCE=/usr/local/ref/e_coli.fa
all: A.bam
A.sam: A.fastq.gz
$(BOWTIE) -x $(REFERENCE) -U A.fastq.gz > A.sam
A.bam: A.sam
$(SAMTOOLS) view –bS A.sam > A.bam
![Page 23: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/23.jpg)
automatic variables
BOWTIE=/usr/local/bin/bowtie2 #v2.3.5.1
SAMTOOLS=/usr/local/bin/samtools #v1.9
REFERENCE=/usr/local/ref/e_coli.fa
all: A.bam
A.sam: A.fastq.gz
$(BOWTIE) -x $(REFERENCE) –U $< > $@
A.bam: A.sam
$(SAMTOOLS) view –bS $< > $@
![Page 24: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/24.jpg)
using pattern rules: the percent sign
%:roughlyequivalentto*inaUnixshell-representsanynumberofanycharacters-canbeplacedanywherewithinpattern-canonlyoccuronce
somevaliduses:%.vs%.owrapper_%-charactersotherthan%matchliterallywithinafilename
![Page 25: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/25.jpg)
revisiting alignment…
FASTQFILES := $(wildcard *.fastq.gz)
all: $(FASTQFILES:.fastq.gz=.sorted.bam)
%.sam: %.fastq.gz
$(BOWTIE) -x $(REFERENCE) -U A.fastq.gz > A.sam
%.bam: %.sam
$(SAMTOOLS) view -bS $< > $@
%.sorted.bam: %. bam
$(SAMTOOLS) sort $< > $@
![Page 26: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/26.jpg)
visualizing the dependency tree
default
Sample1.bam Sample2.bam
Sample2.fastq.gzSample1.fastq.gz
makefile
Sample1.sam Sample2.sam
![Page 27: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/27.jpg)
the -j switch
-j[jobs],--jobs[=jobs]specifiesthenumberofjobs(commands)torunsimultaneously.
![Page 28: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/28.jpg)
why make? the limits of a script:
1. linearexecution• make-j
2. truncatedfiles• .DELETE_ON_ERROR:
3. unabletoresume• make
4. pooraudittrail• make-nB>make.log
![Page 29: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/29.jpg)
Limitations of make
• Wasn’tdesignedforbioinformaticanalyses• Syntaxrequiresunderstandingrulestructure• Lackssupportformultipleoutputsfromsinglecommand• Nosupportformultiplewildcardspername• Nobuilt-insupportfordistributedcomputing
![Page 30: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/30.jpg)
Ways to parallelize
ImageFrom:http://cloudcomputingnet.com/category/clouldcomputing/grid-computing/
Singlecomputer,singlecore
Singlecomputer,multiplecores Multiplecomputers,
multiplecores
![Page 31: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/31.jpg)
Future trends
ImageFrom:https://www.hpcwire.com/2017/05/04/singularity-hpc-container-technology-moves-lab/#foobox-3/0/Singularity-architecture_G-Kurtzer-e1477021972985.jpgSingularitycontainers
![Page 32: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/32.jpg)
Many contemporary alternatives to make
https://github.com/pditommaso/awesome-pipeline
![Page 33: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/33.jpg)
CWL
From:https://www.commonwl.org/user_guide/02-1st-example/
![Page 34: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq](https://reader034.fdocuments.in/reader034/viewer/2022051810/6018eb066a2c7c6c7123db93/html5/thumbnails/34.jpg)
Pipelines tip of iceberg concerning reproducibility
From:ExperimentingwithreproducibilityinbioinformaticsYang-MinKim,Jean-BaptistePoline,GuillaumeDumasbioRxiv143503;doi:https://doi.org/10.1101/143503