Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on...
Transcript of Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on...
![Page 1: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/1.jpg)
SETTING UP A BIOINFORMATICS QC PIPELINEBRIAN MCCONEGHY
BIOINFORMATICS SPECIALIST
SEQUENCING AND BIOINFORMATICS CONSORTIUM, UBC
OFFICE OF THE VICE-PRESIDENT, RESEARCH & INNOVATION
WESTGRID WEBINAR 2019-11-13
1
![Page 2: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/2.jpg)
2
“More than 70% of researchers have tried and
failed to reproduce another scientist's experiments,
and more than half have failed to reproduce their
own experiments.”
Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604),
452-454. doi:10.1038/533452a
![Page 3: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/3.jpg)
3
Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454.
doi:10.1038/533452a
![Page 4: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/4.jpg)
Background
Pipelining Tools
Writing the Pipeline
Metrics to Track
Implementation
Conclusions
4
![Page 5: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/5.jpg)
Background
Pipelining Tools
Writing the Pipeline
Metrics to Track
Implementation
Conclusions
5
![Page 6: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/6.jpg)
WHAT IS BIOINFORMATICS?
6
![Page 7: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/7.jpg)
NEXT GENERATION SEQUENCING
7
Sample Preparation SequencingSample (input) QC Sample (library) QC Data!
https://www.makeuseof.com/tag/best-linux-server-operating-systems/www.Illumina.com
https://www.nextadvance.com/ https://pdb101.rcsb.org/motm/84www.thermofisher.com
![Page 8: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/8.jpg)
NEXT GENERATION SEQUENCING
8
Sample Preparation SequencingSample (input) QC Sample (library) QC Data!
https://www.makeuseof.com/tag/best-linux-server-operating-systems/www.Illumina.com
https://www.nextadvance.com/ https://pdb101.rcsb.org/motm/84www.thermofisher.com
![Page 9: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/9.jpg)
WHAT IS NEXT GEN SEQUENCING DATA, EXACTLY?
• High-throughput sequencing technology
• Generates millions of ‘reads’
• Reads are just strings of G’s, A’s, T’s, and C’s (with associated quality values)
9
@NB999999:999:ZZZZZZZZ9:1:11101:16570:1094 1:N:0:1AAAGCNGCTGAATTGTTCGCGTTTACCTTGCGTGTACGCGCAGGAAACACT+AAAAA#A6EA66EEEAEEE/E//EAE/E//EEEEEEEEAAEEEAEEEEAEE
phiX 174 control DNA
![Page 10: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/10.jpg)
10
https://bloggenohub.files.wordpress.com/2015/01/slide1.jpg
![Page 11: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/11.jpg)
Background
Pipelining Tools
Writing the Pipeline
Metrics to Track
Implementation
Conclusions
11
![Page 12: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/12.jpg)
WHAT AND WHY
• Workflow management system
• Reproducible and scalable data analysis
• Rules, inputs, and outputs
12
![Page 13: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/13.jpg)
NEEDS
• Reproducible
• Scalable
• Efficient (Parallelizable)
• Portable
• Automated
13
https://slides.com/johanneskoester/snakemake-short#/3
![Page 14: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/14.jpg)
WANTS
• Ease of development
• Unix-compatible
• FREE
14
https://slides.com/johanneskoester/snakemake-short#/3
![Page 15: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/15.jpg)
COMPARISON
• Galaxy
• Ruffus
• Snakemake
15
![Page 16: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/16.jpg)
COMPARISON
• Galaxy
• Ruffus
•Snakemake
16
![Page 17: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/17.jpg)
Background
Pipelining Tools
Writing the Pipeline
Metrics to Track
Implementation
Conclusions
17
![Page 18: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/18.jpg)
RESOURCES - HARDWARE
• Cedar - Compute Canada
• 58,416 Cores
• 306,306 GB of RAM
• 10TB scratch space
18
https://medium.com/monplan/how-we-automated-deployments-
and-testing-with-bitbucket-pipelines-bb478c12c55f
![Page 19: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/19.jpg)
RESOURCES - PEOPLE
• Advanced Research Computing (ARC)
• Jamie Rosner
• Venkat Mahadevan
• VP Research & Innovation (VPRI)
• Dr. Helen Burt
19
https://medium.com/monplan/how-we-automated-deployments-
and-testing-with-bitbucket-pipelines-bb478c12c55f
![Page 20: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/20.jpg)
SNAKEMAKE
• Decompose workflow into rules
• Rules define how to obtain output files from input files
• Snakemake infers dependencies and execution order
20
![Page 21: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/21.jpg)
21
https://xkcd.com/1343/
SNAKEMAKE
![Page 22: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/22.jpg)
22
TECHNICAL
![Page 23: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/23.jpg)
SIMPLICITY!
rule sort:
input:
"path/to/dataset.txt"
output:
"dataset.sorted.txt"
shell:
"sort {input} > {output}"
23
![Page 24: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/24.jpg)
GENERALIZE RULES WITH NAMED WILDCARDS
24
rule sort:input:
"path/to/{dataset}.txt"output:
"{dataset}.sorted.txt"shell:
"sort {input} > {output}"
![Page 25: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/25.jpg)
SPECIFY MULTIPLE INPUTS (AND OUTPUTS)REFER BY INDEX
25
rule sort_and_annotate:input:
"path/to/{dataset}.txt","path/to/annotation.txt"
output:"{dataset}.sorted.txt"
shell:"paste <(sort {input[0]}) {input[1]} > {output}"
![Page 26: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/26.jpg)
CAN SPECIFY MULTIPLE INPUTS (AND OUTPUTS), AND REFER BY NAME
26
rule sort_and_annotate:input:
a="path/to/{dataset}.txt",b="path/to/annotation.txt"
output:"{dataset}.sorted.txt"
shell:"paste <(sort {input.a}) {input.b} > {output}"
![Page 27: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/27.jpg)
USE PYTHONWITHIN RULES
27
rule sort:input:
a="path/to/{dataset}.txt"output:
b="{dataset}.sorted.txt"run:
with open(output.b, "w") as out:for l in sorted(open(input.a)):
print(l, file=out)
![Page 28: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/28.jpg)
A REAL RULE
28
• Used in DNA QC pipeline
rule bwa_mem_map_reads:input:
get_trimmed_readsoutput:
temp('mapped/{sample}-{unit}.sorted.bam')log:
'logs/bwa_mem/{sample}-{unit}.log'params:
index = get_genome_index,rg = get_read_group_bwa
threads: 46shell:
'(bwa mem -t {threads} {params.rg} {params.index} {input} | ''samtools sort -T $SLURM_TMPDIR/ -o {output} -) 2> {log}'
![Page 29: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/29.jpg)
JOB EXECUTION
29
• A job only executes if:
1. output file is the target requested and does not exist
2. output file needed by another executed job (i.e. is an
input to another job) and does not exist
3. input file is newer than the output file
4. input file will be updated by other job
5. execution is forced
![Page 30: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/30.jpg)
CLUSTER EXECUTION
30
• Can set up pipeline profiles
• Execute DAG by way of cluster job submission
• Configuration file
• Max jobs at a time
• CPUs
• MEM
• General (per profile) or granular (per rule)
![Page 31: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/31.jpg)
Background
Pipelining Tools
Metrics to Track
Writing the Pipeline
Implementation
Conclusions
31
![Page 32: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/32.jpg)
METRICS TO TRACK
• Adapter trimming
• Duplicate Rate
• % Aligned (for genomes we can map to)
• Insert size
• Coverage
• Error rate
• GC Content
• For RNA, specifically:
• Strand specificity (% correct strand)
• 5’-3’ bias
• % rRNA
• Intron-exon ratio
32
![Page 33: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/33.jpg)
Background
Pipelining Tools
Writing the Pipeline
Metrics to Track
Implementation
Conclusions
33
![Page 34: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/34.jpg)
IMPLEMENTATION
• Conda environment
• Version controlled - GitHub
• SBC has 3 QC pipelines (combined into 1, dynamically determined)
• Paired-end DNA QC
• Single-end DNA QC
• Paired-end RNA QC
34
![Page 35: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/35.jpg)
DNA QC WORKFLOW – 2 SAMPLES
35
![Page 36: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/36.jpg)
WORKFLOWS CAN BE COMPLEX
36
Leipzig, J. (2016). A review of bioinformatic pipeline frameworks. Briefings in Bioinformatics. doi:10.1093/bib/bbw020
![Page 37: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/37.jpg)
Background
Pipelining Tools
Writing the Pipeline
Metrics to Track
Implementation
Conclusions
37
![Page 38: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/38.jpg)
CONCLUSIONS
• Snakemake satisfied all needs of the SBC and is simple to work with
• The complex metrics the SBC is most interested in are being tracked, in an
automated fashion
• Implementation allows for reproducible, scalable, flexible, trackable QC
38
![Page 39: Setting up a bioinformatics QC pipeline · 3 Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452-454. doi:10.1038/533452a](https://reader034.fdocuments.in/reader034/viewer/2022051811/602765710c0aaf5585015715/html5/thumbnails/39.jpg)
THANK YOU
39
Link to GitHub with workshop instructions:
https://bit.ly/2Xf4HN6