Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo...
Transcript of Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo...
![Page 1: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/1.jpg)
Roman Cherniatchik, 2020 May 6, St. Petersburg
SnakemakeAdvanced Tutorial
1
![Page 2: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/2.jpg)
Snakemake
What for?
image source2
![Page 3: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/3.jpg)
Snakemake vs Bash ScriptsSnakemake pipeline Bash Scripts
“Entrance” level Hard Easy
Programming language Python + Bash Bash - convenient for simple scripts only
Calculations Automation + +
Results Consistency / Reproducibility + depends on you
Computation Environment Reproducibility
Docker, Conda, Singularityintegration depends on you
Multiple Platfroms: Write once launch everywhere
PC, Computational clusters (HPC, LSF,..), Cloud Computing
New scripts for each platform, hard to do universal solution
Bash scripts are simple - easier to start with, but could be a nightmare for complicated large pipelines and effective cloud computing
3
![Page 4: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/4.jpg)
Snakemake Basics
<Reminder>
4
![Page 5: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/5.jpg)
Snakemake Dependency Graph (DAG)Crucial to understand DAG concept to write snakemake pipelines
Dependency Graph used to decide:
• Which rules to execute?
• Which input files to use?
Dependencies Graph Building:
InputOutput Files
Input Output Files
Rules Execution Order:
5
![Page 6: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/6.jpg)
Pipeline Execution Orderreads/ A.fq.gz; B.fq.gz; C.fg.qz
Example:
snakemake --cores 1 --dag | dot -Tsvg > dag.svg6
![Page 7: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/7.jpg)
Pipeline Dependencies Lookup
plot.svg
peaks/A.bedpeaks/B.bedpeaks/C.bed
rule: plot
rule: all
rule: call_peaks bams/{sample}.bam peaks/{sample}.bed
rule: align bams/{sample}.bamreads/{sample}.fq.gz
peaks/A.bed
peaks/B.bed
peaks/C.bed
plot.svg
reads/ A.fq.gz; B.fq.gz; C.fg.qz
RULE INPUT RULE OUTPUT
C -> {s
ample
}A -> {sample}
B ->
{sam
ple}
A,B,C -> {sample}
A,B,C -> {sample}
A,B,C -> {sample}
START HERE
The End
7
![Page 8: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/8.jpg)
Demo time: example_01Topic: Snakemake Dependency Graph
• Q1: Why doesn’t work?
8
![Page 9: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/9.jpg)
Demo time: example_01• call_peaks looks for “bams/A.bam”• candidate found: align “bams/{sample}” => {sample} in align is “A.bam” => input
should be “reads/A.bam.fq.gz”
• call_peaks {sample} (“A”) != align {sample} (“A.bam”)• Wildcard variable make sense only inside one rule
9
![Page 10: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/10.jpg)
Demo time: example_01DAG: Jobs Graph vs Rules Graph
Practical advice:
Use rules graph for you pipelines
• Real use cases: 10..1000 input files
• Full graph with all jobs will be extremely large.
• Rules Graph is compact
snakemake --cores 1 --rulegraph | dot -Tsvg > rulegraph.svg
DAG: Rules onlyDAG: All jobs10
![Page 11: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/11.jpg)
Demo time: example_01How to quickly find things in snakemake command line help
* snakemake --help | less: In less press ‘/‘, type ‘rulegraph’ (search in ‘less’ is same as VIM)
Useful snakemake options for debugnig:
• --dry-run : Builds DAG, checks pipeline & exits. Doesn’t execute rules
• --debug-dag : Prints wildcards details etc. while inferring DAG, doesn’t stop rules execution.
• --rulegraph : Prints compact DAG (only rules) & exits
Some useful snakemake methods:
• touch: Creates empty file, could be use it in output: section to mock shell/run sections.
• directory: Mark output: section argument if it is directory, not file
• protected: Mark output: section argument to create ‘read-only’ files for important pipeline results11
![Page 12: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/12.jpg)
Snakemake Editing Tools MatterWhy I’m using PyCharm + SnakeCharm?
Let’s compare:
• cat
• vim + snakemake syntax highlighting
• Atom (recommended by Snakemake)
• PyCharm + SnakeCharm Plugin
12
![Page 13: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/13.jpg)
Example2: catLet’s use editor w/o syntax highlighting, e.g: cat (or less / vim / nano/ …)Pros
• Installed almost on any Linux machine
• works in SSH session
• fast & light
Cons
• Easy to make an error
• Hard to read pipeline code13
![Page 14: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/14.jpg)
Example2: vim + snakemake bundleLet’s use editor with syntax highlighting plugin, e.g: vim (or nano/ …)Pros
• Installed almost on any Linux machine
• works in SSH session
• fast & light
• Code looks better, some errors easier to notice
Cons
• Requires to install snakemake bundle
• Still easy to make an error
Supplementary: How do I enable syntax highlighting in Vim for Snakefiles?
14
![Page 15: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/15.jpg)
Example2: AtomLet’s use IDE suggested by Snakemake official tutorial. Seems they should know better which tool to use, right ?AtomPros
• Fast & light
• Code looks readble, some errors easier to notice
• Reasonable default choice
Cons
• Requires be installed
• Unlikely works via SSH
• Still easy to make an error (see later)
15
![Page 16: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/16.jpg)
Is this code OK?
14 mistakes in 24 lines
16
![Page 17: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/17.jpg)
Example_02: PyCharm + SnakeCharmSnakeCharm developed by JetBrains Biolabs Team Pros
• Code analysis, lots of errors highlighted• Smart code completion• PyCharm: also good for
• Python, Markdown, R, ..• Git• ….
Cons• Requires be installed• Could show false positives• Couldn’t be used via SSH directly (put code in git, sshfs or other
tricks should be used)17
![Page 18: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/18.jpg)
Text Editors Takeaways:My own preferences:
• Local Machine• PyCharm IDE• SnakeCharm Plugin : for Snakemake• IdeaVim Plugin: `Vim style` emulation
• Remote Machine (computation clusters, docker machines,…)• Vim / Nano + Snakemake syntax bundle
Keep your pipeline in Git :• Use it for sync pipeline with remote machine• Convenient for pipeline development
18
![Page 19: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/19.jpg)
“Snakenstein”Snakemake file = mixture of:
• Python code
• Some additional syntax• rules declarations, rules sections,
etc• special syntax in strings:
“path/{sample}.bam”
snakemake tool:• reads Snakefile• generates valid python code• executes python code in some
special python environment“Frankenstein” in terms of programming language
19
![Page 20: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/20.jpg)
Example_03: Snakefile - Not Pythonsnakemake --print-compilation > Snakefile.py
Snakemake generates Python file
andexecutes it!
20
![Page 21: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/21.jpg)
Section arguments typesMost sections supports:
• positional arguments:• string arguments• lists of strings• other python expressions
• named argumentskey1 = value
Input section TEXT is inserted into some python function workflow.input(…)
=> same syntax as in python for method call arguments, e.g like in
print(“fooo”, “boo”, file=…, end=..)
--print-compilation
21
![Page 22: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/22.jpg)
Lambda / Input FunctionsLambda functions • Access to:• wildcards, threads, input, output, resources
• Different sections - different set of arguments for input functions
Input function: • Similar to lambda functions, but for
larger pieces of code• Only for input: sections• Could be used to handle dynamic
dependencies (see checkpoints) 22
![Page 23: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/23.jpg)
Sections Syntax is not Equal• output, log, benchmark:
• lambdas/input functions cannot be used, only expressions
• input:
• expression, lambdas, also “input functions”
• threads: lambdas/functions or expressions which returns: integer or float values
• shell:, wrapper: only one positional argument, expression returning python string
• run: python code block
• ….
Check Snakemake docs / Trust SnakeCharm !
23
![Page 24: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/24.jpg)
Wildcards Syntax3 types:
1. Sections that introduce wildcards:• output, log, benchmark. E.g.: “peaks/{sample}.bed”
• => lambdas cannot be used here
• => wildcards set should be same in these sections
• everything in `{..}` is wildcard name, e.g `{config[reads]}`
2. Sections which uses wildcards w/o `wildcards.` prefix:• input, params, … E.g.: “peaks/{sample}.bed”
• everything in `{..}` is wildcard name, e.g `{SOME_VARIABLE}`
3. Sections which requires `wildcards.` prefix:• message, shell, run, … E.g.: “peaks/{wildcards.sample}.bed”
• w/o wildcards prefix - just python e.g. `{config[reads]}`, `{SOME_VARIABLE}`
Constraining wildcards example: “sorted_reads/{sample,[A-Za-z0-9]+}.bam”
24
![Page 25: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/25.jpg)
Three Execution Phases1. Python module loading
• File top level
• Section arguments
2. DAG computation
• input and lambda functions
3. Rule running
• run, script, shell, wrapper sections
• bonus: after DAG computation, but before/after all rules execution
25 See Example_03.6
![Page 26: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/26.jpg)
Rules Referencesrules.NAME.output.key
• Perfomance improvement for large pipeline DAG computations
• Reduces code duplication - fewer ERRORs!
• Helps in finding usages of rule in code
Documentation
26See Example_04
![Page 27: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/27.jpg)
Dynamic DependenciesSometimes all intermediate file names are not known before pipeline execution
Examples:
• Download some files (e.g. fastq) from database by id
• Align samples, perform QC, use only samples passed QC for downstream
Use: • checkpoint rules (see data-dependent-conditional-execution)• dynamic flag for output is deprecated and will be removed
27
![Page 28: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/28.jpg)
CheckpointsSpecial rule sub-type: • declaration ~ rule syntax• usage: input function + checkpoint ref: checkpoints.NAME.get(**wildcards).output.key
DAG evaluation: • Evaluate DAG except checkpoint
‘using’ rules (syntax above), e.g w/o expression rule
• Run pipeline• Re-calc DAG after checkpoint finished
(separately for every wildcard, if used)• Run pipeline
Do download
28See Example_05
decl
arat
ion
usag
e sy
ntax
![Page 29: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/29.jpg)
Shell Commands• Different python strings
syntax is available
• Use the most convenient for the situation
29See Example_06.1
![Page 30: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/30.jpg)
Wrappers, Scripts, ...shell is enough, but alternatives are:
• wrapper: See docs.Recommended way for standard tools.
• Automatically install tool via conda• Keep each tool in separate conda env• Collects shell args for you• Wrappers repo:
https://snakemake-wrappers.readthedocs.io
• script: Syntax sugar to pass arguments into Python, R scripts.
• notebook: Way to launch notebook (R or Python) and use it’s output, see docs
30See Example_06.2
Wrapper example:
![Page 31: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/31.jpg)
Project Layout• Recommended Structure
• Keep pipeline settings in config.yaml
• Pass input files information as TSV / CSV tables with columns like sample name, reads path, etc. E.g.:
31
![Page 32: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/32.jpg)
Results Consistency / ReproducibilitySnakemake smart enough but with your assistance!
• Rule output exists:
• Recalculated only if input files changed (not always for checkpoints)
• If input marked with `ancient` - file modification date not checked
• Rule fails:
• Deleted only files mentioned in output: of failed rule
• => Mention all tool output files in output: or use shadow: and mention only required files
• shadow:
• If tool outputs too many files
• Run tool in temp directory: .snakemake/shadow/tmpxxx
• Copy only files requested in output: section
• Use symlinks to make all things works
• Shadow levels: minimal, shallow, full
• `minimal` - most cases 32 See Example_07
![Page 33: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/33.jpg)
Bad Pipeline - Inconsistent ResultsSnakemake - not a silver bullet
You can always break all Snakemake conventions and write an inconsistent not reproducible pipeline. Think what you are doing and why.
33
Bugs in the above example:
• ‘my_file.csv’ will be incorrect if several jobs works in parallel
• if one of samples fails - ‘my_file.csv’ will contain inconsistent results
• also ‘my_file.csv’ won’t be deleted
• also ‘my_file.csv’ won’t be recalculated automatically on next pipeline launch
• if input file changed - ‘my_file.csv’ won’t be recalculated automatically
Example of broken conventions:
![Page 34: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/34.jpg)
Computation Environment ReproducibilitySee “distribution and reproducibility” snakemake docs
conda • Used for automatically tools installation with proper versions• --use-conda snakemake option + conda: section in pipeline
wrappers
• Shell commands + conda environment. Required: --use-conda option
docker • You could run whole pipeline or each job in a single docker container• See --use-singularity snakemake option and container: section• up to 100% reproducibility
34
![Page 35: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/35.jpg)
Computational clustersDifferent types: HPC, LSF, Slurm, …
Idea:• Only Linux terminal (SSH) access to cluster
entry point (`login` node machine)• Each rule - single job submission
Snakemake:• Does all complicated work for you ~ feels
like local machine• localrules: force job be launched not
via job submission• Required options: --profile XXX, --jobscript XXX.sh, --restart-times NN
My latest data processing was:• 120 WGBS samples• 10 TB reads, 150 machines, 2
weeks, 100k+ jobs• Each rule - own Docker container
35
![Page 36: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/36.jpg)
Snakemake Pipeline Examples• Community created workflows - https://github.com/snakemake-workflows
Curated, but not all are good
• Our team workflows:
• Chip-Seqhttps://github.com/JetBrains-Research/chipseq-smk-pipeline
• SC ATAC-Seqhttps://github.com/JetBrains-Research/scasat-smk-pipeline
• WGBS Methylation<not published yet>
36
![Page 37: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/37.jpg)
Out of Scope
Please read snakemake docs for
• Reports
• Jupyter integration
• Piped output
• Benchmark Rules
• Handling Ambiguous Rules
• Subworkflows
37
![Page 38: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake](https://reader033.fdocuments.in/reader033/viewer/2022060410/5f1041d77e708231d448374e/html5/thumbnails/38.jpg)
Resources
• Snakemake Documentationhttps://snakemake.readthedocs.io/en/stable/snakefiles/rules.html
• Snakemake Wrappershttps://snakemake.readthedocs.io/en/stable/
• SnakeCharm Pluginhttps://jetbrains-research.github.io/snakecharm/
38