Operating Systems Course Aims Course Outcomes Course Outline
Paprica course
-
Upload
jeff-bowman -
Category
Science
-
view
188 -
download
1
Transcript of Paprica course
Welcome!
Universidade de São PauloPAthway PRediction by phylogenetIC plAcement (paprica)
short courseJeff Bowman, [email protected]
30 March 2016
Introduction and Logistics
Schedule (tentative)0900 – 0915: Introductions and logistics0915 – 1015 Task 1: Troubleshoot installations, Task 2: Tutorial 11015 – 1030: Break1030 – 1100: Discussion: The paprica workflow1100 – 1130: Discussion: Tutorial 1 results1130 – 1200: Troubleshooting installation for custom build of paprica database1200 – 1300: Lunch1300 – 1330: Tutorial 2: Building the paprica database1330 – 1400: Discussion: The paprica database workflow1400 – 1430: Demonstration: Metagenomic analysis with paprica (break during module)1430 – 1630: Your analysis with paprica. If you don’t have a set of libraries that you’d like to work with we will help you find some.
Objectives1. Install paprica and dependencies, and learn how to use it to analyze a set of 16S rRNA
gene sequences2. Install the dependencies for build the paprica database, and learn how to build a
custom database
What it paprica, and what can I do with it?
paprica is a pipeline to estimate the metabolic pathways, enzymes (EC numbers), and genome parameters associated with 16S rRNA gene sequences.
• Designed for NGS data• Also applicable to small libraries or even single 16S rRNA gene sequences (e.g. isolates)
Bowman and Ducklow, 2015 Bowman, 2015
Introduction and Logistics
Bowman, 2015
Function Pathwayb Sanger studies Hatam et al. (2014) Bowman et al. (2012)
CO2 fixation CO2 fixation into oxaloacetate (anapleurotic)
Pseudoalteromonas haloplanktis TAC125
Polaribacter MED152, Acidimicrobiales YM16-304
Psychrobacter cryohalolentis K5, Polaribacter MED 152
Antibiotic resistance Triclosan resistancePelagibacter ubique HTCC1062, Polaribacter MED152
Polaribacter MED152, Leadbetterella byssophila DSM17132, Thiomicrospira spp., Gloeocapsa PCC7428, Acidimicrobiales YM16-304, Janthinobacterium spp.
P. cryohalolentis K5, Polaribacter MED152, GSOS
C1 metabolism Formaldehyde oxidation II (glutathione-dependent) Colwellia psychrerythraea 34H
Gloeocapsa PCC7428, Marinobacter BSs20148, Glaciecola nitratireducens FR1064
Octadecabacter antarcticus 307
Choline degradation Choline degradation 1 C. psychrerythraea 34H Acidimicrobiales YM304 P. cryohalolentis K5, O. antarcticus 307
Glycine betaine production Glycine betaine biosynthesis I (Gram-negative bacteria) C. psychrerythraea 34H Acidimicrobiales YM304 P. cryohalolentis K5, O.
antarcticus 307
Halocarbon degradation 2-chlorobenzoate degradation P. cryohalolentis K5 Polaromonas naphthalenivorans CJ2 P. cryohalolentis K5
Mercury conversion Phenylmercury acetate degradation
Marinobacter BSs20148, P. haloplanktis TAC125, Octadecabacter arcticus 238
Belliella baltica DSM15883, Bordetella petrii O. antarcticus 307
Nitrogen fixation Nitrogen fixation Coraliomargarita akajimensis DSM45221
C. akajimensis DSM45221, Methylomonas methanica MC09, Aeromonas spp.
C. akajimensis DSM45221
Sulfite oxidation Sulfite oxidation II/III Pelagibacter ubique HTCC1062 Cellvibrio japonicus UEDA107 GSOS
Sulfate reduction Sulfate reduction IV/VHalomonas elongata DSM2581, Psychrobacter arcticum 273
Vibrio vulnificus YJ016 GSOS
Denitrification Nitrate reduction I/VII C. psychrerythraea 34H C. japonicus UEDA107 -
Introduction and Logistics
Bowman et al, in revision
Introduction and Logistics
Troubleshoot installation and conduct basic analysis
Tutorial 1 – Initial analysis with paprica• Finishing downloading and installing all remaining dependencies, let me know if you need
assistance• Archaeopteryx
• R and RStudio
• Remove existing paprica directory, then download latest version of paprica:
• Start working through the tutorial located here: http://www.polarmicrobes.org/?p=1473 • Start at “Testing the Installation”
sudo apt-get install default-jrewget https://googledrive.com/host/0BxMokdxOh-JRM1d2azFoRnF3bGM/download/forester_1038.jarmv forester_1038.jar archaeopteryx.jarchmod a+x archaeopteryx.jar
## create bash script archaeopteryx containing these lines (no indentation):## #!/bin/bash## java -cp archaeopteryx.jar org.forester.archaeopteryx.Archaeopteryx
## make this script executablechmod a+x archaeopteryx
rm -r papricagit clone https://github.com/bowmanjeffs/paprica.git
16S sequence library, the bigger
the better!
Obtain all completed genomes
(Genbank)
Predict metabolic pathways (ptools)
Construct 16S rRNA gene tree
(Infernal, RAxML)
Place reads on reference tree
(Infernal, pplacer)
Extract pathways for each placement
Generate confidence score
for sample
Find pathways shared across
all members of all clades
Calculate confidence for
each node
Evaluate genomic
plasticity for terminal nodes
Evaluate relative core genome size
Analysis
Database Construction
Confidence Scoring
Three components to metabolic inference:
1. Database construction2. Analysis
3. Confidence scoring
Caveats:Metabolic inference is only as good
as…• Our genome annotations• The diversity of completed
genomes• Our knowledge of metabolic
pathways
And is further limited by…• Genomic plasticity
The paprica workflow
The paprica workflow
• Data preparation• Read QC – basic steps
• Overlap if PE• Trim for quality• Remove chloroplasts, mitochondria, anything else that looks weird
• Methods• Mothur (preferred)• Qiime• paprica/utilities/read_qc.py
• Test run on single sample• Setup run for multiple samples
• where samples.txt contains a list of the sample files without their extension• Let’s take a look at paprica-run.sh…
while read f;do ./paprica-run.sh $f bacteria;done < samples.txt
Tutorial 1 resultsFiles initially provided or created by papricasummer.fastasummer.sub.fastasummer.sub.clean.fasta
Files produced for or during infernal/pplacer summer.sub.combined_16S.bacteria.tax.clean.align.phyloxmlsummer.sub.combined_16S.bacteria.tax.clean.align.csv summer.sub.combined_16S.bacteria.tax.clean.align.stosummer.sub.combined_16S.bacteria.tax.clean.align.fastasummer.sub.combined_16S.bacteria.tax.clean.align.jplace
paprica output filessummer.bacteria.ec.csvsummer.bacteria.sum_ec.csvsummer.bacteria.pathways.csvsummer.bacteria.sum_pathways.csvsummer.bacteria.edge_data.csvsummer.bacteria.sample_data.txt
Tutorial 1 resultsFiles initially provided or created by papricasummer.fastasummer.sub.fastasummer.sub.clean.fasta
Files produced for or during infernal/pplacer summer.sub.combined_16S.bacteria.tax.clean.align.phyloxmlsummer.sub.combined_16S.bacteria.tax.clean.align.csv summer.sub.combined_16S.bacteria.tax.clean.align.stosummer.sub.combined_16S.bacteria.tax.clean.align.fastasummer.sub.combined_16S.bacteria.tax.clean.align.jplace
paprica output filessummer.bacteria.ec.csvsummer.bacteria.sum_ec.csvsummer.bacteria.pathways.csvsummer.bacteria.sum_pathways.csvsummer.bacteria.edge_data.csvsummer.bacteria.sample_data.txt
Tutorial 1 resultsFiles initially provided or created by papricasummer.fastasummer.sub.fastasummer.sub.clean.fasta
Files produced for or during infernal/pplacer summer.sub.combined_16S.bacteria.tax.clean.align.phyloxmlsummer.sub.combined_16S.bacteria.tax.clean.align.csv summer.sub.combined_16S.bacteria.tax.clean.align.stosummer.sub.combined_16S.bacteria.tax.clean.align.fastasummer.sub.combined_16S.bacteria.tax.clean.align.jplace
paprica output filessummer.bacteria.ec.csvsummer.bacteria.sum_ec.csvsummer.bacteria.pathways.csvsummer.bacteria.sum_pathways.csvsummer.bacteria.edge_data.csvsummer.bacteria.sample_data.txt
origin,name,multiplicity,edge_num,like_weight_ratio,post_prob,likelihood,marginal_like,distal_length,pendant_length,classification,map_ratio,map_overlapsummer.sub.combined_16S.bacteria.tax.clean.align,SRR584344.1832,1,2568,0.497633,0.769127,-42222.2,-42226,0.457927,0.317102,NA,NA,NAsummer.sub.combined_16S.bacteria.tax.clean.align,SRR584344.4354,1,2253,0.840252,0.915613,-41188,-41192.1,7.3661e-06,0.263113,NA,NA,NAsummer.sub.combined_16S.bacteria.tax.clean.align,SRR584344.3662,1,2422,0.614939,0.615935,-42880.8,-42884.1,6.32695e-06,0.17298,NA,NA,NAsummer.sub.combined_16S.bacteria.tax.clean.align,SRR584344.2443,1,242,0.557322,0.787045,-43458.2,-43459.3,9.2618e-06,0.0380588,NA,NA,NA
Tutorial 1 resultsFiles initially provided or created by papricasummer.fastasummer.sub.fastasummer.sub.clean.fasta
Files produced for or during infernal/pplacer summer.sub.combined_16S.bacteria.tax.clean.align.phyloxmlsummer.sub.combined_16S.bacteria.tax.clean.align.csv summer.sub.combined_16S.bacteria.tax.clean.align.stosummer.sub.combined_16S.bacteria.tax.clean.align.fastasummer.sub.combined_16S.bacteria.tax.clean.align.jplace
paprica output filessummer.bacteria.ec.csvsummer.bacteria.sum_ec.csvsummer.bacteria.pathways.csvsummer.bacteria.sum_pathways.csvsummer.bacteria.edge_data.csvsummer.bacteria.sample_data.txt
,15,37,51,142,242,243,552,649,678,739,796,802,805,1030,1050,1075,1106,1107,2139…1.-.-.-,0.0,0.0,0.0,35.25,90.0,14.0,0.0,0.0…1.1.-.-,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0…1.1.1.-,0.0,0.0,0.0,35.25,0.0,0.0,0.0,0…1.1.1.1,0.0,0.0,0.0,23.5,135.0,21.0,0.333333333333…
Edge number for each CCG and CEG
EC n
umbe
r
Tutorial 1 resultsFiles initially provided or created by papricasummer.fastasummer.sub.fastasummer.sub.clean.fasta
Files produced for or during infernal/pplacer summer.sub.combined_16S.bacteria.tax.clean.align.phyloxmlsummer.sub.combined_16S.bacteria.tax.clean.align.csv summer.sub.combined_16S.bacteria.tax.clean.align.stosummer.sub.combined_16S.bacteria.tax.clean.align.fastasummer.sub.combined_16S.bacteria.tax.clean.align.jplace
paprica output filessummer.bacteria.ec.csvsummer.bacteria.sum_ec.csvsummer.bacteria.pathways.csvsummer.bacteria.sum_pathways.csvsummer.bacteria.edge_data.csvsummer.bacteria.sample_data.txt
1.-.-.-,175.1590909091.1.-.-,0.3333333333331.1.1.-,44.09848484851.1.1.1,192.4757575761.1.1.10,0.01.1.1.100,1168.893337991.1.1.102,0.333333333333
Sum (normalized) across all CCG and CEG
EC n
umbe
r
Tutorial 1 resultsFiles initially provided or created by papricasummer.fastasummer.sub.fastasummer.sub.clean.fasta
Files produced for or during infernal/pplacer summer.sub.combined_16S.bacteria.tax.clean.align.phyloxmlsummer.sub.combined_16S.bacteria.tax.clean.align.csv summer.sub.combined_16S.bacteria.tax.clean.align.stosummer.sub.combined_16S.bacteria.tax.clean.align.fastasummer.sub.combined_16S.bacteria.tax.clean.align.jplace
paprica output filessummer.bacteria.ec.csvsummer.bacteria.sum_ec.csvsummer.bacteria.pathways.csvsummer.bacteria.sum_pathways.csvsummer.bacteria.edge_data.csvsummer.bacteria.sample_data.txt
,15,37,51,142,242,243,552,649,678,739,796,802,805,1030,1050,1075,1106,1107,2139…"(1,3)-beta-D-xylan degradation",0.0,0.0,0.0,0.0,0.0,0.0,0.0… (KDO)2-lipid A biosynthesis I,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6…(R)-acetoin biosynthesis I,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0…
Edge number for each CCG and CEG
Path
way
Tutorial 1 resultsFiles initially provided or created by papricasummer.fastasummer.sub.fastasummer.sub.clean.fasta
Files produced for or during infernal/pplacer summer.sub.combined_16S.bacteria.tax.clean.align.phyloxmlsummer.sub.combined_16S.bacteria.tax.clean.align.csv summer.sub.combined_16S.bacteria.tax.clean.align.stosummer.sub.combined_16S.bacteria.tax.clean.align.fastasummer.sub.combined_16S.bacteria.tax.clean.align.jplace
paprica output filessummer.bacteria.ec.csvsummer.bacteria.sum_ec.csvsummer.bacteria.pathways.csvsummer.bacteria.sum_pathways.csvsummer.bacteria.edge_data.csvsummer.bacteria.sample_data.txt
(transposed and put in table)edge_num 242 243
taxonGCF_000012345.1_Candidatus Pelagibacter ubique
HTCC1062_strain=HTCC1062nedge 53 5n16S 1 1nedge_corrected 53 5nge 1 1ncds 1333 1355.5genome_size 1308759 1325981GC 29.68308145 29.15748phi 0.478821295 0.480875clade_size 1 2branch_length 0.0189682 0.246143npaths_terminal 119.5npaths_actual 116 144confidence 0.478821295 0.625556post_prob 0.789555434 0.814622nec_actual 369 461nec_terminal 315.5
Tutorial 1 resultsFiles initially provided or created by papricasummer.fastasummer.sub.fastasummer.sub.clean.fasta
Files produced for or during infernal/pplacer summer.sub.combined_16S.bacteria.tax.clean.align.phyloxmlsummer.sub.combined_16S.bacteria.tax.clean.align.csv summer.sub.combined_16S.bacteria.tax.clean.align.stosummer.sub.combined_16S.bacteria.tax.clean.align.fastasummer.sub.combined_16S.bacteria.tax.clean.align.jplace
paprica output filessummer.bacteria.ec.csvsummer.bacteria.sum_ec.csvsummer.bacteria.pathways.csvsummer.bacteria.sum_pathways.csvsummer.bacteria.edge_data.csvsummer.bacteria.sample_data.txt
name summer.bacteriasample_confidence 0.49199211424npathways 572ppathways 1007nreads 1000database_created_at 2016-03-03T00:59:34.792240
Tutorial 2
• Download the remaining dependencies• RAxML
• add to PATH• What if CPU can’t support AVX2? Cheat.
• pathway-tools• follow GUI instructions
• taxtastic• make sure that system Python is Anaconda (or alternate distro), then:
• Follow the tutorial here: http://www.polarmicrobes.org/?p=1543 • Only complete the “Test paprica-build.sh” section!
git clone https://github.com/stamatak/standard-RAxML.gitcd standard-RAxMLmake -f Makefile.AVX2.PTHREADS.gccrm *.o
pip install taxtastic
Discussion: The paprica database workflow
ref_genome_databaseptools-local
user bacteria archaea
bacteria archaea
refseqcomb…refpkg refseqcomb…refpkg
terminal_paths.csvterminal_ec.csvinternal_probs.csvinternal_ec_probs.csvinternal_ec_n.csvinternal_data.csvgenome_data_final.csvgenome_data.csvcombined_16S.bacteria.tax.database_info.txt
terminal_paths.csvterminal_ec.csvinternal_probs.csvinternal_ec_probs.csvinternal_ec_n.csvinternal_data.csvgenome_data_final.csvgenome_data.csvcombined_16S.archaea.tax.database_info.txt
GCF…*
*.fasta*.hits*.sto*.5mer_bints.txt.gz*.genomic.fna*.genomic.gbff*.protein.faa
GCF…*
*.fasta*.hits*.sto*.5mer_bints.txt.gz*.genomic.fna*.genomic.gbff*.protein.faa
GCF…* GCF…*
draft.combined_16S.fasta draft.combined_16S.fasta
*.fasta*.hits*.sto*.genomic.fna*protein.gbk
*.fasta*.hits*.sto*.genomic.fna*protein.gbk
paprica-mg.dmndpaprica-mg.prot.csv.gz
combined_16S.[domain].tax.clean.align.fastacombined_16S. [domain].tax.clean.align.stoCONTENTS.jsonphylo_modeleSi5_T.jsonRAxML_fastTreeSH_Support.conf.root.ref.treRAxML_info.ref.tre
* *
*
Discussion: The paprica database workflow
paprica-make_ref.py• Downloads all completed genomes from Genbank• Counts 16S genes in each genome and pulls representative• Calculates other genome parameters• Constructs 16S alignment and distance matrix• Constructs genome distance matrix (compositional vector based)• Calculates phi from 16S distance matrix and genome distance matrix• Find 16S genes in user genomes (if present)• Add user 16S genes to previous alignment
paprica-place_it.py• Constructs reference tree and reference package from 16S alignment
paprica-build_core_genomes.py• Predicts metabolic pathways for each genome• Tallies up EC numbers for each genome• For each internal node on reference tree determines mean parameters, and
fraction of occurrence of EC numbers and metabolic pathways• Exports all of this information as csv files
Demonstration: paprica-mg.py
• If you’re on a server you can follow the tutorial at http://www.polarmicrobes.org/?p=1596
• test.annotation.csv: The number of hits in the metagenome, by EC number. This is probably the most useful file to you. The columns are:• index: The accession of a representative protein from the database• genome: Genome the representative protein comes from• domain: Domain of this genome• EC_number: The EC number• product: A sensible name for the gene product• start: Start position of the gene in the genome• end: End position of the gene in the genome• n_occurences: The number of occurrences of this EC number in the database• nr_hits: The number of reads that matched this EC number. Each read is allowed only one hit.
• test.paprica-mg.nr.daa: The DIAMOND format results file. Only one hit per read is reported.• test.paprica-mg.nr.txt: A text file of the DIAMOND results. Only one hit per read is reported.• test_mg.pathologic (for -pathways T only): A directory containing .gbk files for each genome in the paprica database
that received a hit, with each EC number that got a hit for that genome.• test.pathways.txt: A simple list of all the pathways that were predicted for the metagenome.
paprica-mg_run.py -i ERR318619_1.qc.fasta.gz -o demo -ref_dir ref_genome_database -pathways F
• Evaluations
On to your own analysis!