GBS Bioinformatics Pipeline(s) Overview - Cornell...

38
GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation: Terry Casstevens With supporting information from the coders.

Transcript of GBS Bioinformatics Pipeline(s) Overview - Cornell...

Page 1: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

GBS Bioinformatics Pipeline(s) Overview

Getting from sequence files to genotypes.

Pipeline Coding:Ed BucklerJeff GlaubitzJames Harriman

Presentation:Terry CasstevensWith supporting information from the coders.

Page 2: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

Three Pipelines

• Discovery Pipeline– Requires a reference genome– Multiple steps to get to genotypes– Hands on tutorial is based on this pipeline

• Production Pipeline– Uses information from Discovery Pipeline– One step from sequence to genotypes

• UNEAK Pipeline– For species without a reference genome– Fei Lu will present this tomorrow at 9:30

Page 3: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

Vocabulary• Sequence File

– Text file containing DNA sequence reads and supplemental information from the Illumina Platform.

• Taxa– An individual sample

• GBS Bar Code– A short known sequence of DNA used to assign a GBS Tag to its original Taxa

• Key File– Text file used to assign a GBS Bar Code to a Taxa

• GBS Tag– DNA sequence consisting of a cut site remnant and additional sequence.

• Plugin– Tassel pipeline module that performs specific task

Page 4: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa 

Sequence

TOPM

GBS Discovery Pipeline

Page 5: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa 

Sequence

TOPM

GBS Discovery Pipeline

Page 6: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

HWI-ST397 0 3 68 15896 200039 0 1 GTCGATTCTGCTGACTTCATGGCTTCTGTTGACGHWI-ST397 0 3 68 15960 200043 0 1 GAGAATCAGCTTTTCCAACACCTTGAGTTTGAGTHWI-ST397 0 3 68 15831 200053 0 1 ATGTACTGCACCGTTGCAAGCGAGCACCACCAAHWI-ST397 0 3 68 15867 200049 0 1 CCAGCTCAGCCTGCATTCTTTCAAAAACTTCCAAHWI-ST397 0 3 68 15943 200048 0 1 GATTTTACTGCACATCGGTCTTGTCACACCAGCTHWI-ST397 0 3 68 15812 200062 0 1 TCACCCAGCATCACGCCCCTTCACATCCAGTAAAHWI-ST397 0 3 68 15888 200067 0 1 CTTGACTGCCACCATGAATATGTGTTCCAAGTGCHWI-ST397 0 3 68 15969 200067 0 1 CCACAACTGCTCCATCTTTTCCATGAGACATTGCHWI-ST397 0 3 68 15786 200078 0 1 GTATTCTGCACACGAATCAGCTGAGACACCAATTHWI-ST397 0 3 68 15830 200072 0 1 AATATGCCAGCAGTTAAGAGAGTTCAAGATCCAGHWI-ST397 0 3 68 15863 200073 0 1 CTCCCTGCGGGTGCGCGCGACCCATCTTCAGTTHWI-ST397 0 3 68 15762 200088 0 1 TGGTACGTCTGCGGAATGGCGTTTTTTATGCCTTHWI-ST397 0 3 68 15903 200085 0 1 GGACCTACTGCCCAAGAACGGCTCACCCATCATHWI-ST397 0 3 68 15921 200082 0 1 GAGAATCAGCGTGTACGGGGCACGGGGTGACTHWI-ST397 0 3 68 15984 200085 0 1 TTCTCCAGCCGCATGGGCCGGAGACCAGAGAGHWI-ST397 0 3 68 15788 200096 0 1 GCGTCAGCAAATGCCCCAACAGCCAAGTCAGCAHWI-ST397 0 3 68 15842 200099 0 1 TAGGCCATCAGCTGACTTCCCGGGTGTGGAGAAHWI-ST397 0 3 68 15876 200105 0 1 GGACCTACTGCCGGCGGGACGAAAGCGGTTGTHWI-ST397 0 3 68 15937 200097 0 1 CTCCCTGTTGAAGCATGTGCAAAAGAGCTTGTTCHWI-ST397 0 3 68 15958 200102 0 1 CGCCTTATCTGCCCTCGCCGGTCATGGGGAGTG

Raw Sequence (Qseq)

Page 7: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

HWI-ST397 0 3 68 15896 200039 0 1 GTCGATTCTGCTGACTTCATGGCTTCTGTTGACGHWI-ST397 0 3 68 15960 200043 0 1 GAGAATCAGCTTTTCCAACACCTTGAGTTTGAGTHWI-ST397 0 3 68 15831 200053 0 1 ATGTACTGCACCGTTGCAAGCGAGCACCACCAAHWI-ST397 0 3 68 15867 200049 0 1 CCAGCTCAGCCTGCATTCTTTCAAAAACTTCCAAHWI-ST397 0 3 68 15943 200048 0 1 GATTTTACTGCACATCGGTCTTGTCACACCAGCTHWI-ST397 0 3 68 15812 200062 0 1 TCACCCAGCATCACGCCCCTTCACATCCAGTAAAHWI-ST397 0 3 68 15888 200067 0 1 CTTGACTGCCACCATGAATATGTGTTCCAAGTGCHWI-ST397 0 3 68 15969 200067 0 1 CCACAACTGCTCCATCTTTTCCATGAGACATTGCHWI-ST397 0 3 68 15786 200078 0 1 GTATTCTGCACACGAATCAGCTGAGACACCAATTHWI-ST397 0 3 68 15830 200072 0 1 AATATGCCAGCAGTTAAGAGAGTTCAAGATCCAGHWI-ST397 0 3 68 15863 200073 0 1 CTCCCTGCGGGTGCGCGCGACCCATCTTCAGTTHWI-ST397 0 3 68 15762 200088 0 1 TGGTACGTCTGCGGAATGGCGTTTTTTATGCCTTHWI-ST397 0 3 68 15903 200085 0 1 GGACCTACTGCCCAAGAACGGCTCACCCATCATHWI-ST397 0 3 68 15921 200082 0 1 GAGAATCAGCGTGTACGGGGCACGGGGTGACTHWI-ST397 0 3 68 15984 200085 0 1 TTCTCCAGCCGCATGGGCCGGAGACCAGAGAGHWI-ST397 0 3 68 15788 200096 0 1 GCGTCAGCAAATGCCCCAACAGCCAAGTCAGCAHWI-ST397 0 3 68 15842 200099 0 1 TAGGCCATCAGCTGACTTCCCGGGTGTGGAGAAHWI-ST397 0 3 68 15876 200105 0 1 GGACCTACTGCCGGCGGGACGAAAGCGGTTGTHWI-ST397 0 3 68 15937 200097 0 1 CTCCCTGTTGAAGCATGTGCAAAAGAGCTTGTTHWI-ST397 0 3 68 15958 200102 0 1 CGCCTTATCTGCCCTCGCCGGTCATGGGGAGTG

Raw Sequence (Qseq)

Page 8: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

Key File

Page 9: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

Insert (first 64 bases)Barcode Cut site

InsertBarcode adapter Cut site Common adapterCut site

Insert (<64bp)Cut site 2nd InsertBarcode Cut site

GBS Tags

Insert (<64bp)Barcode Cut site Common adapterCut site

‘Good’ reads: (only the first 64 bases after the barcode are kept)

Fragment from GBS library:

chimera or partial digestion:

short fragment:

typical read:

Page 10: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

Insert (first 64 bases)Barcode Cut site

InsertBarcode adapter Cut site Common adapterCut site

Insert (<64bp)Cut siteBarcode Cut site

GBS Tags

Insert (<64bp)Barcode Cut site Cut site

‘Good’ reads: (only the first 64 bases after the barcode are kept)

Fragment from GBS library:

chimera or partial digestion:

short fragment:

typical read:

Page 11: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

Insert (first 64 bases)Barcode Cut site

InsertBarcode adapter Cut site Common adapterCut site

Insert (<64bp)Cut siteBarcode Cut site

GBS Tags

Insert (<64bp)Barcode Cut site Cut site

Barcode Cut site Common adapter

Rejected reads:

• Not matching barcode and cut site remnant• Contains N in first 64 bases after the barcode

‘Good’ reads: (only the first 64 bases after the barcode are kept)

Fragment from GBS library:

chimera or partial digestion:

short fragment:

typical read:

adapter dimer

Page 12: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa 

Sequence

TOPM

GBS Discovery Pipeline

Page 13: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

Tag Counts

• With information from the key file, each sequence file is processed, tags are identified and counted.

• If a tag is shorter than 64 bases it is padded.• The tags and counts are put into a tag count file for each sequence file.

QseqToTagCountsPlugin / FastqToTagCountsPlugin

Page 14: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

Master Tag Counts

• The individual tag count files are merged into a master tag count file.

• A minimum count is specified at the merge stage to exclude tags with low counts (likely sequencing errors).

MergeMultipleTagCountsPlugin

Page 15: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

Conversion of Tags to Fastq

• Sequence aligners do not work with the tag count file format.

• In preparation for the alignment step, the Master Tag Count file is converted to fastqformat.

TagCountsToFastqPlugin

Page 16: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa 

Sequence

TOPM

GBS Discovery Pipeline

Page 17: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

Tag Alignment / TOPM

• The GBS pipeline uses an external aligner to do the initial alignment. 

• The current version uses bowtie2 which produces the alignment in the SAM format.

• We convert the SAM file into our tags on physical map format (TOPM)

bowtie2

SAMConverterPlugin

Page 18: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

TOPM

Page 19: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

So Far We Have

• Identified and counted GBS tags.• Converted tag counts file to fastq.• Aligned the tags to a reference.• Converted the alignment to TOPM.

Page 20: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa 

Sequence

TOPM

GBS Discovery Pipeline

Page 21: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

Tags by Taxa

• In this step we identify which tags are present in which taxa.– Original Sequence Files– Key File– Master Tag Count File

• Recently migrated to HDF5 file format.– Efficient storage– Large data sets

SeqToTBTHDF5Plugin

Page 22: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

Tags By Taxa Additional Operations

• If many TBTs have been created they are merged into 1 TBT.

• Taxa that were sequenced multiple times are merged.

• The TBT table is pivoted in preparation for SNP calling.

ModifyTBTHDF5Plugin

Page 23: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa 

Sequence

TOPM

GBS Discovery Pipeline

Page 24: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

SNP Calling

• Files used in SNP Calling– TOPM– TBT– Pedigree File (optional)

• Some Key Settings– mnF MinimumF (inbreeding coefficient)– mnMAF Minimum Minor Allele Frequency– mnMAC Minimum Minor Allele Count– mnLCov Minimum Locus Coverage

TagsToSNPByAlignmentPlugin

Page 25: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

HapMaprs# alleles chrom pos strand SgSBRIL067:633Y5AAXX:2:C9 SgSBRIL019:633Y5AAXX:2:C3S1_2100 A/G 1 2100 + N N N N N N N R N A N S1_2163 T/C 1 2163 + N N N N N N T C T T N S1_13837 T/G 1 13837 + N N N N N N N G N N TS1_14606 C/T 1 14606 + N N C N N N T T T T CS1_2061 T/A 1 20601 + T N N N N N N A N N NS1_68332 C/T 1 68332 + N N N N N N N N N N NS1_68596 A/T 1 68596 + A N N N N N N N N A NS1_69309 G/A 1 69309 + N G N N N N N A N N NS1_79955 T/G 1 79955 + N T G T T N T T N N NS1_79961 T/G 1 79961 + N T T T T N T T N N NS1_80584 G 1 80584 + N N N N N N N N N N GS1_80647 C/T 1 80647 + N N N N N N N C N N CS1_81274 T/G 1 81274 + N N N N N N T G N N NS1_108834 G/A 1 108834 + N N N N N N N N N N NS1_112345 T/G 1 112345 + N N N N N N K T N N NS1_115359 C/T 1 115359 + N N N N N N T C N TS1_115362 T/C 1 115362 + N N N N N N N C N N NS1_115405 G/A 1 115405 + G G A N N G G G G NS1_115516 T/G 1 115516 + N N T N N N T T N N TS1_116694 A/G 1 116694 + N A G N N N G A N N NS1_119016 C/T 1 119016 + N N N N C N N C N N NS1_155366 T/C 1 155366 + N T N N N N

Page 26: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

GBS Discovery pipelineDiscovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa 

Fastq

TOPM

Page 27: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

GBS Discovery pipelineDiscovery

Tag Counts

SNP Caller

Tags by Taxa 

Fastq

TOPM

Genotypes

Filtered Genotypes

Page 28: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

Production Pipeline

Page 29: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

Why another pipeline?

• The last maize build (30000 taxa) with the discovery pipeline took weeks. 

• Most common alleles have been identified after the first few discovery builds.

• Use the information from the discovery pipeline to call SNPs in new runs quickly.

• Improve efficiency and automate.

Page 30: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

GBS Bioinformatics PipelinesDiscovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa 

Fastq

Production

TOPM

Fastq

Page 31: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa 

Fastq

Production

TOPM

Fastq

TagsOnPhysicalMap (TOPM)

Page 32: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

GBS Bioinformatics PipelinesDiscovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa 

Fastq

Production

Filtered Genotypes

TOPM

Fastq

Page 33: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

GBS Bioinformatics PipelinesDiscovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa 

Fastq

Production

Fastq

Filtered Genotypes

TOPM TOPM

Page 34: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

GBS Bioinformatics PipelinesDiscovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa 

Fastq

Production

Fastq

Filtered Genotypes

TOPM TOPM

Page 35: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

GBS Bioinformatics PipelinesDiscovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa 

Fastq

Production

Fastq

Filtered Genotypes

TOPM TOPM

Genotypes

Page 36: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

Running the Production Pipeline

• Required Files:– Sequence file (fastq or qseq)– Key file– Production TOPM

• TASSEL 3 Standalone & RawReadsToHapMapPlugin

• Running the Pipeline:– One lane processed at a time– HapMap files by chromosome

• ~40 minutes

Page 37: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

Testing Production Pipeline

• Compared HapMap files produced by Discovery Pipeline and Production Pipeline

• Site Comparison:– Discovery 48,139– Production 47,676– Difference due to maximum 8 alleles

• 99.98% correlation of genetic distance matrices

Page 38: GBS Bioinformatics Pipeline(s) Overview - Cornell Universitycbsu.tc.cornell.edu/lab/doc/GBS_Bioinformatics... · GBS Bioinformatics Pipeline(s) Overview Getting from sequence files

Next Steps In Pipeline Development• Hierarchical Data Format – supports very 

large data sets and complex data structures.  • Working to fuse TOPM, TBT, Keyfile, and 

Pedigree File into one HDF5 repository.• Continued improvements to SNP caller.• Ability to use tags not present in the 

reference.