GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding:...

Post on 12-Jan-2016

213 views 0 download

Transcript of GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding:...

GBS Bioinformatics Pipeline(s) Overview

Getting from sequence files to genotypes.

Pipeline Coding:Ed BucklerJeff GlaubitzJames Harriman

Presentation:Terry CasstevensWith supporting information from the coders.

Three Pipelines

• Discovery Pipeline– Requires a reference genome– Multiple steps to get to genotypes– Hands on tutorial is based on this pipeline

• Production Pipeline– Uses information from Discovery Pipeline– One step from sequence to genotypes

• UNEAK Pipeline– For species without a reference genome– Fei Lu will present this tomorrow at 9:30

Vocabulary• Sequence File

– Text file containing DNA sequence reads and supplemental information from the Illumina Platform.

• Taxa– An individual sample

• GBS Bar Code– A short known sequence of DNA used to assign a GBS Tag to its

original Taxa• Key File

– Text file used to assign a GBS Bar Code to a Taxa• GBS Tag

– DNA sequence consisting of a cut site remnant and additional sequence.

• Plugin– Tassel pipeline module that performs specific task

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Sequence

TOPM

GBS Discovery Pipeline

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Sequence

TOPM

GBS Discovery Pipeline

HWI-ST397 0 3 68 15896 200039 0 1 GTCGATTCTGCTGACTTCATGGCTTCTGTTGACGACGATGTGGAACGAGCTGTTGTTGAAACTGATGAGGTTGCTGAGATCGGAAGAGCGGTTCAGCAGG HWI-ST397 0 3 68 15960 200043 0 1 GAGAATCAGCTTTTCCAACACCTTGAGTTTGAGTATGCGATGACAGTTACTCTTACTGTCCATTGTCAGCATTGCCAGAGCTTGACCAGCTGAGATCGGAHWI-ST397 0 3 68 15831 200053 0 1 ATGTACTGCACCGTTGCAAGCGAGCACCACCAAGCGGCGGTATGCACTTTGCAATATGTAGCTAGAATAGGATTTTCAGGTGATTAGGAGCGTAAAAAAG HWI-ST397 0 3 68 15867 200049 0 1 CCAGCTCAGCCTGCATTCTTTCAAAAACTTCCAATGCCTCTCTTGGCCTAGCATTTTGGGCATACCCTGTGACCATTGCTGTCCATGCCACCATATCCTTHWI-ST397 0 3 68 15943 200048 0 1 GATTTTACTGCACATCGGTCTTGTCACACCAGCTATACCTGTAGAGTTGCCTTCCACAGTTGTAGAGATCGGAAGAGCGGTTCAGCGGGACTGCCGAGAAHWI-ST397 0 3 68 15812 200062 0 1 TCACCCAGCATCACGCCCCTTCACATCCAGTAAAACCCCTGAATGATGTGCTGTCACTGTTTGATATACAGTTGTTAACGTGAGGACGGGCTTTGAAGGAHWI-ST397 0 3 68 15888 200067 0 1 CTTGACTGCCACCATGAATATGTGTTCCAAGTGCCACAAGGACTTGGCCCTGAAGCAAGAACAAGCCAAACTTGCAGAGATCGGAAGAGCGGTTCAGCAGHWI-ST397 0 3 68 15969 200067 0 1 CCACAACTGCTCCATCTTTTCCATGAGACATTGCTCCCGCCATTGCACCCTTGGCATCAGCAGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGHWI-ST397 0 3 68 15786 200078 0 1 GTATTCTGCACACGAATCAGCTGAGACACCAATTGGGCATGAATCAAATGGCGCCATTGCCGGGGATCGAACCCCGAATCAAATGGTGCCATTGCCACTGHWI-ST397 0 3 68 15830 200072 0 1 AATATGCCAGCAGTTAAGAGAGTTCAAGATCCAGGGCTCATATTCAGTCACCTATATCAATTTCGAAATGGATTTCCAGGGTTTTAAGAGCCTAACAAAGHWI-ST397 0 3 68 15863 200073 0 1 CTCCCTGCGGGTGCGCGCGACCCATCTTCAGTTGGAGCGTCTATCGGCGTTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTAHWI-ST397 0 3 68 15762 200088 0 1 TGGTACGTCTGCGGAATGGCGTTTTTTATGCCTTAGTGGTTCGCAGAGCATTTGGCAGCTGAGATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATHWI-ST397 0 3 68 15903 200085 0 1 GGACCTACTGCCCAAGAACGGCTCACCCATCATCCGCTTTCTTCACCTTCCGTCTTCTTTGGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACHWI-ST397 0 3 68 15921 200082 0 1 GAGAATCAGCGTGTACGGGGCACGGGGTGACTGCTGTTGCGTGCGAGGGCTGAGATCGGAAGAGCGGTTCAGCAGGAGTGCCGAGACCGATCTCGTATGCHWI-ST397 0 3 68 15984 200085 0 1 TTCTCCAGCCGCATGGGCCGGAGACCAGAGAGGCCTCCCCAGGATTTGCACGATAGACCACGACTTATGGACGATTGGGAAGCCCTTGTTGGAAGGAAATHWI-ST397 0 3 68 15788 200096 0 1 GCGTCAGCAAATGCCCCAACAGCCAAGTCAGCAATTGCCTCAGCAACTTGGGCCACAAACACCACAGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCHWI-ST397 0 3 68 15842 200099 0 1 TAGGCCATCAGCTGACTTCCCGGGTGTGGAGAAAAGAGGGCCCCTCACTTCTCTCAAGTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCGGAGACCGHWI-ST397 0 3 68 15876 200105 0 1 GGACCTACTGCCGGCGGGACGAAAGCGGTTGTTGAATGATGGGGGTCACTAGGCCTTCCAGGGCCTTTAAGCGCGCGCTGAGATCGGAAGAGGGGTTCAGHWI-ST397 0 3 68 15937 200097 0 1 CTCCCTGTTGAAGCATGTGCAAAAGAGCTTGTTCTCGGCCTTCTTCAAGCCATTCTCTTGGCAGACGGCTTTGCCTAGAAGTTTCGCCCCATCACCCTTGHWI-ST397 0 3 68 15958 200102 0 1 CGCCTTATCTGCCCTCGCCGGTCATGGGGAGTGGTGCCCCTACCTCGGACAAGACAGATGCAGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG

Raw Sequence (Qseq)

HWI-ST397 0 3 68 15896 200039 0 1 GTCGATTCTGCTGACTTCATGGCTTCTGTTGACGACGATGTGGAACGAGCTGTTGTTGAAACTGATGAGGTTGCTGAGATCGGAAGAGCGGTTCAGCAGG HWI-ST397 0 3 68 15960 200043 0 1 GAGAATCAGCTTTTCCAACACCTTGAGTTTGAGTATGCGATGACAGTTACTCTTACTGTCCATTGTCAGCATTGCCAGAGCTTGACCAGCTGAGATCGGAHWI-ST397 0 3 68 15831 200053 0 1 ATGTACTGCACCGTTGCAAGCGAGCACCACCAAGCGGCGGTATGCACTTTGCAATATGTAGCTAGAATAGGATTTTCAGGTGATTAGGAGCGTAAAAAAG HWI-ST397 0 3 68 15867 200049 0 1 CCAGCTCAGCCTGCATTCTTTCAAAAACTTCCAATGCCTCTCTTGGCCTAGCATTTTGGGCATACCCTGTGACCATTGCTGTCCATGCCACCATATCCTTHWI-ST397 0 3 68 15943 200048 0 1 GATTTTACTGCACATCGGTCTTGTCACACCAGCTATACCTGTAGAGTTGCCTTCCACAGTTGTAGAGATCGGAAGAGCGGTTCAGCGGGACTGCCGAGAAHWI-ST397 0 3 68 15812 200062 0 1 TCACCCAGCATCACGCCCCTTCACATCCAGTAAAACCCCTGAATGATGTGCTGTCACTGTTTGATATACAGTTGTTAACGTGAGGACGGGCTTTGAAGGAHWI-ST397 0 3 68 15888 200067 0 1 CTTGACTGCCACCATGAATATGTGTTCCAAGTGCCACAAGGACTTGGCCCTGAAGCAAGAACAAGCCAAACTTGCAGAGATCGGAAGAGCGGTTCAGCAGHWI-ST397 0 3 68 15969 200067 0 1 CCACAACTGCTCCATCTTTTCCATGAGACATTGCTCCCGCCATTGCACCCTTGGCATCAGCAGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGHWI-ST397 0 3 68 15786 200078 0 1 GTATTCTGCACACGAATCAGCTGAGACACCAATTGGGCATGAATCAAATGGCGCCATTGCCGGGGATCGAACCCCGAATCAAATGGTGCCATTGCCACTGHWI-ST397 0 3 68 15830 200072 0 1 AATATGCCAGCAGTTAAGAGAGTTCAAGATCCAGGGCTCATATTCAGTCACCTATATCAATTTCGAAATGGATTTCCAGGGTTTTAAGAGCCTAACAAAGHWI-ST397 0 3 68 15863 200073 0 1 CTCCCTGCGGGTGCGCGCGACCCATCTTCAGTTGGAGCGTCTATCGGCGTTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTAHWI-ST397 0 3 68 15762 200088 0 1 TGGTACGTCTGCGGAATGGCGTTTTTTATGCCTTAGTGGTTCGCAGAGCATTTGGCAGCTGAGATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATHWI-ST397 0 3 68 15903 200085 0 1 GGACCTACTGCCCAAGAACGGCTCACCCATCATCCGCTTTCTTCACCTTCCGTCTTCTTTGGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACHWI-ST397 0 3 68 15921 200082 0 1 GAGAATCAGCGTGTACGGGGCACGGGGTGACTGCTGTTGCGTGCGAGGGCTGAGATCGGAAGAGCGGTTCAGCAGGAGTGCCGAGACCGATCTCGTATGCHWI-ST397 0 3 68 15984 200085 0 1 TTCTCCAGCCGCATGGGCCGGAGACCAGAGAGGCCTCCCCAGGATTTGCACGATAGACCACGACTTATGGACGATTGGGAAGCCCTTGTTGGAAGGAAATHWI-ST397 0 3 68 15788 200096 0 1 GCGTCAGCAAATGCCCCAACAGCCAAGTCAGCAATTGCCTCAGCAACTTGGGCCACAAACACCACAGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCHWI-ST397 0 3 68 15842 200099 0 1 TAGGCCATCAGCTGACTTCCCGGGTGTGGAGAAAAGAGGGCCCCTCACTTCTCTCAAGTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCGGAGACCGHWI-ST397 0 3 68 15876 200105 0 1 GGACCTACTGCCGGCGGGACGAAAGCGGTTGTTGAATGATGGGGGTCACTAGGCCTTCCAGGGCCTTTAAGCGCGCGCTGAGATCGGAAGAGGGGTTCAGHWI-ST397 0 3 68 15937 200097 0 1 CTCCCTGTTGAAGCATGTGCAAAAGAGCTTGTTCTCGGCCTTCTTCAAGCCATTCTCTTGGCAGACGGCTTTGCCTAGAAGTTTCGCCCCATCACCCTTGHWI-ST397 0 3 68 15958 200102 0 1 CGCCTTATCTGCCCTCGCCGGTCATGGGGAGTGGTGCCCCTACCTCGGACAAGACAGATGCAGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG

Raw Sequence (Qseq)

Key File

Insert (first 64 bases)Barcode Cut site

InsertBarcode adapter Cut site Common adapterCut site

Insert (<64bp)Cut site 2nd InsertBarcode Cut site

GBS Tags

Insert (<64bp)Barcode Cut site Common adapterCut site

‘Good’ reads: (only the first 64 bases after the barcode are kept)

Fragment from GBS library:

chimera or partial digestion:

short fragment:

typical read:

Insert (first 64 bases)Barcode Cut site

InsertBarcode adapter Cut site Common adapterCut site

Insert (<64bp)Cut siteBarcode Cut site

GBS Tags

Insert (<64bp)Barcode Cut site Cut site

‘Good’ reads: (only the first 64 bases after the barcode are kept)

Fragment from GBS library:

chimera or partial digestion:

short fragment:

typical read:

Insert (first 64 bases)Barcode Cut site

InsertBarcode adapter Cut site Common adapterCut site

Insert (<64bp)Cut siteBarcode Cut site

GBS Tags

Insert (<64bp)Barcode Cut site Cut site

Barcode Cut site Common adapter

Rejected reads:

• Not matching barcode and cut site remnant• Contains N in first 64 bases after the barcode

‘Good’ reads: (only the first 64 bases after the barcode are kept)

Fragment from GBS library:

chimera or partial digestion:

short fragment:

typical read:

adapter dimer

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Sequence

TOPM

GBS Discovery Pipeline

Tag Counts

• With information from the key file, each sequence file is processed, tags are identified and counted.

• If a tag is shorter than 64 bases it is padded.• The tags and counts are put into a tag count

file for each sequence file.

QseqToTagCountsPlugin / FastqToTagCountsPlugin

Master Tag Counts

• The individual tag count files are merged into a master tag count file.

• A minimum count is specified at the merge stage to exclude tags with low counts (likely sequencing errors).

MergeMultipleTagCountsPlugin

Conversion of Tags to Fastq

• Sequence aligners do not work with the tag count file format.

• In preparation for the alignment step, the Master Tag Count file is converted to fastq format.

TagCountsToFastqPlugin

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Sequence

TOPM

GBS Discovery Pipeline

Tag Alignment / TOPM

• The GBS pipeline uses an external aligner to do the initial alignment.

• The current version uses bowtie2 which produces the alignment in the SAM format.

• We convert the SAM file into our tags on physical map format (TOPM)

bowtie2

SAMConverterPlugin

TOPM

So Far We Have

• Identified and counted GBS tags.• Converted tag counts file to fastq.• Aligned the tags to a reference.• Converted the alignment to TOPM.

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Sequence

TOPM

GBS Discovery Pipeline

Tags by Taxa

• In this step we identify which tags are present in which taxa.– Original Sequence Files– Key File– Master Tag Count File

• Recently migrated to HDF5 file format.– Efficient storage– Large data sets

SeqToTBTHDF5Plugin

Tags By Taxa Additional Operations

• If many TBTs have been created they are merged into 1 TBT.

• Taxa that were sequenced multiple times are merged.

• The TBT table is pivoted in preparation for SNP calling.

ModifyTBTHDF5Plugin

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Sequence

TOPM

GBS Discovery Pipeline

SNP Calling

• Files used in SNP Calling– TOPM– TBT– Pedigree File (optional)

• Some Key Settings– mnF MinimumF (inbreeding coefficient)– mnMAF Minimum Minor Allele Frequency– mnMAC Minimum Minor Allele Count– mnLCov Minimum Locus Coverage

TagsToSNPByAlignmentPlugin

HapMaprs# alleles chrom pos strand SgSBRIL067:633Y5AAXX:2:C9 SgSBRIL019:633Y5AAXX:2:C3S1_2100 A/G 1 2100 + N N N N N N N R N A N S1_2163 T/C 1 2163 + N N N N N N T C T T N S1_13837 T/G 1 13837 + N N N N N N N G N N TS1_14606 C/T 1 14606 + N N C N N N T T T T CS1_2061 T/A 1 20601 + T N N N N N N A N N NS1_68332 C/T 1 68332 + N N N N N N N N N N NS1_68596 A/T 1 68596 + A N N N N N N N N A NS1_69309 G/A 1 69309 + N G N N N N N A N N NS1_79955 T/G 1 79955 + N T G T T N T T N N NS1_79961 T/G 1 79961 + N T T T T N T T N N NS1_80584 G 1 80584 + N N N N N N N N N N GS1_80647 C/T 1 80647 + N N N N N N N C N N CS1_81274 T/G 1 81274 + N N N N N N T G N N NS1_108834 G/A 1 108834 + N N N N N N N N N N NS1_112345 T/G 1 112345 + N N N N N N K T N N NS1_115359 C/T 1 115359 + N N N N N N T C N TS1_115362 T/C 1 115362 + N N N N N N N C N N NS1_115405 G/A 1 115405 + G G A N N G G G G NS1_115516 T/G 1 115516 + N N T N N N T T N N TS1_116694 A/G 1 116694 + N A G N N N G A N N NS1_119016 C/T 1 119016 + N N N N C N N C N N NS1_155366 T/C 1 155366 + N T N N N N

GBS Discovery pipelineDiscovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Fastq

TOPM

GBS Discovery pipelineDiscovery

Tag Counts

SNP Caller

Tags by Taxa

Fastq

TOPM

Genotypes

Filtered Genotypes

Production Pipeline

Why another pipeline?

• The last maize build (30000 taxa) with the discovery pipeline took weeks.

• Most common alleles have been identified after the first few discovery builds.

• Use the information from the discovery pipeline to call SNPs in new runs quickly.

• Improve efficiency and automate.

GBS Bioinformatics PipelinesDiscovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Fastq

Production

TOPM

Fastq

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Fastq

Production

TOPM

Fastq

TagsOnPhysicalMap (TOPM)

GBS Bioinformatics PipelinesDiscovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Fastq

Production

Filtered Genotypes

TOPM

Fastq

GBS Bioinformatics PipelinesDiscovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Fastq

Production

Fastq

Filtered Genotypes

TOPM TOPM

GBS Bioinformatics PipelinesDiscovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Fastq

Production

Fastq

Filtered Genotypes

TOPM TOPM

GBS Bioinformatics PipelinesDiscovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Fastq

Production

Fastq

Filtered Genotypes

TOPM TOPM

Genotypes

Running the Production Pipeline

• Required Files:– Sequence file (fastq or qseq)– Key file– Production TOPM

• TASSEL 3 Standalone & RawReadsToHapMapPlugin

• Running the Pipeline:– One lane processed at a time– HapMap files by chromosome

• ~40 minutes

Testing Production Pipeline

• Compared HapMap files produced by Discovery Pipeline and Production Pipeline

• Site Comparison:– Discovery 48,139– Production 47,676– Difference due to maximum 8 alleles

• 99.98% correlation of genetic distance matrices

Next Steps In Pipeline Development

• Hierarchical Data Format – supports very large data sets and complex data structures.

• Working to fuse TOPM, TBT, Keyfile, and Pedigree File into one HDF5 repository.

• Continued improvements to SNP caller.• Ability to use tags not present in the

reference.