Sequencing, Genome Assembly and the SGN Platform

Surya Saha, Ph.D. Cornell University & Boyce Thompson Institute

suryasaha@cornell.edu @SahaSurya

Centre for Agricultural Bioinformatics Pusa, New Delhi

June 13,2014 Slides: http://bit.ly/CABin_Pusa_2014

http://www.acgt.me/blog/2014/3/7/next-generation-sequencing-must-die

Genome Assembly

Jason Chin http://www.bit.ly/SZPKIG

6/15/2014 Centre for Agricultural Bioinformatics, Pusa 2

You are free to:

Copy, share, adapt, or re-mix;

Photograph, film, or broadcast;

Blog, live-blog, or post video of;

This presentation. Provided that:

You attribute the work to its author and respect the rights

and licenses associated with its components.

Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero. Social Media Icons adapted with

permission from originals by Christopher Ross. Original images are available under GPL at

http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites

Sequencing

DNA Structure discovery

Sanger DNA sequencing by chain-terminating inhibitors

Epstein-Barr virus

(170 Kb)

Abi370

Sequencer

Homo sapiens (3.0 Gb)

Solexa

Ion Torrent

PacBio

Haemophilus influenzae (1.83 Mb)

Slide credit: Aureliano Bombarely

Sequencing over the Ages

Illumina

Illumina Hiseq X

Pinus taeda

(24 Gb)

MinION

The Next Generation

Its all about the $£€¥

http://www.genome.gov/sequencingcosts/

First generation sequencing

Sanger method

Frederick Sanger 13 Aug 1918 – 19 Nov 2013 Won the Nobel Prize for Chemistry in 1958 and 1980. Published the dideoxy chain termination method or “Sanger method” in 1977

http://dailym.ai/1f1XeTB

Sanger method

http://bit.ly/1g6Cudq

http://bit.ly/1lcQO4J

First generation sequencing

• Very high quality sequences (99.999%)

• Very low throughput

Run Time Read Length Reads / Run

nucleotides

sequenced

Cost / MB

Capillary

Sequencing

(ABI3730xl)

20m-3h 400-900 bp 96 or 386 1.9-84 Kb $2400

http://bit.ly/1clLps3 http://1.usa.gov/1cLqIRd

Use the specific technology used to generate the data

– Illumina Hiseq/Miseq/NextSeq

– Pacific Biosciences RS I/RS II

– Ion Torrent Proton/PGM

– SOLiD

– 454

http://www.acgt.me/blog/2014/3/10/next-generation-sequencing-must-diepart-2

454 Pyrosequencing

One purified DNA fragment, to one bead, to one read.

http://bit.ly/1ehwxWN

GS FLX Titanium

http://bit.ly/1ehAcEh

Illumina

Output 15 Gb 120 GB 1000 GB 1800 GB

Number of Reads

25 Million 400 Million 4 Billion 6 Billion

Read Length

2x300 bp 2x150 bp 2x125 bp (2x250 update mid-2014)

2x150 bp

Cost $99K $250K $740K $10M

Source: Illumina

$1000 human genome??

6/15/2014 Centre for Agricultural Bioinformatics, Pusa 13 http://1.usa.gov/1fP9ybl

http://bit.ly/1aEPOBn

Pacific Biosciences SMRT sequencing

Single Molecule Real Time sequencing

http://bit.ly/1naxgTe

Pacific Biosciences SMRT sequencing Error correction methods

Hierarchical genome-assembly process (HGAP)

Enlish et al., PLOS One. 2012

PBJelly

Pacific Biosciences SMRT sequencing Read Lengths

Oxford Nanopore

https://www.nanoporetech.com/

• No data yet??

• Error model

http://erlichya.tumblr.com/post/66376172948/hands-on-experience-with-oxford-nanopore-minion

Others

• Ion Torrent Proton/PGM

• Nabsys

• SOLiD

Comparison

Next generation sequencing

Run Time Read Length Quality

nucleotides

sequenced

Cost /MB

Pyrosequencing 24h 700 bp Q20-Q30 0.7 GB $10

Illumina Miseq 27h 2x250bp > Q30 15 GB $0.15

Illumina Hiseq

2500 11days 2x125bp >Q30 1000 GB $0.05

Ion torrent 2h 400bp >Q20 50MB-1GB $1

Pacific

Biosciences 2h 10-20kb

>Q30 consensus

>Q10 single

400-800MB

/SMRT cell $0.33-$1

http://bit.ly/1clLps3 http://1.usa.gov/1cLqIRd

http://omicsmaps.com/

Next Generation Genomics: World Map of High-throughput Sequencers

Centre for Agricultural Bioinformatics, Pusa 6/15/2014 22

http://bit.ly/18pfUId

Real cost of Sequencing!!

Sboner, Genome Biology, 2011

6/15/2014 24 Centre for Agricultural Bioinformatics, Pusa

Library Types

Single end

Pair end (PE, 150-800 bp, Fwd:/1, Rev:/2)

Mate pair (MP, 2Kb to 20 Kb)

F R 454/Roche

F R Illumina

Illumina

Slide credit: Aureliano Bombarely

Implications of Choice of Library

6/15/2014 Centre for Agricultural Bioinformatics, Pusa 26 Slide credit: Aureliano Bombarely

Consensus sequence

(Contig)

Scaffold

(or Supercontig)

Pair Read information

Pseudomolecule

(or ultracontig)

Genetic information (markers)

NNNNN NN

Quality control: Encoding

http://bit.ly/N28yUd

Phred score of a base is: Qphred = -10 log10 (e)

where e is the estimated probability of a base being incorrect

Genome Assembly

Whole Genome Shotgun Sequencing

6/15/2014 Centre for Agricultural Bioinformatics, Pusa 29 Slide credit: cbcb.umd.edu

Genome Sequencing Strategies

6/15/2014 Centre for Agricultural Bioinformatics, Pusa 30 Slide credit: Aureliano Bombarely

Genome Sequencing Strategies

International Human Genome Sequencing Consortium 2001

Overlap Layout Consensus

http://contig.wordpress.com/

cbcb.umd.edu

Ingredient for a Good Assembly

Slide credit: Mike Schatz

Bird Snake

• You have the expertise to install and run • You have the suitable infrastructure (CPU & RAM) to run the assembler • You have sufficient time to run the assembler • Is designed to work with the specific mix of NGS data that you have

generated • Best addresses what you want to get out of a genome assembly (bigger

overall assembly, more genes, most accuracy, longer scaffolds, most resolution of haplotypes, most tolerant of repeats, etc.)

The BEST?? Genome Assembler for YOU

http://haldanessieve.org/2013/01/28/our-paper-making-pizzas-and-genome-assemblies/

Which technology to use??

• Microbial genomes

• Eukaryotic genomes

• Resequencing genomes

• RNAseq and other XXXseq methods

http://bit.ly/1ko9Kgh

SOL Genomics Network

The SGN Team!!

Surya Saha, Tom Fisher-York, Hartmut Foerster, Suzy Strickler, Jeremy Edwards,

Noe Fernandez, Naama Menda, Aure Bombarely, Aimin Yan, Isaak Tecle

SGN Website

http://solgenomics.net

Main web page (front page):

WEB ICONS

TOOL BAR

Main web page (front page):

TOOL BAR

(MENUS)

But the DATA also can be edited

Locus Locus Editor Data

Community Data Curation

You need • SGN account. • Activate submitter / Locus Editor privileges by SGN curator

Locus Locus Editor Data

Genome Browser: GBrowse

Genome Browser: JBrowse

CassavaBase

http://cassavabase.org/

Slide credit: Jeremy Edwards

NextGen Cassava Project

● Project: Adapt SGN database for Cassava Breeding

● Goal: Apply Genomic Selection to cassava breeding

● Predict breeding values from genotype information

● Shorten the breeding cycle

● Massive amounts of genotypic data (GBS)

● Phenotypic data

● Data management challenge

● Improve flowering

● http://nextgencassava.org

SGN/Cassavabase behind the scenes

● Perl/Catalyst MVC Framework

● PostgreSQL Database

● Generic Model Organism Database (GMOD)

– Chado relational database schema

– GBrowse

– JBrowse

– Experimental design

– QTL mapping

– Genomic selection Slide credit: Jeremy Edwards

Objectives

Provide cassava breeders and researchers access to data and tools in a centralized, user-friendly and reliable database.

– Improve partner breeding program information tracking

– Streamline management of genotypic and phenotypic data

– Pipeline genotypic and phenotypic data through Genomic Selection prediction analyses

6/15/2014 Centre for Agricultural Bioinformatics, Pusa 54 Slide credit: Jeremy Edwards

Genomic Selection

The 'training population' is genotyped and phenotyped to 'train' the genomic selection (GS) prediction model. Genotypic information from the breeding material is then fed into the model to calculate genomic estimated breeding values (GEBV) for these lines. From Heffner et al. 2009 Crop Sci. 49:1–12

Information from a majority of lines in the breeding population (the training set) is used to create the prediction model. The model is then used to predict the phenotypes of the remaining lines (the validation set), using genotypic information only. The results from the model are compared to the actual data to give the prediction accuracy. Image courtesy of Martha Hamblin, Cornell University

Flow diagram of a genomic selection breeding program. Breeding cycle time is shortened by removing phenotypic evaluation of lines before selection as parents for the next cycle. From Heffner et al. 2009 Crop Sci. 49:1–12

Data collection in the field

● Android tablets

● Field book app

– Jesse Poland's group at

USDA-ARS / Kansas

State University

Cassava Trait Ontology

Kulakow et al. 2011

● Standard terminology ● Facilitate the sharing of information ● Allow users to query keywords related to traits

Position available at Solgenomics

Cassavabase project

Plant Breeding + Bioinformatician

● Familiar with breeding

● Programming in Perl, R, SQL, Hadoop

● Linux

● Africa

● Genius

http://www.cassavabase.org/forum/posts.pl?topic_id=9

Thank you!! Questions??

Sequencing, Genome Assembly and the SGN Platform

Science

Transcript of Sequencing, Genome Assembly and the SGN Platform

Validation of cytology media for whole genome sequencing · Validation of cytology media for whole genome sequencing: ... Validation of cytology media for whole genome sequencing:

Sequencing the Human Genome€¦ · • Weber, J. L., & Myers, E. W. (1997). Human whole-genome shotgun sequencing. Genome Res, 7(5), 401-409. – Use clone end sequencing generating

ASFV genome sequencing

Overview of Genome Sequencing Progress

Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium.

Sequencing a genome (a) outline the steps involved in sequencing the genome of an organism; (b) outline how gene sequencing allows for genome- wide comparisons.

GENOME SEQUENCING. I. Genome sequencing The Sanger Method (1977) Denaturation +priming Polymerization.

Presentation on genome sequencing

Genome sequencing and annotation

Genome Sequencing - NDSUmcclean/plsc411/Genome Sequencing...Genome Sequencing . ... Here the Phred scores are overlaid on the chromatogram of a Sanger sequencing output. ... o Directed

Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.

Sequencing the Human Genome

Techniques for Genome Mapping & Sequencing

Whole-Genome Sequencing for Surveillance of AMR in ...€¦ · Whole-Genome Sequencing for Surveillance of AMR in Foodborne Bacterial Pathogens ... Whole-Genome Sequencing (WGS) and

Genome Sequencing and Assembly

Genome Annotation - NDSUmcclean/plsc411/Genome... · Genome Annotation . Genome Sequencing • Costliest aspect of sequencing the genome o But Devoid of content • Genome must be

Genome Sequencing Project

Genome sequencing of bacteria: sequencing, de novo assembly and

Genome-Sequencing Types

Whole genome sequencing