The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid...

31
The challenge of NGS data Guy Cochrane

Transcript of The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid...

Page 1: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

The challenge of NGS data

Guy Cochrane

Page 2: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

Data intensity

Availability of technologies

Broadening of applications

Page 3: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

Context

Page 4: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

EMBL European Bioinformatics Institute Genes, genomes & variation

ArrayExpress Expression Atlas Metabolights

PRIDE

InterPro Pfam UniProt

ChEMBL ChEBI

Literature & ontologies

Europe PubMed Central

Gene Ontology

Experimental Factor

Ontology

Molecular structures

Protein Data Bank in Europe

Electron Microscopy Data Bank

European Nucleotide Archive

1000 Genomes

Gene, protein & metabolite expression

Protein sequences, families & motifs

Chemical biology

Reactions, interactions & pathways

IntAct Reactome MetaboLights

Systems

BioModels

Enzyme Portal

BioSamples

Ensembl

Ensembl Genomes

European Genome-phenome Archive

Metagenomics portal

Page 5: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

European Nucleotide Archive (ENA)

http://www.ebi.ac.uk/ena/

•  A broad platform for the management, sharing, integration and dissemination of sequence data

•  Established in the early 1980s, extended for new technologies and applications

•  Globally comprehensive scientific record and European node of INSDC

•  Connectivity with broader EMBL-EBI resources

•  Sequence data foundation

•  Sustained within EMBL-EBI under EMBL funding with additional support from EC, UK Research councils, Wellcome Trust, etc.

•  Substantial scale: 1.3 petabase pairs across >1 million taxa, 2,000-5,000 active data providers, global consumer userbase

•  Rich submission, discovery and retrieval software, tools and services

Page 6: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

Submissions

Page 7: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

Data discovery temperature>=10 AND temperature<=25 AND geo_box1(42, 17, 43, 18)

tax_tree(10090) AND library_source="GENOMIC" AND instrument_platform="ILLUMINA” AND library_strategy="ChIP-Seq"

Page 8: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

Cross-references and tagging GUI: http://www.ebi.ac.uk/ena/data/xref/search

Page 9: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

COMPARE

COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne outbreaks by generation and comparison of genomic information on samples and pathogens across sectors, time and locations, with additional contextual data.

A global platform for the sequence-based rapid identification of pathogens

!

Page 10: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

Ocean Samping Day and Tara Oceans

Page 11: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

Technology

Page 12: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

Data accumulation

Cook CE, Bergman MT, Finn RD, Cochrane G, Birney E, Apweiler R. The European Bioinformatics Institute in 2016: Data growth and integration. Nucleic Acids Res. 2016 Jan;44(D1) D20-6. doi:10.1093/nar/gkv1352. PMID: 26673705; PMCID: PMC4702932.

Page 13: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

Starting point

TGAGCTCTAAGTACC329183050298757

Page 14: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

Sequence compression

•  Encoding of read starts and differences

•  3.5x–100x compression over existing formats

•  Scales favourably with increasing read length and density

Coverage

Bits

per b

ase

0.025

0.05

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.01% unpaired

0 10 20 30 40 50

0.1% unpaired

0 10 20 30 40 50

1.0% unpaired

●●

●●

●●

0 10 20 30 40 50

0.01% paired

●●

● ●

●●

●●

●●

0 10 20 30 40 50

0.1% paired

●●

● ●

●●

●●

●●

0 10 20 30 40 50

1.0% paired

● ●●

● ●

●●

● ●●

●●

● ●●

0 10 20 30 40 50

Read length● 25● 50● 100● 200● 400

Cold Spring Harbor Laboratory Press on May 4, 2011 - Published by genome.cshlp.orgDownloaded from

Coverage

Bits

per b

ase

0.025

0.05

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.01% unpaired

0 10 20 30 40 50

0.1% unpaired

0 10 20 30 40 50

1.0% unpaired

●●

●●

●●

0 10 20 30 40 50

0.01% paired

●●

● ●

●●

●●

●●

0 10 20 30 40 50

0.1% paired

●●

● ●

●●

●●

●●

0 10 20 30 40 50

1.0% paired

● ●●

● ●

●●

● ●●

●●

● ●●

0 10 20 30 40 50

Read length● 25● 50● 100● 200● 400

Cold Spring Harbor Laboratory Press on May 4, 2011 - Published by genome.cshlp.orgDownloaded from

Coverage

Bits

per b

ase

0.025

0.05

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.01% unpaired

0 10 20 30 40 50

0.1% unpaired

0 10 20 30 40 50

1.0% unpaired

●●

●●

●●

0 10 20 30 40 50

0.01% paired

●●

● ●

●●

●●

●●

0 10 20 30 40 50

0.1% paired

●●

● ●

●●

●●

●●

0 10 20 30 40 50

1.0% paired

● ●●

● ●

●●

● ●●

●●

● ●●

0 10 20 30 40 50

Read length● 25● 50● 100● 200● 400

Cold Spring Harbor Laboratory Press on May 4, 2011 - Published by genome.cshlp.orgDownloaded from

Fritz, M.H. Leinonen, R., et al. (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21 (5), 734-40.

Page 15: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

Quality compression

TGAGCTCTAAGTACC329183050298757

002020010022212 -2---30---9---7

Horizontal reduction Vertical reduction

Page 16: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

Quality compression: simple, horizontal reduction

Photograph from MichaelMaggs, http://en.wikipedia.org/wiki/File:Amanita_muscaria_(fly_agaric).JPG

Page 17: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

Models for data reduction

Jong-Seok Lee et al. (2009), http://mmspg.epfl.ch/files/content/sites/mmspl/files/shared/lee_icme.pdf

Page 18: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

CRAM performance

Bits/base

Page 19: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

CRAM performance

Bits/base

1000 Genomes

WT Sanger Institute data

Page 20: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

Human aspects

Page 21: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

Standards

event

organism

!"#$%&'$(

)*+,!-.(

/-!#$%+(

$'$0+(12(

%!33$0+(

4$'5%$(

.56$(7-)%&!0(8(9*)0&+:(

%!0+$0+(

3)-50$(-$;5!0(

/)-)3$+$-(0)3$($0'5-!0<(/)-)3$+$-.(

=57$(.+);$(

.$>(

3$+,!4(

.56$(7-)%&!0(?(+-$)+3$0+(.+!-);$(+-$)+3$0+(%,$35%)=(

.56$("5!3)..("5!'!=*3$(

9*)0&+:(453$0.5!0.(

%*--$0%:(

*05+.(3$+,!4(%!33$0+(

%!0+)50$-(

!"#$%&'(

)*+,('-"'./&01(

event

organism

!"#$"%&'(

)%*+(

$,"-./#(

0"*"12#+(

,"2*30+(,.'&%*30+((

)",%'%*4(

$/.*.!.,(,"5+,(

0+$*6(

)"#$,+(2*,+(

*+#$+/"*3/+(

7+"*3/+(#"*+/%",(

$"/"#+*+/(89(

5%.#+()!%+'2:!('"#+(

*";.'(89(

ten Hoopen et al. Standards in Genomic Sciences (2015) 10:20

Page 22: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne
Page 23: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

Metagenomics

Page 24: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

https://www.ebi.ac.uk/metagenomics/

Page 25: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne
Page 26: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

Scheuch et al. (2015) BMC Bioinformatics 16:69

Page 27: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

Identification data

Sequence Analysis Genes

Taxa Samples

Page 28: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

Identification data accumulation

•  Identification against context is informative

•  (Molecular observations may be tentative or high confidence)

•  Coincidental observations (time, place, virulence, host phenotypes)

•  ‘Capture’ identifications to ‘connect’ coincidental observations

Page 29: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

Identification data

Sequence Analysis Genes

Taxa Samples

User

Page 30: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

Identification data

Sequence Analysis Genes

Taxa Samples

User

User

Page 31: The challenge of NGS data - talk.ictvonline.org · COMPARE COMPARE: the enabling system for rapid identification, containment and mitigation of emerging infectious diseases and foodborne

Acknowledgements

Standards & support Ana Cerdeño-Tárraga, Ana Luisa Toribio, Petra ten Hoopen, Marc Rosello, Richard Gibson, Jeena Rajan, Clara Amid Compression technology Vadim Zalunin, Ewan Birney, James Bonfield, Rasko Leinonen ENA technical services Blaise Alako, Simon Kay, Nima Pakseresht, Xin Liu, Suran Jayathilaka, Nicole Silvester, Daniel Vaughan, Neil Goodgame, Iain Cleland, Dmitriy Smirnov, Kethi Reddy, Vadim Zalunin, Rasko Leinonen ENA - http://www.ebi.ac.uk/ena/