The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec...
-
Upload
julissa-hammell -
Category
Documents
-
view
220 -
download
4
Transcript of The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec...
![Page 1: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/1.jpg)
The Data Tsunami in Biomedical Research
Guillaume Bourque
McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University
June 5th, 2013
![Page 2: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/2.jpg)
2
Next-generation sequencing (NGS)
Stein, Genome Biol. 2010
![Page 3: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/3.jpg)
3
Falling cost of sequencing
DeWitt, Nat. Biotechnol. 2012
![Page 4: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/4.jpg)
Sequencing human genomes
1000 Genomes
Project
~ 10 000 $
The Human
Genome
~ 3 Billion $
Your Genome
100 - 1000 $
20112001 2013 (?)
![Page 5: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/5.jpg)
5
Outline
• Overview of Next-Generation Sequencing (NGS)
• Applications
• Challenges
• Solutions
![Page 6: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/6.jpg)
6
Sequencing Revolution
http://www.brusselsgenetics.be
Sanger sequencing Next-Generation sequencing
Metzker, Nat. Rev. Genet. 2010
100s of reactions… 10000s of base pairs…
Millions of reactions!Billions of base pairs!
![Page 7: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/7.jpg)
High-throughput Sequencing
6 Gbases2009 36bp X 20MX 8 lanes
2013 600 Gbases2 X 150bp X 250M
X 8 lanes
200 Human Genomes in 1 run!!!
![Page 8: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/8.jpg)
NGS Technology Comparisoninstrument Pacbio Ion Torrent 454 Illumina SOLiD
Method Single-molecule in real-time
Ion semiconductor Pyrosequencing synthesis Ligation
Read length 3kb average 200 bp 700 bp 50 to 250 bp 50+35 or 50+50 bp
Error type indel indel indel substitution A-T bias
single-Pass Error rate % 13 ~1 ~0.1 ~0.1 ~0.1
Reads per run 35000–75000 up to 4M 1M up to 3.2G 1.2 to 1.4G
Time per run 30 minutes to 2 hours 2 hours 24 hours 1 to 10 days, 1 to 2 weeks
Cost per 1 million bases
(in US$) $2 $1 $10 $0.05 to $0.15 $0.13
Advantages Longest read length. Fast.
Less expensive equipment.
Fast. Long read size.
Fast. high sequence
yield, cost, accuracy
Low cost per base.
Disadvantages
Low yield at high accuracy. Equipment can
be very expensive.
Homopolymer errors.
Runs are expensive.
Homopolymer errors.
Equipment can be very
expensive.
Slower than other methods,
read length, longevity of the
plateform
![Page 9: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/9.jpg)
9
Genome Canada
• > $915M investment and > $900M in co-funding• 100s Large-scale genomics projects• 5 Innovation centers
![Page 10: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/10.jpg)
10
Outline
• Overview of Next-Generation Sequencing (NGS)
• Applications
• Challenges
• Solutions
![Page 11: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/11.jpg)
Applications (I)• De novo sequencing
– From the human genome… To all model organisms… To all relevant organisms (e.g. extreme genomes)… To “all” organisms?
11
![Page 12: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/12.jpg)
12
Human Genome• 3 Billion DNA base pairs (bp)• Two human genomes are
~99.9% identical • There are about ~3M bp
differences between you and me
• Some of these differences explain variation in:– Disease susceptibility– Differences in drug metabolism– …www.dnacenter.com
![Page 13: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/13.jpg)
Applications (II)• Genome re-sequencing
– Genetic disorders– Cancer genome sequencing– Map genomic structural variations across individuals – Genealogy and migration– Agricultural crops– …
13
1000 Genomes Project
The Cancer Genome Atlas
![Page 14: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/14.jpg)
14
Exome sequencing for Mendelian disease
“… about one-half to one-third (~3,000) of all known or suspected Mendelian disorders (for example, cystic fibrosis and sickle cell anaemia) have been discovered. However, there is a substantial gap in our knowledge about the genes that cause many rare Mendelian phenotypes.”
“Accordingly, we can realistically look towards a future in which the genetic basis of all Mendelian traits is known, …”
![Page 15: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/15.jpg)
15
Exome sequencing
![Page 16: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/16.jpg)
16
Cancer genome sequencing
Can obtain a full catalogue of mutations
![Page 17: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/17.jpg)
Michael Stromberg, bioinformatics.ca
![Page 18: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/18.jpg)
18
Mutations in paediatric gliblastoma
Jabado, Pfister and Majewski
![Page 19: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/19.jpg)
19
Mutations in paediatric gliblastoma
Sequenced the exomes of 48 paediatric GBM samples, found:
• Somatic mutations in the H3.3-ATRX-DAXX chromatin remodelling pathway in 44% of tumours
• Recurrent mutations in H3F3A, which encodes the replication-independent histone 3 variant H3.3 in 31% of tumours
![Page 20: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/20.jpg)
Applications (III)
• Quantitative biology of complex systems– New high-throughput technologies in functional genomics: ChIP-Seq,
RNA-Seq, ChIA-PET, RIP-Seq, …– From single-gene measurements, to thousands of probes on arrays, to
profiles covering all 3B bases of the genome– Important systems: Stem cells, Cancer, Infectious diseases…
20
![Page 21: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/21.jpg)
21
Outline
• Overview of Next-Generation Sequencing (NGS)• Applications• Challenges• Solutions
![Page 22: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/22.jpg)
High-throughput Sequencing
6 Gbases2009 36bp X 20MX 8 lanes
2013 600 Gbases2 X 150bp X 250M
X 8 lanes
200 Human Genomes in 1 run!!!
![Page 23: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/23.jpg)
Big Data
1 TBytes2013 2 X 10 TBytes
Intensity files Reads + qualities
70 TBytes
Image files
![Page 24: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/24.jpg)
Big Data
1 TBytes2013 2 X 10 TBytes
12 TBytes240 TBytes
Intensity files Reads + qualities
25 TB of raw data / month300 TB of raw data / year
From: Alexandre Montpetit Subject: news from IlluminaDate: 4 June, 2013 2:15:16 PM EDTTo: Guillaume Bourque
De Mark Van Oene (vp Illumina ventes): dans la prochaine annee on doit s'attendre a 2x plus de reads en 2x moins de temps (et 2x plus longs) Ca cause probleme?
Alex
![Page 25: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/25.jpg)
Large NGS project
Cancer project with whole genome data:
125 TB raw
500 X 3 lanes = 500 X 250GB
125 TB raw
500 tumors 500 matched-normal
500 X 3 lanes = 500 X 250GB
vs
![Page 26: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/26.jpg)
26
DNA bases sequenced at the Innovation Center
DN
A ba
ses
12 HiSeqs
72 Trillions!
0r 800 genomes at 30X
![Page 27: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/27.jpg)
27
adventure.nationalgeographic.com
![Page 28: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/28.jpg)
Biomedical research is built on data integration
Your data
![Page 29: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/29.jpg)
Biomedical research is built on data integration
100X
Your data
![Page 30: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/30.jpg)
30
Challenges
• NGS instruments generate TBs of data• NGS instruments are getting faster, cheaper and will
increasingly be found in small research labs and hospitals
• Data sharing and integration is critical in biomedical research
• Sequencing data represents sensitive private data and is identifiable
![Page 31: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/31.jpg)
31
Outline
• Overview of Next-Generation Sequencing (NGS)• Applications• Challenges• Solutions
![Page 32: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/32.jpg)
32
Nanuq softwareHas tracked data and meta-data for more than:
• 2.6 million sample aliquots, • 20,500 reagents, • 17,000 plates, • 140,000 tubes, • Multiple platforms, technologies and
workflows(sequencing, genotyping, microarray, etc.)• 3,900 external users
![Page 33: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/33.jpg)
33
Standardized analysis pipelines
ChIP-Seq Analysis report
RNA-Seq Analysis report
MethylationAnalysis report
…
…
… … …
![Page 34: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/34.jpg)
34
Data center at the Innovation Center
> 1200 cores> 2 PB disk> 5 PB tape
![Page 35: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/35.jpg)
35
Need more!
McGill Guillimin – 16000 cores
UdeS Mammouth – 39168 cores
![Page 36: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/36.jpg)
Data processing issues
• We have many different projects all needing space and processing.
• We want to use the Compute Canada clusters for scalability but also to facilitate data distribution (we have >800 users).
• This brings uniformity problems:– Different setups Hardware and Software– Different configurations– Etc.
![Page 37: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/37.jpg)
Our strategy
• We wrote analyses pipelines to be easily configurable across clusters.
• Same code, one ini file to customize (we already have templates for 3 cluster sites)
• We install Linux modules readable by all on all these clusters so we know exactly what is available everywhere
• We also deploy common genomes across sites.
![Page 38: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/38.jpg)
38
Usage on Compute Canada
![Page 39: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/39.jpg)
39
Canadian Epigenetics,
Environment and Health Research
Consortium (CEEHRC)$1.5M
(2012-2017)
![Page 40: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/40.jpg)
40
PORTal for the Analysis of Genetics and Genomics Experiments (PORTAGGE)
![Page 41: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/41.jpg)
Conclusions• NGS offers a variety of technologies and numerous exciting
applications• Many areas of NGS data analyses are still under active
development (e.g. RNA-Seq)• A major challenge is to ensure sufficient compute and storage
capacities not to limit more advanced analyses• Need to work together to avoid duplication of efforts in
installing tools but also to develop efficient ways to use HPC in biomedical research
![Page 42: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/42.jpg)
Acknowledgements
IT teamTerrance McquilkinMarc-André LabontéGenevieve DancausseAndras FrankelAlexandru Guja
Development teamNathalie ÉmondDavid BujoldFrancois CantinCatherine CôtéBurak DemirtasDaniel GuertinLouis Dumond JosephFrancois KorbulyMarc MichaudThuong Ngo
Analysis teamLouis LetourneauMathieu BourgeyMaxime CaronGary LévesqueRobert EveleighFrancois LefebvreJohanna SandovalPascale Marquis
EDCC teamDavid Morais (UdeS)Carol Gauthier (UdeS)Bryan Caron (McGill)Alain Veilleux (UdeS)ME Rousseau (McGill)
![Page 43: The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649c7c5503460f949308d9/html5/thumbnails/43.jpg)
43
Questions?