Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure
description
Transcript of Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure
![Page 1: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/1.jpg)
EBI is an Outstation of the European Molecular Biology Laboratory.
Ewan Birney
Human genetics and genomicsThe Science of the 21st Century…and why it needs infrastructure
![Page 2: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/2.jpg)
Outline of the talk
• Who am I?• A quick crash course in genomics • Genomics 2000-2012
• HapMap, 1000 Genomes, GWAS, ENCODE• Route into medicine• Elixir: Why do we need an information infrastructure?
![Page 3: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/3.jpg)
Who am I?• Associate Director at
European Bioinformatics Institute (EBI)
• Involved in genomics since I was 19 (almost 20 years!)
• Trained as a biochemist – most people think I am CS
• Analysed – sometimes lead – human/mouse/rat/platypus etc genomes
EBI is in Hinxton, SouthCambridgeshire
EBI is part of EMBL, likeCERN for molecular biology
![Page 4: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/4.jpg)
EBI is an Outstation of the European Molecular Biology Laboratory.
Crash Course in genomics for geeks
![Page 5: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/5.jpg)
DNA is a covalently linked polymer nearly always found in anti-parallel, non covalent pairs
![Page 6: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/6.jpg)
We represent it as strings, not worrying about one pair of the two polymers
>6 dna:chromosome chromosome:GRCh37:6:133017695:133161157:1GCAGCAAGACAGAAGTGACTCATACATACAAGGGATCCCCAATAAGATTATCGGCAGATTTCTCATCAATAACTTTGGAGACCACAAAGCATTGAGCTGATATATTTAAAGTACTGAAAGAAAAAAAAATCTGACAACCAAGAATTCTATATCCATCAGAACTGCCCTTCAAAAGGGAGGGAGAAATGAAGACATTCTCAGATTTGAGAAGAAAGGAAAGAGAGAAGGGAGGGGAGGGGAGAGGAGGGGAGGGGAGGAGAGGAGAGGAGAGGGCACAGTGGCTCACGCCTGTAATCCTAGCACTTTGCAAGACTGAGGCCAGTGGAACACCTGAGGTCAGGAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCCACTAAAAATACAAAAAATTAGCCAGGCGTGGTGGCAGGTGCCTGTAGTTCCAGCTACTCAGGAGGCTGAGGCAGCAGAATGGCGTGAACTCGGGAGGTGGAGCTTGCAGTGAGCTGAGATTGCGCCCCTGCACTCCAGCCTGGGTGACAGAGTGAGACTCTGTCTCAAAAAAATAAAAAGTTTAAAAATATTTTAAAAAAAGAAAGAAAGAAGGGAG
1 monomer is called a “base pair” – bp
![Page 7: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/7.jpg)
We can routinely determine small parts of DNA1977-1990 – 500 bp, manual tracking
1990-2000 – 500 bp, computational tracking, 1D, “capillary”
2005-2012 – 20-100bp, 2D systems, (“2nd Generation” or NGS)
2012 - ?? >5kb, Real time “3rd Generation”
Fred Sanger, inventorof terminator DNA sequencing
![Page 8: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/8.jpg)
Costs have come exponentially down
![Page 9: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/9.jpg)
1980 1985 1990 1995 2000 2005 2010 2015100000
1000000
10000000
100000000
1000000000
10000000000
100000000000
1000000000000
10000000000000
100000000000000
Capillary reads
Assembled sequences
Next gen. reads
Date
Bases
Volumes have gone up!
Similar explosion ofImage based methods
![Page 10: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/10.jpg)
A genome is all our DNA
Every cell has two copies of 3e9bp(one from mum, one from dad) in24 polymers (“chromosomes”)
Ecoli: 4e6, Yeast, 12e6 Medaka, 0.9e9
White Pine20e9
![Page 11: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/11.jpg)
Human Genome project
• 1989 – 2000 – sequencing the human genome• Just 1 “individual” – actually a mosaic of about 24 individuals but
as if it was one• Old school technologies• A bit epic
• Now• Same data volume generated in ~3mins in a current large scale
centre • It’s all about the analysis
![Page 12: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/12.jpg)
Reference Based Compression
0.5 to 0.01 Bits/Base
10x to 100x more compressed than currentdata storage
![Page 13: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/13.jpg)
EBI is an Outstation of the European Molecular Biology Laboratory.
What happened next?
![Page 14: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/14.jpg)
We looked into human variation
3 in 10,000 bases between any two individuals are different(a bit more between Africans)
The similarity of a European to an African (any population) isOnly marginally smaller than European to European (2 or 3%).Only a minute amount of DNA is unique to any population
![Page 15: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/15.jpg)
… and associate this with traits or disease
(you can infer the majority of the genome by knowing a base About 1 every 5,000 to 10,000 bases – the experiments toLook at this density is far cheaper than sequencing)
![Page 16: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/16.jpg)
![Page 17: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/17.jpg)
Genome
ENCODE Dimensions
Methods/Factors
Cel
ls
182
Cel
l Lin
es/ T
issu
es
164 Assays (114 different Chip)
3,010 Experiments5 TeraBases1716x of the Human Genome
![Page 18: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/18.jpg)
ENCODE Uniform Analysis Pipeline
Mapped reads from production (Bam)
Uniform Peak Calling Pipeline (SPP, PeakSeq)
IDR Processing, QC and Blacklist Filtering
Motif Discovery Stats, GSC enrichments, etc.
Signal Aggregation over peaks
Signal Generation(read extension and mappability correction)
Segmentation
Poor reproducibilityGood reproducibility
Rep1
Rep2
Self Organising Maps
ChromHMM/Segway
Anshul Kundaje, Qunhua Li, Michael Hoffman, Jason Ernst, Joel Rozowsky, Pouya Kheradpour
![Page 19: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/19.jpg)
Fish Transgenics
• 7 strong positives, 6 negatives from Segmentation, 3 Blood• 2 strong positives, 4 weak, 5 negatives from Naïve picks
(Fisher’s Exact 0.0393)• Wittbrodt Lab, Heidelberg
![Page 20: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/20.jpg)
EBI is an Outstation of the European Molecular Biology Laboratory.
Impact on Medicine
![Page 21: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/21.jpg)
3 big areas of impact for medicine
![Page 22: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/22.jpg)
Germ Line impact
• Everyone has differential risk of disease
• But the shift in risk is small• Perhaps 1 to 2% have a
striking change in risk to a serious disease (>10 fold) which is “actionable”
• This goes up to 3-4% if you count some less clinically worrying diseases
1:500 people have HCM1:500 people have FH
![Page 23: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/23.jpg)
Precision cancer diagnosis
• Cancer is a genomic disease
• By sequencing a cancer you can understand its molecular form better
• Particular molecular forms respond to particular bugs
![Page 24: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/24.jpg)
Pathogens
• Sequencing provides a clear cut diagnosis of pathogens
• Can also be used to sequence environments (eg, hospitals)
• Immune systems for hospitals
![Page 25: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/25.jpg)
EBI is an Outstation of the European Molecular Biology Laboratory.
What can geeks do for biology?
![Page 26: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/26.jpg)
Biology is a big data science
• Not perhaps as big as high energy physics, but “just” one order of magnitude less (~35 PB plays ~300PB)
• Hetreogenity/diversity of data far larger• Lots and lots of details• Always have “dirty” data even in best case scenarios
• Big Data => scaleable algorithms• CS ‘stringology’ – eg, Burrows-Wheeler transform
• Biological inference => statistics• “just” need to collapse a 3e9 dimensional problem into something
more tractable• We are often I/O not CPU bound
![Page 27: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/27.jpg)
We need geeks!
• In Academic/Industry/Start ups
• Convert physicists/statisticians to the 21st century science!
• Insist on maths skills in biology
![Page 28: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/28.jpg)
Why we need an infrastructure…
![Page 29: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/29.jpg)
Infrastructures are critical…
![Page 30: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/30.jpg)
But we only notice them when they go wrong
![Page 31: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/31.jpg)
Biology already needs an information infrastructure
• For the human genome• (…and the mouse, and the rat, and… x 150 now, 1000 in the
future!) - Ensembl• For the function of genes and proteins
• For all genes, in text and computational – UniProt and GO• For all 3D structures
• To understand how proteins work – PDBe• For where things are expressed
• The differences and functionality of cells - Atlas
![Page 32: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/32.jpg)
..But this keeps on going…
• We have to scale across all of (interesting) life• There are a lot of species out there!
• We have to handle new areas, in particular medicine• A set of European haplotypes for good imputation• A set of actionable variants in germline and cancers
• We have to improve our chemical understanding• Of biological chemicals• Of chemicals which interfere with Biology
![Page 33: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/33.jpg)
How?
Fully Centralised
Pros: Stability, reuse,Learning ease
Cons: Hard to concentrateExpertise across of life scienceGeographic, language placementBottlenecks and lack of diversity
Pros: Responsive, GeographicLanguage responsive
Cons: Internal communication overheadHarder for end users to learnHarder to provide multi-decade stability
Fully Distributed
![Page 34: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/34.jpg)
How?
GeneralHub (EBI)
Robust network with a strong hubNode
“DomainSpecific hub”
“National”
![Page 35: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/35.jpg)
Overlapping Networks
![Page 36: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/36.jpg)
Other infrastructures needed for biology
• EuroBioImaging• Cellular Imaging and whole body imaging
• BioBanks (BBMRI)• We need numbers – European populations – in particular for rare
diseases, but also for specific sub types of common disease• Mouse models and phenotypes (Infrafrontier)
• A baseline set of knockouts and phenotypes in our most tractable mammalian model
• (it’s hard to prove something in human)• Robust molecular assays in a clinical setting (EATRIS)
• The ability to reliably use state of the art molecular techniques in a clinical research setting
![Page 37: Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure](https://reader036.fdocuments.in/reader036/viewer/2022062815/568168b3550346895ddf83ba/html5/thumbnails/37.jpg)
Questions?(you can follow me on twitter @ewanbirney)