2014 aus-agta
-
Upload
ctitusbrown -
Category
Science
-
view
615 -
download
0
description
Transcript of 2014 aus-agta
WHAT’S AHEAD FOR BIOLOGY?THE DATA INTENSIVE FUTURE
C. Titus Brown
Assistant Professor, Michigan State University
(In January, moving to UC Davis / VetMed.)
Talk slides on slideshare.net/c.titus.brown
The Data Deluge(a traditional requirement for these talks)
The short version• Data gathering & storage is growing, leaps & bounds!
• Biology is completely unprepared for this at every level:• Technical and infrastructure• Cultural• Training
• Our funding/incentivization/prioritization structures are also largely unprepared.
• This is a huge missed opportunity!!
(What does Titus think we should be doing?)
Challenges:
1. Dealing with Big Data (my current research)
2. Interpreting the unknowns (future research)
3. Accelerating research with better data/methods/results sharing.
4. Expanding the role of exploratory data analysis in biology. (career windmill)
1. Dealing with Big Data
A. Lossy compression
B. Streaming algorithms
Looking forward 5 years…
Navin et al., 2011
Some basic math:• 1000 single cells from a tumor…• …sequenced to 40x haploid coverage with Illumina…• …yields 120 Gbp each cell…• …or 120 Tbp of data.
• HiSeq X10 can do the sequencing in ~3 weeks.
• The variant calling will require 2,000 CPU weeks…
• …so, given ~2,000 computers, can do this all in one month.
Similar math applies:• Pathogen detection in blood;• Environmental sequencing;• Sequencing rare DNA from circulating blood.
• Two issues:
• Volume of data & compute infrastructure;
•Latency for clinical applications.
Approach A: Lossy compression
Lossy compression can substantially reduce data size while retaining
information needed for later (re)analysis.
(Reduce volume of data & compute infrastructure requirements)
http://en.wikipedia.org/wiki/JPEG
Lossy compression
http://en.wikipedia.org/wiki/JPEG
Lossy compression
http://en.wikipedia.org/wiki/JPEG
Lossy compression
http://en.wikipedia.org/wiki/JPEG
Lossy compression
http://en.wikipedia.org/wiki/JPEG
Lossy compression
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
e.g. de novo assembly now scales with richness, not diversity.
• 10-100 fold decrease in memory requirements• 10-100 fold speed up in analysis
Brown et al., arXiv, 2012
Hey, cool, our approach and software is used by Illumina for long-read sequencing!
Our general strategy: compressive prefilters
Approach B: streaming data analysis
See also eXpress, Roberts et al., 2013.
(Reduce latency for clinical applications)
Current variant calling approaches are multipass
Streaming graph-based approaches can detect information saturation
Approach supports compute-intensive interludes – remapping, etc.
Rimmer et al., 2014
Streaming with bases
Integrate sequencing and analysisDecrease latency!
So, how do we deal with Big Data issues?
• Fairly record cost of data analysis (running software & cost of computational infrastructure)
• This incentivizes development of better approaches!
• Lossy compression, streaming, …??
• Think 5 years ahead, rather than 2 years behind!
• Pay attention to workflows, software lifecycle, etc. etc.
(See ABiC 2014 talk :)
2. Dealing with the unknowns
“What is the function of ….?”
We can observe almost everything at a DNA/RNA level!
But,• Experimentally based functional annotations are sparse;• Most genes play multiple roles and are generally
annotated for only one;• Model organisms are phylogenetically quite limited and
biased;• …there is little or no $$$ or reputation gain for
characterizing novel genes (and nor is it straightforward or easy to do so!)
"...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery
being associated with greater research momentum—a genomic bandwagon effect."
Ref.: Pandey et al. (2014), PLoS One 11, e88889. Slide courtesy Erich Schwarz
The problem of lopsided gene characterization:e.g., the brain "ignorome"
How do we systematically broaden our functional understanding of genes?
1. More experimental work!• Population studies, perturbation studies, good ol’ fashioned
molecular biology, etc.
2. Integrate modeling, to see where we have (or lack) sufficiency of knowledge for a particular phenotype.
3. Sequence it all and let the bioinformaticians sort it out!
What I think will work best: a tight integration between all three approaches (c.f. physics) – hypothesis-driven
investigation, modeling, and exploratory data science.
See also: ivory.idyll.org/blog/2014-function-of-unknown-genes.html
3. Accelerating research with better sharing of results, data, methods.
Our current journal system is a 20th century solution to a 17th century problem.
- Paraphrased from Cameron Neylon
(Note: 20th century was LAST century)
3. Accelerating research with better sharing of results, data, methods.
We could accelerate research with better sharing.
Recent example re rare diseases:
http://www.newyorker.com/magazine/2014/07/21/one-of-a-kind-
2
“The current academic publication system does patients an enormous disservice.” – Daniel MacArthur
There are many barriers to better communication of results, data, and methods, but most of them are cultural, not
technical. (Much harder!)
Preprints• Many fields (including bioinformatics and increasingly
genomics) routinely share papers prior to publication. This facilitates reproduction, dissemination, and ultimately progress.
• Biology is behind the times!
See:
1. Haldane’s Sieve (blog discussion of preprints)
2. Evidence that preprints confer massive citation advantage in physics (http://arxiv.org/abs/0906.5418)
Current model for data sharing
In a data limited world,this kind of made sense.
Current model for data sharingThis model ignores the fact that data often has multiple (unrealized or serendipitous) uses.
(Among many other problems ;)
The train wreck ahead
When data is cheap, andinterpretation is expensive,
most data doesn’t get published,and therefore is lost.
(Program managers are not a fan of this)
Data sharing challenges -• Little immediate or career rewards for sharing data;
incentives are almost entirely punitive (if you DON’T…)
• Sharing data in a usable form is still rather difficult.
• Submitting data to archival services is, in many cases, surprisingly difficult.
• Few methods for gaining recognition for data sharing prior to publication of conclusions.
The Ocean Cruise Model
DeepDOM – photo courtesy E. Kujawinski, WHOI
One really expensive cruise, many data collectors, shared data.
Sage Bionetworks / “walled garden”
Collaborative data sharing policy with restricted access to outsiders;
Central platform with analysis provenance tracking;
A model for the future of biomedical research?
See, e.g., Enabling transparent and collaborative computational analysis of 12 tumor types within The Cancer Genome Atlas. Omberg et al, 2014.
Distributed cyberinfrastructure to encourage sharing?
ivory.idyll.org/blog/2014-moore-ddd-talk.html
Better metadata collection is needed!Suppose the NSA could EITHER track
who was calling whom,OR what they were saying – which would
be more valuable?
Who? What? Who?
Better metadata collection is needed!Suppose the NSA could EITHER track
who was calling whom,OR what they were saying – which would
be more valuable?
Who? What? Who?
Better metadata collection is needed!We need to track sample origin,
phenotype/environmental conditions, etc.
Sample information The –omic data Phenotype
This will facilitate discovery, serendipity, re-analysis, and cross-validation.
Data and software citation
Now methods for:• assigning DOIs to data (which makes it citable) – figshare,
dryad.
• Data publications – gigascience, SIGS, Scientific Data.
• Software citations – Zenodo, MozSciLab/GitHub
• Software publications – F1000 Research
Will this address the need to incentivize data sharing and methods? Probably not but it’s a good start ;)
4. Exploratory data analysis
Old model:
New modelYour data is most useful when combined with everyone else’s.
Given enough publicly accessible data…
But: we face lack of training.
The lack of training in data science is the biggest challenge facing biology.
Students! There’s a great future in data analysis!
Also see:
Data integration?
Once you have all the data, what do you do?
"Business as usual simply cannot work."
Looking at millions to billions of genomes.
(David Haussler, 2014)
Illumina estimate: 228,000 human genomes will be sequenced in 2014, mostly by researchers.
http://www.technologyreview.com/news/531091/emtech-illumina-says-228000-human-genomes-will-be-sequenced-this-year/
Looking to the future
For the senior scientists and funders amongst us,
• How do we incentivize data sharing, and training?
• How do we fund the meso- and micro-scale cyberinfrastructure development that will accelerate bio?
See: ivory.idyll.org/blog/2014-nih-adds-up-meeting.html
The NIH and NSF are exploring this; the Moore and Sloan
foundations are simply doing it(but 1% the size).
Thanks for listening!
combine.org.au
Annual Student SymposiumFriday 28th November 2014
Parkville, Victoria
Now accepting abstracts for talks and postersTalk abstracts close 31st October
For Australian students and early career researchers in bioinformatics and computational biology