Big data nebraska
-
Upload
adina-chuang-howe -
Category
Science
-
view
66 -
download
0
Transcript of Big data nebraska
![Page 1: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/1.jpg)
RIDING THE DATA
TIDAL WAVE IN
MICROBIOLOGY
Future of Big Data, Lincoln, NE 11/6/2014
Adina Howe
germslab.org (Genomics and Environmental Research in Microbial Systems)
Argonne National Laboratory / Michigan State University
Iowa State University, Ag & Biosystems Engr (January)
Slides available at www.slideshare.com/adinachuanghowe
![Page 2: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/2.jpg)
Microbes are critical
Climate Change
Energy Supply
USGCRP 2009
www.alutiiq.com
http://guardianlv.com/
Human & Animal
Health
An understanding
of microbial ecology
Global Food
Security
![Page 3: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/3.jpg)
Understanding community
dynamics
Who is there?
What are they doing?
How are they doing it?
![Page 4: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/4.jpg)
Understanding community
dynamics
Who is there?
What are they doing?
How are they doing it?
Kim Lewis, 2010
![Page 5: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/5.jpg)
Gene / Genome Sequencing
Collect samples
Extract DNA
Sequence DNA
“Analyze” DNA to identify its content and origin
Taxonomy
(e.g., pathogenic E. Coli)
Function
(e.g., degrades cellulose)
![Page 6: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/6.jpg)
Cost of Sequencing
Stein, Genome Biology, 2010
E. Coli genome 4,500,000 bp ($4.5M, 1992)
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
Year
0.1
1
10
100
1,000
10,000
100,000
1,000,000
DN
A S
equencin
g, M
bp
per $
10,000,000
100,000,000
![Page 7: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/7.jpg)
Rapidly decreasing costs with
NGS Sequencing
Stein, Genome Biology, 2010
Next Generation Sequencing
4,500,000 bp (E. Coli, $200, presently)
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
Year
0.1
1
10
100
1,000
10,000
100,000
1,000,000
DN
A S
equencin
g, M
bp
per $
10,000,000
100,000,000
![Page 8: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/8.jpg)
Effects of low cost
sequencing…
First free-living bacterium sequenced
for billions of dollars and years of
analysis
Personal genome can be
mapped in a few days and
hundreds to few thousand
dollars
![Page 9: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/9.jpg)
The experimental continuum
Single Isolate
Pure Culture
Enrichment
Mixed CulturesNatural systems
![Page 10: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/10.jpg)
The era of big data in biology
Stein, Genome Biology, 2010
Computational Hardware
(doubling time 14 months)
Sanger Sequencing
(doubling time 19 months)
NGS (Shotgun) Sequencing
(doubling time 5 months)
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
Year
0
1
10
100
1,000
10,000
100,000
1,000,000
Dis
k S
tora
ge,
Mb/$
0.1
1
10
100
1,000
10,000
100,000
1,000,000
DN
A S
equencin
g, M
bp
per $
10,000,000
100,000,000
0.1
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
![Page 11: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/11.jpg)
Postdoc experience with data
2003-2008 Cumulative sequencing in PhD = 2000 bp
2008-2009 Postdoc Year 1 = 50 Gbp
2009-2010 Postdoc Year 2 = 450 Gbp
2014 = 50 Tbp
2015 = 500 Tbp budgeted
![Page 12: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/12.jpg)
THE DIRT ON SOIL
Biodiversity in the dark, Wall et al., Nature Geoscience, 2010 Jeremy Burgress
MAGNIFICENT BIODIVERSITY
![Page 13: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/13.jpg)
THE DIRT ON SOIL
SPATIAL HETEROGENEITY
http://www.fao.org/ www.cnr.uidaho.edu
![Page 14: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/14.jpg)
THE DIRT ON SOIL
DYNAMIC
![Page 15: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/15.jpg)
THE DIRT ON SOIL
INTERACTIONS: BIOTIC, ABIOTIC, ABOVE, BELOW, SCALES
Philippot, 2013, Nature Reviews Microbiology
![Page 16: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/16.jpg)
I. Technical side of microbial big data in
biology
II. Future of big data in soil microbial
communities
III. Bottlenecks for microbiologists
![Page 17: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/17.jpg)
Tackling Soil Biodiversity
Source: Chuck Haney
C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU)
Janet Jansson, Susannah Tringe (JGI)
![Page 18: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/18.jpg)
Lesson #1: Accessing information in
data
http://siliconangle.com/files/2010/09/image_thumb69.png
![Page 19: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/19.jpg)
de novo assembly
Compresses dataset size significantly
Improved data quality (longer sequences, gene order)
Reference not necessary (novelty)
Raw sequencing data (“reads”) Computational algorithms Informative genes / genomes
![Page 20: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/20.jpg)
Metagenome assembly…a scaling
problem.
![Page 21: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/21.jpg)
Shotgun sequencing and de novo
assembly
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
![Page 22: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/22.jpg)
Practical Challenges – Intensive
computing
Howe et al, 2014, PNAS
Months of
“computer
crunching” on a
super computer
![Page 23: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/23.jpg)
Practical Challenges – Intensive
computing
Howe et al, 2014, PNAS
Months of
“computer
crunching” on a
super computerAssembly of 300 Gbp (70,000
genomes worth) can be done with
any assembly program in less
than 14 GB RAM and less than
24 hours.
50 Gbp = 10,000 genomes
![Page 24: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/24.jpg)
Natural community characteristics
Diverse
Many organisms
(genomes)
![Page 25: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/25.jpg)
Natural community characteristics
Diverse
Many organisms
(genomes)
Variable abundance
Most abundant organisms, sampled
more often
Assembly requires a minimum amount
of sampling
More sequencing, more errors
Sample 1x
![Page 26: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/26.jpg)
Natural community characteristics
Diverse
Many organisms
(genomes)
Variable abundance
Most abundant organisms, sampled
more often
Assembly requires a minimum amount
of sampling
More sequencing, more errors
Sample 1x Sample 10x
![Page 27: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/27.jpg)
Natural community characteristics
Diverse
Many organisms
(genomes)
Variable abundance
Most abundant organisms, sampled
more often
Assembly requires a minimum amount
of sampling
More sequencing, more errors
Sample 1x Sample 10x
Overkill
![Page 28: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/28.jpg)
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
![Page 29: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/29.jpg)
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
![Page 30: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/30.jpg)
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
![Page 31: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/31.jpg)
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
![Page 32: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/32.jpg)
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
![Page 33: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/33.jpg)
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
Scales datasets for assembly up to 95% - same assembly
outputs.
Genomes, mRNA-seq, metagenomes (soils, gut, water)
![Page 34: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/34.jpg)
Tackling Soil Biodiversity
Source: Chuck Haney
C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU)
Janet Jansson, Susannah Tringe (JGI)
![Page 35: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/35.jpg)
The reality?
![Page 36: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/36.jpg)
More like…
Source: Chuck HaneyHowe et. al, 2014, PNAS
![Page 37: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/37.jpg)
What we learned from deeply sequencing
soil
Grand Challenge effort –
10% of soil biodiversity
sampled
Incredible soil biodiversity
(estimate required 10
Tbp/sample)
“To boldly go where no man
has gone before”: >60%
Unknown
0
100
200
300
400
am
ino a
cid
meta
bolis
m
carb
ohydra
te m
eta
bo
lism
mem
bra
ne tra
nspo
rt
sig
nal tr
ansdu
ction
transla
tion
fold
ing
, sort
ing a
nd d
egra
da
tion
meta
bolis
m o
f co
facto
rs a
nd v
itam
ins
energ
y m
eta
bolis
m
transp
ort
and
cata
bolis
m
lipid
meta
bolis
m
tra
nscri
ption
ce
ll g
row
th a
nd
dea
th
replic
ation
and
rep
air
xen
obio
tics b
iod
egra
datio
n a
nd m
eta
bo
lism
nucle
otide m
eta
bolis
m
gly
can b
iosynth
esis
and m
eta
bolis
m
meta
bolis
m o
f te
rpenoid
s a
nd
poly
ke
tides
cell
motilit
y
Tota
l C
ount
KO
corn and prairie
corn only
prairie only
Howe et al, 2014, PNAS
Managed agriculture soils exhibit less
diversity, likely from its history of
cultivation.
![Page 38: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/38.jpg)
If soil is so diverse, what are the most
consistent signals we can see at the plot
levels?
![Page 39: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/39.jpg)
Is there an identifiable “soil
functional core” (carbon cycling
focus)?
Kirsten Hofmockel, Iowa State University
Ames, Iowa, COBS Field Site, Fertilized Prairie Whole Soil Samples
4 deeply sampled whole soil metagenomes (16S rRNA 5000 reads, Shotgun 20-50 million reads)
![Page 40: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/40.jpg)
How much genetic sequence do
you think is shared in 4 replicates?
More than 1%? 10%? 50%?
What kind of genes do you expect in this core?
Minimal critical genes will be abundant & diverse
![Page 41: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/41.jpg)
How much genetic sequence do
you think is shared in 4 replicates?
More than 1%? 10%? 50%?
What kind of genes do you expect in this core?
Minimal critical genes are varying abundances &
diverse
![Page 42: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/42.jpg)
Core genes: soil-specific
![Page 43: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/43.jpg)
My vision (and future research)
Microbial markers for ecosystem services
Nutrient cycling
Pathogens
Antibiotic resistance
Biodiversity
![Page 44: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/44.jpg)
Capabilities exist already…
http://vimeo.com/90059732
Indoor Microbiome Project Animation
![Page 45: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/45.jpg)
Is more data better?
Bottlenecks for the emerging
microbiologists
![Page 46: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/46.jpg)
Technical obstacles in the big data
deluge
Access to the data
Access to the resources
Democratization of both data and resource access
“80% of awards and 50% of $$ are for grants < $350,000” (Ian Foster)
Data volume and velocity
Previous efforts are difficult to integrate
Innovation is necessary
![Page 47: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/47.jpg)
Software Developers
Computer Scientists
Clinicians
PIs
Data generators
Microbiologists
Data Analyzers
Statisticians
Bioinformaticians
http://ivory.idyll.org/blog/2014-the-emerging-field-of-data-intensive-biology.html
Data intensive microbiology
![Page 48: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/48.jpg)
![Page 49: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/49.jpg)
![Page 50: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/50.jpg)
Social obstacles – the main
challenge
Shift of costs do not mean shift of
expectations
http://www.deluxebattery.com/25-hilarious-expectation-vs-reality-photos/
Dear PI,
It will take longer than
the time it took you to do
your experiment to
analyze the data. Please
do not write me for
results within 24 hours of
your sequences
becoming available.
- Adina
![Page 51: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/51.jpg)
Culture of sharing
http://www.heathershumaker.com/
Metagenomic Datasets
![Page 52: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/52.jpg)
Training / Incentives
Emails between collaborators don’t contain as
much “science” as I’d like:
![Page 53: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/53.jpg)
All analysis: accessible,
reproducible, and automated
![Page 54: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/54.jpg)
All analysis: accessible,
reproducible, and automated
To reproduce analysis in a publication,
1. Rent Amazon EC2 computer
2. Clone github repository containing data and scripts
3. Open IPython notebook and execute
To run same analysis on different dataset,
1. Replace data files with your own data, execute notebook.
2. Tweak scripts as needed.
![Page 55: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/55.jpg)
The journey in summary
![Page 56: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/56.jpg)
RIDING THE BIG DATA
TIDAL WAVE OF MODERN
MICROBIOLOGY
Adina Howe
Argonne National Laboratory / Michigan State University
Iowa State University, Ag & Biosystems Engr (January)
“”
![Page 57: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/57.jpg)
RIDING THE BIG DATA
TIDAL WAVE OF MODERN
MICROBIOLOGY
Adina Howe
Argonne National Laboratory / Michigan State University
Iowa State University, Ag & Biosystems Engr (January)
![Page 58: Big data nebraska](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a7366f1a28abd9528b465d/html5/thumbnails/58.jpg)
Acknowledgements
C. Titus Brown (MSU)
James Tiedje (MSU)
Daina Ringus (UC)
Folker Meyer (ANL)
Eugene Chang (UC)
NSF Biology Postdoc Fellowship
DOE Great Lakes Bioenergy Research Center