Next-generational sequencing for microbial ecology:
alpha diversity, beta diversity, and biases in high-throughput sequencing
Rachel AdamsAndrew Rominger
Sara BrancoThomas Bruns
Understudied but fundamental ecological habitat
Implications for human healthSick building syndrome
Metrics are practically absent: composition and quantitative characteristics
Need comparison of “typical” buildings
The microbiome of the built environment
Understudied but fundamental ecological habitat
Implications for human healthSick building syndrome
Metrics are practically absent: composition and quantitative characteristics
Need comparison of “typical” buildings and high replication across settings to detect patterns
The microbiome of the built environment
?
?
?
The What and Why of the indoor microbiome
?
?
?Architecture
Ventilation
Building function
The What and Why of the indoor microbiome
?
?
?Architecture
Ventilation
Building function Environmental setting
The What and Why of the indoor microbiome
?
?
?Architecture
Ventilation
Building function Environmental setting
Residents
The What and Why of the indoor microbiome
Fungi in the indoor microbiome, and beyond
Yeasts
Filaments
Fungi in the indoor microbiome, and beyond
Yeasts
Filaments
Saprobes
Fungi in the indoor microbiome, and beyond
Yeasts Saprobes
Symbionts
Parasites Mutualists
− +
Assessing environmental fungi
1. Estimated that 5-20% of fungi grow in culture2. Identification requires a fungal taxonomist
Assessing environmental fungi
SSU RNA (18S) (5.8S) LSU RNA (28S)
ITS1 ITS2
Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi - Schoch et al. 2012
High-throughput sequencing has greatly expanded capabilities in microbial ecology
ACGAGTGCGT
High-throughput sequencing has greatly expanded capabilities in microbial ecology
ACGAGTGCGT
High-throughput sequencing has greatly expanded capabilities in microbial ecology
ACGAGTGCGTACGCTCGACA AGACGCACTC AGCACTGTAG ATCAGACACG
104 – 107 sequence reads
High-throughput sequencing has greatly expanded capabilities in microbial ecology
α1
β12
ϒ
α2 α3
β23
β13
alpha, beta, gamma diversity
α1
α2 α3
alpha, beta, gamma diversity
α1
β12
α2 α3
β23
β13
alpha, beta, gamma diversity
α1
β12
ϒ
α2 α3
β23
β13
alpha, beta, gamma diversity
Kunin et al. 2010
Groundtruthing high-throughput sequencing for alpha richness
Kunin et al. 2010
αtrue < αest
Groundtruthing high-throughput sequencing for alpha richness
Groundtruthing high-throughput sequencing
True samples
Hig
h-th
roug
hput
seq
uenc
ing
Observed samples
α1
α2 α3
α1+
α2+ α3+
In terms of diversity, we know that α
can be elevated in high-throughput sequenced communities...
True community
Observed community
β12 β13
β23
β12? β13?
β23?
α1
α2 α3
α1+
α2+ α3+
...but how does that change conclusions of ecological processes that are based on β diversity?
Hig
h-th
roug
hput
seq
uenc
ing
A key component to community ecology: Linking processes to this compositional variation
Adams et al., ISME Journal, 2013
Beta diversity: the variation in species composition among sites
Do errors that inflate alpha diversity bias conclusions on beta diversity between samples?
Why would it? • Particular taxa in one environment grouping do not amplify or
amplify in a way that skews relative abundance of all others*• Clustering incorrectly groups divergent taxa or splits identical
taxa
Hypothesis: No
While richness/diversity estimations will be off for any given sample, conclusions of beta-diversity will be robust to the errors
Question and hypotheses
Do errors that inflate alpha diversity bias conclusions on beta diversity between samples?
Why would it? • Particular taxa in one environment grouping do not amplify or
amplify in a way that skews relative abundance of all others*• Clustering incorrectly groups divergent taxa or splits identical
taxa
Hypothesis: No
While richness/diversity estimations will be off for any given sample, conclusions of beta-diversity will be robust to the errors
Question and hypotheses
Do errors that inflate alpha diversity bias conclusions on beta diversity between samples?
Why would it? • Particular taxa in one environment grouping do not amplify or
amplify in a way that skews relative abundance of all others*• Clustering incorrectly groups divergent taxa or splits identical
taxa
While richness/diversity estimations will be off for any given sample, conclusions of beta-diversity will be robust to the errors
Question and hypotheses
Simulation process
Initial community
Simulated community
OTU1 OTU2 … OTUj
Sample 1
Sample 2
…
Sample i
OTU1 OTU2 … OTUk
Sample 1
Sample 2
…
Sample i
Simulation process
Expected relative abundance of OTUs
Initial communities
Simulation process
Biased relative abundance
Variation in taxon-specific amplification
Initial communities
Expected relative abundance of OTUs
Simulation process
Biased relative abundance
Variation in taxon-specific amplification
Biased relative abundance + error
Sequence error
Initial communities
Expected relative abundance of OTUs
Simulation process
Biased relative abundance
Variation in taxon-specific amplification
Biased relative abundance + error
Sequence error
Clustering OTUs
Initial communities
Biased relative abundance + error + clustering
Expected relative abundance of OTUs
Simulation process
Biased relative abundance
Variation in taxon-specific amplification
Biased relative abundance + error
Sequence error
Biased relative abundance + error + clusteringClustering OTUs
Simulated communities
Initial communities
Expected relative abundance of OTUs
Model summary – 2 types of errors
1. Create group differences that aren’t there (Type I error)
-0.5 0.0 0.5
-0.4
-0.2
0.0
0.2
0.4
True
NMDS1
NM
DS
2
-0.5 0.0 0.5
-0.4
-0.2
0.0
0.2
0.4
Perceived
NMDS1
NM
DS
2
Model summary – 2 types of errors
2. Loose groups differences that are there (Type II error)
-0.5 0.0 0.5
-0.4
-0.2
0.0
0.2
0.4
True
NMDS1
NM
DS
2
-0.5 0.0 0.5
-0.4
-0.2
0.0
0.2
0.4
Perceived
NMDS1
NM
DS
2
Model summary output
1. Presence of bias: Statistical categorical differences
Groups R2 p-value
Location 0.02 0.34
Season 0.20 0.001
2. Degree of bias: percentage difference between true and simulated communities
(Simulated – True) True
= Normalized bias
Model summary output
1. Presence of bias: Statistical categorical differences
2. Degree of bias: percentage difference between true and simulated communities
(Simulated distance – True distance)True distance
= Normalized error
Morisita-Horn distance metric
Groups R2 p-value
Location 0.02 0.34
Season 0.20 0.001
Categorical differences are robust to high-throughput sequencing errors in alpha diversity, regardless of the underlying patterns of beta-diversity
The degree of bias is not affected by the underlying patterns of beta-diversity but dependent on community characteristics
Model findings
Model findings
Categorical differences are robust to high-throughput sequencing errors in alpha diversity, regardless of the underlying patterns of beta-diversity
The degree of bias is not affected by the underlying patterns of beta-diversity but dependent on community characteristics
True Simulated True Simulated
0.0
0.2
0.4
0.6
0.8
1.0
p v
alu
esNo groups Two groups
Model summary – Type I & II error
True Simulated True Simulated
0.0
0.2
0.4
0.6
0.8
1.0
p v
alu
esNo groups Two groups
Model summary – Type I & II error
True Simulated True Simulated
0.0
0.2
0.4
0.6
0.8
1.0
p v
alu
esNo groups Two groups
Model summary – Type I & II error
Whether groups are different or the same will not be biased by inflated alpha diversity
Model summary – Degree of bias
Degree of bias will be affected by - the error rate of the platform and OTU- clustering- the gamma diversity of the environment- the precise shape of the species abundance
distribution
But not the relationship among samples
Increasing probability of sequencing error and over-splitting OTUs increases bias
1e-04 0.0334 0.0667 0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
No groups
Nor
mal
ized
err
or
1e-04 0.0334 0.0667 0.1
Two groups
Probability of splitting
Increasing OTU richness decreases bias
100 600 1100
0.0
0.2
0.4
0.6
0.8
Number of OTUs
Nor
mal
ized
err
or
Shape of species abundance distribution (SAD) affects bias
0 200 400 600 800 1000 1200
01
00
02
000
30
00
40
005
00
0
Rank
Ab
und
an
ce
Shape of species abundance distribution (SAD) affects bias
1.5 2.5 3.5
0.0
0.2
0.4
0.6
0.8
Increasing SAD variance
No
rmal
ized
err
or
As true community distance increases, degree of error decreases
0.65 0.70 0.75 0.80
0.2
0.3
0.4
0.5
0.6
True distance
No
rma
lize
d e
rro
r
Clustering is the main error-producing step
True Amplified Split
0.0
0.1
0.2
0.3
0.4
0.5
R^2
va
lue
sTwo groups
Simulation overview
Categorical analysis very robust to errors in high-throughput biases
Degree of bias will be affected by error rate of the sequencing platform and OTU-clustering, the gamma diversity of the environment, the precise shape of the species abundance distribution
High-throughput error leads to an over-estimation of the difference between groups
Mean bias is ~20-40%Incorrect OTU clustering is most of that
Steps
1. In silico: Add further complexity to simulations
2. In vitro: Empirically test artificially-created microbial communities
Do errors that inflate alpha diversity bias conclusions on beta diversity between samples?
Why would it?
• Particular taxa in one environment grouping do not amplify or amplify in a way that skews relative abundance of all others*
• Clustering incorrectly groups divergent taxa or splits identical taxa
Hypothesis: No
While richness/diversity estimations will be off for any given sample, conclusions of beta-diversity will be robust to the errors
Question and hypotheses
Air samples in a mycology classroom: a unique source distorts perceived species richness
Air samples in a mycology classroom: a unique source distorts perceived species richness
Mycology classroom appears to be less rich than other classrooms…
0 2000 4000 6000 8000
02
0040
060
080
010
00B
AC
D
E
Individuals
Cha
o E
stim
ated
Ric
hne
ss
… but has higher biomass
A B C D E
050
100
15
02
00
Classroom
Pe
nic
illiu
m s
pore
eq
uiva
lent
s
Composition of non-mycology classrooms are similar
AB
CD
E
Proportion
Cla
ssro
om
0 20 40 60 80 100
Mycology classroom dominated by a few taxa
AB
CD
E
Proportion
Cla
ssro
om
0 20 40 60 80 100
xxPuffballs dominate mycology classroom
Pisolithus, aka dog turd fungus Battarrea, tall stiltball
Lycoperdon, common puffball
Mycology classroom dominated by a few taxa
AB
CD
E
Proportion
Cla
ssro
om
0 20 40 60 80 100
* * **
Adams et al., in review
Beta diversity of mycology classroom: distinct communities
-1.5 -1.0 -0.5 0.0 0.5
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
NMDS1
NM
DS
2Observed
Beta diversity of mycology classroom: distinct communities
-1.5 -1.0 -0.5 0.0 0.5
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
NMDS1
NM
DS
2ObservedTaxonomy reassigned
Beta diversity of mycology classroom: distinct communities
-1.5 -1.0 -0.5 0.0 0.5
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
NMDS1
NM
DS
2ObservedTaxonomy reassignedAbundance reassigned
Conclusions
• While deciphering alpha diversity is problematic:- Inflated alpha due to sequence error & clustering- Deflated alpha due to unevenness
beta diversity calculations are robust to these errors in high-throughput sequencing
• Empirical test will be used to corroborate conclusions of in silico simulations
• High-throughput sequencing will continue to be a promising tool for microbial ecologists
Conclusions
• While deciphering alpha diversity is problematic:- Inflated alpha due to sequence error & clustering- Deflated alpha due to unevenness
beta diversity calculations are robust to these errors in high-throughput sequencing
• Empirical test will be used to corroborate conclusions of in silico simulations
• High-throughput sequencing will continue to be a promising tool for microbial ecologists
Conclusions
• While deciphering alpha diversity is problematic:- Inflated alpha due to sequence error & clustering- Deflated alpha due to unevenness
beta diversity calculations are robust to these errors in high-throughput sequencing
• Empirical test will be used to corroborate conclusions of in silico simulations
• High-throughput sequencing will continue to be a promising tool for microbial ecologists
References – potential biases in high-throughput sequencingDNA extraction: Frostegard et al Appl Environ Microbiol 1999; DeSantis et al FEMS Microbiology 2005; Feinsten et al Appl Environ Microbiol 2009; Morgan et al PLoS ONE 2010; Delmont et al Appl Environ Microbiol 2011
PCR amplification/Relative abundance: Amend et al Mol Ecol 2010; Engelbrektson et al ISME Journal 2010; Bellemain et al BMC Microbiol 2010; Schloss et al PLoS ONE 2011; Pinto & Raskin PLoS ONE 2012; Klindworth et al Nucleic Acids Res 2013
Sequencing error/Chimeras/OTU clustering: Huse et al Genome Biol 2007; Huse et al Environ Microbiol 2010; Kunin et al Environ Microbiol 2010; Quince et al BMC Bioinformatics 2010; Lee et al PLoS ONE 2012; Pinto & Raskin PLoS ONE 2012; Bachy et al ISME Journal 2013
Sequencing platform/protocol: Morgan et al PLoS ONE 2010; Luo et al PLoS ONE 2012
Even sampling depth: Schloss et al PLoS ONE 2011; Gihring et al Environ Microbiol 2012
Denoising: Gasper & Thomas PLoS ONE 2013;
Empirical test of simulation results
100 600 1100
0.0
0.2
0.4
0.6
0.8
Number of OTUs
Nor
mal
ized
err
or
PCR bias
-0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2
0.0
0.5
1.0
1.5
2.0
PCR bias: beta distribution a=0.5, beta=1.0
Scatter around line of true abundance versus amplified abundance
Den
sity
0 200 400 600 800 1000 1200
020
04
006
0080
010
00
1200
1400
True abundance
Am
plifi
ed a
bund
anc
e
OTU splitting bias
0 5 10 15 20
0.0
0.1
0.2
0.3
0.4
Split bias: binomial distribution with n=100
Number of splits
Den
sity
p=0.001
p=0.0667
p=0.0334
p=0.0001
0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Split location: beta distribution with a=b=0.5
Location of split
Den
sity
Top Related